Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Towards social virtual listeners: computational models of human nonverbal behaviors
(USC Thesis Other)
Towards social virtual listeners: computational models of human nonverbal behaviors
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Towards Social Virtual Listeners:
Computational Models of Human Nonverbal Behaviors
by
Derya Ozkan
Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2014
Copyright 2014 Derya Ozkan
Table of Contents
List of Tables v
List of Figures vii
Abstract x
Chapter 1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 High Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Multimodal Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Visual In
uence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Variability in Human's Behaviors . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Summary of Signicant Achievements . . . . . . . . . . . . . . . . . . . 8
1.7 Summary of Other Achievements . . . . . . . . . . . . . . . . . . . . . . 9
1.8 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Chapter 2 Related Work 12
2.1 Human Nonverbal Behavior Modeling . . . . . . . . . . . . . . . . . . . 12
2.1.1 Listener Backchannel Prediction . . . . . . . . . . . . . . . . . . 13
2.2 Feature Selection and Analysis . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Multimodal Data Integration . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Context-based Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Wisdom of Crowds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Chapter 3 Feature Selection and Analysis 21
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Sparse Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.1 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . 23
3.2.2 Feature Ranking Method . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Consensus of Self-Features . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.1 The Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.2 Backchannel Annotations . . . . . . . . . . . . . . . . . . . . . . 29
ii
3.4.3 Multimodal Features and Encodings . . . . . . . . . . . . . . . . 30
3.4.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Chapter 4 Multimodal Integration 37
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Latent Mixture of Discriminative Experts . . . . . . . . . . . . . . . . . 40
4.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.2 LMDE Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.3 Learning Model Parameters . . . . . . . . . . . . . . . . . . . . . 45
4.2.4 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 LMDE for Multimodal Prediction . . . . . . . . . . . . . . . . . . . . . . 47
4.3.1 Multimodal Prediction . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.2 User-adaptive Prediction Accuracy . . . . . . . . . . . . . . . . . 48
4.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4.1 Backchannel Annotations . . . . . . . . . . . . . . . . . . . . . . 50
4.4.2 Multimodal Features and Experts . . . . . . . . . . . . . . . . . 51
4.4.3 Baseline Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5.1 Comparative Results . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5.2 Analysis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Chapter 5 Context-Based Prediction 64
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2 Context-based Backchannel Prediction with
Visual In
uence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3.1 Multimodal Features and Experts . . . . . . . . . . . . . . . . . 69
5.3.2 Baseline Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4 Results and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Chapter 6 Variability in Human's Behaviors: Modeling Wisdom of Crowds 77
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2 Preliminary Study on Parallel Listener Consensus . . . . . . . . . . . . . 79
6.2.1 Parallel Listener Corpus . . . . . . . . . . . . . . . . . . . . . . . 79
6.2.2 Building Response Consensus . . . . . . . . . . . . . . . . . . . . 81
6.2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.2.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . 82
iii
6.2.3.2 Prediction Models . . . . . . . . . . . . . . . . . . . . . 83
6.2.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 83
6.3 Wisdom of Crowds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.3.1 Parasocial Consensus Sampling . . . . . . . . . . . . . . . . . . . 86
6.4 Individual Experts and Wisdom Analysis . . . . . . . . . . . . . . . . . 87
6.4.1 Listener Experts . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.4.2 Feature Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.5 Computational Model: Wisdom-LMDE . . . . . . . . . . . . . . . . . . 89
6.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.6.1 Multimodal Features . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.6.2 Baseline Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.6.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.7 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.7.1 Model Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.7.2 Model Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.7.3 Feature Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Chapter 7 Conclusions and Future Works 103
7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.2 Future Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Bibliography 109
iv
List of Tables
3.1 Group-features with sparse ranking. We incrementally add features
as they appear in the regularization path and use for retraining. Each
row shows the features added at that stage, therefore the model at this
stage is retrained with these new features plus the features above it. Final
row shows values for using all the features instead of feature selection. . 33
3.2 Selected features with self-feature consensus using histograms of dierent
orders (after outlier rejection). . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Precision, recall and F-1 values of retrained CRFs with group-feature
approach and self-feature consensus. . . . . . . . . . . . . . . . . . . . . 35
4.1 Top 5 features from ranked list of features for each listener expert. . . . 57
4.2 Performances of individual expert models trained by using only the top
5 features selected by our feature ranking algorithm introduced in Sec-
tion 3.2. The last two rows represent the LMDE models using the ex-
pert models trained with only 5 features selected by either by a greedy
method (Morency et al., 2008b) or our spare feature ranking scheme. . 58
4.3 Performaces of baseline models and our LMDE model as we increase the
duration of backchannel labels during training. . . . . . . . . . . . . . . 61
4.4 Number of backchannel feedbacks provided by each of the 11 listeners in
our test set and their corresponding upa, precision, recall and F-1 score. 62
5.1 Test performances of the individual expert models for listener backchan-
nel predictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2 Test performances of the individual expert models for speaker gesture
(head nod) predictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3 Comparison of dierent models with our approach. . . . . . . . . . . . 76
6.1 The performance of our ve models measured using only the displayed
listeners ground truth labels. . . . . . . . . . . . . . . . . . . . . . . . . 84
v
6.2 The performance of our ve models measured Consensus 1 and Consensus
2 ground truth labels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.3 Most predictive features for each listener from our wisdom dataset. This
analysis suggests three prototypical patterns for backchannel feedback. . 88
6.4 Comparison of our Wisdom LMDE approach (on PCS dataset) with mul-
tiple baseline models. The same test set is used in all experiments. . . . 97
6.5 Comparison of our Wisdom LMDE approach (on MultiLis dataset) with
multiple baseline models. The same test set is used in all experiments. 98
6.6 Top 5 features for 3 listener experts in the MultiLis dataset. Although the
top feature for the rst two listeners is utterance, the last listener ranks
continued gaze as the most important feature for providing backchannel
feedbacks. On the other hand, the main focus of the rst listener is on
visual gestures (eye gaze, blinked gaze, blinked continued gaze); whereas,
the second listener mainly focuses on the speaker's speech (utterance,
pause, SHoUT features). . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
vi
List of Figures
1.1 Overview of our social behavior modeling scheme for recognizing the
social behaviors (i.e. backchannel feedbacks) of a target person (lis-
tener) while he/she is communicating with an interlocutor (speaker). Our
framework includes Feature Analysis that provides better understanding
and analysis of human communication by discovering the most relevant
features for the predicted behavior. On one hand, the multimodal fea-
tures of the speaker are fused together by a Multimodal Integration tech-
nique during the listener prediction process. On the other hand, the mul-
timodal features of the speaker are integrated together to predict his/her
behavior as well. This context from the speaker is then used to model the
Visual In
uence of the speaker on listener feedbacks, and to improve the
nal listener prediction process. During this listener prediction processes,
the Wisdom of Crowds is modeled to take into account the variability in
human's nonverbal behaviors. . . . . . . . . . . . . . . . . . . . . . . . . 3
3.1 (a) Group-feature approach. Features are selected by using all peo-
ple's observations at once. This model has the potential to miss some
relevant features specic to a subset of the population. (b) Self-feature
consensus.Features of each person in the data is ranked rst. Then, we
select the top n from these ranked list of self-features to construct n
th
order histogram of feature counts. In this gure, only the 1
st
and 2
nd
order histograms are shown. . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 A graphical representation of Conditional Random Fields. . . . . . . . . 24
3.3 Example of sparse ranking using L
1
regularization. As goes
from higher to lower values, model parameters start to become non-zero
based on their relevance to the prediction model. . . . . . . . . . . . . . 25
vii
3.4 Methodology: The output of our prediction model is the probabilities
of providing backchannel prediction over time. During testing, we rst
nd all the peaks in this probability graph (left). Then, we apply a
threshold on these peaks and use all the peaks above this threshold as
the nal prediction of the model (right). This threshold value is learned
automatically during validation. . . . . . . . . . . . . . . . . . . . . . . 32
4.1 Example of multimodal prediction model: listener nonverbal backchannel
prediction based on speaker's speech and visual gestures. As the speaker
says the word her, which is the end of the clause (her is also the object
of the verb bothering), and lowers the pitch while looking back at the
listener and eventually pausing, the listener is then very likely to head
nod (i.e., nonverbal backchannel). . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Latent Mixture of Discriminative Experts: our new dynamic model for
multimodal fusion. In this graphical representation,x
j
represents thej
th
multimodal observation, h
j
is a hidden state assigned to x
j
, and y
j
the
class label of x
j
. Gray circles are latent variables. The micro dynamics
and multimodal temporal relationships are automatically learned by the
hidden states h
j
during the learning phase. . . . . . . . . . . . . . . . . 39
4.3 An example of how the hidden variables of our LMDE model can learn
the temporal dynamics and asynchrony between modalities. . . . . . . 42
4.4 A sample output sequence of listener feedback probabilities (in blue). Red
and green boxes indicate the responses fromListener
1
andListener
2
re-
spectively. The red and green lines indicate the thresholds on the output
probabilities that can correctly assign the backchannel labels to the cor-
responding listener labels. . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5 Baseline Models: a) Conditional Random Fields (CRF), b) Latent Dy-
namic Conditional Random Fields(LDCRF), c) CRF Mixture of Experts
(no latent variable) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.6 Comparison of individual experts with our LMDE model. Top: Recall
(x-axis) v.s. Precision (y-axis) values for dierent threshold values. Bot-
tom: Precision, Recall, F1 and UPA scores of corresponding models for
selected amount of backchannel. . . . . . . . . . . . . . . . . . . . . . . 55
4.7 Comparison of LMDE model with previously published approaches for
multimodal prediction. Top: Recall (x-axis) v.s. Precision (y-axis) val-
ues for dierent threshold values. Bottom: Precision, Recall, F1 and
UPA scores of corresponding models for selected amount of backchannel. 57
viii
4.8 Output probabilities from LMDE and individual experts for two dierent
sub-sequences. The gray areas in the graph correspond to ground truth
backchannel feedbacks of the listener. . . . . . . . . . . . . . . . . . . . 60
5.1 An overview of our approach for predicting listener backchannels in the
absence of visual information. Our approach takes into account the con-
text from the speaker by rst predicting the nonverbal behaviors of the
speaker and uses these predictions to improve the nal listener backchan-
nels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 Our approach for predicting speaker gestures in dyadic conversations.
Using the speaker audio features as input, we rst learn a CRF model
(expert) per each audio channel and for both speaker gestures and lis-
tener backchannels. Then, we merge these CRF experts using a latent
variable model that is capable of learning the hidden dynamic among
the experts. This second step allows us to incorporate the visual of the
speaker gestures on the the listener behaviors. . . . . . . . . . . . . . . 68
6.1 Our wisdom-LMDE: (1) multiple listeners experience the same series
of stimuli (pre-recorded speakers) and (2) a Wisdom-LMDE model is
learned using this wisdom of crowds, associating one expert for each lis-
tener. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.2 Picture of the cubicle in which the participants were seated. It illustrates
the interrogation mirror and the placement of the camera behind it which
ensures eye contact (from (de Kok & Heylen, 2011)). . . . . . . . . . . . 80
6.3 Baseline Models: a) Conditional Random Fields (CRF), b) Latent Dy-
namic Conditional Random Fields(LDCRF), c) CRF Mixture of Experts
(no latent variable) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.4 Output probabilities from LMDE and individual listener experts for two
dierent sub-sequences. The gray areas in the graph corresponds to
ground truth backchannel feedbacks of the listener. . . . . . . . . . . . . 100
ix
Abstract
Human nonverbal communication is a highly interactive process, in which the partici-
pants dynamically send and respond to nonverbal signals. These signals play a signi-
cant role in determining the nature of a social exchange. Although human can naturally
recognize, interpret and produce these nonverbal signals in social contexts, computers
are not equipped with such abilities. Therefore, creating computational models for
holding
uid interactions with human participants has become an important topic for
many research elds including human-computer interaction, robotics, articial intelli-
gence, and cognitive sciences. Central to the problem of modeling social behaviors is
the challenge of understanding the dynamics involved with listener backchannel feed-
backs (i.e. the nods and paraverbals such as \uh-hu" and \mm-hmm" that listeners
produce as someone is speaking). In this thesis, I present a framework for modeling
visual backchannels of a listener during a dyadic conversation. I address the four major
challenges involved in modeling nonverbal human behaviors, more specically listener
backchannels: (1) High Dimensionality: Human communication is a complicated
phenomenon that involves many behaviors (i.e dimensions) such smile, nod, hand mov-
ing, and voice pith. A better understanding and analysis of social behaviors can be
obtained by discovering the subset of features relevant to a specic social signal (e.g.,
backchannel feedback). In this thesis, I present a new feature ranking scheme which
exploits the sparsity of probabilistic models when trained on human behavior problems.
This technique gives researchers a new tool to analyze individual dierences in social
nonverbal communication. Furthermore, I present a feature selection approach which
rst looks at the important behaviors for each individual, called self-features, before
x
building a consensus. (2) Multimodal Processing: This high dimensional data
comes from dierent communicative channels (modalities) that contain complementary
information essential to interpretation and understanding of human behaviors. There-
fore, eective and ecient fusion of these modalities is a challenging task. If integrated
carefully, dierent modalities have the potential to provide complementary information
that will improve the model performance. In this thesis, I introduce a new model called
Latent Mixture of Discriminative Experts which can automatically learn the tempo-
ral relationship between dierent modalities. Since, I train separate experts for each
modality, LMDE is capable of improving the prediction performance even with lim-
ited amount of data. (3) Visual In
uence: Human communication is dynamic in
the sense that people aect each other's nonverbal behaviors (i.e. gesture mirroring).
Therefore, while predicting the nonverbal behaviors of a person of interest, the visual
gestures from the second interlocutor should also be taken into account. In this the-
sis, I propose a context-based prediction framework that models the visual in
uence of
an interlocutor in a dyadic conversation, even if the visual modality from the second
interlocutor is absent. (4) Variability in Human's Behaviors: It is known that
age, gender and culture aect people's social behaviors. Therefore, there are dierences
in the way people display and interpret nonverbal behaviors. A good model of human
nonverbal behaviors should take these dierences into account. Furthermore, gathering
labeled data sets is time consuming and often expensive in many real life scenarios. In
this thesis, I use "wisdom of crowds" that enables parallel acquisition of opinions from
multiple annotators/labelers. I propose a new approach for modeling wisdom of crowds
called wisdom-LMDE, which is able to learn the variations and commonalities among
dierent crowd members (i.e. labelers).
xi
Chapter 1
Introduction
1.1 Motivation
Human face-to-face communication is a dynamic process, in that participants continu-
ously adjust their behaviors based on verbal and nonverbal signals from other people.
Human often coordinate their verbal and nonverbal messages to convey their point. To-
gether with the verbal channel, nonverbal channel build up the larger human communi-
cation process, and sometimes nonverbal signals constitute the most important part of
the message (Knapp & Hall, 2010). Therefore, nonverbal behaviors play an important
role in many social interactions. For example, during a physician-patient interaction,
the patient's nonverbal behaviors of a patient can aect the physician judgement of
the patient (Schmid & Mast, 2007). Nonverbal cues in
uence the learning process in
classroom settings (Valenzeno et al., 2003). Students body and facial gestures manifest
their interest on the topic and their understanding (Knapp & Hall, 2010). Furthermore,
short observations of teacher behavior determines student evaluation (Babad, 2007).
Among other social behaviors, backchannel feedbacks (i.e. the nods and paraverbals
such as \uh-hu" and \mm-hmm" that listeners produce as someone is speaking) have
received considerable attention due to their pervasiveness across languages and conversa-
tional contexts. Backchannel feedbacks play a signicant role in determining the nature
of a social exchange by showing rapport and engagement (Gratch et al., 2007). When
1
these signals are positive, coordinated and reciprocated, they can lead to feelings of rap-
port and promote benecial outcomes in diverse areas such as negotiations and con
ict
resolution (Drolet & Morris, 2000), psychotherapeutic eectiveness (Tsui & Schultz,
1985), improved test performance in classrooms (Fuchs, 1987) and improved quality of
child care (Burns, 1984). By correctly predicting backchannel feedback, virtual agent
and robots can have stronger sense of rapport.
Although human can naturally display and interpret nonverbal signals in social
context, computers are not equipped with such abilities. Therefore, supporting such
uid interactions has become an important topic in computer science research (Pantic
et al., 2006). Many dierent models have been proposed to recognize (Mitra & Acharya,
2007; Sebea et al., 2005), or predict (Ward & Tsukahara, 2000; Maatman et al., 2005;
Morency et al., 2008b) these nonverbal behaviors.
When building computational models of social nonverbal behaviors, four main chal-
lenges arise: (1) High Dimensionality: Human communication is complex since it
involves many nonverbal behaviors such as smile, frowning, nod and voice pitch. This
high dimensionality brings about diculties in understanding and analysis of social
behaviors. (2) Multimodal Processing: During face-to-face conversations, partici-
pants dynamically send and respond to the nonverbal signals from dierent channels
(i.e. visual, acoustic, and verbal). These dierent channels contain complementary in-
formation and a hidden dynamic among each other. (3) Visual In
uence: How can
we model the visual in
uence between participants (i.e. gesture mirroring) if the visual
modality is not directly observed? (4) Variability in Human's Behaviors: There are
variations in the way that people display or interpret nonverbal behaviors. Therefore,
ground truth labels, which are based on these subjective judgements implicity include
this variability.
In this proposal, we propose a framework for modeling human nonverbal behav-
iors which directly addresses these four challenges. More specically, we apply our
framework on the task of predicting listener backchannel feedbacks during a dyadic
2
- I'm just being specific you
know? And she's
complaining about
somebody's instant
messaging her and e-
mailing her and whenever
Feature
Analysis
Listener
Speaker
Multimodal
Features
Crowd members
Speaker
Prediction
Listener
Prediction
Multimodal
Integration
Multimodal
Integration
Wisdom of
Crowds
Modeling
Listener
Behaviors
(i.e. backchannel)
Speaker
Gestures
(i.e. head nods)
Visual
Influence
Modeling
Final Listener
Behaviors
(i.e. backchannel)
Figure 1.1: Overview of our social behavior modeling scheme for recognizing the social
behaviors (i.e. backchannel feedbacks) of a target person (listener) while he/she is com-
municating with an interlocutor (speaker). Our framework includes Feature Analysis
that provides better understanding and analysis of human communication by discover-
ing the most relevant features for the predicted behavior. On one hand, the multimodal
features of the speaker are fused together by a Multimodal Integration technique dur-
ing the listener prediction process. On the other hand, the multimodal features of
the speaker are integrated together to predict his/her behavior as well. This context
from the speaker is then used to model the Visual In
uence of the speaker on listener
feedbacks, and to improve the nal listener prediction process. During this listener pre-
diction processes, the Wisdom of Crowds is modeled to take into account the variability
in human's nonverbal behaviors.
conversation with a speaker. An overview of our framework is given in Figure 1.1. Our
framework rst includes Feature Analysis for a better understanding of human com-
munication by revealing the features that are the most relevant to the social behavior
(listener backchannels). The Listener Prediction step involves processing of multimodal
features of the interlocutor (speaker) that are fused together by a Multimodal Integration
technique. During this prediction process, the Wisdom of Crowds are taken into account
to model the dierences in the way people display and interpret nonverbal behaviors.
Our Speaker Prediction step also looks at the multimodal features of the speaker to
3
predict any missing information (i.e. predicting visual gestures of the speaker when
only audio input features are available). Then, this context from the speaker is used to
improve the nal listener backchannel prediction process.
In the following four subsections (Sections 1.2, 1.3, 1.5, and 1.4); we discuss our
previous achievements that addressed the above four challenges. Our achievements
provide empirical evaluation on the task of listener backchannel feedback prediction
during dyadic conversations. We summarize these achievements in Section 1.6 and
other achievements in Section 1.7. This chapter ends with an outline of the proposal in
Section 1.8.
1.2 High Dimensionality
When building computational models of human nonverbal behaviors, many behaviors
(dimensions) need to be considered since human communication is implicitly multimodal
(e.g. smile, nods and speech pitch). This high dimensionality brings challenges when
building computational models able to understand and analyze social behaviors since
often only a limited amount of data is available for the analysis and learning. Moreover,
since we are often collaborating with psychologists and clinicians, keeping all input
dimensionality brings another challenge when trying to interpret these computation
models.
To alleviate this problem of high dimensionality, and to gain a better understanding
and analysis of social behaviors, we present a feature selection scheme in Chapter 3
that automatically discovers the subset of features relevant to a specic social behav-
iors. Contrary to the traditional approach for feature selection which looks at the most
relevant features from all human interactions in the dataset, we present a feature selec-
tion scheme that rst looks at important behaviors for each individual before building
a consensus. Consequently, our scheme is able to learn some discriminative features
which are targeted to a subset of the population.
4
To enable the above approach, we propose a sparse feature ranking method based
on L
1
regularization technique (Ng, 2004; Smith & Osborne, 2005; Vail, 2007). This
method is a non-greedy ranking technique in the sense that all the features have equal
in
uences and they are selected together. Understanding of human behaviors can allow
us to improve the way that the machines communicate with human.
1.3 Multimodal Integration
Building computational models of nonverbal behaviors involves learning the dynamics
and temporal relationship between cues from multiple modalities (Quek, 2003). These
dierent modalities contain complementary information essential to interpretation and
understanding of human behaviors (Oviatt, 1999). Psycholinguistic studies also sug-
gest that gesture and speech come from a single underlying mental process, and they
are related both temporally and semantically (McNeill, 1992; Cassell & Stone, 1999;
Kendon, 2004). This brings about the issue of multimodal integration which refers to
eective and ecient fusion of information from multiple communicative channels. If
integrated properly, these dierent channels can provide complementary information
that will improve the performance of our computational models of social behaviors.
There are several characteristics that a good fusion process is desired to have. Among
others, we discuss three of the most important characteristics. First, a good fusion
process should be able to allow re-weighting of noisy channels. In other words, it should
be able to learn how condent each modality is in achieving a dened task (such as
audio-visual speaker detection, human tracking, etc.). Second, eective training should
be possible, even with limited amount of data. And third, the fusion process should be
interpretable, therefore analysis of each modality in
uence should be made feasible.
In Chapter 4, we introduce a new model called Latent Mixture of Discriminative
Experts (LMDE), which directly addresses these three issues. A graphical representation
of LMDE is given in Figure 4.2. One of the main advantages of our computational model
is that it can automatically discover the hidden structure among modalities and learn the
5
dynamic between them. Since a separate expert is learned for each modality, eective
training can be purveyed even with limited amount of data. Furthermore, our learning
process provides a ground for better model interpretability. By analyzing each expert,
the most important features in each modality{relevant to the task{ can be identied.
In Chapter 4, we provide empirical evaluation on the task of backchannel feedback
prediction conrming the importance of combining multiple modalities.
1.4 Visual In
uence
Human communication is a dynamic process in which behaviors of either partici-
pant can aect the other one. For instance, the listener may choose to change his/her
feedback behaviors based on the gestures that he/she gets from the speaker. Similarly,
participants of a conversation often mimic each others gestures to convey empathy and
rapport (Ross et al., 2008; Hateld et al., 1992; Riek et al., 2010). This phenomena,
which we refer as visual in
uence, is essential for
uid human interactions; but research
is still needed to build accurate computational models when the visual information is
not directly observed.
Inspired by this idea of visual in
uence, we propose that gestures of the speaker
should be integrated while predicting the listener's feedback. However; in many real
life scenarios, such as when building socially intelligent virtual agents, we only have
the speech or text, without any visual information. Similarly, during a human-virtual
agent interaction, the recognition models of some human visual behaviors are not always
available or accurate. Even in the absence of visual context, the visual in
uence should
be taken into account to improve our prediction models.
In Chapter 5, we present a context-based prediction framework to model the visual
in
uence of a speaker on the listener feedbacks during a dyadic conversation. We assume
an environment where the visual gestures of the speaker are not available. Based on this
assumption, we rst predict the visual context (i.e. head nods) of the speaker using only
6
the information from the audio channel. Then, we use this context from the speaker to
model the visual in
uence by using an extension of the Latent Mixture of Discriminative
Experts model.
1.5 Variability in Human's Behaviors
Although there are many similarities in the way people display their nonverbal behav-
iors, many studies show that environmental factors (such as culture, gender, and age.)
aect people's nonverbal communication. For instance, David Matsumoto (Matsumoto,
2006) claim that human learn to modify and manage their basic behaviors based on
social circumstance (cultural display rules), and learn rules about how to manage their
judgments of the displayed behaviors (cultural decoding rules). Judith Hall (Hall, 1978)
studied the gender dierences in nonverbal behaviors and reported female advantage on
decoding nonverbal communication cues.
Another important factor that aect human nonverbal behaviors is personality. Cer-
tain personality traits aects behavior style (Ambady & Rosenthal, 1998). For instance,
extravert people are more expressive than introvert people. While experiencing the
same set of environmental conditions, people may choose to react dierently. In case of
backchannelling, some people may give more frequent feedbacks, whereas some others
may choose to be less active and give seldom feedbacks. Furthermore, dierent people
may be paying attention to dierent cues from the speaker when proving a backchannel.
Some people might mainly focus on speech channel and give a feedback in the presence
of lexical and/or prosodic cues; on the other hand, some people might be more visual
and give a feedback when the speaker looks back or smiles.
A good computational model of human nonverbal communication should be able to
take these dierences into account. This requires opinions of dierent individuals for the
same task. However in many real life scenarios, it is hard to collect such large number
of opinions required for training; because it is expensive and/or time consuming. To
address this issue, a new direction of research appeared in the last decade, taking full
7
advantage of the "wisdom of crowds" (Surowiecki, 2004). In simple words, wisdom of
crowds enables parallel acquisition of opinions from multiple annotators/experts.
In Chapter 6, we propose a new approach to model wisdom of crowds. This approach
for modeling wisdom of crowds is based on the Latent Mixture of Discriminative Experts
model. In our Wisdom-LMDE model, a discriminative expert is trained for each crowd
member. The key advantage of our computational model is that it can automatically
discover the prototypical patterns in crowd member perception, and learn the dynamic
between dierent views.
1.6 Summary of Signicant Achievements
The signicant achievements of this thesis are summarized below:
1. Consensus of Self-Features
In (Ozkan & Morency, 2010), we presented a feature selection approach which
rst looks at important behaviors for each individual, called self-features, before
building a consensus. This technique gives researchers a new tool to analyze
individual dierences in social nonverbal communication.
2. A sparse feature ranking scheme
To enable ecient feature selection, we proposed a feature ranking scheme in (Ozkan
& Morency, 2010) based on a sparse regularization method called L
1
regulariza-
tion. This scheme is a non-greedy ranking method where two or more features
can have the same rank, meaning that these features have joint in
uence and they
should be selected together.
3. Latent Mixture of Discriminative Experts Model
In (Ozkan et al., 2010), we introduce a new model called Latent Mixture of Dis-
criminative Experts(LMDE) which can automatically learn the temporal relation-
ship between dierent modalities. One of the main advantages of our computa-
tional model is that it can automatically discover the hidden structure among
8
modalities and learn the dynamic between them. Since, we train separate experts
for each modality, LMDE is capable of improving the prediction performance even
with limited amount of data. Furthermore, our learning process provides a ground
for better model interpretability.
4. Visual In
uence
Based on the phenomena of visual in
uence of one participant's gestures on the
behaviors of the other participant in a face-to-face conversation, we proposed a
context-based prediction approach in (Ozkan & Morency, 2013) for predicting
listener behaviors when the visual modality is not directly observed. In this pro-
posed approach, we rst anticipated the interlocutor behaviors, and then used this
anticipated visual context to improve our listener prediction model. We modeled
the visual in
uence using a latent variable sequential model.
5. Analysis and Wisdom-LMDE Model
In (de Kok et al., 2010), our preliminary analysis suggested that opinions of more
than one person on the same task improves the overall learning performance.
Based on this observation, we proposed a new approach for modeling wisdom of
crowds in (Ozkan & Morency, 2011). Our approach presents an extension of the
Latent Mixture of Discriminative Experts model that can automatically learn the
prototypical patterns and hidden dynamic among dierent members of the crowd.
1.7 Summary of Other Achievements
I have also worked on social behaviors other than listener backchannel feedbacks that
in
uenced my PhD thesis work. One specic topic is related to emotion recognition:
1. Audio-Visual Emotion Recognition
The emotional state of an individual often aects the way that he/she reacts to
other (Darwin, 1872; Oatley et al., 2006). In (Ozkan et al., 2012), we presented
an approach based on concatenated Hidden Markov Model (co-HMM) to infer
9
the dimensional and continuous emotion labels from multiple high level audio and
visual cues. This approach had the advantage of explicitly learning the temporal
relationships among the audio-visual data and the emotional labels. The rst step
of our approach involved generating a step-wise representation of the continuous
emotion dimensions, in which we modeled the distribution of each emotion dimen-
sion by a set of discrete labels. In the second step, we built a generative model,
co-HMM, that could estimate the most likely label at each sample. Using the co-
HMM model allowed us to learn both the intrinsic dynamics within the same class
label and extrinsic dynamics among dierent classes. The aective dimensions an-
alyzed in our work were arousal, expectancy, power, and valence. Our model was
evaluated on the Second International Audio/Visual Emotion Challenge (AVEC
2012) dataset, and we got the second place in the word-level sub-challenge. A
complete description of the challenge and the dataset can be found in Schuller et
al. (Bjorn Schuller, October 2012).
1.8 Outline
The rest of this thesis is organized as follows:
In Chapter 2, we present related work on human nonverbal behavior modeling,
multimodal data integration, wisdom of crowds modeling, and feature selection
and analysis.
We present our sparse feature selection scheme and self-feature concensus approach
in Chapter 3 which addresses the High Dimensionality challenge.
Latent Mixture of Discriminative Experts (LMDE) model is presented in Chap-
ter 4 to address the Multimodal Integration challenge.
10
We present our context-based prediction framework in Chapter 5, which addresses
the visual in
uence challenge.
In Chapter 6, we describe our framework on modeling wisdom of crowds based
on Latent Mixture of Discriminative Experts which addresses the challenge on
Variability in Nonverbal Behaviors.
We conclude with discussions and future works in Chapter 7.
11
Chapter 2
Related Work
To better understand the novel aspect of this thesis and to give technical background
to the reader, this chapter presents related works in the ve core topics. Our rst topic
covers the literature in human nonverbal behavior modeling, and more specically in
listener backchannel prediction. Then, we present previous works on feature selection
and analysis, multimodal data integration, and wisdom of crowds modeling. Finally, we
review various context-based prediction/recognition approaches in literature.
2.1 Human Nonverbal Behavior Modeling
Although human can naturally encode and decode nonverbal signals in social context,
computers are not equipped with such abilities. Therefore, supporting such
uid in-
teractions has become an important topic in computer science research (Pantic et al.,
2006). Many recognition models have been proposed to recognize dierent nonverbal
behaviors (Mitra & Acharya, 2007; Sebea et al., 2005). On the other hand, some
researchers focused on prediction models to encode right nonverbal behaviors at appro-
priate times (Ward & Tsukahara, 2000; Maatman et al., 2005; Morency et al., 2008b).
Earlier work in social signal processing focused on multimodal dialogue systems
where the gestures and speech may be constrained (Johnston, 1998; Jurafsky et al.,
1998). Most of the research in social signal processing over the past decade ts within
two main trends that have emerged: (1) recognition of individual multimodal actions
12
such as speech and gestures (e.g, (Eisenstein et al., 2008; Frampton et al., 2009; Gravano
et al., 2007)), and (2) recognition/summarization of the social interaction between more
than one participants (e.g., meeting analysis (Heylen & op den Akker, 2007; Hsueh &
Moore, 2007; Murray & Carenini, 2009; Jovanovic et al., 2006)).
In this thesis we addressed four main issues (feature selection and analysis, mul-
timodal integration, variability among people, context based prediction) involved in
the task of listener backchannel prediction in dyadic conversations. In the rest of this
section, we will summarize previous work on backchannel prediction.
2.1.1 Listener Backchannel Prediction
The application described in this proposal integrates multimodal cues from one per-
son are used to predict the social behavior of another participant. This type of pre-
dictive models has been mostly studied in the context of embodied conversational
agents (Nakano et al., 2003; Nakano et al., 2007). Several researchers have developed
models to predict when backchannel should happen. In general, these results are di-
cult to compare as they utilize dierent corpora and present varying evaluation metrics.
Ward and Tsukahara (Ward & Tsukahara, 2000) propose a unimodal approach where
backchannels are associated with a region of low pitch lasting 110ms during speech.
Models were produced manually through an analysis of English and Japanese conversa-
tional data. Fujie et al. (Fujie et al., 2004) use Hidden Markov Models to perform head
nod recognition. In their proposal, they combined head gesture detection with prosodic
low-level features from the same person to determine strongly positive, weak positive
and negative responses to yes/no type utterances.
Maatman et al. (Maatman et al., 2005) present a multimodal approach where Ward
and Tsukhara's prosodic algorithm is combined with a simple method of mimicking head
nods. No formal evaluation of the predictive accuracy of the approach was provided
but subsequent evaluations have demonstrated that generated behaviors do improve
subjective feelings of rapport (Kang et al., 2008) and speech
uency (Gratch et al., 2007).
13
Morency et al. (Morency et al., 2008b) showed that Conditional Random Field models
can be used to learn predictive features of backchannel feedback. In their approach,
multimodal features are simply concatenated in one large feature vector for the CRF
model. They show statistical improvement when compared to the rule-based approach
of Ward and Tsukahara (Ward & Tsukahara, 2000).
A related project on this topic is The Semaine Project of EU-FP7 (sem, ) that focuses
on building Sensitive Articial Listeners. Towards this eort, Gravano (Gravano, 2009)
focused on backchannel-inviting cues as part of as part of their study of turn-taking
phenomena. They rst analyzed individual acoustic, prosodic and textual backchannel-
inviting cues; then, they investigate how such cues combine together to form complex
signals. In (Neiberg, 2012), Neiber focused on the communicative functions of vocal
feedback like "mhm", "okay" and "yeah, that's right". They categorize feedback as
non-lexical, lexical and phrase based feedback.
Dierent than prior studies, we present a multimodel approach for predicting listener
backchannels that also takes into account the mutual in
uence between the listener and
the speaker, and also the variability among listener behaviors.
2.2 Feature Selection and Analysis
Feature selection refers to the task of nding a subset of features (i.e. speaker observa-
tions) that are most relevant to the model, and provides a good representation of data.
It alleviates the problem of overtting by eliminating the noisy features. With only the
relevant features, a better understanding and analysis of data is facilitated.
Dierent feature selection techniques broadly fall into one of the three categories (Saeys
et al., 2007) (Blum & Langley, 1997):
Embedded methods that concurrently select features during the classier con-
struction.
14
Filter methods that lter out irrelevant features by considering its correlation
with the target function.
Wrapper methods that embed the model hypothesis search within the feature
subset search.
In Chapter 3, we will present a sparse feature selection method that falls into the
third category. Although being computationally expensive, we prefer to use a wrapper
technique, since it considers the input features all at the same time. Therefore, it allows
us to capture any dependencies/independencies among the features.
A well known feature selection technique based on L
1
-regularization was applied
for conditional random elds in robot tag domain (Vail, 2007). Based on the gradient-
based feature selection method (grafting) in (Perkins et al., 2003), Vail et. al. (Riezler &
Vasserman, 2004) proposed an incremental feature selection technique for Maximum En-
tropy Modeling. For the task of listener backchannel prediction, Morency et. (Morency
et al., 2008b) proposed a greedy approach where the rst feature is selected based on
it's performance on the task when used individually. Then, new features are selected
incrementally based on their eect in the performance when added to the rst feature.
Dierent than this greedy approach, all features are present during the selection process
in our sparse feature ranking scheme.
In Chapter 4 and 6, we use this sparse feature ranking method to analyze the rele-
vant features of each expert for the task of listener backchannel prediction. Individual
dierences in nonverbal behaviors has been studied in psychology and sociology. Galla-
her (Gallaher, 1992) analyzes individual dierences in nonverbal behavior by focusing
on intraindividual consistencies and individual dierences in the way behaviors are per-
formed by a person. In (Matsumoto, 2006), Matsumoto investigates the role of culture
in nonverbal communication process. Gender has also been studied as one of the in-
uences of nonverbal behavior (Linda L. Carli & Loeber, 1995). The proposed fusion
technique in Chapter 4, is a computational approach which enables a better analysis of
individual dierences in non-verbal behaviors.
15
2.3 Multimodal Data Integration
Multimodal fusion process involves integration of data from multiple resources, such
as audio, video, and text. This fusion process can be achieved mainly in three levels:
early, late and intermediate (Atrey et al., 2010). Early fusion involves feature level
integration, that integrates the features before it is being used by a learning model.
Therefore, it exploits the correlation among all features (Pavlovic, 1998; Terry et al.,
2008; Fox & Reilly, 2003). McCowan et al. (McCowan et al., 2005) presented a multi-
modal approach for recognition of group actions in meetings. In their experiments, early
integration gives signicantly better frame error rates than all approaches apart from
audio-visual Asynchronous Hidden Markow Model system, which is used to model the
interactions between individuals. However, modeling temporal synchrony/asynchrony
among modalities is a hard problem in early fusion, since features from dierent modal-
ities do not always happen at the same time.
On the other hand, late fusion refers to decision level integration, in which the deci-
sions (outputs) of individual modalities are fused together to have a nal decision (Foo
et al., 2004; Garg et al., 2003; Peng et al., 2009). This level of integration is usually
more scalable than feature level integration, since the decisions from multiple modalities
are all in the same format (i.e. probabilities over time). Snoek et al. (Snoek et al., 2005)
compares early fusion and late fusion for semantic concept learning from multimodal
video. In their experiments, late fusion gives better performance for most concepts;
however it comes with a cost of increased learning eort. For both early and late fusion,
classiers are generic, which are also used for unimodal data processing.
In Chapter 4, we present a probabilistic model based on intermediate fusion of
multimodal data. In intermediate fusion the integration is done at the model level.
Factorial Hidden Markov Models (Ghahramani et al., 1997), Coupled Hidden Markov
Models (Ghahramani et al., 1997) and Layered Hidden Markov Models (LHMMs) are
examples of statistical models for intermediate fusion of audio visual data. LHMMs
was proposed in (Oliver et al., 2004) for modeling oce activity from multiple sensory
16
channels. LHMMs can be seen as a cascade of Hidden Markov Models, where each
layer is trained independently, and the results from a lower layer are used as input to
an upper layer. Barnard and Odobez (Barnard & Odobez, 2005) use this framework
in combination with unsupervised clustering of the data for event recognition in sports
videos. Zhang et. al. (Zhang et al., 2004) presents a two-layer HMM framework for
group action recognition in meetings. Individual actions of participants are recognized
in the lower layer, and the output of this layer is used in the second layer to model group
interactions. Dierent than earlier intermediate fusion techniques, our model depends
on discriminative models that can learn the dynamic among dierent modalities.
Jordan et. al. (Jordan, 1994) presented the Hierarchical Mixture of Experts (HME)
based on probabilistic splits of the input space. HME models a mixture of component
distributions referred to as experts, where the expert mixing ratios are set by gating
functions. Bishop et. al. (Bishop & Svens en, 2003) proposed a variant of this model
called Bayesian HME (BME) based on variational inference. HME and BME are mainly
used for solving static regression and classication problems. On the other hand, we
propose a discriminative model for solving sequential patterns, where we predict one
laber per time sample. Sminchisescu et. al. (Sminchisescu et al., 2005) used the BME
approach for discriminative inference in continuous chain models. Similar to our mul-
timodal integration model proposed in Chapter 4, it can learn the mixing coecients
among experts. In addition to this, our model exploits a latent variable that allows
multiple mixing coecients. In other words, each hidden state in our multimodal inte-
gration model can represent a dierent set of mixing coecients.
2.4 Context-based Prediction
During a human-human interaction, participants often adapt to each others nonverbal
behaviors (Ramseyer, 2011; Burgoon et al., 1995; Hateld et al., 1992; Riek et al.,
2010). This visual in
uence between the participants of a social communication plays
an important role in various domains. For instance, people often have a more positive
17
judgement of a stranger who mimics their behaviors (Gueguen et al., 2009). In a similar
study, Luo et.al. (Luo et al., 2013) conducted a pilot study on wether people favour
virtual agents that act like them or not. Although it may not be their favorite one,
the experiments indicated that people prefer agents that mimic their own gestures.
Duggan at. al. studied 34 video recorded physician-patients interactions in (Duggan &
Bradshaw, 2008). They proposed that physicians and patients become more similar in
their nonverbal rapport-building behaviors while talking about roles and relationships.
While building computational models of human nonverbal behaviors, this mutual
in
uence between the participants arises as the context that should be used to enhance
the prediction/recognition accuracies of these models. Morency et. al. exploited the
contextual information from other participants to predict visual feedback and improve
recognition of head gestures in human-human interactions in (Morency et al., 2008a).
Specically, they used speaker contextual cues, such as gaze shifts or changes in prosody,
for listener backchannel feedback recognition. Sun et. al. exploited behavioral infor-
mation expressed between two interlocutors to detect and identify mimicry (Sun et al.,
2012). Then, this information was used to improve recognition of interrelationship be-
tween the participants of a conversation.
In Chapter 5, we present a context-based framework for modeling the visual in
u-
ence between the speaker and the listener in a dyadic conversation. We assume an
environment in which the visual context from both participants is missing. Based on
this assumption, we rst model the visual gestures of the speaker and then use this
context to improve the listener backchannel feedback prediction.
2.5 Wisdom of Crowds
Wisdom of crowds was rst dened and used in business world by Surowiecki (Surowiecki,
2004). Solomon showed that aggregation of individual decisions produces better deci-
sions than those of either group deliberation or individual expert judgment (Solomon,
2006). Later, it gained great attention from other research areas as well, due to low
18
cost of data collection. Along with the rise of online services like Amazon Mechanical
Turk (Schneider, ), crowdsourcing emerged as an easy way for labeling data. Besides
developing new algorithms to process crowdsourced data, the issues on labor standards
have also been explored in various studies. Quinn et. al. presen examined how rms
are utilizing crowd sourcing for the completion of marketingted a classication system
for human computation systems (Quinn & Bederson, 2011). Whitla (Whitla, 2009).
Callison-Burch et. al. explored quality control and variety of data types in using one of
the most well-known crowd sourcing platforms, Amazons Mechanical Turk, for collecting
data for human language technologies (Callison-Burch & Dredze, 2010).
As data collection became easier with the use of public crowdsourcing platforms,
several machine learning techniques have been developed for modeling wisdom of crowds.
Snow et. al. (Snow et al., 2008) showed that using non-expert labels for training machine
learning algorithms can be as eective as using a gold standard annotation. Welinder
et. al. (Welinder et al., 2010a) proposed a Bayesian generative probabilistic model for
image annotation that uses binary labels of images from many dierent annotators.
In (Welinder et al., 2010b), a generative Bayesian model was proposed for the an-
notation process. This model depends on an inference algorithm that estimates the
properties of the data being labeled and the annotators labeling them. Given the bi-
nary labels of images from many dierent annotators, the model infers the underlying
class of an image, as well as the parameters such as image diculty and annotator
competence and bias.
Dredze et. al. (Dredze et al., 2009) presented learning algorithms for two standard
NLP tasks, where the data instances have multiple labels along with the presence of
noise. They assume a scenario where they are given an indication as to the probability
of the given labels being correct for each training instance. Their algorithm works
iteratively: it rst learns a sequence model trained on all given labels weighted by their
priors. Then, it updates the label distributions based on the likelihood assigned by the
learned sequence model.
19
Raykar et. al. (Raykar et al., 2010) proposed a Bayesian approach for supervised
learning tasks for which multiple annotators provide labels but not an absolute gold
standard. The proposed algorithm works iteratively by assigning a particular gold
standard, measuring the performance of the annotators given that gold standard, and
then rening the gold standard based on the performance measures.
Huang et al. (Huang et al., 2010) introduced the Parasocial Consensus Sampling
(PCS) paradigm, which allowed to gather data from multiple individuals experiencing
the same situation. Instead of recording face-to-face interactions, participants in PCS
were requested to achieve a (given) communicative goal by interacting with a mediated
representation of a person. To obtain data for modeling listener backchannel feedbacks
with PCS, several participants were recruited to watch same speaker videos, and were
told to pretend like an active listener while watching the video of the speaker. To convey
their interest to the topic, they were asked to press the keyboard whenever they felt like
providing backchannel feedbacks such as head nods or paraverbals(i.e. umh or yeah).
In this thesis, we present a frameworks for modeling wisdom of crowds (Chapter 6).
By using a latent variable model, our framework is able capture the hidden dynamic
among dierent annotators.
20
Chapter 3
Feature Selection and Analysis
3.1 Introduction
Human communication is a complicated phenomenon that includes many dimensions,
such as speech, gestures, and interlocutor behaviors. Therefore, one of the key challenges
in building computational models of social nonverbal behaviors is high dimensionality.
High dimensionality makes it hard to gain a good understanding and analysis of human
communication.
The purpose of feature selection is dened to address this issue of high dimensional-
ity. The process of feature selection involves nding a subset of features that are most
relevant to the learned task. The traditional approach for feature selection looks at the
most relevant features from all observations. In a scenario for social nonverbal behav-
iors, this would mean that all human interactions in the dataset are used concurrently
to select the relevant features. While commonly used for simplicity, this group-feature
approach has the potential to select features that are not relevant to any specic indi-
vidual but only to the average model. In other words, this technique is likely to miss
some discriminative features which are specic to subset of the population.
In this chapter, we present a feature selection approach to address the issue of
high dimensionality. Unlike the group-feature approach, we rst look at important
behaviors for each individual, called self-features, before building a consensus. It avoids
21
Figure 3.1: (a) Group-feature approach. Features are selected by using all people's
observations at once. This model has the potential to miss some relevant features specic
to a subset of the population. (b) Self-feature consensus.Features of each person in
the data is ranked rst. Then, we select the top n from these ranked list of self-features
to construct n
th
order histogram of feature counts. In this gure, only the 1
st
and 2
nd
order histograms are shown.
the problem with the group-feature approach which focused on the average model and
oversees the inherent behavioral dierences among people. Figure 3.2 compares our self-
feature consensus approach to the typical group-feature approach. To enable ecient
feature selection, we propose a feature ranking scheme based on a sparse regularization
method called L
1
regularization (Ng, 2004; Smith & Osborne, 2005; Vail, 2007). This
scheme is a non-greedy ranking method where two or more features can have the same
rank, meaning that these features have joint in
uence and they should be selected
together. Our sparse feature ranking approach can be applied for both group-features
and self-features.
22
We evaluate our approach on the task of listener feedback prediction (i.e. head-nods)
in a dyadic interaction of two people. We use a sequential probabilistic model, Condi-
tional Random Fields, which is a recently used technique for predicting the backchan-
nels (Morency et al., 2008b). The experiments are conducted on the RAPPORT dataset
from (Gratch et al., 2007) which contains 42 storytelling dyadic interactions.
The rest of this chapter is organized as follows. In Section 3.3, we describe our
self-feature consensus framework. Sparse ranking scheme is described in Section 3.2.
In Section 3.4, we explain the dataset, features and evaluation metrics used in our
experiments, and give the results on the task of listener head-nod prediction. Finally,
we conclude with discussions in Section 3.6.
3.2 Sparse Ranking
In this section, we present a feature ranking scheme that orders the features based on
their relevance to the learned task (i.e. listener backchannel prediction). Our feature
ranking scheme relies on sparse regularization technique during training. We use Con-
ditional Random Fields as our prediction model, which has been successfully used for
backchannel predictions. For a better understanding, we rst describe the Conditional
Random Fields model used in our experiments and then show how sparse regularization
enable feature ranking in a non-greedy manner.
3.2.1 Conditional Random Fields
Conditional Random Field (CRF) (Laerty et al., 2001b) is a probabilistic discrim-
inative model for sequential data labeling. It is an undirected graphical model that
denes a single log-linear distribution over label sequences given a particular observa-
tion sequence. CRF learns a mapping between a sequence of multimodal observations
x =fx
1
;x
2
;:::;x
m
g and a sequence of labels y =fy
1
;y
2
;:::;y
m
g. Each y
j
is a class
label for the j
th
frame of a video sequence and is a member of a setY of possible class
labels, for example,Y =fhead-nod;other-gestureg. Each frame observation x
j
is
23
Conditional Random Fields (CRF)
Figure 3.2: A graphical representation of Conditional Random Fields.
represented by a feature vector (x
j
)2 R
d
, for example, the prosodic features at each
sample.
Given the above denitions, the conditional probability of y is dened as follows:
P (yj x;) =
1
Z(x)
exp(
X
F
(y; x)) (3.1)
where is a vector of linear weights, andZ(x) is a normalization factor over all possible
states of y. Feature function F
is either a state function s
k
(y
j
; x;j) or a transition
function t
k
(y
j1
;y
j
; x;j). State function s
k
depends on the correlation between label
at positionj and the observation sequence; while transition function t
k
depends on the
entire observation sequence and the labels at positions i and i-1 in the label sequence.
Given a training set consisting of m labeled sequences (x
i
; y
i
) for i = 1:::m, train-
ing of conditional random elds involves nding the optimum parameter set, , that
maximizes the following objective function:
L() =
m
X
i=1
logP (y
i
j x
i
;) (3.2)
which is the conditional likelihood of the observation sequence.
3.2.2 Feature Ranking Method
Our method exploits regularization technique which provides smoothing when the num-
ber of learned parameters is very high compared to size of available data. Using a
24
Regularization Parameter ( )
Figure 3.3: Example of sparse ranking using L
1
regularization. As goes from
higher to lower values, model parameters start to become non-zero based on their rele-
vance to the prediction model.
regularization term in the optimization function during training can be seen as assum-
ing a prior distribution over the model parameter. The two most commonly used priors
are Gaussian(L
2
regularizer) and Exponential (L
1
regularizer) priors. A Gaussian prior
assumes that each model parameter is drawn independently from a Gaussian distribu-
tion and penalizes according to the weighted square of the model parameters as follows:
R() =kk
1
=
X
i
2
i
(3.3)
where is the model parameters and> 0. This conventional regularization technique
often outperforms the Exponential regularizer. However, Gaussian regularizer often
does not provide good intuitions for selecting the relevant features.
On the other hand, an Exponential prior penalizes according to the weighted L
1
norm of the parameters and is dened as follows:
R() =kk
1
=
X
i
j
i
j (3.4)
25
where is the model parameters and > 0. During training of the conditional random
elds, this regularization term is added as a penalty in the log-likelihood function that
is optimized. Therefore, Equation 3.2 becomes:
L() =
m
X
i=1
logP (y
i
j x
i
;)R() (3.5)
Dierent than L
2
, L
1
regularization results in sparse parameters vector in which many
of the parameters are exactly zero (Tibshirani, 1994). Therefore, it has been widely
used in dierent domains for the purpose of feature selection (McCallum & R, 2003;
Vail, 2007). The in Equation 3.4 determines how much penalty should be applied by
the regularization term. Larger values indicate larger penalty, thus produces sparser
vector parameters.
Figure 3.3 shows an example of how aects the model parameters. In this exam-
ple, we trained a single expert with 5 input features: EyeGazes, POS:NN, utterance,
POS:IN, and EnergyEdge (see Section 3.4.3 for details about feature representations).
Figure 3.3 shows the eect of regularization on model parameters. This regularization
path was created by starting with a high regularization penalty where all the parame-
ters are zero and then gradually reduce the regularization until all the parameters have
non-zero values. In this path, if a parameter becomes non-zero in earlier stages (i.e.,
large ), this signies the input feature associated with this parameter is important.
Our ranking scheme is based on this observation. We rank the features in the order
of them becoming non-zero in the regularization path. For the example shown in Fig-
ure 3.3, our algorithm will rank the features as follows: (1) EyeGazes and POS:NN,
(2) EnergyEdge, (3) Utterance and POS:IN. The pseudo code for our sparse feature
ranking approach is given in Algorithm 1.
The regularization penalty determines how sparse the model should be. More
than one of these parameters in may become non-zero at any given regularization
factor. Therefore, our feature ranking scheme allows more than one feature to have the
same rank, meaning that these features have equivalent in
uence and they should be
26
Algorithm 1 Sparse Feature Ranking
ranked features =empty
for =1 down to 0 do
train a CRF with current
for all nonzero feature params
i
do
if
i
is NOT in selected features then
ranked features =franked features, f
i
g
end if
end for
end for
return ranked features
selected together. Compared to other greedy methods( i.e. (Morency et al., 2008b)),
our sparse feature ranking algorithm is not fully-greedy in the sense that all features
are present during selection process. Also, our algorithm is much more ecient than
the greedy method, since the computational cost of our algorithm is determined by the
number of regularization penalty values used. (In our experiments, we use 76 dierent
values since it was enough to obtain raking for all the features). On the other hand,
the computational cost of the greedy approach increases with the number of features (a
total of 1629 features are used in our experiments).
By using anL
1
-regularizer, the objective function dened in Equation 3.5 ends up to
be non-dierentiable. Therefore, we use Orthant-Wise Limited-memory Quasi-Newton
(owl-qn) method (Andrew & Gao, 2007) for training L1-regularized log-linear models,
which is an extension of L-BFGS optimization technique.
3.3 Consensus of Self-Features
To be able to catch the relevant features for all the people in our dataset, we present a
self-feature concensus approach for feature selection. Figure 3.2(b) shows an overview of
our self-feature consensus approach. The rst step of our algorithm is to nd a ranked
subset of the most relevant features for each person individually. We refer to this subset
27
as self-features. Section 3.2 describes our feature ranking algorithm. Figure 3.2(a)
compares our approach to the typical group-feature approach.
Once the ranked lists of self-feature are obtained for each listener in the dataset
using the feature ranking scheme explained in Section 3.2, we create a consensus over
self-features by using only the top n features of each list. A consensus is created by
rst composing an n
th
order histogram, which is basically an histogram of the features
that uses only the top n features from each self-features list. This consensus provides
a ranking of self-features, and we expect the relevant features to be present in these
histograms. To remove possible outliers, we apply a threshold on the consensus features
to keep only a subset of relevant features. The intuition behind this threshold is that
the relevant features are expected to appear frequently in top n of many self-features
corresponding to dierent people, whereas the outlier features would not appear that as
often. The minimum required consensus threshold has been selected to be n + 1 for an
n
th
order histogram in our experiments. Figure 3.2(a) shows two consensus examples:
rst and second order histograms. Pseudo code of our algorithm is given in Algorithm 2.
Algorithm 2 Sparse Feature Ranking
selected features =empty
for each listener i do
nd self-features F
i
F
n
i
= top n features in F
i
end for
Hist = histogram of all F
n
i
's
selected features = Hist>minDesiredConcensus
return selected features
3.4 Experiments
We test the validity of our approach on the multimodal task of predicting listener non-
verbal backchannel (i.e., listener head-nods). As mentioned in Section 3.1, backchannel
feedback prediction has received considerable interest due to its pervasiveness across
28
languages and conversational contexts (Maatman et al., 2005; Morency et al., 2008b).
In the following subsections, we rst describe the data and the backchannel annotations.
Then, we present the multimodal features and experimental methodology.
3.4.1 The Dataset
We use the RAPPORT dataset (Gratch et al., 2007)
1
that contains 42 dyadic inter-
actions between a speaker and a listener. Data is drawn from a study of face-to-face
narrative discourse (\quasi-monologic" storytelling). In this dataset, participants in
groups of two were told they were participating in a study to evaluate a communica-
tive technology. Subjects were randomly assigned the role of speaker and listener. The
speaker viewed a short segment of a video clip taken from the Edge Training Systems,
Inc. Sexual Harassment Awareness video. After the speaker nished viewing the video,
the listener was led back into the computer room, where the speaker was instructed to
retell the stories portrayed in the clips to the listener. The listener was asked to not talk
during the story retelling. Elicited stories were approximately two minutes in length on
average. Participants sat approximately 8 feet apart. All video sequences were manually
transcribed and manually annotated to determine the ground truth backchannels. The
next section describes our annotation procedure.
3.4.2 Backchannel Annotations
In our experiments, we focus on visual backchannels: head nods. A head nod gesture
starts when the person starts moving his/her head vertically. The head nod gesture
ends when the person stops moving or when a new head nod is started. A new head
nod starts if the amplitude of the current head cycle is higher than the previous head
cycle. Some listeners' responses may be longer than others although they all correspond
to one single respond. In our data, annotators found a total of 666 head nods. The
duration of these nods varied from 0.16 seconds to 7.73 seconds. Mean and standard
1
Freely available at http://rapport.ict.usc.edu/
29
deviation of backchannel durations are 1.6 and 1.2 respectively. The minimum number
of head nods given by one listener during one interaction is 1, the maximum is 47, mean
and standard deviations are 14.8 and 10.9 respectively.
Following Ward and Tsukahara's (Ward & Tsukahara, 2000) original work on backchan-
nel prediction, we train our LMDE model to predict only the start time of the backchan-
nel start cue (i.e. head nod). Following again Ward and Tsukahara (Ward & Tsukahara,
2000), we dene the backchannel duration as a window of 1.0 seconds centered around
the start time of the backchannel. A backchannel cue will be correctly predicted if at
least one prediction of our LMDE model happens during this 1.0 seconds duration. All
models tested in this chapter use this same testing backchannel duration of 1.0 seconds.
3.4.3 Multimodal Features and Encodings
We use four dierent type of multimodal features in our models: prosodic, lexical,
part-of-speech, and visual gesture features. Prosody refers to the rhythm, pitch and
intonation of speech. Several studies have demonstrated that listener feedback is corre-
lated with a speaker's prosody (Nishimura et al., 2007). Listener feedback often follows
speaker pauses or lled pauses such as \um" (see (Cathcart et al., 2003b)). We en-
code the following prosodic features, including standard linguistic annotations and the
prosodic features suggested by Ward and Tsukhara (Ward & Tsukahara, 2000):
Downslopes in pitch continuing for at least 40ms; regions of pitch lower than the
26th percentile continuing for at least 110ms (i.e., lowness); drop or rise in energy
of speech (i.e., energy edge); fast drop or rise in energy of speech (i.e., energy fast
edge), vowel volume (i.e., vowels are usually spoken softer), pause in speech (i.e.,
no speech).
Gestures performed by the speaker are often correlated with listener feedback (Bur-
goon et al., 1995). Eye gaze, in particular, has often been implicated as eliciting listener
feedback. Thus, we encode speaker looking at the listener as our visual gesture feature.
30
Some studies have suggested an association between lexical features and listener
feedback (Cathcart et al., 2003b). Therefore, we include top 100 individual words (i.e.,
unigrams) that are selected based on their frequency in the data.
Finally, we attempt to capture syntactic information that may provide relevant cues
by extracting four types of features from a syntactic dependency structure corresponding
to the utterance. Using a part-of-speech tagger (Sagae & Tsujii, 2007a), we extract the
part-of-speech tags for each word (e.g. noun, verb, etc.) as our Part-of-speech(POS)
features.
We encode our features using 13 dierent encoding templates as introduced by (Morency
et al., 2008b). The purpose of this encoding dictionary is to capture dierent rela-
tionships between speaker features and listener backchannels. For instance, listener
backchannels sometimes happen later after speaker features, or when the speaker fea-
tures are present for certain amounts of time and its in
uence may not be constant
over time. To automatically obtain these relations, we use three encoding templates in
our experiments: binary encoding that is designed for speaker features which in
u-
ence on listener backchannel is constraint to the duration of the speaker feature, step
function that is a version of binary encoding with two additional parameters: width of
the encoded feature and delay between the start of the feature and its encoded version.
and ramp function that linearly decreases for a set period of time (width parameter).
Step and ramp functions are used with 6 dierent parameters (width and delay): (0.5
0.0), (1.0 0.0), (0.5 0.5), (1.0 0.5), (0.5 1.0), (1.0 1.0) for step, and (0.5 1.0), (1.0 1.0),
(2.0 1.0), (0.5 0), (1.0 0), (2.0 0) for ramp.
3.4.4 Methodology
We performed hold-out testing by randomly selecting a subset of 10 interactions (out of
42) for the test set. The training set contains the remaining 32 dyadic interactions. All
models evaluated in this chapter were trained with the same training and test sets. The
31
time
0
time
Figure 3.4: Methodology: The output of our prediction model is the probabilities of
providing backchannel prediction over time. During testing, we rst nd all the peaks in
this probability graph (left). Then, we apply a threshold on these peaks and use all the
peaks above this threshold as the nal prediction of the model (right). This threshold
value is learned automatically during validation.
test set does not contain individuals from the training set. Validation of model param-
eters was performed using a 3-fold strategy on the training set. For L
1
regularization,
ranged 1000 0:95
k
;k = [20; 22::170]. For L
2
regularization, the validated range was
10
k
;k = [3::3]. The training of CRF models was done using the hCRF library (hCRF,
2007).
The performance is measured by using the F-score. F-score is the weighted harmonic
mean of precision and recall and calculated as below:
F = 2
precisionrecall
precision +recall
(3.6)
Precision is the probability that predicted backchannels correspond to actual listener
behavior. Recall is the probability that a backchannel produced by a listener in our
test set was predicted by the model. We use the same weight for both precision and
recall, so called F
1
. During validation we nd all the peaks (i.e., local maxima) from
the marginal probabilities. These backchannel hypotheses are ltered using the optimal
32
Table 3.1: Group-features with sparse ranking. We incrementally add features
as they appear in the regularization path and use for retraining. Each row shows the
features added at that stage, therefore the model at this stage is retrained with these
new features plus the features above it. Final row shows values for using all the features
instead of feature selection.
Features Precision Recall F
1
EyeGazes-binary 0.16469 0.14164 0.1523
... + POS:NN-step(1,.5)
... + VowelVolume-step(.5,1) 0.15281 0.25903 0.19222
... + Pause-step(1,0)
... + Lowness-step(1,.5) 0.19818 0.37516 0.25935
... + POS:NN-step(1,1) 0.2002 0.1918 0.19591
... + Lowness-step(1,0)
... + VowelVolume-step(.5,.5) 0.20512 0.1943 0.19956
Baseline: All features
No feature selection 0.1643 0.6079 0.2587
threshold from the validation set. A backchannel (i.e., head-nod) is predicted correctly
if a peak happens during an actual listener backchannel with high enough probability.
A graphical representation of this process is presented in Figure 3.4
3.5 Results and Discussion
The goal of our experiments is four-fold evaluating the: (1) group-feature approach
using our sparse ranking, (2) eect of the order parameter on self-feature consensus,
(3) analysis of selected self-features and (4) comparison of self-feature consensus to
group-feature approach.
Group-features:
For the rst experiment, we applied our sparse ranking scheme in a group-feature man-
ner, concurrently using all the training samples. To show the eect of sparse ranking,
we train a separate CRF for each subset of group-features. For comparison, we trained
another CRF using all features (1833 encoded features). Both CRF approaches were
retrained using L
2
regularization following previous work on CRF-based backchannel
33
Table 3.2: Selected features with self-feature consensus using histograms of dierent
orders (after outlier rejection).
1
st
Order 2
nd
Order 3
rd
Order
POS:NN-step(1,1) POS:NN-step(1,1) POS:NN-step(1,1)
Utterence-binary POS:NN-step(1,.5) POS:NN-step(1,.5)
EyeGaze-binary Utterence-binary Utterence-binary
Pause-binary EyeGaze-binary EyeGaze-binary
POS:DT-step(1,.5) EyeGaze-step(1,1) Pause-step(1,0)
Lowness-step(1,0) Pause-binary POS:DT-step(1,.5)
Pause-step(1,0) Lowness-step(1,0)
POS:DT-step(1,.5) Lowness-step(1,.5)
Lowness-step(1,0)
prediction (Morency et al., 2008b). L
1
was still used during the sparse ranking step.
Precision, recall and F-1 values are given in Table 3.1. In each row, features are added as
they appear in the L
1
regularization path of our sparse ranking scheme. The best per-
formance happens in the third step with ve selected features and F-1 value of 0.25935.
The last row of Table 3.1 represents the performance when no feature selection is applied
(all features are used). This result shows that sparse ranking can nd a subset relevant
of features, with performance similar to the baseline model that contains all features.
For the same listener backchannel prediction task, Morency et al. (Morency et al.,
2008b) used a greedy-forward feature selection method on the RAPPORT dataset. Al-
though, the experimental set up was slightly dierent (i.e. dierent test and train sets
were used), the best precision, recall and F-1 values archived with this method were
0.1862, 0.4106, 0.2236, respectively.
Eect of Order Parameter:
Our second experiment studies the eect of the order parameter in our self-feature
consensus. We constructed feature histograms with orders 1, 2, and 3 by looking at the
top 1
st
, 2
nd
, and 3
rd
features in each list. Then, we applied a threshold of 2, 3, and 4
respectively on the histograms for outlier rejection. The list of features for each order
is listed in Table 3.2. This result is really interesting since the same features appear in
34
Table 3.3: Precision, recall and F-1 values of retrained CRFs with group-feature ap-
proach and self-feature consensus.
Method Precision Recall F-1
self-feature consensus
Order 1 0.2192 0.4939 0.3037
Order 2 0.23802 0.48628 0.3196
Order 3 0.24449 0.28211 0.26196
group-feature approach 0.19818 0.37516 0.25935
Baseline: all features 0.1643 0.6079 0.2587
all three consensus. This indicates that our self-feature consensus approach is robust
against dierent order values.
Feature Analysis:
For our third experiment, we analyze the features selected for our task of head-nod
prediction. It is interesting that some features are selected by both self-feature consensus
and group-feature approach, such as Pause, EyeGaze, Lowness, POS:NN. Utterance and
POS:DT are the two features selected by self-feature consensus approach that do not
appear in the top 20 features from the group-feature approach. POS:DT refers to
determiners in language, such as the, this, that. Utterance refers to the beginning of
an utterance. Mixed together, these two features represent moments where the speaker
starts an utterance with a determiner. To show the relative importance of the Utterance
and POS:DT features, we added these two features to the list of features obtained by
group-feature approach and trained a new CRF model. Precision, recall and F-1 values
are 0.21685, 0.38653, 0.27783, respectively. We see an improvement over group-feature
approach, showing the importance of self-feature consensus.
Group-features vs. Self-features:
Our last experiment compares our self-feature consensus approach to the typical group-
feature approach. Using the selected self-features from Table 3.2, we retrained a L
2
regularized CRFs over all training instances. Precision and recall values for these re-
trained CRFs of self-feature consensus and group-feature approach (best result from
rst experiment) are given in Table 3.3. The best F-1 value achieved with 2
nd
order
35
histogram is 0.3196. Also, all three self-feature consensus models perform better F-1
than the group-feature approach and the CRF trained with all features (i.e., no fea-
ture selection). This results show that using self-features improves listener backchannel
prediction.
3.6 Conclusion
Nonverbal behaviors play an important role in human social interactions and a key
challenge is to build computational models for understanding and analyzing this com-
munication dynamic. In this chapter, we proposed a framework for nding the important
features involved in human nonverbal communication. Our self-feature consensus ap-
proach rst looks at important behaviors for each individual before building a consensus.
It avoids the problem with the group-feature approach which focused on the average
model and oversees the inherent behavioral dierences among people. We proposed a
feature ranking scheme exploiting from L
1
regularization technique. This scheme relies
on the fact that adding more penalty on the model parameters will result in sparser
results in which only the important features will be promoted.
Our framework was tested on the task of listener head-nod prediction in dyadic
interactions. We used the RAPPORT dataset that contains 42 dyadic communications
between a speaker and a listener. The results are promising and provide improvement
over traditional group-feature approach.
36
Chapter 4
Multimodal Integration
4.1 Introduction
During face-to-face conversations, participants dynamically send and respond to the
nonverbal signals from dierent channels (i.e. visual, acoustic, and verbal). These
dierent channels contain complementary information essential to interpretation and
understanding of human behaviors (Oviatt, 1999). Psycholinguistic studies also suggest
that gesture and speech come from a single underlying mental process, and they are re-
lated both temporally and semantically (McNeill, 1992; Cassell & Stone, 1999; Kendon,
2004).
A good example of such complementarity is how people naturally integrate speech,
gestures and higher level language to predict when to give backchannel feedback. Build-
ing computational models of such a predictive process is challenging since it involves
micro dynamics and temporal relationship between cues from dierent modalities (Quek,
2003). Figure 4.1 shows an example of multimodal backchannel feedback (listener head
nod) prediction moment. For example, a temporal sequence from the speaker where
he/she reaches the end of segment (syntactic feature) with a low pitch and looks at the
listener before pausing is a good opportunity for the listener to give nonverbal feedback
(e.g., head nod).
37
Pitch
Words
Gaze
Time
P(nod)
Look at listener
Speaker
Listener
Prediction
Figure 4.1: Example of multimodal prediction model: listener nonverbal backchannel
prediction based on speaker's speech and visual gestures. As the speaker says the word
her, which is the end of the clause (her is also the object of the verb bothering), and
lowers the pitch while looking back at the listener and eventually pausing, the listener
is then very likely to head nod (i.e., nonverbal backchannel).
The problem of multimodal behavior prediction has many requirements to achieve
eective and ecient fusion of the multiple modalities (visual, lexical, prosodic, syntactic
information). A good fusion process should be able to allow re-weighting of noisy
channels. While increasing the weight of modalities that are more successful in achieving
the prediction, it should reduce the weight of modalities that are more noisy. A second
requirement is that eective training should be possible, even with limited amount of
data. Furthermore, fusion process should be interpretable allowing the analysis of each
modality.
In this Chapter, we introduce a new probabilistic model called Latent Mixture of
Discriminative Experts (LMDE), which directly addresses these three issues. A graph-
ical representation of LMDE is given in Figure 4.2. One of the main advantages of our
computational model is that it can automatically discover the hidden structure among
modalities and learn the dynamic between them. Since a separate expert is learned for
38
h
1
h
2
h
3
h
4
h
n
y
1
y
2
y
3
y
4
y
n
x
1
x
2
x
3
x
4
x
n
y
1
y
2
y
3
y
4
y
n
Expert
α
Figure 4.2: Latent Mixture of Discriminative Experts: our new dynamic model for
multimodal fusion. In this graphical representation, x
j
represents the j
th
multimodal
observation, h
j
is a hidden state assigned to x
j
, and y
j
the class label of x
j
. Gray
circles are latent variables. The micro dynamics and multimodal temporal relationships
are automatically learned by the hidden states h
j
during the learning phase.
each modality, eective training can be purveyed even with limited amount of data.
Furthermore, our learning process provides a ground for better model interpretability.
By analyzing each expert, the most important features in each modality{relevant to the
task{ can be identied.
We present empirical evaluation on the task of backchannel feedback prediction con-
rming the importance of combining dierent types of multimodal features. Backchan-
nel feedbacks include nods and para-verbals such as "uh-huh" and "mm-hmm" that
listeners produce as they are speaking. Predicting when to give backchannel feedback
is a good example of complementary information, for which people naturally integrate
speech, gestures and higher level linguistic features. Figure 4.1 shows an example of
backchannel prediction where a listener head nod is more likely. These prediction mod-
els have broad applicability, including the improvement of nonverbal behavior recogni-
tion, the synthesis of natural animations for robots and virtual humans, the training of
cultural-specic nonverbal behaviors, and the diagnoses of social disorders (e.g., autism
spectrum disorder).
39
One last issue directly addressed in this chapter is the evaluation metric for our
multimodal prediction model. Listener feedback varies among people and is often op-
tional (listeners can always decide to give feedback or not). Therefore, traditional error
measurements (i.e. recall, precision, f-score) may not always be adequate to evaluate
the performance of a prediction model. In this chapter, we propose a new error mea-
surement called User-adaptive Prediction Accuracy (UPA) which takes into account the
dierences in people's nonverbal responses.
Our experiments are performed on 45 storytelling dyadic interactions from the Rap-
port dataset (Gratch et al., 2007)
1
, which we also used for our experiments in Sec-
tion 3.4. We compare our LMDE model with previous approaches based on Conditional
Random Fields (CRF) (Laerty et al., 2001b), Latent-Dynamic CRFs (Morency et al.,
2007), and CRF Mixture of Experts (a.k.a Logarithmic Opinion Pools (Smith et al.,
2005)), and a rule based random predictor (Ward & Tsukahara, 2000). All the results
are validated by our User-adaptive Prediction Accuracy as well as the traditional error
measurements like F1-score. We also provide an analysis of the most important fea-
tures for each modality and give an intuition on why our intermediate fusion approach
improves prediction performance.
The rest of this Chapter is organized as follows. We rst present our Latent Mixture
of Discriminative Experts model in Section 4.2. We discuss the challenges in multimodal
prediction modeling and describe our error computation metric in Section 4.3. Experi-
mental setup is explained in Section 4.4, Results and discussions are given in Section 4.5.
Finally, we conclude with future research directions in Section 4.6.
4.2 Latent Mixture of Discriminative Experts
The task of multimodal prediction involves eective and ecient fusion of information
from multiple sources. One of the desired characteristics of good prediction model is
1
Freely available at http://rapport.ict.usc.edu/
40
that it should be able to learn the temporal relationships between modalities. In this
chapter, we introduce a multimodal fusion algorithm called Latent Mixture of Discrim-
inative Experts (shown in Figure 4.2), that addresses important challenges involved in
multimodal data processing. (1) The hidden states of LMDE can automatically learn
the hidden dynamic between modalities. (2) By training separate experts, we improve
the prediction performance even with limited amount of data. (3) LMDE provides
interpretability of modalities, which can be accomplished by expert analysis.
The task of our LMDE model is to learn a mapping between a sequence of multimodal
observations x =fx
1
;x
2
;:::;x
m
g and a sequence of labels y =fy
1
;y
2
;:::;y
m
g. Each y
j
is a class label for the j
th
frame of a video sequence and is a member of a set Y
of possible class labels, for example,Y =fbackchannel;no feedbackg. Each frame
observationx
j
is represented by a feature vector2 R
d
, for example, the prosodic features
at each sample. For each sequence, we also assume a vector of \sub-structure" variables
h =fh
1
;h
2
;:::;h
m
g. These variables are not observed in the training examples and will
therefore form a set of hidden variables in the model. Each h
j
is a member of a setH
y
j
of possible hidden states for the class label y
j
.H, the set of all possible hidden states,
is dened to be the union of allH
y
sets.
In the rest of this section, we rst provide some intuitions motivating our model;
then present details of our LMDE model, explain how we learn the model parameters
and nally how inference is performed.
4.2.1 Motivation
To illustrate how our LMDE can use its latent variables to learn the hidden temporal
relationship between modalities, we present an example (shown in Figure 4.3) based
on the application of predicting listener responses known as backchannel feedback. In
this scenario, the goal is to predict when a listener is most likely to predict a head nod
(i.e the label y
j+3
) given the input features extracted from the speaker actions. In our
LMDE model, each source of information (e.g. visual, lexical, auditory) is modeled by
41
h
j
y
j
y
j+3
y
j+4
y
j+5
Low pitch
Pause
time j time j+1 time j+2 time j+3 time j+4
h
j+1
y
j+1
h
j+3
h
j+4
h
j+5
h
j+2
y
j+2
time j+5
Figure 4.3: An example of how the hidden variables of our LMDE model can learn the
temporal dynamics and asynchrony between modalities.
an expert. In our example we have two experts: pause/talking (orange) and low pitch
region in speech (green). Figure 4.3 shows the speaker talking with low pitch at time
j +1. We know from literature (Ward & Tsukahara, 2000) that listeners are more likely
to give a backchannel feedback (1) during a pause and (2) shortly after a region of low
pitch (usually around 700ms after the low pitch region). Our LMDE model can easily
learn this temporal asynchronous relationship between speaker pause, speaker low pitch
region and listener response by using only two hidden states per label (i.e.jHj = 2).
In Figure 4.3, the rst two hidden states (light gray circles) of each hidden variable
h
j
are associated with the label no feedback and the last two hidden states (dark gray
circles) are associated with the label backchannel. At time j, the speaker is talking
and none of the experts are active. Then, we see low pitch region at time j + 1, which
activates the hidden state 2. At time j + 2, the speaker is still talking but with no low
pitch region. Remark that since the second hidden state was activated at time j + 1
by the low pitch region, the same hidden state will stay active
2
. This is an example
2
This is possible because of the transition weights learned during training of our
LMDE model (see Section 4.2.2).
42
where our LMDE model shows memory functionality through its hidden variables h
j
.
At time j + 3, the hidden state 3 is activated due to a pause in speaker's talk, which
triggers prediction of a listener backchannel at that point in time. Then, at time j + 4
the LMDE model gets back to the hidden state 1 when the speaker starts talking again.
No head nod will be predicted at time j + 5, even though the speaker paused (because
no low pitch region occurred earlier). Another important aspect of the LMDE model
illustrated in Figure 4.8 of our experimental results (see Section 4.5) is that the latent
variablesh
j
can learn multiple mixtures of experts, with one set of mixture weights per
hidden state. More details on the LMDE model and the latent variables are given in
the following subsections.
4.2.2 LMDE Model
Following Morency et al. (Morency et al., 2007), we dene our LMDE model as follows:
P (yj x;) =
X
h
P (yj h; x;)P (hj x;) (4.1)
where are model parameters learned during training.
To keep training and inference tractable, Morency et al. (Morency et al., 2007)
restrict the model to have disjoint sets of hidden statesH
y
j
associated with each class
label. Since sequences which have anyh
j
= 2H
y
j
will by denition haveP (yj h; x;) = 0,
the latent conditional model becomes:
P (yj x;) =
X
h:8h
j
2Hy
j
P (hj x;): (4.2)
where
P (hj x;) =
exp
0
@
P
l
l
T
l
(h)+
P
s
s
S
s
(h; x)
1
A
Z(x';)
; (4.3)
43
For convenience, we split into two parts:
l
parameters related to the transition
between hidden states, and
s
parameters related to the relationships between expert
outputs and the hidden states h
j
.Z is the partition function, and T
l
(h; x') is dened
as follows:
T
l
(h) =
X
j
t
l
(h
j1
;h
j
;j); (4.4)
wherej corresponds to the frame index, andt
l
(h
j1
;h
j
;j) is the transition function.
Eacht
l
(h
j1
;h
j
;j) depends on pairs of hidden variables in the model. Indexl represent
all possible transitions between dierent hidden states.
What dierentiates our LMDE model from the original work of Morency et al. is
the denition of S
s
(h; x):
S
s
(h; x) =
X
j
s
s
(h
j
;(x;j); (4.5)
where
(x;j) = [q
j
1
q
j
2
::q
j
::q
j
jej
]: (4.6)
jej is the total number of experts. Each s
s
(h
j
;(x;j);j) is a state function that
depends on a single hidden variable h
j
and the expert output vector (x;j). Total
number of indicess is equal to the number of expertsjej times the total number of hidden
statesjHj. Each transition/state function is associated with a value in the corresponding
model parameters (
l
and
s
), which can be seen as a weight assigned to this function.
For each hidden stateH
y
j
, there is a subset ofjej model parameters in
s
weighting the
dierent expert output. Therefore, using more than one hidden states per label allows
us to learn multiple mixture of experts. Each q
j
is the marginal probability of expert
at frame j, and equals to P
(y
j
= ajx;
). Each expert conditional distribution is
dened by P
(yjx;
) using the usual conditional random eld formulation:
44
P
(yj x;
) =
exp (
P
k
;k
F
;k
(y; x))
Z
(x;
)
; (4.7)
where
represent the model parameters of each expert . F
;k
is dened as
F
;k
(y; x) =
m
X
j=1
f
;k
(y
j1
;y
j
; x;j);
and each feature function f
;k
(y
j1
;y
j
; x;j) is either a state function s
k
(y
j
; x;j) or a
transition function t
k
(y
j1
;y
j
; x;j). Each expert contains a dierent subset of state
functions s
k
(y; x;j), dened in Section 4.4.2.
4.2.3 Learning Model Parameters
Given a training set consisting of n labeled sequences (x
i
; y
i
) for i = 1:::n, training is
done in a two step process. In the rst step, we learn the model parameters,
, for
each expert by using the following objective function from (Kumar & Herbert., 2003;
Laerty et al., 2001b):
L(
) =
n
X
i=1
logP
(y
i
j x
i
;
)R(
) (4.8)
The rst term in Equation 4.8 is the conditional log-likelihood of the training data.
The second term is the regularization term (as we dened in Section 3.2.2). We choose to
use the Gaussian prior since it consistently gives better prediction results. A Gaussian
prior assumes that each model parameter is drawn independently from a Gaussian
distribution and penalizes according to the weighted square of the model parameters.
It is dened as follows:
R(
) =
1
2
2
jj
jj
2
(4.9)
where
2
is the variance, i.e. P (
) exp
1
2
2
jj
jj
2
. A Gaussian prior provides
smoothing when the number of learned parameters is very high compared to the size of
45
available data. Using a Gaussian prior results in a convex quadratic optimization func-
tion that can be solved by standard optimization techniques. The marginal probabilities
P
(y
j
=aj x;
), are computed using belief propagation. In our experiments, we per-
formed gradient ascent using the BFGS optimization technique (Nocedal & Wright,
2006).
In the second step, we use the following objective function to learn the optimal
parameter
:
L() =
n
X
i=1
logP (y
i
j x
i
;
)
1
2
2
jjjj
2
(4.10)
The rst term is the conditional log-likelihood of the training data. The second term
is the log of a Gaussian prior with variance
2
.
Similar to the rst step, we use gradient ascent with the BFGS optimization tech-
nique to search for the optimal parameter values,
.
4.2.4 Inference
Similar to parameter learning process, inference is also achieved in two steps. Given
a new test sequence x, we rst compute the marginal probabilities P
(y
j
= aj x;
)
for each expert. Secondly, we estimate the most probable sequence of labels y
that
maximizes our LMDE model:
y
= arg max
y
X
h:8h
i
2Hy
i
P (hj x;
) (4.11)
where
is the parameter values learned from training. To estimate the label y
j
of frame j, we rst compute the marginal probabilities P (h
j
=ajx;
) for all possible
hidden statesH. Then, we sum the marginal probabilities according to the disjoint sets
of hidden states H
y
j
. Finally, the label y
j
associated with the optimal set is chosen.
46
4.3 LMDE for Multimodal Prediction
LMDE is a generic approach designed to integrate information from multiple modalities.
In this section, we rst provide a detailed discussion about multimodal prediction, and
more specically about backchannel prediction which is used as the main task in our
experiments and this thesis. Then, we present the User-adaptive Prediction Accuracy,
a new evaluation metric for prediction models.
4.3.1 Multimodal Prediction
Human face-to-face communication is a little like a dance, in that participants continu-
ously adjust their behaviors based on verbal and nonverbal displays and signals. A topic
of central interest in modeling such behaviors is the patterning of interlocutor actions
and interactions, moment-by-moment, and one of the key challenges is identifying the
patterns that best predict specic actions. Thus we are interested in developing pre-
dictive models of communication dynamics that integrate previous and current actions
from all interlocutors to anticipate the most likely next actions of one or all interlocu-
tors. Humans are good at this: they have an amazing ability to predict, at a micro-level,
the actions of an interlocutor (Bavelas et al., 2000); and we know that better predictions
can correlate with more empathy and better outcomes (Goldberg, 2005; Fuchs, 1987).
Building computational models of such a predictive process involves dynamics and
temporal relationship between cues from dierent modalities (Quek, 2003). These dif-
ferent modalities contain complementary information essential to interpretation and
understanding of human behaviors (Oviatt, 1999). Psycholinguistic studies also suggest
that gesture and speech come from a single underlying mental process, and they are re-
lated both temporally and semantically (McNeill, 1992; Cassell & Stone, 1999; Kendon,
2004).
Among other behaviors, backchannel feedback (the nods and paraverbals such as
\uh-hu" and \mm-hmm" that listeners produce as some is speaking) has received con-
siderable interest due to its pervasiveness across languages and conversational contexts.
47
Time (seconds)
10 20 30 40 50 60 70 80
L
1
L
2
Listener
1
Listener
2
Figure 4.4: A sample output sequence of listener feedback probabilities (in blue). Red
and green boxes indicate the responses fromListener
1
andListener
2
respectively. The
red and green lines indicate the thresholds on the output probabilities that can correctly
assign the backchannel labels to the corresponding listener labels.
Several systems have been demonstrated on the task of listener backchannel feedback
prediction (Ward & Tsukahara, 2000; Maatman et al., 2005; Morency et al., 2008b).
Evaluation of results from a backchannel prediction model is challenging, since listener
feedback varies between people and is often optional. While experiencing the same set
of environmental conditions, some people may choose to give more frequent feedbacks,
whereas some others may choose to be less active and give seldom feedbacks. Therefore,
results from prediction tasks are expected to have lower accuracies as opposed to recog-
nition tasks where the data labels are well established. This indicates the necessity of
a new error measurement, which can take into account dierences in human behaviors.
We address this issue in the next section.
4.3.2 User-adaptive Prediction Accuracy
The traditional way to evaluate prediction models is usually to set a threshold on the
output probability, so that nal decision can be made (i.e. backchannel or not). From
these nal predictions, typical error metrics, such as F1-score, precision and recall, can
48
be measured. The same threshold will be applied to all data sequences from dierent
people in the test set. However, people do not always respond the same way to the same
stimuli (e.g. speaker's actions). Some people may naturally give a lot of feedback while
others will give feedback only when the speaker is directly requesting it. For this reason,
using the same threshold for evaluating multiple listeners may not be representative of
the real predictive power of the learned model (e.g. LMDE).
Let's illustrate this problem with an example as depicted in Figure 4.4. In this case,
we have two listeners listening to the same speaker, but reacting dierently. Listener
1
gave only 1 backchannel feedback, while Listener
2
was more actively nodding his head
and gave 5 backchannels. Figure 4.4 shows the output of our LMDE model (backchannel
probabilities) as a continuous blue line and the potential predictions (local maxima) are
depicted by the red stars. The question now is: can our learned model correctly predict
both listeners? As shown in the gure, there is not one threshold that can correctly
predict both listeners' behaviors. However, given the right thresholds, this model can
correctly predict both listeners. So, what should be the evaluation measure and the
performance of the model?
To address this issue, we propose a new error measurement called User-adaptive
Prediction Accuracy (UPA). The main intuition behind UPA is that we will ask our
prediction model to give us the n
i
-best predictions, where n
i
is the number of times
that a particularlistener
i
gave a backchannel. Following this intuition, UPA is dened
as:
UPA =
1
L
N
X
i
P (n
i
)
n
i
=l
i
(4.12)
where i is the listener id, N is the total number of listeners in the test data, n
i
is
the number of backchannelslistener
i
provided during a dyadic interaction, andl
i
is the
length of the interaction i. Therefore, the denominator term coveys the backchannel
frequency of listener
i
. L is the total length of all interactions with all the listeners.
P (n
i
) is a function that compares then
i
-best predictions from our LMDE model output
49
to the ground truth backchannel labels from listener
i
. The function P (n
i
) returns the
number of correctly predicted listener backchannels. Predictions from our LMDE model
are ranked by their probability output.
UPA gives us a measure of the prediction quality while adapting to people's dif-
ferent levels of backchannel responses. Consider the case where two dierent listeners
gave the same amount of backchannel during their interactions, and the duration of
rst interaction with one of the listeners is much longer than the duration of second
interaction with the other listener. One would expect more noise (i.e. peaks) in the
output probabilities of the rst interaction corresponding to possible backchannel op-
portunities that the actual listener had missed. Therefore, a model that can correctly
nd the true backchannel opportunities even if the listener rarely provides backchannel
should be given a higher weight. Therefore, we introduce l
i
weighting in Equation 4.12
to capture these dierences in listener's responses.
4.4 Experimental Setup
As mentioned in the previous section, we evaluate our LMDE on the multimodal task
of predicting listener nonverbal backchannel. We use the RAPPORT dataset which
has been introduced in Section 3.4.1. In the rest of this section, we rst describe
the backchannel labeling and the multimodal speaker features. Then, we explain the
baseline models used for comparison in our tests, and the experimental setup.
4.4.1 Backchannel Annotations
We have explained our backchannel labeling process in Section 3.4.2. During our ex-
periments in this chapter, we will vary the backchannel duration to see which one is
optimal. The Section 4.5.2 describes these results, where we nd the optimal training
backchannel duration to be 0.5 seconds.
50
4.4.2 Multimodal Features and Experts
This section describes the dierent multimodal features used to create our ve experts.
Prosody Prosody refers to the rhythm, pitch and intonation of speech. Several studies
have demonstrated that listener feedback is correlated with a speaker's prosody (Nishimura
et al., 2007; Ward & Tsukahara, 2000; Cathcart et al., 2003b). For example, Ward and
Tsukahara (Ward & Tsukahara, 2000) show that short listener backchannels (listener
utterances like \ok" or \uh-huh" given during a speaker's utterance) are associated with
a lowering of pitch over some interval. Listener feedback often follows speaker pauses
or lled pauses such as \um" (see (Cathcart et al., 2003b)). Using openSMILE (Eyben
et al., 2009) toolbox, we extract the following prosodic features, including standard
linguistic annotations and the prosodic features suggested by Ward and Tsukhara:
downslopes in pitch continuing for at least 40ms
regions of pitch lower than the 26th percentile continuing for at least 110ms (i.e.,
lowness)
drop or rise in energy of speech (i.e., energy edge)
fast drop or rise in energy of speech (i.e., energy fast edge)
vowel volume (i.e., vowels are usually spoken softer)
pause in speech (i.e., no speech)
Visual gestures Gestures performed by the speaker are often correlated with listener
feedback (Burgoon et al., 1995). Eye gaze, in particular, has often been implicated
as eliciting listener feedback. Thus, we manually annotate the following contextual
features:
speaker looking at listener (eye gaze)
51
speaker not looking at listener (~ eye gaze)
smiling
moving eyebrows up
moving eyebrows down
Lexical Some studies have suggested an association between lexical features and listener
feedback (Cathcart et al., 2003b). Using the transcriptions, we included all individual
words (i.e., unigrams) spoken by the speaker during the interactions.
Part-of-Speech Tags In (Cathcart et al., 2003b), combination of pause duration and
a statistical part-of-speech language model is shown to achieve the best performance for
placing backchannels. Following this work, we use a CRF part-of-speech (POS) tagger
to automatically assign a part of speech label to each word. We also include these
part-of-speech tags (e.g. noun, verb, etc.) in our experiments.
Syntactic structure Finally, we attempt to capture syntactic information that may
provide relevant cues by extracting three types of features from a syntactic dependency
structure corresponding to the utterance. The syntactic structure is produced automat-
ically using a data-driven left-to-right shift-reduce dependency parser (Sagae & Tsujii,
2007b), trained POS on dependency trees extracted from the Switchboard section of
the Penn Treebank (Marcus et al., 1994), converted to dependency trees using the
Penn2Malt tool
3
. The three syntactic features are:
Grammatical function for each word (e.g. subject, object, etc.), taken directly
from the dependency labels produced by the parser
3
http://w3.msi.vxu.se/ nivre/research/Penn2Malt.html
52
Figure 4.5: Baseline Models: a) Conditional Random Fields (CRF), b) Latent Dynamic
Conditional Random Fields(LDCRF), c) CRF Mixture of Experts (no latent variable)
Part-of-speech of the syntactic head of each word, taken from the dependency
links produced by the parser
Distance and direction from each word to its syntactic head, computed from the
dependency links produced by the parser
4.4.3 Baseline Models
In this subsection, we present the baseline models used in our experiments to compare
with our LMDE model.
Individual experts Our rst baseline model consists of a set of CRF chain models,
each trained with dierent set of multimodal features (as described in the previous
section). In other words, only visual, prosodic, lexical or syntactic features are used to
train a single CRF expert. (See Figure 6.3a).
Multimodal classiers (early fusion) Our second baseline consists of two models:
CRF and LDCRF (Morency et al., 2007). To train these models, we concatenate all mul-
timodal features (lexical, syntactic, prosodic and visual) in one input vector. Graphical
representation of these baseline models are given in Figure 6.3-(a) and Figure 6.3-(b).
53
CRF Mixture of Experts To show the importance of latent variable in our LMDE
model, we trained a CRF-based mixture of discriminative experts. A graphical repre-
sentation of a CRF Mixture of experts is given in Figure 6.3. This model is similar
to the Logarithmic Opinion Pool (LOP) CRF suggested by Smith et al. (Smith et al.,
2005), in the sense that they both factor the CRF distribution into a weighted product
of individual expert CRF distributions. However, the main dierence between LOP and
CRF Mixture of Experts model is in the denition of optimization functions. Similar
to our LMDE model, training of CRF Mixture of Experts is performed in two steps:
Expert models are learned in the rst step, and the second level CRF model parameters
are learned in the second step.
Pause-Random Classier Our last baseline model is a random backchannel generator,
which randomly generates backchannels whenever some pre-dened conditions in the
speech is purveyed. Ward et. al. (Ward & Tsukahara, 2000), dene theses conditions
as: (1) coming after at least 700 milliseconds of speech, (2) absence of backchannel
feedback within the preceeding 800 milliseconds, (3) after 700 miliseconds of wait. We
have optimized the amounts of randomness in our experiments.
4.4.4 Methodology
We performed held-out testing by randomly selecting a subset of 11 interactions (out of
45) for the test set. The training set contains the remaining 34 dyadic interactions. All
models in this chapter were evaluated with the same training and test sets. Validation of
all model parameters (regularization term and number of hidden states) was performed
using a 3-fold cross-validation strategy on the training set. The regularization term was
validated with values 10
k
;k =1::3. Two dierent numbers of hidden states were tested
for the LMDE models: 2, and 3 (note that LMDE with 1 hidden state is equivalent to
Mixture of CRF Experts model). In our experiments, the optimum number of hidden
54
Recall
0.1
0.2
0.3
LMDE
Prosodic Expert
Visual Expert
Lexical Expert
POS Expert
Syntactic Expert
0.1 0.2 0.3 0.4 0.5 0.6 0.7
Figure 4.6: Comparison of individual experts with our LMDE model. Top: Recall (x-
axis) v.s. Precision (y-axis) values for dierent threshold values. Bottom: Precision,
Recall, F1 and UPA scores of corresponding models for selected amount of backchannel.
states was validated to be 2 when duration of backchannel labels was set to 0.5, and 3
when duration of backchannel labels was set to 1.0 or 1.5.
The performance is measured by using UPA (described in Section 4.3.2) as well as
more conventional metrics: precision, recall, and F-measure. Precision is the probabil-
ity that predicted backchannels correspond to actual listener behavior. Recall is the
probability that a backchannel produced by a listener in our test set was predicted by
the model. We use the same weight for both precision and recall, so-called F
1
, which is
the weighted harmonic mean of precision and recall.
During testing, we nd all the "peaks" (i.e., local maxima) from marginal proba-
bilities P (y
j
= aj x;). When computing UPA, the nal predictions are selected from
55
these peaks so that the number of model predictions are equal to the number of listener
backchannels in the test sequence. For the f1-score, the prediction model needs to decide
on a specic threshold (i.e., amount of backchannel) for the marginal probabilities for all
users. The value of this threshold is automatically set during validation. Since we are
predicting the start time of a backchannel, an actual listener backchannel is correctly
predicted if at least one model prediction happen within the 1 second interval window
around the start time of the listener backchannel.
The training of all CRFs and LDCRFs were done using the hCRF library
4
. The
LMDE model was implemented in Matlab based on the hCRF library. The input
observations were computed at 30 frames per second. Given the continuous labeling
nature of our LMDE model, prediction outputs were also computed at 30Hz.
4.5 Results
In this section we present the results of our empirical evaluation. We designed our
experiments so to test dierent characteristics of the LMDE model. First, we present our
quantitative results that evaluate: (1) integration of multiple sources of information, (2)
late fusion approach and (3) latent variable which models the hidden dynamic between
experts. Then, we present qualitative analysis related to: (1) the output probabilities
from individual experts and the LMDE model, (2) the most relevant features in early
and late fusion models, (3) model robustness and (4) UPA analysis.
4.5.1 Comparative Results
Individual Experts We trained one individual expert for each feature types: visual,
prosodic, lexical and syntactic features (both part-of speech and syntactic structure).
Precision, recall, F
1
, and UPA values for each individual expert and our LMDE model
4
http://sourceforge.net/projects/hrcf/
56
Recall
0.1 0.2 0.3 0.4 0.5 0.6 0.7
0.1
LMDE
Early CRF
Early LDCRF
Mixture CRF
0.2
0.3
upa
Figure 4.7: Comparison of LMDE model with previously published approaches for mul-
timodal prediction. Top: Recall (x-axis) v.s. Precision (y-axis) values for dierent
threshold values. Bottom: Precision, Recall, F1 and UPA scores of corresponding
models for selected amount of backchannel.
Table 4.1: Top 5 features from ranked list of features for each listener expert.
Expert 1 Expert 2 Expert 3 Expert 4 Expert 5
(Prosodic) (Visual) (Lexical) (POS) (Syntactic)
Utterence ~ EyeGaze she POS:NN DIRDIST:L1
Pause Nod um POS:PRP HEADPOS:VBZ
Vowel Volume EyeBrows Up that POS:VBG LABEL:PMOD
Energy Edge EyeBrows Down he POS:UH DIRDIST:L2
Low Pitch EyeGaze women POS:NNS LABEL:SUB
are shown in Figure 4.6 (Bottom). Even though the experts may not perform well indi-
vidually, they can bring important information once merged together. Recall-precision
curve in Figure 4.6 (Top) shows that our LMDE model was able to take advantage of
the complementary information from each expert.
57
Table 4.2: Performances of individual expert models trained by using only the top 5
features selected by our feature ranking algorithm introduced in Section 3.2. The last
two rows represent the LMDE models using the expert models trained with only 5
features selected by either by a greedy method (Morency et al., 2008b) or our spare
feature ranking scheme.
Expert Precision Recall F-1 upa
Prosodic5 0.1463 0.5645 0.2324 0.1545
Visual5 0.1457 0.2671 0.1886 0.1558
Lexical5 0.1059 0.1706 0.1307 0.1471
POS5 0.1522 0.5602 0.2394 0.1409
Syntactic5 0.0995 0.5626 0.1691 0.1302
Greedy5 0.2007 0.3241 0.2479 0.2585
LMDE5 0.1914 0.5306 0.2814 0.2331
Late Fusion We compare our approach with two early fusion models: CRF and LDCRF
(see Figure 6.3). Figure 4.7 summarize the results. The CRF model learns direct
weights between input features and the gesture labels. The LDCRF is able to model
more complex dynamics between input features with the latent variable. We can see
that our LMDE model outperforms both early fusion model because of its late fusion
approach.
When merging the features together in an early manner, the noise from one modality
may hide or suppress the features from a dierent modality. By training separate experts
for each dierent modality, we are able to reduce the eect of this noise, therefore learn
models that can generalize better to new multimodal data.
Latent Variable The CRF Mixture of Experts (Smith et al., 2005) directly merges
the expert outputs while our LMDE model uses a latent variable to model the hid-
den dynamic between experts (see Figure 6.3-(c)). This comparison (summarized in
Figure 4.7) is important since it shows the eect of the latent variable in our LMDE
model.
58
4.5.2 Analysis Results
Expert Analysis Our rst analysis looks at speaker features that are the most rele-
vant to listener feedback prediction. This analysis is performed by applying our sparse
feature ranking algorithm described in Section 3.2 to each ve expert separately. Top
5 features for our ve experts are listed in Table 4.1. First interesting results are the
two features appearing in Prosodic Expert and one feature appearing in Visual Ex-
pert: pause, low pitch and eye gaze. These features have also been identied in previous
work (Ward & Tsukahara, 2000; Morency et al., 2008b) as important cues for backchan-
nelling. Similarly, um feature in Lexical Expert can be considered as a ller pause and
reasonable cue for backchannel prediction. Visual Expert selects nod as the second-best
feature, which can be associated with mirroring eect. This suggests that our experts
are learning relevant features.
To conrm that these selected features are relevant to the L2 trained models, we
trained new experts and LMDE models using the top 5 features of each expert selected
by our sparse feature ranking algorithm. In other words, for each expert, we trained a
new CRF model by using only the top 5 features selected for that expert. Performance
of these new expert models are listed in Table 4.2. It is interesting to see that using only
ve features can achieve as good as using all the features. We see some increase in both
F-1 and upa values for POS and Syntactic experts when 5 features used. We believe
that this is due to noise when all features are used. For comparison, we trained a new
LMDE model using these new expert models. The performance of this model, which
we refer to as LMDE5, is given in Table 4.2. LMDE5 achieves a higher F-1 value than
all individual experts, and a very similar upa value as the original LMDE (remark that
all the features were present while training the expert models in the original LMDE).
We also compared our sparse feature ranking algorithm to the greedy feature se-
lection method presented in (Morency et al., 2008b). For this purpose, we used this
greedy method to select 5 features for each expert, and learned expert models trained
59
Time
62s 65s 69s 72s
Looking at listener Looking at listener
Figure 4.8: Output probabilities from LMDE and individual experts for two dierent
sub-sequences. The gray areas in the graph correspond to ground truth backchannel
feedbacks of the listener.
with these 5 features. Then, these expert models are used to learn an LMDE model, re-
ferred to as Greedy5. The results are shown in Table 4.2. LMDE5 and Greedy5 achieved
similar performance. However, our sparse ranking scheme is a much faster algorithm
than the greedy method. The computational cost of the greedy algorithm increases with
the number of features, whereas the computational cost of our spare ranking scheme is
determined by the number of regularization penalty values (76 in our experiments).
LMDE Model Analysis Our second analysis focuses on the multimodal integration
which happens at the latent variable level in our LMDE model, Figure 4.8 shows the
60
Table 4.3: Performaces of baseline models and our LMDE model as we increase the
duration of backchannel labels during training.
Training Backchannel Duration
0.5 1.0 1.5
F-1 upa F-1 upa F-1 upa
LMDE 0.3026 0.2640 0.2774 0.2439 0.2751 0.2291
Early CRF 0.2512 0.1615 0.2384 0.1916 0.2397 0.1764
Early LDCRF 0.1638 0.0788 0.1856 0.0669 0.1648 0.0496
Mixture CRF 0.2430 0.2027 0.2245 0.1834 0.2037 0.1576
output probabilities for all ve individual experts as well as our model. The strength
of the latent variable is to enable dierent weighting of the experts at dierent point in
time.
In the sequence depicted in Figure 4.8, the actual listener gave backchannel feedback
3 times (around 62s, 64s and 71s), which are indicated by the gray areas. As we analyze
the outputs from dierent experts, we see that the Lexical and POS experts were able to
learn the backchannel opportunity for the rst backchannel feedback at 62s. These two
experts are highly weighted (by one of the hidden state) during this part of the sequence.
All the experts except the Visual Experts assigned a high chance of backchannel around
65.5s, where there is no listener feedback. The Visual Expert was highly weighted during
this time, so that the in
uence of all other experts was reduced in the LMDE output.
This dierence of weighting shows that a dierent hidden state is active during this part
of the sequence.
Model Robustness As mentioned in Section 3.4.2, one of the hyper-parameter of
our LMDE prediction model is the duration of backchannel cues used during training.
To analyze sensitivity of our model to backchannel duration, we varied the duration
from 500 seconds to 1500 seconds, and retrained our LMDE model and the baseline
models. F-1 and UPA values are given in Table 4.3. We observe a drop in the LMDE
performance as we increase the duration. This was true for most of the other models,
which suggests that it is better to train prediction models with more focused labels
61
Table 4.4: Number of backchannel feedbacks provided by each of the 11 listeners in our
test set and their corresponding upa, precision, recall and F-1 score.
num of upa Precision Recall F-1
feedbacks
1 0.000 0.031 1.000 0.061
1 0.000 0.050 1.000 0.095
2 0.000 0.031 0.500 0.059
4 0.000 0.077 0.500 0.133
5 0.200 0.091 0.600 0.158
8 0.375 0.104 0.625 0.178
16 0.562 0.282 0.687 0.400
21 0.238 0.269 0.333 0.298
23 0.478 0.433 0.565 0.491
25 0.320 0.286 0.720 0.409
40 0.500 0.528 0.475 0.500
(i.e. narrow backchannel duration). It should also be noted that LMDE outperforms
all other baseline models for all dierent durations.
UPA Analysis
In our earlier experiments (see Figure 4.6, we have seen that the Visual and Lexical
experts seem to perform about the same based on their F
1
values (0.1914 and 0.1943),
but their UPA values are quite dierent (0.1558 and 0.1131). Looking at their F
1
results, we would expect these two experts to have very similar Recall-Precision curves.
However, their recall-precision curves in Figure 4.6 indicate that the Visual Expert is a
better model than the Lexical Expert, which is already conrmed by our UPA measure.
We can see another such example between the POS and Syntactic Experts. The F
1
values indicate that POS Expert (0.1866) is a better model than the Syntactic Expert
(0.1395). On the other hand, their UPA values (0.1122 and 0.1252) tell that these are
similar models, which is also conrmed by their Precision-Recall curves in Figure 4.6.
These observations suggest that our UPA measurement is a more representative measure
than the F
1
score.
62
To analyze the variability among listeners, we have listed in Table 4.4 the individual
test performances and the number of backchannel feedback provided by each listener.
One interesting conclusion derived from this result is that there is some correlation
with the number of feedbacks and upa, precision, and F-1 values. As the number of
backchannels increase, these values increase as well.
4.6 Conclusions
In this chapter, we addressed three main issues involved in building predictive models
of human communicative behaviors. First, we introduced a new model called Latent
Mixture of Discriminative Experts (LMDE) for multimodal data integration. Many
of the interactions between speech and gesture happen at the sub-gesture or sub-word
level. LMDE learns automatically the temporal relationship between dierent modali-
ties. Since, we train separate experts for each modality, LMDE is capable of improving
the prediction performance even with limited amount of data.
We evaluated our model on the task of nonverbal feedback prediction (e.g., head
nod). Our experiments conrm the importance of combining the ve types of multi-
modal features: lexical, syntactic structure, POS, visual, and prosody. An important
advantage of using our LMDE model is that it enables easy interpretability of indi-
vidual experts. As a second contribution, we have presented a sparse feature ranking
scheme based on L
1
regularization technique. Our third contribution is a new metric
called User-adaptive Prediction Accuracy (UPA). This metric is particularly designed
for evaluating prediction models, and we plan to apply it to other prediction models as
well. LMDE is a generic model that can be applied to a wide range of problems.
In the next chapter, we will be using a variant of our LMDE model to model the
mutual in
uence between the speaker and the listener during a dyadic conversation.
63
Chapter 5
Context-Based Prediction
5.1 Introduction
During a face-to-face communication, participants often adapt to each others nonverbal
behaviors (Ramseyer, 2011; Burgoon et al., 1995; Hateld et al., 1992; Riek et al., 2010).
For instance, it is more likely that a listener in a dyadic conversation will provide more
feedbacks to a speaker who is very enthusiastic and smiling, than to a speaker with
a monotone speech and no eye-contact. Similarly, mimicry is a common phenomena
during human communication. Participants often mimic each others gestures to con-
vey empathy and rapport (Ross et al., 2008; Hateld et al., 1992; Riek et al., 2010).
Also, people often have a more positive judgement of a stranger who mimics their be-
haviors (Gueguen et al., 2009). This phenomena, which we refer as visual in
uence in
this thesis, is essential for
uid human interactions; but research is still needed to build
accurate computational models when the visual information is not directly observed.
One of the challenges in building socially intelligent virtual agents is absence of the
visual information. In many real-world applications, we often have only the speech
and/or text to be spoken by the virtual human, without any visual context. Another
scenario where no visual context is available is phone-to-phone conversations. If we
64
- I'm just being specific you know? And she's complaining about somebody's instant messaging her and e- mailing her and whenever
Listener
Speaker
Multimodal
Features
Speaker
Prediction
Listener
Prediction
Multimodal
Integration
Multimodal
Integration
Listener
Behaviors
(i.e. backchannel)
Speaker
Gestures
(i.e. head nods)
Visual
Influence
Modeling
Final Listener
Behaviors
(i.e. backchannel)
Figure 5.1: An overview of our approach for predicting listener backchannels in the
absence of visual information. Our approach takes into account the context from the
speaker by rst predicting the nonverbal behaviors of the speaker and uses these pre-
dictions to improve the nal listener backchannels.
want to create a virtual (i.e. customer service) representative that is capable of pro-
viding backchannel feedbacks, the only source of information will be the interlocutor's
(customer's) voice. As discussed above, a good prediction model of backchannels should
be able to take into account the visual in
uence of an interlocutor even in the absence
of visual context. This can be achieved by predicting the nonverbal behaviors of the
speaker, and using the predicted visual context to model the visual in
uence of the
speaker gestures on the listener behaviors.
In this chapter, we present a context-based prediction model to address this issue
of visual in
uence. An overview of our approach is given in Figure 5.1. We assume
an environment where the visual gestures of the speaker are not available. Based on
this assumption, we rst predict the visual context (i.e. head nods) of the speaker and
the backchannels of the listener using only the audio channel from the speaker. We
model the visual in
uence of the speaker on listener feedbacks by using an extension
of the Latent Mixture of Discriminative Experts model (Chapter 4). We evaluate our
approach using 45 storytelling dyadic interactions from the RAPPORT dataset. In our
experiments, we compare our approach with previous approaches based on Conditional
Random Fields (CRF) (Laerty et al., 2001b), Latent-Dynamic CRFs (Morency et al.,
2007), and CRF Mixture of Experts (a.k.a Logarithmic Opinion Pools (Smith et al.,
2005)), and a rule based random predictor (Ward & Tsukahara, 2000).
65
The rest of this chapter is organized as follows. We present our context-based pre-
diction approach in Section 5.2. Experimental setup, and results are given in Section 5.3
and Section 5.4, respectively. Finally, we conclude in Section 5.5.
5.2 Context-based Backchannel Prediction with
Visual In
uence
The goal of our approach is to predict listener backchannels in dyadic conversations by
using the visual in
uence of speaker gestures on the listener backchannel feedbacks. We
assume a situation where no visual context from neither the speaker nor the listener is
available. In other words, we have no access to speaker's visual information, but only
the speech/text information from the speaker.
In our context-based prediction approach, we rst infer the speaker gestures, and
then exploit this visual context to improve the nal listener backchannel predictions (see
Figure 5.1). In order to model the visual in
uence between the speaker and the listener,
we use a variant of the Latent Mixture of Discriminative Experts (see Section 4.2.2)
called visual-LMDE. One of the main advantages of our LMDE model is that it can
automatically discover the hidden structure among modalities and learn the dynamic
between them. We extend the LMDE model to also take into consideration the visual
in
uence of speaker gestures.
Our visual-LMDE model is based on a two step process (an overview is shown in
Figure 5.2): in the rst step, we learn discriminative experts for speaker gestures and
listener backchannels. Speaker expert models are trained using a Conditional Random
Field (CRF) (Laerty et al., 2001b) on one of the four speech dimensions (prosody,
lexicons, syntactic structure and part-of-speech tags). These individual experts make
up for the visual context from the speaker. We learn experts for listener backchannels
similar to speaker gestures, but using the actual listener backchannel feedback as our
labels. In the second step, we merge the speaker experts (visual context) with listener
66
experts by using a latent variable model. This process involves using the outputs of
these expert models as an input to a Latent Dynamic Conditional Random Field (LD-
CRF) (Morency et al., 2007) that is capable of modeling the visual in
uence of the
speaker gestures on the listener feedbacks using latent variables.
The task of our LMDE model is to learn a mapping between a sequence of multimodal
observations x =fx
1
;x
2
;:::;x
m
g and a sequence of labels y =fy
1
;y
2
;:::;y
m
g. Each y
j
is a class label for the j
th
frame of a video sequence and is a member of a set Y
of possible class labels, for example,Y =fbackchannel;no feedbackg. Each frame
observationx
j
is represented by a feature vector2 R
d
, for example, the prosodic features
at each sample. For each sequence, we also assume a vector of \sub-structure" variables
h =fh
1
;h
2
;:::;h
m
g. These variables are not observed in the training examples and will
therefore form a set of hidden variables in the model.
We dene visual-LMDE model as follows:
P (yj x;) =
X
h:8h
j
2Hy
j
P (hj x;): (5.1)
where are model parameters learned during training and P (hj x;) is dened as
follows:
P (hj x;) =
exp
0
@
P
p
p
T
p
(h)+
P
l
l
S
l
(h; x) +
P
s
s
S
s
(h; x)
1
A
Z(x';)
; (5.2)
Dierent from our multimodal LMDE( see Chapter 4) and Morency et al. (Morency
et al., 2007), we learn three sets of parameters: (1)
p
related to the transition between
hidden states, (2)
l
related to listener expert outputs, and (3)
s
related to speaker
expert outputs.
s
and
l
model the relationships between expert outputs and the hidden
states h
j
.Z is the partition function. T
p
(h; x') is the transition function between the
hidden states. S
l
(h; x) is the listener state function and is dened as follows:
67
Listener
Backchannels
- I'm just being specific you know? And
she's complaining about somebody's
instant- I'm just being specific you
know? And she's complaining about
somebody's instant
- - I'm just being specific you know? And
she's complaining about somebody's
instant
Speaker Audio
Features
x 1 x 2 x 3 x 4 x n
y 1 y 2 y 3 y 4 y n
x 1 x 2 x 3 x 4 x n
y 1 y 2 y 3 y 4 y n
h 1 h 2 h 3 h 4 h n
y 1 y 2 y 3 y 4 y n
x 1 x 2 x 3 x n
y 1 y 2 y 3 y n
x 1 x 2 x 3 x n
y 1 y 2 y 3 y n
x 1 x 2 x 3 x n
y 1 y 2 y 3 y n
x 1 x 2 x 3 x n
y 1 y 2 y 3 y n
x 1 x 2 x 3 x n
y 1 y 2 y 3 y n
x 1 x 2 x 3 x n
y 1 y 2 y 3 y n
x 1 x 2 x 3 x n
y 1 y 2 y 3 y n
x 1 x 2 x 3 x n
y 1 y 2 y 3 y n
x 1 x 2 x 3 x 4 x n
y 1 y 2 y 3 y 4 y n
x 1 x 2 x 3 x 4 x n
y 1 y 2 y 3 y 4 y n
x 1 x 2 x 3 x 4 x n
y 1 y 2 y 3 y 4 y n
x 1 x 2 x 3 x 4 x n
y 1 y 2 y 3 y 4 y n
CRF experts for predicting
Speaker Gestures
CRF experts for predicting
Listener Backchannels
Figure 5.2: Our approach for predicting speaker gestures in dyadic conversations. Using
the speaker audio features as input, we rst learn a CRF model (expert) per each audio
channel and for both speaker gestures and listener backchannels. Then, we merge these
CRF experts using a latent variable model that is capable of learning the hidden dynamic
among the experts. This second step allows us to incorporate the visual of the speaker
gestures on the the listener behaviors.
S
l
(h; x) =
X
j
s
s
(h
j
; [q
j
1
q
j
2
::q
j
::q
j
jej
]) (5.3)
Each q
j
is the marginal probability of expert at frame j, and equals to P
(y
j
=
ajx;
). Each expert conditional distribution is dened by P
(yjx;
) using the usual
conditional random eld formulation:
P
(yj x;
) =
exp (
P
k
;k
F
;k
(y; x))
Z
(x;
)
; (5.4)
68
where
represent the model parameters of each expert . F
;k
is either a state
function s
k
(y
j
; x;j) or a transition function t
k
(y
j1
;y
j
; x;j). Each expert contains a
dierent subset of state functions s
k
(y; x;j), dened in Section 5.3.1.
Speaker state function S
s
(h; x) is dened similar to S
l
(h; x). The main dierence
is that, we use listener backchannels as sequence labels, y, when learning P
(yj x;
)
for listener experts S
l
(h; x), and use speaker gestures as sequence labels y for speaker
experts S
s
(h; x).
In our framework, each speaker expert learns a one aspect (i.e. prosodic factor) of
speech for predicting speaker gestures. Similarly, the listener experts allows us to obtain
discriminative characters of speaker speech for listener backchannel feedbacks. By using
a latent variable model to combine these individual experts, our visual-LMDE model
is able to learn both the visual in
uence, and the hidden structure among the experts.
More details about training and inference of LMDE can be found in Section 4.2.2.
5.3 Experimental Setup
In our experiments, we are using 45 interactions from the RAPPORT dataset from (Gratch
et al., 2007). Details of this dataset can be found in Sections 3.4.1 and 3.4.2. As men-
tioned in the previous section, we evaluate our visual-LMDE on the multimodal task of
predicting listener nonverbal backchannel. In this section, we rst describe multimodal
speaker features. Then, we explain the baseline models used for comparison in our tests,
and the experimental setup.
5.3.1 Multimodal Features and Experts
This section describes the dierent multimodal audio features used to create our four
experts.
Prosody: Prosody refers to the rhythm, pitch and intonation of speech. Several studies
have demonstrated that listener feedback is correlated with a speaker's prosody (Nishimura
69
et al., 2007; Ward & Tsukahara, 2000; Cathcart et al., 2003b). For example, Ward and
Tsukahara (Ward & Tsukahara, 2000) show that short listener backchannels (listener
utterances like \ok" or \uh-huh" given during a speaker's utterance) are associated with
a lowering of pitch over some interval. Listener feedback often follows speaker pauses
or lled pauses such as \um" (see (Cathcart et al., 2003b)). Using openSMILE (Eyben
et al., 2009) toolbox, we extract the following prosodic features, including standard
linguistic annotations and the prosodic features suggested by Ward and Tsukhara:
downslopes in pitch continuing for at least 40ms
regions of pitch lower than the 26th percentile continuing for at least 110ms (i.e.,
lowness)
drop or rise in energy of speech (i.e., energy edge)
fast drop or rise in energy of speech (i.e., energy fast edge)
vowel volume (i.e., vowels are usually spoken softer)
pause in speech (i.e., no speech)
Lexical: Some studies have suggested an association between lexical features and
listener feedback (Cathcart et al., 2003b). Using the transcriptions, we included all
individual words (i.e., unigrams) spoken by the speaker during the interactions.
Part-of-Speech Tags: In (Cathcart et al., 2003b), combination of pause duration and
a statistical part-of-speech language model is shown to achieve the best performance for
placing backchannels. Following this work, we use a CRF part-of-speech (POS) tagger
to automatically assign a part of speech label to each word. We also include these
part-of-speech tags (e.g. noun, verb, etc.) in our experiments.
Syntactic structure: Finally, we attempt to capture syntactic information that may
provide relevant cues by extracting three types of features from a syntactic dependency
70
structure corresponding to the utterance. The syntactic structure is produced automat-
ically using a data-driven left-to-right shift-reduce dependency parser (Sagae & Tsujii,
2007b), trained POS on dependency trees extracted from the Switchboard section of
the Penn Treebank (Marcus et al., 1994), converted to dependency trees using the
Penn2Malt tool
1
. The three syntactic features are:
Grammatical function for each word (e.g. subject, object, etc.), taken directly
from the dependency labels produced by the parser
Part-of-speech of the syntactic head of each word, taken from the dependency
links produced by the parser
Distance and direction from each word to its syntactic head, computed from the
dependency links produced by the parser
5.3.2 Baseline Models
In this subsection, we present the baseline models used in our experiments to compare
with our visual-LMDE model.
Individual experts: Our rst baseline model consists of a set of CRF chain models,
each trained with dierent set of multimodal features (as described in the previous
section). In other words, only visual, prosodic, lexical or syntactic features are used to
train a single CRF expert. (See Figure 6.3a).
Multimodal classiers: Our second baseline consists of two models: CRF and LD-
CRF (Morency et al., 2007). To train these models, we concatenate all multimodal
features (lexical, syntactic and prosodic) in one input vector. Graphical representation
of these baseline models are given in Figure 6.3-(a) and Figure 6.3-(b).
1
http://w3.msi.vxu.se/ nivre/research/Penn2Malt.html
71
Multimodal LMDE: To show the importance of visual context from the speaker, we
train an LMDE model without using any of the speaker experts (see Section 4.2.2). In
other words, our baseline LMDE model is trained to directly predict listener backchan-
nels from the speaker audio features.
Pause-Random Classier: Our last baseline model is a random backchannel gener-
ator, which randomly generates backchannels whenever some pre-dened conditions in
the speech is purveyed. Ward et. al. (Ward & Tsukahara, 2000), dene theses condi-
tions as: (1) coming after at least 700 milliseconds of speech, (2) absence of backchannel
feedback within the preceeding 800 milliseconds, (3) after 700 miliseconds of wait. We
have optimized the amounts of randomness in our experiments.
CRF Mixture of Experts: To show the importance of latent variable in our context-
based prediction model, we trained a CRF-based mixture of discriminative experts. A
graphical representation of a CRF Mixture of experts is given in Figure 6.3. This model
is similar to the Logarithmic Opinion Pool (LOP) CRF suggested by Smith et al. (Smith
et al., 2005), in the sense that they both factor the CRF distribution into a weighted
product of individual expert CRF distributions. The main dierence between LOP and
CRF Mixture of Experts model is in the denition of optimization functions. Training
of CRF Mixture of Experts is performed in two steps: Expert models are learned in the
rst step, and the second level CRF model parameters are learned in the second step.
LMDE with Speaker Nods: Our nal set of baseline models include an LMDE
model that directly uses the visual context from the speaker (speaker nods). In this
baseline model, we rst train only the listener expert models as in the rst step of our
proposed approach. Then, in the second step, we use the annotated (actual) speaker
gestures together with the listener experts as input to the latent variable model. So,
the main dierence of this baseline model with our approach is that our approach rst
anticipates the speaker nonverbal behaviors through CRF experts instead of directly
using them.
72
5.3.3 Methodology
We performed held-out testing by randomly selecting a subset of 11 interactions (out of
45) for the test set. The training set contains the remaining 34 dyadic interactions. All
models in this paper were evaluated with the same training and test sets. Automatic
Validation of all model parameters (regularization term and number of hidden states)
was performed using a 3-fold cross-validation strategy on the training set. The regular-
ization term was validated with values 10
k
;k =1::3. Two dierent number of hidden
states were tested for the LDCRF models: 3, and 4.
The performance is measured by using the conventional metrics: precision, recall,
and F-measure. Precision is the probability that predicted backchannels correspond to
actual listener behavior. Recall is the probability that a backchannel produced by a
listener in our test set was predicted by the model. We use the same weight for both
precision and recall, so-called F
1
, which is the weighted harmonic mean of precision
and recall. F
1
scores for each sequence is calculated rst, then the nal F
1
result is
computed by averaging these sequence scores.
Before reviewing the prediction results, is it important to remember that backchannel
feedback is an optional phenomena, where the actual listener may or may not decide
on giving feedback (Ward & Tsukahara, 2000). Therefore, results from prediction tasks
are expected to have lower accuracies as opposed to recognition tasks where labels are
directly observed (e.g., part-of-speech tagging).
During testing, we nd all the "peaks" (i.e., local maxima) from marginal probabili-
tiesP (y
j
=aj x;). For the f1-score, the prediction model needs to decide on a specic
threshold (i.e., amount of backchannel) for the marginal probabilities for all users. The
value of this threshold is automatically set during validation. Since we are predicting
the start time of a backchannel, an actual listener backchannel is correctly predicted if
at least one model prediction happen within the 1 second interval window around the
start time of the listener backchannel.
73
Table 5.1: Test performances of the individual expert models for listener backchannel
predictions.
Listener
Expert f1 Precision Recall
Prosodic 0.1913 0.1060 0.9803
Lexical 0.2073 0.1377 0.4198
POS 0.2346 0.1446 0.6220
Syntactic 0.2045 0.1287 0.4956
visual-LMDE 0.3212 0.2633 0.4117
Table 5.2: Test performances of the individual expert models for speaker gesture (head
nod) predictions.
Speaker
Expert f1 Precision Recall
Prosodic 0.2789 0.1669 0.8478
Lexical 0.2959 0.2068 0.5203
POS 0.3274 0.2182 0.6556
Syntactic 0.3175 0.2330 0.4983
visual-LMDE 0.3313 0.2456 0.5087
The training of all CRFs and LDCRFs were done using the hCRF library
2
. The
LMDE model was implemented in Matlab based on the hCRF library. The input
observations were computed at 30 frames per second. Given the continuous labeling
nature of our model, prediction outputs were also computed at 30Hz.
5.4 Results and Discussions
In this section we present the results of our empirical evaluation. We designed our ex-
periments to test dierent characteristics of our visual-LMDE approach: (1) integration
of multiple sources of information, and (2) visual in
uence.
Performances of individual CRF experts for predicting listener backchannels are
presented in Table 5.1. Our approach combines all these experts to model the visual
in
uence of the speaker on the listener backchannel feedbacks. This integration of
2
http://sourceforge.net/projects/hrcf/
74
multiple resources improve the prediction accuracy for listener backchannels. Therefore,
we get an F-1 score of 0.3212 with our visual-LMDE model.
Performances of individual CRF experts for predicting speaker gestures are also
presented in Table 5.2. Although not signicant, we see an improvement over the F-1
score of the prediction model by combining the results from all mutimodal experts in
our visual-LMDE model. However, remark that the main purpose of our framework is
to improve the nal listener backchannel predictions, and see if these prediction models
of the speaker will be helpful for modeling the visual in
uence of the speaker.
In our second set of experiments, we evaluate the importance of modeling visual
in
uence. Table 5.3 summarizes our results. The prediction models in the top three
rows of the table do not take into account the visual in
uence of the speaker gestures
on the listener feedback behaviors. These models are trained on the speaker audio
features to directly infer the listener backchannels. Among these models, LMDE gives
the best F-1 score, which proves the importance of late fusion of multiple sources of
information (dierent speech channels). However, our visual-LMDE model outperform
all these three models, which indicates the importance of using visual in
uence.
The models listed in the last three rows of Table 5.3, model the visual in
uence.
CRF Mixture model does not perform as good as other LMDE models. The main
reason for this decrease in performance is that the LMDE model uses a latent variable to
capture the dynamic among dierent sources of information, whereas the CRF Mixture
approach directly models these information. Although the last LMDE approach use the
speaker nonverbal behavior information directly in the second step of LMDE, it does
not perform as good as our visual-LMDE model, in which we rst infer these speaker
behaviors instead of directly using them. We hypothese that, by inferring the speaker
backchannels, we are able to model a better average speaker feedback behavior and
remove the variations in the actual speaker backchannels.
75
Table 5.3: Comparison of dierent models with our approach.
Model f1 Precision Recall
Early CRF 0.2173 0.1423 0.4591
Early LDCRF 0.2115 0.1231 0.7495
LMDE 0.2764 0.2055 0.4219
Pause-Random 0.1456 0.1322 0.2031
CRF Mixture 0.1963 0.1718 0.2288
LMDE+Speaker Nods 0.2614 0.2071 0.3541
visual-LMDE 0.3212 0.2633 0.4117
5.5 Conclusions
In this chapter, we proposed a context-based approach for predicting the backchannels
of a listener in a dyadic conversation. To model the visual in
uence of the speaker
gestures the listener behaviors, we used a variant of Latent Mixture of Discriminative
Experts model. Our visual-LMDE approach consists of two steps: we rst learn expert
models to predict speaker gestures (head nods), and the listener backchannel feedbacks.
Then, we use visual context (predicted speaker gestures) from the speaker to improve
the nal listener backchannels.
We evaluated our approach on 45 dyadic interactions from the RAPPORT dataset.
Our experiments have shown improvement over all previous approaches. The results
suggest two main conclusion: (1) by modeling the visual in
uence of the speaker ges-
tures, we can better model the backchannel feedbacks of the listener; and (2) in case
of no available visual speaker information, predicted speaker visual context helps us
to learn an average speaker behavior that is more eectual and less noisy than actual
speaker behaviors.
76
Chapter 6
Variability in Human's Behaviors: Modeling Wisdom of
Crowds
6.1 Introduction
Many studies shown that culture, age and gender in
uences people's nonverbal com-
munication (Matsumoto, 2006; Linda L. Carli & Loeber, 1995). Furthermore, the style
of an individual's nonverbal behaviors is aected by his/her personality (Ambady &
Rosenthal, 1998). For instance, personality assessments made by human judges can be
deduced from prosodic features extracted from the speech signal (Mohammadi et al.,
2010). This variability among people's nonverbal behaviors breeds the challenge of
modeling "wisdom of crowds". Wisdom of crowds depends on the idea that aggre-
gated knowledge of many diverse individuals can outperform experts in estimation and
decision-making problems (Surowiecki, 2004; Rauhut & Lorenz, 2011).
Wisdom of crowds has been shown to be aective in many elds such as market-
ing, decision making, business, and sociology. Solomon (Solomon, 2006) showed that
aggregation of individual decisions produces better decisions than those of either group
deliberation or individual expert judgment. Along with the rise of online services like
Amazon Mechanical Turk (Schneider, ), crowdsourcing emerged as an easy way for
labeling data. Therefore, the challenge of modeling wisdom of crowds has attracted
77
attention of many researchers from computer science society. Welinder et. al. (Welin-
der et al., 2010b) presented a method to estimate the class of an image from (noisy)
annotations provided by multiple annotators.
In an attempt to see if the listeners really behave dierently, and if a simple concensus
approach that aggregates opinions from more than one person on the same task can
improve the overall prediction performance, we have conducted preliminary experiments
on a parallel listener corpus (de Kok & Heylen, 2011). Parallel listener corpus is collected
while three listeners are recorded in parallel in interaction with the same speaker.
Motivated by our preliminary experiments, we propose to use Parasocial Consensus
Sampling (PCS) paradigm (Huang et al., 2010) that allows us to acquire responses from
more than just a few parallel listeners. We present a new method to model wisdom
of crowds that is based on Latent Mixture of Discriminative Experts (LMDE) origi-
nally introduced for multimodal fusion (Chapter 4). In our wisdom-LMDE approach,
each individual in the crowd is used to train a separate discriminative expert. One
key advantage our approach is that the variations between dierent experts is learned
automatically using a latent variable model. Our wisdom-LMDE approach is depicted
in Figure 6.1, where each dierent human listener refers to an expert in our model.
In our experiments, we validate the performance of our approach using a dataset
of 43 storytelling dyadic interactions. What makes backchannel prediction task well-
suited for our model is that listener feedback varies between people and is often optional
(listeners can always decide to give feedback or not). Furthermore, some people might be
paying more attention to the gestures of the speaker to provide a backchannel; whereas
for some other people, the vocal information might be more eective for providing a
backchannel feedback.
In the rest of this Chapter, we rst give our preliminary results on parallel consensus
corpus in Section 6.2. Then, we describe the wisdom acquisition technique in Section 6.3.
In Section 6.5 we present the Wisdom-LMDE model. In Section 6.6, experimental setup
and evaluations are provided. Finally, we conclude with discussion in Section 6.8.
78
wisdom-LMDE
h
1
h
2
h
3
h
n
y
2
y
1
y
3
y
n
x
1
x
1
Wisdom of crowds
(listener backchannel)
Speaker
x
1
x
2
x
3
x
n
Pitch
Words
Gaze
Look at listener
h
1
Time
Figure 6.1: Our wisdom-LMDE: (1) multiple listeners experience the same series of
stimuli (pre-recorded speakers) and (2) a Wisdom-LMDE model is learned using this
wisdom of crowds, associating one expert for each listener.
6.2 Preliminary Study on Parallel Listener Consensus
To see if opinions of more than one person on the same task improve the overall predic-
tion performance, we conducted preliminary experiments on Parallel Listener Corpus
(MultiLis Corpus) which are summarized in the following sub-sections. We introduce
the MultiLis Data in Section 6.2.1. Then, we explain how we combine the recordings
of multiple listeners into consensus instances in Section 6.2.2. Finally, we present some
experimental evaluations in Section 6.2.3.
6.2.1 Parallel Listener Corpus
The MultiLis corpus (de Kok & Heylen, 2011) is a Dutch spoken multimodal corpus
of 32 mediated face-to-face interactions totalling 131 minutes. Participants (29 male,
3 female, mean age 25) were assigned the role of either speaker or listener during an
interaction. The speakers summarized a video that they have just seen, or reproduced a
recipe they have just studied for 10 minutes. Listeners were instructed to memorize as
79
Figure 6.2: Picture of the cubicle in which the participants were seated. It illustrates
the interrogation mirror and the placement of the camera behind it which ensures eye
contact (from (de Kok & Heylen, 2011)).
much as possible about what the speaker was telling. In each session four participants
were invited to record four interactions. Each participant was once speaker and three
times listener.
What is unique about this corpus is the fact that it contains recordings of three
individual listeners to the same speaker in parallel, while each of the listeners believed
to be the sole listener. The speakers saw one of the listeners, believing that they had a
one-on-one conversation. We will refer to this listener, which can be seen by the speaker,
as displayed listener. The other two listeners, which can not be seen by the speaker, will
be referred to as concealed listeners. All listeners were placed in a cubicle and saw the
speaker on the screen in front of them. The camera was placed behind an interrogation
mirror, positioned directly behind the position on which the interlocutor was projected
(see Figure 6.2). This made it possible to create the illusion of eye contact. To ensure the
illusion of a one-on-one conversation was not broken, interaction between participants
was limited. Speakers and listeners were instructed not to ask for clarications or to
elicit explicit feedback from each other.
80
The recordings were annotated manually for a number of features. For the listener,
the corpus includes annotations of head, eyebrow and mouth movements, and speech
transcriptions. A listener response can be any combination of a head nod accompanied
by a smile, raised eyebrows accompanied by a smile or the vocalization of \uh-huh",
occurring at about the same time. Among these 2798 annotations, 2456 responses from
the MultiLis corpus with a head movement and/or a vocalization were used as the
ground truth labels.
Considering only the displayed listener we can regard this corpus as any other corpus
of recorded one-to-one interactions. But we also have the listening behavior of the
concealed listener at our disposal and can utilize this during learning. The parallel
consensus does not only improve the quality of negative samples and increase the number
of positive samples, but also provides information about the importance of each response
opportunity, reducing the eect of outliers.
6.2.2 Building Response Consensus
The algorithm for building listener responses performs a forward looking search. When
an unassigned response is encountered, the algorithm checks whether there are more
responses which start within the consensus window of 700 ms
1
from the start time of
this response. If there are, all of these are grouped together with the response. The
start time of the consensus instance is the onset of the rst response; the end time of the
consensus instance is the onset of the latest response included in the consensus. This
corresponds to the "`Window-of-Opportunity"' as found in the data that starts with
the beginning of the rst listener response and ends with the beginning of last listener
response in the consensus. Note that this means if a consensus of only one response is
1
By analyzing the listener interactions in the training corpus, the minimal response
gap was found to be 714ms. To ensure that our algorithm does not group two responses
from the same listener, the consensus window is set to 700 ms
81
created, then the start and end time of the consensus are identical. After a consensus
is created we continue our forward looking search for the next unassigned response.
Using the MultiLis corpus, the algorithm above created 1733 consensus instances.
There are 1140 consensus 1 instances (instances contain only one response), 465 con-
sensus 2 instances (two responses) and 128 consensuses 3 instances (responses from all
three listeners).
6.2.3 Experiments
The goal of our experiments is to see wether aggregation of opinions from more than
one person on the same task can improve the overall prediction performance. In this
section, we rst describe our methodology and features used in our experiments. Then,
we explain our prediction models. Finally, we present the experimental results.
6.2.3.1 Methodology
In our experiments, we use Conditional Random Fields (CRF) (Laerty et al., 2001a)
for data labeling. We use the hCRF library (hCRF, 2007) for the training of our CRF
models. Testing is performed on an hold-out set of 10 randomly selected interactions.
The remaining 21 dyadic interactions were used for training. Validation of model pa-
rameters was performed using a 3-fold strategy on the training set. During training
and validation, the regularization term was automatically validated with values 10
k
, for
k =3::3.
In all models the ground truth labels are normalized to the same length of 700ms.
The mean start time of the responses included in each consensus instance is calculated.
Each instance starts at the 350ms before this mean start time and ends 350ms after it.
We used the following speaker features in our experiments: lexical features, pause,
eye gaze and prosodic features (pitch (F0) value at a 10ms interval).
82
6.2.3.2 Prediction Models
In our experiments, we used 3 dierent learning strategies described as below:
Displayed Listener Only: Our rst model consists of a CRF chain model trained with
using responses of only the displayed listener as the ground truth labels. This model
is our main baseline for our experiments since most previous work used this approach
(such as (Cathcart et al., 2003a; Morency et al., 2008b; Ward & Tsukahara, 2000)). We
refer to this model as the DL only model in the rest of the paper.
All Listeners: In the second model, we use responses of both the primary listener
and the two secondary listeners in the same session. For the secondary listeners, we
duplicate speaker-listener pairs by using the same speaker for both listeners. These
duplicated listener-speaker pairs can be seen as dierent sessions in which the speaker
has the same features and listeners have their own responses. We refer to this model as
ALL model in the rest of the paper.
Consensus 1, 2 and 3: The Consensus 1 model includes all consensus instances. So
all the response opportunities to which at least one listener (either the displayed listener
or one of the concealed listeners) has responded are used as ground truth label. The
Consensus 2 model only includes response opportunities to which at least two listeners
have responded as ground truth label and the Consensus 3 model only opportunities
to which all three listeners have responded.
6.2.4 Results and Discussion
In our experiments, we evaluate and compare the three learning strategies described
above: all listeners, displayed listener only (Table 6.1) and on Consensus 1 and 2 (Ta-
ble 6.2).
Displayed Listener Only As a baseline we measured the performance of our response
prediction model on the Displayed Listener Only (DL Only). Table 6.1 show the per-
formance of our ve models on this measure. Our result with learning on DL Only
(F
1
= 0:265) on this case (our baseline model) is comparable to the result of Morency et
83
Model F
1
Precision Recall
Baseline (DL Only) 0.265 0.268 0.262
All Listeners 0.255 0.188 0.392
Consensus 1 0.225 0.166 0.352
Consensus 2 0.264 0.199 0.391
Consensus 3 0.239 0.170 0.402
Table 6.1: The performance of our ve models measured using only the displayed lis-
teners ground truth labels.
al. (Morency et al., 2008b) (F
1
= 0:256) but on a dierent corpus. Looking at the other
approaches which the MultiLis corpus allowed us to take, we can see that learning on
Consensus 2 achieves comparable performance (F
1
= 0:264) and also the performance
of the ALL model is only slightly worse. The other approaches perform not as good as
the traditional approach of using DL Only for learning.
Consensus 1 and 2 As discussed in Section 6.2.1 this corpus provides us with more in-
formation than only the responses of the displayed listener. We also have the responses
of the concealed listeners available to us and this information can also be used during
evaluation to get a more precise performance measure dealing with exactness and com-
pleteness. Since the production of a listener response is optional, we see in our corpus
that the displayed listener and the concealed listeners do not always respond at the same
time. The displayed listener may miss response opportunities, to which one or both of
the concealed listeners did respond. A prediction of our model at such a missed response
opportunity should not be a wrong prediction, since according to our corpus, these are
moments where listeners do provide responses. In our corpus Consensus 1 provides the
broadest coverage of these moments. Therefore, we also looked at the performance of
our models using Consensus 1 as ground truth (see Table 6.2). On this measure the All
Listeners (F
1
= 0:377), Consensus 2 (F
1
= 0:375) and Consensus 3 (F
1
= 0:364) models
perform signicantly better than the Displayed Listener Only (F
1
= 0:278) model. The
Consensus 1 (F
1
= 0:318) model has a performance in between the other models.
84
Consensus 1 Consensus 2
Model F
1
F
1
Baseline (DL Only) 0.278 0.253
All Listeners 0.377 0.255
Consensus 1 0.318 0.213
Consensus 2 0.375 0.287
Consensus 3 0.364 0.256
Table 6.2: The performance of our ve models measured Consensus 1 and Consensus 2
ground truth labels.
However, Huang et al. (Huang et al., 2010) have shown that a virtual human which
responds at moments that most people would respond is the most believable. In our
corpus these are the moments were two or three listeners responded to the same oppor-
tunity at the same time (Consensus 2). Using these ground truth labels the response
rate is closest to the response rate of the average listener. Again, the Consensus 2 model
(F
1
= 0:287) performs best on this measure, but the dierences with the Displayed Lis-
tener Only (F
1
= 0:253), All Listeners (F
1
= 0:255) and Consensus 3 (F
1
= 0:256)
models are not signicant. So, overall learning on Consensus 2 performs best in all
cases, which proves the use of parallel listener consensus in the learning phase.
In summary, our preliminary experiments showed that combined listener responses
from multiple parallel interactions allows us to better capture dierences and similar-
ities between individuals. The parallel consensus helped us to improve the prediction
performance by identifying better negative samples and reducing outliers. Based on
these results, we attempt to automatically capture these dierences and similarities be-
tween individuals, and propose a framework for modeling wisdom of crowds in the rest
of this chapter. We will rst start by explaining our wisdom acquisition technique in
the following section.
85
6.3 Wisdom of Crowds
There might be variations between people reactions even when experiencing the same
situation. For instance, in our case of listener backchannel prediction, some people might
be responding more to the visual gestures of the speaker (i.e. eye gaze), whereas some
other people might be responding for the acoustic characteristic of speaker's speech (i.e.
low pitch). Furthermore, some listeners may prefer to be more active by providing more
frequent feedbacks, whereas, some others may prefer to stay impassive and provide less
frequent feedbacks. The main goal of this study to is take advantage of these dierences
in wisdom of crowds. We assume that by modeling the dierences (and commonalities),
we can better learn an average listener behavior.
Process of modeling wisdom of crowds refers to the idea that aggregated information
from a group might be better than information from any other individual in that group.
Data acquisition procedure is an important aspect of this process, since it should allow
guided crowd members to experience the same situation. Among all, we choose to
use the following two paradigms for data accusation: Parasocial Consensus Sampling
(PCS) (Huang et al., 2010) and Parallel Listener Consensus (PLC) (de Kok & Heylen,
2011). In Parasocial Consensus Sampling, observer interact with pre-recorded speakers;
whereas in Parallel Listener Consensus, interactions occur in real-time.
Our rst data accusation technique, Parallel Listener Consensus, has already been
introduced in Section 6.2.1. In the rest of this section, we will present our second data
acquisition approache based on the Parasocial Consensus Sampling paradigm.
6.3.1 Parasocial Consensus Sampling
Our second data acquisition method exploits the Parasocial Consensus Sampling (PCS)
paradigm, which was introduced by Huang et al. (Huang et al., 2010). Based on the
theory that people behave similarly when interacting through a media (known as paraso-
cial interaction), PCS allows us to gather data from multiple individuals experiencing
the same situation. Instead of recording face-to-face interactions, participants in PCS
86
are requested to achieve a (given) communicative goal by interacting with a mediated
representation of a person. For instance, when our goal is to model listener behav-
iors, participants are presented with prerecorded speaker videos shown on screen. One
of the main advantages of using PCS data is that it provides us better analysis and
understanding of the dierences in nonverbal behaviors of dierent people.
To obtain data for modeling listener backchannel feedbacks with PCS, several par-
ticipants are recruited to watch same speaker videos. During the the PCS interaction,
participants were told to pretend like as if they are an active listener while watching the
video of the speaker. To convey their interest to the topic, they were asked to press the
keyboard whenever they felt like providing backchannel feedbacks such as head nods or
paraverbals(i.e. umh or yeah ). There is a total of 9 participants, who were adults from
Asia, North America and Europe.
PCS paradigm is applied on 45 dyadic interactions from the Rapport dataset (Gratch
et al., 2006) (See Section 3.4.1). In other words, the speaker videos from this dataset
are used as the mediated representation of a speaker during PCS interactions.
6.4 Individual Experts and Wisdom Analysis
To have a better understanding of variations among dierent listeners (experts), we
selected the most important features for all 9 listeners in the PCS dataset for analysis.
Our wisdom modeling approach is based on mixture of experts. We rst learn expert
models for each listener annotator in our datasets and then fuse them together to exploit
the dierences and commonalities among these experts. The following subsection ex-
plains the listener experts are created. In Section 6.4.2, we present the feature analysis
results to understand the variability among listeners.
6.4.1 Listener Experts
Both Parasocial Consensus Sampling and Parallel Listener Consensus paradigms allow
us to obtain labels from multiple listener annotators while interacting with the same
87
Listener1 Listener2 Listener3 Listener4 Listener5 Listener6 Listener7 Listener8 Listener9
pause
label:sub
POS:NN
POS:NN
pause
label:pmod
pause
POS:NN
label:nmod
pause
POS:NN
low pitch
pause
dirdist:L1
low pitch
POS:NN
pause
low pitch
Eyebrow up
dirdist:L8+
POS:NN
eye gaze
dirdist:R1
POS:JJ
lowness
eye gaze
pause
Table 6.3: Most predictive features for each listener from our wisdom dataset. This
analysis suggests three prototypical patterns for backchannel feedback.
speaker. Therefore, we have multiple set of listener labels for the same set of speaker
behaviors. In our wisdom modeling approach, we use the labels from one single listener
annotator to learn a prediction model for that specic annotator. We refer to each of
these listener prediction models as listener experts since they learn the prototypicality
of one specic listener annotator. By ranking the important features for each listener
expert model, we can have a better sense of the most relevant speaker features for each
listener for triggering a backchannel feedback.
All 45 speaker videos in the PCS dataset is annotated by 9 dierent listener anno-
tators. Therefore, we created 9 expert models, one for each annotator in the dataset.
For the MultiLis dataset, 3 dierent listeners are watching the same speaker. In total,
we have 20 listener experts that provide labels in this dataset.
6.4.2 Feature Analysis
To have a better understanding of the dierences and commonalities among the listener
experts, we apply a feature ranking method on each listener expert model, which pro-
vides us the features that are the most important for each expert. In other words, we
learn the top features of each listener expert that are the most relevant for that listener
to provide backchannel feedback. The input features are the speaker behaviors which
are explained in more details later in Section 6.6.1.
88
We use our feature ranking scheme described in Chapter 3 to select the top speaker
features for each listener that are the most relevant for the backchannel prediction task.
This allows us to analyze commonalities and dierences between listener experts.
The top 3 features for all 9 listener annotators in the PCS dataset are listed in
Table 6.3. We can see interesting groupings when looking at the features selected for all 9
listener experts. For the rst 3 listeners, we observe mostly pause in speech and syntactic
feature, whereas for the next 3 experts, prosodic features are also prominent. This is
coherent with the nding in (Nishimura et al., 2007; Ward & Tsukahara, 2000), where
listener feedback was shown to be correlated with speaker's prosody. It is interesting
to see that visual information appears as one of the top 3 features of only the last 3
experts. Burgoon et al. (Burgoon et al., 1995) also showed that speaker gestures are
often correlated with listener feedback.
These results clearly indicate variations in people's most predictive features. On the
other hand, similarities among some people are also observed. For instance, the rst
group of listeners seems to mostly pay attention to speech, pause and POS:NN as shown
in the top features. However, the ordering of these features is dierent. Similarly, for
the last group of listeners, which use the visual information, eyebrows up appears as the
top feature or only one of the listeners, whereas the other two mostly use the eye gaze
information.
6.5 Computational Model: Wisdom-LMDE
In order to automatically model the variations among dierent listeners, we use an
extension of the Latent Mixture of Discriminative Experts (LMDE) explained in Chap-
ter 4. Our wisdom-LMDE model based on a two step process: a Conditional Random
Field model (see Section 6.4.1) is learned rst for each listener expert st. Then, these
experts models are fused together in the second step to capture the hidden structure
among the experts. See Figure 6.1 for a graphical representation of CRF and LDCRF
89
models. Since each expert corresponds to a dierent listener, LMDE automatically
learns the hidden structure among dierent listeners.
The task of our wisdom-LMDE model is to learn a mapping between a sequence
of observations x =fx
1
;x
2
;:::;x
m
g and a sequence of labels y =fy
1
;y
2
;:::;y
m
g as
described in Section 6.4 for CRF's. However, dierent than a CRF model, we also
assume a vector of \sub-structure" variables h =fh
1
;h
2
;:::;h
m
g for each sequence.
These variables are not observed in the training examples and will therefore form a set
of hidden variables in the model.
We dene wisdom-LMDE model as follows:
P (yj x;) =
X
h:8h
j
2Hy
j
P (hj x;): (6.1)
where are model parameters learned during training and P (hj x;) is dened as
follows:
P (hj x;) =
exp
0
@
P
l
l
T
l
(h)+
P
s
s
S
s
(h; x)
1
A
Z(x';)
; (6.2)
For convenience, we split into two parts: (1)
l
related to the transition between
hidden states, (2)
s
related to expert outputs. T
l
(h; x') is the transition function
between the hidden states.
s
models the relationships between expert outputs and the
hidden states h
j
.Z is the partition function. S
s
(h; x) is the expert state function and
is dened as follows:
S
s
(h; x) =
X
j
s
s
(h
j
; [q
j
1
q
j
2
::q
j
::q
j
jej
]) (6.3)
Each q
j
is the marginal probability of expert at frame j, and equals to P
(y
j
=
ajx;
). Each expert conditional distribution is dened by P
(yjx;
) using the usual
90
conditional random eld formulation, which we also used in Section 6.4 for feature
ranking:
P
(yj x;
) =
exp (
P
k
;k
F
;k
(y; x))
Z
(x;
)
; (6.4)
where
represents the model parameters of each expert . F
;k
is either a state
functions
k
(y
j
; x;j) or a transition functiont
k
(y
j1
;y
j
; x;j). State functionss
k
depend
on the input speaker features at each frame dened in Section 5.3.1, while transition
functions t
k
can depend on pairs of speaker (backchannel) labels.
In our framework, each listener expert learns a dierent listener behavior (backchan-
nel timings) in response to the same speaker gestures. By using a latent variable model
to combine these individual experts, our wisdom-LMDE model is able to learn both
the visual in
uence of the speaker gestures on the listener behaviors, and the hidden
structure among the experts.
6.6 Experiments
We hypothesize that variability among multiple listener behavior's can be modeled by
merging these experts via a latent variable model as in wisdom-LMDE. Therefore, the
main goal of our experiments is to test if these hypothesis hold by presenting both
quantitative and qualitative results that compare our wisdom-LMDE model with state-
of-art baseline models, and analyzes the model and the dominant features of the listener
experts.
To show generalization of our approach, we present experimental results on two
dierent datasets: PCS and MultiLis datasets. In the rest of this section, we describe
the multimodal features, baseline models for comparison, and methodology used during
training of our models.
91
6.6.1 Multimodal Features
We describe the features used to in our dataset: lexical, part-of-speech, syntactic struc-
ture, prosodic and visual.
Lexical: Some studies have suggested an association between lexical features and
listener feedback (Cathcart et al., 2003b). Although lexical features are not as easy
to recognize in real time as the previous features, there has been recent progress in
real-time keyword spotting (Igor et al., 2005). From the PCS dataset we include the
following manually transcribed lexical features for completeness:
top 100 individual words (i.e., unigrams) that are selected based on their frequency
in the data.
In the case of MultiLis dataset, the lexical features were extracted using the Dutch
automatic speech recognition software SHoUT (Huijbregts, 2008). The recognized words
are collected with their start and end times from silences.
Visual gestures: Gestures performed by the speaker are often correlated with lis-
tener feedback (Burgoon et al., 1995). Eye gaze, in particular, has often been implicated
as eliciting listener feedback. Thus, we use the following contextual features provided
by the PCS dataset:
Speaker looking at the listener
Speaker smiling
Speaker moving eyebrows up
Speaker moving eyebrows down
MultiLis dataset contains the following two visual features
Speaker looking at the listener
Speaker blink
92
Figure 6.3: Baseline Models: a) Conditional Random Fields (CRF), b) Latent Dynamic
Conditional Random Fields(LDCRF), c) CRF Mixture of Experts (no latent variable)
Eye gaze and blink features were manually annotated. For eye gaze, the human coder
annotated whether the speaker was looking at the listener (directly into the camera) or
not. Gazes at the listener were occasionally interrupted by blinks of the speaker. Even
though the gaze was interrupted for a moment, the listener would still have the percep-
tion that the speaker is addressing him/her. Therefore, the \continued gaze" feature
was created where the blinks between and after a gaze annotation are included into
the interval. From both the normal gaze and the continued gaze features, a \blinked"
variant is created, which only includes the gaze intervals which were preceeded by a
blink.
Prosody: Prosody refers to the rhythm, pitch and intonation of speech. Sev-
eral studies have demonstrated that listener feedback is correlated with a speaker's
prosody (Nishimura et al., 2007). For example, Ward and Tsukahara (Ward & Tsuka-
hara, 2000) show that short listener backchannels (listener utterances like \ok" or \uh-
huh" given during a speaker's utterance) are associated with a lowering of pitch over
some interval. Listener feedback often follows speaker pauses or lled pauses such as
\um" (see (Cathcart et al., 2003b)). The PCS dataset provides the following prosodic
features, including standard linguistic annotations and the prosodic features suggested
by Ward and Tsukhara:
93
Downslopes in pitch for at least 40ms
Regions of pitch lower than the 26th percentile continuing for at least 110ms (i.e.,
lowness)
Drop or rise in energy of speech (i.e., energy edge)
Fast drop or rise in energy of speech (i.e., energy fast edge)
Vowel volume (i.e., vowels are spoken softer)
Pause in speech (i.e., no speech)
Syntactic structure: Finally, we attempt to capture syntactic information that may
provide relevant cues by extracting four types of features from a syntactic dependency
structure corresponding to the utterance. The syntactic structure is produced automat-
ically for the PCS dataset, using a CRF part-of-speech (POS) tagger and a data-driven
left-to-right shift-reduce dependency parser (Sagae & Tsujii, 2007b), both trained on
POS tags and dependency trees extracted from the Switchboard section of the Penn
Treebank (Marcus et al., 1994), converted to dependency trees using the Penn2Malt
tool
2
. The four syntactic features are:
Part-of-speech tags for each word (e.g. noun, verb, etc.), taken from the output
of the POS tagger
Grammatical function for each word (e.g. subject, object, etc.), taken directly
from the dependency labels produced by the parser
Part-of-speech of the syntactic head of each word, taken from the dependency
links produced by the parser
Distance and direction from each word to its syntactic head, computed from the
dependency links produced by the parser
2
http://w3.msi.vxu.se/ nivre/research/Penn2Malt.html
94
We were not able nd a robust equivalent parsers in Dutch, therefore the MultiLis
dataset does not contain the above features.
6.6.2 Baseline Models
In our experiments, we compare the wisdom-LMDE model with several baseline models.
We selected these baselines for specic reasons. They are state of the art but also show
dierent aspects of our approach: multiple labels(wisdom), model variability (experts),
learn commonalities (hidden variable).
Actual Listener(AL) Classiers Our rst baseline model is trained with the labels
from the original listeners, therefore uses no wisdom of crowds. This baseline consists
of two models: CRF and LDCRF chains (See Figure 6.1). To train these models, we
use the labels of the \Actual Listeners\ (AL) from the PCS dataset. In case of the
MultiLis dataset, we use responses of only the displayed listener labels as the ground
truth labels.
Consensus Classier uses concensus labels to train a CRF model. Consensus
labels are constructed by a similar approach presented in (Huang et al., 2010). To
decide whether there is a feedback at a specic time or not, we look at the response
levels (number of responses from all listeners that agree that there is a feedback at that
time). Then, we apply a threshold on the response level to get only important feedbacks.
In the PCS, the threshold is set to 3 so that the consensus contains approximately the
same number of head nods as the actual listener. For the MultiLis dataset we have two
options to set the threshold: 2 or 3. We present results with 3 in our experiments, since
it performs better than selecting the threshold as 2 (de Kok et al., 2010).
CRF Mixture of Experts To show the importance of latent variable in our context-
based prediction model, we trained a CRF-based mixture of discriminative experts. A
graphical representation of a CRF Mixture of experts is given in Figure 6.3. This model
is similar to the Logarithmic Opinion Pool (LOP) CRF suggested by Smith et al. (Smith
95
et al., 2005), in the sense that they both factor the CRF distribution into a weighted
product of individual expert CRF distributions. The main dierence between LOP and
CRF Mixture of Experts model is in the denition of optimization functions. Training
of CRF Mixture of Experts is performed in two steps: Expert models are learned in the
rst step, and the second level CRF model parameters are learned in the second step.
Multimodal LMDE is a multimodal LMDE model described in Chapter 4, where
each expert refers one of each dierent set of multimodal features. For the PCS dataset,
we use 5 multimodal experts for each set of lexical, visual, prosodic, POS, and syntactic
features. For the MultiLis dataset, we have 2 experts referring to lexical, and visual
features.
Pause-Random Classier This baseline model is a random backchannel generator,
which randomly generates backchannels whenever some pre-dened conditions in the
speech is purveyed. Ward et. al. (Ward & Tsukahara, 2000), dene theses conditions
as: (1) coming after at least 700 milliseconds of speech, (2) absence of backchannel
feedback within the preceeding 800 milliseconds, (3) after 700 miliseconds of wait. We
have optimized the amounts of randomness in our experiments.
Human Coder Our last baseline model is the human coders from the PCS dataset.
For each of 9 coders, we use the labels provided by that coder, and compare them with
the actual listener labels.
6.6.3 Methodology
We performed hold-out testing on a randomly selected subset of interactions. For the
PCS dataset, we use 10 interactions for testing, and the training set contains the remain-
ing 33 interactions. In case of the MultiLis dataset, we use 10 interactions for testing
and the remaining 20 interactions for training. In both datasets, model parameters were
automatically validated by using a 3-fold cross-validation strategy on the training set.
96
Table 6.4: Comparison of our Wisdom LMDE approach (on PCS dataset) with multiple
baseline models. The same test set is used in all experiments.
Model Wisdom of Crowds Precision Recall F1-Score
Wisdom LMDE Yes 0.2473 0.7349 0.3701
Consensus Classier Yes 0.2217 0.3773 0.2793
CRF Mixture of Experts Yes 0.2696 0.4407 0.3345
AL Classier(CRF) No 0.2997 0.2819 0.2906
AL Classier(LDCRF) No 0.1619 0.2996 0.2102
Multimodal LMDE No 0.2548 0.3752 0.3035
Random Classier No 0.1277 0.2150 0.1570
Human Coder Yes 0.1688 0.3164 0.2087
Regularization values used are 10k for k = -1,0,..,3. Number of hidden states validated
for wisdom-LMDE models were 2, 3 and 4.
The performance is measured by using the F-score, which is the weighted harmonic
mean of precision and recall. Precision is the probability that predicted backchannels
correspond to actual listener behavior. Recall is the probability that a backchannel pro-
duced by a listener in our test set was predicted by the model. We use the same weight
for both precision and recall, so called F1 value. During validation, we nd all the peaks
(i.e., local maxima) from the marginal probabilities. These backchannel hypotheses are
ltered using the optimal threshold from the validation set. A backchannel (i.e., head
nod) is predicted correctly if a peak happens during an actual listener backchannel with
high enough probability. The same evaluation measurement is applied to all models.
We use the hCRF library
3
for training of all CRFs and LDCRFs. Our latent mixture
of discriminative experts model was implemented in Matlab based on the hCRF library.
6.7 Results and Discussion
In this section, we compare performances of dierent models with our approach and pro-
vide analysis of these results. Before presenting the prediction results, is it important
3
http://sourceforge.net/projects/hrcf/
97
Table 6.5: Comparison of our Wisdom LMDE approach (on MultiLis dataset) with
multiple baseline models. The same test set is used in all experiments.
Model Wisdom of Crowds Precision Recall F1-Score
Wisdom LMDE Yes 0.2817 0.5187 0.3651
Concensus Classier Yes 0.2574 0.4000 0.3132
CRF Mixture of Experts Yes 0.2928 0.3925 0.3354
AL Classier(CRF) No 0.2685 0.3171 0.2908
AL Classier(LDCRF) No 0.2904 0.3478 0.3165
Multimodal LMDE No 0.2233 0.3923 0.2846
Rule Based Classier No 0.1221 0.5308 0.19852
to remember that backchannel feedback is an optional phenomena, where the actual
listener may or may not decide on giving feedback (Ward & Tsukahara, 2000). There-
fore, results from prediction tasks are expected to have lower accuracies as opposed to
recognition tasks where labels are directly observed.
We designed our experiments so to test dierent characteristics of the wisdom-LMDE
approach. First, we present quantitative results that evaluate compare our wisdom-
LMDE model with other baseline models. Second, we present qualitative results for
model analysis and for feature analysis of selected experts.
6.7.1 Model Comparison
In our rst set of experiments, we compare the performances of the dierent baseline
models (see Section 6.6.2) with our Wisdom LMDE approach. In Table 6.4, we list
the experimental results conducted on the PCS dataset. As we can see from the table,
our wisdom-LMDE model achieves the best F-1 score among all the baseline models.
The second best F-1 score is archived by CRF Mixture of experts, which is the only
model among other baseline models that combines the dierent listener labels in a late
fusion manner. This result supports our claim that wisdom of clouds improves learning
of prediction models. Also, the optimal wisdom-LMDE model had 3 hidden states,
hinting that 3 prototypical patterns exists among our experts. This is coherent with
our analysis in Section 6.4.
98
To have a better understanding of what these numbers represent, we also computed
the F-1 scores of Human Coders in the PCS dataset. The average values are listed in
the last row of Table 6.4. The average F-1 score among the 9 coders is 0.2087. The
maximum F-1 score we observed was 0.3016, which was much lower than the value we
achieved by wisdom-LMDE (0.3701).
The experimental results on the MultiLis dataset are listed in Table 6.5. Similar to
the rst results on the PCS dataset, our wisdom-LMDE model outperforms all of the
baseline models. The second best F-1 score is again achieved by the CRF Mixture of
experts model. The main dierence in these results than our results on the PCS dataset
is that actual listener (AL) classier with LDCRF model achieves much higher F-1 score
than in the PCS dataset. The precision in this case is much higher, whereas the recall
stays about the same in both datasets. Since training involves learning high number of
parameters, LDCRF may overt to training set when the number of input features is
high. Since we use additional features (POS and syntactic) in the PCS data, this might
have caused the LDCRF model to underperform in this task.
6.7.2 Model Analysis
To understand the advantage of using the latent variables in the wisdom-LMDE model,
we analyze the output probabilities of dierent models from our experiments on the PCS
dataset. Figure 6.4 shows the output probabilities for two of the listener experts as well
as the output probability from wisdom-LMDE. The advantage of the latent variables is
to enable dierent weighting of experts at dierent point in time.
By analyzing the sequence on the left, we observe that there is a high likelihood of
backchannel feedback from the rst expert between 8.1s and 9.5s. This expert is highly
weighted (by one of the hidden state) during this part of the sequence. However later
around 54s, although the rst expert misses a feedback opportunity but the second
expert can predict this backchannel feedback correctly. wisdom-LMDE was able to
99
Figure 6.4: Output probabilities from LMDE and individual listener experts for two
dierent sub-sequences. The gray areas in the graph corresponds to ground truth
backchannel feedbacks of the listener.
Table 6.6: Top 5 features for 3 listener experts in the MultiLis dataset. Although the
top feature for the rst two listeners is utterance, the last listener ranks continued gaze
as the most important feature for providing backchannel feedbacks. On the other hand,
the main focus of the rst listener is on visual gestures (eye gaze, blinked gaze, blinked
continued gaze); whereas, the second listener mainly focuses on the speaker's speech
(utterance, pause, SHoUT features).
Listener1 Listener2 Listener3
utterance utterance continued gaze
pause pause utterance
eye gaze eye gaze eye gaze
blinked gaze SHoUT SEGMENT pause
blinked continued gaze SHoUT SEGMENT [s] SHoUT [s]
obtain a high likelihood for this ground thrush region given the learned hidden dynamic
between multiple experts.
100
6.7.3 Feature Analysis
In Section 6.4, we already analyzed the most predictive speaker features for each listener
expert in the PCS dataset. In our last set of experiments, we analyze the top features
of some listener experts in the MultiLis dataset. Since we have 20 listener experts in
the MultiLis dataset, we limit our analysis to only 3 experts. We selected these experts
by using a greedy expert selection method similar to the one in (Morency et al., 2009),
which was originally used for feature selection.
Our greedy expert selection method works as follows. We rst train one wisdom-
LMDE model for each listener expert. In other words, we trained 20 wisdom-LMDE
models, where each wisdom-LMDE model uses one of the 20 listener CRF expert. Then,
we select the rst expert by using the validation errors of these experts on the validation
set and pick the one with the best validation error. To select the second expert, we again
train wisdom-LMDE models for the remaining experts; but this time, we include the
selected CRF expert from the rst run as an input as well. And the method continues
in this same manner.
The top 5 features selected for these 3 experts are given in Table 6.6. The rst
feature for the rst two listener experts is utterance, whereas it is continued gaze for
the third expert. We see that dierent than the other experts, majority of the selected
features for the rst expert is related to visual modality (mainly eye gaze). We can also
say that the second expert mainly focuses on the speech (i.e. utterance and pause) and
the words used in this speech (i.e. two SHoUT SEGMENTs corresponding to words).
Finally, the third expert is the one that uses a combination of both speech and visual
gestures.
101
6.8 Conclusions
Based on our preliminary experiments that concensus of multiple opinions from more
than one person on the same task improves the overall prediction performance; we pro-
posed a new approach for modeling wisdom of crowds, which takes into account the indi-
vidual dierences among people's nonverbal behaviors. Our approach takes advantage of
both Parasocial Consensus Sampling and Parallel Listener Consensus paradigms, which
provides labels from multiple observers/annotators while interacting through same me-
dia.
In the rst step of our wisdom-LMDE approach, we learned one CRF model per
each annotator, corresponding to an expert. Later, we fused the information from all
these experts using a latent variable model that is able to capture the dierences and
commonalities among these experts.
We applied our method on the task of listener backchannel feedback prediction
during dyadic conversations, and showed improvement over dierent baseline models.
Both our earlier analysis and experimental results show that there might be a group
of people with similar behaviors, although there are dierences among dierent groups.
Wisdom LMDE is a generic model that can be applied to dierent scenarios (i.e. image
annotation, aect recognition in natural language, etc.).
102
Chapter 7
Conclusions and Future Works
7.1 Conclusions
Face-to-face communication is a highly interactive process where participants mutually
exchange and interpret linguistic and gestural signals. Even when only one person speaks
at the time, other participants exchange information continuously amongst themselves
and with the speaker through gesture, prosody, gaze, posture and facial expressions.
These signals play a signicant role in determining the nature of a social exchange. For
instance, people show rapport and engagement using backchannel feedbacks(nods and
para-verbals such as uh-huh and mm-hmm that listeners produce as they are speak-
ing). By correctly predicting backchannel feedback, virtual agents and robots can have
stronger sense of rapport.
In this thesis, we presented our framework for modeling human nonverbal behaviors
which directly addresses the following four challenges: (1) high dimensionality, (2) mul-
timodal processing, (3) visual in
uence, and (4) variability in human's behaviors. We
addressed the rst challenge by presenting a sparse feature selection method that gives
researchers a new tool to analyze human nonverbal communication. To address to sec-
ond challenge of eective and ecient fusion of multimodal information, we presented
a model called Latent Mixture of Discriminative Experts (LMDE) that can automat-
ically learn the hidden dynamic between modalities. For the third challenge, we have
103
presented a framework that models the visual in
uence of one participant's gestures on
the behaviors of the other participant when the visual modality is not directly observed.
Finally, we proposed to use our latent mixture model, LMDE, for modeling wisdom of
crowds that take into account variability in human's behaviors.
We presented empirical evaluation on the task of listener backchannel feedback pre-
diction during dyadic conversations. Our experiments showed signicant improvements
over previous approaches on listener backchannel prediction. The results suggested the
following:
Our sparse feature ranking scheme can nd the subset relevant of features, with
performance similar to the baseline model that contain all features. We also
applied this feature ranking algorithm in our later experiments to have a better
understanding of the features that are the most relevant for dierent listener to
provide a backchannel feedback. In most of these experiments, we have found the
following speaker features to be very important: pause, eye gaze, low pith, noun.
These results conrm the importance of multimodal processing.
By fusing the multimodal data via a latent variable model in a later step during
training, we can reduce the eect of noise in each modality; therefore, learn models
that can generalize better to new multimodal data. During our experiments, we
ranked the top features for each multimodal channel for analysis purposes. It
was interesting to see some features at the top for dierent modalities, because
these features had never appeared in our previous experiments, where we applied
early fusion of these features from all the modalities before applying any feature
ranking. For instance, she, um, that features among lexical channel; pocessive
pronoun and verb features among part-of-speech modality. We believe that these
features are suppressed by other channels during early fusion of features from all
modalities.
104
When there is no available visual speaker information, predicted speaker visual
context helps us to learn an average speaker behavior that is more eectual and
less noisy than actual speaker behaviors.
There are several groups of people that behave similarly. Although there are
commonalities among these dierent groups, there exists some variations among
them in their nonverbal behaviors. By modeling this wisdom of crowds, we are able
to learn the prototypical human behavioral patterns and hidden dynamic among
dierent crowd members. In our analysis, we have seen three types of people
that either pay attention to (1) speech and syntactic structure, (2) prosody or (3)
visual gestures of the speaker to provide backchannel feedback.
Among all experiments, we achieved the best performance by modeling wisdom
of crowds. We got an F-1 score of 0.3651, which, to our knowledge, is the best
performance achieved up to now among all other studies on backchannel feedback
prediction. This result also points out the importance of taking into account the
variability among human's nonverbal behaviors.
7.2 Future Challenges
In this thesis, we presented a framework that addresses four main challenges of building
computational models of human nonverbal behaviors. We demonstrated the advantages
of our framework; however, our approach still has some limitations that require future
studies. Here, we list these limitations and future research directions:
Improved Algorithms
The optimization function of our LMDE model in Chapter 4 is non-convex. There-
fore, the solution we get by training is not always optimum. To overcome this issue,
we trained our LMDE with dierent initializations of the model parameters. How-
ever, this costs us a lot of training time. One possible solution to this problem
105
might be to initialize the latent variable model parameters in the second step of
our LMDE model by using the CRF parameters learned during training of the
rst step. In the future, we need to explore better initialization techniques for our
LMDE model.
Our feature selection approach in Chapter 3 looks at the feature weights in the
regularization path that become non-zero rst. However, some of these weights
that appears to be non-zero initially, may later become zero; or the weight of
that feature may drop signicantly later in the regularization path. Our current
scheme does not consider these situations; and a more extensive analysis might
be needed to discover these irregular behaviors.
Gender and Cultural Dierences
In Chapter 6, we presented a scheme for modeling wisdom of crowds. This model
is capable of learning the dynamics among dierent crowd members. However, our
current scheme, we do not take into account the gender, age or cultural dierences.
In the future, we want to extend our current approach to model these dierences
as well. For instance, we can rst group people based on their demographic
information, and then learn one expert for each group before fusing them.
In all of our experiments, we performed held-out testing by randomly selecting a
subset of the data sequences for the test set, and the remaining sequences for the
train set. Our test and train sets come from the same dataset, ie. the Rapport
Dataset, which is based on dyadic interactions in English. To have a better un-
derstanding of cultural dierences in backchannel feedback behavior, we plan to
learn our model on a data collected within one culture (i.e. Rapport Dataset), and
apply it to another data collected within another culture (i.e MultiLis Dataset).
106
Deeper Predictive Features
Although our current method for extracting the speaker features requires that
the entire utterance be available for processing, this provides us with a rst step
towards integrating information about syntactic structure in multimodal predic-
tion models. Many of these features could in principle be computed incrementally
with only a slight degradation in accuracy, with the exception of features that
require dependency links where a word's syntactic head is to the right of the word
itself. We leave an investigation that examines only syntactic features that can be
produced incrementally in real time as future work.
Our prediction model in Chapter 5 models the visual in
uence between the lis-
tener and the speaker in a dyadic conversation. This visual in
uence is often
two directional, and the listener behaviors might eect the speaker behaviors as
well. However, in our current implementation, we only exploit the in
uence of
speaker gestures on the listener behaviors. In the future, we need to extend our
current framework to take this mutual in
uence eect between the participants
into account.
In our context-based prediction model (Chapter 5), we only used the antici-
pated speaker gestures, more specically speaker head nods, to improve listener
backchannel predictions. However, there might be other contextual information
relevant to our task. For instance; other visual gestures of the speaker (i.e. smile,
eye gaze and head movements), emotional state of both the speaker and the lis-
tener, the demographic background of the participants, and the participants' per-
sonal interest to the topic being discussed.
Virtual Human Integration
In our experiments, we validated our results by using the actual listener behaviors
as our ground truth labels. However, the main goal of research is to create articial
107
virtual listeners that are socially intelligent and can act like human. Therefore, in
the future, we want to integrate our current results with a real virtual agent that
can produce backchannel feedbacks during a conversation with a human partici-
pant. This way, we can get judgements from participants on how human-like the
agent behaves.
Going Beyond Backchannel
In our experiments, we have applied our framework on the task of listener backchan-
nel feedback prediction. It was our rst step towards building socially intelligent
virtual agents. In the future, we want to extend our work by applying it to other
nonverbal behaviors; such as smile, eye gaze and even hand/body movements.
Similarly, we want to apply our current framework for various social situations
to recognize/predict the emotions of a person, to predict the right time for turn
taking, and to detect the dominant person in multi-party conversation.
In our current framework, we learn one model for predicting one single nonverbal
gesture (backchannel). When we want to predict several gestures as mentioned
above, it would be achieved by learning one sperate model for each nonverbal
gesture to be predicted. However, these social signals are often coordinated, and
knowing the label for one gesture might be useful to predict the other. Therefore,
in the future, we want to be able to learn these signals all together in a single
model, i.e. through multi-label classication. Multi-label classication refers to
classication of data into more than two classes (Aly, 2005). By learning multiple
gesture labels all together, we can exploit the synchrony/asynchrony among these
labels.
108
Bibliography
Semaine the sensitive agent project.
Aly, Mohamed (2005). Survey on multiclass classication methods.
Ambady, N., & Rosenthal, R. (1998). Nonverbal communication. H. Friedman (Ed.)
Encyclopedia of Mental Health (pp. 775{782). Academic Press.
Andrew, Galen, & Gao, Jianfeng (2007). Scalable training of l1-regularized log-linear
models. International Conference on Machine Learning (ICML). ACM.
Atrey, Pradeep K., Hossain, M. Anwar, El Saddik, Abdulmotaleb, & Kankanhalli, Mo-
han S. (2010). Multimodal fusion for multimedia analysis: A survey. Multimedia
Systems, 16, 345{379.
Babad, Elisha (2007). Teachers' nonverbal behavior and its eects on students. In R. P.
Perry and J. C. Smart (Eds.), The scholarship of teaching and learning in higher
education: An evidence-based perspective, 201{261. Springer Netherlands.
Barnard, Mark, & Odobez, Jean-Marc (2005). Sports event recognition using layered
hmms. IEEE International Conference on Multimedia and Expo (ICME) (pp. 1150{
1153).
Bavelas, J.B., Coates, L., & Johnson, T. (2000). Listeners as co-narrators. Journal of
Personality and Social Psychology (JPSP), 79, 941{952.
Bishop, C., & Svens en, M. (2003). Bayesian hierarchical mixtures of experts. Conference
on Uncertainty in Articial Intelligence (UAI).
Bjorn Schuller, Michel Valstar, Florian Eyben Roddy Cowie Maja Pantic (October
2012). Avec 2012 - the continuous audio/visual emotion challenge. to appear in
Proc. of Second International Audio/Visual Emotion Challenge and Workshop (AVEC
2012), Grand Challenge and Satellite of ACM ICMI 2012. Santa Monica: ACM.
Blum, Avrim L., & Langley, Pat (1997). Selection of relevant features and examples in
machine learning. Articial Intelligence, 97, 245{271.
Burgoon, Judee K., Stern, Lesa A., & Dillman, Leesa (1995). Interpersonal adaptation:
Dyadic interaction patterns. Cambridge: Cambridge University Press.
109
Burns, M. (1984). Rapport and relationships: The basis of child care. Journal of Child
Care, 2, 47{57.
Callison-Burch, Chris, & Dredze, Mark (2010). Creating speech and language data with
amazon s mechanical turk. Language, 1{12.
Cassell, J., & Stone, M. (1999). Living hand to mouth: Psychological theories about
speech and gesture in interactive dialogue systems. Conference on Articial Intelli-
gence (AAAI).
Cathcart, Nicola, Carletta, Jean, & Klein, Ewan (2003a). A shallow model of backchan-
nel continuers in spoken dialogue. European ACL, 51{58.
Cathcart, N., Carletta, Jean, & Klein, Ewan (2003b). A shallow model of backchannel
continuers in spoken dialogue. European chapter of the Association for Computational
Linguistics (EACL) (pp. 51{58).
Darwin, C. (1872). The expression of the emotions in man and animals.
de Kok, Iwan, Ozkan, Derya, , Heylen, Dirk, & Morency, Louis-Philippe (2010). Learn-
ing and evaluating response prediction models using parallel listener consensus. In-
ternational Conference on Multimodal Interfaces (ICMI).
de Kok, I. A., & Heylen, D. K. J. (2011). The multilis corpus - dealing with individual
dierences in nonverbal listening behavior. Third COST 2102 International Training
School, Caserta, Italy (pp. 362{375). Berlin: Springer Verlag.
Dredze, Mark, Talukdar, Partha Pratim, & Crammer, Koby (2009). Sequence learning
from data with multiple labels.
Drolet, Aimee L., & Morris, Michael W. (2000). Rapport in con
ict resolution: Account-
ing for how face-to-face contact fosters mutual cooperation in mixed-motive con
icts.
Journal of Experimental Social Psychology, 36, 26{50.
Duggan, Ashley P., & Bradshaw, Ylisabyth S. (2008). Mutual in
uence processes in
physician-patient communication: An interaction adaptation perspective. Communi-
cation Research Reports, 25:3, 211{225.
Eisenstein, J., Barzilay, R., & Davis, R. (2008). Gestural cohesion for topic segmenta-
tion. Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies (NAACL-HLT)T (pp. 852{860).
Eyben, Florian, W ollmer, Martin, & Schuller, Bj orn (2009). openEAR - Introducing the
Munich Open-Source Emotion and Aect Recognition Toolkit. Aective Computing
and Intelligent Interaction (ACII) (pp. 576{581).
110
Foo, S.W., Lian, Y., & Dong, L. (2004). Recognition of visual speech elements us-
ing adaptively boosted hidden markov models. IEEE Transactions on Circuits and
Systems for Video Technology (pp. 693{705).
Fox, N.A., & Reilly, R.B. (2003). Audio-visual speaker identication based on the use
of dynamic audio and visual features. International Conference on Audio and Video-
based Biometric Person Authentication (IAPR) (pp. 743{751).
Frampton, M., Huang, J., Bui, T., & Peters, S. (2009). Real-time decision detection
in multi-party dialogue. Conference on Empirical Methods in Natural Language Pro-
cessing (EMNLP) (pp. 1133{1141).
Fuchs, D. (1987). Examiner familiarity eects on test performance: implications for
training and practice. Topics in Early Childhood Special Education, 7, 90{104.
Fujie, Shinya, Ejiri, Yasuhi, Nakajima, Kei, Matsusaka, Yosuke, & Kobayashi, Tet-
sunori (2004). A conversation robot using head gesture recognition as para-linguistic
information. IEEE International Symposium on Robot and Human Interactive Com-
munication (RO-MAN) (pp. 159{164).
Gallaher, P. E. (1992). Individual dierences in nonverbal behavior: Dimensions of
style. Journal of Personality and Social Psychology. 63, 133145.
Garg, A., Pavlovic, V., & Rehg, J.M. (2003). Boosted learning in dynamic bayesian
networks for multimodal speaker detection. Proceedings of the IEEE, 91, 1355{1369.
Ghahramani, Zoubin, Jordan, Michael I., & Smyth, Padhraic (1997). Factorial hidden
markov models. Machine Learning. MIT Press.
Goldberg, S.B. (2005). The secrets of successful mediators. Negotiation Journal, 21,
365{376.
Gratch, J., Okhmatovskaia, A., Lamothe, F., Marsella, S., Morales, M., Werf, R.J., &
Morency, L.-P. (2006). Virtual rapport. Proceedings of International Conference on
Intelligent Virtual Agents (IVA), Marina del Rey, CA.
Gratch, Jonathan, Wang, Ning, Gerten, Jillian, & Fast, Edward (2007). Creating rap-
port with virtual agents. Intelligent Virtual Agents (IVA).
Gravano, Agustin (2009). Turn-taking and armative cue words in taskoriented dialogue
(Technical Report).
Gravano, A., Benus, S., Chavez, H., Hirschberg, J., & Wilcox, L. (2007). On the role of
context and prosody in the interpretation and 'okay'. Association for Computational
Linguistics (ACL) (pp. 800{807).
Gueguen, N., Jacob, C., & Martin, A. (2009). Mimicry in social interaction: Its eect
on human judgment and behavior. European Journal of Sociences, 8.
111
Hall, Judith A. (1978). Gender eects in decoding nonverbal cues. Psychological Bul-
letin.
Hateld, E, Cacioppo, J, & Rapson, R (1992). Emotional contagion. Clark MS (ed)
Review of personality and social psychology: emotion and social behavior, 151{171.
hCRF (2007). hcrf library. http://sourceforge.net/projects/hcrf/.
Heylen, D., & op den Akker, R. (2007). Computing backchannel distributions in multi-
party conversations. Association for Computational Linguistics (ACL) Workshop on
Embodied NLP (ACL:EmbodiedNLP) (pp. 17{24).
Hsueh, P.-Y., & Moore, J. (2007). What decisions have you made: Automatic decision
detection in conversational speech. Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies (NAACL-
HLT) (pp. 25{32).
Huang, L., Morency, L.-P., & Gratch:, J. (2010). Parasocial consensus sampling: com-
bining multiple perspectives to learn virtual human behavior. International Confer-
ence on Autonomous Agents and Multiagent Systems (AAMAS).
Huijbregts, Marijn (2008). Segmentation , Diarization and Speech Transcription : Sur-
prise Data Unraveled. Phd thesis, University of Twente.
Igor, Szoke, Petr, Schwarz, Pavel, Matejka, Lukas, Burget, Michal, Fapso, Martin,
Karaat, & Jan, Cernocky (2005). Comparison of keyword spotting approaches for
informal continuous speech. MLMI.
Johnston, M. (1998). Multimodal language processing. International Conference on
Spoken Language Processing (ICSLP).
Jordan, Michael I. (1994). Hierarchical mixtures of experts and the em algorithm.
Neural Computation, 6, 181{214.
Jovanovic, N., op den Akker, R., & Nijholt, A. (2006). Adressee identication in face-
to-face meetings. European Chapter of the. Association for Computational Linguistics
(EACL).
Jurafsky, D., Shriberg, E., Fox, B., & Curl, T. (1998). Lexical, prosodic and syntactic
cures for dialog acts. Workshop on Discourse Relations (pp. 114{120).
Kang, Sin-Hwa, Gratch, Jonathan, Wang, Ning, & Watt, James (2008). Does the con-
tingency of agents' nonverbal feedback aect users' social anxiety? International
Conference on Autonomous Agents and Multiagent Systems (AAMAS). Estoril, Por-
tugal.
Kendon, A. (2004). Gesture: Visible action as utterance. Cambridge University Press.
112
Knapp, Mark L., & Hall, Judith A. (2010). Nonverbal communication in human inter-
action. Boston: Lyn Uhl.
Kumar, S., & Herbert., M. (2003). Discriminative random elds: A framework for
contextual interaction in classication. International Conference on Computer Vision
(ICCV).
Laerty, J., McCallum, A., & Pereira, F. (2001a). Conditional random elds: Proba-
bilistic models for segmenting and labeling sequence data. Proceedings of International
Conference on Machine Learning (pp. 282{289). Citeseer.
Laerty, J., McCallum, A., & Pereira, F. (2001b). Conditional random elds: proba-
bilistic models for segmenting and labelling sequence data. International Conference
on Machine Learning (ICML).
Linda L. Carli, Suzanne J. LaFleur, & Loeber, Christopher C. (1995). Nonverbal be-
havior, gender, and in
uence. Journal of Personality and Social Psychology. 68,
1030-1041.
Luo, Pengcheng, Ng-Thow-Hing, Victor, & MichaelNe (2013). An examination of
whether people prefer agents whose gestures mimic their own. IVA (pp. 229{238).
Maatman, M., Gratch, J., & Marsella, S. (2005). Natural behavior of a listening agent.
Intelligent Virtual Agents (IVA). Kos, Greece.
Marcus, Mitchell, Kim, Grace, Marcinkiewicz, Mary Ann, MacIntyre, Robert, Bies,
Ann, Ferguson, Mark, Katz, Karen, & Schasberger, Britta (1994). The penn tree-
bank: annotating predicate argument structure. Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Tech-
nologies (NAACL-HLT) (pp. 114{119).
Matsumoto, D. (2006). Culture and nonverbal behavior. The Sage Handbook of Non-
verbal Communication, Sage Publications Inc.
McCallum, Andrew, & R, Conditional (2003). Eciently inducing features of condi-
tional random elds. Conference on Uncertainty in Articial Intelligence.
McCowan, Iain A., Gatica-Perez, Daniel, Bengio, Samy, Lathoud, Guillaume, Barnard,
Mark, & Zhang, Dong (2005). Automatic analysis of multimodal group actions in
meetings. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI).
McNeill, D. (1992). Hand and mind: What gestures reveal about thought. Univ. Chicago
Press.
Mitra, S., & Acharya, T. (2007). Gesture recognition: A survey. Systems, Man, and
Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, 37, 311 {324.
113
Mohammadi, Gelareh, Vinciarelli, Alessandro, & Mortillaro, Marcello (2010). The voice
of personality: Mapping nonverbal vocal behavior into trait attributions. Proceedings
of ACM Multimedia Workshop on Social Signal Processing.
Morency, L.-P., de Kok, I., & Gratch, J. (2008a). Context-based recognition during hu-
man interactions: Automatic feature selection and encoding dictionary. International
Conference on Multimodal interfaces (ICMI 2008).
Morency, L.-P., de Kok, I., & Gratch, J. (2008b). Predicting listener backchannels: A
probabilistic multimodal approach. Conference on Intelligent Virutal Agents (IVA).
Morency, Louis-Philippe, de Kok, Iwan, & Gratch, Jonathan (2009). A probabilistic
multimodal approach for predicting listener backchannels. Journal of Autonomous
Agents and Multi-Agent Systems, 20, 7084.
Morency, Louis-Philippe, Quattoni, Ariadna, & Darrell, Trevor (2007). Latent-dynamic
discriminative models for continuous gesture recognition. IEE Conference on Com-
puter Vision and Pattern Recognition (CVPR).
Murray, G., & Carenini, G. (2009). Predicting subjectivity in multimodal conversations.
Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp.
1348{1357).
Nakano, Reinstein, Stocky, & Cassell, Justine (2003). Towards a model of face-to-face
grounding. Association for Computational Linguistics (ACL).
Nakano, Y., Murata, K., Enomoto, M., Arimoto, Y., Asa, Y., & Sagawa, H. (2007).
Predicting evidence of understanding by monitoring user's task manipulation in mul-
timodal conversations. Association for Computational Linguistics (ACL) (pp. 121{
124).
Neiberg, Daniel (2012). Modelling paralinguistic conversational interaction : Towards
social awareness in spoken human-machine dialogue. Doctoral dissertation, KTH,
Speech Communication and Technology. QC 20120914.
Ng, Andrew Y. (2004). Feature selection, l-1 vs. l-2 regularization, and rotational
invariance. International Conference on Machine Learning (ICML).
Nishimura, Ryota, Kitaoka, Norihide, & Nakagawa, Seiichi (2007). A spoken dialog
system for chat-like conversations considering response timing. LNCS, 4629, 599{606.
Nocedal, Jorge, & Wright, Stephen J. (2006). Numerical optimizations. Springer Series
in Operations Research.
Oatley, Keith, Keltner, Dacher, & Jenkins, Jennifer M. (2006). Understanding emotions.
Wiley-Blackwell.
114
Oliver, Nuria, Garg, Ashutosh, & Horvitz, Eric (2004). Layered representations for
learning and inferring oce activity from multiple sensory channels. Computer Vision
and Image Understanding, 96, 163{180.
Oviatt, S. (1999). Ten myths of multimodal interaction. Communications of the ACM.
Ozkan, D., & Morency, L.-P. (2010). Concensus of self-features for nonverbal behavior
analysis. Human Behavior Understanding in conjucion with International Conference
in Pattern Recognition.
Ozkan, D., & Morency, L.-P. (2011). Modeling wisdom of crowds using latent mixture of
discriminative experts. Association for Computational Linguistics: Human Language
Technologies (ACL).
Ozkan, Derya, & Morency, Louis-Philippe (2013). Prediction of visual backchannels in
the absence of visual context using mutual in
uence. IVA (pp. 189{202).
Ozkan, D., Sagae, K., & Morency, L.-P. (2010). Latent mixture of discriminative experts
for multimodal prediction modeling. International Conference on Computational Lin-
guistics (COLING).
Ozkan, Derya, Scherer, Stefan, & Morency, Louis-Philippe (2012). Step-wise emotion
recognition using concatenated-hmm. ICMI (pp. 477{484).
Pantic, Maja, Pentland, Alex, Nijholt, Anton, & Huang, Thomas (2006). Human com-
puting and machine understanding of human behavior: A survey. ACM International
Conferance on Multimodal Interfaces (pp. 239{248).
Pavlovic, V. (1998). Multimodal tracking and classication of audio-visual features.
IEEE International Conference on Image Processing (ICIP) (pp. 343{347).
Peng, J., Bo, L., & Xu, J. (2009). Conditional neural elds. Advances in Neural
Information Processing Systems (pp. 1419{1427).
Perkins, Simon, Lacker, Kevin, Theiler, James, Guyon, Isabelle, & Elissee, Andr?
(2003). Grafting: Fast, incremental feature selection by gradient descent in function
space. Journal of Machine Learning Research. 3, 1333{1356.
Quek, F. (2003). The catchment feature model for multimodal language analysis. In-
ternational Conference on Computer Vision (ICCV).
Quinn, Alexander J., & Bederson, Benjamin B. (2011). Human computation: a survey
and taxonomy of a growing eld. Proceedings of the 2011 annual conference on Human
factors in computing systems (pp. 1403{1412). ACM.
Ramseyer, Fabian (2011). Nonverbal synchrony in psychotherapy: embodiment at the
level of the dyad. USA: Philosophy Documentation Center.
115
Rauhut, Heiko, & Lorenz, Jan (2011). The wisdom of crowds in one mind: How individ-
uals can simulate the knowledge of diverse societies to reach better decisions. Journal
of Mathematical Psychology, 55, 191 { 197.
Raykar, Vikas C., Yu, Shipeng, Zhao, Linda H., Valadez, Gerardo Hermosillo, Florin,
Charles, Bogoni, Luca, Moy, Linda, & Blei, David (2010). Learning from crowds.
Riek, Laurel D., Paul, Philip C., & Robinson, Peter (2010). When my robot smiles at
me: Enabling human-robot rapport via real-time head gesture mimicry. Journal on
Multimodal User Interfaces, 3, 99{108.
Riezler, Stefan, & Vasserman, Alexander (2004). Incremental feature selection and
l1 regularization for relaxed maximum-entropy modeling. Conference on Empirical
Methods on Natural Language Processing (EMNLP).
Ross, Marina Davila, Menzler, Susanne, & Zimmermann, Elke (2008). Rapid facial
mimicry in orangutan play. Biol Lett, 4, 27{30.
Saeys, Yvan, Inza, I~ naki, & Larra~ naga, Pedro (2007). A review of feature selection
techniques in bioinformatics. Bioinformatics (Oxford, England), 23, 2507{2517.
Sagae, Kenji, & Tsujii, Jun'ichi (2007a). Dependency parsing and domain adaptation
with LR models and parser ensembles. Proceedings of the CoNLL Shared Task Session
of EMNLP-CoNLL 2007 (pp. 1044{1050). Prague, Czech Republic: Association for
Computational Linguistics.
Sagae, Kenji, & Tsujii, Jun'ichi (2007b). Dependency parsing and domain adaptation
with LR models and parser ensembles. Association for Computational Linguistics
(ACL) (pp. 1044{1050).
Schmid, Marianne, & Mast (2007). On the importance of nonverbal communication in
the physicianpatient interaction. Patient Education and Counseling, 67, 315 { 318.
Schneider, Jean. Amazon mechanical turk. http://www.mturk.com/.
Sebea, Nicu, Cohenb, Ira, & Netherl, The (2005). Multimodal approaches for emotion
recognition: A survey.
Sminchisescu, Cristian, Kanaujia, Atul, Li, Zhiguo, & Metaxas, Dimitris (2005). Dis-
criminative density propagation for 3d human motion estimation. IEEE Conference
on Computer Vision and Pattern Recognition (CVPR) (pp. 390{397).
Smith, A., Cohn, T., & Osborne, M. (2005). Logarithmic opinion pools for conditional
random elds. Association for Computational Linguistics (ACL) (pp. 18{25).
Smith, Andrew, & Osborne, Miles (2005). Regularisation techniques for conditional
random elds: Parameterised versus parameter-free. International Joint Conference
on Natural Language Processing (NLP).
116
Snoek, Cees G M, Worring, Marcel, & Smeulders, Arnold W M (2005). Early versus late
fusion in semantic video analysis. Proceedings of the 13th annual ACM international
conference on Multimedia MULTIMEDIA 05, 399.
Snow, Rion, Jurafsky, Daniel, & Ng, Andrew Y. (2008). Cheap and fast - but is it good?
Evaluating non-expert annotations for natural language tasks.
Solomon, Miriam (2006). Groupthink versus the wisdom of crowds : The social episte-
mology of deliberation and dissent. The Southern Journal of Philosophy, 44, 28{42.
Sun, Xiaofan, Nijholt, Anton, & Pantic, Maja (2012). Towards mimicry recognition
during human interactions: Automatic feature selection and representation. Intelli-
gent Technologies for Interactive Entertainment (pp. 160{169). Heidelberg, Germany:
Springer Verlag.
Surowiecki, James (2004). The wisdom of crowds: Why the many are smarter than
the few and how collective wisdom shapes business, economies, societies and nations.
Doubleday.
Terry, L.H., Shiell, D.J., & Katsaggelos, A.K. (2008). Feature space video stream con-
sistency estimation for dynamic stream weighting in audio-visual speech recognition.
IEEE International Conference on Image Processing (ICIP) (pp. 1316{1319).
Tibshirani, Robert (1994). Regression shrinkage and selection via the lasso. Journal of
the Royal Statistical Society, Series B. 58, 267{288.
Tsui, P., & Schultz, G.L. (1985). Failure of rapport: Why psychotheraputic engagement
fails in the treatment of asian clients. American Journal of Orthopsychiatry, 55, 561{
569.
Vail, Douglas L. (2007). Feature selection in conditional random elds for activity
recognition. IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS).
Valenzeno, Laura, Alibali, Martha W, & Klatzky, Roberta (2003). Teachers gestures
facilitate students learning: A lesson in symmetry. Contemporary Educational Psy-
chology, 28, 187 { 204.
Ward, N., & Tsukahara, W. (2000). Prosodic features which cue back-channel responses
in english and japanese. Journal of Pragmatics, 23, 1177{1207.
Welinder, Peter, Branson, Steve, Belongie, Serge, & Perona, Pietro (2010a). The mul-
tidimensional wisdom of crowds. Neural Information Processing Systems Conference
(NIPS).
Welinder, Peter, Branson, Steve, Belongie, Serge, & Perona, Pietro (2010b). The mul-
tidimensional wisdom of crowds. Neural Information Processing Systems Conference
(NIPS).
117
Whitla, Paul (2009). Crowdsourcing and its application in marketing activities. Con-
temporary Management Research, 5, 15{28.
Zhang, Dong, Gatica-perez, Daniel, Bengio, Samy, Mccowan, Iain, & Lathoud, Guil-
laume (2004). Modeling individual and group actions in meetings: a two-layer hmm
framework. IEEE Conf. on Computer Vision and Pattern Recognition, Workshop on
Event Mining in Video (CVPREVENT).
118
Abstract (if available)
Abstract
Human nonverbal communication is a highly interactive process, in which the participants dynamically send and respond to nonverbal signals. These signals play a significant role in determining the nature of a social exchange. Although human can naturally recognize, interpret and produce these nonverbal signals in social contexts, computers are not equipped with such abilities. Therefore, creating computational models for holding fluid interactions with human participants has become an important topic for many research fields including human‐computer interaction, robotics, artificial intelligence, and cognitive sciences. Central to the problem of modeling social behaviors is the challenge of understanding the dynamics involved with listener backchannel feedbacks (i.e. the nods and paraverbals such as "uh‐hu" and "mm‐hmm" that listeners produce as someone is speaking). In this thesis, I present a framework for modeling visual backchannels of a listener during a dyadic conversation. I address the four major challenges involved in modeling nonverbal human behaviors, more specifically listener backchannels: (1) High Dimensionality: Human communication is a complicated phenomenon that involves many behaviors (i.e dimensions) such smile, nod, hand moving, and voice pitch. A better understanding and analysis of social behaviors can be obtained by discovering the subset of features relevant to a specific social signal (e.g., backchannel feedback). In this thesis, I present a new feature ranking scheme which exploits the sparsity of probabilistic models when trained on human behavior problems. This technique gives researchers a new tool to analyze individual differences in social nonverbal communication. Furthermore, I present a feature selection approach which first looks at the important behaviors for each individual, called self‐features, before building a consensus. (2) Multimodal Processing: This high dimensional data comes from different communicative channels (modalities) that contain complementary information essential to interpretation and understanding of human behaviors. Therefore, effective and efficient fusion of these modalities is a challenging task. If integrated carefully, different modalities have the potential to provide complementary information that will improve the model performance. In this thesis, I introduce a new model called Latent Mixture of Discriminative Experts which can automatically learn the temporal relationship between different modalities. Since, I train separate experts for each modality, LMDE is capable of improving the prediction performance even with limited amount of data. (3) Visual Influence: Human communication is dynamic in the sense that people affect each other's nonverbal behaviors (i.e. gesture mirroring). Therefore, while predicting the nonverbal behaviors of a person of interest, the visual gestures from the second interlocutor should also be taken into account. In this thesis, I propose a context‐based prediction framework that models the visual influence of an interlocutor in a dyadic conversation, even if the visual modality from the second interlocutor is absent. (4) Variability in Human's Behaviors: It is known that age, gender and culture effect people's social behaviors. Therefore, there are differences in the way people display and interpret nonverbal behaviors. A good model of human nonverbal behaviors should take these differences into account. Furthermore, gathering labeled data sets is time consuming and often expensive in many real life scenarios. In this thesis, I use "wisdom of crowds" that enables parallel acquisition of opinions from multiple annotators/labelers. I propose a new approach for modeling wisdom of crowds called wisdom‐LMDE, which is able to learn the variations and commonalities among different crowd members (i.e. labelers).
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Parasocial consensus sampling: modeling human nonverbal behaviors from multiple perspectives
PDF
Computational modeling of human behavior in negotiation and persuasion: the challenges of micro-level behavior annotations and multimodal modeling
PDF
Generating gestures from speech for virtual humans using machine learning approaches
PDF
Computational methods for modeling nonverbal communication in human interaction
PDF
The interpersonal effect of emotion in decision-making and social dilemmas
PDF
Multimodal representation learning of affective behavior
PDF
Structure and function in speech production
PDF
A computational framework for diversity in ensembles of humans and machine systems
PDF
Behavioral signal processing: computational approaches for modeling and quantifying interaction dynamics in dyadic human interactions
PDF
Virtual extras: conversational behavior simulation for background virtual humans
PDF
Natural language description of emotion
PDF
A framework for research in human-agent negotiation
PDF
Learning shared subspaces across multiple views and modalities
PDF
Building and validating computational models of emotional expressivity in a natural social task
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Computational modeling of behavioral attributes in conversational dyadic interactions
PDF
Emotional speech production: from data to computational models and applications
PDF
Rapid prototyping and evaluation of dialogue systems for virtual humans
PDF
The human element: addressing human adversaries in security domains
PDF
Modeling dyadic synchrony with heterogeneous data: validation in infant-mother and infant-robot interactions
Asset Metadata
Creator
Ozkan, Derya
(author)
Core Title
Towards social virtual listeners: computational models of human nonverbal behaviors
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
03/06/2014
Defense Date
11/19/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
artificial intelligence,machine learning,multimodal processing,OAI-PMH Harvest,virtual agents
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Morency, Louis-Philippe (
committee chair
), Gratch, Jonathan (
committee member
), Marsella, Stacy C. (
committee member
), Medioni, Gérard G. (
committee member
), Narayanan, Shrikanth S. (
committee member
)
Creator Email
derya@usc.edu,dozkan@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-367964
Unique identifier
UC11296281
Identifier
etd-OzkanDerya-2285.pdf (filename),usctheses-c3-367964 (legacy record id)
Legacy Identifier
etd-OzkanDerya-2285.pdf
Dmrecord
367964
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Ozkan, Derya
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
artificial intelligence
machine learning
multimodal processing
virtual agents