Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Computational models for multidimensional annotations of affect
(USC Thesis Other)
Computational models for multidimensional annotations of affect
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
COMPUTATIONAL MODELS FOR MULTIDIMENSIONAL
ANNOTATIONS OF AFFECT
by
Anil Ramakrishna
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(Computer Science)
December 2019
Copyright 2019 Anil Ramakrishna
To all the people who inspire me to keep learning;
past, present, and future.
i
Acknowledgments
I have been fortunate to have met some of the brightest minds of my gen-
eration. For this, I am indebted to Dr. Shrikanth Narayanan who gave me
the opportunity, mentorship and freedom to learn and grow as a researcher
as well as an individual.
I would like to thank Dr. Aiichiro Nakano, Dr. Morteza Dehghani, Dr.
Panayiotis Georgiou and Dr. Jonathan Gratch for serving on my dissertation
and qualier committees. All the suggestions I received from you were critical
to the formation of this dissertation.
I would also like to thank my colleagues from the Signal Analysis and
Interpretation Lab (SAIL) at USC for being an important part of my journey.
The countless discussions and debates I have had with so many of you were
instrumental for my work.
Finally, I would like to thank my parents, siblings and friends for standing
by me through the good and bad times. This would not have been possible
without your support.
ii
Contents
Acknowledgments ii
List of Figures vi
List of Tables ix
Abstract x
Chapter 1: Introduction 1
1.1 Contributions ............................................................................. 4
1.2 Organization .............................................................................. 4
Chapter 2: Multidimensional Annotation Fusion: Preliminaries 6
2.1 Introduction ............................................................................... 6
2.2 Keywords ................................................................................... 8
2.3 Related work.............................................................................. 9
2.4 Motivation.................................................................................. 15
Chapter 3: Additive Gaussian noise model 17
3.1 Introduction ............................................................................... 17
3.2 Model ......................................................................................... 18
iii
3.3 Data ........................................................................................... 23
3.4 Experiments ............................................................................... 24
3.4.1 Baseline: Individual annotator modeling........................ 24
3.4.2 Joint annotator - Independent rating (Joint-Ind) mod-
eling................................................................................ 25
3.4.3 Joint annotator - Joint rating (Joint-Joint) modeling .... 25
3.4.4 Joint annotator - Conditional rating (Joint-Cond) mod-
eling................................................................................ 25
3.5 Results ....................................................................................... 27
3.5.1 Setting 1: Training on data from all annotators ............ 27
3.5.2 Setting 2: Training on annotators with more than a
threshold count of ratings............................................... 28
3.6 Conclusion.................................................................................. 29
Chapter 4: Matrix factorization model 31
4.1 Introduction ............................................................................... 31
4.2 Model ......................................................................................... 32
4.2.1 Setup .............................................................................. 32
4.2.2 Global annotation model................................................ 34
4.2.3 Time series annotation model......................................... 37
4.3 Experiments and Results ........................................................... 40
4.3.1 Global annotation model................................................ 42
4.3.2 Time series annotation model......................................... 47
4.3.3 Eect of dependency among dimensions ........................ 52
4.4 Conclusion.................................................................................. 54
iv
Chapter 5: Estimation of psycholinguistic norms for sentences 56
5.1 Introduction ............................................................................... 56
5.2 Model ......................................................................................... 59
5.3 Data ........................................................................................... 60
5.4 Experiments ............................................................................... 61
5.5 Results ....................................................................................... 63
5.6 Conclusion.................................................................................. 65
Chapter 6: Conclusions and Future Work 67
6.1 Multidimensional annotation agreement .................................... 67
Bibliography 72
Appendix A: Derivations for the matrix factorization model 82
A.1 EM update equations for global annotation model.................... 82
A.1.1 Components of the joint distribution p(a
m
1
::: a
m
K
; a
m
) ... 82
A.1.2 Model formulation .......................................................... 84
A.2 EM update equations for time series annotation model............. 90
A.2.1 Model formulation .......................................................... 90
v
List of Figures
2.1 Plate notation for a basic annotation model. a
m;d
is the latent
ground truth for the given data point (for the d
th
question)
and a
m;d
k
is the rating provided by the k
th
annotator. . . . . . 10
2.2 Annotation model proposed by [1] with a jointly learned pre-
dictor. x
m
is the set of features for them
th
data point;a
m;d
is
thed
th
dimension of the latent ground truth which is modeled
as a function of x
m
; a
m;d
k
is the rating provided by the k
th
annotator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Correlation heatmaps for annotations from a representative
sample of emotion annotated datasets; v - valence, a - arousal,
d - dominance, p - power . . . . . . . . . . . . . . . . . . . . . 15
3.1 Graphical model representation for the proposed model. x
m
is
the set of features for them
th
instance, a
m
is the latent ground
truth and a
m
k
is the rating provided by the k
th
annotator for
that instance. x
m
and a
m
k
are observed variables, a
m
is latent. 18
vi
3.2 MSEE
d
for the four (baseline, Joint-Ind, Joint-Joint and Joint-
Cond) modeling schemes as annotators with less than a thresh-
old count of ratings are dropped. Y-axis representsE
d
and
X-axis represents the minimum number of annotations (cuto
threshold). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1 Proposed model. x
m
is the set of features for the m
th
data
point, a
m;d
is the latent ground truth for the d
th
dimension
and a
m;d
k
is the rating provided by the k
th
annotator. Vectors
x
m
and a
m
k
(shaded) are observed variables, while a
m
is latent.
A
m
is the set of annotator ratings for the m
th
instance. . . . . 33
4.2 Performance of global annotation model on synthetic dataset;
*-statistically signicant . . . . . . . . . . . . . . . . . . . . . 43
4.3 Performance of global annotation model on articial dataset;
Sat-Saturation, Bri-Brightness; *-statistically signicant . . . 44
4.4 Performance of global annotation model on the text emotions
dataset; *-statistically signicant . . . . . . . . . . . . . . . . 46
4.5 Concordance and Pearson correlation coecients between ground
truth and model predictions for the time series annotation
model; *-statistically signicant . . . . . . . . . . . . . . . . . 48
4.6 Eect of varying dependency between annotation dimensions
for the synthetic model . . . . . . . . . . . . . . . . . . . . . . 52
4.7 Average F
k
plots estimated from the joint model at dierent
step sizes for o diagonal elements of the annotator'sF
k
matrices 54
vii
5.1 Joint multidimensional annotation fusion model from Section
4.2.3. F
k
is estimated from word level annotations of psy-
cholinguistic norms, which is used in predicting norms at the
sentence level . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Performance of proposed and baseline models in predicting
sentence level norms . . . . . . . . . . . . . . . . . . . . . . . 63
5.3 Performance of best annotator in our dataset and annotator
average . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.4 Training error on labels predicted; Pred: Predictions from pro-
posed model, Word-avg: Average of word level norm scores;
Ann avg: annotator avg . . . . . . . . . . . . . . . . . . . . . 65
6.1 Comparision of for various distance measures . . . . . . . . 70
viii
List of Tables
3.1 Acoustic prosodic signals and their statistical functionals used
as features x
m
in this study. . . . . . . . . . . . . . . . . . . . 24
3.2 MSEE
d
for annotator label prediction on the four rating di-
mensions; Ex: Expressiveness, Na: Naturalness, Go: Good-
ness of Pronunciation, En: Engagement . . . . . . . . . . . . . 28
ix
Abstract
Aect is an integral aspect of human psychology which regulates all our in-
teractions with external stimuli. It is highly subjective, with dierent stimuli
leading to dierent aective responses in people due to varying personal and
cultural artifacts. Computational modeling of aect is an important problem
in Articial Intelligence, which often involves supervised training of models
using a large number of labeled data points. However, training labels are dif-
cult to obtain due to the inherent subjectivity of aective dimensions. The
most common approach to obtain the training labels is to collect subjective
opinions or ratings from expert or naive annotators, followed by a suitable
aggregation of the ratings.
In this dissertation, we will present our contributions towards building
computational models for aggregating the subjective ratings of aect, specif-
ically in the multidimensional setting. We propose latent variable models
to capture annotator behaviors using additive Gaussian noise and matrix
factorization models, which show improved performance in estimating the
dimensions of interest. We then apply our matrix factorization model to the
task of sentence level estimation of psycholinguistic normatives. Finally, we
set up future work in estimating agreement on multidimensional annotations.
x
Chapter 1
Introduction
Aect is an abstract entity which is said to manifest prior to the realm of
personal awareness or consciousness [2]. According to [3], it is fundamental in
nature and subsumes several other related concepts such as sentiment, feel-
ings and emotion, along with higher order mental constructs such as humor
and mood. An important characteristic of aect and other related concepts
is their inherent subjectivity. For example, a given image may evoke dier-
ent emotions in people depending on their backgrounds. Similarly, dierent
people may react dierently to humorous situations. This subjectivity often
leads to challenges in building models for recognizing aect.
Modeling aect is an important problem in Articial Intelligence (AI). In-
corporating aect can enrich the quality of interactions with AI agents, and
it is relevant in all of the modalities commonly encountered in AI such as
speech, vision and language. Modeling of aect spans the interdisciplinary
eld of Aective Computing (AC), which includes tasks such as emotion
recognition, sentiment analysis and opinion mining, along with recognizing
1
higher order constructs such as humor and mood. Typical approaches used
in these tasks involve training supervised machine learning models, which
assumes the availability of a dataset with training labels. However, these la-
bels are not easy to obtain thanks to the subjectivity of aective dimensions.
Further, we may not always have a clearly identiable ground truth unlike
typical machine learning tasks. For example, when developing an AI system
to identify physical attributes of a person (such as race, gender, etc.) from
images, often we can easily and reliably identify these attributes without am-
biguity. However, aective dimensions such as emotion are social constructs,
with culture playing an important role in their recognition [4], because of
which there may not be an objective ground truth.
Commonly used strategies to collect training labels for the aective di-
mensions include: (i) use an approximate proxy from the corpus to identify
the labels; for ex, if we are building models to predict humor, laughter cues
may be used as a proxy, or (ii) combine noisy labels using an annotation
fusion model.
As a case study for using a proxy to identify the training labels, we
explored the computational modeling of humor from conversations in psy-
chotherapy sessions
1
. We used shared occurrence of laughter between the
client and the therapist as a proxy for occurrence of humorous utterances.
To capture context, we used a hierarchical 2 layer LSTM network and showed
improved performance in recognizing humor compared to a standard baseline.
However, similar attempts to use canned laughter from television sitcoms as a
1
Anil Ramakrishna, Timothy Greer, David Atkins, Shrikanth Narayanan, Computa-
tional modeling of conversational humor in psychotherapy, in: Proceedings of Interspeech,
Hyderabad, India, 2018.
2
proxy to humor failed due to low agreement on which utterances were humor-
ous, between human annotations and those followed by the canned laughter.
This highlights the limitations of using a proxy as we may not always have
access to a reliable approximation to the label of interest. Further, identi-
fying a suitable proxy assumes domain knowledge which may not always be
easily accessible.
An alternate approach to obtain training labels is to collect noisy judg-
ments from human annotators who may be trained experts (such as medical
professionals working on diagnostics data) or untrained workers from crowd-
sourcing platforms such as Amazon Mechanical Turk
2
(MTurk) and Crowd-
ower
3
. Given noisy labels from such annotators, the typical approach is to
aggregate them to obtain the label of interest. Common aggregation strate-
gies include majority voting or simple averaging, but they assume uniform
reliability among the annotators which may not be true with crowdsourc-
ing platforms. To address this, several authors have developed annotation
fusion models to capture the behavior of individual annotators to improve
the quality of estimated labels. However, most existing works model the an-
notation dimensions individually even in settings where we collect annotator
ratings on multiple dimensions. For example, while collecting annotations on
aective dimensions, it is common to collect ratings on dimensions such as
valence, arousal and dominance and it maybe benecial to model the annota-
tions jointly while aggregating them. In this dissertation, we will explore this
hypothesis and propose models to perform annotation fusion, which make use
of correlations between the dierent annotation dimensions.
2
www.mturk.com
3
www.crowd
ower.com
3
1.1 Contributions
The specic contributions made in this dissertation are as follows.
• We propose two multidimensional annotation fusion models with latent
ground truth vectors to capture relationships between the dimensions.
In both models, the annotator parameters and the ground truth vectors
are estimated jointly using the Expectation Maximization [5] algorithm.
{ The rst model assumes additive Gaussian noise for the annota-
tors' distortion function.
{ The second model assumes a matrix factorization structure for the
distortion function.
• We develop a novel strategy to estimate psycholinguistic normatives at
sentence level by making use of the matrix factorization based annota-
tion fusion model.
1.2 Organization
The overall organization of this dissertation is as follows:
• In Chapter 2, we introduce the problem of annotation fusion and dis-
cuss prior works in this domain along with their weaknesses. We also
motivate the need for multidimensional annotation fusion.
• In Chapter 3, we present our rst model for multidimensional annota-
tion fusion which uses additive Gaussian noise.
4
• In Chapter 4, we present the matrix factorization based model to cap-
ture multidimensional annotations. Derivations for the model are pre-
sented in Appendix A.
• In Chapter 5, we apply the annotation fusion model described in Chap-
ter 4 to the task of estimating sentence level psycholinguistic norms.
• We conclude in Chapter 6 and highlight future directions in the esti-
mation of agreement for multidimensional annotations.
5
Chapter 2
Multidimensional Annotation
Fusion: Preliminaries
2.1 Introduction
Crowdsourcing is a popular tool used in collecting human judgments on aec-
tive constructs such as emotion and engagement. Typical examples included
annotations of images or video clips with categorical emotion labels or with
continuous dimensions such as valence or arousal. Online platforms such as
Amazon Mechanical Turk (Mturk) have recently risen in popularity owing
to their inexpensive label costs and also their ability to scale eciently.
Crowdsourcing is also a popular approach in collecting labels for use in
the training of supervised machine learning algorithms. Such labels are typ-
ically obtained from domain experts, which can be slow and expensive. For
example, in the medical domain, it is often expensive to collect diagnosis
information given laboratory tests since this requires judgments of trained
6
professionals. On the other hand, unlabeled patient data may be easily avail-
able. Crowdsourcing has been particularly successful in such settings with
easy availability of unlabeled data instances since we can collect a large num-
ber of annotations from untrained and inexpensive workers over the Internet,
which when combined together may be comparable or even better than ex-
pert annotations [6].
A typical crowdsourcing setting involves collecting annotations from a
large number of workers and hence there is a need to robustly combine them
to estimate the ground truth. The most common approach for this is to take
simple averages for continuous labels or perform majority voting for categor-
ical labels. However, this assumes uniform competency across all the workers
which is not always guaranteed or justied. Several alternative approaches
have been proposed to address this challenge, each with a specic structure
to the function modeling the annotators' behavior. In practice, it is common
to collect annotations on multiple questions for each data instance being la-
beled in order to reduce costs or annotators' mental load or even to improve
annotation accuracy. For example, while collecting emotion annotations for
a given data instance (such as a single image or video segment), collecting
labels on dimensions such as valence or arousal together (concurrently or one
after another) may be preferred over collecting valence annotations for all
instances followed by arousal annotations.
Such a joint annotation task may entail task specic or annotator specic
dependencies between the annotated dimensions. In the emotion annotation
example, task specic dependencies may occur due to inherent correlations
between the valence and arousal dimensions depending on the experimental
7
setup. Annotator specic dependencies may occur due to a given anno-
tator's (possibly incorrect or incomplete) understanding of the annotation
dimensions. Hence it is of relevance to jointly model the dierent annota-
tion dimensions. However, most state of the art models in annotation fusion
combine the annotations by modeling the dierent dimensions independently.
The focus of this dissertation is to highlight the benets of modeling them
jointly. Joint modeling of the annotation dimensions may result in more
accurate estimates of the ground truth as well as giving a better picture
of the annotators' behavior. In this chapter, we will present prior work in
the domain of annotation fusion and motivate the need for multidimensional
annotation fusion models.
2.2 Keywords
We list a few important key words and their denitions below.
• Annotators These are workers from crowdsourcing platforms such as
Mturk who provide their judgments on the subjective construct under
discussion.
• Annotations These are the noisy judgments we obtain from the anno-
tators. We use the terms ratings and annotations interchangeably.
• Ground truth The objective value of labels for cases in which they can
be clearly and unambiguously identied (for example, height of people).
• Reference labels In cases when there are no unambiguous ground truth
values (for example, emotion or humor), we aggregate expert opinions
8
to obtain a reference label against which predictions of our fusion mod-
els are compared.
• Data instance The individual data point for which the subjective rat-
ings are being collected
Since the annotation fusion models are applicable while aggregating an-
notations from tasks both with or without a well dened ground truth, in
the following sections and chapters, we overload the term ground truth and
use it when we refer to the hidden variable of interest to be estimated by the
annotation fusion models, even in problems without a well dened ground
truth. In such cases, the predicted estimates from the models are compared
with reference labels obtained from experts instead.
2.3 Related work
Several authors, most notably [6], assert the benets of aggregating opinions
from many people which is often believed to be better than those from a
small number of experts, under certain conditions. Often referred to as the
wisdom of crowds, this approach has been remarkably popular in recent times,
specially in elds such as psychology where a ground truth may not be easily
accessible or may not exist. This popularity can be largely attributed to
online crowdsourcing platforms such as Mturk that connect researchers with
low cost workers from around the globe. Along with cost, scalability of
annotations is another major appeal with such tools leading to their use in
machine learning in large scale labeling of data instances such as images [7],
audio/video clips [8] and text snippets [9].
9
a k m,d a * K M m,d Figure 2.1: Plate notation for a basic annotation model. a
m;d
is the latent
ground truth for the given data point (for the d
th
question) and a
m;d
k
is the
rating provided by the k
th
annotator.
Figure 2.1 shows a common setting in the crowdsourcing paradigm. For
each data pointm, annotatork provides a noisy labela
m;d
k
which depends on
the ground truth a
m;d
where d is the dimension being annotated. Since we
collect several annotations for each data point, we need to aggregate them
to estimate the unknown ground truth. The most common technique used
in aggregating these opinions is to take the average value in case of numeric
labels or perform majority voting in the case of categorical labels as shown
in Equation 2.1.
a
m;d
= argmax
j
X
k
1fa
m;d
k
==jg (2.1)
where, 1fg is the indicator function
While simple and easy to implement, this approach assumes consistent
reliability among the dierent annotators which seems unreasonable, espe-
cially in online platforms such as Mturk. To avoid this, several approaches
have been suggested that account for annotator reliability in estimating the
ground truth. We explain a few in detail below.
10
Early eorts to capture reliability in annotation modeling [10], [11] as-
sumed specic structure to the functions modeled by each annotator. Given
a set of annotations a
m;d
k
along with the corresponding function parameters,
the ground truth is estimated using the MAP estimator
a
m;d
= argmax
j
X
k
logp(a
m;d
k
ja
m;d
=j) + logp(a
m;d
=j) (2.2)
where p(a
m;d
) is the prior probability of ground truth.
In [10], the categorical ground truth labela
m;d
=i is modied probabilis-
tically by annotatork using a stochastic matrix
k
as shown in Equation 2.3
in which each row is a multinomial conditional distribution given the ground
truth.
P (a
m;d
k
=jja
m;d
=i) =
k
ij
(2.3)
Given annotations fromK dierent annotators, their parameters
k
and
prior distribution of labels p
j
= P (a
m;d
= j), the ground truth is estimated
using MAP estimation as before.
a
m;d
= argmax
j
X
k
log
j(a
m;d
k
)
+ logp
j
(2.4)
The above expression makes a conditional independence assumption for
annotations given the ground truth label. Since we do not typically have the
annotator parameters
k
, these are estimated using the EM algorithm.
Figure 2.2 shows an extension of the model in Figure 2.1 in which we
learn a predictor (classier/regression model) for the ground truth jointly
11
a k m,d x * m,d a m M K Figure 2.2: Annotation model proposed by [1] with a jointly learned predic-
tor. x
m
is the set of features for them
th
data point;a
m;d
is thed
th
dimension
of the latent ground truth which is modeled as a function of x
m
; a
m;d
k
is the
rating provided by the k
th
annotator.
with annotator parameters. Such a predictor may be used to obtain ground
truth for new data points. This strategy of jointly modeling the annotator
functions as well as the ground truth predictor has been shown to have bet-
ter performance when compared to classiers trained independently of the
estimated ground truth [1]. The ground truth estimate in this model is given
by
a
m;d
= argmax
a
m;d
X
k
logp(a
m;d
k
ja
m;d
) + logp(a
m;d
jx
m
) (2.5)
Recently several additional extensions have been proposed to Figure 2.2;
For example in [12], the authors assume varying regions of annotator exper-
tise in the data feature space and account for this using dierent probabilities
for label confusion for each region. The authors show that this leads to a
better estimation of annotator reliability and ground truth.
The models described so far were designed for annotation tasks in which
the task is to rate some global property of the data point. For example, in
12
image based emotion annotation, the task may be to provide annotations on
dimensions such as valence and arousal conveyed by each image. However,
human interactions may often involve continuous variations of these dimen-
sions over time [13] which are captured using time series annotations from
audio/video clips. In this context, the previous models are applicable only
if annotations from each frame are treated independently. However, this en-
tails several unrealistic assumptions such as independence between frames,
zero lag in the annotators and synchronized response in the annotators to
the underlying stimulus.
Several works have been proposed to capture the underlying reaction lag
in the annotators. [14] proposed a generalization of Probabilistic Canonical
Correlation Analysis (PCCA) [15] named Dynamic PCCA which captures
temporal dependencies of the shared ground truth space in a generative set-
ting. They further extend this model by incorporating a latent time warping
process to implicitly handle the reaction lags in annotators. This work is
extended in [16] where they also jointly learn a function to capture depen-
dence of the latent ground truth signal with the data points' features in both
generative and discriminative settings similar to the setting of [1]. [17] ad-
dress the reaction lag by explicitly nding the time shift that maximizes the
mutual information between expressive behaviors and the annotations. [18]
generalize the work of [17] by using a linear time invariant (LTI) lter which
can also handle any bias or scaling the annotators may introduce.
More recent works in annotation fusion include [19] in which the authors
propose a variant of the model in Figure 2.1 with various annotator func-
tions to capture four specic types of annotator behavior. [20] describe a
13
mechanism named approval voting that allows annotators to provide multi-
ple answers instead of one for instances where they are not condent. [21] use
repeated sampling for opinions from annotators over the same data instances
to increase reliability in annotations.
Most of the models described above focus on combining annotations on
each dimension separately. The model proposed in [16] can indeed be gener-
alized to combine the dierent annotation dimensions together but that is not
the focus of their work and as such they do not evaluate on this task. How-
ever, in many practical applications, annotation tasks are multi-dimensional.
For example, while collecting emotion ratings it is routine to collect annota-
tions on valence, arousal, dominance and other related dimensions. In these
cases, it may be benecial to model the dierent dimensions together since
they may be closely related. Further, there may be dependencies between
the internal denitions the annotators hold for the annotation dimensions.
For example, while annotating emotional dimensions, a given annotator may
associate certain valence values with only a certain range of arousal. It is
therefore of relevance to model such annotator specic relationships between
the dierent dimensions as part of the annotator distortion function and
predictor modeling paradigm. In this dissertation, we address this gap by
proposing latent variable models for multidimensional annotation fusion. We
motivate this problem further in the next section.
14
v a d
v
a
d
(a) iemocap
v a p
v
a
p
(b) semaine
v a
v
a
(c) movie emo-
tions
v a
v
a
(d) recola
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 2.3: Correlation heatmaps for annotations from a representative sam-
ple of emotion annotated datasets; v - valence, a - arousal, d - dominance, p
- power
2.4 Motivation
To examine the relationships between annotation dimensions, we created a
plot of absolute values of correlation scores between annotation dimensions
from four commonly studied emotion corpora in Figure 2.3: iemocap [22],
semaine [23], recola [24] and the movie emotion corpus from [25]. Each of
these corpora include annotations over emotion dimensions such as valence,
arousal, dominance and power. For the iemocap corpus we used global an-
notations while the others include time series annotations of the aective
dimensions from videos. In each case, the correlations were computed be-
tween concatenated annotation values from annotators who provide ratings
on all the dimensions.
As is evident, in almost all cases, the annotation dimensions exhibit non-
zero correlations, highlighting the need for fusion models that take into ac-
count such correlations. The models we propose in the next chapters are
15
aimed at capturing this form of dependency. We attribute the inconsistent
correlations between the dimensions across corpora to varying underlying af-
fective narratives as well as dierences in perceptions and biases introduced
by individual annotators themselves.
16
Chapter 3
Additive Gaussian noise model
3.1 Introduction
In this chapter, we will present our rst model to capture multidimensional
annotations which assumes that each annotator's distortion function uses an
additive Gaussian noise vector to capture the relationship between annota-
tion dimensions. Similar to the models described in Section 2.3, we assume
that the ground truth vector for each data point is hidden while the feature
vector corresponding to each le and the annotation vector are available.
With this formulation, we use the EM algorithm to estimate the model pa-
rameters.
The chapter is organized as follows: in Section 3.2, we describe the pro-
posed model and provide equations for parameter estimation using EM al-
gorithm. We describe the data used to evaluate the model in Section 3.3,
experiments in sections 3.4, and results 3.5 before concluding in Section 3.6.
17
a k m x * m a m M K Figure 3.1: Graphical model representation for the proposed model. x
m
is
the set of features for the m
th
instance, a
m
is the latent ground truth and
a
m
k
is the rating provided by thek
th
annotator for that instance. x
m
and a
m
k
are observed variables, a
m
is latent.
3.2 Model
Consider a set ofM data points with featuresfx
1
;::; x
m
g; x
m
being the fea-
ture vector corresponding to the m
th
point. Each data point is associated
with a D dimensional ground truth vector for which ratings from several
annotators are pooled. In this work, we assume that each datapoint is an-
notated by a subset of K annotators. This is a more general setting than
assuming that the ratings are available from every annotator (as assumed in
[1]), and is often the case with data collection over online platforms such as
Mturk. We represent the set of ratings for the m
th
data point by a set A
m
.
For example, if annotators 1, 2 and 5 provided their ratings (out ofK annota-
tors), A
m
would be the setfa
m
1
; a
m
2
; a
m
5
g, where a
m
k
is the multidimensional
rating from the k
th
annotator. The vector a
m
k
is a D-dimensional vector,
represented asfa
m;1
k
;::;a
m;d
k
;::;a
m;D
k
g, wherea
m;d
k
is the rating by thek
th
an-
notator for the d
th
dimension corresponding to the data point m. Armed
18
with this notation, we train annotation fusion model shown as a graphical
model in Figure 3.1. This model is inspired from the works of Raykar et
al. [1] and Gupta et al. [18]. The model assumes that there exists a latent
ground truth a
m
(also of dimensionalityD), which is conditioned on the data
features. The relationship between the features and a
m
is captured by the
functionf(x
m
j), with parameter. We assumef to be an ane projection
of the feature vectors as shown in Equation 3.1, with being the projection
matrix.
a
m
=f(x
m
j) =
T
2
4
x
m
1
3
5
(3.1)
The model further assumes that each annotator's ratings are noisy modi-
cations of the ground truth a
m
. We assume these modications to be the
addition of an D-dimensional Gaussian noise with distributionN (
k
;
k
),
as shown in Equation 3.2.
k
and
k
represent the mean and co-variance
matrix of this distribution, respectively.
a
m
k
= a
m
+
k
; where
k
N (
k
;
k
) (3.2)
Model training
We estimate the model parameters by maximizing the data log-likelihood.
Since the model contains a latent variable (the ground truth a
m
), we adopt
the Expectation Maximization algorithm [5] widely used for similar settings.
During model training, our objective is to estimate the model parameters
=f;
1
;
1
;::;
K
;
K
g that maximize the log-likelihoodL of the ob-
served annotator ratings, given the features. Assuming independent data
19
points,L is given by
L = log
N
Y
n=1
p(A
m
jx
m
; ) =
N
X
n=1
logp(A
m
jx
m
; ) (3.3)
The EM algorithm iteratively performs an E-step followed by an M-step.
A detailed derivation of these steps for the EM algorithm can be referred
from various resources as [5], [10] and [12]. We specically refer the reader
to the EM algorithm derivation in [18] for a fusion model similar to the
one presented in this chapter. The authors in [18] perform a hard version
of EM algorithm where in the E-step an estimate of ground truth a
m
is
computed. This is followed by parameter update in the M-step based on the
estimated a
m
. Popular methods such as Viterbi training [26] and K-means
clustering [27] are variants of the hard EM algorithm for training Hidden
Markov Models and clustering, respectively. Borrowing formulations from
the aforementioned research studies, we summarize the E and M steps for
obtaining the parameters for the graphical model shown in Figure 3.1.
EM algorithm
The EM algorithm involves iteratively executing the Expectation and Max-
imization steps listed below.
Initialize the model parameters
E-step We estimate the ground truth a
m
8m = 1::m by solving the
20
optimization equation shown below.jj:jj
2
represents the l
2
-norm.
a
m
= argmin
a
m
X
k=Set of
annotators inAm
1
2
k
(a
m
k
a
m
k
)
2
2
+
a
m
T
2
4
x
m
1
3
5
2
2
(3.4)
M-step Given a
m
, we estimate the model parameters using the fol-
lowing equations. M
k
is the number of datapoints annotated by annotator
k.
k
=
1
M
k
X
m
0
=Set of datapoints
rated by annotator k
a
m
0
k
a
m
0
!
(3.5)
k
=
1
M
k
1
X
m
0
=Set of datapoints
rated by annotator k
(a
m
0
k
a
m
0
k
) (a
m
0
k
a
m
0
k
)
T
!
(3.6)
= argmin
X
m
a
m
T
2
4
x
m
1
3
5
2
2
(3.7)
Termination We run the algorithm until convergence of data log-likelihood
L.
Model testing
To evaluate our model, we use the task of predicting back the annotator
rating given parameter estimates. We show later how this task can also be
used to address the issue of annotation cost and reducing cognitive load on
the annotator by partial prediction of the ratings. Note that though our
model estimates the latent values for above dimensions, it is hard to evaluate
21
the quality of these estimates as they are unobserved and often subjective
in the dataset of interest (as is true for several datasets in the Behavioral
Signal Processing domain [28]). In order to predict the rating for the m
th
le from the k
th
annotator, we rst predict a
m
using Equation 3.1 and then
add the mean
k
of the noise distributionN (
k
;
k
), corresponding to the
k
th
annotator. Note that adding
k
to a
m
provides the maximum likelihood
estimate of a
m
k
thanks to Equation 3.2 and the Gaussian noise assumption
[29].
We use Mean Squared Error (MSE) computed per dimension, averaged
over all the annotators as our evaluation metric. For the dimension d (out
of D dimensions), we compute the MSEE
d
as shown in Equation 3.8. I
mk
is an indicator variable marking if the k
th
rater annotated the data point m
(Equation 3.9). a
m;d
k
is the true rating obtained from the rater k on data
point m and ^ a
m;d
k
is the model prediction.
E
d
=
P
M
m=1
P
K
k=1
I
mk
(a
m;d
k
^ a
m;d
k
)
2
P
M
m=1
P
K
k=1
I
mk
(3.8)
where
I
mk
=
8
<
:
1 if annotator k annotates data point m
0 otherwise
(3.9)
We choose this metric as it allows for evaluation on each dimension in-
dependently. Such a metric is particularly relevant in the Behavioral Signal
Processing domain where an evaluation on each dimension of rating is de-
sired. In the next section, we describe the dataset used in this study.
22
3.3 Data
We evaluate our model using the SafariBob dataset [30]. The dataset con-
tains multimodal recordings of children watching and imitating video stim-
uli, each corresponding to a dierent emotional expression. We extract audio
clips from each of these recordings which are annotated over M-Turk. For
the purpose of our experiments, we use a set of 244 audio clips (each ap-
proximately 25-30 seconds) which were rated over M-Turk by a set of 124
naive annotators. The annotators provide a four dimensional rating (D = 4),
providing their judgments on expressiveness, naturalness, goodness of pro-
nunciation and engagement of the speaker in each audio clip. The numeric
values of these attributes lies in the range of 1 to 5. Each utterance in the
data set is annotated by a subset of 15 (out of 124) annotators. This setting
is subsumed by the model proposed in Section 3.2. For further details on the
dataset, we refer the reader to [30].
Feature set
We use various statistical functionals computed over a set of acoustic-prosodic
properties of the utterance resulting in a set of 474 features (x
m
) per le.
These features are inspired by prior works in speech emotion recognition [31,
32]. The list of the signals and their statistical functionals used as features
is shown in Table 3.1. In the next section, we describe our experimental
setup including the baseline model and test dierent variants of the model
described in Section 3.2.
23
Acoustic- Audio intensity, mel-frequency band, mel-
prosodic signals frequency cepstral coecients and pitch
Statistical Mean, median, standard deviation,
functionals range, skewness and kurtosis
Table 3.1: Acoustic prosodic signals and their statistical functionals used as
features x
m
in this study.
3.4 Experiments
Based on the approach described in Section 3.2, we train models with dif-
ferent assumptions. Since our goal in these experiments is to predict the
annotator ratings, we initially train a baseline system individually modeling
every annotator. This is followed by various modications of the proposed
model to predict annotator ratings. We discuss these models in detail below.
3.4.1 Baseline: Individual annotator modeling
For the baseline, we train individual models for each annotator, instead of
the joint model described in Section 3.2. We use an ane projection scheme,
for which the relationship between the k
th
annotator's ratings and features
is shown in Equation 3.10.
k
is the projection matrix for thek
th
annotator.
The parameter
k
is obtained using minimum mean squared error criterion
on the training set, using data points that the annotator rated.
a
m
k
=f(x
m
j
k
) =
T
k
2
4
x
m
1
3
5
(3.10)
24
3.4.2 Joint annotator - Independent rating (Joint-Ind)
modeling
In this scheme, we train the joint annotator model assuming independence
between each dimension in the multidimensional rating. This is achieved
by training a separate model for each annotator dimension entry a
m;d
k
. The
training procedure is same as presented in Section 3.2, with the special case of
ratings being scalar. Consequently, we end up with D = 4 dierent models,
one for each dimension. This model acts as a strong baseline and can help
shed light on the benets of modeling the dimensions jointly.
3.4.3 Joint annotator - Joint rating (Joint-Joint) mod-
eling
We next model both the annotators and the ratings jointly as described in
Section 3.2. For each annotator we end up with multidimensional parameters
(
k
;
k
) spanning all four dimensions, which are in turn used to predict the
annotator's rating for each data instance. We expect this model to capture
any joint relationship between the dierent dimensions in the ratings, which
was not modeled by the previous Joint-Ind model.
3.4.4 Joint annotator - Conditional rating (Joint-Cond)
modeling
The Joint-Cond model is an extension of the model described in Section
3.4.3. In this scheme, we assume partial availability of annotator ratings on
25
$\mathcal Minimum number of annotations$
2 7 18 25 39 62
0.5
1
1.5
8
10
12
14
Engagement
2 7 18 25 39 62
0.5
1
1.5
6
8
10
12
14
Pronunciation goodness
2 7 18 25 39 62
0.5
1
1.5
4
6
8
10
Naturalness
2 7 18 25 39 62
0.5
1
1.5
6
8
10
Expressiveness
Baseline Joint−Ind Joint−Joint Joint−Cond
Figure 3.2: MSEE
d
for the four (baseline, Joint-Ind, Joint-Joint and Joint-
Cond) modeling schemes as annotators with less than a threshold count of
ratings are dropped. Y-axis representsE
d
and X-axis represents the minimum
number of annotations (cuto threshold).
a few dimensions. We then use the known distribution parameters for that
annotator and the available partial rating to predict the missing dimension.
For the sake of brevity we focus on the case when only one of the rating
dimensions is missing, noting however that other cases with more than one
missing dimension are entirely straightforward. The primary goal of this
model is to reduce cognitive load on the annotator by asking him/her to
annotate a subset of the rating dimensions.
We represent the available subset of rating dimensions in the vector a
m
k
,
barring ratinga
m;d
k
of dimensiond as a
m;=d
k
. Further, we represent the means
and co-variance matrix entries corresponding to the dimensions barring di-
mension d as
m;=d
k
and
m;=d
k
. In our specic case,
m;=d
k
and
m;=d
k
would
be of dimensionalities 3 1 and 3 3, respectively. Also, the entries within
m
k
storing the co-variances between the dimension d and other dimensions
is represented as
m;d
k
.
m;d
k
is a vector of dimensionality 1 3. Now, given
26
that the Joint annotator - Joint rating model prediction for the rating at
dimensiond was given by ^ a
m;d
k
, we update it to ^ a
m;d+
k
with the availability of
a
m;=d
k
as shown in Equation 3.11. This equation follows from the computa-
tion of conditional Gaussian distribution from a joint Gaussian distribution,
given partial availability of some of the variables [29].
^ a
m;d+
k
= ^ a
m;d
k
+
m;d
k
(
m;=d
k
)
1
(a
m;=d
k
m;=d
k
) (3.11)
We report the MSEE
d
;8d2 1;::; 4 separately.
3.5 Results
We report results from two dierent experiment settings for the models de-
scribed above. In the rst setting, we use ratings from all annotators over the
entire data. However, as some of the annotators only annotated a handful
of data points (as few as 2 data points), in the second setting we discard
annotators with fewer than a threshold number of ratings. This allows for a
more robust estimation of parameters (
k
;
k
) per annotator. We use a 10
fold cross validation scheme over each annotator for all the models.
3.5.1 Setting 1: Training on data from all annotators
We rst compare the dierent models by including all the annotators in our
corpus irrespective of the amount of data they annotated. The metricE
d
for
every dimension d is shown in table 3.2.
From the table, we observe that the Joint-Ind and Joint-Joint models
outperform the chosen baseline predictor in all the cases. The Joint-Joint
27
Dimension d 1 (Ex) 2 (Na) 3 (Go) 4 (En)
Baseline 11.00 10.25 13.86 13.81
Joint-Ind 0.82 0.72 0.76 0.92
Joint-Joint 0.80 0.74 0.69 0.87
Joint-Cond 1.28 3.97 26.08 9.89
Table 3.2: MSEE
d
for annotator label prediction on the four rating dimen-
sions; Ex: Expressiveness, Na: Naturalness, Go: Goodness of Pronunciation,
En: Engagement
model shows the best performance in 3 out of 4 cases. It makes use of the
joint information in the data to make accurate predictions on the annotator
ratings rendering condence in the model's ability to reliably estimate the
hidden ground truth in multidimensional annotation settings, making this
the model of choice in most cases including when the number of ratings
per annotator are low. The Joint-Cond model does better than baseline
for expressiveness, naturalness and engagement but fares much worse on
pronunciation goodness. We attribute this to poor parameter estimation
particularly on annotators with a small number of ratings. In particular the
co-variance matrix
k
is poorly estimated for most annotators, which plays
an important role in determining the Joint-Cond estimate. We expect the
model to do well when a sucient amount of rating is available from every
annotator, which is discussed in the next section.
3.5.2 Setting 2: Training on annotators with more
than a threshold count of ratings
In this setting, we iteratively remove annotators if they rated fewer than a
threshold number of data samples. The metricE
d
is then computed only on
28
the retained annotators. The progression ofE
d
as we increase the threshold
is shown in Figure 3.2.
From Figure 3.2, we observe similar performance trends as the previous
section when the cuto threshold is low. However, as the minimum number
of annotations is increased, the baseline and Joint-Cond models show marked
improvements in performance, while the Joint-Ind and Joint-Joint models'
performance remains more or less consistent. The improvement is signi-
cantly better for the Joint-Cond model and it outperforms the Joint-Ind and
Joint-Joint beyond a certain threshold for all the rating dimensions. Hence
we can use the Joint-Cond model to reduce the dimensionality of queries
made to a given annotator, after a sucient number of ratings are collected
for him/her, in turn reducing the annotator's cognitive load and overall an-
notation cost.
3.6 Conclusion
Ratings from multiple annotators are often pooled in several applications to
obtain the ground truth. Several previous works [1] have proposed methods
for modeling these ratings from multiple annotators. However, such models
were not investigated in the case of multidimensional annotations. In this
work, we presented a model for multidimensional annotation fusion and pro-
posed variants which were applied to the task of predicting back annotator
labels. We tested the fusion model on the SafariBob dataset with four di-
mensional ratings and observed that the proposed model outperformed two
baselines by making label predictions with low MSE. A further extension was
29
proposed which was shown to be useful in reducing the dimension of ratings
presented to annotators after we obtain suciently condent parameters.
The model described in this chapter uses additive Gaussian noise to cap-
ture the relationship between annotation dimensions. However, such a model
fails to capture more nuanced structural relationships between the dimen-
sions. For example, if a given annotator's perception of a dimension scales
with one or more of the actual ground truth values, such relationships are
not easily captured by the model presented in this chapter. To address this,
in the next chapter we propose a matrix factorization based model for mul-
tidimensional annotation fusion.
30
Chapter 4
Matrix factorization model
4.1 Introduction
In the previous chapter, we addressed the need for joint modeling of an-
notation dimensions by proposing a model that uses additive joint multidi-
mensional Gaussian noise. We evaluated the model on Mturk annotations
collected for audio clips of children diagnosed with autism. However, this
model fails to capture nuanced relationships between ground truth values
and the annotation dimensions. In this chapter, we address this shortcoming
by proposing a matrix factorization based multidimensional annotation fu-
sion model, which decomposes annotation vectors into a data point specic
ground truth vector and an annotator specic linear transformation matrix.
The model we propose is an extension of the Factor Analysis model and
is applicable to both the global annotation setting (such as while collect-
ing emotion annotations on a picture, judgment about the overall tone of a
conversation, etc.) as well as time series annotations (for example, annota-
31
tions of audio/video clips). Similar to the model proposed in the previous
chapter, this model treats the hidden ground truth as latent variables and
estimates them jointly along with the annotator parameters using the Expec-
tation Maximization algorithm [5]. We evaluate the model in both settings
on synthetic and real emotion corpora. We also create an articial annota-
tion task with controlled ground truth which is used in the model evaluation
for both settings.
The rest of the chapter is organized as follows. In Section 4.2 we describe
the proposed model and provide equations for parameter estimation using
EM algorithm. We evaluate the model in Section 4.3 and provide conclusions
in Section 4.4.
4.2 Model
4.2.1 Setup
The proposed model is shown in Figure 4.1. Each data point m has feature
vector x
m
and an associated multidimensional ground truth a
m
, which is
dened as follows,
a
m
=f(x
m
; ) +
m
(4.1)
We assume that from a pool of K annotators, a subset operates on each
data point and provides their annotation a
m
k
.
a
m
k
=g(a
m
;F
k
) +
k
(4.2)
32
a k m,1 x * m,1 a m M a k m,d * m,d a . . . . a k m,2 * m,2 a |A|< K m _ Figure 4.1: Proposed model. x
m
is the set of features for them
th
data point,
a
m;d
is the latent ground truth for the d
th
dimension and a
m;d
k
is the rating
provided by the k
th
annotator. Vectors x
m
and a
m
k
(shaded) are observed
variables, while a
m
is latent. A
m
is the set of annotator ratings for the m
th
instance.
where indexk corresponds to thek
th
annotator;F
k
is an annotator specic
matrix that denes his/her linear weights for each output dimension;
m
and
k
are noise terms dened individually in the next sections along with the
functions f and g. In the global annotation setting, both a
m
and a
m
k
2
I R
D
where D is the number of items being annotated; for the time series
setting a
m
and a
m
k
2 I R
TD
, where T is the total duration of the data point
(audio/video signal). In all subsequent denitions, we use uppercase letters
M;K;T;D to denote various counts and lowercase lettersm;k;t;d to denote
the corresponding index variables.
We make the following assumptions in our model.
A1 Annotations are independent for dierent data points.
A2 The annotations for a given data point are independent of each other
given the ground truth.
33
A3 The model ground truths for dierent annotation dimensions are as-
sumed to be conditionally independent of each other given the features
x
m
.
4.2.2 Global annotation model
In this setting, the ground truth and annotations are d dimensional vectors
for each data point. We dene the ground truth a
m
and annotations a
m
k
as
follows.
a
m
=
T
x
m
+
m
(4.3)
a
m
k
=F
k
a
m
+
k
(4.4)
where, x
m
2 I R
P
; 2 I R
PD
;
m
N(0;
2
I);
2
2 I R. The annotator
noise
k
is dened as
k
N(0;
2
k
I);
2
k
2 I R. F
k
2 I R
DD
is the annotator
specic weight matrix. Each annotation dimension value a
m;d
k
for annotator
k is dened as a weighted average of the ground truth vector a
m
with weights
given by the vector F
k
(d; :).
Parameter Estimation
The model parameters =fF
k
; ;
2
;
2
k
g are estimated using Maximum
Likelihood Estimation (MLE) in which they are chosen to be the values that
34
maximize the likelihood functionL.
logL =
M
X
m=1
logp(a
m
1
::: a
m
K
; )
=
M
X
m=1
log
Z
a
m
p(a
m
1
::: a
m
K
ja
m
;F
k
;
2
k
)p(a
m
; ;
2
)da
m
(4.5)
Optimizing Equation 4.5 directly is intractable because of the presence of
the integral within the log term, hence we use the EM algorithm. Note that
the model we propose assumes that only some random subset of all available
annotators provide annotations on a given data point, as shown in Figure
4.1. However, for ease of exposition, we overload the variable K and use it
here to indicate the number of annotators that attempt to judge the given
data point m.
EM algorithm
The Expectation Maximization (EM) algorithm to estimate the model pa-
rameters is shown below. It is an iterative algorithm in which the E and
M-steps are executed repeatedly until an exit condition is encountered. Com-
plete derivations for the model can be found in Appendix A.1.
Initialization We initialize by assigning the expected values and covari-
ance matrices for the m ground truth vectors a
m
to their sample estimates
(i.e. sample mean and sample covariance) from the corresponding annota-
tions. We then estimate the parameters as described in the maximization
step using these estimates.
E-step In this step we take expectation of the log likelihood function
35
with respect to p(a
m
ja
m
1
::: a
m
K
) and the resulting objective is maximized
with respect to the model parameters in the M-step. Equations to compute
the expected value and covariance matrices for the latent variable a
m
in the
E-step are listed below.
E
a
m
ja
m
1
:::a
m
K
[a
m
] =
T
x
m
+
a
m
;a
m
1
:::a
m
K
1
a
m
1
:::a
m
K
;a
m
1
:::a
m
K
(a
m
m
)
a
m
ja
m
1
:::a
m
K
[a
m
] =
a
m
;a
m
a
m
;a
m
1
:::a
m
K
1
a
m
1
:::a
m
K
;a
m
1
:::a
m
K
a
m
1
:::a
m
K
;a
m
The terms are covariance matrices between the subscripted random
variables. a
m
and
m
areDK dimensional vectors obtained by concatenating
theK annotation vectors a
m
1
;::: a
m
K
and their corresponding expected values.
M-step In this step, we compute current estimates for the parameters as
follows. The expectations shown below are over the conditional distribution
a
m
ja
m
1
::: a
m
K
.
= (X
T
X)
1
(X
T
E[a
m
])
F
k
=
M
k
X
m=1
a
m
K
E[(a
m
)
T
]
M
k
X
m=1
E[a
m
(a
m
)
T
]
1
2
=
1
md
M
X
m=1
E[(a
m
)
T
a
m
] 2tr
0T
x
m
E[(a
m
)
T
]
+tr(x
T
m
0
0T
x
m
)
2
k
=
1
m
k
d
M
k
X
m=1
(a
m
K
)
T
a
m
K
2tr
F
0T
k
a
m
K
E[(a
m
)
T
]
+tr
F
0T
k
F
0
k
E[a
m
(a
m
)
T
]
Note the similarity of the update equation for with the familiar normal
equations. We are using the soft estimate of a
m
to nd the expression for
in each iteration. Here, X is the feature matrix for all data points; it includes
individual feature vectorsx
m
in its rows.
0
andF
0
k
are parameters from the
36
previous iteration.
Termination We run the algorithm until convergence, the criterion for
which was chosen to be when the change in model log-likelihood reduces to
less than 0:001% from the previous iteration.
4.2.3 Time series annotation model
In this setting, the ground truth and the annotations are matrices with T
rows (time) and D columns (annotation dimensions). The ground truth
matrix a
m
is dened as follows.
vec(a
m
) = vec(X
m
) +
m
(4.6)
where a
m
2 I R
TD
, X
m
2 I R
TP
and 2 I R
PD
; T represents the time
dimension and is the length of the time series. X
m
is the feature matrix
where each row corresponds to features extracted from the data point for one
particular time stamp. vec(:) is the vectorization operation which
attens
the input matrix in column rst order to a vector.
m
N (0;
2
I)2 I R
TD
is the additive noise vector with 2 I R.
In [18], the authors propose a linear model where the annotation function
g(a
m
;F
k
) is a causal linear time invariant (LTI) lter of xed width. The
advantage of using an LTI lter is that it can capture scaling and time-delay
biases introduced by the annotators.
Since the lter width W is chosen such that W T where T is the
number of time stamps for which we have the annotations, the annotation
function for dimension d
0
can be viewed as the left multiplication of a lter
37
matrix B
d
0
k
2 I R
TT
as shown in Equation 4.7.
B
d
0
k
=
2
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
4
b
d
0
1
0 0 0 0 ::: 0
b
d
0
2
b
d
0
1
0 0 0 ::: 0
b
d
0
3
b
d
0
2
b
d
0
1
0 0 ::: 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 b
d
0
W
::: b
d
0
1
0 ::: 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 0 b
d
0
W
::: b
d
0
1
3
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
5
(4.7)
We extend this model in our work to combine information from all of the
annotation dimensions. Specically, the ground truth is left multiplied by
D horizontally concatenated lter matrices each2 I R
TT
corresponding to a
dierent dimension as shown below.
a
m;d
k
=F
d
k
vec(a
m
) +
k
(4.8)
where,
F
d
k
= [B
d;1
k
;B
d;2
k
;:::;B
d;D
k
] (4.9)
F
d
k
2 I R
TTD
with WD unique parameters.
k
N (0;
2
k
I)2 I R
T
with
2
k
2 I R.
Parameter Estimation
Estimating the model parameters similar to the global model requires com-
puting the expectations over a vector of size TD. Since T is the number of
time stamps in the task and can be arbitrarily long, this may not be feasible
38
in all tasks. For example, in the movie emotions corpus [25], annotations are
computed at a rate of 25 frames per second with each le of duration30
minutes or of45k annotation frames. To avoid this we use a variant of
EM named Hard EM in which instead of taking expectations over the entire
conditional distribution of a
m
we nd its mode. This variant has been shown
to be comparable in performance to the classic EM (Soft EM ) despite being
signicantly faster and simple [33]. This approach is similar to the parameter
estimation strategy devised by [18] in their time series annotation model.
The likelihood function is similar to the global model in Equation 4.5 as
shown below.
logL =
M
X
m=1
log
Z
a
m
p(a
m
1
::: a
m
K
ja
m
;F
k
;
2
k
)p(a
m
; ;
2
)da
m
However the integral here is with respect to the
attened vector vec(a
m
).
EM algorithm
The EM algorithm for the time series annotation model is listed below. Com-
plete derivations for the model can be found in Appendix A.2
Initialization Unlike the global annotation model, we initialize a
m
ran-
domly since we observed better performance when compared to initializing
it with the annotation means. Given this a
m
, the model parameters are
estimated as described in the maximization step below.
E-step In this step we assign a
m
to the mode of the conditional distri-
bution q(a
m
) = p(a
m
ja
m
1
;:::; a
m
K
). Since this distribution is normal nding
39
the mode is equivalent to minimizing the following expression.
a
m
= argmin
a
m
X
k
X
d
jja
m;d
k
F
d
k
vec(a
m
)jj
2
2
+jjvec(a
m
) vec(X
m
)jj
2
2
M-step Given the estimate for a
m
from the E-step, we substitute it in
the likelihood function and maximize with respect to the parameters in the
M-step. The estimates for the dierent parameters are shown below.
=
M
X
m=1
X
T
m
X
m
1
M
X
m=1
X
T
m
a
m
f
d
k
=
M
k
X
m=1
A
T
A
1
M
k
X
m=1
A
T
a
m;d
k
2
=
1
MTD
M
X
m=1
jjvec(a
m
K
) vec(X
m
)jj
2
2
2
k
=
1
M
k
TD
M
k
X
m=1
X
d
jja
m;d
k
F
d
k
vec(a
m
)jj
2
2
A is a matrix obtained by reshaping vec(a
m
).
Termination We run the algorithm until convergence, the criterion for
which was chosen to be when the change in model log-likelihood reduces to
less than 0:5% from the previous iteration.
4.3 Experiments and Results
We evaluate the models described above on three dierent types of data:
synthetic data, an articial task with human annotations, and nally with
real data. We describe these individually below. We compare our joint mod-
40
els with their independent counterparts in which each annotation dimension
is modeled separately. Update equations for the independent model can be
obtained by running the models described above for each dimension sepa-
rately with D = 1. Note that the independent model is similar in the global
setting to the regression model proposed in [1] (with ground truth scaled
by the singleton f
d
k
). In the time series setting it is identical to the model
proposed by [18].
The models are evaluated by comparing the estimated a
m
with the ac-
tual ground truth. We report model performance using two metrics: the
Concordance correlation coecient (
c
) [34] and the Pearson's correlation
coecient ().
c
measures any departures from the concordance line (line
passing through the origin at 45° angle). Hence it is sensitive to rotations
or rescaling in the predicted ground truth. Given two samples x and y, the
sample concordance coecient ^
c
is dened as shown below.
^
c
=
2s
xy
s
2
x
+s
2
y
+ ( x y)
2
(4.10)
We also report results in Pearson's correlation to highlight the accuracy of
the models in the presence of rotations.
As noted before, the models proposed in this paper are closely related to
the Factor Analysis model, which is vulnerable to issues of unidentiability
[35], due to the matrix factorization. Dierent types of unidentiability have
been studied in literature, such as factor rotation, scaling and label switching.
In our experiments, we handle label switching through manual judgment (by
reassigning the estimated ground truth between dimensions if necessary) as
is common in psychology [36], but defer the task of choosing an appropriate
41
prior on the rotation matrixF
k
to address other unidentiabilities for future
work.
We report aggregate test set results using C-fold cross validation. To
address overtting, within each fold, we evaluate the parameters obtained
after each iteration of the EM algorithm by estimating the ground truth
on a disjoint validation set, and pick those with the highest performance
in concordance correlation
c
as the parameter estimates of the model. We
then estimate the performance of this parameter set in predicting the ground
truth from a separate held out test set for that fold. Finally, we also report
statistically signicant dierences between the joint and independent models
at 5% false-positive rate ( = 0:05) in all our experiments.
4.3.1 Global annotation model
The global annotation model uses the EM algorithm described in Section
4.2.2 to estimate the ground truth for discrete annotations. We evaluate the
model in three dierent settings described below. Statistical signicance tests
were run by computing bootstrap condence intervals [37] on the dierences
in model performances across the C-folds.
Synthetic data
We created synthetic data according to the model described in Section 4.2.2
with random features X2 I R
500
for 100 data points each with 2 dimensions of
annotations (i.e. D=2). 10 articial annotators, each with unique randomF
k
matrices were used to produce annotations for all the les. Elements of the
feature matrices were sampled from the standard normal distribution, while
42
Dim1 Dim2
0
0.25
0.5
0.75
*
(a) Concordance (
c
)
Dim1 Dim2
0
0.25
0.5
0.75
1
*
*
(b) Pearson ()
Joint Independent
Figure 4.2: Performance of global annotation model on synthetic dataset;
*-statistically signicant
the elements of F
k
matrices were sampled fromU(0; 1). Elements of ground
truth a
m
were sampled fromU(1; 1) and was estimated from a
m
and X.
Since its o diagonal elements are non-zero, our choice ofF
k
represents tasks
in which the annotation dimensions are related to each other.
Figure 4.2 shows the performance of joint and independent models in
predicting the ground truth a
m
. For both dimensions, the proposed joint
model predicts the a
m
with considerably higher accuracy as shown by the
higher correlations, highlighting the advantages of modeling the annotation
dimensions jointly when they are expected to be related to each other.
Articial data
Since crowdsourcing experiments typically involve collecting subjective an-
notations, they seldom have well dened ground truth. As a result, most
annotation models are evaluated on expert annotations collected by specially
trained users. For example, while collecting annotations on medical data the
ground truth estimated by fusing annotations from naive users may be eval-
43
Sat Bri
0
0.25
0.5
0.75
1
*
*
(a) Concordance (
c
)
Sat Bri
0
0.25
0.5
0.75
1
(b) Pearson ()
Joint Independent
Figure 4.3: Performance of global annotation model on articial dataset;
Sat-Saturation, Bri-Brightness; *-statistically signicant
uated against reference labels provided by experts such as doctors. However,
this poses a circular problem since the expert annotations themselves may be
subjective and combining them to may not be straightforward. To address
this, we created an articial task with controlled ground truth on which we
collect annotations from multiple annotators and evaluate the fused annota-
tion values with the known ground truth values, similar to [38]. In our task,
the annotators were asked to provide their best estimates on saturation and
brightness values for monochromatic images. The relationship between per-
ceived saturation and brightness is well known as the Helmholtz|Kohlrausch
eect, according to which, increasing the saturation of an image leads to an
increase in the perceived brightness, even if the actual brightness was con-
stant [39].
In our experiments, we collected annotations on images from two regimes:
one with xed saturation and varying brightness, and vice versa. This ap-
proach was chosen since it would allow us to evaluate the impact of change
in either brightness or saturation while the other was held constant. The
44
color of the images were chosen randomly (and independent of the image's
saturation and brightness) between green and blue. Annotations were col-
lected on Mturk and the annotators were asked to familiarize themselves with
saturation and brightness using an online interactive tool before providing
their ratings. In both experiments, a reference image with xed brightness
and saturation was inserted after every ten annotation images to prevent any
bias in the annotators. The reference images were hidden from the annota-
tors and appeared as regular annotation images. For parameter estimation,
RGB values were chosen as the features for each image.
We used the joint model to estimate the ground truth for the two regimes
separately since we expect the relationship between saturation and brightness
to be dissimilar in the two cases. From each experiment, predicted values
of the underlying dimension being varied was compared with the actual a
m
values. For example, in the experiment with varying saturation and xed
brightness, the joint model was run on full annotations, but only estimated
values of saturation were compared with the ground truth. For the inde-
pendent model, we use annotation values of the underlying dimension being
varied from each regime, and compare the estimated values with ground
truth.
Figure 4.3 shows the performance of the joint and independent models
for this experiment. The joint model leads to better estimates of saturation
when compared to the independent model by making use of the annotations
on brightness. This agrees with the Helmholtz|Kohlrausch phenomenon de-
scribed above, since the annotators can perceive the changing saturation as
a change in brightness, leading to correlated annotations for the two dimen-
45
Anger Disgust Fear Joy Sadness Surprise Valence
0
0.25
0.5
0.75
*
(a) Concordance correlation (
c
)
Anger Disgust Fear Joy Sadness Surprise Valence
-0.2
0
0.25
0.5
0.75
1
*
(b) Pearson correlation ()
Joint Independent
Figure 4.4: Performance of global annotation model on the text emotions
dataset; *-statistically signicant
sions. On the other hand, the independent model leads to better estimates
of brightness, which seems to have no eect on perceived saturation annota-
tions. This experiment highlights the benets of jointly modeling annotations
in cases where the annotation dimensions may be correlated or dependent on
each other.
Real data
Our nal experiment for the global model was on the task of annotating
news headlines in which the annotators provide numeric ratings for various
46
emotions. This dataset was rst described in the 2007 SemEval task on
aective text [40]. Numeric ratings from the original task were labeled in
house and we treat these as expert annotations since the annotators were
trained with examples. We use Mturk annotations from [9] as the actual
input to our model using which the ground truth estimates are computed.
Sentence level annotations are provided on seven dimensions (D=7): anger,
disgust, fear, joy, sadness, surprise and valence (positive/negative polarity).
In our experiments, we use sentence level embeddings computed using the
pre-trained sentence embedding model sent2vec
1
[41] as features.
Figure 4.4 shows the performance of the joint and independent models
on this task. The joint model shows better performance in predicting the
ground truth for anger, disgust, fear, joy and sadness, but performs worse
than the independent model in predicting surprise and valence.
4.3.2 Time series annotation model
In this setting, the annotations are collected on data with a temporal dimen-
sion, such as time series data, video or audio signals. Similar to the global
model, we evaluate this model in 3 settings: synthetic, articial and on real
data. The evaluation metrics
c
and are computed over estimated and ac-
tual ground truth vectors a
m
by concatenating the data points into a single
vector. The time series models have the window size W as an additional
hyperparameter, which is selected using a validation set. In each fold of the
dataset, we train model parameters for dierent window sizes from the set
f5; 10; 20; 50g, and pick W and related parameters with the highest concor-
1
https://github.com/epfml/sent2vec
47
Dim1 Dim2
0
0.2
0.4
0.6
0.8
Synthetic
Sat Bri
0
0.2
0.4
0.6
0.8
Artificial
Aro Val
-0.2
0
0.2
0.4
0.6
0.8
MovieEmotions
*
*
(a) Concordance correlation coecients
Dim1 Dim2
0
0.2
0.4
0.6
0.8
Synthetic
Sat Bri
0
0.2
0.4
0.6
0.8
1
Artificial
Aro Val
-0.5
0
0.5
1
MovieEmotions
*
(b) Pearson correlation coecients
Joint Independent (Gupta et al [15])
Figure 4.5: Concordance and Pearson correlation coecients between ground
truth and model predictions for the time series annotation model; *-
statistically signicant
48
dance correlation
c
on the validation set. These are then evaluated on a
disjoint test set, and we repeat the process for each fold. In each experiment,
the parameters were initialized randomly, and the process was repeated 20
times at dierent random initializations, selecting the best starting point us-
ing the validation set. To identify signicant dierences, we compute the test
set performance of the two models for each fold, and run the paired t-test
between the twoC sized samples. We avoid computing bootstrap condence
intervals due to smaller test set sizes.
Synthetic data
The synthetic dataset was created using the model described in Section 4.2.3.
Elements of the feature matrix were sampled from the standard normal dis-
tribution while elements of F
k
and ground truth were sampled fromU(0; 1).
In this setting each data point includes T feature vectors, one for each time
stamp. The time dependent feature matrices were created using a random
walk model without drift but with lag to mimic a real world task. In other
words, while creating the P dimensional time series, the features vectors
were held xed for a time period arbitrarily chosen to be between 2 to 4
time stamps. This was done because in most tasks the underlying dimension
(such as emotion) is expected to remain xed at least for a few seconds. In
addition, the transition between changes in the feature vectors were linear
and not abrupt. In our experiments, we chose P = 500, T = 350, D = 2,
M = 18 and the number of annotators K = 6.
Figure 4.5 shows the aggregate results across C-folds (C = 5) for the
joint and independent models in the 3 settings. In the synthetic dataset, the
49
joint model achieves higher values for Pearson's correlation for both the
dimensions and higher value for
c
for dimension 1. For dimension 2 however,
the independent model achieves better
c
.
Articial data
We collected annotations on videos with the articial task of identifying satu-
ration and brightness, described in the previous section. The videos consisted
of monochromatic images with the underlying saturation and brightness var-
ied independent of each other. The dimensions were created using a random
walk model with lag. The annotations were collected in house using an
annotation system developed using the Robot Operating System [42]. 10
graduate students gave their ratings on the two dimensions. Each dimension
was annotated independently using a mouse controlled slider. For parameter
estimation, the feature vectors for each time stamp were RGB values.
As seen in Figure 4.5, both models achieve similar performance in pre-
dicting the ground truth for saturation and brightness in terms of , as well
as in predicting saturation in terms of
c
. The independent model achieves
slightly better performance in predicting brightness in terms of concordance
correlation (though not statistically signicant); however, their performance
in terms of suggests that the joint model output diers only in terms of
a linear scaling. The joint model appears to be at par with the indepen-
dent model for the most part, suggesting that the transformation matrix
F
k
connecting the two dimensions for each annotator, is unable to accurately
capture the dependencies between the dimensions, likely due to the fact that,
unlike the global annotation model, the underlying brightness and saturation
50
were varied simultaneously and independent of each other (leading to non-
linear dependencies between them), and that we limit F
k
to only capture
linear relationships.
Real data
We nally evaluate our model on a real world task with time series anno-
tations. We chose the task of predicting the emotion dimensions of valence
and arousal from movie clips, rst explained in [25]. The associated corpus
includes time series annotations of the emotion dimensions on contiguous 30
minute video segments from 12 Academy Award winning movies. This task
was chosen because the data set includes both expert annotations as well as
annotations from naive users. We treat the expert annotations as reference
and evaluate the estimated ground truth dimensions against them; however,
we note that the expert labels were provided by just one annotator, which
may itself be noisy.
For each movie clip, 6 annotators provide annotations on their perceived
valence and arousal using the Feeltrace [43] annotation tool. The features
used in our parameter estimation include combined audio and video features
extracted separately. The audio features were estimated using emotion recog-
nition baseline features from Opensmile [44] at 25 fps (same frame rate as the
video clips) and aggregated at a window size of 5 seconds using the following
statistical functionals: mean, max, min, std, range, kurtosis, skewness and
inter-quartile range. The video features were extracted using OpenCV [45]
and included frame level luminance, intensity, Hue-Saturation-Value (HSV)
color histograms and optical
ow [46], which were also aggregated to 5 sec-
51
0.1 0.4 0.7 1
0.6
0.7
0.8
0.9
1
Dim1
0.1 0.4 0.7 1
0.6
0.7
0.8
0.9
1
Dim2
(a) Concordance correlation
0.1 0.4 0.7 1
0.6
0.7
0.8
0.9
1
Dim1
0.1 0.4 0.7 1
0.6
0.7
0.8
0.9
1
Dim2
(b) Pearson's correlation
Joint Independent
Figure 4.6: Eect of varying dependency between annotation dimensions for
the synthetic model
onds using simple averaging. The combined features were of size P = 1225
for each frame.
Figure 4.5 shows the performance of the two models for the dataset. The
joint model seems to considerably outperform the independent model while
estimating arousal while the independent models seem to produce better esti-
mates of valence from the annotations. The independent model seems to per-
form poorly in arousal prediction, but shows strong performance with valence
suggesting higher agreement between the annotators and the expert's opin-
ions on valence. The joint model, however shows a balanced performance,
where the information from valance seems to help in predicting arousal.
4.3.3 Eect of dependency among dimensions
To evaluate the impact of the magnitude of dependency between the anno-
tation dimensions on the performance of the models, we created a set of syn-
thetic annotations for the global model similar to Section 4.3.1. We created
52
10 synthetic datasets, each with constant F
k
matrices across all annotators.
The principal diagonal elements were xed to 1 while the o diagonal ele-
ments were increased between 0.1 to 1 with a step size of 0.1. Similar to the
previous setting, we created 100 annotators, each operating on 10 les. Note
that despite the annotators having identical F
k
matrices, their annotations
on a given le were dierent because of the noise term
k
in Equation 4.2.
Figure 4.6 shows the 5-fold cross validated performance of the joint and
independent models on this task. As seen in the gure, the joint model
consistently outperforms the independent model in both metrics. Both the
models start with similar performance when the o diagonal elements are
close to zero since this implies no dependency between the annotation di-
mensions, and the performance of both models continues to degrade as the
o diagonal elements increase. However, the joint model is able to make
better predictions of the ground truth by making using of the dependency
between the dimensions, highlighting the benets of modeling the annota-
tion dimensions jointly. We also created a plot of averages of all predicted
F
k
matrices for dierent step sizes (o diagonal elements of synthetic anno-
tators) in Figure 4.7. In each case, the predicted F
k
matrices close resemble
the actual matrices for the annotators highlighting the accuracy of the joint
model. However, as we get closer to step size 1, the estimated F
k
matrices
appear to be washed out with all terms of the estimated F
k
close to 0:5 in-
stead of 1 (Figure 4.7f). We attribute this to unidentiability due to scaling
that may have been introduced by the model during parameter estimation.
Addressing this is an important part of our proposed future work.
53
(a) 0 (b) 0.2 (c) 0.4
(d) 0.6 (e) 0.8 (f) 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 4.7: AverageF
k
plots estimated from the joint model at dierent step
sizes for o diagonal elements of the annotator's F
k
matrices
4.4 Conclusion
We presented a model to combine multidimensional annotations from crowd-
sourcing platforms such as Mturk. The model assumes the ground truth to
be latent and distorted by the annotators. The latent ground truth and the
model parameters are estimated using the EM algorithm. EM updates are
derived for both global and time series annotation settings. We evaluate the
model on synthetic and real data. We also propose an articial task with
controlled ground truth and evaluate the model.
Weaknesses of the model include vulnerability to unidentiability issues
like most variants of factor analysis [35]. Typical strategies to address this
issue involve adapting a suitable prior constraint on the factor matrix. For
example, in PCA, the factors are ordered such that they are orthogonal to
54
each other and arranged in decreasing order of variance. In our experiments,
the most severe form of unidenability observed was due to label switching,
which we addressed using manual judgments. We defer the task of choosing
an appropriate prior constraint on F
k
for future work.
Future work also includes generalizing the model with Bayesian exten-
sions, in which case the parameters can be estimated using variational infer-
ence. Providing theoretical bounds to the model performance, specially with
respect to the sample complexity may be possible since we have assumed
normal distributions throughout the model.
55
Chapter 5
Estimation of psycholinguistic
norms for sentences
5.1 Introduction
Psycholinguistic norms are numeric ratings assigned to linguistic cues such
as words or sentences to measure various psychological constructs. Exam-
ples include dimensions such as valence, arousal, and dominance which are
used to analyze the aective state of the author. Other examples include
norms of higher order mental constructs such as concreteness and imagabil-
ity which have been associated with improvements in learning [47]. The ease
of computing the norms has enabled their application in a variety of tasks
in natural language processing such as information retrieval [48], sentiment
analysis [49], text based personality prediction [50] and opinion mining. The
norms are typically annotated at the word level by psychologists who provide
numeric scores to a curated list of seed words, which are then extrapolated
56
to a larger vocabulary using either semantic relationships such as synonymy
and hyponymy or using word occurrence based contextual similarity.
Most NLP applications of psycholinguistic norms use sentence or docu-
ment level scores, but manual annotation of the norms at sentence level is
dicult and not straightforward to generalize. In these cases, estimation of
sentence level norms is done by aggregating the word level scores using simple
averaging [51, 52], or by using distribution statistics of the word level scores
[53]. However, such aggregation strategies may not be accurate at estimat-
ing the sentence level scores of the norms. In this work, we propose a new
approach to estimate sentence level norms using the joint multidimensional
model presented in Chapter 4 along with partial sentence level annotations.
Annotation of the normatives at the sentence level is a challenging task
when compared to word level annotations since it involves evaluating the
underlying semantics of the sentence in the abstract space of the correspond-
ing dimension, with some dimensions in particular such as dominance being
more dicult than others. Dominance is a measure of how dominant or
submissive the object behind the word is. Being one of the three basic di-
mensions that are frequently used to describe emotional states (along with
pleasure and arousal), dominance is commonly used in aective computing
but annotating this dimension at the sentence level is considerably dicult.
For example, it is relatively easy to estimate the dominance score for the
words happy or angry but assigning a score for dominance for the sentence
I'm happy to know that the earthquakes are behind us would be considerably
dicult as it involves words with extreme values of dominance (happy and
earthquake). On the other hand, some norms are easier to annotate at the
57
sentence level (for example, valence). We use this fact along with the joint
parameters learned from the fusion model to predict norms at sentence level
given partial annotations.
The joint multidimensional annotation fusion model presented in Section
4.2.2 assumes a matrix factorization scheme to capture annotator behaviors.
The annotations are assumed to be obtained by left multiplying the ground
truth vector a
m
with an annotator specic linear transformation matrix de-
noted as F
k
, which captures the individual contributions of ground truth
values for each dimension in the annotation output. In our work, we make
use of this parameter F
k
to estimate sentence level normative scores. We
start by training the joint global annotation model using the EM algorithm
listed in Section 4.2.2 at the word level to estimate the annotator parame-
ters, and use the word level estimates for F
k
on sentence level ratings from
the same set of annotators. To make model predictions on a given dimen-
sion, we make use of partial annotations on the remaining dimensions along
with F
k
. Our proposed approach shows improved performance in predicting
the sentence level norms when compared to various word level normative
aggregation strategies.
The rest of the chapter is organized as follows. In Section 5.2, we brie
y
introduce the parameters of the joint multidimensional annotation fusion
model and explain our data collection strategy in Section 5.3. We present our
experimentats and results in Sections 5.4 and Section 5.5 before concluding
in Section 5.6.
58
a k m,1 x * m,1 a m M a k m,d * m,d a . . . . a k m,2 * m,2 a |A|< K m _ Figure 5.1: Joint multidimensional annotation fusion model from Section
4.2.3. F
k
is estimated from word level annotations of psycholinguistic norms,
which is used in predicting norms at the sentence level
5.2 Model
a
m
=
T
x
m
+
m
a
m
k
=F
k
a
m
+
k
(5.1)
A plate notation diagram describing joint multidimensional annotation
model introduced in Section 4.2.2 is shown in Figure 5.1. In this model,
each annotator is assumed to operate on the ground truth vector a
m
by left
multiplying a matrixF
k
as shown in Equation 5.1. This matrix captures the
relationship between all dimensions of the ground truth vector with those
in the annotation vector a
m
k
. The key idea used in our approach is that for
a given annotator, the relationships between the annotation dimensions is
similar for both word and sentence level annotations. In other words, the
matrixF
k
is assumed to be identical for both word and sentence annotations
of the norms. Given multidimensional ratings from a set of annotators at
59
word and sentence levels, we use the joint fusion model to estimate annotator
specic parametersF
k
at the word level, which are then used at the sentence
level to estimate the psycholinguistic norms.
5.3 Data
We collected word level annotations on the aective norms of Valence, Arousal
and Dominance using Mturk for words sampled from [54]. This corpus was
chosen because it provides expert ratings on Valence, Arousal and Dominance
for nearly 14,000 English words. Annotators were asked to provide numeric
ratings between 1 to 5 (inclusive) for each dimension, on assignments consist-
ing of a set of 20 randomly sampled words. In total, we collected annotations
on 200 words. Instructions for the annotation assignments included deni-
tions along with examples for each of the dimensions being annotated. After
ltering incomplete and noisy submissions, we retained only those annotators
who provided ratings for at least 100 words in the subsequent sentence level
annotation task, to ensure sucient training data.
Sentence level annotations were collected on sentences from the Emobank
corpus [55], which includes expert ratings on valence, arousal and dominance
for 10000 English sentences. 21 annotators from the word level annotation
task were invited to provide labels for 100 sentences randomly sampled from
this corpus. The assignments were presented in a similar fashion as word level
annotations, with each assignment including 10 sentences and the workers
providing numeric ratings for valence, arousal and dominance for each sen-
tence. We use the annotator specic parameters F
k
estimated at the word
60
level to predict psycholinguistic norms for the sentences given partial anno-
tations, using the approach described in the next section.
5.4 Experiments
Given annotator parameters F
k
estimated at the word level, we use partial
annotator ratings at the sentence level to predict the norms. For example,
while predicting sentence level scores of valence, we use the sentence level
annotator ratings on arousal and dominance along with the word level pa-
rameter matrixF
word
k
, and repeat the process for each dimension. The use of
partial annotations enables us to predict sentence level norms on challenging
psycholinguistic dimensions using ratings on dimensions which maybe easier
to annotate.
In our experiments, we make use of the IID Gaussian noise assumption
in Equation 5.1, which reduces the task of predicting the sentence level norm
to a linear regression problem shown in Equation 5.2. Rows of the matrix
F
word
k
are treated as features of the regression model with vector a
m
as the
regression parameter. Given partial annotations a
m;nd
k
and matrixF
word
k
, the
regression parameter vector a
m
can be estimated using normal equations or
gradient descent.
61
2
6
6
6
6
6
6
6
6
6
4
a
m;nd
1
.
.
.
a
m;nd
K
3
7
7
7
7
7
7
7
7
7
5
=
2
6
6
6
6
6
6
6
6
6
4
F
word
1
[nd; :]
.
.
.
F
word
K
[nd; :]
3
7
7
7
7
7
7
7
7
7
5
2
6
6
6
6
6
6
6
6
6
4
a
m
3
7
7
7
7
7
7
7
7
7
5
(5.2)
where, d is the dimension to predict
We compare the predicted norms with expert ratings from the Emobank
corpus which acts as our reference to evaluate model performance. For base-
lines, we compute dierent aggregations of word level normative scores after
ltering out non-content words as is common in literature [51]. The aggrega-
tion functions we evaluated are: unweighted average, maximum, minimum
and sum of the word level norms. We use Concordance Correlation Coef-
cient (CCC; Equation 4.10) and Pearson's Correlation as our evaluation
metrics. We report our results in the next section.
Finally, we repeat the above experiments with three other dimensions:
pleasantness, imagability and genderladenness. Since we do not have refer-
ence labels for these to evaluate the predictions from dierent models, we
simply report training errors from several regression models when trained on
the model predictions. In this case, low training errors imply higher learn-
ability which can act as a proxy to evaluate the quality of predictions. For
regression models, we used support vector regression with l
1
and l
2
losses,
as well as ridge regression. Model hyperparameters were tuned using 5 fold
cross validation. In each dimension, the lowest mean squared error (MSE)
62
CCC
Val Aro Dom
0
0.05
0.1
0.15
0.2
0.25
0.3
Proposed
Word average
Word max
Word min
Word sum
(a) CCC
Correlation
Val Aro Dom
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Proposed
Word average
Word max
Word min
Word sum
(b) Correlation
Figure 5.2: Performance of proposed and baseline models in predicting sen-
tence level norms
across all regression models explored is reported.
5.5 Results
Figure 5.2 shows the performance of the proposed model and the the dier-
ent baselines. As seen from the gure, the proposed model outperforms the
baselines in predicting valence and arousal in both evaluation metrics, sug-
gesting the ecacy of the approach. Using partial ratings at sentence level
along with matrix F
k
which captures relationships between the dimensions,
the proposed approach seems to outperform the baseline word aggregation
schemes in these two dimensions. On the other hand, model performance on
dominance appears considerably low in both metrics. To further investigate
the reason for this, we created plots for each dimension comparing the best
63
CCC
Val Aro Dom
-0.1
0
0.1
0.2
0.3
0.4
0.5
Best
Avg
(a) CCC
Correlation
Val Aro Dom
-0.2
0
0.2
0.4
0.6
0.8
Best
Avg
(b) Correlation
MSE
Val Aro Dom
0
0.2
0.4
0.6
0.8
Best
Avg
(c) MSE
Figure 5.3: Performance of best annotator in our dataset and annotator
average
possible annotator in our dataset and the average rating across all annotators
as shown in Figure 5.3. Evidently, for dominance, we notice very low values
for the two correlation metrics and high MSE, suggesting a high disagree-
ment between our annotators and those from the Emobank corpus for this
dimension. This may have been due to a possibly diering denition and/or
interpretation of dominance between the two sets of annotators. Addressing
this is likely to improve the quality of performance of the proposed model in
predicting dominance.
Figure 5.4 shows the training set MSE for our experiment on pleasant-
ness, imagability and genderladenness. As seen in the gure, the proposed
approach achieves the best performance in at least one dimension (imagabil-
ity), warranting further explorations for these dimensions, perhaps by col-
lecting expert ratings.
64
Pleasantness
Word-avg Pred Ann-avg
0
0.2
0.4
0.6
0.8
1
10
-6
(a) Pleasantness
Imagability
Pred Word-avg Ann-avg
0
1
2
3
4
5
6
7
8
9
10
-7
(b) Imagability
Gender Ladenness
Word avg Ann avg Pred
0
1
2
3
4
5
6
7
10
-6
(c) Genderladenness
Figure 5.4: Training error on labels predicted; Pred: Predictions from pro-
posed model, Word-avg: Average of word level norm scores; Ann avg: anno-
tator avg
5.6 Conclusion
We presented a novel approach to estimate sentence level psycholinguistic
norms and showed improvements over standard baselines. We evaluate our
approach on annotations of valence, arousal and dominance. Future work
includes evaluating the model on other dimensions such as pleasantness. The
primary challenge lies in obtaining expert ratings on these dimensions at the
sentence level. Recently, alternate schemes to evaluate the model in the
absence of a reliable ground truth or reference have been proposed, such as
the evaluation strategy used in the AVEC 2018 challenge [56]. The challenge
organizers proposed a scheme where annotation fusion models are evaluated
by training and testing baseline regression models on the predicted labels
from the fusion models on disjoint sets. High performance on the test set
suggests consistent learnability of the predicted models and can act as a
proxy for label quality. We aim to expand our annotation experiments on
65
other psycholinguistic norms and use this strategy to evaluate our approach
in future work.
66
Chapter 6
Conclusions and Future Work
In this dissertation, we presented our work on multidimensional annotation
fusion for subjective ratings on aective dimensions. We presented two la-
tent variable models which used additive Gaussian noise and a matrix fac-
torization model respectively to capture the annotators' distortion functions.
We then applied the matrix factorization model to the task of predicting
psycholinguistic norms at the sentence level and showed improved results
compared to baseline models which aggregate word level scores. We also rec-
ognized appropriate future works for some of the tasks described above. We
now present our proposed future work to the task of computing agreement
on multidimensional annotations.
6.1 Multidimensional annotation agreement
Computing agreement between annotators is an important step in most data
collection projects as it provides a measure of reliability of the annotators.
67
Agreement is usually measured using a single numeric score with a high value
suggesting high quality labels. However, most existing strategies to compute
agreement are limited to univariate settings. In the case of multivariate an-
notations, the common practice is to compute agreement for each dimension
separately and report an array of agreement scores which can be cumbersome
or to report a suitable aggregate such as the average or median agreement
score which discards useful information. To address this, we propose to de-
velop a new metric which provides a single numeric score to capture the
agreement between annotators in the multidimensional setting. Specically,
we propose to extend Cohen's [57], one of the most frequently used metric
to compute agreement between annotators.
=
P
o
P
e
1P
e
(6.1)
where, P
o
is the observed agreement
P
e
is the expected agreement due to chance
Equation 6.1 shows the formula to compute kappa, which is given by the
ratio of observed agreement beyond chance over the best possible observable
agreement (equal to 1) beyond chance. To extend this, we use a reformula-
tion of given by [58] shown in Equation 6.2. This variant of is obtained
by subtracting the ratio of observed disagreement over the expected dis-
agreement due to chance
from 1.
68
= 1
(6.2)
where, =
1
m
X
m
(a
m
1
; a
m
2
)
=
1
m
2
X
m1
X
m2
(a
m1
1
; a
m2
2
)
is a distance measure
As seen in Equation 6.2, both and
use a distance measure to
compute disagreement. In the formulation of [58], Euclidean (l
2
) distance is
used to measure the disagreement but it is not clear if it is the optimal choice.
We propose to expand on this work by evaluating other distance measures
such as l
1
(Equation 6.3) or l
1
(Equation 6.4) distances. Each of these
have specic advantages over the l
2
distance. For example, use of l
1
avoids
over penalizing dierences in annotator scales in cases where the annotators
operate on dierent internal ranges. Similarly, use ofl
1
distance may lead to
distributions with lower entropy, and this could lead to distinctly dened
regions which is often desirable in agreement metrics. Our proposed work
includes exploring various distance measures to compute multidimensional
agreement from the formula listed in Equation 6.2 and draw comparisions
between them.
69
0.1 1 2 5 10
standard deviation,
-0.2
0
0.2
0.4
0.6
0.8
1
kappa,
L1
L2
Linf
Figure 6.1: Comparision of for various distance measures
L
1
=
X
d
ja
m
1;d
a
m
2;d
j (6.3)
L
1
= max
d
ja
m
1;d
a
m
2;d
j (6.4)
Evaluating the quality of agreement obtained from the dierent distance
measures is challenging since they often behave comparably. To illustrate
this, we created two synthetic annotators who dier only by additive Gaus-
sian noise and created plots of estimated as we increase the standard de-
viation of the additive noise, for the dierent distance measures described
70
above. As seen in Figure 6.1, in this annotation experiment, all three dis-
tance measures lead to similar decrease in agreements and it is unclear if one
is superior to another. To drawing better comparisons, we would need more
carefully designed annotation experiments which highlight the key dierences
between the distance measures, and we propose to explore this further in our
future work.
71
Bibliography
[1] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni,
and L. Moy, \Learning from crowds," The Journal of Machine Learning
Research, vol. 11, pp. 1297{1322, 2010.
[2] M. D. Munezero, C. S. Montero, E. Sutinen, and J. Pajunen, \Are they
dierent? aect, feeling, emotion, sentiment, and opinion detection in
text," IEEE transactions on aective computing, vol. 5, no. 2, pp. 101{
111, 2014.
[3] K. S. Fleckenstein, \Dening aect in relation to cognition: A
response to susan mcleod," Journal of Advanced Composition,
vol. 11, no. 2, pp. 447{453, 1991. [Online]. Available: http:
//www.jstor.org/stable/20865808
[4] U. Schimmack, S. Oishi, and E. Diener, \Cultural in
uences on the rela-
tion between pleasant emotions and unpleasant emotions: Asian dialec-
tic philosophies or individualism-collectivism?" Cognition & Emotion,
vol. 16, no. 6, pp. 705{719, 2002.
72
[5] A. P. Dempster, N. M. Laird, and D. B. Rubin, \Maximum likelihood
from incomplete data via the em algorithm," Journal of the royal sta-
tistical society. Series B (methodological), pp. 1{38, 1977.
[6] J. Surowiecki, The wisdom of crowds. Anchor, 2005.
[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, \Imagenet:
A large-scale hierarchical image database," in Computer Vision and Pat-
tern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009,
pp. 248{255.
[8] C. Vondrick, D. Patterson, and D. Ramanan, \Eciently scaling up
crowdsourced video annotation," International Journal of Computer Vi-
sion, vol. 101, no. 1, pp. 184{204, 2013.
[9] R. Snow, B. O'Connor, D. Jurafsky, and A. Y. Ng, \Cheap and fast|
but is it good?: evaluating non-expert annotations for natural language
tasks," in Proceedings of the conference on empirical methods in natural
language processing. Association for Computational Linguistics, 2008,
pp. 254{263.
[10] A. P. Dawid and A. M. Skene, \Maximum likelihood estimation of ob-
server error-rates using the em algorithm," Applied statistics, pp. 20{28,
1979.
[11] P. Smyth, U. M. Fayyad, M. C. Burl, P. Perona, and P. Baldi, \Inferring
ground truth from subjective labelling of venus images," in Advances in
neural information processing systems, 1995, pp. 1085{1092.
73
[12] K. Audhkhasi and S. Narayanan, \A globally-variant locally-constant
model for fusion of labels from multiple diverse experts without us-
ing reference labels," Pattern Analysis and Machine Intelligence, IEEE
Transactions on, vol. 35, no. 4, pp. 769{783, 2013.
[13] A. Metallinou and S. S. Narayanan, \Annotation and processing of con-
tinuous emotional attributes: Challenges and opportunities," in 2nd
International Workshop on Emotion Representation, Analysis and Syn-
thesis in Continuous Time and Space (EmoSPACE 2013), apr 2013.
[14] M. A. Nicolaou, V. Pavlovic, and M. Pantic, \Dynamic probabilistic cca
for analysis of aective behaviour," in European Conference on Com-
puter Vision. Springer, 2012, pp. 98{111.
[15] F. R. Bach and M. I. Jordan, \A probabilistic interpretation of canoni-
cal correlation analysis," University of California, Berkeley, Tech. Rep.,
2005.
[16] M. A. Nicolaou, V. Pavlovic, and M. Pantic, \Dynamic probabilistic cca
for analysis of aective behavior and fusion of continuous annotations,"
IEEE transactions on pattern analysis and machine intelligence, vol. 36,
no. 7, pp. 1299{1311, 2014.
[17] S. Mariooryad and C. Busso, \Correcting time-continuous emotional
labels by modeling the reaction lag of evaluators," IEEE Transactions
on Aective Computing, vol. 6, no. 2, pp. 97{108, 2015.
74
[18] R. Gupta, K. Audhkhasi, Z. Jacokes, A. Rozga, and S. Narayanan,
\Modeling multiple time series annotations based on ground truth infer-
ence and distortion," IEEE Transactions on Aective Computing, 2016.
[19] Y. E. Kara, G. Genc, O. Aran, and L. Akarun, \Modeling annotator
behaviors for crowd labeling," Neurocomputing, vol. 160, pp. 141{156,
2015.
[20] N. B. Shah, D. Zhou, and Y. Peres, \Approval voting and incentives in
crowdsourcing," arXiv preprint arXiv:1502.05696, 2015.
[21] V. S. Sheng, F. Provost, and P. G. Ipeirotis, \Get another label? im-
proving data quality and data mining using multiple, noisy labelers,"
in Proceedings of the 14th ACM SIGKDD international conference on
Knowledge discovery and data mining. ACM, 2008, pp. 614{622.
[22] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N.
Chang, S. Lee, and S. S. Narayanan, \Iemocap: Interactive emotional
dyadic motion capture database," Language resources and evaluation,
vol. 42, no. 4, p. 335, 2008.
[23] G. McKeown, M. Valstar, R. Cowie, M. Pantic, and M. Schroder, \The
semaine database: Annotated multimodal records of emotionally colored
conversations between a person and a limited agent," IEEE Transactions
on Aective Computing, vol. 3, no. 1, pp. 5{17, 2012.
75
[24] F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne, \Introducing the
recola multimodal corpus of remote collaborative and aective interac-
tions," in Automatic Face and Gesture Recognition (FG), 10th IEEE
International Conference and Workshops on. IEEE, 2013, pp. 1{8.
[25] N. Malandrakis, A. Potamianos, G. Evangelopoulos, and A. Zlatintsi, \A
supervised approach to movie emotion tracking," in Acoustics, Speech
and Signal Processing (ICASSP), 2011 IEEE International Conference
on. IEEE, 2011, pp. 2376{2379.
[26] M. Franzini, K.-F. Lee, and A. Waibel, \Connectionist viterbi training:
A new hybrid method for continuous speech recognition," in Acoustics,
Speech, and Signal Processing, 1990. ICASSP-90., 1990 International
Conference on. IEEE, 1990, pp. 425{428.
[27] J. A. Hartigan and M. A. Wong, \Algorithm as 136: A k-means cluster-
ing algorithm," Journal of the Royal Statistical Society. Series C (Ap-
plied Statistics), vol. 28, no. 1, pp. 100{108, 1979.
[28] S. Narayanan and P. G. Georgiou, \Behavioral signal processing: Deriv-
ing human behavioral informatics from speech and language," Proceed-
ings of the IEEE, vol. 101, no. 5, pp. 1203{1233, 2013.
[29] C. M. Bishop, Pattern Recognition and Machine Learning. Springer,
2006.
76
[30] R. B. Grossman, L. R. Edelson, and H. Tager-Flusberg, \Emotional
facial and vocal expressions during story retelling by children and ado-
lescents with high-functioning autism," Journal of Speech, Language,
and Hearing Research, vol. 56, no. 3, pp. 1035{1044, 2013.
[31] B. Schuller, S. Steidl, and A. Batliner, \The interspeech 2009 emotion
challenge." in INTERSPEECH, vol. 2009. Citeseer, 2009, pp. 312{315.
[32] R. Gupta, C.-C. Lee, and S. S. Narayanan, \Classication of emotional
content of sighs in dyadic human interactions," in Proceedings of the
International Conference on Acoustics, Speech and Signal Processing
(ICASSP), Mar. 2012.
[33] V. I. Spitkovsky, H. Alshawi, D. Jurafsky, and C. D. Manning, \Viterbi
training improves unsupervised dependency parsing," in Proceedings of
the Fourteenth Conference on Computational Natural Language Learn-
ing. Association for Computational Linguistics, 2010, pp. 9{17.
[34] I. Lawrence and K. Lin, \A concordance correlation coecient to eval-
uate reproducibility," Biometrics, pp. 255{268, 1989.
[35] L. R. Fabrigar, D. T. Wegener, R. C. MacCallum, and E. J. Strahan,
\Evaluating the use of exploratory factor analysis in psychological re-
search." Psychological methods, vol. 4, no. 3, p. 272, 1999.
[36] H. F. Kaiser, \The varimax criterion for analytic rotation in factor anal-
ysis," Psychometrika, vol. 23, no. 3, pp. 187{200, 1958.
[37] B. Efron and R. J. Tibshirani, An introduction to the bootstrap. CRC
press, 1994.
77
[38] B. M. Booth, K. Mundnich, and S. S. Narayanan, \A novel method for
human bias correction of continuous-time annotations," in 2018 IEEE
International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2018, pp. 3091{3095.
[39] D. Corney, J.-D. Haynes, G. Rees, and R. B. Lotto, \The brightness of
colour," PloS one, vol. 4, no. 3, p. e5091, 2009.
[40] C. Strapparava and R. Mihalcea, \Semeval-2007 task 14: Aective
text," in Proceedings of the 4th International Workshop on Semantic
Evaluations, ser. SemEval '07. Stroudsburg, PA, USA: Association
for Computational Linguistics, 2007, pp. 70{74. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1621474.1621487
[41] M. Pagliardini, P. Gupta, and M. Jaggi, \Unsupervised learning of
sentence embeddings using compositional n-gram features," CoRR, vol.
abs/1703.02507, 2017. [Online]. Available: http://arxiv.org/abs/1703.
02507
[42] M. Quigley, B. Gerkey, K. Conley, J. Faust, T. Foote, J. Leibs, E. Berger,
R. Wheeler, and A. Ng, \Ros: an open-source robot operating system,"
in Proc. of the IEEE Intl. Conf. on Robotics and Automation (ICRA)
Workshop on Open Source Robotics, Kobe, Japan, May 2009.
[43] R. Cowie, E. Douglas-Cowie, S. Savvidou, E. McMahon, M. Sawey, and
M. Schr oder, \FEELtrace: An instrument for recording perceived emo-
tion in real time," in ISCA tutorial and research workshop (ITRW) on
speech and emotion, 2000.
78
[44] F. Eyben, M. W ollmer, and B. Schuller, \Opensmile: The munich
versatile and fast open-source audio feature extractor," in Proceedings
of the 18th ACM International Conference on Multimedia, ser. MM '10.
New York, NY, USA: ACM, 2010, pp. 1459{1462. [Online]. Available:
http://doi.acm.org/10.1145/1873951.1874246
[45] G. Bradski, \The OpenCV Library," Dr. Dobb's Journal of Software
Tools, 2000.
[46] L. S. Chen, T. S. Huang, T. Miyasato, and R. Nakatsu, \Multimodal
human emotion/expression recognition," in Automatic Face and Gesture
Recognition, 1998. Proceedings. Third IEEE International Conference
on. IEEE, 1998, pp. 366{371.
[47] A. Paivio, J. C. Yuille, and S. A. Madigan, \Concreteness, imagery, and
meaningfulness values for 925 nouns." Journal of experimental psychol-
ogy, vol. 76, no. 1p2, p. 1, 1968.
[48] S. Tanaka, A. Jatowt, M. P. Kato, and K. Tanaka, \Estimating content
concreteness for nding comprehensible documents," in Proceedings of
the sixth ACM international conference on Web search and data mining.
ACM, 2013, pp. 475{484.
[49] F.
A. Nielsen, \A new anew: Evaluation of a word list for sentiment
analysis in microblogs," arXiv preprint arXiv:1103.2903, 2011.
79
[50] F. Mairesse, M. A. Walker, M. R. Mehl, and R. K. Moore, \Using lin-
guistic cues for the automatic recognition of personality in conversation
and text," Journal of articial intelligence research, vol. 30, pp. 457{500,
2007.
[51] A. Ramakrishna, V. R. Mart nez, N. Malandrakis, K. Singla, and
S. Narayanan, \Linguistic analysis of dierences in portrayal of movie
characters," in Proceedings of the 55th Annual Meeting of the Associa-
tion for Computational Linguistics, 2017, pp. 1669{1678.
[52] N. Malandrakis and S. S. Narayanan, \Therapy language analysis using
automatically generated psycholinguistic norms." in INTERSPEECH,
2015, pp. 1952{1956.
[53] J. Gibson, N. Malandrakis, F. Romero, D. C. Atkins, and S. S.
Narayanan, \Predicting therapist empathy in motivational interviews
using language features inspired by psycholinguistic norms," in Sixteenth
Annual Conference of the International Speech Communication Associ-
ation, 2015.
[54] A. B. Warriner, V. Kuperman, and M. Brysbaert, \Norms of valence,
arousal, and dominance for 13,915 english lemmas," Behavior research
methods, vol. 45, no. 4, pp. 1191{1207, 2013.
80
[55] S. Buechel and U. Hahn, \Emobank: Studying the impact of annotation
perspective and representation format on dimensional emotion analysis,"
in Proceedings of the 15th Conference of the European Chapter of the
Association for Computational Linguistics: Volume 2, Short Papers,
2017, pp. 578{585.
[56] F. Ringeval, B. Schuller, M. Valstar, R. Cowie, H. Kaya, M. Schmitt,
S. Amiriparian, N. Cummins, D. Lalanne, A. Michaud et al., \Avec
2018 workshop and challenge: Bipolar disorder and cross-cultural af-
fect recognition," in Proceedings of the 2018 on Audio/Visual Emotion
Challenge and Workshop. ACM, 2018, pp. 3{13.
[57] J. Cohen, \A coecient of agreement for nominal scales," Educational
and psychological measurement, vol. 20, no. 1, pp. 37{46, 1960.
[58] K. J. Berry and P. W. Mielke Jr, \A generalization of cohen's kappa
agreement measure to interval measurement and multiple raters," Ed-
ucational and Psychological Measurement, vol. 48, no. 4, pp. 921{933,
1988.
81
Appendix A
Derivations for the matrix
factorization model
A.1 EM update equations for global annota-
tion model
A.1.1 Components of the joint distributionp(a
m
1
::: a
m
K
; a
m
)
To help with the model formulation, we rst derive parameters of the joint
distribution p(a
m
1
::: a
m
K
; a
m
). Since the product of two normal distributions
is also normal [29], this joint distribution is also normal and is given by,
82
2
6
6
6
6
6
6
4
a
m
a
m
1
.
.
.
a
m
K
3
7
7
7
7
7
7
5
N
0
B
B
B
B
B
B
@
2
6
6
6
6
6
6
4
T
x
m
F
1
T
x
m
.
.
.
F
K
T
x
m
3
7
7
7
7
7
7
5
;
2
6
6
6
6
6
6
4
1
:::
K
1
11
:::
1K
.
.
.
.
.
.
.
.
.
.
.
.
K
K1
:::
KK
3
7
7
7
7
7
7
5
1
C
C
C
C
C
C
A
(A.1)
The dierent components of the covariance matrix from Equation A.1 are
derived below.
=Cov(a
m
)
=
2
I
k
=E[a
m
k
(a
m
)
T
]E[a
m
k
]E[(a
m
)
T
]
=E[(F
k
a
m
+
k
)(a
m
)
T
]E[F
k
a
m
+
k
]E[(a
m
)
T
]
=F
k
(
2
I)
kk
=Cov(F
k
a
m
+
k
)
=Cov(F
k
a
m
) +
2
k
I
=F
k
F
T
k
+
2
k
I
=
2
F
k
F
T
k
+
2
k
I
k
i
k
j
=E
a
m
[Cov(a
m
k
1
; a
m
k
2
ja
m
)] +Cov(E[a
m
k
1
ja
m
];E[a
m
k
2
ja
m
])
=Cov(E[a
m
k
1
ja
m
];E[a
m
k
2
ja
m
])
=Cov(F
k
1
a
m
;F
k
2
a
m
)
=F
k
1
(F
k
2
)
T
=
2
F
k
1
F
T
k
2
83
In the derivation of
k
i
k
j
, the rst equation is a direct application of the
law of total covariance and the second equation is because of the conditional
independence assumption of annotation values a
m
k
i
given the ground truth a
m
Finally, owing to the jointly normal distributions, p(a
m
ja
m
1
::: a
m
K
) is also
normal:
p(a
m
ja
m
1
::: a
m
K
)N(
a
m
ja
m
1
:::a
m
K
j
a
m
ja
m
1
:::a
m
K
)
Also, by denitions of conditional normal distributions, given a normal
vector of the form
2
4
x
1
x
2
3
5
N
0
@
2
4
1
2
3
5
;
2
4
11
12
21
22
3
5
1
A
the conditional distributionp(x
1
jx
2
)N(
x
1
jx
2
;
x
1
jx
2
) has the following
form.
x
1
jx
2
=
1
+
12
1
22
(x
2
2
) (A.2)
x
1
jx
2
=
11
12
1
22
21
(A.3)
A.1.2 Model formulation
We begin by introducing a new distributionq(a
m
) in Equation 4.5. We drop
the parameters from the likelihood function expansion for convenience.
logL =
M
X
m=1
log
Z
a
m
q(a
m
)
p(a
m
1
::: a
m
K
ja
m
)p(a
m
)
q(a
m
)
da
m
(A.4)
84
Using Jensen's inequality over log of expectation, we can write the above
as follows,
logL
M
X
m=1
Z
a
m
q(a
m
) log
p(a
m
1
::: a
m
K
ja
m
)p(a
m
)
q(a
m
)
da
m
(A.5)
The bound above becomes tight when the expectation is taken over a
constant value, i.e.
p(a
m
1
::: a
m
K
ja
m
)p(a
m
)
q(a
m
)
=c
Solving for the constant c, we have
q(a
m
) =
p(a
m
1
::: a
m
K
; a
m
)
p(a
m
1
::: a
m
K
)
=p(a
m
ja
m
1
::: a
m
K
)
E-Step
The E-step involves simply assuming q(a
m
) to follow the conditional distri-
bution p(a
m
ja
m
1
::: a
m
K
).
To help with future computations, we also compute the following expec-
tations, where the rst two are a result of equations A.2 and A.3; third
equation is by denition of covariance and the last one is a standard result.
E
a
m
ja
m
1
:::a
m
K
[a
m
] =
T
x
m
+
a
m
;a
m
1
:::a
m
K
(
a
m
1
:::a
m
K
;a
m
1
:::a
m
K
)
1
(a
m
m
)
a
m
ja
m
1
:::a
m
K
[a
m
] =
a
m
;a
m
a
m
;a
m
1
:::a
m
K
(
a
m
1
:::a
m
K
;a
m
1
:::a
m
K
)
1
a
m
1
:::a
m
K
;a
m
E
a
m
ja
m
1
:::a
m
K
[a
m
(a
m
)
T
] =
a
m
ja
m
1
:::a
m
K
[a
m
] +E
a
m
ja
m
1
:::a
m
K
[a
m
]E
a
m
ja
m
1
:::a
m
K
[(a
m
)
T
]
E
a
m
ja
m
1
:::a
m
K
[(a
m
)
T
a
m
] =trace(
a
m
ja
m
1
:::a
m
K
[a
m
]) +E
a
m
ja
m
1
:::a
m
K
[(a
m
)
T
]E
a
m
ja
m
1
:::a
m
K
[a
m
]
85
a
m
and
m
are DK dimensional vectors obtained by concatenating the
K annotation vectors a
m
1
;::: a
m
K
and their corresponding expected values
F
1
T
x
m
:::F
K
T
x
m
.
M-step
In the M-step, we nd the parameters of the model by maximizing Equation
A.5. We rst write this equation as an expectation and an equality. The
expectation below is with respect to q(a
m
) = p(a
m
ja
m
1
::: a
m
K
); we drop the
subscript for ease of exposition
logL =
M
X
m=1
E
a
m
ja
m
1
:::a
m
K
log
p(a
m
1
::: a
m
K
ja
m
)p(a
m
)
q(a
m
)
logL =
M
X
m=1
E logp(a
m
1
::: a
m
K
ja
m
) +E logp(a
m
) +H
logL =
M
X
m=1
K
X
k=1
E logp(a
m
k
ja
m
) +E logp(a
m
) +H
(A.6)
wherep(a
m
) andp(a
m
k
ja
m
) are given by equations 4.3 and 4.4 respectively.
The last equation above uses that fact that we assume independence among
annotators given the ground truth. Also expectation commutes with the
linear sum over the K terms.
Here,H is the entropy of p(a
m
ja
m
1
::: a
m
K
). We maximize Equation A.6
with respect to each of the parameters to obtain the M-step updates.
86
Estimating F
k
Dierentiating Equation A.6 with respect to F
k
and
equating the derivative to 0
F
k
Q = 0
F
k
M
k
X
m=1
E[(a
m
k
F
k
a
m
)
T
(
2
k
I)
1
(a
m
k
F
k
a
m
)] = 0
F
k
1
2
k
M
k
X
m=1
E[(a
m
k
F
k
a
m
)
T
(a
m
k
F
k
a
m
)] = 0
M
k
X
m=1
2a
m
k
E[(a
m
)
T
] + 2F
k
E[a
m
(a
m
)
T
] = 0
)F
k
=
M
k
X
m=1
a
m
k
E[(a
m
)
T
]
M
k
X
m=1
E[a
m
(a
m
)
T
]
1
where, M
k
is the number of points annotated by user k.
We used the following facts in the above derivation: trace(x) = x for
scalar x;trace(AB) =trace(BA);
A
trace(A
T
x) =x and
A
trace(A
T
AB) =
AB +AB
T
for matrixA. We also make use of the fact that expectation and
trace of a matrix are commutative since trace is a linear sum.
87
Estimating Similarly, to nd , we dierentiate Equation A.6 with
respect to and equate it to 0.
Q = 0
M
X
m=1
E[(a
m
T
x
m
)
T
(
2
I)
1
(a
m
T
x
m
)] = 0
1
2
M
X
m=1
E[(a
m
T
x
m
)
T
(a
m
T
x
m
)] = 0
M
X
m=1
2x
m
E[(a
m
)
T
] + 2x
m
x
T
m
= 0
=
M
X
m=1
x
m
x
T
m
1
M
X
m=1
x
m
E[(a
m
)
T
]
) = (X
T
X)
1
(X
T
E[a
m
])
which looks like the familiar normal equation except we use the expected
value of a
. Here, X is the matrix of features of the M data points; it
includes individual feature vectors x
m
in its rows.
Estimating Dierentiating Equation A.6 with respect to and equat-
ing to 0, we have
Q = 0
88
M
X
m=1
D log
1
2
2
E[(a
m
)
T
a
m
] 2tr(
T
x
m
E[(a
m
)
T
])+
tr(x
T
m
T
x
m
)
= 0
M
X
m=1
D
+
1
3
E[(a
m
)
T
a
m
] 2tr(
T
x
m
E[(a
m
)
T
]) +tr(x
T
m
T
x
m
)
= 0
MD
=
1
3
M
X
m=1
E[(a
m
)
T
a
m
] 2tr
T
x
m
E[(a
m
)
T
]
+tr(x
T
m
T
x
m
)
)
2
=
1
MD
M
X
m=1
E[(a
m
)
T
a
m
] 2tr
T
x
m
E[(a
m
)
T
]
+tr(x
T
m
T
x
m
)
Estimating
k
Dierentiating Equation A.6 with respect to
k
and
equating to 0, we have
k
Q = 0
k
M
k
X
m=1
D log
k
1
2
2
k
(a
m
k
)
T
a
m
k
2tr(F
T
k
a
m
k
E[(a
m
)
T
])+
tr(F
T
k
F
k
E[a
m
(a
m
)
T
])
= 0
M
k
X
m=1
D
k
+
1
3
k
(a
m
k
)
T
a
m
k
2tr(F
T
k
a
m
k
E[(a
m
)
T
]) +tr(F
T
k
F
k
E[a
m
(a
m
)
T
])
= 0
)
2
k
=
1
DM
k
M
k
X
m=1
(a
m
k
)
T
a
m
k
2tr(F
T
k
a
m
k
E[(a
m
)
T
]) +tr(F
T
k
F
k
E[a
m
(a
m
)
T
])
89
A.2 EM update equations for time series an-
notation model
A.2.1 Model formulation
Similar to the process described in Appendix A.1, the log likelihood function
for the time series model is shown below (similar to Equation A.5).
logL
M
X
m=1
Z
a
m
q(a
m
) log
p(a
m
1
::: a
m
K
ja
m
)p(a
m
)
q(a
m
)
da
m
(A.7)
The bound becomes tight when q(a
m
) =p(a
m
ja
m
1
::: a
m
K
).
90
E-step
Computing the expectation function over the entire distribution of q(a
m
) is
computationally expensive since a
m
is a matrix. To avoid this, we instead
use Hard-EM in which we assume a dirac-delta distribution for a
m
which is
centered at the mode of q(a
m
). This is a common practice in latent mod-
els and is the approach followed by [18] in estimating the annotator lter
parameters. We assign this value to a
m
in the E-step:
a
m
= argmax
a
m
q(a
m
)
= argmax
a
m
p(a
m
ja
m
1
;::: a
m
K
)
= argmax
a
m
p(a
m
; a
m
1
;::: a
m
K
)
p(a
m
1
;::: a
m
K
)
= argmax
a
m
p(a
m
1
;::: a
m
K
ja
m
)p(a
m
)
= argmax
a
m
logp(a
m
1
;::: a
m
K
ja
m
)p(a
m
)
) a
m
= argmax
a
m
logp(a
m
1
;::: a
m
K
ja
m
) + logp(a
m
jx
m
)
Since we assume that each annotator is independent of the others given
the ground truth, we have
a
m
= argmax
a
m
log
Y
k
p(a
m
k
ja
m
) + logp(a
m
)
a
m
= argmax
a
m
X
k
logp(a
m
k
ja
m
) + logp(a
m
)
91
Further, since each annotation dimension a
m;d
k
is assumed to independent
given a
m
, we have
a
m
= argmax
a
m
X
k
X
d
logp(a
m;d
k
ja
m
) + logp(a
m
)
Finally, since both a
m;d
k
and a
m
are dened using iid Gaussian noise, the
above maximization problem is equivalent to the following minimization.
a
m
= argmin
a
m
X
k
X
d
jja
m;d
k
F
d
k
vec(a
m
)jj
2
2
+jjvec(a
m
) vec(X
m
)jj
2
2
For convenience, we reshape a
m
into a vector and optimize with respect
to the
attened vector. If we choose vec(a
m
) = v and vec(X
m
) = y, the
objective becomes,
Q(v) =
X
k
X
d
jja
m;d
k
F
d
k
vjj
2
2
+jjvyjj
2
2
Dierentiating Q with respect to v and equating the gradient to 0, we get
v
Q = 0
v
X
k
X
d
(a
m;d
k
F
d
k
v)
T
(a
m;d
k
F
d
k
v) + (vy)
T
(vy) = 0
v
X
k
X
d
(a
m;d
k
)
T
a
m;d
k
+v
T
(F
d
k
)
T
F
d
k
v 2(a
m;d
k
)
T
F
d
k
v + (v
T
v 2y
T
v +y
T
y) = 0
X
k
X
d
2(F
d
k
)
T
F
d
k
v 2(F
d
k
)
T
a
m;d
k
+ (2v 2y) = 0
)v =
X
k
X
d
(F
d
k
)
T
F
d
k
+I
1
X
k
X
d
(F
d
k
)
T
a
m;d
k
+y
92
We can extract a
m
by reshaping v back into a matrix.
M-step
Given the point estimate for a
m
, the log-likelihood Equation A.7 can now be
written as a function of the model parameters.
logL =
M
X
m=1
K
X
k=1
logp(a
m
k
ja
m
;F
d
k
;
k
) + logp(a
m
; ;)
In the M-step, we optimize the above equation with respect to the pa-
rameters =fF
k
;
k
; ;g.
Q(F
k
;
k
; ;) =
M
X
m=1
K
X
k=1
logp(a
m
k
ja
m
;F
d
k
;
k
) + logp(a
m
; ;) (A.8)
Estimating F
d
k
: Since each F
d
k
is a lter matrix constructed from a
vector f
d
k
2 I R
WD
, we dierentiate Equation A.8 with respect to f
d
k
.
f
d
k
Q = 0
f
d
k
M
k
X
m=1
logp(a
m
k
ja
m
;F
d
k
;
k
) = 0
f
d
k
M
k
X
m=1
jja
m;d
k
F
d
k
vec(a
m
)jj
2
2
= 0
93
In the last step we make use of the fact that a
m
k
depends on a
m
through
Gaussian noise. We also discard all other dimensions d
0
6= d since these
do not depend on f
d
k
. To estimate f
d
k
, we can rearrange F
d
k
vec(a
m
) such
that f
d
k
is now the parameter vector of a linear regression problem with the
independent variables represented by matrixA which is obtained by creating
a ltering matrix out of vec(a
m
). Hence, the optimization problem becomes
f
d
k
M
k
X
m=1
jja
m;d
k
Af
d
k
jj
2
2
= 0
)f
d
k
=
M
k
X
m=1
A
T
A
1
M
k
X
m=1
A
T
a
m;d
k
Estimating
k
Dierentiating Equation (A.8) with respect to
k
and
equating the gradient to 0, we have,
k
Q = 0
k
M
k
X
m=1
logp(a
m
k
ja
m
;F
d
k
;
k
) = 0
k
M
k
X
m=1
X
d
log
1
j2
2
k
Ij
1
2
e
1
2
2
k
jja
m;d
k
F
d
k
vec(a
m
)jj
2
2
= 0
k
M
k
X
m=1
X
d
T log
k
1
2
2
k
jja
m;d
k
F
d
k
vec(a
m
)jj
2
2
= 0
M
k
DT
k
+
1
3
k
M
k
X
m=1
X
d
jja
m;d
k
F
d
k
vec(a
m
)jj
2
2
= 0
)
2
k
=
1
M
k
DT
M
k
X
m=1
X
d
jja
m;d
k
F
d
k
vec(a
m
)jj
2
2
94
Estimating Dierentiating Equation A.8 with respect to and equat-
ing the gradient to 0, we have.
Q = 0
M
X
m=1
jjvec(a
m
) vec(X
m
)jj
2
2
= 0
By denition, each column of is independent of each other. Hence we
can estimate each
d
separately (taking derivatives with respect to above
equation would cancel all terms except those in
d
).
d
M
X
m=1
(a
m;d
X
m
d
)
T
(a
m;d
X
m
d
) = 0
M
X
m=1
(a
m;d
)
T
(a
m;d
) 2(a
m;d
)
T
X
m
d
+ (
d
)
T
X
T
m
X
m
d
= 0
d
=
M
X
m=1
X
T
m
X
m
1
M
X
m=1
X
T
m
a
m;d
We can combine the estimation of all the columns of as follows.
) =
M
X
m=1
X
T
m
X
m
1
M
X
m=1
X
T
m
a
m
95
Estimating Dierentiating Equation A.8 with respect to and equat-
ing the gradient to 0, we have.
Q = 0
M
X
m=1
logp(a
m
; ;) = 0
From Equation 4.6, a
m
was dened by adding zero mean Gaussian noise
to vec(a
m
). Assuming v = vec(a
m
k
) and y = vec(X
m
), we have
M
X
m=1
log
1
j2
2
Ij
1
2
e
1
2
(vy)
T
(
2
I)
1
(vy)
= 0
M
X
m=1
TD log
1
2
2
jjvyjj
2
2
= 0
M
X
m=1
TD
+
1
3
jjvyjj
2
2
= 0
)
2
=
1
MTD
M
X
m=1
jjvec(a
m
k
) vec(X
m
)jj
2
2
96
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Improving modeling of human experience and behavior: methodologies for enhancing the quality of human-produced data and annotations of subjective constructs
PDF
Computational modeling of human behavior in negotiation and persuasion: the challenges of micro-level behavior annotations and multimodal modeling
PDF
Exploiting latent reliability information for classification tasks
PDF
Generative foundation model assisted privacy-enhancing computing in human-centered machine intelligence
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Learning multi-annotator subjective label embeddings
PDF
Computational narrative models of character representations to estimate audience perception
PDF
Context-aware models for understanding and supporting spoken interactions with children
PDF
Integrating annotator biases into modeling subjective language classification tasks
PDF
Generating psycholinguistic norms and applications
PDF
Toward robust affective learning from speech signals based on deep learning techniques
PDF
Modeling dyadic synchrony with heterogeneous data: validation in infant-mother and infant-robot interactions
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Spoken language processing in low resource scenarios with applications in automated behavioral coding
PDF
Responsible artificial intelligence for a complex world
PDF
Complete human digitization for sparse inputs
PDF
Deep generative models for image translation
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Towards social virtual listeners: computational models of human nonverbal behaviors
PDF
Energy-efficient computing: Datacenters, mobile devices, and mobile clouds
Asset Metadata
Creator
Ramakrishna, Anil Kumar
(author)
Core Title
Computational models for multidimensional annotations of affect
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
12/02/2019
Defense Date
10/02/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
affective computing,annotation modeling,human computation,multidimensional annotations,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth S. (
committee chair
), Dehghani, Morteza (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
anil.k.ramakrishna@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-238896
Unique identifier
UC11674321
Identifier
etd-Ramakrishn-7960.pdf (filename),usctheses-c89-238896 (legacy record id)
Legacy Identifier
etd-Ramakrishn-7960.pdf
Dmrecord
238896
Document Type
Dissertation
Rights
Ramakrishna, Anil Kumar
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
affective computing
annotation modeling
human computation
multidimensional annotations