Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Generating psycholinguistic norms and applications
(USC Thesis Other)
Generating psycholinguistic norms and applications
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Generating Psycholinguistic Norms and Applications
by
Nikolaos Malandrakis
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(Computer Science)
May 2020
Copyright 2020 Nikolaos Malandrakis
Dedication
This dissertation is dedicated to my mother Ioanna Malandraki.
ii
Acknowledgements
I would like to express my gratitude to my advisor, professor Shrikanth Narayanan, for
his guidance, patience and support through the better and worse parts of this long journey.
I benefited greatly from the creative freedom I was allowed and from the understanding
when things did not progress in an ideal way. And of course I gained much from the op-
portunities afforded by the sprawling ecosystem of the Signal Analysis and Interpretation
Laboratory (SAIL).
I would be remiss to not properly thank the other members of SAIL. The support and
feedback from my colleagues and friends in SAIL was critical throughout my studies.
The members of SAIL form a large research family, with work ethic only matched by
partying endurance. Work hard and play hard guys!
Special thanks to my former advisor, co-author and collaborator for many years
Alexandros Potamianos. Some lessons take time to fully appreciate, but last for a life
time. It is clear now how useful the lessons I learned from Alex really were.
I would also like to thank the members of the greater ELISA-LORELEI team. Pro-
fessors Kevin Knight, Jon May, Ji Heng and many many others. It was a valuable expe-
rience, teaching me, not just in the ways of research, but also leadership, management
and delegation. And of course the members of my own sub-group, who had to endure
my supervision, while I was still learning how to supervise.
I would like to thank the many co-authors and collaborators of cross-disciplinary
studies I was involved in, especially Professor Jonathan Gratch and Doctor Armen Are-
vian. Apart from the interesting work we did, working with people from different fields
exposes one to new ideas and ways of thinking that can be transplanted to ones field.
iii
Finally, to my family and friends, thank you from the bottom of my heart for your
love and support. I’m almost there!
iv
TableofContents
Dedication ii
Acknowledgements iii
ListofTables ix
ListofFigures xi
1 Introduction 1
2 PriorWork 5
2.1 Generating Affective Norms for Words . . . . . . . . . . . . . . . . . . 5
2.2 Generating Affective Norms for Sentences . . . . . . . . . . . . . . . . 7
2.3 Generating Norms beyond Emotion . . . . . . . . . . . . . . . . . . . 8
3 UsingDistributionalSemanticstoGenerateAffectiveNormsforWordsand
Sentences 9
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Affective Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.1 Word Level Tagging . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.2 Multi-word Term Tagging . . . . . . . . . . . . . . . . . . . . 14
3.2.3 Sentence Level Tagging . . . . . . . . . . . . . . . . . . . . . 15
3.2.4 Fusion of n-gram models . . . . . . . . . . . . . . . . . . . . . 17
3.2.4.1 Interpolation . . . . . . . . . . . . . . . . . . . . . . 17
3.2.4.2 Back-off . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.4.3 Weighted Interpolation . . . . . . . . . . . . . . . . 19
3.3 Experimental Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.1 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
The work presented in this chapter was published in the following article:
Nikolaos Malandrakis, Alexandros Potamianos, Elias Iosif, Shrikanth Naraynan, “Distributional Semantic
Models for Affective Text Analysis” , IEEE Transactions on Audio, Speech, and Language Processing,
V olume: 21, Issue: 11, Nov. 2013, pp. 2379 – 2392
v
3.3.2 Corpus creation and Semantic similarity . . . . . . . . . . . . . 21
3.3.3 Affective Lexicon and Word Affective Ratings . . . . . . . . . 23
3.3.4 Sentence Affective Ratings . . . . . . . . . . . . . . . . . . . . 25
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.1 Baseline Performance . . . . . . . . . . . . . . . . . . . . . . 27
3.4.2 Similarity metric selection . . . . . . . . . . . . . . . . . . . . 28
3.4.3 Seed word selection . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.4 Word Affective Ratings . . . . . . . . . . . . . . . . . . . . . . 33
3.4.5 Sentence Affective Ratings . . . . . . . . . . . . . . . . . . . . 34
3.4.5.1 Baseline Performance . . . . . . . . . . . . . . . . . 35
3.4.5.2 Fusion of n-gram models . . . . . . . . . . . . . . . 35
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4 AdaptingNormstoTaskDomain 42
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Creating affective ratings . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3.1 Pragmatic Constraints . . . . . . . . . . . . . . . . . . . . . . 47
4.3.2 Perplexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.3 Adapting model parameters . . . . . . . . . . . . . . . . . . . 48
4.4 Experimental procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5 GoingBeyondAffect 56
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 The task: Therapy language analysis . . . . . . . . . . . . . . . . . . . 57
5.3 Expanding Word Psycholinguistic Norms . . . . . . . . . . . . . . . . 58
5.4 Norms for larger passages . . . . . . . . . . . . . . . . . . . . . . . . 59
5.5 Corpora and Experimental Procedure . . . . . . . . . . . . . . . . . . . 60
5.5.1 Manually annotated norms . . . . . . . . . . . . . . . . . . . . 60
5.5.2 Raw text corpus . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.5.3 General psychotherapy corpus . . . . . . . . . . . . . . . . . . 61
5.6 Experimental Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
The work presented in this chapter was published in the following article:
Nikolaos Malandrakis, Alexandros Potamianos, Kean J. Hsu, Kalina N. Babeva, Michelle C. Feng, Gerald
C. Davison, Shrikanth Narayanan, “Affective Language Model Adaptation via Corpus Selection” , pro-
ceedings of ICASSP, 2014, pp. 4838-4842
The work presented in this chapter was published in the following article:
Nikolaos Malandrakis, Shrikanth Naraynan, “Therapy Language Analysis using Automatically Generated
Psycholinguistic Norms”, Proceedings of Interspeech, 2015, pp. 1947-1951
vi
5.7.1 Word-level norm estimation . . . . . . . . . . . . . . . . . . . 64
5.7.2 Therapy transcript analysis . . . . . . . . . . . . . . . . . . . . 64
5.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6 UsingNeuralNetworkstoGenerateSentenceNorms 70
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.3 Experimental Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.3.1 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.4.1 Word ratings . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.4.2 Sentence ratings . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7 ApplicationtoSentimentAnalysis: SemEval2014 84
7.1 Model Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.1.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.1.2 Lexicon-based features . . . . . . . . . . . . . . . . . . . . . . 85
7.1.2.1 Third party lexica . . . . . . . . . . . . . . . . . . . 85
7.1.2.2 Emotiword: expansion and adaptation . . . . . . . . 86
7.1.2.3 Statistics extraction . . . . . . . . . . . . . . . . . . 87
7.1.3 Tweet-level similarity ratings . . . . . . . . . . . . . . . . . . . 87
7.1.4 Character features . . . . . . . . . . . . . . . . . . . . . . . . 88
7.1.5 Contrast features . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.1.6 Feature selection and Training . . . . . . . . . . . . . . . . . . 88
7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8 ConclusionsandFutureWork 93
8.1 Word-level norm generation . . . . . . . . . . . . . . . . . . . . . . . 93
8.2 Domain adaptation of Norms . . . . . . . . . . . . . . . . . . . . . . . 94
8.3 Bigram and sentence-level norm generation . . . . . . . . . . . . . . . 94
8.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
ReferenceList 97
The work presented in this chapter was published in the following article:
Nikolaos Malandrakis, Michael Falcone, Colin Vaz, Jesse James Bisogni, Alexandros Potamianos,
Shrikanth Narayanan, “SAIL: Sentiment Analysis using Semantic Similarity and Contrast Features”, Pro-
ceedings of SemEval, 2014, pp. 512–516
vii
A Appendix 108
A.1 Generating Word Norms across languages . . . . . . . . . . . . . . . . 108
A.2 Using Norms to describe Movie language . . . . . . . . . . . . . . . . 110
viii
ListofTables
2.1 The 14 seeds used in the experiments by Turney and Littman. . . . . . . 6
3.1 The functions of similarity used. . . . . . . . . . . . . . . . . . . . . . 11
3.2 Training sample using 10 seed words. . . . . . . . . . . . . . . . . . . 25
4.1 Corpus size after pragmatic constraints and/or perplexity thresholding
has been applied, for the ATSS and Twitter experiments. . . . . . . . . 51
4.2 Performance for each experiment using linear combinations of the generic
and adapted lexicon models. Presented is the maximum accuracy achieved
in each case, as well as the parameters of the adapted model and the
weightw assigned to it. . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.1 Word norm estimation performance. Cardinality of the dataset and Pear-
son correlation to the ground truth for 10-fold regression over manually
annotated words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 Pearson correlation of Client and Therapist norms, for Client centered
therapy, psychoanalytic psychology and Overall. . . . . . . . . . . . . 65
5.3 Factor p-values and direction of difference. Significant differences are
denoted with" or# at thep< 0:05 level and* or+ at thep< 0:001 level. 67
5.4 The effect of increased experience. Significance denoted with" or# at
thep< 0:05 level and* or+ at thep< 0:001 level. . . . . . . . . . . . 68
6.1 Word-level Pearson’s performance. OLS regression using context vec-
tors against OLS and SVR using GloVe embeddings. Statistical signifi-
cance marked by
y
:p< 0:05 or
yy
:p< 0:001. . . . . . . . . . . . . . 76
6.2 Word-level Pearson’s performance. Regression against individual neu-
ral networks. Statistical significance marked by
y
: p < 0:05 or
yy
: p <
0:001. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.3 Word-level Pearson’s performance. OLS regression using context-
based similarity against individual neural networks using GloVe embed-
dings. Statistical significance marked by
y
:p< 0:05 or
yy
:p< 0:001. . 78
ix
6.4 Word-level Pearson’s performance. Individual neural networks against
per-resource multi-task networks. Statistical significance marked by
y
:
p< 0:05 or
yy
:p< 0:001. . . . . . . . . . . . . . . . . . . . . . . . . 79
6.5 Word-level Pearson’s performance. Individual neural networks against
a universal multi-task network. Statistical significance marked by
y
:p<
0:05 or
yy
:p< 0:001. . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.6 Sentence-level Pearson’s performance, for the complete data scenario.
Statistical significance marked by
y
:p< 0:05 or
yy
:p< 0:001. . . . . 81
6.7 Sentence-level results Pearson’s performance, for the missing data sce-
nario. Statistical significance marked by
y
:p< 0:05 or
yy
:p< 0:001. . 82
7.1 Performance and rank achieved by our submission for all datasets of sub-
tasks A and B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.2 Selected features for subtask B. . . . . . . . . . . . . . . . . . . . . . . 90
7.3 Performance on all data sets of subtask B after removing 1 set of features.
Performance difference with the complete system listed if greater than 1%. 91
A.1 Word-level Pearson Correlation performance for the task of cross-lingual
Norm estimation. Results shown, per target language, for Valence and
Arousal and for the cases where the word model was trained using only
the English lexicon or all lexica apart from the target language one. . . . 109
x
ListofFigures
3.1 Example of word rating fusion, showing the per-word ratings and the
phrase ratings produced by the three unigram fusion schemes. . . . . . 25
3.2 Performance of the affective lexicon creation algorithm using similarities
based on co-occurrence counts from the 116m corpus. Correlation for the
ANEW-CV experiment using: (a) a linear kernel and (b) a square root
kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Performance of the affective lexicon creation algorithm using co-occurrence
based similarities. Correlation for the ANEW-CV experiment using: (a)
the 116m corpus and different window sizes at 150 seeds and (b) corpora
of different sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Performance of the affective lexicon creation algorithm using context-
based similarities. Correlation for the ANEW-CV experiment using: (a)
the 116m corpus and different window sizes and (b) a window size of 1
and corpora of different sizes. . . . . . . . . . . . . . . . . . . . . . . . 28
3.5 Performance of the affective lexicon creation algorithm using different
seed selection algorithms and analysis of the wrapper selected seeds for
the ANEW-CV experiment using: (a) the G similarity metric, (b) the
S
1
similarity metric. The corresponding rank distributions of the top 50
seeds per fold selected by a wrapper when using: (c) the G similarity
metric, (d) theS
1
similarity metric. . . . . . . . . . . . . . . . . . . . . 39
3.6 Accuracy of the affective lexicon creation algorithm: (a) ANEW-CV ex-
periment., (b) GINQ-PD experiment. . . . . . . . . . . . . . . . . . . . 40
3.7 Correlation of the affective lexicon creation algorithm: (a) ANEW-CV
experiment, (b) BAWLR-CV experiment. . . . . . . . . . . . . . . . . 40
3.8 Binary classification accuracy of the sentence rating algorithm as a func-
tion of the number of seed words, when using only unigram terms and:
(a) theG similarity metric, (b) theS
1
similarity metric. . . . . . . . . . 41
xi
3.9 Binary classification accuracy of the sentence rating algorithm as a func-
tion of the bigram selection threshold (backoff rate) for the SemEval’07-
Task14 dataset: (a) the G similarity metric and 300 seeds, (b) the S
1
similarity metric and 600 seeds. . . . . . . . . . . . . . . . . . . . . . 41
4.1 Overview of the lexicon expansion method . . . . . . . . . . . . . . . . 45
4.2 Classification accuracy as a function of the size of the corpus used for
lexicon creation, using perplexity, pragmatic constraints and perplexity
or neither to generate the corpus. The data point labels correspond to
perplexity threshold values. Performance shown for (a) ATSS anger, (b)
ATSS distress and (c) TWITTER sentiment. . . . . . . . . . . . . . . . 55
5.1 Sample means as functions of therapist experience for client-centered
therapy and psychoanalytic psychology. . . . . . . . . . . . . . . . . . 69
6.1 Structure overview for the word and sentence models. . . . . . . . . . . 71
A.1 Norm dimensions used to characterize movie language. . . . . . . . . . 110
A.2 Movie results examples: “Forrest Gump” and “Fight Club”. . . . . . . . 111
A.3 Cross-Movie comparisons: Male and Female action heroes. . . . . . . . 112
xii
Chapter1
Introduction
Language norms are (typically numeric) representations of the normative (expected) con-
tent of language, usually collected in a lexicon/thesaurus. The most commonly used va-
riety are emotional norms, representing word polarity, valence, arousal or dominance,
though norms for aspects of language beyond emotion have existed for a long time
[20, 84] and have been popular in behavioral studies as a way to select appropriate stim-
uli [68]. These include norms describing the degree of abstractness, the complexity of
meaning and age or gender affinity of individual words, and can potentially be useful for
a variety of applications.
Affective text analysis is the most popular task for norm use due to the many appli-
cations for which emotional information is relevant, such as sentiment analysis/opinion
mining for news stories [39], blogs and public forums [8], product reviews [83, 28] or
even multimedia documents [36, 35, 7], where extracting emotion from transcripts has a
limited yet important role.
Manually annotated lexica containing norms [11, 69, 20, 84] have limited computa-
tional applications due to their small sizes (typically a few hundred up to a few thousand
words), so machine learning methods are used to expand them [22, 38, 48, 77] and create
much larger resources like SentiWordNet [22] and WORDNET AFFECT [71]. The bulk
of research has focused on emotion norms, particularly emotional valence (emotional
1
evaluation, from negative to positive), though work on some other norm dimensions is
also available [38, 48, 77], but significantly less common.
Generating norms (typically affect) for higher level linguistic structures, typically
sentences, presents more challenges. Unlike words, there are almost no resources with
continuous annotations for sentences. In this work we investigate two scenaria with
respect to sentence data availability: the scenario where we have very little data for a
target norm and the scenario where we have no data for the target norm. For both scenaria
we can use word-level annotations to augment sentence data. Given an norm lexicon
of sufficient coverage, sentence-level ratings can be created by combining word-level
ratings. Often, when lexica with continuous affective ratings are available, sentence-
level ratings are estimated as simple numerical combinations of word ratings (typically
the arithmetic mean). There have been attempts of using syntactic rules with encouraging
results, e.g., [50], though such approaches are, so far, limited to using binary or tertiary
word affective ratings. The end result norms are typically binary or tertiary labels.
In this thesis we are proposing a framework for automatic norm expansion applicable
to many norms, spanning typical emotional identifiers and beyond, based on data-driven
similarity measures. Our goal is to produce highly accurate norms for words and sen-
tences, in continuous scales, for a variety of norms. Overall the main contributions of
this work are:
• We generalize the affective lexicon expansion method proposed in [78, 79]. The
resulting method can use a variety of semantic similarity metrics and seed selection
strategies, while taking advantage of explicit supervision to generate norms for a
variety of dimensions. The method is capable of producing norms for multi-word
expressions [65].
• We investigate the factors contributing to norm estimation performance, such as the
semantic similarity metrics used, the context window size and the characteristics
of the seed (reference) words.
2
• We use corpus selection strategies to generate domain-specific Norms The ap-
proach is shown to be successful and produces norms that are more valuable as
features for downstream tasks.
• Motivated by the language modeling literature, we propose a framework for com-
bining n-gram ratings of varying order and utilizing multi-word term detection
methods, to estimate sentence-level affective ratings. We use a structure similar to
an n-gram language model with back-off and propose multi-word term selection
criteria (for activating the back-off strategy).
• We use neural networks and multi-task learning to transfer knowledge across dif-
ferent norm dimensions and enable the generation of more accurate sentence-level
norms for dimensions where we do not have sentence-level annotations.
The structure of this thesis is as follows:
• Chapter 2 includes a review of prior work on norm expansion for words and sen-
tences.
• Chapter 3 describes the initial approach used to generate affect-only norms for
words and the methodology for generating sentence level norms.
• Chapter 4 describes an adaptation method based on corpus selection that can be
used to create domain-specific norms.
• Chapter 5 describes the generalization of the norm expansion method to norms
beyond emotion.
• Chapter 6 describes the use of multi-task neural networks to improve word-level
norm estimation and enable the generation of sentence norms without any sentence
data for the target dimension.
3
• Chapter 7 describes a complete application of the affective norms, utilizing many
of techniques described in the previous chapters in the context of the SemEval
sentiment analysis challenge.
• Chapter 8 includes the discussion and future work.
• Chapter A includes some extra results for some of the more ambitious experiments
conducted over the course of these thesis and which do not fit in the preceding
chapters.
4
Chapter2
PriorWork
Virtually all existing prior work focuses on the generation of norms for affective dimen-
sions.
2.1 GeneratingAffectiveNormsforWords
The task of assigning affective ratings, such as binary “positive - negative” labels, also
known as semantic orientation [26] is an active research area. The underlying assumption
for most semantic orientation algorithms is that semantic similarity can be translated to
affective similarity. Therefore, given some metric of similarity between two words one
may derive the similarity between their affective ratings. In [78], a set of words with
known affective ratings together with the semantic similarities between these words and
an unseen word are used to estimate affective ratings for the new word. The reference
words that are used to bootstrap the affective model are usually referred to as seed words.
The nature of the seed words can vary; they may be the lexical labels of affective cat-
egories (e.g., “anger”, “happiness”), small sets of words with unambiguous meaning or
even all words in a large lexicon. Having a set of seed words and an appropriate similarity
measure, the next step is devising a method of combining these to create the final rating.
In most cases the desired rating is some form of binary label like “fear” - “not fear”,
in which case a classification scheme, like nearest neighbor may be used to provide the
5
Table 2.1: The 14 seeds used in the experiments by Turney and Littman.
positive: good, superior, positive, correct, fortunate, nice, excellent
negative: bad, inferior, negative, wrong, unfortunate, nasty, poor
final result. Alternatively, continuous/pseudo-continuous ratings may be estimated via
algebraic combination of similarities and ratings of seed words [72].
In [78, 79], hit counts from conjunctive “NEAR” web queries are used to measure
co-occurrence of words in web documents; semantic similarity is estimated for hits via
point-wise mutual information. The estimated valence ^ v(w
j
) of each new word w
j
is
expressed as a linear combination of the valence ratings v(w
i
) of the seeds w
i
and the
semantic similarities between the new word and each seedd(w
i
;w
j
) as;
^ v(w
j
) =
N
X
i=1
v(w
i
)d(w
i
;w
j
); (2.1)
The seeds used areN = 14 adjectives (7 pairs of antonyms) shown in Table 2.1 and their
ratings are assumed to be binary (1 or1)
1
.
WordNet-based methods [21, 5, 71, 25] start with a small set of annotated words,
usually with binary ratings. These sets are then expanded by exploiting synonymy, hy-
pernymy and hyponymy relations (traversal of the WordNet network) along with simple
rules. Various approaches are then used to calculate the similarity between unseen words
and the seed words, including using contextual similarity between glosses [21] and synset
1
The method is shown to work very well in terms of binary (positive/negative) classification, achieving
an 82:8% accuracy on the General Inquirer dataset. This method depends on the, now defunct, Altavista
NEAR queries. As shown in [79, 74] the method performs much worse when using conjunctive AND
queries instead.
6
distance metrics [25]. The main benefit of resource-based methods is the ability to cre-
ate ratings per sense of each word, however ratings can only be produced for words in
WordNet.
Most of the aforementioned work utilizes the notion of semantic similarity between
words or terms in order to infer affective ratings. Semantic similarity metrics can be
roughly categorized into: i) ontology-based similarity measures, e.g., [12], where simi-
larity features are extracted from ontologies (usually WordNet), ii) context-based simi-
larity measures [56], where similarity of context is used to estimate semantic similarity
between words or terms, iii) co-occurrence based similarity metrics where the frequency
of co-occurrence of terms in (web) documents is the main feature used for estimating
semantic similarity [78, 74], and iv) combinations of the aforementioned methods [10].
Context-based methods form the basis of distributional semantic models (DSM), dis-
tinguished into unstructured and structured types [9]. Unstructured approaches do not
consider the linguistic structure of context: a window is centered on the target word and
the surrounding contextual features within the window are extracted [3, 31]. For struc-
tured approaches the extracted contextual features correspond to syntactic relationships,
which are typically extracted by dependency parsing and represented as word tuples
[55, 9]. Recently corpus-based methods (especially context-based metrics) where shown
to perform at par with ontology-based metrics [31], especially when using semantic net-
works as generalizations of distributional semantic models [32].
2.2 GeneratingAffectiveNormsforSentences
Having created an affective lexicon, the next step is the combination of these word ratings
to create ratings for larger lexical units, phrases or sentences. Initially the affect-bearing
words need to be selected, depending on their part-of-speech tags [16], affective rating
and/or the sentence structure [6]. Then their individual ratings are combined, typically in
a simple fashion, such as a numeric average. More complex approaches involve taking
7
into account sentence structure, word/phrase level interactions such as valence shifters
[59] and large sets of manually created rules [16, 6]. In [50] a supervised method is used
to train the parameters of multiple hand-selected rules of composition. However these
complex methods have shown little improvement over simpler distributional approaches.
Furthermore, the application of syntactic rules becomes prohibitively complex when us-
ing continuous word/sentence ratings: even the simplest of rules would require multiple
parameters/cases.
2.3 GeneratingNormsbeyondEmotion
This is a relatively recent research subject and as such relatively little related work exists.
In [77] an expansion of the method of [78] is used to estimate word-level concreteness
values. The authors select 40 words to act as seeds (20 as paradigms of high concrete-
ness and 20 as paradigms of high abstractness) and use LSA vectors to estimate similar-
ity. In [38] WordNet relations (synonymy, hypernymy, hyponymy) are used to propagate
imagability scores across WordNet, with related words merely assigned the same norm
values. In [48] distributional similarity is combined with WordNet relations (synonymy,
hypernymy, hyponymy) to generate imagability norms. Given the relatively recent de-
velopment of this subject, no related work exists on the generation of these norms for
sentences.
8
Chapter3
UsingDistributionalSemanticstoGenerateAffective
NormsforWordsandSentences
3.1 Introduction
In this chapter we describe a framework for the generation of affective norms, focusing
on emotional valence, for words, bigrams and sentences. At the word level the method
utilizes distributional semantic similarity to estimate norms for unknown words based on
their similarity to known words. At the sentence level we leverage the ability of the word
model to generate norms for higher order n-grams, in this instance bigrams, and generate
sentence-level norm values by combining the constituent higher order norms.
3.2 AffectiveModel
As in [78], we start from an existing, manually annotated lexicon. A subset of words is
automatically selected from the lexicon to serve as seed words for the affective model.
The work presented in this chapter was published in the following article:
Nikolaos Malandrakis, Alexandros Potamianos, Elias Iosif, Shrikanth Naraynan, “Distributional Semantic
Models for Affective Text Analysis” , IEEE Transactions on Audio, Speech, and Language Processing,
V olume: 21, Issue: 11, Nov. 2013, pp. 2379 – 2392
9
The affective rating for a new word/term is estimated as a linear combination of the prod-
ucts between semantic similarities and affective ratings of the seed words. We modify the
method in [78] by adding: i) weights to the equation, one per seed word, so as to adjust
each seed word’s contribution to the final output, and ii) a function (kernel) that modifies
the semantic similarity score contribution to the model. The weights are selected so as to
minimize the mean square training error.
The trainable weights are meant to capture the relevance of each seed word in the af-
fective model. For instance, a seed word with high affective (or semantic) variance might
be a less robust predictor of the affective scores of unseen words. Words with high affec-
tive variance typically have multiple part-of-speech tags and word senses, or their valence
rating is highly context-dependent. In addition, a set of seed words might not provide
a detailed and representative description of the affective/semantic space, e.g., selecting
only words with positive valence scores significantly hurts performance of the model
1
.
Rather than attempting to estimate the individual contribution of each parameter to the
relevance of seed words in our model, we use machine learning to automatically esti-
mate linear weights for each seed word. The weights are estimated in order to minimize
estimation error on the bootstrap affective lexicon using the Least Squares Estimation
(LSE) algorithm, as detailed below. A simplified version of the affective model was first
proposed in [44].
3.2.1 WordLevelTagging
We want to characterize the affective content of words in a continuous valence range of
[1;1] (from very negative to very positive), from the reader (i.e., perceiver) perspective.
1
For more details on the effect of these factors on seed selection, see Section 3.4.3.
10
2
6
6
6
4
1 f(d(w
1
;w
1
))v(w
1
) f(d(w
1
;w
N
))v(w
N
)
.
.
.
.
.
.
.
.
.
.
.
.
1 f(d(w
K
;w
1
))v(w
1
) f(d(w
K
;w
N
))v(w
N
)
3
7
7
7
5
2
6
6
6
6
6
6
4
a
0
a
1
.
.
.
a
N
3
7
7
7
7
7
7
5
=
2
6
6
6
6
6
6
4
1
v(w
1
)
.
.
.
v(w
K
)
3
7
7
7
7
7
7
5
(3.2)
Table 3.1: The functions of similarity used.
linear f(d()) =d()
exp f(d()) =e
d()
log f(d()) =log(d())
sqrt f(d()) =
p
d()
We model the valence of each word as a linear combination of its semantic similarities
to a set of seed words and the valence ratings of these words:
^ v(w
j
) =a
0
+
N
X
i=1
a
i
v(w
i
)f(d(w
i
;w
j
)); (3.1)
wherew
j
is the word we aim to characterize, w
1
;:::;w
N
are the seed words, v(w
i
) is
the valence rating for seed word w
i
, a
i
is the weight corresponding to word w
i
(that
is estimated as described next), d(w
i
;w
j
) is a measure of semantic similarity between
wordsw
i
andw
j
(see Section 3.2.1) andf() is a simple function from Table 3.1. The
functionf() serves to non-linearly rescale the similarity metricd(w
i
;w
j
) and will be
henceforth referred to as the kernel of the affective model.
Assuming we have a training corpus ofK words with known ratings (the manually
annotated affective lexicon we start from) and a set of N < K seed words (a subset
of the lexicon) for which we need to estimate weights a
i
, we can use (3.1) to create a
system ofK linear equations withN +1 unknown variables as shown in (3.2); theN
11
weightsa
1
;:::;a
N
and the extra weighta
0
which is the shift (bias). The optimal values
of these variables can be estimated using LSE. Once the weights of the seed words are
estimated the valence of an unseen word w
j
can be computed using (3.1). Note that
no additional training corpus or data are required here, the weights are estimated on the
affective lexicon used to bootstrap the model.
The valence estimator defined in (3.1) uses a metricd(w
i
;w
j
) of the semantic sim-
ilarity between words w
i
and w
j
. In this work, we use both co-occurrence based and
context-based similarity metrics.
Co-occurrencebasedsimilaritymetrics estimate the similarity between two words/terms
using the frequency of co-existence within larger lexical units (sentences, documents).
The underlying assumption is that terms that co-exist often are likely to be related se-
mantically. One popular method to estimate co-occurrence is to pose conjunctive queries
to a web search engine; the number of returned hits is an estimate of the frequency of
co-occurrence [31]. Co-occurrence based metrics do not depend on annotated language
resources like ontologies nor require downloading documents or snippets, as is the case
for context-based semantic similarities.
In the equations that follow,w
i
;:::;w
i+n
are the query words,fD;w
i
;:::;w
i+n
g is
the set of resultsfDg returned for these query words. The number of documents in each
result set is noted asjD;w
i
;:::;w
i+n
j. We investigate the performance of four different
co-occurrence based metrics, defined next.
Jaccard coefficient computes similarity as:
J(w
i
;w
j
) =
jD;w
i
;w
j
j
jD;w
i
j+jD;w
j
jjD;w
i
;w
j
j
: (3.3)
Dice coefficient is a variation of the Jaccard coefficient, defined as:
C(w
i
;w
j
) =
2jD;w
i
;w
j
j
jD;w
i
j +jD;w
j
j
: (3.4)
12
Mutual information [10] is an info-theoretic measure that derives the similarity between
w
i
andw
j
via the dependence between their number of occurrences. Point-wise Mutual
Information (PMI) is defined as:
I(w
i
;w
j
) = log
jD;w
i
;w
j
j
jDj
jD;w
i
j
jDj
jD;w
j
j
jDj
: (3.5)
Mutual information is unbounded and can take any value in[1;+1]. Positive values
translate into similarity, negative values into dissimilarity (presence of one word tends to
exclude the other) and zero into independence, lack of relation.
Google-based Semantic Relatedness Normalized Google Distance is proposed in [80,
18] and defined as:
E(w
i
;w
j
) =
maxfLglogjD;w
i
;w
j
j
logjDjminfLg
; (3.6)
whereL =flogjD;w
i
j; logjD;w
j
jg. This metric is unbounded, taking values in[0;+1].
In [23], a bounded (in[0; 1]) similarity metric is proposed based on Normalized Google
Distance called Google-based Semantic Relatedness, defined as:
G(w
i
;w
j
) =e
2E(w
i
;w
j
)
: (3.7)
Context-based similarity metrics compute similarity between feature vectors ex-
tracted from term context, i.e., using a “bag-of-words” context model, using a metric
like cosine similarity or Kullback-Leibler divergence. The basic assumption behind these
metrics is that similarity of context implies similarity of meaning, i.e., terms that appear
in similar lexical environment (left and right contexts) have a close semantic relation
[63],[56]. “Bag-of-words” [30] models assume that the feature vector consists of words
or terms that occur in text independently of each other. The context-based metrics pre-
sented here employ a context window of fixed size (H words) for feature extraction.
Specifically, the right and left contexts of lengthK are considered for each occurrence
13
of a word or term of interestw in the corpus, i.e.,[v
K;L
:::v
2;L
v
1;L
]w[v
1;R
v
2;R
:::v
K;R
]
wherev
i;L
andv
i;R
represent theith word to the left and to the right ofw, respectively.
The feature vector for word or term w is defined as T
w;H
= (t
w;1
;t
w;2
:::t
w;V
) where
t
w;i
is a non-negative integer andH is the context window size. Note that the length of
the feature vector is equal to the vocabulary sizeV , i.e., all words in the vocabulary are
features. Theith feature valuet
w;i
reflects the (frequency of) occurrence of vocabulary
wordv
i
within the left or right context window of (all occurrences of) the termw. The
value oft
w;i
may be defined as a (normalized or unnormalized) function of the frequency
of occurrence of featurei in the context ofw. Once the feature weighting scheme is se-
lected, the “bag-of-words”-based metricS
H
computes the similarity between two words
or terms,w
1
andw
2
, as the cosine similarity of their corresponding feature vectors,T
w
1
;H
andT
w
2
;H
as follows, [30]:
S
H
(w
1
;w
2
) =
P
V
i=1
t
w
1
;i
t
w
2
;i
q
P
V
i=1
(t
w
1
;i
)
2
q
P
V
i=1
(t
w
2
;i
)
2
(3.8)
whereH is the context window length andV is the vocabulary size. The cosine similarity
metric assigns 0 similarity score when w
1
, w
2
have no common context (completely
dissimilar words), and 1 for identical words. Various feature weighting schemes can be
used to compute the value oft
w;i
. The binary weighting metric used in this work assigns
weightt
w;i
= 1 when theith word in the vocabulary exists at the left or right context of
at least one instance of the wordw, and 0 otherwise. Alternative weighting schemes such
as tf-idf are more popular, but we opt for binary weights that perform best in semantic
similarity tasks [31, 67] and are computationally simpler.
3.2.2 Multi-wordTermTagging
So far we have used the terms “word” and “term” interchangeably when referring to the
targets of the method described in Section 3.2.1. The method has no requirement that
would limit us to estimating word ratings or even limit us to the English language: it
14
can work for any term of any length and for any language as long as we have a starting
affective lexicon and an appropriately large text corpus. When applying to bigrams, only
the semantic similarity metric has to be extended to handle both unigrams and bigrams.
In principle, the co-occurrence and context-based metricsd() used for unigrams can be
also used to estimate the semantic similarity between n-grams
2
.
3.2.3 SentenceLevelTagging
We assume that the affect rating of sentence s = w
1
w
2
:::w
N
can be estimated via the
composition [57] of the affective scores of its constituent wordsw
i
. The simplest fusion
model (and also by far the most popular) is a simple linear combination of the partial
ratings:
v
a
(s) =b
0
+b
1
1
N
N
X
i=1
v(w
i
); (3.9)
where b
0
and b
1
are trainable weights corresponding to an offset and unigrams w
i
re-
spectively. Linear fusion assumes that words should be weighted equally independently
of their strong or weak affective content. As a result, a sentence containing only a few
strongly polarized terms might end up having low absolute valence (due to averaging).
Next, we propose a weighted average scheme, where terms with higher absolute valence
values are weighted more:
v
w
(s) =b
0
+b
1
1
N
P
i=1
jv(w
i
)j
N
X
i=1
v(w
i
)
2
sign(v(w
i
)); (3.10)
2
The generalization is straightforward for context-based metrics and indeed such metrics have been
successfully used to estimate the semantic similarity between multi-word terms [31]. However, for co-
occurrence based metrics that use word counts, the mean and dynamic range of similarity scores is very
different between unigrams and bigrams, making their fusion a challenge (see also Section 3.2.4). No
bigram seed words are necessary to bootstrap the model.
15
where sign(:) is the signum function. One could also generalize to higher powers or to
other non-linear scaling functions. Next, we consider non-linear min-max fusion, where
the term with the highest absolute valence value dominates the meaning of the sentence:
v
m
(s) =b
0
+b
1
v(w
z
); z = argmax
i
(jv(w
i
)j); (3.11)
whereargmax is the argument of the maximum. One could also consider combinations
of linear and non-linear fusion methods, as well as, syntactic- and pragmatic-dependent
fusion rules.
The use of the simple fusion schemes proposed above with only the words of each
sentence, carries the implicit assumption of a compositional model of semantics and
affect. Specifically, estimating the affective score of a sentence is assumed to be simply
a problem of appropriately scaling the contribution of each word’s affective score to
estimate a sentence level score. Although the compositionality assumption might be
reasonable in many cases (and as we shall see in Section 3.4 produces good results),
there are many cases of compound expressions where their semantic and affective content
cannot be accurately estimated as a (weighted) sum of its words. Such examples include:
1) modifiers such as negation that can alter the meaning and (reverse) affective scores,
and 2) idiomatic multi-word expressions that cannot be semantically parsed word-for-
word
3
. To address these concerns, we extend the above models to using terms (of length
n) instead of just words: a model using n-grams instead of unigrams (words) will attempt
to combine the partial ratings of all overlapping n-grams within a sentence.
3
Note that deviation from the expected meaning and affective content of a multi-word expression may
also occur due to contextual or pragmatic constraints, e.g., “wicked” can have high positive valence in
certain contexts. However, such semantic/affective variability can occur both for words and multi-word
expressions and are not treated directly here.
16
3.2.4 Fusionofn-grammodels
In this section, we attempt to improve on the performance of unigram- and bigram-only
affective models by utilizing them as building blocks to create models that employ un-
igrams and bigrams. The proposed fusion algorithms are motivated by language mod-
eling. Here, instead of n-gram probabilities (for language models), we are combining
affective scores. The main fusion strategies, however, are similar: 1) interpolation of
the valence scores of the unigrams and bigrams (or higher-order), and 2) back-off from
bigrams to the unigrams when a certain criterion is satisfied. Much like language mod-
eling the back-off criterion should be related to n-gram counts. For affect, additional
criteria may be devised that are related with the “degree of compositionality” (semantic
or affective) of each n-gram. For bigrams that appear rarely in our corpus it may be
advantageous to back-off to a unigram where adequate statistics to accurately estimate
affective scores exist.
3.2.4.1 Interpolation
For sentences that consists of the word sequencew
1
w
2
:::w
N
we create a unigram
1
and
bigram
2
affective model, respectively, that estimate the sentence level affective score
as follows
4
:
v(sj
1
) =
1
N
N
X
i=1
v(w
i
)
v(sj
2
) =
1
N1
N1
X
i=1
v(w
i
w
i+1
)
(3.12)
4
For simplicity, we only present the equations for the simple linear model. It is straightforward to
generalize to non-linear fusion schemes.
17
where the valencev(w
i
) of wordw
i
and the valencev(w
i
w
i+1
) of bigramw
i
w
i+1
are both
estimated using Eq. (3.1). We combine the scores of the unigram and bigram models as
follows:
v
in
(w
i
w
j
) =b
1
v(w
i
w
j
j
1
)+b
2
v(w
i
w
j
j
2
); (3.13)
v
in
(s)=b
0
+
1
N
"
b
1
2
(v(w
1
)+v(w
N
))+
N1
X
i=1
v
in
(w
i
w
i+1
)
#
(3.14)
whereb
i
are linear weights that can be estimated via machine learning on held-out data
and the term
1
2
b
1
(v(w
1
)+v(w
N
)) serves the need to use each unigram in the sentence
an equal amount of times (by adding the ratings of the first and last unigram). It is
straightforward to extend the proposed method to higher order n-gram models.
3.2.4.2 Back-off
Here instead of interpolating the affective scores of different n-gram models, we propose
a criterion for alternating between the unigram and bigram model [43]. Specifically we
define the selection criterionc(i;j) for bigramw
i
w
j
; we utilize bigramw
i
w
j
ifc(i;j) is
larger than some thresholdt or revert to the unigramsw
i
andw
j
otherwise, i.e.,
v
bo
(w
i
w
j
) =
8
<
:
b
1
v(w
i
w
j
j
1
); c(i;j)t
b
2
v(w
i
w
j
j
2
); c(i;j)>t
9
=
;
; (3.15)
whereb
1
andb
2
are the trainable weights of the unigram and bigram models respectively.
After performing term selection, we combine the scores:
v
bo
(s)=b
0
+
1
N
"
1
2
b
1
(v(w
1
)+v(w
N
))+
N1
X
i=1
v
bo
(w
i
w
i+1
)
#
: (3.16)
The criterionc(i;j) for selecting the appropriate n-gram model utilizes both the fre-
quency of occurrence of the n-gram in our corpus and the degree of compositionality of
the n-gram. Specifically, the following criteria are proposed:
18
1. The probability of occurrence of the bigramw
i
w
j
:
c
p
(i;j) =p(w
i
w
j
): (3.17)
2. A mutual information-like criterion that measures the probability of co-occurrence
of wordsw
i
andw
j
(a simple measure of compositionality):
c
m
(i;j) =
p(w
i
w
j
)
p(w
i
)p(w
j
)
: (3.18)
3. The absolute difference between the valence scores estimated via the bigram and
unigram models (a measure of affective compositionality):
c
nc
(i;j) =jv(w
i
w
j
)0:5[v(w
i
)+v(w
j
)]j: (3.19)
Note that the n-gram frequency-based criterionc
p
() can be combined with the degree of
compositionality criteriac
m
() and/orc
nc
() producing the following criteria:
c
ts
(i;j) =p(w
i
w
j
) logc
m
(i;j);
c
pnc
(i;j) =p(w
i
w
j
) logc
nc
(i;j):
(3.20)
The thresholdst
p
;t
m
;:::;t
pnc
are estimated for each criterion on held-out data.
3.2.4.3 WeightedInterpolation
Weighted interpolation extends the interpolation and back-off models. Similarly to the
back-off model we use a compositionality criterionc(i;j) for bigramw
i
w
j
, however in
weighted interpolation bigram and corresponding unigram ratings are interpolated when
c(i;j) is over a thresholdt:
v
wi
(w
i
w
j
) =
8
<
:
V
1
; c(i;j)t
V
1
+V
2
; c(i;j)>t
9
=
;
; (3.21)
19
whereV
1
= b
1
v(w
i
w
j
j
1
) andV
2
= b
2
v(w
i
w
j
j
2
). The final collection of terms will
include all unigrams in the sentence and some bigrams (appropriately weighted). As
before, we combine the selected terms to produce the final sentence rating:
v
wi
(s)=b
0
+
1
N
"
1
2
b
1
(v(w
1
)+v(w
N
))+
N1
X
i=1
v
wi
(w
i
w
i+1
)
#
: (3.22)
3.3 ExperimentalProcedure
Next we present the corpora used for training and evaluation of the proposed algorithms.
In addition, the experimental procedure for semantic similarity computation, affective
lexicon creation and sentence-level affective score computation is outlined.
3.3.1 Corpora
The main corpus used for creating the affective lexicon is the Affective Norms for English
Words (ANEW) dataset. ANEW consists of 1034 words, rated in 3 continuous dimen-
sions of arousal, valence and dominance. In this work, we only use the valence ratings
provided in ANEW
5
. Looking at quantized values, the dataset contains 586 positive and
448 negative words.
The second corpus used for evaluation of the affective lexicon creation algorithm
is the General Inquirer (GINQ) corpus that contains 2005 negative and 1636 positive
words. The General Inquirer corpus was created by merging words with multiple entries
in the original lists of 2293 negative and 1914 positive words. It is comparable to the
dataset used in [78, 79]. After removing the words that overlap with ANEW, we are left
with 1443 positive and 1754 negative words.
5
The method is applicable to arousal and dominance, however for the purposes of this work we focus
on the more popular dimension of valence. Valence was selected over arousal and dominance due to its
greater applicability and larger volume of prior work enabling comparisons.
20
To evaluate the lexicon creation method on a non-English dictionary, we used the
Berlin Affective Word List Reloaded (BAWL-R) dataset. BAWL-R contains 2902 Ger-
man words annotated in continuous scales (we use only valence). In quantized form, the
set contains 1636 positive and 1266 negative words.
For the sentence level tagging task the SemEval 2007: Task 14 corpus is used [70].
This SemEval corpus contains news headlines, 250 in the development set which are
used for training and 1000 in the testing set which are used for evaluation. The headlines
are manually rated in a fine-grained valence scale of [100;100], which is rescaled to
[1;1] for our experiments. In quantized form the set contains 474 positive and 526
negative samples.
3.3.2 CorpuscreationandSemanticsimilarity
In our experiments we utilized four different similarity metrics based on web co-occurrence,
mentioned in Section 3.2.1, namely, Dice coefficient, Jaccard coefficient, point-wise mu-
tual information (PMI) and Google-based Semantic Relatedness as well as a single con-
textual similarity metric, cosine similarity with binary weights.
All similarity metrics employed require a corpus in order to calculate frequency
statistics or collect context. In this work we use three corpora derived from the web and
created by submitting queries to the Yahoo! search engine and collecting the response.
The first corpus is the web, which is only used to compute co-occurrence based sim-
ilarities. Co-occurrence based similarity metrics require the individual (IND) words’
number of occurrences as well as the number of times that the two words co-exist within
a set distance. Usually this distance is unlimited (anywhere within a document); this
method is used by the AND operator of web search engines. However it is possible to
limit that distance, e.g., the Altavista NEAR operator used to obtain co-occurrence in
[79] limited co-occurrence to a distance of 10 words. The alternative we used was the
Yahoo! NEAR operator, which was an undocumented feature of the Yahoo! engine. This
corpus will be henceforth referred as “web”.
21
Using the web directly poses practical challenges. The vast number of queries re-
quired can have a significant cost (in terms of both time and monetary cost). More
importantly, the desirable distance-limited joint queries are not supported by most search
engines: we obtained enough data for our experiments from the Yahoo! engine, however
as of this writing the Yahoo! engine no longer supports the NEAR operator. To allevi-
ate these problems we created two more corpora by posing IND queries to the Yahoo!
search engine and collected the topjDj (if available) snippets (the short excerpts (page
samples) shown under each result, typically one or two sentences automatically selected
by the search engine) for each word. Each snippet contains at least two sentences from
the result: the title and a preview of the content. The second and third corpora were built
using this process.
The second corpus is task dependent: we created a vocabulary that contained all
words in (all) our evaluation corpora, posed a single IND query for each of them to
Yahoo! and collected 1000 snippets (where available) from each query. The corpus
contains 14 million sentences and was indexed using the Lucene indexing engine [1],
effectively creating a local search engine. This 14m corpus was used to compute both
co-occurrence based and context-based similarities. Using the 14m snippet corpus one
can emulate hit counts obtained by NEAR queries (on the “web” corpus), e.g., by esti-
mating co-occurrence counts within the same sentence. To compute context-based simi-
larities, the left and right contexts of all occurrences ofw
1
andw
2
are examined and the
corresponding feature vectors are constructed. The parameters of context based metrics
are the numberjDj of web documents used and the sizeK of the context window. In all
experiments presented in this workjDj = 1000, whereas the values used forK are 1, 2,
5 and 10. This corpus will be noted henceforth as 14m.
The third corpus is created similarly to the second one, by collecting snippets, how-
ever it is task-independent: to create it we used a vocabulary of the English language,
specifically the one that comes with the Aspell spellchecker [2] containing 135,433
words. For each of them we posed an IND query and collected up to 500 snippets.
22
The final corpus contains 116 million sentences and will be noted henceforth as 116m.
As with 14m, the downloaded text was indexed with Lucene and used to compute both
co-occurrence based and context-based similarities. This corpus was created
6
as part of
the EU-IST PORTDIAL projecthttp://www.portdial.eu/.
3.3.3 AffectiveLexiconandWordAffectiveRatings
The following tasks and associated experimental setup have been used for model training
and performance evaluation in this work:
• ANEW-CV: 10-fold cross-validation on the ANEW dataset, i.e., model training
and evaluation on the ANEW dataset.
• GINQ-PD: model training on the ANEW dataset, evaluation on GINQ dataset.
• BALWR-CV: 10-fold cross-validation on the BAWL-R dataset.
In all cases the seed words were selected from the training set (training fold in the case
of cross-validation), therefore on cross-validation experiments the seeds are different for
each fold. Given a set of candidate seeds (in most cases the entire training set), we
applied a simple method to select the desired seeds. It seems, looking at Turney and
Littman’s method [78], but also confirmed by our experiments, that good seeds need to
have a high absolute valence rating. It also proved beneficial to ensure that the seed
set is as close to balanced (sum of seed valence is zero) as possible. Therefore our
selection method started by sorting the positive and negative seeds separately by their
valence rating. Then positive and negative seeds were added to the seed set iteratively
so as to minimize the absolute value of the sum of their valence ratings, yet maximize
their absolute valence ratings (or frequencies), until the required numberN was reached.
More on seed selection is given in Section 3.4.3.
6
The main motivation behind creating this large task independent corpus is that the performance of
semantic similarity metrics has been shown to improve due to better coverage of rare word senses of
common words and more uniform word occurrence probabilities. For more details see [32].
23
The semantic similarity between each of theN seed words and each of the words in
the test set (“unseen” words) was computed, as discussed in the previous section. Next
for each value of N, the optimal weights of the linear equation system matrix in (3.2)
were estimated using LSE. Finally, for each word in the test set the valence ratings were
computed using (3.1) and evaluated against the ground truth.
A toy training example using N = 10 features and the Google semantic related-
ness co-occurrence based metric is shown in Table 3.2. The second columnv(w
i
) shows
the manually annotated valence of wordw
i
, while the third columna
i
shows the corre-
sponding linear weight computed by the LSE algorithm. Their product (final column)
v(w
i
)a
i
is a measure of the affective “shift” of the valence of each word per “unit of
similarity” to that seed word (see also (3.1)). The last row in the table corresponds to the
bias terma
0
in (3.1) that takes a small positive value. Note that the coefficientsa
i
take
positive values and are not bounded in [0;1], although similarity metrics are bounded at
[0;1] and target valence values are also bounded in [1;1]. There is no obvious intu-
ition behind thea
i
scores, e.g., it is not clear why “suicide” should receive much higher
weighting than “funeral”. The weights might be related to the semantic and affective
variance of the seed words.
The following objective evaluation metrics were used to measure the performance of
the affective lexicon expansion algorithm: (i) Pearson correlation between the manually
labeled and automatically computed valence ratings and (ii) binary classification accu-
racy of positive vs. negative relations, i.e., continuous ratings are produced, converted
to binary decisions and compared to the ground truth. Statistical significance testing
was conducted using the paired sample t-test (right-sided) for the cross-validation exper-
iments and McNemar’s test for non cross-validation. Unless mentioned otherwise, we
set the statistical significance threshold atp< 0:001.
24
Table 3.2: Training sample using 10 seed words.
w
i
v(w
i
) a
i
v(w
i
)a
i
triumphant 0.96 1.48 1.42
rape -0.94 0.72 -0.67
love 0.93 0.57 0.53
suicide -0.94 3.09 -2.91
paradise 0.93 1.77 1.65
funeral -0.90 0.53 -0.48
loved 0.91 1.53 1.40
rejected -0.88 0.50 -0.44
joy 0.90 1.00 0.90
murderer -0.87 1.99 -1.73
w
0
(offset) 1 -0.06 -0.06
3.3.4 SentenceAffectiveRatings
The SemEval’07-Task 14 corpus was used to evaluate the various n-gram fusion meth-
ods. All unseen words/terms in the sentence corpus were added to the lexicon using the
affective lexicon expansion algorithm outlined above (3983 unigrams and 6630 bigrams
overall). The model used to create the required ratings was trained using all of the words
in the ANEW corpus as training samples andN of them as seed words. Then the ratings
of each term in the sentence were combined to create the sentence rating. We employed
content word selection when considering unigram terms: unigrams that were not nouns,
verbs, adjectives or adverbs were ignored. To identify content words part-of-speech tag-
ging was performed using TreeTagger [66]. A toy example can be seen in Fig. 3.1.
watching
0:57
cute
0:71
puppies
0:50
makes
0:00
me
0:11
happy
0:82
| {z }
[linear: 0:41; weighted average: 0:64; max: 0:82]
Figure 3.1: Example of word rating fusion, showing the per-word ratings and the phrase
ratings produced by the three unigram fusion schemes.
25
In order to evaluate the performance of the sentence level affective scores we used
the classification accuracy for the 2-class (positive, negative) problem. Statistical signif-
icance testing was conducted using McNemar’s test.
0 100 200 300 400 500 600 700 800
.7
.8
.9
Correlation
number of seeds
baseline
C/116m
J/116m
G/116m
I/116m
(a)
0 100 200 300 400 500 600 700 800
.7
.8
.9
Correlation
number of seeds
baseline
sqrt(C)/116m
sqrt(J)/116m
sqrt(G)/116m
sqrt(I)/116m
(b)
Figure 3.2: Performance of the affective lexicon creation algorithm using similarities
based on co-occurrence counts from the 116m corpus. Correlation for the ANEW-CV
experiment using: (a) a linear kernel and (b) a square root kernel.
3.4 Results
In this section, we evaluate the proposed algorithms on a variety of word- and sentence-
level affective tasks. The following issues are investigated: i) the relative performance
of the co-occurrence and context-based semantic similarity metrics for estimating con-
tinuous valence ratings of words, ii) the effect of corpus size and type on performance,
iii) how to select seed words for the affective model and iv) the performance of various
unigram- and bigram-level fusion strategies (interpolation, back-off, weighted interpola-
tion) for obtaining sentence-level affective ratings.
26
0 5 10 15 20
.7
.8
.9
Correlation
distance
no bounds
up to distance
equal to distance
(a)
0 100 200 300 400 500 600 700 800
.7
.8
.9
Correlation
number of seeds
baseline
G/14m
G/116m
G/web
(b)
Figure 3.3: Performance of the affective lexicon creation algorithm using co-occurrence
based similarities. Correlation for the ANEW-CV experiment using: (a) the 116m corpus
and different window sizes at 150 seeds and (b) corpora of different sizes.
3.4.1 BaselinePerformance
The baseline performance for word-level affective tasks is that of the method proposed in
[78][79]. The 14 words shown in Table 1 were used as seeds, as well as, co-occurrence
similarities (mutual information metric I) estimated via NEAR
7
web queries. In [79],
the binary classification accuracy for the GINQ dataset was reported at 82.8%; our im-
plementation yielded somewhat higher performance at 84%. In addition, the same setup
was run for the ANEW task achieving 0.66 correlation and 82% binary accuracy perfor-
mance. We do not report baseline performance for the BAWL-R experiment (since no
seed words were proposed for German in [79]).
7
Note that using NEAR conjuctive queries was essential to achieving good performance using this
method.
27
0 100 200 300 400 500 600 700 800
.7
.8
.9
Correlation
number of seeds
baseline
S
1
/116m
S
2
/116m
S
5
/116m
S
10
/116m
(a)
0 100 200 300 400 500 600 700 800
.7
.8
.9
Correlation
number of seeds
baseline
S
1
/116m
S
1
/14m
(b)
Figure 3.4: Performance of the affective lexicon creation algorithm using context-based
similarities. Correlation for the ANEW-CV experiment using: (a) the 116m corpus and
different window sizes and (b) a window size of 1 and corpora of different sizes.
3.4.2 Similaritymetricselection
The first and arguably most important parameter of the affective model detailed in Sec-
tion 3.2.1 is the semantic similarity metric used. The related method in [79] uses the
mutual information similarity metricI estimated via NEAR web queries. In our initial
experiments, we have observed significant performance differences between various co-
occurrence based metrics, e.g., see [44]. In addition to the type of similarity metric used,
the similarity estimation method also significantly affects performance; most importantly
the size and type of corpus used to calculate statistics and term proximity. All experi-
ments reported in this section are for the ANEW-CV task using correlation with human
judgments as the performance metric.
The co-occurrence based similarity metrics used in this work are the same as those
used in [44], however, the method for estimating co-occurrence counts has been updated.
Specifically, we can estimate co-occurrence counts either using web hits (web) or on a
corpus of snippets created via web queries over vocabulary lists of various sizes (14m,
28
116m). Performance of the DiceC, JaccardJ, mutual informationI and GoogleG co-
occurrence metrics, when calculated over the 116m corpus
8
, was evaluated on ANEW-
CV , using a linear model kernel, is shown in Fig. 3.2(a) as a function of the number
of seed words
9
. The relative performance of the similarity metrics are similar to those
reported in [44], with I and G performing significantly better (p < 10
16
) than J and
C. All metrics perform better than the baseline (see Section 3.4.1) provided that a few
hundred seeds are used to bootstrap the model. Note however, that metricsJ andC are
more robust to seed selection process, performance is flat over a wide range of number
of seeds. This can be rectified by using the model kernels defined in Table 3.1. Perfor-
mance using a square root kernel is shown in Fig. 3.2(b). The non-linear rescaling has a
significant effect (p< 10
16
) on performance when using theC andJ similarity metrics:
when using a logarithmic or square root kernel they can reach or, in some cases, overtake
G andI, though overallG andI still prove to be better choices. Kernels can also im-
prove performance of the best performing similarity metrics, however the differences are
smaller and less consistent. G andI perform very similarly in most cases, with a slight
edge toG.
From previous work, e.g., [74, 44], it is clear that word proximity is an important
feature when estimating semantic similarities for the affective model. Restricting the
search engine so that it registers co-occurrence when w
i
and w
j
occur within a small
distance (NEAR queries), rather than when they co-occur within a document at any dis-
tance (AND queries) provides a noticeable performance boost. Next we investigate the
optimal co-occurrence distance. For this experiment we estimated similarities using the
(best-performing)G similarity metric on the 116m corpus. Results are reported on the
ANEW-CV task for various distance requirements: accepting co-occurrence if the term
8
Given that the corpus is composed of independent sentences, instead of full documents, the co-
occurrence statistics are very similar to the result of a NEAR query (since we will only get a hit if the
two terms co-occur within the same sentence).
9
Seed selection is performed here using the heuristic of maximum absolute valence score and zero
mean valence over all seed words, as detailed in Section 3.3.3.
29
distance is up ton or alternatively if the term distance is exactlyn. The results are shown
in Fig. 3.3(a) as a function of the number of seeds. As expected close proximity is an im-
portant feature, with best performance of the “equal to” experiment achieved for distance
2, while for the “up to” experiment there is virtually no performance gain at distances
over 5. There is no performance drop when moving to larger distances, however this is
an artifact of the snippet corpus
10
.
Corpus size and type also significantly affect similarity estimation and model perfor-
mance. In Fig. 3.3(b), we report the correlation performance on the ANEW-CV task
using the G metric estimated on each of the three available corpora (web, 14m and
116m). NEAR queries are used to obtain the co-occurrence statistics for the web cor-
pus, while co-occurrence at the snippet level is computed for the 14m and 116m corpora.
Performance for similarities estimated on the large corpus (116m) are significantly better
(p < 10
3
over 300 seeds) than for the small corpus (14m). The 116m and web cor-
pora have similar performance for a few hundred seeds (around 300 seeds), yet the 116m
corpus achieves better performance for fewer seeds and higher top performance. These
results further validate the use of a corpus as a substitute for the elusive
11
web NEAR
queries.
Next we investigate the performance of context based similarity metrics as a func-
tion
12
of context window length K and corpus size (14m, 116m). Correlation perfor-
mance on the ANEW-CV task is shown in Fig. 3.4(a) for context based metrics with
window lengths of K = 1;2;5;10. The best performance is consistently obtained for
small window sizes ofK = 1;2. This is consistent with the results in [31, 32], where
10
Moving beyond sentence boundaries, e.g., defining co-occurrence at the document level, significantly
reduces the performance of semantic similarity based affective models, as shown in [74, 44].
11
As discussed in Section 3.3.2 the Yahoo! NEAR querying functionality was an undocumented feature
of the engine, that has been recently removed.
12
Different lexical weighting schemes for context vectors have been also investigated (not reported
here). As expected, the binary weighting scheme performed best for affect classification, as was the case
also for semantic similarity estimation tasks in [31, 32].
30
K = 1 provided the best performance for a word-level semantic similarity task. Cor-
relation performance when context based similarities are evaluated on the 14m or 116m
corpus is shown in Fig. 3.4(b) as a function of number of seed words. Estimating context
vectors on the larger corpus significantly (p < 10
3
over 100 seeds) outperforms the
smaller corpus. For a more detailed analysis on why a large task-independent corpus is
expected to provide better performance for semantic similarity estimation tasks see [32].
Based on the results reported in this section, henceforth we focus our attention on
the Google co-occurrence based semantic similarity metric G and the binary weighted
context based semantic similarity metricS
1
with context windowK = 1. Next, results
are reported in terms of correlation and binary classification accuracy for G, S
1
for a
variety of word-level and sentence-level affective tasks.
3.4.3 Seedwordselection
Seed words act as points of reference in affective space, relative to which all other words
are rated. As such, their selection from a set of candidates is an important step of the
rating creation process. Next we try to answer the question of what are the qualitative
features of a “good” seed word or good set of seed words. For this purpose, we used a
supervised feature selection method in the form of a wrapper and evaluated the automat-
ically selected seed word sets against a range of potentially relevant factors: (i) number
of possible part-of-speech tags, (ii) number of possible senses, (iii) seed word frequency
of occurrence, (iv) mean and standard deviation of the semantic similarities, (v) stan-
dard deviation of valence, (vi) absolute value of valence and, (vii) the valence rating of
the seed word. The number of part-of-speech tags and word senses was estimated from
WordNet. The mean and standard deviation of semantic similarity scores were estimated
between each seed word and all words in the ANEW dataset using theG orS
1
semantic
similarity metric. The valence, absolute valence and standard deviation (where standard
deviation was computed over all human annotations of each word) of valence were taken
from the ANEW dataset.
31
The experiment was conducted as a modification on the ANEW-CV experiment.
For each of the 10 folds, we performed an internal 10-fold cross-validation experiment
(splitting the train set only into 10-folds, an approach referred to as double-loop cross-
validation) and used the performance of different seed sets in the internal loop to select
a seed set for that fold of the external loop. The seed set search strategy was forward,
best-first: starting from an empty set we generated bigger sets by adding one feature at a
time (the one that improves previous performance most) and there were no substitutions
or deletions. The criterion of seed selection was the mean square error in the internal
experiment. We ran the process up toN = 150 seeds, creating 10 ordered seed sets of
length 150, one for each fold of the external loop and evaluated the final performance on
the external loop experiment.
Correlation performance on the ANEW dataset is shown in Fig. 3.5(a) and Fig. 3.5(b)
when using theG andS
1
similarity metrics, respectively, over the 116m corpus. As a
comparison, we provide the performance attained when using our unsupervised selection
method based on absolute valence and seed set balance, as detailed in Section 3.3.3.
There is a clear benefit to using a wrapper: performance is significantly (p< 10
16
under
50 seeds) better when using a small number of seed words and the model reaches optimal
performance requiring fewer seed words. However, the performance benefit dissipates
fast (at 150-200 seed words) and, while a wrapper will reach optimal performance faster,
that optimal performance is not significantly higher than that achieved by a model using
our unsupervised selection method, especially for theG metric.
To identify features that make a good seed, we looked at the rank distributions of
the selected seed words across the various factors, shown in Fig. 3.5(c) and Fig. 3.5(d)
when using the G and S
1
similarity metrics, respectively. To make the results clearer
we used only the top 50 seeds selected for each fold, for a total of 500 samples. Box
plots range from the 25% to the 75% percentile of each distribution, while the dot in the
box indicates the distribution median. In both cases, valence is the most relevant factor
that defines a good set of seed words: the selected seed words have very high absolute
32
valence ratings, a very narrow range of possible affective interpretations (low standard
deviation of valence). Also the seed sets are close to balanced (high absolute valence and
high set valence variance).
For all the experiments that follow we use the unsupervised seed selection method,
since: i) the absolute valence heuristic is validated by the results in Fig. 3.5 (c),(d), and
ii) the performance gap between the unsupervised selection method and the double loop
cross-validation method is small when over 100 seeds are used. However, note that if
using a very small set of seed words is a priority a supervised seed selection algorithm
can achieve good performance with a very small number of 20-40 seeds. Note also
that correlation performance of the supervised seed selection algorithm is significantly
(p< 10
8
) higher than the baseline method of [79] (solid blue line in Fig. 3.5(a),(b)) for
the same number of (14) seeds.
3.4.4 WordAffectiveRatings
In this section, we report results using an unsupervised seed selection method, the 116m
corpus and the best performing similarity metricsG,S
1
with a linear kernel to evaluate
the overall performance of the method on a variety of word level affective tasks. In
Fig. 3.6, 2-class classification accuracy is shown for the binary word polarity detection
ANEW-CV (a) and GINQ-PD (b) tasks. In Fig. 3.7, correlation performance is shown
for the continuous polarity rating estimation ANEW-CV (a) and BAWLR-CV (b) tasks.
Results are shown for the similarity metricsG (estimated on web and 116m corpus) and
S
1
(estimated on the 14m and 116m corpus), as well as the baseline performance of the
method described in [79]. For the German task (BAWLR-CV), results forG,S
1
metrics
estimated on the 170m corpus are reported. For the ANEW-CV experiment, correlation
with ground truth up to 0.87 and binary classification accuracy up to 91% is achieved
using the context based similarity S
1
estimated on the large 116K corpus. For the the
GINQ-PD experiment, the best performance achieved is classification accuracy of 87.3%
for theS
1
metric estimated on the 116K corpus. Comparable results for this experiment
33
found in literature, include: 82.8% [79], 81.9% [76] and 82.1% [25]). In the BAWLR-CV
experiment the model reaches 0.82 correlation with the ground truth.
Of note is the good performance and robustness of context-based metrics across all
experiments; they perform clearly better and provide a model that is stable to the seed
selection process. In fact, the model continues to improve (in performance) even when
adding sub-optimal seeds, not exhibiting the large performance drop of co-occurrence
based metric models for large number of seeds. Also important is the ability of the
method to perform well when applied to a different language (German) for the BAWLR-
CV experiment. Though in absolute terms performance in BALWR-CV is lower than
the comparable English experiment of ANEW-CV , performance is still good, particu-
larly considering that the proposed model and similarity estimation process is language-
agnostic
13
.
3.4.5 SentenceAffectiveRatings
For the sentence level affective rating task, we started from (a subset of) the seed words
of the ANEW dataset and performed lexicon expansion for all unigrams and bigrams
in our sentence corpus. Specifically,G andS
1
semantic similarities were estimated be-
tween all unigrams and bigrams in the sentence corpus, and the ANEW seed words.
Similarity metrics were estimated on the 116m corpus. The affective model was then
used to create ratings for all unigram and bigrams in the sentence corpus. The affective
ratings were then combined using one of the (linear, weighted, max) fusion methods de-
scribed in Section 3.2.3. Unigram and bigram affective ratings were fused using one of
the methods defined in Section 3.2.4, i.e., interpolation, back-off, weighted interpolation.
13
Although, we have not yet performed a detailed evaluation of semantic similarity metrics for the Ger-
man language, preliminary experiments indicate that context based semantic similarity metrics perform
worse for morphologically rich languages probably due to the larger number of word forms in these lan-
guages.
34
The Least Squares Estimation (LSE) algorithm was used to estimate the unigram and bi-
gram weights
14
on held-out data (SemEval development set). The various sentence level
affective models were then evaluated on the sentence corpus (SemEval test set).
3.4.5.1 BaselinePerformance
In order to establish a baseline, we used the fusion schemes defined in Section 3.2.3,
using only unigram terms. Sentence level classification accuracy as a function of the
number of seeds is shown in Fig. 3.8 for theG andS
1
metrics estimated on the 116m
corpus. Performance peaks at about 72% for theG metric and 72.5% for theS
1
metric,
an improvement over previous results reported in [44]. The improvement of the word
rating algorithm and the addition of supervised training to the sentence model provide
a fairly minimal improvement in performance. The simple numeric average performs
better throughout our experiments and benefits, particularly in terms of stability, from
the supervised training. Sentence ratings exhibit very similar performance dynamics to
word-level ratings, e.g., optimal performance occurs for a similar number of seed words.
3.4.5.2 Fusionofn-grammodels
Creating sentence ratings using the higher order models described in Section 3.2.4 poses
a specific challenge: we need to select which terms to use, from a pool of unigrams and
bigrams. To do so, we used the criteria described in Section 3.2.4, i.e., interpolation,
back-off and weighted interpolation. To investigate the performance for various com-
binations of unigrams and bigrams, we selected two specific word models (using theG
similarity metric and 300 seeds, and using theS
1
similarity metric and 600 seeds) and
used different term selection criteria during the sentence rating creation process. Sen-
tence level classification accuracy as a function of bigram rejection rate (back-off rate)
is shown in Fig. 3.9 for theG (a) andS
1
(b) metrics. The figures can be read as follows:
14
Note that only the n-gram fusion weights were estimated on the sentence development dataset. The
affective model seed weights were estimated on the ANEW dataset.
35
we start at the bottom left, with the models using only bigram terms. Then we move
to the right, by replacing bigrams with unigrams (back-off) according to the selection
criterion, until we reach the right edge, where the models are using only unigram terms
(baseline). From that point we move back towards the upper-left corner, by keeping
all unigram terms and adding increasingly more bigrams (weighted interpolation), again
based on the selection criterion, until we reach the left edge, where the model is using all
unigrams and bigrams (interpolation).
Performance when using only bigrams (dotted cyan line) is noticeably lower than
when using only unigrams (dotted purple line), probably due to the lack of bigram seeds
to bootstrap the affective model. Despite this shortcoming, combining bigram with un-
igrams significantly (p < 10
3
at 80% bigram rejection in both cases) improves the
performance of the affective model over the unigram baseline. The performance gain is
noticeably higher when using the context
15
based semantic similarity S
1
, as shown in
Fig. 3.9(b).
In Fig. 3.9, classification accuracy is shown only for the two best term selection
criteria: c
ts
and c
nc
(to improve readability). The two criteria detect terms in very
different ways, c
ts
is often used for term-extraction or compound detection (i.e., non-
compositional semantic constructs), while c
nc
estimates the degree of affective non-
compositionality. The semantic criterion provides the absolute best performance when
using a back-off model, while there is no clear winner when using the weighted inter-
polation model. Focusing on the back-off model performance forS
1
in Fig. 3.9(b), we
observe a significant (p < 10
6
at 70% bigram rejection) improvement over the uni-
gram baseline: accuracy improves from72:4% when using only bigrams to around75%
at a back-off rate around 0.7. Weighted interpolation performs worse than the back-off
model, up until it converges to the interpolation model (bigram rejection rate under 10%).
15
The performance gap betweenG andS
1
is also large for the bigram only experiment, 66% vs. 68.5%.
As mentioned in Section 3.2.2, there are similarity metric scaling issues when creating bigram ratings that
are more pronounced when using co-occurrence based similarities.
36
None of the proposed term selection criteria perform better than the (simple) interpola-
tion baseline. Interpolation is the best performing model reaching an accuracy of 75.9%,
a small improvement over the back-off model (at 75%).
Overall, the inclusion of bigrams leads to significantly improved performance over
the unigram only models, with accuracy reaching 75.9%. Comparable results in liter-
ature are 62% [75], 66% [49], 71% [50] and 72.8% (using cross-validation) [15]. The
interpolation model also achieved a correlation to the ground truth of 0.61, compared to
0.5 achieved by the best system in [70].
3.5 Conclusions
We proposed a method of creating sentence affective ratings based on the combination of
partial affective ratings of word n-grams. At the core of this method is an affective lex-
icon expansion algorithm capable of creating continuous n-gram affective ratings based
on a set of manually labeled seed words and semantic similarity ratings calculated over
web data. This algorithm achieves state-of-the-art results in lexical affective tasks and
is generic enough to work in languages other than English, achieving high performance
in creating ratings for German words. Most importantly it does not require any linguis-
tic resources other than the affective ratings of a few hundred words in each language.
Sentence level ratings were obtained from n-gram ratings using linear and non-linear fu-
sion methods. Interpolation and back-off models were proposed for combining unigram
and bigram affective ratings. Overall, a simple linear equation containing the weighted
ratings of all terms, both unigram and bigram, proved to be the best performing solution
achieving state-of-the-art performance in the SemEval’07-Task14.
Future work should include further refinement of the lexicon creation model specifi-
cally targeted at the creation of more accurate higher-order n-gram ratings. Incorporating
morphosyntactic information into the model is also important especially for morpholog-
ically rich languages. The current word/sentence models can be used to create chunk
37
ratings, e.g., for noun phrases or compound nouns, reducing the required complexity of
syntactic rules; a simple syntactic model can then be used to model non-linear interaction
between these chunks.
38
0 50 100 150
.7
.8
.9
Correlation
number of seeds
baseline
Wrapper
Unsupervised
(a)
0 50 100 150
.7
.8
.9
Correlation
number of seeds
baseline
Wrapper
Unsupervised
(b)
0 500 1000
valence
abs. valence
std(valence)
mean(similarities)
std(similarities)
frequency
#senses
#pos tags
rank
(c)
0 500 1000
valence
abs. valence
std(valence)
mean(similarities)
std(similarities)
frequency
#senses
#pos tags
rank
(d)
Figure 3.5: Performance of the affective lexicon creation algorithm using different seed
selection algorithms and analysis of the wrapper selected seeds for the ANEW-CV exper-
iment using: (a) theG similarity metric, (b) theS
1
similarity metric. The corresponding
rank distributions of the top 50 seeds per fold selected by a wrapper when using: (c) the
G similarity metric, (d) theS
1
similarity metric.
39
0 100 200 300 400 500 600 700 800
70
80
90
Accuracy (%)
number of seeds
baseline
S
1
/116m
S
1
/14m
G/116m
G/web
(a)
0 100 200 300 400 500 600 700 800
70
80
90
Accuracy (%)
number of seeds
baseline
S
1
/116m
S
1
/14m
G/116m
G/web
(b)
Figure 3.6: Accuracy of the affective lexicon creation algorithm: (a) ANEW-CV experi-
ment., (b) GINQ-PD experiment.
0 100 200 300 400 500 600 700 800
.7
.8
.9
Correlation
number of seeds
baseline
S
1
/116m
S
1
/14m
G/116m
G/web
(a)
0 100 200 300 400 500 600 700 800
.7
.8
.9
Correlation
number of seeds
S
1
/170m
G/170m
(b)
Figure 3.7: Correlation of the affective lexicon creation algorithm: (a) ANEW-CV ex-
periment, (b) BAWLR-CV experiment.
40
0 100 200 300 400 500 600 700 800
64
66
68
70
72
74
76
number of seeds
Accuracy (%)
average
w. average
max
(a)
0 100 200 300 400 500 600 700 800
64
66
68
70
72
74
76
number of seeds
Accuracy (%)
average
w. average
max
(b)
Figure 3.8: Binary classification accuracy of the sentence rating algorithm as a function
of the number of seed words, when using only unigram terms and: (a) theG similarity
metric, (b) theS
1
similarity metric.
0 20 40 60 80 100
64
66
68
70
72
74
76
bigram rejection rate
Accuracy (%)
1grams
2grams
Interpolation
BO − c
nc
BO − c
ts
WI − c
nc
WI − c
ts
(a)
0 20 40 60 80 100
64
66
68
70
72
74
76
bigram rejection rate
Accuracy (%)
1grams
2grams
Interpolation
BO − c
nc
BO − c
ts
WI − c
nc
WI − c
ts
(b)
Figure 3.9: Binary classification accuracy of the sentence rating algorithm as a function
of the bigram selection threshold (backoff rate) for the SemEval’07-Task14 dataset: (a)
theG similarity metric and 300 seeds, (b) theS
1
similarity metric and 600 seeds.
41
Chapter4
AdaptingNormstoTaskDomain
4.1 Introduction
The analysis of language affective content is a significant part of many applications in-
volving written or verbal speech, such as sentiment analysis [51], news headlines analysis
[70] and emotion recognition from multimedia streams (audio, video, text) [36, 35]. Af-
fective text analysis can happen at various levels, targeting different lexical units: words,
phrases, sentences, utterances, as appropriate to the task. Analyzing the content of utter-
ances typically involves compositional [57] models of affect, that express the meaning
of the utterance through some combination of the meanings (typically affective ratings)
of the words they contain. Word ratings are provided by affective lexica, either manually
annotated, such as Affective norms for English Words (ANEW) [11] or, more typically,
automatically expanded lexica such as SentiWordNet [22] and WORDNET AFFECT
[71]. These word ratings are then combined through a variety of methods, making use of
part-of-speech tags [16], sentence structure [6] or hand-tuned rules [50].
The work presented in this chapter was published in the following article:
Nikolaos Malandrakis, Alexandros Potamianos, Kean J. Hsu, Kalina N. Babeva, Michelle C. Feng,
Gerald C. Davison, Shrikanth Narayanan, “Affective Language Model Adaptation via Corpus Selection” ,
proceedings of ICASSP, 2014, pp. 4838-4842
42
A common problem for affective language analysis is the large variety of topics and
discourse patterns that may be observed and their effect on content interpretation. Dif-
ferent domains can contain text of different topics, leading to words being used with
different senses, or text created using different styles of speech/writing, e.g. informal or
self-referential. This poses challenges since most popular resources are domain-agnostic
and therefore sub-optimal if the task is focused on a narrow domain. There are two main
solutions to this problem: 1) topic modeling of general purpose data and 2) domain adap-
tation via data selection. Topic modeling [37, 46] typically aims to represent the meaning
of words (and by extension sentences and beyond) through a probabilistic mixture of
topic-specific models, in which case the affective content is estimated over all topics.
While these models show promise, they do not fit particularly well with other compu-
tational frameworks: in the recent SemEval sentiment analysis challenge [51] virtually
no submissions used topic modeling, opting for methods based on affective lexica. Here
instead we take the direct domain adaptation approach that has been very successful in
the language modeling and grammar induction literature [53, 33].
Our proposed method involves automatically adapting an affective lexicon in order to
better suit a task. Virtually all automatically generated lexica are created based on some
form of word similarity and the assumption that semantic similarity implies affective
similarity. Therefore if we can estimate domain-dependent word similarity scores then
we can create domain-dependent affective word/term ratings. Our method of lexicon
expansion [44], unlike popular alternatives [22, 71], is purely data-driven, utilizing web-
harvested data and estimating similarity scores through statistics. A simple, yet general
and efficient way to adapt to a specific domain is to filter the data used to estimate the
semantic similarity model. The data selection process we propose is inspired by similar
methods of harvesting data from the web used for language modeling [53] and grammar
induction [33]. Given a small amount of in-domain data we can, in an unsupervised
fashion, select similar data from a large corpus through the use of pragmatic constraints
introduced in [33] and perplexity, leading to a smaller corpus that is more relevant to the
43
task. Using this corpus we can create domain-specific similarities and affective ratings.
Compared to previous research and topic modeling our approach differs in that it gener-
ates a single model rather than a mixture of models. It also results in an affective lexicon,
a resource that is more versatile, since it can fit within most computational frameworks.
Next we outline the basic semantic-affective model of [42, 78], detail how to expand it
to any type of target labels (e.g., distress, anger, sentiment), as well as to adapt it using:
1) adaptation of the semantic model through utterance selection and 2) direct adaptation
of the semantic-affective map. The proposed methods are evaluated on affective tasks of
both speech transcribed data and text twitter data.
4.2 Creatingaffectiveratings
To generate affective ratings for utterances we use a compositional framework. The
utterance is broken into a bag of all words and bigrams it contains, affective ratings for
them are estimated from a lexicon and finally statistics of these word/bigram ratings are
combined into a sentence rating.
The bootstrap lexicon we use is automatically expanded using a method first pre-
sented in [44] and expanded in [42]. It builds on [78]. We start from a manually anno-
tated lexicon, preferably annotated in continuous affective scales and pick, from the the
lexicon, the words corresponding to the most extreme ratings, e.g., for valence we pick
the most positive and most negative words in the lexicon, and use them as dimensions to
define a semantic model, a space representing semantic similarities to these seed words.
Then we use a large corpus to calculate statistics and estimate semantic similarity metrics
that will allow us to place any words or bigrams in the semantic space. We also define
an affective model, a space, in this case one of arousal–valence–dominance, where we
aim to place any new word. The mapping from the semantic model to the affective one
44
is trained on the annotated lexicon using Least Squares Estimation (LSE) and is a simple
linear function
^ v(w
j
) =a
0
+
N
X
i=1
a
i
v(w
i
)d(w
i
;w
j
); (4.1)
wherew
j
is the word whose affect we aim to characterize,w
1
:::w
N
are theN seed words,
v(w
i
) is the affective rating for seed wordw
i
,a
i
is the (trainable) weight corresponding
to word w
i
and d(w
i
;w
j
) is a measure of semantic similarity between words w
i
and
w
j
. Whiled(w
i
;w
j
) may be any estimate of semantic similarity, we have found that the
cosine similarity between the binary weighted context vectors of w
i
and w
j
performs
best [42]. An overview of the lexicon expansion model can be seen in Fig. 4.1. For more
details see [42].
Corpus
Annotated Lexicon
V A D
love 0.93 0.36 0.53
hate -0.72 0.49 0.01
bored -0.51 -0.54 -0.22
love
hate
bored
dominance
valence
arousal
Mapping
Statistics
Dimension selection
Training
Figure 4.1: Overview of the lexicon expansion method
The final step is the mapping of the semantic-affective model in (4.1) to various
categorical labels at the sentence or paragraph level, e.g., sentiment, distress, anger, po-
liteness, frustration. Typically this last step is achieved via a classifier that takes as
45
input the 3-D affective rating of each token (unigrams and bigrams) and produces sen-
tence/utterance level statistics of these ratings (e.g., mean, median, variance). The feature
fusion scheme is trained for each specific categorical label separately
1
.
4.3 Adaptation
Generating the values of the similarity metric used in (4.1) requires a text corpus of
sufficient size so as to contain multiple instances of the words and bigrams we want to
generate ratings for. Size requirements like this one have driven researchers to the web,
which should contain sufficient data for virtually any need. However it is still necessary
to sample the web, in order to collect appropriate data, e.g., we may harvest only data
from twitter for some tasks.
Instead of adapting the semantic-affective space directly
2
we choose to adapt the
semantic similarity scores by reestimating semantic similarity on a subset of the original
corpus that better matches the content and style of our in-domain data. Thus adaptation
boils down to an utterance selection method. In this work, motivated by work in language
modeling and grammar induction, we utilize two criteria to perform utterance selection:
pragmatic constraints and perplexity.
1
Provided there is enough in domain data one could also build a direct mapping from the semantic
similarity space to the label space, i.e., not use the affective rating as an intermediate mapping step. Al-
ternatively on could combine the semantic-affective-label model with a semantic-label model. Given the
limited space we do not present results on such model combinations here.
2
Such adaptation is possible however constraints have to be posed on the semantic space to be able
to perform it. Unfortunately, the semantic space (although often represented as an inner product space in
distributed semantic models (DSMs)) is far from metric, i.e., the triangular inequality is often violated.
Adapting the corpus used to estimate semantic similarity is an elegant way to bypass the problem of
adapting the non-metric semantic space.
46
4.3.1 PragmaticConstraints
Pragmatic constraints are terms or keywords that are highly characteristic of a domain.
For example when generating a domain independent corpus we would search for “delay”,
however if we know that the application is related to air travel then we can use terms that
are highly characteristic of the domain, like “flight” and “plane”. By constraining our
sub-corpus to contain one of these words we will get a sub-corpus that is conditioned
on the domain and in turn allow us to estimate domain dependent probabilities and other
metrics, e.g., semantic similarity. While in this example the pragmatic words are content
words, that is not necessarily the case. The target application may be better characterized
by stylistic elements, e.g., interview transcripts will contain many self-references which
may lead to the word “I” being highly characteristic. Identifying these characteristic
terms can be done via comparing an in-domain corpus with a generic corpus. Intuitively,
highly characteristic terms should appear relatively more often in the in-domain corpus
and also appear on multiple separate samples (utterances), or equivalently should have a
high value as proposed in [33]:
D(w)
P
in
(w)
P
out
(w)
; (4.2)
whereD(w) the number of in-domain samples (sentences, documents or otherwise) the
term occurs in,P
in
(w) the probability of the term in the in-domain corpus andP
out
(w)
the probability of the term in the generic corpus.
4.3.2 Perplexity
Perplexity is a popular method of estimating the degree of fit between two corpora, by
generating a language model on one and calculating the perplexity on the other. In this
context, we can generate a language model using the in-domain corpus and use it to
evaluate each instance contained in the generic corpus [53]. Instances that are lexically
more similar to the instances in the in-domain corpus will be assigned lower perplexity
scores. Therefore we can apply a threshold on perplexity to detect if a new instance
47
should be included to our task-dependent corpus. Once the corpus is selected using
pragmatic constraints and/or perplexity thresholding the semantic similarity metrics are
re-estimated on the selected sub-corpus.
4.3.3 Adaptingmodelparameters
Instead of or in addition to adapting the semantic similarity model d(:;:) in (4.1) one
could adapt directly the semantic-affective mapping, i.e., the parametersa
i
in (4.1) using
in domain data (or more realistically mixing in-domain and general purpose data) as out-
lined in [44]. This method is also evaluated and compared to semantic space adaptation
for the twitter data.
4.4 Experimentalprocedure
The main word corpus we use to train the lexicon creation algorithm is the Affective
Norms for English Words (ANEW) dataset. ANEW consists of 1034 words, rated in 3
continuous dimensions of arousal, valence and dominance.
To train sentence-level models and evaluate their performance we use subsets of the
Articulated Thoughts in Simulated Situations [27] (ATSS) paradigm corpus and the Se-
mEval2013 Task 2 [51] twitter dataset. The ATSS paradigm corpus is composed of
manually transcribed sessions of a psychological experiment, where participants are pre-
sented with short segments of emotion-inducing scenarios and respond, typically with a
few sentences per utterance. These utterances are manually annotated on multiple scales.
For these experiments we use a subset of 1176 utterances and binary labels of anger (522
positive) and distress (445 positive). The twitter corpus contains individual tweets an-
notated as positive, negative and neutral. For these experiments we use the training set,
composed of 9816 tweets and containing 1493 negative, 4649 neutral and 3674 positive
samples.
48
In order to evaluate various methods of utterance selection we need a starting domain-
independent corpus. To create one we use the vocabulary of English packaged in the
aspell spellchecker for English, containing 135,433 words, pose a query for each of them
to the Yahoo! search engine and collect the snippets (short representative excerpts of
the document shown under each result) of the top 500 results. Each snippet is usually
composed of two sentences: title and content. The corpus contains approximately 117
million sentences.
This corpus, as well as filtered versions thereof, is used to calculate statistics and
generate the required semantic similarity metrics to be used by the word/term model.
The model itself is created by selecting seed words from ANEW and training on the
entire ANEW corpus and then used to generate arousal, valence and dominance ratings
for all words and bigrams contained in the evaluated utterance corpora.
Utterance level features are created by calculating word and bigram rating statistics
across each utterance. The statistics used are: cardinality, minimum, maximum, range
(maximum minus minimum), extremum (value furthest from zero), sum, average and
standard deviation. Statistics are calculated across all terms and subgroups based on
rough part-of-speech tags: verbs, adjectives, adverbs, nouns and their combinations of
two three and four. For example, one feature may be the maximum arousal across all
verbs and adjectives.
It should be noted that up to and including feature extraction we are using the affective
dimensions of valence, arousal and dominance. While these dimensions are not the same
as the utterance-level labels, they should be capable of representing them, e.g., anger
should fall within the negative valence, high arousal, high dominance part of the space.
The task of moving these ratings to the desired affective representation is handled by the
supervised sentence model. The model, that uses the extracted features after selection, is
a Naive Bayes Tree, a decision tree with a Naive Bayes classifier on each node.
49
4.5 Results
Next we present the baseline results (general purpose corpus used for semantic similarity
estimation), as well as the adaptation results using pragmatic constants and/or perplexity
thresholding. To select the words that will form the pragmatic constraints we use (4.2)
with the in-domain corpus being the evaluated utterance corpus and the generic corpus
being the web-harvested 117m sentences. Using these we can score and rank every word
in the in-domain corpus, however we do not know how many of these words we should
pick. Data selection will result in a corpus that may be more salient to the task, but will
also be smaller
3
. In order to keep a fairly large corpus, we keep the top-20 words for
each training corpus and use them to filter the original 117m sentences.
To filter by perplexity we train trigram language models (Witten-Bell smoothing) on
the in-domain corpora and use them to calculate perplexity for each of the sentences
contained in the generic corpus. As previously there is no optimal value of perplexity we
can aim for, since a lower threshold will lead to a smaller corpus. For these experiments
we use thresholds of 100, 300, 1000 and 6000. Perplexity thresholding is applied to the
original 117m sentences or to a corpus already filtered via pragmatic constraints. The
sizes of the filtered corpora generated are shown in Table 4.1. All filtered corpora are
substantially smaller than the initial corpus and as the perplexity constraint gets stricter
corpora can become very small. However, even a corpus of fifty thousand sentences is
very large compared to most annotated in-domain corpora available.
These corpora are used to generate semantic similarity estimates and create emotional
ratings for all bigrams and unigrams in all utterances. The term model uses the 600 most
extreme words in ANEW as seed words [42]. It should be noted that while the similarities
are re-evaluated for each different corpus, the semantic-affective map is not: it is trained
on baseline corpus of 117 million sentences and used as-is in all other cases.
3
We have shown [42] that performance increases with corpus size.
50
Table 4.1: Corpus size after pragmatic constraints and/or perplexity thresholding has
been applied, for the ATSS and Twitter experiments.
Pragmatic Perplexity Sentences
Constraints Threshold ATSS Twitter
no - 116,749,758 116,749,758
no 100 177,786 48,868
no 300 4,837,935 1,241,524
no 1000 25,786,774 12,412,022
no 6000 57,932,887 36,044,486
yes - 24,432,892 30,193,306
yes 100 96,768 24,434
yes 300 2,096,241 620,762
yes 1000 9,116,490 6,206,011
yes 6000 15,907,177 18,022,243
To set baselines we use a domain-independent lexicon model trained using all 117m
sentences. The model is, as shown in [42], very accurate reaching a Pearson correlation
to the ground truth of 0.87, so it is a good representation of what a domain-independent
model can do. In the case of twitter, we also compare against the supervised adapta-
tion of the semantic-affective model proposed in [44]. This method performs no corpus
selection, but rather re-trains the a
i
coefficients in (4.1). Words in the training utter-
ances are assigned ratings equal to the average of all utterances they occur in and then
are used as training samples to re-train the lexicon model. For twitter we only re-train
the valence model since sentiment polarity is very similar to valence and set the nega-
tive/neutral/positive values as1;0;1 valence.
To evaluate on the ATSS paradigm corpus we attempt prediction of the binary values
of anger and distress using 10-fold cross-validation. The performance achieved, in terms
of classification accuracy as a function of corpus size, is shown in Fig. 4.2. The baseline
51
performance, achieved without any use of filtering, is 75% for anger and 69% for distress,
with chance baselines of 56% and 62% respectively. Using perplexity alone can lead to
an improvement over the baseline, however that improvement is much less compared
to that achieved by using pragmatic constraints, with or without (infinite) a perplexity
threshold. The combination of pragmatic constraints and perplexity results in a notable
improvement of baseline, reaching accuracies of 77% and 73% respectively. Of note
is the difference in optimal perplexity thresholds between the dimensions of anger and
distress, indicating that the sample labels could (or perhaps should) be used as part of the
filtering process.
To evaluate on Twitter we attempt prediction of the ternary value of sentiment using
10-fold cross-validation, the performance of which is shown in Fig. 4.2. The baseline
performance is 58%, whereas the chance baseline is 47%. As is the case with ATSS
anger, we see a substantial improvement when using a combination of pragmatic con-
straints and perplexity, reaching a peak of 62% or 4% over baseline. In this case the
improvement gained by including pragmatic constraints rather than using just perplexity
is much less pronounced than in the case of ATSS, however that is probably the result of
picking a sub-optimal number of pragmatic words as constraints. The supervised adapta-
tion of thea
i
coefficients improves on the baseline, however the difference is very small,
particularly when compared to the results of the unsupervised method.
Performance is improved notably in all cases showing the validity of the main idea
of adapting the semantic similarity model. Pragmatic constraints seem a better selection
criterion than perplexity, though peak performance is achieved by combining the two.
Finally, we investigate the use of a mixture of the adapted models used above and
the task-independent model. To do that we take a weighted linear combination of the
models (wd
in
(:;:) + (1w)d
out
(:;:)), where d
in
(:;:) the semantic similarity es-
timated over the filtered corpus and d
in
(:;:) the semantic similarity estimated over the
task-independent 117m sentence corpus. The maximum performance achieved, as well
52
as the corresponding weightsw are shown in Table 4.2. As expected, the better perform-
ing adapted models get weighted more in the mixture (typically 80-20%). Combing the
in-domain and out-of-domain models provides very little benefit in terms of maximum
performance, however it increases robustness considerably, smoothing out the perfor-
mance shown in Fig. 4.2. All but the worst performing in-domain models can achieve
similar performance levels when in a mixture, though only the better in-domain models
are assigned high weights.
Table 4.2: Performance for each experiment using linear combinations of the generic and
adapted lexicon models. Presented is the maximum accuracy achieved in each case, as
well as the parameters of the adapted model and the weightw assigned to it.
Experiment
Pragmatic Perplexity
w acc.
Constraints Threshold
ATSS Anger yes 6000 0.8 77.7%
ATSS Distress yes 1000 0.8 73.9%
Twitter Sentiment yes 1000 0.9 62.1%
4.6 Conclusions
We proposed a method of adapting an affective lexicon generation method to specific
tasks through the use of corpus selection, as part of a system that generates utterance-
level affective ratings. The method was shown to provide notable improvements in pre-
diction accuracy on speech and twitter datasets. Future work should focus on finding
optimal filtering parameters, number of pragmatic words and perplexity thresholds, as
well as the role of labels in the corpus selection process. We will also investigate how to
53
optimally combine various adaptation methods at the semantic similarity level, semantic-
affective map, affective-label map, as well as the sampling of the semantic-affective
space via seed work selection.
54
10
4
10
6
10
8
10
10
66
68
70
72
74
76
78
corpus size (sentences)
classification accuracy (%)
baseline
100
300
1000
6000
100
300
1000
6000
inf
baseline
perp
prag+perp
(a)
10
4
10
6
10
8
10
10
64
65
66
67
68
69
70
71
72
73
74
corpus size (sentences)
classification accuracy (%)
baseline
100
300
1000
6000
100
300
1000
6000
inf
baseline
perp
prag+perp
(b)
10
4
10
6
10
8
10
10
53
54
55
56
57
58
59
60
61
62
63
corpus size (sentences)
classification accuracy (%)
baseline
adapt a
100
300
1000
6000
100
300
1000
6000
inf
baseline
adapt a
perp
prag+perp
(c)
Figure 4.2: Classification accuracy as a function of the size of the corpus used for lexicon
creation, using perplexity, pragmatic constraints and perplexity or neither to generate the
corpus. The data point labels correspond to perplexity threshold values. Performance
shown for (a) ATSS anger, (b) ATSS distress and (c) TWITTER sentiment.
55
Chapter5
GoingBeyondAffect
5.1 Introduction
In this chapter we are proposing a method of expanding psycholinguistic norms using a
variation of a method of emotional lexicon expansion that we had previously proposed
[42]. The method is shown to achieve state-of-the-art prediction performance when ap-
plied to eleven different dimensions, most of which have never before been automatically
generated. Then we use the expanded psycholinguistic norms to quantify aspects of the
difference between client centered therapy and psychoanalysis.
Language norms are (typically numeric) representations of the normative (expected)
content of language, usually collected in a lexicon/thesaurus. The most commonly used
variety are emotional norms, representing word polarity, valence, arousal or dominance.
Manually annotated lexica [11, 69] have limited computational applications due to their
small sizes (typically a few hundred up to a few thousand words), so machine learning
methods are used to expand them and create automatically generated lexica [22]. Emo-
tional norms have been utilized for a variety of tasks and remain at the core of cutting
edge sentiment analysis systems [62]. Norms for aspects of language beyond emotion
The work presented in this chapter was published in the following article:
Nikolaos Malandrakis, Shrikanth Naraynan, “Therapy Language Analysis using Automatically Generated
Psycholinguistic Norms”, Proceedings of Interspeech, 2015, pp. 1947-1951
56
have existed for a long time [20, 84] and have been popular in behavioral studies as a
way to select appropriate stimuli [68]. These include norms describing the degree of
abstractness, the complexity of meaning and age or gender affinity of individual words,
so there should be potential for a variety of uses. Very recently some of these norms have
become the target for expansion and used in computational analyses [38, 48, 77] though
they still are not particularly common.
The method used to expand word-level psycholinguistic norms is detailed in Sec-
tion 5.3. The extra steps used to create norms per sentence, turn or other larger units are
described in Section 5.4. The corpora and methodology used in the experiments are pre-
sented in Section 5.5 and Section 5.6. Results are presented and discussed in Section 5.7.
Finally conclusions and future work are discussed in Section 5.8.
5.2 Thetask: Therapylanguageanalysis
Psychotherapy has been increasingly popular in recent years as a subject of compu-
tational examinations, particularly with respect to the automatic evaluation of therapy
quality. Studies have been conducted on discerning the differences between different
methods/schools of therapy [73] and on modeling specific aspects of the therapist-patient
interaction [14].
In this chapter we focus on investigating the differences, in terms of therapist and pa-
tient language, between two popular schools of therapy: psychoanalytic psychotherapy
(also known as psychoanalysis) and client-centered therapy. Psychoanalytic psychother-
apy (henceforth referred to as PP), as defined by Sigmund Freud, targets the patient’s
subconscious mind. The therapist is meant to discover hidden or repressed memories,
emotions and motivations over the course of therapy and use the gained insights to treat
the patient. Client-centered therapy [61] (henceforth referred to as CCT), devised by
Carl Rogers, is defined by the assumption that the patient has within himself the con-
cepts of self-understanding and self-healing. The therapist’s mission is to facilitate the
57
self-healing process by providing an accepting and understanding environment. As part
of that mission statement, CCT practitioners eschew the use of the word “patient” for
“client”, since the subject is viewed as a collaborator in the healing process. While the
difference in definitions should translate into notable differences in therapist language,
quantifying these differences is not simple. In this chapter, we propose using language
norms as a means to describe the language used and therefore any differences.
5.3 ExpandingWordPsycholinguisticNorms
The method of generating word norms is an extension of the approach described in [42],
which in turn is a generalization of the method presented in [79]. At its core is the as-
sumption that semantic similarity can be used to estimate similarity in any other domain:
words with similar meaning will have similar norm values. More specifically, we assume
that the norm valuev(t
j
) corresponding to termt
j
can be estimated through a linear com-
bination of the semantic similaritiess(t
j
;w
i
) between the term and known seed words
w
i
and the normsv(w
i
) using the equation
^ v(t
j
) =a
0
+
X
i
a
i
s(t
j
;w
i
)v(w
i
); (5.1)
where a
i
are trainable weights corresponding to words w
i
. Starting from a manually
annotated lexicon containingK words, we can selectN < K words as seed words and
create a system ofK equations withN +1 variables (a
0
:::a
N
) that can be solved using
Least Squares Estimation (LSE).
This method works very well, producing state-of-the-art results when estimating
emotional norms [42]. Applying to a larger and more varied set of norms complicates
one aspect of the process: seed word selection. The seed wordsw
i
have to be selected
separately for each norm, from the words available in the original manual annotations, a
process that may be affected by the nature of the norm in some way. Also, since norm
58
lexica from different sources contain different words, the number of similarity ratings
required increases dramatically.
A solution can be reached via a simple modification to the method. Since the weights
a
i
are trainable, they will compensate for the absence ofv(w
i
). For example, if a norm
has a value ofx and we eliminate it, the corresponding trained coefficient will increase
in value by a factor ofx. Therefore we eliminate allv(w
i
) and get the new equation:
^ v(t
j
) =a
0
+
X
i
a
i
s(t
j
;w
i
); (5.2)
which allows us to use any set of words, multi-word terms or concepts as seeds, inde-
pendently of the norm we are estimating.
The similarity metrics() used in (5.2) is a critical component of the process. Through
the course of prior experiments [42] we found context based similarity to provide the
best performance, a finding that holds for norms beyond emotion. The metric used is
the cosine between context vectors with binary weights and a window size of one: the
context vector for wordw
i
contains ones in all places corresponding to terms that occur
right next to it (window size of one) at least once in some large corpus. For a detailed
description of the experiments that lead to the selection of this similarity metric, as well
as the performance impact on emotion norm estimation, see [42].
5.4 Normsforlargerpassages
Given the word norm model, we can generate norms for any word that occurs in any
particular corpus we want to process. Still, creating norms for larger lexical units such
as sentences or turns is not straightforward.
The dominant approach in emotion related literature, and our prior work, is extract-
ing statistics that describe the word norm distribution. The target passage is tokenized
and filtered, then a lexicon lookup is performed and words are replaced by their norms
before statistics are estimated. The most common variant of that approach includes
59
part-of-speech tagging, then removal of all words that are not content words (adjec-
tives, nouns, verbs and adverbs), followed by extracting the average norm value across
the content words. More complex variants can generate large numbers of features via
combining multiple filtering criteria (such as multiple different part-of-speech selection
criteria) with multiple statistics, including the minimum, maximum, slopes etc. Mapping
these multiple statistics to passage norms can be achieved using machine learning.
For the purposes of this chapter, the passage norm is estimated as the average norm
across content words.
5.5 CorporaandExperimentalProcedure
Following is a short description of the datasets used for the experiments presented in this
chapter.
5.5.1 Manuallyannotatednorms
In order to generate psycholinguistic norms we need to start from some manual annota-
tions. To that end we use eleven dimensions from three sources. More norm dimensions
are available in these resources: the ones selected for expansion were picked based on
their projected utility for our applications in behavioral informatics. Following is a list
and short description for each dimension of interest.
From the Affective Norms for English Words (ANEW) [11] we use the norms for
arousal, valence and dominance, the three dimensions of the widely used dimensional
model of affect [64]. Arousal represents excitement, the degree of physical activation in
preparation for action. Dominance is the degree of perceived control over one’s circum-
stances. Valence is the continuous polarity (from very negative to very positive) of an
emotion.
From the MRC Psycholinguistic database [84] we use the norms for concreteness,
imagability, age of acquisition and familiarity. Concreteness is the degree to which
60
something can be perceived using the five senses (from very abstract to very concrete).
Imagability is the degree to which one may create a mental image of the word’s subject.
Age of acquisition indicates the expected age at which one acquires (can use correctly)
the word. Familiarity is the degree of exposure to and knowledge of the word.
From the Paivio, Yuille and Madigan norms [20] we use the norms for pleasantness,
pronouncability, context availability and gender ladenness. Pleasantness is very similar
to valence and is a degree of how pleasant the feelings associated with a word are. Pro-
nouncability signifies how easy a word is to pronounce. Context availability represents
the number of different contexts in which a word may appear. Gender Ladenness rep-
resents the degree of perceived feminine or masculine association of a word (from very
masculine to very feminine).
5.5.2 Rawtextcorpus
The norm generating equation (5.2) requires a semantic similarity estimates(). In this
chapter semantic similarity is calculated as the cosine of context vectors calculated over a
large corpus of raw text. The corpus was created by posing a query to the Yahoo! search
engine for every word in the English version of the [2] spell-checker and collecting the
top 500 result previews. Each preview is composed of a title and a sample of the content,
each being a single sentence. Overall the collected corpus contains approximately 117
million sentences.
5.5.3 Generalpsychotherapycorpus
The general psychotherapy corpus, maintained and updated by the “Alexander Street
Press” (http://alexanderstreet.com/) and made available via library subscription, contains
transcripts of therapy sessions. Over 1200 sessions in total are included, spanning a large
variety of therapeutical approaches and experimental conditions. Beyond the therapist
61
and client transcripts, the corpus includes metadata such as therapist and client demo-
graphics and client symptoms.
As noted in other studies [29], this corpus has some unfavorable characteristics. It is
missing a lot of metadata, with most only available for a minority of the provided ses-
sions, and it is unbalanced with respect to almost all variables since it was not specifically
designed for analysis as proposed here. Despite that, the large size makes it a valuable
resource.
For these experiments we used the therapist and client language, the therapist school
of therapy (if it is CCT or PP) and the therapist’s experience (under 10 years, 11-20
years). The subset of the original corpus that fits these criteria contains 312 therapy
sessions.
Table 5.1: Word norm estimation performance. Cardinality of the dataset and Pearson
correlation to the ground truth for 10-fold regression over manually annotated words.
Dimension #Samples Pearson
Arousal 1034 0.70
Dominance 1034 0.77
Valence 1034 0.88
Pleasantness 925 0.82
Concreteness 4295 0.87
Imagability 4829 0.85
Age of Acquisition 1904 0.86
Familiarity 4924 0.86
Pronouncability 925 0.72
Context Availability 925 0.78
Gender Ladenness 925 0.80
62
5.6 ExperimentalProcedure
To evaluate the performance of the word norm expansion algorithm we performed 10-
fold cross-validation experiments for each norm dimension. In all cases we rescaled the
original norm values to the range[1;1]. The seed wordsw
i
of (5.2) were selected using
word frequencies calculated over the corpus described in section 5.5.2. The top 10000
most frequent words with length longer that 3 characters were used as seeds for all exper-
iments. Given each training set, we generated theK10000 matrix of similarities to the
seed words. Dimensionality reduction was performed by applying Principal Component
Analysis (PCA) and keeping the firstN components. These transformed similarities be-
came the similarity termss() of (5.2) which in this case represent similarities to concepts
(combinations of words), rather than individual words. For the experiments presented
in this chapter, the first 500 principal components were used for the MRC and ANEW
sourced norms, whereas only the first 300 components were used for norms taken from
the smaller Paivio-Yuille-Madigan resource.
To generate norms for any new words we trained models using all manually annotated
samples as training samples. The resulting model for each norm dimension is composed
of a coefficient vector containing alla
i
and a PCA transformation matrix, required to map
from the original semantic space of10000 dimensions. For any new word we calculated
the10000 similarities to the seed words, transformed using the trained PCA matrices and
plugged into the learned equations to get the corresponding norms.
The analysis of therapy transcripts started by part-of-speech tagging the patient and
client utterances with Treetagger [66]. Non-content words were removed and the content
word vocabulary was created. For each word in that vocabulary we generated all possible
norms and calculated the norm averages per session. The result was a eleven norms per
session, per interlocutor.
The target of analysis was the therapist language given the client language. To do that
we estimated the therapist norms given the client norms, assuming a linear relationship,
and kept the residuals (the errors). These estimation errors are attributed to the therapist
63
and become the dependent variable of our analysis. The independent variables are: the
school of therapy the therapist belongs to and the years of experience he or she has. Using
more independent variables was desirable, but was not possible due to the unbalanced
corpus that resulted in empty groups. The method of analysis was Welch’s ANOV A, a
variant of ANOV A for unequal variances.
5.7 Results
5.7.1 Word-levelnormestimation
The performance of the word norm expansion algorithm is shown in Table 5.1. For each
dimension the number of manually annotated words and the Pearson correlation to the
ground truth are listed. Overall the model performs very well, with Pearson coefficients
around 0.8 for almost all dimensions. Arousal from language is predictably harder to
estimate and pronouncability is not strictly defined by semantics alone, so the relatively
low performance of the model on these dimensions was expected.
Most of these dimensions have not been automatically expanded before, so compar-
isons are difficult. The 0.88 Pearson for valence is higher than the 0.87 cited in [42], the
0.87 Pearson for concreteness is higher than the 0.81 cited in [77] and the 0.85 imaga-
bility is higher than the 0.56 cited in [48]. Therefore this system performs better overall
than related attempts in literature.
5.7.2 Therapytranscriptanalysis
As a first step, we looked at the relation between the language norms of the therapist and
client, by calculating the Pearson correlation between their per-session features, for each
school of therapy and overall. The results are presented in Table 5.2. All correlation
over 0.15 are significant at thep < 0:05 level and correlations over 0.25 are significant
at the p < 0:001 level. As expected, there is significant correlation between therapist
64
Table 5.2: Pearson correlation of Client and Therapist norms, for Client centered therapy,
psychoanalytic psychology and Overall.
CCT PP Overall
Arousal 0.15 0.08 0.18
Dominance 0.20 0.20 0.19
Valence 0.20 0.24 0.19
Pleasantness 0.15 0.22 0.18
Concreteness 0.47 0.26 0.32
Imagability 0.43 0.17 0.24
Age of Acquisition 0.37 0.19 0.29
Familiarity 0.30 0.33 0.29
Pronouncability 0.31 0.32 0.29
Context Availability 0.32 0.26 0.24
Gender Ladenness 0.35 0.32 0.30
and client along almost all dimensions. The differences in correlation with the client
between the two schools of therapy are not significant with the exception of concreteness,
imagability and age of acquisition where the CCT therapists correlate better. The finding
seems consistent with the school descriptions: the CCT therapists should follow and
empower the client, while the PP therapists have specific goals that may conflict with the
short-term goals of the client (what he or she wants to discuss).
To compare therapist language we perform statistical analysis on their norms, nor-
malized by client norms using linear regression, with respect to the school of therapy
(SoT) and the therapists’ experience (TE). The results are presented in Table 5.3. Most
of these differences have a strong interaction component that warrants further investiga-
tion before commenting on main effects. One exception is dominance, the difference of
65
which is attributed to higher therapist experience, an indicator that a therapist with more
experience gives a stronger impression of being in control.
To investigate the interaction of SoT and TE we created the factor plots shown in
Fig. 5.1 and ran pair-wise statistical tests for all values of the interacting terms. This
investigation reveals only one main effect, only one factor that differentiates between
schools for all levels of experience: arousal, with PP transcripts showing higher values.
Another way to interpret this would be that the CCT practitioners use language consistent
with higher calmness levels.
Beyond arousal, all other effects only become significant at higher experience lev-
els. The effect of increased experience on the two schools is shown in Table 5.4, that
also includes the differences between schools at the higher experience level. Experience
appears to have much more of an effect on CCT than PP, with CCT therapist language
becoming more pleasant (higher pleasantness), simpler (lower age of acquisition, higher
pronouncability) and more accessible (higher familiarity, higher context availability). A
lot of these trends are also evident for PP practitioners, as seen in Fig. 5.1, but none are
significant, apart from a significant increase in concreteness. The differences between
the two schools at the high level of experience mirror the changes for CCT: the thera-
pist language in CCT transcripts is simpler, more accessible and more pleasant than the
language in PP, in addition to the main affect of appearing more relaxed.
The observed differences seem consistent with the CCT target of providing an “ac-
cepting and understanding” environment for the client.
5.8 Conclusions
We proposed and evaluated a method for creating psycholinguistic norms for words
based on manually annotated lexica and semantic similarity. The method assumes a
linear relation between semantics and all other aspects and is trained using LSE. The
method achieved state-of-the-art results on word-level norm generation. The norms were
66
Table 5.3: Factor p-values and direction of difference. Significant differences are denoted
with" or# at thep< 0:05 level and* or+ at thep< 0:001 level.
p-value direction
SoT TE SoT*TE PP TE"
Arousal 0.001 0.033 0.001 * "
Dominance 0.167 0.041 0.200 "
Valence 0.071 0.060 0.091
Pleasantness 0.658 0.494 0.001
Concreteness 0.003 0.002 0.018 " "
Imagability 0.004 0.001 0.017 " *
Age of Acquisition 0.928 0.059 0.000
Familiarity 0.274 0.002 0.000 "
Pronouncability 0.138 0.001 0.000 *
Context Availability 0.017 0.000 0.000 " *
Gender Ladenness 0.613 0.470 0.003
used to analyze the differences between different schools of therapy and the findings
were consistent with the theoretical definitions of the schools, with client-centered ther-
apist speech appearing simpler, calmer and more pleasant. Future work will include the
expansion of the norm model to more dimensions, which will enable a more detailed
description of language. At the analysis level, moving to sentence or turn-level analysis
could help provide further insights. Finally, future work will apply the approach pre-
sented here in conjunction with analysis of other related data (e.g., speech) to obtain a
more comprehensive account of the mechanisms, quality and efficacy of psychotherapy
within a behavioral informatics framework [52].
67
Table 5.4: The effect of increased experience. Significance denoted with" or# at the
p< 0:05 level and* or+ at thep< 0:001 level.
CCT PP CCT vs PP
Arousal +
Dominance
Valence "
Pleasantness * *
Concreteness "
Imagability
Age of Acquisition + +
Familiarity * *
Pronouncability * *
Context Availability * "
Gender Ladenness
68
Therapist Experience
<10 years 11-20 years
Arousal
Therapist Experience
<10 years 11-20 years
Dominance
Therapist Experience
<10 years 11-20 years
Valence
Therapist Experience
<10 years 11-20 years
Pleasantness
Therapist Experience
<10 years 11-20 years
Concreteness
Therapist Experience
<10 years 11-20 years
Imagability
Therapist Experience
<10 years 11-20 years
Age of Acquisition
Therapist Experience
<10 years 11-20 years
Familiarity
Therapist Experience
<10 years 11-20 years
Pronouncability
Therapist Experience
<10 years 11-20 years
Context Availability
Therapist Experience
<10 years 11-20 years
Gender ladenness
c.c. therapy
psych. psychology
Figure 5.1: Sample means as functions of therapist experience for client-centered therapy
and psychoanalytic psychology.
69
Chapter6
UsingNeuralNetworkstoGenerateSentenceNorms
6.1 Introduction
This chapter describes our experiments in the generation of word and sentence level
norms using neural networks and multi-task learning.
We have shown in previous chapters that we can generate highly accurate word-level
norms and that, given sentence-level supervision, we can combine these into sentence-
level norms to great effect. However, there is very limited availability of sentence-level
resources that are annotated in continuous scales for the desired norms. Even for the
most popular affective dimensions of arousal, valence and dominance there are very
few resources [13] that would enable us to use sentence-level model supervision. These
resources are expensive and time-consuming to develop and can be difficult to annotate
as the definitions of these norms becomes less intuitive at the sentence level. As a result,
there is a great variety of norms for which we only have word-level annotations and very
few for which we have word and sentence-level annotations.
The lack of sentence level annotations severely limits the computational possibilities
with respect to these norm dimensions. If we wanted to describe the concreteness of a
passage, with only word level annotations available, we would be limited to using word-
level statistics such as the mean of word concreteness scores. Ideally we would like to
leverage other norm dimensions, for which we have sentence annotations, to improve the
70
estimates for dimensions where such annotations do not exist. Intuitively, we should be
able to use the annotations for other dimensions to learn how to compose sentence se-
mantics and the annotations for the target dimension to learn how to convert the semantic
representations into the desired norm values.
This chapter describes our attempts to exploit the relationships between norm dimen-
sions for which we have sentence-level annotations and norm dimensions for which we
do not, in order to improve sentence-level results for the latter. The framework used
is multi-task learning, where we train a model to execute multiple tasks, with the goal
of jointly (1) defining a semantic space in which words and sentences can be mapped,
(2) creating transformation functions that can infer the norm values from that semantic
space, and (3) implicitly learning the relations between dimensions.
6.2 Approach
Embeddings
Dense
Dense
Dense
Dense
Dense
Dense
Dense
Dense
Dense
Shared Layers Separate Layers
Embeddings
Composition
(GRU)
Dense
Dense
Dense
Dense
Dense
Dense
Dense
Dense
Dense
Words & Sentences Words
Figure 6.1: Structure overview for the word and sentence models.
71
Multi Task Learning refers to any learning strategy that combines multiple tasks with
some sharing of data, features or parameters between the tasks. For this work we utilize
the simplest form of multi task learning, with feature sharing and hard parameter sharing.
We train models where the different tasks have to share inputs (features) and some subset
of the layers. Fig. 6.1 shows the structure of the word and sentence level models.
For the word-level tagging task we use a simple feed-forward network composed
of a series of dense (fully connected) layers, with some of the layers being shared and
some being separate. If the model is applied to only a single psycholinguistic dimension
it collapses to a multi-layered perceptron. For the word-level task we assume that we
have complete data, with annotations for all dimensions, therefore any performance gains
from multi-task learning will be the result of increased robustness of the model, a benefit
typically associated with multi-task learning.
For the sentence level task we use a similar structure of dense layers with the addition
of a composition layer, that will combine the meaning of all tokens in a sentence into a
single vector representation. The composition is handled by a Gated Recursive Unit
(GRU). We assume that the word and sentence tagging tasks for any given dimension
are the same task and they will use the same components of the network. Our problem
formulation states that we do not have sentence level annotations for one dimension,
therefore one stack of task-specific dense layers will not be trained on any sentence
data, however the corresponding output should still benefit by the sentence annotations
available for the other tasks, via the shared layers.
The multi-task models are used to jointly learn how to generate multiple psycholin-
guistic dimensions and are optimized using the combined loss for all outputs.
72
6.3 ExperimentalProcedure
Our experiments start from words, where we use single and multi-task models to generate
labels, before moving to the sentence and missing data. This section describes the data
and experimental procedure.
6.3.1 Corpora
At the word level we use eleven dimensions from three sources, selected based on their
expected utility to computational analysis.
From the Extended Affective Norms for English Words (EANEW) [82] we use the
norms for arousal, valence and dominance, the three dimensions of the widely used di-
mensional model of affect [64]. Arousal represents excitement, the degree of physical
activation in preparation for action. Dominance is the degree of perceived control over
one’s circumstances. Valence is the continuous polarity (from very negative to very pos-
itive) of an emotion. The corpus contains ratings for all three dimensions for 13,915
words.
From the MRC Psycholinguistic database [84] we use the norms for concreteness,
imagability, age of acquisition and familiarity. Concreteness is the degree to which
something can be perceived using the five senses (from very abstract to very concrete).
Imagability is the degree to which one may create a mental image of the word’s subject.
Age of acquisition indicates the expected age at which one acquires (can use correctly)
the word. Familiarity is the degree of exposure to and knowledge of the word. The
corpus contains concreteness ratings for 4295 words, imagability for 4829 words, age of
acquisition for 1904 words and familiarity for 4924 words.
From the Paivio, Yuille and Madigan norms [20] we use the norms for pleasantness,
pronouncability, context availability and gender ladenness. Pleasantness is very similar
to valence and is a degree of how pleasant the feelings associated with a word are. Pro-
nouncability signifies how easy a word is to pronounce. Context availability represents
73
the number of different contexts in which a word may appear. Gender Ladenness rep-
resents the degree of perceived feminine or masculine association of a word (from very
masculine to very feminine). The corpus contains ratings for all four dimensions for 925
words.
At the sentence level the only compatible resource currently available is EmoBank
[13], which includes norm values for arousal, valence and dominance for sentences and
longer passages. EmoBank contains ratings from the reader perspective, the writer per-
spective, as well as composite scores. For the purposes of this work we use the composite
scores for all three dimensions for 10,062 texts.
To initialize our models and in lieu of using a large raw text corpus we use the pre-
trained GloVe embeddings generated from a corpus of 840 billion tokens [58].
All annotated corpora were re-scaled from their original annotation value ranges to
the range of[0;1]. Note that we did not normalize means or variances, we only made the
ratings lie on the same scales.
Tokenization and part-of-speech tagging of the texts was performed using CoreNLP
[45]
6.3.2 Experiments
At the word level, similarly to previous work, we start from regression models that trans-
forms word semantic representations into norm values. The models used are Ordinary
Least Squares (OLS) linear regression and a Support Vector Regressor (SVR) with a lin-
ear kernel. Then we move to using individual neural networks for each norm dimension,
that are tuned separately and can have different topologies. The next step is using multi-
task learning for each word resource, attempting to jointly train models for each. Finally
we use a single multi-task model to generate all ratings from all dimensions.
At the sentence level we first evaluate the scenario for which we have complete data,
including sentence-level annotations for all dimensions. This serves as an upper bound
of what we may expect if some of those sentence-level annotations are missing. For
74
this scenario we evaluate both single-task and multi-task models. Then we add word-
level annotations to the training set and re-evaluate. Comparing the performance of the
models using only sentence data with those that also include words in the training set
is an indirect way of validating the compatibility between our word and sentence-level
resources: a necessary step since they were annotated separately.
The next step is evaluate the missing data scenario. We assume that we have word-
level annotations for all dimensions and sentence-level annotations for all dimensions
except one, which is our target. As baselines we use the word norm models from the
previous sections to generate word norms for content words (adjectives, adverbs, nouns
and verbs) in our texts and calculate their average, maximum and minimum.
All experiments are conducted using five-fold cross-validation. The fold splits are
stratified with respect to label existence, they include similar amounts of missing la-
bels for all dimensions, to facilitate multi-task learning. For neural network models the
training fold is further split into 90%/10% training and validation sets and the combined
validation set loss is used as an early stopping criterion. The validation set loss is also
used for hyper-parameter tuning, primarily component dimensionalities and number of
layers, of the various models.
All models are mostly composed of fully connected layers with equal dimensions for
any given model, e.g. a given model could have five fully connected layers with 200
cells each. The compositional component of the sentence model is a Gated Recurrent
Unit (GRU). We are using a dropout of 0:2 after every layer. The loss function used
is the mean square error, potentially aggregated across multiple outputs. All models
are initialized using GloVe word embeddings, which are not re-trained (they are frozen)
during model training.
Multi-task models use hard parameter sharing and equal weight losses. We did con-
duct experiments using different weighing schemes, including uncertainty weighting [19]
and GradNorm [17], without positive results.
75
We evaluate using Pearson’s correlation coefficient between the system outputs and
the ground truth annotations. Significance testing was performed using a bootstrap test
with ten thousand permutations.
Table 6.1: Word-level Pearson’s performance. OLS regression using context vectors
against OLS and SVR using GloVe embeddings. Statistical significance marked by
y
:
p< 0:05 or
yy
:p< 0:001.
Dimension OLS - cont. OLS - embed. SVR - embed.
EANEW
arousal 0.630
yy
0.614 0.609
dominance 0.703 0.702 0.698
valence 0.810
y
0.805 0.803
MRC
age of acquisition 0.856 0.837 0.848
concreteness 0.871
yy
0.830 0.831
familiarity 0.858 0.857 0.852
imagability 0.836
yy
0.788 0.784
Paivio
context availability 0.771 0.698 0.773
gender ladenness 0.785 0.742 0.798
pleasantness 0.801 0.733 0.795
pronounceability 0.701 0.627 0.691
6.4 Results
6.4.1 Wordratings
Table 6.1 shows the performance of an ordinary least squares (OLS) regressor utiliz-
ing context-based semantic similarity scores or GloVe embeddings as features and an
SVR regressor using GloVe embeddings as features. The OLS context-based scores
76
correspond to the method detailed in previous chapters. Given the restriction that the
transformation function from the semantic space to the norm space needs to be linear,
using word embeddings leads to significantly worse results. Using an SVR instead of
OLS with embeddings leads to, on average, improved results, but still does not match
the original method. However, this linearity restriction is artificial: we assumed that the
semantics-norm transformation is linear, then engineered a semantic representation that
would best fit that assumption, so naturally that representation would perform best. We
expected embeddings results to improve by moving to a non-linear transformation.
Table 6.2: Word-level Pearson’s performance. Regression against individual neural
networks. Statistical significance marked by
y
:p< 0:05 or
yy
:p< 0:001.
Dimension Regression Neural
EANEW
arousal 0.614 0.626
yy
dominance 0.702 0.718
yy
valence 0.805 0.845
yy
MRC
age of acquisition 0.848 0.850
concreteness 0.831 0.883
yy
familiarity 0.857 0.862
imagability 0.788 0.845
yy
Paivio
context availability 0.773 0.792
y
gender ladenness 0.798 0.816
y
pleasantness 0.795 0.823
yy
pronounceability 0.691 0.704
Table 6.2 shows the performance of the best (per dimension) regressor using GloVe
embeddings as features against the performance of neural networks trained separately
for each norm dimension. As expected, moving to a non-linear transformation function
significantly improved embedding results across almost all dimensions. Beyond that,
77
the performance achieved improves upon our previous results. Table 6.3 compares the
results of our existing method with those of individual neural networks. Overall, across
all dimensions the individual neural networks significantly outperform regressors, using
either context-based similarity or word embeddings, and set a new state-of-the-art.
Table 6.3: Word-level Pearson’s performance. OLS regression using context-based
similarity against individual neural networks using GloVe embeddings. Statistical sig-
nificance marked by
y
:p< 0:05 or
yy
:p< 0:001.
Dimension OLS Neural
EANEW
arousal 0.630 0.626
dominance 0.703 0.718
yy
valence 0.810 0.845
yy
MRC
age of acquisition 0.856 0.850
concreteness 0.871 0.883
yy
familiarity 0.858 0.862
imagability 0.836 0.845
y
Paivio
context availability 0.771 0.792
gender ladenness 0.785 0.816
yy
pleasantness 0.801 0.823
y
pronounceability 0.701 0.704
Up to this point we have been treating each norm dimension as an entirely separate
task with a distinct model. However, some dimensions are related, e.g. arousal, va-
lence and dominance are strongly dependent if not necessarily correlated. Attempting to
exploit the inter-dimension dependencies we used multi-task models jointly trained for
multiple dimensions. Table 6.4 shows results for multi-task models trained per-resource,
so three models each corresponding to one annotated lexicon. While we were able to
achieve improved results for the EANEW lexicon, the overall performance is worse,
78
Table 6.4: Word-level Pearson’s performance. Individual neural networks against per-
resource multi-task networks. Statistical significance marked by
y
:p< 0:05 or
yy
:p<
0:001.
Dimension Individual Per Resource
EANEW
arousal 0.626 0.637
yy
dominance 0.718 0.722
valence 0.845 0.838
MRC
age of acquisition 0.850 0.853
concreteness 0.883
y
0.877
familiarity 0.862
y
0.857
imagability 0.845 0.844
Paivio
context availability 0.792 0.783
gender ladenness 0.816
yy
0.792
pleasantness 0.823
y
0.812
pronounceability 0.704 0.696
since the multi-task models fail on the other two lexica. The difference on average is
minimal however. Table 6.5 shows the results for a universal multi-task model that is
trained to produce all eleven norm dimensions. The performance achieved is further re-
duced from the per-resource multi-task scenario and the universal model is significantly
worse than individual models for most norm dimensions. There are a few possible rea-
sons for this outcome. Multi-task learning is a form of regularization, a way to avoid
overfitting, however we have already taken measures towards that in the form of dropout
and early stopping. It is possible that these limit the utility of regularization or even lead
to under-fitting. Another possible explanation comes from the results of the EANEW
specific model, the only one that achieves positive results. EANEW is the largest lexicon
we have and the three dimensions contained within are known to be highly dependent
79
Table 6.5: Word-level Pearson’s performance. Individual neural networks against a
universal multi-task network. Statistical significance marked by
y
:p< 0:05 or
yy
:p<
0:001.
Dimension Individual Universal
EANEW
arousal 0.626
yy
0.612
dominance 0.718 0.716
valence 0.845
yy
0.836
MRC
age of acquisition 0.850
yy
0.824
concreteness 0.883
yy
0.848
familiarity 0.862
yy
0.837
imagability 0.845
yy
0.824
Paivio
context availability 0.792 0.788
gender ladenness 0.816
yy
0.712
pleasantness 0.823 0.837
pronounceability 0.704 0.704
and therefore can be assumed to be compatible for the purposes of multi-task learning.
While some of the other dimensions can also be assumed to be dependent/compatible,
that is not true for every dimension of the MRC and Paivio resources or globally across
all dimensions, e.g. it is unlikely that age of acquisition and gender ladenness are com-
patible. It is perhaps this limited compatibility between the norm dimensions, in global
case as well as the MRC and Paivio cases, that leads to the multi-task model results.
Regardless, we do see some potential for improved performance in the EANEW results,
so there is some value to the multi-task approach, but only in specific cases.
To summarize the word-level results, we showed that neural networks can work quite
well for the task and can achieve state-of-the-art performance across all norm dimen-
sions when using separate models tuned to each dimension. Using multi-task learning to
80
exploit the dependencies between dimensions can also work, but we need to ensure the
existence of said dependencies, that the norm dimensions jointly learned are compatible,
either through prior knowledge or experimental validation.
6.4.2 Sentenceratings
Table 6.6: Sentence-level Pearson’s performance, for the complete data scenario. Sta-
tistical significance marked by
y
:p< 0:05 or
yy
:p< 0:001.
Pearson’s
arousal dominance valence
single task
sent only 0.530
yy
0.317
yy
0.677
sent+words 0.516 0.301 0.677
multi task
sent only 0.532
yy
0.332
yy
0.674
sent+words 0.516 0.312 0.680
Table 6.6 shows the performance of sentence models trained on complete data: as-
suming that we have sentence level annotations for all dimensions. We trained separate
(per dimension) and multi-task models and used either only sentence data for training
or sentence and word data. The evaluation is performed only on sentence data. These
results serve as an upper bound of the performance we can expect when evaluating the
scenario where we have no sentence data for one of these dimensions. Comparing the
single-task and multi-task results we see no difference for arousal and valence, but there
is a significant improvement in the case of dominance. This is however only one of
three dimensions and overall multi-task learning did not lead to significant improvement
on average. The comparison between sentence only and sentence plus word results can
be viewed as an indirect measure of compatibility between the EANEW and EmoBank
datasets. Adding words to the training set leads to significant performance drops for
81
arousal and dominance, therefore the sets are not perfectly compatible, however the re-
duction is not particularly dramatic so we feel the sets are compatible enough for our
purposes.
Table 6.7: Sentence-level results Pearson’s performance, for the missing data scenario.
Statistical significance marked by
y
:p< 0:05 or
yy
:p< 0:001.
Pearson’s
arousal dominance valence
words baseline
mean 0.222 0.197 0.481
max 0.235 0.112 0.303
min 0.080 0.169 0.469
neural multi task 0.333
yy
0.242
yy
0.474
Table 6.7 shows the performance of sentence models trained on incomplete data: as-
suming that we have sentence level annotations for all dimensions except for one, which
is our evaluation target. The baselines correspond to simple statistics extracted over all
content words in each sentence, from word ratings generated by our best performing
word model. These are compared to multi-task neural models trained to generate rat-
ings for each dimension. As expected, these results are notably worse than what can be
achieved with complete data. The neural model produces significantly better results than
any of the baselines for the arousal and dominance dimensions, but fails to outperform
the mean of word ratings in the case of valence, where there is no significant difference.
This is more indicative of the potency of the mean value for this specific case: the mean
of word ratings has been used to estimate polarity or valence ratings with positive re-
sults, but - as also shown in the arousal and dominance results - it is highly unlikely that
a single statistic will work so well for any arbitrary dimension. Overall the results of
the neural model are positive. The model outperforming the baselines verifies the initial
82
assumptions, that we can leverage other dimensions to learn how to compose meaning
and benefit from the underlying dependencies.
6.5 Conclusions
We presented a neural framework for the generation of psycholinguistic norms with com-
plete data at the word level and incomplete data at the sentence level. At the word level
the use of simple, independent neural networks achieved the best results and lead to state-
of-the-art performance for a variety of psycholinguistic norms from different sources.
Multi-task learning, overall, failed to improve upon these results, which we feel is an in-
dicator of limited compatibility between our different tasks. The multi-task approach did
produce positive results in cases where we has prior knowledge of norm compatibility.
At the sentence level the multi-task models with incomplete data were capable of
transferring knowledge from norm dimensions with available sentence annotations to
norm dimensions without. The models achieved significantly higher performance than
baselines based only on word ratings and present a valid alternative to sentence norm
generation for the - far too common - scenario where no annotations exist.
In the future we hope to validate the approach on a larger and more diverse set of norm
dimensions, as annotations become available. The other theme that emerged over the
course of this work was compatibility, either between the different norm dimensions or
between the word and sentence-level annotations. We intend to investigate and quantify
this aspect, the goal being to predict if the additional data is likely to lead to improved
performance.
83
Chapter7
ApplicationtoSentimentAnalysis: SemEval2014
This chapter describes our submissions to SemEval 2014 task 9 [62], which dealt with
sentiment analysis in twitter. The system is an expansion of our submission to the same
task in 2013 [40], which used only token rating statistics as features. We expanded the
system by using multiple lexica and more statistics, added steps to the pre-processing
stage (including negation and multi-word expression handling), incorporated pairwise
tweet-level semantic similarities as features and finally performed feature extraction on
substrings and used the partial features as indicators of irony, sarcasm or humor.
7.1 ModelDescription
7.1.1 Preprocessing
POS-tagging/Tokenization was performed using the ARK NLP tweeter tagger [54], a
Twitter-specific tagger.
Negations were detected using the list from Christopher Potts’ tutorial. All tokens up to
the next punctuation were marked as negated.
The work presented in this chapter was published in the following article:
Nikolaos Malandrakis, Michael Falcone, Colin Vaz, Jesse James Bisogni, Alexandros Potamianos,
Shrikanth Narayanan, “SAIL: Sentiment Analysis using Semantic Similarity and Contrast Features”,
Proceedings of SemEval, 2014, pp. 512–516
84
Hashtagexpansion into word strings was performed using a combination of a word in-
sertion Finite State Machine and a language model. A normalized perplexity threshold
was used to detect if the output was a “proper” English string and expansion was not
performed if it was not.
Multi-word Expressions (MWEs) were detected using the MIT jMWE library [34].
MWEs are non-compositional expressions [65], which should be handled as a single to-
ken instead of attempting to reconstruct their meaning from their parts.
7.1.2 Lexicon-basedfeatures
The core of the system was formed by the lexicon-based features. We used a total of four
lexica and some derivatives.
7.1.2.1 Thirdpartylexica
We used three third party affective lexica.
SentiWordNet [22] provides continuous positive, negative and neutral ratings for each
sense of every word in WordNet. We created two versions of SentiWordNet: one where
ratings are averaged over all senses of a word (e.g., one ratings for “good”) and one where
ratings are averaged over lexeme-pos pairs (e.g., one rating for the adjective “good” and
one for the noun “good”).
NRC Hashtag [47] Sentiment Lexicon provides continuous polarity ratings for tokens,
generated from a collection of tweets that had a positive or a negative word hashtag.
Sentiment140 [47] Lexicon provides continuous polarity ratings for tokens, generated
from the sentiment140 corpus of 1.6 million tweets, with emoticons used as positive and
negative labels.
85
7.1.2.2 Emotiword: expansionandadaptation
To create our own lexicon we used an automated algorithm of affective lexicon expansion
based on the one presented in [44, 42], which in turn is an expansion of [78].
We assume that the continuous (in[1;1]) valence, arousal and dominance ratings of
any termt
j
can be represented as a linear combination of its semantic similaritiesd
ij
to
a set of seed wordsw
i
and the known affective ratings of these wordsv(w
i
), as follows:
^ v(t
j
) =a
0
+
N
X
i=1
a
i
v(w
i
)d
ij
; (7.1)
where a
i
is the weight corresponding to seed word w
i
(that is estimated as described
next). For the purposes of this work,d
ij
is the cosine similarity between context vectors
computed over a corpus of 116 million web snippets (up to 1000 for each word in the
Aspell spellchecker) collected using the Yahoo! search engine.
Given the starting, manually annotated, lexicon Affective Norms for English Words
[11] we selected 600 out of the 1034 words contained in it to serve as seed words and
all 1034 words to act as the training set and used Least Squares Estimation to estimate
the weightsa
i
. Seed word selection was performed by a simple heuristic: we want seed
words to have extreme affective ratings (high absolute value) and the set to be close to
balanced (sum of seed ratings equal to zero). The equation learned was used to generate
ratings for any new terms.
The lexicon created by this method is task-independent, since both the starting lexi-
con and the raw text corpus are task-independent. To create task-specific lexica we used
corpus filtering on the 116 million sentences to select ones that match our domain, us-
ing either a normalized perplexity threshold (using a maximum likelihood trigram model
created from the training set tweets) or a combination of pragmatic constraints (keywords
with high mutual information with the task) and perplexity threshold [41]. Then we re-
calculated semantic similarities on the filtered corpora. In total we created three lexica:
a task-independent (base) version and two adapted versions (filtered by perplexity alone
86
and filtered by combining pragmatics and perplexity), all containing valence, arousal and
dominance token ratings.
7.1.2.3 Statisticsextraction
The lexica provide up to 17 ratings for each token. To extract tweet-level features we used
simple statistics and selection criteria. First, all token unigrams and bigrams contained in
a tweet were collected. Some of these n-grams were selected based on a criterion: POS
tags, whether a token is (part of) a MWE, is negated or was expanded from a hashtag.
The criteria were applied separately to token unigrams and token bigrams (POS tags only
applied to unigrams). Then ratings statistics were extracted from the selected n-grams:
length (cardinality), min, max, max amplitude, sum, average, range (max minus min),
standard deviation and variance. We also created normalized versions by dividing by
the same statistics calculated over all tokens, e.g., the maximum of adjectives over the
maximum of all unigrams. The results of this process are features like “maximum of
Emotiword valence over unigram adjectives” and “average of SentiWordNet objectivity
among MWE bigrams”.
7.1.3 Tweet-levelsimilarityratings
Our lexicon was formed under the assumption that semantic similarity implies affective
similarity, which should apply to larger lexical units like entire tweets. To estimate se-
mantic similarity scores between tweets we used the publicly available TakeLab semantic
similarity toolkit [81] which is based on a submission to SemEval 2012 task 6 [4]. We
used the data of SemEval 2012 task 6 to train three semantic similarity models corre-
sponding to the three datasets of that task, plus an overall model. Using these models
we created four similarity ratings between each tweet of interest and each tweet in the
training set. These similarity ratings were used as features of the final model.
87
7.1.4 Characterfeatures
Capitalization features are frequencies and relative frequencies at the word and letter
level, extracted from all words that either start with a capital letter, have a capital letter
in them (but the first letter is non-capital) or are in all capital letters.
Punctuation features are frequencies, relative frequencies and punctuation unigrams.
Character repetition features are frequencies, relative frequencies and longest string
statistics of words containing a repetition of the same letter.
Emoticon features are frequencies, relative frequencies, and emoticon unigrams.
7.1.5 Contrastfeatures
Cognitive Dissonance is an important phenomenon associated with complex linguistic
cases like sarcasm, irony and humor [60]. To estimate it we used a simple approach,
inspired by one-liner joke detection: we assumed that the final few tokens of each tweet
(the “suffix”) contrast the rest of the tweet (the “prefix”) and created split versions of the
tweet where the lastN tokens are the suffix and all other tokens are the prefix, forN = 2
andN = 3. We repeated the feature extraction process for all features mentioned above
(except for the semantic similarity features) for the prefix and suffix, nearly tripling the
total number of features.
7.1.6 FeatureselectionandTraining
The extraction process lead to tens of thousands of candidate features, so we performed
forward stepwise feature selection using a correlation criterion [24] and used the result-
ing set of 222 features to train a model. The model chosen is a Naive Bayes tree, a tree
with Naive Bayes classifiers on each leaf. The motivation comes from considering this a
two stage problem: subjectivity detection and polarity classification, making a hierarchi-
cal model a natural choice. The feature selection and model training/classification was
conducted using Weka [85].
88
Table 7.1: Performance and rank achieved by our submission for all datasets of subtasks
A and B.
task dataset avg. F1 rank
A
LJ2014 70.62 16
SMS2013 74.46 16
TW2013 78.47 14
TW2014 76.89 13
TW2014SC 65.56 15
B
LJ2014 69.34 15
SMS2013 56.98 24
TW2013 66.80 10
TW2014 67.77 7
TW2014SC 57.26 2
7.2 Results
We took part in subtasks A and B of SemEval 2014 task 9, submitting constrained runs
trained with the data the task organizers provided. Subtask B was the priority and the
subtask A model was created as an afterthought: it only uses the lexicon-based and
morphology features for the target string and the entire tweet as features of an NB Tree.
The overall performance of our submission on all datasets (LiveJournal, SMS, Twitter
2013, Twitter 2014 and Twitter 2014 Sarcasm) can be seen in Table 7.1. The subtask A
system performed badly, ranking near the bottom (among 20 submissions) on all datasets,
a result perhaps expected given the limited attention we gave to the model. The subtask
B system did very well on the three Twitter datasets, ranking near the top (among 42
teams) on all three sets and placing second on the sarcastic tweets set, but did notably
worse on the two non-Twitter sets.
89
Table 7.2: Selected features for subtask B.
Features number
Lexicon-derived 178
By lexicon
Ewrd / S140 / SWNet / NRC 71 / 53 / 33 / 21
By POS tag
all (ignore tag) 103
adj / verb / proper noun 25 / 11 / 11
other tags 28
By function
avg / min / sum / max 45 / 40 / 38 / 26
other functions 29
Semanticsimilarity 29
Punctuation 7
Emoticon 5
Otherfeatures 3
Contrast 72
prefix / suffix 54 / 18
A compact list of the features selected by the subtask B system can be seen in Ta-
ble 7.2. The majority of features (178 of 222) are lexicon-based, 29 are semantic simi-
larities to known tweets and the rest are mainly punctuation and emoticon features. The
lexicon-based features mostly come from Emotiword, though that is probably because
Emotiword contains a rating for every unigram and bigram in the tweets, unlike the other
lexica. The most important part-of-speech tags are adjectives and verbs, as expected,
with proper nouns being also highly important, presumably as indicators of attribution.
Still, most features are calculated over all tokens (including stop words). Finally it is
worth noting the 72 contrast features selected.
90
Table 7.3: Performance on all data sets of subtask B after removing 1 set of features.
Performance difference with the complete system listed if greater than 1%.
Features removed
LJ2014 SMS2013 TW2013 TW2014 TW2014SC
avg. F1 diff avg. F1 diff avg. F1 diff avg. F1 diff avg. F1 diff
None(Submitted) 69.3 57.0 66.8 67.8 57.3
Lexicon-derived 43.6 -25.8 38.2 -18.8 49.5 -17.4 51.5 -16.3 43.5 -13.8
Emotiword 67.5 -1.9 56.4 63.5 -3.3 66.1 -1.7 54.8 -2.5
Base 68.4 56.3 65.0 -1.9 66.4 -1.4 59.6 2.3
Adapted 69.3 57.4 66.7 67.5 50.8 -6.5
Sentiment140 68.1 -1.3 54.5 -2.5 64.4 -2.4 64.2 -3.6 45.4 -11.9
NRC Tag 70.6 1.3 58.5 1.6 66.3 66.0 -1.7 55.3 -2.0
SentiWordNet 68.7 56.0 66.2 68.1 52.7 -4.6
per Lexeme 69.3 56.7 66.1 68.0 52.7 -4.5
per Lexeme-POS 68.8 57.1 66.7 67.4 55.0 -2.2
SemanticSimilarity 69.0 58.2 1.2 64.9 -2.0 65.5 -2.2 52.2 -5.0
Punctuation 69.7 57.4 66.6 67.1 53.9 -3.4
Emoticon 69.3 57.0 66.8 67.8 57.3
Contrast 69.2 57.5 66.7 67.0 51.9 -5.4
Prefix 69.5 57.2 66.8 67.2 47.4 -9.9
Suffix 68.6 57.2 66.5 67.9 56.3
We also conducted a set of experiments using partial feature sets: each time we use
all features minus one set, then apply feature selection and classification. The results are
presented in Table 7.3. As expected, the lexicon-based features are the most important
ones by a wide margin though the relative usefulness of the lexica changes depending on
the dataset: the twitter-specific NRC lexicon actually hurts performance on non-tweets,
while the task-independent Emotiword hurts performance on the sarcastic tweets set.
Overall though using all is the optimal choice. Among the other features only semantic
similarity provides a relatively consistent improvement.
A lot of features provide very little benefit on most sets, but virtually everything is
important for the sarcasm set. Lexica, particularly the twitter specific ones like Sentiment
140 and the adapted version of Emotiword make a big difference, perhaps indicating
some domain-specific aspects of sarcasm expression (though such assumptions are shaky
91
at best due to the small size of the test set). The contrast features perform their intended
function well, providing a large performance boost when dealing with sarcastic tweets
and perhaps explaining our high ranking on that dataset.
Overall the subtask B system performed very well and the semantic similarity fea-
tures and contrast features provide potential for further growth.
7.3 Conclusions
We presented a system of twitter sentiment analysis combining lexicon-based features
with semantic similarity and contrast features. The system proved very successful, achiev-
ing high ranks among all competing systems in the tasks of sentiment analysis of generic
and sarcastic tweets.
Future work will focus on the semantic similarity and contrast features by attempting
more accurately estimate semantic similarity and using some more systematic way of
identifying the “contrasting” text areas.
92
Chapter8
ConclusionsandFutureWork
This thesis presented a data-driven framework of norm generation for words and sen-
tences. In the following we will briefly summarize the thesis contributions and discuss
how these can be expanded in our future work.
8.1 Word-levelnormgeneration
This thesis proposed the use of distributional semantic representations, specifically context-
based similarity vectors and word embeddings, and supervised learning to facilitate the
creation of high quality word-level norms. Multiple semantic representations and mod-
els were evaluated, with neural networks providing the best results.The method has been
successful in estimating a wide variety of norms. Multi-Task learning shows promise, but
applicability appears to be limited by norm compatibility. Quantitative evaluation indi-
cates very high performance on every norm we have attempted to generate and state-of-
the-art performance on norms for which there exists comparable literature. The method
has been used to generate norms never expanded before and even norms in non-English
languages.
93
8.2 DomainadaptationofNorms
This thesis proposed using corpus filtering/selection strategies as a method to create
domain-specific semantic similarities and in turn domain-specific word norm values. The
use of pragmatic constraints and perplexity was evaluated, with the combination of the
two yielding the best results. Our results, evaluating the use of domain-adapted norms
as classification features for a variety of tasks, indicate that the approach creates norm
values more relevant to the task. Each task requires individual tuning of the corpus se-
lection approach, because there is a trade-off between how well the corpus fits the task
and how large that corpus is (which affects semantic similarity estimation).
8.3 Bigramandsentence-levelnormgeneration
For the scenario of very limited data availability (a few hundred sentences) for a norm
dimension, this thesis proposed the use of bigram norm values, calculated by gener-
ating bigram context and then similarity vectors, for the creation of continuous norm
values for sentences. This approach leverages the purely data-driven nature of the lexi-
con expansion algorithm to generate bigram ratings directly, without requiring a meaning
composition algorithm. Our results show that the resulting bigram norm ratings are of
significantly lower quality than the corresponding unigram norm ratings and multiple
strategies were proposed on how to combine unigram and bigram ratings into sentence
ratings, with the weighted averaging of all n-gram ratings producing the best results.
The sentence-level algorithm produced state-of-the-art results at the task of estimating
continuous valence for sentences, with only 3 parameters and limited sentence-level su-
pervision.
For the scenario of no data availability for a target norm, this thesis proposed the
use of Multi-Task neural networks to transfer knowledge from other norms, for which
sentence annotations exist. The approach proved successful, achieving improved perfor-
mance, showing that we can exploit other norm dimensions to learn composition rules
94
and word annotations to learn the relations between dimensions, as well as the mapping
from the joined semantic space into a norm space.
8.4 FutureWork
This thesis explored the use of multi-task models for word-level norm estimation with
mixed results. We only achieved any significant improvement when the norm dimen-
sions jointly learned where known to be highly dependent/compatible, as in the case of
arousal, valence and dominance. The issue of task compatibility is known to the multi-
task learning research community, where it is desirable to cluster tasks into groups where
multi-task learning is most likely to lead to positive results, but it is an open problem. It
is worth investigating further and possibly devising a data-driven method of ascertaining
the compatibility between norm dimensions, that would let us create more meaningful
groups.
This thesis has presented some norm generation experiments for non-English lan-
guages and even across languages. Some of the performance achieved in these cross-
lingual experiments is actually higher than that achieved in the corresponding inter-
language experiments, indicating that perhaps there is some merit to combining resources
from multiple languages to improve results on the target language, though results so far
have been conflicted. We should investigate this combining annotations from multiple
resources from various languages and targeting English. Pivoting across languages can
be achieved through both MT and multi-lingual embeddings.
This thesis presented a multi-task learning approach, utilizing word and sentence
level annotations from multiple norm dimensions to generate improved sentence-level
norm estimates for dimensions where only word-level data exist. The experiments pre-
sented only evaluate the method on the three dimensions of emotion. We can use the
method to generate norms for any other dimension, but no data exist against which to
validate. For some, but not all, of these dimensions it may be relatively straightforward
95
to collect annotations. As an alternative we can perhaps use proxy tasks, for example
using the age ratings of books to evaluate generated norms of age of acquisition.
96
ReferenceList
[1] Apache lucene. http://www.lucene.apache.org/.
[2] Gnu aspell. http://www.aspell.net.
[3] E. Agirre, E. Alfonseca, K. Hall, J. Kravalova, M. Pasca, and A. Soroa. A study
on similarity and relatedness using distributional and wordnet-based approaches.
In Proc. of the Annual Conference of the North American Chapter of the Associa-
tion for Computational Linguistics: Human Language Technologies, pages 19–27,
2009.
[4] Eneko Agirre, Mona Diab, Daniel Cer, and Aitor Gonzalez-Agirre. SemEval-2012
Task 6: A pilot on semantic textual similarity. In proc. SemEval, pages 385–393,
2012.
[5] Alina Andreevskaia and Sabine Bergler. Semantic tag extraction from WordNet
glosses. In Proc. Language Resources and Evaluation Conference, pages 413–416,
2006.
[6] Alina Andreevskaia and Sabine Bergler. CLaC and CLaC-NB: Knowledge-based
and corpus-based approaches to sentiment tagging. In Proc. SemEval, pages 117–
120, 2007.
[7] Jeremy Ang, Rajdip Dhillon, Ashley Krupski, Elizabeth Shriberg, and Andreas
Stolcke. Prosody-based automatic detection of annoyance and frustration in
97
human-computer dialog. In Proc. International Conference on Spoken Language
Processing, pages 2037–2040, 2002.
[8] Krisztian Balog, Gilad Mishne, and Maarten de Rijke. Why are they excited? iden-
tifying and explaining spikes in blog mood levels. In Proc. Conference of the Eu-
ropean Chapter of the Association for computational Linguistics, pages 207–210,
2006.
[9] M. Baroni and A. Lenci. Distributional memory: A general framework for corpus-
based semantics. Computational Linguistics, 36(4):673–721, 2010.
[10] D. Bollegala, Y . Matsuo, and M. Ishizuka. Measuring semantic similarity between
words using web search engines. In Proc. of International Conference on World
Wide Web, pages 757–766, 2007.
[11] Margaret Bradley and Peter Lang. Affective norms for English words (ANEW):
Stimuli, instruction manual and affective ratings. Technical report C-1. The Center
for Research in Psychophysiology, University of Florida, 1999.
[12] A. Budanitsky and G. Hirst. Evaluating WordNet-based measures of semantic dis-
tance. Computational Linguistics, 32:13–47, 2006.
[13] Sven Buechel and Udo Hahn. EmoBank: Studying the impact of annotation per-
spective and representation format on dimensional emotion analysis. In Proceed-
ings of the 15th Conference of the European Chapter of the Association for Com-
putational Linguistics: Volume 2, Short Papers, pages 578–585, 2017.
[14] D. Can, P.G. Georgiou, D.C. Atkins, and S. Narayanan. A case study: Detecting
counselor reflections in psychotherapy for addictions using linguistic features. In
Proc. Interspeech, pages 2254–2257, 2012.
98
[15] Jorge Carrillo de Albornoz, Laura Plaza, and Pablo Gerv´ as. A hybrid approach
to emotional sentence polarity and intensity classification. In Proc. Conference on
Natural Language Learning, pages 153–161, 2010.
[16] Franc ¸ois-R´ egis Chaumartin. UPAR7: A knowledge-based system for headline sen-
timent tagging. In Proc. SemEval, pages 422–425, 2007.
[17] Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. Grad-
Norm: Gradient normalization for adaptive loss balancing in deep multitask net-
works. In Proceedings of the 35th International Conference on Machine Learning,
pages 793–802, 2018.
[18] Rudi L. Cilibrasi and Paul M.B. Vit´ anyi. The Google similarity distance. IEEE
Transactions on Knowledge and Data Engineering, 19(3):370–383, 2007.
[19] R. Cipolla, Y . Gal, and A. Kendall. Multi-task learning using uncertainty to weigh
losses for scene geometry and semantics. In 2018 IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, pages 7482–7491, 2018.
[20] J.M. Clark and A. Paivio. Extensions of the paivio, yuille, and madigan (1968)
norms. Behavior Research Methods, Instruments, & Computers, 36(3):371–383,
2004.
[21] Andrea Esuli and Fabrizio Sebastiani. Determining the semantic orientation of
terms through gloss classification. In Proc. Conference on Information and Knowl-
edge Management, pages 617–624, 2005.
[22] Andrea Esuli and Fabrizio Sebastiani. Sentiwordnet: A publicly available lexical
resource for opinion mining. In Proc. Language Resources and Evaluation Confer-
ence, pages 417–422, 2006.
99
[23] J. Gracia, R. Trillo, M. Espinoza, and E. Mena. Querying the web: A multiontology
disambiguation method. In Proc. of International Conference on Web Engineering,
pages 241–248, 2006.
[24] Mark A. Hall. Correlation-based feature selection for machine learning. PhD
thesis, The University of Waikato, 1999.
[25] Ahmed Hassan and Dragomir Radev. Identifying text polarity using random walks.
In Proc. Conference of the Association for Computational Linguistics, pages 395–
403, 2010.
[26] Vasileios Hatzivassiloglou and Kathleen McKeown. Predicting the Semantic Ori-
entation of Adjectives. In Proc. Conference of the Association for Computational
Linguistics, pages 174–181, 1997.
[27] K.J. Hsu, K.N. Babeva, M.C. Feng., J.F. Hummer, and G.C. Davison. Experimen-
tally induced distraction impacts cognitive but not emotional processes in cognitive
assessment. in Submission.
[28] Minqing Hu and Bing Liu. Mining and summarizing customer reviews. In Proc.
Conference on Knowledge Discovery and Data Mining, KDD ’04, pages 168–177.
ACM, 2004.
[29] Z.E. Imel, M. Steyvers, and D.C. Atkins. Computational psychotherapy re-
search: Scaling up the evaluation of patient–provider interactions. Psychotherapy,
52(1):19–30, 2015.
[30] E. Iosif, A. Tegos, A. Pangos, E. Fosler-Lussier, and A. Potamianos. Combining
statistical similarity measures for automatic induction of semantic classes. In Proc.
IEEE/ACL Workshop on Spoken Language Technology, pages 86–89, 2006.
100
[31] Elias Iosif and Alexandros Potamianos. Unsupervised Semantic Similarity Com-
putation Between Terms Using Web Documents. IEEE Transactions on Knowledge
and Data Engineering, 22(11):1637–1647, 2010.
[32] Elias Iosif and Alexandros Potamianos. Similarity Computation Using Seman-
tic Networks Created From Web-Harvested Data. Natural Language Engineering,
submitted, 2012.
[33] Ioannis Klasinas, Alexandros Potamianos, Elias Iosif, Spiros Georgiladakis, and
Gianluca Mameli. Web data harvesting for speech understanding grammar induc-
tion. In Proceedings of Interspeech, pages 2733–2737, 2013.
[34] Nidhi Kulkarni and Mark Alan Finlayson. jMWE: A java toolkit for detecting
multi-word expressions. In proc. Workshop on Multiword Expressions, pages 122–
124, 2011.
[35] Chul M. Lee, Shrikanth S. Narayanan, and Roberto Pieraccini. Combining acoustic
and language information for emotion recognition. In Proc. International Confer-
ence on Spoken Language Processing, pages 873–876, 2002.
[36] Chul Min Lee and Shrikanth S. Narayanan. Toward detecting emotions in spo-
ken dialogs. IEEE Transactions on Speech and Audio Processing, 13(2):293–303,
2005.
[37] C. Lin and Y . He. Joint sentiment/topic model for sentiment analysis. In Pro-
ceedings of the 18th ACM conference on Information and knowledge management,
CIKM, pages 375–384, 2009.
[38] T. Liu, K. Cho, G. A. Broadwell, S. Shaikh, T. Strzalkowski, J. Lien, S. Taylor,
L. Feldman, B. Yamrom, N. Webb, U. Boz, I. Cases, and C.-S. Lin. Automatic
expansion of the MRC psycholinguistic database imageability ratings. In Proc.
LREC, 2014.
101
[39] Levon Lloyd, Dimitrios Kechagias, and Steven Skiena. Lydia: A system for large-
scale news analysis. In Proc. Conference on String Processing and Information
Retrieval, number 3772 in Lecture Notes in Computer Science, pages 161–166,
2005.
[40] N. Malandrakis, A. Kazemzadeh, A. Potamianos, and S. Narayanan. SAIL: A
hybrid approach to sentiment analysis. In proc. SemEval, pages 438–442, 2013.
[41] N. Malandrakis, A. Potamianos, K. J. Hsu, K. N. Babeva, M. C. Feng, G. C. Davi-
son, and S. Narayanan. Affective language model adaptation via corpus selection.
In proc. ICASSP, pages 4871–4874, 2014.
[42] N. Malandrakis, A. Potamianos, E. Iosif, and S. Narayanan. Distributional seman-
tic models for affective text analysis. Audio, Speech, and Language Processing,
IEEE Transactions on, 21(11):2379–2392, 2013.
[43] Nikos Malandrakis, Alexandros Potamianos, Giorgos Evangelopoulos, and Athana-
sia Zlatintsi. A supervised approach to movie emotion tracking. In Proc. ICASSP,
pages 2376–2379, 2011.
[44] Nikos Malandrakis, Alexandros Potamianos, Elias Iosif, and Shrikanth Narayanan.
Kernel models for affective lexicon creation. In Proc. Interspeech, pages 2977–
2980, 2011.
[45] Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J.
Bethard, and David McClosky. The Stanford CoreNLP natural language process-
ing toolkit. In Association for Computational Linguistics (ACL) System Demon-
strations, pages 55–60, 2014.
[46] Qiaozhu Mei, Xu Ling, Matthew Wondra, Hang Su, and ChengXiang Zhai. Topic
sentiment mixture: modeling facets and opinions in weblogs. In Proceedings of
the 16th international conference on World Wide Web, WWW ’07, pages 171–180,
2007.
102
[47] Saif Mohammad, Svetlana Kiritchenko, and Xiaodan Zhu. NRC-Canada: Building
the state-of-the-art in sentiment analysis of tweets. In proc. SemEval, pages 321–
327, 2013.
[48] M. Mohler, M. Tomlinson, D. Bracewell, and B. Rink. Semi-supervised methods
for expanding psycholinguistics norms by integrating distributional similarity with
the structure of wordnet. In Proc. LREC, 2014.
[49] Karo Moilanen and Stephen Pulman. Sentiment Composition. In Proc. Conference
on Recent Advances in Natural Language Processing, pages 378–382, 2007.
[50] Karo Moilanen, Stephen Pulman, and Yue Zhang. Packed feelings and ordered
sentiments: Sentiment parsing with quasi-compositional polarity sequencing and
compression. In Proc. Workshop on Computational Approaches to Subjectivity and
Sentiment Analysis at European Conference on Artificial Intelligence, pages 36–43,
2010.
[51] Preslav Nakov, Zornitsa Kozareva, Alan Ritter, Sara Rosenthal, Veselin Stoyanov,
and Theresa Wilson. Semeval-2013 task 2: Sentiment analysis in twitter. In Pro-
ceedings of *SEM, pages 312–320, 2013.
[52] S. S. Narayanan and P. Georgiou. Behavioral signal processing: Deriving hu-
man behavioral informatics from speech and language. Proceedings of IEEE,
101(5):1203 – 1233, May 2013.
[53] T. Ng, M. Ostendorf, Mei-Yuh Hwang, Manhung Siu, Ivan Bulyko, and Xin Lei.
Web-data augmented language models for mandarin conversational speech recog-
nition. In Proceedings of ICASSP, volume 1, pages 589–592, 2005.
[54] Olutobi Owoputi, Brendan O’Connor, Chris Dyer, Kevin Gimpel, Nathan Schnei-
der, and Noah A. Smith. Improved part-of-speech tagging for online conversational
text with word clusters. In proc. NAACL, pages 380–390, 2013.
103
[55] S. Pado and M. Lapata. Dependency-based construction of semantic space models.
Computational Linguistics, 33(2):161–199, 2007.
[56] A. Pargellis, E. Fosler-Lussier, C.-H. Lee, A. Potamianos, and A. Tsai. Auto-
induced semantic classes. Speech Communication, 43:183–203, 2004.
[57] Francis Jeffry Pelletier. The principle of semantic compositionality. Topoi, 13:11–
24, 1994.
[58] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global
vectors for word representation. In Empirical Methods in Natural Language Pro-
cessing (EMNLP), pages 1532–1543, 2014.
[59] Livia Polanyi and Annie Zaenen. Contextual valence shifters. In Computing at-
titude and affect in text: Theory and Applications, pages 1–10. Springer Verlag,
2006.
[60] Antonio Reyes, Paolo Rosso, and Davide Buscaldi. From humor recognition to
irony detection: The figurative language of social media. Data & Knowledge En-
gineering, 74(0):1 – 12, 2012.
[61] C. R. Rogers. Client-centered Therapy. Boston: Houghton Mifflin, 1951.
[62] S. Rosenthal, P. Nakov, A. Ritter, and V . Stoyanov. SemEval-2014 Task 9: Senti-
ment analysis in Twitter. In Proc. SemEval, 2014.
[63] Herbert Rubenstein and John B. Goodenough. Contextual correlates of synonymy.
Communications of the ACM, 8(10):627–633, 1965.
[64] J.A. Russell and A. Mehrabian. Evidence for a three-factor theory of emotions.
Journal of Research in Personality, 11(3):273 – 294, 1977.
104
[65] Ivan Sag, Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger.
Multiword expressions: A pain in the neck for NLP. In Computational Linguis-
tics and Intelligent Text Processing, volume 2276 of Lecture Notes in Computer
Science, pages 189–206. 2002.
[66] Helmut Schmid. Probabilistic part-of-speech tagging using decision trees. In Proc.
International Conference on New Methods in Language Processing, volume 12,
pages 44–49, 1994.
[67] Bj¨ orn Schuller and Tobias Knaup. Learning and knowledge-based sentiment anal-
ysis in movie review key excerpts. In Toward Autonomous, Adaptive, and Context-
Aware Multimodal Interfaces. Theoretical and Practical Issues, volume 6456 of
Lecture Notes in Computer Science, pages 448–472. 2011.
[68] P. J. Schwanenflugel, C. Akin, and W.-M. Luh. Context availability and the recall
of abstract and concrete words. Memory & Cognition, 20(1):96–104, 1992.
[69] P. Stone, D. Dunphy, M. Smith, and D. Ogilvie. The General Inquirer: A Computer
Approach to Content Analysis. The MIT Press, 1966.
[70] Carlo Strapparava and Rada Mihalcea. Semeval-2007 task 14: Affective text. In
Proc. SemEval, pages 70–74, 2007.
[71] Carlo Strapparava and Alessandro Valitutti. WordNet-Affect: an affective exten-
sion of WordNet. In Proc. Language Resources and Evaluation Conference, vol-
ume 4, pages 1083–1086, 2004.
[72] Carlo Strapparava, Alessandro Valitutti, and Oliviero Stock. The affective weight
of lexicon. In Proc. Language Resources and Evaluation Conference, pages 423–
426, 2006.
[73] H.H. Strupp. A multidimensional comparison of therapist activity in analytic and
client-centered therapy. Journal of Consulting Psychology, 21(4):301–308, 1957.
105
[74] Maite Taboada, Caroline Anthony, and Kimberly V oll. Methods for creating se-
mantic orientation dictionaries. In Proc. Language Resources and Evaluation Con-
ference, pages 427–432, 2006.
[75] Maite Taboada, Julian Brooke, Milan Tofiloski, Kimberly V oll, and Manfred Stede.
Lexicon-based methods for sentiment analysis. Computational Linguistics, 1:1–41,
2010.
[76] Hiroya Takamura, Takashi Inui, and Manabu Okumura. Extracting semantic ori-
entations of words using spin model. In Proc. Conference of the Association for
Computational Linguistics, pages 133–140, 2005.
[77] P.D. Turney, Y . Neuman, D. Assaf, and Y . Cohen. Literal and metaphorical sense
identification through concrete and abstract context. In Proc. EMNLP, 2011.
[78] Peter Turney and Michael L. Littman. Unsupervised Learning of Semantic Ori-
entation from a Hundred-Billion-Word Corpus. Technical report ERC-1094 (NRC
44929). National Research Council of Canada, 2002.
[79] Peter Turney and Michael L. Littman. Measuring praise and criticism: Inference of
semantic orientation from association. ACM Transactions on Information Systems,
21:315–346, 2003.
[80] Paul M.B. Vit´ anyi. Universal similarity. In Proc. of Information Theory Workshop
on Coding and Complexity, pages 238–243, 2005.
[81] Frane
ˇ
Sari´ c, Goran Glavaˇ s, Mladen Karan, Jan
ˇ
Snajder, and Bojana Dalbelo Baˇ si´ c.
Takelab: Systems for measuring semantic text similarity. In proc. SemEval, pages
441–448, 2012.
[82] Amy Beth Warriner, Victor Kuperman, and Marc Brysbaert. Norms of valence,
arousal, and dominance for 13,915 english lemmas. Behavior Research Methods,
45(4):1191–1207, Dec 2013.
106
[83] Jan Wiebe and Rada Mihalcea. Word sense and subjectivity. In Proc. Conference
of the International Committee on Computational Linguistics and the Association
for Computational Linguistics, pages 1065–1072, 2006.
[84] M. Wilson. MRC psycholinguistic database: Machine-usable dictionary, version
2.00. Behavior Research Methods, Instruments, & Computers, 20(1):6–10, 1988.
[85] Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools
and Techniques. Morgan Kaufmann, 2000.
107
AppendixA
Appendix
A.1 GeneratingWordNormsacrosslanguages
For the purposes of DARPA project LORELEI we wanted the ability to generate Norms
across languages. We used Machine translation to pivot across languages and affective
lexica from multiple languages to train the Norm expansion model. Semantic similarity
and all processing was done in English (lexica from other languages were translated to
English using MT) Table A.1 shows the results obtained when training a model on just
the English lexicon (ANEW) or the concatenation of all lexica apart from the one in the
target language. The results are fairly good and combining multiple lexica does improve
performance in the target language. It is also worth noting the performance achieved on
English, which is actually higher than what we get from a model using just English.
108
Table A.1: Word-level Pearson Correlation performance for the task of cross-lingual
Norm estimation. Results shown, per target language, for Valence and Arousal and for
the cases where the word model was trained using only the English lexicon or all lexica
apart from the target language one.
Valence Arousal
ENonly All ENonly All
English - 0.9 - 0.71
German 0.68 0.72 0.5 0.55
Spanish 0.86 0.87 0.7 0.72
Finnish 0.84 0.88 0.54 0.51
French 0.75 0.8 0.4 0.49
Dutch 0.74 0.77 0.58 0.58
Portuguese 0.86 0.87 0.58 0.64
Greek 0.82 0.83 0.51 0.52
Turkish 0.77 0.81 0.51 0.54
109
A.2 UsingNormstodescribeMovielanguage
We have used Norm aggregates as a way to describe movie dialogs. The approach used
was simple aggregation: mean across content words per sentence, then mean across sen-
tences to create character or movie level scores, finally means across movies to create
values for genres or release years. The results have been very reasonable and inter-
pretable. From the norm values we can see that Forrest Gump uses a vocabulary with
very low age of acquisition or we can compare the vocabulary of Tyler and Jack, the two
alternate personalities from “Fight Club”. Comparisons across movies are also possible,
though there are potential pitfalls since each movie can be considered a separate domain
(they can take place in alternate universes). This is a target-rich environment and an
application that is entirely enabled by the norms. However, more validation is required.
Figure A.1: Norm dimensions used to characterize movie language.
110
Figure A.2: Movie results examples: “Forrest Gump” and “Fight Club”.
111
Figure A.3: Cross-Movie comparisons: Male and Female action heroes.
112
Abstract (if available)
Abstract
Lexical norms are numeric representations of the normative characteristics of language, and are popular as high-level representations of meaning used in a variety of applications. Psycholinguistic norms in particular capture elements of the subjective expression through, or perception of, language, such as emotional state, concreteness or even gender ladenness. In this work we explore the automatic generation of psycholinguistic norms from data-driven semantic representations, semantic similarity and word embeddings. We present a framework that enables the generation of highly accurate word-level norms estimates, then generalize to sentences and longer passages. We also explore the use of multi-task learning to generate sentence-level norms with only word-level supervision. The proposed framework achieves state-of-the-art performance and the resulting norms have proven valuable in the analysis of language use in a variety of tasks.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Neural creative language generation
PDF
Non-traditional resources and improved tools for low-resource machine translation
PDF
Building a knowledgebase for deep lexical semantics
PDF
Modeling, learning, and leveraging similarity
PDF
Computational models for multidimensional annotations of affect
PDF
Neural networks for narrative continuation
PDF
Syntax-aware natural language processing techniques and their applications
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Beyond parallel data: decipherment for better quality machine translation
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Neural sequence models: Interpretation and augmentation
PDF
Computational narrative models of character representations to estimate audience perception
PDF
Syntactic alignment models for large-scale statistical machine translation
PDF
Multimodal reasoning of visual information and natural language
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Theory of memory-enhanced neural systems and image-assisted neural machine translation
PDF
Deciphering natural language
PDF
Multimodal representation learning of affective behavior
PDF
Machine learning paradigms for behavioral coding
Asset Metadata
Creator
Malandrakis, Nikolaos
(author)
Core Title
Generating psycholinguistic norms and applications
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
02/26/2020
Defense Date
12/19/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Computer Science,machine learning,natural language processing,OAI-PMH Harvest,sentiment analysis
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth (
committee chair
), Dehghani, Morteza (
committee member
), Knight, Kevin (
committee member
)
Creator Email
malandra@usc.edu,nmalandrakis@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-272905
Unique identifier
UC11673250
Identifier
etd-Malandraki-8212.pdf (filename),usctheses-c89-272905 (legacy record id)
Legacy Identifier
etd-Malandraki-8212.pdf
Dmrecord
272905
Document Type
Dissertation
Rights
Malandrakis, Nikolaos
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
machine learning
natural language processing
sentiment analysis