Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Language understanding in context: incorporating information about sources and targets
(USC Thesis Other)
Language understanding in context: incorporating information about sources and targets
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Language Understanding in Context: Incorporating Information About Sources
and Targets
by
Justin Garten
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(Computer Science)
August 2018
Copyright 2018 Justin Garten
Table of Contents
Chapter 1: Introduction 1
Chapter 2: Dictionaries and Distributions: Combining Expert Knowledge and
Large Scale Textual Data Content Analysis 4
Chapter 3: Bridging the Gap Between Continuous Representations and Categor-
ical Features 41
Chapter 4: Incorporating Demographic Embeddings into Language Understand-
ing 63
Bibliography 83
ii
Chapter 1
Introduction
The meaning of an utterance depends on context. This is trivially true with deictics [28] where a
phrase such as \this goes here" tells us little outside of a particular situation. But it is generally
true when considering language as a record of the intents, actions, and reactions of real humans.
Nonetheless, the current dominant techniques in computational language understanding generally
don't consider contextual factors. Semantic representations are trained on large bodies of \neu-
tral" data, yielding results which capture only certain facets of human language understanding
[45]. Supervised models are based on corpora of fragmented text and labels, isolated from their
sources. These techniques are extremely powerful and have proved very useful|for tasks and
questions where we can ignore local variation as noise. But as we go beyond that subset of tasks,
context becomes increasingly important.
Chapter 2, extends existing tools for exploring psychological concepts in text to make use of
context in the form of distributed representations. Prior psychological text analysis has made
extensive use of concept dictionaries, generally applied through word count methods. Here, we
introduce Distributed Dictionary Representations (DDR), a method that applies psychological
dictionaries using semantic similarity rather than word counts. This allows for the measurement
of the similarity between dictionaries and spans of text ranging from complete documents to
individual words. We show how DDR enables dictionary authors to place greater emphasis on
1
construct validity without sacricing linguistic coverage, demonstrate the benets of DDR on two
real-world tasks, and nally conduct an extensive study of the interaction between dictionary size
and task performance. These studies allow us to examine how DDR and word count methods
complement one another as tools for leveraging concept dictionaries and show where each is best
applied. Finally, we provide references to tools and resources to make this method both available
and accessible to a broad audience.
Chapter 3 considers how prior beliefs about the speaker aect interpretation of the intent of
a statement. The dierence between a joke and an insult or a comment and a request is personal
and situation-dependent. While computational eorts at language understanding have focused on
extracting statistical regularities from large quantities of text, psychological modeling has focused
on individual and contextual factors. Here, we explore approaches to bridging this gap, combining
statistical linguistic background knowledge with contextual beliefs. We compare the impacts of
three approaches: (1) treating contextual factors as noise, (2) training separate models based on
context, and (3) combining text representation and context into a single model. We explore these
possibilities on a dataset of US presidential debates annotated to identify speaker intent. This
is a useful domain given prior work on the importance of prior knowledge of speaker party and
incumbency status on human interpretation of political speech. We show that it is possible to
combine state-of-the-art continuous representations of text with categorical features in a manner
which improves both predictive accuracy and interpretability and is easily applicable to a wide
range of tasks.
Chapter 4 focuses on the other side of this, looking at how dierences between listeners aect
their reactions to statements. Prior theoretical work has considered the importance of the in-
teraction between individuals and language as essential to understanding both the intentionality
and meaning of utterances in context [34, 13, 32]. Interpretation of an utterance shifts based on
a variety of factors including personal history, background knowledge, and our relationship to the
2
source. While obviously an incomplete model of individual dierences, demographic factors pro-
vide a useful starting point and allow us to capture some of this variance. However, the relevance
of specic demographic factors varies between situations|where age might be the key factor in one
context, ideology might dominate in another. To address this challenge, we introduce a method for
combining demographics and context into situated demographic embeddings|mapping represen-
tations into a continuous geometric space appropriate for the given domain, showing the resulting
representations to be functional and interpretable. We further demonstrate how to make use of
related external data so as to apply this approach in low-resource situations. Finally, we show
how these representations can be incorporated to improve modeling of real-world natural language
understanding tasks, improving model performance and helping with issues of data sparsity.
3
Chapter 2
Dictionaries and Distributions: Combining Expert
Knowledge and Large Scale Textual Data Content Analysis
2.1 Introduction
Language and communication play a central role in psychological research both as direct
objects of study and as windows into underlying psychological processes. In order to automate
analysis of large quantities of text-based communication, psychological researchers have primarily
captured psychological phenomena by developing and applying domain dictionaries [113, 99],
lists of words which are considered indicative of a particular latent factor. These dictionaries
have generally been applied using word-count methods that involve counting the frequency of
dictionary words in samples of text. This intuitive approach has provided a simple method of
applying domain knowledge to large sources of data. This has been successfully applied to topics
ranging from sentiment analysis [119] to group dierences in moral concerns [41] to the evaluation
of depression in clinical patients [104].
This work has also led to insights which have fed back into both linguistics and computer
science. One notable discovery has been the importance of closed class terms to understanding
psychological properties from language [98]. A number of word classes such as determiners, pro-
nouns, and conjunctions and sub-classes such as modal verbs are considered to be closed since
4
they are relatively xed with words rarely added or removed. Given the Zipan distribution of
language [102], these small sets of common words compose around 60% of many English texts.
Preferring to focus on content words, many computational approaches dismissed these as \stop-
words" [126] which could be safely ignored. However, psychological applications of dictionaries
and word counts showed these to be essential to understanding a range of phenomena including
emotional state [97], authorship identication [9], and social hierarchies [65].
While word count is an ideal method for applying psychological dictionaries composed of
closed-class terms, many dictionaries which have attempted to codify psychological concepts (such
as positive and negative emotions, terms associated with depression, morally loaded terms, etc)
are composed of open-class terms. These terms, taken from classes such as nouns, adjectives, and
verbs are much larger than closed classes (even if the individual terms are less frequently used)
and provide a range of unique challenges.
First, even a well-developed open-class dictionary will struggle to capture a concept in all
possible contexts. No researcher can be familiar with all possible sociolects [78], that is, all
dialects associated with combinations of age groups, ethnic groups, socioeconomic classes, gender
cohorts, regional clusters, etc. There is no such thing as domain independence when it comes to
language. This can have pernicious eects, as measures will be most eective when applied to
groups similar to the researchers and their immediate cohort [52]. [82] referred to this as \home-
eld disadvantage." While it's possible to bring in representatives of particular groups of interest,
this vastly increases the complexity of dictionary generation and is still limited to groups which
the researchers are aware of.
Second, the contextual dependence of language means that a simple list of open-class words
can only cover narrow strips of a concept. Even in the simple domain of product reviews, words
can have opposite senses within a single category. For cameras, \long" is positive in the sense of
having a \long battery life" while negative when referring to a \long focusing time" [76, 58]. A
cold beer is good while a cold therapist is probably best avoided.
5
Third, the dynamic and generative aspects of language mean that even a theoretically universal
dictionary would not remain so for long. While this is less of an issue in some contexts such as
xed sets of historical texts [109], it is particularly salient in modern online contexts such as
social media. Terms rapidly appear, disappear, and change meanings. Linguistic resources which
expect consistent usage, \correct" grammar, or even recognizable spelling are of limited use for
text coming from Facebook or Twitter [70]. While lexical drift might not pose a major short-term
threat to dictionary validity, it is likely that as the time between dictionary construction and
application increases, the chance of error increases. Dictionaries can be updated, but this requires
yet more resource investment.
All of this combines to make dictionary development both resource intensive and challenging
to do well. While the most representative words for a category might be easy to recall, low-
frequency words can easily escape even the most diligent expert. Variations in colloquial lexicons
across social and cultural groups make implicit bias a constant threat.
Systems such as Linguistic Inquiry and Word Count (LIWC) [99] have helped both by pro-
viding a validated set of high-quality dictionaries and decreasing the cost of dictionary based
text analysis by providing pattern matching tools. A pattern like \ador*" (from the LIWC af-
fect dictionary) would automatically capture words like \adore", \adoration", and \adoringly".
However, because this pattern matching relies on morphological similarity, semantically unrelated
terms can also be caught (e.g., \adornment" in the previous example). Given the risk of such un-
wanted terms, knowing how and when to use these patterns required additional expertise
1
which
further complicates dictionary authorship.
One possible answer is to consider semantic similarity rather than morphological. This is a
the approach we explore here, applying psychological dictionaries by measuring their similarity to
words or segments of text rather than asking whether or not words are explicitly in the dictionary.
1
such as awareness of relative word frequencies in the target domains to know when spurious wildcard matches
can be safely ignored
6
Of course, this raises the question of how to measure semantic similarity. To this end, we make
use of distributed representations, where words are represented as points in a low-dimensional
space (generally 10-1000 dimensions). Such spaces oer a simple way to determine similarity
between words in terms of distance in the space. For example, two words which are highly similar
to one another, such as \doctor" and \physician", would be near one another in the space.
While often treated as a recent development, distributed representations have been explored
for decades. Geometric spaces were rst used to represent semantic structure early as the 1950's
in psychology [93] and continued to develop in the information retrieval community through the
1970's [107]. One of the more popular approaches for psychological applications, Latent Semantic
Analysis (LSA) [24], emerged by combining work from the information retrieval community with
psychological research on word meaning and language learning [71].
More recently, cognitive psychologists developed Parallel Distributed Processing (PDP) [106],
where distributed representations approached their current form. Work on PDP not only demon-
strated how these representations might be used in complex tasks but also showed how they could
be learned from and serve as the natural inputs and outputs of neural networks.
Critically, these multi-dimensional spaces proved to be relatively easy to generate using large
bodies of unlabeled text. The primary approach has been based on the distributional hypothesis,
a formalization of J.R. Firth's aphorism, "you shall know a word by the company it keeps" [30].
Eectively, this says that if two words occur in similar contexts, they're likely to be more similar
to one another. So, if we saw the sentences "the cat ran across the room" and "the dog ran across
the room", we could infer that cats and dogs likely had some things in common. While a single
instance might not tell us much, over billions of words and contexts, we are able to build highly
detailed representations.
Modern methods of generating distributed representations [21, 86, 100] prove to be both more
ecient and produce representations that show surprising semantic regularities
2
. For example,
2
Although recent work has suggested that these regularities also existed in LSA [72], this was obscured due to
non-linearity.
7
the nearest neighbors of terms tended to be highly meaningful. Given the word "frog", the nearest
terms in one representation were "frogs", "toad", and then the names of particular species of frogs
("litoria", "leptodactylidae", "rana") [100].
Further, it was found that representations trained in this way did not just encode attributional
similarity (a measure of synonymy between words) but also relational similarity (a measure of
analogical similarity between two pairs of words) [83, 122]. Simple linear transformations encoded
meaningful semantic, syntactic, and structural relationships. This famously allowed for the so-
lution of simple analogical reasoning problems such as \man is to woman as king is to (queen)"
[88].
For psychological research, these techniques oer a unique opportunity. In particular, the
structural regularities observed in distributed representations provide a route past some of the
challenges around applying dictionaries of open class words.
In the current paper, we introduce the Distributed Dictionary Representation (DDR) method
and explore its application to two domains, one learning to predict the sentiment of movie reviews
and another attempting to determine the moral loading of posts on Twitter. Finally, we carry
out an extensive evaluation of how a dictionary's size and structure impacts its eectiveness when
applied through DDR.
2.2 Distributed Dictionary Representations
We demonstrate a novel method of combining psychological dictionary methods and dis-
tributed representations which indicates that these two methods are not only compatible, but
that combining the two adds to the
exibility of both and opens new avenues for exploration.
Our method, which we term Distributed Dictionary Representation (DDR), averages the repre-
sentations of the words in a dictionary and uses that average to represent a given concept as a
point in the semantic space. We can use this representation to provide a continuous measure for
how similar other words are to a given concept.
8
One advantage of this method is to improve the ability to apply dictionaries to small pieces of
text (down to individual words). This is critical as more and more social scientic text analysis
makes use of social media posts [90, 67, 26, 25] which are often no more than a few words long.
At that length, it is unlikely that any words from an open-class dictionary will be present to
be counted. Prior social media research has noted precisely this diculty [46] with the common
solution being to aggregate multiple short posts into larger documents [120]. The disadvantage is
that we lose post-level granularity and the ability to track changes over time, critical in a number
of areas such as clinical psychology.
DDR also has a number of benets for dictionary development. Since the purpose of the
dictionary is now to identify the core of a concept rather than identifying every possible word which
might be associated with that concept, it is possible to produce a dictionary with a small list of
the most salient words. This makes it easier for researchers to generate new dictionaries and apply
them to explore theoretical concepts where the resources may not have been previously available
for large-scale text analysis. By making use of distributional semantic similarity, researchers can
focus on concept validity rather than dealing with linguistic issues.
2.2.1 Method Details
The goal of the DDR method
3
is to take a list of words characteristic of a category (often
referred to as a concept dictionary in social scientic research) and use those words to generate a
continuous measure of similarity between that concept and any other piece of text. The key factor
that separates this from the standard application of a dictionary using word count methods is
the combination with a pre-trained distributed representation. This representation can either be
trained on a wide-coverage corpus of text (web pages, news articles, etc) or on a domain specic
corpus (social media posts, interview transcripts, etc).
DDR creates a concept representation by nding the vector representation of each of the words
in the dictionary and averaging them. This is similar to the averaging method for generating
3
Available as open-source software at https://github.com/USC-CSSL/DDR/
9
sentence or document representations from word embeddings [71, 33, 89]. Using this concept
representation, other words, phrases, or documents can be compared to determine their similarity
to the category.
For example, imagine that a psychologist wished to study happiness and, for simplicity, as-
sume that their dictionary consisted only of the words \happy" and \joy". Given a distributed
representation, DDR would nd the vector representations of these two words and average them
to produce a concept representation based on this dictionary.
Formally, we consider a non-empty dictionary D of m wordsfw
1
;w
2
;:::;w
m
g and a pre-
trained n dimensional distributed representation R. R can be treated as a map dened over the
words in its vocabulary V such that, for each word in its vocabulary, R maps that word to an
n-dimensional real-valued vector:
R(w) = [d
1
;d
2
; ;d
n
];8w2V
So, to take a word from our previous example, R would map \joy" to an n-dimensional vector
corresponding to the word's location in the distributed representation.
The next step is to generate the representation of the concept dictionary C
R
in the chosen
distributed representation R. We rst nd which words in the dictionary are present in the
vocabulary of the distributed representation, taking the intersection of the dictionary D and the
vocabulary V :
D
R
=D\V
Finally, we add the representations of the words in this intersection together and normalize
this value to generate a concept representation compatible with the word representations in the
distributed representation:
C
R
=
P
8w2D
R
R(w)
k
P
8w2D
R
R(w)k
10
With this category representation C
R
, we can now calculate its similarity to any word w in
the vocabularyV . We make use of cosine similarity
4
, a measure which denes similarity in terms
of the angle between the vectors
5
. Similarity is maximized when cos = 1 meaning that = 0
(i.e. the vectors are parallel). A value of1 signals that that = (i.e. the two are pointing in
opposite directions and so are maximally dissimilar). Cosine similarity returns a value between 1
(maximum similarity) and -1 (maximum dissimilarity) and can be calculated by:
cos =
xy
kxkkyk
wherekxk is the length of vector x:
kxk =
q
x
2
1
+x
2
2
+ +x
2
n
This formula allows us to make use of a computational shortcut where, by pre-normalizing all
vectors in the space
6
, the previous calculation reduces to cos =xy.
We apply a similar method for determining the similarity of a document or phrase to a dictio-
nary. As a rst step, a document summary vector,T
R
is generated using the same word-averaging
method described for generating the concept representation. That is, we nd the representation
for each word in the document in the distributed representation, add each of those together, and
normalize the resulting vector
7
.
We can make use of this document representation in the same way we previously made use
of word vectors. In particular, we can nd the similarity of the document representation T
R
to
4
While this simple symmetric notion of similarity has been shown to be inadequate by previous research [123, 83],
it is still useful. Prior work has shown that the local structure of nearest neighbors for terms are highly semantically
relevant [63, 87] even if they don't capture the full psycholinguistic notion of concept similarity.
5
for normalized vectors, this is order-equivalent to Euclidean distance
6
this is usually done in a single pre-processing pass for the entire distributed representation
7
While this method of generating document representations ignores word order and the syntactic structure of
the document, this sort of \bag of words" representation is sucient for many domains.
11
the concept representation C
R
with the same distance metric we discussed previously to return a
nal similarity score ranging between 1 and -1:
D(C
R
;T
R
) =C
R
T
R
Given this, we can combine a dictionary with a distributed representation to produce a measure
of category similarity in terms of that representation. Just as applying a dictionary through word
counts yields a single scalar value (in that case, the normalized count of the words in a document
found in the dictionary divided by the total number of words in the document), DDR produces a
single scalar value representing the distributional similarity of the dictionary to the document.
2.2.2 Understanding DDR
It's useful at this point to look at an examples to see how DDR works and what information
is being captured by the underlying concept of semantic similarity. In Study 2, we make use of
the Moral Foundations Dictionary [41], an operationalization of Moral Foundations Theory [50]
which posits ve key moral domains (see Study 2 for details).
One of the dictionaries we consider is designed to specify the authority virtue domain and is
composed of the words \authority", \obey", \respect", and \tradition". By creating a represen-
tation of this dictionary in DDR, we can directly examine the nearest neighbors of this category
8
.
When considered using distributed representations trained on the full text of Wikipedia, the top
10 nearest words we nd are: \obedience", \deference", \regard", \adherence", \uphold", \gov-
ern", \obeyed", \arm", \dignity", and \respecting". In gure 2.1 we show an expanded view of
the nearest neighbors of this category.
We can compare this to the results for the individual words in the category. For example, the
nearest neighbors of the word \authority" alone are \jurisdiction", \government", \authorities",
\responsibility", and \commission" (see gure 2.2). The nearest neighbors of \respect" are:
8
the available code makes it easy to explore this for other words and dictionaries
12
Figure 2.1: Nearest neighbors of the authority MFD domain.
\regard", \deference", \respecting", \fairness", and \disrespect" (see gure 2.3). Each of these
words has a slightly dierent focus than the average of the dictionary. As we would hope, the
combination of words helps to clarify the concept we wish to explore.
We can see this as well on a larger, more extensively validated dictionary, the positive emotions
category from LIWC. Here, the nearest neighbors of the dictionary are: \endearing", \earnest-
ness", \heartfelt", \captivating", \youthful", \exuberant", \likable", \amiable", \carefree", and
\alluring" (see gure 2.4). Once again, the dictionary representation seems to be capturing the
kinds of terms we would hope to catch with this method.
Beyond the direct application of existing dictionaries, distributed representations can also
assist in the process of dictionary creation. Given a set of words, nding the most distributionally
similar terms can help to spur further development. Given a few words which a dictionary author
believes to be highly relevant, looking for distributionally similar terms can not only help to
13
Figure 2.2: Nearest neighbors of the word \authority".
Figure 2.3: Nearest neighbors of the word \respect".
14
Figure 2.4: Nearest neighbors of the LIWC positive emotions dictionary.
suggest new words but also indicate when a given term has alternative senses which may obscure
the intended category.
With these examples, it becomes easier to understand some of the dierences between applying
a dictionary through word counts and DDR. In particular, we can consider this in terms of the
dierences in how a dictionary can go wrong. Let's say we had a dictionary of 100 terms associated
with depression. We could add millions of obscure scientic terms to this dictionary and, while
it would destroy its face validity, it would not aect its performance when applied through word
counting to a set of standard psychological interviews (assuming that none of the patients were
discussing particular gene pathways and such). However, those terms would completely change
the performance of this dictionary when applied through DDR since those terms would completely
shift the representation of the dictionary.
On the
ip side, adding a single high frequency term to our original dictionary (such as \the"
or \a") would have minimal impact on its performance with DDR. That one term would not shift
15
the averaged dictionary representation. However, through word count, the frequency of such a
common term would have a large impact on the results of that dictionary.
This points to the dierent strengths and weaknesses of these two methods when applying
psychological dictionaries. Word count is far more sensitive to high frequency terms, being part
of the reason for our suggestion to prefer this approach when dealing with closed class terms.
DDR is more sensitive to the semantic relation of the terms in the dictionary, leading to our
suggestion to prefer it with open class terms. Further, we believe it is part of the reason why
smaller dictionaries may work better with DDR as it is easier to maintain semantic coherence in
these cases.
In subsequent experiments, we both explore this comparison and particularly focus on the
impacts of dierent dictionaries and distributed representations on the overall eectiveness of
DDR.
2.3 Study 1: Sentiment Analysis
One of the key uses of psychological text analysis is in the inference of the intents and attitudes
which underlie statements. Language is more than just a means of expressing a collection of facts
(in spite of logical positivism's [14] best eorts). The critical questions often revolve less around
what was said than why it was said.
A number of the factors behind the production of a piece of text have been explored, but one
of the most studied is the simplied notion of sentiment analysis [76]|is the writer or speaker
positive or negative about the topic under discussion? Extensive work has been done in terms of
products ranging from movies to presidential candidates [38, 96].
Much of this has been driven by ease of combining the text of reviews with discrete annotations
such as star ratings for movie reviews [95]. This has allowed for the simplied creation of large
labeled datasets, ideal for the application of unsupervised learning methods. Approaches have
16
continued to evolve, both in terms of problem formulation [16] and as the full weight of modern
machine learning techniques have been brought to bear [112, 118, 74].
In this study we focus on predicting the polarity of movie reviews. Movie reviews provide an
interesting domain as they have been shown to be among the more dicult areas for sentiment
analysis [121]. This is due to a number of issues including the blend of writing about the movie
and writing about elements of the movie (a movie about a failed heist is not necessarily a failure),
the tendency to oer separate appraisals for elements of the movie (e.g. \a wonderful performance
that was more than the writing deserved"), and the range of genres (a \hilarious" comedy is good
while \hilarious" might be an insult if applied to a movie attempting solemnity).
Our aim is not to compete with state of the art classication methods, but rather to examine
how and when distributed representations can allow us to extract a clearer signal when apply-
ing dictionaries to text. We compare DDR against word count methods, using both to generate
equivalent features which we evaluate in terms of performance on a downstream polarity classi-
cation task. Within DDR, we evaluate dierent dictionary sizes and compare the eects of using
dierent distributed representations.
2.3.1 Method
We make use of a dataset of 2000 movie reviews [95] introduced to explore machine learning
techniques for sentiment analysis. This widely-used dataset [81, 111, 73] was originally obtained
from the Internet Movie Database (IMDb) archive
9
. Reviews were automatically sorted as positive
or negative based on star ratings. To limit the impact of individual authors, there was a limit of
20 reviews from any single writer. The full dataset of all labeled reviews can be downloaded from
the Cornell archive
10
.
We compare two general approaches for applying dictionaries{word count and DDR. For a
given dictionary and piece of text, word counting returns a value between 0 and 1, representing
9
http://www.imdb.com/reviews/
10
http://www.cs.cornell.edu/people/pabo/- movie-review-data/
17
the normalized count of dictionary words in the document. DDR returns a value between -1 and
1, representing the distributional similarity of the dictionary to the document.
In order to keep the evaluation as consistent as possible, we standardize several factors across
the experiments. All classication is done using logistic regression [55]. While not the highest
performing classication method available, it has the virtue of model simplicity while maintaining
sucient power to handle issues such as diering means of independent variable values (critical
for this dataset).
All evaluations are done on the full set of 2000 documents with 10-fold cross validation. We
evaluated results in terms of F1 score [101], which is calculated as the harmonic mean of preci-
sion and recall. Precision (or positive predictive value) evaluates the ratio of true positives to
total predicted positives of a classier while recall (or sensitivity) measures the ratio of correctly
predicted positives to the total size of the class. By considering the harmonic mean of these two
values, F1 balances these factors.
The rst method we evaluate is a direct application of the LIWC [99, 119] word count method
and dictionaries to this dataset. In particular, we count instances of words in the positive emo-
tions (containing words such as: \love", \nice", and \sweet") and negative emotions categories
(containing words such as \hate", \ugly", and \annoyed") for each of the documents. Based on
prior evaluations of psychological dictionaries [35], we chose to use the LIWC [119] positive and
negative categories over other dictionaries such as PANAS-X [125]. Not only is LIWC is widely
used, these dictionaries have been shown to be more eective for sentiment analysis on this dataset
[35].
However, while prior studies made use of the positive and negative emotion LIWC dictionaries,
we wanted to conrm that this was in fact a valid choice. As such, we performed word counts for all
LIWC 2007 dictionaries. We then calculated the information gain [75, 8, 27], a means of measuring
the capacity of a variable to reduce uncertainty, for the results from each of the dictionaries. The
positive and negative emotion dictionaries had gains of 0.0270 and 0.0170 respectively (while these
18
values are low, in this case we care primarily about the relative informativeness of the dictionaries).
The only other two LIWC categories in this range were negation with an information gain of 0.0214
and discrepancy with a gain of 0.0177. Given that prior work had focused on positive and negative
emotion dictionaries, we chose to focus on these categories. Negation and discrepancy seemed to
be picking up the tendency of certain reviews to equivocate (e.g. `good acting but...'). While this
would be an interesting phenomenon for future exploration, we felt it to be beyond the scope of
the present paper.
To generate features for use in classication we rst ran the basic LIWC word count (including
morphological matching) to get a total count of the words in the document and the words for the
selected dictionaries. Given this, we found the percentage of the document composed of positive
and negative words and used these values as features for a logistic regression model.
With DDR, we tested several combinations of dictionaries and representations. We made use of
three representations, one publicly available set
11
trained on approximately 100 billion words from
Google News articles
12
, one trained on the full text of the English Wikipedia
13
, approximately 2.9
billion words in total, and one trained on approximately 11 million words from movie reviews
14
beyond those in our test set.
All distributed representations were trained using Word2Vec [86]
15
. Given the dierent train-
ing sets, each distributed representation had a dierent vocabulary size. The Google News repre-
sentations had a vocabulary of approximately 3 million words, the Wikipedia representations had
a vocabulary of approximately 2 million words, and the IMDb representations had a vocabulary
size of approximately 45,000 words.
While the sizes of these spaces were very dierent, we felt that this corresponded to a common
research situation. In many cases, researchers have access to large quantities of open domain text
11
available at https://code.google.com/p/word2vec/
12
http://news.google.com/
13
https://dumps.wikimedia.org/
14
available at http://ai.stanford.edu/ amaas/data/sentiment/
15
making use of the skip-gram with negative sampling model with the following parameters: window 10, negative
25, hs 0, sample 1e-4, iter 15
19
or even pre-trained distributed representations while having access to a much smaller amount
of data in their focal domain. Thus, the choice of whether to make use of general purpose
representations trained on more data or more focused representations trained on less data is
salient to many real-world tasks.
The LIWC dictionaries make extensive use of pattern matching (e.g. providing root pat-
terns rather than complete words such as \ador*" to capture \adore", \adoration", etc). Since
DDR makes use of the representations of individual words, we rst expanded the LIWC patterns
against a separate corpus of movie reviews. Originally, the positive emotions dictionary contained
405 words and patterns and the negative emotions dictionary contained 500. Positive emotions
expanded to 1286 full words and negative emotions expanded to 1718 words.
For use with DDR, we compared these expanded LIWC positive and negative emotions dictio-
naries with a small set of representative words chosen to explore task-specic dictionary creation.
For the task-specic dictionaries (which we refer to as seed dictionaries), we chose 4 words charac-
teristic of positive and negative reviews. For the positive, we chose [\great", \loved", \amazing",
and \incredible"]. For the negative we used [\hated", \horrible", \crap", and \awful"].
Using DDR, we combined each of the two dictionary pairs (expanded LIWC and seed) with
each of the three distributed representations. We used each of these to generate two features for
the movie reviews in our test set, a measure of the similarity of the review to the positive and
negative categories. This yielded six sets of features, each of which we could directly compare
against the word count method.
2.3.2 Results
All results (see Table 2.1) were based on 10-fold cross validation to minimize the eects of
overtting (especially with the relatively small test set). Reported values are averaged over 10
iterations of 10 fold cross-validation. Signicance was estimated via 10,000 iteration permutation
testing with paired-sample t-tests.
20
Model Precision Sensitivity F1
Full LIWC Dictionary - Word count 0.657 0.659 0:658
a
Full LIWC - Wikipedia embeddings 0.659 0.649 0:654
a
Full LIWC - IMDb embeddings 0.695 0.682 0:689
b
Full LIWC - Google News embeddings 0.715 0.699 0:707
c
Seed LIWC - Wikipedia embeddings 0.665 0.654 0:660
a
Seed LIWC - IMDb embeddings 0.764 0.762 0.763
d
Seed LIWC - Google News embeddings 0.745 0.723 0:734
e
Table 2.1: Results for Study 1: performance on the 2000 document movie sentiment corpus. Sub-
script letters indicate signicant dierence from other score, p < 0.0001, calculated using random
permutation tests.
Using the LIWC dictionaries with the standard word count method yielded an F1 score of
0.658, in line with prior work considering this method [66].
For the DDR tests, when combined with Wikipedia-derived representations, neither the LIWC
nor the seed dictionaries showed a signicant improvement over the word count results. However,
the results were dierent when making use of representations trained on Google News and IMDb.
When making use of the full dictionary with IMDb and Google News-derived vectors, the DDR
F1 scores improved to 0:697 and 0:707, respectively. Notably, when using the seed dictionary, the
combination with IMDb and Google News representations further increased this improvement,
yielding F1 scores of 0:763 and 0:734. To determine whether F1 scores were signicantly dierent
between feature sets, random permutation tests with 10,000 iterations were conducted. Speci-
cally, for each feature set, 1,000 10-fold models were estimated and the F1 score from each fold
was extracted, yielding 10,000 F1 scores for each method. Random permutation testing was then
performed to test the null hypothesis that sampled F1 scores for each method were drawn from
the same population distribution. This approach is recommended because random permutation
tests are robust to violations of normality and F1 scores are known to not be normally distributed,
which violates the assumptions of the dependent-samples t-test [110, 85].
2.3.3 Discussion
This study demonstrates how, with the proper choice of distributed representations, DDR can
provide benets both for classic, extensively validated dictionaries and for a potentially new style
21
of dictionary generation. Although, it should be seen as augmenting rather than replacing existing
best practices. While a great deal of work in the computational realm focuses on raw performance
results, for social scientic research, model interpretability remains a key factor when trying to
draw explanations of the underlying concepts. These results suggest that we can combine some
of the benets of both theory-driven and bottom-up methods.
In terms of dictionary generation, these results point to the potential to apply dictionary
methods to novel areas of social scientic research. While the ability to develop a large, psycho-
logically and linguistically validated dictionary remains out of the reach of most teams, it is far
more feasible to nd a small set of keywords corresponding to a given concept of interest. In many
studies, this already takes place in the early phases of theory validation.
In conjunction with DDR, these small sets of core words are sucient to allow for large scale
text analysis, either in support of concept validation or in an application domain as demonstrated
here.
While the task-specic seed dictionary performed better in this case, this result should not
be overly generalized. A set of words chosen as characteristic of a given domain should generally
outperform a more broad-coverage dictionary. What is noteworthy here is less the performance
dierence than the fact that simplied dictionary generation worked at all. With word count
methods, small dictionaries generally have too few hits to generate a viable signal on most doc-
uments (too few words from the dictionary are present in any given document). The ability of
DDR to generate a clean signal with smaller dictionaries has the potential to open up a variety
of tasks and theoretical constructs to text analysis.
The statistically signicant improvement in the performance of the LIWC dictionaries when
applied with DDR compared to word count demonstrates that these benets can be realized with
established dictionaries. This is a critical test as it would have been just as easy to imagine this
type of expansion as diluting the overall value of a large dictionary rather than enhancing it.
22
The variations in DDR performance based on the choice of distributed representation was
one of the most intriguing aspects of this study. For example, DDR with representations trained
on Wikipedia performed notably worse for both seed and full LIWC dictionaries. We believe
that the reason for this is that the task was focused on determining sentiment while Wikipedia
explicitly rejects the inclusion of personal sentiment or opinions in articles
16
. As such, the very
information critical for this task may not have been present in the distributed representation.
Comparatively, the Google News training and IMDb corpora contained a larger representation
of sentiment-relevant contexts. While we caution against overinterpretation of the fact that the
IMDb distributed representation produced the best overall performance in this test, we do nd
it notable that a domain-specic distributed representation trained on a fraction of the data
performed comparably to much larger representations.
It is also worth noting, that while the Google news embeddings outperformed the IMDb
embeddings when full-dictionary features were used, the opposite was the case for the seed-
dictionary models. While we explore this issue further in Study 3, we have not yet determined a
simple rule for predicting the precise interaction of a dictionary and representation with DDR.
In sum, this study provides evidence for two key ndings. First, it shows that using small
sets of domain-central terms to construct concept dictionaries is a viable technique when com-
bined with DDR. Second, regardless of the generation technique, when paired with appropriate
representations, DDR is able to improve the performance of dictionaries on application tasks.
This oers both a new route for dictionary generation as well as a new approach to applying
dictionaries to text analysis tasks.
2.4 Study 2: The Morality of Twitter
In this study, we continue comparing the application of dictionaries through DDR and word
count methods in a more dicult context|the identication of moral rhetoric in social media
16
Wikipedia editing instructions specically require articles to be written from a \neutral point of view" https:
//en.wikipedia.org/wiki/Wikipedia:Neutral_point_of_view
23
posts. This task has a number of features that make it useful as a follow-up to the previous
study. First, rather than considering the binary task of positive/negative sentiment, this task is
both multi-class (we consider 10 moral categories and an additional non-moral class) and multi-
label in that a single post can include multiple moral dimensions (except for the exclusive non-
moral category). These make the identication of moral rhetoric dicult even for human raters.
Second, social media posts are far shorter and noisier than movie reviews, complicating the task
of extracting a clean signal. Third, the domain of moral rhetoric lacks external annotation (such
as star ratings for movie reviews) around which to construct a labeled dataset. As such, this case
corresponds more closely to the common situation in psychological analysis where validation and
application of an underlying theory interact with one another.
To represent moral sentiment, we make use of Moral Foundations Theory [41], which classi-
es moral concerns into ve domains: Care/harm (sensitivity to the suering of others), Fair-
ness/cheating, (reciprocal social interactions and the motivations to be fair and just when work-
ing together), Loyalty/betrayal (promoting in-group cooperation, sacrice, and trust), Author-
ity/subversion (endorsing social hierarchy), and Purity/degradation (promoting cleanliness of the
soul over hedonism) [50].
These categories have been operationalized using the Moral Foundations Dictionary [MFD;
41], a collection of terms representative of the positive (virtue) and negative (vice) aspects of each
foundation, which yields a total of 10 moral sentiment dimensions.
We consider moral rhetoric in the context of Twitter posts collected during period surrounding
Hurricane Sandy, specically looking at posts calling for donations. This analysis faces a set
of challenges raised by the brevity of these messages which have a famously xed limit of 140
characters [120].
The short length means that, compared to the article-length movie reviews analyzed in the
previous study, there is very little text to work with. Given this restriction, comments often
have a greater dependency on shared contexts and cultural nuance. References are terse and
24
unexplained, jokes are brief and allusive, and short-hand is the norm. Further, the casual nature
of the medium means that unconventional spelling and grammar (both intentional and un-) is
rampant, complicating the task of natural language processing. In short, though Twitter has
been the focus of a great deal of psychological social media research, it remains an exceptionally
dicult domain of analysis.
As in the previous study, we test the quality of the signal extracted with a given dictionary by
comparing features derived from word count and DDR on the task of identifying moral rhetoric.
2.4.1 Method
To create this data set, we sampled 3000 Tweets from a set of 7 million posts related to
Hurricane Sandy between October 23, 2012 and November 5, 2012. The raw set was ltered
to exclude retweets and those lacking location information, then limited to Tweets discussing
donation. Three trained coders each coded 2000
17
of these Tweets on 11 dimensions|the ve
Moral Foundations broken down into virtues and vices and a \non-moral" dimension, which was
used to indicate that a given Tweet did not contain moral content. Excluding the non-moral
dimension, all dimensions were permitted to overlap so that a given Tweet could be coded as
containing moral rhetoric relevant to multiple moral concerns. Coders were trained over multiple
sessions by rst being introduced to the overall MFT framework with subsequent sessions detailing
the domains and covering potential ambiguities. They were not specically trained on the MFD.
Given the low base rate of expressions of moral sentiment, we pre-selected Tweets based on
their nearness to the distributed semantic spaces representing each moral domain. Specically,
for a given moral dimension, we selected the 250 Tweets that loaded highest on that dimension,
yielding a sample of 2500 Tweets. To ensure that non-moral Tweets would also be represented
in the sample, an additional 500 Tweets were randomly selected from a subset of Tweets that,
across all dimensions, had moral loadings that fell within one standard deviation of 0.
17
Each with an overlap of 1000 Tweets
25
MFD category Seed words
Authority virtue authority obey respect tradition
Authority vice subversion disobey disrespect chaos
Care virtue kindness compassion nurture empathy
Care vice suer cruel hurt harm
Fairness virtue loyal solidarity patriot delity
Fairness vice cheat fraud unfair injustice
Loyalty virtue fairness equality justice rights
Loyalty vice betray treason disloyal traitor
Sanctity virtue purity sanctity sacred wholesome
Sanctity vice impurity depravity degradation unnatural
Table 2.2: Seed words selected for each of the MFD categories.
We used the three sets of independently coded data to compare the information value of
dierent feature sets. First, we tested word counts based on the MFD dictionaries. We directly
counted the words in the MFD categories and represented each Tweet as a 10-dimensional feature
vector based on the normalized counts for each category.
Second, we applied DDR as in Study 1. We made use of the Google News and Wikipedia
distributed representations discussed in Study 1 but replaced the IMDb trained representation
with another publicly available set trained using Twitter data as the domain-specic comparison
set. This set was trained using GloVe [100] on 2 billion tweets with a resulting vocabulary size
of approximately 1.2 million words
18
. For all three representations, we created separate concept
representations for each of the 10 MFD categories. Then, for each tweet, we calculated the distance
between the tweet and the 10 concept representations to yield features for use in classication.
Third, we continued our exploration of dictionary size and distributed representation expan-
sions. In consultation with the original MFD authors, we selected four words most representative
of each of the MFD categories. These seed dictionaries were then applied as in the previous step
where, for each of the 10 MFD categories, we found the concept representation of the seed dictio-
nary and used the distance between these concept representations and each tweet to generate 10
features for use in classication. The chosen seed words are listed in full in table 2.2.
18
available at http://nlp.stanford.edu/projects/glove/
26
Model M Precision M Sensitivity M F1
Full MFD - word count 0.181 0.457 0:275
a
Full MFD - Google News 0.363 0.837 0:485
b
Full MFD - Wikipedia 0.294 0.758 0:405
c
Full MFD - Twitter 0.312 0.764 0:421
d
Seed MFD - Google News 0.372 0.840 0.496
e
Seed MFD - Wikipedia 0.302 0.755 0:411
f
Seed MFD - Twitter 0.305 0.763 0:415
f
Table 2.3: Results for Study 2: method performance averaged across coders and dimensions. Sub-
script letters indicate signicant dierence from other score, p < 0.0001, calculated using random
permutation tests.
As in the previous study all classication was done using logistic regression with 10-fold cross-
validation. Because the rate of positive codes within each coded dimension was unbalanced (e.g.
for a given dimension, many more Tweets were coded as not containing rhetoric relevant to that
dimension than were coded as containing relevant rhetoric), positive cases were upsampled by
selecting cases with replacement from the lower-frequency class. Comparisons were once again
made in terms of F1 score.
2.4.2 Results
As we made use of human-annotated data as the gold standard for this task, evaluation of
inter annotator agreement was key. Agreement was measured using Prevalence and Bias ad-
justed Kappa (PABAK) [11, 108], an extension of Cohen's Kappa robust to unbalanced data sets.
PABAK, which can be evaluated using the same rough guidelines as Kappa, was reasonably high
for all dimensions (for the moral dimensions averaged across coder pairs, M = 0.81, SD = 0.07).
All classiers were evaluated on precision, sensitivity, and F1 score for each of the 10 MFD
categories. Signicance of F1 dierences across methods was calculated using permutation testing
with 10,000 iterations.
In all of the cases we examined for this study, features derived from the application of dictio-
naries through distributed representations using DDR signicantly outperformed features derived
from word count methods. This was true across all three of the distributed representations we
considered.
27
Of the three distributed representations explored, the best results for both the full MFD and
the seed subset were generated through the combination with the Google News vectors. For these,
using the full MFD yielded an F1 of 0:485 while the seed set yielded a signicantly higher F1 of
0:496.
Features generated from the Wikipedia distributed representations yielded the lowest overall
performance of the DDR tests with F1 from the full MFD of 0:405 and seed MFD of 0:411. The
Twitter derived distributed representations were slightly better than the Wikipedia results with
the full MFD features yielding an F1 of 0:421, which was signicantly better than either of the
Wikipedia tests and signicantly worse than either of the two Google News derived feature sets.
The Twitter seed MFD features were not statistically dierent from the Wikipedia full MFD
features with an F1 of 0:415.
While seed features signicantly outperformed those derived from the full MFD when con-
cept representations were generated in terms of the Google News and Wikipedia models, features
derived from the seed dictionary were signicantly worse when generated using the Twitter dis-
tributed representations.
2.4.3 Discussion
This study supports the results of Study 1 that applying open class dictionaries through DDR
is able to extract a clear signal. The combination of greater conceptual and task complexity
suggests that these results may generalize across a range of domains and contexts.
These results were particularly interesting in terms of the applicability to short-text such as
social media posts. In this case, the relatively low performance of features derived from word
count methods is unsurprising. Many of the posts included no words from any of the MFD
dictionaries meaning the classier could do no better than chance and often seemed to overt the
limited signal available. When dealing with short text fragments, the ability of DDR to generate
a ner-grained measure of similarity is valuable.
28
This has implications for more than just social media. Even when longer texts are available,
DDR allows researchers to consider components of the documents. By taking measurements at
the paragraph, sentence, or even subsentential levels, DDR allows for consideration of how usage
shifts over the course of discourse or topical shifts.
Given the structure of the task in terms of the detection of 10 highly related concepts, it was
unclear if DDR would provide useful information or blur the dimensions. However, the results
here suggest that even with these more nely structured and related categories (compared to
positive/negative sentiment in the previous task), DDR is able to improve the detection of signal.
This further suggests that there is in fact semantic separation among the various MFD categories.
In terms of the comparison between the full MFD and a small subset taken from each category,
this study reinforces our ndings on the applicability of much shorter lists of words than found
in traditional psychological dictionaries. While the seed dictionaries outperformed the full MFD
in two of three cases (when combined with the Google News and Wikipedia representations), the
important factor is not the strong claim that dictionary authors should use smaller word lists but
rather than more modest claim that they can make use of smaller lists.
Further, this highlights the value of DDR for dictionaries that are still in the developmental
process. Not all dictionaries are as extensively developed or well-validated as those found in
LIWC and this study suggests the utility of DDR in precisely this case. Ideally, this would allow
other researchers to explore concepts prior to extensive validation and renement of an underlying
dictionary. Given only a handful of the most salient words, DDR simplies the process of applying
and rening those concepts.
2.5 Study 3: Dictionary Selection
The previous two studies demonstrated the utility of the combination of dictionaries and
distributed representations. In both studies DDR was better able to measure the similarity
between a dictionary and a piece of text than classic word count methods. However, while we
29
observed dierences in the eectiveness of particular combinations of dictionaries and distributed
representations for particular tasks, the reasons for those dierences were unclear.
In particular, while we considered both large, validated dictionaries and smaller, \seed" dic-
tionaries, it was unclear how much we could generalize from the observed results. In both studies,
we observed that seed dictionaries generally performed as well or better than the full dictionaries
(with the exception of the combination with the Twitter representations in Study 2). However,
with so few examples, it was unclear whether this was due to the particular examples we considered
and what factors would aect this.
To consider these questions about DDR, we repeated the structure of Study 1 with 5.1 million
unique dictionary pairs (10.2 million total dictionaries). To generate these, we sampled from the
LIWC positive and negative emotions dictionaries to generate dictionaries ranging from 2 to 900
words. We evaluated each pair of generated positive and negative dictionaries on the same movie
review sentiment task of Study 1 allowing us to compare how dictionary size interacts with task
performance.
With this data, we were further able to explore how the structure of the dictionaries aects the
applicability to a downstream task. Since DDR works by averaging the representations of words
in the dictionary to generate a concept representation, a natural starting point is to consider the
similarity of the selected words. In particular, we evaluated how the clustering of dictionary words
in the distributed representation aected resulting classication performance.
2.5.1 Method
In this study, we make use of the same framework as in Study 1, using DDR-derived features
to predict the polarity of movie reviews [95]. We generate dictionary pairs by sampling from the
LIWC positive and negative emotion dictionaries. For distributed representations, we compare the
two top performing distributed representations from Study 1|the IMDB-trained representations
and the Google News representations.
30
As one of our key questions related to the ideal size of a dictionary, we considered a range of
dictionary sizes. Given that the intersection of dictionary and distributed representation vocab-
ulary varied, we separately sampled for each of the two distributed representations.
Dictionary pairs were generated by sampling from the intersection of the expanded LIWC
positive and negative emotion dictionaries with the distributed representations. For IMDb rep-
resentations, this yielded intersections of 888 and 1203 words for the positive and negative dic-
tionaries. For the Google News distributions the intersections were 988 and 1383 words. We
sampled words without replacement from each of these two sets to create dictionary pairs of
length 2; 3;:::; 10; 20;:::; 100; 200;:::; 900 for a total of 26 separate sizes (the IMDb represen-
tations did not include the 900 case as the intersection wasn't large enough). For each size, we
generated 100,000 dictionary pairs for each of the two distributed representations yielding 5.1
million pairs total.
For each of these dictionary pairs, we repeated the experiment of Study 1: rst generating
concept representations for each of the dictionaries using DDR, then nding the distance of doc-
uments to those concept representations, and nally using those distances as features to train a
classier for sentiment polarity. For each of the resulting feature sets, we evaluated performance
using 10-fold cross validation, reporting averaged F1. Within each of the sample sizes, we com-
pared several measures of overall performance: the mean over all samples, the mean of samples
two standard deviations above the mean, and the best overall sample.
We also used these results to explore whether variations in dictionary structure aected result-
ing classier performance. In particular, we wanted to see whether dictionaries which were more
semantically clustered (that is, where the words of the dictionaries were nearer to one another in
the semantic space) led to better resulting performance when combined with DDR.
Towards this, we evaluated the semantic coherence of each generated dictionary. For each pair
of words in the dictionary, we calculate their cosine similarity and average those together for all
31
such pairs. This requires
n
2
calculations for a dictionary with n words. As this is O(n
2
), it is
tractable for any reasonably sized dictionary.
We calculated this all-pairs similarity for each of the 10.2 million dictionaries (from the 5.1
million dictionary pairs) and used the resulting measures for each dictionary pairs as features to
predict the resulting F1 scores by tting with a linear regression model. We did this for each
sample size, calculating an R
2
for each size.
2.5.2 Results
We started by looking at the mean performance over all samples at each size in order to compare
how dictionary size aected performance. The results can be seen in Figure 2.5 (complete results
can be seen in tables Table 1 and Table 2 in the supplementary material). The mean performance
asymptotically approaches the performance of the full dictionary in terms of both of the two
distributed representations.
For these results, the large number of samples made traditional signicance tests irrelevant
(with 100,000 samples for each sample group, almost any dierence is signicant) so we focused
instead on measures of eect sizes. In particular, we made use of Cohen's d [20] to compare
the predictive power of the size of the dictionary pairs. Table 3 in the supplementary materials
provides a complete pairwise comparison of the set of samples for each dictionary size. While
dictionaries with similar numbers of words showed only marginal dierences (mean value of 0:156
for adjacent sizes), larger gaps yielded very large eect sizes (the d between samples of size 2 and
100 was 2:950). The large numbers of samples kept the variance small (< 0:0005 for all reported
values).
Given that we are randomly sampling from each of the two dictionaries, this result makes
intuitive sense. As we increase the sample size, those samples increasingly closely approximate
the complete set of words. At smaller sizes, we're more likely to capture sets of words whose DDR
representation diverges signicantly from the overall concept representation.
32
Figure 2.5: Mean F1 of dictionary sizes.
33
Figure 2.6: Mean F1 of dictionaries performing 2 standard deviations above the mean.
However, our concern is not with randomly sampled dictionaries, the aim is to determine
how size aects ideal performance. As such, we looked specically at sampled dictionary pairs
which performed better than 2 standard deviations above the mean for each sample size. As seen
in Figure 2.6 we see a very dierent pattern here (raw results can be found in tables 4 and 5
in the supplementary material). Overall performance rises quickly to dictionaries of 30 words,
subsequently falling towards the performance of the full dictionaries.
It is interesting to note the dierence in performance between making use of representations
trained on Google News and IMDb text. In this case, combining dictionaries with IMDb repre-
sentations produces better results with smaller-sized dictionaries while doing worse with larger
dictionaries. The performance dierence at the large scale ts with the previous results where
Google News representations performed better when using the full LIWC dictionaries (see table
2.1 in Study 1). As the sample sizes increase, we would expect the results to increasingly closely
34
Figure 2.7: Best performing sample at each size.
approximate those results. However, it is unclear why the opposite was observed for smaller
dictionaries.
We next looked directly at the best performing dictionaries for each sample size. Although
ordinarily this would be an outlier, given the large number of samples at each size, it is reasonable
to consider these best cases. Although there is noise in this data, as seen in Figure 2.7 (and Table
6 in the supplementary materials), the overall pattern matches that seen with the results for the
+2 standard deviation case. Once again, performance increases up to dictionaries of length 30
(40 for the Google News vectors) and subsequently declines. We also see the same pattern where
IMDb representations outperform at smaller sizes while declining more rapidly as dictionary size
increases.
Finally, given this large stock of dictionary pairs and performance results, we considered the
interaction of semantic similarity and resulting classier performance. While we expected that
35
semantic coherence would be positively correlated with dictionary performance, we found an
almost complete lack of correlation. The maximum observed coecient of determination was
0:209 with a mean of 0:017 across all sample sizes.
2.5.3 Discussion
This study conrms that, with DDR, smaller dictionaries can outperform larger ones. In
particular, this conrms that the performance of the seed dictionaries in the previous two studies
were not merely due to chance or selection eects. At the same time, some of the particular
ndings of this study raise a number of questions for future research.
One issue we would caution against is overgeneralizing from the particular numbers in this
case. While we found 30 word dictionaries to produce the best overall results on the downstream
classication, there is nothing to suggest that this particular number would generalize to other
applications, tasks, or dictionary types.
However, the overall structure of the results strongly support the ndings of the rst two
studies that smaller dictionaries are sucient when applied through DDR. In particular, we note
the case of the best performing samples at each size where the dierence in resulting performance
was small for dictionaries smaller than 100 words. When combined with IMDb representations,
the resulting performance ranged between F1 scores of 0:766 and 0:792 while for Google News
those scores ranged between :737 and 0:759.
This suggests that, for a well designed dictionary, length is not a critical factor. Dictionary
authors should feel free to incorporate as many or as few words are necessary to get at the desired
theoretical construct without undue focus on size (unless sampling at random which doesn't seem
to be a common technique for dictionary generation). This is a style of dictionary which is only
possible when applied through DDR as word count methods depend on linguistic coverage.
We were somewhat surprised that, for these examples, dictionaries of size 30 produced the best
overall results. This was larger than we'd expected given our prior work. However, as noted, we
36
don't believe that this particular number can be applied without validation to other combinations
of domain and dictionaries. As before, given the small dierences in performance, we feel that
concept validity should be the guide rather than any other optimality criteria.
Also surprising was the lack of correlation between semantic coherence and resulting perfor-
mance. While our prediction had been that more semantically similar dictionaries would perform
better, we found no correlation. In hindsight, this makes sense given the experiment and DDR's
structure. As we sampled from coherent categories, the particular sampling choices are more
capturing dierent facets of a single concept rather than separate concepts. In terms of DDR,
what matters is the generated concept representation, not the particular words used to generate
it. Two dictionaries might have completely dierent words and structures yet generate similar
concept representations. This points to a number of possibilities in terms of alternatives to points
to represent dictionary concepts. We leave this for future work.
2.6 Conclusions and Future Work
Concept dictionaries have served as one of the major tools for theory-driven text analysis,
producing impressive results across a wide range of problems and tasks. But, in spite of these
successes, challenges remain. First, the diculty of developing and validating broad-coverage sets
of words has meant that not all teams have had the skills and resources to build dictionaries in
their domains. Second, existing techniques have struggled to apply these dictionaries to shorter
texts (such as social media posts).
At the same time, while statistical natural language processing has been going through a
period of explosive growth, many of its tools have not been adopted by social scientists due
to their atheoretic nature. For many social scientic applications, better classier performance
doesn't help if it can't be related to an underlying model.
In this paper, we introduce DDR, a method designed to bridge the gap between these ap-
proaches. Data driven yet conceptually seeded, DDR incorporates the strengths of statistical
37
methods with the theory-driven structure of conceptual dictionaries. By generating a distributed
concept representation based on the words in a dictionary, it provides a continuous measure of
similarity between a concept and any other word, phrase, or document. This provides a range of
benets for both dictionary authorship and application.
Here, we would like to emphasize that we do not view DDR as a replacement for word count
methods. Whenever the object of inquiry is a closed set of words (that is, a xed set of terms
which completely cover a category), word count methods remain more appropriate. For example,
many linguistic categories such as pronouns, articles, and conjunctions are composed of a relatively
xed set of terms. The study of these closed classes has proven to be extremely rich for both
linguistic and psychological research [98]. The notion of similarity on which DDR is based makes
little sense in these contexts.
For open class terms, the situation is a little more complicated. For short texts, the decision
is clear in favor of DDR. There simply isn't enough context there for word count to extract a
clear signal from most dictionaries. For longer documents, the choice is slightly more complicated.
While DDR is once again eective in this case, word count's applicability depends on the dictionary
structure. If the dictionary includes terms which are suciently high frequency in the chosen
domain, word count may still be used. In general, though, it is safe to use word counts with
closed class terms (or in domains where the relevant words can be completely enumerated) and
DDR with open class dictionaries, especially when dealing with shorter documents. DDR doesn't
replace existing methods, but rather augments them with a new set of tools.
In terms of dictionary authorship, DDR helps with several major challenges. First, it allows
authors to focus on the conceptual core of a category rather than attempting to determine all
possible words which might be associated with that category. Given the breadth and dynamic
nature of language, a complete enumeration will generally be infeasible. However, with DDR,
authors can focus on the key elements that dene a category, making use of semantic rather than
morphological similarity to nd related terms.
38
Second, DDR allows for limited domain adaptation through the choice of distributed repre-
sentation. This opens up the potential for a given concept to be more easily explored across a
range of domains and application settings. Further, it opens the potential for researchers to make
use of representations trained on potentially smaller quantities of domain-specic text, allowing
for even more focused adaptation.
Third, DDR provides a means for dictionary authors to directly explore the structure of
the dictionaries they are creating. By looking at the relations of candidate dictionary words in
the context of a range of distributed representations, dictionary authors have another tool for
evaluating and validating their work. The combination of DDR and measurements of semantic
coherence allow them to rapidly evaluate the impact of changes to their dictionaries and how the
words they have selected hold together. None of this is a replacement for existing psycholinguistic
or behavioral validations, however the results shown here suggest that DDR can be a valuable
addition to researchers' toolkits.
Further, DDR oers new scope and application for existing dictionaries. In both Study 1 and
Study 2, we demonstrated how DDR improved the performance of well validated dictionaries on
real world applications. We believe this blend of making existing dictionaries more useful while
greatly easing the task of generating new dictionaries to be a powerful combination.
Study 2 in particular demonstrated some of these advantages of DDR. One facet of this comes
from DDR returning a smoother measure of similarity between texts and dictionaries. Few of
the social media posts contained any of the words in the original dictionaries, and so would
have been beyond the scope of traditional word-count analysis. However, by applying those
dictionaries through DDR, it is possible to generate a smooth measure of similarity between posts
and dictionaries, even when there is no word-level overlap between the two. As an increasing
amount of social interaction is captured by precisely these sorts of short texts (whether on Twitter,
Facebook, or any of the various chat applications), this capability will be increasingly valuable.
39
Study 1 points to the ability of DDR to provide a simplied version of domain adaptation. By
applying a simplied dictionary to a distributed representation trained on that domain, we were
able to get better results than using combinations of both larger dictionaries and more extensive
distributed representations trained on generic domains. While this is by no means a complete
solution to these challenges, it at least provides a small step and the tools made possible with
these measures of distributional similarity may help in analyzing these challenges going forward.
A large number of avenues remain for future work. While we have shown a number of in-
triguing results in terms of particular combinations of dictionaries and representations, we are far
from establishing a general rule for which representation will be most appropriate. In fact, the
comparison of these strengths and weaknesses for particular conceptual domains may prove to
yield a useful window on the underlying structures of those representations.
Additionally, while we have made use of the simplifying assumption that the structure of
a concept can be approximated by distance to a single point in semantic space, there is room
for further exploration. It remains for future work to determine how more complex models of
conceptual structure could provide better mechanisms for evaluation and application.
Nonetheless, DDR provides a new level of
exibility and applicability for theory-driven text
analysis. Combining distributed representations and dictionaries, this method makes it possible to
leverage the strengths of both. Critically, it does so in a way that takes advantage of existing work.
DDR doesn't obsolete current dictionaries, rather it improves their performance and expands their
applicability. It doesn't attempt to restructure distributed representations, but rather leverages
their strengths to explore theory-driven constructs. In providing a bridge between these two
approaches, we hope that it will serve to enrich both.
40
Chapter 3
Bridging the Gap Between Continuous Representations
and Categorical Features
3.1 Introduction
Interpreting the intent of a statement depends on its context. Given the sentence: \I want
more", we need to know who is speaking to even identify the \I" and something about the situation
to know what \more" they might want. This subjectivity and context-sensitivity is widespread
throughout language where we often care not only about what was said, but also why it was said.
Topics such as humor [29], sarcasm [12], or political persuasion [19] highlight this interaction
between prior knowledge of the speaker and interpretation of an utterance.
For psychological modeling, these contextual dierences have been the focus, attempting to
understand language at the intersection of intent, action, and response. As such, psychological
eorts to convert text into features have focused on subjective encoding and theory-driven mea-
sures. Part of this has involved directly hand coded features. For example, clinical transcripts
might be coded with relevant measures such as aect, avoidance, or coping strategies. Automated
psychological text analysis has made extensive use of domain dictionaries|expert-generated lists
of words characteristic of a particular construct|applied either through word counts [99] or se-
mantic similarity [37]. While these methods have allowed for ne-grained situational analyses,
41
expert attention doesn't easily scale to larger amounts of data and concept dictionaries will at
best capture pre-selected slices of semantic and pragmatic intent.
Computational methods have approached the issue of text understanding from the other
direction|learning linguistic representations based on large-scale statistical regularities in lan-
guage rather than local features. Representations are learned based on large quantities of text,
generally chosen to be domain-neutral. The current dominant approach uses neural networks to
learn a mapping function for converting raw text to a continuous representation [69, 94]. For
learning to predict a particular aspect of a piece of text such as sentiment [96], the standard
tactic is to collect a large corpus of text fragments with appropriate labels and use those to learn
a mapping between text and label. While these methods have proved remarkably successful given
large quantities of data [51], they predict the average rather than individual responses.
While both the computational and social scientic approaches capture important aspects of
language, neither is sucient by itself. Representations trained on large amounts of linguistic
data capture an element of linguistic background knowledge which is critical as we don't learn
language from scratch with each new interaction. But this background is insucient as we also
make use of local cues and knowledge of the speaker to interpret utterances. Based on the identify
of the speaker we distinguish between terms of endearment and insults, identify jokes, and decide
how much to trust a given claim.
Given the importance of both facets, increasingly the two approaches are being brought to-
gether. Psychological text analysis has rapidly expanded to make use of state-of-the-art represen-
tations and techniques. On the computational side, a number of approaches have been explored to
include information about sources into the learning of continuous representations. For example,
representations learned with demographic information have been shown to improve classication
performance on a range of tasks [56, 62], particularly in venues such as social media [57] where
group-based linguistic variations can be extreme. In particular, word representations learned on
42
demographically split data [5, 36] have proved useful for community-specic classication in areas
such as sentiment analysis [128].
However, these studies have relied on the availability of sucient data in order to train domain-
specic representations, often on the order of millions of documents. Here we consider situations
where we lack enough data to train domain-specic representations but have prior theoretical or
experimental reasons to explore particular contextual factors. Under those conditions we consider
three options. The rst is simply to ignore the context. This is the present computational default
and provides a useful baseline for comparison. The second is to split the data based on the variable
of interest and train separate models. The third is to explicitly incorporate the continuous text
representations with categorical background information into a single model.
We compare the relative merits of these approaches on the task of modeling speaker intent in
political debate. Prior work has shown this domain to highlight the interaction between message,
speaker, and audience [44, 40]. For example, strong partisans are more likely to accept or reject a
claim based on whether the source shares their aliation [19, 64]. These tensions are highlighted
in debates where the goal is explicitly to persuade listeners, but the target varies widely as
politicians simultaneously attempt to attract undecided voters, motivate their own supporters,
and demotivate the other side's voters [60]. Emotional arguments [79] can be dicult to parse
without extensive prior knowledge and some statements are even deliberately obscured so that
only groups who already agree with a stance will even realize that a position is being taken [77].
As such, contextual clues are essential for interpreting the intent of a given utterance.
To explore these questions, we annotated US general election presidential debates from the
years 1960-2016 on ve factors: attack, policy, value, group, and personal statements (for examples
see table 3.1). The result is a dataset which combines a dicult baseline task (intent recognition)
with a situation where prior work suggests that background information about the identity of the
speaker should be important. We use this to explore how this prior knowledge can be combined
43
with learned text representations to improve the overall model structure and better model listener
interpretation.
Annotation Example
Attack \The man is practicing fuzzy math again." (Bush, 2000)
Policy \I'm going to invest in homeland security, and I'm going to make
sure we're not cutting COPS programs..." (Kerry, 2004)
Value \What is more moral than peace?" (Ford, 1976)
In(out)-group \I fully recognize I'm not of Washington." (Bush, 2000)
Personal \My wife Tipper, who is here, actually went on a military plane
with General Sholicatchvieli..." (Gore, 2000)
Table 3.1: Examples from each of the annotation categories.
We demonstrate that while it is possible to model intent without knowledge about the speaker,
including this information can signicantly improve those models. Further we show that the
overall impact of this information on model complexity is minimal for both learning and inference.
Finally, we discuss how this method can be applied to other situations allowing for more general
incorporation of continuous representations with categorical factors.
3.2 Study 1: Identifying Intent of Political Speech
In this study, our aim was to evaluate the potential for modeling identication of the intents
of politicians engaged in political debates based solely on linguistic information|text alone with-
out background knowledge. This served rst to determine whether or not there was sucient
regularity in this data to model at all. Second, it established a baseline for further experiments,
providing a strong comparison for the following studies. Finally, it allowed us to compare al-
ternative modeling approaches, allowing us to build o of the strongest of these. We begin by
introducing the dataset used for this and the subsequent studies.
3.2.1 Dataset: Political Debates
Given that prior political science research suggests that political debate does not even approx-
imate an ideal deliberative structure [84], we aimed to explore these texts in terms of persuasive
argumentation, including argument types which are conventionally dismissed as fallacies [79].
44
Persuasive arguments are particularly relevant here as context is inherent to their structure, in
particular the identity of the speaker. We focused on two types of arguments which are increas-
ingly considered central to political choices and highlight the interaction between linguistic and
background context in terms of language understanding: values and group identities.
The rst axis we considered was the gap between personal and group identity claims. This
highlights the interaction between text and context in that understanding these arguments de-
pends on a mix of situation and prior knowledge. While, traditionally, candidates were thought
of as \selling" themselves to an evaluative electorate, recent work suggests that far more decision
making occurs in terms of group identity [23, 15]. Politicians either suggest that they are a part
of a favored group or that their opponents are part of a disfavored one [117]. In US electoral
politics, claiming that someone is part of the \Washington elite" or doesn't understand \real
Americans" has fallen beneath the level of cliche. This has become particularly salient in light
of recent elections around the world where racial, economic, and national identities have played
critical roles. For the present study, this is useful in that context and background are inherent to
understanding when a group identity is being referenced.
The other axis we considered was the gap between discussions of policies and the values
that underpin those policies. This division underscores many theories of rational voting wherein
individuals may lack the expertise to evaluate specic policies but can more easily determine a
preferred value orientation [105, 17]. The choice to focus on policy details or broad values provides
important cues to how a politician is framing a given debate [59]. Given that recent work suggests
that political groups can fail to understand one another when discussing values [41], incorporating
context provides valuable information to identifying these arguments. Knowledge of the political
aliation of a given speaker provides valuable clues as to how value arguments are being deployed
and allows for far ner interpretation of ambiguous statements.
In order to evaluate these axes, US general election presidential debates from the years 1960-
2016 were annotated by seven undergraduate students, each trained in the coding scheme and
45
given an initial trial annotation. Each debate was assigned to at least two annotators to allow
for agreement evaluation. The nal set covers 20 debates with a total of 43 annotations yielding
paired annotations for 4670 debate turns with 29395 sentences and 292k words.
Utterances were annotated based on the presence or absence of ve discourse strategies com-
monly used in debates: attacking one's opponent, description of policy, statement of values, group
identity claim, and a discussion of personal characteristics. A single utterance could be tagged
with multiple labels and examples of all label combinations were observed. This allows us to
consider the two dimensions described above while comparing whether statements on these di-
mensions are used to support a candidate or attack their opponent. An utterance could be left
unlabeled if none of the categories were present. Examples of each of the categories is provided
in table 3.1 and the complete coding guide is included in the supplemental materials.
Argument structures are generally dicult to annotate [91] and this is particularly true with
these types of persuasive arguments. This task combines the diculty of highly subjective tasks
such as sarcasm detection [6] with the unique requirements of historical text annotation. Annota-
tors are asked to label arguments from periods with which they may not be familiar. One of our
motivating considerations was that, due to the nature of these arguments, dierent annotators
would pick up on dierent facets of a statement. As such, we considered inter-annotator agree-
ment on both positive and negative cases. That is, we considered and evaluated two versions of
the data, one where a case was considered a true positive only if both annotators agreed (conven-
tional inter-annotator agreement) and the other where a case was considered a true negative only
if both annotators agreed (eectively taking the union of positive annotations). All experiments
are reported in terms of both interpretations of the data.
Given the ambiguity previously discussed in political rhetoric, we focused at the turn level.
This yielded paired annotations for a total of 4670 turns with 1453 of those from the candidates
(as opposed to moderators or audience questions). The total counts for each label are reported
in table 3.2.
46
Category Count
Attack 1154
Policy 984
Value 840
Personal 922
Table 3.2: Candidate turn-level label counts for the union of positive annotations.
Category Kappa Count
Attack 0.49 822
Policy 0.64 733
Value 0.46 450
Personal 0.43 504
Table 3.3: Turn-level agreement scores and counts for positive agreement cases.
While we feel that annotator union is in many ways the more natural approach for this data,
it is nonetheless valuable to consider the results in terms of traditional inter-annotator agreement.
The agreement for the candidate turns is shown in table 3.3
The kappas range from 0:43 and 0:64, generally considered to signify \moderate" to \signi-
cant" agreement [124]. Given the inherent subjectivity of the task, this is well within the expected
range. Due to signicantly lower agreement on the in-group category (kappa = 0:21), we excluded
it from further analysis.
3.2.2 Methods
We compare four possibilities for automatically classifying these labels: support vector ma-
chines (SVM), per-class neural network classiers, a thresholded softmax neural model, and a
multi-task neural network. Each of these represents a dierent approach to the problem and con-
siders the task in a slightly dierent way. SVM allows us to consider a traditional bag-of-words
approach while the neural models compare alternative approaches to handling the multi-class
nature of the task. The per-class network treats the labels as fully independent. The thresholded
softmax treats the labels as intrinsically correlated. And the multi-task network treats the labels
as related but pushes the correlations into a common representation. These are explained in detail
below.
47
All models were evaluated on leave-one-out cross validation on the annotated debates. For
each of the 18 annotation pairs we separately train a model based on the other 17 (with a random
90/10 train/validation split for the learning phase) and evaluate on the held-out debate. We
report averaged results with signicance between models evaluated by the bootstrapped variant
of Yuen's method [129], a two-sample trimmed mean test which accounts for potentially unequal
variances [80].
As a baseline, we trained separate SVM [115] classiers for each of the ve labels. We con-
sidered as features the tf-idf [2] adjusted frequencies of terms in each turn. Given the diversity
of the topics discussed, we wanted to clearly evaluate how much of the signal could be captured
with a simple bag of words approach.
Our second approach continues to treat the labels separately, but shifts to a recurrent neural
network model for generating text representations. The general process of these models is to take
a series of words (for example sentences or paragraphs) and pass them one at a time through
a neural network which stores an evolving representation as the words are processed. The nal
output of the network is treated as the representation of the span of text.
In order to pass words into such a network, they need to rst be converted into continuous
representations. Learning these representations is dicult as current methods largely depend on
distributional similarity which requires large quantities of training data in order to learn high-
quality representations. Given this we pre-trained word embeddings using the Word2Vec model
[88]. As a source of training data, we made use of the full text of the English Wikipedia
1
as this
resource included the historical information to cover the periods spanned by the debates.
Within this general architecture, there are multiple choices of how to structure such a recurrent
network. The particular architecture we made use of relies on Long Short-Term Memory (LSTM)
units [54]. These have been found to better capture the long-ranging dependencies often observed
in linguistic information, especially over multiple sentence spans [114, 61]. The nal output of this
1
http://wikipedia.org/
48
network was a 300 dimensional latent representation of the debate turn. This turn representation
was passed to a standard feed-forward neural network [116] consisting of a 100-node dense layer
and nal binary output layer. All neural network models made use of rectied linear units [92] as
non-linear activation functions, were optimized using the Adam algorithm [68], and trained over
30 epochs.
The third approach is motivated by the fact that we allow multiple labels for a single turn
and consists of modifying the standard softmax [103] approach to multi-class classication. The
softmax works by normalizing network output values associated with multiple classes to yield a
distribution over those classes. We begin with the same network structure however, rather than
taking the maximum value as a xed prediction, we instead apply a cuto value where any labels
over the cuto are treated as positive predictions. This cuto value was treated as a separate
hyperparameter and selected based on grid search between 0.1 and 0.5. We report results for the
best average value: 0.2. The overall network structure was similar to the rst model with the
exception that the nal binary output layer was replaced with a 5-class softmax output.
Our nal approach handled the multi-label nature of this task directly as a multi-task learning
problem. Here, rather than building separate models for each class, we trained a single network to
the complete set of labels for a given turn. We make use of the same recurrent network described
above for generating turn representations. However, rather than having that representation passed
to a single prediction network, it passes to ve separate prediction networks (one for each of the
ve labels) as shown in gure 3.1. This keeps the predictions separate while allowing information
from each of the ve categories to pass back through the network and jointly in
uence the structure
of the representations.
3.2.3 Results and Discussion
As expected, F1 scores are lower when considering positive examples to be those where an-
notators agreed on the presence of a label (i.e. the intersection of the annotations by multiple
49
Figure 3.1: The multi-task network structure.
SVM Sep Thr Mul
Attack 0:49
a
0:31
b
0:60
c
0:61
c
Policy 0:72
a
0:72
a
0:53
b
0:72
a
Value 0:37
a
0:47
b
0:35
c
0:54
d
Personal 0:32
a
0:49
b
0:30
a
0:47
c
Table 3.4: F1 scores for the positive annotation case for the intersection interpretation of the data:
linear SVM with bag of words features, per-class LSTM (Sep), thresholded softmax (Thr), and
multitask network (Mul), averaged over leave-one-out cross validation over the full set of debates.
Subscript letters indicate signicant dierence from other scores.
Per-label Multitask
Attack 0:76
a
0:78
b
Policy 0:86
a
0:86
a
Value 0:78
a
0:79
b
Personal 0:79
a
0:79
a
Table 3.5: F1 scores for the positive annotation case for the union of annotations, averaged over
leave-one-out cross validation for the full set of debates. Subscript letters indicate signicant
dierence from other scores.
50
annotators, which reduces the number of positive examples for each category), and substantially
higher when considering positive examples to be those where any of the annotators assigned a
label (i.e. the union of the annotations).
While the overall performance was lower with annotation intersection, it was in the range
of human annotator agreement. We generally found these results to be driven by low recall.
The results were signicantly better when considering the union of annotations rather than their
intersection as seen in table 3.5. In many ways, these results mirrored those of the annotation task
where the labels humans struggled with were also those which proved most dicult to classify.
As seen in tables 3.4 and 3.5, on the \attack" and \value" labels, the multitask network signif-
icantly outperformed all other models for both data interpretations. For \personal" statements,
the separated model did signicantly better on the intersection interpretation although there was
no signicant dierence on the union interpretation. For both the intersection and union data
interpretations there was no signicant dierence in the results on the \policy" label between the
SVM, separated and multi-task models. The thresholded softmax model did signicantly worse
on all categories and data interpretations.
We were interested in why the SVM was able to predict the policy label better than other
labels. In our analysis, the policy label was most amenable to a bag of words approach as the
vocabulary here diered most sharply from any of the other labels (discussions of `legislation',
`bills', etc). Since this method treated all words as separate features, it was able to take advantage
of the unique vocabulary in these cases.
While the multitask network provided an indirect method of learning correlations through
the lower layer representations, one area for future work is to explore whether more use could
be made of the between-label statistics. While this would complicate the overall model, it could
serve to better capture potential interaction aects. However, among the approaches explored,
the multitask network not only had the best overall performance, it was also approximately
ve times faster to train than the per-label LSTM models. While not critical here, this would
51
become increasingly important on tasks with larger numbers of labels. Given that it was the
best performing model in both cases, we focused further experiments on the multi-task model.
Naturally, in the case with a single label or dependent variable, this would be equivalent to the
single-task network. And, while many psychological modeling tasks focus on a single dependent
variable, this capacity allows for more
exible applications such as correlated predictions or multi-
level modeling.
These results demonstrated both that there was sucient regularity for modeling but also that
the task was suciently dicult that text representations were insucient to completely model
speaker intent.
3.3 Study 2: Split data by party
Study 1 only considered the raw text of the debate turns. However, prior work has suggested
a wide range of background information as critical to understanding political rhetoric. In this
experiment, we begin with the most widely studied of these: the gap between political parties.
Dierences in the issue proles and rhetorical strategies of the two primary US political parties
2
have been extensively studied [53]. Prior to incorporating this dierence directly into a model, we
wanted to consider the possibility of training completely separate models based on this divide.
This involved splitting the data based on the party of the speaker and training two separate
models. Evaluating these models both within and across parties allowed us to consider both the
dierences from the general models of study 1 as well as potential dierences in generalizability
based on the restricted training sets.
3.3.1 Method
We use the multitask network from study 1 with the previously described settings. Here, we
split the available data by party, training separate models on either republican or democratic
data. We evaluate against both the same party as the data is trained on and the other party.
2
Although the period covered included two independent candidates (Anderson in 1980 and Perot in 1994), due
to the small amount of data we focus on the two primary US political parties.
52
Test All Republican Democrat
Train All R D R D
Attack 0.61 0:58
a
0:46
b
0:43
a
0:71
b
Policy 0.72 0:72
a
0:43
b
0:57
a
0:68
b
Value 0.54 0:46
a
0:19
b
0:25
a
0:52
b
Personal 0.47 0:46
a
0:27
b
0:32
a
0:45
b
Table 3.6: F1 scores for the multitask model trained and tested on Republicans (R), Democrats
(D), or the complete training set (All). Results based on inter-annotator agreement. Subscript
letters indicate signicant dierence from other scores for models applied to the same test data.
Test All Republican Democrat
Train All R D R D
Attack 0.78 0:82
a
0:76
b
0:75
a
0:88
b
Policy 0.86 0:88
a
0:86
b
0:89
a
0:92
b
Value 0.79 0:75
a
0:74
a
0:76
a
0:85
b
Personal 0.79 0:83
a
0:72
b
0:79
a
0:83
b
Table 3.7: F1 scores for the multitask model trained and tested on Republicans (R), Democrats
(D), or the complete training set (All). Results based on union of annotations. Subscript letters
indicate signicant dierence from other scores for models applied to the same test data.
for the same party case, we evaluate based on leave one out cross validation as in the previous
experiment. When testing on the alternate party, we train on the complete available dataset
and average test results over the available debates. Results are reported for both interpretations
of inter-annotator aggregation. Signicance between models is evaluated by the bootstrapped
variant of Yuen's method [129] with comparisons reported between models tested on the same
data.
3.3.2 Results and discussion
While we expected the intra-party models to do well, we anticipated that they would be hurt by
the fact that they had access to less than half the training data of the complete model. However,
we found that, when trained and tested on the same group, they generally did as well as the
complete model and in many cases better. This was particularly true when considering the union
data interpretation.
Most dramatic was the attack label where the model trained and tested on democrats had
an absolute improvement of 10 points on both the annotator intersection and union data (from
53
61% to 71% and from 78% to 88% respectively). Equally interesting was how much worse these
models did when trained on one party and tested on the other. All dierences were signicant
within test groups with the exception of the \value" label when testing on Republicans with the
union data interpretation. In this case, there was no signicant dierence regardless of the source
of the training data.
As noted, the dierences between test sets makes an absolute comparison across groups im-
possible, however, we can draw some interesting inferences from the relative results. The two
cases where the combined model outperformed all of the split data models when using intersec-
tion data were with value arguments and personal statements. Of the two, personal statements
seems natural|it's unsurprising politicians talking about their own charms sound somewhat sim-
ilar regardless of party. However, value statements were the category where we had expected the
most dramatic dierences between parties. For the union data, the only cases where a model
did worse than the baseline when trained/tested on the same party was for Republicans on the
\value" category. While this corresponds with prior work examining a
ip in many Republican
positions over this period [1], the precise reasons for this observation will require follow-up with
more targeted experiments. Overall, these results suggest that, when sucient data is available,
training separate models can be a viable way of handling context.
3.4 Study 3: Split by incumbent/challenger
Here, we follow the same method described above but apply it to incumbents and challengers.
Given the results in study 2, we wished to evaluate this approach to incorporating context on a
dierent division to be able to compare the results. The choice of incumbency was a result of
both prior work and our qualitative analysis where we observed that challengers seemed to attack
more than incumbents. As such, we hoped to test both general dierences and were particularly
interested in whether the attack category would be better represented by models split along this
axis. For each year, the candidate from the party in power is treated as the incumbent. In the
54
Test All Incumbent Challenger
Train All I C I C
Attack 0.61 0:55
a
0:49
b
0:42
a
0:62
b
Policy 0.72 0:61
a
0:69
b
0:60
a
0:70
b
Value 0.54 0:56
a
0:35
b
0:40
a
0:44
b
Personal 0.47 0:36
a
0:29
b
0:25
a
0:52
b
Table 3.8: F1 scores for the multitask model trained and tested on incumbents (I), challengers
(C), or the complete training set (All). Results based on inter-annotator agreement. Subscript
letters indicate signicant dierence from other scores for models applied to the same test data.
Test All Incumbent Challenger
Train All I C I C
Attack 0.78 0:82
a
0:77
b
0:75
a
0:83
b
Policy 0.86 0:90
a
0:90
a
0:85
a
0:89
b
Value 0.79 0:81
a
0:77
b
0:73
a
0:78
b
Personal 0.79 0:81
a
0:79
b
0:72
a
0:82
b
Table 3.9: F1 scores for the multitask model trained and tested on incumbents (I), challengers
(C), or or the complete training set (All). Results based on union of annotations. Subscript letters
indicate signicant dierence from other scores for models applied to the same test data.
included years, this was either the current president or vice president. As in study 2, we evaluated
models on both the same group as trained on with leave-one-out cross validation and on the other
group.
3.4.1 Results and discussion
As seen in table 3.8 for the agreement data and table 3.9 for the union data, the results show
that splitting data along these lines can signicantly improve our ability to model speaker intent.
In almost all cases when testing on the same data (either for incumbents on challengers), we
nd that the model trained on the same group signicantly outperforms its counterpart trained
on the opposite group. The single exception was policy statements of incumbents where for the
intersection data the model trained on challengers signicantly outperformed the incumbent model
and the union data where there was no signicant dierence between training on incumbents or
challengers.
As in the party case, it appears that this piece of background knowledge provided valuable
additional information to the models. For this split, we were particularly interested in the attack
55
category as it seemed the most likely to show a dierence on this axis. Not only was the per-
formance signicantly improved when evaluating models trained and tested on the same group,
the relative performance of this category compared with the base model showed better results
than any other category (note that since these models were evaluated on dierent test sets, a
direct comparison is impossible). However, without targeted follow-up studies, these results are
suggestive rather than conclusive in terms of causal processes. For example, some of the observed
dierence may be that democrats were over-represented in the challenger set and, as seen in study
2, their attacks proved far easier to classify.
Further, we do see the same pattern observed in the previous experiment where performance
declines dramatically on the cross-group evaluation. The inability of a model trained on one group
to predict the behavior of the other is another indication that these splits are meaningful and are
removing otherwise unmodeled variance from the models.
3.5 Study 4: Side information model
The previous two experiments t with prior work and strongly suggested that useful informa-
tion is contained in these categories. Naturally, the next question was how these divisions would
interact. For example, do Republican challengers behave dierently from Democratic challengers?
However, splitting on both party and incumbency left us with insucient data to train on. Gen-
erally splitting training/testing data by category is neither scalable nor extensible. Given this, we
moved on to evaluate the potential of incorporating these contextual factors directly into a single
model.
3.5.1 Method
We extend the multi-task network described in study 1. The overall architecture is largely
similar to the one described there. Turn representations are generated using the same recurrent
neural network structure as described before, using the same set of word embeddings pre-trained
on Wikipedia data. However, rather than passing this text representation directly to the prediction
56
Figure 3.2: The multi-task model with party and incumbency information.
57
Base +P +I +PI
Attack 0:61
a
0:64
b
0:63
b
0:66
c
Policy 0:72
a
0:74
b
0:71
a
0:75
c
Value 0:54
a
0:54
a
0:52
b
0:55
c
Personal 0:47
a
0:49
b
0:53
c
0:49
b
Table 3.10: F1 scores for the baseline multitask model, the model with party information added
(+P), incumbency information added (+I), and both added (+PI). Data generated based on
inter-annotator agreement. Subscript letters indicate signicant dierence from other scores for
models applied to the same test data.
Base +P +I +PI
Attack 0:78
a
0:81
b
0:80
b
0:80
b
Policy 0:86
a
0:87
a
0:88
b
0:87
a
Value 0:79
a
0:77
b
0:78
c
0:79
a
Personal 0:79
a
0:80
b
0:81
c
0:80
b
Table 3.11: F1 scores for the baseline multitask model, the model with party information added
(+P), incumbency information added (+I), and both added (+PI). Data generated based on
union of annotations. Subscript letters indicate signicant dierence from other scores for models
applied to the same test data.
sub-networks, we rst combine these with categorical information about the speaker in the form
of their party identication and incumbency status.
Given the nature of this information, we convert party and incumbency into one-hot vectors,
where each categorical value has a separate \slot" in the vector with the actual value set to
one with all other values set to zero. To combine these values with the text representations,
we concatenate the resulting one-hot vectors with the turn representation such that rather than
passing on a 300-dimensional vector, we pass on a 304-dimensional vector. The resulting network
structure can be seen in gure 3.2.
3.5.2 Results and discussion
As seen in table 3.10 for the inter-annotator agreement interpretation of the data, including this
information consistently and signicantly improved overall model performance, particularly when
both party and incumbency information was included. However, although the model with both
types of information outperformed the baseline on all cases, the best performance on the personal
statements category was achieved by the model where only incumbency information was included.
58
Our hypothesis is that this corresponds to dierences in how much time known and unknown
candidates need to spend discussing themselves and discussing accomplishments, eectively setting
up dierent priors for the two categories which overwhelm party-based dierences. Exploring these
dierences is one area for future work. For the union interpretation of the data in table 3.11, the
results are similar although the dierences in performance are less dramatic (as is expected when
starting from a higher baseline). While the additional information still provides a statistically
signicant lift in performance for all labels except for the \value" category (where there is no
signicant dierence between the baseline and the complete model), the dierence is smaller.
This experiment suggests that combining these types of categorical features with continuous
representations may be an eective strategy for incorporating background and contextual infor-
mation. These models provided the best overall performance observed in any of our experiments
on the full dataset (although better results were occasionally found when training and testing on
a pre-selected subset of the data as explored in studies 2 and 3). Beyond that, this approach has
two key advantages. The rst is that this information can be included with negligible increases in
training time and data preparation. The second is that it allows for the inclusion of an arbitrary
number of factors in a given model. While in some controlled settings this may not be important,
in many contexts this
exibility is essential.
Given the late fusion process where the side information is only incorporated after the LSTM
layers have generated a turn representation, the impact on the total number of parameters and
training time is negligible. In our tests, there was no measurable dierence (given the noise in the
stochastic gradient descent process) in overall training time.
Overall, the results of this experiment suggest the applicability of this method to combining
continuous representations with background knowledge. It remains an open question what types of
features are best introduced in this manner, but the ease of doing so opens a range of opportunities.
59
3.6 Conclusion
The previous experiments demonstrate how context and background knowledge can be in-
corporated into predictive text analysis models. In particular, we demonstrate how to combine
categorical information about the identity of speakers with continuous representations of text.
This opens the potential of bridging one gap between computational and psychological modeling.
Taking advantage of the benets of both of these approaches allows for consideration of global
linguistic context (in the form of word representations trained on large quantities of text), local
linguistic context (the continuous turn representations), and local background context (in this
case, prior knowledge about party and incumbency of politicians). Each of these facets is im-
portant for language understanding in real-world contexts and their incorporation here opens the
potential for richer and more varied modeling of language for both prediction and understanding.
In terms of the case study we focus on, prior work on political rhetoric has established the
importance of partisan identity and candidate status to understanding what is being said at any
moment. Basic facts about the identities of the speakers are assumed as common knowledge in
these contexts and provide critical background knowledge for understanding. Further, speakers
act on the assumption that the listeners know something about who they are; this assumption
denes many of their moves as they attempt to navigate the narratives about their candidacies.
For language analysis, incorporation of these pieces of background information is natural as we
would expect that it is precisely the sort of information which would not be present in the text
itself. In these studies, we have shown that this information can be used to improve computational
analysis of political rhetoric. In particular, considering \types" of speakers (in terms of party and
status) can improve the ability to model and understand the intent of highly subjective statements.
This applies far more generally than just political speech. Language is more than a infor-
mational channel, it's a form of social action [4] and can only be fully understood in context.
Human understanding of language relies on our history, both local and global, to interpret even
the simplest of phrases. There is no private language [127] and no meaning without context.
60
This capacity to incorporate context is critical for psychological research as the focus is more
explicitly on the processes which underlie language use and understanding. While the experiments
here focused on prediction using neural networks, these methods apply to any modeling task
capable of handling continuous features. Text can be incorporated naturally with categorical
features without requiring that the text itself be treated categorically. For example, rather than
converting variations in prompts into categorical variables, the texts of the prompts could be
converted into continuous representations allowing for a measure of semantic similarity between
them. Similarly, free text participant responses can be more easily compared rather than forcing
researchers to rely on scale answers. Additionally, this removes a potentially problematic degree
of freedom in terms of potential interaction between the structure of the experiment and the
categories the researcher selects to represent the prompts or responses.
We demonstrated two possible approaches to this challenge: training separate models based
on context and incorporating context in terms of categorical variables directly with a continuous
representation. Both have advantages and disadvantages depending on the circumstances. In
situations where a single context variable is known to be critical, training separate models can be
valuable. This does depend on future data having access to the same information as we observed
o-domain performance of these models to lag far behind models which didn't consider context
at all.
However, training separate models has several disadvantages. For research purposes, the fact
that the models are trained and evaluated on separate data makes comparison and inference di-
cult. It also restricts study design. As the number of variables and available data climbs, training
separate models becomes infeasible. In these cases, the method of incorporating continuous rep-
resentations of text with categorical variables provides a more
exible and extensible approach to
modeling and analysis.
Another key contribution of this work is the development of a novel annotated dataset of value
and identity statements over the challenging domain of political debates. This dataset is freely
61
available which we hope will help further work on these types of arguments, critical to the analysis
of modern political rhetoric. Especially in light of recent elections around the world where identity
arguments have played a central role, the capacity to model the ways in which these arguments are
deployed is essential. The ability to computationally identify these arguments opens the door to
applications such as automated real-time monitoring of the types of arguments being made across
a variety of media. This is especially critical as automated micro-targeting becomes the norm.
The development of an immune system for the public sphere will depend on similarly automated
tracking and identication capabilities. Additionally, for researchers focused on domains such as
polarization, the capacity to process large quantities of political rhetoric to identify regions for
closer inspection is extremely valuable.
Language use depends on understanding and adapting to context. That includes the common
linguistic background required for basic language understanding, the local discourse context, and
the knowledge of human behavior that underlies the usage of language in any interactive setting.
Pulling these together is critical whether the goal is to build more accurate predictive models
or to understand the psychological processes behind a given utterance. Systems which simplify
this fusion of dierent types of information will be critical for many types of work which attempt
to look at language as more than a collection of symbols, but as a fundamental tool of human
interaction.
62
Chapter 4
Incorporating Demographic Embeddings into Language
Understanding
4.1 Introduction
The meaning of an utterance depends on context. This is trivially true with deictics [28] where
a phrase such as \this goes here" tells us little outside of a context. But it is generally true when
considering language as a record of the intents, actions, and reactions of real humans. Nonetheless,
the current dominant techniques in computational language understanding generally ignore these
factors. Semantic representations are trained on large bodies of \neutral" data, yielding results
which capture only certain facets of human language understanding [45]. Supervised models are
based on corpora of fragmented text and labels, isolated from their sources. These techniques
are extremely powerful and have proved very useful|for tasks and questions where we can ignore
local variation as noise. But if we want to go beyond that subset of tasks, context becomes
increasingly important.
One critical aspect of this is the identity of the participants in a given interaction. Prior work
in elds including linguistics, philosophy, and cognitive science [34, 13, 32] has considered the
importance of the interaction between individuals and language as essential to understanding both
the intentionality and meaning of utterances in context. At the simplest level, two individuals
63
can interpret the same statement dierently|at times very dierently|and dismissing these
dierences as \mistakes" or \noise" only serves to limit our modeling eorts to the most arid
of settings. But the choice to focus on statistical regularity was not made from ignorance, but
rather due to practical limitations of data, computational resources, and methods. Modeling the
variation in human response can easily devolve into a study of exceptions, treating each interaction
as novel and losing sight of the patterns and structure which make language possible at all.
Rather than attempt to model the full scope of human variation, we focus on a simplied
representation of individuals, using demographics as an obviously incomplete but useful starting
point. However, just as a sentence or phrase doesn't have a single meaning across contexts, the
impact of identity, especially as summarized by demographic factors, varies depending on the
situation. This is clear even from surface factors such as word choice. Whereas regional variation
will explain whether we expect someone to use \crawsh" versus \crawdad" or \boot" versus
\trunk", age will be far more informative as to the likelihood of observing \hook up" versus
\Net
ix and chill." Generally, no single measure of concept similarity will capture the range of
human or situational variation [123, 39].
In this paper, we explore combining demographics and context into situated demographic
representations and demonstrate how these can be applied to modeling language understanding.
The goal is to take raw demographic variables and transform them so as to capture situational
rather than abstract similarity between individuals. Our approach is to learn a mapping from
demographics into a continuous geometric representation where measures of similarity are mean-
ingful for the given task. Just as we don't expect to learn language anew with each interaction,
language processing generally depends on representations trained on large quantities of external
data. Our method is similarly capable of training situated demographic representations on related,
higher-resource, tasks. This allows for wider applicability and opens the potential for exploring
the relationship between the context in which a representation is learned and the domain in which
it's applied.
64
We apply this approach to an area where prior psychological research has found demographic-
driven dierences in individual responses: moral reasoning [49]. For example, liberals and con-
servatives have been found to respond dierently to a wide range of moral concerns [41]. We
compare modeling participants' responses based on (1) representations of the text by itself (2)
text plus demographic factors and (3) text plus situated demographic factors encoded using our
approach. We show that the incorporation of situated demographic representations is better able
to model participant responses. We follow this by exploring a practical issue where distributed
representations have proved useful in other domains: improving modeling with sparse or missing
data.
Prior work has looked at how incorporating demographic information on speakers can improve
classication performance on a range of tasks [56, 62] and has proven particularly valuable in
venues such as social media [57] where group-based linguistic variations can be extreme. Word
representations have also been learned on demographically split data [5, 36] which has been shown
to be useful for community-specic classication in areas such as sentiment analysis [128]. Our
work diers from these in using demographics to model respondents rather than sources and in
generating domain-specic demographic representations.
The main contributions of this paper are: (1) a novel method for encoding demographics
and context into a combined continuous representation; (2) an exploration of the use of those
representations on a concrete language understanding task; (3) a demonstration of the utility of
demographic information when faced with missing data; and (4) the release of a new dataset of
responses to the moral vignette stimuli developed by [18] along with demographic information on
the respondents.
4.2 Demographic Embeddings
In this section, we introduce a general method for learning situated demographic embeddings.
Some of the factors we have already discussed provide useful constraints on the choice of modeling
65
approach. First, since even the trivial examples we have discussed so far involve non-order pre-
serving transforms, the method must be able to learn non-linear mappings. For example, given
three individuals (A, B, C), imagine that we had a task focused on regional variation with A from
California, B from Kansas, and C from Ohio. In that case we might nd that A is most similar
to B with C more similar to B than A yielding: the order A, B, C. But for another task where
age was most salient, imagine that A was 20, C 25, and B 60. Here we might prefer the order: A,
C, B. Given that linear maps can only preserve or reverse order metrics, to allow for these two
possibilities we require the ability to learn non-linear transformations.
Second, in order to allow geometric interpretability and facilitate downstream interaction with
other semantic representations, we prefer a method which yields a continuous representation.
Finally, as learning interaction eects requires multiple times more data than learning rst-order
eects, we require a method which can handle larger quantities of data. Given these factors, we
chose to generate representations using a shallow feedforward neural network model.
The overall approach is to train a model with an initial input layer, a single non-linear hidden
layer, and an arbitrary number of linear output layers. The size of the hidden layer determines
the size of the resulting embeddings. Eectively, the hidden layer is learning a representation
optimized for the output tasks. This is similar to the approach of [86] but, where they learned word
similarity based on predicting other words in the local context, we learn demographic similarity
based on the chosen output tasks. The choice of which tasks to use for this training depends on
the domain at hand.
After training, network weights are saved and a new network initialized consisting of only
the input and hidden layers. At this point, arbitrary demographic inputs may be entered with
the output of the hidden layer yielding the representation for that input. This is dierent from
situations like word embeddings where training yields representations for a xed subset of words
from the training set. Instead, we learn a mapping function which can be applied to arbitrary
66
Figure 4.1: The multi-task training network.
future inputs. This
exibility allows the generation of representations not only for observed cases,
but also for previously unseen examples or even examples with missing data.
4.2.1 Training Moral Embeddings
In our experiments below, we make use of a version of this architecture trained to learn
demographic embeddings adapted for use in the moral reasoning domain. The goal is to leverage
a much larger set of related data in order to learn how to map demographics into a domain-specic
geometric space. We train on responses to the Moral Foundations Questionnaire (MFQ) collected
on the YourMorals.org website [42]. This dataset included 133,237 individuals who had both
answered the full set of 32 MFQ questions and provided the required demographic information.
As seen in Figure 4.1, the network is structured as a multi-task network [21]. Categorical
demographic variables are converted to one-hot representations and concatenated to generate the
input vector. This feeds in to a single 20 dimensional hidden layer (the value of 20 was selected
by grid search) with rectied linear activation functions [92]. Finally, the output from the hidden
layer is connected to dense output layers for each of the 32 individual MFQ questions. Training
makes use of the Adam optimizer [68] with models trained to convergence based on a randomly
selected 20% validation sample (generally 50-100 epochs).
After this model is trained, the resulting weight parameters are saved and a new network is
instantiated based on the input and 20 dimensional hidden layer (now the output layer). The
outputs of this layer are used as the representation of the input demographic values.
67
We used the YourMorals dataset to train moral-focused demographic embeddings and then
applied the resulting mapping function to the 950 participants in the study discussed below. We
found that these representations, in spite of being trained on a dierent (though clearly related)
task, combined to yield the best overall model for predicting participant reactions.
4.2.2 Observations on Moral Space
Before considering experimental results, it is useful to consider the structure of this mapping
from raw demographics to moral space. Given that this method generates a geometric space, it
is simple to apply standard measures of vector space similarity. The eects of this transform,
in particular inversions of ordering between the spaces, proved to be highly intuitive. For one
example of this, consider three users, U
1
: 30 years old religious, conservative, U
2
: 30 years old
non-religious, liberal, and U
3
: 60 years old non-religious, liberal. In demographic space, U
2
is
strictly closer to U
1
than U
3
. That is, Dist(U
1
;U
2
) < Dist(U
1
;U
3
) for any standard distance
metric. But, as seen in Figure 4.2, in the transformed moral space, U
3
is closer to U
1
than U
2
(Dist(U
1
;U
2
) > Dist(U
1
;U
3
))
1
. This makes intuitive sense in the context of contemporary US
politics and culture as we would expect, all else being equal, that an older liberal would be more
conservative than a young liberal. The mapping learned is non-ane (in this case, non-order
preserving on the distance metric) but is capturing precisely the sort of relations we would wish
to nd.
Capturing this type of relationship opens the potential of applying this method not just for
use in predictive modeling, but also for directly exploring the structures of these similarity spaces.
In the example of the transformation described above, on another domain (for example, questions
relating to media consumption) we wouldn't expect the same inversion (where two thirty-year-olds
would likely be more similar regardless of political/religious leanings). This opens the potential
of directly exploring the space of transforms.
1
In the model used in the paper, the cosine similarity betweenU
1
andU
2
is 0:9757 while the similarity between
U
1
and U
3
is 0:9796
68
Figure 4.2: Example transform into moral space.
4.3 Study1: ApplyingDemographicEmbeddingsforLanguage
Understanding
While we have considered some of the qualitative aspects of these embeddings, in this section
we evaluate their usefulness for a concrete natural language understanding task. In particular, we
model responses to a set of moral vignettes [18], 14-17 word stories designed to evoke a particular
moral concern which participants evaluated in terms of degree of moral wrongness and level of
emotional response. This domain was chosen based on prior work which has found a strong
correlations between moral reasoning and demographic factors [41, 49].
To evaluate this, we compare the results of predicting responses based on: (1) continuous
representations trained from text features only, (2) adding demographic representations to those
text representations, (3) adding demographic cluster information to the text representations, and
(4) adding domain-adapted demographic embeddings to the text representations.
These allow for several useful evaluations and comparisons. Text features are useful in eval-
uating how much variance is explained by the story itself, especially in light of work suggesting
that such features can explain a majority of variance in many survey studies [3]. The addition of
demographic features allows us to compare how much general demographics improves our ability
to model these responses. The comparison of demographic embeddings with raw demographic in-
formation allows evaluation of the impact of situational adaptation. And comparing demographic
69
embeddings with demographic clusters lets us evaluate whether changes in model accuracy are
due to simple noise reduction from the embedding or whether the information contained in the
outside task was helpful.
4.3.1 Dataset
We make use of a collection of \moral vignettes" [18], short (14-17 word) scenarios. These
vignettes were designed to capture a scene, or context, which evokes a potential moral viola-
tions (for example, in the fairness domain: \You see a runner taking a shortcut on the course
during the marathon in order to win."). The vignettes make use of Moral Foundations Theory
[47, 43, 48], which classies moral concerns into ve domains: Care/harm, Fairness/cheating, Loy-
alty/betrayal, Authority/subversion, and Sanctity/degradation [50]. The original authors carried
out a study in which each vignette was annotated by approximately 30 annotators with the nal
collection consisting of vignettes which reached a minimum of 70% agreement as to the primary
moral domain.
For this study, we selected the ve most agreed upon vignettes for each moral foundation, the
most agreed upon non-moral stories (those lacking any moral content), and the ve stories with
the highest combination of degree of violation and disagreement between annotators. This yielded
a total of 40 vignettes.
Based on a target of at least 50 responses to each vignette for the primary ideological groups,
we recruited a total of 950 participants through Amazon Mechanical Turk (allowing margins for
platform demographic bias and attention check failures). Each was asked to evaluate a random
subset of 10 of the vignettes on the basis of which foundation was involved (or none), the degree of
moral wrongness, and their subjective level of emotional response to the story. Additionally, for
each participant we collected demographic data including age, religious aliation and religiosity,
gender, and political aliation. This yielded a total of 9500 individual vignette evaluations with
paired demographic data.
70
4.3.2 Method
Text representations were generated using InferSent [22], which learns a model for generating
continuous sentence representations based on a range of natural language inference data. For this
task, we trained a model
2
to yield 256 dimensional text embeddings. We additionally trained
a model with 4096 dimensions (the number used by the authors to achieve optimal prediction
results), but the dierence in performance on our task between 4096 and 256 was negligible. The
resulting 256-unit model was used to generate representations for each of the vignettes used in
the study.
Models were trained to predict users' responses to the questions \How morally wrong is the
action depicted in this scenario?" and \How strong was your emotional response to the behavior
depicted in the scenario?". Given the small overall size of the dataset and the fact that our
goal was to explore model dierences rather than maximize predictive accuracy, linear regression
models with elastic net regularization were estimated for each of the following feature sets: (1) text
representations; (2) text representations and raw demographic variables; (3) text representations
and k-means demographic cluster membership; and (4) text representations and demographic
embeddings.
Each model was t with 10 repetitions of 10-fold cross-validation. Then model performance
was evaluated, in terms of R
2
, via a second round of ten repetitions of 10-fold cross-validation,
yielding 100 estimates ofR
2
for each feature set. Finally, to facilitate between-model performance
comparisons, permutation tests with 10,000 iterations were conducted to test the hypothesis that
the distribution's R
2
scores for two given feature sets are drawn from the same null distribution
(e.g. that the proportion of variance explained does not vary meaningfully as between feature
sets). The benet of this approach, compared to a simple t-test, for example, is that permutation
tests make no assumptions about the form of the distributions under study (whereas a t-test
assumes samples are drawn from a Gaussian distribution) [110, 85].
2
Code available from:
https://github.com/facebookresearch/InferSent
71
Figure 4.3: Model comparison for predicting participants' reported emotional response
The included demographic features were age, gender (free-response with results mapped to
male/female/other), political ideology (liberal, conservative, moderate, other), and religiosity
(ve levels ranging from highly to non religious).
Clusters were generated based on participant demographics using the k-nearest neighbors al-
gorithm [31]. k was determined based on comparing model performance over values ranging from
k = 2 to k = 10. The best performance was found at k = 2, which was used for the results
reported below.
As described previously, a mapping function for generating demographic embeddings was
learned using a separate dataset of answers to the Moral Foundations Questionnaire. We applied
this map to the participants in this study, generating a 20 dimensional demographic embedding
for each.
72
Figure 4.4: Model comparison for predicting participants' evaluations of degree of moral response
Table 4.1: Mean R
2
with standard error by model and domain predicting participants' reported
emotional response
Domain Text Text + embedding Text + demographics Text + K-mean
1 Authority 0.018 (0.0011) 0.127 (0.0031) 0.120 (0.0031) 0.067 (0.0024)
2 Care 0.026 (0.0014) 0.053 (0.0017) 0.052 (0.0017) 0.026 (0.0013)
3 Fairness 0.139 (0.0033) 0.167 (0.0034) 0.163 (0.0035) 0.137 (0.0033)
4 Loyalty 0.071 (0.0022) 0.108 (0.0028) 0.103 (0.0027) 0.088 (0.0025)
5 Non-moral 0.011 (7e-04) 0.012 (7e-04) 0.010 (6e-04) 0.014 (8e-04)
6 Sanctity 0.092 (0.0028) 0.133 (0.0032) 0.133 (0.0032) 0.109 (0.0028)
Table 4.2: MeanR
2
with standard error by model and domain predicting participants' evaluation
of degree of moral wrongness
Domain Text Text + embedding Text + demographics Text + K-mean
1 Authority 0.059 (0.0023) 0.144 (0.0034) 0.141 (0.0032) 0.102 (0.0028)
2 Care 0.051 (0.0022) 0.085 (0.0026) 0.084 (0.0026) 0.05 (0.0021)
3 Fairness 0.222 (0.0035) 0.247 (0.0036) 0.245 (0.0037) 0.233 (0.0035)
4 Loyalty 0.108 (0.0029) 0.127 (0.003) 0.125 (0.003) 0.115 (0.0029)
5 Non-moral 0.009 (5e-04) 0.01 (6e-04) 0.009 (5e-04) 0.01 (7e-04)
6 Sanctity 0.064 (0.0021) 0.109 (0.0024) 0.108 (0.0025) 0.085 (0.0022)
73
4.3.3 Results and Discussion
Overall, there are several key points here. First, domain-adapted demographic embeddings
provided the best overall ability to model responses to moral scenarios (p< 0:0001 when comparing
against text representations). Second, demographic information (in any form) improves our ability
to predict responses to moral scenarios. This is true in all cases except for non-moral scenarios (for
these, due to most participants rating them as generating no emotional response and expressing
no moral wrongness, there was simply no variance to explain). Finally, both raw demographics
and moral embeddings were signicantly better at predicting responses than demographic clusters
(p< 0:0001).
The low R
2
values when working with only text features is unsurprising given the nature
of the data. These vignettes were designed to avoid using similar words or words from the
descriptions of the foundations themselves [18], in eect being structured to limit the value of
simple textual features. All methods of including demographic information showed signicantly
improved performance over the text-only models. However, both the use of raw demographic
information and moral embeddings did signicantly better (p < 0:0001) than making use of
demographic clusters. As the clustering method reduces multiple demographic factors to a choice
between, in this case, two possible clusters, it appears that the potential advantage of noise
reduction is in this case outweighed by the lost information.
The comparison between adapted demographic embeddings and raw demographic represen-
tations is both more complex and more interesting. The embeddings provided a small over-
all improvement in the ability to model participant responses. When compared with random
permutation tests, this dierence was signicant for reported emotional response for all do-
mains (p < 0:0001 for Authority, Fairness, and Loyalty and p < 0:05 for Care) except Sanctity
(p = 0:408). Results were similar for evaluating model dierences in predicting moral wrongness
(p< 0:0001 for Fairness and Loyalty and p< 0:05 for Authority and Care) with the exception of
Sanctity (p = 0:0393). The improvement observed when predicting emotional response suggests
74
that higher order interactions may be more important there, but more targeted follow-ups would
be required to conrm this.
4.4 Study 2: Demographic Embeddings for Missing Data
While embeddings in other contexts have been explored in terms of the abstract representations
they encode, they have also proven extremely useful for dealing with practical modeling challenges.
One such issue is the general challenge of dealing with missing and sparse data in high dimensional
contexts [10]. The ability to learn similarity metrics between words has allowed for far more
generalizable approaches to a range of downstream language modeling tasks [7]. For example,
a classier trained for sentiment might never have seen the word \tedious" but if \boring" and
\tiresome" were both in the training set, the fact that \tedious" is distributionally similar would
allow for a much better prediction than just treating \tedious" as an unknown word (a common
previous approach for dealing with this issue).
In this study, we explore whether similar advantages might hold true for demographic em-
beddings. If our training data is missing particular groups, can we make use of domain-specic
similarity to improve the quality of subsequent predictions? If a model never saw a particular
subgroup in training, how well can it predict that group's behavior when encountered later? We
compare making use of demographic embeddings with applying the raw demographic features
(the two best performing models from Study 1).
4.4.1 Method
The general method is similar to that of Study 1 in that we rst train models based on a
subset of the data and then test those models against a held out subset. The dierence comes in
how the data is split. Rather than randomly-sampled cross validation, in this study, we iteratively
remove a single group, training on all remaining data and testing solely on that missing group
(models are still trained with cross validation to avoid overtting the training set).
75
We rst iterated through all possible values of the demographic variables (age, gender, re-
ligiosity, and ideology) with age bucketed into the groups: 0-30, 31-50, and 50+. A separate
test/train split was generated for each of these values by putting all participants with the selected
value into the test set and leaving all other participants as the training set. For example, when
testing the case of age=31-50, all participants in that age range (regardless of other demographic
factors) were separated and held out as a test group. Thus, during training, the model would
never observe anyone with the selected demographic value. This yielded a total of 13 separate
train/test splits for the data (one for each possible missing value).
A similar procedure was then repeated for each possible pair of factors, with each possible
value for each demographic factor combined with all possible values for the other factors. For
example, starting with ideology=\conservative", we considered each possible combination with
values of religiosity, gender, and age. This yielded a total of 84 possible combinations.
For each data split generated either by a single factor or pair of factors, we compare two ways
of encoding the situation and generating features for modeling based on the two best performing
models from Study 1. The rst makes use of the full set of raw demographic values with the
sentence representations (\raw demographics"). The second combines the sentence representations
with pre-trained demographic embeddings (\embedding").
In all cases we follow the procedures outlined in Study 1: training linear models with elastic net
regularization. Models are optimized based on 10-fold cross validation over the training set with
pre-processing applied to center, scale, and remove zero-variance variables. Model performance is
evaluated based on R
2
of the test set (as this is on a separate testing set, R
2
can be negative).
Model performance is compared via permutation tests with 10,000 iterations. These steps were
repeated separately for both the prediction of participants' reported degree of subjective emotional
response to a vignette and their reported evaluation of the degree of moral wrongness embodied
in the vignette.
76
Figure 4.5: Average R
2
over the missing data test-instances separated by domain. Evaluated
against user responses to their subjective level of emotional response.
4.4.2 Results and Discussion
Table 4.3: Mean R
2
with standard error by model and domain predicting participants' reported
emotional response over missing data models
Domain Raw demographics Embeddings p-value
1 authority 0.0617 (0.0149) 0.0825 (0.0146) 0.03
2 care 0.0423 (0.0103) 0.0404 (0.006) 0.82
3 fairness 0.0859 (0.0111) 0.1272 (0.0084) 0.00
4 loyalty 0.0614 (0.0103) 0.0903 (0.0089) 0.00
5 sanctity 0.1124 (0.0161) 0.1335 (0.0168) 0.00
6 non-moral 0.0176 (0.0039) 0.0135 (0.0026) 0.19
The use of demographic embeddings did signicantly better than the raw demographic model
across all moral domains except Care. This was observed for both participant evaluations of
subjective emotional response (Figure 4.5 and Table 4.3) and evaluations of the degree of moral
wrongness of a given vignette (Figure 4.6 and Table 4.4). The primary dierence was that the
77
Figure 4.6: Average R
2
over the missing data test-instances separated by domain. Evaluated
against user responses to their subjective evaluation of the moral wrongness of the situation.
78
Table 4.4: MeanR
2
with standard error by model and domain predicting participants' evaluation
of degree of moral wrongness over missing data models
Domain Raw demographics Embeddings p-value
1 authority 0.0774 (0.0165) 0.1244 (0.0147) 0.00
2 care 0.0634 (0.0144) 0.0702 (0.0113) 0.38
3 fairness 0.122 (0.0123) 0.204 (0.0104) 0.00
4 loyalty 0.0567 (0.0074) 0.1082 (0.0086) 0.00
5 sanctity 0.0719 (0.0151) 0.1048 (0.018) 0.00
6 non-moral 0.0216 (0.0061) 0.0203 (0.0046) 0.77
raw demographic model was trained on values for variables which were not present in the test
set. While for some variables (age and religiosity), simple linear interpolation would provide some
information, as previously noted, many of the observed transformations into moral space were
non-linear. For categorical variables (ideology and gender), the situation was even worse where
unseen values were initialized to zero
3
.
The two exceptions to this pattern were the Non-moral and Care categories, neither of which
showed a signicant dierence between models. For the non-moral category, this is similar to what
was observed in Study 1. If we specically adapt a model to a particular domain, the better that
adaptation, the worse the model is likely to do outside of that domain (especially in a case such
as this where examples were selected specically to avoid overlap with the domain in question).
However, the reason for the dierence on the Care domain is not immediately obvious. Prior work
on moral values has suggested that this domain is connected to the basic patterns of reproduction
and attachment [49] and is one of the two \binding" concerns (along with Fairness) that are
thought to show less variance over demographic groups. However, prior work focused on political
dierences [41] found less variation in the \fairness" domain between ideological groups in the US
(where we found a signicant dierence), potentially undercutting this hypothesis. More targeted
follow-up work will be required to evaluate the precise reasons for this nding.
The use of domain-adapted demographic embeddings depends on the availability of sucient
quantities of domain-related data. In this case, we were able to make use of a much larger set
3
In initial experiments, those values had been randomly initialized leading to far worse performance
79
of questionnaire data on a closely related topic. When available, this allows for highly relevant
embeddings to be trained and oers a viable response to the issue of missing data. However, as
the amount of related data declines and/or the closeness of the relationship between the domains
diminishes, the eectiveness of demographic embeddings is likely to vary more widely. The further
exploration of the importance of these tradeos in a wider range of domains remains an interesting
area for future work.
4.5 Discussion and Future Work
Incorporating contextual factors into computational language understanding is important both
in terms of advancing practical modeling as well as improving our theoretical understanding of
linguistic representation. Just as our understanding of semantics has advanced due to the ability
to compare the information captured by dierent language embedding models, we hope that
consideration of context can similarly advance our understanding of pragmatics. One key facet
of this is exploring questions of representing language in context|in particular modeling the
interactions between individuals with backgrounds, intents, and beliefs.
The present work aims to contribute to this goal by exploring the question of how to combine
representations of language and individuals given a particular situation. Our primary contribution
is a novel method for learning domain-specic demographic embeddings and validating their
potential for inclusion in natural language understanding tasks. Additional contributions include
(1) providing a new dataset of responses to moral stories which extends existing work in the
domain, (2) conrming prior work on the importance of demographic dierences on responses to
moral concerns and demonstrating the value of that information in predictive modeling.
The general approach of training situational demographic embeddings has potential applica-
tions to a wide range of questions and tasks. Given that there is no single measure of similarity
which captures the range of human responses across contexts, focusing instead on domain-specic
80
representation provides a better foundation for computational modeling and better captures hu-
man intuitive characterizations.
We were particularly intrigued by the structure of the moral embedding space learned here
where the relationships captured by the transformation were highly intuitive. In particular, the
non-linearities in the mapping seemed to capture precisely the sort of information we would hope
for in such a model. A more formal exploration of the structure of these spaces is a key area for
future work.
This also provides a potentially valuable method for extending statistical models to account
for dierences in individual responses. If we aim to model subjective language understanding,
incorporating information about the respondent will be essential. The features required to predict
the response to a piece of text go beyond the text itself. This provides an intriguing point of contact
between social scientic experimentation and computational language modeling.
In particular, it points to the importance of going beyond averaged responses in our statistical
learning methods. Most training corpora used for language processing have focused on providing
sets of text and labels. Dierences in individual responses are either washed out as average values
or discarded as \bad" data. This has both focused eorts on questions where we wouldn't expect
high degrees of subjective dierence and complicated modeling and learning given that the key
information for a given model may not even be present in a training set.
Treating linguistic annotation as records of human responses under particular circumstances
is a better frame for these data than \gold standard" truth on a given problem. In the case of our
current study, we showed that basic demographic information provided signicant improvements
in model performance, but the more important general point is that annotators cannot be treated
as interchangeable. Just as in any psychological study, what we are recording are human responses
where disagreement has as much to tell us as agreement.
In computational language modeling, the shift to continuous representations has provided
benets including improved model performance, better handling of missing data, the facilitation of
81
gradient-based learning, and direct advancements in the understanding of semantic representation.
We believe that continuous demographic representations have the potential to provide a similar
range of benets. While we demonstrated the ability of these representations to improve model
performance, as with continuous language representations, we expect that these early results will
be supplanted as improved training methods are developed. We have demonstrated here that
demographic embeddings have the potential to address similar challenges.
One major question which we leave for future work is the question of how to generate repre-
sentations. Here we have separately learned linguistic and demographic representations and then
combined those as features for further modeling. This carries the assumption that language can be
represented in the abstract, eectively ltering those abstract representations through individual
dierences. Another possible approach would be to generate linguistic representations directly in
the context of individual dierences, eectively learning personalized semantic spaces. Regardless,
just as distributional semantic representations have evolved rapidly over recent years, we expect
to see methods for representing individual dierences go through a similar transformation.
Finally, while we focused on demographic context here, that is only one aspect of understanding
a situation. There are great many other factors to consider including local discourse context, the
relationship of the speaker and listener, and shared background knowledge. While the general task
of creating fully context-aware models of language understanding and response is AI-complete,
incorporation of this information in more local forms will be essential as we move into increasingly
dynamic and rich domains.
82
Bibliography
[1] Joseph A Aistrup. The southern strategy revisited: Republican top-down advancement in
the South. University Press of Kentucky, 2015.
[2] Akiko Aizawa. An information-theoretic perspective of tf{idf measures. Information Pro-
cessing & Management, 39(1):45{65, 2003.
[3] Jan Ketil Arnulf, Kai Rune Larsen, yvind Lund Martinsen, and Chih How Bong. Pre-
dicting survey responses: How and why semantics shape survey statistics on organizational
behaviour. PloS one, 9(9):e106361, 2014.
[4] John Langshaw Austin. How to do things with words. Oxford university press, 1975.
[5] David Bamman, Chris Dyer, and Noah A Smith. Distributed representations of geographi-
cally situated language. In Proceedings of d Annual Meeting of the Association for Compu-
tational Linguistics (Short Papers), volume 828, page 834, 2014.
[6] David Bamman and Noah A Smith. Contextualized sarcasm detection on twitter. In
ICWSM, pages 574{577. Citeseer, 2015.
[7] Yoshua Bengio, R ejean Ducharme, Pascal Vincent, and Christian Jauvin. A neural proba-
bilistic language model. Journal of machine learning research, 3(Feb):1137{1155, 2003.
[8] George EP Box and WILLIAM J Hill. Discrimination among mechanistic models. Techno-
metrics, 9(1):57{71, 1967.
[9] Ryan L Boyd and James W Pennebaker. Did shakespeare write double falsehood? iden-
tifying individuals by creating psychological signatures with text analysis. Psychological
science, page 0956797614566658, 2015.
[10] Peter B uhlmann and Sara Van De Geer. Statistics for high-dimensional data: methods,
theory and applications. Springer Science & Business Media, 2011.
[11] Ted Byrt, Janet Bishop, and John B Carlin. Bias, prevalence and kappa. Journal of clinical
epidemiology, 46(5):423{429, 1993.
[12] Carol A Capelli, Noreen Nakagawa, and Cary M Madden. How children understand sarcasm:
The role of context and intonation. Child Development, 61(6):1824{1841, 1990.
[13] Rudolf Carnap. Meaning and necessity: a study in semantics and modal logic. University
of Chicago Press, 1947.
[14] Rudolf Carnap. Logical positivism. 1959.
[15] Michael X Delli Carpini, Fay Lomax Cook, and Lawrence R Jacobs. Public deliberation,
discursive participation, and citizen engagement: A review of the empirical literature. Annu.
Rev. Polit. Sci., 7:315{344, 2004.
83
[16] Qiang Chen, Wenjie Li, Yu Lei, Xule Liu, and Yanxiang He. Learning to adapt credible
knowledge in cross-lingual sentiment analysis. In Proceedings of the 53rd Annual Meeting of
the Association for Computational Linguistics and the 7th International Joint Conference
on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 419{429, 2015.
[17] Thomas Christiano. Voting and democracy. Canadian journal of philosophy, 25(3):395{414,
1995.
[18] Scott Cliord, Vijeth Iyengar, Roberto Cabeza, and Walter Sinnott-Armstrong. Moral foun-
dations vignettes: A standardized stimulus database of scenarios based on moral foundations
theory. Behavior research methods, 47(4):1178{1198, 2015.
[19] Georey L Cohen. Party over policy: The dominating impact of group in
uence on political
beliefs. Journal of personality and social psychology, 85(5):808, 2003.
[20] Jacob Cohen. Statistical power analysis for the behavioral sciences. 2nd edn. hillsdale, new
jersey: L, 1988.
[21] Ronan Collobert and Jason Weston. A unied architecture for natural language processing:
Deep neural networks with multitask learning. In Proceedings of the 25th international
conference on Machine learning, pages 160{167. ACM, 2008.
[22] Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. Super-
vised learning of universal sentence representations from natural language inference data.
arXiv preprint arXiv:1705.02364, 2017.
[23] Robyn M Dawes, Alphons JC Van de Kragt, and John M Orbell. Cooperation for the benet
of usNot me, or my conscience. University of Chicago Press, 1990.
[24] Scott C. Deerwester, Susan T Dumais, Thomas K. Landauer, George W. Furnas, and
Richard A. Harshman. Indexing by latent semantic analysis. JASIS, 41(6):391{407, 1990.
[25] Morteza Dehghani, Kate Johnson, Joe Hoover, Eyal Sagi, Justin Garten, Niki Jitendra Par-
mar, Stephen Vaisey, Rumen Iliev, and Jesse Graham. Purity homophily in social networks.
Journal of Experimental Psychology: General, 145(3):366, 2016.
[26] Johannes C Eichstaedt, Hansen Andrew Schwartz, Margaret L Kern, Gregory Park, Dar-
win R Labarthe, Raina M Merchant, Sneha Jha, Megha Agrawal, Lukasz A Dziurzynski,
Maarten Sap, et al. Psychological language on twitter predicts county-level heart disease
mortality. Psychological science, 26(2):159{169, 2015.
[27] Valerii Vadimovich Fedorov. Theory of optimal experiments. Elsevier, 1972.
[28] Charles J Fillmore. Lectures on deixis. CSLI publications, 1997.
[29] Gary Alan Fine. Sociological approaches to the study of humor. In Handbook of humor
research, pages 159{181. Springer, 1983.
[30] John R Firth.fA synopsis of linguistic theory, 1930-1955g. 1957.
[31] Evelyn Fix and Joseph L Hodges Jr. Discriminatory analysis-nonparametric discrimination:
consistency properties. Technical report, California Univ Berkeley, 1951.
[32] Jerry A Fodor. Representations: Philosophical essays on the foundations of cognitive science.
MIT Press, 1981.
84
[33] Peter W Foltz, Walter Kintsch, and Thomas K Landauer. The measurement of textual
coherence with latent semantic analysis. Discourse processes, 25(2-3):285{307, 1998.
[34] Gottlob Frege.
Uber sinn und bedeutung. Zeitschrift f ur Philosophie und philosophische
Kritik, 100(1):25{50, 1892.
[35] J. A. Frimer and M. J. Brandt. Conservatives display greater happiness but only when they
are in power: A linguistic analysis of the u.s. congress. 2015.
[36] Aparna Garimella, Carmen Banea, and Rada Mihalcea. Demographic-aware word associ-
ations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language
Processing, pages 2285{2295, 2017.
[37] Justin Garten, Joe Hoover, Kate M Johnson, Reihane Boghrati, Carol Iskiwitch, and
Morteza Dehghani. Dictionaries and distributions: Combining expert knowledge and large
scale textual data content analysis. Behavior research methods, pages 1{18, 2017.
[38] Namrata Godbole, Manja Srinivasaiah, and Steven Skiena. Large-scale sentiment analysis
for news and blogs. ICWSM, 7(21):219{222, 2007.
[39] Robert L Goldstone, Douglas L Medin, and Dedre Gentner. Relational similarity and the
nonindependence of features in similarity judgments. Cognitive psychology, 23(2):222{262,
1991.
[40] C Goodwin. Notes on story structure and the organization of participation. Structures of
Social Action: Studies in Conversation Analysis, pages 225{246, 1984.
[41] Jesse Graham, Jonathan Haidt, and Brian A Nosek. Liberals and conservatives rely on
dierent sets of moral foundations. Journal of personality and social psychology, 96(5):1029,
2009.
[42] Jesse Graham, Brian A Nosek, Jonathan Haidt, Ravi Iyer, Spassena Koleva, and Peter H
Ditto. Mapping the moral domain. Journal of personality and social psychology, 101(2):366,
2011.
[43] Joshua Greene and Jonathan Haidt. How (and where) does moral judgment work? Trends
in cognitive sciences, 6(12):517{523, 2002.
[44] H Paul Grice. Logic and conversation. 1975, pages 41{58, 1975.
[45] Thomas L Griths, Mark Steyvers, and Joshua B Tenenbaum. Topics in semantic repre-
sentation. Psychological review, 114(2):211, 2007.
[46] John F Gunn and David Lester. Twitter postings and suicide: An analysis of the postings
of a fatal suicide in the 24 hours prior to death. Suicidologi, 17(3), 2015.
[47] Jonathan Haidt. The emotional dog and its rational tail: a social intuitionist approach to
moral judgment. Psychological review, 108(4):814, 2001.
[48] Jonathan Haidt. The moral emotions. Handbook of aective sciences, 11:852{870, 2003.
[49] Jonathan Haidt. The righteous mind: Why good people are divided by politics and religion.
Vintage, 2012.
[50] Jonathan Haidt, Jesse Graham, and Craig Joseph. Above and below left{right: Ideological
narratives and moral foundations. Psychological Inquiry, 20(2-3):110{119, 2009.
85
[51] Alon Halevy, Peter Norvig, and Fernando Pereira. The unreasonable eectiveness of data.
IEEE Intelligent Systems, 24(2):8{12, 2009.
[52] Joseph Henrich, Steven J Heine, and Ara Norenzayan. The weirdest people in the world?
Behavioral and brain sciences, 33(2-3):61{83, 2010.
[53] John Heritage and David Greatbatch. Generating applause: A study of rhetoric and response
at party political conferences. American journal of sociology, 92(1):110{157, 1986.
[54] Sepp Hochreiter and J urgen Schmidhuber. Long short-term memory. Neural computation,
9(8):1735{1780, 1997.
[55] David W Hosmer Jr and Stanley Lemeshow. Applied logistic regression. John Wiley & Sons,
2004.
[56] Dirk Hovy. Demographic factors improve classication performance. In Proceedings of
the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th In-
ternational Joint Conference on Natural Language Processing (Volume 1: Long Papers),
volume 1, pages 752{762, 2015.
[57] Dirk Hovy and Anders Sgaard. Tagging performance correlates with author age. In Pro-
ceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and
the 7th International Joint Conference on Natural Language Processing (Volume 2: Short
Papers), volume 2, pages 483{488, 2015.
[58] Rumen Iliev, Morteza Dehghani, and Eyal Sagi. Automated text analysis in psychology:
methods, applications, and future developments. Language and Cognition, 7(02):265{290,
2015.
[59] Shanto Iyengar. Speaking of values: The framing of american politics. In The Forum,
volume 3, pages 1{8. bepress, 2005.
[60] Jennifer Jerit. Survival of the ttest: Rhetoric during the course of an election campaign.
Political Psychology, 25(4):563{575, 2004.
[61] Yangfeng Ji, Gholamreza Haari, and Jacob Eisenstein. A latent variable recurrent neural
network for discourse relation language models. arXiv preprint arXiv:1603.01913, 2016.
[62] Anders Johannsen, Dirk Hovy, and Anders Sgaard. Cross-lingual syntactic variation over
age and gender. In Proceedings of the Nineteenth Conference on Computational Natural
Language Learning, pages 103{112, 2015.
[63] Michael N Jones and Douglas JK Mewhort. Representing word meaning and order infor-
mation in a composite holographic lexicon. Psychological review, 114(1):1, 2007.
[64] John T Jost, Christopher M Federico, and Jaime L Napier. Political ideology: Its structure,
functions, and elective anities. Annual review of psychology, 60:307{337, 2009.
[65] Ewa Kacewicz, James W Pennebaker, Matthew Davis, Moongee Jeon, and Arthur C
Graesser. Pronoun use re
ects standings in social hierarchies. Journal of Language and
Social Psychology, page 0261927X13502654, 2013.
[66] Jerey H Kahn, Renee M Tobin, Audra E Massey, and Jennifer A Anderson. Measuring
emotional expression with the linguistic inquiry and word count. The American journal of
psychology, pages 263{286, 2007.
86
[67] Margaret L Kern, Johannes C Eichstaedt, H Andrew Schwartz, Lukasz Dziurzynski, Lyle H
Ungar, David J Stillwell, Michal Kosinski, Stephanie M Ramones, and Martin EP Seligman.
The online social self an open vocabulary approach to personality. Assessment, 21(2):158{
169, 2014.
[68] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980, 2014.
[69] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Anto-
nio Torralba, and Sanja Fidler. Skip-thought vectors. In Advances in neural information
processing systems, pages 3294{3302, 2015.
[70] Efthymios Kouloumpis, Theresa Wilson, and Johanna Moore. Twitter sentiment analysis:
The good the bad and the omg! Icwsm, 11:538{541, 2011.
[71] Thomas K Landauer and Susan T Dumais. A solution to plato's problem: The latent seman-
tic analysis theory of acquisition, induction, and representation of knowledge. Psychological
review, 104(2):211, 1997.
[72] Omer Levy and Yoav Goldberg. Linguistic regularities in sparse and explicit word represen-
tations. In Proceedings of the Eighteenth Conference on Computational Natural Language
Learning, Baltimore, Maryland, USA, June. Association for Computational Linguistics,
2014.
[73] Gang Li and Fei Liu. Application of a clustering method on sentiment analysis. Journal of
Information Science, 38(2):127{139, 2012.
[74] Jiwei Li, Dan Jurafsky, and Eudard Hovy. When are tree structures necessary for deep
learning of representations? arXiv preprint arXiv:1503.00185, 2015.
[75] Dennis V Lindley. On a measure of the information provided by an experiment. The Annals
of Mathematical Statistics, pages 986{1005, 1956.
[76] Bing Liu. Sentiment analysis and subjectivity. Handbook of natural language processing,
2:627{666, 2010.
[77] Ian Haney L opez. Dog whistle politics: How coded racial appeals have reinvented racism and
wrecked the middle class. Oxford University Press, 2015.
[78] Max M Louwerse. Semantic variation in idiolect and sociolect: Corpus linguistic evidence
from literary texts. Computers and the Humanities, 38(2):207{221, 2004.
[79] Fabrizio Macagno and Douglas Walton. Emotive language in argumentation. Cambridge
University Press, 2014.
[80] P Mair, R Wilcox, and F Schoenbrodt. Wrs2: a collection of robust statistical methods,
2016.
[81] Jon D Mcaulie and David M Blei. Supervised topic models. In Advances in neural infor-
mation processing systems, pages 121{128, 2008.
[82] Douglas L Medin, Will Bennis, and Michael Chandler. Culture and the home-eld disad-
vantage. Perspectives on Psychological Science, 5(6):708{713, 2010.
[83] Douglas L Medin, Robert L Goldstone, and Dedre Gentner. Similarity involving attributes
and relations: Judgments of similarity and dierence are not inverses. Psychological Science,
1(1):64{69, 1990.
87
[84] Tali Mendelberg. The deliberative citizen: Theory and evidence. Political decision making,
deliberation and participation, 6(1):151{193, 2002.
[85] Joshua Menke and Tony R Martinez. Using permutations instead of student's t distribution
for p-values in paired-dierence algorithm comparisons. In Neural Networks, 2004. Pro-
ceedings. 2004 IEEE International Joint Conference on, volume 2, pages 1331{1335. IEEE,
2004.
[86] Tomas Mikolov, Kai Chen, Greg Corrado, and Jerey Dean. Ecient estimation of word
representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
[87] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Je Dean. Distributed
representations of words and phrases and their compositionality. In Advances in neural
information processing systems, pages 3111{3119, 2013.
[88] Tomas Mikolov, Wen-tau Yih, and Georey Zweig. Linguistic regularities in continuous
space word representations. In HLT-NAACL, pages 746{751, 2013.
[89] Je Mitchell and Mirella Lapata. Vector-based models of semantic composition. In ACL,
pages 236{244, 2008.
[90] Lewis Mitchell, Morgan R Frank, Kameron Decker Harris, Peter Sheridan Dodds, and
Christopher M Danforth. The geography of happiness: Connecting twitter sentiment and
expression, demographics, and objective characteristics of place. 2013.
[91] Elena Musi, Debanjan Ghosh, and Smaranda Muresan. Towards feasible guidelines for the
annotation of argument schemes. ACL 2016, page 82, 2016.
[92] Vinod Nair and Georey E Hinton. Rectied linear units improve restricted boltzmann
machines. In Proceedings of the 27th international conference on machine learning (ICML-
10), pages 807{814, 2010.
[93] Charles E Osgood, George J Suci, and Percy H Tannenbaum. The measurement of meaning.
urbana: Univer. of Illinois Press, 195:36, 1957.
[94] Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, Xiaodong He, Jianshu Chen, Xinying
Song, and Rabab Ward. Deep sentence embedding using long short-term memory networks:
Analysis and application to information retrieval. IEEE/ACM Transactions on Audio,
Speech and Language Processing (TASLP), 24(4):694{707, 2016.
[95] Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment cat-
egorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting on
Association for Computational Linguistics, pages 115{124. Association for Computational
Linguistics, 2005.
[96] Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. Foundations and TrendsR
in Information Retrieval, 2(1{2):1{135, 2008.
[97] James W Pennebaker. Writing about emotional experiences as a therapeutic process. Psy-
chological science, 8(3):162{166, 1997.
[98] James W Pennebaker. The secret life of pronouns. New Scientist, 211(2828):42{45, 2011.
[99] James W Pennebaker, Martha E Francis, and Roger J Booth. Linguistic inquiry and word
count: Liwc 2001. Mahway: Lawrence Erlbaum Associates, 71:2001, 2001.
88
[100] Jerey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for
word representation. Proceedings of the Empiricial Methods in Natural Language Processing
(EMNLP 2014), 12:1532{1543, 2014.
[101] David Martin Powers. Evaluation: from precision, recall and f-measure to roc, informedness,
markedness and correlation. 2011.
[102] David MW Powers. Applications and explanations of zipf's law. In Proceedings of the joint
conferences on new methods in language processing and computational natural language
learning, pages 151{160. Association for Computational Linguistics, 1998.
[103] Kevin L Priddy and Paul E Keller. Articial neural networks: an introduction, volume 68.
SPIE press, 2005.
[104] Nairan Ramirez-Esparza, Cindy K Chung, Ewa Kacewicz, and James W Pennebaker. The
psychology of word use in depression forums in english and in spanish: Texting two text
analytic approaches. In ICWSM, 2008.
[105] Henry S Richardson. Democratic autonomy: Public reasoning about the ends of policy.
Oxford University Press on Demand, 2002.
[106] David E Rumelhart, James L McClelland, PDP Research Group, et al. Parallel distributed
processing, volume 1. IEEE, 1988.
[107] Gerard Salton, Anita Wong, and Chung-Shu Yang. A vector space model for automatic
indexing. Communications of the ACM, 18(11):613{620, 1975.
[108] Julius Sim and Chris C Wright. The kappa statistic in reliability studies: use, interpretation,
and sample size requirements. Physical therapy, 85(3):257{268, 2005.
[109] David A Smith, Jerey A Rydberg-Cox, and Gregory R Crane. The perseus project: A
digital library for the humanities. Literary and Linguistic Computing, 15(1):15{25, 2000.
[110] Mark D Smucker, James Allan, and Ben Carterette. A comparison of statistical signicance
tests for information retrieval evaluation. In Proceedings of the Sixteenth ACM Conference
on Conference on Information and Knowledge Management, CIKM '07, pages 623{632, New
York, NY, USA, 2007. ACM.
[111] Richard Socher, Jerey Pennington, Eric H Huang, Andrew Y Ng, and Christopher D
Manning. Semi-supervised recursive autoencoders for predicting sentiment distributions. In
Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages
151{161. Association for Computational Linguistics, 2011.
[112] Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, An-
drew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality
over a sentiment treebank. In Proceedings of the conference on empirical methods in natural
language processing (EMNLP), volume 1631, page 1642. Citeseer, 2013.
[113] Philip Stone, Dexter C Dunphy, Marshall S Smith, and DM Ogilvie. The general inquirer:
A computer approach to content analysis. Journal of Regional Science, 8(1):113{116, 1968.
[114] Martin Sundermeyer, Ralf Schl uter, and Hermann Ney. Lstm neural networks for language
modeling. In Thirteenth Annual Conference of the International Speech Communication
Association, 2012.
[115] Johan AK Suykens and Joos Vandewalle. Least squares support vector machine classiers.
Neural processing letters, 9(3):293{300, 1999.
89
[116] Daniel Svozil, Vladimir Kvasnicka, and Jiri Pospichal. Introduction to multi-layer feed-
forward neural networks. Chemometrics and intelligent laboratory systems, 39(1):43{62,
1997.
[117] David L Swanson and Paolo Mancini. Politics, media, and modern democracy: An inter-
national study of innovations in electoral campaigning and their consequences. Greenwood
Publishing Group, 1996.
[118] Kai Sheng Tai, Richard Socher, and Christopher D Manning. Improved semantic rep-
resentations from tree-structured long short-term memory networks. arXiv preprint
arXiv:1503.00075, 2015.
[119] Yla R Tausczik and James W Pennebaker. The psychological meaning of words: Liwc and
computerized text analysis methods. Journal of language and social psychology, 29(1):24{54,
2010.
[120] Andranik Tumasjan, Timm Oliver Sprenger, Philipp G Sandner, and Isabell M Welpe.
Predicting elections with twitter: What 140 characters reveal about political sentiment.
ICWSM, 10:178{185, 2010.
[121] Peter D Turney. Thumbs up or thumbs down?: semantic orientation applied to unsuper-
vised classication of reviews. In Proceedings of the 40th annual meeting on association for
computational linguistics, pages 417{424. Association for Computational Linguistics, 2002.
[122] Peter D Turney. Similarity of semantic relations. Computational Linguistics, 32(3):379{416,
2006.
[123] Amos Tversky. Features of similarity. Psychological review, 84(4):327, 1977.
[124] Anthony J Viera and Joanne M Garrett. Understanding interobserver agreement: the kappa
statistic. Fam Med, 37(5):360{363, 2005.
[125] David Watson and Lee Anna Clark. The panas-x: Manual for the positive and negative
aect schedule-expanded form. 1999.
[126] W John Wilbur and Karl Sirotkin. The automatic identication of stop words. Journal of
information science, 18(1):45{55, 1992.
[127] Ludwig Wittgenstein. Philosophical investigations. John Wiley & Sons, 1953.
[128] Yi Yang and Jacob Eisenstein. Putting things in context: Community-specic embedding
projections for sentiment analysis. Arxiv-Social Media Intelligence, 2015.
[129] Karen K Yuen. The two-sample trimmed t for unequal population variances. Biometrika,
61(1):165{170, 1974.
90
Abstract (if available)
Abstract
Meaning depends on context. This applies in obvious cases like deictics or sarcasm as well as more subtle situations like framing or persuasion. One key aspect of this is the identity of the participants in an interaction. Our interpretation of an utterance shifts based on a variety of factors including personal history, background knowledge, and our relationship to the source. Nonetheless, the current dominant techniques in computational language understanding generally don't consider contextual factors. Supervised models are based on corpora of fragmented text and labels, isolated from their sources. These techniques are extremely powerful and have proved very useful—for tasks and questions where we can ignore local variation as noise. But as we go beyond that subset of tasks, context becomes increasingly important. Here I explore methods for incorporating contextual information, in particular prior knowledge about speakers and listeners into computational language understanding.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Balancing prediction and explanation in the study of language usage and speaker attributes
PDF
Neural creative language generation
PDF
Speech and language understanding in the Sigma cognitive architecture
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Integrating annotator biases into modeling subjective language classification tasks
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Incrementality for visual reference resolution in spoken dialogue systems
PDF
Identifying and mitigating safety risks in language models
PDF
Generating psycholinguistic norms and applications
PDF
Grounding language in images and videos
PDF
Externalized reasoning in language models for scalable and trustworthy AI
PDF
Countering problematic content in digital space: bias reduction and dynamic content adaptation
PDF
Conversation-level syntax similarity metric
PDF
Building a knowledgebase for deep lexical semantics
PDF
Representing complex temporal phenomena for the semantic web and natural language
PDF
Identifying Social Roles in Online Contentious Discussions
PDF
Beyond parallel data: decipherment for better quality machine translation
PDF
Building generalizable language models for code processing
PDF
Syntax-aware natural language processing techniques and their applications
PDF
Multimodal reasoning of visual information and natural language
Asset Metadata
Creator
Garten, Justin
(author)
Core Title
Language understanding in context: incorporating information about sources and targets
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
07/05/2018
Defense Date
04/27/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computational linguistics,natural language processing,OAI-PMH Harvest,pragmatics
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Dehghani, Morteza (
committee chair
), Rosenbloom, Paul (
committee member
), Sagae, Kenji (
committee member
), Zevin, Jason (
committee member
)
Creator Email
jgarten@gmail.com,jgarten@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-510598
Unique identifier
UC11266840
Identifier
etd-GartenJust-6375.pdf (filename),usctheses-c40-510598 (legacy record id)
Legacy Identifier
etd-GartenJust-6375.pdf
Dmrecord
510598
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Garten, Justin
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
computational linguistics
natural language processing
pragmatics