Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Integrating annotator biases into modeling subjective language classification tasks
(USC Thesis Other)
Integrating annotator biases into modeling subjective language classification tasks
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Integrating Annotator Biases into Modeling Subjective Language Classification Tasks
by
Aida Mostafazadeh Davani
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(Computer Science)
August 2022
Copyright 2022 Aida Mostafazadeh Davani
Dedication
I dedicate this thesis to Afsaneh and Esmaeil, my mother and father, who supported me in every single
step of this journey with their whole heart; my siblings, Yalda and Sahand, whose physical absence have
been the most formidable challenge of this journey; and my tiny family, Mohamad, Leo, and Noqa, whose
love has been my primary source of light.
ii
Acknowledgements
This thesis is the outcome of years of learning, research, and growth. I owe my greatest appreciation
to every individual mentor, teacher, collaborator, friend, and companion who helped, encouraged, and
endorsed me in this journey.
I owe a debt of gratitude to Morteza Dehghani, my PhD advisor, who believed in my capabilities, helped
me thrive in a new environment, and cheered for me with every accomplishment. I am grateful to my dear
friends and collaborators, Mohammad Atari and Brendan Kennedy, who enriched my journey with their
supportive presence and critical minds.
I thank my collaborators during this journey who were so humble in sharing their knowledge with me,
especially Xiang Ren, who helped me broaden my horizon, gain confidence in my field of studies, and aim
for the highest goals; and Vinodkumar Prabhakaran, who help me shape my ideas and body of research
into a well-adjusted, established contribution to the field.
iii
TableofContents
Dedication ii
Acknowledgements iii
ListofTables vii
ListofFigures viii
Abstract x
Chapter1: Introduction 1
Chapter2: AggregationinSubjectiveAnnotationTasks 4
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Impacts of Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Q1: Do aggregated labels represent individual annotators uniformly? . . . . . . . . 7
2.2.2 Q2: Do aggregated labels represent all social groups uniformly? . . . . . . . . . . . 9
2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Chapter3: Hatespeechclassifierslearnhuman-likesocialstereotypes 11
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.1 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.2 Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.3 Explicit Stereotype Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.4 Hate speech Annotation Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.5 Disagreement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.6 Annotators’ Tendency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.7 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.2 Quantifying Social Stereotypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Study 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4.1 Hate Speech Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.2 Quantifying Social Stereotypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
iv
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Chapter4: DisaggregatedMulti-annotatorModeling 34
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.1 Detecting Online Abuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.2 Detecting Emotions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.3 Annotation Disagreement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.4 Prediction Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Multi-Annotator Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.1 Baseline model using majority labels . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.2 Ensemble Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.3 Multi-label Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.4 Multi-task Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3.3 Results on GHC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3.3.1 Prediction Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3.3.2 Modeling Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.3.3 Computation Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.4 Results on GoEmotions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4.1 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4.2 Uncertainty vs. Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.5.1 Advantages of Multi-Annotator Modeling . . . . . . . . . . . . . . . . . . . . . . . 53
4.5.2 Limitations and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Chapter5: IntegratingAnnotators’PsychologicalAssessmentsintoModelingSubjectiveLan-
guageClassificationTasks 57
5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.1.1 Annotation Disagreement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.1.2 Annotator Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Rationalized Annotator Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2.2 Rationalized Annotator Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2.2.1 Hard rationalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2.2.2 Soft rationalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3.1.1 Sentiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3.1.2 Bias in Hate Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.1.3 Moral Foundations Reddit Corpus . . . . . . . . . . . . . . . . . . . . . . 63
5.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.4 Rationalizing Annotator Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4.1 Psychological Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.4.2 Demographic Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.4.3 Political Conservatism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
v
5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Chapter6: Conclusions 70
Bibliography 72
Appendices 88
A.1 Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
A.1.1 Study1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
A.1.1.1 Test and Annotation Items . . . . . . . . . . . . . . . . . . . . . . . . . . 88
A.1.1.2 Study of All Annotators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
A.1.1.3 Implicit Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
A.1.2 Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
vi
ListofTables
2.1 Statistics of three datasets annotated based on their Hate speech [85], Sentiment [44], and
Emotions [37] content. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.1 The average and standard deviation of precision, recall, and f-score of model predictions
evaluated during 5 iterations of 5-fold stratified cross validation. Majority Vote section
represent models’ performance on predicting the majority vote, while Individual Labels
section reports performance on predicting each raw annotation. . . . . . . . . . . . . . . . 43
4.2 Training time (in minutes); the time it takes to train each model on 80% of the GHC. . . . . 47
4.3 The average and standard deviation of model prediction f-score on the GoEmotions
dataset, evaluated across 5 iterations using the pre-defined train-test splits in the dataset. . 48
4.4 Examples from the GHC, for which the baseline differ from multi-task predictions’
majority vote. (We acknowledge that individual readers may disagree with the annotation
labels presented above.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1 Annotator information provided in three datasets used for annotator modeling. The
datasets vary significantly on their number of annotators to represent different approaches
for annotation collection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
A.1 Test items in annotation survey, participants were filtered based in their correct answers
to this items. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
A.2 All annotated items in Study 1: for each social group understudy, 7 social media posts
mentioning that social group is considered in the study. . . . . . . . . . . . . . . . . . . . . 94
vii
ListofFigures
2.1 The distribution of annotator agreement with the aggregated label for eight tasks under
three datasets for Emotions, Sentiment and Hate Speech. The lack of uniformity means
that annotator perspectives are not equally captured in the majority labels. . . . . . . . . . 8
2.2 Average and standard deviation of annotator agreement with aggregated labels, calculated
for annotators grouped by their socio-demographics under gender, race, and political
affiliation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1 The overview of Study 1. Novice annotators are asked to label each social media post
based on its hate speech content. Then, their annotation behaviors, per social group
token, are taken to be the number of posts they labeled as hate speech, their disagreement
with other annotators and their tendency to identify hate speech. . . . . . . . . . . . . . . 14
3.2 The relationship between the stereotypical competence of social groups and (1) the
number of hate labels annotators detected, (2) their tendency to detect hate speech – as
quantified by the Rasch model –, and (3) their ratio of disagreement with other participants
(top to bottom) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 The overview of Study 2. We investigate a dataset of social media posts and evaluate
the inter-annotator disagreement and majority label for each document in relation
to language-encoded stereotypes of mentioned social groups. Contrary to Study 1,
stereotypes do not vary at annotator-level. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 The effect of mentioned social groups’ stereotype content, as measured based on their
semantic similarity to the dictionaries of warmth and competence, on annotators’
disagreement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5 The overview of Study 3. During each iteration of model training the neural network
learns to detect hate speech based on a subset of the annotated dataset and a pre-trained
language model. The false predictions of the model are then calculated for each social
group token mentioned in test items. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.6 Social groups’ higher stereotypical competence and warmth is associated with higher
false positive predictions in hate speech detection . . . . . . . . . . . . . . . . . . . . . . . 28
viii
3.7 Social groups’ higher stereotypical competence and warmth is associated with higher
false negative predictions in hate speech detection . . . . . . . . . . . . . . . . . . . . . . . 29
4.1 Comparison between approaches for multi-annotator model (ensemble, multi-label and
multi-task) and majority label prediction (baseline). Annotation prediction models are
trained based on all annotations and apply majority voting to predict the final label. . . . . 40
4.2 Correlation of different approaches for estimating prediction uncertainty with annotation
disagreement on the GHC. Annotation modeling approaches better correlate with
disagreement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3 Correlation matrix of approaches for estimating uncertainty. MC dropout and Softmax
have high correlation. Our multi-annotator models also have higher internal correlations. 46
4.4 Correlation of different approaches for estimating prediction uncertainty with annotation
disagreement for theGoEmotions dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.5 Violin plots denoting distribution across uncertainty for true positive, false positive, false
negative, and true negative predictions on GHC. . . . . . . . . . . . . . . . . . . . . . . . 52
5.1 The annotator-aware model generates an annotator representation based on their
demographic/psychological profile to predict several annotations for each input text. . . . 61
5.2 Hard rationales for predicting Care and Equality annotations (from left to right). The
average scores show that while moral foundation concerns of annotators are more
applicable for predicting care, they are of less importance for predicting Equality
annotations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3 Hard rationales for predicting Loyalty and Purity annotations (from left to right) based on
the demographic information of annotators. The average scores are mostly identical for
the two foundations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4 Hard rationales for predicting moral foundations annotations based on annotators’
political standings on different issues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
A.1 Disagreement scores on different subsets of the dataset, based on whether the posts
include hate speech and social group tokens (SGT). The horizontal lines demonstrate the
error bars. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
ix
Abstract
Subjective annotation tasks are inherently nuanced due to annotators’ individual differences in under-
standing of language. Training Natural Language Processing (NLP) models for making predictions in
subjective tasks based on human-annotated datasets is also marked by challenges; model decisions are
rarely generalizable to judgements of unseen annotators. Therefore, modeling an acceptable interpreta-
tion of subjective tasks requires integrating psychological dimensions that capture individual differences
in perceiving language for each specific task.
This thesis provides an alternative approach for modeling subjective NLP tasks by tailoring represen-
tations based on annotators’ varying perceptions of language. First, NLP datasets for subjective tasks are
investigated to demonstrate how aggregating annotation into single ground truth labels impacts the rep-
resentation of different perspectives in language resources. Then, the impacts of annotators’ social biases
are explored to capture the sources of human-like biases in annotated datasets and language classifiers.
And lastly, alternative approaches for incorporating annotators’ individual differences into modeling their
annotation behaviors are presented.
In a broad sense, this thesis provides evidence against the propriety of modeling an aggregated label
for subjective language understanding tasks. Demonstrating that this common practice in NLP modeling
leads to encoding normative social biases into language resources and NLP models, this thesis provide
frameworks, and motivates future efforts, for incorporating varying perspectives of language into design-
ing NLP datasets and models.
x
Chapter1
Introduction
Natural language processing or NLP refers to the efforts for designing artificial intelligence solutions that
reflect humans’ understanding of text and spoken words. The dominant practice for creating modern
NLP systems for decision making is to train models on human-annotated textual data that encode human
judgements. As a result of this process, NLP models are significantly affected by the human annotators’
perceptions of language, and this is especially the case in subjective language understanding tasks [1],
such as detecting emotions, hate speech, and toxicity in text. In such tasks a single correct answer does
not always exist and annotators’ disagreements are derived by the variance in their individual biases and
values. The common practice for dealing with inter-annotator disagreements is to create ground truths by
aggregating the collected annotations into a single label. Nevertheless, capturing the meaningful nuances
of inter-annotator disagreements is essential for incorporating different populations’ and annotators’ sub-
jective perceptions of language into understanding and mitigating NLP models’ unintended outcomes,
such as their tendency for reflecting and propagating human-like biases [27].
Previous research has discovered evidence of the effects of annotators’ experiences on their disagree-
ments in specific tasks. For instance women victims of online sexual harassment are more likely to iden-
tify their recent experience as very or extremely upsetting [151]. Also amateur annotators, compared to
feminist and antiracist activists, are more likely to annotate a non-hateful piece of text as hate speech
1
[147]. However the impact of annotators’ psychological and cognitive differences on annotated datasets
and trained models remains largely unknown. For example, if a specific personality trait (e.g., extrover-
sion) is associated with overestimating the ‘happiness’ expressed in language, how biased will an emotion
classifier perform on detecting ‘happiness’ when it is trained on the annotations of a group of primarily
extrovert annotators?
This thesis combines social-psychological theories and computational linguistic methods to under-
stand the effects of annotators’ biases, such as their social stereotypes, on NLP models. Based on evidence
extracted from annotated datasets, I argue that annotators bring their diverse perspectives into their an-
notations, and aggregating labels into a single ground truth disregards the individual differences among
annotators. Similarly, models fail to represent minority perspectives, when they are trained on the aggre-
gated annotations. In a broad sense, I first examine the model training process to demonstrate the influence
of annotators’ psychological profiles on the NLP models, and then apply the findings to incorporate anno-
tators’ unique perceptions of language into modeling the NLP tasks.
Chapter 2 analyzes the annotation aggregation process for curating datasets of subjective tasks, and
explores whether the majority vote represents individual annotators and their perspectives evenly [127]. I
demonstrated that annotation aggregation may unfairly disregard perspectives of certain annotators, and
even certain demographic groups. Specifically, I examine the ratio of agreement between the majority vote
and each annotator in multiple annotated datasets of subjective tasks. The results show that the aggregated
majority vote does not uniformly agree with the perspectives of all annotators and the level of agreement
may vary significantly across different demographic groups that annotators identify with.
To investigate how disregarding annotators’ disagreements affects NLP models, Chapter 3 assesses the
impact of annotators’ biases and differences on the process of curating an annotated dataset and training a
language classification model for a hate speech detection task [31]. Specifically, I ask how annotators’ social
stereotypes bias annotation behaviors, training dataset, and predictions of the trained model. The results
2
demonstrate that annotators are biased in their hate speech labeling and this bias is due to how individuals
and groups stereotype other social groups. This bias is also reflected in the hate speech dataset and classifier
trained on the majority vote of annotations. These findings motivate the alternative approaches, presented
in Chapter 3 and 4, for modeling annotator-level labels rather than the aggregated ground truths.
Chapter 4 describes an alternative multi-annotator modeling approach for predicting individual anno-
tations rather than the aggregated majority vote and demonstrates its efficacy in modeling several sub-
jective tasks [32]. Designed as a multi-task classifier, the multi-annotator model treats predicting each
annotator’s labels as a separate task and preserves annotators integrity that potentially affects their an-
notation behaviors. Besides performing with an accuracy as acceptable or better than the baseline single
models, the model provides an estimation of the prediction uncertainty that allows model users to diagnose
incorrect labels.
Ultimately, Chapter 5 presents my solution for distinguishing the psychological factors that are re-
quired for modeling annotators’ unique perspectives in specific language understanding tasks. Along with
finding these associations, the framework provides annotator-aware modeling approaches that incorporate
annotators’ important individual differences into the modeling process.
In summary, my work argues against the interchangeability of human annotators by presenting evi-
dence from existing annotated datasets and NLP training pipelines. I introduce and evaluate an alternative
framework for modeling annotations in subjective language understanding tasks that both distinguishes
and incorporates annotators’ individual biases and subjectivity into the modeling procedure.
3
Chapter2
AggregationinSubjectiveAnnotationTasks
Annotators’ socio-demographic factors, moral values, and lived experiences often influence their inter-
pretations of language, especially in subjective tasks such as identifying political stances [100], sentiment
[44], and online abuse [25, 145, 115]. Recent research has pushed for determining the extent of annotators’
individual differences in interpreting language. For instance, [145] found out that feminist and anti-racist
activists systematically disagree with crowd-workers in their hate speech annotations. Similarly, annota-
tors’ political affiliation is shown to correlate with how they annotate the neutrality of political stances
[100]. In such cases, obtaining a single ground truth — often through majority voting — has the potential
adverse effect of sidelining minority perspectives in data, and reinforcing societal disparities and harms.
Here, we seek to answer two questions about majority voting in subjective tasks:
• Q1: Does aggregated data uniformly capture all annotators’ perspectives, when available?
• Q2: Does aggregated data reflect certain demographic groups’ perspectives more so than others?
Dataset #instances #annotators #annotations
Hate speech 27,665 18 86,529
Sentiment 14,071 1,481 59,240
Emotion 58,011 82 211,224
Table 2.1: Statistics of three datasets annotated based on their Hate speech [85], Sentiment [44], and Emo-
tions [37] content.
4
Our analysis demonstrates that in the annotations for many tasks, the aggregated majority vote does not
uniformly reflect the perspectives of all annotators in the annotator pool. For many tasks in our analysis,
a significant proportion of the annotators had very low agreement scores (0 to 0.4) with the majority vote
label. While certain individual annotator’s labels may have low agreement with the majority label due to
valid/expected reasons (e.g., if they produced noisy labels), we further show that these agreement scores
may vary significantly across different socio-demographic groups that annotators identify with. This find-
ing has important fairness implications, as it demonstrates how the aggregation step can sometimes cause
the final dataset to under-represent certain groups’ perspectives.
Meaningfully addressing such issues in multiply-annotated datasets requires understanding and ac-
counting for systematic disagreements between annotators. However, most annotated datasets often only
release the aggregated labels, without any annotator-level information. We argue that dataset develop-
ers should consider including annotator-level labels as well as annotators’ socio-demographic information
(when viable to do so responsibly) when releasing datasets, especially those capturing relatively subjective
tasks. Inclusion of this information will enable more research on how to account for systematic disagree-
ments between annotators in training tasks.
2.1 Background
NLP has a long history of developing techniques to interpret subjective language [155, 1]. While all human
judgments embed some degree of subjectivity, some tasks such as sentiment analysis [96], affect modeling
[2, 97], emotion detection [70], and hate speech detection [153] are agreed upon as relatively more subjec-
tive in nature. As [1] points out, achieving a singlereal‘groundtruth’ is not possible, nor essential in case
of such subjective tasks. Instead, we should investigate how to model the subjective interpretations of the
annotators, and how to account for them in application scenarios.
5
However, the current practice in the NLP community continues to be applying different aggregation
strategies to arrive at a single score or label that makes it amenable to train and evaluate supervised
machine learning models. Oftentimes, datasets are released with only the final scores/labels, essentially
obfuscating important nuances in the task. The information released about the annotations can be at one
of the following four levels of information-richness.
Firstly, the most common approach is one in which multiple annotations obtained for a data instance
are aggregated to derive a single “ground truth” label, and these labels are the only annotations included in
the released dataset (e.g., [55]). The aggregation strategy most commonly used, especially in large datasets,
is majority voting, although smaller datasets sometimes use adjudication by an ‘expert’ (often one of the
study authors themselves) to arrive at a single label (e.g., in [147]) when there are substantial disagree-
ments between annotators. These aggregation approaches rely on the assumption that there always exist
a single correct label, and that either the majority label or the ‘expert’ label is more likely to be that correct
label. What it fails to account for is the fact that in many subjective tasks, e.g., detecting hate speech, the
perceptions of individual annotators may be as valuable as an ‘expert’ perspective.
Secondly, some datasets (e.g., [81, 34]) release the distribution across labels rather than a single ag-
gregated label. In binary classification tasks, this corresponds to the percentage of annotators who chose
one of the labels. In multi-class classification, this may be the distribution across labels obtained for an
instance. While this provides more information than a single aggregated label does (e.g., identifies the
instances with high disagreement), it fails to capture annotator-level systematic differences.
Thirdly, some datasets release annotations made by each individual annotators in an anonymous fash-
ion (e.g., [85, 82]). Such annotator-level labels allow downstream dataset users to investigate and account
for systematic differences between individual annotators’ perspectives on the tasks, although they do not
contain any information about each annotators’ socio-cultural backgrounds. Finally, some recent datasets
6
(e.g., [44]) also release such socio-demographic information about the annotators in addition to annotator-
level labels. This information may include various identity subgroups the annotators self-identify with
(e.g., gender, race, age range, etc.), or survey responses from the annotators that capture their value sys-
tems, lived experiences, or expertise, as they relate to the specific task at hand. Such information, while
tricky to share responsibly, would help enable analysis around representation of marginalized perspectives
in datasets, as we demonstrate in the next section.
2.2 ImpactsofAggregation
In this section, we investigate how the aggregation of multiple annotations to a single label impact rep-
resentations of individual and group perspectives in the resulting datasets. We analyze annotations for
eight binary classification tasks, across three different datasets: hate-speech [85], sentiment [44], and
emotion [37]. Table 2.1 shows the number of instances, annotators and individual annotations present
in the datasets. For hate-speech and emotion datasets, we use the binary label in the raw annotations,
whereas for the sentiment dataset, we map the 5-point ordinal labels (-2, -1, 0, +1, +2) in the raw data to a
binary distinction denoting whether the text was deemed positive or negative.
1
While the emotion dataset
contains annotations for 28 different emotions, in this work, for brevity, we focused on the annotations for
only the six standard Ekman emotions [47] — anger, disgust, fear, joy, sadness, and surprise. In particular,
we use the raw annotations for these six emotions, rather than the mapping of all 28 emotions onto these
six emotions that [37] use in some of their experiments.
2.2.1 Q1: Doaggregatedlabelsrepresentindividualannotatorsuniformly?
First, we investigate whether the aggregated labels obtained through majority labels provide a more or less
equal representation for all annotator perspectives. For this analysis, we calculate the majority label for
1
We do this mapping for the purposes of this analysis, where we are focusing on binary tasks. Ideally, a more nuanced 5-point
labeling schema will be more useful.
7
Figure 2.1: The distribution of annotator agreement with the aggregated label for eight tasks under three
datasets for Emotions, Sentiment and Hate Speech. The lack of uniformity means that annotator perspec-
tives are not equally captured in the majority labels.
each instance as the label that half or more annotators who annotated that instance agreed on. We then
measure Cohen’s Kappa agreement score for each individual annotator’s labels and the majority labels on
the subset of instances they annotated. While lower agreement scores between some individual annotators
and the majority vote is expected (e.g., if the annotator produced noisy labels, or they misunderstood the
task), the assumption is that the majority label captures the perspective of the ‘average human annotator’
within the annotator pool.
Figure 2.1 represents the histogram of annotators’ agreement scores with majority votes for all eight
tasks. While the majority vote in some tasks such as joy and sadness (to some extent) do represent most
of the annotator pool more or less uniformly (i.e., majority vote agrees with most annotators at around
the same rate), in most cases, the majority vote under-represents or outright ignores the perspectives of a
substantial number of annotators. For instance, majority vote fordisgust has very low agreement (κ< 0.3)
with almost one-third (27 out of 82) of the annotator pool. Similarly, majority vote for sentiment has very
low agreement with around one-third (450+) of their annotator pool.
8
2.2.2 Q2: Doaggregatedlabelsrepresentallsocialgroupsuniformly?
While the analysis onQ1 reveals that certain annotator perspectives are more likely to be ignored in the
majority vote, it is especially problematic from a fairness perspective, if these differences vary across differ-
ent social groups. Here, we investigate whether specific socio-demographic groups and their perspectives
are unevenly disregarded through annotation aggregation. To this end, we analyze the sentiment analysis
dataset [44] since it includes raw annotations as well as annotators’ self-identified socio-demographic in-
formation. Furthermore, as observed in Figure 2.1, a large subset of annotators in this dataset are in low
agreement with the aggregated labels.
We study three demographic attributes, namely race, gender, and political affiliation and compare the
agreement scores between the aggregated labels and the individual annotators’ labels within each group.
Figure 2.2 shows the average and standard deviation of annotators’ agreement scores with aggregated
labels for each demographic group: race (Asian,Black, andWhite), gender (Male, andFemale), and political
affiliation ( Conservative, Moderate, and Liberal).
2
We perform three one-way ANOVA tests to test whether annotators belonging to different demo-
graphic groups have significantly different agreement scores with the aggregated labels, on average. The
results show significant differences among racial groups ( F(2,2387)=3.77, p=0.02); in particular, White
annotators show an average agreement of 0.42 (SD=0.26), significantly higher ( p=0.03 according to a post-
hoc Tukey test) than Black annotators with average of 0.37 (SD=0.27). The difference between average
agreement scores across different political groups are not statistically significant, although moderate anno-
tators on average have higher agreement (0.42) compared to conservative and liberal annotators (0.40 and
0.38, respectively). Similarly, annotation agreements of male and female annotators are not significantly
different.
2
We removed social groups with fewer than 50 annotators from this analysis for lack of sufficient data points. These include
other racial groups such as ‘Middle Eastern’ with 2 annotators, ‘Native Hawaiian or Pacific Islander’ with 4 annotators, and
non-binary gender identity with one annotator).
9
Figure 2.2: Average and standard deviation of annotator agreement with aggregated labels, calculated for
annotators grouped by their socio-demographics under gender, race, and political affiliation.
2.3 Discussion
Building models to predict or measure subjective phenomena based on human annotations should involve
explicit consideration for the unique perspectives each annotator brings forth in their annotations. Annota-
tors are not interchangeable– that is, they draw from their socially-embedded experiences and knowledge
when making annotation judgments. As a result, retaining their perspectives separately in the datasets will
enable dataset users to account for these differences according to their needs. Our two analyses reveal that
annotation aggregation may unfairly disregard perspectives of certain annotators, and sometimes certain
socio-demographic groups. Moreover, releasing the aggregated labels rather than annotator-level labels
prevents alternative practices for dealing with the impact of aggregation on representation.
10
Chapter3
Hatespeechclassifierslearnhuman-likesocialstereotypes
In the previous section, we found out that the aggregated majority votes do not uniformly reflect the
perspectives of all annotators in subjective tasks. Moreover, agreement with the aggregated label may vary
significantly across different socio-demographic groups that annotators identify with. This evidence points
to possible patterns in annotators’ behaviors that lead to high tendency to disagree with the majority label.
Consequently, we ask whether annotators tend to systematically disagree with each other and if so, what
individual differences of annotators lead to these patterns of disagreement. We further ask if annotators’
biases and disagreements have observable effects on the annotated datasets, and also the NLP models that
are trained on the aggregated annotations.
Here, we focus on annotator biases in terms of their social stereotypes, and assess the effect of this
cognitive bias on annotations of hateful language. We further examining the association of annotators’
biases and erroneous automated classification of texts by hate speech classifiers. Specifically, in Study
1 we investigate the impact of novice annotators’ stereotypes on their hate-speech-annotation behavior.
In Study 2 we examine the effect of language-embedded stereotypes on expert annotators’ aggregated
judgements in a large annotated corpus. Finally, in Study 3 we demonstrate how language-embedded
stereotypes are associated with systematic prediction errors in a neural-network hate speech classifier.
11
Our results demonstrate that hate speech classifiers learn human-like biases which can further perpetuate
social inequalities when propagated at scale.
3.1 Background
Artificial Intelligence (AI) technologies are prone to acquiring cultural, social, and institutional biases from
the real-world data on which they are trained [102, 103, 111]. AI models trained on biased datasets both
reflect and amplify those biases [28]. For example, the dominant practice in modern Natural Language
Processing (NLP) — which is to train AI systems on large corpora of human-generated text data — leads to
representational biases, such as preferring European American names over African American names [18],
associating words with more negative sentiment with phrases referencing persons with disabilities [77],
making ethnic stereotypes by associating Hispanics with housekeepers, and Asians with professors [58],
and assigning men to computer programming and women to homemaking [15].
Moreover, NLP models are particularly susceptible to amplifying biases when their task involves eval-
uating language generated by or describing a social group [14]. For example, previous research has shown
that toxicity detection models associate documents containing features of African-American English with
higher offensiveness than text without those features [138, 33]. Similarly, [41] demonstrate that models
trained on social media posts are prone to erroneously classifying “I am gay” as hate speech. There-
fore, applying such models for moderating social-media platforms can yield disproportionate removal of
social-media posts generated by or mentioning marginalized groups [33]. This unfair assessment nega-
tively impacts marginalized groups’ representation in online platforms, which leads to disparate impacts
on historically excluded groups [50].
Mitigating biases in hate speech detection, necessary for viable automated content moderation [34,
107], has recently gained momentum [33, 41, 138, 86, 125]. Most current supervised algorithms for hate
speech detection rely on data resources that potentially reflect real-world biases: (1) text representation,
12
which map textual data to their numeric representations in a semantic space; and (2) human annotations,
which represent subjective judgements about the hate speech content of the text, constituting the training
dataset. Both (1) and (2) can introduce biases into the final model. First, a classifier may become biased due
to how the mapping of language to numeric representations is affected by stereotypical co-occurrences in
the training data of the language model. For example, a semantic association between phrases referencing
persons with disabilities and words with more negative sentiment in the language model can impact a
classifier’s evaluation of a sentence about disability [77]. Second, individual-level biases of annotators can
impact the classifier in stereotypical directions. For example, a piece of rhetoric about disability can be
analyzed and labeled differently depending upon annotators’ social biases.
Although previous research has documented stereotypes in text representations [58, 15, 101, 144, 20],
the impact of annotators’ biases on training data and models remains largely unknown. Filling this gap in
our understanding of the effect of human annotation on biased NLP models is the focus of this work. As ar-
gued by [13] and [89], a comprehensive evaluation of human-like biases in hate speech classification needs
to be grounded in social psychological theories of prejudice and stereotypes, in addition to how they are
manifested in language. In this paper, we rely on the Stereotype Content Model [SCM; 52] which suggests
that social perceptions and stereotyping form along two dimensions, namely warmth (e.g., trustworthi-
ness, friendliness) and competence (e.g., capability, assertiveness). The SCM’s main tenet is that perceived
warmth and competence underlie group stereotypes. Hence, different social groups can be positioned in
different locations in this two-dimensional space, since much of the variance in stereotypes of groups is
accounted for by these basic social psychological dimensions.
3.2 Study1
Here, we investigate the effect of individuals’ social stereotypes on their hate speech annotations. Specifi-
cally, we aim to determine whether novice annotators’ stereotypes (perceived warmth and/or competence)
13
Figure 3.1: The overview of Study 1. Novice annotators are asked to label each social media post based
on its hate speech content. Then, their annotation behaviors, per social group token, are taken to be the
number of posts they labeled as hate speech, their disagreement with other annotators and their tendency
to identify hate speech.
of a mentioned social group lead to higher rate of labeling text as hate speech and higher rates of disagree-
ment with other annotators.
We conduct a study on a nationally stratified sample (in terms of age, ethnicity, gender, and political
orientation) of US adults. First, we ask participants to rate eight US-relevant social groups on different
stereotypical traits (e.g., friendliness). Then, participants are presented with social media posts mentioning
the social groups and are asked to label the content of each post based on whether it attacks the dignity
of that group. We expect the perceived warmth and/or competence of the social groups to be associated
with participants’ annotation behaviors, namely their rate of labeling text as hate speech and disagreeing
with other annotators.
14
3.2.1 Participants
To achieve a diverse set of annotations, we recruited a relatively large (N = 1,228) set of participants in a US
sample stratified across participants’ gender, age, ethnicity, and political ideology through Qualtrics Panels.
After filtering participants based on quality-check items (described below), our final sample included 857
American adults (381 male, 476 female) ranging in age from 18 to 70 (M = 46.7, SD = 16.4), about half
Democrats (50.4%) and half Republicans (49.6%), with diverse reported race/ethnicity (67.8% White or
European-American,17.5% Black or African-American,17.7% Hispanic or Latino/Latinx,9.6% Asian or
Asian-American).
3.2.2 Stimuli
To compile a set of stimuli items for this study, we selected posts from the Gab Hate Corpus [GHC; 85],
which includes 27,665 social-media posts collected from the corpus of Gab.com [56], each annotated for
their hate speech content by at least three expert annotators. We collected all posts with high disagreement
among the GHC’s (original) annotators (based on Equation 3.1 for quantifying item disagreement) which
mention at least one social group. We searched for posts mentioning one of the eight most frequently
targeted social groups in the GHC: (1) women; (2) immigrants; (3) Muslims; (4) Jews; (5) communists;
(6) liberals; (7) African-Americans; and (8) homosexual individuals. We selected seven posts per group,
resulting in a set of 56 items in total.
1
3.2.3 ExplicitStereotypeMeasure
We assessed participants’ warmth and competence stereotypes of the 8 US social groups in our study based
on their perceived traits for a typical member of each group. In other words, we asked participants to rate a
typical member of each social group (e.g., Muslims) based on their “friendliness”, “helpfulness”, “violence”,
1
The Supplementary Materials includes all items.
15
and “intelligence”. Following previous studies of perceived stereotypes [30], participants were asked to
rate these traits from low (e.g., “unfriendly”) to high (e.g., “friendly”) using an 8-point semantic differential
scale. We considered the average of the first three traits as the indicator of perceived warmth
2
and the
fourth item as the perceived competence.
3.2.4 HatespeechAnnotationTask
We asked participants to annotate the 56 items based on a short definition of hate speech [85]: “Language
that intends to attack the dignity of a group of people, either through an incitement to violence, encour-
agement of the incitement to violence, or the incitement to hatred.”
Participants could proceed with the study only after they acknowledge understanding the provided
definition of hate speech. We then tested their understanding of the definition by placing three synthetic
“quality-check” items among survey items, two of which included clear and explicit hateful language di-
rectly matching our definition and one item that was simply informational (see Supplementary Materials).
Overall, 371 out of the original 1,228 participants failed to satisfy these conditions and were removed from
the data.
3
3.2.5 Disagreement
Throughout this paper, we assess annotation disagreement in different levels:
• Item disagreement,d
(i)
: For an itemi, item disagreementd
(i)
is the number of annotator pairs that
disagree on the item’s label, divided by the number of all possible annotator pairs.
d
(i)
=
n
(i)
1
× n
(i)
0
n
(i)
1
+n
(i)
0
2
(3.1)
2
Cronbach’sα s ranged between .90 [women] and .95 [Muslims].
3
The replication of our analyses with all participants yielded similar results, reported in Supplementary Materials.
16
Here,n
(i)
1
andn
(i)
0
show the number of hate and non-hate labels assigned toi respectively.
• Participant item-level disagreement,d
(p,i)
: For each participantp and each itemi, we define d
(p,i)
as
the ratio of participants with whom p agreed, to the size of the set of participants who annotated
the same item (P ).
d
(p,i)
=
|{p
′
|p
′
∈P,y
p,i
=y
p
′
,i
}|
|P|
(3.2)
Here,y
p,i
is the label thatp assigned toi.
• Group-leveldisagreement,d
(p,S)
: For a specific set of items S and an annotatorp,d
(p,S)
captures how
muchp disagrees with others over items inS. We calculated
(p,S)
by averagingd
(p,i)
s for all items
i∈S
d
(p,S)
=
1
|S|
X
i∈S
d
(p,i)
(3.3)
3.2.6 Annotators’Tendency
To explore participants’ annotation behaviors relative to other participants, we rely on the Rasch model
[131]. The Rasch model is a psychometric method that models participants’ responses — here, annotations
— to items by calculating two sets of parameters, namely theability of each participant and thedifficulty of
each item. To provide an estimation of these two sets of parameters, the Rasch model iteratively fine-tunes
their values to ultimately fit the best probability model to participants’ responses to items. Here, we use
a Rasch model for each subset of items that mention a specific social group, leading to 8 ability scores for
each participant.
It should be noted that while Rasch models consider each response as either correct or incorrect and
generate an ability score for each participant, we assume no “ground truth” for the hate labels. Therefore,
rather than interpreting the participants’ score as their ability in predicting the correct answer, we interpret
17
the scores as participants’tendency for predicting higher number of hate labels. Throughout this study we
use tendency to refer to the ability parameter.
Figure 3.2: The relationship between the stereotypical competence of social groups and (1) the number of
hate labels annotators detected, (2) their tendency to detect hate speech – as quantified by the Rasch model
–, and (3) their ratio of disagreement with other participants (top to bottom)
3.2.7 Analysis
We estimate associations between participants’ social stereotypes about each social group with their an-
notation behaviors evaluated on items mentioning that social group. Namely, the dependent variables are
(1) the number of hate labels, (2) the tendency (via the Rasch model) to detect hate speech relative to oth-
ers, and (3) the ratio of disagreement with other participants — as quantified by group-level disagreement.
To analyze annotation behaviors concerning each social group, we considered each pair of participant (N
= 857) and social group (n
group
= 8) as an observation (n
total
= 6,856), which includes the social group’s
perceived warmth and competence based on the participant’s answer to the explicit stereotype measure,
as well as their annotation behaviors on items that mention that social group. Since each observation is
nested in and affected by individual-level and social-group level variable, we fit cross-classified multi-level
models to analyze the association of annotation behaviors with social stereotypes. Figure 3.1 illustrates our
18
methodology in conducting Study 1. All analyses were performed inR (3.6.1), and theeRm (1.0.1) package
was used for the Rasch model.
3.2.8 Results
We first investigated the relation between participants’ social stereotypes about each social group and the
number of hate speech labels they assigned to items mentioning that group. The result of a cross-classified
multi-level Poisson model, with the number of hate speech labels as the dependent variable and warmth
and competence as independent variables, shows that a higher number of items are categorized as hate
speech when participants perceive that social group as high on competence (β =0.03,SE = 0.006,p<.001).
In other words, a one point increase in a participant’s rating of a social group’s competence (on the scale
of 1 to 8) is associated with a 3.0% increase in the number of hate labels they assigned to items mentioning
that social group. Warmth scores were not significantly associated with the number of hate-speech labels
(β =0.01,SE =0.006,p=.128).
We then compared annotators’ relative tendency to assign hate speech labels to items mentioning each
social group, calculated by the Rasch models. We conducted a cross-classified multi-level linear model to
predict participants’ tendency as the dependent variable, and each social group’s warmth and competence
as independent variables. The result shows that participants demonstrate higher tendency (to assign hate
speech labels) on items that mention a social group they perceive as highly competent (β = 0.07,SE =
0.013,p<.001). However, warmth scores were not significantly associated with participants’ tendency
scores (β =0.02,SE =0.014,p=0.080).
Finally, we analyzed participants’ group-level disagreement for items that mention each social group.
We use a logistic regression model to predict disagreement ratio which is a value between 0 and 1. The
results of a cross-classified multi-level logistic regression, with group-level disagreement ratio as the de-
pendent variable and warmth and competence as independent variables, show that participants disagreed
19
more on items that mention a social group which they perceive as low on competence (β =− 0.29,SE =
0.001,p<.001). In other words, a one point decrease in a participant’s rating of a social group’s competence
(on the scale of 1 to 8) is associated with a 25.2% increase in their odds of disagreement on items mention-
ing that social group. Warmth scores were not significantly associated with the odds of disagreement
(β =0.05,SE =0.050,p=.322).
In summary, as represented in Figure 3.2, the results of Study 1 demonstrate that when novice anno-
tators perceive a social group as high on competence they (1) assign more hate speech labels to, (2) show
higher tendency for identifying hate speech for, and (3) disagree less with other annotators on documents
mentioning those groups. These associations collectively denote that when annotators stereotypically per-
ceive a social group as highly competent, they tend to become more sensitive or alert about hate speech
directed toward that group. These results support the idea that hate speech annotation is affected by
annotators’ stereotypes (specifically the perceived competence) of the target social group.
3.3 Study2
The high levels of inter-annotator disagreements in hate speech annotation [132] can be attributed to nu-
merous factors, including annotators’ varying perception of the hateful language, or ambiguities of the
text being annotated [6]. Aggregating these annotations into single ground-truth labels, leads to dispro-
portionate representation of individual annotators in annotated datasets [127]. Here, we explore the effect
of social stereotypes on hate speech annotations in a large annotated dataset of social media posts.
Annotated datasets of hate speech rarely report psychological assessments of their annotators [11, 38],
and even if they do, little variance may exist among the few annotators who code an item. Therefore,
rather than relying on annotators’ self-reported social stereotypes, here we analyze stereotypes based on
the semantic representation of social groups in pre-trained language models, which have been shown to
reflect biases from large text corpora [12]. Figure 3.3 illustrates the methodology of Study 2, schematically.
20
Figure 3.3: The overview of Study 2. We investigate a dataset of social media posts and evaluate the inter-
annotator disagreement and majority label for each document in relation to language-encoded stereotypes
of mentioned social groups. Contrary to Study 1, stereotypes do not vary at annotator-level.
3.3.1 Data
We analyzed the GHC [85, discussed in Study 1] which includes 27,665 social-media posts labeled for
hate speech content by 18 annotators. This dataset includes 91,967 annotations in total, where each post
is annotated by at least three coders. Based on our definition of item disagreement in Equation 3.1, we
computed the inter-annotator disagreement, and the majority vote for each of the 27,665 annotated posts
and use them as the dependent variables in our analyses.
3.3.2 QuantifyingSocialStereotypes
We analyzed a list of social group tokens suggested by [41]. To quantify social stereotypes directed toward
each social group, we calculated the similarity of the semantic representation of that social group term
to the dictionaries of competence and warmth developed and validated by [119]. The competence and
warmth dictionaries consists of 192 and 184 tokens, respectively and have been shown to measure linguistic
markers of competence and warmth reliably and efficiently in different contexts.
Based on previous approaches for finding associations between words and dictionaries [18, 58], we
calculated the similarity of each social group token with the entirety of words in dictionaries of warmth
21
and competence in a latent vector space. Specifically, for each social group token, s and each wordw in the
dictionaries of warmth (D
w
) or competence (D
c
) we first obtain their numeric representation ( R(s)∈R
t
andR(w)∈R
t
respectively) from pre-trained English word embeddings [GloVe; 118]. The representation
function, R(), maps each word to a t-dimensional vector, trained based on the word co-occurrences in
a corpus of English Wikipedia articles. Then, the warmth and competence scores for each social group
token were calculated by averaging the cosine similarity of the numeric representation of the social group
token and the numeric representation of the words of the two dictionaries.
W
s
=
1
|D
w
|
X
w∈Dw
cos(R(s),R(w)) (3.4)
C
s
=
1
|D
c
|
X
w∈Dc
cos(R(s),R(w)) (3.5)
3.3.3 Results
We examined the effects of the quantified social stereotypes on hate speech annotations captured in the
dataset. Specifically, we compared post-level annotation disagreements with the mentioned social group’s
warmth and competence. For example, based on this method, “man” is the most semantically similar social
group token to the dictionary of competence (C
man
= 0.22), while “elder” is the social group token with
the closest semantic representation to the dictionary of warmth (W
elder
=0.19). Of note, we investigated
the effect of these stereotypes on hate speech annotation of social media posts that mention at least one
social group token (N
posts
= 5535). Since some posts mention more than one social group token, we
considered each mentioned social group token as an observation (N
observation
= 7550), and conducted a
multi-level model, with mentioned social group tokens as the level-1 variable and posts as the level-2
variable. We conducted two logistic regression analysis to assess the impact of (1) the warmth and (2)
22
Figure 3.4: The effect of mentioned social groups’ stereotype content, as measured based on their semantic
similarity to the dictionaries of warmth and competence, on annotators’ disagreement.
the competence of the mentioned social group as independent variables, and with the inter-annotator
disagreement as the dependent variable. The results of the two models demonstrate that both higher
warmth (β = -2.62,SE=0.76,p <0.001) and higher competence (β = -5.27,SE = 0.62,p <0.001) scores were
associated with lower disagreement. Similar multi-level logistic regressions with the majority hate label
of the posts as the dependent variable and considering either social groups’ warmth or competence as
independent variables show that competence predicts lower hate (β = -7.77,SE=3.47,p=.025), but there
was no significant relationship between perceived warmth and the hate speech content ( β = -3.74,SE =
4.05, p = 0.355).
In this study, we demonstrated that language-encoded dimensions of stereotypes (i.e., warmth and
competence) are associated with annotator disagreement over the document’s hate speech label. As in
23
Study 1, annotators agreed more on their judgements about social media posts that mention stereotypically
more competent groups. Moreover, we observed higher inter-annotator disagreement on social media
posts that mentioned stereotypically cold social groups (Figure 3.4). While Study 1 demonstrated novice
annotators’ higher tendency for detecting hate speech targeting stereotypically competent groups, we
found a lower likelihood of hate labels for posts that mention stereotypically competent social groups in
this dataset. This discrepancy is potentially due to the fact that while novice annotators are more sensitive
about hate speech directed toward stereotypically competent groups (e.g., Whites), these groups are not
perceived as the main targets of hate speech by expert annotators of the GHC dataset.
3.4 Study3
Previous research has demonstrated that NLP models, trained on human-annotated datasets, are prone to
patterns of false predictions associated with specific social group tokens [14, 33]. For example, trained
hate speech classifiers may have a higher probability of assigning a hate speech label to a non-hateful post
that mentions the word “gay” but are less likely to mislabel posts that mention the word “straight.” Such
patterns of false predictions are known as prediction bias [68, 41], which impact models’ performance
on input data associated with specific social groups. Previous research has investigated several sources
leading to prediction bias, such as disparate representation of specific social groups in the training data and
language models, or the choice of research design and machine learning algorithm [74]. However, to our
knowledge, no study has evaluated prediction bias with regard to annotators’ social stereotypes. In Study
3, we investigate whether stereotypes about social groups influence hate speech classifiers’ prediction bias
toward those groups.
In Study 1, we examined social stereotypes reported by novice annotators to predict sources of vari-
ance in annotation behaviors. We discovered less disagreement and more annotator sensitivity about hate
speech directed toward competent groups. In Study 2, we investigated the annotated dataset, created by
24
Figure 3.5: The overview of Study 3. During each iteration of model training the neural network learns to
detect hate speech based on a subset of the annotated dataset and a pre-trained language model. The false
predictions of the model are then calculated for each social group token mentioned in test items.
aggregating expert annotators’ judgements, and discovered high annotator disagreement and lower sen-
sitivity about hate in documents mentioning social groups which are represented as cold and incompetent
in language models. Accordingly, we expect hate speech classifiers, trained on the aggregated annotations,
to be affected by such stereotypes, and perform less accurately and in a biased way on social-media posts
that mention stereotypically cold and incompetent social groups. To detect patterns of false predictions
for specific social groups (prediction bias), we first train neural network models on different subsets of an
expert-annotated corpus of hate speech (GHC; described in Study 1). We then evaluate the frequency of
false predictions provided for each social group and their association with the social groups’ stereotypes.
Figure 3.5 illustrates an overview of the methodology of this study.
25
3.4.1 HateSpeechClassifier
We designed a hate speech classifier based on a pre-trained language model, Bidirectional Encoder Repre-
sentations from Transformers [BERT; 40]. Given an input sentence,x, pre-trained BERT generates a multi-
dimensional numeric representation of the text,g(x)∈R
768
. This representation vector is considered as
the input for a fully connected function, h which applies a softmax function on the linear transformation
ofg(x):
h(g(x))=softmax(W
h
.g(x)+B
h
) (3.6)
During training, BERT’s internal parameters and additional parameters, specifically W
h
∈ R
768× 2
andB
h
∈R
2× 1
, which are respectively the weights and bias matrices, are fine-tuned to achieve the best
prediction on the GHC. We implemented the classification model using the transformers (v3.1) library
of HuggingFace [156] and trained h and g in parallel during six epochs on an NVIDIA GeForce RTX 2080
SUPER GPU using the “Adam” optimizer [88] with a learning rate of10
− 7
.
To analyze the model’s performance on documents mentioning each social group token, we trained
the model on a subset of the GHC and evaluated its predictions on the rest of the dataset. To account for
possible variations in the resulting model, caused by selecting different subsets of the dataset for training,
we performed 100 iterations of model training and evaluating. In each iteration, we trained the model
on a randomly selected 80% of the dataset (n
train
= 22,132) and recorded the model predictions on
the remaining 20% of the samples (n
test
= 5,533). Then, we explored model predictions (n
prediction
=
100× 5,533), to capture false predictions for instances that mention at least one social group token. By
comparing the model prediction with the majority vote for that instance, provided in GHC, we detected
all incorrect predictions. For each social group token, we specifically capture the number of false-negative
(hate speech instances which are labeled as non-hateful) and false-positive (non-hateful instances labeled
26
as hate speech) predictions. For each social group token the false-positive and false-negative ratios are
calculated by dividing the number of false prediction by the total number of posts mentioning the social
group token.
3.4.2 QuantifyingSocialStereotypes
We quantified each social group’s stereotype content based on Equations 3.4 and 3.5 from Study 2. Recall
that we calculated the similarity of each social group with dictionaries of warmth and competence, based
on their semantic representations in a latent vector space of English. In each analysis, we considered
either warmth or competence of social groups as the independent variable to predict false-positive and
false-negative predictions as dependent variables.
3.4.3 Results
On average, the trained model achieved an F
1
accuracy of 48.22% (SD = 3%) on the test sets over the
100 iterations. Since the GHC includes a varying number of posts mentioning each social group token, the
predictions (n
prediction
= 553,300) include a varying number of items for each social group token (M =
2,284.66, Mdn = 797.50, SD = 3,269.20). “White” as the most frequent social group token appears in 16,155
of the predictions and “non-binary” is the least frequent social group token with only 13 observations. We
account for this imbalance by adding the log-transform of the number of test samples for each social group
token as an offset to the regression analyses conducted in this section.
The average false-positive ratio of social group tokens was 0.58 (SD = 0.24), with a maximum of 1.00
false-positive ratio for several social groups, including “bisexual”, and the minimum of 0.03 false-positive
ratio for “Buddhist.” In other words, models always predicted incorrect hate speech labels for non-hateful
social-media posts mentioning ‘bisexuals’ while rarely making those mistakes for posts mentioning “Bud-
dhists”. The average false-negative ratio of social group tokens was 0.12 (SD = 0.11), with a maximum of
27
Figure 3.6: Social groups’ higher stereotypical competence and warmth is associated with higher false
positive predictions in hate speech detection
0.49 false-negative ratio associated with “homosexual” and the minimum of 0.0 false-negative ratio sev-
eral social groups including “Latino.” In other words, models predicted incorrect non-hateful labels for
social-media post mentioning “homosexuals” while hardly making those mistakes for posts mentioning
“Latino”. These statistics are consistent with observations of previous findings [34, 93, 41, 113], which
identify false-positive errors as the more critical issue with hate speech classifiers.
We conducted Poisson models to assess the number of false-positive and false-negative hate speech
predictions for social-media posts that mention each social groups. In two Poisson models, false-positive
predictions were considered as the dependent variable and social groups’ (1) warmth or (2) competence, cal-
culated from a pre-trained language model (see Study 2) were considered as the independent variable along
with the log-transform of the number of test samples for each social group token as the offset. The same
settings were considered in two other Poisson models to assess false-negative predictions as the depen-
dent variable, and either warmth or competence as the independent variable. The results indicate that the
number of false-positive predictions is negatively associated with the social groups’ language-embedded
28
Figure 3.7: Social groups’ higher stereotypical competence and warmth is associated with higher false
negative predictions in hate speech detection
warmth (β =− 0.09,SE =0.01,p<.001) and competence scores (β =− 0.23,SE =0.01,p<.001). There-
fore, texts that mentions social groups that are perceived as cold and incompetent are more likely to be mis-
classified as containing hate speech. In other words, a one point increase in the social groups warmth and
competence is, respectively, associated with 8.4% and 20.3% decrease in model’s false-positive error ratios.
Moreover, the number of false-negative predictions is also negatively associated with the social groups’
warmth (β =− 0.04,SE =0.01,p<.001) and competence scores (β =− 0.10,SE =0.01,p<.001). There-
fore, texts that mention social groups that are perceived as cold and incompetent are more likely to be
misclassified as not containing hate speech; one point increase in the social groups warmth is associated
with 3.6% decrease in model’s false-positive error ratios and one point increase in competence is associ-
ated with 9.8% decrease in model’s false-positive error ratio. Figures 3.6 and 3.7 respectively depict the
associations of the two stereotype dimensions with the proportions of false-positive and false-negative
predictions for social groups.
In summary, this study demonstrates that hate speech classifiers trained on annotated datasets predict
erroneous labels for documents mentioning specific social groups. Particularly, the results indicate that
29
documents mentioning stereotypically colder and less competent social groups, which lead to higher dis-
agreement among expert annotators based on Study 2, drive higher error rates in hate speech classifiers.
This pattern of high false predictions (both false-positives and false-negatives) for social groups stereo-
typed as cold and incompetent implies that prediction bias in hate speech classifiers is associated with
social stereotypes, and resembles human-like biases that we documented in the previous studies.
3.5 Discussion
Here, we integrate theory-driven and data-driven approaches [152] to investigate human annotators’ so-
cial stereotypes as a source of bias in hate speech datasets and classifiers. In three studies, we combine
social psychological theoretical frameworks and computational linguistic methods to make theory-driven
predictions about hate-speech-annotation behavior and empirically test the sources of bias in hate speech
classifiers. Overall, we find that hate speech annotation behaviors, often assumed to be objective, are
impacted by social stereotypes, and that this in turn adversely influences automated content moderation.
In Study 1, we investigated the association between participants’ self-reported social stereotypes against
8 different social groups, and their annotation behavior on a small subset of social-media posts about those
social groups. Our findings indicate that for novice annotators judging social groups as competent is as-
sociated with a higher tendency toward detecting hate and lower disagreement with other annotators.
We reasoned that novice annotators prioritize protecting the groups they perceive as warm and compe-
tent. These results can be interpreted based on the Behaviors from Intergroup Affect and Stereotypes
framework [BIAS; 30]: groups judged as competent elicit passive facilitation (i.e., obligatory association),
whereas those judged as lacking competence elicit passive harm (i.e., ignoring). Here, novice annotators
might tend to “ignore” social groups judged to be incompetent and not assign “hate speech” labels to in-
flammatory posts attacking these social groups.
30
However, Study 1’s results may not uncover the pattern of annotation biases in hate speech datasets
labeled by expert annotators who are thoroughly trained for this specific task [115], and have specific ex-
periences that affect their perception of online hate [145]. In addition, expert annotation of hate speech is
a goal-oriented behavior: expert annotators look for cues of derogation and dehumanization carefully to
protect minoritized groups. In these cases, annotators actively evaluate not only label correctness but also
the consequences of their labeling behavior. In Study 2, we examined the role of social group tokens and
language-encoded stereotype content in expert annotators’ disagreements in a large dataset containing
outgroup-derogatory and dehumanizing language. We found that, similar to Study 1, texts that included
groups that are stereotyped to be warm and competent (e.g., Whites) were highly agreed upon by annota-
tors. However, unlike Study 1, we find that posts mentioning groups stereotyped as incompetent — typical
targets of hate speech — are more frequently labeled as hate speech. In simpler words, novice annotators
tend to focus on protecting groups they perceive as competent, but expert annotators tend to focus on
common targets of hate in the corpus.
To empirically demonstrate the effect of annotation bias on supervised models, in Study 3, we evaluated
a hate speech classifier’s performance an expert annotated dataset. We used the count of incorrect predic-
tions to operationalize the classifier’s unintended bias in assessing hate speech toward specific groups [68].
Study 3’s findings suggested that stereotype content of a mentioned social group is significantly associ-
ated with biased classification of hate speech such that more false-positive and false-negative predictions
are generated for documents that mention groups that are stereotyped to be cold and incompetent. These
results demonstrate that biased predictions are more frequent for the same social groups that evoked more
disagreements between annotators in Study 2. Similar to [32], these findings specifically challenge super-
vised learning approaches that only consider the majority vote for training a hate speech classifier and
dispose of the annotation biases reflected in inter-annotator disagreements.
31
It should be noted that while Study 1 assesses social stereotypes as reported by novice annotators,
Study 2 and 3, rely on a semantic representation of such stereotypes. Since previous work on language
representation have shown that the semantic representations encode socially embedded biases, in Study 2
and 3 we referred to the construct under study as normative social stereotypes. In comparing the results
among the three studies, we demonstrated that while novice annotators’ self-reported social stereotypes
impact their annotation behaviors, the annotated datasets and hate speech classifiers tend to be affected
by normative social stereotypes which are encoded in the aggregated data sources.
While our focus in this work is on hate speech annotation, we believe our findings apply in other do-
mains which involve subjective annotations, and annotators’ individual differences should be considered
as a significant source for biases in automated assessment of language. We should note that our work
is limited to English text classifiers and pretrained models, and participants from the US. Given that the
increase in hate speech is not limited to the US, it is important to extend our findings in terms of research
participants and language resources. Future works can investigate the relationship between human stereo-
types and annotation behavior in other languages and cultures. Lastly, hate speech annotation behavior,
like all types of social evaluation, is a complex task since it involves different perceivers, targets, and di-
mensions that vary in priority and their relation to other dimensions. Here, we applied SCM to quantify
social stereotypes, but other novel theoretical frameworks such as the Agent-Beliefs-Communion model
[91] can be applied in the future to uncover other sources of bias.
Our findings suggest that hate speech classifiers trained on human annotations will also acquire par-
ticular social stereotypes toward historically marginalized groups. Our results have tow specific and direct
implications: First, supervised learning approaches may benefit from modeling annotation biases, which
are reflected in inter-annotator disagreements, rather than the current practice, which is to treat them as
unexplained noise in human judgement, to be disposed of through annotation aggregation (e.g., major-
ity voting). The second implication of the present work concerns psychology theory development: while
32
societal contexts have been often considered in social psychological theories, existing theoretical frame-
works were not advanced with the vast societal reach of AI in mind. Our work is an example of how
well-established theories can be applied to explain the novel interactions between algorithms and people.
Large amounts of data that are being constantly recorded in ever-changing socio-technical environments
call for integrating novel technologies and associated problems in the process of theory development [109].
33
Chapter4
DisaggregatedMulti-annotatorModeling
Obtaining multiple annotator judgements on the same data instances is a common practice in NLP in order
to improve the quality of final labels [141, 110]. In case of disagreements between annotations, they are
often aggregated by majority voting, averaging [136], or adjudicating by an ‘expert’ [147], to derive a single
ground truth or gold label that is later used for training supervised machine learning models. However, as
mentioned in the previous sections, in many subjective tasks there often exists no single “right” answer
[1] and annotators might be influenced by their psychological differences. Therefore, enforcing a single
ground truth sacrifices the valuable nuances embedded in annotator’s assessments of the language and
their meaningful disagreements [7, 21].
Therefore, we propose a simple alternative for annotation aggregation when training a machine learn-
ing model based on human-annotated datasets: a multi-annotator architectures to preserve and model
the internal consistency in each annotators’ labels as well as their systematic disagreements with other
annotators. We show that the multi-task framework [99] provides an efficient way to implement a multi-
annotator architecture that captures the differences between individual annotators’ perspectives using the
subset of data instances they labeled, while also benefiting from the shared underlying layers fine-tuned
for the task using the entire dataset. Preserving different annotators’ perspectives until prediction step
34
provides better flexibility for downstream applications. In particular, we demonstrate that it provides bet-
ter estimates for uncertainty in predictions. This will improve decision making in practice, for instance, to
determine when not to make a prediction or when to recommend a manual review.
4.1 Background
Learning to recognize and interpret subjective language has a long history in NLP [155, 1]. While all
human judgments embed some degree of subjectivity, it is commonly agreed that certain NLP tasks tend to
be more subjective in nature. Examples of such relatively subjective tasks include sentiment analysis [112,
96], affect modeling [2, 97], emotion detection [70, 104], and hate speech detection [153]. [1] argue that
achieving a single real ‘ground truth’ is not possible, nor essential, in subjective tasks, and call for finding
ways to model subjective interpretations of annotators, rather than seeking to reduce the variability in
annotations. While which NLP tasks count as subjective may be contested, we focus on two tasks that are
markedly subjective in nature.
4.1.1 DetectingOnlineAbuse
NLP-aided approaches to detect abusive behavior online is an active research area [139, 105, 24]. Re-
searchers have developed typologies of online abuse [146], constructed datasets annotated with different
types of abusive language [153, 129, 150], and built NLP models to detect them efficiently [34, 108]. Re-
searchers have also expanded the focus to more subtle forms of abuse such as condescension and microag-
gressions [16, 83].
However, recent research has demonstrated that these models tend to reflect and propagate various
societal biases, causing disparate harms to marginalized groups. For instance, toxicity prediction models
were shown to have biases towards mentions of certain identity terms [41], specific named entities [126],
and disabilities [77]. Similarly these models are shown to overestimate the prevalence of toxicity in African
35
American Vernacular English [138, 33, 157]. Most of these studies demonstrate association biases present
in data; for instance, [77] show that discussions about mental illness are often associated with topics such
as gun violence, homelessness, and drugs, likely the reason for the learned association of mental illness
related terms with toxicity. While whether a piece of text is hateful or not depends also on the context
[128], not much work investigated the human annotator biases present in the training labels, and how
they impact downstream predictions.
4.1.2 DetectingEmotions
Detecting emotions from language has been a significant area of research in NLP for the past two decades
[95, 4, 39, 71, 123]. Annotated datasets used for training emotion detection models vary across domains,
and use different taxonomies of emotions. While several datasets [143, 17] include a small set of labels
representing the six Ekman emotions [47] — anger, disgust, fear, joy, sadness, and surprise), or bipolar
dimensions of affect — arousal and valence [135], others such as [37] and [29] include a wider range of
emotion labels according to the Plutchik emotion wheel [122] or the complex semantic space of emotions
[26]. Perceiving emotions is a subjective task affected by various contextual factors, such as time, speaker,
mood, personality, and culture [106]. Since aggregating annotations of emotion expressions loses such
contextual nuances, some researchers provide a distributional representation of emotions [49, 5]. Here,
we use annotations for the six Ekman emotions present in the dataset released by [37] to demonstrate how
our multi-annotator approach can capture emotions in a dis-aggregated fashion.
4.1.3 AnnotationDisagreement
Researchers have studied different sources of annotator disagreements. [92] argued that there are at least
two types of disagreement in content coding: random variation, that comes as an unavoidable by-product
of human coding, and systematic disagreement, that is influenced by features of the data or annotators. [45]
36
identifies different sources of disagreement as (a) the clarity of an annotation label (i.e., task descriptions),
(b) the ambiguity of the text, and (c) differences in workers. [7] also studied inter-annotator disagree-
ment in association with features of the input, showing that it reflects semantic ambiguity of the training
instances. Textual features have been shown to predict annotators’ disagreement in determining the mean-
ing of ambiguous words [3]. Acknowledging inter-annotator disagreement as an indicator of annotator
differences, [84] clustered crowd-workers based on their annotation behaviors, and proposed a method for
interpreting annotation disagreements and its sources.
For highly subjective tasks such as hate speech and emotions detection, annotation disagreements can
be rooted in the differing subjectivities and value systems of annotators. In these cases, annotators build
a subjective social reality as a basis for social judgments and behaviors [67], which explains their labeling
procedure. For example, in interviews with annotators in an aggression labeling task, [115] found that
expert annotators from communities discussed in gang-related tweets drew on their lived experience to
produce different label judgements compared with graduate student researchers. Such annotators whose
lived experiences bring important perspectives to the task would be dramatically underrepresented on
generic crowd work platforms and, by definition, would be outvoted in disagreements subject to majority
vote. Majority vote also necessarily obfuscates differences among groups underrepresented in annotator
pools, such as older adults who can exhibit views on aging distinct from crowd workers [42], the majority
of whom tend to be younger [133].
Some studies have proposed alternatives to majority voting when aggregating multiple annotations.
In early work, [35] used the EM algorithm to obtain maximum likelihood estimates of the “true” label
to account for annotator errors. [36] used the individual annotation distributions to predict areas of un-
certainty in veridicality assessment. [73] proposed an approach based on item-response model that uses
posterior entropy to choose which annotators are trustworthy. [154] developed a pointwise mutual in-
formation metric to quantify the amount of information in an annotator’s judgment that can be used to
37
estimate the “correct” label of an instance. [63] explore multiple annotators judgements to disentangle sta-
ble opinions from noise by estimating intra-annotator consistency. All these approaches aim to obtain the
“correct” label, accounting for erroneous or non-trustworthy annotators, whereas we focus on retaining
the annotator disagreements through the modeling process.
A few studies have explored approaches for utilizing annotation disagreement during model training.
[124] explored applying higher cost for errors made on unanimous annotations to decrease the penalty of
mis-labeling inputs with higher disagreement. Similarly, [120] incorporated annotator disagreement into
the loss function of a structured perceptron model for better predicting part-of-speech tags. Our work also
utilizes annotator disagreements rather than resolving them in the data stage; however, we use a multi-
task architecture using a shared representation to model annotator disagreements, rather than using it
in loss function. [23] use a multi-task approach to model annotator differences in machine translation
annotations. While they use a Gaussian Process approach, we use the multi-task approach on top of
pre-trained language models [99]. [22] proposed an approach where they model individual annotators
separately in an inner layer to improve the final prediction. In contrast, our method uses the multi-task
architecture, and provides the additional ability to utilize multiple predictions during deployment, for
instance, to measure uncertainty. [53] also leveraged annotator disagreement using a multi-task model
that adds an auxiliary task to predict the soft label distribution over annotator labels, which improves the
performance even in less subjective tasks such as part-of-speech tagging. In contrast, our approach models
several annotators’ labels as multiple tasks and obtains their disagreement.
4.1.4 PredictionUncertainty
Model uncertainty denotes the confidence of model predictions, which has specific applications in non-
deterministic machine learning tasks. For instance, interpreting model outputs and its confidence is critical
in autonomous vehicle driving, where wrong predictions are costly or harmful [140]. In subjective tasks,
38
uncertainty embeds additional information that supports result interpretation [62]. For example, the level
of uncertainty could help determine when and how moderators take part in a human-in-the-loop content
moderation [19, 98].
The simplest approach for uncertainty estimation is through prediction probability from a Softmax
distribution [69]. However, as the input data gets farther from the training data, this probability estima-
tion naturally yields extrapolations with unsupported high confidence [57]. Instead, [57] proposed the
Monte Carlo dropout approach to estimate uncertainty by iteratively applying dropouts to all layers of
the model and calculating the variance of generated outputs. Such estimations based on the probability
of a single ground truth label overlooks the many factors that contribute to uncertainty [90]. In contrast,
[114] demonstrate the benefits of measuring uncertainty for the ground truth label by fitting a probabilis-
tic model to individual annotators’ observed labels. Similarly, we demonstrate that calculating annotation
disagreement by predicting a set of annotations for the input yields a better estimation of uncertainty than
estimations based on the probability of the majority label.
4.2 Multi-AnnotatorMethod
We define the classification task on an annotated dataset D = (X,A,Y), in which X is a set of text
instances, A is the set of annotators and Y is the annotation matrix, in which each entry y
ij
∈ {0,1}
represents the label assigned tox
i
∈ X bya
j
∈ A. In most annotated datasetsY includes many missing
values, because each annotator only labels a subset of all instances. We use ¯y
i,
to refer to the annotations
present for itemx
i
. Similarly, we use ¯ y
,j
to refer to the annotations made by annotatora
j
. The classification
task aims to predictmaj(¯y
i,
)∈{0,1}, which is the label assigned tox
i
based on the majority vote over
¯y
i,
. We use majority vote, the most commonly used aggregation method; however, our proposed approach
leaves open the choice of the aggregation method.
39
Figure 4.1: Comparison between approaches for multi-annotator model (ensemble, multi-label and multi-
task) and majority label prediction (baseline). Annotation prediction models are trained based on all an-
notations and apply majority voting to predict the final label.
We consider three different multi-annotator architectures: ensemble, multi-label, and multi-task. Fig-
ure 4.1 shows the schematic differences between these three variations. All variations use BERT for lan-
guage representation [40]. For each instance x
i
, a generic representation h
i
∈ R
d
is generated by the
pre-trained BERT-base, and then fine-tuned along with other components of the classifier during training.
The size of the representation vector, d, is defined by the BERT configuration and is set to 768 for the
pre-trained BERT-base. While our experiments are all performed with BERT-base, our methods are not
restricted to BERT in their nature, and can be implemented with other pre-trained language models, e.g.,
RoBERTa [158].
4.2.1 Baselinemodelusingmajoritylabels
The baseline model is a single-task classifier trained to predict the aggregated label for each instance (i.e.,
majority vote, in our case). It is built by adding a fully-connected layer for linear transformation followed
by Softmax function to BERT-base outputs (h
i
). The fully-connected layer applies a linear transformation
40
followed by Softmax function to generate the probability of the majority label,P(maj(¯y
i,
)|h
i
). Compared
to the other models described in this section, the baseline model does not make use of annotation matrix
Y , as it directly predicts the aggregated labelmaj(¯y
i,
).
4.2.2 EnsembleApproach
An intuitive approach towards multi-annotator models might be to train an ensemble of models, each
trained on different annotators’ labels. This approach is not always practical, as it may increase the train-
ing time prohibitively. The ensemble approach applies|A| single-task classifiers, each for training and
predicting the annotations generated by one annotator. During training, the j-th classifier is indepen-
dently fine-tuned to predict ¯ y
,j
, which includes all annotations provided by the j-th annotator. During
test time, we aggregate the outputs by the majority vote of all|A| models to predictP(maj(¯y
i,
)|x
i
).
1
4.2.3 Multi-labelApproach
A more practical approach for multi-annotator modeling is to consider the problem as a multi-label problem
where each label denotes individual annotators’ labels. More specifically, the multi-label approach attempts
to learn to predict|A| labels for each input using a multi-label classification framework. The model first
adds a fully-connected layer to transform eachh
i
to a|A|-dimensional vector, and then applies a Sigmoid
function to the j-th dimension to generate y
ij
. Since Y includes many missing values, the classification
loss is calculated based on the available labels y
ij
∈ ¯y
i,
. However, during test time, all|A| outputs are
aggregated to predictP(maj(¯y
i,
)|x
i
).
1
During prediction, multi-annotator models do not have access to the list of annotators who originally provided the labels for
each instance. Therefore, the original majority vote is predicted as the majority vote among all annotators.
41
4.2.4 Multi-taskApproach
The multi-task based approach learns multiple annotators’ perspectives (labels) as separate classification
tasks, all of which share encoder layers to generate the same representation of the input sentenceh
i
, each
with its separate fully-connected layer and softmax activation. Compared to the multi-label approach,
the multi-task model includes a fully-connected layer explicitly fine-tuned for each annotator. However,
compared to the ensemble approach, the representation layers which generateh
i
are fine-tuned based on
the outputs of all annotation tasks. The loss function is created as the summation of all available labels ¯y
i,
for each instancex
i
. During test time, the model considers the outputs of all annotation tasks to predict
the majority labelP(maj(¯y
i,
)|x
i
).
4.3 Experiments
4.3.1 Data
For this study, we perform experiments on two datasets annotated for subjective tasks: Gab Hate Corpus
[GHC; 85] and GoEmotions dataset [37]. Both datasets capture per-annotator labels for instances along
with corresponding annotators’ anonymous ID, allowing us to model each annotator separately.
We used the GHC, [85], includes|X| = 27,665 social-media posts collected from a public corpus of
Gab.com [56], each annotated for whether or not they contain hate speech. [85] define hate speech as lan-
guage that dehumanizes, attacks human dignity, derogates, incites violence, or supports hateful ideology,
such as white supremacy. in which each instance is annotated by at least three annotators from a set of
18 annotators. The number of annotations varies for each instance (M(|¯y
i,
|) = 3.13 ,SD(|¯y
i,
|) = 0.39).
The number of annotated instances per annotator also varies significantly ( M( ¯ y
,j
)=4807.17,SD( ¯ y
,j
)=
3184.89).
42
Majority Vote Individual Labels
Model Precision Recall F
1
Precision Recall F
1
Baseline 49.53± 3.8 68.78± 4.4 57.32± 1.2 - - -
Ensemble 63.98± 1.1 46.09± 1.9 53.54± 1.0 60.92± 0.7 60.97± 0.8 60.94± 0.3
Multi-label 66.02± 2.2 50.16± 2.0 56.94± 1.0 67.22± 1.4 55.33± 2.0 60.65± 0.7
Multi-task 59.03± 0.9 59.98± 0.6 59.49± 0.2 63.71± 1.3 62.76± 1.5 63.20± 0.3
Table 4.1: The average and standard deviation of precision, recall, and f-score of model predictions eval-
uated during 5 iterations of 5-fold stratified cross validation. Majority Vote section represent models’
performance on predicting the majority vote, while Individual Labels section reports performance on pre-
dicting each raw annotation.
We use a subset of the GoEmotions dataset [37] which contains Reddit posts annotated for 28 emotions,
split across pre-defined train ( |X|
train
= 43,410), test (|X|
test
= 5,427) and validation (|X|
val
= 5,426) subsets.
Our experiments focus on the emotion annotations for the six Ekman [47] emotions —anger,disgust,fear,
joy, sadness, and surprise. Each instance in GoEmotions is annotated by three to five annotators from
a set of |A| = 82 annotators. The number of annotations varies for each instance (M(|¯y
i,
|) = 3.58 ,
SD(|¯y
i,
|) = 0.91), and in total, there are194,412 annotations. Thenumber of annotated instances varies
significantly across annotators ( M( ¯ y
,j
)=2370.88,SD( ¯ y
,j
)=2180.02).
4.3.2 ExperimentalSetup
We implemented the classification models using the transformers (v3.1) library from HuggingFace [156]
and trained the models for three epochs on an NVIDIA GeForce RTX 2080 SUPER. The training steps
employ the Adam optimizer [88]. Our experiment settings are configured similar to [85] experiments
are conducted with a learning rates ofe− 7 and are trained for three epochs. Since GHC does not have
specific train and test subsets, we conducted 5 iterations of stratified 5-fold cross-validations for evaluation,
changing only the random state for each iteration.
43
4.3.3 ResultsonGHC
4.3.3.1 PredictionResults
Table 4.1 reports the average and standard deviation of the precision, recall, and F
1
-scores for various
models, across the 5 iterations. The baseline model, which is trained using the majority vote as ground
truth, is also tested against the majority vote labels. For the ensemble, multi-label, and multi-task models,
we conduct two types of evaluation: first, we test how well the majority vote of predicted labels match
the majority vote of annotations (columns 2-4 in Table 4.1); second, we report how well the individual
predicted labels for each instance match the annotations (where available) by annotators (columns 5-7 in
Table 4.1).
We observe that the ensemble model performs significantly worse (F
1
=53.54) than the baseline single-
task model (F
1
=57.32) in predicting majority label. This is presumably due to the fact that each base model
in the ensemble is trained using only the examples labeled by the corresponding annotator. Since the
number of annotations varies significantly for different annotators (see Section 4.3.1), many base models
end up with lower performance, resulting in lower overall performance.
Multi-label and multi-task models share most layers across different annotator heads. Thus, each anno-
tator head benefits from the updates to the shared layers owing to all instances, regardless of whether they
annotated it or not. The multi-label model performs slightly worse (F
1
=56.94) than the baseline model. In
contrast, the multi-task model, which has a fully connected layer fine-tuned for each annotator, posted a
significantly higher F-score (F
1
=59.49) than the baseline model. In other words, fine-tuning each annotator
head separately and then taking the majority vote performs better than taking the majority vote first and
then training on that noisier label.
Moreover, the baseline model yields higher performance variance among different iterations, such that
its standard deviations of precision, recall, and F
1
exceeds those of the other three methods. One possible
explanation is that aggregating annotations based on majority votes disposes of information about each
44
annotator and inserts noise into the labels. In other words, modeling each annotator, and their presumable
internal consistency, could lead to more stable prediction results. However, this hypothesis requires further
investigation.
We now evaluate the individual predictions made by the multi-annotator model (prior to majority vote)
on how well they match individual annotators’ labels (Table 4.1). All three multi-annotator approaches
obtain higher F
1
-scores than how the baseline model does in predicting majority labels (note that these
are different tasks, and not directly comparable). The multi-task model achieved the highest F
1
-score
of 63.20. The result suggests that the multi-task model benefits from fine-tuning annotators separately
(thereby avoiding inconsistencies due to majority votes) as well as learning from all instances in a shared
fashion.
4.3.3.2 ModelingUncertainty
Next, we study how well we can model uncertainty in predictions. We compare uncertainty in predictions
with annotator disagreement, measured as the variance of the annotations.
σ 2
(¯y
i,
)=
P
[y
ij
=1]
P
[y
ij
=0]
|¯y
i,
|
2
(4.1)
Since the ensemble, multi-label, and multi-task models all make separate predictions corresponding to each
annotator, we can calculate the uncertainty in predictions to be the variance of the predicted annotations
Figure 4.2: Correlation of different approaches for estimating prediction uncertainty with annotation dis-
agreement on theGHC. Annotation modeling approaches better correlate with disagreement.
45
Figure 4.3: Correlation matrix of approaches for estimating uncertainty. MC dropout and Softmax have
high correlation. Our multi-annotator models also have higher internal correlations.
for each instance x
i
. However, modeling prediction uncertainty in the case of single predictions is an
open question. We compare our results with other common approaches for estimating uncertainty in
single-task predictions such as Softmax probability of the final output for predicting majority vote [69],
and Monte Carlo dropouts [57], orMCdropout, which iteratively applies dropouts to all layers of the model
and calculates the variance in predictions.
Figure 4.2 shows the correlations of uncertainty estimation using each method with the annotation
disagreement calculated asσ 2
(¯y
i,
). While traditional estimations such as Softmax and MC dropout have a
moderate correlation with annotator disagreements, the uncertainty measured by our three multi-annotator
methods show significantly better correlation, with the ensemble method posting a slightly higher corre-
lation than the other two methods. In other words, in addition to performing better on predicting majority
votes, multi-annotator models also predict model uncertainty better than traditional approaches.
We further analyze the pair-wise correlation between estimations of uncertainty by different approaches
(Figure 4.3). As expected, the Softmax and MC dropout methods are highly correlated, and similarly, our
46
Models Training Time (in mins)
Baseline 20.5
Ensemble 158.4
Multi-label 22.8
Multi-task 22.3
Table 4.2: Training time (in minutes); the time it takes to train each model on 80% of the GHC.
methods show high correlation among themselves. It is also interesting to note that the uncertainty es-
timated by our methods also correlate significantly with traditional methods (i.e., between 0.6 and 0.7),
except for the multi-task method and MC Dropout method which have a lower correlation of 0.53.
The fact that the uncertainty scores for multi-task and multi-label models are highly correlated with
each other (0.86) suggests that they both identify textual features that causes disagreement. We verified
this by training a separate model using the same BERT-based setup using Sigmoid activation to directly
predict the annotator disagreement. The predicted uncertainty by this model obtained similar correlation
with the annotator uncertainty (0.47) as the multi-task and multi-label models.
4.3.3.3 ComputationTime
We now assess the computation cost associated with the different approaches. Table 4.2 shows the time it
took to train a single cross-validation fold, i.e., 80% of the dataset. As expected, the ensemble approach takes
the longest to train, as it require training|A| different models (each with varying training set sizes), and the
baseline takes the shortest time. Impressively, multi-label and multi-task models do not take significantly
more time to train. In other words, while the multi-task model train additional layers for annotators, it
adds only a marginal computation cost to the baseline model.
4.3.4 ResultsonGoEmotions
In this section, we describe results obtained on the six binary classification tasks performed using the
GoEmotions dataset. Since the multi-task approach obtained better performance overall on GHC, we report
47
the results on only the multi-task approach here. We start by assessing how well the multi-annotator model
matches the single-task performance of predicting the majority label. Table 4.3 reports the average and
standard deviation of F
1
-scores over 5 iterations of training and testing. Unlike GHC where we used 5-
fold cross validation, for the GoEmotions dataset we use the pre-defined train, validation, test splits in
the dataset. We verified that these splits are stratifed w.r.t. annotators. As in GHC experiments, while
the baseline model is trained and tested on the majority vote, the multi-task model is trained on available
annotator-level annotations for each instance and the predictions from all classifier heads are aggregated
to get the final label during testing.
Results obtained on the full dataset is shown in the second and third columns of Table 4.3. While the
multi-task model outperformed the baseline in predicting two emotions — joy and sadness, it underper-
formed the baseline for the other four emotions, although the ranges of F
1
-scores largely overlap. It is also
observed that the standard deviations of the multi-task model F
1
-scores are significantly larger than what
was observed for GHC.
On further inspection, we found that many annotators contributed very few annotations in the dataset.
For instance, 29 annotators had fewer than 1000 annotations in the training set, six of them having fewer
than 100. In addition, the label distribution is extremely skewed for all six emotions — ranging from
1.6% positive labels for fear on average across all annotators, to 4.0% positive labels on average for joy.
Consequently, many annotator heads have too few positive instances to learn from; some had zero positive
Full Dataset (|A| = 82) Subset (|A| = 53)
Emotion Baseline Multi-task Baseline Multi-task
Anger 40.38± 4.4 39.01± 6.4 41.95± 6.1 42.75± 4.4
Disgust 38.79± 3.9 38.31± 1.9 37.72± 2.0 35.77± 2.0
Fear 58.96± 5.0 54.97± 6.1 57.68± 3.7 58.58± 2.3
Joy 47.80± 2.2 49.53± 3.6 47.45± 3.1 46.26± 1.2
Sadness 49.22± 5.2 50.36± 3.2 47.55± 5.4 48.00± 3.4
Surprise 40.96± 2.9 38.97± 3.6 39.44± 5.7 40.22± 2.2
Table 4.3: The average and standard deviation of model prediction f-score on the GoEmotions dataset,
evaluated across 5 iterations using the pre-defined train-test splits in the dataset.
48
instances in the training set. This makes the corresponding learning tasks in the multi-task setting hard or
even impossible on this dataset, and might explain the lower performance and higher variance in F
1
-scores.
In order to make a fairer comparison, we performed our experiments on a subset of the dataset which
only includes the annotations by 53 annotators who had more than 1000 annotations. Results obtained on
this subset are in the fourth and fifth columns of Table 4.3. Our multi-annotator model outperforms the
baseline model on predicting the majority label in four of the six tasks —anger,fear,sadness, andsurprise,
while obtaining slightly lower results ondisgust andjoy. While F
1
-score ranges of baseline and multi-task
models still largely overlap, the multi-task model fares significantly better when there are enough instances
for each annotator head to learn from. The multi-task model also reported lower standard deviation in
performance than the baseline model, suggesting better robustness in the learned model.
The main advantage of our multi-annotator model is the ability to capture multiple perspectives effi-
ciently. In that respect, our model fared better at modeling annotator uncertainty across board. As shown
in Figure 4.4, our multi-annotator model obtained better correlation overall with annotator disagreement
than Softmax and MC dropout approaches across all six emotions, both in the full dataset as well as the
subset (joy in the full dataset being the only exception). This further demonstrates the strength of our
approach that does not come at any significant cost of performance or efficiency (training the multi-task
model on the full dataset takes 6.1 minutes per epoch, comparing to 5.2 minutes for the baseline model).
4.4 Analysis
In this section, we further analyze the multi-task model and its outputs, as it posted the overall best perfor-
mance among the three approaches, considering the predictive performance, uncertainty modeling corre-
lation, and time efficiency. We focus on the GHC model for this analysis.
49
Figure 4.4: Correlation of different approaches for estimating prediction uncertainty with annotation dis-
agreement for theGoEmotions dataset.
4.4.1 ErrorAnalysis
We first qualitatively analyze the mismatches between the multi-task and baseline model on their majority
vote predictions. Among all GHC instances (|X| = 27,665), multi-task and baseline model disagreed
on 1,945 labels. Table 4.4 shows some examples of such instances and the corresponding majority vote,
and the percentage of annotators who labeled them as hate speech. Table 4.4 also provides the baseline
model’s prediction (columns 6), the multi-task model’s majority label, and the percentage of prediction
heads labeling them as hate speech (columns 4-5).
The most common type of mismatch (57.94% of mismatches) occurs when an instance deemed non-
hateful (by majority vote of annotations) is correctly labeled by the multi-task model but incorrectly labeled
by the baseline (first set of rows in Table 4.4). In other words, these samples represent the baseline model’s
50
Annotation Multi-task Baseline
Text Label %Hate Label %Hate Label
Agreed. Traitors deserve to be shot. PERIOD. Not-Hate 0.33 Not-Hate 0.39 Hate
Are the democrats guilty of treason? Not-Hate 0.25 Not-Hate 0.39 Hate
I don’t know, you tell me ya prick. Not-Hate 0.33 Not-Hate 0.39 Hate
They should force the ‘refugees’ to wear this crap. Hate 0.67 Not-Hate 0.33 Hate
At this point, nobody gives a shit! Hate 0.67 Not-Hate 0.44 Hate
It’s OK to be a Nazi!! Hate 1.00 Not-Hate 0.44 Hate
Arise WHITE PEOPLE! We’re back. Not-Hate 0.33 Hate 0.56 Not-Hate
Shocking! A jew defending another jew! Who’d have thought? Not-Hate 0.33 Hate 0.56 Not-Hate
I am beginning to think being straight is misconduct. Not-Hate 0.0 Hate 0.56 Not-Hate
Armenia is a nation of mongrel bastards. Hate 1.0 Hate 0.78 Not-Hate
Hope they both get AIDS. Hate 1.0 Hate 0.72 Not-Hate
I am so NOT afraid of you gay boy. Hate 0.67 Hate 0.83 Not-Hate
Table 4.4: Examples from the GHC, for which the baseline differ from multi-task predictions’ majority
vote. (We acknowledge that individual readers may disagree with the annotation labels presented above.)
false-positive predictions, most of which include specific tokens, such as slur words and social group to-
kens. The next most common type of model mismatch (22.31% of mismatches) occurred when an instance
that was deemed hateful (by majority vote) is mislabeled by the multi-task model and labeled correctly by
the baseline model. In general, these two types of mismatches correspond to the positive predictions of the
baseline model. A possible explanation for the frequency of such mismatches is the high rate of positive
predictions by the baseline model, which is also supported by the higher recall and lower precision scores
of the baseline model (Table 4.1).
The other two types of mismatches occurred when the baseline and multi-task model respectively
predicted hateful and non-hateful labels. When this mismatch is over an instance deemed hateful by
majority vote of annotations (12.19% of mismatches) the multi-task model is making a false-positive error
and we observe mentions of social group names in the text. A large number of such instances had even
split (54% - 44%) between labels across individual predictions (see Table 4.4), suggesting the model was
unsure. The least common type of disagreement is over instances deemed as hateful by both majority vote
of annotations and our multi-task model, but mis-classified by the baseline model (7.56% of mismatches).
51
Figure 4.5: Violin plots denoting distribution across uncertainty for true positive, false positive, false neg-
ative, and true negative predictions on GHC.
4.4.2 Uncertaintyvs. Error
Now, we investigate whether the uncertainty in predictions is correlated with whether the multi-task
model was able to correctly predict the majority label. Note that the value of uncertainty, based on Equation
4.1, falls between 0 and 0.25. We observe that the mean value for uncertainty in correct predictions was
0.049 compared to 0.170 when the model was incorrect. Figure 4.5a shows the corresponding violin plots.
While most incorrect predictions had high uncertainty, a small but significant number of errors were made
with certainty.
Separating this analysis across true positives, false positives, false negatives, and true negatives rep-
resents a more informative picture. For instance, the model is almost always certain about true neg-
atives (M(uncertainty) = 0.040). Similarly, the model is almost always uncertain about false positives
(M(uncertainty) = 0.199), something we also observed in the error analysis presented in Section 4.4.1. On
the other hand, both true positives and false negatives have a bi-modal distribution of uncertainty, with
similar mean uncertainty values of 0.140 and 0.141, respectively. In sum, a negative prediction with high
uncertainty is more likely to be a false negative, in our case.
52
4.5 Discussion
We presented multi-annotator approaches that predict individual labels corresponding with each annota-
tor of a subjective task, as an alternative to the more common practice of deriving (and predicting) a single
“ground-truth” label, such as the majority vote or average of multiple annotations. We demonstrate that
our method based on multi-task architecture obtains better performance for modeling each annotator (63.2
F
1
-score, micro-averaged across annotators in GHC), and even when aggregating annotators’ predictions,
our approach matches or outperforms the baseline across seven tasks. Our study focuses on majority vote
as the baseline aggregation approach to demonstrate how this commonly used approach loses meaning-
ful information. Other aggregation strategies such as MACE [73] and Bayesian methods [116] could be
explored in future work as complementary approaches that can work with the multi-annotator framework.
4.5.1 AdvantagesofMulti-AnnotatorModeling
One core advantage of our method, which can further be leveraged in practice, is its ability to provide
multiple predictions for each instance. As demonstrated in Figure 4.2 and 4.4, the multiple predictions
can derive an uncertainty estimation that better matches with the disagreement between annotators. The
estimated uncertainty could be used to determine when not to make a prediction or to route the example
to a manual content moderation queue as it may be an example that annotators likely disagreed on. One
could also investigate how to learn an uncertainty threshold to make cleverer predictions. For instance,
based on our analysis in 4.4, a negative prediction with high uncertainty is very likely to be a false negative.
One could use this knowledge in a deployment scenario and predict a positive label in case of a negative
majority prediction with high uncertainty.
Predicting multiple annotations rather than a ground truth is specifically essential in subjective tasks.
As [2] argues, in many subjective tasks, the aim is not to find an accurate answer; instead, a model can pro-
duce the mostacceptable answer based on responses from different judgements. Accordingly, our method
53
contrasts with approaches for enhancing ground-truth generation prior to modeling. Our approach aims to
preserve annotators’ consistency in labeling by delaying the annotation aggregation until the final stage.
As a final step, if required, application-driven approaches can be employed to find the most proper an-
swer. For instance, an aggregation approach based on MACE [73, 116], could be applied to the predicted
individual labels to find a final label that considers the trustworthiness of individual annotators.
Researchers have pointed out that in more objective tasks, such as commonsense knowledge or word
sense disambiguation, training a model on judgements of a specific set of annotators lack generalizability
to annotations generated by new annotators [61]. However, in subjective tasks such as affect and online
abuse detection, different annotator perspectives, and their contrasts can be useful [63].
Another advantage of having multiple prediction heads in a multi-task architecture is that we could
adapt the same model to different value systems. For instance, in cases where annotators with different
moral beliefs systemically produce different labels [145, 42, 115], one could use the multi-task approach to
have a single global model that can adjust predictions to be conditioned on different value systems. This is
valuable for international media platforms to build and deploy global models that attend to local cultures
and values without retraining entirely separate models for each culture.
Multi-annotator modeling can also be applied in scenarios that may benefit from obtaining several
perspectives for a single instance. For example, in detecting affect in language, a range of subjective hu-
man knowledge, interpretation, and experience can be modeled through a multi-annotator architecture.
This approach would generate a range of affective states either along affect categories, such as anger and
happiness, or dimensions, such as arousal and pleasantness [1, 2], which correspond with different sub-
jective perceptions of the text. Another example is sarcasm detection, where an ambiguous sarcastic text
is labeled differently according to annotators’ thresholds for sarcasm [130]. In a multi-annotator setting,
internal consistency of each annotators’ threshold for sarcasm may be preserved in the training process.
54
4.5.2 LimitationsandChallenges
Our approach is not without limitations. Our experiments were computationally viable because of the rel-
atively small number of annotators in our annotator pool (18 for GHC and 82 for the GoEmotions dataset),
which is not usually the case with large crowd-sourced datasets. For instance, the dataset by [42] has
over 1.4K individual annotators, and [82] built a dataset with over 8K annotators. Fine-tuning that many
separate annotator heads will be computationally expensive and may not be a viable option. However,
clustering annotators based on their agreements and aggregating annotator labels into cluster labels could
address this issue. In that scenario, the multi-task model would include separate classifier heads for each
cluster of annotators. The number of clusters could be determined based on availability of computational
resources and data factors to enhance the multi-task approach. This is an important direction of research
for future work.
The proposed approach along with other methods for incorporating individual annotators and their
disagreements are only viable when annotated datasets include annotator-level labels for each instance.
However, most multiply annotated datasets contain only per-instance majority labels [147, 81], or aggre-
gate percentages [34, 82]. Even in cases where the raw annotations were released, the multi-annotator
model requires there being enough annotations from each annotator to model them effectively. However,
we observed that the dataset designers may not have envisioned such a utility of annotator-level labels
for downstream analysis. For instance, in the GoEmotions dataset, many annotators labeled fewer than
1000 instances, making it hard for annotator-level modeling. Moreover, the high cost of gathering large
number of annotations per annotator in crowdsourcing platforms may limit the data collection and call
for post-hoc modeling solutions. One way to tackle this issue is by choosing a subset of top-performing
annotator heads (during the validation step) for the final prediction. Future work should look into such
post-processing steps that could further improve the performance.
55
To enable further exploration into open questions in studying annotator disagreements and efficient
ways to model them, the main challenge is the lack of annotator-level labels. This largely stems from the
practice of considering crowd annotators as interchangeable, and not accounting for the differences in
their perspectives. We recommend data providers to consider releasing individual annotation labels, when
feasible to do so, in an anonymized way and with appropriate consent. We also encourage researchers to
design data collection efforts in a way that includes a sufficient number of annotations by each annotator,
so that systematic differences in their annotation behaviors could be better understood and accounted for.
56
Chapter5
IntegratingAnnotators’PsychologicalAssessmentsintoModeling
SubjectiveLanguageClassificationTasks
Annotators can disagree significantly on the labels they assign to text [8, 145, 137]; therefore, language
annotation efforts often report high ratios of disagreements within the labels due to annotators’ subjective
biases [46]. [38] argue that annotator subjectivity matters to different extent for different tasks and includ-
ing the uncertainty or inter-annotator disagreement on each instance can act as a signal in the dataset.
However, the systematic variation in annotators’ judgements [148] is commonly reduced through ma-
jority voting, averaging [136], adjudication [147, 76] or other aggregation strategies for creating single
ground-truth labels, resulting in under-representation of minority perspectives [127].
Alternatively, researchers have proposed new methods for modeling nuances in annotations, rather
than aggregating them into ground-truth labels. [73] introduce a method for incorporating annotators
reliability into generating the aggregated label rather than the common practice of treating annotators
as inter-changeable [78]. [32] introduce a multi-annotator model to predict several annotations for each
instance rather than aggregating the labels prior to training the model. While these methods acknowledge
annotators’ disagreements as a source of information and incorporate annotator-level labels into the mod-
eling process, they do not integrate annotators’ background information into modeling their judgements.
57
In this paper, we argue that depending on the task, specific background information of annotators
are the indicators for disagreements and can explain annotation behaviors and improve overall prediction
results. We use this information to train classifier models that are aware of annotators’ psychological
differences. The proposed model processes the input language, along with annotators’ profiles to predict
each single annotator’s judgements about the text rather than a single label.
Besides integrating the annotator profiles into the classifier model and predicting a distribution of
labels rather than a single general answer, our approach introduces a framework for distinguishing the
annotator features that are most important for predicting annotators’ varying perspectives. Based on
these results we select a stratifies set of annotators to capture a balanced representation of perspectives in
the annotator pool and enhance the modeling process of subjective tasks.
5.1 Background
Crowdsourcing approaches for creating annotated datasets often aim to collect several annotations for each
instance in order to increase the reliability of the final label and prevent noisy annotations from impacting
the results. The most efficient way to do so is by recruiting a large number of annotators each responsible
for labeling a small portion of the corpus. However, recent analyses have shown that treating annotators
as inter-changeable is not the preferred approach for dealing with subjective language understanding tasks
[117, 38, 127, 31]. Modeling the nuances encoded in annotations and inter-annotator disagreements has
recently been explored as an alternative solution for subjective tasks.
5.1.1 AnnotationDisagreement
When datasets include a set of annotations per instance, the distribution of these labels, and the disagree-
ment extracted from the set, become two possible pieces of information that potentially help the modeling
process. [10] argues that disagreement — even on objective tasks — should be considered as a source of
58
information rather than being resolved. To operationalize the subjectivity in creating annotated datasets,
[134] proposes a descriptive annotation paradigm for surveying and modelling different beliefs.
Others have incorporated the inter-annotator agreements for each item to weight the items’ effect
on the loss value and achieved improvements on the downstream tasks [121]. [54] leveraged annotator
disagreement using a multi-task model that adds an auxiliary task to predict the soft label distribution
over annotator labels, which improves the performance even in less subjective tasks such as part-of-speech
tagging.
[87] applies an item response theory model to the variations in annotations of hate speech to decom-
pose the binary hate speech labels to a continuous, infinitely divisible spectrum of sentiment ranging from
extremely negative to extremely positive, and uses them in a multi-task model for predicting the latent
variables. While these methods are driven from the intuition for considering the variation in annotators’
perspectives, they still fall short on regarding the integrity of the labels provided by each annotator.
5.1.2 AnnotatorModeling
Acknowledging the difference in annotators’ perceptions of subjective tasks have led a number of pre-
vious model designers to incorporate information at the annotator level as the social factors needed for
contextualizing language [75] in modeling subjective tasks. [72] shows that providing the age or gender
information of the authors of text to a classifier consistently and significantly improves the performance
over demographic-agnostic models. [59] model users’ responses to particular questionnaire items based
on their demographic information by training a demographics embedding layer, which can further be used
in isolation to generate embeddings for any unseen sets of demographics information.
[51] add annotators’ sentiment about the writer of the text as a value in{− 1,0,1} — representing
negative, neutral, or positive bias — to model their labels. Their experiments show that modeling subjective
task with information about the context and the annotators increases the performance of the model. [32]
59
introduce a multi-annotator architecture that models each annotators’ perspectives separately using a
multi-task approach. While these methods model annotations based on annotators’ differences they do
not incorporate psychological profile of annotators into modeling their behaviors.
5.2 RationalizedAnnotatorModeling
In this chapter, we propose a framework for modeling annotations in subjective tasks by incorporating an-
notators’ psychological profiles. In the simplest scenario — annotator-aware — the model uses all provided
annotator information and in the rationalized scenario the model selects arationale, which indicates what
annotator information is to be used and to what extent the information is helpful for modeling the annota-
tions. In other words, for predicting each annotation, the rationale is the subset of annotator information
that is selected by the model to explain the final prediction. The rationale can therefore be interpreted
as annotator information that have affected the model’s prediction of each annotation. Creating ratio-
nales during model training is done by extracting hard or soft importance values for annotator features,
discussed in this section.
5.2.1 ProblemFormulation
We formalize the annotation modeling question in a scenario withN instances labeled byA annotators.
The model uses the input textx
j
(0≤ j < N), and an annotator vector a
i
∈R
d
(0≤ i < A), where each
of thed numeric values represent one-hot encoding of demographic/psychological information about the
i-th annotator. The predicted value is the annotation assigned to j-th instance by the i-th annotator,
y
ij
∈{0,1}.
1
1
In this paper we exclusively deal with binary annotations, however, the approach is not restricted to binary output values.
60
Figure 5.1: The annotator-aware model generates an annotator representation based on their demo-
graphic/psychological profile to predict several annotations for each input text.
5.2.2 RationalizedAnnotatorModeling
In the rationalized version, the model generates a binary mask z
i
∈ {0,1}
d
, of the same length as the
annotator vector, to indicate which annotator information should be used for predicting the label. After
applying the maskz
i
to the annotator vector,a
i
× z
i
is appended to the text representation (same as the
original Annotator-Aware scenario) to predict the label. We compare the following two approaches for
creatingz
i
, which — inspired by [80] — we refer to as hard and soft rationalization:
5.2.2.1 Hardrationalization
This method is an unsupervised approach, introduced by [94], that generates the mask vector z
i
for a
given annotator vectora
i
directly from a Bernoulli distribution. The goal of the method is to generate the
probability for each attribute being selected in the rationale independently from the probability estimated
for other attributes. For instance, the hard rationalization can select the age dimension of the annotator
61
Dataset #Annotators Profiles
Sentiment 1,481 demographics
Hate 1200 demographics, explicit biases
Morality 20 demographics, psychological profile
Table 5.1: Annotator information provided in three datasets used for annotator modeling. The datasets
vary significantly on their number of annotators to represent different approaches for annotation collec-
tion.
vector and disregard the gender dimension for predicting an annotation. Therefore, applying the mask to
the annotator vector will cause specific dimensions to be disregarded throughout prediction.
5.2.2.2 Softrationalization
In contrast to the Hard rationalization, the method introduced by [79], tends to provide more faithful
explanations by separating the rationale generation process from the prediction model. The approach
starts by generating continuous importance values s
i
∈ R
d
that are then binarized to create the mask
vectorz
i
. The rationalizationa
i
× z
i
is then independently used for predicting the results.
5.3 Experiment
5.3.1 Data
As discussed earlier, most annotated datasets treat annotators as interchangeable and do not provided
annotator-level labels and information. For this research, however, we rely on datasets that do report labels
provided by each annotator, along with annotator-level information that could be encoded to annotator
vectors. Table 5.1 includes the annotator profiles included in each dataset.
5.3.1.1 Sentiment
includes 14,071 blog posts from a prominent “elderblogger” community [44, 43], annotated by 1,481 anno-
tators. Each item is labeled based on its sentiment on a scale of very negative to very positive. Following
62
the process described by the authors we mapped the continuous labels to binary values. The data curators
of this dataset also published annotators’ answers to several questionnaires that assess their perceptions
about ageism. The annotator information captured in this dataset include:
• Demographic information including age, race, income, education, marital status, and political
identification.
5.3.1.2 BiasinHateSpeech
The dataset includes hate speech annotations from a relatively large (N = 1,228) set of participants in a
US sample stratified across participants’ gender, age, ethnicity, and political ideology [31]. Each annotator
labeled 56 social media posts from the Gab Hate Corpus [85] with high disagreement which mention at
least one social group and filled out a survey that captures the following annotator information:
• Demographic information including age, sex, sexual orientation, race, education, religion, and
perceived social class.
• Explicitbiases evaluated based on their responses to how friendly, peaceful, helpful, or intelligent
they perceive each of the 8 social groups [30].
5.3.1.3 MoralFoundationsRedditCorpus
We also provide an annotated dataset that includes 2k Reddit posts each annotated by 20 annotators.
Annotators are research assistants who are trained on detecting concerns relating to Moral Foundations
[64] expressed in text. Each instance is either labeled by a list of moral labels or labeled as non-moral. The
survey includes the following items:
• Demographicinformation including age, gender, sexual orientation, race, religion, and perceived
social class.
63
Sentiment Hate Speech Bias
F1 P R F1 P R
Majority 80.5 82.4 78.8 73.8 65.6 84.5
AA 81.5 76.7 87.0 66.8 69.6 64.2
AA
Hard
80.6 76.9 84.7 60.4 67.3 54.8
AA
Soft
80.5 77.2 84.1 59.6 69.0 52.4
(a) Results on Sentiment and Hate Speeech Bias datasets.
Care Equality Proportionality Authority Loyalty Purity
F1 P R F1 P R F1 P R F1 P R F1 P R F1 P R
Majority 44.6 64.1 34.2 43.5 63.1 33.1 24.8 59.1 15.7 17.5 46.4 10.8 19.0 58.0 11.4 15.6 52.6 9.1
AA 44.7 34.2 64.6 45.3 34.3 66.7 29.9 22.8 43.5 40.4 31.2 57.5 34.0 25.6 50.3 22.0 16.5 33.0
AA
Hard
49.7 39.6 66.7 44.1 32.7 67.9 28.6 19.6 52.9 37.2 26.1 64.7 27.0 18.8 48.1 11.4 7.2 28.0
AA
Soft
48.8 38.2 67.6 47.1 37.1 64.6 34.2 26.2 49.2 35.8 26.6 54.6 26.0 20.5 35.7 25.2 17.3 46.7
(b) Results on the Moral Foundations Reddit Corpus; predicting each moral foundation label is considered as a binary
classification task.
Results of predicting annotations. Each model has been trained in a 5-fold cross validations and the performance
is evaluated on all predictions. The Majority model provides the majority vote of the training set to predict the
annotations in the test set. Annotator-aware models either use all provided annotator attributes (AA), or applied
Hard and Soft rationalizations methods (AA
Hard
and AA
Soft
respectively).
• Moralfoundationsquestionnaire2 [9] which measures an individual’s moral values in terms of
the six foundations: care, equality, proportionality, authority, loyalty and purity.
• Big five personality traits [142] explain important individual differences in people’s patterns of
thinking, feeling, and behaving in terms of extroversion, agreeableness, conscientiousness, negative
emotionality, and open-mindedness.
• Politicalconservatism [48] which captures individual’s standings about political concerns in the
US, such as abortion, limited government, military and national security, gun ownership, and patri-
otism. Participants’ political concerns are ranked based on their positive or negative standings on
the topics.
5.3.2 Results
We first evaluate the performance of the three proposed Annotator-Aware model (the initial setting and
two rationalization approaches) on predicting the single annotations. We compare these results with a
64
baseline classifier trained on majority votes, on predicting the single annotations. This evaluation is mo-
tivated by the fact that in a baseline setting, a classifier is trained on a single majority vote and applied
to different settings with the purpose of providing acceptable results regardless of contextual variations.
Table 5.2ba and 5.2bb show the results of these evaluation on different datasets. Note that for the Moral
Foundation Reddit Corpus, we considered the detection of each moral foundation as an independent binary
classification task.
In case of the morality foundations (Table 5.2ba), the single model trained on the majority vote, per-
forms poorly on predicting each annotations. This can be due to the small number of items in the training
set. While the annotator-aware models are trained on 80% of the annotations (N
annotations
≈ 40k), the
single majority model is trained on a much smaller dataset that is created by aggregating the labels for
each instance 80% (N
instances
≈ 2k). While it can be argued that having 20 annotations per item is not the
common practice in creating NLP models, the results show that modeling annotations can be extra helpful
for dataset with a small set of items.
Overall, the results demonstrates that, incorporating annotators’ demographic/psychological profiles
into modeling annotations yields more acceptable estimation of single annotation comparing to training
a single model of the majority vote.
5.4 RationalizingAnnotatorInformation
The results of our experiments (Table 5.2ba and 5.2bb), do not show an improvements in performance due
to using the rationales. However, in this section we explore the created rationales (either in Hard or Soft
rationalization methods) as the means for investigating what annotator attributes are more informative
for predicting the annotator-level labels. In our analyses, we focus on the Hard rationalizations for the
Moral Foundation Reddit Corpus since the variation of annotator profiles capture in the dataset enables a
more thorough comparison.
65
Figure 5.2: Hard rationales for predicting Care and Equality annotations (from left to right). The average
scores show that while moral foundation concerns of annotators are more applicable for predicting care,
they are of less importance for predicting Equality annotations.
5.4.1 PsychologicalMeasures
We first compare the weight of each psychological measure, captured in the rationalization models for
predicting each specific moral foundation. Specifically, we compared different morality concerns [9], with
Big five personality traits [142]. We intuitively expect annotators’ score on the moral foundation question-
naire to be more informative for predicting their annotations in all six moral foundation detection tasks.
However, it is only the case in predicting the Care annotations. Figure 5.2 represents the ranking of the
psychological attributes based on their overall Hard rationalization scores on the dataset.
5.4.2 DemographicInformation
The importance of demographic information in rationalizing the annotation predictions is restricted to
Loyalty and Purity foundations. In predicting these two foundation, the order of importance are mostly
identical, with annotators’ being straight, non-religion, woman, and Asian having the most rationalization
effect and being non-binary, identifying with sexual orientations other than being straight, homosexual,
or bisexual, and black having the least effect. Figure 5.3 represents the overall rankings of the demographic
attributes.
66
Figure 5.3: Hard rationales for predicting Loyalty and Purity annotations (from left to right) based on the
demographic information of annotators. The average scores are mostly identical for the two foundations.
5.4.3 PoliticalConservatism
We next compare the effect of annotators’ standings regarding different US political concerns, on their
annotation behaviors based on the Hard rationalization model. While the political conservatism measures
have no varying effects on predicting Proportionality and Authority, they differentially effect other moral
foundations. Interestingly, as Figure 5.4 shows, the order of importance is mostly the same for different
political concerns in predicting Loyalty, Equality and Purity and the rankings are reversed for predicting
Care. This can be due to the fact that liberals value individualizing moral foundations (e.g., Care) more
that other foundations [64].
5.5 Discussion
In this chapter, we introduced a modeling framework that relies on annotator representations along with
the input text representation to predict the annotation. By doing so, the resulting model is capable of
providing an acceptable label that is compatible with the given demographic/psychological profile of the
perceiver.
67
Figure 5.4: Hard rationales for predicting moral foundations annotations based on annotators’ political
standings on different issues.
Based on our analyses, annotators’ profiles can help understand demographic/psychological profiles
that impact subjective annotations. However, the common practice of data curation is to remove annotator-
level variations by aggregating the labels. Alternatively, we propose three recommendations aimed to
avoid these issues:
Annotator-levellabels: As discussed in Chapter 1, most multiply annotated datasets are released with
no annotator-level information, and instances are labeled by the aggregated votes [147, 81], or aggregate
percentages [34, 82]. Alternatively, as part of the dataset curation, annotator-level labels can be released,
preferably in an anonymous fashion, and leave open the choice of whether and how to utilize or aggregate
these labels for the dataset users.
68
Socio-demographic information: Information about the sociodemographic identities of the annota-
tors is crucial to ascertain whether datasets (and the models trained on them) equitably represent perspec-
tives of various social groups. We urge dataset developers to include socio-demographic information of
annotators, when viable to do so responsibly.
Documentation about recruitment, selection, and assignment of annotators: Finally, we urge
dataset developers to document how the annotators were recruited, the criteria used to select them and
assign data to them, and any efforts to ensure representational diversity, through transparency artefacts
such datasheets [60] or data statements [11].
69
Chapter6
Conclusions
Language issubjective; our interpretation and understanding of language can vary based on our beliefs and
worldviews, personalities, biases and experiences. This thesis presents the framework for training NLP
models for subjective language understanding tasks. I demonstrate that aggregating annotators’ judge-
ments for creating a gold standard and modeling a single correct answer not only often ignores the sub-
jectivity of language, but also it leads to modeling those human beliefs that are most dominant, either in
real world, or as represented in human-generated datasets.
Chapter 2 examines the necessity of explicitly considering the unique perspectives that each annotator
expresses in their annotations for building models used for predicting or measuring subjective phenom-
ena. Given that in assessing subjective tasks, annotators are not interchangeable, and their judgements are
derived by their experiences and subjectivity, we recommend retaining all perspectives separately in the
datasets and enabling dataset users to account for these differences accordingly in the modeling process.
Later in Chapter 5, we follow this recommendation where we curate such a dataset which includes com-
prehensive demographic/psychological profiles of annotators along with their assessments of language.
In Chapter 3, I assess the role of individual annotators and normative social biases in creating biased
annotated datasets and automated language classifier models. The results demonstrate that individual an-
notators are affected by their social biases. Moreover, annotated datasets and trained models are impacted
70
by the normative social biases encoded in the aggregated annotations and language models. These find-
ings raise a challenge for model designers because while individuals are affected by their biases, a model
trained on ground truths only reflects those biases that are more significantly represented in our language
resources. Therefore, our choice of language resources — whose language is being represented — has
crucial impact on the model we create. These results have further implications for training AI and NLP
models, because of the legitimate concern that these technologies may perpetuate social stereotypes and
societal inequalities.
In Chapters 4 and 5, I introduce solutions that are not restricted to modeling single gold labels. Alter-
natively, these approaches aim to inform NLP models about annotators’ profiles, along with their percep-
tions about language. As a result, the trained models are more capable of predicting acceptable labels for
each individual annotator, comparing to providing a prediction based on the majority vote. Besides opera-
tionalizing the main motivation of this thesis —annotatorsarenotinter-changeable —, the annotator-aware
model (Chapter 5) rationalizes what annotator attributes are more informative in each specific task. This
functionality assists model designers of subjective tasks with stratifying the pool of annotators to have
balanced representation of different perspectives.
In this thesis, I demonstrated that creating datasets based on majority votes, under-represents the per-
spectives of specific individuals and social groups. Once models are trained on such ground truth, the
normative biases and stereotypes are likely to propagate into model predictions. This is especially con-
cerning when the models are applied for decision making; because models cannot imagine an alternative
future and are restricted to recreating the gold standards, biases, and discriminations embedded in our
historically recorded data.
71
Bibliography
[1] Cecilia Ovesdotter Alm. “Subjective natural language problems: Motivations, applications,
characterizations, and implications”. In: Proceedings of the 49th Annual Meeting of the Association
for Computational Linguistics: Human Language Technologies. 2011, pp. 107–112.url:
https://aclanthology.org/P11-2019.pdf.
[2] Ebba Cecilia Ovesdotter Alm. “Affect in* Text and Speech”. PhD thesis. University of Illinois at
Urbana-Champaign, 2008.url:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.172.9934&rep=rep1&type=pdf.
[3] Héctor Martınez Alonso, Anders Johannsen, Oier Lopez de Lacalle, and Eneko Agirre. “Predicting
word sense annotation agreement”. In: Proceedings of the First Workshop on Linking
Computational Models of Lexical, Sentential and Discourse-level Semantics. 2015, pp. 89–94.url:
https://aclanthology.org/W15-2711.pdf.
[4] Saima Aman and Stan Szpakowicz. “Identifying expressions of emotion in text”. In: International
Conference on Text, Speech and Dialogue. Springer. 2007, pp. 196–205.url:
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1081.5218&rep=rep1&type=pdf.
[5] Atsushi Ando, Satoshi Kobashikawa, Hosana Kamiyama, Ryo Masumura, Yusuke Ijima, and
Yushi Aono. “Soft-target training with ambiguous emotional utterances for dnn-based speech
emotion classification”. In: 2018 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE. 2018, pp. 4964–4968.url:
https://ieeexplore.ieee.org/abstract/document/8461299.
[6] Lora Aroyo, Lucas Dixon, Nithum Thain, Olivia Redfield, and Rachel Rosen. “Crowdsourcing
subjective tasks: the case study of understanding toxicity in online discussions”. In: Companion
Proceedings of The 2019 World Wide Web Conference. 2019, pp. 1100–1105.url:
https://dl.acm.org/doi/pdf/10.1145/3308560.3317083?casa_token=8rfJfsh_ImoAAAAA:
C2QlwpeeiL1Tr-v927_wFd9ZC9F8BMGnwc2i_Ul2yzn0SBbpwZO8qt9dUv1oBfLo0hdBjMMoSOI.
[7] Lora Aroyo and Chris Welty. “Crowd truth: Harnessing disagreement in crowdsourcing a relation
extraction gold standard”. In: WebSci2013. ACM 2013 (2013).url:
https://www.academia.edu/download/66797651/Crowd_Truth_Harnessing_disagreement_in_c20210503-
14654-7wfrq6.pdf.
72
[8] Lora Aroyo and Chris Welty. “Truth is a lie: Crowd truth and the seven myths of human
annotation”. In: AI Magazine 36.1 (2015), pp. 15–24.
[9] Mohammad Atari, Jonathan Haidt, Jesse Graham, Sena Koleva, Sean T Stevens, and
Morteza Dehghani. “Morality Beyond the WEIRD: How the Nomological Network of Morality
Varies Across Cultures”. In: (2022).
[10] Valerio Basile, Michael Fell, Tommaso Fornaciari, Dirk Hovy, Silviu Paun, Barbara Plank,
Massimo Poesio, and Alexandra Uma. “We Need to Consider Disagreement in Evaluation”. In:
Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future. Online: Association for
Computational Linguistics, Aug. 2021.doi: 10.18653/v1/2021.bppf-1.3.
[11] Emily M Bender and Batya Friedman. “Data statements for natural language processing: Toward
mitigating system bias and enabling better science”. In: Transactions of the Association for
Computational Linguistics 6 (2018), pp. 587–604.url:
https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00041/1567666/tacl_a_00041.pdf.
[12] Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. “On the
Dangers of Stochastic Parrots: Can Language Models Be Too Big?” In: Proceedings of the 2021
ACM Conference on Fairness, Accountability, and Transparency. 2021, pp. 610–623.url:
https://dl.acm.org/doi/pdf/10.1145/3442188.3445922.
[13] Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. “Language (Technology) is
Power: A Critical Survey of “Bias” in NLP”. In: Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics. Online: Association for Computational Linguistics, July
2020, pp. 5454–5476.doi: 10.18653/v1/2020.acl-main.485.
[14] Su Lin Blodgett and Brendan O’Connor. “Racial disparity in natural language processing: A case
study of social media african-american english”. In: arXiv preprint arXiv:1707.00061 (2017).url:
https://arxiv.org/pdf/1707.00061.pdf.
[15] Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. “Man is
to computer programmer as woman is to homemaker? debiasing word embeddings”. In: Advances
in neural information processing systems. 2016, pp. 4349–4357.url:
https://proceedings.neurips.cc/paper/2016/file/a486cd07e4ac3d270571622f4f316ec5-Paper.pdf.
[16] Luke Breitfeller, Emily Ahn, David Jurgens, and Yulia Tsvetkov. “Finding microaggressions in the
wild: A case for locating elusive phenomena in social media posts”. In: Proceedings of the 2019
Conference on Empirical Methods in Natural Language Processing and the 9th International Joint
Conference on Natural Language Processing (EMNLP-IJCNLP). 2019, pp. 1664–1674.url:
https://aclanthology.org/D19-1176.pdf.
[17] Sven Buechel and Udo Hahn. “Emobank: Studying the impact of annotation perspective and
representation format on dimensional emotion analysis”. In: Proceedings of the 15th Conference of
the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers.
2017, pp. 578–585.url: https://aclanthology.org/E17-2092.pdf.
73
[18] Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. “Semantics derived automatically from
language corpora contain human-like biases”. In: Science 356.6334 (2017), pp. 183–186.url:
https://www.science.org/doi/full/10.1126/science.aal4230?casa_token=EdWaokUTLXYAAAAA:
SAi06P_G7If59YRsSQ8tUJnbQgAC_UkmwqTtGNYo33jLUWkB_z3oa96pAEfARTLVcU8b7G9iMm7e.
[19] Eshwar Chandrasekharan, Chaitrali Gandhi, Matthew Wortley Mustelier, and Eric Gilbert.
“Crossmod: A cross-community learning-based system to assist reddit moderators”. In:
Proceedings of the ACM on human-computer interaction 3.CSCW (2019), pp. 1–30.
[20] Tessa ES Charlesworth, Victor Yang, Thomas C Mann, Benedek Kurdi, and Mahzarin R Banaji.
“Gender Stereotypes in Natural Language: Word Embeddings Show Robust Consistency Across
Child and Adult Language Corpora of More Than 65 Million Words”. In: Psychological Science 32
(2021), pp. 218–240.url:
https://journals.sagepub.com/doi/pdf/10.1177/0956797620963619?casa_token=7Z8c_TE7Ui0AAAAA:
65RUiwfWUf_jnoJYTBaC9pU30F1hDaR6ScpGTxtHojECXyfrFds4YozX5pUqoR-4_jXZw02GAMI.
[21] Veronika Cheplygina and Josien PW Pluim. “Crowd disagreement about medical images is
informative”. In:Intravascularimagingandcomputerassistedstentingandlarge-scaleannotationof
biomedical data and expert label synthesis. Springer, 2018, pp. 105–111.url:
https://arxiv.org/pdf/1806.08174.pdf.
[22] Huang-Cheng Chou and Chi-Chun Lee. “Every rating matters: Joint learning of subjective labels
and individual annotators for speech emotion classification”. In: ICASSP 2019-2019 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2019,
pp. 5886–5890.url:
https://www.researchgate.net/profile/Huang-Cheng-Chou/publication/332791139_Every_Rating_
Matters_Joint_Learning_of_Subjective_Labels_and_Individual_Annotators_for_Speech_Emotion_
Classification/links/5df0d69ba6fdcc283717cca3/Every-Rating-Matters-Joint-Learning-of-
Subjective-Labels-and-Individual-Annotators-for-Speech-Emotion-Classification.pdf.
[23] Trevor Cohn and Lucia Specia. “Modelling Annotator Bias with Multi-task Gaussian Processes:
An Application to Machine Translation Quality Estimation”. In: Proceedings of the 51st Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Sofia, Bulgaria:
Association for Computational Linguistics, Aug. 2013, pp. 32–42.url:
https://aclanthology.org/P13-1004.
[24] Michele Corazza, Stefano Menini, Elena Cabrio, Sara Tonelli, and Serena Villata. “A multilingual
evaluation for online hate speech detection”. In: ACM Transactions on Internet Technology (TOIT)
20.2 (2020), pp. 1–22.url: https://hal.archives-ouvertes.fr/hal-02972184/document.
[25] Gloria Cowan and Désirée Khatchadourian. “Empathy, ways of knowing, and interdependence as
mediators of gender differences in attitudes toward hate speech and freedom of speech”. In:
Psychology of Women Quarterly 27.4 (2003), pp. 300–308.url:
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.852.3266&rep=rep1&type=pdf.
74
[26] Alan Cowen, Disa Sauter, Jessica L Tracy, and Dacher Keltner. “Mapping the passions: Toward a
high-dimensional taxonomy of emotional experience and expression”. In: Psychological Science in
the Public Interest 20.1 (2019), pp. 69–90.url:
https://journals.sagepub.com/doi/pdf/10.1177/1529100619850176.
[27] Kate Crawford. “The atlas of AI”. In: The Atlas of AI. Yale University Press, 2021.
[28] Kate Crawford. “The trouble with bias”. In: Conference on Neural Information Processing Systems,
invited speaker. 2017.url: https://www.youtube.com/watch?v=fMym_BKWQzk.
[29] Crowdflower. https://www.figureeight.com/data/sentiment-analysis-emotion-text/. 2016.
[30] Amy J. C. Cuddy, Susan T. Fiske, and Peter Glick. “The BIAS map: behaviors from intergroup
affect and stereotypes.” In: Journal of personality and social psychology 92.4 (2007), p. 631.url:
http://www.europhd.net/sites/europhd/files/images/onda_2/07/18th_lab/scientific_materials/
guan/cuddy_fiske_glick_2007.pdf.
[31] Aida Mostafazadeh Davani, Mohammad Atari, Brendan Kennedy, and Morteza Dehghani. “Hate
speech classifiers learn human-like social stereotypes”. In: arXiv preprint arXiv:2110.14839 (2021).
url: https://arxiv.org/pdf/2110.14839.pdf.
[32] Aida Mostafazadeh Davani, Mark Dıaz, and Vinodkumar Prabhakaran. “Dealing with
Disagreements: Looking Beyond the Majority Vote in Subjective Annotations”. In: Transactions of
the Association for Computational Linguistics 10 (2022), pp. 92–110.url:
https://arxiv.org/pdf/2110.05719.pdf.
[33] Thomas Davidson, Debasmita Bhattacharya, and Ingmar Weber. “Racial Bias in Hate Speech and
Abusive Language Detection Datasets”. In: Proceedings of the Third Workshop on Abusive
LanguageOnline. Florence, Italy: Association for Computational Linguistics, Aug. 2019, pp. 25–35.
doi: 10.18653/v1/W19-3504.
[34] Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. “Automated hate speech
detection and the problem of offensive language”. In: Proceedings of the International AAAI
Conference on Web and Social Media. Vol. 11. 1. 2017.url:
https://ojs.aaai.org/index.php/ICWSM/article/download/14955/14805.
[35] Alexander Philip Dawid and Allan M Skene. “Maximum likelihood estimation of observer
error-rates using the EM algorithm”. In: Journal of the Royal Statistical Society: Series C (Applied
Statistics) 28.1 (1979), pp. 20–28.url:
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.469.1377&rep=rep1&type=pdf.
[36] Marie-Catherine De Marneffe, Christopher D Manning, and Christopher Potts. “Did it happen?
The pragmatic complexity of veridicality assessment”. In: Computational linguistics 38.2 (2012),
pp. 301–333.url: https://direct.mit.edu/coli/article-pdf/38/2/301/1801598/coli_a_00097.pdf.
[37] Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan Cowen, Gaurav Nemade, and
Sujith Ravi. “GoEmotions: A Dataset of Fine-Grained Emotions”. In: Proceedings of the 58th
Annual Meeting of the Association for Computational Linguistics. Online: Association for
Computational Linguistics, July 2020.url: https://aclanthology.org/2020.acl-main.372.
75
[38] Emily Denton, Mark Dıaz, Ian Kivlichan, Vinodkumar Prabhakaran, and Rachel Rosen. “Whose
Ground Truth? Accounting for Individual and Collective Identities Underlying Dataset
Annotation”. In: arXiv preprint arXiv:2112.04554 (2021).url:
https://arxiv.org/pdf/2112.04554.pdf.
[39] Bart Desmet and Véronique Hoste. “Emotion detection in suicide notes”. In: Expert Systems with
Applications 40.16 (2013), pp. 6351–6358.url:
https://www.researchgate.net/profile/Bart-Desmet/publication/257405079_Emotion_detection_in_
suicide_notes/links/5d92229e92851c33e94b24a6/Emotion-detection-in-suicide-notes.pdf.
[40] J. Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding”. In: NAACL-HLT. 2019.url:
https://arxiv.org/pdf/1810.04805.pdf&usg=ALkJrhhzxlCL6yTht2BRmH9atgvKFxHsxQ.
[41] Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. “Measuring and
mitigating unintended bias in text classification”. In: Proceedings of the 2018 AAAI/ACM
Conference on AI, Ethics, and Society. 2018, pp. 67–73.url:
https://dl.acm.org/doi/pdf/10.1145/3278721.3278729.
[42] Mark Dıaz. “Biases as Values: Evaluating Algorithms in Context”. PhD thesis. Northwestern
University, 2020.url:
http://markjdiaz.com/wp-content/uploads/2021/01/Diaz_BiasesAsValues.pdf.
[43] Mark Dıaz. Older Adult Annotator Demographic and Attitudinal Survey. Version V1. 2020.doi:
10.7910/DVN/GXS7DI.
[44] Mark Dıaz, Isaac Johnson, Amanda Lazar, Anne Marie Piper, and Darren Gergle. “Addressing
age-related bias in sentiment analysis”. In: Proceedings of the 2018 CHI Conference on Human
Factors in Computing Systems. 2018, pp. 1–14.url:
https://dl.acm.org/doi/pdf/10.1145/3173574.3173986.
[45] Anca Dumitrache. “Crowdsourcing disagreement for collecting semantic annotation”. In:
European Semantic Web Conference. Springer. 2015, pp. 701–710.url:
https://link.springer.com/chapter/10.1007/978-3-319-18818-8_43.
[46] Carsten Eickhoff. “Cognitive biases in crowdsourcing”. In: Proceedings of the eleventh ACM
international conference on web search and data mining. 2018, pp. 162–170.
[47] Paul Ekman. “An argument for basic emotions”. In: Cognition & emotion 6.3-4 (1992), pp. 169–200.
url: https://asset-pdf.scinapse.io/prod/1966797434/1966797434.pdf.
[48] Jim AC Everett. “The 12 item social and economic conservatism scale (SECS)”. In: PloS one 8.12
(2013), e82131.
[49] Haytham M Fayek, Margaret Lech, and Lawrence Cavedon. “Modeling subjectiveness in emotion
recognition with deep neural networks: Ensembles vs soft labels”. In: 2016 international joint
conference on neural networks (IJCNN). IEEE. 2016, pp. 566–570.url:
https://ieeexplore.ieee.org/abstract/document/7727250/.
76
[50] Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and
Suresh Venkatasubramanian. “Certifying and removing disparate impact”. In: proceedings of the
21th ACM SIGKDD international conference on knowledge discovery and data mining. 2015,
pp. 259–268.url: https://dl.acm.org/doi/pdf/10.1145/2783258.2783311?casa_token=k-
r3vSRqmswAAAAA:7WIQzLbNa6AvhIUDwY3qG8o5XhscCC0DEb891YkdhjwbAsCr2_mLNPdW-ilUbCKV0DYAnqZI_eY.
[51] Elisa Ferracane, Greg Durrett, Junyi Jessy Li, and Katrin Erk. “Did they answer? Subjective acts
and intents in conversational discourse”. In: Proceedings of the 2021 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language Technologies.
Online: Association for Computational Linguistics, June 2021, pp. 1626–1644.doi:
10.18653/v1/2021.naacl-main.129.
[52] Susan T. Fiske, Amy J. C. Cuddy, P Glick, and J Xu. “A model of (often mixed) stereotype content:
competence and warmth respectively follow from perceived status and competition.” In: Journal
of personality and social psychology 82.6 (2002), p. 878.url:
https://doi.apa.org/doiLanding?doi=10.1037%5C%2F0022-3514.82.6.878.
[53] Tommaso Fornaciari, Alexandra Uma, Silviu Paun, Barbara Plank, Dirk Hovy, and
Massimo Poesio. “Beyond Black & White: Leveraging Annotator Disagreement via Soft-Label
Multi-Task Learning”. In: Proceedings of the 2021 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies. Online: Association for
Computational Linguistics, June 2021, pp. 2591–2597.doi: 10.18653/v1/2021.naacl-main.204.
[54] Tommaso Fornaciari, Alexandra Uma, Silviu Paun, Barbara Plank, Dirk Hovy, and
Massimo Poesio. “Beyond Black & White: Leveraging Annotator Disagreement via Soft-Label
Multi-Task Learning”. In: Proceedings of the 2021 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies. 2021, pp. 2591–2597.
url: https://aclanthology.org/2021.naacl-main.204.pdf.
[55] Antigoni Maria Founta, Constantinos Djouvas, Despoina Chatzakou, Ilias Leontiadis,
Jeremy Blackburn, Gianluca Stringhini, Athena Vakali, Michael Sirivianos, and Nicolas Kourtellis.
“Large scale crowdsourcing and characterization of twitter abusive behavior”. In: Twelfth
International AAAI Conference on Web and Social Media. 2018.url:
https://www.aaai.org/ocs/index.php/ICWSM/ICWSM18/paper/download/17909/17041.
[56] Gavin Gaffney. PushshiftGabCorpus. https://files.pushshift.io/gab/. Accessed: 2019-5-23. 2018.
[57] Yarin Gal and Zoubin Ghahramani. “Dropout as a bayesian approximation: Representing model
uncertainty in deep learning”. In: international conference on machine learning. PMLR. 2016,
pp. 1050–1059.url: http://proceedings.mlr.press/v48/gal16.pdf.
[58] Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and James Zou. “Word embeddings quantify 100
years of gender and ethnic stereotypes”. In:ProceedingsoftheNationalAcademyofSciences 115.16
(2018), E3635–E3644.url: https://www.pnas.org/doi/pdf/10.1073/pnas.1720347115.
[59] Justin Garten, Brendan Kennedy, Joe Hoover, Kenji Sagae, and Morteza Dehghani. “Incorporating
demographic embeddings into language understanding”. In: Cognitive science 43.1 (2019), e12701.
url: https://onlinelibrary.wiley.com/doi/pdfdirect/10.1111/cogs.12701.
77
[60] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan,
Hanna Wallach, Hal Daumé III, and Kate Crawford. “Datasheets for datasets”. In: arXiv preprint
arXiv:1803.09010 (2018).
[61] Mor Geva, Yoav Goldberg, and Jonathan Berant. “Are We Modeling the Task or the Annotator? An
Investigation of Annotator Bias in Natural Language Understanding Datasets”. In: Proceedings of
the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International
Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China:
Association for Computational Linguistics, Nov. 2019, pp. 1161–1166.doi: 10.18653/v1/D19-1107.
[62] Asma Ghandeharioun, Brian Eoff, Brendan Jou, and Rosalind Picard. “Characterizing Sources of
Uncertainty to Proxy Calibration and Disambiguate Annotator and Data Bias”. In: 2019 IEEE/CVF
International Conference on Computer Vision Workshop (ICCVW). IEEE. 2019, pp. 4202–4206.url:
https://arxiv.org/pdf/1909.09285.pdf.
[63] Mitchell L Gordon, Kaitlyn Zhou, Kayur Patel, Tatsunori Hashimoto, and Michael S Bernstein.
“The Disagreement Deconvolution: Bringing Machine Learning Performance Metrics In Line
With Reality”. In: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems.
2021.url: http://www.kayur.org/papers/chi2021.pdf.
[64] Jesse Graham, Jonathan Haidt, and Brian A Nosek. “Liberals and conservatives rely on different
sets of moral foundations.” In: Journal of personality and social psychology 96.5 (2009), p. 1029.
[65] Anthony G Greenwald, Debbie E McGhee, and Jordan LK Schwartz. “Measuring individual
differences in implicit cognition: the implicit association test.” In: Journal of personality and social
psychology 74.6 (1998), p. 1464.
[66] Anthony G Greenwald, T Andrew Poehlman, Eric Luis Uhlmann, and Mahzarin R Banaji.
“Understanding and using the Implicit Association Test: III. Meta-analysis of predictive validity.”
In: Journal of personality and social psychology 97.1 (2009), p. 17.
[67] Rainer Greifeneder, Herbert Bless, and Klaus Fiedler. Social cognition: How individuals construct
social reality. Psychology Press, 2017.url:
https://www.taylorfrancis.com/books/mono/10.4324/9781315784731/social-cognition-herbert-
bless-klaus-fiedler.
[68] Moritz Hardt, Eric Price, and Nati Srebro. “Equality of opportunity in supervised learning”. In:
Advances in neural information processing systems. 2016, pp. 3315–3323.url:
https://proceedings.neurips.cc/paper/2016/file/9d2682367c3935defcb1f9e247a97c0d-Paper.pdf.
[69] Dan Hendrycks and Kevin Gimpel. “A Baseline for Detecting Misclassified and
Out-of-Distribution Examples in Neural Networks”. In: Proceedings of International Conference on
Learning Representations (2017).url: https://arxiv.org/pdf/1610.02136.pdf.
[70] Julia Hirschberg, Jackson Liscombe, and Jennifer Venditti. “Experiments in emotional speech”. In:
ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition. 2003.url:
https://www.researchgate.net/profile/Julia-Hirschberg/publication/246140762_Experiments_in_
Emotional_Speech/links/556f094208aeab7772282916/Experiments-in-Emotional-Speech.pdf.
78
[71] Julia Hirschberg and Christopher D Manning. “Advances in natural language processing”. In:
Science 349.6245 (2015), pp. 261–266.url:
https://nlp.stanford.edu/~manning/xyzzy/Hirschberg-Manning-Science-2015.pdf.
[72] Dirk Hovy. “Demographic factors improve classification performance”. In: Proceedings of the 53rd
annual meeting of the Association for Computational Linguistics and the 7th international joint
conference on natural language processing (volume 1: Long papers). 2015, pp. 752–762.url:
https://aclanthology.org/P15-1073.pdf.
[73] Dirk Hovy, Taylor Berg-Kirkpatrick, Ashish Vaswani, and Eduard Hovy. “Learning whom to trust
with MACE”. In: Proceedings of the 2013 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies. 2013, pp. 1120–1130.
url: https://aclanthology.org/N13-1132.pdf.
[74] Dirk Hovy and Shrimai Prabhumoye. “Five sources of bias in natural language processing”. In:
Language and Linguistics Compass 15.8 (2021), e12432.url:
https://compass.onlinelibrary.wiley.com/doi/pdf/10.1111/lnc3.12432.
[75] Dirk Hovy and Diyi Yang. “The importance of modeling social factors of language: Theory and
practice”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies. 2021, pp. 588–602.url:
https://aclanthology.org/2021.naacl-main.49.pdf.
[76] Eduard Hovy, Mitch Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel.
“OntoNotes: the 90% solution”. In: Proceedings of the human language technology conference of the
NAACL, Companion Volume: Short Papers. 2006, pp. 57–60.url:
https://aclanthology.org/N06-2015.pdf.
[77] Ben Hutchinson, Vinodkumar Prabhakaran, Emily Denton, Kellie Webster, Yu Zhong, and
Stephen Denuyl. “Social Biases in NLP Models as Barriers for Persons with Disabilities”. In:
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online:
Association for Computational Linguistics, July 2020, pp. 5491–5501.doi:
10.18653/v1/2020.acl-main.487.
[78] Lilly C Irani and M Six Silberman. “Turkopticon: Interrupting worker invisibility in amazon
mechanical turk”. In: Proceedings of the SIGCHI conference on human factors in computing systems.
2013, pp. 611–620.
[79] Sarthak Jain, Sarah Wiegreffe, Yuval Pinter, and Byron C. Wallace. “Learning to Faithfully
Rationalize by Construction”. In: Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics. Online: Association for Computational Linguistics, July 2020.doi:
10.18653/v1/2020.acl-main.409.
[80] Shan Jiang and Christo Wilson. “Structurizing Misinformation Stories via Rationalizing
Fact-Checks”. In: Proceedings of the 59th Annual Meeting of the Association for Computational
Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1:
Long Papers). Online: Association for Computational Linguistics, Aug. 2021, pp. 617–631.doi:
10.18653/v1/2021.acl-long.51.
79
[81] Jigsaw. Toxic Comment Classification Challenge . Accessed: 2021-05-01. 2018.url:
https://www.kaggle.com/c/%5C%5Cjigsaw-toxic-comment-classification-challenge/data.
[82] Jigsaw. Unintended Bias in Toxicity Classification . Accessed: 2021-05-01. 2019.url:
https://www.kaggle.com/c/%5C%5Cjigsaw-unintended-bias-in-toxicity-classification/data.
[83] David Jurgens, Libby Hemphill, and Eshwar Chandrasekharan. “A Just and Comprehensive
Strategy for Using NLP to Address Online Abuse”. In: Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics. Florence, Italy: Association for Computational
Linguistics, July 2019, pp. 3658–3666.doi: 10.18653/v1/P19-1357.
[84] Sanjay Kairam and Jeffrey Heer. “Parting crowds: Characterizing divergent interpretations in
crowdsourced annotation tasks”. In: Proceedings of the 19th ACM Conference on
Computer-Supported Cooperative Work & Social Computing. 2016, pp. 1637–1648.url:
https://dl.acm.org/doi/pdf/10.1145/2818048.2820016.
[85] Brendan Kennedy, Mohammad Atari, Aida Mostafazadeh Davani, Leigh Yeh, Ali Omrani,
Yehsong Kim, Kris Coombs Jr., Shreya Havaldar, Gwenyth Portillo-Wightman, Elaine Gonzalez,
Joe Hoover, Aida Azatian, Gabriel Cardenas, Alyzeh Hussain, Austin Lara, Adam Omary,
Christina Park, Xin Wang, Clarisa Wijaya, Yong Zhang, Beth Meyerowitz, and Morteza Dehghani.
The Gab Hate Corpus: A collection of 27k posts annotated for hate speech. 2020.doi:
10.31234/osf.io/hqjxn.
[86] Brendan Kennedy, Xisen Jin, Aida Mostafazadeh Davani, Morteza Dehghani, and Xiang Ren.
“Contextualizing Hate Speech Classifiers with Post-hoc Explanation”. In: Proceedings of the 58th
Annual Meeting of the Association for Computational Linguistics. Online: Association for
Computational Linguistics, July 2020, pp. 5435–5442.doi: 10.18653/v1/2020.acl-main.483.
[87] Chris J Kennedy, Geoff Bacon, Alexander Sahn, and Claudia von Vacano. “Constructing interval
variables via faceted Rasch measurement and multitask deep learning: a hate speech application”.
In: arXiv preprint arXiv:2009.10277 (2020).url: https://arxiv.org/pdf/2009.10277.pdf.
[88] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization”. In: 3rd
International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9,
2015, Conference Track Proceedings. Ed. by Yoshua Bengio and Yann LeCun. 2015.url:
http://arxiv.org/abs/1412.6980.
[89] Svetlana Kiritchenko, Isar Nejadgholi, and Kathleen C Fraser. “Confronting Abusive Language
Online: A Survey from the Ethical and Human Rights Perspective”. In: arXiv preprint
arXiv:2012.12305 0 (2020).url:
https://www.jair.org/index.php/jair/article/download/12590/26695.
[90] Michael Kläs and Anna Maria Vollmer. “Uncertainty in machine learning applications: A
practice-driven classification of uncertainty”. In: International Conference on Computer Safety,
Reliability, and Security. Springer. 2018, pp. 431–438.url:
https://link.springer.com/chapter/10.1007/978-3-319-99229-7_36.
80
[91] Alex Koch, Roland Imhoff, Ron Dotsch, Christian Unkelbach, and Hans Alves. “The ABC of
stereotypes about groups: Agency/socioeconomic success, conservative–progressive beliefs, and
communion.” In: Journal of personality and social psychology 110.5 (2016), p. 675.url:
https://www.researchgate.net/profile/Alex-Koch-2/publication/303086913_The_ABC_of_
Stereotypes_About_Groups_AgencySocioeconomic_Success_Conservative-
Progressive_Beliefs_and_Communion/links/592fdf100f7e9beee761b0a8/The-ABC-of-Stereotypes-
About-Groups-Agency-Socioeconomic-Success-Conservative-Progressive-Beliefs-and-
Communion.pdf.
[92] Klaus Krippendorff. “Agreement and information in the reliability of coding”. In: Communication
Methods and Measures 5.2 (2011), pp. 93–112.url:
https://repository.upenn.edu/cgi/viewcontent.cgi?article=1286&context=asc_papers.
[93] Irene Kwok and Yuzhou Wang. “Locate the hate: Detecting tweets against blacks”. In: Proceedings
of the AAAI Conference on Artificial Intelligence . Vol. 27. 2013.url:
https://www.aaai.org/ocs/index.php/AAAI/AAAI13/paper/viewPDFInterstitial/6419/6821.
[94] Tao Lei, Regina Barzilay, and Tommi Jaakkola. “Rationalizing Neural Predictions”. In: Proceedings
of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, Texas:
Association for Computational Linguistics, Nov. 2016, pp. 107–117.doi: 10.18653/v1/D16-1011.
[95] Jackson Liscombe, Jennifer Venditti, and Julia Hirschberg. “Classifying subject ratings of
emotional speech using acoustic features”. In: Eighth European Conference on Speech
Communication and Technology. 2003.url:
https://academiccommons.columbia.edu/doi/10.7916/D8VX0QTJ/download.
[96] Bing Liu et al. “Sentiment analysis and subjectivity.” In: Handbook of natural language processing
2.2010 (2010), pp. 627–666.
[97] Hugo Liu, Henry Lieberman, and Ted Selker. “A model of textual affect sensing using real-world
knowledge”. In: Proceedings of the 8th international conference on Intelligent user interfaces. 2003,
pp. 125–132.url:
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.73.2973&rep=rep1&type=pdf.
[98] Tong Liu. “Human-in-the-Loop Learning from Crowdsourcing and Social Media”. In: (2020).url:
https://scholarworks.rit.edu/cgi/viewcontent.cgi?article=11619&context=theses.
[99] Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. “Multi-Task Deep Neural
Networks for Natural Language Understanding”. In: Proceedings of the 57th Annual Meeting of the
Association for Computational Linguistics. Florence, Italy: Association for Computational
Linguistics, July 2019, pp. 4487–4496.doi: 10.18653/v1/P19-1441.
[100] Yiwei Luo, Dallas Card, and Dan Jurafsky. “Detecting Stance in Media On Global Warming”. In:
Findings of the Association for Computational Linguistics: EMNLP 2020. Online: Association for
Computational Linguistics, Nov. 2020, pp. 3296–3315.doi: 10.18653/v1/2020.findings-emnlp.296.
81
[101] Thomas Manzini, Lim Yao Chong, Alan W Black, and Yulia Tsvetkov. “Black is to Criminal as
Caucasian is to Police: Detecting and Removing Multiclass Bias in Word Embeddings”. In:
Proceedings of the 2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
Minneapolis, Minnesota: Association for Computational Linguistics, June 2019, pp. 615–621.doi:
10.18653/v1/N19-1062.
[102] Melissa D McCradden, Shalmali Joshi, James A Anderson, Mjaye Mazwi, Anna Goldenberg, and
Randi Zlotnik Shaul. “Patient safety and quality improvement: Ethical principles for a regulatory
approach to bias in healthcare machine learning”. In: Journal of the American Medical Informatics
Association 27.12 (2020), pp. 2024–2027.url:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7727331/.
[103] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. “A
survey on bias and fairness in machine learning”. In: ACM Computing Surveys (CSUR) 54.6 (2021),
pp. 1–35.url: https://dl.acm.org/doi/pdf/10.1145/3457607?casa_token=UpREOO4pAAkAAAAA:
wkYt6GQ7CTW1uQAkPjF7Ml4TjTj1Xl2Pp__CkREN7eR9LxCURspytmLPdc0JZuvwePwVol9sFzU.
[104] Rada Mihalcea and Hugo Liu. “A corpus-based approach to finding happiness.” In: AAAI Spring
Symposium: Computational Approaches to Analyzing Weblogs. 2006, pp. 139–144.url:
https://www.aaai.org/Papers/Symposia/Spring/2006/SS-06-03/SS06-03-027.pdf.
[105] Pushkar Mishra, Helen Yannakoudakis, and Ekaterina Shutova. “Tackling online abuse: A survey
of automated abuse detection methods”. In: arXiv preprint arXiv:1908.06024 (2019).url:
https://arxiv.org/pdf/1908.06024.pdf.
[106] Emily Mower, Angeliki Metallinou, Chi-Chun Lee, Abe Kazemzadeh, Carlos Busso, Sungbok Lee,
and Shrikanth Narayanan. “Interpreting ambiguous emotional expressions”. In: 2009 3rd
International Conference on Affective Computing and Intelligent Interaction and Workshops . IEEE.
2009, pp. 1–8.url: http://web.eecs.umich.edu/~emilykmp/EmilyPapers/MowerACII.pdf.
[107] Marzieh Mozafari, Reza Farahbakhsh, and Noël Crespi. “Hate speech detection and racial bias
mitigation in social media based on BERT model”. In: PloS one 15.8 (2020), e0237861.url:
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0237861.
[108] Marzieh Mozafari, Reza Farahbakhsh, and Noel Crespi. “A BERT-based transfer learning
approach for hate speech detection in online social media”. In: International Conference on
Complex Networks and Their Applications. Springer. 2019, pp. 928–940.url:
https://arxiv.org/pdf/1910.12574.pdf.
[109] Michael Muthukrishna and Joseph Henrich. “A problem in theory”. In: Nature Human Behaviour
3.3 (2019), pp. 221–229.url: https://www.nature.com/articles/s41562-018-0522-1.
[110] Stefanie Nowak and Stefan Rüger. “How reliable are annotations via crowdsourcing: a study
about inter-annotator agreement for multi-label image annotation”. In: Proceedings of the
international conference on Multimedia information retrieval. 2010, pp. 557–566.url:
https://oro.open.ac.uk/25874/1/mir354s-nowak.pdf.
82
[111] Ziad Obermeyer, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. “Dissecting racial
bias in an algorithm used to manage the health of populations”. In: Science 366.6464 (2019),
pp. 447–453.url:
https://www.science.org/doi/full/10.1126/science.aax2342?casa_token=OFqf1aG_5dQAAAAA:
lprTuuAwoatqGGY_4_EY53aFYvWhb68oCjLT_BAMa7DxJsgrwB__vlGnbBp3K0f8bQzh2YiThs7a.
[112] Bo Pang and Lillian Lee. “A Sentimental Education: Sentiment Analysis Using Subjectivity
Summarization Based on Minimum Cuts”. In: Proceedings of the 42nd Annual Meeting of the
Association for Computational Linguistics (ACL-04). Barcelona, Spain, July 2004, pp. 271–278.doi:
10.3115/1218955.1218990.
[113] Ji Ho Park, Jamin Shin, and Pascale Fung. “Reducing Gender Bias in Abusive Language Detection”.
In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.
Brussels, Belgium: Association for Computational Linguistics, 2018.doi: 10.18653/v1/D18-1302.
[114] Rebecca J Passonneau and Bob Carpenter. “The benefits of a model of annotation”. In:
Transactions of the Association for Computational Linguistics 2 (2014), pp. 311–326.url:
https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00185/1566907/tacl_a_00185.pdf.
[115] Desmond Patton, Philipp Blandfort, William Frey, Michael Gaskell, and Svebor Karaman.
“Annotating social media data from vulnerable populations: Evaluating disagreement between
domain experts and graduate student annotators”. In: Proceedings of the 52nd Hawaii International
Conference on System Sciences. 2019.url:
https://scholarspace.manoa.hawaii.edu/bitstream/10125/59653/0213.pdf.
[116] Silviu Paun, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, and Massimo Poesio.
“Comparing Bayesian Models of Annotation”. In: Transactions of the Association for
Computational Linguistics 6 (2018), pp. 571–585.url:
https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00040/1567662/tacl_a_00040.pdf.
[117] Ellie Pavlick and Tom Kwiatkowski. “Inherent disagreements in human textual inferences”. In:
Transactions of the Association for Computational Linguistics 7 (2019), pp. 677–694.
[118] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. “GloVe: Global Vectors for
Word Representation”. In: Empirical Methods in Natural Language Processing (EMNLP). 2014,
pp. 1532–1543.url: http://www.aclweb.org/anthology/D14-1162.
[119] Agnieszka Pietraszkiewicz, Magdalena Formanowicz, Marie Gustafsson Sendén, Ryan L Boyd,
Sverker Sikström, and Sabine Sczesny. “The big two dictionaries: Capturing agency and
communion in natural language”. In: European journal of social psychology 49.5 (2019),
pp. 871–887.url:
https://onlinelibrary.wiley.com/doi/abs/10.1002/ejsp.2561?casa_token=3S84jJPJ1FcAAAAA%5C%
3ALBPxxU_spswrC7auSlo_J7tf_VPBX_ajQPcyTwn4kkvLRjm6Nn1IgEasFdQrg86hof5U0d1hKrraQA.
[120] Barbara Plank, Dirk Hovy, and Anders Søgaard. “Learning part-of-speech taggers with
inter-annotator agreement loss”. In: Proceedings of the 14th Conference of the European Chapter of
the Association for Computational Linguistics. Gothenburg, Sweden: Association for
Computational Linguistics, Apr. 2014, pp. 742–751.doi: 10.3115/v1/E14-1078.
83
[121] Barbara Plank, Dirk Hovy, and Anders Søgaard. “Learning part-of-speech taggers with
inter-annotator agreement loss”. In: Proceedings of the 14th Conference of the European Chapter of
the Association for Computational Linguistics. 2014, pp. 742–751.url:
https://aclanthology.org/E14-1078.pdf.
[122] Robert Plutchik. “A general psychoevolutionary theory of emotion”. In: Theories of emotion.
Elsevier, 1980, pp. 3–33.url:
https://www.sciencedirect.com/science/article/pii/B9780125587013500077.
[123] Soujanya Poria, Navonil Majumder, Rada Mihalcea, and Eduard Hovy. “Emotion recognition in
conversation: Research challenges, datasets, and recent advances”. In: IEEE Access 7 (2019),
pp. 100943–100953.url: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8764449.
[124] Vinodkumar Prabhakaran, Michael Bloodgood, Mona Diab, Bonnie Dorr, Lori Levin,
Christine Piatko, Owen Rambow, and Benjamin Van Durme. “Statistical Modality Tagging from
Rule-based Annotations and Crowdsourcing”. In: 2012, pp. 57–64.
[125] Vinodkumar Prabhakaran, Ben Hutchinson, and Margaret Mitchell. “Perturbation Sensitivity
Analysis to Detect Unintended Model Biases”. In: Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the 9th International Joint Conference on Natural
Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational
Linguistics, Nov. 2019, pp. 5740–5745.doi: 10.18653/v1/D19-1578.
[126] Vinodkumar Prabhakaran, Ben Hutchinson, and Margaret Mitchell. “Perturbation Sensitivity
Analysis to Detect Unintended Model Biases”. In: Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the 9th International Joint Conference on Natural
Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational
Linguistics, Nov. 2019, pp. 5740–5745.doi: 10.18653/v1/D19-1578.
[127] Vinodkumar Prabhakaran, Aida Mostafazadeh Davani, and Mark Diaz. “On Releasing
Annotator-Level Labels and Information in Datasets”. In: Proceedings of The Joint 15th Linguistic
Annotation Workshop (LAW) and 3rd Designing Meaning Representations (DMR) Workshop. Punta
Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 133–138.
doi: 10.18653/v1/2021.law-1.14.
[128] Vinodkumar Prabhakaran, Zeerak Talat, Seyi Akiwowo, and Bertie Vidgen. “Online Abuse and
Human Rights: WOAH Satellite Session at RightsCon 2020”. In: Proceedings of the Fourth
Workshop on Online Abuse and Harms. Online: Association for Computational Linguistics, Nov.
2020, pp. 1–6.doi: 10.18653/v1/2020.alw-1.1.
[129] Ilan Price, Jordan Gifford-Moore, Jory Flemming, Saul Musker, Maayan Roichman,
Guillaume Sylvain, Nithum Thain, Lucas Dixon, and Jeffrey Sorensen. “Six Attributes of
Unhealthy Conversations”. In: Proceedings of the Fourth Workshop on Online Abuse and Harms.
Online: Association for Computational Linguistics, Nov. 2020, pp. 114–124.doi:
10.18653/v1/2020.alw-1.15.
84
[130] Rachel Rakov and Andrew Rosenberg. ““Sure, I did the right thing”: a system for sarcasm
detection in speech.” In: Interspeech. 2013, pp. 842–846.url:
https://lpp.ilpga.fr/PDF/IS130109/IS130109.PDF.
[131] Georg Rasch. Probabilistic models for some intelligence and attainment tests. ERIC, 1993.url:
https://eric.ed.gov/?id=ED419814.
[132] Björn Ross, Michael Rist, Guillermo Carbonell, Benjamin Cabrera, Nils Kurowsky, and
Michael Wojatzki. “Measuring the reliability of hate speech annotations: The case of the european
refugee crisis”. In: arXiv preprint arXiv:1701.08118 (2017).url: https://arxiv.org/abs/1701.08118.
[133] Joel Ross, Lilly Irani, M Six Silberman, Andrew Zaldivar, and Bill Tomlinson. “Who are the
crowdworkers? Shifting demographics in Mechanical Turk”. In: CHI’10 extended abstracts on
Human factors in computing systems. 2010, pp. 2863–2872.url:
https://www.academia.edu/download/43592369/Who_are_the_crowdworkers_shifting_demogr20160310-
18708-cv9zu3.pdf.
[134] Paul Röttger, Bertie Vidgen, Dirk Hovy, and Janet B Pierrehumbert. “Two Contrasting Data
Annotation Paradigms for Subjective NLP Tasks”. In: arXiv preprint arXiv:2112.07475 (2021).
[135] James A Russell. “Core affect and the psychological construction of emotion.” In: Psychological
review 110.1 (2003), p. 145.url: https://www.academia.edu/download/30925178/psyc-rev2003.pdf.
[136] Marta Sabou, Kalina Bontcheva, Leon Derczynski, and Arno Scharl. “Corpus Annotation through
Crowdsourcing: Towards Best Practice Guidelines.” In: LREC. 2014, pp. 859–866.url:
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1048.7024&rep=rep1&type=pdf.
[137] Joni Salminen, Hind Almerekhi, Ahmed Mohamed Kamel, Soon-gyo Jung, and Bernard J Jansen.
“Online hate ratings vary by extremes: A statistical analysis”. In: Proceedings of the 2019
Conference on Human Information Interaction and Retrieval. 2019, pp. 213–217.url:
https://dl.acm.org/doi/pdf/10.1145/3295750.3298954.
[138] Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A Smith. “The risk of racial bias
in hate speech detection”. In: Proceedings of the 57th annual meeting of the association for
computational linguistics. 2019, pp. 1668–1678.url:
https://www.aclweb.org/anthology/P19-1163.pdf.
[139] Anna Schmidt and Michael Wiegand. “A survey on hate speech detection using natural language
processing”. In: Proceedings of the fifth international workshop on natural language processing for
social media. 2017, pp. 1–10.url: https://www.aclweb.org/anthology/W17-1101.pdf.
[140] Patrick Schwab and Walter Karlen. “CXPlain: Causal Explanations for Model Interpretation under
Uncertainty”. In: Advances in Neural Information Processing Systems (NeurIPS). 2019.url:
https://arxiv.org/pdf/1910.12336.
85
[141] Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Ng. “Cheap and Fast – But is it
Good? Evaluating Non-Expert Annotations for Natural Language Tasks”. In: Proceedings of the
2008 Conference on Empirical Methods in Natural Language Processing. Honolulu, Hawaii:
Association for Computational Linguistics, Oct. 2008, pp. 254–263.url:
https://aclanthology.org/D08-1027.
[142] Christopher J Soto and Oliver P John. “Short and extra-short forms of the Big Five Inventory–2:
The BFI-2-S and BFI-2-XS”. In: Journal of Research in Personality 68 (2017), pp. 69–81.
[143] Carlo Strapparava and Rada Mihalcea. “Semeval-2007 task 14: Affective text”. In: Proceedings of
the Fourth International Workshop on Semantic Evaluations (SemEval-2007). 2007, pp. 70–74.url:
https://www.aclweb.org/anthology/S07-1013.pdf.
[144] Nathaniel Swinger, Maria De-Arteaga, Neil Thomas Heffernan IV, Mark DM Leiserson, and
Adam Tauman Kalai. “What are the biases in my word embedding?” In: Proceedings of the 2019
AAAI/ACM Conference on AI, Ethics, and Society. 2019, pp. 305–311.url:
https://dl.acm.org/doi/pdf/10.1145/3306618.3314270?casa_token=S9HCbmHj3PMAAAAA:
cBFIiPUhZPFa7YQPJC6JeO7n23N_qJDpEWxK0kERHOvQxOR3dL7-7sw6DNiiN2nY6f5NeHyjx30.
[145] Zeerak Talat. “Are you a racist or am i seeing things? annotator influence on hate speech
detection on twitter”. In: Proceedings of the first workshop on NLP and computational social science .
2016, pp. 138–142.url: https://aclanthology.org/W16-5618.pdf.
[146] Zeerak Talat, Thomas Davidson, Dana Warmsley, and Ingmar Weber. “Understanding abuse: A
typology of abusive language detection subtasks”. In: arXiv preprint arXiv:1705.09899 (2017).url:
https://arxiv.org/pdf/1705.09899.
[147] Zeerak Talat and Dirk Hovy. “Hateful symbols or hateful people? predictive features for hate
speech detection on twitter”. In: Proceedings of the NAACL student research workshop. 2016,
pp. 88–93.url: https://www.aclweb.org/anthology/N16-2013.pdf.
[148] Alexandra N Uma, Tommaso Fornaciari, Dirk Hovy, Silviu Paun, Barbara Plank, and
Massimo Poesio. “Learning from Disagreement: A Survey”. In: Journal of Artificial Intelligence
Research 72 (2021), pp. 1385–1470.
[149] Michele Vecchione, Francesco Dentale, Guido Alessandri, and Claudio Barbaranelli. “Fakability of
implicit and explicit measures of the Big Five: Research findings from organizational settings”. In:
International Journal of Selection and Assessment 22.2 (2014), pp. 211–218.
[150] Bertie Vidgen, Tristan Thrush, Zeerak Talat, and Douwe Kiela. “Learning from the Worst:
Dynamically Generated Datasets to Improve Online Hate Detection”. In: Proceedings of the 59th
Annual Meeting of the Association for Computational Linguistics and the 11th International Joint
Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for
Computational Linguistics, Aug. 2021, pp. 1667–1682.doi: 10.18653/v1/2021.acl-long.132.
[151] Emily A Vogels. “The state of online harassment”. In: Pew Research Center 13 (2021).
86
[152] Claudia Wagner, Markus Strohmaier, Alexandra Olteanu, Emre Kıcıman, Noshir Contractor, and
Tina Eliassi-Rad. “Measuring algorithmically infused societies”. In: Nature (2021).doi:
10.1038/s41586-021-03666-1.
[153] William Warner and Julia Hirschberg. “Detecting hate speech on the world wide web”. In:
Proceedings of the second workshop on language in social media. 2012, pp. 19–26.url:
https://www.aclweb.org/anthology/W12-2103.pdf.
[154] Tamsyn P Waterhouse. “Pay by the bit: an information-theoretic metric for collective human
judgment”. In: Proceedings of the 2013 conference on Computer supported cooperative work. 2013,
pp. 623–638.url: https://research.google/pubs/pub40700.pdf.
[155] Janyce Wiebe, Theresa Wilson, Rebecca Bruce, Matthew Bell, and Melanie Martin. “Learning
subjective language”. In: Computational linguistics 30.3 (2004), pp. 277–308.url:
https://direct.mit.edu/coli/article-pdf/30/3/277/1798072/0891201041850885.pdf.
[156] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi,
Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer,
Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao,
Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. “Transformers:
State-of-the-Art Natural Language Processing”. In:Proceedingsofthe2020ConferenceonEmpirical
Methods in Natural Language Processing: System Demonstrations. Online: Association for
Computational Linguistics, Oct. 2020, pp. 38–45.doi: 10.18653/v1/2020.emnlp-demos.6.
[157] Xuhui Zhou, Maarten Sap, Swabha Swayamdipta, Yejin Choi, and Noah Smith. “Challenges in
Automated Debiasing for Toxic Language Detection”. In: (Apr. 2021), pp. 3143–3155.url:
https://aclanthology.org/2021.eacl-main.274.
[158] Minghao Zhu, Youzhe Song, Ge Jin, and Keyuan Jiang. “Identifying Personal Experience Tweets
of Medication Effects Using Pre-trained RoBERTa Language Model and Its Updating”. In:
Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis.
Online: Association for Computational Linguistics, Nov. 2020, pp. 127–137.doi:
10.18653/v1/2020.louhi-1.14.
87
Appendices
A.1 Chapter3
A.1.1 Study1
A.1.1.1 TestandAnnotationItems
Table A.1 and A.2 respectively represent the 3 test items and all annotation items labeled by annotators.
Test Item Expected Label
I want to see Sandirians raped to death collectively. Hate
Now if we could only get a few jew heads on pikes that would be great. Hate
Over 99% of Syrian refugees admitted to UK in 2017 were Muslims. Non-Hate
Table A.1: Test items in annotation survey, participants were filtered based in their correct answers to this
items.
Annotation Item Target
Texans in the vast plains, rivers, and the farmlands must gather, arm up, and declare
independence from the United States of America. We are a nation of warriors that fell
at the Alamo only to rise victorious, just as the Spartans did in the narrow pass against
the Persians. The USA needs us. We do not need those sickening f[**]ktards.
Immigrant
So if robots can plant the food, pick the food, and serve the food, why do we need illegal
aliens again? Not to vote democrat. They have already moved on to simply changing
the counts.
Immigrant
88
The Kalergi Plan for White Genocide, (funded by George Soros), through unending
Turd World Mass Migration into ALL White Nations. They Come. They breed, rape,
terrorize, defecate, leech, infect and plunder... WE DIE!!!!
Immigrant
Lesson: be careful what you wish for. Especially if you are an idiot. Never could figure
out how people with such a better work ethic come from such shit hole countries? They
are like Superman. Get near their home countries & they become worthless, but super
worker here? Strange how that happens. Like Work kryptonite.
Immigrant
Go back to your shithole and learn how to be civilized. Immigrant
Everything on Instagram is in Portuguese now! Build that Wall! Immigrant
You know your country is MULTICULTURAL... When there’s armed patrols on the
streets.
Immigrant
A church in Berlin was damaged after unknown attackers pelted stones at the building
while shouting “Allahu Akbar!”
Muslim
“A place to discuss Jesus with fellow clean-living Christians”. lol, that description. But
wasn’t Christianity responsible for the church pedophilia?
Muslim
I have no problem with how China treats these bastard Uighurs. They are only treating
them like how the rest of us get treated by muzzies.
Muslim
Look forward to reading UR next book. My 1st thought is that we are in this situation
through “cock-up” rather than “conspiracy”. Or more precisely, “cock-up” on our side
& something that’s not quite conspiracy on the Muslim side: them just acting as they
always do. We should have learned from History, but we were complacent.
Muslim
Muslims are like a virus. We need the vaccine NOW. Muslim
Islam doesn’t seek to assimilate into ANY culture, it seeks to assimilate cultures into it,
you’re naive to believe any differently with 1400 years of evidence.
Muslim
89
Five alleged Muslim extremists accused of training children to carry out school shoot-
ings were released on bond after their attorney complained that the men were only
being targeted because they’re “black and Muslim,” as opposed to “white and Chris-
tian.”
Muslim
Commies, They need to be driven from this world. Now. And forever. Misanthrope! Communist
There are two types of socialists: the evil socialist, and the stupid socialist. Evil socialists
are despots who seek control. Stupid socialists think socialism works and is good.
Communist
Tulsi Gabbard is basically an out-and-out socialist and open-borders zealot. That’s not
crazy enough for modern D’s, though. You have to also want war with Russia. And the
poor girl just couldn’t get her mind right on that issue.
Communist
This is Antifa, folks. Filthy, degenerate communist subhumans. Every single one. Communist
Who cares what John Dean says. The commies keep living in the past. Communist
IF the libtards and socialists (also known as libtards) keep trying to bring down OUR
President, me and 63,000,000 of my closest friends will be in your face BIGGLY. And we
know what to do because I am Jungle Warfare qualified and I did bring home 3 Bronze
Stars from the Nam... AND I have never forgotten my oath.
Communist
Yesterday, Judge Naomi B ruled that Trump can’t block commies/animals from his pri-
vate Twit acct because 1st Amend.
Communist
“Top psychiatrist: today’s left are mentally ill”. We already knew this, but it’s nice to
have it officially confirmed.
Liberal
Now that Infowars has been hit by the Silicon Valley, I am amazed Twitter didn’t join
in. But they are the same liberal far left trash and probably wanted to join the other
kids kicking a dog when it’s down. Scum.
Liberal
90
Bitching on social media may be cathartic, but the best way to fight the biased lefty press
is not to watch, mention, click, link, post, or refer to their lefty crap. Even by complain-
ing (mentioning), you’re giving them data to sell to their advertisers. Take away their
ability to sell your data, take away your eyeballs and they’ll go out of business over
time.
Liberal
Popular leftist terms that do not exist: - Toxic Masculinity - Patriarchy - White Priv-
ilege. No matter how long the gender studies majors at McDonald’s try to push this
propaganda we must push back with facts.
Liberal
“I find wearing these little yellow stars on the chest is a good way to expiate our past
sins, it reminds us of our privilege and keeps the memory of our crimes alive now that
the last nazi is dead.” Conservative, circa 2025.
Liberal
Liberals: These bikini coffee chains are sexist and a bad influence on impressionable
young girls. Also Liberals: Let’s take our kids down to the Folsom street fair where
they can see transvestites swinging their junk around and getting whipped in public.
Liberal
That hair probably costs over $200 including the trip to the salon, and she’s holding an
$800 iPhone... She’s wearing makeup and jewelry, too, while she howls & cries about
how oppressed she is. Liberals have no sense of reality.
Liberal
He did fuck all except promoting violence against white people, giving money to ter-
rorism and trying and starting a race war... Useless President who still comes up first
on auto-correct after typing President...
African-
American
Black Conservatives on Twitter: CNN’s @JeffreyToobin called Antifa, a left-wing po-
litical terror group with few, if any, black participants, an “African-American organiza-
tion.” These violent stereotypes are racist and wrong. Using it to smear POTUS Trump
— even worse!
African-
American
91
Planned Parenthood is fighting back against Blacks catching on to what Margaret
Sanger was all about: Limiting the Black population. It’s why she formed PP in the
first place. So now, billboards making it look bright and shiny and liberating to kill off
their unborn.
African-
American
Guardian journalist, Daniel José Camacho, spews anti-white hatred after Super Bowl.
(Guardianista nonwhite journalists openly now show their anti-white racism).
African-
American
It doesn’t matter if there are non-whites who share some of our nationalist views. they
are still different from us. Should they be allowed in our ethnostate? If it were up to me,
the answer is no. I don’t support a policy of, “anyone who’s like Joe the Jew is welcome
in our white ethnostate.” The whole idea of an ethnostate is to preserve our genetics
and identity.
African-
American
Mixing cultures is an issue, you end up with a mutt race. African-
American
Western world is built on RACISM – BBC Newsnight guest makes SHOCK announce-
ment. If not for the white man this “professor of black studies”, yeah I know LOL, would
be walking naked amongst the lions & zebras in Apefreaka.
African-
American
Horrific! A friend and former business partner who also is a retired policeman sent this
along to me today. It describes a case he had, which was the wreckage left behind by
one homosexual priest. Predators. Gay predators. Which sounds like a joke.
Gay
People who say “nobody’s born a racist, you have to be taught”, are also the ones who
tell their children to be gay degenerates and killing white people is right.
Gay
Romney is remembered for the Obamacare prototype, for not fighting Gay Marriage,
Speaking in front of Planned Parenthood, Being part of the Vulture Capital firm Bain,
that destroyed companies and US Jobs with debt, and showing contempt for both the
poor and the Deplorables.
Gay
When even a flag triggers them, then you know that Hollywood is full of butthurt
snowflakes!
Gay
92
And here we see the fa***t it its natural habitat Gay
I said homosexuality is NOT natural and have been suspended until I take down my
tweets, they will have a long wait, I hardly use twitter.
Gay.
Pornography is a filthy tool by ((them)) to normalize miscegenation, homosexuality,
and other degenerate perversions.
Gay
Well, yea, but they also believe the greeks were Israelites... Jew
If they were only work camps that got twisted later on into death camps, does that mean
the jew sees being made to finally work and not be a leech, as the *same* as death?
Jew
In 50 years either humanity will exist or Israel will exist. Not both. Jew
Why do you follow Jew hater Cantwell? Oh, don’t tell me your friend Danny is Jewish.
So was Soros.
Jew
Trump is the king of neo nazis and I am a true neo nazi. You’re a fa***t jew boomer. Jew
Saying that the Holocaust happened makes you an Anti-Semite because it means you
wanted 6 million jews to die.
Jew
Why do people here believe the Holocaust happened? Why would you WANT TO
BELIEVE SUCH AN ATROCITY HAPPENED TO THE JEWS? Of course it didn’t happen,
otherwise the world wouldn’t have Israel and all the Nobel prize winners that came after
WW2.
Jew
Sharia May is to set up a new unit to deal with fake news. Stupid woman, if she wants
to stop fake news, just shut down the BBC.
Woman
Feminism leads to a decrease in the birth population which then requires mass migra-
tion from the 3rd world to support the aging population since feminists aren’t having
enough children to replace the population.
Woman
Feminism is an infection! Woman
Damn that bitch is dumb.. with her pink hair, fucking Garth glasses and shitty old
novelty guitar.
Woman
Front Hole’, Please explain the objectification of women in the current context. Thanks. Woman
93
Perfect example of why women shouldn’t be voting. Woman
Feminism is now Female Supremacism as women have all the rights that men have and
they still have special privileges in the courts and in our culture. Cultural Marxists
who hate America, Whites, and the West are the leaders and preachers of Third Wave
Feminism.
Woman
Table A.2: All annotated items in Study 1: for each social group understudy, 7 social media posts mention-
ing that social group is considered in the study.
A.1.1.2 StudyofAllAnnotators
We replicate the results of Study 1, on the whole set of participants (N = 1,228). We first investigated the relation
between participants’ social stereotypes about each social group and the number of hate speech labels they assigned
to items mentioning that group. The result of a cross-classified multi-level Poisson model, with number of hate
speech labels as the dependent variable and warmth and competence as independent variables, shows that a higher
number of items are categorized as hate speech when participants perceive that social group as high on competence
(β =0.02, SE = 0.005, p<.001). In other words, one point increase in a participant’s rating of a social group’s
competence (on the scale of 1 to 8) is associated with a 1.9% increase in the number of hate labels they assigned to
items mentioning that social group. However, warmth scores were not significantly associated with the number of
hate-speech labels (β =0.01,SE =0.006,p=.286).
We then analyzed participants’ group-level disagreement for items that mention each social group. The results
of a cross-classified multi-level logistic regression, with group-level disagreement ratio as the dependent variable and
warmth and competence as independent variables, show that participants disagreed more on items that mention a
social group which they perceive as low on competence (β = − 0.17,SE = 0.034,p<.001). In other words, one
point decrease in a participant’s rating of a social group’s competence (on the scale of 1 to 8) is associated with a
15.5% increase in the odds of disagreement on items mentioning that social group. Contrary to the original results,
warmth scores were also significantly associated with the odds of disagreement ( β =0.07,SE =0.036,p=.044).
94
Finally, we compared annotators’ relativetendency to assign hate speech labels to items mentioning each social
group, calculated by the Rasch models. As mentioned before, bytendency we refer toability parameter calculated by
Rasch model for each participant. We conducted a cross-classified multi-level linear model to predict participants’
tendency as the dependent variable, and each social group’s warmth and competence as independent variables. The
result shows that participants demonstrate highertendency (to assign hate speech labels) on items that mention a so-
cial group they perceive as highly competent (β =0.04,SE =0.010,p<.001). Warmth scores were only marginally
associated with participants’ tendency scores (β = 0.02,SE = 0.010,p = 0.098). Except for the significant associ-
ation of warmth stereotype with the odds of disagreement, other results are the same for both analyses.
A.1.1.3 ImplicitBias
To analyze the impact of implicit biases of our pariticipants on their annotations judgments, each participant was
assigned to complete an Implicit Association Tests [65, IAT]. Each IAT assesses participants’ implicit bias toward one
of the 8 groups studied on Study 1 (
˜
150 participants for each IAT) The IAT [65] is a computer-based task designed to
measure the strength of automatic associations between two opposing target categories and two opposing attributes.
In each trial, participants are instructed to categorize a stimulus (e.g., a word) as quickly and accurately as possible
into one of two target categories or two attributes. In a first combined block, the two target categories and the
two attributes are located with a certain associative pattern. In a second combined block, the location of the target
categories is switched. A measure of the implicit associations can be obtained computing the difference between the
mean latencies of the first and the second combined block. Previous work has shown that IAT scores can be used to
assess attitudes and stereotypes, showing adequate levels of criterion validity [66] and less proneness to impression
management concerns compared with self-report measures [149].
In our study, the IAT provided a mechanism to quantify the participants’ bias in preferring a group over a
randomly selected pair. The social group pairs assessed in the IAT were the social groups mentioned in the hate
speech items (Mexican vs American, Christianity vs Islam, Communism vs Capitalism, Liberal vs Conservative,
White vs Black, Gay vs Straight, Female vs Male and Jewish vs Christian). As the result, the calculated IAT score for
each participants represents their implicit bias (a value between -1.5 and +1.5) for a specific social group, such that
a positive value would represent a higher bias against that group.
95
We investigated the relation between participants’ implicit bias about each social group and the number of hate
speech labels they assigned to items mentioning that group. We conducted a multi-level Poisson model, with social
group as the level-1 variables, number of hate speech labels as the dependent variable and the implicit bias score as
independent variables. The result shows that a higher number of items mentioning a social group are categorized
as hate speech when participants show higher implicit bias for that group (β =0.09, SE = 0.027, p<.01). In other
words, one point increase in a participant’s implicit bias score for a social group is associated with a 0.1% increase
in the number of hate labels they assigned to items mentioning that social group.
A.1.2 Study2
First, in order to show that annotating hateful rhetoric leads to high levels of disagreement, even among expert
annotators, we compared the occurrence of inter-annotator disagreement between hateful and non-hateful social-
media posts. Two-sample permutation tests (5,000 permutations) based on mean and median suggest that annotators
disagree more on the posts which are labeled as hate speech (M = 0.50,Md = 0.67,SD = 0.28) compared with
those that are labeled as not hateful (M = 0.13,Md = 0.00,SD = .26) by the majority vote (p <.001). This effect
is not surprising since hate speech annotation is shown to be a non-trivial task that requires careful consideration
of the social dynamics between who generates a piece of text and who perceives the text. In addition, recognizing
non-hate content is substantially easier than flagging inflammatory and potentially hateful content [147].
We then assessed the association of textual mentions of social groups with inter-annotator item disagreements.
Two-sample permutation tests (5,000 permutations) based on mean and median suggest that posts which mention
social group tokens triggered more disagreement (p <.001) such that, in presence of social group tokens the averaged
item disagreement is 0.30 (Md=0.00,SD =0.32), as opposed to averaged item disagreement of 0.13 (Md=0.00,
SD =0.26) for posts without social group tokens.
To examine the interaction of presence of social group tokens and hate speech label, we conducted a two-
way ANOVA to compare item disagreements, considering the binary hate label along with the presence of social
group tokens as the factors. Both factors (presence of social group token and hate speech content) are significantly
associated with item disagreement (p <.001 for both). Figure A.1 shows the distribution of disagreement scores
96
Figure A.1: Disagreement scores on different subsets of the dataset, based on whether the posts include
hate speech and social group tokens (SGT). The horizontal lines demonstrate the error bars.
for different subsets of the dataset. Generally, mentioning an social group token leads to higher inter-annotator
disagreement, however, in hateful posts, mentioning an social group token actually results in annotators agreeing.
97
Abstract (if available)
Abstract
Subjective annotation tasks are inherently nuanced due to annotators' individual differences in understanding of language. Training Natural Language Processing (NLP) models for making predictions in subjective tasks based on human-annotated datasets is also marked by challenges; model decisions are rarely generalizable to judgements of unseen annotators. Therefore, modeling an acceptable interpretation of subjective tasks requires integrating psychological dimensions that capture individual differences in perceiving language for each specific task.
This thesis provides an alternative approach for modeling subjective NLP tasks by tailoring representations based on annotators' varying perceptions of language.
First, NLP datasets for subjective tasks are investigated to demonstrate how aggregating annotation into single ground truth labels impacts the representation of different perspectives in language resources. Then, the impacts of annotators' social biases are explored to capture the sources of human-like biases in annotated datasets and language classifiers. And lastly, alternative approaches for incorporating annotators' individual differences into modeling their annotation behaviors are presented.
In a broad sense, this thesis provides evidence against the propriety of modeling an aggregated label for subjective language understanding tasks. Demonstrating that this common practice in NLP modeling leads to encoding normative social biases into language resources and NLP models, this thesis provides frameworks, and motivates future efforts, for incorporating varying perspectives of language into designing NLP datasets and models.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Identifying and mitigating safety risks in language models
PDF
Building generalizable language models for code processing
PDF
Balancing prediction and explanation in the study of language usage and speaker attributes
PDF
Generating and utilizing machine explanations for trustworthy NLP
PDF
Computational models for multidimensional annotations of affect
PDF
Countering problematic content in digital space: bias reduction and dynamic content adaptation
PDF
Computational narrative models of character representations to estimate audience perception
PDF
Modeling dynamic behaviors in the wild
PDF
Towards trustworthy and data-driven social interventions
PDF
Language understanding in context: incorporating information about sources and targets
PDF
Common ground reasoning for communicative agents
PDF
Fairness in natural language generation
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Grounding language in images and videos
PDF
Externalized reasoning in language models for scalable and trustworthy AI
PDF
Deciphering natural language
PDF
Controlling information in neural networks for fairness and privacy
PDF
Generating psycholinguistic norms and applications
PDF
Aggregating symbols for language models
PDF
Learning multi-annotator subjective label embeddings
Asset Metadata
Creator
Mostafazadeh Davani, Aida
(author)
Core Title
Integrating annotator biases into modeling subjective language classification tasks
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2022-08
Publication Date
06/10/2022
Defense Date
04/29/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
AI ethics,computational social science,machine learning,machine learning bias,natural language processing,OAI-PMH Harvest,social stereotypes
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Dehghani, Morteza (
committee chair
), Dilkina, Bistra (
committee member
), Read, Stephen (
committee member
), Ren, Xiang (
committee member
)
Creator Email
aida.m91@gmail.com,Mostafaz@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111339630
Unique identifier
UC111339630
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Mostafazadeh Davani, Aida
Internet Media Type
application/pdf
Type
texts
Source
20220613-usctheses-batch-946
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
AI ethics
computational social science
machine learning
machine learning bias
natural language processing
social stereotypes