Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Computational narrative models of character representations to estimate audience perception
(USC Thesis Other)
Computational narrative models of character representations to estimate audience perception
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Computational Narrative Models of Character Representations
to Estimate Audience Perception
by
Victor Raul Martinez Palacios
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(Computer Science)
August 2021
Copyright 2021 Victor Raul Martinez Palacios
To Ana Luisa,
Victor Hugo,
and Ana Tha s
ii
Acknowledgements
I wish to take this chance to express my sincere gratitude to the faculty, colleagues, friends, and
family that made all of this possible.
To my advisor Professor Shrikanth Narayanan. I am extremely grateful for your guidance and
your complete condence in my skills as a researcher, programmer, and communicator. Thank you
for providing me with this unique and crazy opportunity.
I am deeply indebted to two USC professors who went above and beyond their expectations
to make sure I had all the favorable circumstances needed to succeed: Francisco Valero-Cuevas
and Panayiotis Georgiou. Francisco, it was thanks to you your rm trust that I got to meet the
wonderful people at SAIL. Thank you for being that initial piece that set it up all in motion. Panos,
thank you for being the unocial advisor to everyone at SAIL. I really appreciate all the support,
trust, and advice you have given me over the years.
My gratitude goes to all the amazing members of my quals/proposal and dissertation commit-
tees. Professors Morteza Degahni, Bistra Dilkina, Johnathan Gratch, Yan Liu, Johnathan May,
Paul Rosenbloom, and Violet Peng. Your keen observations greatly helped in shaping the overall
framework of this work, leading to an immeasurable improvement in the quality of this dissertation.
This project would not have been possible without the help of some extraordinary collaborators:
Yalda Uhls, thank you for your valuable social scientic insights and perspectives that helped me
look beyond the data to the real-world impact of my work; David Atkins, Zac E. Imel, and all the
people involved in the DEPTH project. You really dened in me what means to work hard and
party hard. I will always look back fondly on all the fun meetings we had.
To all my colleagues and friends who supported me throughout this adventure. My mentors at
SAIL: Dogan, Nikos Malandrakis, Jimmy, Colin, and Naveen. Thank you for teaching me how to
grasp the ropes of the Ph.D. life. My colleagues and friends: Krishna, Amrutha, Nikos, Ardulov,
Karan, Karel, and Manoj. I am extremely lucky to have met you and to be able to call you my
friends. Thank you for all the coee breaks and the help with bouncing o all the crazy ideas.
To the friends that I have come to consider family. Pedro, Ramon, Jesus, Jose Manuel, Montse,
Yus, Mire, Mike, Hugo, Alan, Fer B. y Hector. Thank you for always being a constant in my
tumultuous life. And to the folks that I was fortunate enough to meet on this side of the border.
iii
Axel, Juan Carlos, Octavio, Santiago, Anaya, Rodrigo, Sharon and Katja. You certainly made
your mark in my life in L.A. improving it vastly.
Finally, to all my family. To my grandmother, aunt Diana and uncles Pablo and Pedro. Thank
you for keeping me close to your hearts, always reaching out when things seemed dicult. To my
uncle Pablo and aunt Mary Carmen, for teaching me the importance of family bonds and their
transcendence over geographical borders. And nally, to my mother, father, and sister|to whom
I dedicate this work. You always made sure I had everything I needed, pushing me to follow my
dreams.
This achievement is thanks to you.
iv
Contents
Dedication ii
Acknowledgements iii
List of Tables x
List of Figures xii
Abstract xiv
1 Introduction 1
1.1 Characters and the Audience Experience . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 The Present Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Research Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Prior Work 7
2.1 Media Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Under-representation in Media . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Gender Stereotypes in Media . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 Portrayals of Risk-Behaviors . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Methods for Media Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Manual Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Automatic Content Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
v
2.3 Beyond the Media Domain: Narratives in Psychotherapy . . . . . . . . . . . . . . . . 15
2.3.1 Working Relations in Therapy . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.2 Automatic Assessment of Psychotherapy . . . . . . . . . . . . . . . . . . . . . 17
I Character Representations from Dialogues 18
3 Violence Rating Prediction from Movie Scripts 19
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 Movie Screenplay Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.2 Violence Ratings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.1 Model Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.3 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5.1 Classication Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5.2 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5.3 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5.4 Attention Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4 Multi-Task Rating Prediction from Movie Scripts 35
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2.1 Semantic representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2.2 Sentiment representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
vi
4.2.3 Role of Movie Genre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.4 Ratings Prediction Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4.1 Model Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.5.1 Classication Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.5.2 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.6 Co-Occurrence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
II Character Representations from Actions 52
5 MovieSRL: Automatic Identication of Character Actions 53
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2.2 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2.3 Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3.1 Dataset construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3.2 Manual Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4.1 Model Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4.2 Model Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.5.1 Semantic Role Labeling Performance . . . . . . . . . . . . . . . . . . . . . . . 67
vii
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6 Boys don't cry (or snuggle or dance): A large-scale analysis of gendered actions
in lm. 70
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3.1 Character Gender Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.3.2 Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.4 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.4.1 Gender Estimation Performance . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.4.2 Regression Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.5 Large-scale Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.5.1 Agency and Gender . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.5.2 The Male Gaze Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.5.3 Representations of Disability. . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.5.4 Portrayals of Emotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.6.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
III Character Representations from Roles 86
7 Identifying Therapist and Client Personae from Alliance Estimation 87
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.3.1 Character Identication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.3.2 Personae Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.4.1 Linear Mixed Eect Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
viii
7.4.2 Regression Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.6 Personae Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.6.1 Persona distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.6.2 Topic contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
8 Conclusion and Future Work 98
8.1 Characters' Representations and Audience's Experience . . . . . . . . . . . . . . . . 98
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
References 100
Appendices 115
A SRL System Performance 116
B Number of actions per genre 118
C Results for Agent's Actions 119
D Results for Patient's Actions 127
E Results for Agent{Patient interactions 129
ix
List of Tables
3.1 Description of the Movie Screenplay Dataset. . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Raw frequency (no.) and percentage (perc.) distribution of violence ratings . . . . . 23
3.3 Frequency (no.) and percentage (%) distribution of violence ratings after encoding
as a categorical variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Summary of feature representation at utterance and movie level. TF - Term frequency,
IDF - Inverse document frequency, Functionals - Mean, Variance, Maximum, Minimum,
and Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 Classication results: 5-fold CV precision (Prec), recall (Rec) and F1 macro average
scores for each classier. In parenthesis are the number of units in each hidden layer. 30
3.6 5-fold CV ablation experiments using GRU-16. column shows the dierence be-
tween original model and individual ablations. `-' indicates removing a certain feature 31
3.7 5-fold cross validation F scores (macro-average) for the two best models by varying
model parameters.jVj= vocabulary size for word and character n-grams. . . . . . . 32
4.1 Movie content rating counts and percentage distribution. Median split was induced
on all ratings to balance class distribution. . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 10-fold cross validation multi-task classication performance. Precision (P), recall
(R) and F1 macro average scores reported (percentages). Models trained indepen-
dently for each task are denoted by double-line. The best model (shown in bold)
performs signicantly better than baseline for violence (perm. testn = 10
5
,p = 0:002)
and substance-abuse (n = 10
5
, p = 0:006). . . . . . . . . . . . . . . . . . . . . . . . . . 46
x
4.3 10-fold CV ablation experiments using Bi-GRU (16). F1 macro average score (per-
centage) reported. In parenthesis: dierence between full model and the individual
ablation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.1 Summary of system performance results. Precision (Prec), Recall (Rec) and macro-
average F1 scores reported from the test set. Additional results can be found in
Table A.1 in the appendix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.1 Descriptive statistics of the dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.2 Gender heuristics classication report. . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3 Distribution of agents and patients according to their gender . . . . . . . . . . . . . 79
7.1 Number of sessions per client, per therapist and per (client, therapist) pair in the
available dataset. Support is the total number of clients, therapists, or pairs. . . . . 90
7.2 Cross-validation estimation for mean () and standard deviation () of MSE for
regression models (lower is better). Persona model performs signicantly better
than the baselines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.3 Mean, standard deviation and mode for persona distribution per participant and
joint distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
A.1 SRL System Performance Complete Results . . . . . . . . . . . . . . . . . . . . . . . 117
B.1 Distribution of the number of actions per genres . . . . . . . . . . . . . . . . . . . . 118
C.1 GLMM results for the agent's actions. . . . . . . . . . . . . . . . . . . . . . . . . . . 126
D.1 GLMM results for the patients' actions . . . . . . . . . . . . . . . . . . . . . . . . . . 128
E.1 GLMM results for agent{patient interactions. . . . . . . . . . . . . . . . . . . . . . . 132
xi
List of Figures
2.2.1 Two examples of semantic role annotation for action, agent and patient. Notice the
dierence between patients (animated) and objects (non animated). . . . . . . . . . 14
3.3.1 Recurrent Neural Network with attention: Each utterance is represented as a vector
of concatenated feature-types. A sequence ofk utterances is fed to a RNN with atten-
tion, resulting in a H-dimensional representation. This vector is then concatenated
with genre representation and fed to the softmax layer for classication. . . . . . . . 27
3.5.1 Examples of utterances with highest and lowest attention weights for a few movies.
green - correctly identied, blue - depends on context (implicit), red - miss identied . . . 34
4.2.1 Multi-task model for content rating classication: Each utterance is represented by
semantic and sentiment features, fed to independent RNN encoders. The sequence
of hidden states from the encoders serve as input for task-specic layers (gray boxes). 39
4.2.2 Risk behavior rating co-occurrence: on average, when one risk-behavior rating in-
creases so does the others. Error bars denote 95% condence intervals. . . . . . . . . 40
4.5.1 10-fold cross validation multi-task classication performance based on GRU dimen-
sion (d) and sequence length (m). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.6.1 Attention weights for violence and sex for (a) The Exorcist, and (b) From Russia
With Love. Sex-sentiment (green) leads the violence-semantics (red) by 31 ( = 0:23)
and 203 ( = 0:29) utterances respectively. . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1.1 Two examples taken from the dataset. In contrast to previous SRL datasets, we
explicitly identify characters as the sole sources and targets of the actions. . . . . . . 54
xii
5.2.1 Our proposed SRL system. Starting at the bottom, the system takes a sentence as
input and obtains a highly-contextualized representation for each token (sub-words)
using the BERT transformer. The sequence of representations is fed into a RNN and
softmax layers for sequence labeling. As part of the post-process, a set of heuristics
aggregate multiword expressions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3.1 Example of the typical structure of a movie script. Reproduced with permission
from Slugline (https://slugline.co/). . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3.2 Labeling Task: Annotators are presented with a sentence and an action. They are
asked to either select the agent (source) and patient (target) of the action. For cases
where one of these is missing, the annotator has the option to check the `Does not
say' box. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.5.1 Actions with the largest coecients for each of our models. Each coecient () rep-
resents the change in log odds of the character being male/female were the frequency
of action changes by one unit. Negative values are more likely to be portrayed by
female characters (purple); positive coecients are more likely to be played by male
characters (teal) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.2.1 Histogram of the therapeutic alliance ratings. . . . . . . . . . . . . . . . . . . . . . . 90
7.3.1 Personae and topic distributions from psychotherapy sessions: Personae are distri-
butions over the topics of conversation (shown in green). Sets are learned for clients
and therapists independently (shown in blue and light pink respectively). Therapist
and client get assigned to a single persona per session. . . . . . . . . . . . . . . . . . 92
xiii
Abstract
Stories play an important part in how we weave together the day to day events, to make sense
of what happens around us, and in the lives of others. They help shape our identities, inform
our world view, and allow us to understand other people's perspectives. The impact of these
stories, whether measured through economic gains, emotional response, or societal in
uence, is
closely tied to the audience's experience of these narratives. As such, understanding an audience's
narrative experience could allow for computational models that predict a story's impact on both
the individual and the societal level. It may also provide a venue where intelligent agents might
infer interpersonal relationships, human behaviors, and current societal norms. However, current
computational models are limited in their ability to unravel the complex interaction that denes
the spectator's narrative experience.
While the exact mechanisms behind an audience's narrative experience remains unidentied,
we know that the audience's relationship to the characters plays a central role. For example, the
audience's emotional response has been linked to their identication with the main characters,
their comprehension of the character's motives, as well as the resolution of the character's fates.
In this dissertation, I propose that computational models could provide a better estimate of the
audience's narrative experience by considering how the characters of the story are being portrayed.
To this end, this dissertation presents my contributions in models that automatically infer character
representations from aspects by which audiences perceive characters: from their dialogues, from
their actions and behaviors, and from their roles. Each aspect is explored through a particular
task. For representations from dialogues, state-of-the-art NLP techniques are developed to predict
movie ratings of violent and risk-behavior content. From the representations of actions, I construct
datasets and labeling models for action-agent-patient triads that enable large-scale analysis of gen-
der biases in the media. Finally, from the role representations, a model that leverages character's
representations from personal narratives to automatically infer a better estimate for the client's
perception of a shared bond during psychotherapy is presented. In each task, our results demon-
strate that our models present a signicant improvement over the previous state-of-the-art. This
provides empirical support to our claim that character's representations may be leveraged for their
information about an audience's narrative experience.
xiv
Chapter 1
Introduction
Stories are universal. Through stories, we share passions, fears, sadness, hardships, and joys that
help us understand ourselves and others better. Every day we come across several stories, each
with its own context, goal, and determination: when we share our daily experiences around the
dinner table; in the news that inform us of the relevant events happening across the globe; when we
narrate pieces of our life story to a therapist; while role-playing a fantastic adventure on imaginary
worlds through the magic of video games, or when we are captivated by the life events of real or
ctional characters narrated in movies and books. Stories are told to facilitate the transmission of
culture, strengthen the social bonds of human society, and provide a reliable medium for preserving
information, one that has carried on across millennia (Rose, 2017).
To tell a compelling story, storytellers must present audiences with an engaging con
ict, one
that draws the most appropriate and emotionally lively response from its audience. As Bruner
(1986) argues, the essence of the storyteller's work is to construct a landscape in the reader's mind
by shaping the constituents of the story (i.e., characters, plot, con
ict and setting). This is by no
means a simple task. For one, a story is not just a shallow structure to lay out the plot by means
of language and sophisticated narration techniques|what Barthes and Duisit (1975) refer to as
the discourse or syuzhet. It also encompasses the fabula, a deeper structure of the chronological
sequence of events as they occurred in the time{space of the story that, when reconstructed by
the reader, conveys the original intent of the narrative (Propp, 1968). Part of the audience's
response to a narrative originates from the casual and temporal combination of events, characters
and actions captured across these structures (Barthes & Duisit, 1975; Propp, 1968). For another
1
thing, the prospective audience plays an active role in dening the story's meaning. Audiences are
not typically a monolithic singled-minded entity, but a collection of readers (or spectators), each
with their own social{, cultural{ and personal{contexts. These contexts play an essential role in
moderating that person's|and, by aggregation, the audience's|response to the narrative (Labov,
2013; Polanyi, 1981; B. H. Smith, 1980). Coming back to Bruner (1986), storytellers have to
construct a plot, characters, con
ict and setting in such a way that:
[...] the plights [of the story] must be set forth with sucient subjunctivity to allow
them to be rewritten by the [audience], rewritten so as to allow play for [their] imagi-
nation (p. 35)
From a computational perspective, the problem arises as we do not have good models for
understanding how narrative content changes the cognitive and aective state of a reader, watcher
or interactive participant (Riedl, 2017). While the lack of such models has a clear impact on
the entertainment and marketing elds|who would not want to know, beforehand, the emotional
impact that an ad or a movie might have on an audience?|it also presents an open challenge for
A.I. systems. Let me exemplify this by brie
y discussing three possible applications for audience
models in A.I. systems. The rst application is in content creation. Hypothetically, we could think
of a system that automatically generates educational lectures for a class of students and uploads
them into a video sharing platform. To maximize its impact (e.g., audience reach, retention, number
of views, revenue, etc.), the educational A.I. needs to know its audience. An audience model
would allow an adaption of its communication strategy, depending on the cognitive dierences,
development, and capabilities of its watchers (e.g., K-12 children vs. college students). A second,
and similar, application of audience models in A.I. can be found in the context of automatic
story generation. Here, machine learning models can be used to produce narratives that maximize
engagement. Given the broad range of people who might watch or interact with these automatically
generated stories, one of the things that our hypothetical system ought to consider is the age-
appropriateness for each narrative. This is only possible if the A.I. has a notion of the audience's
perception from that narrative, as well as the socially-accepted themes for that audience. The
third and nal example is in the context of conversational agents. By recognizing an audience
and their individual expectations, a conversational agent might be able to recall past interactions,
incorporate new topics into the conversation, or follow-up on previous discussions. This can serve as
2
a inexpensive way for agents to emulate memory and user personalization (both of which have been
linked to an increased perception of rapport and overall ratings for the interaction (V. R. Martinez &
Kennedy, 2020)), and could prove benecial in increasing the desirability of other assistive agents
such as Siri, Cortana, or Alexa (Riedl, 2017). It follows, then, that in order for A.I. models to
better understand the audience's perception of a narrative, it must start by unraveling the complex
interplay between the dierent structures of a story (i.e., its content and intended meaning); the
story's constituents (i.e., characters, plot, con
ict, setting), and the expectations of its audience.
Within this context, my dissertation aims to narrow the gap in our collective knowledge by focusing
on the audience's narrative perception as a function of the character's portrayals.
1.1 Characters and the Audience Experience
Many, if not most of the stories follow a character or group of characters on their journey to resolve
a con
ict, with a particular focus on the group's personal perspectives, diculties, and the relations
between them. While a precise denition of a character is fuzzy at best, one can refer to the process
of characterization by which the audience attributes or infers the qualities from the characters'
actions, speech, and appearance (Baldick, 1996). Characters play a central role in explaining
how audiences experience narratives. For example, by dening how they recall the narrative,
and their emotional reaction to the plot (Bal, Butterman, & Bakker, 2011; Hoener & Cantor,
1991). As Hoener and Cantor (1991) argue, how viewers form impressions of characters \mediate
short- and long-term emotional reactions to depicted events and to characters themselves." (p. 63)
which \[...] promotes an understanding of the viewer's response to entertainment media." (p. 64).
Similarly, for Niemiec and Wedding (2013), the audience's narrative experience and the perception
of that narrative's subtext is in
uenced by their recognition of the strengths and virtues portrayed
by the characters. Anti-heroes not withstanding, audiences relate to the characters that align
with their own norms of morality, worldview and other allegiances|the heroes of the story|and
dissociate from the exemplars of their undesirable traits|the villains (MacDorman, 2019; Shafer &
Raney, 2012). As audiences become more familiar with a character, they begin to build a relation,
often experiencing stronger emotional reactions to the actions that the characters do or the things
that happen to them (Hoener & Cantor, 1991). Within the context of TV and lm, Cohen
3
(2001) conceptualizes this relation eect as identication, the mechanism through which \audience
members experience reception and interpretation of the text from the inside, as if the events were
happening to them." (p. 245). Further stating,
Identication is tied to the social eects of media in general (e.g., Basil, 1996; Maccoby
& Wilson, 1957); to the learning of violence from violent lms and television, speci-
cally (Huesmann, Lagerspetz, & Eron, 1984); and is a central mechanism for explaining
such eects (p. 245)
Identication plays an important part in shaping aspects of our self and social identity. Identifying
with other people we encounter, both in real and ctional situations, is crucial to the development
of socialization skills during childhood (Mead, 1934). Additionally, identication helps individuals
to develop new social attitudes and update what they consider the norm (Cape, 2003). When we
identify ourselves with the characters we see in TV and lm, for a brief moment we get to live the
life of others. This vicarious experience allows viewers to temporarily adopt external points of view
and experience alternative social realities (Cohen, 2001; Erikson, 1968); conceding them a mental
space to internalize surrogate ideas, images, attitudes and identities (Erikson, 1968; Mead, 1934).
Nevertheless, identication also makes us susceptible to media's long-term negative eects.
As identication involves an internalization of a character's attributes, a repeated exposure to
powerful and seductive imagery lled with negative biases and stereotypes could have a measurable
impact on our perceptions of self and society (Cohen, 2001). As Meyrowitz (1998) notes, even if
the identication is merely temporary, it may induce some extreme behavior in adolescents, some
of which may have grave impact on their later stages of life. This notion of ill eects due to
repeated media exposure coincides with Gerbner's Cultivation Theory (Gerbner, Gross, Morgan,
& Signorielli, 1980). Cultivation Theory suggests that repetitive exposure to a consistent set of
media messages gradually leads viewers to accept those messages and portrayals as reality. For
example, people with aggressive dispositions are in
uenced by the exposure to media violence (Eyal
& Rubin, 2003). This increases the probability of these people to participate in violent acts in real
life. Or, people who repeatedly watch media content that promotes traditional gender stereotypes
are expected to be more inclined to accept such stereotypes as truth (Jerald, Ward, Moss, Thomas,
& Fletcher, 2017).
4
1.2 The Present Dissertation
In this dissertation, I propose that by learning a character's representation, computational models
might be able to provide a better estimate for the audience's perception of a story. To this end, I
will present four computational models that learn character representations from both ctional and
real-world narratives. The characters' representations learned throughout this work aim to capture
particular aspects on the way in which audience's perceive the characters of the story: by their
speech, actions, roles and appearance. The computational models developed for this work can be
used to estimate the audience's perceptions of violent and risk-behavior content in movies, as well
as an individual's perception of working relationships in the context of psychotherapy.
Specically, this work addresses the following two research questions:
RQ1 Can we design computational models to learn representations for the characters of a story?
RQ2 Can we leverage characters' representation to improve estimates of an audience's emotional
response to a narrative?
1.3 Research Statement
Characters' representations|learned from their actions, speech, appearance, and roles|
provide a way to estimate an audience's perception of a story, and its impact at an
individual and societal level.
1.4 Contributions
The main contributions of this work are:
Representations from Dialogue
Developed a computational model that estimates the audience's perception of violence from
characters' representations learned from dialogue.
Designed a multitask model to predict a viewer's perceptions of violence, sexual and substance-
abusive content from characters' representations learned from dialogue.
5
Representations from Action
Implemented a transformer-based model to produce large-scale automatic labeling of charac-
ter actions and behaviors.
Formulated a statistical approach to estimate dierences in the characters' representations of
actions and behaviors with respect to their assumed genders.
Representations from Roles and Traits
Engineered computational models that leverage characters' trait representation to estimate
real-life perceptions of working alliance between clients and therapists.
1.5 Structure
This dissertation is structured into three parts. The rst part presents my contributions in learning
character representations from dialogue. In Chapter 3, I propose a computational model that
takes dialogues extracted from movie scripts to predict an audience's perception of movie violence.
In Chapter 4, I improve upon this model to predict perceptions of violence, sexual and substance-
abusive movie content.
The second part presents the work on learning representations of characters' actions and behav-
iors. In Chapter 5, I present a transformer-based model designed to automatically identify actions
alongside characters playing the roles of agents and patients. In Chapter 6, I propose a statistical
approach to identify gender dierences in the portrayal of actions and behaviors.
Finally, the third part showcases one of the possible applications of character representation
learning to real-world domains. In Chapter 7, I present our work on modelling some of the narrative
elements of psychotherapy, and how character representations (and their similarities) provide a
better estimate for the perception of a working relationship between client and therapist.
6
Chapter 2
Prior Work
2.1 Media Studies
Movies are often described as having the power to in
uence individual beliefs and values, as well
as updating an individual's existing social boundaries, based on what is shown on screen as the
`norm' (Cape, 2003; Cohen, 2001; Erikson, 1968; Gerbner et al., 1980; Meyrowitz, 1998). This
in
uence has long sparked concerns about the potential eects of repeated exposure, particularly
to those of at-risk audiences to mature content in lms. As such, researchers have placed a particular
focus on studying how the exposure to content shown in the media might in
uence an audience.
These typically center around the frequency of particular demographic portrayals, stereotypes, and
characters engaging in risk behaviors.
2.1.1 Under-representation in Media
A fair share of media studies center on providing simple statistics to emphasize group under-
representation in the media, particularly that of historically under-represented groups such as
women, people of color, and LGBTQ+ (England, Descartes, & Collier-Meek, 2011; Fast, Vachovsky,
& Bernstein, 2016; G alvez, Tienberg, & Altszyler, 2019). For example, a manual content analysis
of the 101 top-grossing G rated lms from 1990 to 2005 revealed that fewer than one out of three
(28%) of the speaking characters (both real and animated) are female, and only 2.7 percent of
characters were depicted with a disability (S. L. Smith & Cook, 2008). With respect to race,
only 12.5% of all roles in top-grossing Hollywood movies are played by Black characters. The
7
representation gap between the Black population in the U.S. and their on-screen representation has
been closing with recent trends. However, other groups are still lagging behind, with only 4.5% of
all speaking characters were Latino, and 3% were Asian|13.8 and 3 percentage points below U.S.
Census respectively (Hunt & Ram on, 2020; S. L. Smith et al., 2019; S. L. Smith & Cook, 2008).
Unfortunately, just increasing the number of appearances from marginalized groups will not be a
solution in itself, since many times these portrayals tend to emphasize intersectionality stereotypes.
Increasing their frequency might actually exacerbate their problematic eects (Collins, 2011). For
instance, by perpetuating the notion that Black men are scary or angry, or Latino (or Black) women
are loudmouthed and sassy (Schacht, 2019). In spite of an active eort devoted to tackle on-screen
under-representation, with a recent trend over the past years to include more female and nonwhite
characters, this tropes still persists in Hollywood (Hunt & Ram on, 2020).
2.1.2 Gender Stereotypes in Media
Both experimental and correlational studies have found a strong link between exposure to me-
dia content that features gender stereotypes and greater endorsement of traditional gender be-
liefs (Davies, Spencer, & Steele, 2005; Signorielli & Kahlenberg, 2001). Conversely, if an individual
was to be repeatedly exposed to counter-stereotypical portrayals, one could expect a stronger en-
dorsement of such traditional gender beliefs. One example of this can be found in the 2018 report of
2,021 women who were regular viewers of the 90's hit TV series The X-Files|featuring actor Gillian
Anderson in the prominent role of Dr. Dana Scully. Among other results, heavy viewers of the show
reported far more positive beliefs about the importance of a career in STEM than non/light view-
ers (21st Century Fox, The Geena Davis Institute on Gender in Media, and J. Walter Thompson
Intelligence, 2018). Despite the evidence of the benecial eects of counter-stereotypical portray-
als, lms continue to regard certain characteristics as `masculine' (e.g., intellect, independence,
ambition, athleticism, bravery), thus usually condemning them as \inappropriate and unseemly
in women" (Giannetti & Leach, 1999). Most of the women actors appear as slim and attractive
gures (Fouts & Burggraf, 1999; Y. Zhang, Dixon, & Conrad, 2010), in overly emotional roles (Fast,
Vachovsky, & Bernstein, 2016), or as housewives (Paek, Nelson, & Vilela, 2011; Sink & Mastro,
2017) and victims (Cliord, Jensen III, & Petee, 2009; De Ceunynck, De Smedt, Daniels, Wouters,
& Baets, 2015; Stabile, 2009). Recent trends notwithstanding, Smith (S. L. Smith & Cook, 2008)
8
suggests that most of the on-screen female representations in lm and TV can be reduced to the
dimensions of hyper-sexualized attractiveness and/or complete passiveness. In the former, charac-
ters are to be lusted upon by other (oft male) characters and viewers alike; in the latter, characters
display no agency, motivation or personality beyond that of the romantic interest. Similarly, Tasker
(1993) argues that lms reduce the female role to the \threatened object", a convenient, yet still
signicant, plot device whose sole purpose is to move the story forward by springing the male hero
into action. This trope is particularly common in action lms, where women have traditionally
been fought over and avenged rather than ghting or doing avenging themselves (Tasker, 1993).
2.1.3 Portrayals of Risk-Behaviors
Most of the surveyed studies agree that a constant exposure to portrayals of characters engaging
in risky-behaviors increases the willingness of children and adolescents to imitate similar risky-
behavior. These risky behavior portrayals of violence, sexual, and substance abuse content tend to
include scenes of ghting, bloodshed, gunplay; intercourse, nudity, and alcohol, drug and smoking,
respectively. For example, a study of 2; 321 seventh graders from 16 Southern California middle
schools found that more than 50% of the students were exposed to alcohol and substance abuse
content in media more than once, and that this exposure was associated with a higher probability
of alcohol consumption by the eighth grade (Tucker, Miles, & D'Amico, 2013). Children exposed to
violent content displayed an increase in aggressive aect and reinforced notions of violent behavior
as an acceptable problem-solving strategy (Anderson & Bushman, 2001; Huesmann & Malamuth,
1986). Teenagers (14 to 16 years old) were more likely to engage in sexual activity after repeated
exposure to sexual content (Brown et al., 2006). This in
uence seems to be prevalent even among
young adults. Attitudes towards safe sex in 437 undergraduate students (19{21 years old) changed
in response to watching TV content with sexual overtones (Moyer-Gus e & Nabi, 2011), with a re-
duced eect when the content was presented in a humorous way (Moyer-Gus e, Mahood, & Brookes,
2011). These eects might be moderated by external factors such as a viewer's characteristics and
the degree to which the viewer identies with the characters (Eyal & Rubin, 2003; Igartua, 2010).
Surveys on 1000 African-American adolescents (14{17) who see themselves more aligned with the
characters in \Black-oriented" movies were more likely to be aected by the characters' portrayals
of aggressive and drinking behaviors (Moyer-Gus e, Chung, & Jain, 2011).
9
2.2 Methods for Media Analysis
2.2.1 Manual Annotations
To understand the prevalence of certain types of content (e.g., risk behaviors or stereotypes) in
lm and TV, researchers typically rely on data generated by human annotators. This manual
content analysis is considered the standard approach in the elds of Media Studies and Social
Sciences (Rudy, Popova, & Linz, 2010). This manual labor, which tends to be repetitive, exhausting
and error-prone, also limits their studies to a small sample size, generally under 100 samples. This
includes studies of portrayals of violence in 74 G-rated animated lms (Yokota & Thompson, 2000),
77 of the 1999-2000 top-grossing lms (Webb, Jenkins, Browne, A, & Kraus, 2007), and of teen-
sex in 90 of the top-grossing lms in the last decades (Callister, Stern, Coyne, Robinson, & Bennion,
2011). Among other ndings, these studies showed that a majority of the movies analyzed contained
violent acts in spite of their MPA
1
ratings that suggested these movies to be appropriate for all ages
(e.g., G rating) (Webb et al., 2007; Yokota & Thompson, 2000). A manual annotation approach was
also used to study the relation between characters' demographics and their participation in violent
acts in TV series (Bell-Jordan, 2008; Potter et al., 1995) and lm (Bleakley, Jamieson, & Romer,
2012; Schlesinger et al., 1998; S. L. Smith et al., 1998). These studies concluded that women
are more frequently portrayed as victims of violence, while most of the perpetrators are played
by middle age white actors. Yet, to the best of our knowledge, there has not been a large-scale
systematic analysis on the pervasiveness of these stereotypical portrayals.
2.2.2 Automatic Content Analysis
As a complementary approach, multiple works have proposed the use of machine learning methods
to scale the analysis of media content. These are used to analyze large collections of published
media|including lm, TV, news, gaming, music, publishing, lm criticism, consumer goods, mas-
cots, and advertising (Geena Davis Institute on Gender in Media, 2019a, 2019b; Somandepalli et
al., 2021). A majority of these works aim to provide systematic, and large-scale, evidence of a
gender gap by identifying cues in either audio or visual channels (Geena Davis Institute on Gender
1
Motion Picture Association; formerly the MPAA
10
in Media, 2019a, 2019b). For example, researchers developed algorithms that automatically detect
actors' faces and voices to accurately estimate measures of screen and speaking time in TV and
lm (Guha, Huang, Kumar, Zhu, & Narayanan, 2015; Hebbar, Somandepalli, & Narayanan, 2019,
2018; Kumar, Nasir, Georgiou, & Narayanan, 2016); another group of researchers designed a model
to automatically recognize instances of risky behaviors in lm both for violence (Chen, Hsu, Wang,
& Su, 2011) and sexual depictions (Liu et al., 2008).
Audio-Visual Approaches
A considerable amount of works propose machine learning systems to identify patterns in audio-
visual cues to understand what is being shown on screen (Bojanowski et al., 2013; Duchenne, Laptev,
Sivic, Bach, & Ponce, 2009; Laptev, Marszalek, Schmid, & Rozenfeld, 2008; Marszalek, Laptev, &
Schmid, 2009). Some of these features capture visual aspects such as action scene detection or
shot transition pace (Brezeale & Cook, 2008), or auditory aspects such as energy (Giannakopoulos,
Kosmopoulos, Aristidou, & Theodoridis, 2006) and spectrograms (Giannakopoulos, Pikrakis, &
Theodoridis, 2007). As an example, Huang, Xiong, Rao, Wang, and Lin (2020) reviews some
systems that target character action identication through the construction of bounding boxes for
the characters. On the assumption that dierent types of features complement each other, multiple
systems explored a combination of audio-visual factors (e.g., (Chen et al., 2011; Demarty, Penet,
Ionescu, Gravier, & Soleymani, 2014)).
While there has been a lot of advancement in the eld, these approaches still come at a limita-
tion. Visual-based systems have been held back by an action boundary denition that is inherently
fuzzy, leading to signicant inter-annotator disagreement (Idrees et al., 2017), and annotation quan-
tities that are limited and not enough for the data-hungry deep learning paradigms (Huang et al.,
2020). And most of the audio-based systems tend to fail for tracks containing more than just speech
(such as music and environmental sound), and have no way to deal with silent video portions or
clips (Chen et al., 2011).
Perhaps one of the most limiting aspects of this approach is that their application requires
access to the produced content, as many of these systems rely on visual and sound eects. At that
stage of the production, any possible insight gained from the system might be too late or expensive
to carry out.
11
Texual Approaches
As a viable alternative, researchers have explored the use of additional textual information. For
instance, either by jointly modeling the information provided by the movie scripts (Cascante-
Bonilla, Sitaraman, Luo, & Ordonez, 2019; Kukleva, Tapaswi, & Laptev, 2020), or by applying a
mixture of natural language processing techniques directly to the movie scripts (P. J. Gorinski &
Lapata, 2015; Kagan, Chesney, & Fire, 2020; Ramakrishna, Malandrakis, Staruk, & Narayanan,
2015; Ramakrishna, Mart nez, Malandrakis, Singla, & Narayanan, 2017; Sap, Prasettio, Holtzman,
Rashkin, & Choi, 2017a; Srivastava, Chaturvedi, & Mitchell, 2016; Trovati & Brady, 2014).
Automating the Bechdel-Wallace Test
One of the most popular measurements of representation, and one that can be easily calculated
from textual analysis of movie scripts, is the Bechdel-Wallace Test. First proposed by American
cartoonist Alison Bechdel (Bechdel, 2008), the Bechdel-Wallace Test is a validation experiment for
women's representation in movies. For a movie to pass, it has to include at least two female charac-
ters talking to one another about something other than a man. This test is currently considered the
standard for female representation in movies (Kagan et al., 2020). While certainly useful and easy
to automate (Agarwal, Zheng, Kamath, Balasubramanian, & Ann Dey, 2015), the test is not all
encompassing. Movies that pass the Bechdel-Wallace test can still display an array of stereotypical
representations, since the test omits tropes that are communicated through characters' actions and
behaviors (England et al., 2011).
Risk Behaviors and Abusive Language
Research in identifying risky behaviors from character dialogue is centered around the concept of
Abusive Language (AL). AL (Waseem, Davidson, Warmsley, & Weber, 2017) is an umbrella term
that includes oensive language, hate-speech, sexism, and racism
2
. In the context of social media,
AL typically targets under-represented groups (Burnap & Williams, 2015) or particular demograph-
ics (L. Dixon, Li, Sorensen, Thain, & Vasserman, 2017; Park & Fung, 2017). AL computational
2
For a detailed review, see (Schmidt & Wiegand, 2017).
12
models are usually designed using popular document classication techniques (Mironczuk & Pro-
tasiewicz, 2018) based on multiple linguistic features. For example, word or character n-grams
(Mehdad & Tetreault, 2016; Nobata, Tetreault, Thomas, Mehdad, & Chang, 2016; Park & Fung,
2017), and distributed semantic representations (Djuric et al., 2015; Pavlopoulos, Malakasiotis, &
Androutsopoulos, 2017; Wulczyn, Thain, & Dixon, 2017). Under the assumption that abusive mes-
sages contain specic negative words (e.g., slurs, insults, etc.) many works have constructed lexical
resources to capture these type of words (for example (Davidson, Warmsley, Macy, & Weber, 2017;
Wiegand, Ruppenhofer, Schmidt, & Greenberg, 2018)). In addition to linguistic features, works
have explored the use of social network meta-data (Pavlopoulos et al., 2017) with the eective-
ness of this approach still up for debate (Zhong et al., 2016). With respect to the computational
models used, most studies employ either traditional machine learning approaches or deep-learning
methods. Examples of the former are Support Vector Machines (Nobata et al., 2016) or Logistic
Regression classiers (Djuric et al., 2015; Wulczyn et al., 2017). On the latter, studies have success-
fully used convolutional neural networks (Park & Fung, 2017; Z. Zhang, Robinson, & Tepper, 2018)
and recurrent neural networks (e.g., (Founta et al., 2018; Pavlopoulos et al., 2017)). There have
been a few attempts at combining these two approaches, with various degree of success (Djuric
et al., 2015; Golem, Karan, &
Snajder, 2018; Park & Fung, 2017). Recently, approaches to AL
identication explored the use of highly contextualized word representations, a paradigm shift from
the traditional NLP approaches, which have found a lot of success in a multitude of tasks (Devlin,
Chang, Lee, & Toutanova, 2019; Peters et al., 2018; A. Radford et al., 2019). One of this particular
developments is BERT (Devlin et al., 2019), a language model that outperforms its predecessors due
to an innovative architecture that incorporates information from both the left and right contexts.
This incorporation is done through an interlacing of 12 or 24 fully-connected dense layers each with
a multi-head attention layer (Vaswani et al., 2017). Most recently, Mozafari, Farahbakhsh, and
Crespi (2019) explored the use of BERT as a feature generator to identify instances of AL in social
media messages.
Identifying Character's Actions from Text
We identied two approaches typically taken to identify character's actions from text. One uses
automated parsing techniques to determine the subject{verb{object structure of action descriptions.
13
Figure 2.2.1: Two examples of semantic role annotation for action, agent and patient. Notice the
dierence between patients (animated) and objects (non animated).
This is done through a combination of named entity recognition (NER) and heuristics over the
syntactic dependency parsing tree (Srivastava et al., 2016; Trovati & Brady, 2014). This approach
is limited by the underlying systems as they are usually trained on out-of-domain non-literary
data. For NER systems, recent eorts have been made to create datasets in the literary domains,
particularly in the case of books and movie scripts (Bamman, Popat, & Shen, 2019). However,
to the best of our knowledge, no similar eort has been done for adapting the dependency tree
parser. We believe that the underlying structural assumptions required by NER systems may not
generalize to movie action descriptions. As such, acquiring additional in-domain training data for
NER systems is a dicult task.
The second approach develops systems that identify verbs and all sentence constituents which
ll a semantic role. This procedure is known as Semantic Role Labeling (SRL) or shallow semantic
parsing (Gildea & Jurafsky, 2002). The goal of SRL is to answer the question \Who did what to
whom?" by analyzing the propositions expressed by the verbs of a sentence. Typical semantic ar-
guments include Agents, Patients, Instruments, etc. and also adjuncts such as Locative, Temporal,
Manner, Cause, etc. (Carreras & M arquez, 2005). An example of two annotated action descrip-
tions is presented in Figure 2.2.1. On the top description, we see an agent (Gimli) performing two
actions (falling and staring). Depending on the action of interest, their semantic arguments vary.
For falling, `backwards' is a modier of Direction; for staring, `disbelief' and `at the ring' are a
modier of Manner and Object, respectively. Similarly, the bottom action description has an agent
14
(Boromir), a predicate (looks) and a modier of Patient (`Elrond and Gandalf').
Classical approaches to SRL rely on carefully designed features (e.g., word properties, syntactic
connections, distances and paths from nodes in the semantic dependency tree) alongside expensive
techniques such as Integer Linear Programming or Dynamic Programming (Daza & Frank, 2018;
Pradhan, Ward, Hacioglu, Martin, & Jurafsky, 2005; T ackstr om, Ganchev, & Das, 2015; Zhao,
Chen, & Kit, 2009). With the advent of deep learning, and their impressive performance on a
variety of NLP tasks, the feature engineering approach went out of vogue to be replaced by end-to-
end neural models (He, Lee, Lewis, & Zettlemoyer, 2017; Marcheggiani, Frolov, & Titov, 2017). In
general, these models treat the problem as a supervised sequence labeling task, using deep LSTM
architectures that assign a label to each token within the sentence (Daza & Frank, 2018). Recently,
these models gave way to novel architectures that leverage attention mechanisms as an alternative
to data-hungry recurrent networks (Vaswani et al., 2017). Perhaps the most popular of these
architectures is the Bidirectional Encoder Representations from Transformers (BERT; Devlin et
al., 2019). BERT presents a new method of pre-training language representations which obtains
state-of-the-art results on a wide array of textual tasks. This model learns a language model (i.e.,
, a probability distribution over a sequence of words) by using the surrounding text to establish
the context, which allows the model to learn a dierent representation for each token depending
on the context in which they appear in. Recently, works have explored ways to leverage BERT
pre-trained models to further push the state-of-the-art in semantic role labeling, without relying
on lexical or syntactic features (Gardner et al., 2017; Shi & Lin, 2019). These models represent the
current state-of-the-art after signicantly improving the performance results for SRL in English.
2.3 Beyond the Media Domain: Narratives in Psychotherapy
Psychotherapy is a commonly used process in which mental health disorders are treated through
communication between an individual and a trained mental health professional (Flemotomos et al.,
2021). Problems helped by psychotherapy include diculties in coping with daily life; the impact
of trauma, medical illness or loss, like the death of a loved one; and specic mental disorders,
like depression or anxiety (American Psychiatric Association, 2019). Given such a broad range of
diculties, it is no surprise that clients do not know exactly the root cause for their adversities.
15
Instead, clients usually begin a therapy session by recounting a life story related to the problems
they want to work through, the signicance and feelings they have associated with these events,
and how they relate to their relationships with others. These narratives allow clients to distance
themselves from particular issues, and to gain a new and objective perspective on the problem
(Morgan, 2000).
Professional therapists unravel a client's complex narrative to understand what is of interest,
and to raise awareness of a particular traits and characteristics that may have caused the client's
current aiction. As a way to explore a person's internal state, personal narratives have been
studied in mental health (Hall & Powell, 2011), medical interviews (Diers, 2004; Mishler, 1984),
and children's language therapy (Nicolopoulou & Trapp, 2018). In Narrative Therapy, one of the
many popular approaches to psychotherapy, the therapeutic process centers around these client's
narratives (White, White, Wijaya, & Epston, 1990). Therapists and clients build upon storylines
based on the client's dreams, values, goals and skills (Morgan, 2000). These storylines uncover
the true nature of a client, separate from their problems. However, it is important to note that
narrative therapy still lacks the clinical and empirical support found across other psychotherapy
methods (Etchison & Kleist, 2000; Vromans & Schweitzer, 2011). That is, the eectiveness of this
type of therapy has only been supported by case studies and qualitative research.
2.3.1 Working Relations in Therapy
Several factors contribute to positive therapeutic outcomes, some of which are strongly related to
the combination of individuals and their relationship (Baldwin, Wampold, & Imel, 2007; Goldberg
et al., 2020; Lambert & Barley, 2001; M. N. Thompson, Goldberg, & Nielsen, 2018). As such, many
psychology studies have focused on understanding the client{therapist working relation. Similarities
between therapist and client personalities have been associated with longer sessions, higher ther-
apeutic alliance, and overall therapy outcome (Coleman, 2006; Taber, Leibert, & Agaskar, 2011).
PROJECT MATCH RESEARCH GROUP (1998) study if the outcome of treatments for alcoholism
were improved by selecting type of treatment depending on a patient's characteristics. Their results
suggest that that there is no benet in matching patients to a particular treatment. Whereas the
treatment type was not signicant, the way particular therapists interacted with alcoholic patients
had a substantial impact on those patient's outcomes (Bower, 1997; Peele, 1998).
16
One specic relationship factor, known as therapeutic alliance (Fl uckiger, Del Re, Wampold, &
Horvath, 2018), corresponds to the collaborative aspects of the therapist{client relationship includ-
ing the perception of a shared bond and the agreement on the focus of the therapy treatment. This
factor is a major contributing element in psychotherapy success (Fl uckiger et al., 2018; Goldberg
et al., 2020; Taber et al., 2011).
2.3.2 Automatic Assessment of Psychotherapy
Other types of psychotherapy approaches, such as Cognitive Behavior Therapy (CBT) or Motiva-
tional Interviewing (MI), emphasize the need for empirically driven quality assessments of therapy.
Traditionally, this assessment is addressed by human raters who evaluate recorded sessions along
specic dimensions, often codied through constructs relevant to the approach and domain (Fle-
motomos et al., 2021). Nevertheless, providing regular and immediate performance evaluation is
both time-consuming and cost-prohibitive when applied in real-world settings (Flemotomos et al.,
2021). To overcome this limitation, researchers have turned to machine learning methods for au-
tomatic assessment of psychotherapy sessions. For example, initial methods explored the use of
unsupervised topic modelling techniques as a higher-level measure of content, and their relation to
mental health outcomes (Howes, Purver, & McCabe, 2013, 2014). Other works, as Flemotomos et
al. (2021) notes, span a wide range of approaches including text-based (Can et al., 2015; Gibson,
Can, Georgiou, Atkins, & Narayanan, 2017; Imel, Steyvers, & Atkins, 2015; Xiao, Can, Georgiou,
Atkins, & Narayanan, 2012) and audio-based (Black et al., 2013; Xiao, Imel, Georgiou, Atkins, &
Narayanan, 2015).
17
Part I
Character Representations from
Dialogues
18
Chapter 3
Violence Rating Prediction from
Movie Scripts
In this chapter we describe the computational model we designed to estimate expert ratings of vio-
lent content in a lm from the language use in its movie script. By leveraging on the language use
in movie scripts, rather than post-production audio-visual eects, our method becomes applicable
to movies in the earlier stages of content creation|even before it is produced. It also makes our
method complementary to previous works which rely on these eects. Our approach is based on
a broad range of features designed to capture lexical, semantic, sentiment and abusive language
characteristics. We use these features to learn a vector representation for (1) complete movie, and
(2) for an act in the movie. The former representation is used to train a movie-level classication
model, and the latter, to train deep-learning sequence classiers that make use of context. We
tested our models on a dataset of 732 Hollywood scripts annotated by experts for violent content.
Our performance evaluation suggests that linguistic features are a good indicator for violent con-
tent. Furthermore, our ablation studies show that semantic and sentiment features are the most
important predictors of violence in this data. This was the rst work to show the language used in
movie scripts is a strong indicator of violent content.
The work presented in this chapter was published in the following article: Martinez, V. R., Somandepalli, K.,
Singla, K., Ramakrishna, A., Uhls, Y. T., & Narayanan, S. (2019). Violence Rating Prediction from Movie Scripts.
Proceedings of the AAAI Conference on Articial Intelligence, 33(01), 671{678.
19
3.1 Introduction
Violence is an important narrative tool, despite some of its ill eects. It is used to enhance a
viewer's experience, boost movie prots (Barranco, Rader, & Smith, 2017; K. M. Thompson &
Yokota, 2004), and facilitate global market reach (Sparks, Sherry, & Lubsen, 2005). Including
violent content may modify a viewer's perception of how exciting a movie is by intensifying the
sense of relief when a plot-line is resolved favorably (Sparks et al., 2005; Topel, 2007).
There is a sweet-spot of how much violent content lmmakers should include to maximize their
gains. Too much violence may lead to a movie being rated as NC-17
1
which severely restricts both
the promotion of the lm as well as the viewership. This is sometimes considered a \commercial
death sentence" (Cornish & Block, 2012; Susman, 2013). This usually forces lmmakers to trim the
violent content to receive a rating of R or lower. As shown by K. M. Thompson and Yokota (2004),
MPA ratings have been to be inconsistent. Hence it is crucial to develop an objective measure of
violence in media content.
Using violent content is often a trade-o between the economic advantage and social responsi-
bility of the lmmakers. The impact of portrayed violence on the society, especially children and
young adults, has been long studied (see American Academy of Pediatrics, 2001). Violent content
has been implicated in evoking aggressive behavior in real life (Anderson & Bushman, 2001) and
cultivating the perception of the world as a dangerous place (Anderson & Dill, 2000). But this
type of content does not appear to increase severe forms of violence (e.g., homicide, aggravated
assault) at the societal level (Markey, French, & Markey, 2015), and its impact on an individual is
highly dependent on personality predispositions (Alia-Klein et al., 2014). Casting people of certain
demographics as perpetrators more frequently than others may contribute to the creation of neg-
ative stereotypes (Potter et al., 1995; S. L. Smith et al., 1998), which may put these populations
at a higher risk of misrepresentation (Eyal & Rubin, 2003; Igartua, 2010). Thus, it is important
to study violence in movies at scale.
There is a demand for scalable tools to identify violent content with the increase in movie
1
Motion Picture Association of America's rating system classies media into 5 categories, ranging from suitable
for all audiences (G) to Adults only (NC-17). NC-17 lms do not admit anyone under 17. In contrast, restricted
rating (R) lms may contain adult material but still admit children accompanied by their parents.
20
production (777 movies released in 2017; up 8% from 2016 (Motion Picture Association of America,
2017)). Most eorts in detecting violence in movies have used audio and video-based classiers (e.g.,
Ali & Senan, 2018; Dai et al., 2015). No study to date has explored language use from subtitles
or scripts. This limits their application to after the visual and sound eects have been added to a
movie.
There are many reasons why identifying violent content from movie scripts is useful: 1) Such a
measure can provide lmmakers with an objective assessment of how violent a movie is 2) It can
help identify subtleties that producers may otherwise not pick up on when violence is traditionally
measured through action, and not language 3) it could suggest appropriate changes to a movie
script even before production begins 4) It has the potential to analyze large scale portrayals of
violence from a who-is-doing-what perspective. This could provide social scientists with additional
empirical evidence for creating awareness of negative stereotypes in lm.
The objective of this chapter is to understand the relation between language used in movie
scripts and the portrayed violence. In order to study this relationship, we present experiments
to computationally model violence using features that capture lexical, semantic, sentiment and
abusive language characteristics. In the following sections we describe our computational framework
followed by a battery of experiments to validate our approach.
3.2 Dataset
3.2.1 Movie Screenplay Dataset
We use the movie screenplays collected by (Ramakrishna et al., 2017), an extension to Movie-
DiC (Banchs, 2012). It contains 945 Hollywood movies, from 12 dierent genres (1920|2016).
Unlike other datasets of movie summaries or scripts, this corpus is larger and readily provides
actors' utterances extracted from the scripts (see Table 3.1).
The number of utterances per movie script in our dataset varied widely. It ranged between
159 and 4141 ( = 1481:70; = 499:05). The median number of utterances of the scripts was
M = 1413:5. Assuming that movies follow a 3-act structure, and that each act has the same
number of utterances, this means that each act is made by about 500 utterances. In all sequence
modeling experiments we focus only on the last act because lmmakers often include more violent
21
Number of movies 945
# Genres 12
# Characters 6,907
# Utterances 530,608
Table 3.1: Description of the Movie Screenplay Dataset.
content towards the climax of movie. This is supported by excitation-transfer theory (Zillmann,
1971). It suggests that a viewer experiences a sense of relief intensied by the transfer of excitation
from violence when the movie plot is resolved favorably. We also evaluate our sequence models
using the introductory segment of the movie to assess this assumption (See Section 3.5.3).
3.2.2 Violence Ratings
In order to measure the amount of violent content in a movie, we used expert ratings obtained
from Common Sense Media (CSM). CSM is a non-prot organization that provides education and
advocacy to families to promote safe technology and media for children
2
. Expert raters, trained
by CSM, review books, movies, TV shows, video games, apps, music and websites in terms of
age-appropriate educational content, violence, sex, profanity and more to help parents make media
choices for their kids. CSM experts watch movies to rate its violent content from 0 (lowest) to 5
(highest). Ratings include a brief rationale (e.g., ghting scenes, gunplay). Each rating is manually
checked by the Executive Editor to ensure consistency across raters. These ratings can be accessed
directly from the CSM website.
From the total 945 movie scripts in the dataset, we found 732 movies (76:64%) for which CSM
had ratings. The distribution of ratings labels is given in Table 3.2. To balance the negative
skewness of the rating distribution, we selected two cut-os and encoded violent content as a three-
level categorical variable (LOW< 3, MED= 3, HIGH> 3). The induced distribution can be seen
in Table 3.3.
2
https://www.commonsensemedia.org/about-us
22
0 1 2 3 4 5 Total
no. 40 48 83 261 135 165 732
% 5.46 6.56 11.34 35.66 18.44 22.54 100.0
Table 3.2: Raw frequency (no.) and percentage (perc.) distribution of violence ratings
LOW MED HIGH Total
no. 171 261 300 732
% 23.36 35.65 40.98 100.0
Table 3.3: Frequency (no.) and percentage (%) distribution of violence ratings after encoding as a
categorical variable
3.3 Methodology
Movie scripts often contain both the dialogues of an actor (or utterances) and scene descriptions.
We pre-processed our data to keep only the actors' utterances. We discarded scene descriptions
(e.g., camera panning, explosion, interior, exterior) for two reasons: 1) Our objective is study the
relation between violence and what a person said. As such, we do not want to bias our models
with descriptive references to the setup. Additionally, these descriptions vary widely in style and
are not consistent in the depth of detail (in publicly available scripts) 2) This enables us to express
a movie script as a sequence of actors speaking one after another using models such as recurrent
neural networks (RNN).
Following this pre-processing, we collected language features from all the utterances. Our
features can be divided into ve categories: N-grams, Linguistic and Lexical, Sentiment, Abusive
Language and Distributional Semantics. These features were obtained at two units of analysis: 1)
Utterance-level: text in each utterance is considered independently to be used in sequence models,
and 2) Movie-level: where all the utterances are treated as a single document for classication
models. Because movie genres are generally related to the amount of violence in a movie (e.g.,
romance vs. horror), we evaluated all our models by including movie-genre as an one-hot encoded
feature vector. We used the primary genre since a movie can belong to multiple genres. See
Table 3.4 for a summary of the feature extraction methods at the two levels. We now describe each
feature category in detail.
23
Utterance-level Movie-level
N-grams TF (IDF) TF (IDF)
Linguistic TF TF
Sentiment Scores Functionals
Abusive Language TF TF
Semantic Average Average
Table 3.4: Summary of feature representation at utterance and movie level. TF - Term frequency,
IDF - Inverse document frequency, Functionals - Mean, Variance, Maximum, Minimum, and Range
3.3.1 Features
N-grams:
We included unigrams and bigrams to capture the relation of the words to violent content. Because
screenwriters often portray violence using oensive words and use their censored versions, we in-
cluded 3, 4 and 5 character n-grams as additional features. This window of size of 3{5 is consistent
with Nobata et al. (2016). They showed this to be eective to model oensive word bastardization
(e.g., fudge) or censoring (e.g., f**k). These features were then transformed using term frequencies
(TF) or TF-IDF (Sparck Jones, 1972). We setup additional experiments to assess the choice of
transformation and the vocabulary size (See Section 3.4.3).
Linguistic and Lexical Features.
Violent content may be used to modify the viewers' perception of how exciting a movie is (Sparks
et al., 2005). Examples of how excitation can be communicated through writing is by repeating
exclamation marks to add emphasis, or by indicating yelling/loudness by capitalizing all letters.
Hence, we include: number of punctuation marks (periods, quotes, question marks), number of
repetitions, and the number of capitalized letters. These features were also proposed in the context
of abusive language by Nobata et al. (2016).
Psycho-linguistic studies traditionally represent text documents by the percentage of words that
belong to a set of categories using predetermined dictionaries. These dictionaries that map words
to categories are often manually crafted based on existing theory. In our work, we obtained word
percentages across 192 lexical categories using Empath (Fast, Chen, & Bernstein, 2016). Empath
is similar to other popular tools found in Psychology studies such as Linguistic Inquiry and Word
Count (LIWC; Pennebaker, Boyd, Jordan, & Blackburn, 2015) and General Inquirer (GI; Stone,
24
Dunphy, & Smith, 1966). We chose Empath because, it analyzes text on a wider range of lexical
categories including those in LIWC or GI.
Sentiment Features.
We include a set of features from sentiment classication tasks because it is likely that violent
utterances contain words with negative sentiment. We used two sentiment analysis tools that are
commonly used to process text.
AFINN-111 (Nielsen, 2011): For a particular sentence, it produces a single score (-5 to +5) by
summing the valence (a dimensional measure of positive or negative sentiment) ratings for all words
in the sentence. AFINN-111 has also been eectively used for movie summarization (P. J. Gorinski
& Lapata, 2018).
VADER (Gilbert, 2014): valence aware dictionary and sentiment reasoner (VADER) is a lex-
icon and rule-based sentiment analyzer that produces a score (-1 to +1) for a document.
For the two measures described above, we estimated a movie-level sentiment score using the
statistical functionals (mean, variance, max, min and range) across all the utterances in the script.
Formally, let U =fu
1
;u
2
;:::;u
k
g be a sequence of utterances with associated sentiment measures
given by S
U
=fs
1
;s
2
;:::;s
k
g. We obtain a representation of movie-level sentiment S
M
2R
5
:
S
M
(U) =
0
B
B
B
B
B
B
B
B
B
B
@
(S
U
)
2
(S
U
)
max(S
U
)
min(S
U
)
max(S
U
) min(S
U
)
1
C
C
C
C
C
C
C
C
C
C
A
Where (S
U
) =
1
k
P
k
i=0
s
i
and
2
(S
U
) =
1
k1
P
k
i=0
(s
i
(S
U
))
2
.
We also obtain the percentage of words in the lexical categories of positive and negative emotions
from Empath. Finally we concatenate the S
M
(U) from AFINN-111 and VADER with the two
measures from Empath to obtain a 12-dimensional movie-level sentiment feature.
25
Distributed Semantics
By including pre-trained word embeddings, models can leverage semantic similarities between
words. This helps with generalization as it allows our models to adapt to words not previously seen
in training data. In our feature set we include a 300-dimensional word2vec word representation
trained on a large news corpus (Mikolov, Sutskever, Chen, Corrado, & Dean, 2013). We obtained
utterance-level embeddings by averaging the word representations in an utterance. Similarly, we
obtained movie-level embeddings by averaging all the utterance-level embeddings. This hierarchical
procedure was suggested by Schmidt and Wiegand (2017). Other approaches have suggested that
paragraph2vec (Le & Mikolov, 2014) provides a better representation than averaging word embed-
dings (Djuric et al., 2015). Thus, in our experiments we also evaluated the use of paragraph2vec
for utterance representation (See Section 3.5.3).
Abusive Language Features
As described before, violence in movies is related to abusive language. Explicit abusive language can
often be identied by specic keywords, making lexicon-based approaches well suited for identifying
this type of language (Davidson et al., 2017; Schmidt & Wiegand, 2017). We collected the following
lexicon-based features: (i) number of insults and hate blacklist words from Hatebase
3
(Davidson
et al., 2017; Nobata et al., 2016); (ii) cross-domain lexicon of abusive words from Wiegand et al.
(2018), and (iii) human annotated hate-speech terms, collected by Davidson et al. (2017).
Implicit abusive language, however is not trivial to capture as it is heavily dependent on context
and the domain (e.g, twitter vs. movies). Although we do not model this directly, the features
which we have included, e.g., N-grams (Schmidt & Wiegand, 2017) and word embeddings (Wulczyn
et al., 2017) have been shown to be eective to identify implicit abusive language|albeit, in social
media. Additionally, we use sequence modeling which keeps track of the context that can capture
this implicit nature of the abusive language. A detailed study of the relation between the domain,
context and implicit abusive language is part of our future work.
3
www.hatebase.org
26
Figure 3.3.1: Recurrent Neural Network with attention: Each utterance is represented as a vector
of concatenated feature-types. A sequence ofk utterances is fed to a RNN with attention, resulting
in aH-dimensional representation. This vector is then concatenated with genre representation and
fed to the softmax layer for classication.
27
3.3.2 Models
Support Vector Machines
We train a Linear Support Vector Classier (LinearSVC) using the movie-level features to classify
a movie script into one of three categories of violent content (i.e., LOW/MED/HIGH). We chose
LinearSVC as they were shown to outperform deep-learning methods when trained on a similar set
of features (Nobata et al., 2016).
Recurrent Neural Networks
We investigate if context can improve to predict violence similar to previous works (e.g., Founta
et al., 2018) for related tasks. In this work, we consider two main forms of context: conversational
context, and movie genre. The former refers to what is being said in relation to what has been
previously said. This follows from the fact that most utterances are not independent from one
another, but rather follow a thread of conversation. The latter takes into account that utterances
in a movie follow a particular theme set by the movie's genre (e.g., action, sci-). Our proposed
architecture (see Figure 3.3.1) captures both forms of context. The conversational context is cap-
tured by the RNN layer. It takes all past utterances as input to update the representation for
the utterance-of-interest. Movie genre is encoded as a one-hot representation concatenated to the
output of the attention layer. This allows our model to learn that some utterances that are violent
for a particular genre may not be considered violent in other genres.
Formally, let U =fu
1
;u
2
;:::;u
k
g be a sequence of utterances each representated by a xed-
length vector x
t
2 R
D
. Let M
G
be a one-hot representation of a movie's genre. The RNN layer
transforms the sequencefx
1
;x
2
;:::;x
k
g into a sequence of hidden vectorsfh
1
;h
2
;:::;h
k
g. These
hidden vectors are to be aggregated by an attention mechanism (Bahdanau, Cho, & Bengio, 2014).
Attention mechanism outputs a weighed sum of hidden states,
h
sum
=
k
X
i=0
i
h
i
where each weight
i
is obtained by training a dense layer over the sequence of hidden vectors.
The aggregated hidden state h
sum
is concatenated to M
G
(See Figure 3.3.1) and input to a dense
28
layer for classication.
3.4 Experiments
In this section we discuss the model implementation, hyper-parameter selection, baseline models
and sensitivity analysis setup. The code to replicate all experiments is publicly available
4
.
3.4.1 Model Implementation
Linear SVC was implemented using scikit-learn (Pedregosa et al., 2011). Features were centered
and scaled using sklearn's robust scaler. We estimated model's performance and optimal penalty
parameter C2 [0:01; 1; 10; 100; 1000] through nested 5-fold cross validation (CV).
RNN models were implemented in Keras (Chollet, 2015). We used the Adam optimizer with
mini-batch size of 16 and learning rate of 0:001. To prevent over-tting, we use drop-out of 0:5,
and train until convergence (i.e., consecutive loss with less than 10
8
dierence). For the RNN
layer, we evaluated Gated Recurrent Units (Cho, van Merrienboer, Bahdanau, & Bengio, 2014) and
Long Short-Term Memory cells (Hochreiter & Schmidhuber, 1997). Both models were trained with
number of hidden units H2 [4; 8; 16; 32]. Albeit uncommon in most deep-learning approaches, we
opted for 5-fold CV to estimate our model's performance. We chose this approach to be certain
that the model does not over-t the data.
3.4.2 Baselines
For baseline classiers, we consider both explicit and implicit abusive language classiers. For
explicit, we trained SVCs using lexicon based-approaches. The lexicon considered were wordlist
from Hatebase, manually curated n-gram list from Davidson et al. (2017), and cross-domain lexicon
from Wiegand et al. (2018). Additionally, we compare against implementations of two state-of-the-
art models for implicit abusive language classication: Nobata et al. (2016), a LinearSVC trained
on Linguistic, N-gram, Semantic features plus Hatebase lexicon, and Pavlopoulos et al. (2017), an
RNN with deep-attention. Unlike our approach, deep-attention learns the attention weights using
4
https://github.com/usc-sail/mica-violence-ratings-predictions-from-movie-scripts
29
Prec Rec F-score
Abusive Language Classiers
Hatebase 29.8 37.4 30.5
Davidson et al. (2017) 13.6 33.1 19.3
Wiegand et al. (2018) 28.3 34.8 26.8
Nobata et al. (2016) 55.4 54.5 54.8
Pavlopoulos et al. (2017) 53.3 52.0 52.5
Semantic-only (word2vec)
Linear SVC 56.3 55.8 56.0
GRU (16) 53.6 52.3 51.5
LSTM (16) 52.7 54.0 52.5
Movie-level features
Linear SVC 60.5 58.4 59.1
Utterance-level features
GRU (4) 52.4 49.5 49.5
GRU (8) 58.2 58.2 58.2
GRU (16) 60.9 60.0 60.4
GRU (32) 58.8 58.4 58.4
LSTM (4) 54.1 54.2 52.1
LSTM (8) 56.6 57.6 57.0
LSTM (16) 57.4 57.2 57.2
LSTM (32) 56.4 56.2 55.9
Table 3.5: Classication results: 5-fold CV precision (Prec), recall (Rec) and F1 macro average
scores for each classier. In parenthesis are the number of units in each hidden layer.
more than one dense layer. In addition to these baseline models, we also compare against RNN
models trained using only Semantic features (i.e., only word embeddings).
3.4.3 Sensitivity Analysis
We present model performance under dierent selections of initial feature extraction parameters.
First, we evaluate the impact of limiting vocabulary size to the most frequentjVj word n-grams
and character n-grams. For this, we exploredjVj2 [500; 2000; 5000]. Additionally, we evaluated
whether TF or TF-IDF word n-gram transformation was better. We also assessed the use of
word embeddings (word2vec) against using embeddings trained on both words and paragraphs
(paragraph2vec). We do so because previous works suggested that paragraph2vec creates a better
representation than averaging word embeddings (Djuric et al., 2015). Finally, sequence models were
evaluated on each one of the three segments obtained from the three-act segmentation heuristic.
30
Prec Rec F-score
All 60.9 60.0 60.4 0:0
Ablations
-Genre 59.9 59.0 59.4 1:0
-N-grams 59.9 59.2 59.5 0:9
-Linguistic 62.1 59.0 60.1 0:3
-Sentiment 59.6 58.3 58.8 1:6
-Abusive 60.6 58.9 59.6 0:8
-Semantic 59.8 58.3 58.9 1:5
Table 3.6: 5-fold CV ablation experiments using GRU-16. column shows the dierence between
original model and individual ablations. `-' indicates removing a certain feature
3.5 Results
3.5.1 Classication Results
Table 3.5 shows the macro-averaged classication performance of baseline models and our proposed
models. Precision, recall and F-score (F1) for all models was estimated using 5-fold CV. Consistent
with previous works, lexicon-based approaches resulted in a higher number of false positives, leading
to a high recall but low precision (Schmidt & Wiegand, 2017); in contrast, both implicit abusive
language classiers and our methods achieve a better balance between precision and recall. In
line with previous work (Nobata et al., 2016), for the feature set we selected traditional machine
learning approaches perform better than deep-learning methods trained on word2vec vectors only.
Our results suggest that models trained on the complete feature set performed better than other
models. The dierence in performance is signicant (permutation tests, n = 10
5
, all p < 0:05).
This suggests that the additional language features contribute to the classication performance. As
shown in the next section, this increase can be attributed mostly to Sentiment features. The best
performance is obtained using a 16-unit GRU with attention (GRU-16), trained on all features.
GRU-16 performed signicantly better than the baselines (permutation test, smallestjj = 0:056,
n = 10
5
, allp< 0:05), and better than the RNN models trained on word2vec only (jj = 0:079; n =
10
5
;p < 0:05). We were unable to nd statistical dierences in performance between LinearSVC
trained on movie-level features and GRU-16 (perm test, p> 0:05).
31
Model Semantic N-grams jVj F-score(%)
Linear SVC
word2vec
TF 500 58.3
TF 2000 59.6
TF 5000 59.1
TF-IDF 5000 58.9
paragraph2vec TF 5000 59.1
GRU-16
word2vec
TF 500 55.9
TF 2000 59.1
TF 5000 60.4
TF-IDF 5000 56.9
paragraph2vec TF 5000 59.1
Table 3.7: 5-fold cross validation F scores (macro-average) for the two best models by varying
model parameters.jVj= vocabulary size for word and character n-grams.
3.5.2 Ablation Studies
We explore how each feature contributes to the classication task through individual ablation tests.
Dierence in model performance was estimated using 5-fold CV, which are shown in Table 3.6.
Overall, our results suggest that GRU-16 takes advantage of all feature types (all ablations below
zero). Sentiment and word2vec generalizations (i.e., Semantic) contribute the most.
Our ablation studies showed no signicant dierences in classication performance (perm. test,
n = 10
5
, all p> 0:05). This could be because our non-parametric tests do not have enough statis-
tical power, or that there is no real dierence in how the model performs. A possible explanation
of no real dierence is that language features share redundant information. For instance, when
N-grams are not present the classier may be relaying more on Semantic features. This might also
explain why Sentiment accounts for the highest ablation drop ( =1:6), since word embeddings
and n-grams ignore sentiment-related information (Tang et al., 2014). We found linguistic features
to be the least informative features (ablation drop of =0:3) and when removed, the classier
achieves a higher precision score. An explanation of this behavior is that Linguistic features include
general-domain lexicons which tend to produce low precision and high recall classiers (Schmidt
& Wiegand, 2017). Hence, removing these features reduced recall and increased precision. Fi-
nally, removing genre resulted in a drop in performance (ablation drop of =1:0) suggesting the
importance of genre for violent rating prediction.
32
3.5.3 Sensitivity Analysis
We measured the performance of our best classiers (LinearSVC and GRU-16) with respect to the
choice of dierent parameters. Table 3.7 shows 5-fold CV macro-average estimates for precision,
recall and F1 scores. Our results suggest that using the IDF transformation negatively impacts
the performance of both classiers. However these dierences were not signicant (perm. test,
jj = 0:035; n = 10
5
;p> 0:05).
We did not nd any signicant dierence (perm. test,jj 0:00; n = 10
5
;p> 0:05) in perfor-
mance when using paragraph2vec rather than averaging word embeddings. A possible explanation
is that unlike previous approaches, which studied multi-line user comments, utterances are one or
two short lines of text|typically.
Regarding vocabulary size, classiers seem to be impacted dierently. LinearSVC achieves a
better score when 2000 word n-grams and character n-grams are used. In contrast, the bigger the
vocabulary, the better GRU-16 performs.
Finally, scores suggest that GRU-16 performs better when trained on the nal segment than
when trained on the rst segment (jj = 0:067) or the second segment (jj = 0:017). The dierence
in performance was signicant when trained on the rst segment (perm. test, n = 10
5
;p< 0:05).
This result seems to suggest that lmmakers include more violent content towards the end of a
movie script.
3.5.4 Attention Analysis
By exploring the utterances with the highest and lowest attention weights, we can get an idea
of what the model labels as violent. We would expect utterances assigned to a higher attention
weight to be more violent than utterances with lower attention weights. To investigate if the
attention scores highlight violent utterances, we obtained the weights from GRU-16 on a few movie
scripts. These movie scripts were selected from a held-out batch of 16. We sorted utterances from
each movie based on their attention weights. To illustrate a few examples (See Figure 3.5.1), we
picked the top- or bottom-most utterances when a movie was rated HIGH or LOW respectively.
From movies predicted as HIGH, the top utterances show themes related to killing or death. It
also appears to pick up on more subtle indications of aggression such as \loosing one's temper".
33
Figure 3.5.1: Examples of utterances with highest and lowest attention weights for a few movies.
green - correctly identied, blue - depends on context (implicit), red - miss identied
However, it assigned a high attention weight to an utterance about sports (marked in red). This
suggest that movie genre, although helpful, does not disambiguate subtle contexts for violence.
Understanding these contexts is a part of our future work.
3.6 Conclusion
We present an approach to identify violence from the language used in movie scripts. This can
prove benecial for lmmakers to edit content and social scientists to understand representations
in movies.
Our work is the rst to study how linguistic features can be used to predict violence in movies
both at utterance- and movie-level. This comes with certain limitations. For example, our ap-
proach does not account for modications in post-production (e.g., an actor delivering a line with
a threatening tone). We aim to analyze this in our future work with multi-modal approaches
using audio, video and text. Our results suggest that sentiment-related features were the most
informative among those considered.
34
Chapter 4
Multi-Task Rating Prediction from
Movie Scripts
In this chapter we present extensions to the computational model for violence prediction required
to leverage from additional risk behaviors, such as sexual and substance-abuse content. We propose
a multi-task model that learns movie representations from sequences of character utterances, where
features for each utterance are extracted from models we adapted from recent NLP techniques such
as language representation learning and sentiment classication. We tested our model on a dataset
of about one thousand Hollywood scripts, rated by experts for risky-behavior content. Our results
present a signicant improvement over the state-of-the-art for violent content estimation, and novel
baselines for substance abuse and sexual content from language use in lm.
4.1 Introduction
In one of the longest running movie franchises in history, ctional British Secret Service agent
James Bond is more often than not portrayed as an extremely charming gentleman, a cold-blooded
killer, a smoker, and a severe alcoholic (N. Wilson, Tucker, Heath, & Scarborough, 2018). This is
The work presented in this chapter was published in the following article: Martinez, V., Somandepalli, K.,
Tehranian-Uhls, Y., & Narayanan, S. (2020). Joint Estimation and Analysis of Risk Behavior Ratings in Movie
Scripts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
(pp. 4780{4790).
35
not a unique character trait, as other critically acclaimed lms|such as, The Exorcist (Friedkin,
1973), Pulp Fiction (Tarantino, 1994), and A Clockwork Orange (Kubrick, 1972)|follow narratives
where the main characters engage in a similar collection of risk behaviors. The portrayals of
these risk behaviors typically include acts of violence, sexual and substance-abusive behaviors in
scenes of ghting, bloodshed, gunplay; intercourse and nudity; and alcohol, smoking and drug use,
respectively. While these tend to attract audiences (Barranco et al., 2017) and facilitate a movie's
global market reach (Sparks et al., 2005), they have long sparked concerns about the potential side
eects of repeated exposure. Particularly, in the case of at-risk populations, such as children and
adolescents, where this exposure has been linked to increased risk for engaging in violence (Anderson
& Bushman, 2001; Bushman & Huesmann, 2001), smoke and alcohol consumption (Dal Cin, Worth,
Dalton, & Sargent, 2008; Sargent et al., 2005), and earlier sexual initiation (Brown et al., 2006).
Although various automated tools have been designed to recognize risk behaviors portray-
als (e.g., Chen et al., 2011; Liu et al., 2008), many rely on cinematic principles from lm theory
such as illumination, rapid shot transitions or musical score selection (Brezeale & Cook, 2008). This
limits their practical impact to an almost-nal edition of the content, specically where visual and
sound eects have been added in, making it too late or expensive to implement any modications.
Hence, there is an opportunity on being able to identify these depictions from an earlier stage of
content creation as to oer additional useful insights for lm-makers and movie producers during
the complex creative process.
To this end, our work leverages on two key insights: rst, that while all of these works focus
on a specic behavior, risk behaviors frequently co-occur with one another both in real-life (Brener
& Collins, 1998) and in entertainment media (Bleakley et al., 2017; Bleakley, Romer, & Jamieson,
2014; K. M. Thompson & Yokota, 2004). Second, that the language use in movie scripts can charac-
terize portrayals of risk behaviors at the earliest form of content creation|even before production
begins. For example, by identifying when Mr. Bond orders his usual alcoholic drink, Pulp Fiction's
main characters plotting to kill someone, or the evil incarnated in The Exorcist cursing in a sexually
explicit manner.
The present work, to the best of our knowledge, is the rst to model the co-occurrence of
risk behaviors from linguistic cues found in movie scripts. Our proposed model is a multi-task
approach that predicts a movie script's violent, sexual and substance-abusive content from vectorial
36
representations of the character's utterances. We hypothesize that this multi-task approach will
help improve violent content classication, as well as in providing insights on their relation to other
dimensions of risk behaviors depicted in lm media.
Specically, the contributions of this work are:
1. A multi-task model that signicantly improves the state-of-the-art for violent content rating
prediction by leveraging the co-occurrence of sexual and substance-abusive content
2. MovieBERT
1
: A domain-specic ne-tuned BERT model (Devlin et al., 2019) pre-trained over
a large collection of lm and TV scripts. We use this model to obtain better representations
for the semantics of a character's language
3. A novel large-scale analysis on the joint portrayals, and their relation to other ratings, of
violence, sex, and substance abuse in lm.
4.2 Method
Our model learns to map sequences of character utterances' representations to overall movie-level
ratings. Each representation is composed by two parts: one representing its semantics, and one
for its sentiment. These representations are obtained from models trained on larger out-of-domain
corpora but have been validated on related tasks in domains similar to those we study in this work
(e.g., classication of movie review sentiment (Pagliardini, Gupta, & Jaggi, 2018)). Our decision
to start from character utterance representation (as opposed to word representations) comes from
the limited number of labeled expert curated content ratings in our dataset (see Section 4.3).
4.2.1 Semantic representations
The unique aspect of this work is the use of highly-contextualized vector representations for the
particular domain of movie scripts to predict content ratings. These techniques have shown re-
markable success on a variety of NLP tasks such as sentiment classication (Devlin et al., 2019)
and identifying AL in social media (Mozafari et al., 2019).
1
https://github.com/usc-sail/mica-riskybehavior-identification
37
A. Sentence embeddings
We obtain 700-dimensional Sent2Vec (Pagliardini et al., 2018) |a sentence-level extension of
word2vec (Mikolov et al., 2013)|representations from either of two pre-trained sources: (a) Book-
Corpus (Zhu et al., 2015), and (b) our own collection of 6; 000 movie and TV scripts (see Section 4.3).
B. Highly-contextualized representations
Bidirectional Encoder Representations from Transformers (BERT; Devlin et al., 2019) is a novel
language model that outperforms its predecessors due to an innovative architecture that incor-
porates information from both the left and right contexts. This is done through an interlacing
of n fully-connected dense layers each with a multi-head attention layer (Vaswani et al., 2017).
From BERT, we obtain vector representations for every utterance. These come from either of
two pre-trained models: (a) BERT-base (n = 12; 768-dimensional), and (b) BERT-large (n = 24;
1024-dimension)|both trained on a large corpus of documents from Wikipedia and BookCorpus.
C. MovieBERT
A common approach to implement models that produce near state-of-the-art results is to ne-tune
large pre-trained models (such as BERT) for a particular task. This aims to keep the generalization
power of the original model while also adapting its vocabulary for the language use in a particular
domain. Following this idea, here we ne-tune a BERT-base model by continuing its training
over the 6; 000 movie scripts dataset. Our adapted model, movieBERT, consists of 12 transformer
layers that learn a 768-dimensional representation of a movie script. We train this model over a
85% 15% train-test data split and, as done by Devlin et al. (2019), we optimize the model for
two tasks: next-sentence prediction and masked language modeling. In the former, the model has
to predict the sentence that follows a given sentence; in the latter, a random word in a sentence
is masked with a token, and the model has to recover the original word. We initialize the weights
of our model with those from the pre-trained BERT-base model, and continue training for 10; 000
steps, using the base model's parameters: learning rate of 2 10
5
, batch size of 32, and sequences
length of 128. MovieBERT achieves 96:5% accuracy on the next sentence prediction task, and a
65:9% accuracy on the masked language model|an absolute improvement from the BERT-base
38
Figure 4.2.1: Multi-task model for content rating classication: Each utterance is represented by
semantic and sentiment features, fed to independent RNN encoders. The sequence of hidden states
from the encoders serve as input for task-specic layers (gray boxes).
model of 24:5% and 12:43%, respectively. To obtain sentence-level representations, we concatenate
and then average-pooled the output of the last 2 layers.
4.2.2 Sentiment representations
Previous works show the benets of including lexical features that capture the expressed senti-
ment characteristics from language for media content prediction tasks (V. Martinez et al., 2019;
Shafaei, Samghabadi, Kar, & Solorio, 2019). However, most approaches to sentiment analysis on
movie scripts rely on manually-constructed sentiment lexica (e.g., P. Gorinski & Lapata, 2015;
P. J. Gorinski & Lapata, 2018). These lexica have a limited vocabulary, which is costly to scale or
adapt to new domains. In contrast, here we explore neural-network-based sentiment models that
learn representations from language used in the related task of movie reviews (Socher et al., 2013).
39
Figure 4.2.2: Risk behavior rating co-occurrence: on average, when one risk-behavior rating in-
creases so does the others. Error bars denote 95% condence intervals.
While we are aware of the possible mismatch between the language use in movie reviews and that
of movie scripts, our work relies on the assumption that these reviews provide a good initial step
towards capturing sentiment expressed in movie scripts. These models not only learn how words
are used from a larger vocabulary but also consider the relations between these words which may
allow them to generalize better for unseen data. In this work, we experiment with two neural-based
models: bidirectional long short-term memory models (Bi-LSTM; Tai, Socher, & Manning, 2015),
and bidirectional encoder representations from transformers (Devlin et al., 2019). We chose these
models because they provide a good trade-o between the number of parameters and the perfor-
mance on the sentiment prediction task (J. Barnes, Klinger, & Schulte im Walde, 2017), and due
to their outstanding performance in NLP tasks. Our sentiment representations are obtained from
the last hidden state of the Bi-LSTM, and the previous to last layer of the BERT transformer.
4.2.3 Role of Movie Genre
Movie genres relate the elements of a story, plot, setting and characters to a specic category.
Categorizing a movie indirectly assists in shaping the characters and the story of the movie, and
determines the plot and best setting to use. Thus, movie genre contains information on the type
of content one could expect in a movie (especially for the case of violent content (V. Martinez et
al., 2019)). Thus, our models include movie genre as an additional feature. Genres for each movie
40
were obtained from IMDb
2
and transformed into a multi-hot encoding.
4.2.4 Ratings Prediction Model
Our model (see Figure 4.2.1) takes a sequence of utterance representations as input, and outputs
predictions for target content ratings. Formally, let K be the number of content ratings to output
(number of tasks), andfu
t
g
N
t=1
be a sequence of N character utterances. For each u
t
, we obtain
features,f
1t
andf
2t
corresponding to the semantic and sentiment aspects of language respectively.
These representations are input to separate bi-directional RNN layers. To improve model gener-
alization, a dropout layer (probability p) was added after the feature extraction layer. Each RNN
takes a sequence of representations and outputs a sequence of m hidden vectorsfh
j1
;:::;h
jm
g;
h
jl
2R
d
where j = 1; 2 corresponds to semantic and sentiment features respectively. Each hidden
vector represents a state of conversational context|i.e., what is being said in relation to what has
been previously said. This context is important as it follows from the fact that most utterances are
not independent of one another, but follow a conversation thread.
Both hidden-vector sequencesfh
1i
g
m
i=1
andfh
2i
g
m
i=1
go through k2f1:::;Kg task-specic
units, represented as gray boxes in Figure 4.2.1. Each task-specic unit is composed of a sequence
of four layers: (i) two separate self-attention mechanisms; (ii) a concatenation layer; (iii) a z-
dimensional dense layer, and (iv) a softmax prediction layer. Self-attention (Bahdanau et al., 2014)
aggregates the sequence of hidden vectors into a representation of what characters say during the
movie. These attention layers, denoted byf
kj
2 R
m
: j = 1; 2g, are not shared between the
tasks to allow them to focus on what is important for their particular type of content. We chose
this approach as it showed improved performance over our initial experiments with multi-head
attention (Vaswani et al., 2017). Attention outputs corresponds to a weighted sum of the hidden
states and the
kj
weights,
A
kj
=
m
X
i=1
kji
h
ji
. In the concatenation layer, these aggregated representations are coupled with movie-genre v
k
=
2
https://www.imdb.com/
41
LOW(< 3) MED(= 3) HIGH(> 3)
violence 304 (30.7%) 329 (33.3%) 356 (36%)
sexual 446 (45.1%) 329 (33.3%) 214 (21.6%)
substance 469 (47.4%) 225 (39.6%) 129 (13.0%)
Table 4.1: Movie content rating counts and percentage distribution. Median split was induced on
all ratings to balance class distribution.
[A
k1
;A
k2
;g], and serve as inputs for a z-dimensional dense layer. This yields,
s
k
=(W
k
v
k
+b
k
)
where is a ReLu function, and W
k
;b
k
are the weight and bias matrix to be learned. We predict
the ratings through a prediction layer as ^ y
k
= softmax(s
k
). The complete model is trained by
minimizing the aggregated loss
L =
X
k
l
k
(y
k
; ^ y
k
)
where l
k
is the cross-entropy loss associated with the k-th task.
4.3 Data
We collected a large number of movie scripts from three publicly available sources. The rst
source was related works who shared their movie scripts datasets (P. J. Gorinski & Lapata, 2018;
Ramakrishna et al., 2017); the second source was online collections of produced scripts
3
, and the
nal source was online communities where non-produced scripts are shared
4
. In total we collected
12; 706 scripts, some of which correspond to produced lms or TV episodes. To improve the quality
of this dataset, we clean it by extracting text, limiting to les with more than 1; 000 lines, and
replacing non-ascii characters. In case of any error, we remove the le from the collection. This
procedure resulted in 6; 057 movie scripts spanning 23 genres with an average of 1450.6 utterances
per movie ( = 456:11;M = 1447:0). We use this collection to ne-tune movieBERT.
To evaluate the performance of our model, and directly compare it to previous work, we manu-
ally align a subset of 989 movie scripts from our dataset to the content ratings found in (V. Martinez
3
imsdb.com and scriptdrive.org
4
reddit.com/r/Screenwriting
42
et al., 2019). These ratings come from Common Sense Media (CSM), a non-prot organization
that promotes safe technology and media for children
5
. CSM experts rate movies from 0 (lowest)
to 5 (highest) with each rating manually checked by the executive editor to ensure consistency
across raters. A manual inspection of the dataset revealed that the movies with the least scores
across all risk behaviors correspond to the romantic genre, whereas the movies with the most risky
content were in the horror genre. Additionally, we investigate if CSM expert raters capture the co-
occurrence of risk behavior portrayals. Figure 4.2.2 shows that, on average, when one risk-behavior
rating increases so does the others. This was corroborated by signicant positive Spearman's corre-
lations between violence and sexual content (r
s
= 0:161; p< 0:001); violence and substance-abuse
(r
s
= 0:129; p< 0:001), and sexual content and substance-abuse (r
s
= 0:467; p< 0:001).
4.3.1 Preprocessing
We follow a procedure similar to that described in V. Martinez et al. (2019), which discards scene
headers, actions and transitions to represent a movie script as a sequence of actors speaking one
after another. This leads to a natural formulation of a sequence learning model for capturing
the dialog narrative using recurrent neural networks. Additionally, we transformed the ve-point
ratings to three categories using a median split on each rating to counter class imbalance and to
be consistent with previous work. The distribution of the ratings is shown in Table 4.1.
4.4 Experimental Setup
In this section we discuss the model implementation, parameter selection, baseline models and
sensitivity analysis setup.
4.4.1 Model Implementation
Our model was implemented in Keras
6
. Although not common in most deep-learning approaches,
we performed 10-fold cross-validation (CV) to obtain a more reliable estimation for our model's
5
http://www.commonsensemedia.org
6
https://keras.io
43
performance. In each fold, the model was trained until convergence (i.e. loss in consecutive epochs
was less than 10
8
dierence). To prevent over-tting, we used Adam optimizer with a small
learning rate ( 0:001), batch size of 16, and high dropout probability ( p = 0:5). For the RNN
layer, we used Gated Recurrent Units (GRU; Cho et al., 2014). For the sentiment models, Bi-LSTM
parameters were informed by the work of Tai et al. (2015): 50-dimensional hidden representation,
dropout (p = 0:1), trained with Adam optimizer on a batch size of 25 and aL
2
penalty of 10
4
. To
allow for a fair comparison, all the BERT pre-trained models and movieBERT had the same set of
parameters as the BERT-base model: 12 layers, 768 dimensions, learning rate of 210
5
, sequence
length of 128 and batch size of 32. For the initial experiments, we set the model parameters to
hidden dimension size of d = 16, to help prevent overtting, and the sequence length m = 500,
which is approximately the duration of one movie act (i.e., one third). This selection was informed
by previous works (V. Martinez et al., 2019; Shafaei et al., 2019).
4.4.2 Experiments
In our rst set of experiments, we compare the predictive power of each of the proposed features for
predicting risk behavior content. In a second set, we explore how varying the number of dimensions
( d2f8; 16; 32; 64g) and the utterance sequence length ( m2f100; 300; 500; 1000g) impacts the
performance of our model. Additionally, we explore the individual contribution of each feature to
the overall prediction task using ablation studies. For all experiments, we report macro-average
precision, recall and F-score (F
1
) estimated through 10-fold cross validation.
4.4.3 Baselines
As baselines, we compare against: (i) AL classication (Nobata et al., 2016), since AL likely in-
cludes sexual and drug-related terms; (ii) the state-of-the-art for violence rating prediction from
movie scripts (V. Martinez et al., 2019), and (iii) BERT-only document classication systems (Ad-
hikari, Ram, Tang, & Lin, 2019). Additionally, to measure whether the performance improves with
the inclusion of co-occurring risk behaviors, we compare our model against the same architecture
without the multi-task approach.
44
4.5 Results
4.5.1 Classication Results
Table 4.2 presents the classication performance for the baselines and our proposed model. In line
with previous results (V. Martinez et al., 2019; Shafaei et al., 2019), we observe that including
sentiment features (either in the form of lexica or neural network representations) greatly improves
the model performance. Even without the multi-task framework, our model architecture shows
signicant improvement over the baselines ( permutation test, n = 10
5
, allp< 0:05). This is likely
due to our design choice of reducing the model complexity by focusing just on the informative
features (i.e., semantic, sentiment and genre) instead of dealing with redundant features (e.g., n-
grams, word2vec, AL lexica). By including the co-occurrence information in the form of additional
tasks, our proposed multi-task model with task-specic attention gained an average F
1
= 1:22%
points. It also results in the best model (movieBERT + sentiment + movie-genre) with an F
1
=
67:7% for (d = 16;m = 500), performing signicantly better than the previous state-of-the-art
model for violent content rating prediction ( perm. test n = 10
5
, p = 0:002) as well as the AL
baselines for violence ( perm. test n = 10
5
, p = 0:005) and substance-abuse content ( perm. test
n = 10
5
, p = 0:006).
45
Features Violence Sex Subs. Abuse
Semantic Sentiment Genre P R F
1
P R F
1
P R F
1
Single-Task Baselines
Adhikari et al. (2019) BERT (base) { No 57.4 55.7 56.1 39.2 34.0 29.2 30.4 35.1 31.9
Nobata et al. (2016) Abusive Lang. { No 52.4 52.4 52.3 44.3 44.3 44.2 42.8 42.4 42.6
V. Martinez et al. (2019) AL + word2vec Lexical Yes 60.1 61.1 60.4 { { { { { {
No Multi-Task
Bi-GRU (16)
Sent2Vec (BookCorpus)
Bi-LSTM Yes
64.7 65.6 64.9 45.2 43.8 43.2 52.5 45.1 46.1
Sent2Vec (adapted) 64.5 65.6 64.8 47.2 43.3 42.4 51.7 46.4 47.8
BERT (base) 64.1 64.5 64.2 46.5 44.3 43.5 50.3 46.0 47.3
BERT (large) 63.0 63.7 63.2 44.2 42.1 40.2 52.8 44.2 45.0
movieBERT 66.9 67.3 67.0 47.6 47.4 47.3 51.1 47.2 48.5
Proposed: Multi-Task & Task-specic Attention
Bi-GRU (16)
Sent2Vec (BookCorpus)
Bi-LSTM
Yes
66.3 67.2 66.5 17.7 18.6 17.7 17.2 16.9 16.9
Sent2Vec (adapted) 64.0 64.8 64.0 45.0 43.9 43.6 49.9 47.0 47.9
BERT (base) 67.4 67.8 67.5 49.5 47.0 46.8 53.5 47.6 49.1
movieBERT 67.6 68.3 67.7 49.8 47.9 47.9 51.7 48.7 49.6
BERT (large) 64.3 65.0 64.5 46.1 44.5 43.8 53.6 46.9 48.6
movieBERT BERT (base) 66.2 66.5 66.3 48.7 46.1 46.2 50.8 48.8 49.6
Table 4.2: 10-fold cross validation multi-task classication performance. Precision (P), recall (R) and F1 macro average scores reported
(percentages). Models trained independently for each task are denoted by double-line. The best model (shown in bold) performs
signicantly better than baseline for violence (perm. test n = 10
5
, p = 0:002) and substance-abuse (n = 10
5
, p = 0:006).
46
Figure 4.5.1: 10-fold cross validation multi-task classication performance based on GRU dimension
(d) and sequence length (m).
While the proposed model also improves sexual content rating prediction, this improvement is
non-signicant ( p > 0:05). As previously mentioned, this could be attributed to the fact that
MPA's ratings are particularly sensitive to sexual content (K. M. Thompson & Yokota, 2004). In
fact, lmmakers are advised to avoid the repeated usage of sexually-derived words|either as an
expletive or in a sexual context|as to avoid a non family-friendly rating (Myers, 2018). Thus, they
might refer to sexual acts through the use of euphemisms or innuendos, which the model seems
unable to pick up on. Our experiments in using BERT for sentiment representations (last row
in Table 4.2) did not signicantly improve performance any further ( p> 0:05). Future work will
explore further ne-tuning to better capture aective language.
4.5.2 Performance Analysis
Parameter Selection
We evaluate model performance under dierent selections of parameters, namely the number of
hidden dimensions in the GRU layer (d) and the length of the character utterance sequences (m).
The model performance for dierent dimensions is presented in the left section of Figure 4.5.1. For
all tasks, we notice an improvement in performance for d = 16, which drops for higher dimensions.
This suggests that the larger models are overting the data. There is a slight improvement for
sexual content estimation for d = 8 ( F
1
= 48:1), but its performance is not signicantly dierent
47
Sem. Sent. Genre Violence Sex Subs. Abuse Avg.
X X X 67.6 (0.0) 47.9 (0.0) 49.6 (0.0) 0.0
{ X X 60.8 (-6.8) 42.6 (-5.3) 38.2 (-11.4) -7.83
X { X 65.2 (-2.4) 46.9 (-0.1) 49.0 (-0.6) -0.96
X X { 64.5 (-3.1) 47.0 (-0.9) 50.0 (+0.4) -1.2
Table 4.3: 10-fold CV ablation experiments using Bi-GRU (16). F1 macro average score (percent-
age) reported. In parenthesis: dierence between full model and the individual ablation.
from the original model ( perm. test p> 0:05).
With respect to m, the right section of Figure 4.5.1 presents the F
1
performance of the multi-
task model. Overall, we see that longer sequences improve the model's performance. However,
there was no signicant dierence between the performance of m = 500 and that of m = 1000 (
perm.test, p> 0:05). Although we did not test sequences longer than 1000 utterances, the smaller
performance gains between increments of m lead us to believe that the model is saturated, which
suggests that any longer sequence length will not provide any signicant performance gains.
Ablation studies
Table 4.3 shows the individual contributions of each of the three representations. We nd that
semantic representations are the most important source of information. Removing this feature
results in an average performance drop of7:83F
1
. This dierence in performance was signicant
for violence (perm.test n = 10
5
; p = 0:003) and substance-abuse (perm.test n = 10
5
; p < 0:0001)
tasks. The second most informative feature was genre, closely followed by sentiment with average
performance drops of1:2 and0:96 respectively. These results suggest that, while useful, our
sentiment features still have scope for improvement. In particular, we note that a potential limiting
factor might be the possible mismatch between the language used in movie reviews and that of the
movie scripts. A study on how to bridge this possible mismatch will be part of our future work.
Attention Analysis
Finally, we verify our assumption that the attention layers are correctly identifying the important
aspects of language with respect to each behavior. We do so by exploring how the attention
weights are distributed across the movies scripts. Each of the 6 attention layers (two per task:
one for semantic and one for sentiment) learns a m-dimensional weight vector, where each entry
48
corresponds to a particular utterance in the sequence. The higher the weight, the more importance
the model assigns to that particular utterance. For example, for the violent behavior task, we would
expect utterances assigned a higher attention weight to be more re
ective violent expressions than
utterances with lower attention weights. To verify that each attention layer is correctly focusing
on the behavior we are interested in, we set up a hypothesis test where we compare the maximum
weight of each attention layer for movies rated HIGH against movies rated LOW on each behavior.
Our null hypothesis is that there will be no dierence in the way attention concentrates weights for
dierent levels of the behavior. We reject this null hypothesis for the case of the semantics of the
violence task ( Mann-WhitneyU = 59377:5; n
1
= 356; n
2
= 304; p = 0:015), and for the sentiment
in the sexual content task ( Mann-Whitney U = 52937:5; n
1
= 214; n
2
= 446;p = 0:011). These
results suggest that our model picks up on violence by focusing on the content of the words, whereas
identication of sexual behaviors is dependent on the emotional aspects of the language.
4.6 Co-Occurrence Analysis
In this section, we focus on some of the insights that our proposed model may provide lm-makers
and producers during the creative process. In particular, our analyses centers on three insights: rst,
on understanding how joint portrayals of risk behaviors appear on screen; second, in identifying
temporal patterns that arise from these joint portrayals, and nally, in showcasing the relation
between risk behaviors and MPA ratings. For this analysis, we re-trained the best performing
model over the complete movie script dataset (n = 989).
On the relation between joint portrayals of risk behaviors. We nd a strong association
between predictions of substance-abuse and sexual content: the odds for a movie script to be rated
high on sexual content are twice as high when it has a high rating in substance-abuse compared
to when it has a low rating ( 95% Condence Interval [CI] 2:01 to 34:05). Moreover, we nd
that the odds of rating high on all three risk behaviors simultaneously are inversely proportional
to the predicted violence rating ( 95% CI, HIGH:0:11 to 0:82 and MED:0:12 to 0:88). Hence,
this suggests that lm-makers compensate low levels of violence with joint portrayals of sexual and
substance-abuse behaviors.
49
On the temporal patterns of the joint portrayals. If there is a temporal relation between
the portrayals, when the model picks up a cue for a particular behavior at timet (i.e., a spike in the
attention signal), we expect to see a corresponding spike in the attention signal of another task some
time aftert. To compute this relation, for each movie script we obtained the maximum correlation
and its corresponding time lag (2 [m;m]) by using sample cross correlation function (CCF)
between the attention weights of each task. CCF is a measure of similarity between two time series
as a function of the displacement of one relative to the other. As an example, Figure 4.6.1 shows
the co-evolution of attentions weights and the lags corresponding to their maximum correlation
for two renown movies: The Exorcist (Friedkin, 1973) and From Russia With Love (Young, 1963).
On average, attention to the sexual sentiment content precedes attention to violence semantics
by
= 15:50 utterances ( 95% CI, 10:88 to 17:4), with an average correlation coecient of
r
z
= 0:192 0:02. This lag increases for movies with higher content ratings on both violence
and sex (
= 21:46, r
z
= 0:202), whereas movies with low sex and violent content have almost
no temporal dierence, and a signicantly lower correlation coecient (
= 0:75, r
z
= 0:172,
perm.test n = 10
5
;p = 0:034). These results suggests, as Bleakley et al. (2014) points out, that
characters engage in sexual and violent behaviors in a small time span from one another.
On the relation between risk behaviors and MPA ratings. Finally, we measure the re-
lation between the predicted risk behaviors and the movie's MPA rating. We nd that as sexual
content increases, the association between violent (or substance-abuse) content and MPA rating de-
creases. Specically, movies with high sexual rating are more likely to be rated as R
7
, irrespective
of their violent or substance-abuse content ( odds ratio OR = 12:172 (95% CI: 7:86 to 19:46)).
In contrast, the MPA rating of a movie with low sexual content is strongly associated with both
their violent content rating (
2
(6) = 18:595; p = 0:004) and their substance-abuse content rating
(
2
(3) = 17:99; p < 0:001). These results point out the overly sensitiveness of MPA raters to-
wards sexual content and corroborate previous ndings from small manually-annotated samples of
lms (K. M. Thompson & Yokota, 2004; Tickle, Beach, & Dalton, 2009).
7
R{Restricted: under 17 requires accompanying parent or adult guardian.
50
Figure 4.6.1: Attention weights for violence and sex for (a) The Exorcist, and (b) From Russia With
Love. Sex-sentiment (green) leads the violence-semantics (red) by 31 ( = 0:23) and 203 ( = 0:29)
utterances respectively.
4.7 Conclusion
We designed a multi-task model to capture the co-occurrence of depictions of violent content as well
as sexual and substance abuse risk behaviors in lm through the language data available in scripts.
Our proposed model achieves signicant improvements over previous state-of-the-art models for
violent content rating prediction. While complementing audio-visual methods, our language-based
models can be used to identify subtleties in the way risk behavior content is portrayed, before
production begins, oering a valuable tool for content creators and decision makers in entertainment
media.
51
Part II
Character Representations from
Actions
52
Chapter 5
MovieSRL: Automatic Identication
of Character Actions
In this chapter we describe a computational model to identify actions and its characters from scene
descriptions found in movie scripts. We frame this problem as a semantic-role labeling task, were
the model has to label predicates and its constituents as actions, agents, patients or none. We pro-
pose a simple transformer-based model based on the most recent developments in natural language
processing. Additionally, we construct a data resource of more than 9,000 manually-labeled action
descriptions to both implement domain adaptation and get an accurate estimation for the perfor-
mance in domain. Our results show that this approach achieves signicantly higher performance
at identifying characters engaging in actions than general-domain state-of-the-art approaches.
5.1 Introduction
Most works in computational narrative understanding focus on identifying characters and their
actions. Particularly, in the context of lm and movie scripts, characters have been studied from a
social network structure, where connections represent characters' interactions (Kagan et al., 2020),
or analyzed based on their relationships and portrayals (e.g., Ramakrishna et al., 2017; Sap,
Prasettio, Holtzman, Rashkin, & Choi, 2017b). However, to accurately identify characters and
their actions (predicates), most of these works have either heavily relied on manual annotations,
which limits their sample size due to scalability, or applied NLP systems trained on data that does
53
Figure 5.1.1: Two examples taken from the dataset. In contrast to previous SRL datasets, we
explicitly identify characters as the sole sources and targets of the actions.
not come from the lm domain.
To analyze the actions taken or experienced by the characters, and to the best of our knowledge,
computational linguists use one of two approaches. The rst approach employs parsing techniques
to determine the subject{verb{object structure of sentences through a combination of dependency
parsing (DEP) and named entity recognition systems (NER) (Srivastava et al., 2016; Trovati &
Brady, 2014). The second approach identies the predicate and its constituents (i.e., the agents
and patients of action) through the use of semantic role labeling systems (SRL). Both approaches
depend on underlying systems that are trained on nonliterary data for the simple reason that data
is readily available for other domains. For NER systems, recent eorts have been made to create
datasets in the literary domains, particularly in the case of books and movie scripts, as a way to
improve state-of-the-art performance (Bamman et al., 2019). Yet no similar eort has been made
to improve SRL systems.
The motivation of this work is two-fold. First, we are able to show (see Section 5.4) that domain
mismatch heavily impacts the performance of state-of-the-art SRL models when directly applied
to movie scripts. Second, that even when SRL models are built to answer the question who did
what to whom?, the data typically used to train these systems (e.g., news articles datasets such
as Weischedel et al., 2011; Xue et al., 2015, 2016) make no distinction for patients being objects
or characters. Take for example the sentences shown in Figure 5.1.1. AllenNLP (Gardner et al.,
2017), a BERT-based state-of-the-art SRL system, correctly labels the start of each sentence as
54
the agent of the action. It also marks the latter part of these sentences as the patients. These
ultimately leads to unintended consequences on the computational analyses of characters' actions.
Either by incurring in a labelling mistake or by requiring additional development to distinguish
between character and non-characters patients|a task that still remains an open research question
for computational models (Bamman et al., 2019).
To address these limitations, we present a dataset of 9; 613 sentences manually annotated for ac-
tions, agents (sources), and patients (targets) (see Section 5.3). Our annotation procedure explicitly
marks only characters as the sources and targets of the actions. Furthermore, we present a simple
BERT-based approach that accurately identies characters and their actions (see Section 5.2). Our
model achieves signicantly better performance than state-of-the-art results on our dataset (see Sec-
tion 5.5). The dataset and models are freely available for download under the Creative Commons
ShareAlike 4.0 license at https://sail.usc.edu/
~
ccmi/actions-agents-and-patients/.
5.2 Method
In this section we provide an in-depth overview of the development of the computational model
that accurately identies the action and its participants (agents and patients). This model is based
on the current state-of-the-art BERT-based models for SRL (Shi & Lin, 2019). Additionally, we
present the steps performed for the manual annotation of a large-scale sample of character action's
descriptions from a collection Hollywood movie scripts. This data resource serves as a way to
train and domain-adapt our model, which results in a signicant improvement in performance over
competitive baselines.
5.2.1 Problem statement
We frame the problem of automatically identifying the set of actions and its participating characters
as a Semantic Role Labeling task (SRL), with a few dierences. Given a sentence, the SRL task
consists of analyzing the propositions expressed by some target verbs of the sentence. In particular,
for each target verb, all constituents in the sentence which ll a semantic role of the verb have to
be recognized. Typical Semantic arguments include Agent, Patient, Instrument, etc. and also
adjuncts such as Locative, Temporal, Manner, Cause, etc. (Carreras & M arquez, 2005).
55
According to Shi and Lin (2019), a typical formulation of the SRL task splits into four sub-
tasks: predicate detection, predicate sense disambiguation, argument identication, and argument
classication. We start from the assumption that there are models that support the rst two
tasks in a reliable manner. Specically, in our experiments, we use Spacy (Honnibal, Montani,
Van Landeghem, & Boyd, 2020) for predicate identication, which allows us to focus entirely on
the argument identication and classication sub-tasks. Furthermore, in contrast to the traditional
SRL task (Carreras & M arquez, 2005), we are only interested in the characters doing the action,
and those are done the action to. Hence, our label set can be restricted to actions, agents, and
patients only. Another point of contrast is that we explicitly make the distinction between objects
(inanimate) and patients (characters).
5.2.2 Proposed Model
We follow Shi and Lin (2019) in applying a simple yet powerful recipe for SRL: obtain word vector
representations from a pre-trained BERT-based architecture to train a Recurrent Neural Network
for sequence labeling. Our proposed model (see Figure 5.2.1) learns to map the sequence of tokens
from an input sentence to a sequence of labels for actions, agents, and patients. Inputs to our model
are sentences, which are tokenized and fed into a BERT model to obtain highly contextualized word
representations. These representations are then used as input to a Recurrent Neural Network to
produce a sequence of token-level labels.
In contrast to Shi and Lin (2019), who inputs the predicate as an additional feature to the
model, our current setup restricts role labeling to a single predicate per sentence. If a sentence has
more than one predicate, we create a separate copy for each predicate; this same setting was applied
by Daza and Frank (2018) and Zhou and Xu (2015). In the following sections, we go in-depth into
the steps taken by our model.
Input representation
We selected a BERT model for our input representation because of its remarkable success on a vari-
ety of NLP tasks, such as question answering, dialogue systems, and information extraction (Devlin
56
Figure 5.2.1: Our proposed SRL system. Starting at the bottom, the system takes a sentence
as input and obtains a highly-contextualized representation for each token (sub-words) using the
BERT transformer. The sequence of representations is fed into a RNN and softmax layers for
sequence labeling. As part of the post-process, a set of heuristics aggregate multiword expressions.
57
et al., 2019). In this work, we start from the original BERT model
1
trained for the general-domain
on a large unlabeled plain text corpus|that is, the complete English Wikipedia and BookCorpus.
To obtain an input representation, we feed an action description into the model as an input
sequence. This sequence starts with a sentence delimiter ([CLS]) and ends in a separator delimiter
([SEP]), as follows:
[CLS] word word word ... [SEP]
That is, we follow the traditional format used for sentence encoding in the BERT transformer (De-
vlin et al., 2019).
Our rst step is to tokenize the sentence elements using WordPiece (Wu et al., 2016). The
WordPiece tokens are fed into pre-trained BERT models from which we obtain one vector rep-
resentation for each of the tokens. Formally, the sentence representation step maps a delimited
sequence of k words
[CLS];w
1
;w
2
;:::;w
k
; [SEP ]
into n WordPiece tokens,
t
[CLS]
;t
1
;t
2
;:::;t
n
;t
[SEP]
. Note that the length of these sequences might not be the same as the tokenizer might split a
word into multiple sub-word tokens. The token sequence is then mapped into the representation
sequence by the BERT model. Let H =fh
[CLS]
;h
1
;h
2
;:::;h
n
;h
[SEP]
g denote the sequence of
BERT representations. The dimension of each h
i
is given by the BERT model, and corresponds
to 768 and 1024 for the bert-base and bert-large models, respectively. The sequence of highly-
contextualized vector representations is then fed into a RNN for token-level prediction.
Token-level Prediction
We use a bidirectional recurrent neural network (RNN) to obtain the semantic role labels for each
word in a sentence. RNNs are a class of neural networks specialized in processing sequences of
inputs. With each input, the RNN updates its internal state (memory) and produces a probability
1
https://github.com/google-research/bert
58
distribution over the labels for that input. Here, we fed the output of BERT into the RNN layer
as sequence of tokens, for which we obtain a sequence of probability distributions. To obtain the
sequence of SRL labels (i.e., Verb, Agent, and Patient), each token gets assigned to the label with
the maximum (posterior-)probability. For the RNN, we explore two popular congurations: Long
short-term memory cells (LSTM) (Hochreiter & Schmidhuber, 1997) and Gated Recurrent Units
(GRU) (Chung, G ul cehre, Cho, & Bengio, 2014).
Formally, the sequence obtained from BERT, H, is then fed into a bidirectional RNN to learn
a mapping between the tokens and the SRL labels of interest. The RNN takes the sequence of
representations H and outputs a sequence of n hidden vectorsfv
[CLS]
;v
1
;:::;v
n
;v
[SEP]
g. Each v
i
is constructed as the concatenation of the left-side and right-side context, v
i
= [
!
v
i
;
v
i
]. To predict
a label, we use a fully connected dense layer and a softmax function over all labels:
s
i
=(v
i
)
^ y =softmax(s
i
)
where is the activation function. In our experiments, this function corresponds to a linear
activation function(x) =xA
|
+b whereA andb are learn-able parameters of the model. Finally,
the complete model is trained using a weighted cross entropy loss
L(y; ^ y) =w
C
y[C] + log
X
e
^ y
where w
C
is the weighted associated to the class C.
Post-processing
From the SRL system, we collect two outputs: the subword WordPiece token representation of the
sentence, and the sequence of token-level labels. To recover the word-level sentence, we postprocess
the outputs by removing the special tokens introduced by BERT (i.e., [CLS], [SEP], [PAD] and
[MASK]) and merging back WordPiece tokens into words. The word-level label is calculated from its
corresponding tokens as the mode of the token's label. Additionally, we postprocess the data further
to accommodate for cases where agents and patients are composed of more than one word. Examples
59
of this include the case of honorics (e.g., `Mr. Anderson', `Captain Crunch') or expressions with
more than one character (e.g., `Mr. Smith and his wife'). We restructure our word sequence into a
multiword expression (MWE) sequence by (i) concatenating consecutive words with the same label
into a MWE, and (ii) merging consecutive MWEs that end up in a conjunction. MWE labels are
assigned to be equal to the label of the left-most word. For example, after applying our model
and postprocessing procedure to the sentence presented in Figure 5.2.1 we are able to discern
that \Boromir" is the agent of the action \look", and that the patients correspond to \Elron and
Galdalf", a multi-word expression.
Parameter Selection
As part of this work, we explore two pre-trained BERT models: bert-base and bert-large. The
dierence between these two is the size of their output representations and their number of trainable
parameters|110M parameters with 768-dimensional representation, and 340M parameters with a
1024-dimensional output, respectively. On their typical setup, these models do not consider word
casing. However, it seems reasonable to assume that movie scripts will follow a proper grammar.
Particularly, we expect the characters' names to be capitalized. Hence, in our experiments we
investigate the advantage of pre-trained BERT models that do not discard token capitalization.
Namely, BERT-base-cased and BERT-large-cased. For the RNN layer, we explored performance
dierences due to dierent architectures: Gated Recurrent Units (Cho et al., 2014) or Long Short-
term Memory cells (Hochreiter & Schmidhuber, 1997). Additionally, we explore a varying number
of dimensions (d2f50; 100; 300g) for the hidden representation of the RNN.
5.2.3 Domain Adaptation
Domain adaptation aims to readjust the BERT language model and its vocabulary for the way
language is used in a particular domain, without osetting the generalization power of the original
model. This procedure typically results in models that achieve state-of-the-art performance for
domain-related tasks. In this work, we perform domain adaptation of BERT language models by
training the model end-to-end. We back-propagate the errors from the sequence labeling task back
through the network, all through the BERT transformer layers (yellow box in Figure 5.2.1). This
results in the update of the BERT layer parameters, which adapts them to the patterns in the
60
Figure 5.3.1: Example of the typical structure of a movie script. Reproduced with permission from
Slugline (https://slugline.co/).
language of the movie scripts.
5.3 Dataset
We construct a data resource to estimate the performance of our proposed model, as well as to
do domain-adaptation for the BERT language models. In this section, we will brie
y describe the
construction of the dataset and the annotation process.
5.3.1 Dataset construction
We obtain movie scripts from Scriptbase (P. J. Gorinski & Lapata, 2015), a corpus of 912 Hollywood
movie scripts from 31 genres over the years 1909{2013. One of the motivating factors in the
selection of this resource is its ample use within the research community, particularly in the analysis
of characters' portrayals (e.g., V. Martinez, Somandepalli, Tehranian-Uhls, & Narayanan, 2020;
Sap et al., 2017b). Moreover, this corpus readily provides named entity recognition, co-reference
resolution, and gender information for the main characters in each movie.
Typically, movie scripts are structured as a sequence of scenes (see Figure 5.3.1). Each scene
starts with a heading which sets the location of where the scene is going to take place. This is
61
followed by an action paragraph where the scene is described. This action paragraph tells the
reader what is going to happen on the screen and describes the characters and their actions. Next
comes a dialogue section, where a sequence of characters talk one after the other. Finally, the end
of a scene is marked by a transition into the next location (e.g., CUT TO, FADE TO). From the
initial set of 912 movies, we were able to collect 131; 954 scenes. The average number of scenes
per movie in the dataset is 1363:45, and the average number of genres per movie is 3. The most
popular genres are Drama (n = 349), Thriller (n = 303), and Comedy (n = 247).
From each movie script, we collect all of its action descriptions and discard the rest of the
elements. This allows to focus on the characters' actions and behaviors, and provide a complemen-
tary approach to prior works on character dialogue (Kagan et al., 2020; Ramakrishna et al., 2017;
Sap et al., 2017a). We split each action paragraph into sentences and identify the actions (verbs)
using Spacy (Honnibal et al., 2020). With respect to the action descriptions, we parsed a total of
1; 242; 107 sentences ( = 1363:45; = 560:03;M = 1313 per description). Since a sentence can
have more than one action, these descriptions amount to 1; 634; 230 action instances over a set of
84; 513 unique actions. The most common actions are `looks' (n = 44; 725), `turns' (25; 215), `takes'
(19; 137), `see' (17; 716) and `walks' (16; 279). The actions were also the most frequent ones across
all genders and movie genres.
5.3.2 Manual Annotation
We select a sample of 12; 500 sentences to be coded by the human annotators. This sample is
constructed by rejection sampling, where each sentence has to have at least one verb (action).
Our annotation procedure consists of two tasks: labeling and verication. For labeling, we present
human non-experts annotators with a sentence and an action. They are asked to identify the agents
and patients by either selecting a character from a list of characters or checking in the `Does not
say' box (see Figure 5.3.2). Each sentence gets annotated by 3 non experts, and their agreement is
used as the presumptive label for the next stage. If there is no agreement between the annotators,
we discard the sentence from the sample. In the verication task, another set of non-experts is
presented with the sentence and presumptive labels (i.e., action, agent, and patient). Their task is
to consider if these labels are correct labels or not. The cases were they did not consider the labels
correct were discarded. As a nal step, one of the authors checked all the sentences and veried
62
Figure 5.3.2: Labeling Task: Annotators are presented with a sentence and an action. They are
asked to either select the agent (source) and patient (target) of the action. For cases where one of
these is missing, the annotator has the option to check the `Does not say' box.
the labels for consistency. Annotators were hired through the Mechanical Turk platform
2
, with a
remuneration scheme that followed best practices, assuring that the annotators receive at least an
hourly minimum wage (as calculated in the U.S.).
Labeling
Semantic banks such as PropBank usually represent arguments as syntactic constituents (spans),
whereas CoNLL shared tasks follow a dependency-based approach where the goal is to identify the
syntactic heads of arguments (Shi & Lin, 2019). Our annotation style follows the work of Li et
al. (2018), by unifying these two annotation schemes into one framework where we annotate the
syntactic heads of the constituents, and not their full extent.
Given the novelty of our application domain (movie scripts), we make several departures from
the original annotation guidelines noted here. These aim to ensure that annotators do not mix
noncharacters with characters when labeling for targets of action.
While previous works cover unrestricted annotations for the constituents (Carreras & M arquez,
2005; Li et al., 2018), we are only interested in the case where agents and patients correspond to
characters in a movie. To reduce the workload on the annotators and minimize the chance for
2
https://www.mturk.com/
63
labeling mistakes, we provide the annotators with two things: rst, from the part-of-speech tags
provided by the corpus, we identify all verbs in a given sentence. For each verb, we create a separate
annotation task where annotators identify the agents and patients for that particular action.
Second, we reduce the annotation task to a selection process on a reduced set of entities. These
entities are selected to be the most likely to represent a character. However, accurately identifying
the literary gures in a text is still an open research problem (Bamman et al., 2019). Instead, here
we reduce the list of possible entities by removing the words that are least likely used to represent
a character. We start by ltering out words that are not pronouns, proper nouns, nouns, or noun
phrases. From the remainder, we remove most of the common words, since these are not normally
used to refer to a character (e.g., door, eyes, room, hand, car, head,
oor, etc...). For the special
case of honorics (such as Mr. or Miss), we follow Bamman et al. (2019) in considering these tokens
as part of an entire maximal span (e.g., [Mr. Collins] or [Miss Havisham]). We include this maximal
span as a single entry in our select box. For cases where the agent (or patient) is not explicitly
stated in the sentence, annotators have the option to check in the box for `Does not say'.
Verication
For the verication task, we present a single annotator with the sentence, the action to be annotated,
and the presumptive labels for agents and patients when applicable or a `Does not say` string for
when not. The annotator gets prompted with a single question: \Is [AGENT] doing the [ACTION]
to [PATIENT]?" and radio buttons for Yes, No or Does not say. If the annotator does not agree with
the result, or they cannot say whether its correct or not, we discard this sentence from consideration.
Annotation Results
From the 12; 500 sampled sentences, we found 14; 344 actions to be annotated. Annotators agreed
with both the agents and patients for a large majority of the cases n = 11; 775(82:09%). Al-
though, in one out of ve, the verication annotation considered their answers to be incorrect
(n = 2; 162(18:36%)). The remaining 9; 613(81:63%) sentences were manually corrected by one of
the authors. This is presented as the full dataset.
On average, the labeling task took 50:21 seconds with a standard deviation of 173:69 while the
verication task took 28:49 seconds with a standard deviation of 30:78. A posterior analysis on the
64
verication errors reveals that most of the errors were caused by a few annotators that completed
a large amount of annotations (top-10 percentile) with over 90% of them marked as `Does not say'.
5.4 Experiments
In this section, we discuss the experiments used to measure the performance of our proposed model
for SRL in movie scripts and to measure the reliability of the gender estimation procedure.
5.4.1 Model Implementation
The BERT-base SRL model is implemented in PyTorch with the HuggingFace transformers pack-
age (Wolf et al., 2019). We train our SRL model in an end-to-end fashion with a cross-entropy
weighted loss, Adam optimizer (Kingma & Ba, 2014). The initial learning rate was set to 3 10
5
,
batch size of 32, and a L
2
penalty of 10
4
. To counteract class imbalance, class weights are calcu-
lated as the inverse label frequency of the train set. Maximum token sequence length is truncated
at 156 tokens, a parameter selected to cover 85% of the cases in the dataset.
5.4.2 Model Performance
Model performance was estimated over the manual annotated dataset of n = 9; 613 action descrip-
tions. For training, we used 75% of the available data (n = 7; 209). Performance was estimated on
15% of the data, held back as a test set (n = 1; 419). A development set (10%, n = 985) was used
for parameter optimization and early stopping. Our experiments measure the model's ability to
correctly categorize each token by its corresponding semantic role label. Hence, this can be seen as
a 4-way classication task (i.e., Action, Agent, Patient, and None). We report the average accuracy
and micro-average F-score for this classication task.
Baselines
We compare the performance of our proposed model with other SRL approaches, as found in prior
work: a NER + heuristics method, and state-of-the-art BERT-based SRL systems. The former
was implemented using the method proposed by Sap et al. (2017a). For each action description,
we obtain part-of-speech tags and a syntactic dependency tree using Spacy (Honnibal et al., 2020).
65
From this, verbs are tagged as actions and Agent-Patient labels are obtained from inspecting noun
phrases (NP) and their syntactic role. NPs playing the role of subjects get tagged as agents.
Similarly, NPs playing the role of syntactic objects get labeled as Patients.
For state-of-the-art SRL systems, we implemented SimpleBERT (Shi & Lin, 2019) and Al-
lenNLP (Gardner et al., 2017). Even though our architectures are quite similar, there are a few
dierences between our approaches, particularly with respect to the inputs and the sequence label-
ing layer. A rst dierence is that AllenNLP uses a time-distributed linear dense layer to classify
the sequence of outputs from the BERT system into the sequence of semantic role labels. In con-
trast, SimpleBERT and our method use a single RNN layer, as this might be better adapted to
handling sequence data.
Second, both SimpleBERT and AllenNLP extend the sentence representation of BERT to in-
clude the current predicate, as a way of informing the model which action to attend to. Instead,
we restrict our inputs to only a single predicate per sentence. This approach, which is also used
by Daza and Frank (2018) and Zhou and Xu (2015), showed better performance in our initial
experiments. We control for this possible source of noise by presenting two versions of the baselines.
In the rst version, the models are only provided with the action description and have to perform
predicate identication and semantic role labeling. In a second version, we provide the sentence
and the gold standard predicate, so the models only need to do the semantic role labeling task. We
refer to the second version as the oracle predicate version.
And third, and the biggest dierence, is that both SimpleBERT and AllenNLP are trained in non
literary data. Specically, these models learn from news wires datasets such as ConLL2012 (Prad-
han, Moschitti, Xue, Uryupina, & Zhang, 2012) and OntoNotes 4.0 (Weischedel et al., 2011). These
baselines provide a direct comparison to the performance of out-of-the-box state-of-the-art SRL sys-
tems when applied to a novel real-world application. Additionally, we compare our approach to
that of SimpleBERT trained end-to-end. This provides a comparison point to the state-of-the-art
baseline after domain adaptation.
66
5.5 Results
5.5.1 Semantic Role Labeling Performance
Table 5.1 presents a summary of the performance results
3
. With respect to identifying agents and
patients, the baseline models achieve high precision but suer from low recall. While a high precision
is certainly important, we must consider that this model is to be used in the context of a large-scale
analysis of characters and their actions (see Chapter 6). To obtain the most representative sample
for such analysis, we would want our model to retrieve as many instances of actions and their
participants. Hence, models with a higher recall ought to be preferred.
Even though there is a clear domain mismatch in the way the baselines were trained (news
wires vs. movie scripts), both baselines can still recover some of the signal present in the dataset.
In contrast, our na ve approach of relying on a pre-trained model BERT-base (uncased) resulted in
no patient label being produced, and thus a 0.00% F1 score for that category. This suggests that
movie scripts, and character names specically, always follow a proper grammar.
Our results show that the domain adaptation of BERT language model resulted in the biggest
improvement overall. For example, domain adapting SimpleBERT (Shi & Lin, 2019) resulted in a 6
percent increment in action identication, and about 3 to 4 points in agent and patient classication.
Furthermore, our proposed model, trained end-to-end, achieved over 30 absolute points above the
baseline performance. The best performing model was our proposed conjunction of transformer and
RNN (GRU), where the transformer was initialized with a BERT-base cased pre-trained model,
and the complete set of parameters was updated end-to-end. This model achieves F1 scores of
96:80, 89:78 and 73:00 percent for action, agent and patient respectively. Compared to the baseline
models, the dierence in these performances was found to be signicant (permutation test,n = 10
5
,
all p< 0:05). Surprisingly, even though we did not precondition BERT with the current predicate
(as both SimpleBERT and AllenNLP do), our model was able to correctly infer the action for most
of the sentences.
Finally, we investigated changes in the performance of the proposed model due to dierent RNN
dimensions (see Table A.1 in the appendix). The model seems to be saturated around a dimension
3
A complete list of results can be found in Table A.1 in the appendix.
67
of 300, with higher dimensions not performing particularly dierent from the current size. This
result also suggests that the poor performance of the SRL baselines could be due to their bigger
size.
5.6 Conclusion
This work presents a novel computational model to automatically identify actions as well as the
characters that participate in them. We frame the problem as a SRL task where the model has
to identify the predicate (action) and its constituents (agents and patients). Additionally, we
constructed a dataset of manual annotations to evaluate the performance of our model and improve
it by domain adaptation. Our results demonstrate that our model outperforms competitive state-
of-the-art baselines. As we will show in the following section, our model serves as a way for us to
scale our character analysis by automatically producing reliable labels for a large-scale dataset of
character actions.
68
Features Accuracy
Action Agent Patient
F1 Prec Rec F1 Prec Rec F1
Baselines (Oracle Predicate)
Sap et al. (2017a) NER + DepTree 64.54
-
93.82 54.41 68.88 94.62 34.45 51.57
Gardner et al. (2017) BERT-base uncased 60.65 90.86 31.99 47.31 72.46 28.82 41.24
Shi and Lin (2019) BERT-large cased 78.20 95.10 63.42 76.10 41.30 38.33 39.76
Baselines
Sap et al. (2017a) NER + DepTree 48.60 69.14 86.15 49.72 63.05 85.19 26.51 40.44
Gardner et al. (2017) BERT-base uncased 41.80 69.23 71.63 27.85 40.11 37.44 23.63 28.98
Shi and Lin (2019)
BERT-large cased 51.90 69.70 91.55 54.78 68.55 29.44 34.87 31.93
Domain adapted 66.00 75.10 86.93 60.82 71.57 25.41 58.77 35.48
Proposed Model
BERT+LSTM
BERT-base uncased 31.50 17.82 86.03 58.02 69.30 0.00 0.00 0.00
BERT-base cased 68.90 81.88 86.64 60.91 71.60 47.59 63.45 54.39
Domain adapted 89.30 96.84 88.90 90.39 89.64 82.42 61.70 70.57
BERT+GRU
BERT-base cased 67.80 87.11 86.84 60.91 71.60 45.55 50.88 48.07
Domain adapted 90.20 96.80 89.82 89.74 89.78 74.10 71.93 73.00
Table 5.1: Summary of system performance results. Precision (Prec), Recall (Rec) and macro-average F1 scores reported from the test
set. Additional results can be found in Table A.1 in the appendix.
69
Chapter 6
Boys don't cry (or snuggle or dance):
A large-scale analysis of gendered
actions in lm.
In this chapter we describe a large-scale analysis of over 1.2M action descriptions. The goal is
to gain a better understanding of the ways in which lmmakers|sometimes unwillingly|tend
to communicate and perpetuate gender stereotypes. To this end, we use the model described
in Chapter 5 to reliable label at scale samples of characters engaging in actions. We then identify
dierences in the frequency of portrayals of this actions based on the gender of its agents, patients
or both. Our results highlight a lm industry where females are still bound by their looks, show
less agency, and where LGBT+ and persons with disabilities are under-represented.
6.1 Introduction
There is a clear disparity in the way characters are portrayed in TV and lm media, particularly
with respect to their assumed gender
1
. Women are often presented in decorative (e.g., for their
1
We acknowledge that gender lies on a spectrum, and reducing it to a male-female-neutral categorization might
be considered overly simplistic. However, at this time, we are limited by the labeling procedures. A more nuanced
understanding of gender portrayals might be something we pursue in future.
70
body and beauty), family-oriented, and demure roles (Fouts & Burggraf, 1999; Paek et al., 2011;
Sink & Mastro, 2017; Uray & Burnaz, 2003; Y. Zhang et al., 2010) whereas men are typically
shown as independent, authoritarian and professional agents, irrespective of age and physical ap-
pearance (Gauntlett, 2008; Reichert & Carpenter, 2004). While media often re
ects the views of
a society, it also has a considerable impact in how gender stereotypes get reinforced (Holtzman &
Sharpe, 2014). Through assumed behaviors and social roles, media portrayals are a major in
uence
in the way we construct our beliefs and ideas around gender-appropriate behaviors and norms, par-
ticularly during the formative years of childhood and youth (Browne Graves, 1999; Cheryan, Drury,
& Vichayapai, 2013; Gerbner, Gross, Morgan, & Signorielli, 1994; Steinke, 2017). Even throughout
our adulthood, these portrayals can still guide the way we think (Entman, 1989), how we create
our worldview and perceptions of others (Widdershoven, Josselson, & Lieblich, 1993), the way we
dress (J. D. Wilson & MacGillivray, 1998), and how we dene our own self-identity (Polce-Lynch,
Myers, Kliewer, & Kilmartin, 2001).
While quantifying the gender disparity in the media, studies over the past two decades have
highlighted an industry where women tend to be both under-represented (Geena Davis Institute
on Gender in Media, 2019a; S. L. Smith, Choueiti, & Pieper, 2017) and depicted in a stereotyp-
ical manner (Gauntlett, 2008; Wood, 1994). The former being so gravely unbalanced that male
leads outnumber female leads two-to-one, with male characters speaking and appearing on the
screen twice as often than their female counterparts (Geena Davis Institute on Gender in Media,
2019a). Generally speaking, however, the bulk of eorts to study diversity in the representation
of characters in the media are largely qualitative and require immense manual work with human
annotations and/or surveys (Somandepalli et al., 2021). These cannot match the scale at which
media content is currently being produced or consumed; in fact, these eorts have been unable to
produce systematic data for both science and media scholarship at scale. To provide supporting
evidence for the gender gap at a larger scale, researchers have recently turned to machine learning
models. These applications range from works that automatically detect actors' faces and voices in
TV and lm (Guha et al., 2015; Hebbar et al., 2019, 2018; Kumar et al., 2016) to works on lm nar-
rative understanding through the analysis of character networks (Kagan et al., 2020; Ramakrishna
et al., 2017). A number of these are applied directly to movie scripts to gain insight into the rst
stage of content creation, and where the suggested modications could be implemented at lower
71
cost. For example, a linguistic analysis of movie scripts found that romantic movies tend to include
more feminine language while action movies tend to have more masculine language (Ramakrishna
et al., 2015). With respect to character's dialogues, linguistic analysis studies have found that
male characters are associated with a higher number of words related to achievement, whereas
female characters are usually written with more positive language, lower agency, and less power
than their male counterparts (Ramakrishna et al., 2017; Sap et al., 2017a). Other types of studies
center around the social network inferred from scene sharing among characters (Kagan et al., 2020;
Ramakrishna et al., 2017). These studies demonstrate that with a few exceptions, men play almost
all central roles across all genres, and for every three characters interacting in a movie, at least two
are men.
One limiting factor for these automatic approaches is that gender stereotypes are not bound to
dialogue or scene co-appearance, but also can be communicated through the actions and behaviors
of the characters (England et al., 2011). For instance, consider the following three traditional
stereotypes. First, the tomboy, a stereotype embodied by girls who are interested in science, vehicle
mechanics, sports or other gender non-conforming behaviors or appearances. Second, that of the
spinster or crazy cat lady, an unmarried women of a certain age whose sole narrative arc centers
around their quest to nd a partner. Finally, the scary angry man, which often portrays persons of
color as innately savage, animalistic, destructive, and criminal. Even though these stereotypes are
typically not described explicitly through dialogue|and one would be hard pressed to infer them
by the character's scene co-appearance|audiences still internalize their cautionary tales through
an understanding of the character's actions and behaviors (Niemiec & Wedding, 2013).
To address this limitation, we present three contributions. First, we collect over 1.2 million
action descriptions|i.e., sentences that describe an action occurring during a scene|from a dataset
of 1,000 movie scripts. To accurately identify actions and its constituents at such a large scale,
our second contribution is a deep-learning sequence-labelling model based on state-of-the-art pre-
trained language models (Devlin et al., 2019). To train and estimate the performance of our model,
we manually annotate a subset of action descriptions. These annotations allow the model to learn
the dierent ways in which scriptwriters describe characters' actions and behaviors. Our model
performs signicantly better than competitive alternatives, and provides us with a way to scale
up labeling for the dataset. The nal dataset, one of the largest of its kind, contains annotations
72
for more than 50,000 dierent actions and over 20,000 dierent characters participating in these
actions (either as agents or patients).
Our third and main contribution is a large-scale statistical framework to uncover dierences
in how characters engage in actions depending on their role (agent, patient) and their assumed
genders. To this end, we perform a series of regression analyses over the 1.2+ million action
descriptions to estimate frequency of portrayal as a function of role and gender. As part of our
results, we provide insights into some of the nuanced aspects in which certain stereotypes are being
communicated through characters' actions and behaviors. These insights can be categorized into
one of four groups. First, on how female characters are often shown with less agency than male
characters (Sap et al., 2017a). Second, on the emphasis placed on the female appearance and sexual
objectication of women actors (Eva-maria Jacobsson, 1999; Mulvey, 1989). Third, on the marked
gender disparities of caregivers, particularly for those characters that use a wheelchair (England et
al., 2011). And nally, on how gender plays a role in the frequency of aective portrayals, either
by frequently casting women into overly emotional roles (Fast, Vachovsky, & Bernstein, 2016;
Gauntlett, 2008; Stabile, 2009) or by a clear absence of male portrayals of aection (W. W. Dixon,
2012; McKinnon, 2015).
We hope that this will serve as an example of how this framework can prove benecial for lm
makers, scriptwriters, and society in general.
In summary, the contributions of this work are:
1. We construct and provide an action description dataset with over 1.2 million descriptions
obtained from movie scripts.
2. An automatic state-of-the-art machine learning method to identify actions, agents and pa-
tients from linguistic cues found in the action descriptions.
3. A novel statistical framework on 1.2 million action descriptions to highlight biases in the
portrayals of actions, agents and patients.
To provide other researchers with the opportunity to explore additional hypotheses, we have
made the dataset and labeling models freely available for download under the Creative Commons
ShareAlike 4.0 license at https://sail.usc.edu/
~
ccmi/actions-agents-and-patients/.
73
6.2 Dataset
Our dataset construction follows the details presented in Section 5.3. In summary, we start from
the movie scripts collected from Scriptbase (P. J. Gorinski & Lapata, 2015), a corpus of 912
Hollywood movie scripts from 31 genres over the years 1909{2013. From each movie script, we
collect all of its action descriptions and discard the rest of the elements. This allows to focus on
the characters' actions and behaviors, and provide a complementary approach to prior works on
character dialogue (Kagan et al., 2020; Ramakrishna et al., 2017; Sap et al., 2017a). We split
each action paragraph into sentences and identify the actions (verbs) using Spacy (Honnibal et
al., 2020). With respect to the action descriptions, we parsed a total of 1; 242; 107 sentences
( = 1363:45; = 560:03;M = 1313 per description). Since a sentence can have more than one
action, these descriptions amount to 1; 634; 230 action instances over a set of 84; 513 unique actions.
Table 6.1 presents descriptive statistics of the constructed dataset.
We use the system presented in Chapter 5 to automatically identify actions and its participants
at a large-scale. Our system is able to identify a total of 2; 136; 207 instances of characters partici-
pating in actions. Most of these are given by pronouns: for example, the most common agents are
`He' (n = 109; 408) and `She' (n = 59; 043), and the most frequent patients are `him' (n = 67; 567)
and `it' (n = 61; 659). The most common actions are `looks' (n = 44; 725), `turns' (25; 215), `takes'
(19; 137), `see' (17; 716) and `walks' (16; 279). The actions were also the most frequent ones across
all genders and movie genres. The frequency of the actions varied highly depending on the genre
of the movie
2
.
6.3 Method
The focus of this work is to discover patterns on the characters actions, and how these vary according
to the characters' role (i.e., agent and/or patient) and gender. We propose a statistical model to
identify signicant dierences in the frequency of the action portrayals due to the role and gender
of its participants. We do this through a series of three studies. Our rst study focuses on
2
see Table B.1 in the appendix
74
Min Mean Median Max Std.
Movies: 912
Genres: 31
Per movie 2 3.00 3.00 7 0.95
Scenes: 131,954
Per movie 1 1363.45 136.00 646 68.52
Action Descriptions: 1,242,107
Per movie 1 1363.45 1313.00 4901 560.03
Actions: 1,634,230
Per movie 1 1797.83 1765.00 4057 633.98
Per action description 0 1.32 1.00 16 1.06
Agents: 1,175,237
Per movie 1 1008.93 977.00 3061 361.33
Per action description 0 0.95 1.00 11 0.56
Patients: 960,970
Per movie 2 1056.01 1000.50 4286 464.94
Per action description 0 0.77 1.00 14 0.89
Table 6.1: Descriptive statistics of the dataset
explaining action frequency as a function of agents' gender (agent-only). Separately, the second
study investigates this relation for the patients' gender (patient-only). Finally, our third study
investigates the eects of the agent's and patient's gender simultaneously (agent-patient). In
the following sections, we provide an in-depth overview of the steps required for the large-scale
statistical analysis.
6.3.1 Character Gender Estimation
To obtain the character's assumed gender, we follow a hierarchical heuristic approach. This ap-
proach is heavily informed by prior work on the same domain (Kagan et al., 2020; Ramakrishna
et al., 2017; Sap et al., 2017a). Our gender estimation method proceeds as follows: rst, for movie
scripts of an already produced lm, we obtain the character's gender from the casting of that role
from IMDb
3
. For the remainder of the characters, we rely on the following heuristics:
1. For proper names (as identied by their part-of-speech tag), we estimate gender using histor-
ical U.S. census data (Social Security Administration, 1998).
3
http://imdb.com
75
2. We use gendered pronouns as markers of that character's gender. The set of pronouns was
selected from Twenge, Campbell, and Gentile (2012), and includes the following words: female
pronouns (e.g., she, hers, her, and herself); male pronouns (e.g., he, his, him, and himself),
and neutral (or plural) pronouns (e.g., we, they, them).
3. Lookup over a manually labeled word list containing the 3; 000 most frequent words and their
gender. This list includes gendered words for family and relationships (e.g., uncle, aunt, wife,
husband), common gendered nouns (e.g., boy, gal), and other words were gender is evident
(e.g., nun, congresswoman, policeman).
We estimate the performance of our gender estimation through a manual verication process.
This process involved a manual inspection of the dataset to collect 400 character names alongside
their gender (i.e., non-gendered, male, female, and neutral). We report precision, recall and F1 for
each one of these groups, as well as average metrics across all groups.
6.3.2 Regression Analysis
The underlying assumption of our work is that if there is no dierence between the number of times
an action is portrayed by any particular gender, then this action is not stereotypical. Under this
assumption, we should expect to see that the frequencies of certain actions vary as a function of
agent's and patient's gender. This gives us a natural framework for posing the problem of identifying
stereotypes as a regression over the frequency of actions and the genders of its participants.
To uncover this relation, we use a Poisson-regression generalized linear mixed model (GLMM) (Bres-
low & Clayton, 1993). GLMMs are an extension of generalized linear models (e.g., logistic regres-
sion) to include both xed and random eects. A xed eects approach was chosen over a random-
eects approach because our data contains repeated measurements (i.e., a single character can
participate in multiple actions, multiple times). Furthermore, we use a Poisson regression as it is
particularly useful for response variables that represent counts or frequencies (Coxe, West, & Aiken,
2009). All major conclusions about frequency of portrayals hold regardless of the xed-eects or
random-eects approach.
The formal specication of the GLMM is as follows: let Y = [y
1
;y
2
;:::;y
N
] be our response
variable. Each y
i
corresponds to the number of times we see action i in our dataset. We assume
76
that Y follows a multi-variate Poisson distribution, and we model the expected value as a linear
combination of unknown parameters,
E[Y j
]
g() =X +Z
Z = [C
a
;m
g
]
(6.3.1)
whereX2R
Np
is the matrix of predictor variables;Z2R
Nq
, the design matrix ofq random
eects, and 2 R
p
and
2 R
q
the vectors of the regression coecients for the xed-eects and
random eects respectively. The link function g() (and its inverse g
1
()) is what allows the
response variables to come from dierent distributions. In our work, we use the log-link function
g() = log() to induce a Poisson distribution.
Predictors. We vary our predictor variables X according to the study. For (agent-only) and
(patient-only), X corresponds to an indicator variable of the genders of the agent and patients,
respectively. For (agent-patient),X is designed to re
ect the gender group of agents and patients.
This gender group is constructed as a categorical value for the combinations of male-to-male, male-
to-female, female-to-male, and female-to-female.
Random eects. We control for two sources of variability as part of the random eect matrixZ.
The rst eect controls for the distribution of actions. That is, the fact that naturally some actions
will occur more often than others. For this,Z incorporates actions as a categorical co-variate (C
a
).
The second eect we control for is the movie genre. This comes from the fact that certain actions
are more common than others for a particular type of movie. For example, we expect the action
`run' to be more common in action movies than in dramas and romantic comedies. We obtain all of
the movie's genres from IMDb, and transform them into a binary matrix, m
g
2f0; 1g
NG
, where
each entry indicates whether that movie has that particular genre. We include m
g
as an additional
co-variate in our model.
77
Parameter estimation. Parameter estimation is performed by deviance minimization, a gener-
alization of the idea of using the sum of squares of residuals in ordinary least squares to cases where
model-tting is achieved by maximum likelihood (Breslow & Clayton, 1993). For each study, we
subsample the action description dataset to include only complete records. This means that the
(agent-only) and (patient-only) models are tted to subsets for which we have a known gender
for agents and patients, respectively. Similarly, the (agent-patient) model is tted to those records
for which we know the gender of both agent and patient. In all cases, we observe that the gender
distributions of the sub-samples still follow the gender distribution of the complete dataset. More-
over, all follow previous reports in the literature, that is a 2-to-1 male to female ratio (W. Radford
& Gall e, 2015; S. L. Smith et al., 2017). After tting, we perform a factor analysis on the coecient
estimates to gain insights into the subset of actions that occur more frequently as a function of the
agents and the patient's genders.
Goodness of Fit. In each of our studies, we validate the GLMM regression approach by com-
paring three nested models. All models tested have the frequency of portrayal as the response
variable, and control for movie genres and action description. The only dierence between nested
setups is the number of explanatory variables used. The rst setup is the null model. This does not
use any explanatory variables and relies solely on the control variables Z to predict the variableY .
The second experimental setup incorporates the gender of agents (or patients) as an independent
categorical value
4
. By comparing these two setups, we can provide statistical evidence for the
existence of a relation between the gender of the characters (X) and the frequency of portrayals
(Y ). The third setup incorporates an interaction term between the actions and genders
5
. This
provides evidence for the hypothesis that the interaction between the agent's (or patient's) gender
and actions make for a better predictor than just the action and gender on their own. All model
comparisons were done using a likelihood ratio test (
2
test) at a signicance level of = 0:05.
4
Following R notation: Y gender + Z, where gender is a categorical variable for the agent's (or patient's)
genders.
5
In R notation, Y genderaction +Z
78
precision recall f1-score
Female 91.0 67.0 77.0
Male 91.0 63.0 75.0
Neutral 95.0 80.0 87.0
Non-gendered 60.0 91.0 72.0
Average 84.25 75.25 77.75
Table 6.2: Gender heuristics classication report.
Female Neutral Male Unknown Total
Agent 227,196 176,596 479,136 34,186 917,114
Patient 144,890 168,092 284,757 15,734 613,473
Total 372,086 344,688 763,893 49,920 1,530,587
Table 6.3: Distribution of agents and patients according to their gender
6.4 Validation
6.4.1 Gender Estimation Performance
With respect to the gender estimation, we identied a total ofn = 2; 136; 207 character's instances.
Our hierarchical heuristic method is able to infer the gender of a majority (71.64%) of instances
(n = 1; 530; 587). From these, 917; 114 correspond to agents and 613; 473 to patients. In line
with previous research, the sample of genders contains male characters in a 2-to-1 proportion to
female characters (W. Radford & Gall e, 2015; S. L. Smith et al., 2017). We manually identied
49; 885 cases of character references which should have a gender, but our heuristics were not able to
determine which gender it should be. A majority of these cases (56.4%) are unresolved coreferences
(e.g., I, you, me), and could be addressed in future work when appropriate literary coreference
systems become widely available.
Table 6.2 presents the results of our manual validation on the gender classication heuristics.
Our method achieves 75% accuracy and macro-F1 score of 77:75%, with individual F1 scores rating
from 72:0% (non-gendered) to 87:0% (neutral). From their class-level results, we see that our gender
classication method is quite precise when labeling female, male, and neutral characters.
6.4.2 Regression Model Validation
Across all three studies, our goodness of t test revealed that models that include gender predictors
performed signicantly better than the null models (
2
tests, allp< 0:0001). Including interactions
between gender and action improved the performance of all models (
2
tests, allp< 0:0001). These
79
results support our assumption that the gender of the agents and patients plays an important role
in dening the frequency of portrayals for a particular set of actions.
6.5 Large-scale Analysis
We are able to nd 571 actions for which the character's gender plays a signicant role (t-test,
= 0:05) in the frequency of the action portrayal. A complete list of these results is presented
as part of our supplementary materials (see Table C.1, Table D.1 and Table E.1). We note that
due to limitations in our system, some of these results can be attributed to errors in the processing
pipeline. For example, errors in the lemmatizer or parsing modules (e.g., confusing the past tense
lie with lay), or actions that can only be applied to objects and not characters (e.g., dress, wear).
Through manual inspection of the results, we identify and remove these instances from our results
(we still present them in our result tables for completion). Figure 6.5.1 presents the actions where
the relation between frequency of portrayal and characters' gender is the strongest, that is, the
ones with the largest regression coecients. In the following, we highlight some of the ndings and
share some insights to their signicance.
6.5.1 Agency and Gender
Previous works argued that male characters are generally given more agency than female char-
acters (Geena Davis Institute on Gender in Media, 2019a; Sap et al., 2017a). We are able to
corroborate this nding through proportion tests. Our results reveal that the proportion of male
agents is signicantly higher than that of female agents (Z = 294:12;p< 2e 16).Additionally, the
proportion of female patients is larger than that of male patients (Z = 294:12;p< 2e 16).
The actions we found to be signicantly dependent on the gender of either the agent or the
patient parallel those collected by Sap et al. (2017a). For example, male agents are less likely to be
shown `letting' other male characters do something to them (agent-patient: =2:27); in fact,
they are not often seen being `helped', `lifted' or `carried' by other male characters (agent-patient:
=1:75, =1:85, and =1:87 respectively). Moreover, male characters are less likely to
be `forced', `stopped' or `rushed' by other male characters (agent-patient: =1:86, =1:98,
and =1:53 respectively).
80
Figure 6.5.1: Actions with the largest coecients for each of our models. Each coecient ()
represents the change in log odds of the character being male/female were the frequency of action
changes by one unit. Negative values are more likely to be portrayed by female characters (purple);
positive coecients are more likely to be played by male characters (teal)
.
6.5.2 The Male Gaze Theory
The `male gaze' theory (Mulvey, 1989) posits three dierent looks associated with a lm: one of
the camera (usually controlled by a man either a staer or director), one for the characters looking
at each other, and one originating from the spectators or audiences. In all of these, the woman is
the passive receiver of the gaze and the man is the active spectator of the woman; the woman is
taken as an object, subjected to a controlling and curious gaze of the man (Eva-maria Jacobsson,
1999). Our results highlight dierences in the emphasis placed on the female appearance and sexual
objectication of women actors. We are able to show that female characters are more likely to be
gawked or looked at by other characters (patient-only: =2:76 and =2:88, respectively).
Moreover, male agents are not likely to glance, stare, gaze, or look at other males (agent-patient:
=1:55, =1:64, =1:84, and =1:67 respectively). As Mulvey argues, \female
portrayals can be characterised by their `to-be-looked-at-ness' with man being `the bearer of the
look"' (Mulvey, 1989).
81
6.5.3 Representations of Disability.
Our framework highlights a gender dierences in the portrayal of characters living with a disabil-
ity, specically for those that use a wheelchair. We found that whenever the character requires
someone to push their wheelchair, male agents are unlikely to `wheel' neither female patients nor
male patients (agent-patient: =3:45, and =1:74 respectively). This result might be a
re
ection of real-life. Women are often the predominant provider of informal care for family mem-
bers with chronic medical conditions or disabilities, including the elderly and adults with mental
illnesses, more than twice as often as men (Grigoryeva, 2014; Sharma, Chakrabarti, & Grover,
2016). This may have lead to the role of female characters as nurturing care-givers, a stereotype
often over-represented in romantic dramas. In fact, a manual inspection of our sample revealed
that a majority of the `wheel' instances originate from two movies with central themes revolving
around disability and romance|Me Before You (Sharrock, 2016) and The Sessions (Lewin, 2012).
The rest of the instances can be attributed to the science ction genre, with Gattaca (Niccol, 1997),
a highly acclaimed dystopian futuristic drama, and the superhero saga of X-Men (Singer, Ratner,
& Vaughn, 2000). These movies have been previously studied for their stereotypical representa-
tions of disability. For example, disability being used as a narrative tool to reassure able-bodied
audiences of their \normality" (C. Barnes, 2012; Ellis, 2003). The characters with disabilities are
often portrayed as angry because of the inability to accept their own limitations, and death is often
shown as a merciful way to deal with the problem of disability (Enns, 2000).
6.5.4 Portrayals of Emotion
As agents, male character portrayals that are less likely to be `sobbing', `crying' or `staring hor-
ried'(agent-only: =1:47, =1:01, and =3:44 respectively). Likewise, our results
suggests that male characters rarely `scream' at other characters, regardless of their gender (agent-
patient: =2:59 for male patients, =1:84 for female patients). Male agents do not `snap',
`laugh' or `smile' at other male characters (agent-patient: =1:82, =2:46, and =2:19,
respectively). These results fall in line with previous media studies that suggested that female
characters are typically stereotyped through portrayals of emotional outbursts (Fast, Vachovsky, &
Bernstein, 2016).
82
Other results show that female characters are more likely to be patients of aggressive actions. We
see this in examples of actions such as kidnap, drug, and hassle where the patient is more likely to be
a female character (patient-only: =2:21, =2:89, and =2:60 respectively). Similarly,
female characters are less likely to be `lured' by other characters (patient-only: =2:41). From
their interactions, we see that male characters are not often portrayed `struggling' with other male
characters (agent-patient: =1:82). These ndings seem to corroborate previous literature
on aggressive portrayals. Particularly, those suggesting that women are typically portrayed in
distress or need of protection (Cliord et al., 2009; De Ceunynck et al., 2015; Gauntlett, 2008;
Stabile, 2009). One exception is the action of `shot'. This has a higher frequency when the shot is
targeted at a male actor than when the target is female (patient-only: = 3:19). One possible
explanation for this result is that male characters just happen to be portrayed more often in action
scenes involving a gun. As previous literature suggested, most of the perpetrators are portrayed by
middle-aged white male actors (Bleakley et al., 2012; Potter et al., 1995; S. L. Smith et al., 1998),
in action-driven male-dominated narratives where the con
ict is resolved with the villain's demise.
Additional results highlight gender dierences in the way aective interactions happen on the
screen, specically with respect to the male-to-male dyad. Two male characters are rarely shown
on screen in actions such as `kissing', `wrapping [arms]', `dancing', or `hugging' (agent-patient:
=3:18, =2:91, =2:91, and =2:78, respectively). With heterosexuality being the
predominant assumption for characters in movies, one possibility is that our results merely re
ect a
historical prohibition of public same-sex shows of aection (W. W. Dixon, 2012; McKinnon, 2015).
Whereas portrayals of mixed-sex aection were invisibly ordinary, same-sex shows of aection
evoked disgust, `cultural squeamishness', and even, sometimes, to the point of violence (Tillmann-
Healy, 2002; Warner, 2002). If same-sex shows of aection induce such a response, why is that
our results only capture the male-to-male dyad and not the female-to-female dyad? One possible
explanation is the sexualization of female shows of aection (Morris & Sloop, 2006). For example,
an on-screen lesbian kiss can also be read as a form of sexualized entertainment for the heterosexual
male viewer. Following this reasoning, we perform an additional comparison using GLMMs tted
in the subset where the agent and patient are assigned the same assumed gender (i.e., M!M
vs. F!F). The frequency of the aective actions `kiss', `hug' and `wrap' and the genders of the
agent and patient remain to be signicant (t-test, all p< 0:05). In all cases, the aective actions
83
were less likely to be portrayed by two males than two females (same-sex: =3:22, =2:96,
and =2:93 respectively).
6.6 Conclusion
TV and lm are among the most universal mass mediums in history (Gerbner et al., 1980). It
has a tremendous power to aect the ways in which people think and behave. When characters
depictions are perceived by the viewer as similar to their standard everyday reality, the media
message is amplied, creating a more powerful and in
uential suggestion (Gerbner et al., 1980;
Gerbner, Gross, Morgan, Signorielli, & Shanahan, 2002). According to past research, women are
expected to be less dominant, more emotional, less technical, and more nurturing, whereas men are
supposed to be assertive, competitive, independent avoiding weakness, insecurities and emotional
outbursts (England et al., 2011; Fast, Vachovsky, & Bernstein, 2016; Jerald et al., 2017; Koenig,
2018; Prentice & Carranza, 2002; Sap et al., 2017a)
This work presents a novel large-scale analysis on the actions taken by the characters, and how
these actions are related to gender biases in media. Our results uncover linguistic patterns in the
action descriptions in movie scripts where male characters are portrayed with higher agency than
female characters; female characters are often cast in emotional supporting roles; male aection
is rarely presented on screen, especially when the patient of aection is also a male, and where
female characters are often the patients of actions that draw attention to their appearance and
looks. Thus, our work complements previous literature and provides large-scale empirical evidence
to support their claims.
6.6.1 Limitations
Some of the limiting factors of our analysis originate from the limitations of the processing pipeline.
We are bound by the capabilities of the parser to identify words, their part-of-speech tags, as well
as the semantic role they play.
Another limitation is that of the yearly trends. One could argue that the frequency of actions
could be explained by changes in vocabulary or lm trends over the years. To address this limitation,
we perform initial experiments that incorporate a movie's year of release into the regression models
84
as an additional random eect. However, these models failed to converge. We hypothesize that
this is due to a non-uniform distribution on the number of lms released per year (Watson, 2019).
Further work will look into possible groupings for consecutive years as a way of inducing the required
uniform distribution.
85
Part III
Character Representations from Roles
86
Chapter 7
Identifying Therapist and Client
Personae from Alliance Estimation
Psychotherapy, from a narrative perspective, is the process in which a client relates an on-going
life-story to a therapist. In each session, a client will recount events from their life, some of which
stand out as more signicant than others. These signicant stories can ultimately shape one's
identity. In this chapter, we study these narratives in the context of therapeutic alliance|a self-
reported measure on the perception of a shared bond between client and therapist. We propose
that alliance can be predicted from the interactions between certain types of clients with types of
therapists. To validate this method, we obtained 1235 transcribed sessions with client-reported
alliance to train an unsupervised approach to discover groups of therapists and clients based on
common types of narrative characters, or personae. We measure the strength of the relation between
personae and alliance in two experiments. Our results show that (1) alliance can be explained by
the interactions between the discovered character types, and (2) models trained on therapist and
client personae achieve signicant performance gains compared to competitive supervised baselines.
Finally, exploratory analysis reveals important character traits that lead to an improved perception
of alliance.
The work presented in this chapter was published in the following article: Martinez, et al. \Identifying Therapist
and Client Personae for Therapeutic Alliance Estimation." Proceedings of Interspeech, 2019.
87
7.1 Introduction
In psychotherapy, clients seek to overcome a particular diculty or problem with the help of a
professional. It is uncommon for a client to know exactly what the root cause of their adversity
really is. Instead, clients usually begin a therapy session by recounting a life-story related to the
problems they want to work through, the signicance and feelings they have associated with these
events, and how they relate to their relationships with others. Some of these stories will stand out
as more signicant than others, usually the ones stemming from negative events. These signicant
stories can ultimately shape one's identity (Morgan, 2000). Professional therapists unravel a client's
complex narratives as to understand what is of interest, and to raise awareness on particular traits
and characteristics that may have caused the client's current aictions. Moreover, therapists might
also benet in telling a story of their own. For example, they might build rapport with their client
by recounting of a similar life-experience, or deliver a therapeutic narrative focused on the client|
that is, when the therapist's story is really a retelling of the client's story. Hence, understanding
the narrative elements of these stories, and the similarities between their stories, might help provide
insights into how therapy works.
In this work we introduce a novel approach towards analyzing therapeutic outcomes from the sto-
ries told during psychotherapy sessions. Several factors contribute to positive therapeutic outcomes,
some of which are strongly related to the combination of individuals and their relationship (Baldwin
et al., 2007; Goldberg et al., 2020; Lambert & Barley, 2001; M. N. Thompson et al., 2018). One
specic relationship factor, known as therapeutic alliance (Fl uckiger et al., 2018), corresponds to
the collaborative aspects of the therapist-client relationship including the perception of a shared
bond and the agreement on the focus of the therapy treatment. This factor is a major contributing
element in psychotherapy success (Fl uckiger et al., 2018; Goldberg et al., 2020; Taber et al., 2011).
Unlike other counting measurements (e.g., ratio of open and closed questions Miller, Moyers, Ernst,
& Amrhein, 2003), it is unclear how alliance might be captured by what is discussed in each ses-
sion (Baldwin et al., 2007; Goldberg et al., 2020). Instead, automatically assessment of therapeutic
alliance might require higher-order cognitive and aective models for the individuals. To address
this limitation, we propose to model alliance as a function on the interaction between certain types
of therapists with certain types of clients. These types are automatically discovered from therapy
88
transcripts by identifying the attributes shared among the client's and therapist's characters in the
stories told throughout the sessions. Character's attributes are obtained through a Personae model
(Bamman, O'Connor, & Smith, 2013). Personae, also known as character archetypes, are classes of
characters grouped by similar traits, behaviors and motivations (Jung, 1943). For example, com-
mon personae in story-telling include The Hero, The Villain and The Wise Old-Man. While these
models have found success in narrative understanding (Bamman et al., 2013; Kim, Katerenchuk,
Billet, Park, & Li, 2019), to the best of our knowledge no one has investigated their application
on real-life narratives. Our approach is as follows: In each session, client and therapists will tell
stories, and as with other stories, these narratives contain characters, setting, plot, con
icts, and
resolution. The characters of those stories are imperfect portraits of persons in the client's life,
which include both the client themselves (as the protagonist) and the therapist. For the client and
therapist characters, we automatically identify their personae by using an unsupervised model for
narrative understanding (Bamman et al., 2013). This model extends on the ideas of unsupervised
topic modelling, in which documents (i.e., therapy sessions) are represented by mixtures of their
characters' archetypes, and each archetype corresponds to a mixture of topics. To evaluate our
approach, we rst train a Persona model with automatically transcribed sessions. Then we analyze
the relation between the interactions of characters' personae and alliance to show that the discov-
ered personae is useful for the task of alliance estimation. Finally, we identify the most important
character traits which promote the perception of alliance. To the best of our knowledge, this is the
rst work to connect the narrative processes underlying psychotherapy to alliance which measures
the impact (success) of an intervention.
7.2 Data
For this study, we obtained 1235 recorded sessions of dyadic interactions between a therapist and
a client. Sessions were collected between September 2017 and December 2018 at a US university
counseling center. Explicit consent from both the therapist and clients was obtained before record-
ing. The total number of clients and therapists is 386 and 40, respectively. These average duration
of the sessions was 50:71 10:32 minutes. Table 7.1 summarizes the distribution of the number of
sessions per therapist and client. Before the start of each session except the rst one, self-reported
89
min mean max support
# sessions per client 1 3.20 13 386
# sessions per therapist 1 30.88 131 40
# sessions per pair 1 3.11 13 397
Table 7.1: Number of sessions per client, per therapist and per (client, therapist) pair in the
available dataset. Support is the total number of clients, therapists, or pairs.
0 1 2 3 4 5 6 7
alliance rating
0
50
100
150
200
# sessions
Figure 7.2.1: Histogram of the therapeutic alliance ratings.
alliance is collected from the clients. Therapeutic alliance is scored along 4 dimensions using the
short form of the Working Alliance Inventory (Hatcher & Gillaspy, 2006; Horvath & Greenberg,
1989). This form includes four items: \[Therapist] and I are working towards mutually agreed
upon goals", \I believe the way we are working on my problem is correct", \I feel that [Therapist]
appreciates me", and \[Therapist] really understands me". Each item is scored by the client on a
7-point Likert scale and the average is used as the nal alliance rating. A histogram of the alliance
ratings in the available data is shown in Figure 7.2.1.
7.3 Method
As previously stated, we aim to model alliance as a function of the interaction between client's and
therapist's types. As an indirect way of inferring a therapist's (client's) traits and motivations we
present a method based on computational methods for persona discovery. These persona induce a
natural clustering of characters, which we show can be then used to estimate their reported alliance
more accurately than linguistic-based methods.
90
First, sessions were automatically transcribed using an speech processing pipeline. This pipeline
is based on state-of-the-art models oered by Kaldi (Povey et al., 2011). It consists of four steps:
1. Voice Activity Detection (VAD)
2. Diarization
3. Automatic Speech Recognition (ASR)
4. Role assignment
For VAD, a two-layer feed forward network with a softmax inference layer at the frame level
was used. Diarization was based on the x-vector/PLDA paradigm (Sell et al., 2018). For ASR, a
time-delay neural network (Peddinti et al., 2015) and a tri-gram language model were trained on
more than 4,000 hours of data from publicly available speech corpora augmented with noise and
reverberation. We also adapted it using in-domain psychotherapy data. For role assignment, the
two diarized speaker clusters were assigned to either therapist or client roles, following the method
presented by Flemotomos et al. (2018). Performance of this system was evaluated on additional
psychotherapy sessions provided by the counseling center. Unweighted Average Recall for VAD
was 82:7%, Diarization Speaker Error Rate was 6:4% and ASR Word Error Rate was 36:4%. The
resulting ASR transcripts were processed for lemmatization, dependency parsing, and co-reference
resolution using CoreNLP (Manning et al., 2014). As a last step, we split each transcription by
their speaker into two dierent documents.
7.3.1 Character Identication
Narratives may contain any number of characters, however we are only interested in those char-
acters that represent the therapist and the client. To identify when a participant narrates actions
corresponding to one of these two characters, we rely on the following assumption: the therapist
(client) refers to their own character using rst person pronouns only; conversely, they refer to
the other participant's character with second person pronouns. For example, when narrating a
story about the client's character, the therapist uses second person singular pronouns (i.e., \you")
where the client refers to that same character using a 1st person singular pronouns (i.e., \I"). This
process yields four dierent character combinations: (1) therapist character from therapist text,
91
Figure 7.3.1: Personae and topic distributions from psychotherapy sessions: Personae are distribu-
tions over the topics of conversation (shown in green). Sets are learned for clients and therapists
independently (shown in blue and light pink respectively). Therapist and client get assigned to a
single persona per session.
(2) therapist character from client text, (3) client character from therapist text, and (4) client
character for client text. We explore the contribution of each one of these combinations as part of
our experiments.
7.3.2 Personae Model
For each session, we assign a persona to each of the two characters. Each persona is selected
from a distribution learned from the data using the unsupervised topic modelling technique rst
proposed by (Bamman et al., 2013). This model extends Latent Dirichllet Allocation (Blei, Ng, &
Jordan, 2003) to narratives by assuming each story is generated by a mixture of characters' traits.
These traits aim to capture the way in which a character is revealed through the narrative: by the
actions they take toward others, the actions done to them, and the attributes used to describe them.
Following this idea, a persona is represented as a triplet of multinomial distributions over the topics,
capturing the action verbs, possessives, and modiers. Each topic is represented as a weighted
distribution over the complete vocabulary. Given a set of documents, Personae models learn P
persona representations from a group of K topics where P and K are hyper-parameters. For this
92
work, we applied the Personae model on the co-referenced transcriptions (see Figure 7.3.1). In each
session, we extract persona representations for client and therapist transcriptions independently.
This enforces the notion of dierent personae for dierent roles, and allows our evaluation to better
weight on the interaction between therapists and client types. The participants are assigned to a
single persona corresponding to the maximum posterior probability.
7.4 Evaluation
7.4.1 Linear Mixed Eect Models
We use linear mixed eect models (LMEs) to measure the eect of characters' persona on therapeu-
tic alliance. We t an LME with alliance as response variable, and therapist and client personae
as xed eect variables. Additionally, we control for the therapist and client by including their
unique anonymized identiers as random eects. We experiment both with and without interac-
tion terms. Quality of models is evaluated using Akaike Information Criterion (AIC), a measure
based on information theory that rewards models on their goodness-of-t while penalizing for
their complexity. Models are compared against the null model (i.e., therapist and client identiers
only), and against models with varying number of topics (K 2f10; 20; 30; 50g) and number of
personae (P 2f5; 10; 20; 30g). These comparisons are done using likelihood tests, correcting for
multiple tests using Holm-Bonferroni method. For our experiments, we split our dataset into train
(85%;n = 1049) and dev (15%;n = 186). Best values for K and P correspond to the model with
the lowest value of AIC. However, this pair need not be unique, as it is possible for models with
dierentK andP to perform just as well as the model with the best pair. To identify the minimum
set of parameters that give rise to the best models, we follow the steps suggested by Burnham and
Anderson (2004).
7.4.2 Regression Experiments
Additionally, we train machine learning models to capture the relation between alliance and assigned
personae. As mentioned before, each persona consists of a triplet of multinomial distributions over
action verbs, possesives and modiers. We concatenate these distributions into a single vector to
obtain a vectorial representation for the personas. We train support vector regressor (SVR) with
93
a linear kernel on vector representations of the personae to predict alliance. We chose the linear
kernel since it performed better in our preliminary experiments, and it allows us to identify the
components that are most important for predicting alliance. Model performance is estimated using
mean squared error (MSE) with cross-validation in a leave-one-therapist-out fashion. We compare
the performance of our model to SVRs trained using uni- and bi-gram language models from either
participant speech as well as trained on their joint text. These baselines were selected due to their
success in related tasks (Gibson et al., 2017).
7.5 Results
Model Selection Our results show that regardless of the choice of K and P , models with only
one of the therapist or client personae do not perform signicantly better than the null model
(AIC(null) = 1091:75, AIC(therapist) = 1088:75, AIC(client) = 1095:90,
2
tests, all p > 0:05). In
contrast, including both therapist and client personae signicantly increases the explanatory power of the
models (AIC
min
= 721:32;
2
testsp< 0:05). Furthermore, interaction between therapist personae and client
personae signicant improved the descriptive power of these models (AIC
min
(no-interaction) = 1137:18,
AIC
min
(interactions) = 721:32,
2
(284) = 983:86;p < 0:001). These results are in line with previ-
ous works suggesting that alliance is the product of the dyadic interaction between client and thera-
pist (Baldwin et al., 2007). We compare each one of the four character combinations produced by our
method. We found that models trained with therapist character from client's text plus the client char-
acter from therapist's text achieved the best result out of any other possible combination of characters
(AIC(t&c) = 721:32;AIC
min
(others) = 781:08;
2
(27) = 43:194;p < 0:05). This character combination
seems to capture two important aspects of alliance: the shared relationship bond, by considering charac-
ters across roles, and a therapist who narrates client-focused stories, instead of just recounting their own
life-experiences. The best model achieved an AIC
min
= 721:32 with K = 30 topics, P = 30 personae,
and character cross interaction (client from therapist speech, and vice versa). No other model is within 10
AIC units from this result. Thus, our model selection procedure yields a single best model. This model is
signicantly better than the models with other choices of K and P (
2
(2) = 119:09;p< 0:001).
Regression Results The performance of regression models is presented in Table 7.2. Consistent with
the results with LME models, using only therapist or client text performs poorly compared to using both
therapist and client information. The poor performance from linguistic models supports that an individual's
language use does not capture alliance information (Goldberg et al., 2020). Including persona information
94
Mean Squared Error
Therapist-only (unigram) 9.27 7.38
Therapist-only (bigram) 13.09 7.77
Client-only (unigram) 15.40 10.67
Client-only (bigram) 20.69 11.13
Therapist + Client (unigram) 3.04 1.62
Therapist + Client (bigram) 4.32 2.31
Personae (K = 30, P = 30) 0.69 0.51
-Client Only 0.69 0.50
-Therapist Only 0.70 0.54
Table 7.2: Cross-validation estimation for mean () and standard deviation () of MSE for regres-
sion models (lower is better). Persona model performs signicantly better than the baselines.
about either the therapist or the patient signicantly improved the performance over the separate therapist
or patient linguistic models (t-test, t(60) = 3:94;p< 0:001 and t-test, t(60) = 9:13;p< 0:001 respectively).
Furthermore, personae model also performs better than the supervised model that considers both therapist
and client text (t-test, t(60) = 7:01;p < 0:01). However, no statistical dierences (p > 0:05) were found
between the model with interactions and the models with either the therapist or client's personae. The
poor performance of the linguistic models suggests that the regression weights are over-tting to specic
instances of word usage. This further suggests that the personae model is successfully capturing higher-
order interactions that are not represented in the vocabulary of the participants.
7.6 Personae Analysis
To understand what the Persona model captures from the data, we inspect the distributions of personae
and topics. We focus on those clients and therapist that appear more than once to increase reliability of the
analysis. Thus the subset of our data for this analysis contains n = 802 sessions, with 31 unique therapists
and 204 unique clients. The number of sessions per therapist ranges from 2 to 105 ( = 25:87; = 26:47),
for clients this ranged between 2 and 12 ( = 3:93; = 2:19).
7.6.1 Persona distribution
Table 7.3 shows distribution statistics for the discovered personae. Interestingly, all 31 therapists changed
their persona at least once. Thus, that our models are not just creating a persona for each therapist across all
sessions, but instead grouping therapists into certain personae depending on the characteristics of a session.
Moreover, this suggests that therapists are changing their personae in accordance to each client. In contrast,
clients portrayed fewer personas on average. For most of the clients, therapists see between 2 and 3 personae
95
Mode (n)
Therapist 10.25 6.61 29 (77)
Client 3.37 1.68 16 (50)
Client + Therapist 1.87 1.23 16 and 2 (8)
Table 7.3: Mean , standard deviation and mode for persona distribution per participant and
joint distribution.
(61:76%). This suggest that the client's personae are dynamic and change through the therapeutic process.
Investigating if this change is meaningful and what it is capturing will be explored in a future work. With
respect to the topics, clients describe the therapist's mode (i.e., most frequent persona) with topics related
to arguments, love, and court (as in court of law but also \the ball is in your court"); therapists describe
clients with the most popular persona using topics of wellness, issues and expectations, cognitive skills (such
as think and feel), and religion-related terms (e.g., god, religious). With respect to their interactions, we
observe 428 personae pairs out of the 900 possible ones. There was a signicant dierence in the alliance
between personae pairs (Kruskal-Wallis, H = 477:59;p< 0:05). Once again, this result supports our claim
that interactions between certain types of characters achieve higher levels of alliance than others.
7.6.2 Topic contribution
We inspect the sign and magnitude of the Linear SVR coecients as a measure of topic importance. The
most important indicators of high alliance are clients that solve schedule con
icts (e.g., \should we schedule
for next week?"), therapists that show empathy (e.g., \[no need] to be blaming yourself either, you worked
hard on that paper", \I hope I'm not communicating judgment"), specially when clients are going through
worrying or stressful moments (e.g., \it triggered a panic attack", \[I] worry too much about what i need to
do tomorrow"). The top indicators of low alliance are uncommitted or distracted therapists (e.g., \let me
check on that", \let me know next week"), therapists not completely understanding what the client is going
through (e.g., \What's it like to just sit with this anxiety?", \I wonder what could have been, and for me")
or invalidating a client's feelings (e.g., \I'd imagine that I would have made the decision to stay", \yeah yeah
like just because you're feeling anxious doesn't mean you're going to [...]").
7.7 Conclusion
Our claim that alliance is captured by the interaction between the discovered character types is supported
by: (1) LME models achieve signicantly better results with interaction terms; (2) there is a signicant
dierence between alliance ratings for certain pairs of therapist/clients, and (3) machine learning models
with persona representations predict alliance signicantly better than supervised linguistic baselines. There
96
are two main limitations for this work: rst, we did not make any eort in recovering downstream errors
from the role matching and ASR systems, which might induce errors in the words selected for the topics, and,
second, our assumption on how participants refer to each character does not consider the case of impersonal
you, which might induce errors in the topics selected for the personae. Both of these limitations will be
addressed in future work.
97
Chapter 8
Conclusion and
Future Work
When people think back on their favorite stories, they usually recall them by the cast of characters, that
populate them (Hoener & Cantor, 1991). Tales like Cinderella, Arabian Nights, or the Nordic Sagas are best
remembered as the stories of the eponymous maiden who lost her crystal slipper; Shahrazad, the brilliant
woman who outsmarts a bloodthirsty king, and Thor Odinson, the god of thunder who must defend Earth
from a gargantuan poison-dribbling snake. Accordingly, characters play a central role in explaining how
audiences experience narratives (Bal et al., 2011). Yet, in contrast to our ample knowledge of how science
and logical reasoning proceed, we know little|in any formal sense|on how audiences perceive a character,
and the in
uence it might have in shaping that audience's experience of a narrative (Bruner, 1986).
This dissertation aims to narrow this gap in our knowledge. Specically, we propose that to make
better predictions about the audience's narrative experience, computational models need to incorporate
representations of the characters. Throughout this work we present computational models that learn to
predict an audience response to the characters' portrayals in a story. In the following, I will brie
y summarize
the dissertation contributions and discuss how these can be further explored in future work.
8.1 Characters' Representations and Audience's Experience
Part I: From their dialogues In the rst part of this work, I present models for identifying risk-
behavior movie content along the dimensions of violence, sexual and substance-abuse. These models char-
acterize movie content based on representations learned from what the characters said. To the best of our
knowledge, this approach is the rst to study the relation between character's language use and portrayals
98
of risk behaviors. Moreover, our approach can be used even before production begins, which might provide
lmmakers and producers with a reliable estimation of how their content might be perceived by an audience.
Part II: From their actions In the second part of this work, I showcase models to support a large-scale
analysis of characters' actions. These models identify dierences in action portrayals based on the gender
of the characters engaging in the action. Such dierences might be responsible in shaping stereotyped
attitudes in society. Thus, our model aims to provide lmmakers and producers with a way to check for such
stereotypes, even before production begins.
Part III: From their roles In the last part of this work, I present models that predict therapeutic
alliance as a response to the characters' traits in a shared narrative. These models identify characters'
portrayals from the role they play within the client's personal stories and the therapist's recounts of these
stories. To the best of our knowledge, this is the rst work to connect the narrative processes underlying
psychotherapy to alliance which measures the impact (success) of an intervention.
8.2 Future Work
This work is part of an overarching initiative to go beyond simple frequency statistics and assess the quality
of character portrayals in TV and lm. As such, I am interested in the possibility for a holistic multi-
view understanding of the characters' portrayals. A simple starting point would be to combine the textual
character representations with representations learned from a visual (e.g., appearance, clothing, setting) and
audio (e.g., accent, pace, intonation) modality. These approaches could provide a way for models to grasp a
fuller picture of how the characters might be perceived by allowing for the model to pick up on the interplay
between actions, speech, appearance and roles.
Finally, an exciting prospect is to go beyond gender biases and investigate intersectionality. For example,
whether persons of color and a particular gender portray certain actions more frequently than others. Even
though our current framework provides an easy way to incorporate these variables, there are still several
limitations to overcome to automate the labeling of these constructs at a scale (e.g., dening an appropriate
ontology). Following this idea, one could take these works as a basis to construct models which can predict
higher order social constructs for the characters. For example, one possibility is to extend the characters'
action representations into the risk-behavior framework. In this scenario, characters' representations would
re
ect a character's participation in violent, sexual or substance-abusive acts. By incorporating demographic
information, the spatial distribution of these representations could re
ect on the stereotypes perpetuated by
the lm industry and their possible evolution through time.
99
References
21st Century Fox, The Geena Davis Institute on Gender in Media, and J. Walter Thompson
Intelligence. (2018). The Scully Eect: I Want to Believe...In STEM (Tech. Rep.). Los
Angeles, CA: Geena Davis Institute on Gender in Media. Online [Accessed: Feb 1st, 2021].
Adhikari, A., Ram, A., Tang, R., & Lin, J. (2019). Docbert: BERT for document classication.
CoRR, abs/1904.08398.
Agarwal, A., Zheng, J., Kamath, S., Balasubramanian, S., & Ann Dey, S. (2015, May{June). Key fe-
male characters in lm have more to talk about besides men: Automating the Bechdel test. In
Proceedings of the 2015 conference of the north American chapter of the association for com-
putational linguistics: Human language technologies (pp. 830{840). Denver, Colorado: Associ-
ation for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/
N15-1084 doi: 10.3115/v1/N15-1084
Ali, A., & Senan, N. (2018). Violence Video Classication Performance Using Deep Neural Net-
works. In International Conference on Soft Computing and Data Mining (pp. 225{233).
Alia-Klein, N., Wang, G.-J., Preston-Campbell, R. N., Moeller, S. J., Parvaz, M. A., Zhu, W., . . .
Volkow, N. D. (2014, 09). Reactions to Media Violence: It's in the Brain of the Beholder.
PLOS ONE, 9(9). doi: 10.1371/journal.pone.0107260
American Academy of Pediatrics. (2001). Media violence. Pediatrics, 108(5), 1222{1226.
American Psychiatric Association. (2019). What is Psychotherapy?
https://www.psychiatry.org/patients-families/psychotherapy.
Anderson, C. A., & Bushman, B. J. (2001). Eects of violent video games on aggressive behavior,
aggressive cognition, aggressive aect, physiological arousal, and prosocial behavior: A meta-
analytic review of the scientic literature. Psychological Science, 12(5).
Anderson, C. A., & Dill, K. E. (2000). Video games and aggressive thoughts, feelings, and behavior
in the laboratory and in life. Journal of Personality and Social Psychology, 78(4), 772.
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning
to Align and Translate. CoRR, abs/1409.0473. Retrieved from http://arxiv.org/abs/
1409.0473
Bal, P. M., Butterman, O. S., & Bakker, A. B. (2011). The in
uence of ctional narrative experience
on work outcomes: A conceptual analysis and research model. Review of General Psychology,
15(4), 361{370.
Baldick, C. (1996). The concise oxford dictionary of literary terms. Oxford University Press.
Baldwin, S. A., Wampold, B. E., & Imel, Z. E. (2007). Untangling the alliance-outcome correlation:
Exploring the relative importance of therapist and patient variability in the alliance. Journal
of consulting and clinical psychology, 75(6).
Bamman, D., O'Connor, B., & Smith, N. A. (2013, August). Learning latent personas of lm
characters. In Proceedings of the 51st annual meeting of the association for computational
linguistics (volume 1: Long papers). Soa, Bulgaria: Association for Computational Linguis-
tics.
100
Bamman, D., Popat, S., & Shen, S. (2019, June). An annotated dataset of literary entities. In
Proceedings of the 2019 conference of the north American chapter of the association for com-
putational linguistics: Human language technologies, volume 1 (long and short papers) (pp.
2138{2144). Minneapolis, Minnesota: Association for Computational Linguistics. Retrieved
from https://www.aclweb.org/anthology/N19-1220 doi: 10.18653/v1/N19-1220
Banchs, R. E. (2012). Movie-DiC: A Movie Dialogue Corpus for Research and Development. In
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics,
ACL 2012 (pp. 203{207).
Barnes, C. (2012). Understanding the social model of disability. Routledge Handbook of Disability
Studies, 12{29.
Barnes, J., Klinger, R., & Schulte im Walde, S. (2017). Assessing state-of-the-art sentiment models
on state-of-the-art sentiment datasets. In Proceedings of the 8th workshop on computational
approaches to subjectivity, sentiment and social media analysis.
Barranco, R. E., Rader, N. E., & Smith, A. (2017). Violence at the Box Oce. Communication
Research, 44(1).
Barthes, R., & Duisit, L. (1975). An introduction to the structural analysis of narrative. New
literary history, 6(2), 237{272.
Basil, M. D. (1996). Identication as a mediator of celebrity eects. Journal of Broadcasting &
Electronic Media, 40(4), 478{495.
Bechdel, A. (2008). The essential dykes to watch out for. Houghton Miin Harcourt.
Bell-Jordan, K. E. (2008). Black.White. and a Survivor of The Real World: Constructions of Race
on Reality TV. Critical Studies in Media Communication, 25(4).
Black, M. P., Katsamanis, A., Baucom, B. R., Lee, C.-C., Lammert, A. C., Christensen, A., . . .
Narayanan, S. S. (2013). Toward automating a human behavioral coding system for married
couples' interactions using speech acoustic features. Speech communication, 55(1), 1{21.
Bleakley, A., Ellithorpe, M. E., Hennessy, M., Khurana, A., Jamieson, P., & Weitz, I. (2017). Alco-
hol, sex, and screens: Modeling media in
uence on adolescent alcohol and sex co-occurrence.
The Journal of Sex Research, 54(8), 1026{1037.
Bleakley, A., Jamieson, P. E., & Romer, D. (2012). Trends of sexual and violent content by gender
in top-grossing us lms, 1950{2006. Journal of Adolescent Health, 51(1).
Bleakley, A., Romer, D., & Jamieson, P. E. (2014). Violent lm characters' portrayal of alcohol,
sex, and tobacco-related behaviors. Pediatrics, 133(1).
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine
Learning research, 3(Jan).
Bojanowski, P., Bach, F., Laptev, I., Ponce, J., Schmid, C., & Sivic, J. (2013). Finding actors and
actions in movies. In Proceedings of the ieee international conference on computer vision (pp.
2280{2287).
Bower, B. (1997). Alcoholics synonymous: Heavy drinkers of all stripes may get comparable help
from a variety of therapies. Science News, 151(4).
Brener, N. D., & Collins, J. L. (1998). Co-occurrence of health-risk behaviors among adolescents
in the united states. Journal of Adolescent Health, 22(3), 209{213.
Breslow, N. E., & Clayton, D. G. (1993). Approximate inference in generalized linear mixed models.
Journal of the American Statistical Association, 88(421), 9{25.
Brezeale, D., & Cook, D. J. (2008). Automatic video classication: A survey of the literature.
IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews),
38(3).
Brown, J. D., L'Engle, K. L., Pardun, C. J., Guo, G., Kenneavy, K., & Jackson, C. (2006). Sexy
media matter: exposure to sexual content in music, movies, television, and magazines predicts
101
black and white adolescents' sexual behavior. Pediatrics, 117(4).
Browne Graves, S. (1999). Television and prejudice reduction: When does television as a vicarious
experience make a dierence? Journal of Social Issues, 55(4), 707{727.
Bruner, J. (1986). Actual minds, possible worlds. Harvard University Press.
Burnap, P., & Williams, M. L. (2015). Cyber hate speech on twitter: An application of machine
classication and statistical modeling for policy and decision making. Policy & Internet, 7(2),
223{242.
Burnham, K. P., & Anderson, D. R. (2004). Multimodel inference: understanding aic and bic in
model selection. Sociological methods & research, 33(2).
Bushman, B. J., & Huesmann, L. R. (2001). Eects of televised violence on aggression. Handbook
of children and the media, 223{254.
Callister, M., Stern, L. A., Coyne, S. M., Robinson, T., & Bennion, E. (2011). Evaluation of sexual
content in teen-centered lms from 1980 to 2007. Mass Communication and Society, 14(4).
Can, D., Marin, R., Georgiou, P., Imel, Z. E., Atkins, D., & Narayanan, S. S. (2015). "it sounds
like...": A natural language processing approach to detecting counselor re
ections in motiva-
tional interviewing. Journal of Counseling Psychology. doi: 10.1037/cou0000111
Cape, G. S. (2003). Addiction, stigma and movies. Acta Psychiatrica Scandinavica, 107(3),
163{169.
Carreras, X., & M arquez, L. (2005). Introduction to the conll-2005 shared task: Semantic role
labeling. In Proceedings of the ninth conference on computational natural language learning
(conll-2005) (pp. 152{164).
Cascante-Bonilla, P., Sitaraman, K., Luo, M., & Ordonez, V. (2019). Moviescope: Large-scale
analysis of movies using multiple modalities. CoRR, abs/1908.03180.
Chen, L.-H., Hsu, H.-W., Wang, L.-Y., & Su, C.-W. (2011). Violence detection in movies. In 2011
eighth international conference computer graphics, imaging and visualization.
Cheryan, S., Drury, B. J., & Vichayapai, M. (2013). Enduring in
uence of stereotypical computer
science role models on women's academic aspirations. Psychology of Women Quarterly, 37(1),
72{79.
Cho, K., van Merrienboer, B., Bahdanau, D., & Bengio, Y. (2014). On the Properties of Neural
Machine Translation: Encoder-Decoder Approaches. CoRR, abs/1409.1259. Retrieved from
http://arxiv.org/abs/1409.1259
Chollet, F. (2015). Keras. https://keras.io.
Chung, J., G ul cehre, C ., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent
neural networks on sequence modeling. CoRR, abs/1412.3555.
Cliord, J. E., Jensen III, C. J., & Petee, T. A. (2009). Does gender make a dierence? Women,
Violence, and the Media: Readings in Feminist Criminology, 124.
Cohen, J. (2001). Dening identication: A theoretical look at the identication of audiences with
media characters. Mass communication & society, 4(3), 245{264.
Coleman, D. (2006). Therapist{client ve{factor personality similarity: A brief report. Bulletin of
the Menninger clinic, 70(3).
Collins, R. L. (2011). Content analysis of gender roles in media: Where are we now and where
should we go? Sex roles, 64(3-4), 290{298.
Cornish, A., & Block, M. (2012). NC-17 Rating Can Be A Death Sentence For Movies. [Ra-
dio broadcast episode] https://www.npr.org/2012/08/21/159586654/nc-17-rating-can-be-a-
death-sentence-for-movies.
Coxe, S., West, S. G., & Aiken, L. S. (2009). The analysis of count data: A gentle introduction to
poisson regression and its alternatives. Journal of personality assessment, 91(2), 121{136.
Dai, Q., Zhao, R., Wu, Z., Wang, X., Gu, Z., Wu, W., & Jiang, Y. (2015). Fudan-Huawei
102
at MediaEval 2015: Detecting Violent Scenes and Aective Impact in Movies with Deep
Learning. In Working Notes Proceedings of the MediaEval 2015 Workshop.
Dal Cin, S., Worth, K. A., Dalton, M. A., & Sargent, J. D. (2008). Youth exposure to alcohol use
and brand appearances in popular contemporary movies. Addiction, 103(12).
Davidson, T., Warmsley, D., Macy, M. W., & Weber, I. (2017). Automated Hate Speech Detec-
tion and the Problem of Oensive Language. In Proceedings of the Eleventh International
Conference on Web and Social Media, ICWSM 2017 (pp. 512{515).
Davies, P. G., Spencer, S. J., & Steele, C. M. (2005). Clearing the air: identity safety moderates
the eects of stereotype threat on women's leadership aspirations. Journal of personality and
social psychology, 88(2), 276.
Daza, A., & Frank, A. (2018, July). A sequence-to-sequence model for semantic role label-
ing. In Proceedings of the third workshop on representation learning for NLP (pp. 207{
216). Melbourne, Australia: Association for Computational Linguistics. Retrieved from
https://www.aclweb.org/anthology/W18-3027 doi: 10.18653/v1/W18-3027
De Ceunynck, T., De Smedt, J., Daniels, S., Wouters, R., & Baets, M. (2015). \crashing the
gates"{selection criteria for television news reporting of trac crashes. Accident Analysis &
Prevention, 80, 142{152.
Demarty, C.-H., Penet, C., Ionescu, B., Gravier, G., & Soleymani, M. (2014). Multimodal violence
detection in Hollywood movies: State-of-the-art and Benchmarking. In Fusion in Computer
Vision (pp. 185{208). Springer.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019, June). BERT: Pre-training of
deep bidirectional transformers for language understanding. In Proceedings of the 2019
conference of the north American chapter of the association for computational linguistics:
Human language technologies, volume 1 (long and short papers) (pp. 4171{4186). Min-
neapolis, Minnesota: Association for Computational Linguistics. Retrieved from https://
www.aclweb.org/anthology/N19-1423 doi: 10.18653/v1/N19-1423
Diers, D. (2004). Speaking of nursing{: Narratives of practice, research, policy, and the profession.
Jones & Bartlett Learning.
Dixon, L., Li, J., Sorensen, J., Thain, N., & Vasserman, L. (2017). Measuring and Mitigating
Unintended Bias in Text Classication. In Proceedings of AAAI/ACM conference on articial
intelligence, ethics and society.
Dixon, W. W. (2012). Straight: Constructions of heterosexuality in the cinema. SUNY Press.
Djuric, N., Zhou, J., Morris, R., Grbovic, M., Radosavljevic, V., & Bhamidipati, N. (2015).
Hate Speech Detection with Comment Embeddings. In Proceedings of the 24th International
Conference on World Wide Web (pp. 29{30).
Duchenne, O., Laptev, I., Sivic, J., Bach, F., & Ponce, J. (2009). Automatic annotation of human
actions in video. In 2009 ieee 12th international conference on computer vision (pp. 1491{
1498).
Ellis, K. (2003). Reinforcing the stigma-the representation of disability in gattaca. Australian
Screen Education Online(31), 111{114.
England, D. E., Descartes, L., & Collier-Meek, M. A. (2011). Gender role portrayal and the disney
princesses. Sex roles, 64(7), 555{567.
Enns, A. (2000). The spectacle of disabled masculinity in john woo's \heroic bloodshed"; lms.
Quarterly Review of Film & Video, 17(2), 137{145.
Entman, R. M. (1989). How the media aect what people think: An information processing
approach. The journal of Politics, 51(2), 347{370.
Erikson, E. H. (1968). Identity: Youth and crisis (No. 7). WW Norton & company.
Etchison, M., & Kleist, D. M. (2000). Review of narrative therapy: Research and utility. The
103
Family Journal, 8(1).
Eva-maria Jacobsson. (1999). A Female Gaze? (Online [Accessed Feb 19th, 2021])
Eyal, K., & Rubin, A. M. (2003). Viewer aggression and homophily, identication, and parasocial
relationships with television characters. Journal of Broadcasting & Electronic Media, 47(1).
Fast, E., Chen, B., & Bernstein, M. S. (2016). Empath: Understanding topic signals in large-scale
text. In Proceedings of the Conference on Human Factors in Computing Systems, CHI 2016
(pp. 4647{4657).
Fast, E., Vachovsky, T., & Bernstein, M. (2016). Shirtless and dangerous: Quantifying linguistic
signals of gender bias in an online ction writing community. In Proceedings of the interna-
tional aaai conference on web and social media (Vol. 10).
Flemotomos, N., Martinez, V., Gibson, J., Atkins, D., Creed, T., & Narayanan, S. (2018). Lan-
guage features for automated evaluation of cognitive behavior psychotherapy sessions. Proc.
Interspeech 2018 .
Flemotomos, N., Martinez, V. R., Chen, Z., Singla, K., Ardulov, V., Peri, R., . . . Narayanan, S.
(2021). \Am I A Good Therapist?" Automated Evaluation Of Psychotherapy Skills Using
Speech And Language Technologies.
Fl uckiger, C., Del Re, A., Wampold, B. E., & Horvath, A. O. (2018). The alliance in adult
psychotherapy: A meta-analytic synthesis. Psychotherapy.
Founta, A., Chatzakou, D., Kourtellis, N., Blackburn, J., Vakali, A., & Leontiadis, I. (2018). A
Unied Deep Learning Architecture for Abuse Detection. CoRR, abs/1802.00385. Retrieved
from http://arxiv.org/abs/1802.00385
Fouts, G., & Burggraf, K. (1999). Television situation comedies: Female body images and verbal
reinforcements. Sex roles, 40(5-6), 473{481.
Friedkin, W. (1973). The Exorcist. Warner Bros. Pictures.
G alvez, R. H., Tienberg, V., & Altszyler, E. (2019). Half a century of stereotyping associations
between gender and intellectual ability in lms. Sex Roles, 81(9-10), 643{654.
Gardner, M., Grus, J., Neumann, M., Tafjord, O., Dasigi, P., Liu, N. F., . . . Zettlemoyer, L. S.
(2017). Allennlp: A deep semantic natural language processing platform..
Gauntlett, D. (2008). Media, gender and identity: An introduction. Routledge.
Geena Davis Institute on Gender in Media. (2019a). The Geena Benchmark Report 2007{2017
(Report). Los Angeles, CA: Geena Davis Institute on Gender in Media. (Online [Accessed
Jan, 21th 2021])
Geena Davis Institute on Gender in Media. (2019b). What 2.7M YouTube ads reveal about
gender bias in marketing. https://www.thinkwithgoogle.com/future-of-marketing/
management-and-culture/diversity-and-inclusion/gender-representation-media
-bias/. Google.
Gerbner, G., Gross, L., Morgan, M., & Signorielli, N. (1980). The \mainstreaming" of america:
violence prole number 11. Journal of communication, 30(3), 10{29.
Gerbner, G., Gross, L., Morgan, M., & Signorielli, N. (1994). Growing up with television: The
cultivation perspective. In J. Bryant & D. Zillmann (Eds.), Media eects: Advances in theory
and research (pp. 17{41). New Jersey, NJ: Lawrence Erlbaum Associates, Inc.
Gerbner, G., Gross, L., Morgan, M., Signorielli, N., & Shanahan, J. (2002). Growing up with
television: Cultivation processes. In J. Bryant & D. Zillmann (Eds.), Media eects: Advances
in theory and research (2nd ed.) (Vol. 2, pp. 43{67). New Jersey, NJ: Lawrence Erlbaum
Associates Publishers.
Giannakopoulos, T., Kosmopoulos, D., Aristidou, A., & Theodoridis, S. (2006). Violence content
classication using audio features. In Advances in articial intelligence. Springer Berlin
Heidelberg.
104
Giannakopoulos, T., Pikrakis, A., & Theodoridis, S. (2007). A multi-class audio classication
method with respect to violent content in movies using bayesian networks. In 2007 ieee 9th
workshop on multimedia signal processing (pp. 90{93).
Giannetti, L. D., & Leach, J. (1999). Understanding movies (Vol. 1) (No. 1). Prentice Hall
Englewood Clis, NJ.
Gibson, J., Can, D., Georgiou, P., Atkins, D., & Narayanan, S. (2017, August). Attention networks
for modeling behavior in addiction counseling. In Proceedings of interspeech.
Gilbert, C. H. E. (2014). VADER: A parsimonious rule-based model for sentiment analysis of social
media text. In Eighth International Conference on Weblogs and Social Media ICWSM 2014.
Gildea, D., & Jurafsky, D. (2002). Automatic labeling of semantic roles. Computational linguistics,
28(3), 245{288.
Goldberg, S. B., Flemotomos, N., Martinez, V. R., Tanana, M., Kuo, P., Pace, B. T., . . . Atkins,
D. C. (2020). Machine learning and natural language processing in psychotherapy research:
Alliance as example use case. Journal of counseling psychology, 67 4, 438{448.
Golem, V., Karan, M., &
Snajder, J. (2018). Combining Shallow and Deep Learning for Ag-
gressive Text Detection. In Proceedings of the First Workshop on Trolling, Aggression and
Cyberbullying (TRAC-2018) (pp. 188{198). Association for Computational Linguistics.
Gorinski, P., & Lapata, M. (2015). Movie script summarization as graph-based scene extraction.
In Proceedings of the 2015 conference of the north american chapter of the association for
computational linguistics: Human language technologies (pp. 1066{1076).
Gorinski, P. J., & Lapata, M. (2015, May{June). Movie script summarization as graph-based
scene extraction. In Proceedings of the 2015 conference of the north American chapter of
the association for computational linguistics: Human language technologies (pp. 1066{1076).
Denver, Colorado: Association for Computational Linguistics. Retrieved from https://
www.aclweb.org/anthology/N15-1113 doi: 10.3115/v1/N15-1113
Gorinski, P. J., & Lapata, M. (2018). What's this movie about? A joint neural network architecture
for movie content analysis. In Proceedings of the 2018 conference of the north american chapter
of the association for computational linguistics: Human language technologies.
Grigoryeva, A. (2014). When gender trumps everything: the division of parent care among siblings.
Princeton, NJ: Center for the Study of Social Organization.
Guha, T., Huang, C.-W., Kumar, N., Zhu, Y., & Narayanan, S. S. (2015). Gender representation in
cinematic content: A multimodal approach. In Proceedings of the 2015 acm on international
conference on multimodal interaction (pp. 31{34).
Hall, J. M., & Powell, J. (2011). Understanding the person through narrative. Nursing research
and practice, 2011.
Hatcher, R. L., & Gillaspy, J. A. (2006). Development and validation of a revised short version of
the working alliance inventory. Psychotherapy Research, 16(1).
He, L., Lee, K., Lewis, M., & Zettlemoyer, L. (2017, July). Deep semantic role labeling: What works
and what's next. In Proceedings of the 55th annual meeting of the association for computa-
tional linguistics (volume 1: Long papers) (pp. 473{483). Vancouver, Canada: Association for
Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/P17-1044
doi: 10.18653/v1/P17-1044
Hebbar, R., Somandepalli, K., & Narayanan, S. (2019). Robust speech activity detection in
movie audio: Data resources and experimental evaluation. In Ieee international conference
on acoustics, speech and signal processing (icassp) (pp. 4105{4109).
Hebbar, R., Somandepalli, K., & Narayanan, S. S. (2018). Improving gender identication in movie
audio using cross-domain data. In Interspeech (pp. 282{286).
Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8),
105
1735{1780.
Hoener, C., & Cantor, J. (1991). Perceiving and responding to mass media characters. In J. Bryant
& D. Zillmann (Eds.), Responding to the screen: Reception and reaction processes (chap. 4).
Routledge.
Holtzman, L., & Sharpe, L. (2014). Media messages: What lm, television, and popular music
teach us about race, class, gender, and sexual orientation. Routledge.
Honnibal, M., Montani, I., Van Landeghem, S., & Boyd, A. (2020). spaCy: Industrial-strength
Natural Language Processing in Python. Zenodo. Retrieved from https://spacy.io/ doi:
10.5281/zenodo.1212303
Horvath, A. O., & Greenberg, L. S. (1989). Development and validation of the working alliance
inventory. Journal of counseling psychology, 36(2).
Howes, C., Purver, M., & McCabe, R. (2013). Investigating topic modelling for therapy dialogue
analysis. In Proceedings of the iwcs 2013 workshop on computational semantics in clinical
text (csct 2013). Association for Computational Linguistics.
Howes, C., Purver, M., & McCabe, R. (2014). Linguistic indicators of severity and progress in
online text-based therapy for depression. In Proceedings of the workshop on computational
linguistics and clinical psychology: From linguistic signal to clinical reality.
Huang, Q., Xiong, Y., Rao, A., Wang, J., & Lin, D. (2020). Movienet: A holistic dataset for movie
understanding. In Proceedings of the european conference on computer vision (eccv).
Huesmann, L. R., Lagerspetz, K., & Eron, L. D. (1984). Intervening variables in the tv violence{
aggression relation: Evidence from two countries. Developmental psychology, 20(5), 746.
Huesmann, L. R., & Malamuth, N. M. (1986). Media violence and antisocial behavior: An overview.
Journal of Social Issues, 42(3), 1{6.
Hunt, D., & Ram on, A.-C. (2020). Hollywood Diversity Report: A Tale of Two Hollywoods (Tech.
Rep.). Los Angeles, CA: University of California, Los Angeles. Online [Accessed: Feb 2nd,
2021].
Idrees, H., Zamir, A. R., Jiang, Y.-G., Gorban, A., Laptev, I., Sukthankar, R., & Shah, M. (2017).
The thumos challenge on action recognition for videos \in the wild". Computer Vision and
Image Understanding, 155, 1{23.
Igartua, J.-J. (2010). Identication with characters and narrative persuasion through ctional
feature lms. Communications, 35(4), 347|373.
Imel, Z. E., Steyvers, M., & Atkins, D. C. (2015). Computational psychotherapy research: Scaling
up the evaluation of patient{provider interactions. Psychotherapy, 52(1), 19.
Jerald, M. C., Ward, L. M., Moss, L., Thomas, K., & Fletcher, K. D. (2017). Subordinates, sex
objects, or sapphires? investigating contributions of media use to black students' femininity
ideologies and stereotypes about black women. Journal of Black Psychology, 43(6), 608{635.
Jung, C. G. (1943). Two essays on analytical psychology. Routledge.
Kagan, D., Chesney, T., & Fire, M. (2020). Using data science to understand the lm industry's
gender gap. Palgrave Communications, 6(1), 1{16.
Kim, H., Katerenchuk, D., Billet, D., Park, H., & Li, B. (2019). Learning Joint Gaussian Rep-
resentations for Movies, Actors, and Literary Characters. In Proceedings of the thirthy-third
aaai conference on articial intelligence. AAAI.
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization (Vol. abs/1412.6980).
Retrieved from https://arxiv.org/abs/1412.6980
Koenig, A. M. (2018). Comparing prescriptive and descriptive gender stereotypes about children,
adults, and the elderly. Frontiers in Psychology, 9, 1086. doi: 10.3389/fpsyg.2018.01086
Kubrick, S. (1972). A Clockwork Orange. Warner Bros. Pictures.
Kukleva, A., Tapaswi, M., & Laptev, I. (2020). Learning interactions and relationships between
106
movie characters. In Proceedings of the ieee/cvf conference on computer vision and pattern
recognition (pp. 9849{9858).
Kumar, N., Nasir, M., Georgiou, P. G., & Narayanan, S. S. (2016). Robust multichannel gender
classication from speech in movie audio. In Interspeech (pp. 2233{2237).
Labov, W. (2013). The language of life and death: The transformation of experience in oral
narrative. Cambridge University Press.
Lambert, M. J., & Barley, D. E. (2001). Research summary on the therapeutic relationship and
psychotherapy outcome. Psychotherapy: Theory, research, practice, training, 38(4).
Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions
from movies. In 2008 ieee conference on computer vision and pattern recognition (pp. 1{8).
Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In Inter-
national Conference on Machine Learning.
Lewin, B. (2012). The Sessions. Fox Searchlight Pictures.
Li, Z., He, S., Cai, J., Zhang, Z., Zhao, H., Liu, G., . . . Si, L. (2018, October-November). A unied
syntax-aware framework for semantic role labeling. In Proceedings of the 2018 conference on
empirical methods in natural language processing (pp. 2401{2411). Brussels, Belgium: Associ-
ation for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/
D18-1262 doi: 10.18653/v1/D18-1262
Liu, A., Zhang, Y., Song, Y., Zhang, D., Li, J., & Yang, Z. (2008). Human attention model for
semantic scene analysis in movies. In Ieee international conference on multimedia and expo.
Maccoby, E. E., & Wilson, W. C. (1957). Identication and observational learning from lms. The
Journal of abnormal and social psychology, 55(1), 76.
MacDorman, K. F. (2019). In the uncanny valley, transportation predicts narrative enjoyment
more than empathy, but only for the tragic hero. Computers in Human Behavior, 94.
Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., & McClosky, D. (2014, June).
The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual
meeting of the association for computational linguistics: System demonstrations (pp. 55{60).
Baltimore, Maryland: Association for Computational Linguistics. Retrieved from https://
www.aclweb.org/anthology/P14-5010 doi: 10.3115/v1/P14-5010
Marcheggiani, D., Frolov, A., & Titov, I. (2017, August). A simple and accurate syntax-
agnostic neural model for dependency-based semantic role labeling. In Proceedings of
the 21st conference on computational natural language learning (CoNLL 2017) (pp. 411{
420). Vancouver, Canada: Association for Computational Linguistics. Retrieved from
https://www.aclweb.org/anthology/K17-1041 doi: 10.18653/v1/K17-1041
Markey, P. M., French, J. E., & Markey, C. N. (2015). Violent Movies and Severe Acts of Violence:
Sensationalism Versus Science. Human Communication Research, 41(2).
Marszalek, M., Laptev, I., & Schmid, C. (2009). Actions in context. In 2009 ieee conference on
computer vision and pattern recognition (pp. 2929{2936).
Martinez, V., Somandepalli, K., Singla, K., Ramakrishna, A., Uhls, Y., & Narayanan, S. (2019).
Violence rating prediction from movie scripts. In Proceedings of the 33rd aaai conference on
articial intelligence.
Martinez, V., Somandepalli, K., Tehranian-Uhls, Y., & Narayanan, S. (2020, November). Joint
estimation and analysis of risk behavior ratings in movie scripts. In Proceedings of the 2020
conference on empirical methods in natural language processing (emnlp) (pp. 4780{4790).
Online: Association for Computational Linguistics.
Martinez, V. R., & Kennedy, J. (2020). A multiparty chat-based dialogue system with concurrent
conversation tracking and memory. In Proceedings of the 2nd conference on conversational
user interfaces. New York, NY, USA: Association for Computing Machinery.
107
McKinnon, S. (2015). Watching men kissing men: the australian reception of the gay male kiss
on-screen. Journal of the History of Sexuality, 24(2), 262{287.
Mead, G. H. (1934). Mind, self and society (Vol. 111). Chicago University of Chicago Press.
Mehdad, Y., & Tetreault, J. (2016). Do Characters Abuse More Than Words? In Proceedings
of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue (pp.
299{303).
Meyrowitz, J. (1998). Multiple media literacies. Journal of communication, 48(1), 96{108.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed Representa-
tions of Words and Phrases and their Compositionality. In Advances in Neural Information
Processing Systems (pp. 3111{3119).
Miller, W. R., Moyers, T. B., Ernst, D., & Amrhein, P. (2003). Manual for the motivational inter-
viewing skill code (MISC) (Tech. Rep.). Albuquerque, NM: Center on Alcoholism, Substance
Abuse and Addictions, University of New Mexico.
Mironczuk, M., & Protasiewicz, J. (2018). A recent overview of the state-of-the-art elements of
text classication. Expert Systems with Applications, 106.
Mishler, E. G. (1984). The discourse of medicine: Dialectics of medical interviews (Vol. 3).
Greenwood Publishing Group.
Morgan, A. (2000). What is narrative therapy? an easy-to-read introduction (1st ed.). Dulwich
Centre Publications.
Morris, C. E., & Sloop, J. M. (2006). \What lips these Lips have kissed": Reguring the politics
of queer public kissing. Communication and Critical/Cultural Studies, 3(1), 1{26.
Motion Picture Association of America. (2017). Theme Report: A comprehensive analysis and
survey of the theatrical and home entretainment market environment (THEME) for 2017
(Tech. Rep.). Retrieved from https://www.mpaa.org/wp-content/uploads/2018/04/MPAA
-THEME-Report-2017%5FFinal.pdf ([Accessed: Jul 25th, 2018])
Moyer-Gus e, E., Chung, A. H., & Jain, P. (2011). Identication with characters and discussion
of taboo topics after exposure to an entertainment narrative about sexual health. Journal of
Communication, 61(3), 387{406.
Moyer-Gus e, E., Mahood, C., & Brookes, S. (2011). Entertainment-education in the context of
humor: Eects on safer sex intentions and risk perceptions. Health Communication, 26(8).
Moyer-Gus e, E., & Nabi, R. L. (2011). Comparing the eects of entertainment and educational
television programming on risky sexual behavior. Health communication, 26(5).
Mozafari, M., Farahbakhsh, R., & Crespi, N. (2019). A BERT-based transfer learning approach
for hate speech detection in online social media. CoRR, abs/1910.12574.
Mulvey, L. (1989). Visual pleasure and narrative cinema. In Visual and other pleasures (pp. 14{26).
Springer.
Myers, S. (2018). Reader question: Is there a rule as to how many \cuss words" can be used in a
script? https://bit.ly/35hKwhY. The Black List. (Accessed: 04/09/2020)
Niccol, A. (1997). Gattaca. Columbia Pictures.
Nicolopoulou, A., & Trapp, S. (2018, 07). 17. narrative interventions for children with language
disorders: A review of practices and ndings..
Nielsen, F.
A. (2011). A new ANEW: evaluation of a word list for sentiment analysis in microblogs.
In Proceedings of the ESWC2011 Workshop on `Making Sense of Microposts': Big things
come in small packages (pp. 93{98).
Niemiec, R. M., & Wedding, D. (2013). Positive psychology at the movies: Using lms to build
virtues and character strengths. Hogrefe Publishing.
Nobata, C., Tetreault, J., Thomas, A., Mehdad, Y., & Chang, Y. (2016). Abusive language
detection in online user content. In Proceedings of the 25th International Conference on
108
World Wide Web.
Paek, H.-J., Nelson, M. R., & Vilela, A. M. (2011). Examination of gender-role portrayals in
television advertising across seven countries. Sex roles, 64(3), 192{207.
Pagliardini, M., Gupta, P., & Jaggi, M. (2018). Unsupervised Learning of Sentence Embeddings
using Compositional n-Gram Features. In Naacl 2018 - conference of the north american
chapter of the association for computational linguistics.
Park, J. H., & Fung, P. (2017). One-step and Two-step Classication for Abusive Language
Detection on Twitter. CoRR, abs/1706.01206.
Pavlopoulos, J., Malakasiotis, P., & Androutsopoulos, I. (2017). Deeper Attention to Abusive User
Content Moderation. In Proceedings of the 2017 Conference on Empirical Methods in Natural
Language Processing, EMNLP 2017 (pp. 1125{1135).
Peddinti, V., Chen, G., Manohar, V., Ko, T., Povey, D., & Khudanpur, S. (2015). Jhu aspire
system: Robust lvcsr with tdnns, ivector adaptation and rnn-lms. In 2015 ieee workshop on
automatic speech recognition and understanding (asru).
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., . . . Duchesnay,
E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research,
12(Oct), 2825{2830.
Peele, S. (1998). Ten radical things niaaa research shows about alcoholism. The Addictions
Newsletter, 5(6).
Pennebaker, J. W., Boyd, R. L., Jordan, K., & Blackburn, K. (2015). The development and
psychometric properties of LIWC2015 (Tech. Rep.). Austin, TX: The University of Texas at
Austin.
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018).
Deep contextualized word representations. CoRR, abs/1802.05365.
Polanyi, M. (1981). The creative imagination. In The concept of creativity science art (pp. 91{108).
Springer.
Polce-Lynch, M., Myers, B. J., Kliewer, W., & Kilmartin, C. (2001). Adolescent self-esteem and
gender: Exploring relations to sexual harassment, body image, media in
uence, and emotional
expression. Journal of Youth and Adolescence, 30(2), 225{244.
Potter, W. J., Vaughan, M. W., Warren, R., Howley, K., Land, A., & Hagemeyer, J. C. (1995).
How real is the portrayal of aggression in television entertainment programming? Journal of
Broadcasting & Electronic Media, 39(4), 496{516.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., . . . Vesely, K. (2011,
December). The kaldi speech recognition toolkit. In Ieee 2011 workshop on automatic
speech recognition and understanding. IEEE Signal Processing Society. (IEEE Catalog No.:
CFP11SRW-USB)
Pradhan, S., Moschitti, A., Xue, N., Uryupina, O., & Zhang, Y. (2012). CoNLL-2012 shared task:
Modeling multilingual unrestricted coreference in OntoNotes. In Proceedings of the Sixteenth
Conference on Computational Natural Language Learning (CoNLL 2012). Jeju, Korea.
Pradhan, S., Ward, W., Hacioglu, K., Martin, J., & Jurafsky, D. (2005, June). Semantic role label-
ing using dierent syntactic views. In Proceedings of the 43rd annual meeting of the association
for computational linguistics (ACL'05) (pp. 581{588). Ann Arbor, Michigan: Association for
Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/P05-1072
doi: 10.3115/1219840.1219912
Prentice, D. A., & Carranza, E. (2002). What women and men should be, shouldn't be, are allowed
to be, and don't have to be: The contents of prescriptive gender stereotypes. Psychology of
women quarterly, 26(4), 269{281.
PROJECT MATCH RESEARCH GROUP. (1998). Matching patients with alcohol disorders to
109
treatments: Clinical implications from project match. Journal of Mental Health, 7(6), 589{
602.
Propp, V. (1968). Morphology of the folktale. 1928. Trans. Svatava Pirkova-Jakobson. 2nd ed.
Austin: U of Texas P.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models
are unsupervised multitask learners. OpenAI blog, 1(8), 9.
Radford, W., & Gall e, M. (2015). Roles for the boys?: Mining cast lists for gender and role
distributions over time. In Proceedings of the 24th international conference on world wide
web.
Ramakrishna, A., Malandrakis, N., Staruk, E., & Narayanan, S. (2015, September). A quantitative
analysis of gender dierences in movies using psycholinguistic normatives. In Proceedings of
the 2015 conference on empirical methods in natural language processing (pp. 1996{2001).
Lisbon, Portugal: Association for Computational Linguistics. Retrieved from https://www
.aclweb.org/anthology/D15-1234 doi: 10.18653/v1/D15-1234
Ramakrishna, A., Mart nez, V. R., Malandrakis, N., Singla, K., & Narayanan, S. (2017, July).
Linguistic analysis of dierences in portrayal of movie characters. In Proceedings of the 55th
annual meeting of the association for computational linguistics (volume 1: Long papers) (pp.
1669{1678). Vancouver, Canada: Association for Computational Linguistics. Retrieved from
https://www.aclweb.org/anthology/P17-1153 doi: 10.18653/v1/P17-1153
Reichert, T., & Carpenter, C. (2004). An update on sex in magazine advertising: 1983 to 2003.
Journalism & Mass Communication Quarterly, 81(4), 823{837.
Riedl, M. O. (2017). Computational narrative intelligence: Past, present and future. Re-
trieved from https://mark-riedl.medium.com/computational-narrative-intelligence
-past-present-and-future-99e58cf25ffa
Rose, J. A. (2017). To teach science, tell stories (Unpublished doctoral dissertation). Duke
University.
Rudy, R. M., Popova, L., & Linz, D. G. (2010). The context of current content analysis of gender
roles: An introduction to a special issue. Springer.
Sap, M., Prasettio, M. C., Holtzman, A., Rashkin, H., & Choi, Y. (2017a). Connotation Frames
of Power and Agency in Modern Films. In Proceedings of the 2017 Conference on Empirical
Methods in Natural Language Processing, EMNLP 2017 (pp. 2329{2334).
Sap, M., Prasettio, M. C., Holtzman, A., Rashkin, H., & Choi, Y. (2017b, September). Connotation
frames of power and agency in modern lms. In Proceedings of the 2017 conference on empir-
ical methods in natural language processing (pp. 2329{2334). Copenhagen, Denmark: Associ-
ation for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/
D17-1247 doi: 10.18653/v1/D17-1247
Sargent, J. D., Beach, M. L., Adachi-Mejia, A. M., Gibson, J. J., Titus-Ernsto, L. T., Carusi,
C. P., . . . Dalton, M. A. (2005). Exposure to movie smoking: its relation to smoking initiation
among US adolescents. Pediatrics, 116(5).
Schacht, K. (2019). What hollywood movies do to perpetuate racial stereotypes (Tech. Rep.). Bonn,
Germany: Deutsche Welle. Online [Accessed: Feb 1st, 2021].
Schlesinger, P., Haynes, R., Boyle, R., McNair, B., Dobash, R. E., & Dobash, R. (1998). Men
viewing violence. Great Britain, Broadcasting Standards Commission London.
Schmidt, A., & Wiegand, M. (2017). A survey on hate speech detection using natural language pro-
cessing. In Proceedings of the Fifth International Workshop on Natural Language Processing
for Social Media.
Sell, G., Snyder, D., McCree, A., Garcia-Romero, D., Villalba, J., Maciejewski, M., . . . others
(2018). Diarization is hard: Some experiences and lessons learned for the jhu team in the
110
inaugural dihard challenge. In Proc. interspeech.
Shafaei, M., Samghabadi, N. S., Kar, S., & Solorio, T. (2019). Rating for parents: Predicting chil-
dren suitability rating for movies based on language of the movies. CoRR, abs/1908.07819.
Shafer, D. M., & Raney, A. A. (2012). Exploring how we enjoy antihero narratives. Journal of
Communication, 62(6), 1028{1046.
Sharma, N., Chakrabarti, S., & Grover, S. (2016). Gender dierences in caregiving among family-
caregivers of people with mental illnesses. World journal of psychiatry, 6(1), 7.
Sharrock, T. (2016). Me Before You. New Line Cinema.
Shi, P., & Lin, J. (2019). Simple BERT models for relation extraction and semantic role labeling.
CoRR, abs/1904.05255.
Signorielli, N., & Kahlenberg, S. (2001). Television's world of work in the nineties. Journal of
Broadcasting & Electronic Media, 45(1), 4{22.
Singer, B., Ratner, B., & Vaughn, M. (2000). X-Men. 20th Century Studios.
Sink, A., & Mastro, D. (2017). Depictions of gender on primetime television: A quantitative
content analysis. Mass Communication and Society, 20(1), 3{22.
Smith, B. H. (1980). Narrative versions, narrative theories. Critical Inquiry, 7(1).
Smith, S. L., Choueiti, M., Case, A., Pieper, K., Clark, H., Hernandez, K., . . . Mota, M. (2019).
Latinos in Film: Erasure On Screen & Behind the Camera. Online [Accessed: Feb 2nd,
2021].
Smith, S. L., Choueiti, M., & Pieper, K. (2017). Inequality in 900 Popu-
lar Films (Tech. Rep.). USC Annenberg School for Communication and Journal-
ism. Retrieved from http://annenberg.usc.edu/sites/default/files/Dr%5FStacy%5FL%
5FSmith-Inequality%5Fin%5F900%5FPopular%5FFilms.pdf
Smith, S. L., & Cook, C. A. (2008). Gender Stereotypes: An Analysis of Popular Films and TV
(Report). Los Angeles, CA: Geena Davis Institute on Gender in Media. (Online [Accessed
Jan, 27th 2021])
Smith, S. L., Wilson, B. J., Kunkel, D., Linz, D., Potter, W. J., Colvin, C. M., & Donnerstein, E.
(1998). Violence in television programming overall: University of California, Santa Barbara
study. National Television Violence Study, 3, 5{220.
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y., & Potts, C. (2013). Re-
cursive deep models for semantic compositionality over a sentiment treebank. In Proceedings
of the 2013 conference on empirical methods in natural language processing, EMNLP 2013
(pp. 1631{1642).
Social Security Administration. (1998). Name Distributions in the Social Security Area. Retrieved
from https://www.ssa.gov/oact/babynames/ (Accessed: Dec 10th, 2018)
Somandepalli, K., Guha, T., Martinez, V. R., Kumar, N., Adam, H., & Narayanan, S. (2021).
Computational Media Intelligence: Human-Centered Machine Analysis of Media. Proceedings
of the IEEE, 1{20.
Sparck Jones, K. (1972). A statistical interpretation of term specicity and its application in
retrieval. Journal of Documentation, 28(1), 11{21.
Sparks, G. G., Sherry, J., & Lubsen, G. (2005). The appeal of media violence in a full-length
motion picture: An experimental investigation. Communication Reports, 18(1-2).
Srivastava, S., Chaturvedi, S., & Mitchell, T. (2016). Inferring interpersonal relations in narrative
summaries. In Proceedings of the aaai conference on articial intelligence (Vol. 30).
Stabile, C. A. (2009). \sweetheart, this ain't gender studies": Sexism and superheroes. Communi-
cation and Critical/cultural studies, 6(1), 86{92.
Steinke, J. (2017). Adolescent girls' stem identity formation and media images of stem professionals:
Considering the in
uence of contextual cues. Frontiers in psychology, 8, 716.
111
Stone, P. J., Dunphy, D. C., & Smith, M. S. (1966). The general inquirer: A computer approach
to content analysis. MIT press.
Susman, G. (2013). Whatever happened to nc-17 movies?
https://www.rollingstone.com/movies/movie-news/whatever-happened-to-nc-17-movies-
172123/. Rolling Stone. (Accessed: 2018-08-29)
Taber, B. J., Leibert, T. W., & Agaskar, V. R. (2011). Relationships among client{therapist
personality congruence, working alliance, and therapeutic outcome. Psychotherapy, 48(4).
T ackstr om, O., Ganchev, K., & Das, D. (2015). Ecient inference and structured learning for
semantic role labeling. Transactions of the Association for Computational Linguistics, 3,
29{41.
Tai, K. S., Socher, R., & Manning, C. D. (2015). Improved semantic representations from tree-
structured long short-term memory networks. CoRR, abs/1503.00075.
Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., & Qin, B. (2014). Learning Sentiment-Specic
Word Embedding for Twitter Sentiment Classication. In Proceedings of the 52nd Annual
Meeting of the Association for Computational Linguistics, ACL 2014 (pp. 1555{1565).
Tarantino, Q. (1994). Pulp Fiction. Miramax.
Tasker, Y. (1993). Spectacular bodies: Gender, genre, and the action cinema. Psychology Press.
Thompson, K. M., & Yokota, F. (2004). Violence, sex, and profanity in lms: correlation of movie
ratings with content. Medscape General Medicine, 6(3).
Thompson, M. N., Goldberg, S. B., & Nielsen, S. L. (2018). Patient nancial distress and treatment
outcomes in naturalistic psychotherapy. Journal of counseling psychology, 65(4).
Tickle, J. J., Beach, M. L., & Dalton, M. A. (2009). Tobacco, alcohol, and other risk behaviors
in lm: how well do mpaa ratings distinguish content? Journal of health communication,
14(8).
Tillmann-Healy, L. (2002). Men kissing. Ethnographically speaking: Autoethnography, literature,
and aesthetics, 336{343.
Topel, F. (2007). TMNT's Censored Violence. Retrieved from http://www.canmag.com/nw/6673-
kevin-munroe-tmnt-violence.
Trovati, M., & Brady, J. (2014). Towards an automated approach to extract and compare ctional
networks: an initial evaluation. In 2014 25th international workshop on database and expert
systems applications (pp. 246{250).
Tucker, J. S., Miles, J. N., & D'Amico, E. J. (2013). Cross-lagged associations between substance
use-related media exposure and alcohol use during middle school. Journal of Adolescent
Health, 53(4).
Twenge, J. M., Campbell, W. K., & Gentile, B. (2012). Male and female pronoun use in us books
re
ects women's status, 1900{2008. Sex roles, 67(9), 488{493.
Uray, N., & Burnaz, S. (2003). An analysis of the portrayal of gender roles in turkish television
advertisements. Sex roles, 48(1-2), 77{87.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., . . . Polosukhin, I.
(2017). Attention is all you need. In Advances in neural information processing systems.
Vromans, L. P., & Schweitzer, R. D. (2011). Narrative therapy for adults with major depressive
disorder: Improved symptom and interpersonal outcomes. Psychotherapy Research, 21(1).
Warner, M. (2002). Publics and counterpublics. Public culture, 14(1), 49{90.
Waseem, Z., Davidson, T., Warmsley, D., & Weber, I. (2017, August). Understanding Abuse: A
Typology of Abusive Language Detection Subtasks. In Proceedings of the First Workshop on
Abusive Language Online.
Watson, A. (2019). Movie releases in north america from 2000-2018. https://www.statista.com/
statistics/187122/movie-releases-in-north-america-since-2001/.
112
Webb, T., Jenkins, L., Browne, N., A, A. A., & Kraus, J. (2007). Violent entertainment pitched
to adolescents: an analysis of PG-13 lms. Pediatrics, 119(6), e1219{e1229.
Weischedel, R., Palmer, M., Marcus, M., Hovy, E., Pradhan, S., Ramshaw, L., . . . Houston, A.
(2011). Ontonotes release 4.0. LDC2011T03. Web Download. Philadelphia: Linguistic Data
Consortium.
White, M., White, M. K., Wijaya, M., & Epston, D. (1990). Narrative means to therapeutic ends.
WW Norton & Company.
Widdershoven, G., Josselson, R., & Lieblich, A. (1993). The narrative study of lives. CA: Sage.
Wiegand, M., Ruppenhofer, J., Schmidt, A., & Greenberg, C. (2018). Inducing a Lexicon of
Abusive Words{a Feature-Based Approach. In Proceedings of the 2018 Conference of the
North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, NAACL-HLT 2018.
Wilson, J. D., & MacGillivray, M. S. (1998). Self-perceived in
uences of family, friends, and media
on adolescent clothing choice. Family and Consumer Sciences Research Journal, 26(4), 425{
443.
Wilson, N., Tucker, A., Heath, D., & Scarborough, P. (2018). Licence to swill: James bond's
drinking over six decades. Medical journal of Australia, 209(11), 495{500.
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., . . . Brew, J. (2019). Hugging-
face's transformers: State-of-the-art natural language processing. CoRR, abs/1910.03771.
Retrieved from http://arxiv.org/abs/1910.03771
Wood, J. T. (1994). Gendered media: The in
uence of media on views of gender. Gendered lives:
Communication, gender, and culture, 9, 231{244.
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., . . . Dean, J. (2016).
Google's neural machine translation system: Bridging the gap between human and machine
translation. CoRR, abs/1609.08144. Retrieved from http://arxiv.org/abs/1609.08144
Wulczyn, E., Thain, N., & Dixon, L. (2017). Ex machina: Personal attacks seen at scale. In
Proceedings of the 26th International Conference on World Wide Web.
Xiao, B., Can, D., Georgiou, P. G., Atkins, D., & Narayanan, S. S. (2012). Analyzing the language
of therapist empathy in motivational interview based psychotherapy. In Proceedings of the
2012 asia pacic signal and information processing association annual summit and conference
(pp. 1{4).
Xiao, B., Imel, Z. E., Georgiou, P. G., Atkins, D. C., & Narayanan, S. S. (2015). \rate my
therapist": automated detection of empathy in drug and alcohol counseling via speech and
language processing. PloS one, 10(12), e0143055.
Xue, N., Ng, H. T., Pradhan, S., Prasad, R., Bryant, C., & Rutherford, A. (2015, July). The
CoNLL-2015 shared task on shallow discourse parsing. In Proceedings of the nineteenth
conference on computational natural language learning - shared task (pp. 1{16). Beijing,
China: Association for Computational Linguistics. Retrieved from https://www.aclweb
.org/anthology/K15-2001 doi: 10.18653/v1/K15-2001
Xue, N., Ng, H. T., Pradhan, S., Rutherford, A., Webber, B., Wang, C., & Wang, H. (2016,
August). CoNLL 2016 shared task on multilingual shallow discourse parsing. In Proceedings
of the CoNLL-16 shared task (pp. 1{19). Berlin, Germany: Association for Computational
Linguistics. Retrieved from https://www.aclweb.org/anthology/K16-2001 doi: 10.18653/
v1/K16-2001
Yokota, F., & Thompson, K. M. (2000). Violence in G-rated animated lms. Journal of the
American Medical Association, 283(20).
Young, T. (1963). From Russia with Love. United Artists.
Zhang, Y., Dixon, T. L., & Conrad, K. (2010). Female body image as a function of themes in rap
113
music videos: A content analysis. Sex roles, 62(11-12), 787{797.
Zhang, Z., Robinson, D., & Tepper, J. (2018). Detecting Hate Speech on Twitter Using a
Convolution-GRU Based Deep Neural Network. In The Semantic Web.
Zhao, H., Chen, W., & Kit, C. (2009, August). Semantic dependency parsing of NomBank and
PropBank: An ecient integrated approach via a large-scale feature selection. In Proceedings
of the 2009 conference on empirical methods in natural language processing (pp. 30{39).
Singapore: Association for Computational Linguistics. Retrieved from https://www.aclweb
.org/anthology/D09-1004
Zhong, H., Li, H., Squicciarini, A. C., Rajtmajer, S. M., Grin, C., Miller, D. J., & Caragea,
C. (2016). Content-Driven Detection of Cyberbullying on the Instagram Social Network. In
International Joint Conferences on Articial Intelligence.
Zhou, J., & Xu, W. (2015, July). End-to-end learning of semantic role labeling using recurrent neu-
ral networks. In Proceedings of the 53rd annual meeting of the association for computational
linguistics and the 7th international joint conference on natural language processing (volume 1:
Long papers) (pp. 1127{1137). Beijing, China: Association for Computational Linguistics. Re-
trieved from https://www.aclweb.org/anthology/P15-1109 doi: 10.3115/v1/P15-1109
Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., & Fidler, S. (2015).
Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies
and Reading Books. In 2015 IEEE international conference on computer vision, ICCV 2015,
santiago, chile, december 7-13, 2015.
Zillmann, D. (1971). Excitation transfer in communication-mediated aggressive behavior. Journal
of Experimental Social Psychology, 7(4).
114
Appendices
115
Appendix A
SRL System Performance
This chapter presents the complete table of performance results for the SRL systems presented in Chapter 5.
Model name follows a legend that encodes dierent attributes of each of the experiments:
bi(lstm/gru)N RNN used. Either LSTM or GRU. Bi if RNN is bilateral. Size of hidden dimension is
given by the number at the end, N.
bert(un)cased BERT model used
classweights computed whether class weighting scheme was used or not, weights computed according to
the class frequencies on the train set
fulltune whether the transformer layer weights were adapted or not
116
Action Agent Patient
Modelname Trained On Test On Accuracy micro-F1 F1 Precision Recall F1 Precision Recall F1
BASELINES
Shi and Lin (2019) (oracle) Conll2012 test 78.20% 84.10% 100.00% 95.10% 63.42% 76.10% 41.30% 38.33% 39.76%
Shi and Lin (2019) Conll2012 test 51.90% 63.00% 69.70% 91.55% 54.78% 68.55% 29.44% 34.87% 31.93%
Gardner et al. (2017) (oracle) Conll2012 test 60.65% 74.18% 100.00% 90.86% 31.99% 47.31% 72.46% 28.82% 41.24%
Gardner et al. (2017) Conll2012 test 41.80% 53.05% 69.23% 71.63% 27.85% 40.11% 37.44% 23.63% 28.98%
Sap et al. (2017a) (oracle) - test 64.54% 77.82% 100.00% 93.82% 54.41% 68.88% 94.62% 35.45% 51.57%
Sap et al. (2017a) - test 48.60% 63.91% 69.14% 86.15% 49.72% 63.05% 85.19% 26.51% 40.44%
SIMPLEBERT + CLASS WEIGHTS + DOMAIN ADAPTED
Shi and Lin (2019) (LSTM (50)) train test 64.10% 70.40% 76.88% 86.70% 60.82% 71.49% 44.08% 31.58% 36.80%
Shi and Lin (2019) (LSTM (100)) train test 69.80% 70.40% 77.10% 86.30% 61.10% 71.55% 12.50% 0.58% 1.12%
BERT BASE UNCASED + LSTM
LSTM (50) train test 38.80% 54.20% 34.68% 87.72% 55.32% 67.85% 0.00% 0.00% 0.00%
LSTM (100) train test 32.40% 47.60% 22.24% 87.81% 55.78% 68.23% 0.00% 0.00% 0.00%
LSTM (300) train test 31.50% 45.50% 17.82% 86.03% 58.02% 69.30% 0.00% 0.00% 0.00%
BERT BASE CASED + LSTM
LSTM (50) train test 24.20% 37.80% 4.19% 87.05% 58.30% 69.83% 0.00% 0.00% 0.00%
LSTM (100) train test 68.40% 70.50% 62.81% 86.70% 60.82% 71.49% 0.00% 0.00% 0.00%
LSTM (300) train test 22.80% 36.20% 0.00% 86.73% 60.35% 71.18% 0.00% 0.00% 0.00%
BERT BASE CASED + LSTM + CLASS WEIGHTS
LSTM (50) train test 63.70% 70.30% 76.64% 86.72% 60.91% 71.56% 43.90% 31.58% 36.73%
LSTM (100) train test 69.90% 70.90% 77.63% 86.80% 60.73% 71.46% 33.33% 0.88% 1.71%
LSTM (300) train test 68.90% 74.20% 81.88% 86.84% 60.91% 71.60% 47.59% 63.45% 54.39%
BERT BASE CASED + LSTM + CLASS WEIGHTS + DOMAIN ADAPTED
LSTM (50) (full tune) train test 89.00% 90.70% 96.91% 89.40% 88.15% 88.77% 73.94% 66.37% 69.95%
LSTM (100) (full tune) train test 89.00% 90.90% 96.63% 90.14% 88.71% 89.42% 55.75% 65.20% 60.11%
LSTM (300) (full tune) train test 89.30% 91.20% 96.84% 88.90% 90.39% 89.64% 82.42% 61.70% 70.57%
LSTM (500) (full tune) train test 88.80% 90.80% 96.47% 89.64% 88.81% 89.22% 77.51% 65.50% 71.00%
BERT BASE CASED + GRU + CLASS WEIGHTS
GRU (50) train test 67.80% 73.40% 87.11% 86.84% 60.91% 71.60% 45.55% 50.88% 48.07%
GRU (100) train test 67.10% 72.70% 79.60% 86.85% 61.01% 71.67% 45.63% 61.11% 52.25%
GRU (300) train test 69.30% 74.20% 82.04% 86.79% 61.29% 71.84% 46.60% 66.08% 54.66%
BERT BASE CASED + GRU + CLASS WEIGHTS + DOMAIN ADAPTED
GRU (50) (full tune) train test 90.20% 91.30% 96.80% 89.82% 89.74% 89.78% 74.10% 71.93% 73.00%
GRU (100) (full tune) train test 88.30% 90.90% 96.21% 89.92% 89.09% 89.50% 82.06% 62.87% 71.19%
GRU (300) (full tune) train test 89.20% 91.00% 96.85% 90.40% 87.87% 89.12% 75.90% 68.13% 71.80%
Table A.1: SRL System Performance Complete Results
117
Appendix B
Number of actions per genre
This chapter presents the distribution of the number of actions with respect to each of the movie genres. As
discussed in Chapter 6, the frequency of the actions varied highly depending on the genre of the movie.
Genre n Min Mean Median Max Std.Dev.
Drama 667873 225 1730.24 1683.00 3858 628.15
Thriller 602828 1 1976.49 1981.00 3787 568.60
Comedy 405557 3 1480.14 1454.50 3038 517.71
Action 398232 1 2011.27 2029.00 4057 656.47
Crime 312967 619 1819.58 1792.00 3378 517.01
Independent 283392 3 1548.59 1495.00 3139 606.83
Horror 251608 3 1965.69 1993.00 3787 602.61
Sci-Fi 241616 1 1996.83 2076.00 4057 722.01
Period 228820 228 1922.86 1941.00 3858 623.69
Romance 222793 3 1675.14 1653.00 3711 640.19
Adventure 222761 1 2006.86 2024.00 3858 657.40
Fantasy 185164 1 1991.01 2040.00 3840 684.41
Mystery 158679 529 2008.59 2034.00 3378 581.82
Biography 70195 308 1799.87 1717.00 3613 613.90
War 58277 779 2158.41 2141.00 3186 552.32
Historical 52947 779 2117.88 2145.00 3858 803.09
Family 36245 502 1725.95 1709.00 3840 694.80
Sport 35785 785 1626.59 1594.50 3613 602.93
Music 33518 619 1675.90 1564.50 2868 601.21
Animation 28619 502 1683.47 1709.00 2474 476.21
Martial 26897 3 1681.06 1786.00 2777 652.24
Western 25261 1435 2296.45 2227.00 3858 707.12
Foreign 20330 739 2033.00 1997.00 3186 764.22
Parody 19134 502 1275.60 1175.00 2228 463.17
Erotic 16849 1128 1684.90 1634.50 2500 377.17
Film 13354 1339 2225.67 2057.00 3094 630.58
Musical 11466 800 1638.00 1661.00 2488 529.27
Blaxploitation 6842 935 1710.50 1660.50 2586 595.70
Expressionism 3643 1740 1821.50 1821.50 1903 81.50
Surng 2930 2930 2930.00 2930.00 2930 0.00
Mockumentary 2308 919 1154.00 1154.00 1389 235.00
Table B.1: Distribution of the number of actions per genres
118
Appendix C
Results for Agent's Actions
This chapter presents the results of the GLMM on agent's actions. Only signicant regressors (p < 0:05)
were reproduced. A positive estimate > 0 corresponds to an action which is more likely for male characters
to portray. Likewise, negative estimates < 0 suggest an action which is less likely to be portrayed by a
male agent. Manually identied errors are color coded: blush for errors down-streamed from an outside
the SRL system (e.g., parsing, lemmatization), and gray for errors due to mislabels coming from our SRL
system.
GLMM results for the agent's actions
Action Estimate Std. Error Z P-Val Signicance
adapt -4.32 0.64 -6.77 0.00 ***
sob -1.47 0.31 -4.71 0.00 ***
stream -1.39 0.31 -4.45 0.00 ***
track -1.39 0.32 -4.31 0.00 ***
pucker -3.16 0.75 -4.21 0.00 ***
orbit -3.84 0.95 -4.05 0.00 ***
compete -3.23 0.81 -4.00 0.00 ***
blank -2.96 0.75 -3.93 0.00 ***
pans -1.83 0.47 -3.93 0.00 ***
gut punch -4.91 1.25 -3.92 0.00 ***
cry -1.01 0.27 -3.81 0.00 ***
start rub -3.43 0.90 -3.80 0.00 ***
cluck -2.26 0.60 -3.79 0.00 ***
asleep 2.79 0.75 3.70 0.00 ***
stare horrify -3.44 0.95 -3.63 0.00 ***
plan -2.36 0.65 -3.61 0.00 ***
enter toss -3.70 1.03 -3.58 0.00 ***
can muster -2.90 0.82 -3.52 0.00 ***
reprimand -4.38 1.25 -3.50 0.00 ***
nod sip -4.07 1.18 -3.44 0.00 ***
start whip -4.30 1.25 -3.44 0.00 ***
start back -2.78 0.81 -3.43 0.00 ***
jar -3.70 1.08 -3.41 0.00 ***
patch -2.76 0.81 -3.40 0.00 ***
emblazon -2.82 0.83 -3.40 0.00 ***
disclose -2.77 0.82 -3.39 0.00 ***
look bae -3.72 1.11 -3.35 0.00 ***
powder -3.46 1.03 -3.35 0.00 ***
stop spin -3.97 1.18 -3.35 0.00 ***
blossom -3.43 1.03 -3.33 0.00 ***
119
GLMM results for the agent's actions
Action Estimate Std. Error Z P-Val Signicance
would look -2.83 0.86 -3.31 0.00 ***
undaunte -2.79 0.84 -3.31 0.00 ***
rivite -4.06 1.25 -3.24 0.00 **
start gather -2.47 0.77 -3.20 0.00 **
giggle -1.03 0.33 -3.17 0.00 **
sigh start 3.26 1.03 3.16 0.00 **
outstretche -2.12 0.68 -3.14 0.00 **
stop follow -3.91 1.25 -3.13 0.00 **
stare wait -3.90 1.25 -3.12 0.00 **
ladle -2.72 0.87 -3.11 0.00 **
can oer -3.88 1.25 -3.10 0.00 **
snuggle -1.49 0.48 -3.10 0.00 **
get rid -3.40 1.10 -3.09 0.00 **
encrust -3.87 1.25 -3.09 0.00 **
stride 0.87 0.28 3.09 0.00 **
begin do -3.64 1.18 -3.08 0.00 **
looks -1.23 0.40 -3.07 0.00 **
notice turn -3.63 1.18 -3.07 0.00 **
lever -3.32 1.09 -3.04 0.00 **
choke gasping -3.79 1.25 -3.03 0.00 **
review -1.52 0.50 -3.03 0.00 **
exceed -3.78 1.25 -3.02 0.00 **
fast -3.11 1.03 -3.01 0.00 **
unholster -2.40 0.81 -2.98 0.00 **
would say -2.82 0.95 -2.97 0.00 **
eect -3.71 1.25 -2.97 0.00 **
belch -2.01 0.68 -2.96 0.00 **
portray -3.49 1.18 -2.95 0.00 **
sit write -3.27 1.11 -2.94 0.00 **
disrobe -3.48 1.18 -2.94 0.00 **
look concern -3.27 1.11 -2.94 0.00 **
ricochet -1.85 0.63 -2.94 0.00 **
nish cut -3.02 1.03 -2.92 0.00 **
quiver -1.49 0.51 -2.92 0.00 **
game -3.63 1.25 -2.90 0.00 **
ex -1.36 0.47 -2.89 0.00 **
turn glance -3.31 1.15 -2.88 0.00 **
nod shake -3.61 1.25 -2.88 0.00 **
turn nod -2.73 0.95 -2.88 0.00 **
will do -1.65 0.58 -2.87 0.00 **
hightail -3.23 1.13 -2.87 0.00 **
perturb 2.58 0.90 2.85 0.00 **
res miss -3.26 1.15 -2.84 0.00 **
stare fascinate 2.28 0.81 2.83 0.00 **
s quiet -1.78 0.63 -2.83 0.00 **
aront -3.53 1.25 -2.82 0.00 **
swap -3.07 1.09 -2.81 0.00 **
rob -2.36 0.84 -2.80 0.01 **
start slide -3.50 1.25 -2.80 0.01 **
be release -3.30 1.18 -2.79 0.01 **
come stumble -3.49 1.25 -2.79 0.01 **
120
GLMM results for the agent's actions
Action Estimate Std. Error Z P-Val Signicance
polished -3.46 1.25 -2.76 0.01 **
lacerate -3.46 1.25 -2.76 0.01 **
hump -1.51 0.55 -2.76 0.01 **
indulge -2.61 0.95 -2.75 0.01 **
pan -0.84 0.31 -2.75 0.01 **
glassy eye -3.15 1.15 -2.75 0.01 **
scare nervous -3.43 1.25 -2.74 0.01 **
traverse -2.23 0.82 -2.74 0.01 **
be carry -2.59 0.95 -2.73 0.01 **
appear come -3.03 1.11 -2.73 0.01 **
spin startled -2.81 1.03 -2.73 0.01 **
swell -1.12 0.41 -2.72 0.01 **
derange -2.46 0.90 -2.72 0.01 **
come upon -1.77 0.65 -2.71 0.01 **
s there -1.62 0.60 -2.71 0.01 **
bellow -1.15 0.43 -2.70 0.01 **
become intrigue -3.38 1.25 -2.70 0.01 **
deploy -3.03 1.13 -2.69 0.01 **
hug -0.73 0.27 -2.69 0.01 **
seem unaware -3.18 1.18 -2.69 0.01 **
unpin -2.55 0.95 -2.69 0.01 **
start spin -3.16 1.18 -2.67 0.01 **
blindfold -2.96 1.11 -2.67 0.01 **
seem frightened -2.53 0.95 -2.67 0.01 **
avenge -3.15 1.18 -2.67 0.01 **
look pale -3.15 1.18 -2.66 0.01 **
well -1.55 0.58 -2.65 0.01 **
inquire -2.32 0.87 -2.65 0.01 **
sit slump -1.38 0.52 -2.65 0.01 **
nod hand -3.03 1.15 -2.64 0.01 **
reset -2.04 0.77 -2.64 0.01 **
heap -2.04 0.77 -2.64 0.01 **
car roar -3.30 1.25 -2.64 0.01 **
sashay -1.84 0.70 -2.64 0.01 **
stand ght -3.30 1.25 -2.63 0.01 **
boom -1.31 0.50 -2.63 0.01 **
can regain -3.29 1.25 -2.63 0.01 **
tilt -0.78 0.29 -2.63 0.01 **
resolve -1.69 0.64 -2.63 0.01 **
glass -3.11 1.18 -2.63 0.01 **
seem bewildered -3.28 1.25 -2.62 0.01 **
start stumble -3.28 1.25 -2.62 0.01 **
spur -1.22 0.47 -2.62 0.01 **
ping -2.94 1.13 -2.61 0.01 **
smash 0.77 0.29 2.61 0.01 **
elate -3.09 1.18 -2.61 0.01 **
translate -1.18 0.45 -2.60 0.01 **
be come -2.22 0.86 -2.60 0.01 **
humiliate -1.27 0.49 -2.60 0.01 **
stop hum -3.25 1.25 -2.60 0.01 **
approach stare -3.23 1.25 -2.58 0.01 **
121
GLMM results for the agent's actions
Action Estimate Std. Error Z P-Val Signicance
wall -2.87 1.11 -2.58 0.01 **
commence -2.26 0.87 -2.58 0.01 **
smile say -3.05 1.18 -2.58 0.01 **
sheathe -2.25 0.87 -2.58 0.01 *
look triumphant -3.21 1.25 -2.57 0.01 *
stop confuse -3.20 1.25 -2.56 0.01 *
sigh put 3.18 1.25 2.54 0.01 *
shroud -1.46 0.58 -2.53 0.01 *
stop frighten -3.17 1.25 -2.53 0.01 *
cost -2.89 1.15 -2.52 0.01 *
lie pass -2.89 1.15 -2.52 0.01 *
broach -3.14 1.25 -2.51 0.01 *
align -2.78 1.11 -2.50 0.01 *
can carry -2.95 1.18 -2.50 0.01 *
weld -1.75 0.70 -2.50 0.01 *
stumble fall -2.79 1.13 -2.48 0.01 *
nod confuse -2.93 1.18 -2.48 0.01 *
enter turn -1.43 0.58 -2.48 0.01 *
emphasize -1.77 0.71 -2.47 0.01 *
corral -2.55 1.03 -2.47 0.01 *
wing -2.68 1.08 -2.47 0.01 *
swat -0.92 0.37 -2.47 0.01 *
will kill -2.73 1.11 -2.46 0.01 *
wild -1.85 0.75 -2.46 0.01 *
claim -1.70 0.69 -2.45 0.01 *
spin run -2.90 1.18 -2.45 0.01 *
stone face -1.75 0.72 -2.43 0.01 *
neaten -3.04 1.25 -2.43 0.02 *
stops -2.08 0.86 -2.43 0.02 *
stop dig -2.19 0.90 -2.43 0.02 *
embolden -2.73 1.12 -2.43 0.02 *
keep glance -1.48 0.61 -2.42 0.02 *
start tear -1.95 0.81 -2.42 0.02 *
hallucinate -2.28 0.95 -2.41 0.02 *
laugh delighted -3.01 1.25 -2.41 0.02 *
aim 0.69 0.29 2.41 0.02 *
interfere -3.01 1.25 -2.41 0.02 *
wail -0.86 0.36 -2.40 0.02 *
bottom -3.00 1.25 -2.40 0.02 *
keep drive -1.69 0.71 -2.40 0.02 *
waken -1.80 0.75 -2.40 0.02 *
impale -1.27 0.53 -2.40 0.02 *
wait watch -2.70 1.13 -2.40 0.02 *
get stick -2.83 1.18 -2.39 0.02 *
skitter -2.69 1.12 -2.39 0.02 *
invent -2.65 1.11 -2.38 0.02 *
repulse -1.34 0.56 -2.38 0.02 *
barricade -2.82 1.18 -2.38 0.02 *
begin kiss -2.98 1.25 -2.38 0.02 *
goad -2.73 1.15 -2.38 0.02 *
stay focus -2.15 0.90 -2.38 0.02 *
122
GLMM results for the agent's actions
Action Estimate Std. Error Z P-Val Signicance
combine -1.95 0.82 -2.37 0.02 *
see look -2.97 1.25 -2.37 0.02 *
relive -1.64 0.69 -2.37 0.02 *
get carry -2.96 1.25 -2.37 0.02 *
be interview -2.55 1.08 -2.36 0.02 *
hoe 2.13 0.90 2.36 0.02 *
emaciate -2.79 1.18 -2.36 0.02 *
shaky -2.95 1.25 -2.36 0.02 *
lm -0.99 0.42 -2.35 0.02 *
soap -1.89 0.81 -2.35 0.02 *
evacuate -2.78 1.18 -2.35 0.02 *
refract -2.94 1.25 -2.35 0.02 *
can cover -2.94 1.25 -2.35 0.02 *
shriek -0.80 0.34 -2.35 0.02 *
damn -2.60 1.11 -2.35 0.02 *
re 0.63 0.27 2.34 0.02 *
sti
e -0.80 0.34 -2.34 0.02 *
nod think -2.11 0.90 -2.33 0.02 *
adore -1.39 0.60 -2.33 0.02 *
ratchet -2.92 1.25 -2.33 0.02 *
barely avoid -2.92 1.25 -2.33 0.02 *
grieve -2.91 1.25 -2.33 0.02 *
stand clutch -2.91 1.25 -2.33 0.02 *
wait look -1.57 0.68 -2.33 0.02 *
torment -1.42 0.61 -2.32 0.02 *
deepen -1.58 0.68 -2.32 0.02 *
hollow -2.09 0.90 -2.31 0.02 *
exhibit -1.94 0.84 -2.30 0.02 *
start ll -2.72 1.18 -2.30 0.02 *
carpet -2.87 1.25 -2.30 0.02 *
scream -0.60 0.26 -2.30 0.02 *
should know -2.63 1.15 -2.29 0.02 *
sit opposite -1.64 0.71 -2.29 0.02 *
can watch -2.07 0.90 -2.29 0.02 *
begin push -2.62 1.15 -2.29 0.02 *
emulate -2.86 1.25 -2.29 0.02 *
smile move -1.59 0.69 -2.29 0.02 *
throb -1.84 0.81 -2.28 0.02 *
expand -1.40 0.61 -2.28 0.02 *
sparkle -0.91 0.40 -2.28 0.02 *
underline -1.58 0.69 -2.27 0.02 *
tight -2.05 0.90 -2.27 0.02 *
comfort -0.77 0.34 -2.27 0.02 *
look scare 2.84 1.25 2.27 0.02 *
can buy -2.84 1.25 -2.27 0.02 *
begin back -1.85 0.82 -2.26 0.02 *
camou
age -2.33 1.03 -2.26 0.02 *
start edge -2.82 1.25 -2.26 0.02 *
shtail -1.83 0.81 -2.25 0.02 *
start roll -1.30 0.58 -2.25 0.02 *
rue -1.06 0.47 -2.25 0.02 *
123
GLMM results for the agent's actions
Action Estimate Std. Error Z P-Val Signicance
turn step -2.66 1.18 -2.25 0.02 *
dissipate -1.89 0.84 -2.25 0.02 *
become tense -2.58 1.15 -2.25 0.02 *
unpocket -2.44 1.09 -2.25 0.02 *
start scrape -2.81 1.25 -2.24 0.02 *
awake watch 2.80 1.25 2.24 0.03 *
stop kiss -1.73 0.77 -2.23 0.03 *
appal -0.91 0.41 -2.23 0.03 *
can walk -1.88 0.84 -2.23 0.03 *
service -2.64 1.18 -2.23 0.03 *
keep ring -2.79 1.25 -2.23 0.03 *
enter pull -2.63 1.18 -2.23 0.03 *
reassemble -2.55 1.15 -2.23 0.03 *
start load -2.62 1.18 -2.22 0.03 *
soften -0.72 0.33 -2.22 0.03 *
turn get -2.62 1.18 -2.21 0.03 *
grin 0.62 0.28 2.21 0.03 *
rain -1.37 0.62 -2.21 0.03 *
spoon -1.23 0.56 -2.21 0.03 *
smile nd -2.76 1.25 -2.20 0.03 *
breathtake -2.76 1.25 -2.20 0.03 *
move in -1.67 0.76 -2.20 0.03 *
resume play -2.60 1.18 -2.20 0.03 *
daub 2.60 1.18 2.20 0.03 *
can follow -2.75 1.25 -2.20 0.03 *
squint 0.67 0.30 2.20 0.03 *
move onto 2.75 1.25 2.19 0.03 *
succumb -1.76 0.81 -2.19 0.03 *
look frustrated -1.87 0.86 -2.19 0.03 *
nod go -2.51 1.15 -2.19 0.03 *
nish ll -2.42 1.11 -2.18 0.03 *
disembark -2.35 1.08 -2.18 0.03 *
end -0.61 0.28 -2.18 0.03 *
psyche -2.58 1.18 -2.18 0.03 *
swirl -0.85 0.39 -2.18 0.03 *
turn leave -2.06 0.95 -2.18 0.03 *
can re -2.35 1.08 -2.18 0.03 *
spin look -2.57 1.18 -2.18 0.03 *
pause wait -2.72 1.25 -2.17 0.03 *
deform -2.49 1.15 -2.17 0.03 *
assimilate -2.72 1.25 -2.17 0.03 *
stop pour -2.06 0.95 -2.17 0.03 *
attract -0.83 0.38 -2.17 0.03 *
mobilize 2.23 1.03 2.17 0.03 *
beautiful -2.05 0.95 -2.17 0.03 *
trigger -1.31 0.60 -2.16 0.03 *
mangle -2.70 1.25 -2.16 0.03 *
nonplus -1.55 0.72 -2.16 0.03 *
hear cry 2.70 1.25 2.15 0.03 *
stand try -2.55 1.18 -2.15 0.03 *
stop scream -1.41 0.66 -2.15 0.03 *
124
GLMM results for the agent's actions
Action Estimate Std. Error Z P-Val Signicance
turn reveal -2.42 1.12 -2.15 0.03 *
smile put -1.54 0.72 -2.15 0.03 *
go diving -2.69 1.25 -2.15 0.03 *
stand frozen -0.78 0.37 -2.14 0.03 *
turn try -1.61 0.75 -2.14 0.03 *
cook -0.76 0.36 -2.14 0.03 *
invade -1.44 0.68 -2.13 0.03 *
look searchingly -2.66 1.25 -2.13 0.03 *
reckon -1.92 0.90 -2.13 0.03 *
loose -1.14 0.54 -2.12 0.03 *
spits -1.86 0.87 -2.12 0.03 *
continue read -1.52 0.71 -2.12 0.03 *
juice -2.66 1.25 -2.12 0.03 *
yellow -2.43 1.15 -2.12 0.03 *
sew -1.00 0.47 -2.12 0.03 *
continue open -2.18 1.03 -2.11 0.03 *
immobilize -1.53 0.72 -2.11 0.03 *
boggle -2.64 1.25 -2.11 0.03 *
paralyze -0.91 0.43 -2.11 0.03 *
sing -0.59 0.28 -2.10 0.04 *
snort -0.78 0.37 -2.10 0.04 *
start turn -1.51 0.72 -2.10 0.04 *
follow turn -2.62 1.25 -2.10 0.04 *
pirouette -2.35 1.12 -2.09 0.04 *
pause notice 2.16 1.03 2.09 0.04 *
pale -0.93 0.45 -2.09 0.04 *
dazzle -1.13 0.54 -2.09 0.04 *
borrow -1.39 0.66 -2.09 0.04 *
appear carry -1.39 0.66 -2.09 0.04 *
unlace -1.82 0.87 -2.09 0.04 *
slit -2.24 1.08 -2.08 0.04 *
smile raise -2.60 1.25 -2.08 0.04 *
remain deant -2.60 1.25 -2.08 0.04 *
lie look -2.46 1.18 -2.08 0.04 *
ice -1.97 0.95 -2.08 0.04 *
salivate -2.60 1.25 -2.08 0.04 *
gyrate -1.36 0.66 -2.08 0.04 *
shiver -0.67 0.32 -2.07 0.04 *
include -0.65 0.31 -2.07 0.04 *
delete -2.22 1.07 -2.07 0.04 *
bludgeon -1.96 0.95 -2.07 0.04 *
divide -2.18 1.05 -2.07 0.04 *
loathe -2.32 1.12 -2.07 0.04 *
straighten look -2.44 1.18 -2.06 0.04 *
look alarmed -1.60 0.77 -2.06 0.04 *
resume read -2.58 1.25 -2.06 0.04 *
holler -1.30 0.63 -2.06 0.04 *
revs -2.58 1.25 -2.06 0.04 *
fuse -2.13 1.03 -2.06 0.04 *
get suck -1.55 0.75 -2.06 0.04 *
pounce -1.04 0.51 -2.06 0.04 *
125
GLMM results for the agent's actions
Action Estimate Std. Error Z P-Val Signicance
age -0.78 0.38 -2.06 0.04 *
turn head -1.13 0.55 -2.05 0.04 *
come shoot 2.56 1.25 2.05 0.04 *
stop cry -1.02 0.50 -2.04 0.04 *
quote -1.75 0.86 -2.04 0.04 *
smile touch -1.48 0.72 -2.04 0.04 *
look haunt -2.55 1.25 -2.04 0.04 *
toy -0.94 0.46 -2.04 0.04 *
enthrone 2.54 1.25 2.03 0.04 *
turn move -2.28 1.12 -2.03 0.04 *
commandeer -2.09 1.03 -2.03 0.04 *
skip -0.72 0.35 -2.02 0.04 *
smile impressed -1.83 0.90 -2.02 0.04 *
freshen -1.91 0.95 -2.02 0.04 *
smile proud -1.62 0.81 -2.02 0.04 *
lie asleep -0.94 0.47 -2.01 0.04 *
parade -1.55 0.77 -2.01 0.04 *
groove -2.51 1.25 -2.01 0.04 *
cease -1.10 0.55 -2.01 0.04 *
cop -2.51 1.25 -2.00 0.05 *
bite -0.57 0.28 -2.00 0.05 *
can be -2.51 1.25 -2.00 0.05 *
unsettle -1.44 0.72 -2.00 0.05 *
stay silent -1.61 0.81 -2.00 0.05 *
last -2.25 1.12 -2.00 0.05 *
match -0.76 0.38 -1.99 0.05 *
grace -2.06 1.03 -1.99 0.05 *
frost -1.89 0.95 -1.99 0.05 *
regret -1.12 0.56 -1.99 0.05 *
peg -2.28 1.15 -1.98 0.05 *
move acros -2.23 1.12 -1.98 0.05 *
ash open -1.53 0.77 -1.98 0.05 *
confuse take -2.48 1.25 -1.98 0.05 *
ignite 0.86 0.44 1.98 0.05 *
start cry -0.69 0.35 -1.98 0.05 *
jettison -2.47 1.25 -1.98 0.05 *
retaliate -2.33 1.18 -1.97 0.05 *
shimmy -1.23 0.62 -1.97 0.05 *
can reach -1.13 0.58 -1.97 0.05 *
sweat 0.66 0.34 1.97 0.05 *
mat -1.48 0.75 -1.97 0.05 *
resurface -2.32 1.18 -1.97 0.05 *
slam open -2.25 1.15 -1.96 0.05 *
Table C.1: GLMM results for the agent's actions.
126
Appendix D
Results for Patient's Actions
This chapter presents the results of the GLMM on the patients' actions. Only signicant regressors (p< 0:05)
were reproduced. A positive estimate ( > 0) corresponds to an action which is more likely for male
characters to portray. Likewise, negative estimates ( < 0) suggest an action which is less likely to be
portrayed being done to a male character. Manually identied errors are color coded: blush for errors down-
streamed from an outside the SRL system (e.g., parsing, lemmatization), and gray for errors due to mislabels
coming from our SRL system.
GLMM results for the patients' actions
Action Estimate Std. Error Z P-Val Signicance
expressionless look -4.13 1.61 -2.56 0.01 *
plot -3.71 1.45 -2.56 0.01 *
gawk -2.76 1.12 -2.47 0.01 *
corrode 3.54 1.44 2.45 0.01 *
constrict -3.47 1.45 -2.40 0.02 *
seat -3.85 1.61 -2.39 0.02 *
rot -2.30 0.97 -2.37 0.02 *
come -3.39 1.44 -2.35 0.02 *
see -3.38 1.44 -2.34 0.02 *
begin singe 3.38 1.44 2.34 0.02 *
stop -3.36 1.44 -2.33 0.02 *
leafs -2.51 1.08 -2.33 0.02 *
rivite -3.73 1.61 -2.32 0.02 *
expel -3.19 1.39 -2.30 0.02 *
seem aware -3.29 1.44 -2.28 0.02 *
dazed -3.65 1.61 -2.27 0.02 *
fall -3.65 1.61 -2.27 0.02 *
comply -3.13 1.39 -2.26 0.02 *
compress -2.61 1.16 -2.26 0.02 *
proposition -2.69 1.19 -2.26 0.02 *
continue kiss 3.63 1.61 2.26 0.02 *
headline 3.26 1.44 2.26 0.02 *
package -3.24 1.44 -2.25 0.02 *
remember -3.60 1.61 -2.24 0.03 *
lure -2.41 1.08 -2.23 0.03 *
fence -2.93 1.32 -2.23 0.03 *
will pray -2.65 1.19 -2.22 0.03 *
shoot 3.19 1.45 2.21 0.03 *
closes -3.05 1.39 -2.20 0.03 *
ad libbe -3.48 1.61 -2.17 0.03 *
127
GLMM results for the patients' actions
Action Estimate Std. Error Z P-Val Signicance
dole -3.48 1.61 -2.17 0.03 *
retrace 3.12 1.44 2.16 0.03 *
volunteer 3.12 1.44 2.16 0.03 *
unsnap 2.70 1.26 2.15 0.03 *
pass touch -3.45 1.61 -2.15 0.03 *
nake -3.09 1.44 -2.14 0.03 *
lilt -3.43 1.61 -2.13 0.03 *
strum -2.53 1.19 -2.13 0.03 *
look -2.88 1.36 -2.12 0.03 *
grimace 2.86 1.36 2.11 0.03 *
hear 3.04 1.44 2.10 0.04 *
stand -3.03 1.45 -2.09 0.04 *
boot -2.26 1.08 -2.09 0.04 *
be pull -3.00 1.44 -2.08 0.04 *
hassle -2.60 1.26 -2.07 0.04 *
cherish -2.98 1.44 -2.07 0.04 *
lob 2.31 1.12 2.06 0.04 *
gutte -2.45 1.19 -2.06 0.04 *
infect 2.94 1.44 2.04 0.04 *
pity -2.81 1.39 -2.03 0.04 *
rise go -2.71 1.34 -2.02 0.04 *
endow -2.92 1.44 -2.02 0.04 *
cavort 2.78 1.39 2.01 0.04 *
lie scatter -2.78 1.39 -2.00 0.05 *
drug -2.89 1.44 -2.00 0.05 *
start leak -2.89 1.44 -2.00 0.05 *
do -3.20 1.61 -1.99 0.05 *
begin talk -2.70 1.35 -1.99 0.05 *
brew 2.74 1.39 1.98 0.05 *
extricate -3.18 1.61 -1.98 0.05 *
kidnap -2.21 1.12 -1.97 0.05 *
Table D.1: GLMM results for the patients' actions
128
Appendix E
Results for Agent{Patient interactions
This chapter presents the results of the GLMM on the interactions between agents and patients. Only
signicant regressors (p < 0:05) were reproduced. Group encodes gender dynamics. For example, M!M
encodes male-to-male action portrayals whereas M!F corresponds to male-to-female interactions. A positive
estimate ( > 0) indicates an action which is more likely for that group of characters to portray. Likewise,
negative estimates ( < 0) suggest an action which is less likely to be portrayed by that agent{patient group.
Manually identied errors are color coded: blush for errors down-streamed from an outside the SRL system
(e.g., parsing, lemmatization), and gray for errors due to mislabels coming from our SRL system.
GLMM results for the agent-patient interactions
Group Action Estimate Err. Z P-val Signicance
M!M kiss -3.18 0.75 -4.27 0.00 ***
M!M wrap -2.91 0.77 -3.80 0.00 ***
M!M hug -2.78 0.75 -3.70 0.00 ***
M!M scream -2.59 0.75 -3.46 0.00 ***
M!M cover -2.59 0.75 -3.45 0.00 ***
M!M laugh -2.46 0.74 -3.34 0.00 ***
M!M wear -2.35 0.73 -3.20 0.00 **
M!M remove -2.49 0.78 -3.20 0.00 **
M!M dance -2.91 0.91 -3.19 0.00 **
M!M close -2.34 0.73 -3.19 0.00 **
M!M begin -2.34 0.74 -3.17 0.00 **
M!M hesitate -2.71 0.88 -3.10 0.00 **
M!M let -2.27 0.73 -3.10 0.00 **
M!M lie -2.30 0.75 -3.08 0.00 **
M!M set -2.29 0.75 -3.03 0.00 **
M!M nd -2.21 0.73 -3.02 0.00 **
M!M smile -2.19 0.73 -3.01 0.00 **
M!M hear -2.20 0.73 -3.01 0.00 **
M!M seem -2.19 0.74 -2.96 0.00 **
M!M dress -2.22 0.75 -2.95 0.00 **
M!M face -2.43 0.82 -2.95 0.00 **
M!M hurry -2.31 0.78 -2.95 0.00 **
M!M hold -2.13 0.73 -2.94 0.00 **
M!M kneel -2.21 0.76 -2.92 0.00 **
M!M roll -2.17 0.74 -2.92 0.00 **
M!M start -2.11 0.73 -2.90 0.00 **
M!M eat -2.44 0.85 -2.88 0.00 **
M!M cross -2.12 0.74 -2.87 0.00 **
M!M move -2.08 0.73 -2.86 0.00 **
129
GLMM results for the agent-patient interactions
Group Action Estimate Err. Z P-val Signicance
M!M continue -2.07 0.74 -2.82 0.00 **
M!M fall -2.07 0.74 -2.80 0.01 **
M!M put -2.03 0.73 -2.79 0.01 **
M!M touch -2.12 0.76 -2.78 0.01 **
M!M pour -2.06 0.74 -2.78 0.01 **
M!M play -2.12 0.77 -2.76 0.01 **
M!M sit -2.00 0.73 -2.76 0.01 **
M!M embrace -2.31 0.84 -2.75 0.01 **
M!M slip -2.09 0.76 -2.74 0.01 **
M!M make -2.00 0.73 -2.74 0.01 **
M!M leave -2.01 0.74 -2.73 0.01 **
M!M check -2.04 0.75 -2.72 0.01 **
M!M stop -1.98 0.73 -2.71 0.01 **
M!M rise -2.02 0.75 -2.71 0.01 **
M!M take -1.96 0.73 -2.70 0.01 **
M!M arrive -2.12 0.79 -2.69 0.01 **
M!M nish -2.19 0.82 -2.69 0.01 **
M!M go -1.95 0.73 -2.68 0.01 **
M!M open -1.95 0.73 -2.67 0.01 **
M!M catch -1.96 0.74 -2.66 0.01 **
M!M realize -1.94 0.73 -2.65 0.01 **
M!M come -1.90 0.73 -2.62 0.01 **
M!M pick -1.91 0.73 -2.61 0.01 **
M!F wheel -3.45 1.32 -2.60 0.01 **
M!M try -1.89 0.73 -2.59 0.01 **
M!M notice -1.90 0.73 -2.59 0.01 **
M!M hang -1.92 0.74 -2.59 0.01 **
M!M bring -1.96 0.76 -2.59 0.01 **
M!M feel -1.93 0.75 -2.59 0.01 **
M!M whisper -2.02 0.79 -2.56 0.01 *
M!M speak -1.89 0.74 -2.56 0.01 *
M!M want -1.91 0.75 -2.54 0.01 *
M!M place -1.99 0.79 -2.53 0.01 *
M!M carry -1.87 0.74 -2.51 0.01 *
M!M turn -1.82 0.72 -2.51 0.01 *
M!M lift -1.85 0.74 -2.50 0.01 *
M!M can -1.86 0.74 -2.50 0.01 *
M!M pull -1.81 0.73 -2.50 0.01 *
M!M shut -2.05 0.82 -2.49 0.01 *
M!M read -1.90 0.76 -2.48 0.01 *
M!M enter -1.81 0.73 -2.47 0.01 *
M!M follow -1.78 0.73 -2.44 0.01 *
M!M struggle -1.82 0.75 -2.44 0.01 *
M!M push -1.78 0.73 -2.44 0.01 *
M!M keep -1.84 0.76 -2.43 0.02 *
M!M pass -1.78 0.74 -2.42 0.02 *
M!M lean -1.77 0.73 -2.42 0.02 *
M!M appear -1.78 0.74 -2.40 0.02 *
M!M run -1.74 0.73 -2.39 0.02 *
M!M force -1.86 0.78 -2.39 0.02 *
M!M wave -1.81 0.76 -2.38 0.02 *
130
GLMM results for the agent-patient interactions
Group Action Estimate Err. Z P-val Signicance
M!M get -1.73 0.73 -2.37 0.02 *
M!M ca -1.75 0.74 -2.37 0.02 *
M!M work -1.80 0.76 -2.36 0.02 *
M!M help -1.75 0.74 -2.36 0.02 *
M!M see -1.71 0.73 -2.36 0.02 *
M!M slide -1.78 0.76 -2.36 0.02 *
M!M draw -1.79 0.76 -2.35 0.02 *
M!M shake -1.73 0.74 -2.35 0.02 *
M!M talk -1.72 0.74 -2.32 0.02 *
M!M reach -1.69 0.73 -2.31 0.02 *
M!M look -1.67 0.72 -2.31 0.02 *
M!M seat -1.76 0.77 -2.27 0.02 *
M!M drop -1.67 0.74 -2.26 0.02 *
M!M know -1.65 0.73 -2.26 0.02 *
M!M stare -1.64 0.73 -2.25 0.02 *
M!M exit -1.70 0.76 -2.24 0.03 *
M!M listen -1.70 0.76 -2.24 0.03 *
M!M lay -1.85 0.83 -2.24 0.03 *
M!M stand -1.62 0.73 -2.24 0.03 *
M!M indicate -1.88 0.84 -2.23 0.03 *
M!M sense -1.88 0.84 -2.23 0.03 *
M!M ask -1.92 0.86 -2.22 0.03 *
M!M walk -1.61 0.73 -2.22 0.03 *
M!M return -1.74 0.79 -2.21 0.03 *
M!M grab -1.61 0.73 -2.21 0.03 *
M!F scream -1.84 0.84 -2.20 0.03 *
M!M head -1.67 0.76 -2.19 0.03 *
M!M pause -1.68 0.77 -2.19 0.03 *
M!M throw -1.60 0.73 -2.19 0.03 *
M!M study -1.68 0.77 -2.18 0.03 *
M!M watch -1.56 0.73 -2.14 0.03 *
M!M wheel -1.74 0.81 -2.13 0.03 *
M!M say -1.59 0.75 -2.13 0.03 *
M!M snap -1.82 0.86 -2.11 0.03 *
M!M raise -1.57 0.74 -2.11 0.03 *
M!M tell -1.68 0.80 -2.11 0.04 *
M!M glance -1.55 0.74 -2.10 0.04 *
M!M give -1.53 0.73 -2.10 0.04 *
M!M gaze -1.84 0.88 -2.08 0.04 *
M!M show -1.56 0.75 -2.08 0.04 *
M!M can see -1.64 0.80 -2.05 0.04 *
M!M pat -1.64 0.80 -2.05 0.04 *
M!M rush -1.53 0.75 -2.05 0.04 *
M!F pour -1.71 0.83 -2.05 0.04 *
M!M drag -1.53 0.75 -2.04 0.04 *
M!M lower -1.62 0.80 -2.03 0.04 *
M!M stay -1.79 0.88 -2.02 0.04 *
M!F realize -1.66 0.82 -2.02 0.04 *
M!M jump -1.50 0.74 -2.02 0.04 *
M!M step -1.47 0.73 -2.01 0.04 *
M!F nd -1.62 0.81 -2.00 0.05 *
131
GLMM results for the agent-patient interactions
Group Action Estimate Err. Z P-val Signicance
M!M wait -1.48 0.74 -1.99 0.05 *
M!M call -1.49 0.76 -1.97 0.05 *
M!M lose -1.73 0.88 -1.97 0.05 *
Table E.1: GLMM results for agent{patient interactions.
132
Abstract (if available)
Abstract
Stories play an important part in how we weave together the day to day events, to make sense of what happens around us, and in the lives of others. They help shape our identities, inform our world view, and allow us to understand other people’s perspectives. The impact of these stories, whether measured through economic gains, emotional response, or societal influence, is closely tied to the audience’s experience of these narratives. As such, understanding an audience’s narrative experience could allow for computational models that predict a story’s impact on both the individual and the societal level. It may also provide a venue where intelligent agents might infer interpersonal relationships, human behaviors, and current societal norms. However, current computational models are limited in their ability to unravel the complex interaction that defines the spectator’s narrative experience. ❧ While the exact mechanisms behind an audience’s narrative experience remains unidentified, we know that the audience’s relationship to the characters plays a central role. For example, the audience’s emotional response has been linked to their identification with the main characters, their comprehension of the character’s motives, as well as the resolution of the character’s fates. In this dissertation, I propose that computational models could provide a better estimate of the audience’s narrative experience by considering how the characters of the story are being portrayed. To this end, this dissertation presents my contributions in models that automatically infer character representations from aspects by which audiences perceive characters: from their dialogues, from their actions and behaviors, and from their roles. Each aspect is explored through a particular task. For representations from dialogues, state-of-the-art NLP techniques are developed to predict movie ratings of violent and risk-behavior content. From the representations of actions, I construct datasets and labeling models for action-agent-patient triads that enable large-scale analysis of gender biases in the media. Finally, from the role representations, a model that leverages character’s representations from personal narratives to automatically infer a better estimate for the client’s perception of a shared bond during psychotherapy is presented. In each task, our results demonstrate that our models present a significant improvement over the previous state-of-the-art. This provides empirical support to our claim that character’s representations may be leveraged for their information about an audience’s narrative experience.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Creating cross-modal, context-aware representations of music for downstream tasks
PDF
Computational models for multidimensional annotations of affect
PDF
Integrating annotator biases into modeling subjective language classification tasks
PDF
Modeling dynamic behaviors in the wild
PDF
Generating and utilizing machine explanations for trustworthy NLP
PDF
Identifying Social Roles in Online Contentious Discussions
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Generative foundation model assisted privacy-enhancing computing in human-centered machine intelligence
PDF
Computational modeling of mental health therapy sessions
PDF
Theory of memory-enhanced neural systems and image-assisted neural machine translation
PDF
Invariant representation learning for robust and fair predictions
PDF
Computational modeling of human behavior in negotiation and persuasion: the challenges of micro-level behavior annotations and multimodal modeling
PDF
Towards social virtual listeners: computational models of human nonverbal behaviors
PDF
Understanding sources of variability in learning robust deep audio representations
PDF
Spoken language processing in low resource scenarios with applications in automated behavioral coding
PDF
Visual representation learning with structural prior
PDF
Generating psycholinguistic norms and applications
PDF
Learning shared subspaces across multiple views and modalities
PDF
Learning distributed representations from network data and human navigation
Asset Metadata
Creator
Martinez Palacios, Victor Raul
(author)
Core Title
Computational narrative models of character representations to estimate audience perception
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2021-08
Publication Date
07/16/2021
Defense Date
04/19/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
character representation,characters,computational models,computational narrative,deep learning,long short-term memory,LSTM,media studies,narratives,natural language processing,OAI-PMH Harvest,psychotherapy,recurrent neural network,RNN,stories
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth (
committee chair
), Dehghani, Morteza (
committee member
), Dilkina, Bistra (
committee member
)
Creator Email
victorrm@usc.edu,vrmp00@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC15595901
Unique identifier
UC15595901
Legacy Identifier
etd-MartinezPa-9753
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Martinez Palacios, Victor Raul
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
character representation
computational models
computational narrative
deep learning
long short-term memory
LSTM
media studies
narratives
natural language processing
psychotherapy
recurrent neural network
RNN