Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Computational modeling of mental health therapy sessions
(USC Thesis Other)
Computational modeling of mental health therapy sessions
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
COMPUTATIONAL MODELING OF MENTAL HEALTH THERAPY SESSIONS
by
Leili Tavabi
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
August 2023
Copyright 2023 Leili Tavabi
Dedication
To my loving parents who have always taught me to fly without fear of falling.
To my amazing sister Nazgol Tavabi, who has always been there for me.
To my wonderful family and friends from near and afar whose love has always kept me going.
To my beloved Behnam Shahbazi, for his love and support through all of life’s thick and thin.
ii
Acknowledgements
First and foremost, I would like to express my gratitude to my advisor, Prof. Mohammad
Soleymani. His advice and continuous support have made the long PhD journey really enjoyable
and rewarding. Mohammad’s breadth and depth of knowledge, discipline and work ethic, while
remaining a genuinely caring and supportive advisor are all qualities I aspire to achieve.
I would also like to thank my colleagues, Dr. Trang Tran and Dr. Kalin Stefanov whose help
and support was invaluable in keeping my sanity intact during the challenges of working with
real-world data. The majority of my PhD time was focused on an inter-disciplinary NIH grant
through which I was able to work with and learn from amazing collaborators, Prof. Stefan Scherer,
Dr. Brian Borsari, Dr. Joannalyn Delacruz and Dr. Joshua Woolley; For which I am really grateful.
I would also like to thank the rest of my doctoral committee: Prof. Maja Mataric and Prof.
Shrikanth Narayanan for their invaluable input on my work. Additionally I am thankful to Prof.
Jon Gratch, Prof. Gale Lucas, Prof. Morteza Dehghani and Prof. Bistra Dilkina for their support
and feedback throughout the different stages of my PhD.
I would also like to thank my fellow lab-mates Yufeng Yin, Minh Tran, Di Chang, Zongjian Li
and Soheil Rayatdoost, amazing co-authors Zihao He, Liupei Lu and Larry Zhang, along with my
friends at the Institute for Creative Technologies (ICT): Setareh Nasihati Gilani, Su Lei, Kushal
Chawla, Jessie Hoegen and many others, for making my time at ICT really enjoyable.
iii
Last but not least, I would like to express my gratitude to my amazing family and wonderful
friends for their immeasurable support throughout this journey.
iv
Table of Contents
Dedication ii
Acknowledgements iii
List of Tables vii
List of Figures ix
Abstract xi
Chapter 1: Introduction 1
1.1 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Chapter 2: Background 6
2.1 Machine Learning in Mental Health Therapy . . . . . . . . . . . . . . . . . . . 6
2.2 Motivational Interviewing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Behavioral coding systems in Motivational Interviewing . . . . . . . . . . . . . 8
2.4 In-session behaviors linked with subsequent outcomes . . . . . . . . . . . . . . 9
2.5 Automatic coding of in-session behaviors and outcome prediction . . . . . . . . 10
2.6 Dialogue act classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7 Modeling empathetic opportunities . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.8 Automated quality assessment of MI sessions . . . . . . . . . . . . . . . . . . . 15
Chapter 3: MI Intent Recognition 18
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.1 Input Features and Embeddings . . . . . . . . . . . . . . . . . . . . . . 21
3.3.2 Statistical Analysis on MI Codes . . . . . . . . . . . . . . . . . . . . . . 23
3.3.3 Feedforward MI Code Prediction . . . . . . . . . . . . . . . . . . . . . 24
3.3.4 Recurrent MI Code Prediction . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.1 Feedforward Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.2 Recurrent Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.2.1 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.2.2 Salient Features . . . . . . . . . . . . . . . . . . . . . . . . . 32
v
3.5 Summary of Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Chapter 4: Speaker Turn Change Modeling 36
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.1 Utterance Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.2 Speaker Turn Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.3 Conversational Context Modeling . . . . . . . . . . . . . . . . . . . . . 40
4.3.4 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5 Summary of Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Chapter 5: Computational Modeling of Empathetic Opportunities 47
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3.1 Multimodal Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . 52
5.3.2 Ground-Truth Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3.3 Behavior Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.4 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.5 Summary of Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Chapter 6: Automated assessment of MI sessions 64
6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.3.1 Session Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.3.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.3.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.4.1 Base Models vs. Emotion Models . . . . . . . . . . . . . . . . . . . . . 73
6.4.2 Cross Corpus vs. Within Corpus . . . . . . . . . . . . . . . . . . . . . . 74
6.4.3 Therapist Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.5 Ethical Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.6 Summary of Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Chapter 7: Conclusion and Future Directions 82
7.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Bibliography 87
vi
List of Tables
3.1 Example dialogue excerpt from the dataset . . . . . . . . . . . . . . . . . . . . . 20
3.2 Client MI codes class distribution. . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Behavioral outcomes class distribution . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Most significant LIWC features across client MISC codes . . . . . . . . . . . . . 23
3.5 MI codes prediction results for three-class classification. Average micro F1-scores
and their standard deviations (in parentheses) are given. . . . . . . . . . . . . . . 29
3.6 Precision and recall for MI code prediction . . . . . . . . . . . . . . . . . . . . 29
3.7 Confusion matrix for MI code prediction (ST: Sustain Talk, FN: Follow/Neutral,
CT: Change Talk) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.8 Outcome predictions results for two-class classification. Random baseline (mean
F1-score over 1000 trials) and F1-scores over the entire dataset are given. (CTBAC:
Change in Typical Blood Alcohol Content, CPROB: Change in Alcohol-Related
Problems) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.9 F1-Score Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1 Specs of the three DA classification benchmark datasets. |C| denotes the number of
DA classes;|P| denotes the number of parties; Train/Val/Test denotes the number
of conversations/utterances in the corresponding split. . . . . . . . . . . . . . . . 38
4.2 The accuracies using different chunk sizes on SwDA. . . . . . . . . . . . . . . . 43
4.3 The accuracies using different chunk sizes on MRDA. . . . . . . . . . . . . . . . 43
vii
4.4 Results of DA classification following three different approaches. “Ours ¬Speaker”
represents our method without adding speaker turn embeddings; “Ours+Topic”
represents the proposed method using speaker turn and topic-aware embeddings
for fair comparison to baselines utilizing topic information. State-of-the-art results
are highlighted in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1 Human-Agent dialogue excerpts with different empathy responses. . . . . . . . . 51
5.2 Distribution of classes for two sets of labels . . . . . . . . . . . . . . . . . . . . 55
5.3 F1-scores for three-class classification. . . . . . . . . . . . . . . . . . . . . . . . 60
5.4 Confusion matrices (RNN fusion). . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.5 Instances of RNN Fusion model’s correct/incorrect predictions on MTurk labels
(Positive, Negative, None). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.1 Dataset statistics: session length (in minutes) and average number of turns . . . . 69
6.2 Most common bi-grams across speakers and quartiles;
∗ bac: blood alcohol content 70
6.3 Performance results (CCC scores) comparing distil-RoBERTa vs. emotion-distil-
roBERTa encoders, under therapist-independent and therapist-dependent scenarios 74
6.4 ccc score results in within-corpus and cross-corpus experiments, therapist-independent
setting. The within-corpus results include the mean across the cross validation
folds, with standard deviation in parentheses. The cross-corpus results are obtained
from training and validation on the entire datasets. . . . . . . . . . . . . . . . . . 75
6.5 Significant therapist codes associated with empathy; **pvalue < 0.01; *pvalue<
0.05; positive(+) and negative(-) associations with empathy are shown in parentheses 77
6.6 Sample dialogue excerpts from the datasets including corresponding MI codes. T
denotes the therapist and C denotes the client. . . . . . . . . . . . . . . . . . . . 79
viii
List of Figures
3.1 Feed-forward network architecture for client code classification . . . . . . . . . . 25
3.2 Recurrent network architecture for client code classification . . . . . . . . . . . . 27
3.3 Confusion matrices of classification results by LIWC (left) and RoBERTa (right)
features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Example dialogue with correct and incorrect classifications. red (true → predicted)
denotes misclassification by RoBERTa but correctly classified by LIWC; blue
(true label) denotes correct classification by both models. . . . . . . . . . . . . . 33
3.5 Example dialog with correct and incorrect classifications. blue (true label) denotes
correct classification by our models, red (true → predicted) denotes misclassifica-
tion by both models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1 The overall framework of our proposed method. In this toy example, the conversa-
tion consists of five utterances. . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 A toy example of slicing a conversation of length 10 into 3 chunks of length of 4. 42
5.1 A participant and the virtual agent, Ellie. . . . . . . . . . . . . . . . . . . . . . . 50
5.2 Box plots of verbal and nonverbal behavior with significant differences among
different classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3 Multimodal static fusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.4 Multimodal RNN fusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.5 F1-scores of V ADER sentiment analysis with different thresholds. . . . . . . . . 61
6.1 Histograms of empathy ratings across our two datasets and the combination of
both datasets (normalized between [0-1]) . . . . . . . . . . . . . . . . . . . . . . 68
ix
6.2 The model includes an utterance encoder (distil-RoBERTa [81] or emotion-distil-
RoBERTa [58]) whose output is projected to a lower-dimensional space by the
following linear layer. The sequence of utterance-level representations is then fed
to a Bidirectional Gated Recurrent Unit (Bi-GRU) layer. The GRU is followed by
a two-head self-attention layer on GRU’s hidden states whose output is mean- and
max-pooled into the final vector embedding, which is fed to a final linear layer for
regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.3 Empathy ratings across different therapists. Therapist IDs 1-13 are from PRIME
dataset; Therapist IDs 14-17 are from NEXT dataset. . . . . . . . . . . . . . . . 78
x
Abstract
Mental health conditions have increased worldwide in recent years, with about 20% of adults
in the US having experienced a mental health disorder. However, over half of this population
don’t receive the care and treatment they need [146]. Despite the prevalence of mental health
disorders, there is a large gap between the needs and available resources for diagnosis and
treatment. The recent advancements in machine learning and deep learning provide an opportunity
for developing AI-assisted analysis and assessment of therapy sessions through behavior analysis
and understanding, and measuring clients’ symptoms.
Automatic behavior analysis can augment clinical resources in diagnosis, and assessment of
treatment quality and efficacy. Behavior perception modules from clients’ and therapists’ in-session
data can be utilized in multiple ways, including modeling client and therapist in-session behaviors
to investigate their associations with treatment quality; building the underlying components for
intelligent machines capable of running automated clinical interviews for probing indicators of
mental health disorders; and identifying and utilizing salient behavioral patterns associated with
certain disorders toward AI-assisted diagnosis.
This dissertation is mainly aligned with the following directions:
Automated assessment of therapy sessions: We develop behavior analysis models for auto-
mated assessment of therapy sessions, with a special focus on Motivational Interviewing (MI).
xi
Using transcripts from real-world MI sessions, we propose approaches for modeling and analyzing
client-therapist dialogue toward automatic recognition of client intent on a granular utterance level,
as well as global session-level quality metrics like therapists’ empathy; and further exploring the
association of in-session data with consequent behavioral outcomes. Additionally, for a more
interpretable understanding, we identify psychologically-relevant features associated with the
modeled constructs.
To validate our models on a more general task with large publicly available datasets, we train
and evaluate our model architectures on benchmark datasets for dialogue act classification while
proposing learnable embeddings for modeling turn changes in dialogue. Our approach significantly
improves task performance compared to the existing models.
Development of perception models to facilitate intelligent human-agent interactions: Fo-
cusing on empathy as an important construct in therapy, we develop models for recognizing
opportunities for expressing empathy in human-agent interactions. For this goal, we train our
models on multimodal data from clinical interviews designed to probe indicators of mental health
disorders. The proposed models provide the means for leading more emotionally-intelligent
human-agent interactions.
This work strives to computationally study and model aspects of therapy sessions and clinical
interviews that can provide the means for more efficient and objective assessment of the sessions,
and important indicators of consequent behavioral outcomes.
xii
Chapter 1
Introduction
Mental disorders are characterized by clinically significant disturbances in an individual’s cognition,
emotion regulation, or behavior. They are usually associated with distress or impairment in
important areas of functioning [147]. Mental health conditions have been increasing worldwide
in recent years and are reported to affect 1 in 5 adults, making them one of the leading causes of
disability worldwide. Over half of the adults with mental health disorders don’t receive treatment
mainly due to unmet availability of clinical resources among other reasons [146].
Automated behavior analysis, utilizing the recent advancements of large neural networks, can
be used to augment clinical resources by providing efficient and objective tools for AI-assisted
diagnosis, automated assessment of treatment quality and efficacy, and discovering behavioral
patterns from in-session client-therapist data associated with desired behavioral outcomes. Models
of behavior perception can provide many advantages to the field of mental health, including but
not limited to, automating some of the resource-intensive processes required for understanding
and assessment of therapy sessions; and identifying patterns of verbal and non-verbal behaviors
associated with specific disorders like depression and PTSD, which can be used for AI-assisted
diagnosis.
1
The major part of this dissertation focuses on using real-world therapy sessions from Motiva-
tional Interviewing (MI). Motivational Interviewing is a collaborative, goal-oriented conversation
style focused on the language of change. It is designed to strengthen personal motivation for and
commitment to a specific goal and focuses on resolving the client’s ambivalence and eliciting their
own intrinsic reasons for behavior change [96]. MI efficacy on behavior change is largely based
on two underlying processes: 1) Eliciting in-session arguments of change and 2) client-therapist
interpersonal relationship, which is further described below:
1. Eliciting in-session arguments of change: Behavior change would be promoted by elic-
iting client’s arguments of change or Change Talk, whereas Sustain Talk would favor the
behavioral status quo [106]. The client’s in-session attitude toward change has been shown
to carry important indicators for subsequent behavior change.
2. client-therapist interpersonal relationship: The relation factors consists of different
constructs including empathy and MI spirit. Empathy, which is the therapist’s ability to
understand and relate to the patient’s experience, may be an important component of building
a strong therapeutic alliance [1], and has shown to be an influential factor in promoting
behavior change [149]. Additionally, an underlying spirit of MI has been described as a cru-
cial component of MI efficacy [126]. “MI spirit” is focused on a collaborative conversation,
evoking the client’s own motivation while honoring their autonomy; and is the culmination
of three therapist characteristics: collaboration, evocation and autonomy. There is early
evidence of MI spirit linked with change talk and desirable behavioral outcomes [94]. These
factors are rated by focusing on the therapist behavior (and their facilitation of the described
constructs) on the global session-level.
2
Standardized MI codings provide guidelines for coding client intent with respect to their
attitude toward change; and coding therapist intent from a repertoire of actions they can take
for resolving ambivalence and encouraging behavior change, i.e. open/close-ended questions,
simple/complex reflections or affirmations. These client and therapist utterance-level codes have
been widely used for the assessment of MI sessions and studying associations of in-session
behaviors with subsequent outcomes. Additionally, global session-level ratings of empathy and
MI spirit, among other factors, provide an overall assessment of session quality. These utterance-
level and session-level codings are performed by trained coders who often have to take multiple
passes for each session. This process is highly resource-intensive and subjective. As part of
this dissertation, we focus on building models for automatic recognition of client utterance-level
intent and global session-level quality metrics while beating the state-of-the-art on existing models
trained on similar datasets. We propose two different approaches for encoding dialogue history
context and highlight the importance of history context for intent recognition. We further provide
an interpretable and psychologically-relevant understanding by identifying salient features across
different classes of client intent and explore the associations of in-session data with subsequent
behavioral outcomes.
Although trained and evaluated on MI data, the proposed approaches can be utilized for modeling
other goal-oriented therapy sessions, including Cognitive Behavioral Therapy (CBT), or more
generally goal-oriented dialogue. To study the effectiveness of our proposed approaches on
general-domain dialogue and in comparison with state-of-the-art models in Dialogue Act (DA)
classification, we train and evaluate our models on benchmark DA datasets [72, 135, 80] while
introducing a simple but effective turn change encoding and reaching state-of-the-art performance.
3
Focusing on the importance of empathy in shaping dyadic interactions, which is critical in
the context of therapy and mental health, we build perception models for augmenting embodied
intelligent machines toward running more emotionally-aware interactions. We study opportunities
for expressing empathy in human-agent clinical interviews and develop multimodal models using
the client’s verbal and nonverbal behaviors for recognition of opportunities for expressing empathy.
We propose different multimodal fusion approaches while demonstrating the critical role and
challenge of finding ground-truth labels for the complex and interpersonal empathy construct.
1.1 Thesis Overview
This dissertation is structured as follows:
In Chapter 2, we provide background on prior art related to different sections of this disser-
tation. We provide a review of existing work in the computational modeling of therapy sessions
using natural language; describe existing approaches to dialogue act classification, and provide
background on multimodal modeling of empathy in human-machine interactions.
In Chapter 3, we describe our approaches for modeling client intent in Motivational Interviewing
(MI) sessions, exploring its association with subsequent behavioral outcomes, as well as providing
an interpretable understanding of the language used across different client intents/talk types.
Chapter 4 presents our models for encoding dialogue history by introducing turn change encodings
for the task of Dialogue Act (DA) classification on benchmark datasets. We compare our model
performance on multiple datasets and compare results with existing work on DA classification.
In Chapter 5, we focus on modeling opportunities for empathetic expressions in human-agent
clinical interviews. We propose multimodal models using the client’s facial expressions, speech
4
prosody and language to predict opportunities throughout the dyadic interaction where the agent
can express empathy.
In Chapter 6, we describe our approach for automated assessment of MI sessions by focusing on
empathy as a key factor for reaching desired therapeutic outcomes. We present our results for
predicting ratings of therapist empathy by focusing on different stages of the therapy session. We
perform multiple experiments within- and cross-datasets to investigate the salience of different
stages of the therapy sessions to estimate session-level empathy.
Finally, in Chapter 7, we provide a review and summary of findings and contributions, along with
a discussion on future directions.
5
Chapter 2
Background
2.1 Machine Learning in Mental Health Therapy
The growing prevalence of mental health disorders along with the unmet need for clinical resources,
highlights the importance of automated behavior analysis for facilitating AI-assisted approaches to
diagnosis, personalized treatment, and monitoring and assessment of treatment quality. Here we
describe a few approaches to using AI-based behavior analysis for augmenting clinical resources:
AI-assisted diagnosis: There is a growing interest in using automatic human behaviour analysis
toward computer-aided diagnosis for depression among other disorders. Such works utilize
behavioral cues including facial expressions and speech prosody, because of convincing evidence
that depression and related mental health disorders are associated with changes in patterns of
verbal and nonverbal behavior [35, 133, 71, 39, 154]. Facial activity, gesturing, head movements
and expressivity are among behavioral signals that are strongly correlated with depression. Early
paralinguistic investigations into depressed speech found that patients consistently demonstrated
prosodic speech abnormalities such as reduced pitch, reduced pitch range, slower speaking rate,
and higher articulation errors [39]. Facial expression and head gestures are also good predictors
6
of depression; e. g., a more downward angle of the gaze, less intense smiles, and shorter average
duration of smiles have been reported as the most salient facial cues of depression [132]. Further,
body expressions, gestures, head movements, and linguistic cues have also been reported to provide
relevant cues for depression detection [100, 123, 112, 3].
Probing indicators of mental health disorders using intelligent agents: Mental health is one
of the domains where intelligent interactive agents have shown to be capable of providing a safe and
secure environment for patients to share personal and sensitive experiences. Studies have shown
that people are more comfortable disclosing to a virtual agent compared with a human, due to fear
of judgment [85]. Following this finding, prior works have designed virtual human interviewers
to create interactions favorable to the automatic assessment of distress indicators, which include
verbal and nonverbal indicators correlated with depression, anxiety or Post-Traumatic Stress
Disorder (PTSD) [43, 140, 141]. A large body of work has developed machine learning models for
augmenting embodied agents and robots with the necessary capabilities for running automated and
realistic interactions including prediction of backchannel opportunities [101, 109] and empathy
[26].
2.2 Motivational Interviewing
Motivational interviewing (MI) is a client-centered counseling approach that aims to elicit and
strengthen motivation for behavior change. It is designed to strengthen personal commitment to
a specific goal by eliciting the person’s own reasons for behavior change within an atmosphere
of acceptance and compassion. MI is based on a respectful and curious approach that facilitates
the natural process of change while honoring the client’s autonomy. It is a balance between
7
following (good listening) and directing (giving information and advice), while refraining from
any unsolicited advice, directing or confrontation. MI has been widely used in healthcare, social
work, and addiction treatment settings like alcohol and substance abuse [97]. The key principles
of motivational interviewing include expressing empathy, developing discrepancy, rolling with
resistance, and supporting self-efficacy [97]. These principles are applied through a variety of
techniques, such as reflective listening, open-ended questioning, and affirmations [127]. MI is a
well-established approach to behavior change that is supported by a growing body of research,
and its client-centered and collaborative style makes it a particularly useful tool for practitioners
working in healthcare and social work settings.
2.3 Behavioral coding systems in Motivational Interviewing
Previous research has shown that client and therapist in-session behaviors throughout an MI
session can provide strong insights on the session quality on different fronts, for example:
• Documenting therapist adherence to MI
• Providing detailed session feedback for therapists in the process of learning MI
• Enabling the investigation of subsequent behavioral outcomes using psychotherapy process
measures
• Obtaining new insights about MI and its underlying processes of efficacy
The Motivational Interviewing Skill Code (MISC) [95] and Motivational Interviewing Treat-
ment Integrity (MITI) [103] are two commonly used coding systems. MISC and MITI are different
tools used for accomplishing different tasks. MISC is a more comprehensive coding system
8
examining both interviewer and client behaviors, coding for both the global session-level rating
scales and more granular utterance-level behavior codes for both the client and therapist. Similar to
MISC, MITI also provides global as well utterance-level codings for client and therapist behaviors
though with higher emphasis on therapist behaviors. MISC and MITI are different in the scales
of their global ratings, depending on the version being used. The MISC is an exhaustive and
mutually-exclusive coding system, which is different from MITI. Many specific behaviors that are
coded in the MISC are collapsed into a single category in the MITI, or left uncoded entirely. Our
datasets are coded following MISC 2.5 or MITI 3.1 coding systems. These systems provide global
ratings of empathy and MI spirit on different likert scales: MISC 2.5 and MITI 3.1 use 7-point and
5-point likert scales respectively.
2.4 In-session behaviors linked with subsequent outcomes
MI is grounded in client-centered principles which are built on two major components: 1) Rela-
tional processes, including therapeutic empathy, respect for client’s autonomy and MI spirit, which
are characterized by the overall therapeutic atmosphere and relationship and 2) Technical processes,
including open questions, simple and complex reflections which are characterized by therapist
skills. These principles are designed to create a safe and exploratory environment for the client to
verbalize their personal values, capacities and reasons regarding behavior change. In MI causal
theory, it is hypothesized that the client’s statements for and against change mediate intervention
efficacy. Therefore, the technical hypothesis of MI posits that the therapist-implemented MI
strategies are related to the client’s expressed attitude toward behavior change, and those client
statements can be utilized to predict subsequent behavior outcomes [87].
9
A meta-analysis testing these hypotheses on MI efficacy demonstrates that MI-consistent
skills were correlated with more client change talk, and MI-inconsistent skills were correlated
with higher sustain talk. When focusing on the proportions of these indicators, it was shown that
the proportion of MI consistency (i.e. total MI-consistent skills/total MI skills) was related to
higher proportion of change talk (total change talk/total change talk and sustain talk), and higher
proportion of change talk was related to reductions in risk behaviors at follow-up sessions. When
testing for independent effects of change talk and sustain talk on subsequent outcomes, change
talk was not significant but sustain talk was significantly and positively associated with worse
outcomes [86]. While the relational hypothesis was not validated in the described meta-analysis,
other work demonstrates that relational processes, such as therapist expression of empathy, has
positive effects on client within-session collaboration and engagement [105] as well as follow-up
alcohol use outcomes [54].
2.5 Automatic coding of in-session behaviors and outcome prediction
Using in-session behavioral codings of the client and therapist language has shown to be effective
in estimating the final outcomes of the therapy sessions in Motivational Interviewing, Cognitive
Behavioral Therapy, etc [86, 153]. Multiple efforts have therefore focused on the automated
prediction of these behavioral codings in order to avoid the costly and time-consuming manual
effort of annotation. Ewbank et al. [47] propose a neural network classification model for MI
behavioral codings, used in CBT sessions. They focus on five categorical behavioral codings from
MI (change-talk active, change-talk explore, follow/neutral, sustain talk and describing problems)
for a multi-label classification of client utterances. They encode both client and therapist utterances
10
as a sequence of word embeddings obtained by word2vec [92], along with an added dimension
to represent the speaker role and feed the sequence of utterance embeddings to a bidirectional
Long Short-Term Memory (LSTM). They further look into associations of the selected behavioral
codings with the desired outcome of the CBT sessions by running a logistic regression and observe
that the quantity of sustain talk was negatively associated with reliable improvement. Xiao et al.
[158] leverage bi-directional Gated Recurrent Units (GRU) on sequences of word embeddings
pretrained on in-domain data to predict therapist and client codings. Huang et al. [65] combine
the topic- and word-level content of the current utterance, verbal context (five previous therapist
utterances), and codes (ten previous codes), along with a domain adaptation mechanism on topic
embeddings. They use this data for code classification of individual MI sessions and examine
how the distribution of content changes across time stages within sessions. In other domains of
therapy, Tseng et al. [150] approach human behavior estimation in couple’s therapy, in which
couples with real marital issues discuss selected topics. They first extract semantic information
using seq2seq models into deep sentence embeddings, which are then fed into a Recurrent Neural
Network (RNN) for estimating the behavioral codings of each speaker. The models are trained for
automated coding of negative sentiment by attributing the codes from the entire session to all the
utterances within that session as a form of data augmentation.
Compared to the existing literature for behavioral code prediction using unimodal language
or speech, the multimodal domain remains relatively less explored. Black et al. [14] use speech
prosody toward measuring different session-level behavioral observations of married couples
partaking in problem-solving interactions (i.e., level of blame toward the other spouse). They
used prosodic, spectral, and voice quality features to capture global acoustic properties for each
spouse and trained gender-specific and gender-independent classifiers for the classification of the
11
extreme instances of the selected codes (i.e. “low” versus “high” blame). Singla et al. [137] use
a multimodal approach, combining prosodic, speech pause and lexical features, to classify and
predict CBT codes using LSTM models. The work demonstrates improved results when using
attention mechanism on multimodal data.
In our work in Chapter 3, we propose different models for client intent recognition along with
providing statistical analysis on the interpretable features associated with different talk types, error
analysis on the model outputs, as well as exploring the predictability of behavioral outcomes using
in-session data.
2.6 Dialogue act classification
Existing work have proposed different approaches for the task of Dialogue Act (DA) classification
using multiple benchmark datasets. DA classification is similar to intent classification in therapy
sessions discussed earlier, but on more general-domain task-oriented dialogue like telephone
conversations and meetings.
For the task of DA classification, Chen et al. [30] propose a CRF-attentive structured network
and apply structured attention network to the CRF (Conditional Random Field) layer in order to
simultaneously model contextual utterances and the corresponding DAs. Li et al. [79] introduce a
dual-attention hierarchical RNN to capture information about both DAs and topics, where the best
results are achieved by a transductive learning model. Raheja et al. [122] utilize a context-aware
self-attention mechanism coupled with a hierarchical RNN. Colombo et al. [37] leverage the
seq2seq model to learn the global tag dependencies instead of the widely used CRF that captures
local dependencies; this method however, requires beam search introducing additional complexity.
12
Some existing works focus on speaker roles for the purpose of encoding dialogue context
in conversations, involving distinguishable speaker roles like guide versus tourist. For encoding
role-based context information, existing works use individual recurrent modules for each speaker
role, modeling the role-dependent goals and speaking styles and taking the sum of the resulting
representations from each speaker [31, 28]. Similarly, Hazarika et al. [59] obtain history context
representations per speaker by modeling separate memory cells using GRUs for each speaker;
therefore, speaker-based histories undergo identical but separate computations before being
combined for the downstream task. In another approach, Qin et al. [121] treat an utterance
as a vertex and add an edge between utterances of the same speakers to construct cross-utterance
connections, based on specific speaker roles.
Different from speaker role-based methods, our work in Chapter 4, focuses on speaker turns
and it is still useful when speakers are not associated with specific roles. Additionally, previous
methods incorporate speaker information by proposing more complex and specialized models,
which inevitably introduce a large number of parameters to train; whereas we introduce two
global additive embedding vectors, requiring negligible modifications to a recurrent model and
introducingO(1) space complexity.
2.7 Modeling empathetic opportunities
The development of emotionally intelligent and empathetic agents have been a long-standing goal
of AI. Bickmore et al. [13] showed how embodied agents can employ empathy to form better
social relationships. Brave et al. [20] show that empathetic emotions lead to greater likeability
and trustworthiness of the agent. Existing works have mostly examined empathetic interactions
13
through game-playing contexts [77, 20, 12]. Clavel et al. [34] surveyed sentiment analysis and its
applications to human-agent interaction. They found that the existing sentiment analysis methods
deployed in human-agent interaction are not designed for socio-affective interactions. Hence, they
recommend building systems that can support socio-affective interactions in addition to enhancing
engagement and agent likability.
Sentiment analysis usually focuses on recognizing the polarity of sentiment expressed towards
an entity [138]. Learning empathetic opportunities in interactive systems requires more than
mere recognition of polarity, since empathetic responses are in response to personal misfortunes
or successes, and are not just any emotionally charged utterance. Recent multimodal sentiment
analysis approaches use deep neural networks trained and evaluated on social media videos to
detect sentiment. Zadeh et al. [162] used a Tensor Fusion Network to model intra-modality and
inter-modality dynamics in multimodal sentiment analysis. Their tensor fusion network consists
of modality embedding sub-networks, a tensor fusion layer modeling the unimodal, bimodal and
trimodal interactions using a three-fold Cartesian product from modality embeddings along with
a final sentiment inference sub-network conditioned on the tensor fusion layer. Hazarika et al.
[59] propose a conversational memory network for emotion recognition in dyadic interactions,
considering emotion dynamics. They use GRUs to model past utterances of each speaker into
memories to leverage contextual information from the conversation history. Majumder et al. [89]
model emotions in conversations by distinguishing individual parties throughout the conversation
flow. They consider three major aspects in dialogue by modeling individual party states, context
from the preceding utterances as well as the emotion of the preceding utterance by employing
three GRUs. Their network feeds incoming utterances into two GRUs called global GRU and party
GRU to update the context and party state respectively. The global GRU encodes corresponding
14
party information while encoding an utterance. By attending over the global GRU, the model
represents information from all previous utterances and the speaker state. Depending on the
context, information is updated and fed into the emotion GRU for emotion representation.
Existing works mostly leverage online datasets that benefit from large amounts of data [29,
114, 118, 116], or use highly curated offline datasets that adopt professional actors for predefined
and highly expressive scenarios [118, 155, 90]. In Chapter 5, we focus on real-world data obtained
from people talking with a virtual agent in a semi-structured interview imitating a therapy session.
This is an inherently challenging domain due to the limited amount of real-world data with
relatively lower expressiveness and unstructured spoken dialogue.
2.8 Automated quality assessment of MI sessions
In MI, the therapist is focused on promoting behavior change by encouraging clients to verbalize
their desire for change (change talk). MI theory posits that change talk should have a linear,
positive slope over the course of the session as the therapist selectively evokes and reflects change
talk [93, 98]. This has been examined by dividing the session into equal parts, though how the
session is divided has differed in a variety of studies. In the first systematic examination of client
change language over the course of an MI session, Armhein et al. [4] divide MI sessions into
ten deciles (each being 1/10 of the session’s length), and this approach has been replicated in
later work [2, 5, 102, 17]. These studies also showed that the amount of change talk increased
quadratically over the session, suggesting that fewer divisions can capture the same fluctuations of
client change language, with later work using session quintiles (1/5 of the session’s length) [64].
15
Quality assessment of therapy sessions can provide valuable insights into how a competent
therapist operates, and what kind of therapist-client interactions are productive. Researchers
have explored approaches to build automatic systems for quality assessment, for different types
of therapy. For example, Xiao et al. [156] trained a model to predict empathy levels (high vs.
low) using n-gram language model features from manual and automatically recognized speech
transcripts in MI sessions, with encouraging results regarding human rating correlation (0.65). In
[27], the authors extended these approaches, by integrating the language model features into a
Hidden Markov Model (HMM) in order to capture the dynamic interactions between utterances
in the MI sessions. They show that the dynamic model improved on the accuracy of empathy
level prediction compared to a static model as in [159]. To leverage the semantic aspects beyond
word counts (n-grams), researchers have also used Linguistic Inquiry and Word Count (LIWC)
[111] features to predict empathy levels. Lord et al. [82] used LIWC features to compute language
style synchrony between clients and therapists, finding that higher empathy ratings are correlated
with higher synchrony, controlling for therapist reflections. Similarly, Gibson et al. [56] found
that these psychologically-motivated LIWC features carry complementary information to standard
n-gram features in predicting therapist empathy. This is likely because LIWC features were also
found useful in distinguishing change vs. sustain talk in client language [144, 7, 143].
Researchers have also attempted to automatically code session behavior (MI codes) as an
intermediate step for predicting session-level quality metrics. For example, Can et al. [23] used
Conditional Random Fields (CRF) in a sequence tagging framework to predict MI codes based
on speech, which are then used in estimating session quality measures like empathy. Leveraging
the advances in neural network models, a series of more recent work have focused on using word
embeddings from non-contextualized representations such as GloVe [113] and word2vec [91], to
16
newer large contextualized models such as BERT [45], for natural language understanding. For
example, Gibson et al. [55] modeled MI sessions using a recurrent neural network (RNN) applied
to word2vec embeddings in order to obtain utterance-level representations, which are then used to
predict empathy levels. In [52], the authors used GLoVe embeddings as well as LIWC features
to estimate Cognitive Behavioral Therapy (CBT) session quality as measured by the Cognitive
Therapy Rating Scale (CTRS) scores, finding that therapist-related language features have more
predictive power than client language. In their follow-up study, Flemotomos et al. [51] expanded
their analyses by incorporating highly contextualized representations, i.e., by using BERT-based
embedding in their classifiers and achieving consistent performance improvements for session
quality assessment over simple n-gram features.
All these empirical studies use the entire therapy sessions for predicting therapy quality metrics.
However, clinicians are often interested in the behaviors throughout the session, as it relates to
the topics of discussion to facilitate a more fine-grained understanding. Our work in Chapter 6,
addresses this limitation by investigating how language from the temporal segments can be used
to estimate empathy. This is done through training a regression model using expert-annotated
empathy scores for MI therapy sessions as ground-truth labels.
17
Chapter 3
MI Intent Recognition
3.1 Motivation
Client in-session behaviors provide meaningful insights into their attitude toward behavior change,
which in turn has an important effect on the possibility of post-session behavior change. These
behaviors can be expressive language like “I really need to cut drinking” (willingness to change),
or can be implied indirectly like “I am not drinking any more than my friends” (resistance to
change). They can also appear through non-verbal communication like speech prosody which can
convey the presence or absence of sincerity when expressing their attitude. MISC coding [95]
provides granular behavioral coding for both the client and the therapist on the utterance level, and
their effective use in assessment of MI has been long established in psychotherapy research [104].
On the other hand, manual coding of utterance-level intent requires trained individuals to make
multiple passes on each session which is highly costly and time-consuming.
Toward building a more efficient pipeline for the assessment of MI sessions, we build an automated
coding system for client language. We do so by developing a neural network architecture that
takes as input the anchor utterance (whose intent we are predicting), in addition to the preceding
18
local history consisting of both therapist and language utterances. We formulate the problem as a
3-class classification of client utterances with respect to their attitude toward behavior change: (i)
Change Talk (CT)- signalling willingness to change, (ii) Sustain Talk (ST)- indicating a desire not
to change, or preserve the status quo and (iii) Follow/Neutral (FN)- which is language unrelated to
change (e.g. commenting on politics).
For automated coding of client MI codes, we perform two sets of experiments which we will
describe in two sections. In Section 3.3.3, we describe our approach on 1) automated coding of
client utterances using feed-forward networks; 2) Interpretable understanding of the differences
within client language pertaining to different MI codes using psychologically-relevant features;
and 3) Exploring the possibility of using in-session data for behavior prediction. In Section 3.3.4,
we build on our findings from earlier experiments described in 3.3.3 to 1) build improved models
for code prediction using history context and Recurrent Neural Networks (RNN); and 2) perform
statistical post-analysis on our model predictions for gaining some interpretable understanding on
the model performance and patterns that our model has captured or failed to capture.
3.2 Dataset
In this work, we utilized a clinical dataset [18, 25], namely PRIME dataset, from real-world
motivational interviewing sessions with college students involving alcohol-related issues. The
dataset consists of audio recordings, manual transcriptions, and MISC codes. The study has been
approved by the relevant IRBs, and the data collection has been performed with the consent of the
participating volunteers. The transcriptions include the sessions’ metadata including manual MISC
19
Table 3.1: Example dialogue excerpt from the dataset
Speaker Transcript MISC Code
Therapist I mean, it sounds to me like, when you tell me that your
average is like five to ten drinks, so that’s already a heavy-
drinking episode.
Complex Reflection
Client Yeah, yeah. I guess that contributes to heavy-drinking. Follow/Neutral
Client But like, and like I’m sure you hear this all the time—but
like, I have a couple friends that are like way past me, and
they drink a lot more, so yeah I wouldn’t consider myself
like a heavy drinker.
Sustain Talk
Client And I realize like according to this, I am, but I do realize
it’s bad
Change Talk
Therapist So you’re comparing yourself to the people you’re around. Simple Reflection
Client Yeah, right, like my friends that are like most like me. Follow/Neutral
codes for both the therapist and client utterances, along with the speaker tag (client/therapist) for
each utterance. Table 3.1 shows an example segment of a client-therapist dialogue.
Using this dataset, we have access to real-world sessions from 219 individual clients with 12
unique therapists. The clients involved in this dataset have an average age of 18.8 years with a
40:60 female to male ratio. The dataset consists of a total of 41,494 client utterances and 51,802
therapist utterances. In this work, we primarily focus on the classification of client utterances into
three main categories of MI codes, while taking into account the preceding utterances from both the
therapist and client as part of the history context. A subset of the sessions also include behavioral
measures related to the therapy’s desired outcome, reduced alcohol consumption. For each session,
we have two behavioral measures: Change in Typical Blood Alcohol Content (CTBAC), and
Change in Alcohol-Related Problems (CPROB). Blood Alcohol Content and Alcohol-Related
Problems inventory was administered during the MI session and at a 6-month follow-up. The
change of these measures over the course of this 6-month period is used in our analysis as the
behavioral outcome measure. A positive value for both CTBAC and CPROB indicates an increase
20
in blood alcohol content or alcohol-related problems and therefore indicates an undesirable out-
come, and vice versa. Additionally, zero change in both of these measures is also considered an
undesired outcome, since it suggests the therapy session had not been fully effective. Based on the
value of CTBAC and CPROB measures, we divide the sessions into two main categories in terms
of outcome: desired or undesired outcome. Out of the entire 219 sessions, we have 166 sessions
with behavioral outcomes. Tables 3.2 and 3.3 show the distribution of the data across the MI codes
and behavioral outcomes which indicates an imbalanced dataset in both cases.
Table 3.2: Client MI codes class distribution.
Sustain Talk Follow/Neutral Change Talk
0.13 0.59 0.28
Table 3.3: Behavioral outcomes class distribution
Undesired Desired
Blood Alcohol Content 0.55 0.45
Alcohol-Related Problems 0.70 0.30
3.3 Approach
3.3.1 Input Features and Embeddings
We utilized different types of features and embeddings for performing statistical analysis and
prediction.
Interpretable LIWC features: The Linguistic Inquiry Word Count (LIWC) is a dictionary-
based tool that assigns scores to documents in psychologically meaningful categories including
social, affective and cognitive processes [111]. We used LIWC for our statistical analysis due
21
to its interpretability and for the purpose of identifying important textual features in separating
utterances with different MI codes.
Pre-trained BERT Embeddigs: For text representation in our classification models in section
3.3.3, we utilize embeddings from the pretrained language model Bidirectional Encoder Represen-
tations from Transformers (BERT) [44]. BERT is self-supervised language representation model,
pre-trained on large corpora of text, and it has provided significant advancements to different tasks
in Natural Language Processing (NLP) including text classification. We therefore use BERT to
take advantage of its powerful pre-trained representations. We extract BERT embeddings (using
bert-base-uncased) per utterance, for both clients and therapists, and obtain 768-dimensional
representational vectors.
Pre-trained RoBERTa Embeddings: In follow-up work described in 3.3.4, we propose an
improved model, while also utilizing a more powerful encoder, a successor of BERT which is more
advanced in terms of pre-training approach and data, called Robustly Optimized BERT pre-training
Approach (RoBERTa) [81]. RoBERTa differs from BERT in several aspects: removal of the Next
Sentence Prediction objective, introduction of dynamic masking, pretrained on a larger dataset
with larger mini-batches and longer sequences. These changes improve the representations on our
data, especially since dialogue utterances in psychotherapy can consist of very long sequences.
Our preliminary experiments in section 3.3.4, for fine-tuning both BERT and RoBERTa on our
task showed that RoBERTa performed better.
22
3.3.2 Statistical Analysis on MI Codes
With the goal of identifying the language differences between client utterances categorized by
different MISC codes, we used LIWC [111] interpretable features for analyzing the in-session
client language. To that end, we executed a hierarchical Analysis of Variance (ANOV A) across the
LIWC features, with MI codes being nested under subjects. We used the F-statistic for statistical
significance and reported the most significant features in Table 3.4. All p-values are very close
to zero (< 10
− 4
) and therefore, not reported. The significant results in this test indicate that the
mean of a feature is significantly different in at least one out of the three classes.
Among the significant features obtained by LIWC are, “informal” terms including “assent”,
i.e., words like “agree”, “yes”, “ok” which are highly frequent in utterances categorized as fol-
low/neutral. Additionally “words per sentence” tends to be lower for follow/neutral instances
compared to change talk and sustain talk since follow/neutral utterances include short utterances
like “ok”, “yes/no” or backchannels. These features including “informal”, “assent” and “words per
sentence” are associated mostly with follow/neutral which is likely the reason they are identified
as strong discriminants.
Table 3.4: Most significant LIWC features across client MISC codes
Feature F-Statistic
Informal 7.880
Function 7.215
Assent 7.160
Words per sentence 6.966
Analytic 6.443
23
3.3.3 Feedforward MI Code Prediction
We examine the model’s prediction performance using two sets of data: 1) Taking only the client’s
utterance and 2) Using the client’s current utterance and the history context from preceding client
and therapist utterances.
Client Utterances: The BERT encoder first maps the client utterances to fixed-size vector
embeddings, which are passed to the classification layer for a single-utterance code prediction.
Client Utterance and History Context: Three client-therapist dialogue turn changes preceding
the current client utterance are extracted to represent the history context. The size of the history
window is selected based on empirical analyses. The contextual client and therapist utterances are
encoded separately to account for the inherent differences of the used language based on their roles
and are fed into separate linear layers. Current client utterance is also encoded by using a linear
layer of equal size. The current utterance as well as contextual utterances are encoded in three
fixed-sized representation vectors. These vectors are concatenated and passed through a hidden
linear layer, before being fed to the final classification layer. The described network architecture is
shown in Figure 3.1.
Experimental Setup: As described in Section 3.2, our dataset consists of 219 real-world MI
sessions with 219 different clients. We extract the client utterances as our data points, with their
corresponding manually-coded MISC codes. The datasets of client utterances amount to a size
of 41,494 data points, on which we perform a 3-class classification, with a one-subject-out cross-
validation. The dataset is imbalanced between the three classes with ’Sustain Talk’ as the minority
24
Figure 3.1: Feed-forward network architecture for client code classifica-
tion
class. To handle the data imbalance, a cross-entropy loss is used with a weight vector, where the
weight of each class is inversely proportional to its frequency. The weights are learned from the
train set within each fold, where 10% of the training data is held out as the validation set. The
evaluation results for the 3-class classification of MISC codes are computed using F1-score and
the model with the best performance on the validation set is selected for each fold. We optimize
the network using Adam, with a batch size of 256 and a learning rate of 10
− 3
.
We use pre-trained BERT embeddings, which consist of 768-d vectors. We utilized a linear
encoder network mapping the input feature space to a 256-d embedding space. The 256-d
representational vectors are then fed to the classification layer.
For prediction of our two target behavioral outcomes, we use the same data on the session
level. We take the subjects with either positive or negative behavioral change (those with no
behavioral changes are also categorized as a negative outcome). We have 166 subjects with
Change in Alcohol-Related Problem labels and Change in Blood Alcohol Level, which is the data
our behavior prediction models are trained on. We extract the client utterances from each session
as a sequence, removing the utterances from the therapist. We approach this 2-class classification
of behavioral changes following different perspectives: First, to identify whether data pertaining
25
to specific MISC codes have higher prediction power and therefore, looking at sequences of data
pertaining to change talk, sustain talk and follow/neutral codes individually; and second, to use the
client data from the session holistically for the prediction. Similar to the models used in MI code
prediction, the outcome prediction models also consist of 256-d encoding and fusion layers.
3.3.4 Recurrent MI Code Prediction
In this section, we improve on our classification model from the previous section, for the prediction
of client MI codes. Previously, we demonstrated the importance of conversation history when
classifying utterance intent. Our model encoded context from individual speakers separately to
account for role and speaker differences. However, when modeling the speakers independently, we
lose information on the dyadic aspects of the dialogue which is an essential factor in the progress
of the conversation and therefore client intent. To better encode the dyadic aspects of the dialogue,
rather than encoding each speaker’s history context separately, we encode the history window
from both speakers as a whole. We take the history window including both speakers, along with
the final utterance (whose MI code we are predicting) as input to our recurrent model.
The history window, similar to the previous sections, consists of a total of 3 turn changes
across speakers, where each turn consists of one or more consecutive utterances per speaker. In
the beginning of the session, where the history context is shorter than the specified threshold, the
context history consists of those limited preceding utterances. The size of the context window
was selected empirically among 3, 4 or 5 turn changes. The history window was extracted by
focusing on number of turn changes in order to have at least a certain amount of dyadic interaction,
and not be limited by instances of only one speaker talking consecutively. Our input samples
contain between 6 and 28 utterances depending on the dynamic of the dialogue, e.g. an example
26
input could be [T C T T T C C T C], where T denotes a therapist’s utterance and C denotes a
client’s. The motivation for using the entire window of context and final utterance is that our
recurrent neural network encoder output would carry more information from the final utterance
and closer context while retaining relevant information from the beginning of the window. We
also investigated encoding the current utterance separate from the history context using a separate
linear layer but did not see improvements in the classification results.
We use RoBERTa embeddings for obtaining utterance representations. For our RoBERTa
embeddings, each utterance representation is the concatenation of (1) CLS token (2) mean pooling
and (3) max-pooling of the tokens from the last hidden state. Additionally, we add a binary
dimension for each utterance to indicate the speaker. Figure 3.2 illustrates this process.
Client Utterance
+
Context History
RoBERTa
CLS token
N x [1 x 768]
Tokens max pool
N x [1 x 768]
Tokens mean pool
N x [1 x 768]
Speaker ID
N x [1 x 1]
+
Tokens embeddings
N x [1 x 2304]
+
Final embeddings
N x [1 x 2305]
GRU
Figure 3.2: Recurrent network architecture for client code classification
Similar to 3.3.3, we used LIWC features for the purpose of performing interpretable analysis,
this time on the classifier misclassifications. We also performed the MI code classification using
the LIWC features for comparison. Similar to the RoBERTa embedding inputs, we added the
binary dimension to the LIWC vectors to indicate the speaker. The history context representation
for both RoBERTa and LIWC is obtained by concatenating the utterance-level representation
27
vectors into a 2d matrix. These inputs are then fed into a unidirectional GRU, and the last hidden
state is used for the classification layer.
Experimental Setup: For training, we use a 5-fold subject-independent cross-validation. 10%
of the train data from each fold is randomly selected in a stratified fashion, and held out as the
validation set. We optimize the network using AdamW [83], with a learning rate of 10
− 4
and
batch size of 32. We train our model for 25 epochs with early stopping after 10 epochs, and
select the model with the highest macro F1 on the validation set. To handle class imbalance, we
use a cross-entropy loss with a weight vector inversely proportional to the number of samples in
each class. The GRU hidden dimension is 256 and 32 when running on RoBERTa and LIWC
representations, respectively.
3.4 Results and Discussion
3.4.1 Feedforward Model
Table 3.5 demonstrates the classification results for estimation of MI codes. Our results demonstrate
that adding history context to our text model leads to statistically significant improvement compared
to using only the current client utterance. We compare our classification performance with the
results from the most relevant previous work using a similar dataset and problem formulation
[8]. They take a multimodal approach for a 3-class classification of client utterance codes. They
use pre-trained word embeddings GloVe (Global Vectors for Word Representation) and LIWC
for the text modality, and use COV AREP [40] for speech. They train logistic regressions models
by using different combinations of the three feature sets, and show that using all three feature
sets combined obtains the highest classification performance. Although results from different
28
combinations show that speech makes a minor improvement over the text-only model. Our models
outperform this baseline from previous work, reaching F1-score of 0.721 compared to the previous
0.566, by encoding historical context as well as the client utterance, while also taking advantage
of more advanced encoders like BERT.
Table 3.5: MI codes prediction results for three-class classification. Av-
erage micro F1-scores and their standard deviations (in parentheses) are
given.
Data/Model Micro F1-score
Utterance 0.701 (0.065)
Utterance + Context 0.721 (0.062)
Baseline (prior work) [8] 0.566
The model performance across the three classes is shown in Table 3.6. It can be seen that the
hardest class for the model to recognize is Sustain Talk, which could be partly due to its very low
frequency in the dataset.
Table 3.6: Precision and recall for MI code prediction
Precision Recall F1
Sustain Talk 0.43 0.53 0.47
Follow/Neutral 0.85 0.77 0.81
Change Talk 0.60 0.66 0.63
The confusion matrix from the code prediction can be found in Table 3.7. It can be seen that
the misclassification of Sustain Talk instances is mostly due to its confusion with Change Talk,
which is aligned with observations from previous work indicating that sustain talk and change talk
are more difficult to discriminate. [8].
Table 3.8 shows the results from the binary outcome prediction models for both target behaviors.
The results are reported for individual MI codes as well as the entire sequence of client utterances
29
Table 3.7: Confusion matrix for MI code prediction
(ST: Sustain Talk, FN: Follow/Neutral, CT: Change Talk)
ST FN CT
ST 0.53 0.19 0.28
FN 0.08 0.77 0.15
CT 0.15 0.19 0.66
Table 3.8: Outcome predictions results for two-class classification. Ran-
dom baseline (mean F1-score over 1000 trials) and F1-scores over the
entire dataset are given. (CTBAC: Change in Typical Blood Alcohol
Content, CPROB: Change in Alcohol-Related Problems)
Data F1-score
CTBAC CPROB
CT 0.517 0.494
FN 0.507 0.491
ST 0.545 0.416
ALL 0.495 0.535
(referred to as ’all’), for assessing their influence on the behavioral outcome. We observe that
using the entire sequence provides the best prediction for CPROB (Change in alcohol-related
PROBlems), while using only the ST (Sustain Talk) utterances has the best overall performance
for CTBAC (Change in Typical Blood Alcohol Content). Nonetheless, there is no consistent
pattern of one talk type significantly outperforming others in prediction. Prior knowledge from the
literature shows that ST (Sustain Talk) can be used as an important predictor of behavioral outcome,
whereas the same finding has not held true for CT (Change Talk) [6]. We see the same results for
one of the behavioral outcomes (CTBAC) but not consistently for both. Further investigation is
needed to evaluate the predictive power of each MI code with the outcome. Overall, the outcome
prediction results show marginal improvement over the chance baseline, which validates the
challenge associated with predicting human behavior by accessing only a small interaction window.
Also noting that the post-session behavioral assessment was performed at a 6-month follow-up
which makes the problem of behavior prediction even more challenging.
30
3.4.2 Recurrent Model
In this section, we compare results from our follow-up approach on client intent recognition with
the previous feedforward model described in 3.3.3. The classification results are shown in Table 3.9.
The model trained using RoBERTa outperforms the model trained on LIWC features as expected
due to their strong representation power, in addition to beating our previous results (considered as
baseline) with F1-macro=0.66. The baseline results are obtained from our previously discussed
model under this work’s evaluation setting, demonstrating that our follow-up approach obtains
significant improvement over the previous model. Improved results over the baseline model are
likely due to the following: 1) The previous linear model encodes the client and therapist utterances
from the context history separately, therefore potentially missing information from the dyadic
interaction. 2) The RNN in our current model temporally encodes the dyadic interaction window.
3) Using RoBERTa embeddings improved over BERT embeddings, as RoBERTa was trained on
larger datasets and on longer sequences, making them more powerful representations.
Table 3.9: F1-Score Classification Results
Features Baseline
LIWC RoBERTa
ST 0.41 0.50 0.46
FN 0.78 0.84 0.81
CT 0.56 0.64 0.63
All (macro) 0.58 0.66 0.63
All (micro) 0.65 0.74 0.71
The results from other work on classifying client codes in MI range from F1-macro=0.44 [23]
to F1-macro=0.54 [24] on different datasets. In [7], who used a similar dataset to our work, they
reached F1=0.57. Huang et al. [66] obtained F1-macro=0.70 by using (ground truth) labels from
prior utterances as the model input and domain adaptation for theme shifts throughout the session.
31
3.4.2.1 Error Analysis
Figure 3.3 shows the confusion matrices from classification results by the model using LIWC
features vs. RoBERTa embeddings. Comparing between classes, Sustain Talk gets misclassified
about equally as Follow/Neutral and Change Talk by RoBERTa but it is much more often mis-
classified as Change Talk by LIWC. On the other hand, Change Talk is more often misclassified
as Follow/Neutral by RoBERTa, but misclassified as Sustain Talk by LIWC. Of the wrongly
classified utterances by LIWC, 47% were correctly classified by RoBERTa. Of the RoBERTa
misclassifications (11k utterances), about 30% were correctly classified by LIWC. Some examples
of these cases are presented in Figure 3.4, which seem to be associated with certain key words
related to salient features (Section 3.4.2.2). When both RoBERTa and LIWC misclassified, they
gave the same wrong prediction on 70% of those utterances. Some anecdotal examples of such
cases are shown in Figure 3.5, and most seem to be highly context-dependent, suggesting that
better modeling of context would potentially be useful. We also experimented with a simple
concatenation of RoBERTa and LIWC features, but did not find significant improvements over the
RoBERTa-only model.
3.4.2.2 Salient Features
Using statistical analysis, we aimed to investigate the interpretable LIWC features for instances
where the RoBERTa model has made erroneous classifications to identify features the RoBERTa
representations may have failed to capture. To that end, we used hierarchical Analysis of Variance
(ANOV A), with talk types nested under sessions to account for individual differences. We looked
into samples where RoBERTa representations might be limited (i.e., misclassified), while LIWC
features were correct in the classification. Using ANOV A, we found the most prominent features
32
Predicted
True Label
FN CT ST FN CT ST
FN
CT
ST
RoBERTa LIWC
Figure 3.3: Confusion matrices of classification results by LIWC (left)
and RoBERTa (right) features.
(normalized by true labels)
Figure 3.4: Example dialogue with correct and incorrect classifications.
red (true→ predicted) denotes misclassification by RoBERTa but correctly
classified by LIWC; blue (true label) denotes correct classification by both
models.
Speaker Utterance
Therapist What varies your drinking?
Client Money, (CT→ ST)
Client if I have work to do I won’t drink. (CT)
Therapist Okay.
... ...
Client Anxious thing is kinda like I don’t have
control, like I, I’m shaky and stuff like
that. (CT)
Therapist Ok. Is your heart racing faster or, and,
and that type of thing?
Client No, it’s not really anxious, it’s kinda just
like a ... (CT→ ST)
Therapist It’s more shaky?
Client It’s like agitated, kind of. (CT→ ST)
33
Figure 3.5: Example dialog with correct and incorrect classifications.
blue (true label) denotes correct classification by our models, red (true →
predicted) denotes misclassification by both models.
Speaker Utterance
Therapist Oh, ok, so the summer you usually
drink a little more
Client Yeah. (FN)
Therapist and then when you get to school, it’s...
Client Kinda cut down a little bit. (CT)
Therapist I see, because of like, school and
classes and stuff.
Client Yeah. (CT→ FN)
Therapist And working on the weekends.
Client Yeah. (CT→ FN)
in such samples across the 3 classes: ‘swear’ (6.06), ‘money’ (5.29), ‘anger’ (2.24), ‘death’
(2.19), and ‘affiliation’ (2.00), where numbers in parentheses denote F-statistic from hierarchical
ANOV A. This is consistent with our error analysis shown in Figure 3.4. The mean scores of the
‘swear,’ ‘money,’ and ‘anger’ categories are higher for Change Talk compared to other classes. We
hypothesize that ‘swear’ and ‘anger’ in Change Talk may represent anger toward oneself regarding
drinking behavior. Words in the ‘money’ category might be related to the high cost of alcohol
(especially with college-age clients), which can be motivation for behavior change. The Change
Talk samples misclassified by the RoBERTa model may indicate the model’s failure to capture
such patterns for this specific task.
3.5 Summary of Findings
In this chapter, we explored the task of MI code (intent) classification for client utterances in
Motivational Interviewing. Our experiments showed the importance of incorporating dialogue
history for identifying client intent. In addition, our experiments showed that modeling the dialogue
34
history in a way to preserve the dynamic of the conversation between speakers is important for
reaching better performance in intent recognition.
We also explored the possibility of predicting behavioral outcomes using in-session client
data. Our experiment showed above-chance performance and did not lead to consistent patterns
when comparing the predictability of behavioral outcomes when using utterances with different
MI codes/intents. Our results further emphasized the challenge of predicting long-term human
behavior change, especially given a limited window of interaction. Utilizing more longitudinal
data, along with contextual metadata, may provide more promising results for modeling behavior
change.
We used statistical analysis and interpretable psychologically-relevant features to identify
salient features across client MI codes and further performed error analysis on our model’s
results. Our results showed that the most salient features across classes were those distinguishing
Follow/Neutral (FN) from other classes. Additionally, error analysis shows that more advanced
context encoding can further improve the model performance for intent classification. Through
error analysis, we also identified LIWC categories that the language model RoBERTa may be
failing to capture for this specific task.
With this work [143, 145], we aimed to develop systems for enhancing effective modeling
and assessment of client intent in MI, which can potentially generalize to other types of therapy.
Identifying patterns of change language can facilitate MI strategies that will assist clinicians with
treatment while providing efficient means for training new therapists. These steps contribute to the
long-term goal of providing cost- and time-effective evaluation of treatment fidelity, education of
new therapists, and ultimately broadening access to lower-cost clinical resources for the general
population.
35
Chapter 4
Speaker Turn Change Modeling
4.1 Motivation
Following on the direction of MI code prediction in the previous chapter, which focused on
client utterance intent in MI sessions, in this section we propose a new approach for encoding
speaker turn changes in dialogue. In this approach, we focus on modeling dialogue turn changes
across speakers in a simple and generalizable approach that can be applied to different dyadic
or multi-party dialogues regardless of number of speakers, existence of distinct speaker roles,
interaction scenario, etc. Although this is a general approach, the model would be able to encode
role-specific language and/or individual characteristics specifically for dyadic dialogue. Therefore
the proposed approach would be of special importance in types of dialogue where there is a clear
distinction between roles of the speakers, like therapy.
Intents, which signify the primary goal of the user on the utterance level, are mainly domain-
dependent hence for different goal-oriented dialogue systems, we have a unique set of intents.
Dialogue acts, on the other hand, refer to the role or function each utterance plays in a conversation
36
at the level of illocutionary force and therefore constitutes the basic unit of linguistic communica-
tion. Dialogue acts can include things like making a request, asking a question or a backchannel
response [134]. To evaluate our model performance against state-of-the-art natural language
architectures, we train and evaluate our approach on the more general task of Dialogue Act (DA)
classification, on commonly used DA classification benchmark datasets.
Dialogue Act (DA) classification is the task of classifying utterances with respect to the
function they serve in a dialogue. DA classification is of critical importance in NLP, as it underlies
various tasks such as dialogue generation [78] and intent recognition [63], thus providing effective
means for domains like dialogue systems [62], talking avatars [161] and therapy [157, 145, 143].
Recent studies of DA classification have leveraged deep learning techniques, reaching promis-
ing results. Generally, these methods utilize hierarchical Recurrent Neural Networks (RNNs) to
model structural information between utterances, words, and characters [122, 79, 152, 30, 74, 19].
However, most of these approaches treat spoken dialogue similar to written text, thereby neglecting
to explicitly model turn-taking across different speakers. Inherently, computational understanding
of dialogue, which has been generated by multiple parties with different goals and idiosyncrasies
in an interactive and uncontrolled environment [31], requires modeling turn-taking behavior and
temporal dynamics of a conversation. For instance, in a dyadic conversation, given an utterance
with dialogue act “Question” from speaker A, if the following utterance is from speaker B, then
the corresponding act is likely to be “Answer”; however, if there is no change in speakers, then the
following act is less likely to be “Answer.” Therefore, modeling turn changes in conversations is
essential.
In this regard, we aim to incorporate the speaker turns into utterance encodings. More
specifically, we model speaker turns in conversations and introduce two speaker turn embeddings
37
that are combined with the utterance embeddings. We further describe the approach and results in
the following sections.
4.2 Dataset
We evaluate the performance of our model on three public datasets: the Switchboard Dialogue Act
Corpus (SwDA) [72, 136, 139], the Meeting Recorder Dialogue Act Corpus (MRDA) [135], and
the Dailydialog (DyDA) [80]. SwDA
*
contains dyadic telephone conversations labeled with 43
DA classes; the conversations are assigned to 66 manually-defined topics. MRDA
†
consists of
multi-party meeting conversations and 5 DA classes. DyDA
‡
corpus consists of human-written
daily dyadic conversations labeled with 4 DA classes; the conversations are assigned to 10 topics.
For SwDA and MRDA, we use the train, validation and test splits following [76]. For DyDA, we
use its original splits [80]. Specs of the three datasets are summarized in Table 4.1.
Table 4.1: Specs of the three DA classification benchmark datasets. |C|
denotes the number of DA classes;|P| denotes the number of parties;
Train/Val/Test denotes the number of conversations/utterances in the cor-
responding split.
Dataset |C| |P| Train Val Test
SwDA 43 2 1003/193K 112/20K 19/4.5K
MRDA 5 multiple 51/75k 11/15.3K 11/15K
DyDA 5 2 11K/87.1K 1K/8K 1K/7.7K
*
https://github.com/cgpotts/swda
†
https://github.com/NathanDuran/MRDA-Corpus
‡
http://yanran.li/dailydialog
38
4.3 Approach
4.3.1 Utterance Modeling
We use the pretrained language model RoBERTa to encode utterances, which enables us to utilize
the powerful representations obtained from pretrainining on large amounts of data. Given an
utteranceu, we take the embedding of [CLS] token from the last layer as the utterance embedding,
denoted ase(u).
4.3.2 Speaker Turn Modeling
Different from text written by a single author in a non-interactive environment, dialogues usually
involve multiple parties, in minimally-controlled environments where each speaker has their own
goals and speaking styles [31]. Therefore, it is critically important to model how speakers take
turns individually and inform the model when there is a speaker turn change. To this end, we
introduce two conversation-invariant speaker turn embeddings. These two embeddings are trained
across all speakers in the train set and are independent of any given conversation or speaker
pair. The two embeddings are learnable parameters during the optimization and have the same
size as the utterance embeddings, which are generated by a speaker turn embedding layer with
speaker labels as input. Note that in a dyadic conversation, speaker labels (0/1) naturally indicate
speaker turn changes. This idea is inspired by the positional encoding in Transformers [151],
where the authors introduce a positional embedding with the same size as the token embedding at
each position, and the positional embeddings are shared across different input sequences. For a
multi-party conversation corpus, because our goal is to model speaker turns instead of assigning a
different embedding to each speaker, we re-label the speakers and flip the speaker label (from 0 to
39
1 and vice versa) when there is speaker turn change; for example, if the original speaker sequence
is⟨0,0,1,2,3,3,1⟩, after re-labeling it becomes⟨0,0,1,0,1,1,0⟩, which can then be represented
by the two introduced speaker turn embeddings. This simplifies turn-change modeling, as the
number of speakers in different conversations can be different.
Encoding speaker turns instead of individual speaker styles/characteristics provides the fol-
lowing advantages: 1) in datasets with many different speakers across relatively short dialogue
sessions, it is challenging to transfer the learned speaker representations across different sessions;
2) the simplicity of this mechanism makes it more scalable for multi-party dialogue sessions with
larger number of speakers.
To obtain the speaker-turn-aware utterance embeddingg(u,s), given an utteranceu and its
binary speaker turn label s, the speaker turn embedding f(s) is then added to the utterance
embedding e(u), such that g(u,s) = e(u) +f(s),s ∈ {0,1}. The idea of taking the sum is
also inspired by Transformers where they add the positional embeddings to token embeddings
for sequence representation [151]. We also considered the concatenation of the speaker turn
embedding and the utterance embedding, resulting in inferior performance compared to taking the
sum.
4.3.3 Conversational Context Modeling
Context plays an important role in modeling dialogue, which should be taken into account when
performing DA classification. Given a sequence of independently encoded speaker turn-aware
utterance embeddings⟨g(u
t
,s
t
)⟩
n
t=1
in conversationC, we used a Bi-GRU [32] to inform each
utterance of its context, such that⟨q(u
t
,s
t
)⟩
n
t=1
= GRU⟨g(u
t
,s
t
)⟩
n
t=1
, where⟨q(u
t
,s
t
)⟩
n
t=1
are
contextualized speaker turn-aware utterance embeddings from the hidden states of the Bi-GRU
40
model. These embeddings are then fed into a fully connected layer for DA classification, which
is optimized using a cross-entropy loss. Different from existing work [122, 79, 152, 30, 74, 19],
we do not use a CRF layer in our method, because our experiments indicate that it brings modest
performance gains at the expense of adding more complexity.
The overall framework of our model is shown in Figure 4.1.
u
1
u
2
u
3
u
4
u
5
Pretrained Language Model
Bi-GRU
s
1
s
2
s
3
s
4
s
5
Speaker Turn Embedding Layer
e
1
e
2
e
3
e
4
e
5
f
1
f
2
f
3
f
4
f
5
g
1
g
2
g
3
g
4
g
5
q
1
q
2
q
3
q
4
q
5
Dense
y
1
y
2
y
3
y
4
y
5
utterances
speaker labels
utterance embeddings
speaker turn
embeddings
speaker turn-aware
utterance embeddings
contextualized speaker turn-aware
utterance embeddings
predictions
Figure 4.1: The overall framework of our proposed method. In this toy
example, the conversation consists of five utterances.
4.3.4 Experiment Setup
On SwDA and DyDA, which are two dyadic conversation corpora, we use the original speaker
labels since the turn-change labels are equivalent to speaker labels in dyadic settings. Although
since MRDA is a multi-party conversation corpus, we obtained the binary speaker turn change
labels from the sequence of speaker labels as described in Section 4.3.2. On DyDA, because the
41
maximum length of conversations (number of utterances) is less than 50, we treat each conversation
as a data point and pad all conversations to the maximum length. However, conversations in SwDA
and MRDA are much lengthier (up to 500 in SwDA and 5,000 in MRDA); to avoid memory
overflow when training on a GPU, we slice the conversations into shorter fixed-length chunk sizes
of 128 and 350 for SwDA and MRDA respectively, as shown in Figure 4.2, where each chunk
would represent a data point.
u
1
u
2
u
3
u
4
u
5
u
6
u
7
u
8
u
9
u
10
pad pad
chunk 1 chunk 2 chunk 3
Figure 4.2: A toy example of slicing a conversation of length 10 into 3
chunks of length of 4.
The slicing operation is only needed for training but not in the validation or test, because in
training a computation graph is maintained which consumes significantly more GPU memory. The
maximum feasible chunk sizes without a CUDA memory overflow, on our machines with 11GB
of RAM, are 300(>128) and 700(>350) on SwDA and MRDA respectively.
We trained our model using Adam optimizer on 2 GTX 1080Ti GPUs. On SwDA and MRDA,
we use a batch size of 2; and on DyDA, the batch size is 10. All batch sizes are the maximum
before a memory overflow happens. On all three datasets, we use a learning rate of 1e− 4 and
train the model for a maxium of 50 epochs and report the test accuracy in the epoch where the best
validation accuracy is achieved. The running times for an epoch are ~20min, ~5min, and ~45min
across SwDA, MRDA and DyDA respectively.
Keeping all other hyperparameters fixed, we show the results of using different chunk sizes
on SwDA and MRDA in Table 4.2 and Table 4.3 respectively. On both datasets, with the chunk
42
size increasing from a small value, the performance increases, where more context information
is available to the RNN to leverage. However, after a certain value, the performance deteriorates
as the chunk size further increases, in which case the gradient vanishing and gradient explosion
happen in RNN and it forgets the long-term dependencies. Therefore, we argue that in order to
achieve better performance in DA classification, taking the holistic conversation as input leads to
inferior performance compared to slicing a long conversation into shorter chunks.
Table 4.2: The accuracies using different chunk sizes on SwDA.
chunk size 32 64 85 128 160 196 256 300
accuracy 82.9 82.7 82.8 83.2 82.7 83.0 82.9 82.3
Table 4.3: The accuracies using different chunk sizes on MRDA.
chunk size 85 175 350 700
accuracy 91.3 91.1 91.4 91.3
Baselines: We consider deep learning based approaches from literature as baselines including
DRLM-Cond [70], Bi-LSTM-CRF [74], CRF-ASN [30], ALDMN [152], SelfAtt-CRF [122],
SGNN [124], DAH-CRF-Manual [79], and Seq2Seq [37]. We report the results of DRLM-Cond
and Bi-LSTM-CRF on DyDA implemented by [79]. Our proposed speaker turn modeling is
usable in other embedding-based approaches to DA classification, but because none of the recently
published work have made the code available, we do not implement the proposed speaker turn
modeling on top of the baselines.
For fair comparison with DAH-CRF-Manual
conv
[79] where manual conversation-level topic
labels are used, we assign all utterances in a conversation the corresponding conversation topic
label. To utilize the topic information, following the idea of speaker turn embedding in Section
4.3.2, we introduce an embeddingh(m) for each topicm and add it to the speaker turn-aware
43
utterance embedding, such that l(u,s,m) = g(u,s) +h(m) where l(u,s,m) is the obtained
speaker turn and topic-aware utterance embedding.
Note that we do not compare our results to DAH-CRF-LDA
conv
and DAH-CRF-LDA
utt
[79], which are categorized as transductive learning because they utilize the data from training,
validation and test sets to perform LDA topic modeling and use the learned topic labels to supervise
the training process. In contrast, our method and all baselines are categorized as inductive learning,
which do not use supervision from the validation or test set. In addition, we do not compare to
Seq2Seq [37] on SwDA where they adopt a different test split from the one used in our method
and the baselines.
4.4 Results and Discussion
The results from our method and the baselines are shown in Table 4.4. Our method achieves
state-of-the-art results on SwDA and DyDA; on MRDA it achieves performance comparable to
the state-of-the-art. Notably, on SwDA and MRDA, comparing the proposed model (Ours) to the
model without speaker turn embeddings (Ours¬Speaker), we observe significant improvements in
performance, signifying the effectiveness of modeling speaker turns in dialogue representation.
On DyDA, the performance gains slightly after applying speaker turn modeling; we argue that this
is because in conversations in DyDA, there is a consistent speaker turn change after each utterance
following the pattern⟨0,1,0,1,0,1⟩; such a pattern is more predictable, and therefore, modeling
speaker turns provides limited auxiliary information, from the perspective of information theory.
In addition, on DyDA, the model Ours¬Speaker outperforms the baselines, although this
is not observed on SwDA and MRDA. We hypothesize that the reason may be from the fact
44
Table 4.4: Results of DA classification following three different ap-
proaches.
“Ours¬Speaker” represents our method without adding speaker turn em-
beddings; “Ours+Topic” represents the proposed method using speaker
turn and topic-aware embeddings for fair comparison to baselines utilizing
topic information. State-of-the-art results are highlighted in bold.
Dataset SwDA MRDA DyDA
DRLM-Cond 77.0 88.4 81.1
Bi-LSTM-CRF 79.2 90.9 83.6
CRF-ASN 80.8 91.4 -
ALDMN 81.5 - -
SelfAtt-CRF 82.9 91.1 -
SGNN 83.1 86.7 -
DAH-CRF-Manual 80.9 - 86.5
Seq2Seq - 91.6 -
Ours¬Speaker 82.4 90.7 86.8
Ours 83.2 91.4 86.9
Ours+Topic 82.4 - 87.5
that RoBERTa [81] is pretrained on a large corpus of written text, which will make it better for
processing the human-written conversations in DyDA, in comparison to the transcripts of telephone
conversations and meeting records in SwDA and MRDA. As a result, the generated utterance
embeddings are of higher quality, leading to the high performance of Ours¬Speaker on DyDA.
In terms of modeling topics, on DyDA, topic information significantly improves the classifica-
tion performance; in contrast, on SwDA, the performance suffers when utilizing topic information,
as can be observed from the comparison of Ours and Ours+Topic. Therefore, leveraging topic
labels does not consistently lead to performance improvement; on the other hand, it is consistently
improved by encoding speaker turn changes on all three datasets.
45
4.5 Summary of Findings
In this work [61], we aimed to model speaker turn changes in dialogue to tackle the task of
DA classification. We introduced conversation-invariant speaker turn embeddings, which were
merged with the utterance encodings. This simple change helps the model encode information
regarding the speaker turn changes, and utilizes the information from the dialogue structure;
therefore distinguishing the dialogue encoding from the written text encodings used in prior
literature. Experiments on three benchmark datasets demonstrate the effectiveness of our approach
on dialogue act classification. This simple and scalable module can be easily added to other
models to obtain significantly better results in dialogue act and intent recognition or more generally
encoding turn or speaker-related information in dialogue.
46
Chapter 5
Computational Modeling of Empathetic Opportunities
5.1 Motivation
Empathy is the “capacity” to share and understand another’s “state of mind” or emotions [68],
and it plays an important role in shaping rapport between interlocutors which is essential in
clinical settings for building doctor-patient relationships [107]. Empathy is highly important in the
context of mental health therapy sessions, where studies have shown that the therapist’s ability to
understand and relate to the patient’s experience may be an important component of building a
strong therapeutic alliance [1]. Empathy is therefore a key factor for successful therapy.
Prior works have studied the use of embodied interactive agents in assisting with running
clinical interviews with the purpose of probing indicators of mental health disorders. Studies have
shown that people are more comfortable disclosing to a virtual agent compared with a human
during clinical interviews, due to fear of judgement [85]. This makes intelligent agents and robots
great candidates for running automated clinical interviews for the purpose of mental health data
collection. Improved and efficient access to behavioral data, facilitated by virtual agents and
robots, has many benefits for more efficient use of clinical resources, as well as facilitating the
47
foundational need for data which serves the goal of designing systems for assisting clinicians with
diagnosis, treatment quality assessment, etc. Expression of emotions by intelligent and embodied
machines can increase their believability by creating an illusion of life [11]. Recent studies have
shown that virtual agents’ expressions of empathetic emotions can improve users’ engagement [73],
task performance [110] and perception of the virtual agent [20, 115, 108]. Therefore expressions
of empathy are crucial for effective human-machine interactions in clinical interviews. To facilitate
such empathetic interactions, in this chapter, we focus on building models that use multimodal
client behaviors to identify opportunities for expressing empathy. We build our models by using
data from real-world clinical interviews run by a Wizard-of-Oz (WoZ) controlled virtual agent
probing indicators of mental health disorders like depression and Post-Traumatic-Stress-Disorder
(PTSD).
A large body of prior work has focused on multimodal recognition of sentiment and human
emotions from online videos or interactive experiences [117, 22, 21]. Existing work have made
notable progress towards sentiment recognition from vast online datasets. Nonetheless, despite
the increasing attention towards emotionally intelligent and empathetic interactive companions,
recognition of empathy has not been extensively explored due to the limited amount of data and the
complexity of defining ground-truth labels. Automatic recognition of empathy, although similar
to sentiment, requires different and more complex modeling. Recognition of opportunities for
empathetic responses should include subjectivity while also accounting for the intensity of the
sentiment to elicit empathetic responses. The threshold for expressing empathetic responses can
vary from person to person and is also affected by interpersonal relationships and the context of
the conversation. “I am concerned about global warming. ” and “I lost my mother to cancer. ” are
expected to elicit different responses in terms of empathy.
48
In this work, building upon the work on multimodal sentiment analysis, we propose a multi-
modal machine learning framework for identifying opportunities for empathetic responses during
human-agent interactions. To this end, we analyzed interactions between an agent and a user during
a semi-structured interview probing symptoms of mental health disorders. During the interview,
the agent asks a set of questions, where each question is possibly followed by shorter follow-up
questions with respect to the user’s previous responses. Our developed model determines when the
agent needs to express empathy and with what polarity. We focus on the prediction of empathy in
a minimally-controlled environment with real-world users throughout the human-agent dialogue
interaction. The dataset and approach are described in the following sections.
The main contributions of this work include:
• An analysis of verbal and nonverbal behaviors prompting empathetic responses.
• Providing a machine learning framework for identifying empathetic opportunities in an
uncontrolled dyadic interaction with real-world users.
• An analysis of different strategies for creating ground-truth labels for empathetic responses.
5.2 Dataset
We use a portion of the Distress Analysis Interview Corpus-Wizard-of-Oz (DAIC-WOZ) for
training and evaluating our method. DAIC-WOZ is a subset of DAIC that contains semi-structured
interviews designed to support the assessment of psychological distress conditions such as depres-
sion and PTSD [57]. The interviews were collected as part of an effort to create a virtual agent
that conducts semi-structured interviews to identify verbal and nonverbal indicators of mental
illness. The subset of the corpus examined in this work include the Wizard-Of-Oz interviews
49
Figure 5.1: A participant and the virtual agent, Ellie.
conducted by a virtual agent controlled by two trained human wizards in a separate room. In this
two-wizard arrangement, one wizard controlled the agent’s verbal behavior while the other handled
her nonverbal behavior. The interview was structured to start with a set of general rapport-building
questions and continue to query potential symptoms of mental health such as quality of sleep.
In this setup, a fixed set of top-level questions were provided to the wizard to be asked during
the interview. In addition to asking the top-level questions, the wizard was provided with a finite
repertoire of response options to act as a good listener by providing back-channels, empathy and
continuation prompts [43] (see Figure 5.1).
Verbal and nonverbal behaviors of participants were captured by a front-facing camera and
head-worn microphone. In this work, we extract dialogue excerpts eliciting empathetic responses
from the sessions by looking at the agent’s expressions of empathy such as “I’m sorry to hear
that.” or “That sounds like a great situation.”. Each instance consists of the participants’ verbal and
non-verbal (audiovisual) responses to each main question and the follow-up questions. Follow-up
questions such as “Can you tell me more about that?” were asked to elicit further disclosure and
encourage more elaborate responses. Example dialogue excerpts are shown in Table 5.1.
50
Table 5.1: Human-Agent dialogue excerpts with different empathy re-
sponses.
Dialogue Excerpt
Negative
A: How have you been feeling lately?
H: Um kind of uh I guess sorta sorta de-
pressed generally
A: Tell me more about that
H: Uh just uh feeling tired and sluggish
and um less less motivated and less inter-
ested in things
A: I’m sorry to hear that.
Positive
A: What are you most proud of in your
life?
H: Uh I’m proud that I’ve come a long
way from when I first moved out here
I’m uh a lot more disciplined um I read a
lot uh I do crosswords and I think I’ve I
think I know what’s important in life now
and I’m more focused and going after what
I want
A: That’s so good to hear.
None
A: What are somethings you wish you
could change about yourself?
H: Um I wish I could be taller I wish I
could be more inclined to play basketball
so I then become go to the NBA and be a
millionaire I know that’s all unrealistic but
just answering honestly.
51
Due to the nature of the predefined semi-structured interview, the dialogue turns take minimal
influence from the dialogue history and are therefore considered independently. The data is
segmented into small time windows consisting of the users’ transcribed text, video and audio that
have resulted in either positive, negative or no empathetic response from the virtual agent. Overall,
we had 2,185 data points (dialogue excerpts) extracted from conversations of 186 participants. The
average length of the dialogue excerpts was 30.6 seconds, while the average number of turns per
data point was 3.2 turns.
5.3 Approach
5.3.1 Multimodal Feature Extraction
Textual Features: For text input, we use BERT [45] which is pre-trained on large amounts of
data. We used BERT as our text embedding model using only the participants’ utterances from the
dialogue excerpts. We avoid using the agent’s utterances in the classification because of the unfair
advantage it may provide to the recognition model. We obtained a 768-d vector representation of
the transcribed text per data entry [160].
Audio Features: Two types of feature sets were extracted for the representation of speech
prosody: (i) the extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) and (ii) Mel-
frequency cepstral coefficients (MFCC), extracted using OpenSMILE [50]. eGeMAPS provides a
set of acoustic features hand-selected by experts for their potential to detect affect in speech, and
has been widely used in literature due to their performance, as well as theoretical significance
[48]. This feature-set consists of 23 features such as fundamental frequency and loudness. MFCCs
represent 13 band mel-frequency cepstral coefficients (MFCC) computed from audio signals from
52
25ms audio frames. MFCCs and their first and second order derivatives were extracted [50, 49] to
obtain a temporal matrix ofT × 39 representation per data entry.
Visual Features: For the visual representation, we experimented with two different feature-
sets: (i) 17 action units and 6 head pose features were extracted per frame using OpenFace [10]
and (ii) face embedding obtained from a pre-trained ResNet model [60]. OpenFace is used to
extract the intensity of facial action units, representing 17 action units based on the Facial Action
Coding System (FACS) [46] along with head pose variations per frame, therefore providing a
T × 23 representation. For the face embedding, we extracted masked and aligned faces per
frame using OpenFace [9] and fed it to ResNet-50, a convolutional neural network pre-trained on
ImageNet [42], and extracted the representation from the penultimate layer, to obtain aT × 2048
representation.
5.3.2 Ground-Truth Labels
Wizard judgments: We extracted the ground-truth labels from the empathetic and non-empathetic
responses of the human-controlled virtual agent. The agent’s responses are divided into three
classes: negative empathy, positive empathy or no empathy. Negative empathy responses include
utterances such as “That sounds really hard” and “I’m sorry to hear that”, positive empathy
includes utterances like “That’s so good to hear”, “That sounds like a great situation”, and no
empathy shows that the agent moved on to the next question or expressed fillers or back-channels
without sentiment. Utilizing the predefined repertoire of agent responses, we map the key phrases
to positive/negative/no empathy, and use them to extract ground-truth labels for the three classes.
53
Mechanical Turk Ratings: To validate the wizard-controlled agent’s empathetic responses, we
collected labels via Amazon Mechanical Turk (MTurk) for the collected dialogue excerpts. We
recruited five raters per instance (257 unique participants), all from the United States to avoid
language barriers. For each data point, the users were given the text data, i.e. the dialogue sequence
and they were asked to select the proper categorical response toward the user at the end of each
conversation. For further clarification, we provided example responses belonging to each category.
Each assignment consisted of 20 tasks (data points) plus two control questions (with obvious
responses) to eliminate raters that did not pay attention to the task and provided random answers.
One control question contained a devastating story about the participant’s mother passing away
while the other control question involved a very happy and inspiring story about the participant. We
repeated the experiment on data points that had wrong answers to either of the control questions to
obtain valid ratings. We additionally eliminated the instances where there was no majority vote
among raters (7% of the data).
The Fleiss’ kappa was calculated to measure inter-annotator agreement for the entire data
across five raters which showed fair agreement with κ = 0.33. A comparison between the majority
vote of the MTurk raters and the wizard’s responses shows 58% agreement. More analysis indicates
that the difference is mainly caused by MTurk raters annotating certain entries as either positive
or negative where there was, in fact, no empathetic response by the wizard. This is likely the
result of the raters looking at data entries independently and not as part of an entire dialogue. For
instance, the wizard may not have expressed empathy where it was fit to avoid redundancy of such
expressions throughout the interaction. The low inter-rater agreement from MTurk annotations
demonstrates the intrinsic complexity of the task, which speaks well to the nature of empathy as a
social construct and the empathy level of the person expressing it. Furthermore, the task becomes
54
Table 5.2: Distribution of classes for two sets of labels
Negative Positive None
Wizard 20.6% 40.6% 38.8%
MTurk 24.9% 46.0% 29.1%
more difficult due to the individual differences among the annotators with respect to their own
personal experiences and self-identification with the user.
Table 5.2 shows the distribution of data across different classes. Throughout the experimenta-
tion, we evaluate and report the results for both sets of labels to address this difference between
the sets of labels.
5.3.3 Behavior Analysis
To study the verbal and nonverbal indicators associated with instances of behavior that elicit
empathetic responses, we used interpretable features from each modality for investigating such
associations. For vision, we used facial action units, for speech, we opted for eGeMAPS features
and for language we used LIWC. Linguistic Inquiry and Word Count (LIWC) is a dictionary-based
tool that generates scores along different dimensions including linguistic variables such as a
number of conjunctions and pronouns and affective and cognitive constructs [111].
After selecting a set of features, we ran one-way analysis of variance (ANOV A) and visually
inspected the box plots of significant results ( p < 1E− 5). The behavioral features that stood
out are shown in Figure 5.2. The sentiment of language, tone, positive (posemo) and negative
emotions (negemo) according to LIWC are strong indicators for recognizing empathetic response
opportunities. The language used in describing less pleasant situations is more formal which might
show that participants were less comfortable sharing them. Social processes including mentioning
55
family members were higher during the description of negative experiences, pointing toward
interpersonal issues. Cognitive processes (cogproc) which involve describing causation, certainty
and insight were lower for positive instances which demonstrates that the expressions of positive
experiences were in simpler language. We could not observe any visible differences among audio
features. Action units associated with positive expressions, AU06 (cheek raiser) and AU12 (lip
puller), are strong indicators of positive empathetic instances. AU15, or lip corner depressor,
which is associated with sadness, also showed stronger activation during negative instances. This
demonstrates that visual features in addition to verbal behavior might be able to assist in the
recognition of sentiment for providing empathetic responses.
positive negative none
Sentiment
0
20
40
60
80
100
tone
positive negative none
Sentiment
0
5
10
15
20
affect
positive negative none
Sentiment
0
5
10
15
20
posemo
positive negative none
Sentiment
0
2
4
6
8
10
12
negemo
positive negative none
Sentiment
0
5
10
15
20
25
30
social
positive negative none
Sentiment
0
5
10
15
20
25
30
35
cogproc
positive negative none
Sentiment
0
5
10
15
20
informal
Positive None Negative
Sentiment
0.0
0.5
1.0
1.5
2.0
2.5
3.0
AU06
Positive None Negative
Sentiment
0.0
0.5
1.0
1.5
2.0
2.5
AU12
Positive None Negative
Sentiment
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
AU15
Figure 5.2: Box plots of verbal and nonverbal behavior with significant
differences among different classes.
56
Video
Text
Audio
ResNet
Features (T x N)
GRU
MFCC GRU
BERT FC
FC (Fusion)
Features (T x M)
Features (1 x L)
Figure 5.3: Multimodal static fusion.
5.3.4 Models
Unimodal models: For every modality, an encoder maps its input representations to a fixed-size
vector or embedding. In unimodal classification, each of these encoders is then followed by a
softmax layer for three-class classification. Language information is encoded with instance-based
encoders. These encoders consist of a single fully connected (FC) layer of a fixed size. Sequences
of audio and visual features were fed to a single-layer GRU that maps the vision and speech
representations to a fixed-size embedding, keeping only the last state. The obtained representations
from unimodal encoders are followed by a softmax layer for classification. Additionally, we
developed a multimodal model that fused the aforementioned encoders, described below.
Static fusion: In this architecture, features from different modalities are initially passed through
unimodal encoders, and their resulting embeddings were concatenated and fed into a fully-
connected fusion layer followed by a softmax classifier. The structure of this static fusion network
is illustrated in Figure 5.3.
RNN Fusion: Similar to the static fusion model, the RNN fusion architecture initially produces
unimodal embeddings for each modality. However, in the case of vision and audio, with RNN
57
Video
Text
Audio
ResNet
Features (T x N)
GRU
MFCC GRU
BERT FC
FC (Fusion)
Features (T x M)
Features (1 x L)
GRU (Fusion)
Figure 5.4: Multimodal RNN fusion.
encoders, the temporal embeddings learned through single-layer GRUs, are concatenated and
fed into an RNN fusion layer consisting of a single-layer GRU. The text embedding is then
concatenated with the output from the last state of the RNN fusion layer and fed to a single
fully-connected layer for final fusion (static). The output is finally passed through a softmax
classifier for three-class classification. The RNN fusion network structure is shown in Figure 5.4.
5.3.5 Experimental Setup
In this work, we evaluate our methods on a dataset of 2,185 instances of conversation excerpts
from 186 participants. Given the size of the dataset at hand, we opted for a simpler neural
network architecture that can capture the patterns associated with empathetic responses while
generalizing well. The model takes temporal audio and video input features per data entry and
a single representation vector for text. We discard all data shorter than 1.5 seconds and apply
random cropping of a 90-second window for long video and audio inputs (the average length of
the data is 90 seconds) during training. During evaluation, a middle segment with a max duration
of 90 seconds is extracted.
For each modality, we designed an encoder network mapping the input feature space to a
128-d embedding space. In both architectures, video and audio inputs are fed separately into two
58
1-layer GRUs to obtain individual embeddings for both modalities. Only for ResNet, due to the
higher dimensionality of the original space, we added a 128-d fully connected layer after GRU.
For textual data, the BERT vector representation is fed into a fully-connected layer to obtain a
compact representation, reducing the feature dimensions from 768 to 128. The embeddings from
all modalities are consistent across the two fusion networks. The two models employ different
fusion architectures: (i) static fusion model uses the concatenation of the three embeddings and
feeds the multimodal representation vector to a fully-connected layer, with a dropout value of 0.2,
to obtain a final vector of size three, containing the probabilities among three classes. A softmax
classifier is then adopted to perform the classification. (ii) RNN fusion model initially fuses the
temporal video and audio sequences using a GRU of size 128 and then concatenates the bimodal
representation with the text embedding. Similar to the static fusion network, the multimodal
representation is fed to the fully-connected layer, with a dropout value of 0.2, obtaining the final
probability vector on which a softmax classifier performs the classification. A cross-entropy loss is
used in this setup with a weight vector, learned from the train set, to account for the data imbalance
and the evaluation results are computed using micro F1-score. A 10-fold cross-validation has been
used for the training and evaluation of the dataset. We optimize the network using Adam, with
a batch size of 32 and a learning rate of 10
− 4
. 20% of training data is held out in each iteration
for validation, and the best-performing model on the validation set is selected. In the case of
multimodal models, the encoders and fusions layers are all trained jointly for 100 epochs.
Since there is no prior work whose results are directly comparable with our work, we compare
our results against a text-based sentiment analysis method, given the similarities between our
problem and classical sentiment analysis. For our text baseline, we use Valence Aware Dictionary
and Sentiment Reasoner (V ADER) which is a lexicon and rule-based sentiment analysis tool [67].
59
5.4 Results and Discussion
To inform our design decisions for the multimodal networks, we initially trained and evaluated
unimodal classifiers using different feature sets. The results from unimodal classification, evaluated
by micro F1-scores are shown in Table 5.3.
Table 5.3: F1-scores for three-class classification.
Features/Models MTurk Wizard
Audio
MFCC 0.38 0.36
eGeMAPS 0.37 0.35
Video
AU+Pose 0.38 0.35
ResNet 0.46 0.43
Text BERT 0.64 0.61
Multimodal
Static Fusion 0.69 0.61
RNN Fusion 0.71 0.61
Baseline V ADER - text 0.58 0.44
Unimodal classification results demonstrate the superiority of text in content representation
and predictive power, outperforming visual and audio modalities. This result is consistent with
prior work on multimodal sentiment analysis [162, 116], and possibly worsened by the real-world
setting and low expressiveness of this interactive scenario.
The multimodal networks are trained on the best performing feature sets from each modality,
meaning ResNet for video representation, MFCCs for audio and BERT for language. The
audio representations had low predictive power for both MFCC and eGeMAPs on unimodal
classifications, which may be the result of audio quality and recording. When training the models
with MTurk annotations, the results from the multimodal networks show an increase in performance
using the RNN fusion model, which speaks to the existing temporal inter-dynamics of audio and
60
video captured by this network. The multimodal networks gain an overall advantage over the
textual unimodal network which is the highest performing unimodal classifier in this task (see
Table 5.3).
0.2 0.4 0.6 0.8
Threshold
0.40
0.45
0.50
0.55
0.60
F1-score
label
MTurk
Wizard
Figure 5.5: F1-scores of V ADER sentiment analysis with different thresh-
olds.
Our unimodal text classifier outperforms the text sentiment baseline. Using the recommended
threshold on compound sentiment score, i.e. 0.05 for V ADER, a text-based sentiment analysis
achieves F1 = 0.58 for MTurk labels and F1 = 0.44 for wizard labels. We also tested the
sensitivity of the threshold value and found that the best possible results are only slightly different
(see Figure 5.5). Hence, our text-based method using BERT comfortably outperforms V ADER
results which further validates our approach.
The results demonstrate that model predictions are higher when trained on MTurk labels
for both multimodal and unimodal classifications. The aggregate of labels from five annotators
provides higher reliability and potentially lower between-person variability. Additionally, the
61
Table 5.4: Confusion matrices (RNN fusion).
Predictions
Negative Positive None
MTurk
Labels
Negative 72.12% 4.15% 11.43%
Positive 9.85% 78.44% 30.60%
None 18.03% 17.41% 57.97%
Wizard
Labels
Negative 49.65% 1.59% 12.02%
Positive 14.24% 74.87% 30.80%
None 36.11% 23.54% 57.18%
wizard has an understanding of conversation context and may experience different interpersonal
connections to the story or person that would affect the empathetic responses beyond the ability of
our model.
The column-wise-normalized confusion matrices for RNN fusion model across wizard and
MTurk ratings are shown in Table 5.4. The results show similar patterns for both labels and indicate
that false predictions are mainly misclassifications of either positive and negative responses with
no empathy, i.e. prediction of positive/negative responses where none was necessary or predicting
no empathy where positive empathy would have been a better response. To deploy such a system
in real interactions, high precision in detection is necessary, as confusion of positive and negative
responses will disrupt the interaction. Examples of the model’s predictions on MTurk labels
are shown in Table 5.5. Instances like the second entry are dependent on the personalities and
interpersonal relationships of the interlocutors. However, instances like the third entry can be
disruptive to the interaction and require further attention.
62
Table 5.5: Instances of RNN Fusion model’s correct/incorrect predictions
on MTurk labels (Positive, Negative, None).
Dialogue Excerpt Prediction/Label
A: What got you to seek help?
H: My mood was just not right I was always feeling down
and depressed and lack of energy always wanting to sleep
um lack of interest
Neg/Neg
A: What’s your dream job?
H: Designing for the movie industry
A: How hard is that?
H: Extremely so I never really pursued it
Non/Neg
A: What do you do when you’re annoyed?
H: When I’m annoyed you know I really don’t get annoyed
that much I just let it go it’s not worth the pain and problems
they could cause if I can’t straighten out a problem let it go Neg/Pos
5.5 Summary of Findings
In this work [142], we reported on our efforts in automatic recognition of opportunities for provid-
ing empathetic responses. Our analysis demonstrated that verbal content and facial expressions of
emotions are important channels for recognizing such opportunities. We developed and evaluated
a neural network model capable of multimodal learning of these empathetic opportunities. The
best unimodal result was achieved through the text modality, which may be due to the high
representation power of the pre-trained BERT network along with the natural communicative
affordances of language over other modalities. Fusing the verbal channel with facial expressions,
our recurrent neural network fusion provided the best result ofF1 = 0.71 which is comparable
to the recent work on multimodal sentiment analysis [162]. Analysis of two sets of ground-truth
labels from the experiments and independent observers shows that empathy, similar to other social
constructs, may suffer from indistinct boundaries that can be affected by interpersonal relationships
and individuals’ personalities.
63
Chapter 6
Automated assessment of MI sessions
6.1 Motivation
The quality and effectiveness of psychotherapy sessions are highly influenced by the therapists’
ability to lead the conversation with empathy and acceptance. Manual assessment of the quality
of therapy sessions is labor-intensive and difficult to scale. In this work, we propose a method
for estimating session-level therapist empathy ratings for Motivational Interviewing (MI) using
therapist language, which has applications in clinical assessment and training. We analyze
different stages within therapy sessions to investigate the importance of each stage and its topics
of conversation in estimating session-level therapist empathy. We perform experiments on two
datasets of MI therapy sessions for alcohol use disorder with session-level empathy scores provided
by expert annotators. We achieve average CCC (Concordance Correlation Coefficient) scores
of 0.596 and 0.408 for estimating therapist empathy under therapist-dependent and therapist-
independent evaluation settings. Our results suggest that therapist responses to client’s discussions
on activities and experiences around the problematic behavior (in this case, alcohol abuse) along
64
with the therapist’s usage of in-depth reflections, are the most significant factors in the perception
of therapists’ empathy.
Empathy in psychotherapy has been described as: “To sense the client’s private world as if
it were your own, but without ever losing the ‘as if’ quality – this is empathy, and this seems
essential to therapy” [125]. Empathy is hypothesized to be one of the key ingredients in creating
a good therapeutic relationship, which in turn, is the best predictor of success in psychotherapy
[15]. Indeed, greater therapist empathy has been linked to better therapeutic outcomes [148, 130]
further highlighting the importance of empathy in counseling.
In this chapter, we focus on empathy in Motivational Interviewing (MI). MI has been shown
to have positive outcomes for multiple disorders, including alcohol use disorder [128]. MI
focuses on strengthening personal motivation by eliciting the client’s own reasons for change
while respecting the client’s agency. Empathy is therefore an important pillar in MI. It is crucial
for the therapist to empathize and understand the client’s reasons and motives to best facilitate
behavioral change. Empathy is one of the consistent evaluation metrics for assessing MI sessions’
quality. Standardized MI coding systems like the Motivational Interviewing Skill Code 2.5 (MISC)
[95] and the Motivational Interviewing Treatment Integrity 3.1 (MITI) [103] both use therapist
empathy as an important metric for assessing session quality. However, behavioral coding, i.e.,
the process of listening to audio recordings to observe therapist behaviors for quality assessment,
is highly costly and time-consuming, and therefore, hard to scale. Specifically, obtaining the
session quality ratings requires trained third-party coders to review and rate the session following
the aforementioned standardized codings, MISC and MITI, on 7-point or 5-point Likert scales,
respectively.
65
The current work focuses on building models that utilize therapist language for estimating
the session-level empathy ratings (standardized across datasets). To this aim, we utilize real-
world MI therapy datasets for alcohol abuse [18, 36]. Motivated by past work demonstrating the
potential relevance of the content of certain temporal segments of MI sessions to outcomes [64,
53], we divide each session into four roughly equal-length sequential segments (quartiles), each
representing a different stage of the conversation. We aim to determine whether the language
from these segments (often with different topics elicited by the therapists) has different predictive
power for session-level empathy estimation. Through our analyses, we show that the language of
therapists and clients generally follows a common progression. We conduct multiple experiments
to study the importance of the content in each quartile for understanding empathy. We demonstrate
that the utterances from the second quartile, which focuses on clients’ activities and experiences
around alcohol, may be more predictive of session-level empathy.
The main contributions of this work are as follows.
• We propose and evaluate a regression model for estimating therapist empathy using spoken
language. We demonstrate that language encoders pre-trained for emotion recognition pro-
vide better results compared with general purpose encoders, demonstrating the significance
of affect in therapist empathy.
• We investigate and analyze the therapist and client language across the session quartiles,
showing that certain quartiles (i.e., discussion of client activities around the problematic
behavior) are more predictive of empathy overall.
66
6.2 Data
In this work, we leverage two clinical datasets of real-world Motivational Interviewing sessions.
Our datasets, PRIME and NEXT, come from MI sessions from two populations: 1) College students
mandated to take part in MI sessions due to alcohol-related problems [18] and 2) Community-based
underage (ages 17-20) heavy drinkers transitioning out of high school who were not immediately
planning to enroll in a 4-year college. These participants were non-treatment-seeking volunteers
recruited via advertisement and recruitment events held at local high schools, community colleges,
etc. [36]. Both populations underwent single session Motivational Interventions (MIs), which were
delivered as face-to-face meetings that take approximately 50-60 minutes, and include personalized
feedback to promote less risky drinking.
PRIME dataset contains 219 BMI sessions with mandated college students. The sessions
include audio files and manual transcriptions. They are coded following the MISC 2.5 guidelines
for local utterance-level behaviors, as well as global ratings of empathy and other MI-related
measures like therapists’ acceptance and MI spirit. 20% of the sessions were randomly selected
and double-coded to verify inter-rater reliability. Intraclass correlation coefficients (ICCs; two-way
mixed, single measure) were calculated for each variable to determine inter-rater reliability across
rater pairs [16]. For this dataset, the ICC scores for therapists’ global measures range from 0.47 to
0.78, which is considered “fair” to “excellent” [33].
NEXT dataset comprises 81 MI sessions with community-based unique underage drinkers con-
sisting only of audio files. We used the Google Automatic Speech Recognition (ASR) service for
the automatic transcription of the sessions. We manually verified the ASR quality of a subset of
sessions and found the transcriptions have a few issues that are minor enough not to affect this work
67
(e.g. misrecognition of proper names associated with the sites, missing or inserting disfluencies
such as ‘uh’s and ‘um’s). The sessions are annotated with utterance-level codes, as well as global
ratings of therapist skills like empathy and acceptance following the MITI 3.1 coding system.
Similar to the first dataset, 20% of the sessions were randomly selected for double-coding and
ICC scores (two-way mixed, single measure) were computed [88]. ICC for this dataset was 0.83,
which is considered “excellent” [33].
Since both datasets follow different coding systems on different Likert scales (5- and 7-point),
we scale the empathy ratings between 0–1. Fig. 6.1 demonstrates the histogram across the datasets,
and dataset statistics are presented in Table 6.1.
2.5 5.0
PRIME
0
10
20
30
40
50
Session Count
2 4
NEXT
0
10
20
30
40
0.0 0.5 1.0
Full Dataset
0
10
20
30
40
50
60
70
80
Figure 6.1: Histograms of empathy ratings across our two datasets and
the combination of both datasets (normalized between [0-1])
68
Table 6.1: Dataset statistics: session length (in minutes) and average
number of turns
# sess avg. length (min) avg. # turns
PRIME 219 49.9 (13.7) 422 (128.2)
NEXT 82 54.3 (12.1) 600 (138.5)
6.3 Approach
6.3.1 Session Segmentation
In this work, we focus on using the therapist’s spoken language for estimating the empathy ratings
with a regression model. MI therapists are trained to follow a common (but flexible) structure,
which inspired our approach to studying language patterns at the quartile level. A typical session
roughly progresses as follows: In the first quartile (Q1) the therapist asks about the client’s drinking
habits and discussing how alcohol fits into their life. The client provides information on their
overall patterns of drinking such as number of drinks on a typical drinking day, average number of
drinking days per week, etc. In the second quartile (Q2), the client discusses the activities they
partake in while drinking. These activities mostly revolve around social activities such as parties
and drinking games. The therapist provides insights into some of the physiological effects of such
drinking behaviors. The third quartile (Q3) focuses on personalized feedback, where the therapist
usually provides statistics and quantitative assessment of the drinking behaviors of the client, e.g.
how the client compares to other people of their age. In the final quartile (Q4), the client and
therapist discuss potential actions corresponding to their plan for change.
We provide a preliminary analysis of the common language within session quartiles for more
insights into the data. We use class-based tf-idf (Term Frequency - Inverse Document Frequency)
69
analysis to obtain top n-grams per quartile, by exploring individual quartiles combined across
all sessions to represent documents. Table 6.2 shows the most frequent bi-grams from session
quartiles, showing an overall pattern for the session progression.
Table 6.2: Most common bi-grams across speakers and quartiles;
∗ bac: blood alcohol content
PRIME Dataset NEXT Dataset
therapist client therapist client
Q1
heaviest week, questions started,
drinks single, single day,
maximum number
Thursday Friday, week yeah,
Saturday like, sounds right,
number drinks
love just, adult make,
really opens, related alcohol,
life young, hopes dreams
community college, looking job,
just hanging, make money,
going school
Q2
slurred speech, particular bacs*,
outer brain, tolerance heard,
emotional center, people associate
slurred speech, flip cup,
self conscious, reaction time,
drinking game
regret getting, trouble having,
standard drink, age 21,
measure alcohol
high tolerance, shit faced,
funny like, high number,
binge drinking
Q3
went estimation, called perceived,
thought average, people overestimate,
myth alcohol
talk people, easier talk,
bad time, nights week,
neglected responsibilities
later regretted, alcohol dependence,
spend drinking, percent money,
lead dangerous
alcohol level, like woke,
spend time, heavy drinker,
family history
Q4
seal envelope, related problems,
alcohol problems, drinkers alcohol,
severe consequences
drink unattended, designated
driver, good idea,
avoid drinking, leave drink
lot today, make hard,
really appreciate, keeping track,
complete stranger
drinking games, peer pressure,
avoid drinking, need help,
drug use
The decision to divide the sessions into quartiles has multiple practical advantages. First, the
quartiles can be mapped onto the format of the sessions of interest into four structural segments
described above. Second, a quarter of a session is more interpretable for clinician training and
supervision implications than considering a larger (whole session) or smaller (decile) part of
a session. Finally, analyzing session language at the quartile level would alleviate the loss of
information associated with limitations in a neural network model’s context or input sequence
length when modeling the entire session.
Since we do not have access to precise annotations of the start or end points of each stage in
the sessions (in addition to the fact that the session structure might be more fluid and flexible),
we divide the session into four quartiles by time to access utterances within each quartile. We
70
study the therapist language within the quartiles to explain the progression and investigate the
importance of each quartile in the estimation of perceived therapist empathy.
6.3.2 Model
In this work, we investigate the therapist’s in-session language with respect to its perceived
empathy level. For encoding the therapist utterances, we leverage the recent advancements on
language representation by fine-tuning the pre-trained model distil-RoBERTa (distilled Robustly
Optimized BERT Pretraining Approach) [81]. We obtain the language representations from the
input window and feed the sequence to an initial linear layer for dimensionality reduction, followed
by single-layer bidirectional GRU. We take the output from the entire sequence and feed them into
a multi-head self-attention layer [151] for learning the relative importance per utterance within the
input window. The weighted sequence representations are aggregated into a final representation of
the input window by concatenating the mean- and max-pooled hidden states of the entire sequence.
These learned representations are passed through a final linear layer for regression.
Using this network architecture, depicted in Fig. 6.2, we build a regression model for learning
continuous empathy ratings for session quartiles. We further aggregate these quartile-level estima-
tions to obtain the session-level empathy, by taking the average across quartiles. In addition to
using a base encoder, motivated by the affective nature of empathy, which involves identification
with others’ emotional experiences, we also train our models with a distil-RoBERTa-emotion
encoder [58], which is pre-trained on multiple emotion datasets [41, 99, 119, 131, 129].
71
6.3.3 Experimental Setup
In this work, we train and evaluate our method on a combined dataset of 301 real-world MI sessions.
We extract a fixed window size with the first 64 therapist turns. The choice of window size was
based on the average length of session quartiles in terms of the number of therapists’ speech
turns and the hardware constraints. We use distil-RoBERTa [81] or emotion-distil-RoBERTa [58]
encoders for our text representation while fine-tuning the final layer for our task. The dimension
of the input vector embeddings is 768, and the hidden dimensions for the initial linear layer and
GRU are 512 and 256, respectively, and the dropout rate after self-attention is set to 0.5.
We use 5-fold cross-validation for training and evaluation of the overall dataset, and report
the test results by selecting the model with the highest validation performance. We perform both
therapist-dependent and therapist-independent cross-validation. In the therapist-dependent cross-
validation, the splits are not disjoint by the therapist, meaning sessions from the same therapist can
appear in both train and test sets. On the other hand, in therapist-independent cross-validation,
sessions from the same therapist do not appear in both training and testing data, preventing the
model from learning any therapist-specific patterns or idiosyncrasies. We optimize the network
weights using AdamW, with a batch size of 8 and a learning rate of 5e
− 5
. The small batch size is
due to the memory constraints on GPUs given the large input size for each sample. We use the
Concordance Correlation Coefficient (CCC), a widely used evaluation metric for regression that
measures agreement between the prediction and ground-truth values. Past work demonstrated that
using CCC loss results in superior performance in emotion recognition [75]. Therefore, we opt to
use a CCC loss rather than a more commonly used Mean Squared Error (MSE) loss.
72
6.4 Results and Discussion
… …
BERT Bi-GRU Attention
+
Linear
max
mean
Linear
…
Therapist
Figure 6.2: The model includes an utterance encoder (distil-RoBERTa [81] or emotion-distil-
RoBERTa [58]) whose output is projected to a lower-dimensional space by the following linear
layer. The sequence of utterance-level representations is then fed to a Bidirectional Gated Recurrent
Unit (Bi-GRU) layer. The GRU is followed by a two-head self-attention layer on GRU’s hidden
states whose output is mean- and max-pooled into the final vector embedding, which is fed to a
final linear layer for regression.
6.4.1 Base Models vs. Emotion Models
In our first set of experiments, we compare two different encoders for language representations,
namely, distil-RoBERTa [81] and distil-emotion-RoBERTa [58], to see if an encoder pre-trained
on emotion recognition tasks provides performance improvements for estimating empathy as
an affective construct. Our previous experiments showed that the distilled model versions are
on-par with their full model counterparts (distilled RoBERTa vs. RoBERTa-base) in terms of
performance. This may be due to our limited data size, so we focused on distilled encoders across
all experiments.
In Table 6.3, we provide the model performance across different session quartiles, under both
therapist-dependent and therapist-independent settings. From these results, we can see that the
emotion-RoBERTa outperforms the RoBERTa encoder on average, leading to higher performance
results on the session-level estimations.
The performance gap is more evident in the therapist-independent setting, especially for the
first quartile, Q1, (from 0.150 to 0.367) potentially due to affect-related language in that quartile.
73
Comparing the performance across the quartiles, Q2 is consistently more predictive in the therapist-
dependent setting, although the pattern is not consistent for the therapist-independent scenario. The
session-level performance, obtained by aggregating the predictions across all quartiles, performs
best in all but one of the cases, reaching CCC scores of 0.596 and 0.408 for therapist-dependent
and independent cases, respectively. As expected, there is a drop in performance in the therapist-
independent evaluation setting, since the model can’t utilize individual therapist characteristics for
recognition. This gap is exacerbated by the real-world nature of our dataset and the large variation
in the number of sessions per therapist, which we will further discuss in Section 6.4.3.
Table 6.3: Performance results (CCC scores) comparing distil-RoBERTa
vs. emotion-distil-roBERTa encoders, under therapist-independent and
therapist-dependent scenarios
therapist-dependent therapist-independent
base emotion base emotion
Q1 0.510 (0.05) 0.488 (0.08) 0.150 (0.09) 0.367 (0.16)
Q2 0.546 (0.03) 0.544 (0.04) 0.264 (0.15) 0.341 (0.12)
Q3 0.436 (0.06) 0.528 (0.08) 0.291 (0.07) 0.310 (0.10)
Q4 0.450 (0.10) 0.470 (0.11) 0.341 (0.16) 0.344 (0.17)
sess. 0.572 (0.04) 0.596 (0.06) 0.320 (0.12) 0.408 (0.16)
6.4.2 Cross Corpus vs. Within Corpus
In our next set of experiments, we use the emotion encoder due to its superiority, under the
therapist-independent configuration for more robust results invariant to individual characteristics.
As shown in Table 6.2, the sessions generally follow a similar structure of progression despite
some differences in the discussed content. To further analyze this aspect, we explore the overall
74
generalizability of our model predictions across datasets under within-corpus and cross-corpus
evaluations.
To this end, we first train and evaluate our model within corpus on our larger dataset (PRIME).
Next, we perform cross-corpus testing by using PRIME as the training set and NEXT as the
validation set with the main goal of identifying whether certain session quartiles are more similar
in language. The results are shown in Table 6.4, with the first column providing the within-corpus
results using a 5-fold therapist-independent cross-validation. The remaining columns on the right
provide the scores when training on PRIME and validating on NEXT under different combinations
of session quartiles.
Table 6.4: ccc score results in within-corpus and cross-corpus experi-
ments, therapist-independent setting. The within-corpus results include
the mean across the cross validation folds, with standard deviation in
parentheses. The cross-corpus results are obtained from training and
validation on the entire datasets.
PRIME Dataset
Within- Cross-corpus testing (NEXT Dataset)
corpus Q1 Q2 Q3 Q4 sess.
Q1 0.173 (0.18) 0.107 0.178 -0.018 0.067 0.114
Q2 0.342 (0.18) 0.229 0.188 0.134 0.102 0.198
Q3 0.293 (0.19) 0.037 0.186 -0.007 0.058 0.053
Q4 0.228 (0.27) 0.092 0.191 -0.009 0.069 0.112
sess. 0.299 (0.20) 0.134 0.194 0.016 0.090 —
We first compare the within-corpus results, in which the model was trained and tested on
PRIME. Compared to the results on the combined dataset (last column of Table 6.3), the results for
Q2 and Q3 are on par across the two experiments, suggesting that these quartiles are most similar
in terms of therapist language across datasets. On the other hand, the performance on Q1 and
75
Q4 seems to suffer from significant drops, likely due to the differences in dataset characteristics.
Additionally, in the cross-corpus setting, results were consistent when the model was trained on
Q2, i.e. the Q2 model transfers well to all test quartiles. Conversely, models trained on any quartile
also performed most consistently on Q2 quartiles in test. These findings suggest that there is a
language commonality in this specific Q2 quartile across datasets. Recall that Q2 is when the
clients are describing their activities around drinking, and the therapist informs them of possible
behavioral effects while expressing understanding in a non-judgemental manner, in accordance
with MI protocols.
6.4.3 Therapist Analysis
One of the challenges of the dataset in the therapist-independent scenario is the high imbalance of
the number of sessions across different therapists, ranging from 1 to 44. In PRIME dataset, we
have 219 sessions run by 13 therapists (IDs 1-13), and in NEXT dataset we have 82 sessions run by
3 therapists. One session had missing therapist information, so we coded this session as conducted
by a separate therapist, i.e., NEXT dataset includes therapist IDs 14-17. While the 17 therapists
follow the same guidelines across datasets, they have different variances in empathy ratings across
sessions. Fig. 6.3 shows the empathy ratings across different therapists. As described earlier, the
two datasets are rated using different Likert scales, which we have scaled to [0,1] for our analysis.
Our research question in this section is: what constitutes a more empathetic therapist, and we
explore this question by studying the types of language and MI codes therapists use. MI codes
are utterance-level categories following standardized MISC/MITI coding systems. These codes
categorize the session utterances into therapist- and client-specific categories. In this analysis, we
focus on therapist codes, which include simple/complex reflections, open/closed-ended questions,
76
giving information, facilitation, etc. To this end, we categorize the therapists into groups of high
vs. low empathy using their average empathy ratings across sessions. We select a threshold of 0.7,
leading to a balanced grouping. We then obtain the normalized usage of each MI code per therapist
across different quartiles, by taking the average across sessions. We run Kruskal-Wallis tests
across the two groups for different therapist-specific codes including simple/complex reflections,
open/closed-ended questions, giving information, etc. Table 6.5 demonstrates the significant MI
codes that distinguish high vs. low empathy scores across quartiles.
Table 6.5: Significant therapist codes associated with empathy;
**pvalue< 0.01; *pvalue< 0.05; positive(+) and negative(-) associations
with empathy are shown in parentheses
Significant therapist codes
Q1
MI-consistent** (+); Open-ended question* (+)
Giving information* (-)
Q2
Complex reflection** (+); MI-consistent** (+)
MI-inconsistent* (-); Giving information* (-)
Q3
MI-consistent** (+); Complex reflection* (+)
Q4 Complex reflection * (+)
These findings are consistent with what we would expect, based on general MI guidelines. In
particular, ‘MI-consistent’ is a large category that includes codes like ‘advice with permission,’
‘affirm,’ ‘emphasize control,’ and ‘support.’ The results show that the ‘MI-consistent’ category is
significantly and positively associated with perceived empathy across most quartiles. ‘Complex
reflection,’ which consists of reflections that add substantial meaning or emphasis to what the
client had said, is also significantly and positively associated with therapist empathy. ‘Complex
77
reflection,’ which can be a good indicator of empathetic understanding, has a stronger association in
Q2 (p-value< 0.01), where the client is describing their activities and experiences around drinking
alcohol. This further supports why Q2 is a prominent quartile for the estimation of session-level
empathy in the majority of our experiments. This result also suggests that effective therapists
exhibit deep and empathetic understanding around discussion of alcohol-related activities and
experiences.
Figure 6.3: Empathy ratings across different therapists. Therapist IDs 1-
13 are from PRIME dataset; Therapist IDs 14-17 are from NEXT dataset.
Other statistically significant codes include ‘MI-inconsistent’ and ‘Giving information’, which
are both negatively associated with empathy. ‘MI-inconsistent’ consists of those actions directly
proscribed by MI guidelines (such as giving advice without permission, confronting, directing),
which shows to be detrimental to perceived therapist empathy. ‘Giving information,’ which is
the category for when the therapist explains something, educates or provides feedback, is also
78
negatively associated with empathy. This suggests that these types of speech should be used in
moderation by therapists. Some example dialogue excerpts are shown in Table 6.6, noted with the
types of MI codes associated with each utterance.
Table 6.6: Sample dialogue excerpts from the datasets including
corresponding MI codes. T denotes the therapist and C denotes the
client.
T: ... you can see that on a typical occasion and on the heavier
occasion you are getting above that (Giving information)
C: oh yeah i wouldn’t i wouldn’t consider driving after more
than like two drinks basically a... (Change talk)
T: okay so it sounds like the avoiding the driving is pretty
important to you (Complex reflection)
C: ... that’s definitely not a positive effect of drinking you know
balance and movement are affected (Change talk)
C: so it sounds like for you that would be really difficult
especially if you know you tend to try to stay in control of
yourself and situations and having that impairment
(Complex reflection)
C: yeah i was drinking every day for like four years right and
(Follow/neutral)
T: ... it is less for sure and you are functioning
so it makes sense that you know you wouldn’t think that
the number would be as high as it is but when you look at ...
(MI-consistent)
79
6.5 Ethical Impact
The datasets and labels used in this chapter and Chapter 3, are the result of a secondary analysis
from past studies, which were reviewed by their relevant IRB (see the original studies [18, 36].)
The original data were recorded with informed consent from the participating clients to be used for
research. The original studies allowed for secondary analysis of the audio recordings for training
and research purposes, in accordance with the goals of the original study and the participants’
consent. In the original studies, the audio data were reviewed and cleared of any identifiable
information, such as names and addresses. Therefore, the research presented in this paper was
deemed IRB-exempt analysis of secondary data by the USC IRB, which designated the data non-
identifiable. Nevertheless, we ensured that data was always transferred and stored by encrypted
and password-protected means. When speech data from the second dataset were transcribed by
the Google Cloud speech-to-text service, data were transferred through encrypted connections and
were only kept in the protected cloud storage for the minimum necessary duration to complete the
transcription. We also did not allow the cloud service provider to log the data during transcription
for additional protection. We chose Google Cloud Platform due to its superior performance in
transcription and its reputation and ability to provide secure and compliant services for storage
and analyses of sensitive data. Specifically, Google uses encryption in receiving speech data; It
does not claim ownership over the speech data and the resulting transcripts, and does not store or
reveal the information when logging is not enabled by the user.
*
*
https://cloud.google.com/speech-to-text/docs/data-usage-faq
80
6.6 Summary of Findings
In this work, we presented a method for computational understanding of therapist empathy in
MI therapy sessions. To this end, we utilized therapist language from individual quartiles within
the sessions. We developed and evaluated a neural network architecture for estimating empathy
ratings per session using therapist language. We conducted experiments and analyses within
and across datasets to gather insights into the importance of each quartile within the session
progression. Our results indicate that the second quartile of the session, which is commonly
focused on discussing clients’ experiences around drinking alcohol, may be of higher importance
for estimating therapist empathy. We achieved promising results for estimating empathy using
pre-trained language encoders with affect-aware representations. Moreover, our analyses show
that therapists with higher empathy ratings tend to provide more complex reflections, which are
most significant during the second quartiles. This finding provides evidence for the importance of
complex reflections for therapist empathy, through demonstrating a deeper understanding of the
client by the therapist.
Therapist empathy is key to successful therapy. Modeling and understanding empathy requires
effective and efficient means, in order to facilitate therapist training and therapy quality assessment.
With this work, we provide evidence for the salience of certain topics and therapy techniques for
effective modeling and assessment of empathy in MI therapy sessions.
81
Chapter 7
Conclusion and Future Directions
The recent advances in natural language and multimodal behavior perception have opened up
exciting new opportunities in health care. The design and development of low-cost and objective
behavior perception models enable gathering and assessment of behavioral data at different levels
of granularity to investigate associations with symptoms and treatment outcomes. In mental health,
this includes identifying client’s verbal and nonverbal behaviors associated with specific disorders
for diagnosis and monitoring in-session behaviors associated with successful therapeutic outcomes.
Behavior analysis can be additionally used to track and adjust treatment, according to measured
symptoms. In this chapter, we present a summary of contributions and future directions from the
research presented in this dissertation.
7.1 Summary of Contributions
We have proposed a model for client intent recognition in Motivational Interviewing (MI). We
demonstrate the importance of modeling history context in intent classification. Through ex-
periments across different history context encoding approaches, we demonstrate that the best
results are obtained when encoding the dyadic interactions of the client and therapist, by taking
82
the entire history window without separating speaker utterances. Additionally, through error
analysis on misclassified instances, we showed psychologically meaningful dimensions that the
pre-trained language encoders may have failed to capture in association with this task, including
language around anger, swearing and money. This type of language is higher for Change Talk
compared to other codes, possibly representing the client’s anger toward oneself. Additionally,
the language about money is also higher for Change Talk, potentially referencing the high cost
of alcohol, a good motivator toward reducing alcohol consumption. We further investigate the
predictability of long-term behavior change given in-session data obtaining promising results by
looking at the sequence of client utterances from the entire session. Our findings show promise
for long-term (6-month) behavior change prediction given the challenge of such task, through
analyzing a relatively short interaction (about an hour). Additionally, we performed multiple
experiments using different subsets of client utterances pertaining to individual codes (i.e., Change
Talk, Sustain Talk and Follow/Neutral) for behavioral outcome prediction. Although, we did not
see a consistent pattern of one talk type significantly outperforming others in behavior prediction.
Our best results were achieved using the entire set of client utterances.
For incorporating speaker turn changes in dialogue, we propose a simple and effective approach
by introducing learnable embeddings that are capable of encoding role-specific language and style
in dyadic dialogue. We train and evaluate our models on Dialogue Act classification benchmarks
achieving the state-of-the-art performance. We showed that incorporating the speaker turn change
embeddings provides significant performance improvements, which was able to beat similar
state-of-the-art models.
Due to the importance of empathy in shaping therapeutic alliance and leading to desired
therapeutic outcomes, we propose an approach for computational modeling of empathy. We build
83
multimodal models for predicting opportunities for expressing empathy using client verbal and
non-verbal behaviors in clinical interviews. We showed that the multimodal results provide the
highest performance, while language and facial expressions provide the highest unimodal results,
respectively. By studying two sets of ground-truth labels from the experiments and independent
observers, we show that empathy, similar to other social constructs, may suffer from indistinct
boundaries that can be affected by interpersonal relationships and individuals’ personalities.
We further focus on therapist empathy as a metric for assessing the quality of MI sessions.
We propose a model for estimating session-level therapist empathy using therapist language and
study the language from different session quartiles, each roughly focusing on a different stage
of the therapy session. We demonstrate the second quartile of the session, commonly focused
on discussing clients’ experiences around the problematic behavior (in this case alcohol abuse),
may be of higher importance for estimating therapist empathy. Moreover, our analyses show
that therapists with higher empathy ratings tend to provide more complex reflections. Complex
reflections consist of reflections that add substantial meaning or emphasis to what the client had
said and are likely a good indicator of empathetic understanding. With this work, we provide
evidence for the salience of specific topics (experiences around problematic behaviors) and therapy
techniques (deep reflection), where empathy is most effectively modeled and potentially where it
is most important for building an empathetic alliance.
7.2 Future Directions
Future directions we believe will be valuable for advancing machine learning models in mental
health:
84
Pretraining publicly available language models on mental health therapy data: The existing
advancements in natural language processing have been made possible with large pre-trained
language models. These models are pre-trained on data from books, Wikipedia, news or general
web data, with limited exposure to spoken dialogue from therapy. Therefore continued pre-training
on available data from acted or real-world therapy sessions and clinical interviews is expected to
provide significant improvements in their representation power for mental health-related tasks.
Some existing efforts have provided models like MentalBert and MentalRoBERTa [69] which
use Reddit posts discussing mental health related issues like depression, anxiety, etc. Other work
provides psychBERT [51] by adapting BERT to a large archive of recorded Cognitive Behavioral
Therapy sessions [38]. Despite the high importance of such models, they are either pre-trained on
web-based data which lack the aspect of spoken language and its subsequent intricacies or they are
not publicly available. A publicly-available model pre-trained on mental health therapy data, in
accordance with ethical and privacy-preserving guidelines, can be immensely helpful for future
advancements in computational modeling of therapy.
Investigation of client and therapist dyadic behaviors: Entrainment or coordination is the
phenomenon that partners tend to speak more similar to each other, in terms of language and
speech (i.e., pitch, loudness or speaking rate) as the interaction unfolds. Entrainment has been
linked with increased rapport [84], and dialogue success [120]. Studying the entrainment of the
client and therapist throughout the therapy session and within different stages can provide further
insights on the effective strategies associated with therapeutic outcomes. For example, evidence
on the degree to which the therapist should lead or follow within each stage of the session can
provide valuable insights for therapists in training.
85
Multimodal inspection of client intent sincerity and commitment in MI: MI theory posits that
the client’s in-session expression of Change Talk, and Sustain Talk are associated with behavioral
outcomes. However, the sincerity of such expressions and their association with outcomes, have not
been investigated. For example, clients may express Change Talk to appease the therapist, without
an intention to follow through. Multimodal analysis of client intent, through cross-comparison
of language and speech, can provide interesting new insights when studying the link between
in-session behaviors and subsequent outcomes.
86
Bibliography
[1] Steven J Ackerman and Mark J Hilsenroth. “A review of therapist characteristics and
techniques positively impacting the therapeutic alliance”. In: Clinical psychology review
23.1 (2003), pp. 1–33.
[2] Efrat Aharonovich, Paul Amrhein, Adam Bisaga, Edward Nunes, and Deborah Hasin.
“Cognition, commitment language, and behavioral change among cocaine-dependent
patients”. In: Psychology of Addictive Behaviors 23 (4 2008).
[3] Tim Althoff, Kevin Clark, and Jure Leskovec. “Large-scale Analysis of Counseling
Conversations: An Application of Natural Language Processing to Mental Health”. In:
Transactions of the Association for Computational Linguistics 4 (2016), pp. 463–476.
[4] Paul Amrhein, William Miller, Carolina Yahne, Michael Palmer, and Laura Fulcher.
“Client commitment language during motivational interviewing predicts drug use
outcomes”. In: Journal of Consulting and Clinical Psychology 71 (5 2003).
[5] Timothy Apodaca, Molly Magill, Richard Longabaugh, Kristina Jackson, and Peter Monti.
“Effect of a Significant Other on Client Change Talk in Motivational Interviewing”. In:
Journal of Consulting and Clinical Psychology 81 (1 2012).
[6] Timothy R Apodaca, Brian Borsari, Kristina M Jackson, Molly Magill,
Richard Longabaugh, Nadine R Mastroleo, and Nancy P Barnett. “Sustain talk predicts
poorer outcomes among mandated college student drinkers receiving a brief motivational
intervention.” In: Psychology of Addictive Behaviors 28.3 (2014), p. 631.
[7] Chanuwas Aswamenakul, Lixing Liu, Kate B Carey, Joshua Woolley, Stefan Scherer, and
Brian Borsari. “Multimodal Analysis of Client Behavioral Change Coding in Motivational
Interviewing”. In: Proceedings of the 20th ACM International Conference on Multimodal
Interaction. 2018, pp. 356–360.
87
[8] Chanuwas Aswamenakul, Lixing Liu, Kate B. Carey, Joshua Woolley, Stefan Scherer, and
Brian Borsari. “Multimodal Analysis of Client Behavioral Change Coding in Motivational
Interviewing”. In: Proceedings of the 20th ACM International Conference on Multimodal
Interaction. ICMI ’18. Boulder, CO, USA: Association for Computing Machinery, 2018,
pp. 356–360. ISBN: 9781450356923. DOI: 10.1145/3242969.3242990.
[9] Tadas Baltrusaitis, Peter Robinson, and Louis Philippe Morency. “OpenFace: An open
source facial behavior analysis toolkit”. In: 2016 IEEE Winter Conference on Applications
of Computer Vision, WACV 2016. IEEE, Mar. 2016, pp. 1–10. ISBN: 9781509006410.
[10] Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. “Openface
2.0: Facial behavior analysis toolkit”. In: 2018 13th IEEE International Conference on
Automatic Face & Gesture Recognition (FG 2018). IEEE. 2018, pp. 59–66.
[11] Joseph Bates et al. “The role of emotion in believable agents”. In: Communications of the
ACM 37.7 (1994), pp. 122–125.
[12] Christian Becker, Helmut Prendinger, Mitsuru Ishizuka, and Ipke Wachsmuth.
“Evaluating affective feedback of the 3D agent max in a competitive cards game”. In:
International Conference on Affective Computing and Intelligent Interaction. Springer.
2005, pp. 466–473.
[13] Timothy W Bickmore. “Relational agents: Effecting change through human-computer
relationships”. PhD thesis. Massachusetts Institute of Technology, 2003.
[14] Matthew P. Black, Athanasios Katsamanis, Brian R. Baucom, Chi-Chun Lee,
Adam C. Lammert, Andrew Christensen, Panayiotis G. Georgiou, and
Shrikanth S. Narayanan. “Toward automating a human behavioral coding system for
married couples’ interactions using speech acoustic features”. In: Speech Communication
55.1 (Jan. 2013). ISSN: 0167-6393. DOI: 10.1016/j.specom.2011.12.003. (Visited on
05/30/2020).
[15] Arthur C Bohart and Leslie S Greenberg. Empathy and psychotherapy: An introductory
overview. American Psychological Association, 1997.
[16] Brian Borsari, Timothy R Apodaca, Kristina M Jackson, Nadine R Mastroleo,
Molly Magill, Nancy P Barnett, and Kate B Carey. “In-session processes of brief
motivational interventions in two trials with mandated college students”. In: Journal of
consulting and clinical psychology 83.1 (Feb. 2015), pp. 56–67.
[17] Brian Borsari, Timothy R. Apodaca, Kristina M. Jackson, Anne Fernandez,
Nadine R. Mastroleo, Molly Magill, Nancy P. Barnett, and Kate B. Carey. “Trajectories of
In-Session Change Language in Brief Motivational Interventions with Mandated College
Students”. In: Journal of consulting and clinical psychology 86.2 (Feb. 2018),
pp. 158–168. ISSN: 0022-006X. DOI: 10.1037/ccp0000255. (Visited on 05/14/2020).
88
[18] Brian Borsari, John TP Hustad, Nadine R Mastroleo, Tracy O’Leary Tevyaw,
Nancy P Barnett, Christopher W Kahler, Erica Eaton Short, and Peter M Monti.
“Addressing alcohol use and problems in mandated college students: A randomized
clinical trial using stepped care.” In: Journal of consulting and clinical psychology 80.6
(2012), p. 1062.
[19] Chandrakant Bothe, Cornelius Weber, Sven Magg, and Stefan Wermter. “A context-based
approach for dialogue act recognition using simple recurrent neural networks”. In: arXiv
preprint arXiv:1805.06280 (2018).
[20] Scott Brave, Clifford Nass, and Kevin Hutchinson. “Computers that care: investigating the
effects of orientation of emotion exhibited by an embodied computer agent”. In:
International journal of human-computer studies 62.2 (2005), pp. 161–178.
[21] Erik Cambria, Isabelle Hupont, Amir Hussain, Eva Cerezo, and Sandra Baldassarri.
“Sentic avatar: Multimodal affective conversational agent with common sense”. In:
Toward Autonomous, Adaptive, and Context-Aware Multimodal Interfaces. Theoretical
and Practical Issues. Springer, 2011, pp. 81–95.
[22] Erik Cambria, Björn Schuller, Yunqing Xia, and Catherine Havasi. “New avenues in
opinion mining and sentiment analysis”. In: IEEE Intelligent systems 28.2 (2013),
pp. 15–21.
[23] Do˘ gan Can, David C Atkins, and Shrikanth S Narayanan. “A dialog act tagging approach
to behavioral coding: A case study of addiction counseling conversations”. In: Sixteenth
Annual Conference of the International Speech Communication Association. 2015.
[24] Jie Cao, Michael Tanana, Zac E Imel, Eric Poitras, David C Atkins, and Vivek Srikumar.
“Observing dialogue in therapy: Categorizing and forecasting behavioral codes”. In: arXiv
preprint arXiv:1907.00326 (2019).
[25] Kate B Carey, James M Henson, Michael P Carey, and Stephen A Maisto. “Computer
versus in-person intervention for students violating campus alcohol policy.” In: Journal of
consulting and clinical psychology 77.1 (2009), p. 74.
[26] Ginevra Castellano, Ana Paiva, Arvid Kappas, Ruth Aylett, Helen Hastie,
Wolmet Barendregt, Fernando Nabais, and Susan Bull. “Towards empathic virtual and
robotic tutors”. In: Artificial Intelligence in Education: 16th International Conference,
AIED 2013, Memphis, TN, USA, July 9-13, 2013. Proceedings 16. Springer. 2013,
pp. 733–736.
[27] Sandeep Nallan Chakravarthula, Bo Xiao, Zac E. Imel, David C. Atkins, and
Panayiotis G. Georgiou. “Assessing empathy using static and dynamic behavior models
based on therapist’s language in addiction counseling”. In: Interspeech. 2015.
89
[28] Po-Chun Chen, Ta-Chung Chi, Shang-Yu Su, and Yun-Nung Chen. “Dynamic time-aware
attention to speaker roles and contexts for spoken language understanding”. In: 2017
IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE. 2017,
pp. 554–560.
[29] Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Baltrušaitis, Amir Zadeh, and
Louis-Philippe Morency. “Multimodal sentiment analysis with word-level fusion and
reinforcement learning”. In: Proceedings of the 19th ACM International Conference on
Multimodal Interaction. ACM. 2017, pp. 163–171.
[30] Zheqian Chen, Rongqin Yang, Zhou Zhao, Deng Cai, and Xiaofei He. “Dialogue act
recognition via crf-attentive structured network”. In: The 41st international acm sigir
conference on research & development in information retrieval. 2018, pp. 225–234.
[31] Ta-Chung Chi, Po-Chun Chen, Shang-Yu Su, and Yun-Nung Chen. “Speaker role
contextual modeling for language understanding and dialogue policy learning”. In: arXiv
preprint arXiv:1710.00164 (2017).
[32] Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. “On the
properties of neural machine translation: Encoder-decoder approaches”. In: arXiv preprint
arXiv:1409.1259 (2014).
[33] Domenic V Cicchetti. “Guidelines, criteria, and rules of thumb for evaluating normed and
standardized assessment instruments in psychology.” In: Psychological assessment 6.4
(1994), p. 284.
[34] Chloe Clavel and Zoraida Callejas. “Sentiment analysis: from opinion mining to
human-agent interaction”. In: IEEE Transactions on affective computing 7.1 (2016),
pp. 74–93.
[35] Jeffrey F. Cohn, Tomas Simon Kruez, Iain Matthews, Ying Yang, Minh Hoai Nguyen,
Margara Tejera Padilla, Feng Zhou, and Fernando De la Torre. “Detecting Depression
from Facial Actions and V ocal Prosody”. In: Proc. 3rd International Conference on
Affective Computing and Intelligent Interaction and Workshops. 7 pages. Amsterdam,
Netherlands: IEEE, 2009.
[36] Suzanne M Colby, Lindsay Orchowski, Molly Magill, James G Murphy, Linda A Brazil,
Timothy R Apodaca, Christopher W Kahler, and Nancy P Barnett. “Brief motivational
intervention for underage young adult drinkers: Results from a randomized clinical trial”.
In: Alcoholism: clinical and experimental research 42.7 (2018), pp. 1342–1351.
[37] Pierre Colombo, Emile Chapuis, Matteo Manica, Emmanuel Vignon, Giovanna Varni, and
Chloe Clavel. “Guiding attention in sequence-to-sequence models for dialogue act
prediction”. In: Proceedings of the AAAI Conference on Artificial Intelligence . V ol. 34. 05.
2020, pp. 7594–7601.
90
[38] Torrey A Creed, Sarah A Frankel, Ramaris E German, Kelly L Green, Shari Jager-Hyman,
Kristin P Taylor, Abby D Adler, Courtney B Wolk, Shannon W Stirman,
Scott H Waltman, et al. “Implementation of transdiagnostic cognitive therapy in
community behavioral health: The Beck Community Initiative.” In: Journal of consulting
and clinical psychology 84.12 (2016), p. 1116.
[39] Nicholas Cummins, Stefan Scherer, Jarek Krajewski, Sebastian Schnieder, Julien Epps,
and Thomas F Quatieri. “A review of depression and suicide risk assessment using speech
analysis”. In: Speech Communication 71 (July 2015), pp. 10–49.
[40] Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, and Stefan Scherer.
“COV AREP—A collaborative voice analysis repository for speech technologies”. In: 2014
ieee international conference on acoustics, speech and signal processing (icassp). IEEE.
2014, pp. 960–964.
[41] Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan Cowen,
Gaurav Nemade, and Sujith Ravi. “GoEmotions: A Dataset of Fine-Grained Emotions”.
In: Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics. Online: Association for Computational Linguistics, July 2020, pp. 4040–4054.
DOI: 10.18653/v1/2020.acl-main.372.
[42] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. “Imagenet: A
large-scale hierarchical image database”. In: 2009 IEEE conference on computer vision
and pattern recognition. IEEE. 2009, pp. 248–255.
[43] David DeVault, Ron Artstein, Grace Benn, Teresa Dey, Ed Fast, Alesia Gainer,
Kallirroi Georgila, Jon Gratch, Arno Hartholt, Margaux Lhommet, et al. “SimSensei
Kiosk: A virtual human interviewer for healthcare decision support”. In: Proceedings of
the 2014 international conference on Autonomous agents and multi-agent systems.
International Foundation for Autonomous Agents and Multiagent Systems. 2014,
pp. 1061–1068.
[44] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “BERT:
Pre-training of Deep Bidirectional Transformers for Language Understanding”. In: Proc.
NAACL. 2019, pp. 4171–4186.
[45] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “Bert: Pre-training
of deep bidirectional transformers for language understanding”. In: arXiv preprint
arXiv:1810.04805 (2018).
[46] P. Ekman and W.V . Friesen. The Facial Action Coding System (FACS). Consulting
Psychologists Press, Stanford University, Palo Alto, 1978.
91
[47] MP Ewbank, R Cummins, V Tablan, A Catarino, S Buchholz, and AD Blackwell.
“Understanding the relationship between patient language and outcomes in
internet-enabled cognitive behavioural therapy: A deep learning approach to automatic
coding of session transcripts”. In: Psychotherapy Research (2020), pp. 1–13.
[48] Florian Eyben, Klaus R. Scherer, Bjorn W. Schuller, Johan Sundberg, Elisabeth Andre,
Carlos Busso, Laurence Y . Devillers, Julien Epps, Petri Laukka, Shrikanth S. Narayanan,
and Khiet P. Truong. “The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for
V oice Research and Affective Computing”. In: IEEE Transactions on Affective Computing
7.2 (Apr. 2016), pp. 190–202. ISSN: 19493045.
[49] Florian Eyben, Felix Weninger, Florian Gross, and Björn Schuller. “Recent developments
in openSMILE, the munich open-source multimedia feature extractor”. In: Proceedings of
the 21st ACM international conference on Multimedia - MM ’13. New York, New York,
USA: ACM Press, Oct. 2013, pp. 835–838. ISBN: 9781450324045.
[50] Florian Eyben, Martin Wöllmer, and Björn Schuller. “OpenSMILE: The Munich Versatile
and Fast Open-source Audio Feature Extractor”. In: Proceedings of the 18th ACM
International Conference on Multimedia. MM ’10. Firenze, Italy: ACM, 2010,
pp. 1459–1462. ISBN: 978-1-60558-933-6.
[51] Nikolaos Flemotomos, Victor R Martinez, Zhuohao Chen, Torrey A Creed,
David C Atkins, and Shrikanth Narayanan. “Automated quality assessment of cognitive
behavioral therapy sessions through highly contextualized language representations”. In:
PloS one 16.10 (2021), e0258639.
[52] Nikolaos Flemotomos, Victor R Martinez, James Gibson, David C Atkins, Torrey A Creed,
and Shrikanth S Narayanan. “Language Features for Automated Evaluation of Cognitive
Behavior Psychotherapy Sessions.” In: Interspeech. 2018, pp. 1908–1912.
[53] Jon Fokas Kathryn and Houck and Barbara McCrady. “Inside Alcohol Behavioral Couple
Therapy (ABCT): In-session speech trajectories and drinking outcomes”. In: Journal of
Substance Use & Addiction Treatment (JSAT) 118 (2020).
[54] Jacques Gaume, Gerhard Gmel, Mohamed Faouzi, and Jean-Bernard Daeppen.
“Counselor skill influences outcomes of brief motivational interventions”. In: Journal of
substance abuse treatment 37.2 (2009), pp. 151–159.
[55] James Gibson, Dogan Can, Bo Xiao, Zac E Imel, David C Atkins, Panayiotis Georgiou,
and Shrikanth Narayanan. “A deep learning approach to modeling empathy in addiction
counseling”. In: Commitment 111 (2016), p. 21.
[56] James Gibson, Nikos Malandrakis, Francisco Romero, David C. Atkins, and
Shrikanth S. Narayanan. “Predicting therapist empathy in motivational interviews using
language features inspired by psycholinguistic norms”. In: Interspeech. 2015.
92
[57] Jonathan Gratch, Ron Artstein, Gale M Lucas, Giota Stratou, Stefan Scherer,
Angela Nazarian, Rachel Wood, Jill Boberg, David DeVault, Stacy Marsella, et al. “The
distress analysis interview corpus of human and computer interviews.” In: LREC. Citeseer.
2014, pp. 3123–3128.
[58] Jochen Hartmann. Emotion English DistilRoBERTa-base. online. 2022. URL:
%5Curl%7Bhttps://huggingface.co/j-hartmann/emotion-english-distilroberta-base/%7D.
[59] Devamanyu Hazarika, Soujanya Poria, Amir Zadeh, Erik Cambria,
Louis-Philippe Morency, and Roger Zimmermann. “Conversational memory network for
emotion recognition in dyadic dialogue videos”. In: Proceedings of the conference.
Association for Computational Linguistics. North American Chapter. Meeting. V ol. 2018.
NIH Public Access. 2018, p. 2122.
[60] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for
image recognition”. In: Proceedings of the IEEE conference on computer vision and
pattern recognition. 2016, pp. 770–778.
[61] Zihao He, Leili Tavabi, Kristina Lerman, and Mohammad Soleymani. “Speaker turn
modeling for dialogue act classification”. In: arXiv preprint arXiv:2109.05056 (2021).
[62] Ryuichiro Higashinaka, Kenji Imamura, Toyomi Meguro, Chiaki Miyazaki,
Nozomi Kobayashi, Hiroaki Sugiyama, Toru Hirano, Toshiro Makino, and
Yoshihiro Matsuo. “Towards an open-domain conversational system fully based on natural
language processing”. In: Proceedings of COLING 2014, the 25th International
Conference on Computational Linguistics: Technical Papers. 2014, pp. 928–939.
[63] Ryuichiro Higashinaka, Katsuhito Sudoh, and Mikio Nakano. “Incorporating discourse
features into confidence scoring of intention recognition results in spoken dialogue
systems”. In: Speech Communication 48.3-4 (2006), pp. 417–436.
[64] Jon Houck, Sarah Hunter, Jennifer Benson, Linda Cochrum, Lauren Rowell, and
Elizabeth D’Amico. “Temporal variation in facilitator and client behavior during group
motivational interviewing sessions”. In: Psychology of Addictive Behaviors 29.4 (2015),
pp. 941–949.
[65] Xiaolei Huang, Lixing Liu, Kate Carey, Joshua Woolley, Stefan Scherer, and
Brian Borsari. “Modeling Temporality of Human Intentions by Domain Adaptation”. In:
Proceedings of the 2018 Conference on Empirical Methods in Natural Language
Processing. Brussels, Belgium: Association for Computational Linguistics, Oct. 2018,
pp. 696–701. DOI: 10.18653/v1/D18-1074.
[66] Xiaolei Huang, Lixing Liu, Kate Carey, Joshua Woolley, Stefan Scherer, and
Brian Borsari. “Modeling temporality of human intentions by domain adaptation”. In:
Proceedings of the 2018 Conference on Empirical Methods in Natural Language
Processing. 2018, pp. 696–701.
93
[67] Clayton J Hutto and Eric Gilbert. “Vader: A parsimonious rule-based model for sentiment
analysis of social media text”. In: Eighth international AAAI conference on weblogs and
social media. May 2014.
[68] Flora Ioannidou and Vaya Konstantikaki. “Empathy and emotional intelligence: What is it
really about?” In: International Journal of caring sciences 1.3 (2008), p. 118.
[69] Shaoxiong Ji, Tianlin Zhang, Luna Ansari, Jie Fu, Prayag Tiwari, and Erik Cambria.
“Mentalbert: Publicly available pretrained language models for mental healthcare”. In:
arXiv preprint arXiv:2110.15621 (2021).
[70] Yangfeng Ji, Gholamreza Haffari, and Jacob Eisenstein. “A latent variable recurrent
neural network for discourse relation language models”. In: arXiv preprint
arXiv:1603.01913 (2016).
[71] Jyoti Joshi, Roland Goecke, Sharifa Alghowinem, Abhinav Dhall, Michael Wagner,
Julien Epps, Gordon Parker, and Michael Breakspear. “Multimodal assistive technologies
for depression diagnosis and monitoring”. In: Journal on Multimodal User Interfaces 7.3
(2013), pp. 217–228.
[72] Dan Jurafsky. “Switchboard SWBD-DAMSL shallow-discourse-function annotation
coders manual”. In: Institute of Cognitive Science Technical Report (1997).
[73] Jonathan Klein, Youngme Moon, and Rosalind W Picard. “This computer responds to user
frustration”. In: CHI’99 extended abstracts on Human factors in computing systems. 1999,
pp. 242–243.
[74] Harshit Kumar, Arvind Agarwal, Riddhiman Dasgupta, and Sachindra Joshi. “Dialogue
act sequence labeling using hierarchical encoder with crf”. In: Proceedings of the AAAI
Conference on Artificial Intelligence . V ol. 32. 1. 2018.
[75] Duc Le, Zakaria Aldeneh, and Emily Mower Provost. “Discretized Continuous Speech
Emotion Recognition with Multi-Task Deep Recurrent Neural Network”. In: Proc.
Interspeech 2017. 2017, pp. 1108–1112. DOI: 10.21437/Interspeech.2017-94.
[76] Ji Young Lee and Franck Dernoncourt. “Sequential short-text classification with recurrent
and convolutional neural networks”. In: arXiv preprint arXiv:1603.03827 (2016).
[77] Iolanda Leite, André Pereira, Samuel Mascarenhas, Carlos Martinho, Rui Prada, and
Ana Paiva. “The influence of empathy in human–robot relations”. In: International
journal of human-computer studies 71.3 (2013), pp. 250–260.
[78] Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, and Dan Jurafsky.
“Adversarial learning for neural dialogue generation”. In: arXiv preprint
arXiv:1701.06547 (2017).
94
[79] Ruizhe Li, Chenghua Lin, Matthew Collinson, Xiao Li, and Guanyi Chen. “A
dual-attention hierarchical recurrent neural network for dialogue act classification”. In:
arXiv preprint arXiv:1810.09154 (2018).
[80] Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. “Dailydialog: A
manually labelled multi-turn dialogue dataset”. In: arXiv preprint arXiv:1710.03957
(2017).
[81] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. “Roberta: A robustly optimized
bert pretraining approach”. In: 2019.
[82] Sarah Peregrine Lord, Elisa Sheng, Zac E. Imel, John Baer, and David C. Atkins. “More
Than Reflections: Empathy in Motivational Interviewing Includes Language Style
Synchrony Between Therapist and Client”. en. In: Behavior Therapy 46.3 (May 2015),
pp. 296–303. ISSN: 00057894. DOI: 10.1016/j.beth.2014.11.002. (Visited on
05/25/2020).
[83] Ilya Loshchilov and Frank Hutter. “Decoupled Weight Decay Regularization”. In:
International Conference on Learning Representations. 2019. URL:
https://openreview.net/forum?id=Bkg6RiCqY7.
[84] Nichola Lubold and Heather Pon-Barry. “Acoustic-prosodic entrainment and rapport in
collaborative learning dialogues”. In: Proceedings of the 2014 ACM workshop on
Multimodal Learning Analytics Workshop and Grand Challenge. 2014, pp. 5–12.
[85] Gale M Lucas, Jonathan Gratch, Aisha King, and Louis-Philippe Morency. “It’s only a
computer: Virtual humans increase willingness to disclose”. In: Computers in Human
Behavior 37 (2014), pp. 94–100.
[86] Molly Magill, Timothy R Apodaca, Brian Borsari, Jacques Gaume, Ariel Hoadley,
Rebecca EF Gordon, J Scott Tonigan, and Theresa Moyers. “A meta-analysis of
motivational interviewing process: Technical, relational, and conditional process models
of change.” In: Journal of consulting and clinical psychology 86.2 (2018), p. 140.
[87] Molly Magill, Jacques Gaume, Timothy R. Apodaca, Justin Walthers,
Nadine R. Mastroleo, Brian Borsari, and Richard Longabaugh. “The Technical
Hypothesis of Motivational Interviewing: A Meta-Analysis of MI’s Key Causal Model”.
In: Journal of consulting and clinical psychology 82.6 (Dec. 2014), pp. 973–983. ISSN:
0022-006X. DOI: 10.1037/a0036833. (Visited on 05/25/2020).
[88] Molly Magill, Tim Janssen, Nadine R. Mastroleo, Ariel Hoadley, Justin Walthers,
Nancy P. Barnett, and Suzanne M Colby. “Motivational Interviewing Technical Process
and Moderated Relational Process With Underage Young Adult Heavy Drinkers”. In:
Psychology of Addictive Behaviors 33 (2019), pp. 128–138.
95
[89] Navonil Majumder, Soujanya Poria, Devamanyu Hazarika, Rada Mihalcea,
Alexander Gelbukh, and Erik Cambria. “Dialoguernn: An attentive rnn for emotion
detection in conversations”. In: arXiv preprint arXiv:1811.00405 (2018).
[90] Angeliki Metallinou, Athanasios Katsamanis, and Shrikanth Narayanan. “Tracking
continuous emotional trends of participants during affective dyadic interactions using
body language and speech information”. In: Image and Vision Computing 31.2 (2013),
pp. 137–152.
[91] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. “Distributed
Representations of Words and Phrases and their Compositionality”. In: Advances in
Neural Information Processing Systems. Ed. by C.J. Burges, L. Bottou, M. Welling,
Z. Ghahramani, and K.Q. Weinberger. V ol. 26. Curran Associates, Inc., 2013.
[92] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. “Distributed
representations of words and phrases and their compositionality”. In: Advances in neural
information processing systems. 2013, pp. 3111–3119.
[93] William Miller and Stephen Rollnick. Motivational interviewing: Helping people change.
Guilford Press, 2013.
[94] William R Miller, R Gayle Benefield, and J Scott Tonigan. “Enhancing motivation for
change in problem drinking: a controlled comparison of two therapist styles.” In: (2001).
[95] William R Miller, Theresa B Moyers, Denise Ernst, and Paul Amrhein. “Manual for the
motivational interviewing skill code (MISC)”. In: Unpublished manuscript. Albuquerque:
Center on Alcoholism, Substance Abuse and Addictions, University of New Mexico (2003).
[96] William R Miller and Stephen Rollnick. Motivational interviewing: Helping people
change. Guilford press, 2012.
[97] William R Miller and Stephen Rollnick. Motivational interviewing: Helping people
change (applications of motivational interviewing). Guilford press, 2013.
[98] William R. Miller and Gary S. Rose. “Toward a Theory of Motivational Interviewing”. In:
The American psychologist 64.6 (Sept. 2009), pp. 527–537. ISSN: 0003-066X. DOI:
10.1037/a0016830. (Visited on 05/25/2020).
[99] Saif Mohammad, Felipe Bravo-Marquez, Mohammad Salameh, and Svetlana Kiritchenko.
“Semeval-2018 task 1: Affect in tweets”. In: Proceedings of the 12th international
workshop on semantic evaluation. 2018, pp. 1–17.
[100] Michelle Morales, Stefan Scherer, and Rivka Levitan. “A Cross-modal Review of
Indicators for Depression Detection Systems”. In: Proc 4th Workshop on Computational
Linguistics and Clinical Psychology – From Linguistic Signal to Clinical Reality.
Vancouver, BC: ACL, 2017, pp. 1–12.
96
[101] Louis-Philippe Morency, Iwan de Kok, and Jonathan Gratch. “A probabilistic multimodal
approach for predicting listener backchannels”. In: Autonomous agents and multi-agent
systems 20 (2010), pp. 70–84.
[102] Jon Morgenstern, Alexis Kuerbis, Paul Amrhein, Lisa Hail, Kevin Lynch, and
James McKay. “Motivational Interviewing: A Pilot Test of Active Ingredients and
Mechanisms of Change”. In: Psychology of Addictive Behaviors 26 (4 2012).
[103] TB Moyers, T Martin, JK Manuel, WR Miller, and D Ernst. “Revised global scales:
Motivational interviewing treatment integrity 3.1. 1 (MITI 3.1. 1)”. In: Unpublished
manuscript, University of New Mexico, Albuquerque, NM (2010).
[104] Theresa Moyers, Tim Martin, Delwyn Catley, Kari Jo Harris, and Jasjit S Ahluwalia.
“Assessing the integrity of motivational interviewing interventions: Reliability of the
motivational interviewing skills code”. In: Behavioural and Cognitive Psychotherapy 31.2
(2003), pp. 177–184.
[105] Theresa B Moyers, William R Miller, and Stacey ML Hendrickson. “How does
motivational interviewing work? Therapist interpersonal skill predicts client involvement
within motivational interviewing sessions.” In: Journal of consulting and clinical
psychology 73.4 (2005), p. 590.
[106] Theresa B Moyers and Stephen Rollnick. “A motivational interviewing perspective on
resistance in psychotherapy”. In: Journal of clinical psychology 58.2 (2002), pp. 185–193.
[107] Tim Norfolk, Kamal Birdi, and Deirdre Walsh. “The role of empathy in establishing
rapport in the consultation: a new model”. In: Medical education 41.7 (2007),
pp. 690–697.
[108] Magalie Ochs, Catherine Pelachaud, and David Sadek. “An empathic virtual dialog agent
to improve human-machine interaction”. In: Proceedings of the 7th international joint
conference on Autonomous agents and multiagent systems-Volume 1. 2008, pp. 89–96.
[109] Hae Won Park, Mirko Gelsomini, Jin Joo Lee, Tonghui Zhu, and Cynthia Breazeal.
“Backchannel opportunity prediction for social robot listeners”. In: 2017 IEEE
International Conference on Robotics and Automation (ICRA). IEEE. 2017,
pp. 2308–2314.
[110] Timo Partala and Veikko Surakka. “The effects of affective interventions in
human–computer interaction”. In: Interacting with computers 16.2 (2004), pp. 295–309.
[111] James W Pennebaker, Martha E Francis, and Roger J Booth. “Linguistic inquiry and word
count: LIWC 2001”. In: Mahway: Lawrence Erlbaum Associates 71.2001 (2001), p. 2001.
97
[112] James W Pennebaker, Matthias R Mehl, and Kate G Niederhoffer. “Psychological
Aspects of Natural Language Use: Our Words, Our Selves”. In: Annual Review of
Psychology 54.1 (2003), pp. 547–577.
[113] Jeffrey Pennington, Richard Socher, and Christopher Manning. “GloVe: Global Vectors
for Word Representation”. In: Proceedings of the 2014 Conference on Empirical Methods
in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational
Linguistics, Oct. 2014, pp. 1532–1543. DOI: 10.3115/v1/D14-1162.
[114] Verónica Pérez-Rosas, Rada Mihalcea, and Louis-Philippe Morency. “Utterance-level
multimodal sentiment analysis”. In: Proceedings of the 51st Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers). V ol. 1. 2013,
pp. 973–982.
[115] Rosalind W Picard and Karen K Liu. “Relative subjective count and assessment of
interruptive technologies applied to mobile monitoring of stress”. In: International
Journal of Human-Computer Studies 65.4 (2007), pp. 361–375.
[116] Soujanya Poria, Erik Cambria, and Alexander Gelbukh. “Deep convolutional neural
network textual features and multiple kernel learning for utterance-level multimodal
sentiment analysis”. In: Proceedings of the 2015 conference on empirical methods in
natural language processing. 2015, pp. 2539–2544.
[117] Soujanya Poria, Erik Cambria, Newton Howard, Guang-Bin Huang, and Amir Hussain.
“Fusing audio, visual and textual clues for sentiment analysis from multimodal content”.
In: Neurocomputing 174 (2016), pp. 50–59.
[118] Soujanya Poria, Iti Chaturvedi, Erik Cambria, and Amir Hussain. “Convolutional MKL
based multimodal emotion recognition and sentiment analysis”. In: 2016 IEEE 16th
international conference on data mining (ICDM). IEEE. 2016, pp. 439–448.
[119] Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria,
and Rada Mihalcea. “MELD: A Multimodal Multi-Party Dataset for Emotion Recognition
in Conversations”. In: Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics. Florence, Italy: Association for Computational Linguistics,
July 2019, pp. 527–536. DOI: 10.18653/v1/P19-1050.
[120] Robert Porzel, Annika Scheffler, and Rainer Malaka. “How entrainment increases
dialogical effectiveness”. In: Proceedings of the IUI. V ol. 6. Citeseer Sydney, NSW. 2006,
pp. 35–42.
[121] Libo Qin, Zhouyang Li, Wanxiang Che, Minheng Ni, and Ting Liu. “Co-GAT: A
Co-Interactive Graph Attention Network for Joint Dialog Act Recognition and Sentiment
Classification”. In: Proceedings of the AAAI Conference on Artificial Intelligence . V ol. 35.
15. 2021, pp. 13709–13717.
98
[122] Vipul Raheja and Joel Tetreault. “Dialogue act classification with context-aware
self-attention”. In: arXiv preprint arXiv:1904.02594 (2019).
[123] Nairan Ramirez-Esparza, Cindy K Chung, Ewa Kacewicz, and James W Pennebaker.
“The Psychology of Word Use in Depression Forums in English and in Spanish: Texting
Two Text Analytic Approaches”. In: International Conference on Weblogs and Social
Media. Seattle, WA: AAAI, 2008, pp. 102–108.
[124] Sujith Ravi and Zornitsa Kozareva. “Self-governing neural networks for on-device short
text classification”. In: Proceedings of the 2018 Conference on Empirical Methods in
Natural Language Processing. 2018, pp. 887–893.
[125] Carl R Rogers. “The necessary and sufficient conditions of therapeutic personality
change.” In: Journal of consulting psychology 21.2 (1957), p. 95.
[126] Stephen Rollnick and William R Miller. “What is motivational interviewing?” In:
Behavioural and cognitive Psychotherapy 23.4 (1995), pp. 325–334.
[127] Stephen Rollnick, William R Miller, Christopher C Butler, and Mark S Aloia.
Motivational interviewing in health care: helping patients change behavior. 2008.
[128] Sune Rubak, Annelli Sandbæk, Torsten Lauritzen, and Bo Christensen. “Motivational
interviewing: a systematic review and meta-analysis”. In: British journal of general
practice 55.513 (2005), pp. 305–312.
[129] Elvis Saravia, Hsien-Chi Toby Liu, Yen-Hao Huang, Junlin Wu, and Yi-Shin Chen.
“CARER: Contextualized Affect Representations for Emotion Recognition”. In:
Proceedings of the 2018 Conference on Empirical Methods in Natural Language
Processing. Brussels, Belgium: Association for Computational Linguistics, Oct. 2018,
pp. 3687–3697. DOI: 10.18653/v1/D18-1404.
[130] Ruth Scheeffer. “Toward effective counseling and psychotherapy”. In: Arquivos
Brasileiros de Psicologia Aplicada 23.1 (1971), pp. 151–152.
[131] Klaus R Scherer and Harald G Wallbott. “Evidence for universality and cultural variation
of differential emotion response patterning.” In: Journal of personality and social
psychology 66.2 (1994), p. 310.
[132] Stefan Scherer, Giota Stratou, Jonathan Gratch, Jill Boberg, Marwa Mahmoud,
Albert (Skip) Rizzo, and Louis-Philippe Morency. “Automatic Behavior Descriptors for
Psychological Disorder Analysis”. In: Proc. 10th IEEE International Conference and
Workshops on Automatic Face & Gesture Recognition (FG). 8 pages. Shanghai, P. R.
China: IEEE, Apr. 2013.
99
[133] Stefan Scherer, Giota Stratou, Gale Lucas, Marwa Mahmoud, Jill Boberg,
Jonathan Gratch, Albert (Skip) Rizzo, and Louis-Philippe Morency. “Automatic
audiovisual behavior descriptors for psychological disorder analysis”. In: Image and
Vision Computing 32.10 (Oct. 2014), pp. 648–658.
[134] John R Searle, PG Searle, S Willis, John Rogers Searle, et al. Speech acts: An essay in the
philosophy of language. V ol. 626. Cambridge university press, 1969.
[135] Elizabeth Shriberg, Raj Dhillon, Sonali Bhagat, Jeremy Ang, and Hannah Carvey. The
ICSI meeting recorder dialog act (MRDA) corpus. Tech. rep. INTERNATIONAL
COMPUTER SCIENCE INST BERKELEY CA, 2004.
[136] Elizabeth Shriberg, Andreas Stolcke, Daniel Jurafsky, Noah Coccaro, Marie Meteer,
Rebecca Bates, Paul Taylor, Klaus Ries, Rachel Martin, and Carol Van Ess-Dykema. “Can
prosody aid the automatic classification of dialog acts in conversational speech?” In:
Language and speech 41.3-4 (1998), pp. 443–492.
[137] Karan Singla, Zhuohao Chen, Nikolaos Flemotomos, James Gibson, Dogan Can,
David Atkins, and Shrikanth Narayanan. “Using Prosodic and Lexical Information for
Learning Utterance-level Behaviors in Psychotherapy”. en. In: Interspeech 2018. ISCA,
Sept. 2018, pp. 3413–3417. DOI: 10.21437/Interspeech.2018-2551. (Visited on
05/30/2020).
[138] Mohammad Soleymani, David Garcia, Brendan Jou, Björn Schuller, Shih-Fu Chang, and
Maja Pantic. “A survey of multimodal sentiment analysis”. In: Image and Vision
Computing 65 (2017), pp. 3–14.
[139] Andreas Stolcke, Klaus Ries, Noah Coccaro, Elizabeth Shriberg, Rebecca Bates,
Daniel Jurafsky, Paul Taylor, Rachel Martin, Carol Van Ess-Dykema, and Marie Meteer.
“Dialogue act modeling for automatic tagging and recognition of conversational speech”.
In: Computational linguistics 26.3 (2000), pp. 339–373.
[140] Leili Tavabi. “Multimodal machine learning for interactive mental health therapy”. In:
2019 International Conference on Multimodal Interaction. 2019, pp. 453–456.
[141] Leili Tavabi, Anna Poon, Albert Skip Rizzo, and Mohammad Soleymani.
“Computer-based PTSD assessment in VR exposure therapy”. In: HCI International
2020–Late Breaking Papers: Virtual and Augmented Reality: 22nd HCI International
Conference, HCII 2020, Copenhagen, Denmark, July 19–24, 2020, Proceedings 22.
Springer. 2020, pp. 440–449.
[142] Leili Tavabi, Kalin Stefanov, Setareh Nasihati Gilani, David Traum, and
Mohammad Soleymani. “Multimodal learning for identifying opportunities for empathetic
responses”. In: 2019 International Conference on Multimodal Interaction. 2019,
pp. 95–104.
100
[143] Leili Tavabi, Kalin Stefanov, Larry Zhang, Brian Borsari, Joshua D Woolley,
Stefan Scherer, and Mohammad Soleymani. “Multimodal Automatic Coding of Client
Behavior in Motivational Interviewing”. In: Proceedings of the 2020 International
Conference on Multimodal Interaction. 2020, pp. 406–413.
[144] Leili Tavabi, Trang Tran, Kalin Stefanov, Brian Borsari, Joshua Woolley, Stefan Scherer,
and Mohammad Soleymani. “Analysis of Behavior Classification in Motivational
Interviewing”. In: Proceedings of the Seventh Workshop on Computational Linguistics
and Clinical Psychology: Improving Access. Online: Association for Computational
Linguistics, June 2021, pp. 110–115. DOI: 10.18653/v1/2021.clpsych-1.13.
[145] Leili Tavabi, Trang Tran, Kalin Stefanov, Brian Borsari, Joshua Woolley, Stefan Scherer,
and Mohammad Soleymani. “Analysis of Behavior Classification in Motivational
Interviewing”. In: Proceedings of the Seventh Workshop on Computational Linguistics
and Clinical Psychology: Improving Access. 2021, pp. 110–115.
[146] The State Of Mental Health In America 1.
https://www.mhanational.org/issues/state-mental-health-america. 2023.
[147] The State Of Mental Health In America 2.
https://www.who.int/news-room/fact-sheets/detail/mental-disorders. 2023.
[148] Charles Truax. “Research on certain therapist interpersonal skill in relation to process and
outcome”. In: Handbook of Psychotherapy and Befavioral Change (1971).
[149] Charles B Truax and Robert Carkhuff. Toward effective counseling and psychotherapy:
Training and practice. Transaction Publishers, 2007.
[150] Shao-Yen Tseng, Brian R Baucom, and Panayiotis G Georgiou. “Approaching Human
Performance in Behavior Estimation in Couples Therapy Using Deep Sentence
Embeddings.” In: INTERSPEECH. 2017, pp. 3291–3295.
[151] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need”. In:
Advances in neural information processing systems. 2017, pp. 5998–6008.
[152] Yao Wan, Wenqiang Yan, Jianwei Gao, Zhou Zhao, Jian Wu, and S Yu Philip. “Improved
dynamic memory network for dialogue act classification with adversarial training”. In:
2018 IEEE International Conference on Big Data (Big Data). IEEE. 2018, pp. 841–850.
[153] Henny A Westra. “Comparing the predictive capacity of observed in-session resistance to
self-reported motivation in cognitive behavioral therapy”. In: Behaviour research and
therapy 49.2 (2011), pp. 106–113.
101
[154] James R. Williamson, Thomas F. Quatieri, Brian S. Helfer, Rachelle Horwitz, Bea Yu, and
Daryush D. Mehta. “V ocal Biomarkers of Depression Based on Motor Incoordination”. In:
Proc. 3rd ACM International Workshop on Audio/Visual Emotion Challenge. A VEC’13.
Barcelona, Spain: ACM, 2013, pp. 41–48.
[155] Martin Wöllmer, Angeliki Metallinou, Florian Eyben, Björn Schuller, and
Shrikanth Narayanan. “Context-sensitive multimodal emotion recognition from speech
and facial expression using bidirectional lstm modeling”. In: Proc. INTERSPEECH 2010,
Makuhari, Japan. 2010, pp. 2362–2365.
[156] Bo Xiao, Dogan Can, Panayiotis G Georgiou, David Atkins, and Shrikanth S Narayanan.
“Analyzing the language of therapist empathy in motivational interview based
psychotherapy”. In: Proceedings of The 2012 Asia Pacific Signal and Information
Processing Association Annual Summit and Conference. IEEE. 2012, pp. 1–4.
[157] Bo Xiao, Dogan Can, James Gibson, Zac E Imel, David C Atkins, Panayiotis G Georgiou,
and Shrikanth S Narayanan. “Behavioral Coding of Therapist Language in Addiction
Counseling Using Recurrent Neural Networks.” In: Interspeech. 2016, pp. 908–912.
[158] Bo Xiao, Do˘ gan Can, James Gibson, Zac E. Imel, David C. Atkins, Panayiotis Georgiou,
and Shrikanth S. Narayanan. “Behavioral Coding of Therapist Language in Addiction
Counseling Using Recurrent Neural Networks”. en. In: Sept. 2016, pp. 908–912. DOI:
10.21437/Interspeech.2016-1560. (Visited on 06/05/2020).
[159] Bo Xiao, Zac E Imel, Panayiotis G Georgiou, David C Atkins, and Shrikanth S Narayanan.
“" Rate My Therapist": Automated Detection of Empathy in Drug and Alcohol Counseling
via Speech and Language Processing”. In: PloS one 10.12 (2015), e0143055.
[160] Han Xiao. bert-as-service. https://github.com/hanxiao/bert-as-service. 2018.
[161] Lei Xie, Naicai Sun, and Bo Fan. “A statistical parametric approach to video-realistic
text-driven talking avatar”. In: Multimedia tools and applications 73.1 (2014),
pp. 377–396.
[162] Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency.
“Tensor fusion network for multimodal sentiment analysis”. In: arXiv preprint
arXiv:1707.07250 (2017).
102
Abstract (if available)
Abstract
Mental health conditions have increased worldwide in recent years, with about 20% of adults in the US having experienced a mental health disorder. However, over half of this population don't receive the care and treatment they need.
Despite the prevalence of mental health disorders, there is a large gap between the needs and available resources for diagnosis and treatment. The recent advancements in machine learning and deep learning provide an opportunity for developing AI-assisted analysis and assessment of therapy sessions through behavior analysis and understanding, and measuring clients' symptoms.
Automatic behavior analysis can augment clinical resources in diagnosis, and assessment of treatment quality and efficacy. Behavior perception modules from clients' and therapists' in-session data can be utilized in multiple ways, including modeling client and therapist in-session behaviors to investigate their associations with treatment quality; building the underlying components for intelligent machines capable of running automated clinical interviews for probing indicators of mental health disorders; and identifying and utilizing salient behavioral patterns associated with certain disorders toward AI-assisted diagnosis.
This dissertation is mainly aligned with the following directions:
Automated assessment of therapy sessions: We develop behavior analysis models for automated assessment of therapy sessions, with a special focus on Motivational Interviewing (MI). Using transcripts from real-world MI sessions, we propose approaches for modeling and analyzing client-therapist dialogue toward automatic recognition of client intent on a granular utterance level, as well as global session-level quality metrics like therapists' empathy; and further exploring the association of in-session data with consequent behavioral outcomes. Additionally, for a more interpretable understanding, we identify psychologically-relevant features associated with the modeled constructs. To validate our models on a more general task with large publicly available datasets, we train and evaluate our model architectures on benchmark datasets for dialogue act classification while proposing learnable embeddings for modeling turn changes in dialogue. Our approach significantly improves task performance compared to the existing models.
Development of perception models to facilitate intelligent human-agent interactions: Focusing on empathy as an important construct in therapy, we develop models for recognizing opportunities for expressing empathy in human-agent interactions. For this goal, we train our models on multimodal data from clinical interviews designed to probe indicators of mental health disorders. The proposed models provide the means for leading more emotionally-intelligent human-agent interactions.
This work strives to computationally study and model aspects of therapy sessions and clinical interviews that can provide the means for more efficient and objective assessment of the sessions, and important indicators of consequent behavioral outcomes.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Modeling dyadic synchrony with heterogeneous data: validation in infant-mother and infant-robot interactions
PDF
Computational narrative models of character representations to estimate audience perception
PDF
Modeling expert assessment of empathy through multimodal signal cues
PDF
Spoken language processing in low resource scenarios with applications in automated behavioral coding
PDF
Machine learning paradigms for behavioral coding
PDF
Computational foundations for mixed-motive human-machine dialogue
PDF
Modeling and regulating human interaction with control affine dynamical systems
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Event-centric reasoning with neuro-symbolic networks and knowledge incorporation
PDF
Computational models for multidimensional annotations of affect
PDF
Rapid prototyping and evaluation of dialogue systems for virtual humans
PDF
Extracting and using speaker role information in speech processing applications
PDF
Computational model of stroke therapy and long term recovery
PDF
Context-aware models for understanding and supporting spoken interactions with children
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Speech and language understanding in the Sigma cognitive architecture
PDF
Parasocial consensus sampling: modeling human nonverbal behaviors from multiple perspectives
PDF
Computational modeling of behavioral attributes in conversational dyadic interactions
PDF
Fairness in natural language generation
PDF
Common ground reasoning for communicative agents
Asset Metadata
Creator
Tavabi, Leili
(author)
Core Title
Computational modeling of mental health therapy sessions
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2023-08
Publication Date
08/11/2023
Defense Date
05/03/2023
Publisher
University of Southern California. Libraries
(digital)
Tag
dialogue,mental health,motivational interviewing,natural language processing,OAI-PMH Harvest,therapy
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Soleymani, Mohammad (
committee chair
), Mataric, Maja (
committee member
), Narayanan, Shrikanth (
committee member
), Scherer, Stefan (
committee member
)
Creator Email
ltavabi@usc.edu;leili.tavabi@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113298047
Unique identifier
UC113298047
Identifier
etd-TavabiLeil-12245.pdf (filename)
Legacy Identifier
etd-TavabiLeil-12245
Document Type
Dissertation
Rights
Tavabi, Leili
Internet Media Type
application/pdf
Type
texts
Source
20230814-usctheses-batch-1084
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Repository Email
cisadmin@lib.usc.edu
Tags
dialogue
mental health
motivational interviewing
natural language processing
therapy