Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Spoken language processing in low resource scenarios with applications in automated behavioral coding
(USC Thesis Other)
Spoken language processing in low resource scenarios with applications in automated behavioral coding
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SPOKEN LANGUAGE PROCESSING IN LOW RESOURCE SCENARIOS WITH
APPLICATIONS IN AUTOMATED BEHAVIORAL CODING
by
Zhuohao Chen
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
December 2023
Copyright 2023 Zhuohao Chen
Dedication
This dissertation is dedicated to my mon and dad.
ii
Acknowledgements
My career as a PhD student comes to an end. It has been an unforgettable journey for me. There
are many people I want to express my gratitude to because, without them, I could never finish
this program.
I would like to thank my advisor Dr. Shrikanth Narayanan for providing me the opportunity
to be a member of this great lab. Under his guidance and encouragement, I can grow as a qualified
scientist and a better man.
I would like to the committee members of my quals and dissertations Dr. Morteza Dehghani,
Dr. Keith Jenkins, Dr. Jay Kuo and Dr. David Traum for providing valuable feedback to make this
dissertation better.
I would also like to thank my colleagues at SAIL. I will never forget the moment we worked
and played together. Specifically, I want to mention Dr. Dogan Can and Dr. James Gibson for
their great help in the rookie year of my PhD.
I want to extend a special thanks to Dr. Alexander Sawchuk, Dr. Richard Leahy, Mrs. Tanya
Acevedo-Lam, Mrs. Diane Demetras, and Mr. Tim Boston for their incredible support over the
past few years.
Finally, I would like to thank my parents, who have always been my greatest source of support.
I dedicate this dissertation to them.
iii
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Behavioral Coding with Applications . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Automated Behavioral Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Data Sparsity Problem in Automated Behavioral Coding . . . . . . . . . . . . . . 2
1.4 Research Directions and Road Map . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 RoadMap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Chapter 2: Machine Learning Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Long Document Classification with Transformers . . . . . . . . . . . . . . . . . . 6
2.2 Domain Adaptation in Text Classification . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Meta-learning and Task Similarity in Natural Language Processing . . . . . . . . 8
Chapter 3: Session Level Behavioral Coding . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.1 Hierarchical framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.2 Framework with local quality estimates . . . . . . . . . . . . . . . . . . . 13
3.3.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.2.2 Connecting Session Quality and Segment Quality . . . . . . . . 14
3.4 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5.1 BERT Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
iv
3.6 Analysis and Discussion of Experimental Results . . . . . . . . . . . . . . . . . . . 27
3.6.1 Attention weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.6.2 Word distributions over segments . . . . . . . . . . . . . . . . . . . . . . . 30
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Chapter 4: Domain Adaptation for Utterance Level Behavioral Coding with Limited Labels 33
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.1 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.2 A Domain Adversarial Network with Label Proportion Estimation . . . . 35
4.3.2.1 Moments and Matrices Definition . . . . . . . . . . . . . . . . . 37
4.3.2.2 Label Proportions Estimation . . . . . . . . . . . . . . . . . . . 38
4.4 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5.1 Experiments on Behavioral Coding in Psychotherapy . . . . . . . . . . . . 44
4.5.2 Experiments on Yelp Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Chapter 5: Meta-Transfer Learning for Utterance Level Behavioral Coding When Both Sam-
ples and Labels are Limited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3.1 Task Augmentation via Label Clustering . . . . . . . . . . . . . . . . . . . 52
5.3.2 Meta-learning Framework with Augmented Tasks . . . . . . . . . . . . . 54
5.4 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.5.2 Baseline Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.5.4 Effect of the Size of Label Subsets . . . . . . . . . . . . . . . . . . . . . . . 61
5.5.5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.5.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.6 Extension of the Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.6.1 Adaptation to Prompt-Based Learning. . . . . . . . . . . . . . . . . . . . . 63
5.6.2 Unsupervised Meta-learning Framework with Task Augmentation . . . . 65
5.7 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Chapter 6: Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
v
List of Tables
3.1 CBT behavior codes defined by the CTRS manual . . . . . . . . . . . . . . . . . . 19
3.2 Statistics describing the datasets used for the experiments. . . . . . . . . . . . . . 20
3.3 Evaluation results for total CTRS scores, M: #utterances/segment, k: #times
processing the segment scores estimator, SQE: segment quality estimator . . . . . 26
3.4 Evaluation results for different segment length, Approach: BERT-small + LSTM
M: #utterances/segment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5 Comparison of different segment quality estimator modes (for M = 40), update
the segment quality scores for k = 1 time. None: without using a segment
quality estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Results of Behavioral Coding Experiments. . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Results of Yelp Experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.1 Data statistics for behavior codes in Motivational Interviewing psychotherapy. . . 56
5.2 The UARs achieved on predicting therapist’s code. . . . . . . . . . . . . . . . . . . 59
5.3 The UARs achieved on predicting patient’s code. . . . . . . . . . . . . . . . . . . . 59
5.4 Effect of the size of label subset K, 8-way classification tasks of therapist. . . . . . 61
5.5 Effect of the size of label subset K, 3-way classification tasks of patient. . . . . . . 61
5.6 The UARs (%) achieved on predicting patient’s code with different framework
and NLP paradigms. MP refers to manual prompts. . . . . . . . . . . . . . . . . . 64
5.7 The number of utterances for clinical note sections . . . . . . . . . . . . . . . . . 68
vi
5.8 The comparison between UMTA and other models. . . . . . . . . . . . . . . . . . 68
6.1 Statistics describing the SwDA datasets for the 42 tags scheme. . . . . . . . . . . . 88
6.2 Statistics describing the SwDA datasets for the 7 tags scheme. . . . . . . . . . . . 88
6.3 Label clustering results for therapist’s codes . . . . . . . . . . . . . . . . . . . . . . 89
6.4 Label clustering results for patient’s codes . . . . . . . . . . . . . . . . . . . . . . . 89
vii
List of Figures
3.1 Proposed framework using hierarchical transformers. . . . . . . . . . . . . . . . . 13
3.2 Segment Quality Estimator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Distribution of the 11 CTRS codes (and the total CTRS) . . . . . . . . . . . . . . . 20
3.4 Mean attention weights across the sessions consisting of exactly 10 long
segments (of 40 utterances each) in the testing set. . . . . . . . . . . . . . . . . . . 28
3.5 Divide the segments in terms of the relative performance within the session. . . . 29
3.6 Comparison of term frequencies of key words between the low-score groups and
the high-score groups; low-score group consists of the segments whose estimated
scores are among the lowest 50% in the session they belong to, high-score group
consists of the segments whose estimated scores are among the highest 50% in
the session they belong to. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1 The structure of DAN-LPE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Label distributions of MI behavioral coding Data. . . . . . . . . . . . . . . . . . . 45
4.3 Label distributions of Yelp Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.1 An example of task augmentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2 The flowchart of MTA framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3 The comparison between standard MTA and the random version of MTA for
predicting therapist’s behavioral codes. . . . . . . . . . . . . . . . . . . . . . . . . 62
5.4 The comparison between standard MTA and the random version of MTA for
predicting patient’s behavioral codes. . . . . . . . . . . . . . . . . . . . . . . . . . 63
viii
5.5 The flowchart of UMTA framework. . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.6 The flowchart of UMTA framework. . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.1 An example episode of doctor-patient conversation. . . . . . . . . . . . . . . . . . 93
ix
Abstract
Advances in spoken language processing techniques have dramatically augmented productivity
and improved the quality of life. One of the most striking applications of such techniques is
automated human behavioral coding in domains such as diagnostic or therapeutic clinical inter-
actions. Behavioral coding is a procedure during which trained human coders listen to audio
recordings and review session transcripts to assess the session quality and specific interaction
attributes and mechanisms. Developing computational models of speech and natural language
processing for automated behavioral coding helps waive annotation and reduce the burden placed
on experts. However, most existing automated behavioral coding methods assume that we have
enough in-domain samples and labels, which do not work in practice in low resource scenarios.
In this dissertation, I discuss the roots of the data sparsity problem for automated behavioral
coding and address these issues using advanced spoken language processing techniques. Specif-
ically, I adopt the hierarchical transformer framework, domain adaptation model, meta-learning
and task augmentation approaches to build computational linguistics models for modeling hu-
man interactions in different low-resource scenarios. I compare these novel algorithms to the
baseline approaches and show the improved performance.
We evaluate our automated behavioral coding algorithms in psychotherapy, which is consid-
ered an expository domain. The datasets we use in our experiments are from cognitive behavior
x
therapy and motivational interviewing. Beyond that, we further apply our models to other styles
of text data to present the generalizability of our algorithms.
xi
Chapter 1
Introduction
1.1 Behavioral Coding with Applications
Learning human communication and modeling their interaction patterns in a dialogue setting is
inherently a complex and challenging task. The gold-standard method for monitoring the qual-
ity of conversation is behavioral coding - a manual procedure in which experts manually identify
and annotate behaviors of the participants [10]. Among various applications, behavioral coding is
beneficial in understanding behaviors in psychotherapy interactions, which enables better treat-
ment by establishing metrics for therapy quality and tracking patient progress. However, such
a process demands human raters to listen to long audio recordings or read through manually
transcribed sessions, which leads to a prohibitively high cost in terms of both time and human
resources, and, therefore is often unfeasible in real-life scenarios [37].
1
1.2 Automated Behavioral Coding
To overcome such problems related to manual annotation and coding, computational approaches
for modeling and assessing the quality of conversation-based behavioral signals [79] have been
recently developed and used in multiple clinical domains such as for autism diagnosis [18], un-
derstanding oncology communication [4, 26], and supporting primary care [84]. A significant
amount of work has been particularly focused on psychotherapy interactions, including addic-
tion counseling [19, 119, 47, 118, 117] and couple therapy sessions [14, 110, 46]. Those methods
have focused on predicting both utterance-level and globally-coded, session-level behaviors using
either linguistic or acoustic features. Additionally, there have been many studies of multimodal
approaches that utilize both linguistic or acoustic characteristics [102, 5, 28, 105, 26].
1.3 Data Sparsity Problem in Automated Behavioral Coding
In the real world, however, implementing automated behavioral coding algorithms is associated
with several data sparsity challenges: (1) manual transcription, and especially annotation, re-
quires well-trained experts to perform the behavioral coding, which leads to a prohibitively high
cost; (2) the applied coding scheme varies a lot and depends on the specific domain and nature
of the interactions; (3) the high level of privacy and sensitivity of the domains (e.g, clinical in-
teractions between doctor-patient, or therapist-client) makes the in-domain data scarce; (4) some
behavioral codes at multiple levels – from utterance to the full conversational level, which makes
label (code) distribution diverse, and possibly limited.
2
1.4 Research Directions and Road Map
In this dissertation, I propose to address the aforementioned data-hungry challenges in automatic
behavioral coding. Specifically, I handle the low-resource issues in three scenarios and offer dif-
ferent spoken language processing methods.
Session Level Behavioral Coding
In psychotherapy, the quality of a session is typically assessed by the behavioral codes at the
full conversational level. These codes are always sparse because they are coded only once ev-
ery therapy session. To handle this, we propose a hierarchical framework wherein we divide
each psychotherapy session into conversation segments to assess the overall session quality in
a compositional way. In particular, we incorporate the local quality estimator to assess the local
performance and improve the evaluation accuracy of the session quality by modeling fluctuations
of a therapist’s performance within a session.
Domain Adaptation for Utterance Level Behavioral Coding with Limited Labels
Many conversational domains adopt utterance level behavioral coding, which captures local be-
haviors and encodes local events in the conversation of interest. To build a high-quality auto-
mated behavioral coding model, we need enough manually annotated codes. However, this holds
only for the domains that have been extensively studied. The conversations from a new domain
might have never been coded. In this situation, we utilize the labeled data from a well-known
domain to train a computational model for an unlabeled target domain. We propose a domain ad-
versarial network framework with label proportions estimation, which learns domain-invariant
3
features and reduces the label distribution shift to improve the domain adaptation.
Meta-Transfer Learning for Utterance Level Behavioral Coding When Both Samples and Labels
are Limited
Due to the high level of privacy and sensitivity, it is difficult to collect and access a large amount of
in-domain conversational data in real-world domains such as clinical interactions. In these cases,
samples and labels are both limited. We leverage an open resource and apply meta-learning to
benefit from fine-tuning the automated behavioral coding model with limited in-domain data.
Moreover, we propose a novel task augmentation method by introducing the concept of “analogy
tasks" - tasks similar to the target task. We construct large numbers of analogy tasks using open
resources to improve task transferability and meta-transfer learning, leading to better automated
performance of the behavioral coding model..
1.5 RoadMap
The remaining parts of this dissertation proposal are structured as follows:
• Chapter 2 introduces the related machine learning background on natural language pro-
cessing in low resource scenarios.
• Chapter 3 proposes a hierarchical transformer framework with local quality estimator to
automatically evaluate the conversational level behavioral codes.
• Chapter 4 presents a domain adversarial network framework with estimations of the label
proportions to adapt the automated behavioral coding model to an unlabeled domain. This
4
framework is able to learn domain-invariant features and address the label distribution shift
simultaneously.
• Chapter 5 leverages an open resource dataset for automated behavioral coding when both
samples and labels are limited, and the task is unique. In particular, I employ a task aug-
mentation to the meta-transfer learning, which benefits the knowledge transfer.
• Chapter 6 concludes the contributions of this thesis proposal and describes the ongoing
and proposed future work.
5
Chapter 2
Machine Learning Background
2.1 Long Document Classification with Transformers
Our understanding of natural language, including spoken language processing, has benefited
from the development of neural network language models. Recently proposed models are able
to capture rich contextual information, making them widely useful across a variety of domains.
Notably, BERT (Bidirectional Encoder Representations from Transformers) models have demon-
strated significant improvements in multiple natural language processing (NLP) tasks [33]. Re-
searchers have extended the transformers to architectures with sparse attention mechanisms to
better achieve the language model representations of long text. Many of these works are based
on left-to-right autoregressive models [103, 91, 31]. Autoencoding models such as Longformer
[13] and BigBird [126] combine the windowed local-context self-attention and global attention,
which can process up to 4,096 input tokens.
[83] proposed a transformer-based hierarchical framework for the task of long document clas-
sification. The method chunks input documents into blocks, fine-tunes BERT with those blocks
6
to obtain their representations, and then employs a recurrent neural network [94] to perform
classification.
2.2 Domain Adaptation in Text Classification
Domain adaptation techniques in text classification have been studied for many years to address
the problem that the target dataset is unlabeled [16, 15, 82, 17]. For example, [15] first applied a
structural correspondence learning (SCL) [16] to cross-domain sentiment classification. [82] pro-
posed an alignment of spectral features (SFA) to reduce the gap between domain-specific words
of the domains. The study [17] modeled the cross-domain classification task as an embedding
learning. Recently, deep adversarial networks [48] have achieved success across many tasks in-
cluding text classification. The domain-adversarial neural networks (DANN) proposed by [44]
outperforms the traditional approaches in domain adaptation tasks for sentiment analysis. It
implements a domain classifier to learn domain-invariant features. [22] applied the adversar-
ial deep networks to cross-lingual sentiment classification. Other studies extended DANN for
cross-lingual and multi-source scenarios [22, 71, 131, 21].
The shortcoming of DANN models is that it does not consider about the problem of label
distribution shift [104]. To overcome this issue, scientists developed algorithms to estimate the
label distribution of the unlabeled target domain, and re-weight the samples feeding into the
domain adapter in DANN [67, 104, 27]. There are also many studies only fucos on the estimation
of the target label distribution [20, 128, 81, 69, 7].
7
2.3 Meta-learning and Task Similarity in Natural Language Processing
Meta-learning refers to machine learning algorithms that learn from the output of other machine
learning algorithms, which is also called “learning to learn". It aims at fast adaptation to new tasks
with small amounts of data through acquiring knowledge from multiple source tasks. Among
different approaches to meta-learning, one proposal is learning the initialization of a network that
is good at adapting to new jobs. [35] applied this proposal to the General Language Understanding
Evaluation (GLUE) benchmark [114] and explored the model-agnostic meta-learning (MAML)
[40] and its variants called first-order MAML (FO-MAML) and Reptile.
Many works have shown that generating source tasks similar to the target task can benefit
meta-learning. In the past decades, researchers explored the relationship between classification
tasks on task similarity using traditional machine learning algorithms [106, 11, 121, 129]. Other
recent work mapped the functions into a vector space [2, 123, 3] to estimate the transferability us-
ing a non-symmetric distance. [113] further developed the task embeddings approach and applied
it to the NLP field to predict the most transferable source tasks. [127] modeled the underlying
structure among different tasks to reduce the number of labeled training data. However, the com-
mon theme in all these approaches is that they require fine-tuning the target task and exhaustive
optimization of parameters. The transferability estimation, unfortunately, is not robust if there
are insufficient training samples. [109] investigated the correlation of the label distributions be-
tween those tasks and proposed a negative conditional entropy (NCE) measure to estimate the
task transferability. This algorithm only requires the source model and the labeled target samples
without fine-tuning the in-domain data. [80] developed a variant of NCE measure called the Log
8
Expected Empirical Prediction (LEEP) that denotes the average log-likelihood of the expected
empirical predictor. [133] estimated the task similarity by model parameters.
9
Chapter 3
Session Level Behavioral Coding
3.1 Introduction
In psychotherapy, we evaluate the quality of a session by the behavioral codes at the full conversa-
tional level. The codes are coded only once every therapy session based on (global) session-level
attributes. Building the computational models for predicting these behavioral codes automati-
cally suffers from the sparsity of codes. In addition, the automation is challenged by the inherent
complexity of a session, notably due to the long session duration, structured to address multiple
varied facets of the therapy process, and the rich diversity across specific client-therapist sessions.
To handle these challenges, we adopt transformer-based hierarchical [83] framework for au-
tomated behavioral coding. In particular, we incorporate the local quality estimates to model
fluctuations of a therapist’s performance within a session. Specifically, we model the session
quality as the weighted average of the local quality and exploring two approaches to determine
the segment weights
This chapter is based on [25].
10
3.2 Related Works
Previous works on automatically evaluating the quality of the CBT session have thus adopted
coarse features based either on the word frequency [43] or on the distributions of local behavioral
acts [19, 23], which are limited in capturing potentially useful contextual information available
within an interaction.
Recently proposed models are able to capture rich contextual information making them widely
useful across a variety of domains. Notably, BERT (Bidirectional Encoder Representations from
Transformers) models have demonstrated significant improvements in multiple natural language
processing (NLP) tasks [33]. However, one of the key challenges related to BERT is its limited
capability to handle long sequences due to memory-related limitations. In particular, traditional
BERT models can handle sequences of at most 512 tokens, which is a serious limitation when pro-
cessing long, multi-turn conversations. To address this problem, researchers have extended the
transformers to architectures that can better process long text. Many of these works are based on
left-to-right autoregressive models which makes them unsuitable for tasks requiring bidirectional
information [103, 91, 31]. Recently, autoencoding models such as Longformer [13] and BigBird
[126] were proposed, which combine windowed local-context self-attention and global attention,
and can process up to 4,096 input tokens. However, such approaches aggravate the data-hungry
problem of adapting the language model to domains with limited resources, while the maximum
allowed length is still not sufficient to process real-world clinical sessions.
In the study of [83], the author developed a transformer-based hierarchical framework to
classify long document. The method segments input documents into smaller blocks, fine-tunes
BERT with those blocks to obtain their representations, and then employs a recurrent neural
11
network to perform classification. However, an evident shortcoming of this approach is that all
the segments simply inherit the label of the document that they belong to, which is crude.
3.3 Method
3.3.1 Hierarchical framework
We partition a psychotherapy session C into N segments{C
1
,C
2
,...,C
N
}. In order to avoid
splitting the session in-between utterances which comprise complete thought units, we divide
the session into segments everyM utterances. We denote the session-level score ofC ass and
the local quality corresponding to the segmentC
i
ass
i
.
As our base configuration, we adopted the hierarchical transformers structure presented in
the red dashed box of Fig. 3.1, which corresponds to the framework in [83]. In that case, the seg-
ment quality is simply considered to be the score of the session it belongs to. So, every segment
is labeled with the score y
i
= s and those segment-label pairs are used for fine-tuning BERT.
We obtain the segment embeddings by the pooled output (embedding of the initial [CLS] token)
from the last transformer block of BERT and feed them into a predictor for session quality eval-
uation. The predictor includes a bidirectional LSTM layer [52] to capture dependency features,
followed by an additive self-attention layer [9]. Different activation functions (linear/sigmoid)
of the final dense layer allow for either regression or classification tasks. The loss function for
regression is the mean squared error. In the classification scenario, we trained the predictor us-
ing the cross-entropy loss with the categorized codes and we assigned weights for each class
inversely proportional to their class frequencies.
12
Figure 3.1: Proposed framework using hierarchical transformers.
In this setting, however, all the segments are assigned the label corresponding to the entire
session. This means that the behavior of a therapist is assumed to be constant throughout a
session, which is rarely true in the real world. To handle this limitation, we incorporate a local
quality estimator which lets us model fluctuations in quality, e.g., of a therapist’s performance,
within a session.
3.3.2 Framework with local quality estimates
3.3.2.1 Motivation
The motivation behind the approach we followed is based on the idea that the score of a session-
level behavioral code offers a measure of the overall skill of a therapist over an entire conversation.
However, since a psychotherapy session is typically several tens of minutes long, a segment of
the conversation by itself often represents a rich and meaningful exchange of ideas and, thus,
13
Algorithm 1 Training/Prediction Scheme of Our Hierarchical Framework
1: Initialization:
Split each conversationC into segmentsC
1
,C
2
,...C
n
denote the session-level score ofC ass
initialize the segment quality scores asy
0
1
=s,y
0
2
=s,...,y
0
N
=s
2:
3: fork =0 toK do
4: Fine-tune BERT by using the local quality scoresy
k
1
,y
k
2
,...,y
k
N
with a regression task to
learn representations of segments.
5: Feed segment representations into the LSTM-based model to train the segment quality
estimator (SQE) with a regression task.
6: Use the trained SQE to predict global quality ˆ s and local qualities ˆ s
1
,ˆ s
2
,...,ˆ s
N
.
7: Correcting shift and update local quality:
¯s
i
= ˆ s
i
+s− ˆ s, y
k+1
i
= ¯s
i
8: end for
9:
10: Train the predictor with either regression or classification task to make a final prediction of
the global quality.
can be evaluated and scored: the labels (scores) for the segments are interpretable. To that end,
we implemented a local quality estimator to access the quality of each segment by modeling the
global scores as the (weighted) average of local scores.
3.3.2.2 Connecting Session Quality and Segment Quality
We approximate the session qualitys as a weighted average of the performance across the (tem-
poral) segments of an interaction. To estimate the local quality, we implemented the segment
quality estimator shown in Fig. 3.2a. It has a similar structure to that of the predictor for a re-
gression task, the only difference being that we replaced the self-attention layer with a linearly
activated one so that its output can be a linear combination of its inputs:
h =
N
X
i
α i
h
i
(3.1)
14
whereh
i
are the outputs of the BLSTM layers,h is the hidden state fed into the final dense layer,
andα i
denote the segment weights.
(a) Model & Training (b) Estimation (fixed model parameters)
Figure 3.2: Segment Quality Estimator.
To determine the segment weightsα i
, two modes of the segment quality estimator are inves-
tigated:
1. Segment weights are assumed to be proportional to the number of utterances the segments
contain, as shown in Eq. 3.2. According to this equation, and based on the way we split
each session, the weights of the segment within a session are equal, possibly apart from
the last one.
∗
We implemented an average pooling layer after the BLSTM (Fig. 3.2a). The
session quality is approximately equal to the average of its segment qualities. We denote
this mode as even.
α i
=
#utterances inC
i
total #utterances inC
(3.2)
2. Segments are assumed to have different importance towards the estimation of the overall
session quality, even when they have the same length in terms of the number of utterances.
∗
The last segment might not exactly haveM utterances.
15
In that case, we implement an attention layer (Fig. 3.2a), where the internal context vector
learns the segment weights α i
[122], as given by Eq. 3.3. The attention layer applies a
single-layer MLP to derive the keys z
i
. Then the attention weights α i
are computed by a
softmax function. The internal context vectoru is randomly initialized and jointly learned
during the training process. The padded sequences are always masked so their attention
weights equal zero. We adopt these attention weights as the segment weights and denote
this mode as uneven due to the (potentially) uneven distribution ofα i
.
z
i
=tanh(Wh
i
+b),
α i
=
exp(z
t
i
u)
P
j
exp(z
t
j
u)
,
h =
N
X
i
α i
h
i
(3.3)
For both modes of the segment quality estimator, the weights of the segments in a session
satisfy the condition that
P
N
i
α i
= 1. Having the segment weights determined, the prediction
for the session quality is now
ˆ s =L(h) =
N
X
i
α i
L(h
i
)≜
N
X
i
α i
ˆ s
i
(3.4)
whereL represents the linear activation of the dense layer. Eq. 3.4 indicates that we can decom-
pose the global score estimate ˆ s as the weighted sum of the local score estimates ˆ s
i
. As shown
in Fig. 3.2b, we can obtain these local estimates by feeding the hidden statesh
i
directly into the
dense layer.
In order to account for the deviation between the prediction and the true labels, we update
the estimates by correcting the shift (top layer in Fig. 3.2b):
16
¯s
i
= ˆ s
i
+s− ˆ s, i∈ 1,...N (3.5)
so that
N
X
i
α i
¯s
i
=
N
X
i
α i
(ˆ s
i
+s− ˆ s) =
N
X
i
α i
ˆ s
i
+s
N
X
i
α i
− ˆ s
N
X
i
α i
= ˆ s+s− ˆ s =s. (3.6)
Eq. 3.6 indicates that the weighted average of the modified segment quality estimates is equal to
the true score of a session.
The complete learning procedure of our hierarchical framework is given in Algorithm 1: we
input the updated segment labels {y
i
= ¯s
i
}
N
i=1
in BERT and repeat fine-tuning to get better
representations. The loss function for training the segment quality estimator via regression is also
the mean squared error. We iterate this process multiple times before employing the predictor to
make a final prediction of the overall session quality.
3.4 Dataset
Cognitive Behavioral Therapy (CBT) is a widely used evidence-based psychotherapy that aims
at enabling shifts in the patient’s thinking and behavioral patterns [12]. In this work, we use
the CBT data that comes from the Beck Community Initiative [29], a public-academic partner-
ship, and consists of 1,118 sessions with durations ranging from 10 to 90 minutes. Out of those,
292 sessions are accompanied by professional, manual transcriptions, while all the sessions were
automatically transcribed by an automated speech pipeline developed for psychotherapy inter-
actions [42], consisting of voice activity detection, diarization, automatic speech recognition,
17
speaker role assignment (therapist vs. patient), and utterance segmentation, which converted
speech to punctuated text. We adapted this pipeline to the CBT domain using 100 manually tran-
scribed sessions and evaluated the performance on the remaining 192 ones, with the estimated
word error rate being 45.81%. The error analysis revealed that errors were highly influenced by
the presence of speech fillers (e.g., ‘um’, ‘huh’, etc.) and other idiosyncrasies of conversational
speech. It should be noted that error rates around this value have been reported to be typical
in conversational medical interactions [61] and inevitably degrade the downstream tasks. From
a practical viewpoint, it is important to study the feasibility of applying NLP techniques under
real-world circumstances where perfect transcriptions are not available. Several studies have
successfully performed NLP tasks on decoded transcripts with similar word error rates across
different applications [78, 132]. In our own previous work, we have shown that for the specific
problem in the same data domain, the performance degradation of such end-to-end systems due
to the automatically derived transcriptions is relatively small when we adopt the frequency-based
features [23].
The entire available data set has been manually annotated to assess session quality. The
quality evaluation of CBT is based on the Cognitive Therapy Rating Scale (CTRS, [125]), which
defines the session-level behavioral codes shown in Table 3.1. Each session is evaluated according
to 11 codes scored on a 7 point Likert scale ranging from 0 (poor) to 6 (excellent). The sum over
all the 11 codes, called the total CTRS score, is typically used as the overall measure of session
quality and ranges from 0 to 66. We binarized each CTRS code by assigning codes greater than
or equal to 4 as ‘high’ and less than 4 as ‘low’, since 4 is the primary anchor indicating the skill
is fully present, but still with room for improvement [125]. For the overall CTRS quality, we
binarized it by setting to ‘high’ sessions with total CTRS score greater than or equal to 40 and
18
Table 3.1: CBT behavior codes defined by the CTRS manual
Abbr. CTRS code
‘low’/‘high’ (count)
Train Test
ag agenda 594/308 146/70
at application of cognitive behavioral techniques 735/167 167/49
co collaboration 528/374 120/96
fb feedback 706/196 163/53
gd guided discovery 716/186 164/52
hw homework 757/145 167/49
ip interpersonal effectiveness 215/687 56/160
cb focusing on key cognitions and behaviors 615/287 131/85
pt pacing and efficient use of time 635/267 136/80
sc strategy for change 606/296 131/85
un understanding 603/299 144/72
total total score 683/188 156/60
‘low’ those with total CTRS less than 40, since a score of 40 is regarded as the benchmark for CBT
competence [100]. The score distributions for different CTRS codes across all the 1,118 sessions
are given in Fig. 3.3. In this study, a total of 28 doctoral-level CBT experts served as raters. In
order to prevent rater drift, they were required to demonstrate calibration before coding, which
resulted in high inter-rater reliability (ICC = 0.84, [29]).
We split the data into training and testing sets with a roughly 80%:20% ratio across therapists,
with the sessions in the training set being used both for training our models and for domain
adaptation. The label distributions of the binarized codes for both the training and testing sets
are shown in Table 3.1. Besides the CBT data, we used automatically transcribed psychother-
apy sessions from a university counseling center (UCC) decoded by the same automated speech
pipeline mentioned above to adapt the BERT models. As shown in Table 3.2, the UCC data set
contains more sessions than the CBT one, while the two sets are similar in terms of domain, ses-
sion duration, and number of words per utterance, which demonstrates its appropriateness for
19
Figure 3.3: Distribution of the 11 CTRS codes (and the total CTRS)
pre-training and adapting the language model.
Table 3.2: Statistics describing the datasets used for the experiments.
Dataset
sessions
(count)
therapists
(count)
session duration [min]
(mean± std)
utterances per session
(mean± std)
words per utterance
(mean± std)
CBT 1118 383 35.5± 12.8 656.3± 270.3 8.4± 8.3
UCC 4268 59 38.9± 10.0 665.2± 226.2 10.2± 11.2
3.5 Experiments and Results
Based on the sequence length distribution on the available dataset, and in order to better exploit
the maximum allowed BERT sequence length of 510 tokens, we split each session into sequential
segments comprisingM = 40 utterances, with an average sequence length equal to 327.9 words.
We found that only 3.9% of the total number of segments in the CBT dataset we considered were
20
longer than 510 tokens. For comparison, we also tried different values for the segment length M
and experimentation showed thatM = 40 yields the best results, as explained in Section 3.5.3.
3.5.1 BERT Adaptation
Due to memory-related limitations (GeForce GTX 1080, 11G) and training efficiency, we adopted
a smaller pre-trained uncased BERT variant with 4 layers and 256 hidden states
†
— denoted as
BERT-small in this paper — allowing us to select a larger batch size when training with long
sequences.
We adapted BERT-small to the CBT domain via domain-adaptive pretraining with the UCC
data (Table 3.2) and task-adaptive pretraining [55, 50] with the training set of CBT over a 90%-10%
train-eval split. We continued training with the UCC utterances for 1 epoch and CBT utterances
for 10 epochs using the following parameters: learning rate of 2e-5, batch size of 64, and sequence
length of 64. The adapted model, called cbtBERT-utt, achieves an accuracy of 76.6% on the next
sentence prediction task and 44.8% on the masked language model task.
We continued adapting the model using the long segments (M = 40) to improve its ability
to learn features of longer sequences. Since the number of available segments is very small and
some contextual information is filtered, we augmented the segment samples using the following
strategy: for each session, we split the transcript 8 times by setting the length of the first segment
equal to 5, 10, 15, 20, 25, 30, 35 and 40 utterances. We continued training cbtBERT-utt on those
segments for 10 epochs with a learning rate of 2e-5, batch size of 8, and sequence length of 512.
We denote this adapted model as cbtBERT-segment.
†
https://github.com/google-research/bert
21
3.5.2 Experimental Setup
We perform both classification and regression tasks for evaluation, which not only predicts whether
a session ‘is good’ but also ‘how good it is’.
All our models were implemented in Tensorflow [1] and 20% of the training data were used
for validation. For BERT fine-tuning, we selected the best learning rate (among 1e-5, 2e-5, 3e-
5, 4e-5, and 5e-5) on the validation set and used decoupled weight decay regularizer [76] and a
linear learning rate scheduler for optimization. The model was trained for 10 epochs based on
the mean squared error loss function with a batch size of 64. The max sequence length was set
to 512 tokens for segments of 40 utterances and 64 tokens for segments comprising 1 utterance.
‡
For the predictor and segment quality estimator models (presented in Fig. 3.1 and Fig. 3.2), an
Adam optimizer [60] was employed with a fixed learning rate of 0.001. The BLSTM layer has the
same dimension as the hidden states of BERT. The maximum sequence length was set to 1,600 for
short segments and 40 for long segments. We trained the models for up to 50 epochs with a batch
size of 64 and an early stopping strategy based on the validation loss. For the regression tasks, we
re-scaled the session scores by f(s) =
s− A
2
/
A
2
for fast convergence. The value ofA equals
to 66 for predicting the total CTRS scores and to 6 for predicting each of the CTRS codes so that
the normalized labels are always in the range of [-1, 1]. After prediction, we map the range [-1,
1] back to [0, 66] for the total CTRS and [0, 6] for the other CTRS codes.
‡
We also tried setting the max sequence length to 512 tokens for the segments comprising 1 utterance and
achieved comparable performance with the results using 64 tokens.
22
3.5.3 Results
We predict the total CTRS scores via different approaches and report the root mean squared error
(RMSE) and mean absolute error (MAE) for the regression tasks and the macro-averaged F1 score
for the classification tasks.
As a baseline, we perform linear regression (LR) and support vector machine (SVM) based
classification, coupled with unigrams under a term frequency-inverse document frequency (tf-
idf) transformation, which was reported to achieve the best results for the task in [43]. We denote
these two methods by tf-idf + LR and tf-idf + SVM. We also compare the results of our hierarchical
framework to the model’s performance when replacing the BERT embeddings by the segment-
level averaged glove embeddings [85] or by paragraph vectors (doc2vec, [30]). We extract these
segment-level features and directly feed them into the LSTM-based predictor in Fig. 3.1; we denote
those approaches by glove + LSTM and doc2vec + LSTM. Besides, we perform the evaluation
tasks using the pre-trained BERT-small, and two sparse attention transformer-based models —
Longformer [13] and BigBird [126] — by truncating the document to the maximum sequence
length.
For our framework, we additionally evaluate the performance with respect to whether fine-
tuning BERT with segment scores or not is involved, and with respect to the number of times
k we call the segment quality estimator for updating those scores. If k = 0, the structure is
equivalent to the normal hierarchical framework described in Section 3.3.1 and introduced in
[83]. Furthermore, we report and make comparisons between the two proposed modes of the
segment quality estimator: even — in each session, every segment contributes equally towards
the session quality — and uneven — in each session, the contribution of each segment towards
23
the session quality is learnt through an attention mechanism. Based on the pre-trained language
model we use, the approaches are named as BERT-small + LSTM, BERT-cbt-utt + LSTM and
BERT-cbt-segment + LSTM.
The experimental results of the different approaches are shown in Table 3.3. For all the meth-
ods, the difference between RMSE and MAE is relatively small which indicates that outliers do not
greatly affect the predictions. Based on the results, we observe that, for evaluating such lengthy
conversations, frequency-based methods perform better than simple neural network methods.
The transformer-based model — BERT-small — achieves a low performance because it only re-
ceives 512 tokens for each session, ignoring most of the information. The transformer models
with sparse attentions — Longformer and BigBird — increase the maximum possible input se-
quence length to 4,096 and evaluate the session quality more accurately. However, this length
is still not sufficient for CBT conversations, and, as a result, the performance is worse when
compared to the hierarchical approaches. The results of the hierarchical framework suggest that
fine-tuning substantially improves the performance when the segment is long. However, fine-
tuning is not as effective and might even lead to performance degradation if single utterances
are used as segments, since assigning the global session quality to very short chunks of text may
result in inaccuracies. Specifically for the BERT-small + LSTM configuration, we experimented
with various values M for the segment length and found that M = 40 resulted to the best re-
sults which is shown in Table 3.4. The performance of the task is improved as we increase the
number of M from 1 to 40. However, when we set M = 80, the number of words exceed the
maximum sequence length (512) of BERT. The performance degrades because the long chunks
cannot be fully fed into the model. This is why we usedM = 80 for the subsequent experiments.
Comparing the various pre-trained BERT models, we conclude that adapting the language model
24
with in-domain data leads to improved prediction performance. Additionally, when using long
segments as inputs, cbtBERT-segment consistently yields better results than cbtBERT-utt, which
confirms its suitability for handling longer sequences.
The performance of the hierarchical transformers framework is significantly improved by
incorporating a segment quality estimator (SQE), which also outperforms other approaches in
Table 3.3. Moreover, we find that the best results are achieved when we select the “even" mode
for the segment quality estimator, which indicates that, for predicting the total CTRS scores, it
is reasonable to model the global quality as the average of the local quality scores. The results
also show that the system performance is improved as we increase the numberk of SQE updates,
although a plateauing trend is observed.
To further compare the two segment quality estimator modes, we predict each of the CTRS
codes with both modes and also present the performance without incorporating a segment qual-
ity estimator. The segment length for these experiments is set to 40 utterances (M = 40) and
the results are reported in Table 3.5. For both modes, the number of times updating the segment
quality estimator is set to one (k = 1). The ‘None’ column in the table corresponds to performing
the tasks without a segment scores estimator (k = 0). From the table, we can observe that using
a segment scores estimator, regardless of the mode selected, leads to improved prediction perfor-
mance for the majority of the CTRS codes. The segment quality estimator with the “even" mode
yields the best results for 8 out of 11 codes. However, for the codes agenda and homework, the
“uneven" mode leads to more accurate predictions. We assume that which mode of segment qual-
ity estimator achieves better performance depends on inherent characteristics of the codes and
the ability of the employed attention mechanism to robustly learn the segment weights. It seems
25
Table 3.3: Evaluation results for total CTRS scores,M: #utterances/segment,k: #times processing
the segment scores estimator, SQE: segment quality estimator
Approach BERT fine-tune M k SQE mode RMSE/MAE F1 score (%)
Frequency-based Methods
tf-idf + LR - - - - 9.48/7.49 -
tf-idf + SVM - - - - - 69.0
Neural Network Methods
glove + LSTM
- 1 - - 10.05/8.09 59.6
- 40 - - 9.90/7.99 60.2
doc2vec + LSTM
- 1 - - 9.88/7.91 62.2
- 40 - - 9.75/7.80 63.0
Transformer Models
BERT-small - - - - 9.89/7.93 61.9
Longformer - - - - 9.35/7.31 67.9
BigBird - - - - 9.30/7.25 68.5
Hierarchical Framework
1
BERT-small + LSTM
2
% 1 0 - 9.78/7.82 62.6
! 1 0 - 9.88/7.89 62.2
% 40 0 - 9.68/7.70 63.5
! 40 0 - 8.78/6.97 70.7
BERT-cbt-utt + LSTM
% 1 0 - 9.57/7.59 65.3
! 1 0 - 9.56/7.67 64.6
% 40 0 - 9.45/7.50 65.5
! 40 0 - 8.59/6.80 72.0
BERT-cbt-segment + LSTM
% 40 0 - 9.27/7.29 67.9
! 40 0 - 8.47/6.59 73.0
+
BERT-cbt-segment + LSTM
+ SQE
! 40 1 Even 8.19/6.35 74.7
! 40 1 Uneven 8.25/6.40 74.3
! 40 2 Even 8.12/6.29 75.0
! 40 2 Uneven 8.22/6.38 74.5
! 40 3 Even 8.09/6.27 75.1
∗ ! 40 3 Uneven 8.22/6.37 74.5
1
Approachs without fine-tuning correspond to the single task models in [41]
2
Corresponds to the framework in [83]
∗ is significantly higher than + atp< 0.05 based on Student’s t-test.
Table 3.4: Evaluation results for different segment length, Approach: BERT-small + LSTM M:
#utterances/segment
M Average words per segment RMSE/MAE F1 scores (%)
1 8.4 9.88/7.89 62.2
5 41.8 9.36/7.33 66.9
20 166.2 8.85/7.07 70.2
40 327.9 8.78/6.97 70.7
80 647.5 9.02/7.14 69.5
26
Table 3.5: Comparison of different segment quality estimator modes (for M = 40), update the
segment quality scores fork = 1 time. None: without using a segment quality estimator
CTRS code
MAE/RMSE F1 scores (%)
None Even Uneven None Even Uneven
ag 0.86/1.10 0.85/1.10 0.81/1.05 74.6
+
74.6 76.6
∗ at 0.95/1.20 0.90/1.16 0.93/1.18 64.5
+
66.3
∗ 65.6
co 0.74/0.97 0.73/0.96 0.75/0.98 69.6 70.5 68.9
fb 0.97/1.20 0.89/1.13 0.93/1.15 66.5
+
69.5
∗ 68.0
gd 0.74/0.98 0.71/0.95 0.73/0.98 63.2
+
66.6
∗ 64.4
hw 1.00/1.24 0.97/1.21 0.96/1.20 67.0 68.2 68.7
ip 0.69/0.90 0.67/0.89 0.69/0.90 60.5 61.5 60.3
cb 0.89/1.10 0.91/1.12 0.91/1.13 69.8 69.3 69.3
pt 0.84/1.16 0.82/1.13 0.84/1.16 64.0 65.5 64.4
sc 1.00/1.25 0.96/1.19 0.98/1.20 66.9
+
69.4
∗ 68.6
un 0.66/0.87 0.62/0.83 0.64/0.85 60.2 62.4 61.5
∗ is significantly higher than + atp< 0.05 based on Student’s t-test.
that, for most of the codes, all the segments are almost equally important, and the uniformly dis-
tributed segment weights used by the ‘even’ mode are more accurate than the estimated weights
learned by the attention mechanism. For codes like agenda and homework, the global quality is
mainly associated to only a few segments within the session, so the estimated weights using the
attention mechanism are more accurate. We further analyze this behavior in Sec 3.6.1.
3.6 Analysis and Discussion of Experimental Results
In this section, we investigate the contribution of each segment via the analysis of the attention
weights and discuss how the segment quality estimates benefit the prediction of the session-level
scores.
27
Figure 3.4: Mean attention weights across the sessions consisting of exactly 10 long segments (of
40 utterances each) in the testing set.
3.6.1 Attention weights
We recorded the attention weights from the predictor using the best approach in Table 3.3 for
a subset of the sessions in the test set, each of which consisted of ten segments (of M = 40
utterances) to observe their behavior qualitatively through time. The average attention weights
through time, across the selected sessions, for the CTRS codes agenda, homework, understand
and total score are presented in Fig. 3.4. As we can see, the attention mechanism assigns higher
weights in the beginning when predicting the CTRS dimensions of agenda and homework. In
CBT, the therapist sets the agenda collaboratively with the client to establish key topics to be dis-
cussed and reviews the homework in the early stages of a session. We also observe that the tail of
the homework curve goes up because the therapist is expected to assign new homework to their
patient at the end of the session. Understanding is the CTRS code used to evaluate the listening
28
Figure 3.5: Divide the segments in terms of the relative performance within the session.
and empathic skills of the therapist and whether he/she successfully captured the patient’s ‘in-
ternal reality’ throughout a session. Thus, this therapist behavior is, on average, approximately
equally important in each segment and, as a result, the attention weights are evenly distributed.
In general, the average attention weights of the codes understanding and total score seem to be
much more evenly distributed through time compared to those of agenda and homework, which
partially explains that the uneven mode is surpassed by the even mode for predicting most of
the CBT codes but achieves the best results for agenda and homework. The different behaviors
between the attention weights distributions for different codes suggest that the even mode and
uneven mode of our approach complement each other while evaluating the session quality from
various perspectives. It is also interesting to point out that similar conclusions are drawn in our
concurrent work in [41], despite the different approaches that are followed.
29
Figure 3.6: Comparison of term frequencies of key words between the low-score groups and the
high-score groups; low-score group consists of the segments whose estimated scores are among
the lowest 50% in the session they belong to, high-score group consists of the segments whose
estimated scores are among the highest 50% in the session they belong to.
3.6.2 Word distributions over segments
Since we do not have reference (expert-provided) scores at the segment level, it is not possible
to directly evaluate the accuracy of the segment quality estimates and confirm the fluctuations
within a session. We can, however, perform a simple assessment by comparing the occurrence
of the most informative words. To that end, we perform a backward selection to find the subset
of the five best features/words for predicting the total CTRS scores using the tf-idf features of
each session. The words selected are ‘agenda’, ‘evidence’, ‘feeling’, ‘helpful’ and ‘homework’.
The Spearman correlations between the tf-idf features of these words and the total CTRS scores
all fall into the range [0.6, 0.8]. These correlation scores indicate that the particular words tend
to exist more frequently in sessions with higher total CTRS scores.
Again, by using the best approach in Table 3.3, we obtain the segment quality estimates (M =
40) of the sessions in the testing set. Fig. 3.5 demonstrates that we cluster these segments into
30
two groups containing: 1) segments with estimated scores that are among the lowest 50% in the
session they belong to, denoted as “low-score group"; 2) segments with estimated scores that are
among the highest 50% in the session they belong to, denoted as “high-score group". We then
compute the term frequency of the 5 words described above for both groups. As illustrated in
Fig. 3.6, all of those words are more likely to exist in a segment with a high estimated score.
For the words ‘agenda’, ‘evidence’, ‘helpful’ and ‘homework’, the term frequencies of the “high-
score group" are more than three times higher than the ones of the “low-score group". These
comparisons suggest that the estimated segment scores can provide insights into the variability
in a therapist’s performance within a session.
3.7 Conclusion
This chapter introduces a hierarchical framework to evaluate the quality of psychotherapy con-
versations with a focus on cognitive behavioral therapy (CBT). We split sessions into blocks (con-
versation segments), employ BERT to learn segment embeddings, and use those features within
an LSTM-based model to make predictions about session quality. We additionally implement
a local quality estimator to model the estimated session quality as a linear combination of the
segment-level quality estimates. The experimental results show a substantial gain over baselines.
They suggest that incorporating such a local quality estimator leads to better segment repre-
sentations and consistent improvements for assessing the overall session quality in most of the
CTRS codes. In addition, we discuss how the estimated scores of the segment benefit the pre-
diction tasks by comparing the differences of the segments within the same session. We should
note that an important added benefit of our proposed approach is enhanced interpretability of
31
the predicted results. By modeling the session quality as a function of local estimates, we get
insights into specific salient parts of the therapy session and into how particular conversation
segments contribute to the overall CTRS scores.
32
Chapter 4
Domain Adaptation for Utterance Level Behavioral Coding with
Limited Labels
4.1 Introduction
In many behavioral coding applications, utterance level codes can capture local characteristics
of the speakers and encodes local events in the conversation of interest. In many cases on cod-
ing scheme can be applied to many sub-fields. To build an utterance level automated behavioral
coding model of high quality, we need manually annotated codes. However, the data sparsity
problem sometime exists in such scenarios: we only have enough labels data in the domains that
have been extensively studied and for many other domains the labels are scared. For example,
behavioral codes in Motivational Interviewing (MI) sessions [77] defined by the the Motivational
Interviewing Skill Code (MISC) manual [53] are primarily applied in addiction, however, it can
also be used in the fields of classroom management [93, 68] and health coaching. The conversa-
tions from a unpopular domain or a brand new domain might have never been coded.
33
To address this issue, we adopt the domain-adversarial neural networks (DANN) proposed
by [44] to perform the domain adaptation to train the predictor for a unlabeled domain. The
shortcoming of this method is that these models all assume that the label proportions across the
domains remain the same, an assumption that often is not met in real world tasks of psychother-
apy behavioral coding. In this chapter, we propose a domain adversarial network framework with
label proportions estimation (DAN-LPE) which learns domain-invariant features and estimates
the target label proportions. The label distribution shift is handled by re-weighting the samples
feeding into the domain adaptor based on the estimated label proportions.
4.2 Related Works
[44] proposed DANN models in domain adaptation tasks for text classification. However, they
assume that the label proportions across the domains remain the same, an assumption that often
is not met in most of real world tasks of of behavioral coding.
The changes in the distribution of labels in different domains are known as prior probability
shift or label distribution shift, which prohibits the DANN from learning domain-invariant fea-
tures [130]. To estimate the label distribution shift, [95] proposed an EM algorithm to obtain the
new prior probabilities by maximizing the likelihood of the new data. This approach has been
successfully applied by [20]. In the study of [128] the kernel mean matching (KMM) method
was demonstrated to correct the shift. [81] further developed the KMM algorithm for continuous
target shift adaptation. A new attempt to quantify the shift is the Black Box Shift Estimation
(BBSE) [69, 7], a moment-matching approach using confusion matrices that achieves accurate
estimates on high-dimensional data sets of natural images. However, these approaches are under
34
an anti-causal hypothesis in which the labels cause the features [99]. Recently, [67] proposed to
address the label distribution shift problem by incorporating distribution matching of intermedia
features into the DANN to estimate label proportions as well as perform the domain adaptation.
[104] and [27] developed algorithms to learn the domain shift by matching the distributions of
the predicted labels and used importance weight in domain-invariant representation learning.
4.3 Method
4.3.1 Problem Setup
Let P and Q be the source and target domains defined on X× Y and let f be a text classifier
function. We usex∈X = R
d
andy∈Y ={1,2,...,L} to denote the feature and label variables.
The output of the classifier prediction is denoted by ˆ y = f(x). We use p and q to indicate the
probability density functions ofP andQ, respectively. The source and target datasets are repre-
sented byS
P
={(x
P
1
,y
P
1
),(x
P
2
,y
P
2
),...,(x
P
M
,y
P
M
)} andS
Q
={(x
Q
1
,y
Q
1
),(x
Q
2
,y
Q
2
),...,(x
Q
N
,y
Q
N
)}.
S
P
includes the training set S
Pt
and validation set S
Pv
.The prior distributions of P and Q are
given byα l
=p(y =l) andβ l
=q(y =l).
4.3.2 A Domain Adversarial Network with Label Proportion Estimation
The structure of the DAN-LPE is shown in Fig. 4.1. The red box in the figure presents the DANN
structure consisting of a feature extractor F , a text classifier C and a domain classifier D. We
train the module C to classify the text accurately and D to make the feature distributions be-
tween source and target domains indistinguishable by back-propagation with gradient reversal.
35
However, the performance ofD would decline if the prior distributions betweenP andQ differ
a lot. To reduce the label distribution shift we implement a prior distribution estimator above the
red box in Fig. 4.1 to estimate the target label proportions and apply the trick introduced in [67]
by re-weighting the sample feeding intoD based on this varying estimate. The prior distribution
estimator takes the confusion of the training set and the label predictions of the target set as
inputs.
Figure 4.1: The structure of DAN-LPE.
36
4.3.2.1 Moments and Matrices Definition
We first define the training set D
P
T
= {(x
P
T
1
,y
P
T
1
),(x
P
T
2
,y
P
T
2
),...,(x
P
T
M
T
,y
P
T
M
T
)}. The moments,
matrices ofp andq are denoted as follows:
P
ij
=p(y =i,ˆ y =j), Q
ij
=q(y =i,ˆ y =j),
P
j
i
=p(ˆ y =j|y =i) =
p(y =i,ˆ y =j)
p(y =i)
=
P
ij
P
k
P
ik
,
Q
j
i
=q(ˆ y =j|y =i) =
q(y =i,ˆ y =j)
q(y =i)
=
Q
ij
P
k
Q
ik
.
We also present the plug-in estimates using the samples fromS
P
T
andS
Q
:
ˆ
P
ij
=
1
M
T
M
T
X
k=1
1{y
P
T
k
=i,f(x
P
T
k
) =j},
ˆ
Q
ij
=
1
N
N
X
k=1
1{y
Q
k
=i,f(x
Q
k
) =j},
ˆ
P
j
i
=
P
M
T
k=1
1{y
P
T
k
=i,f(x
P
T
k
) =j}
P
M
T
k=1
1{y
P
T
k
=i}
=
ˆ
P
ij
P
k
ˆ
P
ik
,
ˆ
Q
j
i
=
P
N
k=1
1{y
Q
k
=i,f(x
Q
k
) =j}
P
N
k=1
1{y
Q
k
=i}
=
ˆ
Q
ij
P
k
ˆ
Q
ik
.
We do not have access to
ˆ
Q
ij
and
ˆ
Q
j
i
sinceQ is an unlabeled domain. However, the distribu-
tion of the label predictions ofQ is accessible:
β i
=q(y =i) =
L
X
j=1
q(y =i,ˆ y =k) =
L
X
j=1
Q
ij
,
q(ˆ y =j) =
L
X
i=1
q(y =i,ˆ y =j) =
L
X
i=1
Q
ij
=
L
X
i=1
β i
Q
j
i
.
37
We can get the estimate ofq(ˆ y =j) by:
ˆ q(ˆ y =j) =
1
N
N
X
k=1
1{f(x
Q
k
) =j}
=
1
N
N
X
k=1
L
X
i=1
1{y
Q
k
=i,f(x
Q
k
) =j} =
L
X
i=1
ˆ
Q
ij
.
4.3.2.2 Label Proportions Estimation
The crucial component of the DAN-LPE is the way to update the label proportion estimates. In
this section, we discuss two estimation methods - Black Box Shift Estimation and Joint Adversar-
ial Training Estimation.
Black Box Shift Estimation:
Black Box Shift Estimation (BBSE) [69] is a moment-matching approach, the author of which
proposed three assumptions:
A1: The condition features are domain-invariant;
p(x|y) =q(x|y),∀x∈X,y∈Y.
38
A2: For everyy∈Y withq(y)> 0, we requirep(y)> 0;
A3: Access to a black box predictor f : X → Y where the expected confusion matrix C
p
(f) is
invertible.
C
P
(f) :=p(f(x),y)∈R
|Y|×| Y|
From the assumption A1 we can drawp(ˆ y|y)q(y) =q(ˆ y|y)q(y) so that
q(
ˆ
(y)) =
X
y∈Y
q(ˆ y|y)q(y) =
X
y∈Y
p(ˆ y|y)q(y)
Now calculate the target label distributionq(y) by the following equations:
β =
¯ P[q(ˆ y = 1),q(ˆ y = 1),...,q(ˆ y =L)]
T
,
¯ P =
P
1
1
P
2
1
... P
1
L
P
1
2
P
2
2
... P
2
L
.
.
.
.
.
.
.
.
.
.
.
.
P
L
1
P
L
2
... P
L
L
.
(4.1)
where the matrix
¯ P and the predicted label distribution [q(ˆ y = 1),q(ˆ y = 1),...,q(ˆ y = L)]
T
can be achieved via the text classifier trained with source samples.
The BBSE method is simple to implement, however, the assumption thatp(x|y) = q(x|y) is
handy-crafted and rarely correct in practice.
39
Joint Adversarial Training Estimation:
We define a vector γ = [γ 1
,γ 2
,...,γ L
]
T
to be the estimator of β . Rather than BBSE that we
predict the target label distribution before the adversarial training, we aim at achieving 1) the
domain-invariant features such that p(x|y) = q(x|y), which implies P
j
i
= Q
j
i
, 2) the accurate
estimate of the target label proportionsγ = β during the training procedure. In this condition,
the equality
P
L
i=1
γ i
P
j
i
=
P
L
i=1
β i
Q
j
i
= q(ˆ y = j) holds∀j. So we propose the following loss
function
J
γ =
L
X
j=1
L
X
i=1
γ i
P
j
i
− q(ˆ y =j)
2
. (4.2)
Replacing with the plug-in estimates we get:
ˆ
J
γ =
L
X
j=1
L
X
i=1
γ i
ˆ
P
j
i
− ˆ q(ˆ y =j)
2
. (4.3)
We setJ
γ = 0, which implies
P
i
γ i
P
j
i
=q(ˆ y =j),∀j, and we get that:
¯ Pγ = [q(ˆ y = 1),q(ˆ y = 1),...,q(ˆ y =L)]
T
,
(4.4)
Then we conclude the following implication
Theorem 1 Assumep(x|y) =q(x|y)and
¯ P isaninvertiblematrix,thenJ
γ ≥ 0isaconvexfunction
ofγ and the equality is satisfied when γ =β .
40
The gradient ofJ
γ is given by:
∂J
γ ∂γ k
= 2
L
X
j=1
P
j
k
L
X
i=1
γ i
P
j
i
− q(ˆ y =j)
. (4.5)
We modify the equation in a similar way as before and relateγ only to the observable data
∂
ˆ
J
γ ∂γ k
= 2
L
X
j=1
ˆ
P
j
k
L
X
i=1
γ i
ˆ
P
j
i
− ˆ q(ˆ y =j)
. (4.6)
The proposed priorγ is updated by gradient descent using Equation (4.6). However, sinceγ is constrained by
P
i
γ i
= 1, we apply the projected gradient descent instead:
G(γ ) =
∂
ˆ
J
γ ∂γ = [
∂
ˆ
J
γ ∂γ 1
,
∂
ˆ
J
γ ∂γ 2
,...,
∂
ˆ
J
γ ∂γ L
]
T
,
γ t
=γ t− 1
− λ L
G(γ t− 1
)−⟨ G(γ t− 1
),
1
T
√
L
⟩· 1
√
L
,
(4.7)
where λ L
is the learning rate of updating γ . To avoid the existence of the negative proportion
estimate, we also set a lower bound thatγ i
≥ 0.001. Onceγ i
< 0.001, we have
γ k
=γ k
+γ i
− 0.001,k = argmax
j
γ j
,
γ i
= 0.001.
(4.8)
We define J
C
and J
D
as the loss functions of the text classifier C and the domain classifier
D. To eliminate the prior shift, we re-weight the samples fromP inD based on their labels. Let
w
i
=
γ i
˜ α i
, where ˜ α i
is the prior distribution ofS
P
T
. For a mini-batch of sizeB, the instances from
41
P andQ areB
P
={(µ 1
,ν 1
),...,(µ B/2
,ν B/2
)} andB
Q
={(µ ′
1
,ν ′
1
),...,(µ ′
B/2
,ν ′
B/2
)}, the sample
weight vector ofB
P
isw
T
= [w
1
,w
2
,...,w
B/2
]. ComputeJ
D
by
J
D
=
1
B
(
B/2
X
i=1
w
v
i
∥w∥
c(µ i
)+
B/2
X
i=1
c(µ ′
i
)),
(4.9)
wherec(· ) presents the cross-entropy loss. By this new loss function, the samples from the same
class in the source and target domain have closer contributions in D, which helps the domain
adapter to suffer less from the label distribution shift.
Algorithm 2 Domain Adversarial Network with Label Proportion Estimation
1: Step 1:
2: Initialization: γ = [
1
L
,
1
L
,...,
1
L
]
T
;λ D
> 0,λ L
> 0;T,T
0
,k,m∈N
3: fort =i toT do
4: Sample a mini-batch training set forC andD respectively
5: Fixγ 6: Update C parameters using∇J
C
7: Update F parameters using∇J
C
− λ D
∇J
D
8: Update D parameters using∇J
D
9: if t>T
0
andt modk = 0 then
10: Predict the labels ofD
P
T
andD
Q
.
11: forj = 1 tom do
12: Updateγ by Equation (4.6)-(4.8)
13: end for
14: end if
15: end for
16:
17: Step 2:
18: Perform DANN with fixed γ and modified loss function of D in Equation (4.9)
The complete pseudo-code of this learning procedure is given in Algorithm 2. In the first step
it trains a domain adversarial network and processes label proportion estimation alternately to get
an estimate of the target prior distribution. During this procedure, we are achieving a more and
more accurate estimate of the target label proportions, the label distribution shift effect is being
42
reduced by re-weighting the samples inD and better domain-invariant features are learned. Since
the label distribution shift still matters a lot in early epochs, we need a second step to perform
general DANN with the fixed γ achieved in the first step and the modified loss function J
D
in
(4.7).
The hyper-parameters of step 1 in algorithm 2 are quite flexible. The number of iterations
T is deemed adequate when the validation loss does not increase too much. The role of T
0
is
to guarantee that when start updating γ a decent model for text classification is trained. We
update γ every k iterations so it reduces the times to predict S
P
T
and S
Q
and accelerates the
process. Parameter λ L
and m controls how fast and smoothly γ changes. The DAN-LPE is not
very sensitive to these hyper-parameters. Whenγ is fixed as the prior distribution of D
P
T
, step
1 of Algorithm 2 is equivalent to the basic DANN.
4.4 Dataset
We use data from Motivational Interviewing (MI) counseling sessions where the utterance level
behavior of the therapist is coded following the Motivational Interviewing Skill Code (MISC)
[53] manual. Some of the codes have similar functions as dialogue acts such as “Open Question",
“Closed Question" and “Facilitate" (similar to backchannel) which can be coded even by every who
know English. However, some of these MISC codes inherently overlap with each other in their
construction, and training classifiers for these confusable codes can help improve the behavioral
coding performance [28]. In this experiment we classify the utterances of Giving Information
(GI), simple reflection (RES) and complex reflection(REC) collected from MI sessions of alcohol
addiction (A), drug-abuse (D) [6] and general psychology conversations (G) from an US university
43
counseling center with each category containing around 10000 samples. The label proportions
are shown in Fig. 4.2.
4.5 Experiments and Results
4.5.1 Experiments on Behavioral Coding in Psychotherapy
In the DAN-LPE setting we implement a word embedding layer, followed by a bidirectional LSTM
layer and an attention mechanism implemented as in [122] for F . C takes the output of the
attention as the input with dimension of 128 and another hidden layer of the same size before
the softmax output layer. D has a similar structure of C except it has a gradient reversal layer
and replace the 3-way classification to a binary classification. Dropout of p = 0.4 is set for all the
hidden dense layers. As shown in Fig. 4.2b, the data are highly imbalanced so we evaluate the
performance by the Unweighted Average Recall (UAR). In moduleC of both step 1 and step 2 of
the algorithm 2, we assign weights for each class inversely proportional to their class frequencies
to make the algorithm more robust as well as for improving the F-score.
To evaluate the label proportions estimation, we define ˆ γ dl
to be the estimate using DAN-LPE,
ˆ γ b
to be the estimate using BBSE and ˆ α and
ˆ
β be the label proportions of samples in the source
and target dataset. The results are measured by the Euclidean distance between the estimate and
the actually label proportions of the target set.
In the behavioral coding experiments, the DAN-LPE wins BBSE in label proportion estimation
and reduce the label shift in all the tasks and achieves the overall best classification performance,
shown in Table 4.1. It gains the highest UAR in all the tasks. The most trivial improvement is
achieved in the last one when the estimated proportions do not decrease the label shift much. The
44
Figure 4.2: Label distributions of MI behavioral coding Data.
Table 4.1: Results of Behavioral Coding Experiments.
Task Unweighted Average Recall Estimation Results
P->Q DNN DANN DAN-LPE ||
ˆ
β − ˆ γ dl
||
2
||
ˆ
β − ˆ γ b
||
2
||
ˆ
β − ˆ α ||
2
A->G 0.520 0.518 0.536 0.14 0.15 0.27
G->A 0.512 0.511 0.523 0.10 0.17 0.27
D->G 0.531 0.533 0.558 0.05 0.08 0.24
G->D 0.556 0.561 0.565 0.06 0.15 0.24
A->D 0.632 0.640 0.656 0.13 0.15 0.25
D->A 0.603 0.605 0.605 0.19 0.25 0.25
45
Figure 4.3: Label distributions of Yelp Data.
DANN only has a comparable UAR compared with DNN and even degrades for some tasks. From
the results we also find that the behavioral coding task is hard because the behavior codes are
human defined and not uncorrelated so they are easy to confuse with each other. However, DAN-
LPE shows its robustness and still gives reasonable proportion estimate of the data in unlabelled
domain.
4.5.2 Experiments on Yelp Data
We further generalize our algorithm to a sentiment analysis task with the Yelp Open Dataset
[124]. It includes 192,609 businesses and 6,685,900 reviews of more than 20 categories. In each
review a user expresses opinions about a business and gives a rating ranging from 1 to 5. We
compute the average review ratingsz
i
of each business and label the business with
46
y =
1, ifz
i
< 3.4,
0, ifz
i
> 3.6.
(4.10)
The business with 3.4 ≤ z
i
≤ 3.6 are filtered out to make the gap. We select the data of
Financial Services (F), Hotel& Travel (H), Beauty& Spas (B) and Pets (P) for the domain adapta-
tion tasks. Their label distributions are shown in Fig. 4.3, the businesses of Financial Services and
Hotel& Travel are more likely to receive negative reviews possibly because they are fundamental
requirements that people are more strict with them. We sample 2800 businesses for each domain
preserving the label proportions and predict the class using their reviews. Among the samples of
each domain, 10% of them are split into the validation set.
We extract the features for each business using the following steps:
1) remove punctuations and the stop words from Natural Language Toolkit (NLTK)[74];
2) apply stemming using Porter algorithm implemented in NLTK;
3) negate words between the negation and the following punctuation [32];
4) find 500 words by the intersection of exact 837 most common words of each domain and form
the bag of words vector for each review by the occurrence of these tokens;
5) get the feature vector of this business by averaging the vectors of all its reviews.
In the DAN-LPE setting we implement a standard neural network with 2 hidden layers of 32
dimensions. D takes the output of the first layer as the input and another hidden layer of the
same size. Dropout of p = 0.6 is set for all the hidden layers. We compare DAN-LPE with SVM,
DNN and DANN. DNN is constructed byF +C and DANN byF +C+D. For DNN, DANN and
DAN-LPE, the learning rate is fixed as 10
− 4
and the size of mini-batch is 64. For optimization,
47
Table 4.2: Results of Yelp Experiments.
Task Accuracy Estimation Results
P->Q SVM DNN DANN
DAN-
LPE
||
ˆ
β − ˆ γ dl
||
2
||
ˆ
β − ˆ γ b
||
2
||
ˆ
β − ˆ α ||
2
B->H 0.881 0.882 0.884 0.886 0.08 0.15 0.40
B->F 0.869 0.876 0.883 0.884 0.10 0.13 0.32
P->H 0.842 0.863 0.858 0.865 0.03 0.06 0.47
P->F 0.871 0.879 0.880 0.883 0.13 0.17 0.38
H->B 0.862 0.861 0.858 0.868 0.05 0.05 0.40
H->P 0.871 0.878 0.875 0.879 0.06 0.08 0.47
F->B 0.885 0.879 0.877 0.896 0.03 0.05 0.32
F->P 0.840 0.828 0.826 0.845 0.03 0.07 0.38
B->P 0.884 0.892 0.893 0.893 0.02 0.05 0.07
P->B 0.896 0.907 0.908 0.908 0.06 0.07 0.07
H->F 0.881 0.885 0.883 0.885 0.02 0.03 0.08
F->H 0.846 0.839 0.852 0.849 0.13 0.17 0.08
the Adam [60] optimizer was applied following an early stopping strategy. In the first step of
DAN-LPE, we setλ D
= 0.05,λ L
= 0.01,T = 8000,T
0
= 2000 andm =k = 5.
The results of the yelp experiments are presented in Table 4.2. We found the DAN-LPE has
a overall more accurate label proportions estimate than BBSE. In the first eight tasks ˆ α and
ˆ
β differ a lot and the DANN does not show much improvement over DNN. In some tasks it even
degrades the classification performance. In these experiments, the DAN-LPE shows a significant
gain because the label proportions estimateˆ γ reduces the label shift. In the last four tasks the label
proportions between the source and target domains are close and DANN gets the best accuracy
in three of them. However, the DAN-LPE algorithm performs comparably with DANN in these
tasks since is does not degrade in estimatingβ .
48
4.6 Conclusion
In this chapter, we proposed the DAN-LPE framework to handle the label shift in DANN for
unsupervised domain adaptation of automated behavioral coding. In DAN-LPE we estimate the
target label distribution and learn the domain-invariant features simultaneously. We derived the
formula to update the label proportions estimate using the confusion and target label predictions
and re-weighted the samples in the domain classifier to better learn domain-invariant features.
Experiments shows that the DAN-LPE gives much better estimate than the BBSE and evidently
reduces the label shift. When the DANN does not gain much from the domain adapter under large
shift, the DAN-LPE structure successfully corrects the shift and achieves better performance in
predicting the behavioral codes. Besides, we generalize our algorithm to the application of senti-
ment analysis with Yelp data and the results are consistent to the experiments of psychotherapy
behavioral coding.
49
Chapter 5
Meta-Transfer Learning for Utterance Level Behavioral Coding
When Both Samples and Labels are Limited
5.1 Introduction
Human behavioral coding is associated with data sparsity challenges due to the high level of pri-
vacy and sensitivity of the data and the expensive cost of human annotation by experts. In many
cases, samples and labels of in-domain data are both limited which makes it hard to train com-
putational models for automated behavioral coding. This section aims to discuss the algorithms
for predicting behavior codes directly from psychotherapy utterances through classification tasks
with limited in-domain data.
To overcome the data sparsity issue of the in-house data, our strategy is to make use of ex-
ternal datasets. In this chapter, we leverage publicly available resources and transfer knowledge
to the low-resource behavioral coding task by performing an intermediate language model train-
ing via meta-learning. We introduce a task augmentation method to produce a large number of
“analogy tasks” — tasks similar to the target one — and demonstrate that the proposed framework
50
enhances meta-transfer learning and improves the classification accuracy of automated behav-
ioral coding tasks. Unlike most previous meta-learning frameworks, which require auxiliary tasks
from various datasets, our work uses only one dataset and produces the source tasks by a task
augmentation procedure. The task augmentation framework evaluates the correlations between
the source and target labels. It produces source tasks by choosing subsets of source labels whose
classes are in one-to-one correspondence with target classes. Using this strategy, we can generate
a large number of source tasks similar to the target task and thus improve the performance of
meta-learning.
This chapter is mainly based on [24].
5.2 Related Works
Recently, substantial work has shown the success of universal language representation via pre-
training context-rich language models on large corpus [86, 56]. Among these language models,
BERT (Bidirectional Encoder Representations from Transformers) has achieved state-of-the-art
performance in many natural language processing (NLP) tasks and provided strong baselines in
low-resource scenarios [34]. However, these models rely on self-supervised pre-training on a
large out-of-domain text corpus. Prior works have explored addressing this difficulty by inter-
mediate task model pre-training with some other high-resource dataset before fine-tuning with
in-domain data [54, 70, 113]. However, not all the source tasks can yield positive gains. Some-
times the intermediate task might even lead to negative transfer [89, 64, 88]. To improve the
chance to find a good transfer source, we might need to collect as many source tasks as possible.
Another approach for low-resource in-domain NLP tasks is meta-learning which aims at finding
51
good initialization for fine-tuning the target [49, 36, 90], which also calls for enough source tasks
and is effect by the task similarity [111, 59, 133].
5.3 Method
5.3.1 Task Augmentation via Label Clustering
We define a low resource target task on X ×Y and usex∈X to denote data andy∈Y ={1,2,...,M}
to denote the target labels. We additionally assume a data-rich source task defined on X ×Z
with samples{(x
1
,z
1
),(x
1
,z
2
), ...,(x
n
,z
n
)} supported by a much larger label set denoted by
z∈Z ={1,2,...,N},N > M. Our task augmentation procedure aims at producing numerous
tasks similar to the target task—we will refer to those as the “analogy tasks”.
Algorithm 3 Augmented Task
Initialize model parametersθ from a released BERT;K∈N.
Create empty label subsets: C
1
=∅,C
2
=∅,...,C
M
=∅.
Fine-tune BERT with in-domain samples to obtain the classifier f
fori = 1 toK do
forj = 1 toM do
For the target labely =j, selectz
∗ ∈Z by Equation (3.1) and (3.2), then add it toC
j
Remove the labelz
∗ fromZ
end for
end for
Selecting one label fromC
1
,C
2
,...,C
M
to produceM
K
augmented tasks
The high-level idea is to construct the tasks with class labels similar to the target ones. Thus
we explore the relationships betweenY andZ. We initialize M label subsets C
1
= ∅,C
2
=
∅,··· ,C
M
=∅ to gather the source labels corresponding toy = 1,y = 2,··· ,y = M, respec-
tively. In the first step, we fine-tune the in-domain data to achieve a dummy classifier f. Then,
we feed the source samples intof and obtain the predicted labels
ˆ
Z ={f(x
1
),f(x
2
),...,f(x
n
)}.
52
Figure 5.1: An example of task augmentation.
Figure 5.2: The flowchart of MTA framework.
53
For any pair of a target labely and a source labelz, we define the similarity function Sim(· ) ex-
pressed by Equation (5.1). The value ofSim(y,z) represents the proportion of the source samples
within the classz, which are assigned the labely byf. For any target labely, we determine the
most similar labelz
∗ from the source data by Equation (5.2). Next, we apply Equation (5.1) and
(5.2) to each of the target labels alternatively to cluster the source labels into the label subsets
C
1
,C
2
,...C
M
with|C
1
| = |C
2
| = ... = |C
M
| = K, where K is the size of the label subsets.
Finally, we generate the source tasks by selecting one label from each of the subsets, resulting in
M
K
analogy tasks. The details of this procedure are given in Algorithm 3.
Sim(y,z) =
P
n
k=1
1{f(x
k
) =y,z
k
=z}
P
n
i=1
1{z
k
=z}
. (5.1)
z
∗ = argmax
z∈Z
Sim(y,z). (5.2)
Fig. 5.1 presents a task augmentation example from which we suppose the produced analogy
tasks can benefit meta-transfer learning in three aspects: 1) the task similarity and knowledge
transfer are improved; 2) the large number of theanalogytasks increases the generalization which
helps meta-learner find a commonly good initialization of model parameters; 3) the classification
layers can be shared for all the tasks.
5.3.2 Meta-learning Framework with Augmented Tasks
After task augmentation, we apply an optimization-based meta-learning algorithm for interme-
diate training with the produced analogy tasksT
1
,T
2
,...,T
M
K. In particular, here we use Reptile,
that has shown superior text classification results [36].
54
Algorithm 4 Reptile with Augmented Tasks
Initialize model parametersθ from a released BERT;m∈N,α,β > 0
for iteration in 1,2,... do
Sample batch of tasks{τ i
} from the augmented tasks based on one of the sampling methods
we proposed.
for allτ i
do
Computeθ m
i
bym gradient descent steps with the learning rateα .
Updateθ =θ +β 1
|{τ i
}|
P
τ i
(θ m
i
− θ )
end for
end for
We denote this meta-learning-based framework with task augmentation asMTA and propose
three task sampling methods:
Uniform: choose a task by uniformly selecting one source label from each of the label subsets
resulting in the same probability
1
M
K
for every task.
PPTS, where we choose an analogy task with the probability proportional to the task size to make
the best use of instances (see Appendix 6.2).
We describe the training procedure in Algorithm 4 whereα andβ present the learning rate
for the inner and outer loop, respectively, andm denotes the update steps for the inner loop. This
intermediate task provides a good initialization for fine-tuning the target task. Fig 5.2 shows the
flowchart of the framework.
5.4 Dataset
We use data from Motivational Interviewing (MI) sessions of alcohol and drug abuse problems
[8, 6] for the target task. The corpus consists of 345 transcribed sessions with behavioral codes
annotated at the utterance level according to the Motivational Interviewing Skill Code (MISC)
55
Code Description #Train #Test
Therapist
FA Facilitate 19397 5838
GI Giving information 17746 5064
RES Simple reflection 7236 2137
REC Complex reflection 4974 1510
QUC Closed question 6421 1569
QUO Open question 5011 1475
MIA MI adherent 4898 1346
MIN MI non-adherent 1358 237
Patient
FN Follow/Neutral 56204 15426
POS Change talk 6146 1737
NEG Sustain talk 5121 1407
Table 5.1: Data statistics for behavior codes in Motivational Interviewing psychotherapy.
manual [53]. The original MISC has 19 behavioral codes. However, the instances of some codes
are sparse, [19] and [18] proposed different ways to cluster the codes in order to address the
sparsity. we take the strategy proposed by [18] grouping all counselor codes into 8 categories
and client codes into 3 categories, which is described Table 5.1. We split the data into training
and testing sets with a roughly 80%:20% ratio across speakers, resulting in 276 training sessions
and 67 testing sessions. The statistics of the data are shown in Table 5.1.
We perform the intermediate task with the SwDA dataset, which consists of telephone con-
versations with a dialogue act tag for each utterance. We concatenate the parts of an interrupted
utterance together, following [116], which results in 196K training utterances and 4K testing ut-
terances. This dataset supports 42 distinct tags, with more details displayed in Appendix 6.2.
56
5.5 Experiments and Results
We adopt the MI dataset introduced in section 5.4 to perform two tasks: predicting the behavioral
codes of the therapist and of the patient. We use SwDA as the source dataset for intermediate
tasks to train the BERT model with Reptile. We set the number of sessions for both the training
and validation sets to 1, 5, and 25 to simulate low-resource situations at different levels. We pick
sessions randomly to make pairs of training and validation, and we repeat this 15 times. For each
level of data sparsity, we report the averaged prediction results over 15 runs to reduce the effect
of data variations.
5.5.1 Experimental Setup
Our BERT model was implemented in PyTorch (version 1.3.1) and initialized with BERT-base
∗
.
We threshold the word sequence length to 50, covering more than 99% of the sentences from
either source or target dataset. The model is trained using the Adam [60] for optimization with a
batch size of 64. For Reptile, We set the learning rate to be 5e-5 for the inner loop and 1e-5 for the
outer loop and fix the inner update step m to be 3. We run the Reptile by sampling 8 tasks in each
step and pre-training the model for 4 epochs. In the fine-tuning stage, we select The learning rate
from{5e-6, 1e-5, 2e-5, 3e-5} and number of epoch from{1, 3, 5, 10} with the lowest validation
loss.
To handle the class imbalance, we assign a weight for each class inversely proportional to its
class frequency in the fine-tuning stage. In the meta-transfer learning, we assign a weight for
each sample inversely proportional to the frequency of the label subsets it belongs, as shown in
∗
https://github.com/huggingface/pytorch-pretrained-BERT
57
Fig. 5.1. The performance of classification tasks is evaluated by the unweighted average recall
(UAR).
5.5.2 Baseline Methods
We compare our methods to the following baseline methods:
BERT: We directly fine-tune BERT with the limited in-domain data.
Pre-train-42: We pre-train BERT before the fine-tuning stage with the Switchboard-DAMSL data
using a 42-class classification task adopting its standard label tags.
Pre-train-7: We cluster the labels into simpler 7 tags, as described by [101], and pre-train the
intermediate task of BERT with the SwDA dataset using a 7-class classification task.
To explore the impact of label clustering, we perform the task with additional two approaches:
Pre-train-LC-Shared: After label clustering, we pre-train the model by classifying samples into
the label subsets they belong to as in Fig. 5.1. The classification layer is shared between pre-
training and fine-tuning stage.
Pre-train-LC-Unshared: The setup is the same as in Pre-train-LC-Shared, but the classification
layer is randomly initialized for fine-tuning.
58
Approach
Nb. Training Sessions
1 5 25
BERT 0.512 0.577 0.626
Pre-train-42 0.528 0.584 0.630
Pre-train-7 0.543 0.592 0.638
Pre-train-LC-Shared 0.533 0.584 0.633
Pre-train-LC-Unshared 0.552 0.597 0.643
MTA-Uniform 0.555 0.601 0.646
MTA-PPTS 0.574 0.618 0.660
Table 5.2: The UARs achieved on predicting therapist’s code.
Approach
Nb. Training Sessions
1 5 25
BERT 0.408 0.469 0.528
Pre-train-42 0.407 0.463 0.523
Pre-train-7 0.410 0.466 0.529
Pre-train-LC-Shared 0.445 0.497 0.545
Pre-train-LC-Unshared 0.446 0.499 0.545
MTA-Uniform 0.448 0.499 0.547
MTA-PPTS 0.461 0.511 0.555
Table 5.3: The UARs achieved on predicting patient’s code.
59
5.5.3 Results
The results of different algorithms for predicting therapist’s and patient’s codes are presented in
Tables 5.2 and 5.3. For the therapist-related tasks, both Pre-train-42 and Pre-train-7 outperform
fine-tuning BERT directly because some of the therapist’s codes (i.e., “Open Question” or “Closed
Question”) are similar in function to dialog acts such as “Open Question” and “Yes-No Question”.
ThePre-train-7 groups the source labels in a reasonable way, making the source task closer to the
target task and achieving better performance than Pre-train-42. However, both failed to improve
the accuracy of predicting patient behavior since the codes reflect whether the patient shows a
motivation to change their behavior and thus do not have evident similarities to these dialogue
acts. The results of Pre-train-LC-Unshared are better when compared to direct fine-tuning and
regular pre-training. The greater improvement in the patient’s task indicates that gathering the
source labels similar to target labels is effective. However, sharing the classification layer when
fine-tuning degrades the performance in the task of therapists. This drop is because the pre-
trained models do not provide a good initialization, and thus, when we fine-tune BERT, it becomes
difficult to escape from local minima.
The results in red in Tables 5.2 and 5.3 are for our proposed framework, where we set the
size of label subsetsK to be 3 and 8 for the therapist’s task and patient’s task, respectively. The
outcomes show that our framework with task augmentation performs better than the baseline
approaches. We further compare the performance using the different task sampling strategies
proposed in Section 5.3.2, and the results demonstrate thatPPTS is superior toUniform achieving
significantly better UAR scores than any other approaches at ( p< 0.05) based on Student’s t-test.
60
Approach
Size of label subset K
2 3 4 5
Pre-train-LC-Unshared 0.585 0.597 0.596 0.589
MTA-PPTS 0.603 0.618 0.615 0.608
Table 5.4: Effect of the size of label subset K, 8-way classification tasks of therapist.
Approach
Size of label subset K
2 5 8 11
Pre-train-LC-Unshared 0.477 0.492 0.499 0.493
MTA-PPTS 0.488 0.501 0.511 0.504
Table 5.5: Effect of the size of label subset K, 3-way classification tasks of patient.
5.5.4 Effect of the Size of Label Subsets
We test the effect of K usingPre-train-LC-Unshared andMTA-PPTS with 5 training sessions. From
the results in Tables 5.4 and 5.5 we find that an optimal K should be neither too small nor too big.
WhenK is small, we utilize too little source data. A bigger value ofK leads to a larger number
of samples and augmented tasks. However, at the same time, it weakens the task similarity.
5.5.5 Ablation Study
To better understand how the proposed framework improves the classification tasks, we per-
formed an ablation study by grouping the clusters randomly, instead of using any similarity met-
ric in Equation(5.2). We denote the modified framework as MTA-PPTS-random and perform tasks
with 5 training sessions. Fig.X and Fig.X presents the comparison between the MTA-PPTS and
MTA-PPTS-random for different values of K. Unlike MTA-PPTS, which has an optimal value of
K, the result ofMTA-PPTS-random improves monotonically asK increases, presumably because
the random selection procedure does not affect the similarity between the produced source and
61
Figure 5.3: The comparison between standard MTA and the random version of MTA for predicting
therapist’s behavioral codes.
target tasks. TheMTA-PPTS-random still improves accuracy with allK values, compared to direct
BERT fine-tuning (57.7% and 46.9% in Table 5.3 and 5.2, respectively). However, it degrades the
performance of the UMTA-PPTS. It suggests that proposed cluster grouping strategy increases
task similarity and benefits performance.
5.5.6 Limitations
Our work suffers from several limitations. First, we only leverage a single open dataset for the
intermediate task. There are other conversation-based corpora with utterance-level labels that
we have not explored yet, such as Persuasion For Good Corpus [115] and DailyDialog Corpus
[66]. Secondly, we adopted the BERT-base as the language model throughout all the experiments
ignoring domain adaptation. For example, we can perform domain-adaptive pre-training with a
publicly available general psychotherapy corpus [57]. In our framework, we force the size of the
62
Figure 5.4: The comparison between standard MTA and the random version of MTA for predicting
patient’s behavioral codes.
label subsets to be the same in the label clustering stage, which might be sub-optimal. A more
sophisticated clustering algorithm is needed.
5.6 Extension of the Framework
In this section, we extend the MTA-PPTS to prompt-based tuning and unsupervised schemes to
show the generalization of our framework.
5.6.1 Adaptation to Prompt-Based Learning.
As machine learning grows, the paradigm in NLP keeps evolving [72]. Traditional NLP mod-
els relied heavily on feature engineering, where researchers extracted salient features such as
word frequencies. In the early 2010s, scientists focused on architecture engineering because of
the development of neural network models. Since the late 2010s, the most popular framework
became the “pre-train and fine-tune" paradigm, where the language model representations are
63
Fine-tuning
Method
Framework
Direct Fine-tuning MTA-PPTS
BERT 46.9 51.1
MP 47.2 51.3
P-tuning 49.7 53.2
Table 5.6: The UARs (%) achieved on predicting patient’s code with different framework and NLP
paradigms. MP refers to manual prompts.
pre-trained by self-supervised training with large raw textual data and then fine-tuned by task-
specific dataset. The latest NLP paradigm is called “pre-train, Prompt, Predict" that uses the
probability achieved by the language model pre-training to predict the output of the downstream
tasks [65]. This state-of-the-art paradigm makes the language model pre-training and fine-tuning
of the downstream tasks more consistent and reduces the need for large unsupervised datasets,
which has become the basic of recent large language models (LLMs) such as BLOOM [96], Chat-
GPT
†
, LLaMa-1 [107] and Llama-2 [108].
We do the experiments of predicting the client’s behavioral codes using the data in Sec 5.4,
and the number of training sessions is fixed to 5. To adapt our MTA framework to the latest
NLP paradigm, we perform the “Meta-learning Pre-training" and “BERT Fine-tuning" processes
in the flowchart of Fig 5.2 with prompt-based algorithms. Specifically, we adopt 1) MP Fine-
tuning: fine-tuning the language model with the manual prompts [97, 45]. The prompt template
is “[X], The client is [MASK] about changing his behavior" where [X] represents the input text,
the masked word belongs to {positive, negative, neural}. 2) P-tuning: incorporating a encoder to
learn prompts in the continuous space [73].
Table 5.6 shows the prediction results of different frameworks and NLP paradigms, where the
text in red represents the prompt-based methods. Firstly, the MP fine-tuning performs slightly
†
https://chat.openai.com/
64
better than the BERT fine-tuning because the template is humanly hand-crafted. Meanwhile, the
P-tuning achieves more improvement by adopting continuous prompts generated and optimized
by a prompt encoder [73]. Moreover, the MTA-PPTS framework significantly improves the per-
formance of all fine-tuning methods in Table 5.6, which reveals that our algorithm can update
easily with the development of the NLP paradigm.
5.6.2 Unsupervised Meta-learning Framework with Task Augmentation
One of the shortcomings of our proposed framework in this chapter is that it assumes the source
dataset to be labeled. In some cases, however we want to leverage closer but unlabeled corpora
to improve the knowledge transfer. Thus we generalize our method with an algorithm of Unsu-
pervised Meta-learning with Task Augmentation (UMTA). The flowchart of UMTA framework is
shown in Fig. 5.5 that is similar to the one of MTA except we perform utterance clustering on
the source corpus (the external dataset) and produce latent reasoning labels. Specifically, we use
BERT to extract features from the data and then use k-mean clustering to group them into clus-
ters. The extracted features are the pooled output (embedding of the initial [CLS] token) so that
the instances within the same cluster are semantically similar. In order to make clustering results
more related to the target classes, we use the BERT model F after the first fine-tuning on our
in-house data. Next, we label utterances by their cluster indices. These clusters are used to con-
struct simulated source tasks, and we hypothesize that they benefit meta-learning performance
by increasing task variability. Fig. 5.6 procedure of producing the source tasks in UMTA.
65
Figure 5.5: The flowchart of UMTA framework.
Figure 5.6: The flowchart of UMTA framework.
66
We evaluate the UMTA algorithm on a task of clinical section classification. In clinical visits,
clinical note writing is a time-consuming and cost-prohibitive manual task for clinicians. Al-
though virtual medical scribes have been proposed to generate clinical notes (semi-)automatically,
the data sparsity issue is still a challenging problem in practice [39, 58, 62]. Identifying the topic of
clinical utterances in doctor-patient conversations is one of the key strategies for automation [92,
84, 98, 63]. The (in-house) target task data consists of speech audio recordings of dyad clinician-
patient conversations and full clinical documents. Clinician-role participants are real nurses,
nurse practitioners, and medical doctors, while patient-role participants are mock patients.
After the visits, clinicians completed clinical notes according to the SOAP (Subjective, Objec-
tive, Assessment, and Plan) coding scheme [87]. The in-house speech audio data were manually
transcribed in the utterance level by a transcription service provider. Finally, a specialized label-
ing team of in-house expert and manually annotated the clinical note sections: seven sections
from (sub)headings of clinical notes and “Other”. An episode of a labeled snippet is shown in
Appendix 6.2. Table 5.7 shows statistics of section label distribution. Sections “allergies”, “family
history” and “past medical history” are merged in our experiments, because the data size of these
sections were too small. The in-house data consists of 6,860 utterances in total. We partitioned
the data into train/dev/test sets by sessions in a ratio of 28/10/10, without the overlap of tele-
health visits. We also generated the machine transcriptions of in-house data by using Amazon
Transcribe Medical.
The source dataset we used is simulated clinician-patient conversation purchased from exter-
nal medical data vendors. This data was collected from various specialties and scenarios, includ-
ing in-patient and out-patient, and telehealth and offline visits. In total, it has 300,000 utterances
transcribed by the outsourced worker.
67
Abbr. Topic description
Nb. of utterances
Train Dev Test
PS positive reported symptoms 1017 391 326
NS negative/denied symptoms 253 76 82
SH social history 137 53 46
Med
confirmed past medical history
164 37 76 confirmed allergies
confirmed family history
Plan what clinician asks patient to do 601 191 203
None None of above 1866 652 588
Total instances 4069 1407 1358
Table 5.7: The number of utterances for clinical note sections
Data Format BERT Fine-tuning Self-training UMTA
Manual 75.0± 0.4 75.5± 0.2 77.0± 0.1 ± 0.1 ± 0.1
ASR 74.0± 0.5 73.6± 0.4 76.2± 0.1 ± 0.1 ± 0.1
Table 5.8: The comparison between UMTA and other models.
We performUMTA on clinical section classification with manual transcript data and machine
transcript data – the 1-best of Automatic Speech Recognition (ASR) output. The details of the
framework and experimental setups are described in the appendix 6.2. Table 5.8 shows the per-
formances of UMTA and other models. The experimental results of accuracy we present are all
relative values compared to the baseline in the table. The numbers in the table tell that the best
accuracy is achieved by UMTA for both manual transcripts and ASR transcripts. Moreover, we
compare UMTA with self-training [112] which is another popular way of leveraging external data.
Interestingly, self-training improves accuracy over two-phase fine-tuning on human transcripts,
while it is not on ASR transcripts. However, UMTA improves accuracy on both human transcripts
and ASR transcripts over two-phase fine tuning, suggesting the robustness of UMTA to the ASR
error. This application proves that our framework of meta-learning with task augmentation can
68
be extended to the unsupervised scheme, in which we can utilize the source datasets even if they
are unlabeled.
5.7 Conclusion and Future Work
This chapter leveraged publicly available datasets to build computational models for predicting
behavioral codes in psychotherapy conversations with limited samples. We employed a meta-
learning framework with task augmentation to address the data sparsity problem and improve
the performance over baseline methods. We also discussed the effect of a hyper-parameter and
the task sampling strategy in our framework. We generalize our algorithm to unsupervised mode
and apply it to the clinical section classification of doctor-patient conversations. Finally, we prove
that our framework can benefit prompt-based tuning as well.
69
Chapter 6
Conclusion and Future Work
6.1 Summary
This dissertation addresses three common challenges related to data sparsity in automated behav-
ioral coding. To tackle these challenges, spoken language processing approaches are proposed.
The challenge of automated behavioral coding at the session level is addressed by implement-
ing a hierarchical transformer framework with a local quality estimator. This enhances the accu-
racy of the prediction of behavioral codes and provides an alternative method for long document
classification and regression tasks.
To perform the automated behavioral coding in unlabeled domains. I incorporate label pro-
portion estimates into domain adversarial networks and re-weight the samples feeding into the
domain adapter. This correction helps to alleviate the label distribution shift in domain adaptation
tasks, facilitating the training of computational models.
For the utterance level behavioral coding task with limited samples and labels, publicly avail-
able data is utilized. A meta-learning framework with task augmentation is employed to improve
70
knowledge transfer and performance. Furthermore, this proposed framework is adapted and gen-
eralized to different natural language processing (NLP) paradigms, extending its applicability to
the unsupervised scheme.
6.2 Future Work
The following list discusses the possible future directions.
Dialogue Segmentation for Session level Behavioral Coding
In automated behavioral coding at the session level, I developed a hierarchical transformer frame-
work that segment the conversation with equal length in terms of utterance. However, it degrades
the inner relationship between sentences. Applying dialogue segmentation strategies [38, 120]
can group utterances into topical-coherent units and might improve the prediction accuracy of
session quality.
Multi-label Classification for Utterance Level Behavioral Codes
In many spoken dialogue applications, it is common for a single utterance to be associated with
multiple tags. Additionally, when employing an end-to-end automated behavioral coding pipeline
with audio recordings, it is more practical to segment the dialogue at the turn level instead of at
the utterance level, as the latter approach tends to result in higher errors. As a result, adapting
our frameworks to handle multi-label spoken language processing tasks in low-resource scenarios
becomes highly advantageous and valuable in real-world applications.
71
Incorporating Acoustic and Prosodic Features
Prior fully supervised learning tasks in automated behavioral coding have demonstrated the ben-
efits of including acoustic and prosodic features alongside lexical features to enhance behavior
prediction [102, 5, 28, 105, 26]. As a result, a promising avenue for future research lies in the
development of multimodal techniques tailored for low-resource behavioral coding.
.
72
Bibliography
[1] Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,
Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. “Tensorflow: A
system for large-scale machine learning”. In: 12th USENIX symposium on operating
systems design and implementation 16). 2016, pp. 265–283.
[2] Alessandro Achille, Michael Lam, Rahul Tewari, Avinash Ravichandran, Subhransu Maji,
Charless C Fowlkes, Stefano Soatto, and Pietro Perona. “Task2vec: Task embedding for
meta-learning”. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision. 2019, pp. 6430–6439.
[3] Alessandro Achille, Giovanni Paolini, Glen Mbeng, and Stefano Soatto. “The information
complexity of learning tasks, their structure and their distance”. In: Information and
Inference: A Journal of the IMA 10.1 (2021), pp. 51–72.
[4] Firoj Alam, Morena Danieli, and Giuseppe Riccardi. “Annotating and modeling empathy
in spoken conversations”. In: Computer Speech & Language 50 (2018), pp. 40–61.
[5] Victor Ardulov, Madelyn Mendlen, Manoj Kumar, Neha Anand, Shanna Williams,
Thomas Lyon, and Shrikanth Narayanan. “Multimodal interaction modeling of child
forensic interviewing”. In: Proceedings of the 20th ACM International Conference on
Multimodal Interaction. 2018, pp. 179–185.
[6] David C Atkins, Mark Steyvers, Zac E Imel, and Padhraic Smyth. “Scaling up the
evaluation of psychotherapy: evaluating motivational interviewing fidelity via statistical
text classification”. In: Implementation Science 9.1 (2014), pp. 1–11.
[7] Kamyar Azizzadenesheli, Anqi Liu, Fanny Yang, and Animashree Anandkumar.
“Regularized learning for domain adaptation under label shifts”. In: arXiv preprint
arXiv:1903.09734 (2019).
73
[8] John S Baer, Elizabeth A Wells, David B Rosengren, Bryan Hartzler, Blair Beadnell, and
Chris Dunn. “Agency context and tailored training in technology transfer: A pilot
evaluation of motivational interviewing training for community counselors”. In: Journal
of substance abuse treatment 37.2 (2009), pp. 191–202.
[9] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural Machine Translation
by Jointly Learning to Align and Translate”. In: Proceedings of the 3rd International
Conference on Learning Representations. 2015.
[10] Roger Bakeman. “Behavioral Observation and Coding”. In: Handbook of Research
Methods in Social and Personality Psychology (2000), p. 138.
[11] Bart Bakker and Tom Heskes. “Task Clustering and Gating for Bayesian Multitask
Learning”. In: Journal of Machine Learning Research 4 (2003), pp. 83–99.
[12] Judith S Beck. Cognitive behavior therapy: Basics and beyond. New York, NY, 2020.
[13] Iz Beltagy, Matthew E Peters, and Arman Cohan. “Longformer: The long-document
transformer”. In: arXiv preprint arXiv:2004.05150 (2020).
[14] Matthew P Black, Athanasios Katsamanis, Brian R Baucom, Chi-Chun Lee,
Adam C Lammert, Andrew Christensen, Panayiotis G Georgiou, and
Shrikanth S Narayanan. “Toward automating a human behavioral coding system for
married couples’ interactions using speech acoustic features”. In: Speech communication
55.1 (2013), pp. 1–21.
[15] John Blitzer, Mark Dredze, and Fernando Pereira. “Biographies, bollywood, boom-boxes
and blenders: Domain adaptation for sentiment classification”. In: Proceedings of the 45th
annual meeting of the association of computational linguistics. 2007, pp. 440–447.
[16] John Blitzer, Ryan McDonald, and Fernando Pereira. “Domain adaptation with structural
correspondence learning”. In: Proceedings of the 2006 conference on empirical methods in
natural language processing. Association for Computational Linguistics. 2006,
pp. 120–128.
[17] Danushka Bollegala, Tingting Mu, and John Yannis Goulermas. “Cross-domain
sentiment classification using sentiment sensitive embeddings”. In: IEEE Transactions on
Knowledge and Data Engineering 28.2 (2015), pp. 398–410.
[18] Daniel Bone, Somer L Bishop, Matthew P Black, Matthew S Goodwin, Catherine Lord,
and Shrikanth S Narayanan. “Use of machine learning to improve autism screening and
diagnostic instruments: effectiveness, efficiency, and multi-instrument fusion”. In:
Journal of Child Psychology and Psychiatry 57.8 (2016), pp. 927–937.
74
[19] Doğan Can, David C Atkins, and Shrikanth S Narayanan. “A dialog act tagging approach
to behavioral coding: A case study of addiction counseling conversations”. In: Sixteenth
Annual Conference of the International Speech Communication Association. 2015.
[20] Yee Seng Chan and Hwee Tou Ng. “Estimating class priors in domain adaptation for
word sense disambiguation”. In: Proceedings of the 21st International Conference on
Computational Linguistics and the 44th annual meeting of the Association for
Computational Linguistics. Association for Computational Linguistics. 2006, pp. 89–96.
[21] Xilun Chen and Claire Cardie. “Multinomial adversarial networks for multi-domain text
classification”. In: arXiv preprint arXiv:1802.05694 (2018).
[22] Xilun Chen, Yu Sun, Ben Athiwaratkun, Claire Cardie, and Kilian Weinberger.
“Adversarial deep averaging networks for cross-lingual sentiment classification”. In:
Transactions of the Association for Computational Linguistics 6 (2018), pp. 557–570.
[23] Zhuohao Chen, Nikolaos Flemotomos, Victor Ardulov, Torrey A Creed, Zac E Imel,
David C Atkins, and Shrikanth Narayanan. “Feature Fusion Strategies for End-to-End
Evaluation of Cognitive Behavior Therapy Sessions”. In: International Conference of the
IEEE Engineering in Medicine and Biology Society (EMBC) (2021).
[24] Zhuohao Chen, Nikolaos Flemotomos, Zac E Imel, David C Atkins, and
Shrikanth Narayanan. “Leveraging Open Data and Task Augmentation to Automated
Behavioral Coding of Psychotherapy Conversations in Low-Resource Scenarios”. In:
arXiv preprint arXiv:2210.14254 (2022).
[25] Zhuohao Chen, Nikolaos Flemotomos, Karan Singla, Torrey A Creed, David C Atkins,
and Shrikanth Narayanan. “An automated quality evaluation framework of
psychotherapy conversations with local quality estimates”. In: Computer Speech &
Language (2022), p. 101380.
[26] Zhuohao Chen, James Gibson, Ming-Chang Chiu, Qiaohong Hu, Tara K Knight,
Daniella Meeker, James A Tulsky, Kathryn I Pollak, and Shrikanth Narayanan.
“Automated Empathy Detection for Oncology Encounters”. In: IEEE International
Conference on Healthcare Informatics (ICHI) (2020).
[27] Zhuohao Chen, Singla Karan, David C Atkins, Zac E Imel, and Shrikanth Narayanan. “A
label proportions estimation technique for adversarial domain adaptation in text
classification”. In: arXiv preprint arXiv:2003.07444 (2020).
[28] Zhuohao Chen, Karan Singla, James Gibson, Dogan Can, Zac E Imel, David C Atkins,
Panayiotis Georgiou, and Shrikanth Narayanan. “Improving the prediction of therapist
behaviors in addiction counseling by exploiting class confusions”. In: ICASSP 2019-2019
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE.
2019, pp. 6605–6609.
75
[29] Torrey A Creed, Sarah A Frankel, Ramaris E German, Kelly L Green, Shari Jager-Hyman,
Kristin P Taylor, Abby D Adler, Courtney B Wolk, Shannon W Stirman,
Scott H Waltman, et al. “Implementation of transdiagnostic cognitive therapy in
community behavioral health: The Beck Community Initiative.” In: Journal of consulting
and clinical psychology 84.12 (2016), p. 1116.
[30] Andrew M Dai, Christopher Olah, and Quoc V Le. “Document embedding with
paragraph vectors”. In: arXiv preprint arXiv:1507.07998 (2015).
[31] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and
Ruslan Salakhutdinov. “Transformer-XL: Attentive Language Models beyond a
Fixed-Length Context”. In: Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics. July 2019, pp. 2978–2988.
[32] Sanjiv R Das and Mike Y Chen. “Yahoo! for Amazon: Sentiment extraction from small
talk on the web”. In: Management science 53.9 (2007), pp. 1375–1388.
[33] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “BERT:
Pre-training of Deep Bidirectional Transformers for Language Understanding”. In:
Proceedings of the 2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short
Papers). June 2019, pp. 4171–4186.
[34] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “BERT:
Pre-training of Deep Bidirectional Transformers for Language Understanding”. In:
Proceedings of the 2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short
Papers). Minneapolis, Minnesota: Association for Computational Linguistics, June 2019,
pp. 4171–4186.doi: 10.18653/v1/N19-1423.
[35] Zi-Yi Dou, Keyi Yu, and Antonios Anastasopoulos. “Investigating Meta-Learning
Algorithms for Low-Resource Natural Language Understanding Tasks”. In: Proceedings
of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th
International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019,
pp. 1192–1197.
[36] Zi-Yi Dou, Keyi Yu, and Antonios Anastasopoulos. “Investigating Meta-Learning
Algorithms for Low-Resource Natural Language Understanding Tasks”. In: Proceedings
of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th
International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong
Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 1192–1197.doi:
10.18653/v1/D19-1112.
[37] Christopher G Fairburn and Zafra Cooper. “Therapist competence, therapy quality, and
therapist training”. In: Behaviour research and therapy 49.6-7 (2011), pp. 373–378.
76
[38] Song Feng, Hui Wan, Chulaka Gunasekara, Siva Patel, Sachindra Joshi, and Luis Lastras.
“doc2dial: A Goal-Oriented Document-Grounded Dialogue Dataset”. In: Proceedings of
the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).
Online: Association for Computational Linguistics, Nov. 2020, pp. 8118–8128.doi:
10.18653/v1/2020.emnlp-main.652.
[39] Gregory Finley, Erik Edwards, Amanda Robinson, Michael Brenndoerfer,
Najmeh Sadoughi, James Fone, Nico Axtmann, Mark Miller, and
David Suendermann-Oeft. “An automated medical scribe for documenting clinical
encounters”. In: Proceedings of the 2018 Conference of the North American Chapter of the
Association for Computational Linguistics: Demonstrations. New Orleans, Louisiana:
Association for Computational Linguistics, June 2018, pp. 11–15.doi:
10.18653/v1/N18-5003.
[40] Chelsea Finn, Pieter Abbeel, and Sergey Levine. “Model-agnostic meta-learning for fast
adaptation of deep networks”. In: International Conference on Machine Learning. PMLR.
2017, pp. 1126–1135.
[41] Nikolaos Flemotomos, Victor R Martinez, Zhuohao Chen, Torrey A Creed,
David C Atkins, and Shrikanth Narayanan. “Automated Quality Assessment of
Cognitive Behavioral Therapy Sessions Through Highly Contextualized Language
Representations”. In: PLOS ONE (2021).
[42] Nikolaos Flemotomos, Victor R Martinez, Zhuohao Chen, Karan Singla, Victor Ardulov,
Raghuveer Peri, Derek D Caperton, James Gibson, Michael J Tanana,
Panayiotis Georgiou, et al. “Automated Evaluation Of Psychotherapy Skills Using
Speech And Language Technologies”. In: Behavior Research Methods (2021).
[43] Nikolaos Flemotomos, Victor R Martinez, James Gibson, David C Atkins, Torrey Creed,
and Shrikanth S Narayanan. “Language Features for Automated Evaluation of Cognitive
Behavior Psychotherapy Sessions.” In: INTERSPEECH. 2018, pp. 1908–1912.
[44] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle,
François Laviolette, Mario Marchand, and Victor Lempitsky. “Domain-adversarial
training of neural networks”. In: The Journal of Machine Learning Research 17.1 (2016),
pp. 2096–2030.
[45] Tianyu Gao, Adam Fisch, and Danqi Chen. “Making Pre-trained Language Models Better
Few-shot Learners”. In: Proceedings of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International Joint Conference on Natural
Language Processing (Volume 1: Long Papers). Online: Association for Computational
Linguistics, Aug. 2021, pp. 3816–3830.doi: 10.18653/v1/2021.acl-long.295.
77
[46] James Gibson, David C Atkins, Torrey A Creed, Zac Imel, Panayiotis Georgiou, and
Shrikanth Narayanan. “Multi-label multi-task deep learning for behavioral coding”. In:
IEEE Transactions on Affective Computing 13.1 (2019), pp. 508–518.
[47] James Gibson, Dogan Can, Panayiotis G Georgiou, David C Atkins, and
Shrikanth S Narayanan. “Attention Networks for Modeling Behaviors in Addiction
Counseling.” In: INTERSPEECH. 2017, pp. 3251–3255.
[48] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. “Generative adversarial nets”. In:
Advances in neural information processing systems. 2014, pp. 2672–2680.
[49] Jiatao Gu, Yong Wang, Yun Chen, Victor O. K. Li, and Kyunghyun Cho. “Meta-Learning
for Low-Resource Neural Machine Translation”. In: Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for
Computational Linguistics, Oct. 2018, pp. 3622–3631.doi: 10.18653/v1/D18-1398.
[50] Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy,
Doug Downey, and Noah A. Smith. “Don’t Stop Pretraining: Adapt Language Models to
Domains and Tasks”. In: Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics. July 2020, pp. 8342–8360.
[51] Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy,
Doug Downey, and Noah A. Smith. “Don’t Stop Pretraining: Adapt Language Models to
Domains and Tasks”. In: Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics. Online: Association for Computational Linguistics, July 2020,
pp. 8342–8360.doi: 10.18653/v1/2020.acl-main.740.
[52] Sepp Hochreiter and Jürgen Schmidhuber. “Long short-term memory”. In: Neural
computation 9.8 (1997), pp. 1735–1780.
[53] JM Houck, TB Moyers, WR Miller, LH Glynn, and KA Hallgren. “Motivational
interviewing skill code (MISC) version 2.5”. In: (Available from
http://casaa.unm.edu/download/misc25.pdf) (2010).
[54] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone,
Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly.
“Parameter-efficient transfer learning for NLP”. In: International Conference on Machine
Learning. PMLR. 2019, pp. 2790–2799.
[55] Jeremy Howard and Sebastian Ruder. “Universal Language Model Fine-tuning for Text
Classification”. In: Proceedings of the 56th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers). July 2018, pp. 328–339.
78
[56] Jeremy Howard and Sebastian Ruder. “Universal Language Model Fine-tuning for Text
Classification”. In: Proceedings of the 56th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for
Computational Linguistics, July 2018, pp. 328–339.doi: 10.18653/v1/P18-1031.
[57] Zac E Imel, Mark Steyvers, and David C Atkins. “Computational psychotherapy
research: Scaling up the evaluation of patient–provider interactions.” In: Psychotherapy
52.1 (2015), p. 19.
[58] Serena Jeblee, Faiza Khan Khattak, Noah Crampton, Muhammad Mamdani, and
Frank Rudzicz. “Extracting relevant information from physician-patient dialogues for
automated clinical note taking”. In: Proceedings of the Tenth International Workshop on
Health Text Mining and Information Analysis (LOUHI 2019). Hong Kong: Association for
Computational Linguistics, Nov. 2019, pp. 65–74.doi: 10.18653/v1/D19-6209.
[59] Sharu Theresa Jose and Osvaldo Simeone. “An information-theoretic analysis of the
impact of task similarity on meta-learning”. In: 2021 IEEE International Symposium on
Information Theory (ISIT). IEEE. 2021, pp. 1534–1539.
[60] DP Kingman and J Ba. “Adam: A Method for Stochastic Optimization. Conference
paper”. In: Proceedings of the 3rd International Conference on Learning Representations.
2015.
[61] Jodi Kodish-Wachs, Emin Agassi, Patrick Kenny III, and J Marc Overhage. “A systematic
comparison of contemporary automatic speech recognition engines for conversational
clinical speech”. In: AMIA Annual Symposium Proceedings. Vol. 2018. American Medical
Informatics Association. 2018, p. 683.
[62] Kundan Krishna, Sopan Khosla, Jeffrey Bigham, and Zachary C. Lipton. “Generating
SOAP Notes from Doctor-Patient Conversations Using Modular Summarization
Techniques”. In: Proceedings of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International Joint Conference on Natural
Language Processing (Volume 1: Long Papers). Online: Association for Computational
Linguistics, Aug. 2021, pp. 4958–4972.doi: 10.18653/v1/2021.acl-long.384.
[63] Kundan Krishna, Amy Pavel, Benjamin Schloss, Jeffrey P Bigham, and Zachary C Lipton.
“Extracting structured data from physician-patient conversations by predicting
noteworthy utterances”. In: Explainable AI in Healthcare and Medicine. Springer, 2021,
pp. 155–169.
[64] Lukas Lange, Jannik Strötgen, Heike Adel, and Dietrich Klakow. “To Share or not to
Share: Predicting Sets of Sources for Model Transfer Learning”. In: Proceedings of the
2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta
Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021,
pp. 8744–8753.doi: 10.18653/v1/2021.emnlp-main.689.
79
[65] Brian Lester, Rami Al-Rfou, and Noah Constant. “The Power of Scale for
Parameter-Efficient Prompt Tuning”. In: Proceedings of the 2021 Conference on Empirical
Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic:
Association for Computational Linguistics, Nov. 2021, pp. 3045–3059.doi:
10.18653/v1/2021.emnlp-main.243.
[66] Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. “DailyDialog: A
Manually Labelled Multi-turn Dialogue Dataset”. In: Proceedings of the Eighth
International Joint Conference on Natural Language Processing (Volume 1: Long Papers).
Taipei, Taiwan: Asian Federation of Natural Language Processing, Nov. 2017,
pp. 986–995.url: https://aclanthology.org/I17-1099.
[67] Yitong Li, Michael Murias, Samantha Major, Geraldine Dawson, and David E Carlson.
“On Target Shift in Adversarial Domain Adaptation”. In: arXiv preprint arXiv:1903.06336
(2019).
[68] Ariel Linden, Susan W Butterworth, and James O Prochaska. “Motivational
interviewing-based health coaching as a chronic care intervention”. In: Journal of
evaluation in clinical practice 16.1 (2010), pp. 166–174.
[69] Zachary C Lipton, Yu-Xiang Wang, and Alex Smola. “Detecting and correcting for label
shift with black box predictors”. In: arXiv preprint arXiv:1802.03916 (2018).
[70] Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew E. Peters, and Noah A. Smith.
“Linguistic Knowledge and Transferability of Contextual Representations”. In:
Proceedings of the 2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short
Papers). Minneapolis, Minnesota: Association for Computational Linguistics, June 2019,
pp. 1073–1094.doi: 10.18653/v1/N19-1112.
[71] Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. “Adversarial multi-task learning for text
classification”. In: arXiv preprint arXiv:1704.05742 (2017).
[72] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and
Graham Neubig. “Pre-train, prompt, and predict: A systematic survey of prompting
methods in natural language processing”. In: ACM Computing Surveys 55.9 (2023),
pp. 1–35.
[73] Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and
Jie Tang. “GPT understands, too”. In: arXiv preprint arXiv:2103.10385 (2021).
[74] Edward Loper and Steven Bird. “NLTK: the natural language toolkit”. In: arXiv preprint
cs/0205028 (2002).
[75] Ilya Loshchilov and Frank Hutter. “Decoupled Weight Decay Regularization”. In:
International Conference on Learning Representations. 2018.
80
[76] Ilya Loshchilov and Frank Hutter. “Decoupled weight decay regularization”. In:
Proceedings of the 7th International Conference on Learning Representations. 2019.
[77] William R Miller and Stephen Rollnick. Motivational interviewing: Helping people change.
Guilford press, 2012.
[78] Mohamed Morchid, Richard Dufour, and Georges Linarès. “Impact of word error rate on
theme identification task of highly imperfect human–human conversations”. In:
Computer Speech & Language 38 (2016), pp. 68–85.
[79] Shrikanth Narayanan and Panayiotis G Georgiou. “Behavioral signal processing:
Deriving human behavioral informatics from speech and language”. In: Proceedings of
the IEEE 101.5 (2013), pp. 1203–1233.
[80] Cuong Nguyen, Tal Hassner, Matthias Seeger, and Cedric Archambeau. “Leep: A new
measure to evaluate transferability of learned representations”. In: International
Conference on Machine Learning. PMLR. 2020, pp. 7294–7305.
[81] Tuan Duong Nguyen, Marthinus Christoffel, and Masashi Sugiyama. “Continuous target
shift adaptation in supervised learning”. In: Asian Conference on Machine Learning. 2016,
pp. 285–300.
[82] Sinno Jialin Pan, Xiaochuan Ni, Jian-Tao Sun, Qiang Yang, and Zheng Chen.
“Cross-domain sentiment classification via spectral feature alignment”. In: Proceedings of
the 19th international conference on World wide web. ACM. 2010, pp. 751–760.
[83] Raghavendra Pappagari, Piotr Zelasko, Jesús Villalba, Yishay Carmiel, and Najim Dehak.
“Hierarchical Transformers for Long Document Classification”. In: 2019 IEEE Automatic
Speech Recognition and Understanding Workshop (ASRU). IEEE. 2019, pp. 838–844.
[84] Jihyun Park, Dimitrios Kotzias, Patty Kuo, Robert L Logan Iv, Kritzia Merced,
Sameer Singh, Michael Tanana, Efi Karra Taniskidou, Jennifer Elston Lafata,
David C Atkins, et al. “Detecting conversation topics in primary care office visits from
transcripts of patient-provider interactions”. In: Journal of the American Medical
Informatics Association 26.12 (2019), pp. 1493–1504.
[85] Jeffrey Pennington, Richard Socher, and Christopher D Manning. “Glove: Global vectors
for word representation”. In: Proceedings of the 2014 conference on empirical methods in
natural language processing (EMNLP). 2014, pp. 1532–1543.
[86] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark,
Kenton Lee, and Luke Zettlemoyer. “Deep Contextualized Word Representations”. In:
Proceedings of the 2018 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New
Orleans, Louisiana: Association for Computational Linguistics, June 2018,
pp. 2227–2237.doi: 10.18653/v1/N18-1202.
81
[87] Vivek Podder, Valerie Lew, and Sassan Ghassemzadeh. “SOAP notes”. In: StatPearls
[Internet]. StatPearls Publishing, 2021.
[88] Clifton Poth, Jonas Pfeiffer, Andreas Rücklé, and Iryna Gurevych. “What to Pre-Train
on? Efficient Intermediate Task Selection”. In: Proceedings of the 2021 Conference on
Empirical Methods in Natural Language Processing. 2021, pp. 10585–10605.
[89] Yada Pruksachatkun, Jason Phang, Haokun Liu, Phu Mon Htut, Xiaoyi Zhang,
Richard Yuanzhe Pang, Clara Vania, Katharina Kann, and Samuel R. Bowman.
“Intermediate-Task Transfer Learning with Pretrained Language Models: When and
Why Does It Work?” In: Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics. Online: Association for Computational Linguistics, July 2020,
pp. 5231–5247.doi: 10.18653/v1/2020.acl-main.467.
[90] Kun Qian and Zhou Yu. “Domain Adaptive Dialog Generation via Meta Learning”. In:
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
Florence, Italy: Association for Computational Linguistics, July 2019, pp. 2639–2649.doi:
10.18653/v1/P19-1253.
[91] Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, and Timothy P Lillicrap.
“Compressive transformers for long-range sequence”. In: Proceedings of the 7th
International Conference on Learning Representations. 2019.
[92] Alvin Rajkomar, Anjuli Kannan, Kai Chen, Laura Vardoulakis, Katherine Chou,
Claire Cui, and Jeffrey Dean. “Automatically charting symptoms from patient-physician
conversations using machine learning”. In: JAMA internal medicine 179.6 (2019),
pp. 836–838.
[93] Wendy M Reinke, Keith C Herman, and Randy Sprick. Motivational interviewing for
effective classroom management: The classroom check-up . Guilford press, 2011.
[94] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. “Learning
representations by back-propagating errors”. In: nature 323.6088 (1986), pp. 533–536.
[95] Marco Saerens, Patrice Latinne, and Christine Decaestecker. “Adjusting the outputs of a
classifier to new a priori probabilities: a simple procedure”. In: Neural computation 14.1
(2002), pp. 21–41.
[96] Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić,
Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon,
Matthias Gallé, et al. “Bloom: A 176b-parameter open-access multilingual language
model”. In: arXiv preprint arXiv:2211.05100 (2022).
82
[97] Timo Schick and Hinrich Schütze. “It’s Not Just Size That Matters: Small Language
Models Are Also Few-Shot Learners”. In: Proceedings of the 2021 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language
Technologies. Online: Association for Computational Linguistics, June 2021,
pp. 2339–2352.doi: 10.18653/v1/2021.naacl-main.185.
[98] Benjamin Schloss and Sandeep Konam. “Towards an automated SOAP note: classifying
utterances from medical conversations”. In: Machine Learning for Healthcare Conference.
PMLR. 2020, pp. 610–631.
[99] Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and
Joris Mooij. “On causal and anticausal learning”. In:arXivpreprintarXiv:1206.6471 (2012).
[100] Brian F Shaw, Irene Elkin, Jane Yamaguchi, Marion Olmsted, T Michael Vallis,
Keith S Dobson, Alice Lowery, Stuart M Sotsky, John T Watkins, and Stanley D Imber.
“Therapist competence ratings in relation to clinical outcome in cognitive therapy of
depression.” In: Journal of Consulting and Clinical Psychology 67.6 (1999), p. 837.
[101] Elizabeth Shriberg, Andreas Stolcke, Daniel Jurafsky, Noah Coccaro, Marie Meteer,
Rebecca Bates, Paul Taylor, Klaus Ries, Rachel Martin, and Carol Van Ess-Dykema. “Can
prosody aid the automatic classification of dialog acts in conversational speech?” In:
Language and speech 41.3-4 (1998), pp. 443–492.
[102] Karan Singla, Zhuohao Chen, Nikolaos Flemotomos, James Gibson, Dogan Can,
David C Atkins, and Shrikanth S Narayanan. “Using Prosodic and Lexical Information
for Learning Utterance-level Behaviors in Psychotherapy.” In: INTERSPEECH. 2018,
pp. 3413–3417.
[103] Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. “Adaptive
attention span in transformers”. In: arXiv preprint arXiv:1905.07799 (2019).
[104] Remi Tachet des Combes, Han Zhao, Yu-Xiang Wang, and Geoffrey J Gordon. “Domain
adaptation with conditional distribution matching and generalized label shift”. In:
Advances in Neural Information Processing Systems 33 (2020), pp. 19276–19289.
[105] Leili Tavabi, Kalin Stefanov, Larry Zhang, Brian Borsari, Joshua D Woolley,
Stefan Scherer, and Mohammad Soleymani. “Multimodal Automatic Coding of Client
Behavior in Motivational Interviewing”. In: Proceedings of the 2020 International
Conference on Multimodal Interaction. 2020, pp. 406–413.
[106] Sebastian Thrun and Joseph O’Sullivan. “Discovering structure in multiple learning
tasks: The TC algorithm”. In: ICML. Vol. 96. 1996, pp. 489–497.
83
[107] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux,
Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al.
“Llama: Open and efficient foundation language models”. In: arXiv preprint
arXiv:2302.13971 (2023).
[108] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi,
Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale,
et al. “Llama 2: Open Foundation and Fine-Tuned Chat Models”. In: arXiv preprint
arXiv:2307.09288 (2023).
[109] Anh T Tran, Cuong V Nguyen, and Tal Hassner. “Transferability and hardness of
supervised classification tasks”. In: Proceedings of the IEEE/CVF International Conference
on Computer Vision. 2019, pp. 1395–1405.
[110] Shao-Yen Tseng, Brian R Baucom, and Panayiotis G Georgiou. “Approaching Human
Performance in Behavior Estimation in Couples Therapy Using Deep Sentence
Embeddings.” In: INTERSPEECH. 2017, pp. 3291–3295.
[111] Arun Venkitaraman, Anders Hansson, and Bo Wahlberg. “Task-similarity aware
meta-learning through nonparametric kernel regression”. In: arXiv preprint
arXiv:2006.07212 (2020).
[112] Tu Vu, Minh-Thang Luong, Quoc Le, Grady Simon, and Mohit Iyyer. “STraTA:
Self-Training with Task Augmentation for Better Few-shot Learning”. In: Proceedings of
the 2021 Conference on Empirical Methods in Natural Language Processing. Online and
Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021,
pp. 5715–5731.doi: 10.18653/v1/2021.emnlp-main.462.
[113] Tu Vu, Tong Wang, Tsendsuren Munkhdalai, Alessandro Sordoni, Adam Trischler,
Andrew Mattarella-Micke, Subhransu Maji, and Mohit Iyyer. “Exploring and Predicting
Transferability across NLP Tasks”. In: Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP). Online: Association for
Computational Linguistics, Nov. 2020, pp. 7882–7926.doi:
10.18653/v1/2020.emnlp-main.635.
[114] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and
Samuel Bowman. “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural
Language Understanding”. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP:
Analyzing and Interpreting Neural Networks for NLP. 2018, pp. 353–355.
[115] Xuewei Wang, Weiyan Shi, Richard Kim, Yoojung Oh, Sijia Yang, Jingwen Zhang, and
Zhou Yu. “Persuasion for Good: Towards a Personalized Persuasive Dialogue System for
Social Good”. In: Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics. Florence, Italy: Association for Computational Linguistics,
July 2019, pp. 5635–5649.doi: 10.18653/v1/P19-1566.
84
[116] Nick Webb, Mark Hepple, and Yorick Wilks. “Dialogue act classification based on
intra-utterance features”. In: Proceedings of the AAAI Workshop on Spoken Language
Understanding. Vol. 4. Citeseer. 2005, p. 5.
[117] Zixiu Wu, Simone Balloccu, Vivek Kumar, Rim Helaoui, Diego Reforgiato Recupero, and
Daniele Riboni. “Creation, Analysis and Evaluation of AnnoMI, a Dataset of
Expert-Annotated Counselling Dialogues”. In: Future Internet 15.3 (2023), p. 110.
[118] Zixiu Wu, Rim Helaoui, Diego Reforgiato Recupero, and Daniele Riboni. “Towards
Low-Resource Real-Time Assessment of Empathy in Counselling”. In: Proceedings of the
Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving
Access. Online: Association for Computational Linguistics, June 2021, pp. 204–216.doi:
10.18653/v1/2021.clpsych-1.22.
[119] Bo Xiao, Dogan Can, James Gibson, Zac E Imel, David C Atkins, Panayiotis G Georgiou,
and Shrikanth S Narayanan. “Behavioral Coding of Therapist Language in Addiction
Counseling Using Recurrent Neural Networks.” In: Interspeech. 2016, pp. 908–912.
[120] Linzi Xing and Giuseppe Carenini. “Improving Unsupervised Dialogue Topic
Segmentation with Utterance-Pair Coherence Scoring”. In: Proceedings of the 22nd
AnnualMeetingoftheSpecialInterestGrouponDiscourseandDialogue. 2021, pp. 167–177.
[121] Ya Xue, Xuejun Liao, Lawrence Carin, and Balaji Krishnapuram. “Multi-task learning for
classification with dirichlet process priors.” In: Journal of Machine Learning Research 8.1
(2007).
[122] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy.
“Hierarchical attention networks for document classification”. In: Proceedings of the 2016
conference of the North American chapter of the association for computational linguistics:
human language technologies. 2016, pp. 1480–1489.
[123] Huaxiu Yao, Ying Wei, Junzhou Huang, and Zhenhui Li. “Hierarchically structured
meta-learning”. In: International Conference on Machine Learning. PMLR. 2019,
pp. 7045–7054.
[124] YelpData. Yelp Open Dataset [online]. https://www.yelp.com/dataset. 2019.
[125] Jeffrey Young and Aaron T Beck. Cognitive therapy scale: Rating manual. Vol. 36. Bala
Cynwyd, PA: Beck Institute for Cognitive Behavior Therapy, 1980.
[126] Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti,
Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. “Big Bird:
Transformers for Longer Sequences.” In: NeurIPS. 2020.
85
[127] Amir R Zamir, Alexander Sax, William Shen, Leonidas J Guibas, Jitendra Malik, and
Silvio Savarese. “Taskonomy: Disentangling task transfer learning”. In: Proceedings of
the IEEE conference on computer vision and pattern recognition. 2018, pp. 3712–3722.
[128] Kun Zhang, Bernhard Schölkopf, Krikamol Muandet, and Zhikun Wang. “Domain
adaptation under target and conditional shift”. In: International Conference on Machine
Learning. 2013, pp. 819–827.
[129] Yu Zhang and Dit-Yan Yeung. “A convex formulation for learning task relationships in
multi-task learning”. In: Proceedings of the Twenty-Sixth Conference on Uncertainty in
Artificial Intelligence . 2010, pp. 733–742.
[130] Han Zhao, Remi Tachet des Combes, Kun Zhang, and Geoffrey J Gordon. “On learning
invariant representation for domain adaptation”. In: arXiv preprint arXiv:1901.09453
(2019).
[131] Han Zhao, Shanghang Zhang, Guanhang Wu, Geoffrey J Gordon, et al. “Multiple source
domain adaptation with adversarial learning”. In: (2018).
[132] Yang Zheng, Yongkang Liu, and John HL Hansen. “Navigation-orientated natural
spoken language understanding for intelligent vehicle dialogue”. In: 2017 IEEE Intelligent
Vehicles Symposium (IV). IEEE. 2017, pp. 559–564.
[133] Pan Zhou, Yingtian Zou, Xiao-Tong Yuan, Jiashi Feng, Caiming Xiong, and Steven Hoi.
“Task similarity aware meta learning: Theory-inspired improvement on maml”. In:
Uncertainty in Artificial Intelligence . PMLR. 2021, pp. 23–33.
86
Appendix
A. Proof of Theorem 1 (See page 40)
Under the assumption thatp(x|y) =q(x|y) andP
j
i
=Q
j
i
the Equation (4.2) can be modified as
J
γ =
L
X
j=1
(
L
X
i=1
γ i
P
j
i
− q(ˆ y =j))
2
=
L
X
j=1
(
L
X
i=1
γ i
P
j
i
− β i
Q
j
i
)
2
=
L
X
j=1
(
L
X
i=1
γ i
P
j
i
− β i
P
j
i
)
2
=
L
X
j=1
(
L
X
i=1
(γ i
− β i
)P
j
i
)
2
.
(6.1)
Obviously, J
γ ≥ 0 and the equality is satisfied when γ = β . Let ¯q = [q(ˆ y = 1),q(ˆ y =
1),...,q(ˆ y =L)]
T
, the Equation (6.1) can also be expressed as
J
γ =||
¯ P · γ − ¯q||
2
. (6.2)
The Hessian matrix ofJ
γ is2
¯ P
T
¯ P , which is a positive semidefinite matrix. Thus we conclude
thatJ
γ is convex.
87
B. The SwDA Dataset
This section demonstrates the dialogue acts distributions of the SwDA dataset. The statistics for
the 42-tag scheme and the simpler 7-tag scheme are presented in Tables 6.1 and 6.2, respectively.
Dialogue Act Utterances (count) Dialogue Act Utterances (count)
statement-non-opinion 74k collaborative completion 0.7k
backchannel 38k repeat-phrase 0.7k
statement-opinion 26k open question 0.6k
abandoned/uninterpretable 15k rhetorical questions 0.6k
agree/accept 11k hold-before-answer/agreement 0.5k
appreciation 4.7k reject 0.3k
yes-no-question 4.7k negative non-no answers 0.3k
non-verbal 3.6k signal-non-understanding 0.3k
yes answers 3k other answer 0.3k
Conventional-closing 2.6k conventional-opening 0.2k
wh-question 1.9k or-clause 0.2k
no answers 1.4k dispreferred answers 0.2k
response acknowledgement 1.3k 3rd-party-talk 0.1k
hedge 1.2k offers, options commits 0.1k
declarative yes-no-question 1.2k self-talk 0.1k
backchannel in question form 1k downplayer 0.1k
quotation 0.9k maybe/accept-part 0.1k
summarize/reformulate 0.9k tag-question 0.1k
other 0.9k declarative wh-question 0.1k
affirmative non-yes answers 0.8k apology 0.1k
action-directive 0.7k thinking 0.1k
Table 6.1: Statistics describing the SwDA datasets for the 42 tags scheme.
Dialogue Act Utterances (count)
statement 100k
backchannel 38k
question 8.6k
agreement 11k
appreciation 4.7k
incomplete 15k
other 23k
Table 6.2: Statistics describing the SwDA datasets for the 7 tags scheme.
88
C. Examples of Label Subsets
This part shows examples of label clustering results for predicting therapist’s codes and patient’s
codes with five in-domain training sessions. Table 6.3 and 6.4 present the produced label subsets
which achieve the median performance among 15 runs of Reptile-TA-PPTS.
Behavioral Code Clustered Similar Labels
Facilitate backchannel, yes answer, no answer
Giving Information statement-opinion, statement-non-opinion, dispreferred-answers
Simple Reflection quotation, declarative yes-no-question, declarative wh-question
Complex Reflection non-verbal, hedge, summarize/reformulate
Closed Question yes-no-question, or-clause, tag-question
Open Question wh-question, open-question, self-talk
MI adherent appreciation, downplayer, thanking
MI non-adherent action-directive, offers/options commits, 3rd-party-talk
Table 6.3: Label clustering results for therapist’s codes .
Behavioral Code Clustered Similar Labels
Follow/Neutral
backchannel, no answer, non-verbal, yes answer,
response acknowledgement, tag-question, repeat-phrase,
backchannel in question form
Change Talk
(positive)
quotation, declarative, yes-no-question, offers/options commits,
statement-opinion, declarative wh-question, rhetorical-questions,
3rd-party-talk, yes-no-question
Sustain Talk
(negative)
statement-non-opinion, collaborative completion, hedge,
action-directive, other answers, dispreferred answers,
declarative yes-no-question, affirmative non-yes answers
Table 6.4: Label clustering results for patient’s codes .
89
D. A proposition for the Sampling Strategy PPTS
Proposition 1 If we adopt the sampling strategy PPTS, then every unique instance within the label
subsets has the same chance of being picked.
Proof. Define the label subsets after label clustering C
1
={c
1
1
,c
2
1
,...,c
K
1
},C
2
={c
1
2
,c
2
2
,...,c
K
2
},...,
C
M
={c
1
M
,c
2
M
,...,c
K
M
}.
Letx be an arbitrary instance with label which is contained in the label subsetC
i
,1≤ i≤ M.
Consider the following process: 1) sample an analogy task with the probability proportional
to the task size; 2) randomly sample an instance from the selected task.
We compute the probability of the picked instance to bex by
P(x) =
X
T
P(T)· P(x|T)
=
X
T
|T|
P
T
|T|
· 1
|T|
· 1{x∈T}
=
X
T
1{x∈T}
P
T
|T|
(6.3)
whereT denotes any analogy task. Consider that an arbitrary labelc can be enrolled in exact
K
M− 1
analogy tasks. Equation 6.3 can be rewritten as
P(x) =
X
T
K
M− 1
K
M− 1
P
M
i
P
K
j
|c
j
i
|
=
1
P
M
i
P
K
j
|c
j
i
|
(6.4)
90
The probability is irrelevant to the label and thus the same for every instancex. Please note
that the proposition will not hold if the sizes of the label subsetsC
i
are different.
91
E. Details of the Clinical Section Classification Task
Our in-house data is a mock customer dataset simulating the nursing situation from a specific
business institution. For every conversation, both transcriptions and clinical notes are provided.
A specific labeling team performed the annotation work who demonstrated calibration prior to
coding process. Fig 6.1 shows an episode of a labeled snippet.
BERT fine-tuning strategies
First, we add preceding and following utterances instead of feeding a single utterance as the input
to the model and examined their impact on accuracy. Our hypothesis is that contextual informa-
tion can benefit prediction accuracy. Second, we added speaker-role information in the input by
using role-specific tokens: “[PAT]" for patient’s utterances and “[CLI]" for clinician’s utterances.
They were placed in front of the utterances. For example,{[PAT],U
i− 2
,[CLI],U
i− 1
,[CLI],U
i
,
[PAT], U
i+1
, [CLI], U
i+2
} are used as the input for predicting the section of clinician’s utter-
anceU
i
with the context size of 2 (ranging fromU
i− 2
toU
i+2
). After that, we perform two-phase
fine-tuning in order to learn role-specific language patterns. In the first phase fine-tuning, we
perform the regular BERT fine-tuning on all utterances. Then, in the second phase, we fine-tune
the model on role-specific utterance data.
Experimental Setup
We set the max sequence length depending on how many contextual utterances we incorpo-
rated, which covered more than 99% of the sentences. For all the BERT fine-tuning processes,
we selected the best learning rate among {1e-5, 2e-5, 3e-5} on the validation set. We employed
92
Figure 6.1: An example episode of doctor-patient conversation.
a decoupled weight decay regularizer and a linear learning rate scheduler for optimization [75].
The model was trained with a batch size of 64 and 5 epochs, selected by their lowest validation
loss. For the intermediate task of meta-learning, we setλ = 5e− 5 andδ = 2e− 5. We pre-trained
the model for 3 epochs, sampled 8 tasks per step, and fixed the inner update step n
s
to be 5. The
best size of cluster groupsK for UMTA is also determined by the validation loss.
Language Model Adaptation
To achieve a better pre-trained language model for our task, we adapted BERT to the healthcare
conversation domain via domain-adaptive pre-training [51] using masked word prediction and
next sentence prediction using external data. To learn the roles and contextual information, we
prefixed the role tokens “[CLI]" and “[PAT]" to each utterance and splice the corpus every three
utterances. We trained BERT for 20,000 steps with the external data, setting the learning rate to
2e-5, the batch size to 32, and the maximum sequence length to 128.
93
Abstract (if available)
Abstract
Advances in spoken language processing techniques have dramatically augmented productivity and improved the quality of life. One of the most striking applications of such techniques is automated human behavioral coding in domains such as diagnostic or therapeutic clinical interactions. Behavioral coding is a procedure during which trained human coders listen to audio recordings and review session transcripts to assess the session quality and specific interaction attributes and mechanisms. Developing computational models of speech and natural language processing for automated behavioral coding helps waive annotation and reduce the burden placed on experts. However, most existing automated behavioral coding methods assume that we have enough in-domain samples and labels, which do not work in practice in low resource scenarios. In this dissertation, I discuss the roots of the data sparsity problem for automated behavioral coding and address these issues using advanced spoken language processing techniques. Specifically, I adopt the hierarchical transformer framework, domain adaptation model, meta-learning and task augmentation approaches to build computational linguistics models for modeling human interactions in different low-resource scenarios. I compare these novel algorithms to the baseline approaches and show the improved performance. We evaluate our automated behavioral coding algorithms in psychotherapy, which is considered an expository domain. The datasets we use in our experiments are from cognitive behavior therapy and motivational interviewing. Beyond that, we further apply our models to other styles of text data to present the generalizability of our algorithms.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Building generalizable language models for code processing
PDF
Machine learning paradigms for behavioral coding
PDF
Extracting and using speaker role information in speech processing applications
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Automatic quantification and prediction of human subjective judgments in behavioral signal processing
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Computational modeling of behavioral attributes in conversational dyadic interactions
PDF
Enriching spoken language processing: representation and modeling of suprasegmental events
PDF
Behavioral signal processing: computational approaches for modeling and quantifying interaction dynamics in dyadic human interactions
PDF
Computational modeling of mental health therapy sessions
PDF
Semantically-grounded audio representation learning
PDF
Statistical inference for dynamical, interacting multi-object systems with emphasis on human small group interactions
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Modeling expert assessment of empathy through multimodal signal cues
PDF
Multimodal reasoning of visual information and natural language
PDF
Enhancing speech to speech translation through exploitation of bilingual resources and paralinguistic information
PDF
Modeling dyadic synchrony with heterogeneous data: validation in infant-mother and infant-robot interactions
PDF
Generating psycholinguistic norms and applications
PDF
Active data acquisition for building language models for speech recognition
Asset Metadata
Creator
Chen, Zhuohao
(author)
Core Title
Spoken language processing in low resource scenarios with applications in automated behavioral coding
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2023-12
Publication Date
09/15/2023
Defense Date
09/08/2023
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
automated behavioral coding,data augmentation,data sparsity,OAI-PMH Harvest,spoken language processing,task augmentation,transfer learning
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth (
committee chair
), Dehghani, Morteza (
committee member
), Jenkins, Keith (
committee member
)
Creator Email
zhuohaoc@usc.edu,zhuohaochen92@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113376971
Unique identifier
UC113376971
Identifier
etd-ChenZhuoha-12385.pdf (filename)
Legacy Identifier
etd-ChenZhuoha-12385
Document Type
Dissertation
Format
theses (aat)
Rights
Chen, Zhuohao
Internet Media Type
application/pdf
Type
texts
Source
20230918-usctheses-batch-1098
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
automated behavioral coding
data augmentation
data sparsity
spoken language processing
task augmentation
transfer learning