Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Interaction dynamics and coordination for behavioral analysis in dyadic conversations
(USC Thesis Other)
Interaction dynamics and coordination for behavioral analysis in dyadic conversations
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Interaction Dynamics and Coordination for Behavioral Analysis in Dyadic Conversations by Md Nasir A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Electrical Engineering) May 2020 Acknowledgements First, I would like to express my sincere gratitude to my advisors Shrikanth (Shri) Narayanan and Panayiotis (Panos) Georgiou for their support. Shri’s vision and perspective towards a bigger picture and Panos’s meticulousness and rigor taught me to strive for excellence as a researcher. Their patience, motivation, and immense knowledge have shaped my PhD life and upcoming career. If it were not for them, I could not have had finished my journey of PhD. I also express my gratitude towards SAIL members who mentored me in the early years of my PhD, particularly Naveen, Jangwon, Rahul, Theodora. My graditude also goes to other current and past SAILers who extended their help and support through in- tellectual and academic discussions as well as friendship. Special thanks to Arindam, Sandeep, Rimita, Krishna, Amrutha, Colin, Zhaojun, and Tanaya for their constant sup- port within and outside the lab life. Particularly, I spent a good amount of time (also a great time) with Arindam, as roommates and colleagues, starting from undergrad days. Thank you for all your help and friendship over the years. ii I am indebted to Shiva and Divya (and Sandeep) for being part of my support system. There was probably no coffeeshop on campus where we did not spend our afternoons discussing our lives. I am indebted to all my friends who made me feel home during my time at USC and who were always so helpful in numerous ways. Special thanks to Pradipta, Sulagna, Anupriya, Debtanu, Sayantani, Agnimitra, Sebanti, Shinjini, Shreya, Pratyusha, Sam- prita, Anastasia and Chinmoy. I also had the priviledge of having many good friends who constantly stood by me inspite of being all over the world. I express my heart- felt thanks and gratitude to Aditi, Anushmita, Soumyadip, Mimansa, Krysta, Shouvik, Sridip, Pavel, Sayani, Ushasi, Pranoy, Sourajit, Nazneen and many others. Finally, I would like to thank my parents for their constant support, motivation in all of my endeavors. They had fostered a passion for learning and inquisitiveness for knowledge in me during my childhood that finally brought me to the academic jounrey of doctoral studies. They encouraged me to be myself, believe in my decisions and follow my heart in every single step of my life. Supporting my decision of moving far away from home to fulfill my dreams was particularly difficult for them, but still they never showed a bit hestitation. I cannot thank them enough for their love throughout my life and all the sacrifices they made. I would also like to thank my younger sisters Nafisa and Nazia for their support and love throughout my PhD and my life in general. iii Table of Contents Acknowledgements ii Abstract vii Chapter 1: Introduction 1 Chapter 2: Dynamic Functionals and Couples Therapy Outcome Prediction 12 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Related Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Couples Therapy Corpus and Outcomes . . . . . . . . . . . . . . . . . 21 2.4 Acoustic Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.1 Preprocessing of Audio Data . . . . . . . . . . . . . . . . . . . 26 2.4.2 Different types of acoustic features . . . . . . . . . . . . . . . 28 2.4.3 Static functionals . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4.4 Dynamic functionals . . . . . . . . . . . . . . . . . . . . . . . 31 2.4.4.1 Short term dynamic functionals . . . . . . . . . . . . 31 2.4.4.2 Long-term dynamic functionals . . . . . . . . . . . . 33 2.5 Manually-derived behavioral codes as features . . . . . . . . . . . . . . 34 2.6 Correlation Analysis of Features with Outcomes . . . . . . . . . . . . . 36 2.7 Classification Experiments . . . . . . . . . . . . . . . . . . . . . . . . 38 2.7.1 Experiments with different feature sets . . . . . . . . . . . . . 39 2.7.2 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.7.3 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.7.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Chapter 3: Nonlinear Dynamical Systems Modeling of Dyadic Interactions 47 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2 Related literature: Entrainment measures . . . . . . . . . . . . . . . . . 48 3.3 Similarity measures in various domains . . . . . . . . . . . . . . . . . 50 3.3.1 Probabilistic/Information Theoretic Measures . . . . . . . . . . 50 3.3.2 Similarity Measures used for Clustering . . . . . . . . . . . . . 52 iv 3.3.3 Wavelet-based Similarity Measures . . . . . . . . . . . . . . . 53 3.3.4 System Theoretic Measures . . . . . . . . . . . . . . . . . . . 54 3.4 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.4.1 Audio Preprocessing . . . . . . . . . . . . . . . . . . . . . . . 55 3.4.2 Prosodic Features: Pitch and Energy . . . . . . . . . . . . . . . 55 3.5 Complexity of Nonlinear Dynamical Systems . . . . . . . . . . . . . . 56 3.5.1 Reconstructed State Space Embedding . . . . . . . . . . . . . . 56 3.5.2 Different Complexity Measures . . . . . . . . . . . . . . . . . 58 3.5.2.1 Lyapunov Exponents . . . . . . . . . . . . . . . . . 58 3.5.2.2 Fractal Dimensions . . . . . . . . . . . . . . . . . . 59 3.6 Dataset 1: Couples Therapy Corpus . . . . . . . . . . . . . . . . . . . 61 3.6.1 Behavioral Codes . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.6.2 Outcome Ratings . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.6.3 Preprocessing and Variables of Interest in the Study . . . . . . . 62 3.7 Dataset 2: Suicide Risk Assessment Corpus . . . . . . . . . . . . . . . 63 3.8 Individual and Joint Complexity Measures . . . . . . . . . . . . . . . . 65 3.9 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.9.1 Verification of Joint Complexity Measures . . . . . . . . . . . . 66 3.9.2 Relation to Behavioral Codes . . . . . . . . . . . . . . . . . . 67 3.9.3 Relation to Outcomes . . . . . . . . . . . . . . . . . . . . . . . 68 3.10 Experimental Results on Suicide Corpus . . . . . . . . . . . . . . . . . 70 3.10.1 Complexity during Interview Sessions . . . . . . . . . . . . . . 70 3.10.2 Correlation with Emotional Bond . . . . . . . . . . . . . . . . 72 3.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Chapter 4: Deep Neural Network Modeling of Dyadic Interactions 75 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2 Preprocessing and Feature Extraction for V ocal Entrainment . . . . . . 80 4.2.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.2.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . 81 4.2.3 Turn-level Features . . . . . . . . . . . . . . . . . . . . . . . . 82 4.3 Deep Unsupervised Learning Framework for Entrainment . . . . . . . . 82 4.3.1 Basic Principle of Learning Entrainment from Data . . . . . . . 83 4.3.2 Encoding Approach . . . . . . . . . . . . . . . . . . . . . . . 85 4.3.3 Nuisance Factors . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.3.4 Triplet Network for Entrainment . . . . . . . . . . . . . . . . . 88 4.3.5 i-vector Modeling for Negative Sampling . . . . . . . . . . . . 90 4.4 Network Architecture and Entrainment Distance Measures . . . . . . . 93 4.4.1 Encoder Approach: Neural Entrainment Distance (NED) . . . . 93 4.4.2 Triplet Network-based Entrainment Distance Measures (TNED and iTNED) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.5 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 v 4.5.1 Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.5.2 Evaluation Data . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.6 Verification of the Measures . . . . . . . . . . . . . . . . . . . . . . . 99 4.7 Correlation Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.7.1 Emotional Bond in Suicide Risk Assessment . . . . . . . . . . 103 4.7.2 Behavioral Codes in Couples Therapy . . . . . . . . . . . . . . 106 4.8 Application of the Measures as Features . . . . . . . . . . . . . . . . . 109 4.8.1 Classification Setup . . . . . . . . . . . . . . . . . . . . . . . . 110 4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Chapter 5: Quantification of Linguistic Coordination 114 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.2 Linguistic Coordination Distances in Conversations . . . . . . . . . . . 116 5.2.1 Utterance-level distances as building blocks . . . . . . . . . . . 116 5.2.1.1 Word Mover’s Distance (WMD) . . . . . . . . . . . 116 5.2.1.2 Use of sentence embeddings . . . . . . . . . . . . . 119 5.2.1.3 Fusion Approach . . . . . . . . . . . . . . . . . . . 121 5.2.2 Conversational Linguistic Distances . . . . . . . . . . . . . . . 122 5.2.2.1 Local Interpersonal Distance . . . . . . . . . . . . . 122 5.2.2.2 Session-level measures . . . . . . . . . . . . . . . . 123 5.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.3.1 Motivational Interviewing corpus . . . . . . . . . . . . . . . . 126 5.3.2 Couples Therapy corpus . . . . . . . . . . . . . . . . . . . . . 126 5.3.3 Cornell Movie Dialogs Corpus . . . . . . . . . . . . . . . . . . 127 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.4.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.4.2 Case Study 1: Empathy in Motivational Interviews . . . . . . . 128 5.4.3 Case Study 2: Couples Therapy . . . . . . . . . . . . . . . . . 130 5.4.3.1 Individual behavioral codes . . . . . . . . . . . . . . 130 5.4.3.2 Therapy outcome . . . . . . . . . . . . . . . . . . . 131 5.4.4 Case Study 3: Analysis of Coordination in Movie Dialogs . . . 132 5.5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . 133 Chapter 6: Summary and Future work 135 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Reference List 139 vi Abstract Analysis of interaction dynamics in dyadic conversations can provide important insights into the behavior patterns of the interlocutors. We explore characterization of inter- action dynamics in the form of two common interpersonal adaptation mechanisms in conversations–vocal entrainment and linguistic coordination. First, we show how mod- eling dyadic interactions through nonlinear dynamical systems can provide complexity measures that capture entrainment. These measures, albeit knowledge-driven, are able to capture the nonlinear nature of entrainment during interactions, yielding improved per- formance over existing linear measures found in the literature. We then propose a deep neural network-based unsupervised learning framework for entrainment and leverage the ability of learning from real conversational data to provide novel distance measures in- dicative of entrainment. We also propose measuring linguistic coordination in conver- sations by using neural word embeddings and learning distance measures that capture lexical, syntactic and semantic similarity between interlocutors. Our experiments show that the proposed measures can successfully distinguish real conversations from the fake ones by detecting the presence of entrainment or coordination. We also demonstrate their vii applications in relation to several behaviors and outcomes in observational psychother- apy domains such as couples therapy, suicide risk assessment, and motivational inter- viewing. Furthermore, we find that incorporating measures characterizing interaction dynamics as features significantly improve the classification performance of predicting therapy outcome of couples with marital conflict. viii Chapter 1 Introduction Computational study of human behavior is one of the central themes of research and practices in modern psychology and sociology. Expression and perception of human behavior highly influence human interactions and social relationships. Given the im- portance and the widespread nature of the problem domain, computational modeling of human behavior is becoming an inter-disciplinary area of increasing research activ- ities which also invovles signal processing, machine learning and data analytics. The emerging field, also known as Behavioral Signal Processing (BSP) [76, 185], refers to computational methods that support measurement, analysis, and modeling of human be- havior and interactions. More specifically, it aims to analyze highly complex and intricate expressions of behavior through multiple modalities of human communication such as speech, language, visual expressions (often characterized by real-world signals) and map them to behavioral constructs, often abstract and complex, based on the area of interest. The technology and algorithms developed under the umbrella of BSP to obtain behav- ioral informatics can be applied across a variety of domains ranging from healthcare to 1 commerce. Understanding of expressed and perceived behavior in human interactions based on the signal-driven analytics can lead to recognizition and quantification various types of typical and atypical behaviors of interest and informed decision-making. BSP research has found applications in a variety of clinical domains including couples ther- apy [23, 24, 25, 26, 76, 76, 79, 156], Autism Spectrum Disorder [29, 45], and addiction counseling [36, 102, 172, 273]. These studies are promising towards the creation of automated support systems for psychotherapists in creating objective measures for diag- nostics, intervention assessment and planning. Parallel work with focus on improving social interactions rather than the health domains can be found in [206, 259]. Researchers have explored behavioral information gathered from various modalities such as vocal patterns of speech [25, 26, 79, 156], spoken language use [43, 76] and visual cues like head motion [275] or body gestures. In behavioral signal processing, often we focus our analysis on dyadic conversations, interactions that take place between two interlocutors. Such typcial conversations are rich with lexical information (“what is being said”) and vocal patterns (“how it is being said”) and other nonverbal cues. Moreover, we tend to use the contextual information (from both the primary speaker and the interlocutor) in order to add to the progress of the conversation. The context length is often multiple turns in a natural conversation evolves unlike dialog acts where the context is shorter and often limited to the question or prompt of the interlocutor. 2 Modeling and analysis of this multimodal and dynamic information encoded in dyadic interactions is a complex task and there are several different aspects to analyze. There are two aspects of characterizing dyadic interactions that I have studied in my work: interpersonal dynamics, i.e., how one interlocutor influences the other during the interaction, temporal dynamics, i.e., how the speaker is influenced by temporal context or the past history of the conversation. One should note that these three aspects are not conceptually disjoint. For example, temporal dynamics also includes interpersonal dynamics along with the self-dynamics of the speaker. DynamicFunctionalsandCouplesTherapyOutcomePrediction Next, as an appli- cation of interpersonal behavior analysis using data-driven techniques, we analyze the vocal speech patterns of couples engaged in problem-solving interactions to infer the eventual outcome of their relationship – whether it improves or not–over the course of couples therapy. We formulate the outcome prediction as binary (improvement vs. no improvement) and multiclass (different levels of improvement) classification problems and use machine learning techniques to automatically discern the underlying patterns of these classes from the speech signal. We propose short-term and long-term dynamic functions that capture the variation of 3 features from one speaker to the other and also over the course of the therapy session. We compare the prediction using features directly derived from speech with prediction using clinically relevant behavioral ratings (e.g., relationship satisfaction, blame patterns, negativity) manually coded by experts after observing the interactions. It should be noted that human behavioral codes are based on watching videos of interactions that provide access to additional information beyond vocal patterns (solely relied by the proposed prediction scheme) including language use and visual nonverbal cues. In addition to evaluating how well directly signal-derived acoustic features compare with manually de- rived behavioral codes as features for prediction, we also evaluate the prediction of the outcome when both feature streams are used together. We also investigate the benefit of explicitly accounting for the dynamics and mutual influence of the dyadic behavior during towards the prediction task. The experimental results show that dynamic func- tionals that measure relative vocal changes within and across interlocutors contribute to improved outcome prediction. Quantifying Entrainment in Conversations V ocal entrainment is an established so- cial adaptation mechanism. It can be loosely defined as one speaker’s spontaneous adaptation to the speaking style of the other speaker. Entrainment is a fairly complex multifaceted process and closely associated with many other mechanisms such as co- ordination, synchrony, convergence etc. While there are various aspects and levels of entrainment [162], there is also a general agreement that entrainment is a sign of pos- itive behavior towards the other speaker [19, 116, 268]. There have been a myriad of 4 works with the central focus on the phenomenon that the speakers often tend to ‘accom- modate’ or ‘sound similar’ over the course of conversation. Different studies on this have come up with different terms to indicate the same (or similar phenomenon): Con- vergence [81, 202], Alignment [211], Entrainment [30] etc. [162] used distinct terms for entrainment at different levels, such as, Convergence, proximity, synchrony. In [153, 156] the authors used the term Vocal Entrainment when analyzing speech patterns. In [274] and [260] the term Mimicry has been used while analyzing multimodal communication. In the celebrated “Speech Accommodation Theory” [81], authors have studied this phenomenon. Giles et al.further analyzed ‘convergence’ and ‘divergence’ of speech and non-verbal characteristics such as utterance length, speech rate, intensity, response la- tency etc by individuals in response to each other’s communicative behavior in social settings[80]. Different speech strategies such as complementarity, over- and under- accommodation) have been theoretically recognized as predictive of high-level behav- ioral patterns. In [174, 181, 215, 266], authors have identified this phenomena to in- fluence social and romantic relationships. High degree of vocal entrainment has been associated with various interpersonal behavioral attributes, such as high empathy [276], more agreement and less blame towards the partner and positive outcomes in couples therapy [190], and high emotional bond [187]. A good understanding of entrainment provides insights to various interpersonal behaviors and facilitates the recognition and estimation of these behaviors in the realm of Behavioral Signal Processing [78, 185]. 5 Moreover, it also contributes to the modeling and development of ‘human-like’ spoken dialog systems or conversational agents. Unfortunately, quantifying entrainment has always been a challenging problem. There is a scarcity of reliable labeled speech databases on entrainment, possibly due to the sub- jective and diverse nature of its definition. This makes it difficult to capture entrainment using supervised models, unlike many other behaviors. Early studies on entrainment relied on highly subjective and context-dependent manual observation coding for mea- suring entrainment. The objective methods based on extracted speech features employed classical synchrony measures such as Pearson’s correlation [162] and traditional (linear) time series analysis techniques[142]. Lee et al. [156, 276] proposed a measure based on PCA representation of prosody and MFCC features of consecutive turns. Most of these approaches assume a linear relationship between features of consecutive speaker turns which is not necessarily true, given the complex nature of entrainment. For example, the effect of rising pitch or energy can potentially have a nonlinear influence across speakers. This motivated us adopt a nonlinear dynamical systems modeling approach and explore the possibility of complexity measures for capturing entrainment. Nonlinear Dynamical Systems Modeling of Dyadic Interactions Dynamical sys- tems and chaotic analysis have been extensively studied for modeling speech wave- forms [9, 186, 253] because of their ability to account for the nonlinear phenomena underlying the time-series. One characterization of non-linear systems is through the notion of complexity which provides a quantitative measure of ‘degrees of freedom’ 6 or ‘detail’ in the minimal representation of the system. For example, a chaotic or ir- regular signal corresponds to higher complexity than a deterministic or periodic signal. Researchers have used complexity measures and other nonlinear dynamical features for several speech-based applications such as speaker recognition [208], phoneme classifi- cation [139], pathological speech classification [137] and speech synthesis [9]. However, little effort has been made on analyzing complexity patterns of interacting signal streams, such as the ones that can be found in dyadic interactions. Coordination and accommodation, also known as entrainment, are commonly occurring phenomena in such interactions [35, 81, 224], where interlocutors tend to adapt to each other’s verbal and non-verbal behavior reflected in their speech patterns. It includes convergence or divergence of their patterns, as well as how they synchronize in time. Within the frame- work of nonlinear dynamical systems modeling of speech, the interactions can be viewed as joint (coupled) dynamical systems. One can argue that the complexity of the joint sys- tem formed by two interlocutors depends on the extent of their mutual influence. More specifically, the complexity can be deemed lower if there is more coordination between the speakers in the form of behavioral similarity and synchrony [224]. We propose a framework to analyze different complexity measures of spoken inter- actions using the prosody features (pitch and energy) of the observed speech signal. We associate the feature streams from each speaker with individual dynamical systems, while features from both speakers together are used to model a joint system, capturing coor- dination between individuals. The complexity measures computed on these systems are 7 then investigated in relation to behavioral codes characterizing the dyad and the outcome of the couples therapy. In another case study, we investigate the characteristics of patient-therapist interac- tions during suicide risk assessment interviews. While researchers have attempted to pre- dict suicide risk itself from speech acoustic features[52, 71], to the best of our knowledge, there has been no work in analyzing suicide risk assessment interviews and interaction dynamics using vocal cues. In addition to the similarity measures used to quantify vocal entrainment proposed by Lee et al. [156], the current work employs the notion of com- plexity in a dynamical systems approach by modeling speech features as nonlinear time- series[190]. For two interlocutors during a conversation, the jointly characterized com- plexity of their speech patterns can be associated with the degree of their entrainment. More specifically, in case of lower behavioral similarity and less synchrony, there is more variability in the underlying system, resulting into higher system complexity[224]. Deep Neural Network Modeling of Dyadic Interactions Along with rise of data- driven approaches based on machine learning algorithms, there have been some attempts to simulate nonlinear dynamical methods using Neural Networks. These approaches pro- vides the same measures, but in a robust way. One of the earliest works [2] in this domain showed that it was possible to reconstruct the attractor of a noise-corrupted signal using back-propagation. A range of machine Learning methods are exploring the connection between “signal distances”, not directly comparable to metrics of entrainment but with potential technical parallels. For example autoencoders [113, 114] have been employed 8 for the purpose of obtaining a hidden, bottleneck embedding that best represents the data. This is similar to Bidirectional Associative Memories [141]. While this can be looked upon as compression, it also derives a minimalistic representation that may be more ap- plicable to direct distance metric comparison. Recently, various complexity measures (such as largest Lyapunov exponent) of fea- ture streams based on nonlinear dynamical systems modeling showed promising results in capturing entrainment [187, 190]. A limitation of this modeling, however, is the as- sumption of the short-term stationary or slowly varying nature of the features. While this can be reasonable for global or session-level complexity, the measure is not very meaningful capturing turn-level or local entrainment. Nonlinear dynamical measures also suffer from scalability to a multidimensional feature set, including spectral coeffi- cients such as MFCCs. Further, all of the above metrics are knowledge-driven and do not exploit the vast amount of information that can be gained from existing interactions. A more holistic approach is to capture entrainment in consecutive speaker turns through a more robust nonlinear function. Conceptually speaking, such a formulation of entrainment is closely related to the problem of learning a transfer function which maps vocal patterns of one speaker turn to the next. A compelling choice to nonlinearly approximate the transfer function would be to employ Deep Neural Networks (DNNs). This is supported by recent promising applications of deep learning models, both in su- pervised and unsupervised paradigms, in modeling and classification of emotions and behaviors from speech. For example in [164] the authors learned, in an unsupervised 9 manner, a latent embedding towards identifying behavior in out-of-domain tasks. Sim- ilarly in [122, 123] the authors employ Neural Predictive Coding to derive embeddings that link to speaker characteristics in an unsupervised manner. We propose an unsupervised training framework to contextually learn the transfer function that ties the two speakers. The learned bottleneck embedding contains cross- speaker information closely related to entrainment. We define a distance measure be- tween the consecutive speaker turns represented in the bottleneck feature embedding space. We call this metric the Neural Entrainment Distance (NED). We then exper- imentally investigate the validity and effectiveness of the NED measure in association with interpersonal behavior. QuantifcationofLinguisticCoordinationusingNeuralEmbeddings When people engage in conversations in social settings, they tend to coordinate with each other and show similar behavior in various modalities. This tendency, known as en- trainment or coordination, is exhibited through facial expressions [221], head-motion [277], vocal patterns (vocal entrainment) [156, 189], as well as the use of language (linguistic coordination) [210]. Linguistic coordination is a well-established phenomenon in both spoken and written communication that has many collaborative benefits. It is often asso- ciated with a wide range of positive social behaviors and outcomes, such as task success in collaborative games [171, 194], building effective dialogues [216] and rapport [41], engagement in tutoring scenario [263], successful negotiation [251]. 10 Drawing inspiration from the recent success of distributed representations of words and sentences, we propose a framework for computing linguistic coordination based on distances between utterances from the interlocutors. Our first approach relies on a re- cently proposed measure known as Word Mover’s Distance (WMD) [148] and extend it to compute a distance that captures linguistic coordination. Since WMD builds upon word2vec word embeddings shown to contain semantic and syntactic information [182], our proposed measure attempts to capture both semantic and syntactic aspects of linguis- tic coordination. We also describe an alternate approach of generating an embedding for each utterance and obtaining distances in the embedding space. Finally, we combine both word-level and utterance-level representation approach to come up with a hybrid approach. One of the main novelties in our work is in jointly integrating multiple aspects of coordination into a single measure. In our framework, we also propose to measure the coordination locally and then normalize it globally to account for the individual tendency of coordination. We experimentally validate our measure in relation to the therapist’s em- pathy towards their patient in Motivational Interviewing as well as outcome and affective behaviors in Couples Therapy. 11 Chapter 2 Dynamic Functionals and Couples Therapy Outcome Prediction 2.1 Introduction Behavioral Signal Processing (BSP) [76, 185] refers to computational methods that sup- port measurement, analysis, and modeling of human behavior and interactions. The main goal is to support decision making of domain experts, such as mental health researchers and clinicians. BSP maps real-world signals to behavioral constructs, often abstract and complex, and has been applied in a variety of clinical domains including couples ther- apy [25, 76, 156], Autism Spectrum Disorder [29], and addiction counseling [36, 172]. Parallel work with focus on social context rather than the health domains can be found in [206, 259]. Notably, couples therapy has been among one of the key application domains of Behavioral Signal Processing. There have been significant efforts in char- acterizing the behavior of individuals engaged in conversation with their spouses during 12 problem-solving interaction sessions. Researchers have explored information gathered from various modalities such as vocal patterns of speech [25, 26, 79, 156], spoken lan- guage use [43, 76] and visual body gestures [275]. These studies are promising towards the creation of automated support systems for psychotherapists in creating objective mea- sures for diagnostics, intervention assessment and planning. This entails not only charac- terizing and understanding a range of clinically meaningful behavior traits and patterns but, critically, also measure behavior change in response to treatment. A systematic and objective study and monitoring of the outcome relevant to the respective condition can facilitate positive and personalized interventions. In particular, in clinical psychology, predicting (or measuring from couple interactions, without couple, or therapist provided metrics) the outcome of the relationship of a couple undergoing counseling has been a subject of long-standing interest [14, 103, 237]. Many previous studies have manually investigated what behavioral traits and patterns of a couple can tell us of their relationship outcome, for example, whether a couple could successfully recover from their marital conflict or not. Often the monitoring of outcomes involves a prolonged period of time post treatment (up to 5 years), and highly subjective self reporting and manual observational coding [15]. Such an approach suffers from the inherent limitations of the qualitative observational assessment, subjective biases of the experts, and great variability in the self-reporting of behavior by the couples. Having a computational framework for outcome prediction can be beneficial towards assessment 13 of the employed therapy strategies and the quality of treatment, and also help provide feedback to the experts. In this chapter, we analyze the vocal speech patterns of couples engaged in problem- solving interactions to infer the eventual outcome of their relationship – whether it im- proves or not–over the course of therapy. The proposed data-driven approach focuses pri- marily on the acoustics of the interaction; unobtrusively-obtainable, and known to offer rich behavioral information. We adopt well-established speech signal processing tech- niques, in conjunction with novel data representations inspired by psychological theories to design the computational scheme for the therapy outcome prediction considered. We formulate the outcome prediction as binary (improvement vs. no improvement) and mul- ticlass (different levels of improvement) classification problems and use machine learn- ing techniques to automatically discern the underlying patterns of these classes from the speech signal. We compare the prediction using features directly derived from speech with pre- diction using clinically relevant behavioral ratings (e.g., relationship satisfaction, blame patterns, negativity) manually coded by experts after observing the interactions. It should be noted that human behavioral codes are based on watching videos of interactions that provide access to additional information beyond vocal patterns (solely relied by the pro- posed prediction scheme) including language use and visual nonverbal cues. 14 In addition to evaluating how well directly signal-derived acoustic features compare with manually derived behavioral codes as features for prediction, we also evaluate the prediction of the outcome when both feature streams are used together. Acoustic Feature Extraction Expert Ratings All Acoustic Features Pre-Treatment Interactions 26 Week Interactions 2 Year Interactions Outcome Prediction Figure 2.1. Overview of the work described in this chapter. We use 2 out of 3 interactions (shown on left). We employ automated feature extraction from acoustics and/or human behavioral coding (center) and machine learning (right) to derive outcomes. We also investigate the benefit of explicitly accounting for the dynamics and mutual influence of the dyadic behavior during towards the prediction task. The experimental results show that dynamic functionals that measure relative vocal changes within and across interlocutors contribute to improved outcome prediction. The outline of the chapter is as follows. We discuss relevant literature in Section 2.2. The Couples Therapy Corpus used in the study is described in Section 2.3 and illustrated in Figure 2.1. An overview of the methodologies for speech acoustic feature extraction is given in Section 2.4 and the use of behavioral codes as features is described in Section 2.5. We provide an analysis of the proposed acoustic features in Section 2.6 and the results of the classification experiments in Section 2.7. Finally, we conclude the chapter 15 with a discussion of our findings as well as possible directions for future research in Section 2.8. 2.2 Related Literature Clinical psychotherapy is an important treatment method for a wide range of psycholog- ical problems and disorders including depression, addiction, anxiety, domestic violence and relationship distress. Studies have shown that a typical therapy client is likely to be better off than 75% of the untreated individuals, on average [240]. Over the years, different approaches of psychotherapy have been proposed with me- thodical differences, but with a shared common goal focused on the personal and so- cial well-being of the individual. In couples therapy, some widely used approaches are Emotionally Focused Couples Therapy (EFCT) [124], Gottman’s Method of Cou- ples Therapy [90], Traditional Behavioral Couples Therapy (TBCT) [120], Cognitive Behavioral Couples Therapy (CBCT) [13, 66], and Integrative Behavioral Couples Ther- apy (IBCT) [48]. Many studies have compared these different schools of therapy in terms of effectiveness and realizability. Recent works have shown that even though TBCT works well in a short-term basis, IBCT turns out to be the most effective one towards a positive long-term marital outcome [14, 47]. Apart from the inherent nuances of therapy methods, the subjectivity of the thera- pist and the specific characteristics of the clients can potentially play an important role 16 in therapy. Therefore, it is critical to assess the quality and effectiveness of the ther- apy process by observing its outcome. Based on this objective, there have been numer- ous studies on therapy outcomes and comparative analysis of different therapy methods relating to the outcomes. Many of these works focus on the very definition of ther- apy outcome and the choice of outcome variables by accommodating contextual differ- ences [118, 150, 151, 198, 199, 240]. Often, monitoring of outcome over the course of the therapy serves as a good indicator of therapy effectiveness. This has triggered a lot of research on longitudinal outcome studies [135, 180]. Among the different outcome studies, a considerable amount of research has been undertaken in the specific domain of couples therapy, including those that have focused on defining proper metrics for marital therapy outcomes. One of the obvious outcomes, of course, would be the information if the couple stayed in the relationship or not within a certain time after the intervention. However, divorce (or absence of it) does not always reflect the degree of marital satisfaction; whether a couple in a distressed relationship would go through divorce depends on a number of external factors like age, education, culture, religious beliefs and socio-economic status of the spouses [31, 226]. Most of the studies on couples therapy outcomes have focused on outcomes of the couples based on their behavior, either observed from their interactions or through carefully designed questionnaires. One of the first studies of this kind was conducted by Bentler and New- comb [18], who found a high correlation between certain psychological variables, such 17 as self-perception and other personality traits, reported by the couple through a question- naire and their marital success. As a general trend of outcome studies in couples ther- apy, researchers typically have proposed relevant behavioral descriptors of the couple and analyzed how they are related to, and predictive of, marital outcome. Gottman and Krokoff [88] found certain interaction patterns, such as defensiveness and withdrawal, to be detrimental for long-term marital success from the empirical studies they conducted. In [89], the authors have shown codified observed behaviors, such as withdrawal, sad- ness, and humor, to be indicators of marital success with a cascade representation of possible gradual deterioration with time. Another set of constructed variables, such as disappointment, withdrawal, and fondness, describing the history of oral interviews of the couple were used by Buehlman et al. [34]. Another work by Gottman et al. [87] received widespread attention for prediction of marital satisfaction and divorce. It also made many recommendations for therapy based on what it deemed beneficial or detri- mental for marriage. Other works with similar behavioral coding-based approaches for the prediction of marital success or failure can be found in the literature [40, 86, 93]. A comprehensive survey of the marital outcome prediction studies can be found in [131]. Further, two recent books by Gottman [91, 92] have summarized his work on this topic. In summary, a significant amount research in clinical psychology has sought answer to the question “What leads to a divorce or an unsuccessful marriage?”. Even though these studies have provided important insights into the key factors for marital success, they suffer from certain drawbacks. According to Heyman [108], these shortcomings can 18 range from technical issues like lack of rigorous statistical validation of the hypotheses of the studies to more practical shortcomings such as lack of sufficient reliable data [109]. Another criticism of these studies is that the high prediction accuracy rates reported are often misleading as the experiments were mostly data-fitting analysis instead of predic- tion with cross-validation and hence subject to overfitting [111]. The limitation of using self-reporting behavioral traits by the couples was highlighted in [110]. Kim et al. [136] also argued against the generalizability of these works and highlighted the importance of further research and investigation of behavioral process models of relationship out- comes. Another work [245] also raised concern about some possible methodological flaws in many previous works and in [87] in particular. A more recent study investigated different factors being responsible for an unsuccess- ful marriage [5]. It categorized these factors into three categories: demographic (e.g., education), intrapersonal (e.g., depression) and interpersonal (e.g., intimacy, commit- ment). According to the findings of the hierarchical linear modeling technique used in this work, interpersonal factors have the strongest contribution to the success of a mar- riage. Moreover, it found that the effect is even stronger during the initial stages of the therapy. In a follow-up of the same study 2 years after the termination of the therapy, communication factors such as encoded arousal (based on pitch), power processes were also included [12]. These communication factors were found to be the strongest pre- dictors of the treatment response after 2 years. Finally, a 5 year follow-up showed that 19 commitment is a key factor behind outcome [11]. This study was based on the Couples Therapy corpus [47], which is used in the current work and described in a latter section. Over the past two decades, psychology and social science have seen a lot of changes in computational aspects coinciding with advances in machine learning, artificial intel- ligence and more recently fields like social signal processing [206, 207, 259, 260] and behavioral signal processing [77, 185]. Researchers have shown that thin slices [4] or small segments of conversational dynamics can predict interpersonal or behavioral traits or outcomes such as negotiation trends [53], personality [209, 234], depression [71], deception [3, 62], and agreement [112]. In couples therapy, researchers have investigated various signal processing and ma- chine learning based computational methods to study key emotions and behaviors ex- pressed through different modalities of interactions. A majority of these works have used the aforementioned Couples Therapy corpus to validate the signal-driven approaches with real world data. A particularly relevant work on couples therapy is the one that used speech acoustic features to predict different behavioral classes[25, 26], e.g., deter- mining automatically if a person blames his/her spouse during a conversation. Another work [156] analyzed dyadic interaction dynamics, notably the process of entrainment or mutual adaptation of behavior through the course of an interaction and related it to predicting the perceived affectivity. In [76], the authors presented a framework for ex- tracting behavioral information from language use by the couples, while [23] showed the utility of combining speech and language information for behavioral prediction. Finally, 20 some early results from our current work on prediction of marital outcome from acous- tic features were presented in [192] with a simpler methodology and basic analyses. In the current work, we developed a improved framework that extracts both short-term and long-term temporal changes in acoustic features. 2.3 Couples Therapy Corpus and Outcomes The Couples Therapy corpus used in this work is a collection of video recordings of in- teractions of real couples in distressed relationships. The corpus was collected as a part of a longitudinal study on couples therapy by collaborating researchers from University of California, Los Angeles and University of Washington [47]. The clinical trial that created this corpus primarily focused on analyzing whether Integrative Behavioral Cou- ples Therapy (IBCT) is more efficacious than Traditional Behavioral Couples Therapy (TBCT). To the best of our knowledge, it is also the largest such collection of random- ized clinical couples therapy interaction data [47]. All study procedures were approved by the Institutional Review Boards at the University of California, Los Angeles and the University of Washington, written consent was provided by all study participants, and treatment was provided according to the principles of the Declaration of Helsinki. One hundred and thirty-four chronically distressed couples were recruited to partici- pate in this study. All of them were male-female pairs legally married on average for 10.0 21 years (SD= 7.6). They were also selected after a screening of psycho-pathological con- ditions that might interfere with the behavioral aspects of interest, such as schizophrenia, bipolar disorder or antisocial personality disorder. The mean age of the husbands and wives in the study were 43.49 years (SD = 8.74) and 41.62 years (SD = 8.59), respectively. The majority of the participants identified themselves as Caucasians (husbands: 79.1%, wives: 76.1%); other ethnic groups include African American (husbands: 6.7%, wives: 8.2%), Asian or Pacific Islander (husbands: 6.0%, wives: 4.5%), Latino or Latina (husbands: 5.2%, wives: 5.2%) and Native Amer- ican/Alaskan (husbands: 0.7%). The study consisted of three recording sessions collected over a span of 2 years for each couple as illustrated in Figure 2.1. The first session took place just before the therapy started; the second one was after 26 weeks of therapy and the last session was recorded after two years. However, some of the couples did not follow up and as a conse- quence, the corresponding post-therapy sessions (26 weeks or 2 years) are missing. Each spouse chose an issue critical to their relationship and discussed it with their partner in each of these problem-solving interactions. The short-term goal of these sessions was the mutual understanding of these conflicting problems and to reach a resolution. Every session again has two parts based on the problem under discussion: whether it was cho- sen by the husband or the wife. The couples had their interaction in the absence of any therapist or research staff. 22 Coding System Codes SSIRS Global positive affect, global negative affect, use of humor, in- fluence of humor by the other, sadness, anger/frustration, bel- ligerence/domineering, contempt/disgust, tension/anxiety, defen- siveness, affection, satisfaction, solicits partner’s suggestions, in- strumental support offered, emotional support offered, submissive or dominant, topic being a relationship issue, topic being a personal issue, discussion about husband, discussion about wife CIRS Acceptance of the other, blame, responsibility for self, solicits part- ner’s perspective, states external origins, discussion, clearly defines problem, offers solutions, negotiates, makes agreements, pressures for change, withdraws, avoidance Table 2.1. Behavioral coding systems used in the dataset: SSIRS (Social Support Interaction Rating System, [107]) and CIRS (Couple Interaction Rating System, [125]) Behavioral Coding: Observational interaction measures by experts: As a part of the corpus, we also have manually-specified behavioral annotations for each spouse in each session. It was based on observations of the recorded audio-visual interaction of the couple. The behavioral attributes of interest, which we refer to as the behavioral codes or simply codes, consist of 33 behavioral dimensions combining two established behavioral coding systems: the Couples Interaction Rating System (CIRS, [125]) and the Social Support Interaction Rating System (SSIRS, [107]). These codes are summarized in Table 2.1. Every session was annotated by multiple (ranging from 2 to 9) human ex- perts and the average of their ratings are used as the reference. For the data we used, the average inter-annotator agreement of these codes in terms of Krippendorff’sa [147] measure is 0.7528. 23 Marital Outcome Measures: The aforementioned couples therapy corpus has been used in a number of research studies on marital outcome in response to different thera- pies [14, 15, 47]. The two common scales to measure marital satisfaction are the Dyadic Adjustment Scale (DAS, [244]) and the Global Distress Scale(GDS, [242]). Simple com- parison of pre-therapy and post-therapy scores using these scales can tell us empirically whether there has been any improvement in the relationship. Couples were categorized into four categories using the formula provided in Jacobson and Truax [121] and a com- posite relationship satisfaction score based on a combination of the DAS and the GDS. This categorical approach is more interpretable than a continuous score and useful for couples therapy domain since the categories are based on clinically significant change. In psychotherapy, clinical significance of a change is qualitatively defined as the extent to which therapy moves a couple within the control group or functional population. The operational definitions of clinical significance are based on various statistical approaches and are discussed in [121]. The four derived categories are as follows: Type 1: deteriorated (i.e., they got measurably worse over treatment) Type 2: no change (i.e., no meaningful improvement) Type 3: improved (i.e., they got measurably better over treatment, but still clini- cally insignificant) Type 4: recovered (i.e., they got measurably better over treatment and their score is above the upper cut-off for clinically significant distress) 24 These outcome types represented the recovery (or the lack thereof) of the couples at the time of either 26 weeks or 2 years relative to the time they started the therapy. In other words, one such outcome variable is associated with every combination of interaction sessions(pre-therapy to post-therapy). These outcome ratings will be considered as the reference labels for our automatic classification tasks in this study. Even though the original corpus had 134 couples, the outcome ratings could not be recorded for some couples due to reasons such as dropout of couples from the study, or lack of sufficient information to rate them. Also the audio quality of some of the recordings was poor. Moreover, some couples had these outcomes labeled only for one of the post-therapy sessions (either after 26 weeks or 2 years). After taking into account all such cases in the dataset, we had 141 instances of outcomes, which included (i) out- come after 26 weeks relative to pre-treatment, and (ii) outcome after 2 years relative to pre-treatment. Therefore, we have 141 samples in our analyzed dataset, every sample belonging to one of the four outcome classes (with ratings 1 though 4) shown in Table 2.2. Among these, 53 couples have both outcome variables (26 weeks and 2 years), and 35 couples have only one. There are total 229 recordings with, two 10-minute problem- solving interactions each, resulting in 458 10-minute interactions altogether. Outcome Decline No Change Partial Recovery Recovery Rating 1 2 3 4 Count 12 26 34 67 Table 2.2. Number of data samples with different outcome ratings 25 2.4 Acoustic Feature Extraction In this section, we describe the process of acoustic feature extraction from the speech recorded during dyadic conversations. Our aim is to capture relevant cues from the recorded speech acoustic signal relevant to the behavioral outcomes of the speaker in general, and the outcome of the couples therapy in particular. As a starting point, we extracted standard speech features of various kinds including those which are represent both segmental spectral characteristics and prosody. Furthermore, we designed addi- tional meta-features from these standard acoustic features to extract short- and long-term dynamics of the vocal cues of the interlocutors. These meta-features range from turn- level (L1) features within a session to cross-session features (L2). We discuss them in further detail in the following subsections. 2.4.1 Preprocessing of Audio Data In this section, we describe the preprocessing steps employed to prepare the recorded speech data for automated feature extraction and subsequent analysis. We started with all the sessions that we had after the initial screening based on the availability of outcome measures. For every 10 minute session, we had single channel continuous audio recorded from a far-field microphone (16 kHz, 16 bit). Originally the audio was collected with an analog recorder, and digital copies were made prior to processing of the data. Voice Activity Detection: In our study, we focus on acoustic features extracted only for speech regions in the audio recordings of the conversations. For this purpose, we used 26 an automatic V oice Activity Detection (V AD) system as described in [256] to separate the audio stream into speech and non-speech regions. This robust algorithm exploits the spectral characteristics of the audio signal to distinguish speech from background audio. More specifically, it extracts audio features like spectral shape, harmonicity, and long-term spectral variability features with a long duration context window and feeds them to a Multilayer Perceptron classifier. Since we do not have V AD ground truth (manually labeled speech and non-speech regions) for couples therapy dataset, we used the manual transcripts and audio to force-align the text with audio [26] to come up with a proxy for the ground truth. On the evaluation subset of the data, the miss rate of V AD (speech detected as non-speech) was 17.1% and false detection rate (non-speech detected as speech) was 13.6%. SpeakerDiarization: Since the speech was recorded continuously with a single chan- nel microphone during a conversation, we need to segment the speech regions belonging to each speaker (the husband’s or the wife’s speech), prior to further speech analyses. To achieve this, we performed speaker diarization in a two-step method: first, the al- gorithm segments the speech stream based on possible speaker changes using General- ized Likelihood Ratio based criteria in a frame-based analysis, following which speaker- homogeneous segments are clustered using agglomerative clustering [105]. This way we partition the entire interaction session into regions spoken by each of the speakers. We also automatically identified the speakers as husband or wife using their average pitch 27 information [205]. This simplistic approach was adequate since these conversations al- ways involve two people of different genders, and whose pitch patterns tend to be distinct. Based on a performance evaluation similar to V AD, the diarization error rate (DER) was found to be 27.6%. While this error rate for diarization is not satisfactorily low, it might reflect the inaccuracies in the references, which is obtained by automatic speech-to-text alignment. There are also some instances of overlapped speech in the dataset which is not recognized by diarization algorithms. 2.4.2 Different types of acoustic features Following the preprocessing steps, we extracted various acoustic features from each of the 458 10-minute sessions, which are already segmented into speaker-specific speech regions and separated from silence regions. The initial feature extraction is done on a frame-by-frame basis from the audio in every 10 ms with a 25 ms Hamming window. Pitch, intensity and Harmonics-to-Noise Ratio (HNR) were computed with the Praat toolbox [27], while all other features were extracted using openSMILE [68]. In total, we used 74 acoustic features in this study, deemed relevant for capturing behavioral information of interest [26], and summarized in Table 2.3. While a larger number of acoustic features could be derived, given the data sample size we restricted the features to a smaller set that nevertheless captured essential speech 28 Feature Type Feature Names Spectral 15 MFCCs and their derivatives, 8 MFBs and their derivatives, 8 LSFs and their derivatives Prosody Intensity, Pitch and their derivatives V oice quality Jitter, Shimmer, Harmonics-to-Noise Ratio and their derivatives Table 2.3. Basic acoustic features used in the study properties grouped into three categories: Prosodic features, Spectral features, and V oice Quality features. Spectralfeatures: Even though vocal prosody is more easily interpretable in terms of reflecting emotion and other psychological states of a speaker, speech spectral features are known to encode critical behavioral information [22, 26, 100, 149, 156, 157]. In this work, we use 15 Mel-frequency cepstral coefficients (MFCCs), 8 log Mel-frequency band features (MFB) and 8 line spectral frequencies (LSFs). The derivatives of these were also used as features. Prosodic features: Pitch, intensity and their derivatives were the prosodic features used in our study. These features have been of wide interest in psychology research due to the interpretability they afford of the underlying behavioral mechanisms [73, 127, 214]. Prior behavioral signal processing research in couples therapy has also validated this through predictive modeling [26, 79, 156, 271]. We used Praat [27] to extract pitch ( f 0 ) and intensity, while other prosodic features were extracted using openSMILE [68]. Voice quality features: Jitter and shimmer are two widely used features for voice quality, and were also considered in this study. Jitter is the short-term cycle-to-cycle variation of pitch, whereas the analogous quantity for amplitude is called shimmer [69]. 29 It has been shown that these capture paralinguistic information and are used emotion recognition [7]. We have also used derivatives of both jitter (also known as jitter-of-jitter) and shimmer. Another voice quality feature that we considered is Harmonics-to-Noise Ratio (HNR) which estimates the noise level in human voice signal. 2.4.3 Static functionals Frame-level analysis results in high dimensional data stream both due to the high dimen- sion of features extracted within each frame and the high frame rate. In order to represent the vocal characteristics in a more compact way, often the statistics of the frame-level features such as mean, median and standard deviation are obtained. In this work, we do the same for each of the interlocutors – husband and wife – resulting in two sets of static functionals for every session. As these are computed over one session for every speaker without considering the temporal dynamics or the influence of the other speaker, we call them static functionals. This approach is common in most literature looking for session-level attributes from frame-level speech analysis [25, 26, 235, 262]. Representation Input Scope Definition Raw features Audio 25 ms window as described in Table 2.3 Static functionals Raw features 1 session (10 minutes) Statistics over entire session Short-term dynamic Turns 1 session (10 minutes) Statistics over all turns Long-term dynamic Segments Duration of therapy Delta between two sessions Table 2.4. Different features representations used in the study 30 2.4.4 Dynamic functionals Most literature aimed at extracting emotion or other behavioral constructs at a global level from speech relies on using static functionals over the frame-level features or low- level descriptors [26, 157, 262]. This is a reasonable way to reduce the representation overhead of information for high-level inference. Yet, it has been also recognized that due to a high degree of data compression, important temporal information might be lost. This has also motivated some works to employ diverse temporal information of speech features, especially in emotion recognition [165, 197]. Important behavioral patterns are inherently dynamic. For example, dynamic coordi- nation of speech characteristics reflect the psychological states of the interlocutors [81]. In social contexts, they are also reflective of and influential to the nature of social rela- tionships through communicative behavior [80, 215]. This motivates the use of dynamic features that we discuss below. These are de- signed to be robust and to potentially capture dynamical patterns of speech encoded with behavioral information. 2.4.4.1 Short term dynamic functionals The acoustic features described in the previous section are based on features of each speaker in isolation, and hence do not fully capture interaction phenomena like dyadic coordination and entrainment. To address this, turn-level analysis is often adopted, for example, in the context of emotion recognition [158, 236]. Lee et al. [156] have shown 31 that interlocutors tend to adapt to each other’s behavior during their interaction. This phe- nomenon, known as behavioral entrainment, is also reflected in speech acoustic patterns and thus motivates the use of features which can capture such coupled changes. The computation method of short-term dynamic functionals is as follows: 1. The mean of each acoustic feature over each turn of a speaker is computed. This way, every turn taken by the interlocutors is represented by the averaged acoustic features of that turn. 2. Next, we compute the differences (“deltas”) between corresponding features in adjacent turns within and across speakers. So in the dyadic conversation setting of couples, we obtain three types of differences — husband-husband (HH) delta, husband-wife (HW) delta, wife-wife (WW) delta features. One should note that another possible set of functionals, namely, wife-husband (WH) contain the same information, albeit with a reversed sign. Hence they are not considered to avoid unnecessarily increasing the feature dimensionality. 3. Finally, we use the statistical functionals of the turn-level delta features (as listed in Table 2.4) as short-term dynamic functionals. The rationale behind using turn-level measures is that these turn-level differences or delta features can capture useful information about the mutual and self-influence of behavioral patterns of the speakers over time within a session. The central idea of turn- level delta features is presented through a schematic in Figure 2.2. 32 Turn level statistics, e.g. mean(f o ), median(MFCC 1 ) Session level for Husband and Wife separately 25ms Interaction 1 Raw Features e.g. f o , MFCC, MFB, Jitter, derivatives Static Functionals e.g. mean(f o ), median(MFCC 1 ) Statistics of H➛H, H➛W, W➛W, W➛H over each session Short-term Dynamics H➛W W➛H H➛H W➛W Statistics of features in Table 3 over each session for each partner Features described in Table 3 Long-term Dynamics H 1a ➛H 1b Break session into husband and wife and then into quarters; Extract quarter-session feats as static functionals above W 4a ➛W 4b Interaction 2 Compute differences between representative features 25ms Figure 2.2. Short-term dynamic functionals capture the statistics of differences between the means of features of adjacent turns in the interaction, both within an interlocutor (e.g., Wife to wife turn changes) but also across interlocutors (e.g., Wife to husband turn changes) 2.4.4.2 Long-term dynamic functionals Since we want to extract information about changes in a marital relationship between two different time-points: one before therapy and the other after therapy, we constructed a set of functionals that connects both sessions. They are computed as described below: 1. After removing the silence regions, we split each session into four equal segments. 33 2. Next, we perform session-level feature normalization by subtracting the mean from each feature and dividing them by the standard deviation, computed over that ses- sion. This reduces the effect of any mismatch in the recording conditions between sessions. 3. Then we take the average of every feature over each quarter, separately for the hus- band and the wife. Each of these average values essentially represents a cumulative sample from the respective quarters. 4. Finally, we compute differences between representative features from one quarter in the pre-therapy session and corresponding quarter in the post-therapy session. These represent long-term functionals of the features with respect to pre- and post- therapy sessions. Conceptually, the design of the long-term dynamic functionals aims to capture two different aspects. Firstly, it captures information from the four quarters of a session thus allowing the features to represent the coarse evolution of dynamics within a session. Second, it captures the direct change in dynamics in sessions before and after therapy. 2.5 Manually-derived behavioral codes as features In this study, our aim is to investigate whether and how well we can automatically recog- nize the outcome of marital therapy directly from speech acoustic features of a couple’s interaction. The factors that underlie and influence an outcome such as the relationship 34 status are complex, and multifaceted. It is within this backdrop, we explore what insights automated signal-driven machine-learning approach can offer. We are also interested in investigating how this direct signal-based prediction would compare to a human-driven approach of manually extracting behavioral information and using it for predicting rela- tionship status change post-therapy. For this purpose, we used the annotations for a set of behavioral codes provided by experts, as described in Section 2.3. The code set consists of 33 codes in total. All behavioral codes were defined using elaborate guidelines and to be rated on a scale from 1 (“not present”) to 9(“maximally present”). For example, a rating of 8 on the behavioral code for “blame” means the individual was heavily blaming his/her partner during the interaction whereas a rating of 1 means there was no blame at all. It should be noted that these codes are based on the judgments of the raters using all modalities of interaction present in the video recordings, i.e., speech patterns, facial expression and other gestures, and language information. In other words, these codes are based on both verbal and non-verbal behavior of the couple, made available to the trained annotators. On the other hand, one limitation of the codes is that since they are each designed for the behaviors of interest for specific research studies, they do not capture the complete behavioral information exhibited by the individuals. Furthermore, they are also affected by subjective bias inherent in human annotations [188]. 35 2.6 Correlation Analysis of Features with Outcomes After extracting the speech acoustic features and computing functionals of those features, we analyze their relevance to the outcome variable of interest, i.e., the relationship status of the couple. In this section, we present a correlation-based analysis to compare the relevance of different features to the task of inferring the outcome. We compute Pearson’s correlation coefficient between the outcome and every acous- tic feature considered (represented by its static functionals). For this experiment, we have binarized the outcome variable into two classes: recovery (outcome rating 4) vs. no recovery (outcome rating 1, 2, and 3 combined). Pearson’s correlation ranges between 1 to+1 and quantifies both the degree and direction of the linear association between the variables. More specifically, a positive value of the coefficient refers to higher lev- els of one variable being associated to the higher levels of the other, while a negative value represents higher levels of one variable being associated to the negative levels of the other. Rank Feature Category Functional Coefficient p-value 1 MFCC spectral mean 0:2997 0.0003 2 Loudness prosodic std. dev. 0.2983 0.0003 3 MFB spectral median 0.2859 0.0005 4 Jitter voice-quality mean 0:2791 0.0006 5 Pitch delta prosodic mean 0.2772 0.0008 Table 2.5. Pearson’s correlation coefficients of top 5 features and the corresponding functionals (all correlations are statistically significant, i.e., p< 0:05) 36 In Table 2.5, we have reported the five most correlated features with the outcome, based on the magnitude of Pearson’s correlation coefficient. In this analysis, for every acoustic feature, we chose the functional with the highest correlation (magnitude); then we compared them for all the features and came up with this list of most relevant features. It should be noted that some of the features are correlated among themselves and thus this list cannot be considered as a sufficient way of identifying the efficacy of the features. However, it provides a straightforward and interpretable way to look into the relevance of the features, to complement the classification experiments that we discuss in following section. Moreover, we perform a two-tailed significance test of correlation to determine if the these correlations are statistically significant. More specifically, we tested against the null hypothesis that the corresponding feature is not correlated with the binary out- come variable. For all the features mentioned in Table 2.5, p< 0:001 is obtained, which indicates significant correlation. In Figure 2.3, we show the scatterplot of two prosodic features (normalized) with highest correlation coefficient values: standard deviation of loudness (r= 0:2983) and mean pitch delta (r= 0:2772). From the plot (as well as the positive sign of correlation coefficients), one can infer that high changes in pitch (i.e., high values of mean pitch delta) and a high variation in loudness (i.e., high values of its standard deviation) are associated with a positive outcome. 37 loudness feature(normalized) -6 -4 -2 0 2 4 pitch delta feature(normalized) -2 -1 0 1 2 3 4 class 0 class 1 Figure 2.3. Scatter plot of two prosodic features(normalized) with highest correlation: loudness (r= 0:2983) and pitch delta (r= 0:2772). The corresponding static functionals are standard deviation and mean, in respective order. Class 0 and class 1 represent respectively no recovery and recovery cases. 2.7 Classification Experiments The goal of our classification experiments is to investigate the possibility of inferring dis- tressed couple’s marital outcome using speech patterns of their interaction. As mentioned in Section 2.3 and shown in Table 2.2, the outcome can be of 4 defined ratings [1-4]. It should be noted from Table 2.2 that different number of couples belonging to different outcome classes create a large imbalance, which affects the performance of most classi- fication algorithms [1]. So, we decided to conduct multiple classification experiments, which are listed below: Experiment 1: Classification of all data samples into 2 classes, i.e., complete recovery (rating 4) vs. incomplete or no recovery (ratings of type 1,2,3 combined) 38 Experiment 2: Classification of instances of no (or incomplete) recovery into finer levels, i.e., rating 1 vs. rating 2 vs. rating 3 Experiment 3: Classification of each possible outcomes i.e., ratings 1 through 4. As the number of classes increases from Experiment 1 to Experiment 3, the difficulty of the classification also increases — Expt. 3> Expt. 2> Expt. 1. 2.7.1 Experiments with different feature sets For each of these aforementioned experiments, we investigate the performance of various feature sets extracted from pre- and post-therapy sessions: 1. acoustic features with static functionals, 2. acoustic features with dynamic functionals (both short-term and long-term), 3. acoustic features (with all functionals), 4. manually(human)-derived behavioral codes as features, 5. all features (acoustic features with all functionals and behavioral codes combined) For each of the classification tasks, we perform z-score normalization on every feature and use a feature selection method to select an optimal subset of features. Also, to ac- count for variability in the dataset, a 10-fold cross-validation is performed. While gener- ating the cross-validation subsets, two post-therapy sessions from the same couple (after 26 weeks and 2 years) are always put together in a single subset (either training or test). 39 In this way, we ensured that there was no data contamination between the training and test datasets. 2.7.2 Classifier We set up the prediction problem as three different classification problems and use the well-known Support Vector Machine (SVM) algorithm for all three. SVM is a binary classifier by origin, yet it has been later extended to solve multi-class problems and been shown to perform well[119]. In these multi-class problems, we used the one-against-all method, which, as the name suggests, decomposes the multiclass problem into a number of binary classification problems. Throughout all experiments we used the radial basis function (RBF) kernel. Standard parameters of RBF kernel SVM, namely C and g were optimized by a simple grid search, separately for each feature set and each experiment. As an example, C = 1000 and gamma = 0.001 were optimally chosen for Expt. 1 with all features. 2.7.3 Feature selection The feature extraction (Section 2.4) leads to a high dimensional feature set, particularly compared to the sample size of training data available. We perform feature selection to choose a subset of the original features that provides the maximum information in the context of a particular classification problem. We consider two feature selection ap- proaches in this work. First, we use a simple correlation-based feature selection method, 40 where we ranked all features using Pearson’s correlation coefficient (discussed earlier in Section 2.6) as the selection criteria. Next, we also use the Mutual Information Max- imisation (MIM) [163] feature selection method available as a part of the FEAST tool- box [32]. In this method, every feature X k is given a mutual information score with respect to the class label Y as follows: J MIM = I(X k ;Y) (2.1) Features with the highest mutual information scores are selected and the optimal number of features is also determined using cross-validation. We obtained better predic- tion results using MIM method and decided to utilize it for all the subsequent experi- ments. 2.7.4 Results Table 2.6 shows the classification accuracy of different feature sets using SVM as the classifier. In the table, mean accuracy and standard deviation over all cross-validation folds are reported for each setup. In addition, the original dimensionality of each feature set is also reported. Every feature set was reduced by using feature selection prior to actual classification. For different experiments, around 10% to 20% of the original fea- tures were selected by feature selection. The first row contains the accuracy by chance, computed as the percentage of samples belonging to the largest class. 41 Featureset Dim. Expt. 1 Expt. 2 Expt. 3 mean SD mean SD mean SD Chance - 51.8 - 47.2 - 48.2 - Behavioral codes 264 75.6 13.5 65.4 14.7 61.8 11.2 Static functionals 3552 76.4 10.0 70.9 13.8 63.2 11.4 Dynamic functionals 6696 78.9 7.6 71.1 12.8 61.5 12.3 Acoustic (all functionals) 10248 79.3 10.2 72.6 13.0 64.1 12.8 All features 9144 79.6 7.4 74.6 12.6 64.1 13.2 Table 2.6. Classification accuracy (in terms of their mean and standard deviation over all folds of cross-validation) of different experiments (across the columns) with different feature sets (across the rows) Featureset Expt. 1 Expt. 2 Expt. 3 mean SD mean SD mean SD Behavioral Codes 0.68 0.12 0.49 0.11 0.48 0.11 Static functionals 0.56 0.10 0.60 0.07 0.52 0.09 Dynamic functionals 0.63 0.05 0.59 0.07 0.50 0.09 Acoustic (all functionals) 0.70 0.09 0.64 0.08 0.57 0.11 All features 0.78 0.07 0.64 0.09 0.56 0.10 Table 2.7. F-scores(in terms of their mean and standard deviation over all folds of cross-validation) of different experiments (across the columns) with different feature sets (across the rows) As our dataset is highly imbalanced (especially for the multiclass classification), we also computed F-measures [200] of the predicted labels for each setup. The mean and standard deviation of F-scores over all cross-validation folds are shown in Table 2.7. By definition, the F-measure values lie in the interval (0;1). A higher value of F-measure signifies better quality in classification. 42 There are several observations to make from the obtained classification accuracy and F-measures. First, in general classification based on speech acoustic features tends to out- perform the one with behavioral codes extracted by human experts. Specifically, acous- tic features (with all functionals) outperformed behavioral codes in terms of accuracy by 2.1% in Expt. 1, 6.9% in Expt. 2, and 1.6% in Expt. 3 (absolute). It is encouraging to see that using acoustic features directly derived from the signal can capture useful informa- tion relevant to predicting couples’ relationship status, better than even domain experts can via the manually coded behaviors. Comparing the different acoustic features, we observe that dynamic functionals per- form better than static ones in Expt. 1 and 2. In Expt. 3, however, static functionals achieved better accuracy. The significance and complementarity of both can be seen through the use of all the features. The results of fusing manual rating based features and acoustic features are mixed. While fusion appears to help in classification in Experiments 1 and 2, we obtain lower accuracy in Experiment 3. We believe the reason for this might be due to overfitting of some behavioral features. For this experiment, the training accuracy (averaged over cross-validation folds) using all features is 73.4%, about 9% higher than the accuracy on the test subsets. This indicates that it is possible that some behavioral codes were selected by the feature selection algorithm from the combined feature set as it helped to achieve low accuracy in training subsets of cross-validation, but it failed to do so in the test subsets. Moreover, issues like the data imbalance and data sparsity become more 43 Comparison Expt. 1 Expt. 2 Expt. 3 Acoustic (all) vs. Behavioral Codes 0:016 0:028 0:027 Acoustic (all) vs. Static 0:034 0:042 0:039 All features vs. Behavioral Codes 0:013 0:008 0:025 All features vs. Acoustic (all) 0:025 0:045 0:079 Table 2.8. p-values of statistical significance test against the null hypotheses that the there is no significant difference in performance of the two feature sets compared. The entries inbold indicate statistically signifcant difference (p< 0:05) Comparison Expt. 1 Expt. 2 Expt. 3 Acoustic (all) vs. Behavioral Codes (0.019 0.243) (0.284 0.395) (0.159 0.271) Acoustic (all) vs. Static (0.276, 0.294) (0.221, 0.258) (0.376, 0.457) All features vs. Behavioral Codes (0.009 0.133) (0.156 0.237) (0.184 0.208) All features vs. Acoustic (all) (0.240, 0.303) (0.298, 0.334) (0.029, 0.311) Table 2.9. 95% confidence intervals of the statistic for significance test for comparing different feature sets prominent in Experiment 3 due to the higher number of classes. Another possible expla- nation for this pattern of findings is that Experiment 3 involves prediction of both changes in and levels of relationship satisfaction while Experiments 1 and 2 involve prediction of only changes in relationship satisfaction. Previously published work on this corpus [12] has found that associations between acoustic features and levels of relationship satisfac- tion depend on wives’ pre-treatment relationship satisfaction and on the type of couples therapy a couple received. Type of couples therapy and wife pre-treatment relationship satisfaction were not included in analyses in the current chapter, and that may be one reason for the different pattern of results across experiments. We also perform a two-tailed exact binomial test [229] to verify whether the differ- ence in classification results of different feature sets (reflected in accuracy and F-score 44 measures) is statistically significant. In particular, our null hypothesis is that the re- sults of two feature sets in each test are not significantly different from each other. The p-values are reported in Table 2.8. We observe that using acoustic features produce sig- nificantly different results in comparison to using behavioral codes. The differences in performance of all acoustic features (including dynamic functionals) vs. static function- als only are significant as well. Finally, in most cases, combining acoustic features and behavioral codes make significant difference in performance, which indicate presence of complementary information in behavioral codes and acoustic features. The only excep- tion is all features combined vs. acoustic feature set with all functionals for Experiment 3. In addition, we report the 95% confidence intervals of the statistic computed in each hypothesis test using Clopper-Pearson’s method [50] in Table 2.9. As we can observe, the confidence intervals are narrow in most cases. The software employed in this work can be found at http://scuba.usc.edu/software. 2.8 Conclusion In this chapter, we presented a study on automatically predicting the marital relation- ship status of distressed couples in therapy using acoustic information from their speech. We presented a framework for capturing behaviorally significant acoustic features from the spoken interactions of couples engaged in problem solving discussions. We also introduced knowledge-driven features of capturing short-term and long-term acoustic 45 descriptors inspired by previous studies on human interactions. We compared this au- tomatic approach of capturing important behavioral information directly from speech signal to the traditional approach taken by psychologists, i.e., manual coding of behavior from therapy sessions. In the multiple classification experiments, we observed that the acoustic features from speech capture more relevant information than the manually constructed behavioral di- mensions for predicting the marital outcomes from human experts. Even though be- havioral codes are not designed to predict outcomes itself, they function as behavioral descriptors of the couple and one can expect them to be informative towards the outcome based on the observational methods of psychology. In the future, we can also analyze the importance of other communication modali- ties including language use (i.e., what is being spoken), and visual (e.g., head-movement and other face and body expressions). One can also investigate more complex temporal modeling (e.g., hidden Markov models, dynamical systems modeling) of the behaviors captured through the acoustic features extracted from the speech signal. Also, automatic recognition of the mental states (such as emotional arousal) of the speakers and investi- gation of the dynamics of local behavioral cues might be useful. 46 Chapter 3 Nonlinear Dynamical Systems Modeling of Dyadic Interactions 3.1 Introduction In this work, we explore the possible link between system coordination and complexity of conversational speech. Our analysis focuses on interactions of married couples, that were clinically assessed to have a distressed relationship. This work is motivated by several studies in the emerging domain of behavioral signal processing [185], that have shown that human interaction dynamics are influenced by underlying behavioral states of the interlocutors [26]. Lee et al. [156] quantified entrainment reflected in prosody and investigated its relationship with codified behavioral attributes of speakers, such as positivity and negativity. Several studies have shown importance of mutual influence in emotion during dyadic interactions [46, 154, 278]. Finally, our previous study on couples 47 interactions used different acoustic features including turn-level information within and across speakers to predict possible relationship status change [192]. The current work also attempts to link signal-driven approaches to quantify couple’s behavior and numerous theoretical and empirical research in couples therapy with dy- namical systems approaches. Felmlee and Greenberg [70] modeled dyadic interaction of intimate couples as a dynamical system and argued that ‘cooperation’ between the cou- ple leads to more stability in the system. Karney et al. [132] investigated how complexity in cognitive behavior influences marital relationships. Gottman et al. [84] also proposed mathematical models for behavioral constructs (such as marital satisfaction) in couples. We propose a framework to analyze different complexity measures of spoken inter- actions using the prosody features (pitch and energy) of the observed speech signal. We associate the feature streams from each speaker with individual dynamical systems, while features from both speakers together are used to model a joint system, capturing coor- dination between individuals. The complexity measures computed on these systems are then investigated in relation to behavioral codes characterizing the dyad and the outcome of the couples therapy. 3.2 Related literature: Entrainment measures While previous work regarding behavioral synchrony has offered many insights into hu- man interaction dynamics, methods for assessing and quantifying the degree of behav- ioral entrainment have received little attention. Except for a few notable studies (e.g., 48 (author?) [28, 85]) the computational techniques for quantifying entrainment have been largely based on log-linear models of highly reductionistic, categorical manual observa- tion coding of behaviors. Another body of work has shown empirical evidence of entrain- ment [115, 143, 144, 159, 160, 161] and its role in various social scenarios and contexts, such as tutorial dialogs [263], supreme court hearings [20], multi-party conversations [170, 219] and even in acted interactions e.g., movies [56]. Many works have shown that accommodation takes place in through prosody such as pitch, intensity and intonation contours. [59]. [97] studied pitch and energy convergence in dyadic interviews and associated them with the quality of the interview. Another work [99] analyzed long-term averaged spectra (LTAS) of low frequency spectrum by FFT of the conversation of Larry King live show. In [98], very similar methods were applied for more general hypotheses, along with accommodation of voice frequency and amplitude. Many other works used correlation and statistical measures between speech rate and speech duration and silence duration as measures of entrainment. [65, 178] More rigorous time-series approach based on cross-correlation function (CCF) and time-aligned moving average (TAMA) [142, 145] was proposed to study the bidirectional nature of entrainment and feedback. Another work [58] used correlation of prosodic features as entrainment measure and also applied Fisher’s transformation on it to verify that is significant. Joint distribution of prosodic patterns in consecutive turns of the speakers preceeded by quantization proved to be useful for measuring empathy in addiction counselling sessions [272]. This pro- vides a simple yet powerful way to measure how much influence one speaker has on the 49 other as well as the nature of the influence. [153] used correlation coefficients, mutual information and spectral coherence of pitch and energy in married couples for predicting positive vs. negative affect. [155] used PCA-based measures of entrainment through acoustic features (prosody as well as spectral features) and analyzed their association with behaviors in couples, such as withdrawal and discussion. A more rigorous treat- ment of the similar measures were presented in [156] and shown to be useful in detecting affect in couples. The measure was computed as the KL divergence (both directional and symmetric) between consecutive speaker turns by distribution-like treatment of the quantized variance vectors of PCA-space of features. 3.3 Similarity measures in various domains 3.3.1 Probabilistic/Information Theoretic Measures Transfer Entropy [232]: Assuming that two time series are finite length Markov processes, the measure of deviation from one process Y being independent of the other’s (X) past samples is defined as transfer entropy X! Y . It is a directional measure, and computed as the expected Kullback-Leibler divergence between the two probability distributions, conditioned on past samples of X and combination of X and Y together. It has been extended to multivariate case by [228]. Applications: connectivity between different parts of brain [258, 269], finding di- rection of disturbance in chemical processes [16], structural systems[195] 50 Global Correlation Coefficient: Based on the mutual information between two signal, this is a standardized metric defined as GCC(X 1 ;X 2 )= p 1 exp(2I(X 1 ;X 2 )) where I() is the mutual information. Applications: measuring similarity between stock prices in different countries. Distance Correlation: [249] proposed this metric to measure ‘independence’ or ‘dependence’ of two random variables. Application: biomedical and clinical data Kolmogorov-Sinai (KS) entropy [51]: it can be defined as the mean rate of cre- ation of information in the state of the system (with respect to its history). If the phase space of a system hasD degrees of freedom and is partitioned into hyper- cubes of content e D and the state of is measured at intervals of time t, then KS entropy, H KS = lim t!0 lim e!0 lim n!¥ [H n+1 H n ] where H n is the entropy of the subsequence of samples up to index n. Applications: It was introduced to analyze complex physiological time series such as ECG, as a part of multiscale entropy analysis [51]. 51 Approximate entropy [212]: It can be described as a measure of rate of entropy for approximating a Markov chain as a process. It is the log-likelihood of a pattern in the time-series being close to each other for m observations on next incremental comparisons. ApEn(m;r;N)= lim N!¥ [F m (r)F m+1 (r)] F m (r) is a statistic defined as a function of r runs, N data points and observation window length m. Details of the algorithm to compute this statistic can be found in [213]. Applications: Heart-rate signal complexity for detecting aborted-SIDS infants [213]. 3.3.2 Similarity Measures used for Clustering Clustering has been an interesting and challenging problems in time series analysis [166]. There are three broad categories of approaches for this kind of clustering: using the raw time series, either in time domain or frequency domain using features extracted from the time series, (pretty much application domain- dependent). using the model that is built to characterize the time series Among these, the first approach is very relevant, since most of these clustering meth- ods rely on some similarity measures between two given time series. For example, Golay 52 et al. [83] used two cross-correlation based distance metrics for times series clustering for applications in fMRI. d 1 = 1 cc 1+ cc b ; 0<b < 1 d 2 = 2(1 cc) where cc is the Pearson’s cross-correlation between the two time series. Since it is based on Pearson’s correlation, it cannot capture the temporal co-evolution of time series. Another distance metric, Short Time Series (STS), used in fuzzy clustering of time series [183] considered each time series as a piecewise linear function and is defined as the sum of the squared differences of the slopes in the two time series. Researchers [128] also came up with probabilistic distance measures which captures how different the probabilities of transitions are for the two time series. J divergence and Chernoff information divergence are such measures, which are based on KL diver- gence and rely on the stationarity assumption. However, extensions of the same has been proposed in [54, 239] for measuring non-stationary time series by considering locally stationary segments. 3.3.3 Wavelet-based Similarity Measures One popular way of finding this similarity involves wavelet analysis. 53 One such work [247] employed Haar wavelet basis to perform orthonormal decom- position of two time series and proposed various distance metrics in that representation to closely correspond ‘subjective sense of similarity’ of those two time series. Another work [101] proposed performing Cross Wavelet Transform (XWT) with two time series and computing the Wavelet Coeherence, which provides a measure of local similarity between the time series in time-frequency space. 3.3.4 System Theoretic Measures Most of these measures assume an underlying state-space model. The most basic metrics are based on Fourier Transform and filter analysis. For example the dimension of filter with a standard input such as a step signal or a si- nusoid to analyze a signal with certain minimum error of prediction would provide a simplistic way to describe how ‘complex’ a signal is. Detrended Fluctuation Analysis (DFA) Correlation Dimension Information dimension Recurrence quantification analysis (RQA) [265] measure the tendency of a signal to show similar repetitive characteristics over time through Recurrence Plots (RP) . Lyapunov Coefficient [64] 54 Cross-recurrence analysis (CRA) [177] is an approach using complex network to characterize the simalirty of two time series. This method embeds both time series into the same phase space and uses recurrence plot techniques to find patterns. It has been used for measuring limb co-ordination [225]. 3.4 Feature Extraction 3.4.1 Audio Preprocessing As the first step, we segment the raw audio stream into speaker-homogeneous regions. The segmentation quality is crucial because the dynamical analysis performed on a cer- tain speaker’s speech may be erroneous if it is corrupted by speech segments from an- other speaker. Therefore, instead of relying on a fully automatic process of voice activ- ity detection (V AD) and speaker diarization, we use a speech-text alignment algorithm, SailAlign [133]. It uses the transcription of the audio to obtain more accurate timestamps for each segment. 3.4.2 Prosodic Features: Pitch and Energy We extract two commonly used prosodic features for all speech regions: pitch and energy. A state-of-the-art pitch tracking algorithm implemented in Praat toolbox [27] is used to extract pitch and energy. However, since pitch detection is not very robust under noisy conditions, we perform a smoothing by median filtering with a window size of 5 samples 55 for every speaker-homogeneous segment on the raw pitch stream. We also perform linear interpolation for the instances in the speech region where pitch is detected to be zero such as for unvoiced phoneme regions. This step also attempts to rectify jumps in pitch that may happen due to doubling or halving error. We do not perform any smoothing for energy since this feature is more robust to outliers. Then pitch normalization is performed for each speaker on a logarithmic scale: f 0 norm = log( f 0 = f 0 m ), where f 0 m is the mean pitch per speaker. Similarly, we normalize energy as E norm = E=E m where E m denotes the mean energy per speaker. 3.5 Complexity of Nonlinear Dynamical Systems According to dynamical systems theory, a time-series can be described through a map- ping function starting from an initial state and evolving over a state space. In this section, we first describe a method to obtain an embedded representation of the state space of the considered time-series; then we discuss four different measures of complexity of the dy- namical system. The complexity measures are calculated on reconstructed space (with the exception of the last measure in Section 3.5.2, obtained by Katz’s algorithm) and aim to characterize how chaotically the system behaves. 3.5.1 Reconstructed State Space Embedding The reconstructed state-space basically consists of a set of shifted versions of the orig- inal signal. Let us consider a scalar time series z(n) sampled from s(t), which is the 56 observed signal of an nonlinear dynamical system with a finite-dimensional state space. The temporal evolution of the system is characterized by a mapping function F as x(t)=F t (x(0)), where x denotes a state in the original state space of the system. Given this formulation, one can construct a mapping F from the original state space to a re- constructed state space inR d as shown in eqn. (3.1), where d is called the embedding dimension and D is the time delay. This mapping is also known as delay coordinates map. x7! y= F(x)= s(t);s(t+D);:::;s(t+(d 1)D) (3.1) Although this formulation was originally justified by the celebrated Takens’ theorem [250] for continuous signals s(t), it was later extended to discrete-time signals or time series z(n) by considering z(n) as a sampled version of s(t) [129, 204, 231]. As it can be seen from the eqn. (3.1), the embedding is dependent on parameters d and D. In this work, we first estimate the time delay D by finding the location of the first local minimum in the mutual information function of the signal with its de- layed versions [72]. Next, we estimate the optimal embedding dimension d using Cao’s method [37], which requires the value ofD. Finally we embed the time series into the reconstructed state space using eqn. (3.1). Once the reconstructed state space has been established, the temporal evolution of system can be examined from this space itself. Analysis of the reconstructed state space provides us with different characteristics of the chaotic patterns (attractors) present in the system through different complexity measures. 57 3.5.2 Different Complexity Measures 3.5.2.1 Lyapunov Exponents The Lyapunov exponents (LE) describe the sensitivity of the initial conditions of a system and dynamics of neighboring trajectories in the state space embedding. Formally it is defined as the average exponential rate of convergence or divergence of two neighboring trajectories in a given direction of the embedded space. Let us assume thekdx(0)k is the initial separation of two trajectories andkdx(t)k is the separation after time t. Then the Lyapunov exponentl i in i th direction is given by: kdx(t)k kdx(0)k = e l i t (t!¥) (3.2) Largest Lyapunov exponent (LLE): The largest Lyapunov exponent l m is of specific in- terest because it is much easier to compute in a robust way and provides a measure of the complexity of the dynamical system. It is often calculated to discriminate between peri- odic and chaotic nature of a dynamical system. A positive value of the largest Lyapunov exponent typically indicates chaos, whereas a negative value indicates the orbits in the phase space approaching a common fixed point. The notion of Lyapunov exponents has been extended to nonlinear time series (discrete time) [64], as required in real-world ap- plications. The present study uses the robust algorithm proposed by Sato et al. [230] to estimate the largest Lyapunov exponent from the reconstructed time series embedding. 58 3.5.2.2 Fractal Dimensions The fractal dimension is a generalized term for several measures of geometric complex- ity of a set, or a pattern. In the case of dynamical system in the embedded space it refers to the active degrees of freedom of the system as reflected in chaotic behavior of the attractor [252]. The fractal dimension represents a lower bound on the number of equa- tions required to model the underlying dynamical system. In this work, we used three measures of fractal dimensions [67], as described next. Correlation Dimension (CD) [96] is probably the most widely used measure of fractal dimensions for nonlinear time series analysis. It is referred to as the second member (q= 2) in the infinite family of generalized dimensions D q defined by Grassberger et al. [95]. The Correlation Dimension (D 2 ) is defined by the exponential scaling of the correlation sum C d (r) which computes the fraction of neighboring points closer than r as shown in the eqns. (3.3a) and (3.3b), C d (r)µ r D 2 (3.3a) C d (r)= 2 N(N 1) N å i=1 N å j=i+1 Q(rky i y j k) (3.3b) whereQ() is the Heaviside step function withQ(x)= 1 for x 0 and zero elsewhere and y i , y j are points in the reconstructed state space. Takens [250] proposed a maximum likelihood estimator for the correlation dimension, which has been used in this work. 59 Information Dimension (ID), proposed by Radii and Politi [218], considers k nearest neighbors of a reference point and uses the distances of these points from the reference point in time-delay reconstructed space, as shown below: D= logN 1 N å N n=1 logr n (3.4) where r n is the average distance of the nth reference point from its neighbors, and N is the number of samples. This is known as a fixed mass approach with k as a parameter. Fractal Dimension by Katz’s algorithm (KD) [134] treats the signal as a piecewise linearly connected set of points in the time series representation (z(n) vs. n). If L is the total length of the curve or the sum of distances between consecutive points, one can compute the fractal dimension D with following formula: D= logn logd c logL+ logn (3.5) where n= L= ¯ a, ¯ a denoting the average distance between successive points, and d c is the diameter of the curve defined as d c = max i max j kz i z j k (3.6) Numerous definitions of fractal dimension have been introduced in literature to quantify the complexity of nonlinear dynamical systems. Although each of these definitions are related and similar in the sense that a higher value means a more complex system, some 60 of them are intuitively different and capture different aspects of complexity. The above three approaches are chosen with an attempt to sample over different classes of algo- rithms with distinct characteristics, in order to obtain complementary information about the system. 3.6 Dataset 1: Couples Therapy Corpus We use the Couples Therapy Corpus [47] in this work, which consists of recorded in- teractions of 134 heterosexual married couples with serious and chronic marital distress. Even though the dataset has been introduced in the previous chapter, we briefly discuss it here for the sake of convenience. It includes three recording sessions for each couple at different time points over a span of two years while they were undergoing therapy– the beginning of the therapy, after 26 weeks, and 2 years since the therapy started. The couple talked for 10 minutes on the wife’s chosen topic and another 10 minutes on the husband’s chosen topic without any therapist or research staff present. These interactions were recorded using a far-field microphone. 3.6.1 Behavioral Codes In addition to the recordings of the interactions, we have manually annotated behavioral attributes, also known as behavioral codes, for each spouse in each session. In total the corpus has 33 behavioral codes, following two established behavioral coding systems: 61 the Couples Interaction Rating System (CIRS, [125]) and the Social Support Interac- tion Rating System (SSIRS, [107]). Some examples of the codes are blame, sadness, agreement, humor, negativity etc. Each session was annotated by two to nine trained evaluators using these two rating systems on an integer scale of 1 to 9 and the average of their ratings are used as the reference. 3.6.2 Outcome Ratings Finally, the corpus also included therapy outcome ratings of the couples based on where they stood in terms of their relationship compared to the condition before therapy. Each couple had one rating for each of the post-therapy sessions–26 weeks and 2 years. The ratings are provided on a 4-point scale; 1 (decline), 2 (no change), 3 (partial recovery), and 4 (complete recovery). 3.6.3 Preprocessing and Variables of Interest in the Study In this work, we use 372 sessions out of total 574 sessions in the corpus, as the rest were too noisy to achieve good alignment, just as in some of the earlier studies [26, 156]. We choose two behavioral codes, agreement and blame, out of 33 codes for analysis. These two codes have respectively positive and negative correlation with the outcome; also intuitively speaking, agreement is more related to coordination of the speaker, whereas blame is not. Since annotations for these codes are at the session-level and for both husband and wife, we have 744 samples in total for the experiments with behavioral 62 codes. On the other hand, a pre-therapy session and either of the corresponding post- therapy sessions (26 weeks or 2 years) constitute one sample for outcome experiments. As there were many couples with one or more of these sessions missing, we ended up with 64 samples for outcomes after discarding the missing sets. 3.7 Dataset 2: Suicide Risk Assessment Corpus The second dataset employed in this work was collected as a part of research on sui- cide risk among military personnel[33]. Although 97 active duty soldiers participated in this study, we currently have complete speech recordings for 61 subjects and anno- tations for 54 subjects (45 male and 9 female). Based on self-reports, the majority of the participants (75.9%) were Caucasian. The other patients identified themselves as African-American, Native American and Hispanic or Latino. All participants had ac- tive suicide ideation during the week preceding the interaction, while 22 of them had attempted suicide at least once in the past. Upon completion of the informed consent process, they were invited for participation in the study. The study consisted of an interview session followed by crisis intervention and a post-study follow-up on the same day. Five therapists trained in suicide risk assess- ment and intervention procedures conducted the study. The conversation took place between the patient and one of the therapists, and was recorded using two directional microphones. The duration of the interview sessions ranged from 10 minutes to 1 hour, varying from patient to patient, resulting in 53 hours in total. The interview session was 63 Enrollment Semi-structured Interview RFL Intervention Questionnaire for ratings No recorded speech No audio Pre Post 1 month Follow-up 3 months Follow-up 6 months Follow-up RFL RFL = Reasons for Living Figure 3.1. Overview of the structure of a session in the corpus semi-structured, where the therapist asked a set of questions related to the patient’s rea- sons for living, elaboration of suicidal thoughts, history of attempts etc.The questions were carefully designed for suicide risk assessment based on the Beck Scale for Suicidal Ideation (BSSI)[17] and the Suicide Attempt Self-Injury Interview (SASII) structure[168]. Examples of these questions include: “Describe exactly what method you used to injure yourself.”, “With all this going on, what would you say are your reasons for living, or your reasons for not killing yourself?”. The patient’s answer to the latter question, also known as reasons for living (RFL)[169], was manually transcribed. After the interview session, an intervention session and a post-study conversation were conducted. The basic structure of a session is shown in Figure 3.1. The patient was asked to rate a number of measures: emotional bond, the extent of reasons for living and 10 other attributes related to the patient’s moods (urge, happiness, 64 burden, hope etc.) These ratings were self-reported through a written questionnaire and on a scale from 1 to 100. In this work, we only focus on one attribute—emotional bond between the therapist and the patient, as perceived by the patient[106]. For most of the subjects, there were one or more additional follow-up sessions (after 1, 3 and 6 months) which did not have any annotations or transcripts. 3.8 Individual and Joint Complexity Measures Following the steps of pre-processing and feature extraction as described in Section 3.4, we obtain pitch and energy streams for every session. Each feature stream is also divided into speaker homogeneous regions associated with either husband or wife. Thus we can have two speaker-specific sub-streams for each session, considering the speech of the husband and the wife to be individual systems and pitch and energy being the observed variable. Again, if we consider the original feature stream as the observed variable, we can model the speech of the husband and the wife together to form a joint system. Since the features were normalized per speaker, this construction does not include any apparent discontinuity. On each of these feature streams (husband, wife, and joint), we apply the complexity measures described in Section 3.5.2 (LLE, CD, ID, and KD). We denote a complexity measure asC , a function of the feature stream. Finally we compute a 65 normalized complexity of the joint feature stream of the couple (s J ) with respect to the individual feature streams of the husband (s h ) and the wife (s w ). normalized joint complexity= C s J q C s h C s w (3.7) 3.9 Experiments and Results We set up our experiments to investigate the following hypotheses in regard to the system complexity measures based on speech prosody: H1: The joint complexity measures are meaningful representations of interac- tions. (Section 6.1) H2: The complexity measures relate to the behavioral codes. (Section 6.2) H3: The complexity measures relate to the outcome of the couples therapy. (Sec- tion 6.3) 3.9.1 Verification of Joint Complexity Measures In this experiment, we verify whether the complexity measure of the joint feature streams are meaningful. To set up the procedure to test this, we first reverse the sequence of the one of the interlocutors (husband or wife, chosen randomly) and then combine it with the intact stream of the other on a turn-by-turn basis to create a stream (s A ) correspond- ing to an artificial dialog. These artificial feature streams should have more complexity 66 than original conversations, as the latter involves phenomena such as speech accom- modation [81], and entrainment [156]. We test this hypothesis with all four complex- ity measures of pitch and energy streams and the percentages of sessions for which C s A >C s J are shown in Table 3.1. The results show that all of these measures support this hypothesis for over 92% of the interactions. Prosody complexity measure used LLE CD ID KD Pitch 98.92 97.31 94.35 96.51 Energy 95.16 95.43 92.74 94.09 Table 3.1. Percentage of sessions withC s A >C s J with different complexity measures 3.9.2 Relation to Behavioral Codes We perform a correlation analysis with different complexity measures of prosody with two behavioral codes. As the candidate feature streams, we use the individual feature streams of the specific person whose behavioral codes are being considered along with the joint feature stream. For each of these streams, the highest Spearman’s r (in terms of absolute value) from each feature set, as correlation measures with both behavioral codes, is reported in Table 3.2. As baseline feature sets, we use: Baseline 1: mean pitch and energy, Baseline 2: individual and joint distribution of prosodic patterns preceeded by quantization, as used in [272]. 67 Feature set agreement blame pitch energy pitch energy Baseline 1 0.2802 0.2621 0.2445 0.2643 Baseline 2 0.2426 0.2344 0.2523 0.2712 Individual 0.2664 0.2830 0.2536 0.2719 Joint 0.2935 0.3187 0.2313 0.2787 Table 3.2. Spearman’sr (absolute) of the most correlated feature from different feature sets with agreement and blame Complexity measures of both pitch and energy turn out to have a generally high correla- tion with both of the codes, agreement and blame, when compared to baseline features. Moreover, joint complexity features seem to be more correlated with agreement than to blame. This is in accordance with the intuition that agreement is more interpersonal and dynamic as a behavior, hence can be captured well with the joint dynamical features. We also find that joint complexity measures are negatively correlated with agreement, and positively with blame (i.e., in the last row of Table 3.2, first two values r values have negative sign, and the latter two are positive). We also perform a statistical signif- icance test for the results of individual and joint complexity measures against the null hypothesis that they are not correlated to the behavioral codes. For each of the measures (both individual and joint, corresponding to the last two rows of Table 3.2), p< 0:05 is obtained, indicating significant correlation. 3.9.3 Relation to Outcomes Finally we perform another experiment to investigate importance of complexity mea- sures, with relationship outcome as the variable of interest. Along with the two baseline 68 feature sets as mentioned before, we use complexity measures of joint feature streams and the normalized joint complexity measures as defined in eqn. (3.7). The results are shown in Table 3.3. Feature set outcome pitch energy Baseline 1 0.2772 0.2983 Baseline 2 0.2181 0.1878 Joint 0.3473 0.2565 Normalized Joint (3.7) 0.4146 0.2636 Table 3.3. Spearman’sr (absolute) of the most correlated feature from different feature sets with therapy outcome We find that the normalized joint complexity of pitch has the highest correlation with outcome. However, according to the results, complexity of energy seems to be less relevant to outcome. In both cases, normalized joint complexity feature has higher correlation than the unnormalized one. The correlations between outcome and complex- ity measures (the last two rows of Table 3.3) are also found to be statistically signifi- cant (p< 0:05) and negative in sign. The latter observation might indicate that higher complexity in interaction is related to lower value of the outcome variable, i.e., decline in the relationship of the couple. 69 3.10 Experimental Results on Suicide Corpus 3.10.1 Complexity during Interview Sessions The risk assessment interview sessions were conducted with predesigned questions. Un- like intervention sessions where the therapist consciously tries to sympathize with the patient to help her or him cope with the crisis, the purpose of interview sessions is to quickly obtain relevant information from the patient. With this objective at hand, the therapist-patient entrainment may not be at par with their entrainment during intervention or follow-up sessions. To test this hypothesis, we conduct an experiment by computing the normalized joint complexity of therapist-patient pair for the interview, as described in Section ??. Each complexity measure (LLE and CD) is used for each of the fea- ture streams (pitch and energy). In addition, PCA-based acoustic similarity measures are computed. We repeat the same operation on the intervention and follow-up sessions and compute the the average complexity of those sessions of the same patient-therapist pair, which we refer to as the baseline complexity. Then we check for what percentage of the subjects the complexity in interview is higher than the baseline complexity. The results are shown in Table 3.4, where we also present the p-values obtained in the Stu- dent’s t-test against the null hypothesis that there is no significant difference in the two aforementioned complexity measures. Results indicate that the majority (up to about 74%) of the subjects have higher com- plexity in interviews than their baseline complexity. This observation also turns out to be 70 Measure C(interview) > C(baseline) Percentage of subjects p-value † PCA-based similarity 67.21 0.0072 LLE with pitch 72.13 0.0005 LLE with energy 70.49 0.0014 CD with pitch 65.57 0.0150 CD with energy 73.77 0.0002 Table 3.4. Results of testing for higher complexity (lower similarity) in interview sessions in comparison to other sessions, i.e., C(interview) > C(baseline) †p< 0:05 indicates statistically significant difference we test for similarity(interview)< similarity(baseline) in this case 10 20 30 40 50 60 subjects -3 -2 -1 0 1 2 3 C(interview) - C(baseline) Figure 3.2. Sorted difference in normalized therapist-patient complexity measure, using correlation dimension (CD) for energy statistically significant as p< 0:05 for all measures. Figure 3.2 shows the difference in the two complexities for correlation dimension (CD) for energy feature stream by sorting them in increasing order. 71 Measure Pearson’s correlation r p-value † PCA-based similarity 0.2480 0.1132 LLE with pitch 0:3022 0.0419 LLE with energy 0:3737 0.0148 CD with pitch 0:2733 0.0473 CD with energy 0:3815 0.0127 Table 3.5. Correlation between emotional bond and various complexity (or similarity) measures †p< 0:05 indicates statistically significant (strong) correlation 3.10.2 Correlation with Emotional Bond The Pearson’s correlation coefficients between complexity (and similarity) measures are presented in Table 4.3. All complexity measures are negatively correlated with the emo- tional bond perceived by the patient (p< 0:05) as reported in their survey. The negative values of the correlation coefficients r indicate that higher complexity or lower entrain- ment is associated with lower emotional bond. Only the PCA-based similarity does not show significant correlation. The reason for this might be the limitation of the PCA-based approach due to inability of capture the temporal dynamics of features within a speaker turn. However, the positive sign of r for similarity is consistent with the findings for complexity measures. 3.11 Conclusion Human dyadic spoken interactions between two interlocutors can be modeled as cou- pled dynamical systems. These models may be useful for behavioral analysis of the in- terlocutors, with an emphasis on characterization of behavioral entrainment and mutual 72 influence. In this chapter, we explore such opportunities using prosodic features as obser- vations of the dynamical systems. Then we investigate different complexity measures of these systems and evaluate their correlation with behaviors of couples during interactions and as well the therapy outcomes. The experimental results show that these complexity measures are useful for behavior analysis. We also observe that increased complexity in speech during dyadic interactions are associated with negative behavior (such as blame), and indicate decline in the couple’s relationship. We find that joint complexity measures tend to be higher during risk assessment inter- view sessions of suicide prevention therapy, when compared to the baseline complexity. This indicates a lower degree of therapist-patient entrainment during the interview ses- sions. Based on the interactions, the patients evaluate their perceived emotional bond with the therapist. We investigate the statistical relationship between ratings of emo- tional bond and the computed complexity measures. Results show that, joint complexity of speakers, also an opposite notion of entrainment, is negatively correlated with emo- tional bond. This finding is intuitively justified and consistent with previous studies in psychology[10]. The speech-based approach for analysis of interactions presented in this work can be useful for guidance in conducting more effective interviews in suicide prevention. This work also offers a number of future directions. In the nonlinear dynamical sys- tems framework, we intend to develop other measures of joint complexity, especially a measure that can capture asymmetry in the interaction dynamics. Given the asymmetric 73 roles of the therapist and patient, such measures could be highly useful. Moreover, ana- lyzing complexity-based measures may reflect characteristics of the patient, particularly his or her ability to connect to the other person during a conversation. A careful analysis of this might be informative towards assessment of the suicide risk itself. 74 Chapter 4 Deep Neural Network Modeling of Dyadic Interactions 4.1 Introduction Analyzing interpersonal human interactions in social settings is one of the central themes of psychology, sociology and other social sciences. A significant amount of research has attempted to understand and draw inferences from the process of such interactions in computational social sciences and other emerging fields like behavioral signal process- ing [185] and social signal processing [259]. Often researchers focus on dyadic con- versations, a common form of interaction that takes place between two interlocutors. Such conversations are rich in lexical information (“what is being said”), vocal patterns (“how something is being said”) and other nonverbal vocal and visual cues. Moreover, contextual information from both the primary speaker and the interlocutor is used to aid the progress of the conversation. This multimodal and adaptive process of interpersonal communication has been conceptually studied and theorized in many psychological and behavioral studies [35, 74, 264]. 75 There have been a myriad of works with a focus on the phenomenon that the speakers often tend to ‘accommodate’ or ‘sound similar’ to one another over the course of conver- sation. Different studies have come up with different terminology to indicate the same (or similar phenomenon): convergence [81, 202], alignment [211], entrainment [30] etc. Levitan et al.[162] used distinct terms for entrainment at different levels, such as, con- vergence, proximity, synchrony. In another line of work [153, 156] the authors use the term vocal entrainment when analyzing speech patterns. Similar phenomena have been referred to as ‘acoustic-prosodic entrainment‘ by other researchers [160, 161, 162, 270]. In [274] and [260] the term mimicry has been used while analyzing multimodal com- munication. In this work, we use the term vocal entrainment in association with the phenomena of a speaker adapting to their interlocutor‘s vocal patterns during a conver- sation. In Speech Accommodation Theory [81] (which later evolved to Communication Ac- commodation Theory, CAT), this phenomenon has been considered a driving force of the interaction. Giles et al.further addressed different aspects of accommodation, namely ‘convergence’ and ‘divergence’. There is another form of accommodation called dis- entrainment, where the interlocutors tend to actively maintain a larger distance than usual, which appears to be beneficial in some circumstances. CAT encompasses var- ious characteristics of speech and non-verbal behavior such as utterance length [21], speech rate [246], intensity, response latency [21], pause frequency and length [21, 39] by individuals in response to each other’s communicative behavior in social settings[80]. 76 Different speech strategies such as complementarity, over- and under-accommodation have been theoretically recognized as predictive of high-level behavioral patterns. Entrainment is also often associated with different measures of dialogue or interac- tion effectiveness [161], such as perceived social desirability [193], naturalness of the conversation [194], smoother interaction [44, 162] and rapport building [174]. A high degree of vocal entrainment has been also associated with various interpersonal behav- ioral attributes, such as high empathy of a clinical provider [276], increased interper- sonal agreement and less blame towards the partner and positive outcomes in couple therapy [190], and high emotional bond [187]. Understanding and analyzing entrain- ment can provide insights into interpersonal behaviors expressed in an interaction. In [174, 181, 215, 266], authors have identified this phenomena to influence social and romantic relationships. Another work [203] found significant correlation between en- trainment in phonetic patterns of college roommates and their closeness. As a useful characterization of the interaction dynamics of the interlocutors it can potentially facil- itate the recognition and estimation of these behaviors in the realm of behavioral signal processing [185]. It has also become a part of spoken dialog systems (SDS) to make a conversational agent sound more ‘natural’ and attentive to the user’s speech [267]. While prominent theories have been presented to explain the mechanism of entrain- ment and many of the aforementioned studies empirically have shown evidence of en- trainment, quantifying entrainment in a conversation is a non-trivial task. In early efforts in entrainment research, instances of entrainment in a conversation were directly judged 77 (and annotated) by trained experts observing the interaction. Studies have employed be- havioral coding methods [38] to identify different behaviors on a local scale or along micro-units of behavior, such as variation of prosody in specific direction (e.g., rising pitch, nods, smiles). Generally, a measure of entrainment on a more global (interaction- level) scale is computed based on the covariation or co-occurrence of these annotated behaviors [61]. However, such human observational methods are challenged by limita- tions. Not only the manual coding of behavioral data is tedious and time-consuming, the subjectivity of the human expert’s judgment often leads to poor inter-annotator agree- ment. Also, an annotator is sometimes restricted by the coding scheme because none of the labels accurately matches the observed behavior. Moreover, manual measures of entrainment makes it difficult to isolate entrainment in different modalities. Prior work that has attempted to measure how much a speaker entrains to the inter- locutor includes correlation, recurrence analysis, time-series analysis and spectral meth- ods [61]. These methods are limited in scope because of the diverse and complex nature of entrainment as a phenomenon. For example, the majority of these approaches assume a linear association among vocal features of speakers, which does not necessarily hold in real conversations. Furthermore, complex forms of entrainment, such as disentrainment cannot be captured using these measures. A nonlinear dynamical systems approach was proposed in [190], where complexity measures were used to quantify entrainment. However, this approach relies on finding 78 entrainment separately in different features which has limited applicability while mea- suring entrainment between the vocal patterns of interlocutors as a whole. Also, all these approaches are knowledge-driven and computes measures for entrainment directly from the features based on domain theory or heuristics. However, factors other than entrain- ment can contribute to similarity or dissimilarity in vocal patterns of interlocutors. If we could ‘learn’ how entrainment is manifested in conversation from real conversational data, we can selectively exploit that information in order to come up with entrainment measures that is capable of addressing the aforementioned drawback. In our recent work [189] we proposed a more direct data-driven encoding based approach using neural net- works to extract information relevant to entrainment and compute a distance measure. Neural network models can potentially learn the nonlinear mapping between the original speech feature space and the information that can be entrained under specific learning frameworks. In this paper, we first present a general framework to extract information from raw speech features and use it for computing entrainment distances. We show two specific approaches under this framework–first, using an encoding approach and second, a triplet network based approach. The validity of entrainment distances obtained by these ap- proaches are evaluated through distinguishing real conversations from fake ones. We also test their efficacy in real world psychotherapy applications in relation to different relevant behavioral attributes. 79 4.2 Preprocessing and Feature Extraction for Vocal Entrainment First we briefly describe the preprocessing and feature extraction steps required for mod- eling entrainment and computing the associated distances from audio. 4.2.1 Preprocessing We perform a few preprocessing steps on the raw audio waveform before extracting features. These steps are required to obtain boundaries of audio segments to identify consecutive speaker turns. The first step is voice activity detection (V AD) which isolates speech regions from non-speech regions. Then, we perform speaker diarization to dis- tinguish contiguous speech segments spoken by different speakers. This step typically involves speaker-homogeneous segmentation followed by clustering them into the dif- ferents speakers in the conversation. This way we can obtain consecutive speaker turns that are used for measuring entrainment for a given conversation. For training our models, however, we prefer cleaner boundaries typically obtained via manual transcripts to minimize the preprocessing error. Hence we use the transcripts with speaker turn boundaries available in our training dataset to get the timings for speaker change as well as pauses within a turn. Assuming that these time stamps are reasonably accurate, they are used as oracle V AD and diarization for the training data. For evaluation datasets, we perform automated V AD and diarization on raw audio as the preprocessing steps to simulate the real-world application of our models. 80 Following the preprocessing steps, we divide a single turn into multiple inter-pausal units (IPUs) whenever there is a pause of at least 50 ms duration within the turn. The notion behind using IPUs come from the hypothesis that entrainment is prominent be- tween the last IPU of the previous speaker turn and the leading IPU of the next speaker turn [162]. Hence for each turn with multiple IPUs, we only consider the first and the last IPU of every turn to capture entrainment-related information. 4.2.2 Feature Extraction In the feature extraction step, we extract 38 frame-level acoustic features from the seg- ments (IPUs). The extracted features are of different categories–4 prosody features (pitch, energy and their first order deltas), 31 spectral features (MFCCs, LSFs etc.) and 3 voice quality features (shimmer and 2 variants of jitter). We do not include any delta of spectral and voice quality features as our analyses showed that they do not significantly contribute to capturing entrainment. In particular, we found these features have very low correlation (r < 0:05) between consecutive turns in our initial analysis. We extract the selected 38 features with a Hamming window of 25 ms width and 10 ms shift us- ing the OpenSMILE feature extraction toolkit [68]. A post-processing step is performed for pitch by applying a median-filter based smoothing with a window size of 5 frames. The reason for this additional step is that the lack of robustness in pitch extraction and possibility of halving or doubling errors. Finally we z-score normalize the frame-level 81 features across the whole session, with the exception of pitch and energy. These two features are normalized by dividing them by their respective means. 4.2.3 Turn-level Features Our framework models entrainment with a directional distance measure from speaker 1 to speaker 2 given a particular turn change as shown in Figure 4.3. In this case, the segments of interest are the last IPU of speaker 1’s turn and the first IPU of the subsequent turn by speaker 2, which are shown using bounding boxes in the figure. We then compute six statistical functionals over all frames within each IPU. For each pair of turns, we obtain two sets of functionals of features representing those turns. The functionals used in this work are as follows: mean, median, standard deviation, 1st percentile, 99th percentile and range (99th percentile 1st percentile). Thus we represent each turn with 38 6= 228 turn-level features from extracted from the corresponding IPU. In the rest of the paper, we will use x 1 and x 2 to denote the turn-level feature vector of the last IPU of speaker 1 and the first IPU of speaker 2, respectively . 4.3 Deep Unsupervised Learning Framework for Entrainment In this section, we first discuss the basic principle of modeling entrainment using deep learning models in an unsupervised framework. 82 4.3.1 Basic Principle of Learning Entrainment from Data With our goal of obtaining a useful distance measure between consecutive turns (repre- sented by turn-level feature vectors), we rely on capturing information that maximally relates to entrainment. The majority of the prior works in vocal entrainment also rely on this idea of finding a distance (or similarity) measure. For example, [162] uses correla- tion between the turn-level features while [156] proposed using their lower-dimensional representation. Let us assume the first speaker turn is represented by x 1 and the next speaker turn by x 2 . Without loss of generality, we can present this class of approaches for quantifying entrainment with a measure,D ent () as follows: D ent (x 1 ;x 2 )= d(T(x 1 )T(x 2 )) (4.1) As evident from the above equation, there are two components in designing such a measure–one of them is the choice of appropriate transformation functionT() to apply on the feature vector (x 1 and x 2 ) in order to obtain a representation that ideally preserves information that relates to entrainment and possibly remains invariant to other factors such as speaker characteristics, channel information and so on. The other step is choos- ing a distance metric (or similarity measure) d() for the chosen representation space. For example, measuring entrainment by computing Pearson’s correlation [162] be- tween all instances of x 1 and x 2 assumes correlation as the similarity measure and no 83 Speaker 1 Speaker 2 Tx L1/L2 Distance Entrainment Distance Tx Figure 4.1. An overview of the basic framework for unsupervised learning of entrainment further transformation of the feature vectors (or, mathematically speaking, identity trans- formation function). On the other hand, [156] uses a PCA and vector quantization-based method as the transformation function and Kullback-Liebler divergence as d(). As dis- cussed earlier, these approaches (and other approaches from prior literature as well) do not make any well-devised attempt to designT() in such a way that captures the en- trainable information while remaining invariant to other ‘nuisance’ factors that do not entrain across speakers. If e represents a vector in the entrainable information space and q is the vector notation for other nuisance factors, we can model the feature vector x 84 as a nonlinear functionF() over them, i.e., x=F(e;q). Our goal is to estimate this function and obtain e from x via unsupervised learning of representations, such that e 1 T(x 1 ) and e 2 T(x 2 ) (4.2) In the ‘perfect’ scenario where the next speaker entrains as much as possible to the pre- vious speaker, e 1 = e 2 , which meansD ent (x 1 ;x 2 ) 0 from Equation (4.1). However, in real conversations, it is highly unlikely for interlocutors to entrain perfectly, hence D ent (x 1 ;x 2 )> 0. The main contribution of this work is in approximatingT() with deep neural net- work architectures without direct supervision. It is difficult to obtain manual annotations for degree of entrainment and practically impossible to accurately identify (feature) infor- mation contributing to entrainment. Hence, we use consecutive speaker turn pairs from spontaneous conversations to train models that can approximateT() with an underly- ing assumption that there is some degree of entrainment occurring across turns in such conversations, at least to some extent. Even if this assumption do not necessarily hold true for all turn pairs, it is expected for the majority of the turn pairs if the conversations are deemed natural, as found in prior literature [162]. 4.3.2 Encoding Approach In our first approach, we adopt a feed-forward deep neural network (DNN) as an encoder to encode possible entrainable information from speaker turns. Some initial results of 85 this approach was published in [189]. The basic idea is to extract relevant information from the previous speaker turn to predict the next speaker turn since that information is deemed to be associated with entrainment between the speakers. mark Feature Extraction Speaker 1 Speaker 2 functionals functionals Encoder Decoder Feature Extraction DNN DNN Figure 4.2. An overview of unsupervised training of the encoder approach The different components of the model are described below: 1. First we use x 1 as the input to the encoder network. We choose the output of the encoder network z to be an undercomplete representation of x 1 by restricting the dimensionality of z to be lower than that of x. 86 2. z is then passed through another feed-forward (z) network used as decoder to pre- dict x 2 . The output of the decoder is denoted asb x 2 . 3. Thenb x 2 and its reference x 2 are compared to obtain the loss function of the model, L(x 2 ;b x 2 ). Even though this deep neural network resembles autoencoder architectures, it does not reconstruct itself but rather tries to encode relevant information from one turn to predict the next turn, parallel to [122, 123, 164]. Thus the bottleneck embedding z can be considered closely related to the entrainment embedding e mentioned above. 4.3.3 Nuisance Factors While the encoder approach tries to capture entrainment by modeling information flow across consecutive speaker turns, it can capture all aspects of similarity between speaker turns. The main limitation of the encoder approach is that it may potentially capture similarity between speaker turns that is not relevant to entrainment (nuisance factors), such as speaker and channel characteristics. For example, the embedding learned in this approach might include acoustic channel information, making the entrainment dis- tance sensitive to channel characteristics. If the two speakers share the same channel, the distance could be smaller compared to the situation where they use different channels. Also, speakers with similar voice would exhibit a lower distance even if they do not en- train enough during their conversation. This latter limitation however might not be that detrimental, as the network should learn to ignore the speaker information unless there 87 is an abundance of similar sounding speaker pairs in the training data. In either case, a desirable property of the transformation functionT() is that it should learn to ignore speaker and channel characteristics while learning entrainable information. The encoder approach is not explicitly incentivized to ignore such nuisance factors. Hence, we at- tempt to use a learning framework that addresses this limitation. Triplet networks [117] have shown promising results in learning embeddings such that the distance in that em- bedding directly can be mapped into similarity (or dissimilarity in a variable of interest). It has been successfully used in computer vision applications like face identification and verification [233] as well as speech tasks such as speaker diarization [243] and verifica- tion [123]. In our application, we address the problem of learning meaningful embedding with entrainment information with a triplet network. The following subsection describes the triplet network and the associated loss. 4.3.4 Triplet Network for Entrainment A triplet network is an architecture suitable for representation learning by minimizing triplet loss obtained in the latent embedding space. In a triplet network, each input con- sists of three samples or triplets: an anchor x a , a positive sample x p belonging to the same class as the anchor and a negative sample x n from a different class. Let the em- bedding be represented byT() that maps a turn-level feature vector x toT(x) with an additional constraint thatkT(x)k 2 = 1. We want to ensure that in the embedding space, 88 the anchor sample is closer to the positive sample than it is to the negative sample. So we want, kT(x a )T(x p )k 2 2 +a <kT(x a )T(x n )k 2 2 (4.3) wherea is a margin enforced between anchor-positive and anchor-negative pairs. If we consider all triplets in the training setT with cardinality N denoted by(x i a ;x i p ;x i n )2 T, the total loss we minimize is given by L= å i max(kT(x i a )T(x i p )k 2 2 kT(x i a )T(x i n )k 2 2 +a;0): (4.4) In a typical binary classification setup, it is trivial to assign labels to the sample and subsequently choose triplets with each anchor having a corresponding positive and negative exemplar with different mining strategies. However, in our case, we do not have such labels for entrainment. Rather we assume that consecutive turn pairs belong to same category as they share entrainable information, while any non-consecutive turn pair to not belong to the same category as they are not directly entrained. Thus we con- sider a speaker turn an anchor, the immediate next turn the positive sample and another non-consecutive turn selected from the training data as the negative sample. It is crucial to select negative samples for a robust embedding learning [233] and the only way we can implement a hard triplet mining strategy is by choosing the negative samples closer to the anchor in the feature space. We can select the anchor and positive samples triv- ially from the consecutive speaker turns which share a greater similarity due to potential 89 entrainment. On the other hand, the negative sample can come from any turn that is not assumed to entrain with the anchor turn. We first employ an online semi-hard neg- ative sampling strategy for selecting the turn representing the negative sample from the dataset. However, our end goal of using triplet network is to minimize the effect of nu- siance factors and the traditional approach of semi-hard sampling in the original feature space might not be optimal. Hence we attempt to improve the embedding learning by choosing a negative sample that shares some similarity with the anchor with respect to the nuisance factors, yet is not entrained to the same. In the next section, we describe how we use i-vector modeling for the negative sample selection. 4.3.5 i-vector Modeling for Negative Sampling i-vector modeling is a popular technique in speaker recognition and verification [60] as it has been shown to capture speaker identity information. i-vectors have been also used for speaker adaptation techniques for automatic speech recognition [130] by accounting for the variability in speech from different speakers. This employs a total variability framework where speech segments are transformed into a low-dimensional space using joint factor analysis. The total variability space is trained in an unsupervised manner and hence captures both speaker and channel variabilities. This is more suitable for nega- tive sample selection since high similarity (low distance) in i-vector space between the anchor and the negative sample is beneficial for learning entrainment information while ignoring the nuisance factors (speaker and channel information). For this reason, i-vector 90 Feature Extraction Speaker 1 (anchor) Speaker 2 (positive) functionals functionals Network Feature Extraction shared weights negative Similar in i-vector space Feature Extraction functionals Network Network Triplet Loss shared weights Figure 4.3. An overview of the triplet network model for entrainment is preferred over x-vectors [241] which has recently shown state-of-the-part performance in speaker verification and exhibited invariance to channel characteristics. The i-vector extraction technique could be modeled as the mapping of a speech utterance (speaker turn in our application),S =fs t g N t=1 with s t 2R F to a fixed-length vector v2R D . A Gaussian mixture model based universal background model (GMM-UBM) is learned 91 from the data and their Baum-Welch statistics is computed to obtain a super-vectorq for each utteranceS assuming to obey the following affine relationship: q = m+ T v (4.5) where m is the mean super-vector from the UBM and T is the low-rank total-variability matrix capturing speaker and channel information. In our work, we use i-vector extraction to come up with a hard negative sampling technique. The strategy is described as follows: 1. First, i-vector extraction is performed for all utterances representing the speaker turns in the training datasetT . Let us denote an utterance byS i 2T , its turn- level feature vector by x i and i-vector representation by v i . 2. The selection of negative sample in the triplet (x a ;x p ;x n ) could be posed as the choice of the utteranceS n 2T , givenS a andS p already chosen as two consec- utive utterances in a conversation. 3. Ideally we want to choose n in such a way that the negative sample is as close to the anchor as possible in the i-vector space, i.e., v n is close to the v a . 4. Since it is computationally inefficient to compute the distances of all utterances from the anchor, we randomly choose K number of utterances fromTnfS a ;S p g and form a candidate poolC . 92 5. Finally n is chosen from the candidate pool such that the utterance is nearest to the anchor in i-vector space: n= argmin k kv k v a k 2 ;S k 2C (4.6) 4.4 Network Architecture and Entrainment Distance Measures 4.4.1 Encoder Approach: Neural Entrainment Distance (NED) In the encoder approach we use two fully connected layers as hidden layers both in the encoder and decoder network. Batch normalization layers and Rectified Linear Unit (ReLU) activation layers (in respective order) are used between fully connected layers in both of the networks. The dimension of the embedding is chosen to be 30. The number of neuron units in the hidden layers are: [ 228! 128! 30! 128! 228 ]. We use smooth L1 norm, a variant of L1 norm which is more robust to outliers [82], so that L(x 2 ;b x 2 )=kx 2 b x 2 k smooth 1 = N å k=1 smooth L 1 (x 2k b x 2k ); (4.7) where smooth L 1 (d)= 8 > > > < > > > : 0:5d 2 ; ifjdj 1 jdj 0:5; otherwise (4.8) 93 and N is the dimension of x which is 228 in our case. For training the network, we choose a subset (80% of all sessions) of Fisher corpus and use all turn-level feature pairs (x 1 ;x 2 ). We employ the Adam optimizer [138] and a minibatch size of 128 for training the network. The validation error is computed on the validation subset (10% of the data) of the Fisher corpus and the best model is chosen. After the unsupervised training phase, we use the encoder network to obtain the em- bedding representation (z) from any turn-level feature vector x. To quantify the entrain- ment from a turn to its subsequent turn, we extract turn-level feature vectors from their final and initial IPUs, respectively, denoted as x i and x j . Next we encode x i and x j using the pretrained encoder network and obtain z i and z j as the outputs, respectively. Then we compute a distance measure d NE , which we term Neural Entrainment Distance (NED), between the two turns by taking smooth L1 distance z i and z j . d NE (x i ;x j )=kz i z j k smooth 1 = M å k=1 smooth L 1 (z ik z jk ); (4.9) where smooth L 1 () is defined in Equation (2) and M is the dimensionality of the em- bedding. Note that even though smooth L1 distance is symmetric in nature, our distance measure is still asymmetric because of the directionality in the training of the neural network model. 94 4.4.2 Triplet Network-based Entrainment Distance Measures (TNED and iTNED) In the triplet network approach, we use a network somewhat similar to the encoder method as a building block to apply on all three components of the triplet. The ma- jor difference lies in an additional fully connected layer followed by a ReLU layer at the input. We also do not use any batch normalization layer in this network architec- ture. The dimension of the embedding layer on which triplet loss is computed is chosen to be 30. The architecture in terms of the neurons in the hidden layers is as follows: [ 228! 256! 128! 30 ]. We train the network on the same training data as before with the Adam optimizer and a minibatch size of 128. The best model along with the margina in the triplet loss are chosen based on the loss on the validation set. Finally the trained network is used to obtain the embeddings e 1 and e 2 from each consecutive turn pair (x 1 ;x 2 ). Then the entrainment distance between the two pairs is computed by taking L2 distance between the embeddinga. d TNE (x i ;x j )=ke i e j k 2 (4.10) We train two versions of this model: (1) online semi-hard sampling based model, and (2) i-vector sampling based model. The only difference between these two models lies in the sampling strategy. We denote the distances computed using these two models as Triplet Network Entrainment Distance Measures (TNED) and i-vector based Triplet Network 95 Entrainment Distance Measures (iTNED). All these distances are turn-level measures in the sense that relate to local entrainment that is exhibited across consecutive speaker turns. We typically take the average over the distances between all consecutive turn pairs to obtain a global measure for the entire session. 4.5 Datasets We use a number of datasets in this work for training and evaluation. 4.5.1 Training Data We require a conversational dataset for training with a reasonable extent of possible entrainment between the speakers (with our assumption of proximal turns are more en- trainable than not). The dataset also needs to have adequate number of turn-pairs so that we could train our neural network models on it. Fisher Corpus English Part 1: We chose Fisher Corpus English Part 1 (LDC2004S13) [49] for training which satisfies both of these requirements. It consists a good number of spon- taneous telephonic dyadic conversations between native English speakers. More specif- ically, there are 5850 such conversations, each lasting up to 10 minutes. Since these are telephonic conversations, the sampling rate of the recordings is 8 kHz. The dataset is accompanied by manual transcripts that contain time-stamps of speaker turn boundaries as well as boundaries of pauses within a turn. We use this information for extraction of turns instead of diarization to minimize potential error owing to preprocessing. We 96 randomly chose 80% of the dataset for training and 10% of the data is set aside as the validation set during the training process. 4.5.2 Evaluation Data We use the following datasets for evaluation experiments in our work. Fisher Corpus: A test subset is formed by setting aside the remaining 10% of the Fisher corpus. This subset is used for the verification experiment. Suicide Risk Assessment corpus: This dataset [10] contains recorded conversa- tions of suicidal patients while being interviewed by their therapist during suicide risk assessment sessions. The participants were active duty military personnel who either had attempted suicide or had suicidal thoughts prior to the sessions. The subset of the corpus employed in the current work included therapist-patient inter- views of 54 subjects, each session ranging from 10 minutes to 1 hour. They were asked several questions related to their personal history, reasons leading to their suicidal thoughts, elaboration of their reasons for living etc. Immediately after the interview sessions, the patient was asked to provide a self-reported score for per- ceived emotional bond, an attribute which entails the therapist’s empathy for the patient and the patient’s feeling of trust towards them. It was rated on a scale from 1 to 10. 97 Couples Therapy Corpus: This dataset [47] is a collection of recorded interac- tions of 134 heterosexual married couples with serious and chronic marital dis- tress. Each couple participated in up to three sessions of interactions at different time points as a part of a longitudinal study. More specifically, the sessions took place at the beginning of the therapy, after six months and finally after two years since the onset of the therapy. In each session the couple had a conversation for 10 minutes on a topic chosen by the husband and another 10 minutes on the wife’s chosen topic. The interactions were recorded with a far-field microphone while no therapist or research staff was present. Behavioral Codes: The audio-visual recordings of the sessions were manually an- notated with a set of behavioral attributes, also known as behavioral codes. These codes are used to describe each spouse’s behavior in each 10 minute subsession. This corpus used 32 such behavioral codes in total, following two well-established behavioral coding systems: 13 codes from the Couples Interaction Rating System (CIRS, [125]) and 19 codes from the Social Support Interaction Rating System (SSIRS, [107]). A few examples of these codes are blame, sadness, agreement, humor and negativity. A number of trained evaluators (ranging from two to nine) annotated these codes on an integer scale of 1 to 9 where a higher rating indicated higher extent of the rated behavior. The average of the ratings from multiple anno- tators is used as the reference in this work. We use 372 sessions, each with ratings for both spouses for the experiment involving behavioral codes. 98 Outcome Ratings: The Couples Therapy Corpus also included annotations on ther- apy outcomes of the couples. The outcome ratings were based on the quality of their relationship compared to the condition before the therapy started, following the methods provided by Jacobson and Truax [121]. The outcome ratings are pro- vided on a 4-point scale: – 1: deteriorated (i.e., they got measurably worse over treatment) – 2: no change (i.e., no meaningful improvement) – 3: improved (i.e., they got measurably better over treatment, but still clini- cally insignificant) – 4: recovered (i.e., they got measurably better over treatment and their score is above the upper cut-off for clinically significant distress) Each couple had one rating for each of the post-therapy sessions–26 weeks and 2 years relative to the time they started the therapy. 4.6 Verification of the Measures Before the proposed entrainment distance measure are tested on data from real world ap- plications, we conduct an analysis to verify that the measures are suitable for capturing the inherent behavioral inter-dependencies present in human-human interactions, as con- ceptualized in psychological studies. This verification step is implemented by creating 99 fake sessions by modifying the real conversations and then testing whether the proposed measures can distinguish between them with a simple rule-based classification. From each real session (denoted byS real ), a fake session (S f ake ) is created by ran- domly shuffling the speaker turns. The shuffling is performed under the constraint that no two turns of the same speaker can be adjacent as a result of the shuffling. Then the distance measures are used to identify the real session from the pair (S real ,S f ake ). The steps of the experiment are outlined as follows: 1. A fake sessionS f ake is created from a real session,S real . 2. The session-level entrainment distance (or, similarity) is computed for both real and fake sessions by taking average over the turn-level measures. 3. Then for each pair of sessions (S real ,S f ake ), we use a simple rule to identify the real session from its fake counterpart: the session with the lower distance (or higher similarity) is inferred to be the session. This rule is based on the hypothesis that higher entrainment is observed across consecutive turns in a natural conversation when compared to randomly paired turns in a fake session. 4. If the inferred real session is the true real session, it is correctly classified. We repeat this experiment 30 times for the whole evaluation dataset to account for the randomness in creating the fake session. The average classification accuracy and the standard deviation is reported in Table 4.2. This experiment is conducted on two datasets: 100 Baseline 1 L2 distance directly computed between turn-level features (x i and x j ) Baseline 2 Smooth L1 distance directly computed between turn-level features (x i and x j ) Baseline 3 Convergence, proximity, synchrony by Levitan et al. [162] Baseline 4 PCA-based symmetric acoustic similarity measure by Lee et al. [156] Baseline 5 Nonlinear dynamical systems-based complexity measure [187] Table 4.1. Baselines used in verification and correlation analysis experiments a subset (10%) of the Fisher corpus and Suicide corpus. We use a number of baseline measures as mentioned in Table 4.1. Since Baselines 4 and 5 refer to multiple measures originally proposed in literature, we report the best performing one from the respective set, thus providing an upper-bound performance for comparison. Baselines 3 and 4 are similarity measures; hence the ses- sion with higher value of the measure is inferred to be real. Measure Classification accuracy (%) Fisher corpus Suicide corpus Baseline 1 70:55(6:13) 71:49(6:08) Baseline 2 72:10(5:83) 70:44(6:69) Baseline 3 86:27(5:58) 82:62(4:31) Baseline 4 92:32(3:01) 88:12(5:93) Baseline 5 90:21(5:40) 88:54(5:87) NED 98:87(0:97) 91:92(2:32) TNED 97:23(0:62) 90:91(3:05) iTNED 96:07(0:86) 94:63(1:98) Table 4.2. Results of Experiment 1: classification accuracy (%) of real vs. fake sessions (averaged over 30 runs; standard deviation shown in parentheses) 101 From the results reported in Table 4.2, we observe that all three of our proposed mea- sures achieve higher accuracy than all baselines when evaluated on the Fisher corpus, NED being the best performing measure with almost 99% accuracy. This shows that the proposed measures are able to successfully distinguish between real and fake conversa- tions by identifying the degree of entrainment present in them. iTNED achieves slightly lower accuracy when compared to NED and TNED, while still performing better than the baselines. The proposed measures perform better than all baselines also on the Suicide corpus. The accuracy of the NED measure declines about 7% absolute in the experiment on the Suicide corpus as compared to the Fisher corpus. Similar behavior is observed in case of TNED too. The likely reason behind this is data mismatch as the model was trained on Fisher. However, the other proposed measure iTNED, while also trained on the Fisher corpus, performs better than NED and only loses 3% absolute accuracy when compared to the evaluation on the Fisher corpus itself. This observation validates our earlier argument about iTNED being more robust to variation in channel and speaker characteristics than NED. We also run a two-tailed exact binomial test [229] to check for statistical significance of our observations in the classification experiment. On both Fisher and Suicide corpus, we find the superiority of all three proposed measures over the corresponding best per- forming baseline (Baseline 4 and Baseline 5, respectively) to be statistically significant under the significance level p < 0:05. Also, iTNED is found to perform significantly better than NED on the Suicide corpus. 102 4.7 Correlation Analyses While the verification experiment was designed to validate the proposed measures by testing the presence of entrainment as a fundamental property of natural conversations, it is still a proof-of-concept analysis for measures capturing entrainment. In the current section, we experimentally show the potential usability of the measures in relation to cer- tain behavior constructs typically associated with entrainment. In particular, we focus on two case studies on modeling specifc behaviors using real world psychotherapy datasets from two important domains–emotional bond in suicide risk assessment interviews and agreement and blame behavior in couples therapy interactions. These experiments could be thought of as indirect validation of the ability of the proposed measures to capture entrainment by demonstrating their relation with the associated behaviors. 4.7.1 Emotional Bond in Suicide Risk Assessment Prior literature on suicide risk assessment interviews has found that a high emotional bond in patient-therapist interactions in the suicide therapy domain is associated with higher degree of entrainment. Moreover it was also experimentally validated in our pre- vious work [187, 189]. In this analysis, we compute Pearson’s correlation between the entrainment measures and the emotional bond ratings as reported by the patient during the interviews conducted by therapist for suicide risk assessment. Since the proposed measure is asymmetric in nature, we compute the measures for both patient-to-therapist and therapist-to-patient entrainment. We also compute the correlation of emotional bond 103 with the baselines used in Experiment 1. We report Pearson’s correlation coefficients (r) for this experiment in Table 4.3 along with their p-values. We test against the null hy- pothesis H 0 that there is no linear association between emotional bond and the candidate measure. Measure Pearson’s correlation r p-value Baseline 1 0:2392 0:1394 Baseline 2 0:1980 0:2031 Baseline 3 0:2959 0:0468 Baseline 4 0:2480 0:1132 Baseline 5 0:3815 0:0127 NED-TP 0:1317 0:3999 TNED-TP 0:1201 0:4284 iTNED-TP 0:1756 0:2435 NED-PT 0:4479 0:0095 TNED-PT 0:4413 0:0087 iTNED-PT 0:4672 0:0043 Table 4.3. Correlation between emotional bond and various measures; TP: therapist-to-patient, PT: patient-to-therapist p< 0:05 indicates statistically significant correlation We summarize the results of our experiment in Table 4.3. We find the patient-to- therapist measures for all of our proposed models (NED, TNED and iTNED) to be negatively correlated with emotional bond. We also notice a high statistical signifi- cance (p< 0:01) for these correlations. The negative sign in the values forr is consistent with previous studies as higher distance in acoustic features indicates lower entrainment. 104 On the other hand, the therapist-to-patient measures do not have a significant correla- tion with emotional bond. A possible explanation for this finding is that the emotional bond is reported by the patient and influenced by the degree of their perceived therapist- entrainment. Thus, equipped with an asymmetric measure, we are also able to identify the latent directionality of the emotional bond metric. The complexity measure (Baseline 2) also shows statistically significant correlation, but the value ofr is lower than that of the proposed measure. Finally, we conduct another statistical significance test based on Fisher’s transforma- tion to compare the correlation performance of difference measures. The test relies on the construction of a confidence interval for two possibly dependent variables for com- parison of correlation with a third variable of interest [279]. We find that the correlation obtained with any of the three proposed measures is significantly stronger than the best performing baseline (Baseline 5). However, the difference between iTNED-TP and the other two proposed measures is found to be not significant. To analyze the embeddings encoded by our model, we also compute a t-SNE [175] transformation of the difference of all patient-to-therapist turn embedding pairs, denoted as z i z j in Equation (3). Using NED and iTNED, Figure 4.4 and Figure 4.5 respectively show a session with high emotional bond and another one with low emotional bond (with values of 7 and 1 respectively) as a 2-dimensional scatter plot. Some visible separation can be noticed from the figures between the sessions with low and high emotional bond. 105 -10 -8 -6 -4 -2 0 2 4 6 8 -10 -5 0 5 10 15 20 25 low high Figure 4.4. t-SNE plot of difference vector of encoded turn-level embeddings for sessions with low and high emotional bond for NED 4.7.2 Behavioral Codes in Couples Therapy In the next experiment, we focus on two behavioral codes employed to describe the behavior of the couples with marital conflict, namely, agreement and blame. In the lit- erature, researchers have associated high behavioral entrainment with high agreement between the interlocutors [63]. This relationship has been also quantitatively verified with in our prior work [190] where a positive correlation was found between entrain- ment measure and the code agreement in Couples Therapy corpus. On the other hand, it has been shown an interlocutor exhibiting more entrainment towards their partner is likely to show less blame [190, 274]. In the current work, we test the efficacy of our pro- posed entrainment distance measures in manifestation of these behaviors through their 106 -15 -10 -5 0 5 10 -15 -10 -5 0 5 10 15 20 low high Figure 4.5. t-SNE plot of difference vector of encoded turn-level embeddings for sessions with low and high emotional bond for iTNED correlation with agreement and blame. For comparison, we use the same set of baselines as the previous correlation analysis experiment. The Pearson correlation coefficients (r) and corresponding p-values this experiment are reported in Table 4.4. The null hypoth- esis H 0 for this experiment is that there is no linear association between the behavioral code and the candidate measure. The results for both agreement and blame show efficacy of our proposed measures reflected in correlation coefficient values. The proposed measures as well as the base- lines pass the significance test ( p< 0:05 ) except Baseline 1 for blame. For agreement, 107 Measure agreement blame r p-value r p-value Baseline 1 0:2393 0:0349 0:1949 0:0609 Baseline 2 0:2459 0:0335 0:2163 0:0464 Baseline 3 0:2958 0:0186 0:2393 0:0349 Baseline 4 0:2794 0:0233 0:2450 0:0338 Baseline 5 0:3187 0:0107 0:2787 0:0250 NED 0:3227 0:0095 0:2943 0:0192 TNED 0:3399 0:0056 0:3091 0:0150 iTNED 0:3345 0:0059 0:3249 0:0092 Table 4.4. Correlation between couple therapy behavioral codes and various measures p< 0:05 indicates statistically significant correlation all proposed measures show stronger level of significance with p< 0:01 while for blame only iTNED shows the same. All three proposed distance measures are negatively cor- related with agreement, indicating association of higher entrainment with higher extent of agreement with interlocutor. On the other hand, blame is found to be positively cor- related with all the proposed distance measures which is justified by the notion lack of entrainment associated with blame. Similar to the previous experiment with emotional bond, we also check for statisti- cal significance of the differences in the correlation performances. While the correlation with agreement obtained by TNED and iTNED is found to be significantly stronger than the baseline with strongest correlation (Baseline 5), the difference for NED is not signif- icant. On the other hand for blame, the superiority of all three proposed measures over the best performing baseline (Baseline 5) is statistically significant. The improvement in correlation for iTNED over NED is also found to be significant. 108 4.8 Application of the Measures as Features There is a large body of work that associates social success and entrainment, some of which focuses on the relationship quality between couples or romantic partners [190, 220, 266]. In [190], a moderate correlation between entrainment and the outcome measure in couples undergoing therapy was found. Marital outcome is a complex phe- nomenon, and depends on interpersonal behavior that is thought to be manifested in long-term changes in the verbal and non-verbal cues during the interactions between the partners. In this experiment, we investigate the possible benefits of leveraging the pro- posed distance measure for predicting marital outcome. The rationale behind this exper- iment lies in testing the capability of our measure to characterize the complex dynamics of interactions that was shown to be useful in our prior work in predicting outcome us- ing acoustic features [191]. Along with standard acoustic features traditionally used for predictive analysis of behavior, it used short-term and long-term dynamic functionals to capture the dynamics within and across speaker to improve the performance of out- come prediction. Our goal in the current context is to find out whether it is possible to further improve the performance by including proposed entrainment distance measures. This also seeks answer to the question whether additional useful information about the interpersonal dynamics is manifested in the proposed measures that simpler dynamic functionals fail to capture. 109 4.8.1 Classification Setup We binarize the outcome labels into classes: recovery (outcome rating 4: 51.8% of the samples) vs. no-recovery (outcome rating 1, 2 and 3 combined: 48.2% of the samples). This also addresses the class imbalance for outcome originally present in the Couples Therapy corpus. We then perform a binary classification to predict the binarized out- come class of each instance, consisting of a pre-therapy and a post-therapy session. We use a Support Vector Machine (SVM) classifier with radial basis function (RBF) kernel for this purpose. Prior to the actual classification task, a feature selection method based on mutual information maximization is adopted. As features, we use a set of differ- ent acoustic features (prosodic, spectral and voice-quality). Since most acoustic feature representations used here are based on either turn-level features or frame-level descrip- tors, we use six statistics computed over frames or turns (as applicable), such as mean, standard deviation and maximum. We use a number of baseline feature sets: Static functionals: functionals computed over all speech segments from each spouse Dynamic functionals: functionals that represent within-speaker and across-speaker dynamics; short-term functionals are computed across turns within a session, while long-term functionals capture dynamics between pre-therapy and post-therapy ses- sions of the same couple All functionals: static and dynamic functionals combined For further details of the experimental setup, please refer to our previous work [191]. 110 Finally, we compute functionals of the proposed turn-level entrainment distances (NED and iTNED separately) and use them in addition to aforementioned functionals for the outcome prediction task. The results of this experiment in terms of 10-fold cross validation mean accuracy (as well as standard deviation) for different feature sets are reported in Table 4.5. Feature set accuracy mean SD Chance 51:8 Static functionals 76:4 10:0 Dynamic functionals 78:9 7:6 All functionals 79:3 10:2 All functionals + NED 80:9 8:7 All functionals + TNED 81:0 9:1 All functionals + iTNED 81:4 8:4 Table 4.5. Classification accuracy of marital outcome with different feature sets As evident from the results, using the functionals of the proposed meaures (NED or iTNED) as features further improves the performance of the classification task of out- come prediction. With NED, the absolute improvement in terms of classification accu- racy is 1.6% in comparison with the previous best-performing feature set with acoustic features while including TNED achieves 1.7% improvement. Finally, the feature set with iTNED measures outperforms all systems and achieves 81.4% accuracy which is 2.1% above the previous best system that did not include entrainment measures. To test whether these improvements are statistically significant, we perform a two-tailed 111 binomial exact test [229]. Specifically, our null hypothesis is that the sample of pre- dicted classes from our previous best system and that from the proposed system (all functionals+ entrainment distances) come from the same distribution. We run this exper- iment once each for NED, TNED, and iTNED and obtain p-values of 0:0349, 0:0306 and 0:0281, respectively. Hence, under the significance level of p< 0:05 , all these measures contribute to significantly improving the classification performance. Thus, the proposed entrainment distances are found to capture complementary information about the dynam- ics of the interaction between the spouses, improving the classification accuracy of their marital outcome. 4.9 Conclusion A data-driven measure that captures the interpersonal influence in the form of entrain- ment in vocal patterns can help us understand human conversation as a dynamic process. The measure also can be potentially useful in various applications, such as behavioral analysis and building natural spoken dialog systems. In this paper, we propose a general framework to obtain entrainment distances across consecutive speaker turns in a dyadic interaction. The distances are computed in a latent embedded space that is learned in an unsupervised manner using natural conversational data. As a validation, we first show that these proposed measures can distinguish between real and fake conversations by capturing entrainment in real ones. Furthermore, we obtain better performance in this validation task using the proposed measures in comparison with the baselines. We use 112 the measures in two psychotherapy applications and find them to have statistically sig- nificant association with behavioral codes such as emotional bond, agreement and blame behaviors. In all of the experiments most of our proposed measures, iTNED in particular, perform better than the baselines. The directional nature of these measures were justified in relation to emotional bond, which itself is asymmetric in nature. The promises shown by the unsupervised data-driven framework to learn vocal entrainment, further encourage building models that can exploit temporal context within a turn, preferably with attention mechanism. Since such a model is difficult to realize in practice due to the assumption of presence of entrainment in every consecutive turn-pairs, we intend to explore alternative strategies to address this issue. As another possible future direction, we will investigate supervised learning of entrainment using the pretrained embeddings as features, using limited session-level annotations as weak supervision. 113 Chapter 5 Quantification of Linguistic Coordination 5.1 Introduction Understanding linguistic coordination and quantifying it is beneficial in characterization of interpersonal behavior in psychotherapy, and in monitoring the quality and efficacy of therapy [140, 185]. Another potential application lies in spoken dialog systems and conversational agents, where the system can learn to use linguistic coordination to com- municate efficiently with the human user and create a common ground [171]. According to Pickering and Garrod’s model [210], there exist several different com- ponents in linguistic coordination – lexical, syntactic and semantic. Among these lexical entrainment has been arguably the focus of the most attention, primarily in psycholin- guistics [30, 75]. While it is a complex and multifaceted phenomenon, a number of stud- ies have explored specific forms of lexical entrainment, such as linguistic style match- ing [196], similarity in choice of high frequency words [194], similarity in referring ex- pressions [30], similarity in style words [55]. Researchers in computational linguistics 114 also tried to quantitatively measure lexical entrainment in conversational settings. For example, [194] used a unigram model of different classes of words and measured lexical entrainment as the cumulative difference in unigram scores for the interlocutors. However, the majority of the computational approaches for measuring linguistic co- ordination has been limited to lexical entrainment, agnostic to coordination in the se- mantic space or syntactic structures. Coordination in semantics is closely related to co- hesion [104], another mechanism in linguistics which ties together different words used in continuation of a shared context. Approaches towards quantification of cohesion pri- marily have been used in tasks like text classification and discourse segmentation [176]. In these applications, however, cohesion is defined within a document, as opposed to the cohesion between the interlocutors in dyadic conversations which we are interested in. There have been only a few attempts to model the latter by exploring the relation be- tween synonymous words (e.g., via WordNet) used by different speakers in the domain of intelligent tutor systems [94, 263]. However, this body of work suffers from the lim- itation that two words might be semantically or syntactically related even without being synonyms. Further, using any of the lexical entrainment or cohesion measures alone does not provide a complete representation of linguistic coordination. Addressing the aforementioned limitations, we propose use of neural word and sen- tece embeddings for quantification of linguistic coordination. We elaborate on multiple measures based on different approaches of computing utterance-level distances and then using them as building blocks to come up with different session-level measures. 115 5.2 Linguistic Coordination Distances in Conversations Our approach of computing a lingusitic coordination measure for each utterance of a speaker relies on qunatifying the extent to which the given utterance is coordinated by the following utterances of their interlocutor. We measure the coordination across ut- terances using two different ways–using Word Mover’s Distance exploiting ties between words and using sentence embeddings. In addition we show how they could be fused to- gether. Then we propose using the utterance level distances as building blocks in a longer context to finally compute conversational linguistic distances that can capture lexical and semantic dissimilarity. 5.2.1 Utterance-level distances as building blocks We present two alternate approaches of computing distances between two utterances– Word Mover’s Distance and leveraging sentence embeddings. Finally we also show how we can combine both of these approach which are assumed to capture complementary information. 5.2.1.1 Word Mover’s Distance (WMD) Word Mover’s Distance (WMD) was introduced by Kusner et al. [148], as a distance measure between text documents. The measure is based on the concept of neural word embeddings, which provide distributed vector representations of words in a document. 116 Although any neural word embedding could be used in measuring WMD, it was origi- nally proposed using one of the most popular word embeddings, word2vec [182]. word2vec has been shown to contain semantic and syntactic information [182], making WMD suitable for capturing different aspects of linguistic coordination. Unlike the original WMD paper, we include stop words (which do not carry much semantic information) in our framework, in order to capture lexical entrainment patterns of using similar high- frequency and style words. WMD is essentially a bag-of-words approach where each document is a collection of words represented as vectors in the embedding space. In prin- ciple, it can be interpreted as the minimum transport cost to reach the embedded words in a document from the embedded words of another document. Inherently this measure relies on the individual distances of pairs of words in the vector space, as building blocks. For a pair of words, w i and w j , the Euclidean distance between their embedding vectors is computed as the first step, v i = e(w i ) and v j = e(w j ), d(w i ;w j )=kv i v j k (5.1) 117 Based on this, the distance between a pair of utterances U 1 and U 2 is formulated as follows: WMD(U 1 ;U 2 )= min T0 m å i=1 n å j=1 T i j d(w i ;w j ) (5.2) subject to n å j=1 T i j = c 1 i n 8i2f1;:::;mg; and m å i=1 T i j = c 2 j m 8 j2f1;:::;ng; where m and n are the number of unique words in U 1 and U 2 respectively and c k i is the frequency of w i in U k . The computation of WMD involves a constrained optimization The pasta was very tasty today. The spaghetti was utterly delicious. pasta tasty very utterly delicious spaghetti word2vec embedding The The was was today Figure 5.1. Illustration of WMD (each word from one utterance is mapped to the most similar word in the other utterance) problem of finding an optimal flow matrix T which can be solved using many exact and approximate techniques. In fact, this is a special case of earth mover’s distance 118 computation, a widely-known transportation problem [227]. In Figure 5.1, we illustrate how WMD between two utterances is computed in the vector space of word embeddings (only two dimensions are shown for interpretability). The optimal selection of T could be interpreted as finding ties between neighboring words in the vector space, as seen in the figure. Although WMD was originally introduced for documents, more recently it has been also applied for sentences [223], and in this work, we use it for utterances. 5.2.1.2 Use of sentence embeddings Following the popularity and widespread of applications of word representations, re- searchers have been also working on developing semantic representations of sentences and documents [42, 152, 201, 254, 255]. Many of these sentence embeddings have shown promising results in several semantic tasks such as sentiment analysis and natural lan- guage inference. In our application, however, we are mainly interested in an embedding which is trained via a conversational model that leverages input-response relationship; for instance, given an input utterance from a speaker, we aim to predict the response of the other speaker. Such a model should learn the typical relationship between two con- secutive utterances and hence the embedding would represent the information which are typically coordinated. In particular, we employ the neural conversation model proposed in [261] and use the hidden states from the encoder-decoder network trained on a spoken 119 conversational dataset. This is adopted from the work of Tseng et al. [255], who used similar emeddings for behavior classification. Our encoder-decoder (or, seq2seq [248]) network is trained on the Fisher corpus, which has been already described in chapter 4. We use a recurrent neural network, namely, a three-layered LSTM as the encoder and the decoder, with each layer consisting of 512 memory cells. We also use an attention mechanism [8] which can potentially fo- cus on the specifc words in the input utterance which are more salient for coordination. However, as noted in [261], this model is limited by its short context length which is just one utterance. We should note that the goal of this model is not to silumate a perfect query-response system, rather to represent utterances using embeddings that carry infor- mation regarding how the relate to their immediate neighboring utterances. The network architecture is shown in Figure 5.2. attention A B C X Y Input Response sentence embedding Figure 5.2. Illustration of the neural conversation model with an input ”ABC” followed by a response ”XY” 120 Once the network is trained, we extract the embedding for each utterance (consider- ing it as a single sentence) in test data and take cosine distance between the embeddings of two utterances as a measure of linguistic distance. 5.2.1.3 Fusion Approach The two previous measures of utterance-level distance are motivated to capture different and possibly somewhat complementary aspects and scope of linguistic elements. WMD measures similarity in words across utterances and does not consider temporal context of words within utterances. While sentence embeddings obtained through neural con- versation modeling (NCM) approach looks at an utterance as a single entity where the context is preserved, however, with a loss of detailed representation of individual words within the utterance. Hence, we propose a fusion approach to combine distances from both of these distance measure by linearly combining them. If the WMD distance and sentence embedding distance between two utterances are denoted as d WMD and d NCM , the combined distance is computed as follows. d combined =ld WMD +(1l)d NCM ; (5.3) where l is a parameter between 0 and 1. We set the optimal value of l to be 0.6 based on the model’s performance in distinguishing real vs fake conversation in the validation subset of Fisher corpus. The details of the experiment have been already described in Chapter 4. 121 5.2.2 Conversational Linguistic Distances In the previous subsection, we discussed how we can compute a measure of the linguistic difference between two utterances. Here we describe how it is extended to the distance measure capturing linguistic coordination, which we name Conversational Linguistic Distance (CLiD). More specifically, we propose an unnormalized and a normalized dis- tance (uCLiD and nCLiD). 5.2.2.1 Local Interpersonal Distance Although linguistic coordination occurs at multiple levels, we focus on capturing it at a local scale, i.e., between consecutive turns of the interlocutors. The other alternative is to measure the coordination globally by considering all the words used by each of the interlocutors as a single document and computing the distance between them. While similar approaches have been adopted in prior works on lexical entrainment [194], the coarse resolution of such a measure can potentially fail to capture the dynamics of the conversation. On the other hand, measuring the distance between one speaker turn and the imme- diate next one is a simple local measure which is appealing for our purpose. However, local coordination is not necessarily expressed in the immediate response to the primary speaker’s turn; rather it might be sustained and exhibited after a few turns [211]. Hence, we propose a scheme where we consider a predefined number of turns (defined as context length) in response to the utterance of the primary speaker (referred to as anchor), and 122 choose the minimum of the distances of every pair formed by the anchor utterance and a response. This can be interpreted as the maximum coordination that is exhibited towards the primary speaker by their interlocutor in the causal vicinity of the original utterance. In a similar approach, [222] considered a predefined time window (as opposed to fixed number of turns) as the context length to find instances of syntactic coordination. Let us consider the scenario where two interlocutors A and B converse with each other and each of them takes N number of turns. We can use any of the three utterance-level distances (D WMD , D NCM , or D f us ) and denote it as D(;) in the following formulation. A 1 ;A 2 ;::;A N and B 1 ;B 2 ;::;B N represent the utterances of A and B respectively. Given a context length k, for every anchor utterance A i , we compute a distance d A!B i over next k number of utterances by B following A i as follows: d A!B i = min i ji+k1N D(A i ;B j ) (5.4) It should be noted that we obtain two sequences of directional distance measures for the entire session,fd A!B i g andfd B!A i g, due to the asymmetric nature of Equation (5.4). 5.2.2.2 Session-level measures Although local distance measures provide a good characterization of the interpersonal coordination that happens throughout the course of conversation, an aggregated session- level measure obtained from the local distances could be more useful for session-level analysis in applications like behavioral analyses. We simply take an average of the local 123 distances defined in Equation (5.4) over the whole session to compute the session-level measures, which we call unnormalized Conversational Linguistic Distance (uCLiD): uCLiD= 1 N N å i=1 d A!B i (5.5) In this equation, only uCLiD for A! B has been shown, that captures interlocutor A’s co- ordination with B; similarly B! A can be computed. While the uCLiD measure provides how much overall linguistic coordination occurs between interlocutors in a conversation, it is also influenced by the nature of the conversation – whether it is a structured con- versation on pre-decided topic or an unrestricted spontaneous interaction, or something in between. It can be also affected by the extent to which the interlocutors tend to use similar language in a conversation as a whole, as a result of coordinating to their own lan- guage. To account for these phenomena, we use a normalized distance which attempts to provide a more suitable measure for applications where the nature of the conversation is not important. We draw inspiration from a similar approach by Jones et al. [126], where they compute a factor called Zelig Quotient for normalization. In our work, we first define a normalization factor a, computed as the average pairwise utterance-level distance throughout the session, including within and across interlocutors. Next, the 124 normalized distance measure, which we term as normalized Conversational Linguistic Distance (nCLiD) is computed by dividing uCLiD bya, as follows: nCLiD= uCLiD a ; (5.6) where a = 2 N(N 1) N å i=1 N å j=i+1 D(A i ;A j ) + 2 N(N 1) N å i=1 N å j=i+1 D(B i ;B j ) (5.7) + 2 N(N+ 1) N å i=1 N å j=i D(A i ;B j ) In the RHS of Equation (6), the first two terms are the average utterance-level dis- tance within A and within B, which are related to the tendency to change their language throughout the conversation. The third term represents the overall tendency of each inter- locutor to accommodate the other. To summarize, we obtain uCLiD and nCLiD measures to quantify linguistic coordination in both directions, A! B and B! A. 5.3 Datasets Two datasets are used in this work: a corpus consisting of five independent clinical stud- ies in addiction counseling (Motivational Interviewing corpus) and another corpus con- sisting of interactions of married couples undergoing marital therapy (Couples Therapy corpus). 125 5.3.1 Motivational Interviewing corpus This corpus consists of therapist-patient interactions in Motivational Interviewing (MI), a form of addiction counseling in psychotherapy. In each interview, the aim of the therapist is to help the patient, who is seeking therapy for substance addiction, make behavioral changes by resolving ambivalence about their problems. There are 145 interactions, in total, collected from the five clinical studies: ARC, ESPSB, ESB21, iCHAMP, HMCBI [6]. The interactions, which range from 20 minutes to an hour, take place between ther- apists and real patients struggling with alcohol, marijuana and poly-drug addiction. Each interaction was recorded on tape and manually transcribed and annotated for speaker labels, turn timings, back-channels, disfluencies, etc. In addition, each therapist was assigned an overall, session-level rating for the behavior code empathy based on the Motivational Interviewing Treatment Integrity (MITI) [184] manual. The rating was performed on a Likert Scale from 1 to 7, where low (high) values indicated low (high) levels of empathy exhibited by the therapist. 5.3.2 Couples Therapy corpus We have already introduced this corpus before in earlier chapters. In this work, our focus lies on analyzing only two codes from the SSIRS system – Global Positive Affect and Global Negative Affect. Finally, the corpus also includes the therapy outcomes of the couples as a measure of their relationship quality relative to the beginning of the 126 therapy. Rated on two occasions (26 weeks and/or 2 years), which we refer to as post- therapy sessions, the outcome is rated on a 4-point scale; 1 (deterioration), 2 (no change), 3 (partial recovery), and 4 (complete recovery). 5.3.3 Cornell Movie Dialogs Corpus This corpus [57] consists of movie dialogs collected and curated by researchers at Cornell university. It is created by matching movie scripts from various sources with IMDb database. There are above 220 thousands of dialogs from 617 movies from a diverse number of genres. The dataset also contains some metadata from IMDb, such as genre, release year and top-billed characters for each movie. It is one of the largest dataset of metadata-rich movie conversations. 5.4 Experiments We applied the proposed measure in the two case studies using the datasets described in Section 3. Obtaining labels for ground truth coordination measure is not only a challeng- ing task, but also highly subjective. Due to lack of any accepted labeled dataset in the domain, we use a number of proxy behavioral attributes which are known to be associ- ated with coordination to indirectly evaluate the proposed measures. In this section, we describe the correlation analysis experiments conducted to indirectly validate our pro- posed measures. We use both unnormalized and normalized distances for each of the three utterance-level distances (WMD, NCM, fusion), resulting into six measures. 127 5.4.1 Baselines We use a number of baseline methods to compare with the proposed method: Turn-level lexical similarity based on TF-IDF[167], Cohesion (distance) measure based on WordNet [263], Global WMD measured between the language of the interlocutors taken together, as described in Section 2.2.1 5.4.2 Case Study 1: Empathy in Motivational Interviews Deemed an important interpersonal behavior in counseling-based psychotherapy, empa- thy has been shown to be positively associated with entrainment both in domain the- ory [217] and computational studies [173, 276]. In this case study, we compute Spear- man’sr correlation between the proposed linguistic coordination measures (uCLiD and nCLiD) and empathy ratings. Due to the asymmetric nature of the proposed measure, we obtain each of these measures in two directions–the therapist-to-patient and patient-to- therapist. Since empathy is a behavior expressed by therapist, intuitively it should not be affected by how much coordination the patients exhibits. As a verification, we found no significant correlation between the therapist-to-patient distance (using nCLiD measure) and empathy (r = 0:0521; p= 0:4344). Hence we consider only patient-to-therapist co- ordination distance, focusing only on the coordination exhibited by the therapist. We 128 empirically set the context length parameter of our measure as k = 6 and use a 300- dimensional pre-trained model for word2vec (trained on 3 million words from Google News). We also report the p-values against the null hypothesis H 0 that there is no mono- tonic (rank-ordered) association between empathy and the candidate measure. We repeat the same procedure for the baselines as well. Measure Spearman’s correlation r p-value uCLiD-WMD 0:2283 0:0103 uCLiD-NCM 0:2419 0:076 uCLiD-fusion 0:2465 0:069 nCLiD-WMD 0:2639 0:0026 nCLiD-NCM 0:2447 0:0072 nCLiD-fusion 0:3120 0:0009 † TF-IDF [167] 0:1152 0:1675 WordNet [263] 0:0952 0:2546 global WMD 0:1710 0:0398 Table 5.1. Correlation between empathy and various coordination measures From the results shown in Table 5.1, we can observe that both the normalized and the unnormalized measure (uCLiD and nCLiD, respectively) exhibit stronger correla- tion than the baselines. We also notice the improvement from normalization as nCLiD turns out to be the most highly correlated measure. The negative sign of the correlation *p< 0:05 indicates statistically significant correlation † similarity measure while other measures are distances 129 values is justified for the proposed measures since we expect sessions with higher em- pathy to have higher coordination, and hence, lower distance. We also observe p-values lower than 0:05 indicating statistically significant association between empathy and the proposed measures. 5.4.3 Case Study 2: Couples Therapy 5.4.3.1 Individual behavioral codes In the Couples Therapy domain, we first explore the possible association of linguistic coordination with positive and negative affect. We adopt the same context length param- eter value for our measures (k= 6) and use the same baselines for comparison as used in the previous case study. We consider the coordination exhibited by a subject (husband or wife) with their spouse for the behavior ratings of the former. For example, as far as the husband’s positive affective behavior is concerned, we only analyze how much the husband coordinated with respect to the wife during the session. The results in Table 5.2 show that we obtain higher correlation values for our pro- posed measures than the baselines and that the normalized measure again exhibited the strongest correlation. Judging by the sign of r, coordination distance is higher for sub- jects with lower positive affect and lower for subjects with lower negative affect, which is consistent with literature associating entrainment with behavior [156]. 130 Measure positive negative r p-value* r p-value* uCLiD-WMD 0:2903 9:9 10 5 0:3142 3:4 10 8 uCLiD-NCM 0:2851 2:3 10 4 0:2916 1:1 10 6 uCLiD-fusion 0:3273 3:0 10 9 0:3234 8:5 10 9 nCLiD-WMD 0:3068 1:2 10 7 0:3371 2:1 10 10 nCLiD-NCM 0:3017 9:5 10 6 0:3089 2:6 10 7 nCLiD-fusion 0:3380 3:5 10 10 0:3298 4:7 10 9 † TF-IDF [167] 0:1542 0:0001 0:2119 2 10 4 WordNet [263] 0:0847 0:0020 0:0952 0:0005 global WMD 0:1310 0:0001 0:1556 0:0001 Table 5.2. Correlation between various coordination measures and affective behaviors (positive and negative) 5.4.3.2 Therapy outcome We hypothesize that the coordination distance between both spouses (measured by the average of husband-to-wife and wife-to-husband distances) decreases in the post-therapy session with respect to the pre-therapy if they had fully recovered (outcome rating “4”). We conduct a paired Wilcoxon signed rank test against the null hypothesis H 0 that both pre- and post-therapy measures come from the same distribution. In this experiment, we only use the fusion method for utterance-level distance. We obtain p= 0:018 for uCLiD- fusion and p= 0:011 for the nCLiD-fusion measure. This indicates a statistically sig- nificant (p< 0:05) observation that the couples who had recovered also exhibited lower coordination distance, or in other words, higher linguistic coordination after therapy. 131 5.4.4 Case Study 3: Analysis of Coordination in Movie Dialogs We apply our proposed measures on movie dialogs to analyze the degree of linguistic co- ordination in movies. It should be noted that the movies can be seen as imagined conver- sations as thought by script-writers and do not necessarily portray the charactersistics of real-world spoken conversations [57, 146, 179]. Hence we wanted to investigate whether there is any noticeable difference in the extent of linguistic coordi nation between movie dialogs and real-life conversations. In particular, we computed nCLiD-fusion measure for all the dialogs in Cornell Movie Dialogs corpus as well as Fisher corpus as an exam- ple of real world conversations. The average linguistic coordination distance as measured by our proposed metric in movie corpus is found be 0.9782. On the other hand, the mean of the measure in Fisher corpus is 0.8128. We also run a two-sample t-test for the two means being different and obtain p= 0:0037, indicating statistical significance. This indicates that movie dialogs tend to have lower linguistic coordination than real world spontaneous conversations. This is consistent with the prior literature and could be a result of the lack of direct social benefits in conversations in movies as opposed to real life. Further, we analyze linguistic coordination between major characters across different movie genres. In this experiment, we consider movies belonging to six genres with most number of movies in the dataset. The genres could be overlapping in the sense that a movie could fall under multiple genres; we include the movie in both genres. We only only consider conversations between the two top billed chatacters for each movie and 132 compute the linguistic coordination for every such dialog segment. We report the mean and standrad deviation of nCLiD across different genres in Table 5.3. Measure nCLiD-fusion mean SD Romance 0:7622 0:1473 Comedy 0:8623 0:1592 Drama 0:9252 0:1550 Action 1:032 0:1334 Sci-fi 1:131 0:0902 Horror 1:189 0:1237 Table 5.3. nCLiD-fusion in movie dialogs across different genres (sorted in increasing order of mean) We observe that Romance genre has the most coordination and Horror genre has the least coordination. This is intuitive if we take the possible nature of the social relation- ships of the major characters of the these genres into account. 5.5 Conclusion and Future Work In this work, we present a novel distance measure to quantify linguistic coordination in dyadic conversations. Equipped with neural word embeddings, our proposed measure can potentially capture different aspects of linguistic coordination (lexical, semantic and syntactic). From the experiments performed in the two case studies, we establish the use- fulness of the measure in capturing interpersonal behavioral information. In the future, we intend to study the effect of the context length parameter on our measure. We could 133 use more recent and potentially more powerful neural word embedding techniques (such as BERT, ELMo) instead of word2vec in a similar framework as presented in this paper. Motivated by the efficacy of the neural word embeddings in relation to linguistic coordi- nation, we would also like to explore models to jointly learn the embedding that encodes shared linguistic information between the interlocutors, similar to [255]. We would also like to investigate linguistic coordination in-the-wild through ASR transcripts using em- beddings such as conf2vec [238]. Another possible research direction is to investigate modeling a fused measure combining linguistic and vocal coordination. 134 Chapter 6 Summary and Future work In this chapter, we briefly summarize the key ideas and findings and improvements ob- tained from our research work described in the previous chapters. Then we discuss some of the directions for future work and finally show a provisional timeline with possible venues for the work. 6.1 Summary First we presented a framework on using temporal dynamics to predict the marital re- lationship status of distressed couples in therapy using acoustic information from their spoken interactions. We introduced knowledge-driven features of capturing short-term and long-term descriptors of dynamics inspired for this task. As we have shown in Chap- ter 2, this automatic approach of capturing dynamics and relevant important behavioral information directly from speech signal performs better in predicting the outcome when 135 compared to the traditional approach taken by psychologists, i.e., manual coding of be- havior from therapy sessions. This is a promising finding considering the fact that human coders had utilized multiple modalities (speech, visual and lexical information) in their coding process. Even though behavioral codes are not designed to predict outcomes it- self, they function as behavioral descriptors of the couple and one can expect them to be informative towards the outcome based on the observational methods of psychology. We also found that dynamic functionals are better than traditional static functionals of acoustic features for outcome prediction. This work opens up avenues for many other research applications and similar frameworks for various behavioral outcome prediction tasks such as assessing results of treatment for various disorders and conditions. The later part of our research involved analyzing the interpersonal and temporal dy- namics during a dyadic conversations, which is closely related to the phenomenon of entrainment. Our first approach involved modeling dyadic spoken interactions as a non- linear dynamical systems for behavioral analysis. We used various complexity measures as a characterization of the entrianment in the vocal patterns of the interlocutors dur- ing the conversation. First we experimentally verified the measures and then presented results on applications of complexity measures in two important domains of behavioral analysis: couples therapy and suicide risk assessment interviews. To address the limitations of the knowledge-driven approach of complexity measures, we explored how we could leverage data collected from real conversations and learn what information could be transferred as entrainment. This led us to the data-driven deep 136 neural network based approach of capturing entrainment in speech. We also provided experimental results to show the improvement by introduicng this approach. Finally, we proposed quantifying coordination in the language modality in the form of linguistic coordination. We leveraged distributed represention of words and sentences using neural network models to compute distances between utterances and then measured linguistic coordination on a global scale. We showed the application of these conversa- tional linguistic distance measures in relation to different behaviors–affective behaviors in couples therapy, empathy of the therapist towards the patient in motivational interview- ing. Using the proposed measures, we also analyzed the extent of lingusitic coordination between major characters in dialogs from movies across different genres. 6.2 Future Work Our previous work on modeling entrainment with deep neural modeling was promis- ing enough to motivate us for exploring better learning mechanisms for exploiting data- driven approaches to capture entrainment. The proposed neural network-based measures have a limitation that they do not exploit the temporal structure of vocal patterns within a turn, rather use functionals to represent a turn. We can address this shortcoming by richer modeling of the features in a turn that preserves the temporal information, such as recurrent neural networks (RNN). Moreover, employing attention mechanism [257] be- tween the encoder and decoder might help building architectures even more capable of preserving the information relevant to entrainment. That way, we can also learn relative 137 importance of speech features across time within a turn in the context of entrainment. Also, using longer context by looking into multiple turns and capturing the informa- tion in a context vector might prove useful. We also intend to explore (weakly) super- vised learning of entrainment using the bottleneck embeddings as features, in presence of session-level annotations. Another promising direction of research is in analyzing the trajectory of local (turn-level) entrainment during the course of the conversation and identify ‘salient’ events corresponding to certain behavior. In addition to speech and language, we would also explore other modalities for en- trainment, such entrainment in facial expressions, head motion and body gesture. Al- though many of these modalities have already been studied extensively for entrainment or mimicry, the relationship across different modalities are yet to be explored. We will study cross-modal characteristics of entrainment in different modalities and analyze how, if at all, entrainment in one modality affects entrainment in another modality. We will explore the possibility of finding a single multimodal entrainment by learning a joint em- bedding space using neural networks that captures visual, vocal and lexical information. This could be achieved either by combining the embeddings obtained in different modal- ities, or by jointly learning a single embedding that retains information from different modalities. Finally, we would also investigate the presence and nature of entrainment in affective space (valence, arouasal) in dyadic interactions. 138 Bibliography [1] Rehan Akbani, Stephen Kwek, and Nathalie Japkowicz. Applying support vector machines to imbalanced datasets. In European conference on machine learning, pages 39–50. Springer, 2004. [2] AM Albano, A Passamante, T Hediger, and Mary Eileen Farrell. Using neural nets to look for chaos. Physica D: Nonlinear Phenomena, 58(1-4):1–9, 1992. [3] Justin S Albrechtsen, Christian A Meissner, and Kyle J Susa. Can intuition im- prove deception detection performance? Journal of Experimental Social Psychol- ogy, 45(4):1052–1055, 2009. [4] Nalini Ambady and Robert Rosenthal. Thin slices of expressive behavior as pre- dictors of interpersonal consequences: A meta-analysis. Psychological bulletin, 111(2):256, 1992. [5] David C Atkins, Sara B Berns, William H George, Brian D Doss, Krista Gattis, and Andrew Christensen. Prediction of response to treatment in a randomized clinical trial of marital therapy. Journal of Consulting and Clinical Psychology, 73(5):893, 2005. [6] David C Atkins, Mark Steyvers, Zac E Imel, and Padhraic Smyth. Scaling up the evaluation of psychotherapy: evaluating motivational interviewing fidelity via statistical text classification. Implementation Science, 9(1):49, 2014. [7] Jo-Anne Bachorowski and Michael J Owren. V ocal expression of emotion: Acous- tic properties of speech are associated with emotional intensity and context. Psy- chological science, 6(4):219–224, 1995. [8] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine trans- lation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014. [9] Michael Banbrook, Stephen McLaughlin, and Iain Mann. Speech characteriza- tion and synthesis by nonlinear methods. Speech and Audio Processing, IEEE Transactions on, 7(1):1–17, 1999. 139 [10] B.R. Baucom, A.O. Crenshaw, C.J. Bryan, T.A. Clemans, T.O. Bruce, and M.D. Rudd. Patient and clinician vocally encoded emotional arousal as predictors of response to brief interventions for suicidality. Brief Cognitive Behavioral Inter- ventions to Reduce Suicide Attempts in Military Personnel. Association for Behav- ioral and Cognitive Therapies, 2014. [11] Brian R Baucom, David C Atkins, Lorelei Simpson Rowe, Brian D Doss, and Andrew Christensen. Prediction of treatment response at 5-year follow-up in a randomized clinical trial of behaviorally based couple therapies. Journal of con- sulting and clinical psychology, 83(1):103, 2015. [12] Brian R Baucom, David C Atkins, Lorelei E Simpson, and Andrew Christensen. Prediction of response to treatment in a randomized clinical trial of couple therapy: a 2-year follow-up. Journal of Consulting and Clinical Psychology, 77(1):160, 2009. [13] Donald H Baucom, Norman B Epstein, Jennifer S Kirby, and Jaslean J LaTail- lade. Cognitive-behavioral couple therapy. Handbook of cognitive-behavioral therapies, page 411, 2002. [14] Katherine JW Baucom, Brian R Baucom, and Andrew Christensen. Changes in dyadic communication during and after integrative and traditional behavioral cou- ple therapy. Behaviour research and therapy, 65:18–28, 2015. [15] Katherine JW Baucom, Mia Sevier, Kathleen A Eldridge, Brian D Doss, and An- drew Christensen. Observed communication in couples two years after integrative and traditional behavioral couple therapy: outcome and link with five-year follow- up. Journal of consulting and clinical psychology, 79(5):565, 2011. [16] Margret Bauer, John W Cox, Michelle H Caveness, James J Downs, and Nina F Thornhill. Finding the direction of disturbance propagation in a chemical pro- cess using transfer entropy. IEEE transactions on control systems technology, 15(1):12–21, 2007. [17] Aaron T Beck, Maria Kovacs, and Arlene Weissman. Assessment of suicidal intention: the Scale for Suicide Ideation. Journal of consulting and clinical psy- chology, 47(2):343, 1979. [18] Peter M Bentler and Michael D Newcomb. Longitudinal study of marital success and failure. Journal of Consulting and Clinical Psychology, 46(5):1053, 1978. [19] ˇ Stefan Beˇ nuˇ s. Social aspects of entrainment in spoken interaction. Cognitive Computation, 6(4):802–813, 2014. 140 [20] ˇ Stefan Beˇ nuˇ s, Agust´ ın Gravano, Rivka Levitan, Sarah Ita Levitan, Laura Will- son, and Julia Hirschberg. Entrainment, dominance and alliance in supreme court hearings. Knowledge-Based Systems, 71:3–14, 2014. [21] Frances R. Bilous and Robert M. Krauss. Dominance and accommodation in the conversational behaviours of same- and mixed-gender dyads. Language and Com- munication, 8(3):183 – 194, 1988. Special Issue Communicative Accomodation: Recent Developments. [22] Dmitri Bitouk, Ragini Verma, and Ani Nenkova. Class-level spectral features for emotion recognition. Speech communication, 52(7):613–625, 2010. [23] Matthew Black, Panayiotis G Georgiou, Athanasios Katsamanis, Brian R Baucom, and Shrikanth S Narayanan. ” you made me do it”: Classification of blame in married couples’ interactions by fusing automatically derived speech and language information. In INTERSPEECH, pages 89–92, 2011. [24] Matthew Black, Athanasios Katsamanis, Chi-Chun Lee, Adam Lammert, Brian Baucom, Andrew Christensen, Panayiotis Georgiou, and Shrikanth S. Narayanan. Automatic classification of married couples’ behavior using audio features. In In Proceedings of InterSpeech, Makuhari, Japan, September 2010. [25] Matthew Black, Athanasios Katsamanis, Chi-Chun Lee, Adam C Lammert, Brian R Baucom, Andrew Christensen, Panayiotis G Georgiou, and Shrikanth S Narayanan. Automatic classification of married couples’ behavior using audio features. In INTERSPEECH, pages 2030–2033, 2010. [26] Matthew P Black, Athanasios Katsamanis, Brian R Baucom, Chi-Chun Lee, Adam C Lammert, Andrew Christensen, Panayiotis G Georgiou, and Shrikanth S Narayanan. Toward automating a human behavioral coding system for married couples’ interactions using speech acoustic features. Speech communication, 55(1):1–21, 2013. [27] Paul Boersma and David Weenink. PRAAT, a system for doing phonetics by computer. Glot International, 5(9/10):341–345, 2001. [28] Steven M Boker and Jean-Philippe Laurenceau. Dynamical systems modeling: An application to the regulation of intimacy and disclosure in marriage. Models for intensive longitudinal data, pages 195–218, 2006. [29] Daniel Bone, Chi-Chun Lee, Matthew P Black, Marian E Williams, Sungbok Lee, Pat Levitt, and Shrikanth Narayanan. The psychologist as an interlocutor in autism spectrum disorder assessment: Insights from a study of spontaneous prosody. Journal of Speech, Language, and Hearing Research, 57(4):1162–1177, 2014. 141 [30] Susan E Brennan. Lexical entrainment in spontaneous dialog. Proceedings of ISSD, 96:41–44, 1996. [31] Clifford L Broman. Thinking of divorce, but staying married: The interplay of race and marital satisfaction. Journal of Divorce & Remarriage, 37(1-2):151– 161, 2002. [32] Gavin Brown, Adam Pocock, Ming-Jie Zhao, and Mikel Luj´ an. Conditional like- lihood maximisation: a unifying framework for information theoretic feature se- lection. The Journal of Machine Learning Research, 13(1):27–66, 2012. [33] Craig J Bryan, Jim Mintz, Tracy A Clemans, Bruce Leeson, T Scott Burch, Sean R Williams, Emily Maney, and M David Rudd. Effect of crisis response planning vs. contracts for safety on suicide risk in us army soldiers: A randomized clinical trial. Journal of Affective Disorders, 212:64–72, 2017. [34] Kim T Buehlman, John M Gottman, and Lynn F Katz. How a couple views their past predicts their future: Predicting divorce from an oral history interview. Jour- nal of Family Psychology, 5(3-4):295, 1992. [35] Judee K Burgoon, Lesa A Stern, and Leesa Dillman. Interpersonal adaptation: Dyadic interaction patterns. Cambridge University Press, 2007. [36] Dogan Can, Rebeca Marin, Panayiotis Georgiou, Zac E. Imel, David Atkins, and Shrikanth S. Narayanan. ”it sounds like...”: A natural language processing ap- proach to detecting counselor reflections in motivational interviewing. Journal of Counseling Psychology, 2015. [37] Liangyue Cao. Practical method for determining the minimum embedding di- mension of a scalar time series. Physica D: Nonlinear Phenomena, 110(1):43–50, 1997. [38] Joseph N Cappella. Coding mutual adaptation in dyadic nonverbal interaction. The sourcebook of nonverbal measures: Going beyond words, pages 383–392, 2005. [39] Joseph N Cappella and Sally Planalp. Talk and silence sequences in infor- mal conversations iii: Interspeaker influence. Human Communication Research, 7(2):117–132, 1981. [40] Sybil Carr` ere, Kim T Buehlman, John M Gottman, James A Coan, and Lionel Ruckstuhl. Predicting marital stability and divorce in newlywed couples. Journal of Family Psychology, 14(1):42, 2000. [41] Justine Cassell, Alastair J Gill, and Paul A Tepper. Coordination in conversation and rapport. In Proceedings of the workshop on Embodied Language Processing, pages 41–50. Association for Computational Linguistics, 2007. 142 [42] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, et al. Universal sentence encoder. arXiv preprint arXiv:1803.11175, 2018. [43] S.N. Chakravarthula, R. Gupta, B. Baucom, and P. Georgiou. A language-based generative model framework for behavioral analysis of couples’ therapy. In Acous- tics, Speech and Signal Processing (ICASSP), 2015 IEEE International Confer- ence on, pages 2090–2094, April 2015. [44] Tanya L Chartrand and John A Bargh. The chameleon effect: the perception– behavior link and social interaction. Journal of personality and social psychology, 76(6):893, 1999. [45] Theodora Chaspari, Daniel Bone, James Gibson, Chi-Chun Lee, and Shrikanth S. Narayanan. Using physiology and language cues for modeling verbal response latencies of children with asd. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2013. [46] Theordora Chaspari, Sohyun C Han, Daniel Bone, Adela C Timmons, Laura Per- rone, Gayla Margolin, and Shrikanth S Narayanan. Quantifying regulation mecha- nisms in dating couples through a dynamical systems model of acoustic and phys- iological arousal. In submitted to INTERSPEECH, 2016. [47] Andrew Christensen, David C Atkins, Sara Berns, Jennifer Wheeler, Donald H Baucom, and Lorelei E Simpson. Traditional versus integrative behavioral couple therapy for significantly and chronically distressed married couples. Journal of consulting and clinical psychology, 72(2):176, 2004. [48] Andrew Christensen, Neil S Jacobson, and Julia C Babcock. Integrative behav- ioral couple therapy. Guilford Press, 1995. [49] Christopher Cieri, David Miller, and Kevin Walker. The fisher corpus: a resource for the next generations of speech-to-text. In LREC, volume 4, pages 69–71, 2004. [50] Charles J Clopper and Egon S Pearson. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika, 26(4):404–413, 1934. [51] Madalena Costa, Ary L Goldberger, and C-K Peng. Multiscale entropy analysis of complex physiologic time series. Physical review letters, 89(6):068102, 2002. [52] Nicholas Cummins, Stefan Scherer, Jarek Krajewski, Sebastian Schnieder, Julien Epps, and Thomas F Quatieri. A review of depression and suicide risk assessment using speech analysis. Speech Communication, 71:10–49, 2015. [53] Jared R Curhan and Alex Pentland. Thin slices of negotiation: predicting out- comes from conversational dynamics within the first 5 minutes. Journal of Applied Psychology, 92(3):802, 2007. 143 [54] Rainer Dahlhaus. On the kullback-leibler information divergence of locally sta- tionary processes. Stochastic processes and their applications, 62(1):139–168, 1996. [55] Cristian Danescu-Niculescu-Mizil, Michael Gamon, and Susan Dumais. Mark my words!: linguistic style accommodation in social media. In Proceedings of the 20th international conference on World wide web, pages 745–754. ACM, 2011. [56] Cristian Danescu-Niculescu-Mizil and Lillian Lee. Chameleons in imagined con- versations: A new approach to understanding coordination of linguistic style in dialogs. In Proceedings of the 2nd Workshop on Cognitive Modeling and Com- putational Linguistics, pages 76–87. Association for Computational Linguistics, 2011. [57] Cristian Danescu-Niculescu-Mizil and Lillian Lee. Chameleons in imagined con- versations: A new approach to understanding coordination of linguistic style in dialogs. In Proceedings of the 2nd workshop on cognitive modeling and computa- tional linguistics, pages 76–87. Association for Computational Linguistics, 2011. [58] C´ eline De Looze, Catharine Oertel, St´ ephane Rauzy, and Nick Campbell. Mea- suring dynamics of mimicry by means of prosodic cues in conversational speech. In International Conference on Phonetic Sciences (ICPhS). Hong Kong, pages 1294–1297, 2011. [59] C´ eline De Looze, Stefan Scherer, Brian Vaughan, and Nick Campbell. Investi- gating automatic measurements of prosodic accommodation and its dynamics in social interaction. Speech Communication, 58:11–34, 2014. [60] Najim Dehak, Patrick J Kenny, R´ eda Dehak, Pierre Dumouchel, and Pierre Ouel- let. Front-end factor analysis for speaker verification. IEEE Transactions on Au- dio, Speech, and Language Processing, 19(4):788–798, 2010. [61] Emilie Delaherche, Mohamed Chetouani, Ammar Mahdhaoui, Catherine Saint- Georges, Sylvie Viaux, and David Cohen. Interpersonal synchrony: A survey of evaluation methods across disciplines. IEEE Transactions on Affective Comput- ing, 3(3):349–365, 2012. [62] Bella M DePaulo, James J Lindsay, Brian E Malone, Laura Muhlenbruck, Kelly Charlton, and Harris Cooper. Cues to deception. Psychological bulletin, 129(1):74, 2003. [63] Norah E Dunbar and Robert Mejia. A qualitative analysis of power-based entrain- ment and interactional synchrony in couples. Personal Relationships, 20(3):391– 405, 2013. 144 [64] J-P Eckmann, S Oliffson Kamphorst, David Ruelle, and S Ciliberto. Liapunov exponents from time series. Physical Review A, 34(6):4971, 1986. [65] Jens Edlund, Mattias Heldner, and Julia Hirschberg. Pause and gap length in face- to-face interaction. In INTERSPEECH, pages 2779–2782, 2009. [66] Norman Epstein. Cognitive therapy with couples. Springer, 1983. [67] Rosana Esteller, George Vachtsevanos, Javier Echauz, and Brian Litt. A compar- ison of waveform fractal dimension algorithms. Circuits and Systems I: Funda- mental Theory and Applications, IEEE Transactions on, 48(2):177–183, 2001. [68] Florian Eyben, Martin W¨ ollmer, and Bj¨ orn Schuller. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the inter- national conference on Multimedia, pages 1459–1462, 2010. [69] Mireia Farr´ us, Javier Hernando, and Pascual Ejarque. Jitter and shimmer mea- surements for speaker recognition. In INTERSPEECH, pages 778–781, 2007. [70] Diane H Felmlee and David F Greenberg. A dynamic systems model of dyadic interaction. The journal of mathematical sociology, 23(3):155–180, 1999. [71] Daniel J France, Richard G Shiavi, Stephen Silverman, Marilyn Silverman, and D Mitchell Wilkes. Acoustical properties of speech as indicators of depression and suicidal risk. Biomedical Engineering, IEEE Transactions on, 47(7):829–837, 2000. [72] Andrew M Fraser and Harry L Swinney. Independent coordinates for strange attractors from mutual information. Physical review A, 33(2):1134, 1986. [73] Robert W Frick. Communicating emotion: The role of prosodic features. Psycho- logical Bulletin, 97(3):412, 1985. [74] Cindy Gallois and Howard Giles. Communication accommodation theory. The International Encyclopedia of Language and Social Interaction, 2015. [75] Simon Garrod and Anthony Anderson. Saying what you mean in dialogue: A study in conceptual and semantic co-ordination. Cognition, 27(2):181–218, 1987. [76] Panayiotis G Georgiou, Matthew P Black, Adam C Lammert, Brian R Baucom, and Shrikanth S Narayanan. “that’s aggravating, very aggravating”: Is it possible to classify behaviors in couple interactions using automatically derived lexical features? In International Conference on Affective Computing and Intelligent Interaction, pages 87–96. Springer, 2011. 145 [77] Panayiotis G. Georgiou, Matthew P. Black, and Shrikanth S. Narayanan. Behav- ioral signal processing for understanding (distressed) dyadic interactions: some recent developments. In Proceedings of the 2011 joint ACM workshop on Human gesture and behavior understanding, J-HGBU ’11, pages 7–12. ACM. [78] Panayiotis G Georgiou, Matthew P Black, and Shrikanth S Narayanan. Behavioral signal processing for understanding (distressed) dyadic interactions: some recent developments. In Proceedings of the 2011 joint ACM workshop on Human gesture and behavior understanding, pages 7–12, Scottsdale, AZ, 2011. ACM. [79] James Gibson, Athanasios Katsamanis, Matthew P Black, and Shrikanth S Narayanan. Automatic identification of salient acoustic instances in couples’ be- havioral interactions using diverse density support vector machines. In INTER- SPEECH, pages 1561–164, 2011. [80] Howard Giles, Justine Coupland, and Nikolas Coupland. Contexts of accommo- dation: Developments in applied sociolinguistics. Cambridge University Press, 1991. [81] Howard Giles and Peter Powesland. Accommodation theory. Springer, 1997. [82] Ross Girshick. Fast R-CNN. In Proceedings of the 2015 IEEE International Con- ference on Computer Vision (ICCV), ICCV ’15, pages 1440–1448, Washington, DC, USA, 2015. IEEE Computer Society. [83] Xavier Golay, Spyros Kollias, Gautier Stoll, Dieter Meier, Anton Valavanis, and Peter Boesiger. A new correlation-based fuzzy logic clustering algorithm for fmri. Magnetic Resonance in Medicine, 40(2):249–260, 1998. [84] John Gottman, Catherine Swanson, and Kristin Swanson. A general systems the- ory of marriage: Nonlinear difference equation modeling of marital interaction. Personality and social psychology review, 6(4):326–340, 2002. [85] John Gottman, Catherine Swanson, and Kristin Swanson. A general systems the- ory of marriage: Nonlinear difference equation modeling of marital interaction. Personality and social psychology review, 6(4):326–340, 2002. [86] John M Gottman. The roles of conflict engagement, escalation, and avoidance in marital interaction: a longitudinal view of five types of couples. Journal of consulting and clinical psychology, 61(1):6, 1993. [87] John M Gottman, James Coan, Sybil Carrere, and Catherine Swanson. Predicting marital happiness and stability from newlywed interactions. Journal of Marriage and the Family, pages 5–22, 1998. 146 [88] John M Gottman and Lowell J Krokoff. Marital interaction and satisfaction: a longitudinal view. Journal of consulting and clinical psychology, 57(1):47, 1989. [89] John M Gottman and Robert W Levenson. Marital processes predictive of later dissolution: behavior, physiology, and health. Journal of personality and social psychology, 63(2):221, 1992. [90] John Mordechai Gottman. Gottman method couple therapy. Clinical handbook of couple therapy, 4:138–164, 2008. [91] John Mordechai Gottman. Marital interaction: Experimental investigations. El- sevier, 2013. [92] John Mordechai Gottman. What predicts divorce?: The relationship between mar- ital processes and marital outcomes. Psychology Press, 2014. [93] John Mordechai Gottman and Robert Wayne Levenson. The timing of divorce: predicting when a couple will divorce over a 14-year period. Journal of Marriage and Family, 62(3):737–745, 2000. [94] Arthur C Graesser, Danielle S McNamara, Max M Louwerse, and Zhiqiang Cai. Coh-metrix: Analysis of text on cohesion and language. Behavior research meth- ods, instruments, & computers, 36(2):193–202, 2004. [95] Peter Grassberger. Generalized dimensions of strange attractors. Physics Letters A, 97(6):227–230, 1983. [96] Peter Grassberger and Itamar Procaccia. Measuring the strangeness of strange attractors. Physica D Nonlinear Phenomena, 9:189–208, 1983. [97] Stanford Gregory, Stephen Webster, and Gang Huang. V oice pitch and amplitude convergence as a metric of quality in dyadic interviews. Language & Communi- cation, 13(3):195–217, 1993. [98] Stanford W Gregory, Kelly Dagan, and Stephen Webster. Evaluating the relation of vocal accommodation in conversation partners’ fundamental frequencies to per- ceptions of communication quality. Journal of Nonverbal Behavior, 21(1):23–43, 1997. [99] Stanford W Gregory Jr and Stephen Webster. A nonverbal signal in voices of inter- view partners effectively predicts communication accommodation and social sta- tus perceptions. Journal of personality and social psychology, 70(6):1231, 1996. [100] Michael Grimm, Kristian Kroschel, Emily Mower, and Shrikanth Narayanan. Primitives-based evaluation and estimation of emotions in speech. Speech Com- munication, 49(10):787–800, 2007. 147 [101] Aslak Grinsted, John C Moore, and Svetlana Jevrejeva. Application of the cross wavelet transform and wavelet coherence to geophysical time series. Nonlinear processes in geophysics, 11(5/6):561–566, 2004. [102] Rahul Gupta, Panayiotis Georgiou, David Atkins, and Shrikanth Narayanan. Pre- dicting client’s inclination towards target behavior change in motivational inter- viewing and investigating the role of laughter. In Proceedings of Interspeech, September 2014. [103] W Kim Halford, Matthew R Sanders, and Brett C Behrens. A comparison of the generalization of behavioral marital therapy and enhanced behavioral marital therapy. Journal of Consulting and Clinical Psychology, 61(1):51, 1993. [104] Michael Alexander Kirkwood Halliday and Ruqaiya Hasan. Cohesion in english. Routledge, 2014. [105] Kyu Jeong Han, Samuel Kim, and Shrikanth S Narayanan. Strategies to improve the robustness of agglomerative hierarchical clustering under data source variation for speaker diarization. Audio, Speech, and Language Processing, IEEE Transac- tions on, 16(8):1590–1601, 2008. [106] Robert L Hatcher and J Arthur Gillaspy. Development and validation of a re- vised short version of the Working Alliance Inventory. Psychotherapy Research, 16(1):12–25, 2006. [107] C Heavey, D Gill, and A Christensen. Couples interaction rating system 2 (CIRS2). University of California, Los Angeles, 2002. [108] Richard E Heyman. Observation of couple conflicts: clinical assessment applica- tions, stubborn truths, and shaky foundations. Psychological assessment, 13(1):5, 2001. [109] Richard E Heyman, Bushra R Chaudhry, Dominique Treboux, Judith Crowell, Chiyoko Lord, Dina Vivian, and Everett B Waters. How much observational data is enough? an empirical test using marital interaction coding. Behavior Therapy, 32(1):107–122, 2002. [110] Richard E Heyman, Shari R Feldbau-Kohn, Miriam K Ehrensaft, Jennifer Langhinrichsen-Rohling, and K Daniel O’Leary. Can questionnaire reports cor- rectly classify relationship distress and partner physical abuse? Journal of Family Psychology, 15(2):334, 2001. [111] Richard E Heyman and Amy M Smith Slep. The hazards of predicting divorce without crossvalidation. Journal of Marriage and Family, 63(2):473–479, 2001. 148 [112] Dustin Hillard, Mari Ostendorf, and Elizabeth Shriberg. Detection of agreement vs. disagreement in meetings: Training with unlabeled data. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Compu- tational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003–short papers-Volume 2, pages 34–36. Associa- tion for Computational Linguistics, 2003. [113] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006. [114] Geoffrey E Hinton, Terrence J Sejnowski, and David H Ackley. Boltzmann ma- chines: Constraint satisfaction networks that learn. Carnegie-Mellon University, Department of Computer Science Pittsburgh, PA, 1984. [115] J. Hirschberg. Speaking more like you: Entrainment in conversational speech. In Twelfth Annual Conference of the International Speech Communication Associa- tion, 2011. [116] Julia Hirschberg. Speaking more like you: Entrainment in conversational speech. In Twelfth Annual Conference of the International Speech Communication Asso- ciation, 2011. [117] Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In Interna- tional Workshop on Similarity-Based Pattern Recognition, pages 84–92. Springer, 2015. [118] Kenneth I Howard, Karla Moras, Peter L Brill, Zoran Martinovich, and Wolfgang Lutz. Evaluation of psychotherapy: Efficacy, effectiveness, and patient progress. American Psychologist, 51(10):1059, 1996. [119] Chih-Wei Hsu and Chih-Jen Lin. A comparison of methods for multiclass support vector machines. Neural Networks, IEEE Transactions on, 13(2):415–425, 2002. [120] Neil S Jacobson and Gayla Margolin. Marital therapy: Strategies based on social learning and behavior exchange principles. Psychology Press, 1979. [121] Neil S Jacobson and Paula Truax. Clinical significance: a statistical approach to defining meaningful change in psychotherapy research. Journal of consulting and clinical psychology, 59(1):12, 1991. [122] Arindam Jati and Panayiotis Georgiou. Speaker2vec: Unsupervised learning and adaptation of a speaker manifold using deep neural networks with an evaluation on speaker segmentation. In Proceedings of Interspeech, Stockholm, Sweden, August 2017. 149 [123] Arindam Jati and Panayiotis Georgiou. Neural predictive coding using convolu- tional neural networks towards unsupervised learning of speaker characteristics. IEEE Trans. Speech, Audio, and Language Processing, 2018. [124] Susan M Johnson, John Hunsley, Leslie Greenberg, and Dwavne Schindler. Emo- tionally focused couples therapy: Status and challenges. Clinical Psychology: Science and Practice, 6(1):67–79, 1999. [125] J Jones and A Christensen. Couples interaction study: Social support interaction rating system. University of California, Los Angeles, 1998. [126] Simon Jones, Rachel Cotterill, Nigel Dewdney, Kate Muir, and Adam Joinson. Finding zelig in text: A measure for normalising linguistic accommodation. In Proceedings of COLING 2014, pages 455–465, 2014. [127] Patrik N Juslin and Klaus R Scherer. Vocal expression of affect. Oxford University Press, 2005. [128] Yoshihide Kakizawa, Robert H Shumway, and Masanobu Taniguchi. Discrimina- tion and clustering for multivariate time series. Journal of the American Statistical Association, 93(441):328–340, 1998. [129] Holger Kantz and Thomas Schreiber. Nonlinear time series analysis, volume 7. Cambridge university press, 2004. [130] Martin Karafi´ at, Luk´ aˇ s Burget, Pavel Matˇ ejka, Ondˇ rej Glembek, and Jan ˇ Cernock` y. ivector-based discriminative adaptation for automatic speech recogni- tion. In 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, pages 152–157. IEEE, 2011. [131] Benjamin R Karney and Thomas N Bradbury. The longitudinal course of marital quality and stability: A review of theory, methods, and research. Psychological bulletin, 118(1):3, 1995. [132] Benjamin R Karney and Brynna Gauer. Cognitive complexity and marital inter- action in newlyweds. Personal Relationships, 17(2):181–200, 2010. [133] Athanasios Katsamanis, Matthew Black, Panayiotis G Georgiou, Louis Goldstein, and S Narayanan. Sailalign: Robust long speech-text alignment. In Proc. of Workshop on New Tools and Methods for Very-Large Scale Phonetics Research, 2011. [134] Michael J Katz. Fractals and the analysis of waveforms. Computers in biology and medicine, 18(3):145–156, 1988. 150 [135] Martin B Keller, Philip W Lavori, Barbara Friedman, Eileen Nielsen, Jean En- dicott, Pat McDonald-Scott, and Nancy C Andreasen. The longitudinal interval follow-up evaluation: a comprehensive method for assessing outcome in prospec- tive longitudinal studies. Archives of general psychiatry, 44(6):540–548, 1987. [136] Hyoun K Kim, Deborah M Capaldi, and Lynn Crosby. Generalizability of gottman and colleagues’ affective process models of couples’ relationship outcomes. Jour- nal of Marriage and Family, 69(1):55–72, 2007. [137] Jangwon Kim, Md Nasir, Rahul Gupta, Maarten Van Segbroeck, Daniel Bone, Matthew P Black, Zisis Iason Skordilis, Zhaojun Yang, Panayiotis G Georgiou, and Shrikanth S Narayanan. Automatic estimation of parkinson’s disease severity from diverse speech tasks. In Sixteenth Annual Conference of the International Speech Communication Association, 2015. [138] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [139] Iasonas Kokkinos and Petros Maragos. Nonlinear speech analysis using mod- els for chaotic systems. Speech and Audio Processing, IEEE Transactions on, 13(6):1098–1109, 2005. [140] Sander L Koole and Wolfgang Tschacher. Synchrony in psychotherapy: A review and an integrative framework for the therapeutic alliance. Frontiers in psychology, 7:862, 2016. [141] Bart Kosko. Bidirectional associative memories. IEEE Transactions on Systems, man, and Cybernetics, 18(1):49–60, 1988. [142] Spyros Kousidis, David Dorran, Ciaran Mcdonnell, and Eugene Coyle. Conver- gence in human dialogues time series analysis of acoustic feature. 2009. [143] Spyros Kousidis, David Dorran, Ciaran Mcdonnell, and Eugene Coyle. Conver- gence in human dialogues time series analysis of acoustic feature. 2009. [144] Spyros Kousidis, David Dorran, Yi Wang, Brian Vaughan, Charlie Cullen, Dermot Campbell, Ciaran McDonnell, and Eugene Coyle. Towards measuring continuous acoustic feature convergence in unconstrained spoken dialogues. 2008. [145] Spyros Kousidis, David Dorran, Yi Wang, Brian Vaughan, Charlie Cullen, Dermot Campbell, Ciaran McDonnell, and Eugene Coyle. Towards measuring continuous acoustic feature convergence in unconstrained spoken dialogues. 2008. [146] Sarah Kozloff. Overhearing film dialogue. Univ of California Press, 2000. [147] Klaus Krippendorff. Estimating the reliability, systematic error and random error of interval data. Educational and Psychological Measurement, 30(1):61–70, 1970. 151 [148] Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. From word embed- dings to document distances. In International Conference on Machine Learning, pages 957–966, 2015. [149] Oh-Wook Kwon, Kwokleung Chan, Jiucang Hao, and Te-Won Lee. Emotion recognition by speech signals. In Proceedings of International Conference EU- ROSPEECH, 2003. [150] Michael J Lambert and Allen E Bergin. The effectiveness of psychotherapy. Hand- book of psychotherapy and behavior change (4th ed.), 1994. [151] Michael J Lambert and Clara E Hill. Assessing psychotherapy outcomes and processes. Handbook of psychotherapy and behavior change (4th ed.), 1994. [152] Quoc Le and Tomas Mikolov. Distributed representations of sentences and docu- ments. In International conference on machine learning, pages 1188–1196, 2014. [153] Chi-Chun Lee, Matthew Black, Athanasios Katsamanis, Adam C Lammert, Brian R Baucom, Andrew Christensen, Panayiotis G Georgiou, and Shrikanth S Narayanan. Quantification of prosodic entrainment in affective spontaneous spo- ken interactions of married couples. In Eleventh Annual Conference of the Inter- national Speech Communication Association, 2010. [154] Chi-Chun Lee, Carlos Busso, Sungbok Lee, and Shrikanth S Narayanan. Model- ing mutual influence of interlocutor emotion states in dyadic spoken interactions. In INTERSPEECH, pages 1983–1986, 2009. [155] Chi-Chun Lee, Athanasios Katsamanis, Brian R Baucom, Panayiotis G Geor- giou, and Shrikanth S Narayanan. Using measures of vocal entrainment to inform outcome-related behaviors in marital conflicts. In Signal & Information Process- ing Association Annual Summit and Conference (APSIPA ASC), 2012 Asia-Pacific, pages 1–5. IEEE, 2012. [156] Chi-Chun Lee, Athanasios Katsamanis, Matthew P Black, Brian R Baucom, An- drew Christensen, Panayiotis G Georgiou, and Shrikanth S Narayanan. Comput- ing vocal entrainment: A signal-derived PCA-based quantification scheme with application to affect analysis in married couple interactions. Computer Speech & Language, 28(2):518–539, 2014. [157] Chi-Chun Lee, Emily Mower, Carlos Busso, Sungbok Lee, and Shrikanth Narayanan. Emotion recognition using a hierarchical binary decision tree ap- proach. Speech Communication, 53(9):1162–1171, 2011. [158] Chul Min Lee and Shrikanth S Narayanan. Toward detecting emotions in spoken dialogs. Speech and Audio Processing, IEEE Transactions on, 13(2):293–303, 2005. 152 [159] R. Levitan, A. Gravano, and J. Hirschberg. Entrainment in speech preceding backchannels. Proc. of ACL 2011, 2011. [160] Rivka Levitan, Stefan Benus, Agustın Gravano, and Julia Hirschberg. Entrainment and turn-taking in human-human dialogue. In AAAI Spring Symposium on Turn- Taking and Coordination in Human-Machine Interaction, 2015. [161] Rivka Levitan, Agust´ ın Gravano, Laura Willson, Stefan Benus, Julia Hirschberg, and Ani Nenkova. Acoustic-prosodic entrainment and social behavior. In Pro- ceedings of the 2012 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human language technologies, pages 11–19. Association for Computational Linguistics, 2012. [162] Rivka Levitan and Julia Hirschberg. Measuring acoustic-prosodic entrainment with respect to multiple levels and dimensions. In Twelfth Annual Conference of the International Speech Communication Association, 2011. [163] David D Lewis. Feature selection and feature extraction for text categorization. In Proceedings of the workshop on Speech and Natural Language, pages 212–217. Association for Computational Linguistics, 1992. [164] Haoqi Li, Brian Baucom, and Panayiotis Georgiou. Unsupervised latent be- havior manifold learning from acoustic features: Audio2behavior. In Proceed- ings of IEEE International Conference on Audio, Speech and Signal Processing (ICASSP), New Orleans, Louisiana, March 2017. [165] Yang Li and Yunxin Zhao. Recognizing emotions in speech using short-term and long-term features. In ICSLP, 1998. [166] T Warren Liao. Clustering of time series data—a survey. Pattern recognition, 38(11):1857–1874, 2005. [167] Noah Liebman and Darren Gergle. Capturing turn-by-turn lexical similarity in text-based communication. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing, pages 553–559. ACM, 2016. [168] Marsha M Linehan, Katherine Anne Comtois, Milton Z Brown, Heidi L Heard, and Amy Wagner. Suicide Attempt Self-Injury Interview (SASII): development, reliability, and validity of a scale to assess suicide attempts and intentional self- injury. Psychological assessment, 18(3):303, 2006. [169] Marsha M Linehan, Judith L Goodstein, Stevan L Nielsen, and John A Chiles. Reasons for staying alive when you are thinking of killing yourself: the reasons for living inventory. Journal of consulting and clinical psychology, 51(2):276, 1983. 153 [170] Diane Litman, Susannah Paletz, Zahra Rahimi, Stefani Allegretti, and Caitlin Rice. The teams corpus and entrainment in multi-party spoken dialogues. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1421–1431, 2016. [171] Jos´ e Lopes, Maxine Eskenazi, and Isabel Trancoso. From rule-based to data- driven lexical entrainment models in spoken dialog systems. Computer Speech & Language, 31(1):87–112, 2015. [172] Sarah Peregrine Lord, Do˘ gan Can, Michael Yi, Rebeca Marin, Christopher W Dunn, Zac E Imel, Panayiotis Georgiou, Shrikanth Narayanan, Mark Steyvers, and David C Atkins. Advancing methods for reliably assessing motivational in- terviewing fidelity using the motivational interviewing skills code. Journal of substance abuse treatment, 49:50–57, 2015. [173] Sarah Peregrine Lord, Elisa Sheng, Zac E Imel, John Baer, and David C Atkins. More than reflections: empathy in motivational interviewing includes language style synchrony between therapist and client. Behavior therapy, 46(3):296–303, 2015. [174] Nichola Lubold and Heather Pon-Barry. Acoustic-prosodic entrainment and rap- port in collaborative learning dialogues. In Proceedings of the 2014 ACM work- shop on Multimodal Learning Analytics Workshop and Grand Challenge, pages 5–12. ACM, 2014. [175] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Jour- nal of machine learning research, 9(Nov):2579–2605, 2008. [176] Okumura Manabu and Honda Takeo. Word sense disambiguation and text seg- mentation based on lexical cohesion. In Proceedings of the 15th conference on Computational linguistics-Volume 2, pages 755–761. Association for Computa- tional Linguistics, 1994. [177] Norbert Marwan, Jonathan F Donges, Yong Zou, Reik V Donner, and J¨ urgen Kurths. Complex network approach for recurrence analysis of time series. Physics Letters A, 373(46):4246–4254, 2009. [178] Joseph D Matarazzo and Arthur N Wiens. Interviewer influence on durations of interviewee silence. Journal of Experimental Research in Personality, 1967. [179] Robert McKee. Story: Substance, structure, style, and the principles of screen- writing. Methuen, 1999. [180] NM Menezes, T Arenovich, and RB Zipursky. A systematic review of longitudinal outcome studies of first-episode psychosis. Psychological medicine, 36(10):1349– 1362, 2006. 154 [181] Jan Michalsky, Heike Schoormann, and Oliver Niebuhr. Conversational quality is affected by and reflected in prosodic entrainment. In Proc. 9th International Conference on Speech Prosody 2018, pages 389–392, 2018. [182] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Dis- tributed representations of words and phrases and their compositionality. In Ad- vances in neural information processing systems, pages 3111–3119, 2013. [183] Carla S M¨ oller-Levet, Frank Klawonn, Kwang-Hyun Cho, and Olaf Wolkenhauer. Fuzzy clustering of short time-series and unevenly distributed sampling points. In International Symposium on Intelligent Data Analysis, pages 330–340. Springer, 2003. [184] Theresa B Moyers, Tim Martin, Jennifer K Manuel, and William R Miller. The motivational interviewing treatment integrity (MITI) code: Version 2.0. [185] Shrikanth Narayanan and Panayiotis G Georgiou. Behavioral Signal Processing: deriving human behavioral informatics from speech and language. Proceedings of the IEEE. Institute of Electrical and Electronics Engineers, 101(5):1203, 2013. [186] Shrikanth S Narayanan and Abeer A Alwan. A nonlinear dynamical systems anal- ysis of fricative consonants. The Journal of the Acoustical Society of America, 97(4):2511–2524, 1995. [187] Md Nasir, Brian Baucom, Craig J Bryan, Shrikanth Narayanan, and Panayiotis Georgiou. Complexity in speech and its relation to emotional bond in therapist- patient interactions during suicide risk assessment interviews. In Proceedings of Interspeech. August 2017. 2017. [188] Md Nasir, Brian Baucom, Panayiotis Georgiou, and Shrikanth Narayanan. Redun- dancy analysis of behavioral coding for couples therapy and improved estimation of behavior from noisy annotations. In Acoustics, Speech and Signal Process- ing (ICASSP), 2015 IEEE International Conference on, pages 1886–1890. IEEE, 2015. [189] Md Nasir, Brian Baucom, Shrikanth Narayanan, and Panayiotis Georgiou. To- wards an unsupervised entrainment distance in conversational speech using deep neural networks. In Interspeech / arXiv:1804.08782, 2018. [190] Md Nasir, Brian Baucom, Shrikanth S Narayanan, and Panayiotis Georgiou. Com- plexity in prosody: A nonlinear dynamical systems approach for dyadic conver- sations; behavior and outcomes in couples therapy. Interspeech 2016, pages 893– 897, 2016. 155 [191] Md Nasir, Brian Robert Baucom, Panayiotis Georgiou, and Shrikanth Narayanan. Predicting couple therapy outcomes based on speech acoustic features. PloS one, 12(9):e0185123, 2017. [192] Md Nasir, Wei Xia, Bo Xiao, Brian Baucom, Shrikanth Narayanan, and Panayi- otis Georgiou. Still together?: The role of acoustic features in predicting marital outcome. In Proceedings of Interspeech, Dresden, Germany, September 2015. [193] Michael Natale. Convergence of mean vocal intensity in dyadic communication as a function of social desirability. Journal of Personality and Social Psychology, 32(5):790, 1975. [194] Ani Nenkova, Agustin Gravano, and Julia Hirschberg. High frequency word en- trainment in spoken dialogue. In Proceedings of the 46th annual meeting of the association for computational linguistics on human language technologies: Short papers, pages 169–172. Association for Computational Linguistics, 2008. [195] JM Nichols, M Seaver, ST Trickey, MD Todd, C Olson, and L Overbey. Detecting nonlinearity in structural systems using the transfer entropy. Physical Review E, 72(4):046217, 2005. [196] Kate G Niederhoffer and James W Pennebaker. Linguistic style matching in social interaction. Journal of Language and Social Psychology, 21(4):337–360, 2002. [197] Stavros Ntalampiras and Nikos Fakotakis. Modeling the temporal evolution of acoustic parameters for speech emotion recognition. Affective Computing, IEEE Transactions on, 3(1):116–125, 2012. [198] Benjamin M Ogles, Michael J Lambert, and Kevin S Masters. Assessing outcome in clinical practice. Allyn & Bacon, 1996. [199] David E Orlinsky, Klaus Grawe, and Barbara K Parks. Process and outcome in psychotherapy: Noch einmal. Handbook of psychotherapy and behavior change (4th ed.), 1994. [200] Arzucan ¨ Ozg¨ ur, Levent ¨ Ozg¨ ur, and Tunga G¨ ung¨ or. Text categorization with class-based and corpus-based keyword selection. In Computer and Information Sciences-ISCIS 2005, pages 606–615. Springer, 2005. [201] Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. Unsupervised learning of sentence embeddings using compositional n-gram features. arXiv preprint arXiv:1703.02507, 2017. [202] Jennifer S Pardo. On phonetic convergence during conversational interaction. The Journal of the Acoustical Society of America, 119(4):2382–2393, 2006. 156 [203] Jennifer S Pardo, Rachel Gibbons, Alexandra Suppes, and Robert M Krauss. Pho- netic convergence in college roommates. Journal of Phonetics, 40(1):190–197, 2012. [204] Ulrich Parlitz. Nonlinear time-series analysis. In Nonlinear Modeling, pages 209– 239. Springer, 1998. [205] Eluned S Parris and Michael J Carey. Language independent gender identifica- tion. In Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE International Conference on, volume 2, pages 685–688. IEEE, 1996. [206] Alex Pentland. Socially aware, computation and communication. Computer, 38(3):33–40, 2005. [207] Alex Sandy Pentland. Social signal processing [exploratory dsp]. Signal Process- ing Magazine, IEEE, 24(4):108–111, 2007. [208] Adriano Petry and Dante Augusto Couto Barone. Speaker identification using nonlinear dynamical features. Chaos, Solitons & Fractals, 13(2):221–231, 2002. [209] Fabio Pianesi, Nadia Mana, Alessandro Cappelletti, Bruno Lepri, and Massimo Zancanaro. Multimodal recognition of personality traits in social interactions. In Proceedings of the 10th international conference on Multimodal interfaces, pages 53–60. ACM, 2008. [210] Martin J Pickering and Simon Garrod. Toward a mechanistic psychology of dia- logue. Behavioral and brain sciences, 27(2):169–190, 2004. [211] Martin J Pickering and Simon Garrod. Alignment as the basis for successful com- munication. Research on Language & Computation, 4(2):203–228, 2006. [212] Steve Pincus. Approximate entropy (apen) as a complexity measure. Chaos: An Interdisciplinary Journal of Nonlinear Science, 5(1):110–117, 1995. [213] Steven M Pincus and Ary L Goldberger. Physiological time-series analysis: what does regularity quantify? American Journal of Physiology-Heart and Circulatory Physiology, 266(4):H1643–H1656, 1994. [214] Jeff Pittam. Voice in social interaction: An interdisciplinary approach, volume 5. Sage Publications, 1994. [215] Margaret J Pitts and Howard Giles. Social psychology and personal relationships: Accommodation and relational influence across time and contexts. The handbook of interpersonal communication, pages 15–31, 2008. 157 [216] Robert Porzel, Annika Scheffler, and Rainer Malaka. How entrainment increases dialogical effectiveness. In Proceedings of the IUI, volume 6, pages 35–42. Cite- seer, 2006. [217] Stephanie D Preston and Frans BM De Waal. Empathy: Its ultimate and proximate bases. Behavioral and brain sciences, 25(1):1–20, 2002. [218] Remo Radii and Antonio Politi. Statistical description of chaotic attractors: the dimension function. Journal of Statistical Physics, 40(5-6):725–750, 1985. [219] Zahra Rahimi, Anish Kumar, Diane Litman, Susannah Paletz, and Mingzhi Yu. Entrainment in multi-party spoken dialogues at multiple linguistic levels. Proc. Interspeech 2017, pages 1696–1700, 2017. [220] Fabian Ramseyer and Wolfgang Tschacher. Nonverbal synchrony in psychother- apy: coordinated body movement reflects relationship quality and outcome. Jour- nal of consulting and clinical psychology, 79(3):284, 2011. [221] Fabian Ramseyer and Wolfgang Tschacher. Nonverbal synchrony of head-and body-movement in psychotherapy: different signals have different associations with outcome. Frontiers in psychology, 5:979, 2014. [222] David Reitter, Frank Keller, and Johanna D. Moore. Computational modelling of structural priming in dialogue. In Proceedings of the Human Language Tech- nology Conference of the NAACL, pages 121–124. Association for Computational Linguistics, 2006. [223] Fuji Ren and Ning Liu. Emotion computing using word mover’s distance features based on ren cecps. PloS one, 13(4):e0194136, 2018. [224] Daniel C Richardson, Rick Dale, and Kevin Shockley. Synchrony and swing in conversation: coordination, temporal dynamics, and communication. Embodied communication, pages 75–94, 2008. [225] Michael J Richardson, Stacy Lopresti-Goodman, Marisa Mancini, Bruce Kay, and RC Schmidt. Comparing the attractor strength of intra-and interpersonal interlimb coordination using cross-recurrence analysis. Neuroscience Letters, 438(3):340– 345, 2008. [226] Amy E Rodrigues, Julie H Hall, and Frank D Fincham. What predicts divorce and relationship dissolution. Handbook of divorce and relationship dissolution, pages 85–112, 2006. [227] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. A metric for distributions with applications to image databases. In Sixth International Conference on Com- puter Vision, pages 59–66. IEEE, 1998. 158 [228] Jakob Runge, Jobst Heitzig, Vladimir Petoukhov, and J¨ urgen Kurths. Escaping the curse of dimensionality in estimating multivariate transfer entropy. Physical review letters, 108(25):258701, 2012. [229] Steven L Salzberg. On comparing classifiers: Pitfalls to avoid and a recommended approach. Data mining and knowledge discovery, 1(3):317–328, 1997. [230] Shinichi Sato, Masaki Sano, and Yasuji Sawada. Practical methods of measuring the generalized dimension and the largest Lyapunov exponent in high dimensional chaotic systems. Progress of Theoretical Physics, 77(1):1–5, 1987. [231] Thomas Schreiber. Interdisciplinary application of nonlinear time series methods. Physics reports, 308(1):1–64, 1999. [232] Thomas Schreiber. Measuring information transfer. Physical review letters, 85(2):461, 2000. [233] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE con- ference on computer vision and pattern recognition, pages 815–823, 2015. [234] Bj¨ orn Schuller. V oice and speech analysis in search of states and traits. In Com- puter Analysis of Human Behavior, pages 227–253. Springer, 2011. [235] Bj¨ orn Schuller, Anton Batliner, Dino Seppi, Stefan Steidl, Thurid V ogt, Johannes Wagner, Laurence Devillers, Laurence Vidrascu, Noam Amir, Loic Kessous, et al. The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals. In Eighth Annual Conference of the International Speech Communication Association, 2007. [236] Bj¨ orn Schuller, Bogdan Vlasenko, Ricardo Minguez, Gerhard Rigoll, and Andreas Wendemuth. Comparing one and two-stage acoustic modeling in the recognition of emotion in speech. In Automatic Speech Recognition & Understanding, 2007. ASRU. IEEE Workshop on, pages 596–600, 2007. [237] Mia Sevier, Kathleen Eldridge, Janice Jones, Brian D Doss, and Andrew Chris- tensen. Observed communication and associations with satisfaction during tra- ditional and integrative behavioral couple therapy. Behavior therapy, 39(2):137– 150, 2008. [238] Prashanth Gurunath Shivakumar and Panayiotis Georgiou. Confusion2vec: To- wards enriching vector space word representations with representational ambigu- ities. arXiv preprint arXiv:1811.03199, 2018. [239] Robert H Shumway. Time-frequency clustering and discriminant analysis. Statis- tics & probability letters, 63(3):307–314, 2003. 159 [240] Mary L Smith and Gene V Glass. Meta-analysis of psychotherapy outcome stud- ies. American psychologist, 32(9):752, 1977. [241] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Process- ing (ICASSP), pages 5329–5333. IEEE, 2018. [242] Douglas K Snyder. Multidimensional assessment of marital satisfaction. Journal of Marriage and the Family, pages 813–823, 1979. [243] Huan Song, Megan Willi, Jayaraman J Thiagarajan, Visar Berisha, and Andreas Spanias. Triplet Network with Attention for Speaker Diarization. [244] Graham B Spanier. Measuring dyadic adjustment: New scales for assessing the quality of marriage and similar dyads. Journal of Marriage and the Family, pages 15–28, 1976. [245] Scott M Stanley, Thomas N Bradbury, and Howard J Markman. Structural flaws in the bridge from basic research on marriage to interventions for couples. Journal of Marriage and Family, 62(1):256–264, 2000. [246] Richard L Street Jr. Speech convergence and speech evaluation in fact-finding interviews. Human Communication Research, 11(2):139–169, 1984. [247] Zbigniew R Struzik and Arno Siebes. The haar wavelet transform in the time series similarity paradigm. In PKDD, volume 99, pages 12–22. Springer, 1999. [248] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014. [249] G´ abor J Sz´ ekely, Maria L Rizzo, Nail K Bakirov, et al. Measuring and testing dependence by correlation of distances. The annals of statistics, 35(6):2769–2794, 2007. [250] Floris Takens. Detecting strange attractors in turbulence. Springer, 1981. [251] Paul J Taylor and Sally Thomas. Linguistic style matching and negotiation out- come. Negotiation and Conflict Management Research, 1(3):263–281, 2008. [252] James Theiler. Estimating fractal dimension. JOSA A, 7(6):1055–1073, 1990. [253] Naftali Tishby. A dynamical systems approach to speech processing. In Acoustics, Speech, and Signal Processing, 1990. ICASSP-90., 1990 International Conference on, pages 365–368. IEEE, 1990. 160 [254] Shao-Yen Tseng, Brian Baucom, and Panayiotis Georgiou. Unsupervised online multitask learning of behavioral sentence embeddings. PeerJ Computer Science, 5:e200, 2019. [255] Shao-Yen Tseng, Brian R Baucom, and Panayiotis G Georgiou. Approaching human performance in behavior estimation in couples therapy using deep sentence embeddings. In INTERSPEECH, pages 3291–3295, 2017. [256] Maarten Van Segbroeck, Andreas Tsiartas, and Shrikanth S. Narayanan. A ro- bust frontend for V AD: Exploiting contextual, discriminative and spectral cues of human voice. In INTERSPEECH, August 2013. [257] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017. [258] Raul Vicente, Michael Wibral, Michael Lindner, and Gordon Pipa. Transfer en- tropy—a model-free measure of effective connectivity for the neurosciences. Jour- nal of computational neuroscience, 30(1):45–67, 2011. [259] Alessandro Vinciarelli, Maja Pantic, and Herv´ e Bourlard. Social signal process- ing: Survey of an emerging domain. Image and Vision Computing, 27(12):1743– 1759, 2009. [260] Alessandro Vinciarelli, Maja Pantic, Herv´ e Bourlard, and Alex Pentland. So- cial signals, their function, and automatic analysis: a survey. In Proceedings of the 10th international conference on Multimodal interfaces, pages 61–68. ACM, 2008. [261] Oriol Vinyals and Quoc Le. A neural conversational model. arXiv preprint arXiv:1506.05869, 2015. [262] Bogdan Vlasenko, Bj¨ orn Schuller, Andreas Wendemuth, and Gerhard Rigoll. Frame vs. turn-level: emotion recognition from speech considering static and dynamic processing. In Affective Computing and Intelligent Interaction, pages 139–147. Springer, 2007. [263] Arthur Ward and Diane Litman. Measuring convergence and priming in tutorial dialog. University of Pittsburgh, Tech. Report, 2007. [264] James H Watt and C Arthur VanLear. Dynamic patterns in communication pro- cesses. Sage Publications, Inc, 1996. [265] Charles L Webber Jr and Norbert Marwan. Recurrence quantification analysis. Springer, 2015. 161 [266] Sarah Weidman, Mara Breen, and Katherine C Haydon. Prosodic speech entrain- ment in romantic relationships. In proceedings of Speech Prosody, 2016. [267] Andreas Weise. Towards a spoken dialog system capable of acoustic-prosodic entrainment. PhD thesis, CITY UNIVERSITY OF NEW YORK, 2017. [268] Joan Welkowitz and Marta Kuc. Interrelationships among warmth, genuineness, empathy, and temporal speech patterns in interpersonal interaction. Journal of Consulting and Clinical Psychology, 41(3):472, 1973. [269] Michael Wibral, Benjamin Rahm, Maria Rieder, Michael Lindner, Raul Vicente, and Jochen Kaiser. Transfer entropy in magnetoencephalographic data: quantify- ing information flow in cortical and cerebellar networks. Progress in biophysics and molecular biology, 105(1):80–97, 2011. [270] Megan M Willi, Stephanie A Borrie, Tyson S Barrett, Ming Tu, and Visar Berisha. A discriminative acoustic-prosodic approach for measuring local entrainment. arXiv preprint arXiv:1804.08663, 2018. [271] Wei Xia, James Gibson, Bo Xiao, Brian Baucom, and Panayiotis Georgiou. An acoustic-based local sequential learning model for behavioral analysis of couple therapy. In Proceedings of Interspeech, Dresden, Germany, September 2015. [272] Bo Xiao, Daniel Bone, Maarten Van Segbroeck, Zac E Imel, David C Atkins, Panayiotis G Georgiou, and Shrikanth S Narayanan. Modeling therapist empathy through prosody in drug addiction counseling. In INTERSPEECH, pages 213– 217, 2014. [273] Bo Xiao, Dogan Can, Panayiotis Georgiou, David Atkins, and Shrikanth S. Narayanan. Analyzing the language of therapist empathy in motivational inter- view based psychotherapy. In Proceedings of APSIPA Annual Summit and Con- ference, December 2012. [274] Bo Xiao, Panayiotis Georgiou, and Shrikanth Narayanan. Head motion modeling for human behavior analysis in dyadic interaction. IEEE Transactions on Multi- media, 17(7):1107–1119, July 2015. [275] Bo Xiao, Panayiotis G Georgiou, B Baucom, and Shrikanth S Narayanan. Data driven modeling of head motion towards analysis of behaviors in couple interac- tions. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE Interna- tional Conference on, pages 3766–3770, 2013. [276] Bo Xiao, Panayiotis G Georgiou, Zac E Imel, David C Atkins, and Shrikanth Narayanan. Modeling therapist empathy and vocal entrainment in drug addiction counseling. In INTERSPEECH, pages 2861–2865, 2013. 162 [277] Bo Xiao, Panayiotis G Georgiou, Chi-Chun Lee, Brian Baucom, and Shrikanth S Narayanan. Head motion synchrony and its correlation to affectivity in dyadic interactions. In 2013 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2013. [278] Zhaojun Yang and Shrikanth Narayanan. Modeling mutual influence of multi- modal behavior in affective dyadic interactions. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 2234–2238. IEEE, 2015. [279] Guang Yong Zou. Toward using confidence intervals to compare correlations. Psychological methods, 12(4):399, 2007. 163
Abstract (if available)
Abstract
Analysis of interaction dynamics in dyadic conversations can provide important insights into the behavior patterns of the interlocutors. We explore characterization of interaction dynamics in the form of two common interpersonal adaptation mechanisms in conversations—vocal entrainment and linguistic coordination. First, we show how modeling dyadic interactions through nonlinear dynamical systems can provide complexity measures that capture entrainment. These measures, albeit knowledge-driven, are able to capture the nonlinear nature of entrainment during interactions, yielding improved performance over existing linear measures found in the literature. We then propose a deep neural network-based unsupervised learning framework for entrainment and leverage the ability of learning from real conversational data to provide novel distance measures indicative of entrainment. We also propose measuring linguistic coordination in conversations by using neural word embeddings and learning distance measures that capture lexical, syntactic and semantic similarity between interlocutors. Our experiments show that the proposed measures can successfully distinguish real conversations from the fake ones by detecting the presence of entrainment or coordination. We also demonstrate their applications in relation to several behaviors and outcomes in observational psychotherapy domains such as couples therapy, suicide risk assessment, and motivational interviewing. Furthermore, we find that incorporating measures characterizing interaction dynamics as features significantly improve the classification performance of predicting therapy outcome of couples with marital conflict.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Computational modeling of behavioral attributes in conversational dyadic interactions
PDF
Behavioral signal processing: computational approaches for modeling and quantifying interaction dynamics in dyadic human interactions
PDF
Machine learning paradigms for behavioral coding
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Statistical inference for dynamical, interacting multi-object systems with emphasis on human small group interactions
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Toward robust affective learning from speech signals based on deep learning techniques
PDF
Noise aware methods for robust speech processing applications
PDF
Modeling expert assessment of empathy through multimodal signal cues
PDF
Context-aware models for understanding and supporting spoken interactions with children
PDF
Understanding sources of variability in learning robust deep audio representations
PDF
Automatic quantification and prediction of human subjective judgments in behavioral signal processing
PDF
Modeling dyadic synchrony with heterogeneous data: validation in infant-mother and infant-robot interactions
PDF
Multimodal and self-guided clustering approaches toward context aware speaker diarization
PDF
Efficient estimation and discriminative training for the Total Variability Model
PDF
Multimodality, context and continuous dynamics for recognition and analysis of emotional states, and applications in healthcare
PDF
Geometric and dynamical modeling of multiscale neural population activity
PDF
Improving modeling of human experience and behavior: methodologies for enhancing the quality of human-produced data and annotations of subjective constructs
PDF
Nonverbal communication for non-humanoid robots
Asset Metadata
Creator
Nasir, Md
(author)
Core Title
Interaction dynamics and coordination for behavioral analysis in dyadic conversations
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
02/06/2020
Defense Date
12/17/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Conversations,coordination,Human behavior,interaction dynamics,Language,OAI-PMH Harvest,speech
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth (
committee chair
), Georgiou, Panayiotis (
committee member
), Margolin, Gayla (
committee member
)
Creator Email
mdnasir@usc.edu,nasir795@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-266455
Unique identifier
UC11674798
Identifier
etd-NasirMd-8151.pdf (filename),usctheses-c89-266455 (legacy record id)
Legacy Identifier
etd-NasirMd-8151.pdf
Dmrecord
266455
Document Type
Dissertation
Rights
Nasir, Md
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
coordination
interaction dynamics