Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Computational modeling of behavioral attributes in conversational dyadic interactions
(USC Thesis Other)
Computational modeling of behavioral attributes in conversational dyadic interactions
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Computational Modeling of Behavioral Attributes in Conversational Dyadic Interactions by Sandeep Nallan Chakravarthula A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Electrical Engineering) May 2021 Copyright 2021 Sandeep Nallan Chakravarthula Acknowledgements This thesis is the culmination of a long struggle and a tumultuous journey to learn how to be a good researcher. I would not be where I am today were it not for some incredible people whose company I was fortunate to nd myself in. I am extremely grateful to my gurus, Prof. Panayiotis Georgiou and Prof. Shrikanth Narayanan, for their generous support over all these years and for giving me the time I needed to grow, pick myself up after every stumble and persist in my endeavours until I had achieved my goals. The value of their mentorship in conceptualizing and communicating research problems was exceeded only by their spirit of kindness. It is something that I hope to emulate in my career as well. I thank my committee faculty members, Prof. Gayla Margolin, Prof. Keith Jenkins and Prof. Jonathan Gratch for their invaluable feedback during my qualifying exam and defense. I also deeply value the mentorship of my research collaborators, Prof. Brian Baucom at the University of Utah and Dr. Maija Reblin at the Mott Cancer Center. Working with them on problems of critical importance and real-life impact not only invigorated my research with a strong sense of purpose but also allowed me to learn from their expertise and broaden my knowledge. My sincere thanks go to EEB sta members Diane Demetras and Tanya Acevedo-Lam for their unmatched problem-solving abilities and calming presence, and USC's High Performance Computing facilities, without which I could not have run the millions of experiments needed for my research. My time at USC and in Los Angeles has been a most magical one, thanks to the many friends I have made along the way. I will always cherish the countless hours spent with my wonderful ii core group of Nasir, Divya and Shiva, in discussing our lives, hopes and dreams, while also single- handedly propping up the coee economy on campus. I am also grateful to Naveen for taking me under his wing during my initial years and helping me discover and appreciate LA and its many cultures and hidden gems. I will not forget anytime the surreal years spent in the basement with my fellow SCUBA comrades Haoqi, Shao-Yen, Taejin and Arindam, nor those spent in far sunnier places with my SAIL colleagues, Amrutha, Krishna, Colin and Manoj. I thank my roommates, Sushil and Alan, for maintaining a welcoming and fun atmosphere at home and motivating me to lead a tter lifestyle. I am forever indebted to my dear friends Bargav, Priya, Kavitha, Abhinav, Nagu, Sai, Aakanksha, Akshay, Omkar and Yogesh for their warm and comforting support over all these years. It was absolutely sheer luck that I ran into them nearly a decade ago and since then, many of them have become my family and home away from home. Here's to many more such decades. Finally, I oer my unconditional love and devotion to my family, Amma, Nanagaru and Pranu, for their boundless love, patience and faith in me all this time. Over thousands of kilometers and video calls, you nourished and kept my spirit alive and were always there to encourage, criticize, cajole, inspire and celebrate with me through all my setbacks and triumphs. Words cannot convey how much I love and miss you all and appreciate the sacrices you have made for me; this thesis is as much yours as it is mine. I humbly thank God for everything so far and pray for the well-being of everyone. iii Table of Contents Acknowledgements ii List Of Tables vii List Of Figures ix Abstract xii Chapter 1: Introduction 1 1.1 Behavioral Expression in Conversational Interactions . . . . . . . . . . . . . . . . . 1 1.2 Observational Coding in Mental Health domains . . . . . . . . . . . . . . . . . . . 2 1.3 Behavioral Signal Processing (BSP) . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Focus of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.5 Summary of Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.5.1 Modeling Speaker Behaviors from Speaker Interaction Cues . . . . . . . . . 5 1.5.2 Determining Observation Window Lengths for Modeling Behaviors . . . . . 6 1.5.3 Contextualizing Speaker Behavior with Partner Interaction Cues . . . . . . 8 1.6 Organization of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Chapter 2: Modeling Speaker Behaviors from Speaker Interaction Cues 10 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Static Behavior Model (SBM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.1 Modeling approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Dynamic Behavior Model (DBM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.1 Activation-DBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.1.1 Modeling approach . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.1.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.1.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.2 Likelihood-DBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.2.1 Modeling approach . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.2.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.2.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4 Neural Behavior Model (NBM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.1 Word-level Recurrent NBM . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.1.2 Modeling approach . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4.1.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.1.4 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 iv 2.4.2 Turn-level NBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4.2.2 Modeling approach . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4.2.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4.2.4 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5.1 Comparing SBM vs DBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5.1.1 Case Study 1: Classifying Negative behavior in Couples Therapy . 27 2.5.1.2 Case Study 2: Rating Empathetic behavior in Addiction Counseling 30 2.5.2 Evaluating Word-level Recurrent NBM . . . . . . . . . . . . . . . . . . . . . 33 2.5.2.1 Case Study 3: Classifying Negative behavior in Couples Therapy . 33 2.5.3 Evaluating Turn-level Feedforward NBM . . . . . . . . . . . . . . . . . . . . 36 2.5.3.1 Case Study 4: Classifying Positive, Constructive and Hostile be- havior in Cancer Care . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.6 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Chapter 3: Determining Observation Window Lengths for Modeling Behaviors 46 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2 Multi-scale nature of behavioral constructs . . . . . . . . . . . . . . . . . . . . . . 48 3.3 Framework of Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.5 Proposed Analysis Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.5.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.5.1.1 Behavior Construct Similarity . . . . . . . . . . . . . . . . . . . . 53 3.5.1.2 Behavior Relationship Consistency . . . . . . . . . . . . . . . . . . 54 3.5.2 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.6 Windowed Scoring of Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.7 Behavior Scoring Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.7.1 N-gram Maximum Likelihood Model . . . . . . . . . . . . . . . . . . . . . . 61 3.7.1.1 Modeling approach . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.7.1.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.7.1.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.7.2 Neural Estimation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.7.2.1 Modeling approach . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.7.2.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.7.2.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.8.1 Analysis of window length for behavior estimation . . . . . . . . . . . . . . 68 3.8.1.1 Case Study 5: Estimating Interaction and Social Support behav- iors in Couples Therapy . . . . . . . . . . . . . . . . . . . . . . . . 68 3.9 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Chapter 4: Contextualizing Speaker Behavior with Partner Interaction Cues 81 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.2 In uence Dynamic Behavior Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.2.1 Modeling approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.2.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.2.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.3 Multi-Scale Multimodal Speaker-Partner Interaction Cues . . . . . . . . . . . . . . 87 4.3.1 Modeling approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.3.1.1 Turn-level semantic representations . . . . . . . . . . . . . . . . . 88 v 4.3.1.2 BERT-based turn-level sentiment cues . . . . . . . . . . . . . . . . 88 4.3.1.3 LIWC-based word-level sentiment cues . . . . . . . . . . . . . . . 89 4.3.1.4 Multi-scale lexical behavior cues . . . . . . . . . . . . . . . . . . . 89 4.3.1.5 Acoustic Low-level Descriptors (LLDs) . . . . . . . . . . . . . . . 90 4.3.1.6 Speaker-Partner Vocal Entrainment . . . . . . . . . . . . . . . . . 90 4.3.1.7 Speaker-Partner Turn-Taking Cues . . . . . . . . . . . . . . . . . . 91 4.3.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.3.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.4.1 In uence-DBM vs Likelihood-DBM . . . . . . . . . . . . . . . . . . . . . . . 92 4.4.1.1 Case Study 6: Identifying Negative behavior in Couples Therapy . 92 4.4.2 Modeling Long-Term Speaker Behaviors from Multi-Scale Multimodal Speaker- Partner Interaction Cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.4.2.1 Case Study 7: Assessing Suicidal Risk in Military Couples . . . . 95 4.5 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Chapter 5: Conclusion and proposed work 106 5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.2 Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 References 110 Appendix A Algorithms and Intermediate Results of Behavior Observation Window Length Analysis 122 A.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 A.2 Intermediate Results for N-gram model . . . . . . . . . . . . . . . . . . . . . . . . 123 A.2.1 Behavior Construct Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . 123 A.2.2 Behavior Relationship Consistency . . . . . . . . . . . . . . . . . . . . . . . 125 A.3 Intermediate Results for Neural model . . . . . . . . . . . . . . . . . . . . . . . . . 126 A.3.1 Behavior Construct Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . 126 A.3.2 Behavior Relationship Consistency . . . . . . . . . . . . . . . . . . . . . . . 127 A.3.3 ELMo Layer Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 vi List Of Tables 2.1 Model hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2 Class weighting schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3 Data Demographics on 20% least/most negativity sessions . . . . . . . . . . . . . . 29 2.4 Classication Accuracy of Behavioral Models using 1-grams . . . . . . . . . . . . . 29 2.5 Classication Accuracy of Behavioral Models for dierent versions of the dataset . 32 2.6 Classication accuracy (%) on negativity for dierent input sequence lengths. . . . 34 2.7 Comparison of agreement using Krippendor's alpha. . . . . . . . . . . . . . . . . 36 2.8 Description of RMICS2 Behavior Codes . . . . . . . . . . . . . . . . . . . . . . . . 39 2.9 Number of samples per behavior for dierent partitions . . . . . . . . . . . . . . . 40 2.10 Mean (Std. Deviation) UAR % of Test-fold Behavior Classication . . . . . . . . . 41 3.1 All possible window lengths at which an utterance with O p words can be scored . . 60 3.2 Description of behavior codes in Couples Therapy corpus . . . . . . . . . . . . . . . 69 4.1 Couples Therapy Outcome Demographics . . . . . . . . . . . . . . . . . . . . . . . 93 4.2 Comparison of Test Classication Accuracy % with best performing model indi- cated in bold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.3 Demographics of suicide risk labels of subjects. Some subjects dropped out as the study progressed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.4 Test Macro-Recall % for Baseline Risk Prediction with Clean Data Best system's features shown in blue for acoustic modality and green for lexical modality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 vii 4.5 Test Macro-Recall % for 6-month, 12-month Risk Prediction with Clean Data Best system's features shown in blue for acoustic modality and green for lexical modality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.6 Test Macro-Recall % for Baseline Risk Prediction with Noisy Data Best system's features shown in blue for acoustic modality and green for lexical modality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.7 Test Macro-Recall % for 6-month, 12-month Risk Prediction with Noisy Data Best system's features shown in blue for acoustic modality and green for lexical modality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 A.1 Spearman Correlation between window-level scores of Acceptance and target behav- iors with the N-gram model: Q and Q 0 refer to the correlations used to calculate the pair BRC in Eqn. 3.4. refers to the proportional weight used to calculate the individual BRC in Eqn. 3.5. All correlations are statistically signicant (p< 0:05) unless marked as - . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 A.2 Spearman Correlation between window-level scores of Blame and target behaviors with the N-gram model: Q and Q 0 refer to the correlations used to calculate the pair BRC in Eqn. 3.4. refers to the proportional weight used to calculate the individual BRC in Eqn. 3.5. All correlations are statistically signicant (p< 0:05) unless marked as - . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 A.3 Spearman Correlation between window-level scores of Blame and target behaviors with the Neural model: Q and Q 0 refer to the correlations used to calculate the pair BRC in Eqn. 3.4. Since we have only one , = 1 in Eqn. 3.5. All correlations are statistically signicant (p< 0:05) unless marked as - . . . . . . . . . . . . . . . 128 viii List Of Figures 1.1 Behavioral analysis by an expert: in practice, experts observe directly and infer diagnosis and treatment outcomes. In some clinical cases and in observational research the data may be revisited through audiovisual recordings and coded by experts. [Image of couple interacting courtesy of Prof. A. Christensen, Clinical Psychology Department, University of California Los Angeles (UCLA)]. . . . . . . 3 1.2 Automated quantication of behavior using a moving-window modeling approach . 4 1.3 Toy example illustrating the eect of observation window length on behavior esti- mation: The true degree of Supportive behavior of the utterance \Honey it's not your fault please" is High. The system's predicted behavior is the aggregate of the behavior estimates from all the windows. At short window lengths (1, 3 words), insucient information leads to noisy estimates and an incorrect prediction that the degree of behavior is Medium. However, as the window becomes longer (5 words), estimates are more accurate and the system correctly predicts that the degree is High . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 Conceptualizing the proposed graphical model on the right versus the baseline on the left . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Activation-based DBM (left) versus the Likelihood-based DBM (right) . . . . . . . 15 2.3 Training frame-level RNN using global rating values. . . . . . . . . . . . . . . . . . 20 2.4 Recurrent Neural Network system for predicting behavior. . . . . . . . . . . . . . . 21 2.5 Turn-level Feedforward Neural Behavior Model . . . . . . . . . . . . . . . . . . . . 25 2.6 Comparison of distribution of scores for words in a session for 1-hot system at the bottom versus w2v-joint above. I observe that the distribution of w2v short-term decisions closer resembles likely interactions of a negative session while the 1-hot system has very few discriminating data points due to sparsity. . . . . . . . . . . . 36 2.7 Test recall versus extended evaluation window size . . . . . . . . . . . . . . . . . . 42 ix 3.1 Automated quantication of behavior from lexical cues using a moving-window ap- proach: During an interaction, interlocutors express behaviors such as B i through conversational cues such as language cues. The text of the conversation is decom- posed into windowed chunks, each chunk L words long. Then, model M is used to score the text inside each window, resulting in a trajectory of window-level scores. Finally, a functional F is applied on the trajectory to obtain a summary of the behavior during the interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2 Computation of Behavior Construct Similarity (BCS) for behavior B i at two ob- servation window lengths L 1 and L 2 : The session-level aggregates are more highly correlated with human expert ratings at L 2 (0.8) than at L 1 (0.5). Therefore, L 2 is considered to be more appropriate than L 1 for estimating B i . . . . . . . . . . . 53 3.3 Computation of Behavior Relationship Consistency (BRC) between behaviors B i and B j at two observation window lengths L 1 and L 2 : The correlation betweenB i andB j 's estimates are more similar to the correlation between B i andB j 's human ratings (-0.7) atL 2 (-0.8) than atL 1 (0.3). Therefore, L 2 is considered to be more appropriate than L 1 for estimating B i and B j . . . . . . . . . . . . . . . . . . . . . 55 3.4 Flowchart of the analysis procedure for determining appropriate window length of target behavior B i : In each stage, I check if the system's estimates of B i satisfy a condition, summarized under \CONDITION" and denoted in blue boxes under \OPERATION". If satised, I determine the appropriate window length as denoted in green boxes under \OPERATION" and summarized under \RESULT". If the condition is not satised, I simply proceed to the next stage. This procedure continues for 4 stages, beyond which I cannot make determinations about B i 's window length. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.5 N-gram model used to estimate behavior of a sample utterance: Pre-built sets of K-2 binary LMs provide likelihoods for the utterance on a behavior scale from 1 to K (in my data, K=9). Posterior probabilities are then calculated and the expected value of the resulting distribution is used as the estimated behavior score of the utterance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.6 Neural model that uses a O-word-long window to estimate the behavior score of a sample utterance: Given an utterance, the ELMo word embedding sequence is mapped to a xed-length hidden representation and passed through a fully con- nected layer and ReLU6 activation to obtain the estimate of the behavior score . . 64 3.7 Grouping of Couples Therapy behaviors based on their relation to each other: Be- haviors are clustered in the space of their human annotations using the k-means algorithm. The cell in the (i;j) position shows the Spearman Correlation between human ratings of the corresponding behaviors i and j. Yellow (Blue) indicates highly positive (negative) correlation. Non-diagonal gray cells indicate that corre- lation is not statistically signicant (p< 0:05) . . . . . . . . . . . . . . . . . . . . . 71 3.8 Appropriate window lengths of behaviors estimated with (left bar plot) N-gram model and (right bar plot) Neural model. Absence of a bar for a behavior implies that my analysis framework was unable to determine appropriate window lengths for that behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 x 3.9 Trained ELMo layer weights for dierent behavior groups . . . . . . . . . . . . . . 76 3.10 Comparison of best modeling performance from both models over all window lengths for dierent behaviors: Performance here refers to the similarity between behavior estimates and human judgments, as measured by the Behavior Construct Similarity (BCS) metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.1 Behavior generation process over 3 speaker turns in the In uence Model . . . . . . 84 4.2 Histogram of best parameter congurations for 1,2,3-gram LDBM and In uence models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 A.1 Behavior Construct Similarity for N-gram model: Spearman Correlation between human annotations and functional-aggregated scores of N-gram model at dierent observation window lengths. All correlations are statistically signicant (p< 0:05) 124 A.2 Sample distribution of Dominance and External scores at window lengths 3, 100 and session-length: In both behaviors, the median 3-gram as well as 100-gram scores are very similar to the session-level scores, possibly due to symmetrical dis- tributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 A.3 Behavior Construct Similarity for Neural model: Spearman Correlation between human annotations and functional-aggregated scores of Neural model at dierent observation window lengths. All correlations are statistically signicant (p< 0:05) 127 xi Abstract Conversational dyadic interactions are a natural and ubiquitous form of communication for ex- changing ideas and building relationships. Observing how a person behaves during an interaction can provide insights into how they are feeling at that moment (e.g., happy, angry), what their general state of mind is like (e.g., depressed, worried) and even how they might behave in future (e.g., is this person likely to harm themselves?) Machine Learning based computational models can be used to automatically perform such assessments in applications ranging from personal smart assistants to health care domains. This thesis presents my work on building computational models of a speaker's behavioral attributes based on their spoken language cues when interacting with their partner. Such models typically rely on some fundamental assumptions about the many complex underlying processes by which communication takes place between the interlocutors and, hence, it is crucial to understand and verify their validity and appropriateness. Accordingly, I investigate three aspects of the modeling process: (i) which frameworks are eective at capturing behavior information in spoken language, (ii) how much context is needed for accurate modeling and how does it depend on the target behavior, and (iii) how can the partner's cues be utilized for better understanding the speaker's behaviors. I show that N-gram and Neural Network based frameworks can accurately identify behavioral attributes from spoken language, with short contexts found to be adequate for capturing aective expressions while longer contexts are required for more complex exchanges such as problem solving. In addition, incorporating the partner's cues and their in uence on the speaker is observed to xii not only contribute to better modeling of the speaker's behavioral attributes but to also reveal patterns in their interaction dynamics that are linked to outcomes in the future. The eectiveness of these approaches is demonstrated through real-world applications in mental health domains such as assessing suicidal risk of military personnel, identifying positive communication between cancer-aicted couples and rating empathy levels of therapists. xiii Chapter 1 Introduction 1.1 Behavioral Expression in Conversational Interactions Conversational dyadic interactions are a natural and ubiquitous form of communication for ex- changing ideas and building relationships. Observing how a person behaves during an interaction can provide insights into how they are feeling at that moment (e.g., happy, angry), what their general state of mind is like (e.g., depressed, worried) and even how they might behave in fu- ture (e.g., is this person likely to harm themselves?). The nature of interactions between two speakers can not only provide insights into their behavioral proles and interaction dynamics but can also have ramications for their interpersonal relationship. Eective communication between couples, for instance, can facilitate decision-making, improve intimacy and relationship quality, and improve well-being for both spouses. However, evidence is scant when it comes to determining what might be the normal or even ideal amount or content of communication between two speakers [6]. Recognizing behaviors based on the interaction cues of speakers can help analyze such interactions and identify behaviors which are benecial for eective communication. Human behavior expression is extremely complex and dynamic and conveys a multitude of information, including information related to one's mental health. Interpersonal interactions, in particular, contain signicant heterogeneity and variability 1 with respect to context, speaker traits and identity, etc.. and, thus, attaining complete insight into human interaction mechanisms is challenging. 1.2 Observational Coding in Mental Health domains Observational methods have been the gold standard of communication and behavioral analysis [61] and have led to important advances in areas such as the understanding of couple con ict, aect, and intimacy [56, 73]. Many of these studies have relied on the development of coding systems in which human coders identify specic communication behaviors that occur within interactions. The need to capture a sucient sample of behavior from a large enough sample of participants to adequately power analyses requires observational studies with large amounts of raw data that then take a long time for human coders to eectively and reliably process. A typical observational process is shown in Fig. 1.1. Guided by expert-dened behavioral representations, humans are relied upon to derive behavioral constructs using observed data. Upon undergoing training according to the appropriate coding manual, these human expert annotators then perform an audiovisual observation of the interlocutors during the interaction. Upon nishing the observation, the annotators then quantitatively gauge the degree to which they think the interlocutors exhibited a certain behavior, keeping in mind how the average interlocutor would behave. The ratings from these experts are evaluated for inter-annotator agreement and used by counselors to guide the therapy for the interlocutors. As such, these techniques are often resource intensive. For instance, human coders require ex- tensive initial training to ensure their work is valid and reliable, followed by supervision throughout coding, usually through ongoing meetings, review of coding, and assessment. Depending on the coding system, coders can take 2-3 times the length of the original recording [128]. The act of human coding is also fatiguing, even with opportunities for breaks, rest, and re ection [45], and may result in coder drift, error, or burnout over time. 2 1.3 Behavioral Signal Processing (BSP) Engineering approaches oer a viable way to study human interactions and replicate the behavior coding process described above. The eld of Behavioral Signal Processing [52, 108] oers en- couraging results towards using computational tools and models of human interactions to inform research and practice across a variety of behavior-centered domains. The approach relies on using domain knowledge to guide feature design and machine learning methods, whose eectiveness is validated through experiments on real datasets. the human (from experts to naive observers and crowds) in the processing loop in BSP, wherein the behavioral repre- sentations, models, and outcomes are directly informed, and used, by humans. The latter represents and builds upon the tremendous technological advances being made in sensing, signal processing, and machine learning, espe- cially in acquiring and analyzing vast amounts of human behavioral data. These are further elaborated below. C. BSP and Human in the Loop A central aspect of behavioral analysis is the key role played by the human expert. Notably in observational ap- proaches, guided by expert-defined behavioral representa- tions, humans are relied upon to derive behavioral constructs using observed data. The key point here is that the analysis and modeling are codified by the human ‘‘annotator.’’ For instance, consider a scenario in which a teacher is formally assessing a child’s abilities in learning to read (formative assessment). This involves audiovisual observation of the child to measure accuracy and fluency and also to gauge how certain and confident the child is in the task. Likewise, research and practice in psychology and psychiatry that is focused on diagnosing, managing, and treating atypical and distressed behavior often relies on expert observations of behavior interactions (Fig. 1). To capture this centrality of humans in behavioral modeling, BSP follows a two-pronged approach. On the one hand, it attempts to emulate the decision making of humans to learn signal features and machine learning techniques re- levant to human processing of behavioral information. This often entails manually mapping audiovisual observations of verbal and nonverbal cues of social and affective commu- nication, critically manifested in speech, spoken language, and gestures, into behavioral descriptions that the expert defines and desires, a process referred to as behavior cod- ing. On the other hand, the outcomes of machine process- ing and learning of behavioral cues are fed back to the human expert both to refine the derived representations and to augment their analytical capabilities. This human-in- the-loop notion of BSP is illustrated in Fig. 2. In sum, this exemplifies a key characteristic of BSP to support, rather than supplant, human analysis and decision making. Technology and computing advances can offer tremen- dous benefits to the human expert related to observing, analyzing, and modeling human behavior. Integral to in- terpretation of such informationincognitively diverse and emotionally rich interactions is the development of be- havior-centric computational models that encompass the cognitive, social, affective, and communicative state of the interlocutors reflected in their speech and spoken lan- guage. Such capability can augment the relevant informa- tion made available to the experts, strengthening their ability not only to take appropriate action and to intervene appropriately but also offering tools of scientific discovery. BSPbuildsupontremendous advances in many realms of signal processing that offer foundational capability to measure, analyze, track, and model human behavior. These include the ability to acquire and diarize audio and video Fig.1.Behavioralanalysisbyanexpert:inpractice,expertsobservedirectlyandinferdiagnosisandtreatmentoutcomes.Insomeclinicalcases andinobservationalresearchthedatamayberevisitedthroughaudiovisualrecordingsandcodedbyexperts.[Imageofcoupleinteracting courtesyofProf.A.Christensen,ClinicalPsychologyDepartment,UniversityofCaliforniaLosAngeles(UCLA)]. Narayanan and Georgiou: Behavioral Signal Processing: Deriving Human Behavioral Informatics 1206 Proceedings of the IEEE|Vol.101,No.5,May2013 Figure 1.1: Behavioral analysis by an expert: in practice, experts observe directly and infer diag- nosis and treatment outcomes. In some clinical cases and in observational research the data may be revisited through audiovisual recordings and coded by experts. [Image of couple interacting courtesy of Prof. A. Christensen, Clinical Psychology Department, University of California Los Angeles (UCLA)]. 3 Great strides have been made over the last few years on establishing Signal Processing and Natural Language Processing methods as viable in estimating the behavioral state of the inter- locutors. Automated coding systems have been shown to be eective at quantifying behaviors from speech and spoken language such as Negativity [21, 31, 51, 148], Depression [60, 103] and Empathy [53, 121, 157]. In couples therapy, example work includes [19{21, 51, 54, 83{85, 88, 159]. Spoken language, in particular, has been found to be a rich source of information for behav- ioral cues across a variety of domains and applications, such as Couples Therapy [51], Addiction Counseling [157] and Cancer Care [107]. Speaker Partner B 1 B 2 B i Facial Expression Acoustics Language ... ... E lang (B i ) Behaviors Interaction Cues Windowed Cues Window Length L j th window M(; ϴ) Window-level Behavior Scores Interaction-level Behavior Score Model Functional F Figure 1.2: Automated quantication of behavior using a moving-window modeling approach Many of these approaches employ a moving-window approach for quantifying the degree of behavior exhibited by a speaker as illustrated in Figure 1.2. First, the behaviors that are ex- pressed by an interlocutor are encoded to dierent degrees in their spoken language. Then, using an appropriately long observation window, the system observes the cues being generated by the interlocutor and estimates the short-term behavior information with the help of the appropriate behavior model. Finally, this short-term information is integrated over the course of the interac- tion, resulting in a summary of the behavior for that interlocutor. This approach is followed in many works due to its eectiveness as well as its ability to provide short-term behavior analysis of interactions. 4 1.4 Focus of Dissertation The process of modeling the speaker's behavior from their language cues during interactions with their partner is characterized by several aspects that are jointly critical for its eectiveness. This dissertation focuses on investigating three such aspects: (i) which frameworks are eective at capturing behavior information in spoken language, (ii) how much context is needed for accurate modeling and how does it depend on the target behavior, and (iii) how can the partner's cues be utilized for better understanding the speaker's behaviors. Each of these aspects needs to be determined and designed in a manner that is appropriate to the nature of the behaviors and interactions to be modeled. The rest of this dissertation presents an empirical investigation into the dierent aspects of the modeling process and examines the impact of dierent design choices for various behavioral constructs across dierent mental health domains. Below, I present a summary of my ndings so far, followed by a roadmap of the structure of the dissertation. 1.5 Summary of Findings 1.5.1 Modeling Speaker Behaviors from Speaker Interaction Cues Existing work [51] has assumed that during an interaction, an interlocutor is constantly in the same behavioral state when interacting with their partner. This is equivalent to modeling them as a single state generative model, which is limiting and does not re ect human behavior that dynamically adapts based on the various stimuli, internal and external. I propose Dynamic Be- havior Models (DBM) [31,33] which model speakers as switching between dierent latent variable states over the course of an interaction and show that they perform better than existing mod- els. I also propose a Neural Network-based model [150] that leverages out-of-domain information in a context-independent manner to better estimate behavior. The eectiveness of neural ar- chitectures at estimating aective constructs is explored at both the word-level as well as at 5 sentence-level across dierent applications. It is observed that context-dependent latent-space models and context-independent neural network-based models are both eective at modeling how humans express aective behavior such as negative and positive during interactions. 1.5.2 Determining Observation Window Lengths for Modeling Behaviors When assessing a person's behavior based on their interaction cues, humans look at factors such as the intensity of expression, context and how frequently the behavior is observed [12]. The latter two imply that an appropriately long window is used to observe the cues before making a judgment about the behavior; for lexical cues, the length of this observation window is measured in terms of the number of words spoken. While some behaviors can be assessed based on short-duration cues, others require observations along longer time-scales. For example, one can sense that a person is Angry if they say something as brief as \Shut up!", but it is dicult to judge whether they are Engaged in a discussion unless a longer and more involved conversation is observed. Based on this, it is intuitive to expect that evaluating dierent behaviors would require dierent observation window lengths. Such associations have been exhibited by humans when judging characteristics such as personality traits [22], non-verbal behaviors [106] and group dynamics [133]. However, it is not clear as to how these associations manifest in automated systems that quantify behaviors based on interaction cues. Unlike emotions, which are simple and rapid [14] and can be reliably estimated from short observations such as a few seconds [136], a sentence [163] or a speaker turn [27]), the rich variety of human behaviors can be much more complicated and long-ranging. Even expert coders in the eld of psychological research typically rst have to be trained according to domain-specic guidelines or manuals before they can start coding patients' behaviors, such as CIRS [63] and MITI [105]. This complexity can potentially give rise to uncertainty at the time of assessment, which then necessitates longer observations in order to achieve condent and reliable annotation. Furthermore, the annotation time-frames for coding 6 dierent behaviors can range from as short as 30 seconds [66] to as long as 10 minutes [63], demonstrating the potential variability in observation lengths. These facets of behavior coding demonstrate the need for investigating the role of the length of observation for specic behavioral characterization. Honey it’s not your fault please High Low Medium Honey Honey it’s not Honey it’s not your fault it’s not your fault please it’s not your not your fault your fault please it’s not your fault please 5-WORD WINDOW 3-WORD WINDOW 1-WORD WINDOW TRUE True degree of Supportive behavior High Low Medium High Low Medium High Low Medium Aggregate estimate of Supportive behavior Figure 1.3: Toy example illustrating the eect of observation window length on behavior estimation: The true degree of Supportive behavior of the utterance \Honey it's not your fault please" is High. The system's predicted behavior is the aggregate of the behavior estimates from all the windows. At short window lengths (1, 3 words), insucient information leads to noisy estimates and an incorrect prediction that the degree of behavior is Medium. However, as the window becomes longer (5 words), estimates are more accurate and the system correctly predicts that the degree is High These considerations assume signicant importance in applications that rely on moving-window approaches to estimate behavior. Such approaches typically rst break down an interaction into windowed segments using a xed-length observation window, such as a xed number of words or speaker turns. Then, behavior estimates are computed within each window and combined over all the windows to obtain an aggregate estimate of behavior that characterizes the entire interaction. These approaches are used in psychotherapy research where interactions, or \sessions", are often analyzed and evaluated at session-level [53, 148, 157]. In such applications, the choice of length 7 of the observation window is important; too short a window can result in noisy or incorrect esti- mates due to insucient information being used, and as a result, the aggregate behavior will be inaccurate, as illustrated in the toy example in Figure 1.3. This choice also becomes important in multi-label tasks where recognizing dierent behaviors might necessitate the use of a dierent window length for each behavior. I propose a systematic framework [30] for analyzing the observation window length of be- havior quantication by measuring the agreement between the system's behavior scores and the corresponding human-annotated behavior ratings with respect to the nal, summary of behavior as well as inter-behavior relations at dierent window lengths. This analysis is performed using dierent models, window lengths and context-independent aggregation schemes for a large and diverse set of behaviors and appropriate recommendations are made for each behavior and type of behavior. 1.5.3 Contextualizing Speaker Behavior with Partner Interaction Cues Traditional approaches [19,51,157] as well as the proposed ones above, while eective, ignore the partner's information when estimating the speaker's behavior and do not explicitly model the interpersonal in uence between speaker and partner that characterizes their interaction. This is a considerable omission since the partner's information is known to reduce ambiguity and contribute to better modeling of behavior [86,161]. I propose the In uence DBM [29] which incorporates the partner's eect on the speaker's behavior using a sequentially-integrated latent-space modeling framework and show that it improves upon the previously proposed models. In addition, the explicit parameterization of the degree to which the speaker is in uenced by their partner is found to be associated with the outcomes of their post-therapy relationship quality. Finally, I show that explicit incorporation of conversational dynamics between the speaker and partner, in the form of entrainment dynamics [109] and turn-taking speech and pause features, can help 8 assess the speaker's long term behavioral outcomes such as suicidal risk [32] with a high level of accuracy. 1.6 Organization of Dissertation The dissertation is structured as follows: Chapter 2 describes the existing work on behavior modeling from language that adopts a static behavior-state approach and introduces the proposed improvements in the form of dynamic-state extensions to previous work as well as neural network- based models. Chapter 3 re-examines the modeling and observation window length assumptions in the models thus far and performs an analysis of what the appropriate design choices should be for dierent types of behaviors based on the nature of the concepts they convey. Finally, Chapter 4 reviews the behavior modeling work using speaker only and demonstrates the value of partner's information for inferring the speaker's behavioral attributes in a two-fold manner: as a modeling extension to the dynamic-state behavior prediction model by incorporating the in uence of the partner over the speaker, and as feature-based approach by using interaction dynamics information such as entrainment and turn-taking patterns to predict the speaker's suicidal risk. In closing, Chapter 5 provides a summary of the ndings and conclusions from the work thus far and uses these to motivate the design of a modeling framework that jointly incorporates the cross-behavior and multi-scale in uence dynamics between speaker and partner to model dierent types of behavioral attributes. 9 Chapter 2 Modeling Speaker Behaviors from Speaker Interaction Cues 2.1 Introduction Single-speaker models are frameworks that rely on the observed cues of a speaker during an interaction in order to estimate their degree of behavior expression. Existing works [51, 157] typically assume that the human is a behavioral state model that generates utterances from the same state for the entire duration. This is equivalent to modeling each interlocutor in the interaction as a single state generative model as shown in Fig. 2.1 (left). This is clearly a simplistic assumption and does not re ect human behavior, that dynamically adapts based on the various stimuli, internal and external. Using the Static Behavior Model, similar to the one in [51], in Sec.2.2 as the baseline, I present my proposed model, the Dynamic Behavior Model in Sec. 2.3 that allows for transition between two behavioral states throughout the interaction and thus makes turn-level decisions instead of session level decisions. 2.2 Static Behavior Model (SBM) In the Static Behavior Modeling (SBM) framework, a person's behavior is assumed to remain the same throughout the interaction, irrespective of external stimuli such as spouse's utterances, topic being discussed, etc. This is the same model that was proposed in [51] and in eect corresponds 10 Utterance Utterance Utterance Utterance Utterance Utterance Utterance Utterance Utterance Utterance Static Behavior Model Dynamic Behavior Model Figure 2.1: Conceptualizing the proposed graphical model on the right versus the baseline on the left to behavioral averaging. Thus, all the utterances observed in that session are generated from the same behavioral state as in Fig. 2.1(left). 2.2.1 Modeling approach This approach relies on the assumption that human behavior can be classied as one of two classes C 0 andC 1 , representing \Low" and \High" degrees. Then, identifying the behavioral state of the interlocutor is equivalent to identifying the class label C i =fC 0 ;C 1 g from the whole transcript. In this work I employ a Maximum Likelihood (ML) formulation for the binary classication. For a set of observed utterances, corresponding to the whole transcript, U=fU(1);:::;U(M)g: P (Low or High BehaviorjU) =P (C 0 or C 1 jU) For class C i =fC 0 ;C 1 g: C i = arg max Cj P (UjC j )P (C j ) P (U) (2.1) = arg max Cj P (UjC j )P (C j ) (2.2) 11 Assuming a balanced class distribution, the class prior probabilities are equal, i.e. P (C 0 )=P (C 1 ). Therefore, the decision can be rewritten as: C i = arg max Cj P (UjC j ) (2.3) Equation 2.3 represents the nal decision scheme for the SBM. 2.2.2 Training For the implementation, I use language models to represent probabilistic models of lexical content, and build them using the SRILM toolkit [143]. I replace maximum likelihood schemes with min- imum perplexity schemes wherever applicable. Perplexity is a measure of how well a probability model predicts an observation; lower the value, better the model. For an utterance U(m), it is calculated as: PP(U(m)) =P (w 1 w 2 :::w N ) 1=N (2.4) Where P (w 1 w 2 :::w N ) is joint probability of occurrence of words w i is the ith word of utterance U(m);i2f1; 2;:::;Ng From (2.4) it can be seen that minimizing perplexity is equivalent to maximizing probability From the SBM assumption, it is clear that the classes and states are equivalent. Therefore, for building a model of a particular behavioral state C 0 /C 1 , I collect all utterances from the corresponding sessions and train the language models L 0 /L 1 respectively. While doing so, I combine the trained LMs with a Universal Background Model (UBM), in order to smooth the language models. I use a parameter =0.1 as the interpolation weight for the UBM. Thus, the SBM consists of two LMs for each speaker that model their language structure corresponding to the lowest and the highest presence of behavior. 12 2.2.3 Testing For a given test session, I assign C 1 /C 0 label to a speaker after comparing the perplexities from their LMs. Given a set of M utterances U=fU(1);:::;U(M)g from a speaker during the test session, I compute the LM perplexities based on L 0 and L 1 , and the class label assignment is shown in equation 2.5. C i = arg min j X m PP j fU(m)g (2.5) Where PP i fg represents perplexity score of utterance computed by LML i 2.3 Dynamic Behavior Model (DBM) Dynamic Behavior Models (DBMs) allow for a person's behavior to change over time. This is modeled in the form of transitions between dierent behavioral states throughout a session, shown in Fig. 2.1 (right). In this work I simplify the model to have 2 states and assume that behavior remains short-term stationary (i.e. behavior does not change within an utterance, but only from one utterance to another). I denote utterance states as S i =fS 0 ;S 1 g. The behavioral state of the interlocutor, labeled by the human annotators as Low/High Behavior or C 0 =C 1 , does not provide a one-to-one cor- respondence any more. A person expressing High behavior (C 1 ) can generate from both states (S 0 ;S 1 ) with the nature of the states S 0 ;S 1 described in the training stage. Given only one turn, similar to the formulation of the SBM, a ML model will result in: P (S i jU(m)) = P (U(m)jS i )P (S i ) (2.6) 13 This method estimates turn-level state probabilities. Human coders integrate behavioral in- formation and give a summative opinion on the session level. In order to reach inference from the DBM for session level behavioral descriptors, such turn-level behavioral information needs to be integrated. There are many perception inspired methods for behavioral integration. For instance, previous work [90] evaluated whether global behavior could be judged based on a locally isolated, yet highly informative event or whether integrating information over time was more eective. The premise of such work is that the human perception process is capable of integrating local events to generate an overall impression at the global level, but this process is not transparent and as such is dicult to replicate. What can be done instead is to use local information to derive the same global decisions. In this work I employ two fusion methods for the DBM: Activation-based and Likelihood-based. The DBM assumes that an interaction consists of multiple behavioral instantiations, even if the entire session is tied to one behavioral type. In scenarios where there is only access to session-level labelsC 0 /C 1 , the utterance-level labels S 0 /S 1 are latent. They are iteratively estimated through semi-supervised learning methods. The learning convergence is veried indirectly through the total training perplexity, since local label information is not present. The ADBM and LDBM tie utterance-level labels to the session-level behavior in dierent perceptual ways, as explained below: 2.3.1 Activation-DBM 2.3.1.1 Modeling approach The Activation-based DBM (ADBM), shown in Fig. 2.2 (left), decides the global behavior using a majority-vote principle, thus assuming that all talk turns carry the same perceptual weight. The turn-level decision for utterance U(m) about the behavioral state S i is: S i = arg max Sj P (U(m)jS j )P (S j ) (2.7) 14 S 0 Utterance Utterance Utterance Utterance Utterance S 1 Activation-Based Dynamic Behavior Model Turn level decisions. Majority voting S 1 . Thus C 1 selected due to association learned on training data Utterance S 0 Utterance Utterance Utterance Utterance Utterance S 1 Likelihood-based Dynamic Behavior Model S 0 Utterance Utterance Utterance Utterance S 1 Best path from decoding from C 0 model therefore C 0 behavior selected (even if more S 1 states in selected path) C 0 C 1 Figure 2.2: Activation-based DBM (left) versus the Likelihood-based DBM (right) For the session-level decision, a mapping is needed from the dominant behavioral state to the behavioral class, S i ! C j ;8 i;j 2f0; 1g. This mapping is learned at the training stage, as explained below. 2.3.1.2 Training In the ADBM, language modelsL 0 /L 1 are built to represent behavioral states S 0 /S 1 , as opposed to classes C 0 /C 1 in the SBM. However, they are initialized exactly in the same way as that of the SBM, but are then re-trained until the training perplexity converges. Each utterance is classied independently of the rest using an Expectation Maximization (EM) [41] learning scheme as described in Algorithm 1 After convergence, each session is comprised of both states. From the association of session with a class based on human coding (C 0 =Low or C 1 =High), and session with states (S 0 or S 1 ) based on the EM above, the mapping from states to classes is learned. This is done by computing the proportions of state occupancies in each class and associating it with the dominant one. This, based on the initialization, usually results in an association of S 0 with C 0 and of S 1 with C 1 . 15 Algorithm 1 EM algorithm for state convergence in activation-based DBM Initialize utterances in C 0 session2fS 0 g, C 1 2fS 1 g Build language model L 0 from utterances2S 0 , L 1 from utterances2S 1 while training perplexity does not converge do E-step: Classify every utterance U(m) in C 0 ,C 1 classes Get perplexities PP 0 ; PP 1 of U(m) computed by L 0 ,L 1 PP 0 fU(m)g state=S1 ? state=S0 PP 1 fU(m)g;m2f0; 1;:::;Mg M-step: Build L 0 from S 0 utterances,& L 1 from S 1 end while 2.3.1.3 Testing Given a therapist's test session, state labels are assigned to each utterance U(m) independently, based on the LM perplexities from L 0 and L 1 . The most dominant state S k is identied and the behavioral class that maximizes (2.8) is selected. C i = arg max Cj P (S k jC j ) (2.8) 2.3.2 Likelihood-DBM 2.3.2.1 Modeling approach The Likelihood-based DBM (LDBM), shown in Fig. 2.2 (right), uses a decision scheme based on Hidden Markov Modeling of the behavioral classes C j . The states S 0 and S 1 that generate the utterances are now assumed to be hidden, and the HMM of each class, HMM(C j ), is allowed to generate from both states. Thus, the underlying states must be deduced from the model that provides the best decoded sequence. This modeling is meant to re ect the real-world observation that the same utterance can be indicative of dierent types of behavior, depending on the context. Thus, the decision rule is: C i = arg max j P (S j jU; j ) (2.9) 16 Where j is the set of HMM parameters of class C j , U=fU(1);:::;U(M)g is the set of utterances in that session, S=fS(1);:::;S(M)g is state sequence decoded by HMM(C j ). Algorithm 2 Viterbi-EM algorithm for state and class parameter convergence in likelihood-based DBM Initialize utterances in C 0 session2fS 0 g, C 1 2fS 1 g Build language model L 0 from utterances2S 0 , L 1 from utterances2S 1 Initialize , 0 , 1 while training perplexity does not converge do E-step: Decode C 0 utterances using 0 ;L 0 ;L 1 ; for every session utterance U(m) do Get probabilities P 0 ,P 1 of utterance U(m) from L 0 , L 1 Find probability that U(m) was generated by state S k if m=1 (start of session) then k (m) = k *P k fU(m)g; k2f0; 1g else k (m) = arg max j [ j (m-1) * 0 (j,k)] * P k fU(m)g; j,k2f0; 1g 0 (m-1) * 0 (0,k) m(k) = S0 ? m(k) = S1 1 (m-1) * 0 (1,k); k2f0; 1g end if end for Decode state sequence using k (M) and m (k); m2fM;M 1;:::2g, k2f0; 1g Repeat E-step for C 1 utterances; replace 0 with 1 M-step: Re-estimate states, class parameters Build L k from all U(m) whose state = S k ;k2f0; 1g Update n (i,j) = count(<i;j> state pairs in class Cn) P k count(<i;k> state pairs in class Cn) ; i,j,k2f0; 1g, n2f0; 1g end while 2.3.2.2 Training In the LDBM, each behavioral class is represented by a HMM that describes the characteristics of transitions between dierent behavioral states. The transition matrix of C i is initialized to 17 heavily favorS j !S i ;8j transitions over the rest. I then use the Viterbi-EM algorithm to obtain converged estimates of model parameters and behavioral states, as described in Algorithm 2. At the end of training, the LDBM is associated with HMMs that correspond to human-coded Low and High presence of behaviors (HMM(C i )). Each behavioral stateS i is associated with a language model, while each classC i consists of a common initial-state probability vector and a matrix i that governs state transitions in that class. For example, 0 (i;j) represents the probability of aC 0 -rated speaker transitioning to state j, given that they were previously in state i. 2.3.2.3 Testing The test utterances U are decoded using the HMMs of C 0 /C 1 , thereby obtaining the most likely state sequences S 0 =S 1 and their corresponding likelihoods. The LDBM then picks the class with the highest likelihood as the global behavior label for the test session, as shown in (2.10) C i = arg max j P (S j jU; j ) (2.10) Where U is set of test utterances,fU(1);:::;U(M)g S j is the most likely state sequence predicted by HMM(C j ) j is set of HMM parameters of class C j ,f j ,L 0 ,L 1 ,g 2.4 Neural Behavior Model (NBM) The framework for obtaining (i) continuous metrics of behavior, or behavior trajectories, in an (ii) online, sliding window, manner frame is crucial to providing psychologists with real-time feed- back on patient interactions. This framework should be (iii) robust to Out-Of-Vocabulary (OOV) phrases and allow for (iv) long and variable-length context where that is available and helpful. 18 Another requirement is to create a framework that easily allows the (v) fusion of behavior trajec- tories with other modalities, such as acoustic features [155]. A Neural Network based modeling framework, as demonstrated in this section, meets all of the above requirements. Two variants of models in this framework are tested: (i) a word-level recurrent model that maps a xed number of words to a short-term score which is then aggregated to obtain the target behavior score and (ii) a turn-level feedforward model that maps a sentence, with variable number of words, directly to the target behavior score. 2.4.1 Word-level Recurrent NBM 2.4.1.1 Motivation Recurrent neural networks have become increasingly popular for sequence learning tasks as they are adept at integrating temporal information from the entire sequence history as opposed to a xed window of data in feed-forward neural networks. This dynamic context is especially valuable in natural language processing where semantic meaning may have long-term dependencies across any number of words. RNNs have been shown to perform better than statistical language models in such data-sparse situations by learning distributed representations for words [17,99]. However the training of RNNs generally requires large amounts of data with accurate labels; something generally not available in the psychotherapy domain. Therefore, I propose the use of pretrained distributed representations of words from out-of-domain large corpora to alleviate the problem of data sparsity. In addition, I train the RNN using a weakly supervised method to account for the missing frame-level labels. Recurrent neural networks (RNN), namely the Long Short-Term Memory (LSTM) architec- ture, have demonstrated incredible abilities in handling long range dependencies for improved sequence learning [145]. In Natural Language Processing, LSTMs have produced state-of-the-art results in many tasks [58,144]. However, the use of RNNs in behavior estimation has seen limited 19 success due to data limitations. Firstly, due to privacy restrictions, data with rich information of behavior in psychotherapy sessions is often severely limited in quantity. Secondly, due to the eort required for annotation, very often only the session labels are given and there is no ground truth for individual frames. In this section I address these problems and propose an LSTM-RNN system for capturing behavior trajectories in couples interaction in a low data resource environment. To allow for training of the RNN with limited data, I use pretrained word representations learned from out- of-domain corpora and joint optimization. I also show the viability of using session-level labels for learning frame-level behavior. Using a fusion of the frame-level behavior trajectories I show that the ratings predicted by my proposed system achieve inter-annotator agreement comparable to those of trained human annotators. ρ’i ρ RNN RNN RNN wi-2 wi-1 wi Session Data Session Rating ρ Session-level human process Frame-level RNN training process Sliding window Time Figure 2.3: Training frame-level RNN using global rating values. Since human raters do not provide behavioral ratings for each utterance in the session I instead use the global rating as training labels for the individual sequences. In other words, all word sequences within a session are trained with the same label as the global rating. This method assumes that sequences of words from a session are related to global rating in a non-linear, complex 20 manner. This is depicted in Figure 2.3 where the session-level label is assumed to be a proxy for the frame-label 0 . This also infers that the longer the context-window the less the mismatch between the global rating and 0 . Ideally one would like the whole session to be passed as a training sample, however this would drastically reduce the size of the training set and render it infeasible. Nevertheless, a larger window can help identify lexical combinations that contribute towards the expression, and consequently estimation, of specic behaviors. 2.4.1.2 Modeling approach LSTM LSTM LSTM Feed forward Feed forward Feed forward w0 w1 w2 Embedding Embedding Embedding W2V W2V W2V Session label prediction … Fusion layer Figure 2.4: Recurrent Neural Network system for predicting behavior. I propose a 3-layer recursive neural network architecture as shown in Figure 2.4. The input is encoded as a one-hot vector w, where the n-th word in the dictionary is represented by setting the n-th element in w to 1 and all other elements to 0. I assume a vocabulary of N unique words and 0 n N. The rst layer in the RNN maps the one-hot vectors w into intermediate continuous vectors using an embedding layer [98]. 21 The next hidden layer consists of the LSTM blocks that, employing memory cells, will store non-linear representations of the current sequence history and be better able to encode context. To prevent overtting a dropout layer is added after the LSTM. Finally the last layer is a feed-forward layer that performs non-linear mappings to better approximate the human scale of behavior. The RNN is then trained for a xed number of epochs using an adaptive learning rate optimizer [44]. For evaluation purposes, and to better approximate the human annotation process, I employ a fusion layer after the RNN to combine the behavior metrics over all the time-steps and obtain a prediction of the global rating. 2.4.1.3 Training In this work, I investigate two options for generating such representations. One is to directly train this on the limited, but domain-specic training data, denoted as 1Hot. Another option that also addresses the problem of data sparsity and allows for a more generalized model, is to incorporate out-of-domain knowledge by pretraining word representations on larger corpora, denoted as w2v. It is expected that employing this second method will have advantages: First, by using pre- trained word representations, I can mitigate the issue of data sparsity in the training data. High- quality word representations will map similar words to closely spaced points in the vector repre- sentation space. This allows for the use of a smaller number of parameters and hyper-parameters in constructing and training the RNN. Second, by training on the word representations the sys- tem will generalize well in regards to out-of-vocabulary words. Words that were not seen during training will still be mapped to a continuous vector that preserves its semantic relationships to words that were seen during training. The RNN will therefore be able to produce reasonable if not accurate predictions when encountering out-of-vocabulary words in a sequence. To learn high-quality word representations I use the Google News corpus [2] which contains more than 4 billion words. I also introduce 1 million words from the General Psychotherapy 22 corpus transcripts from [1] to allow the word representations to be more representative of the target domain. The word representations are learned through the methods described in [98]. Since the nal objective is to estimate the behavior metrics for word sequences, I reduce the vector dimensionality from the commonly-used size of 300. In my experiments, I tried vector dimensionality congurations of 300, 50, and 10. The continuous word representations are incorporated into the RNN system by xing the weights in the embedding layer with the learned word to vector mappings. These weights are then maintained during training to preserve the learned word representations. Using pretrained word representations the RNN learns to predict the behavior ratings from continuous vectors that capture the semantic relationships between words. However, although these word vectors encode a lot of semantic information they are not optimized for predicting behavior. By jointly training these word vectors with the behavior ratings the word represen- tations become more indicative of behavior where appropriate while still maintaining semantic relationship. In training the RNN with pretrained word representations, I initialize with the above learned word vectors and allow the weights in the embedding layer to be updated to allow for this joint optimization, denoted as w2v-joint. 2.4.1.4 Testing The RNN system is trained to predict behavioral ratings for dierent sequences of words. Since local-level annotations are not available to compare these predictions with, I evaluate the system at the global session-score level. This is done by fusing the local predictions to arrive at a global predicted score, similar to the human process of integrating behavioral information over time to arrive at a gestalt opinion of the session. I observed that, in general, the median predicted rating exhibited lesser bias as an estimator of the true rating than the mean rating, possibly due to the former's robustness to outliers. Therefore, I used an RBF-Kernel Support Vector Regressor to learn a mapping from the median 23 predicted rating to the true rating on the training data. At test time, I applied this map on the median predicted test rating to obtain the predicted session-level rating, which I then compared against the true session-level rating that had been used to train the RNN system. 2.4.2 Turn-level NBM 2.4.2.1 Motivation The word-level recurrent NBM utilizes a sequence of 3 words at a time, in order to arrive at an estimate of a behavior score. While this approach has been shown to work well for predicting some behaviors, a 3-word context might not always be sucient to capture dierent types of behavioral attributes. Complex behaviors, such as depression, that entail long-term or slow- varying phenomenon are often expressed over a turn or even multiple turns. In such a scenario, the word-level model might not be adequate for capturing such behaviors, despite the presence of a recurrent module. Hence, I also investigate turn-level modeling of behavior within the Neural framework. 2.4.2.2 Modeling approach For every turn t i consisting of a sequence of M ti words, I extract its sentence-level embedding l i of dimension 600 by passing the word sequence through an encoder-decoder conversational model [149] and extracting the encoder output. The conversational model was trained using a sequence-to-sequence [145] framework in a multi-task fashion by jointly predicting a movie dialogue turn given its previous turn as well as the sentiment category of the previous turn, based on whether the majority of the words belong to Positive or Negative sentiment, as dened by the LIWC dictionary [146]. For comparing the performance of the lexical modality against the acoustic modality, I also extract acoustic features using openSMILE toolkit [46] with the default eGeMAPS conguration. The eGeMAPS feature set includes 88 features related to frequency, energy, spectral, cepstral, 24 and dynamic information. The nal feature vectorsa i of each turnt i are generated by calculating the following statistics: mean, coecient of variation, 20th, 50th, and 80th percentile, range of 20th to 80th percentile, the spectral slopes from 0-500Hz and 500-1500Hz, mean of the Alpha Ratio, the Hammarberg Index, and mean and standard deviation of the slope of rising/falling signal parts. “I don’t like the way you do things” UTTERANCE TURN-LEVEL EMBEDDING INPUT LAYER OUTPUT LAYER NEGATIVE PREDICTED BEHAVIOR HIDDEN LAYERS Figure 2.5: Turn-level Feedforward Neural Behavior Model Once I obtain the turn-level lexical and acoustic embeddingsl i anda i respectively, I feed them separately to the input of a multi-layer feedforward Neural Network model, shown in Figure 2.5, whose last layer outputs softmax probabilities for a distribution over categorical behavioral labels. 2.4.2.3 Training The acoustic system DNN has 3 to 4 hidden layers while the lexical system has 1 to 3 hidden layers and both use ReLU activation functions. I tried all hyperparameter combinatorics shown in Table. 2.1. To address the high class imbalance, during training, I employ weighted categorical cross entropy loss. Assuming that the number of samples in each class isx i , wherei2 [1; 2; 3] in the 25 Table 2.1: Model hyperparameters Acoustic hidden layers conguration (64, 32, 16), (128, 64, 32), (128, 64, 32, 32) batch size 32, 64, 128, 256 class weights 1 xi P i (xi) , P i (xi) xi , maxixi xi Lexical hidden layers conguration (300, 200, 100), (200, 100, 50), (300, 200), (200, 100), (100, 50), (300), (200), (100), (50) batch size 25 rate decay 1e-1, 5e-1 class weights maxixi xi Training optimizer Adam, SGD learning rate 1e-2, 1e-3, 1e-4 case, I use three methods to calculate the class weight w i , shown in Table. 2.2. When optimizing for the acoustic system, no major dierence was observed between the weighing methods 2 & 3, while method 1 was under performing. I thus only employed method 3 for the lexical system. Table 2.2: Class weighting schemes (1) 1 xi P i (xi) (2) P i (xi) xi (3) maxixi xi 2.4.2.4 Testing I use a leave-one-couple-out nested cross-validation paradigm for the behavior classication task. In every test fold i, I set the samples of couple C i as test data and use the samples of remaining couples C j ;j6=i to train and validate the models. I then pick the model that performed best on the validation data, run it on the samples of couple C i and compute the test evaluation metric. Since the data has 85 unique couples, this process is repeated over 85 folds and the test evaluation metric is averaged over all of them. In every test fold, I create an 80% training, 20% validation split such that every couple appears in only one split and each split contains samples of all 3 behaviors. For model evaluation and checkpoint selection, I use the Unweighted Average Recall (UAR) metric which accounts for class imbalance. The model hyperparameters, such as number of hidden layers, hidden layer size, learning rate, etc. (shown in Table. 2.1), are tuned on the validation data using grid search. 26 In partition experiments, since one model is built for each partition, I halve the model size so that the total number of parameters remains the same as when not using any partitions. I used PyTorch [118] and scikit-learn [119] in the experiments. 2.5 Experiments In this section, I perform a comparison of the models introduced above to understand how well each one ts the expression of dierent behavior constructs in spoken language. For this, I employ three datasets of real-life interactions in the mental health domain with their associated set of behavioral constructs. I rst compare the SBM against the dierent variants of the DBM on the Couples Therapy and Addiction Counseling corpora. Then, I test the word-level Neural model on the same task in Couples Therapy to check whether the improvements in modeling translate into better performance. Finally, I test the turn-level on the Cancer Care discussion corpus to check how well the model can perform when estimating behavior at the sentence-level context. 2.5.1 Comparing SBM vs DBM 2.5.1.1 Case Study 1: Classifying Negative behavior in Couples Therapy 1. Introduction In couples therapy, example work includes [19{21, 51, 54, 83{85, 88, 159]. These methods however suer from the simplistic assumption of estimating a single behavioral state from the whole interaction: the inherent underlying assumption is that the human is a behavioral state generator and they generate from the same state for the entire duration. 2. Aim In this experiment, I specically investigate two goals: (1) Human behavior changes through- out an interaction and better models of this evolution can improve automated behavioral 27 annotation and (2) Human perception of this evolution can be quite complex and non-linear and better techniques than averaging need to be investigated. For this purpose, I contrast the DBM against the SBM which allows only a constant session-long behavioral state. I use Negativity in a couples therapy task as the case study. I present results and analysis on both models for capturing the local behavior information and predicting the session level negativity label. 3. Dataset I use the data of 134 couples from the UCLA/UW Couple Therapy Research Project [36] which contains audio and video recordings and transcripts of interactions between real cou- ples. Each couple had at least 2 interactions or \sessions", once with each participant leading the discussion on a topic of their choice, and the total number of sessions per couple ranged from 2 to 6. In each session, both the husband and the wife were rated for a total of 31 CIRS [63] and SSIRS [71] behaviors by trained human annotators with a sense of what \typical" behavior is like during these interactions. The annotators were asked to ob- serve both verbal and nonverbal expressions when rating each behavior independently and in many cases, dierent annotators rated dierent behaviors. Each behavior in each session was rated by 3 to 9 annotators, with most of them being rated by 3 to 4. The rating was done on a Likert scale from 1 to 9, where 1 represents \absence of behavior", 5 represents \some presence of behavior" and 9 represents \strong presence of behavior". More details about the recruitment, data collection and the annotations can be found in [12,36]. For the task, I analyze the `Negativity' behavioral code. In order to simplify the code learn- ing, I only use sessions with mean annotator ratings in the top 20% (`High Negativity') and bottom 20% (`Low Negativity') of the code range and I binarize intoC 1 andC 0 respectively. In this manner, I pose the learning problem as a binary classication one, as was also done in the earlier work [19]. Table. 2.3 lists the statistics on the chosen set of data and I describe 28 the modeling schemes in the next section. For more information, the reader can refer to [21,63,71] Husband Wife Class Label C 0 C 1 C 0 C 1 Code Range 1.00-1.75 5.33-9.00 1.00-2.00 6.33-9.00 No. of Files 70 70 70 70 Table 2.3: Data Demographics on 20% least/most negativity sessions 4. Results and Discussion The performance of all 3 models - SBM, ADBM and LDBM - are shown in Table. 2.4. Model Husband Wife Average SBM 79.29% 83.57% 81.43% Activation-DBM 85.00% 79.29% 82.15% Likelihood-DBM 83.57% 88.57% 86.07% Table 2.4: Classication Accuracy of Behavioral Models using 1-grams The likelihood-DBM has the best average performance across both spouses, at 86.07%, while the SBM has the lowest, with 81.43%. This matches the expectations since the likelihood- DBM is less rigid in its assumptions about behavioral changes. Thus, it is better equipped to capture information about dynamic behavior, as compared to the other two models. While the SBM helps in identifying which state a person's behavior more likely corresponded to, it is limited in its ability to model changes in behavior within a session. The Activation- DBM can model behavioral changes within the same session, and performs well, but it allows rapid changes in behavioral states across successive utterances. I feel that such fast changes in behavior do not realistically mirror the usual changes in human behavioral expression. In addition, there is no dependence on previous utterances, meaning that context is ignored. The Likelihood-DBM avoids the pitfalls associated with the rst two models, and while more complex, it is more accurate in predicting behavior from language. Note that both dynamic 29 implementations can likely be improved through supervised learning and it is notable that the achieved performance is through only loose semi-supervised clustering. 2.5.1.2 Case Study 2: Rating Empathetic behavior in Addiction Counseling 1. Introduction The presence of empathy generally involves taking the perspective of others, and responding with sensitivity and care appropriate to the suering of another [9]. High empathy ratings are associated with positive outcomes in a variety of human interactions [15,48]. Empathy is considered an essential quality of therapists in psychotherapy and drug abuse counseling in particular and is associated with positive clinical outcomes such as decreased substance use [59,102]. In psychotherapy studies, therapists are often rated on their empathy levels by third-party observers (coders), based on multimodal behavioral cues expressed throughout the interaction. Previous work on inferring empathy has used acoustic cues [81, 156] as well as lexical in- formation stream using maximum likelihood models [157], while other work has focused on simulating and synthesizing empathy through computer agents [24, 97]. Xiao et al. [160] found that vocal similarity between client and therapist was signicantly correlated with empathy and that it contributed to better classication when combined with speaking time features. 2. Aim In this experiment I test the performance of the SBM and the DBMs for modeling empathetic behavior during an an interaction. The SBM assumes a xed degree of empathy throughout an interaction while the DBMs allow transitions between high- and low- empathy states. Through the non-causal human perception mechanisms, these states can be perceived and integrated as high- or low- gestalt empathy. I also demonstrate the robustness of both SBM 30 and DBM to transcription errors stemming from ASR rather than human transcriptions. the results suggest that empathy manifests itself in dierent forms over time and is best captured by context-dependent models. 3. Dataset I employ the CTT (Context Tailored Training) data set collected in a therapist training study [7]. The study's focus was on Motivational Interviewing [101], which is a type of addiction counseling that helps people resolve ambivalence and emphasizes the intrinsic motivation of changing addictive behaviors. Each session in the set received an overall empathy code in the range of 1 to 7 given by human-experts according to a specic coding manual named \Motivational Interviewing Treatment Integrity" [104]. Intra-Class Correlations (ICC) for inter-coder and intra-coder comparisons were 0.670.16 and 0.790.13 respectively, which prove coder reliability in the annotation. I binarize the set of 200 sessions, based on the average ratings, into two classes: C 0 (1 to 4) representing low-empathy and C 1 (4.5 to 7) representing high-empathy. In total there are 133 unique therapists in the set. Audio recording of each session is about 20 min with single channel, far-eld microphone. These sessions were also manually transcribed with speaker labels and timing information (i.e., manual diarization). Moreover, I trained a large vocabulary Automatic Speech Recognizer (ASR) on additional in- domain data using the Deep Neural Network model implemented in the Kaldi library [126]. I applied the ASR to the CTT set in two settings: with manual and automatic diarization. For the latter, I conducted Voice Activity Detection and Diarization on the audio signal before decoding, plus speaker role identication on the decoded transcripts. In average, I obtained Word Error Rate of 43.1% and 44.6% for the manual and automatic diarization cases, respectively. We, therefore, have three versions of the dataset to test the models on: 31 (1) Manual Diarization Manual Transcription (MDMT), (2) Manual Diarization Automatic Transcription (MDAT) and (3) Automatic Diarization Automatic Transcription (ADAT). 4. Results and Discussion Table 2.5 displays the classication accuracies of evaluation on all 3 behavioral models for each noisy version of the dataset. Dataset Type Model 1-gram Accuracy% 2-gram Accuracy% 3-gram Accuracy% Chance 60.0 60.0 60.0 MDMT SBM 79.0 78.0 75.5 ADBM 81.5 82.5 80.0 LDBM 86.5 80.5 81.0 MDAT SBM 79 70.5 71 ADBM 80.0 77.5 75.0 LDBM 82 74.5 72.0 ADAT SBM 73.5 66.5 68.5 ADBM 74.0 69.0 68 LDBM 78.5 69.0 68.5 Table 2.5: Classication Accuracy of Behavioral Models for dierent versions of the dataset As expected, the highest classication accuracy for all the models is obtained in the case of manual diarization and manual transcription. The Likelihood-DBM performs the best among all 3 models, with its highest classication accuracy close to 87%. Looking at the unigram column the ADBM performs worse than the LDBM, but better than the SBM. This is expected since the DBM can better capture behavior than the SBM and the LDBM weighs data according to their importance rather than equally weighting them as in the ADBM. The results from bigram and trigram models demonstrate a drop in performance that can be attributed to data sparsity. This performance drop is visible in all models, while I would expect the context captured by the higher-order n-gram features to provide better performance if data sparsity issues were not prevalent. 32 I see that, for the SBM and the ADBM, errors in diarization bring about a more signicant drop in performance than errors in transcription. Errors in diarization result in parts of the interaction being attributed to the wrong speaker and hence create local, yet pronounced errors. In contrast, errors in transcription are distributed throughout the interaction. The SBM averages the whole interaction so it can handle errors better. The LDBM, due to its decoding, is more in uenced by errors as they can propagate through the decoded sequence. Thus I observe that the LDBM is aected more by signal noise. Nevertheless, even with the high word error rates of such signals I see that LDBM still outperforms all other algorithms and does signicantly better than chance. 2.5.2 Evaluating Word-level Recurrent NBM 2.5.2.1 Case Study 3: Classifying Negative behavior in Couples Therapy 1. Introduction Among the recent advances in behavioral signal processing for modeling behavior from spoken language are [51], who used a static behavior model where a person's behavioral state was assumed to remain the same throughout the interaction, and Chakravarthula et al. [31], who used a dynamic behavior model to capture transitions between dierent behavioral states. These works, however, suer from the drawbacks of being unable to handle extremely long contexts or out-of-vocabulary words. 2. Aim In this experiment, I investigate the application of recurrent neural networks for capturing behavior trajectories through larger context windows. I address the issue of data sparsity and improve robustness by introducing out-of-domain knowledge through pretrained word representations. Finally, I test the system on tasks in the Couples Therapy domain and show that the system can accurately estimate true rating values of couples interactions using a 33 fusion of the frame-level behavior trajectories. The ratings predicted by the proposed system achieve inter-annotator agreements comparable to those of trained human annotators. 3. Dataset For this experiment, I used the same dataset as in Sec. 2.5.1.1, shown in Table. 2.3. The models were trained and tested to perform two tasks: (1) classify the label of behavior, as in Sec. 2.5.1.1, and (2) estimate the degree of behavior. 4. Results and Discussion In the experiments I used a leave-one-out cross-validation scheme on each couple to separate train and test data. In each fold one couple is held out while the rest are used for training. I applied a sliding window with a 1-word shift across each session to generate multiple training sequences and trained each RNN architecture for 25 epochs. I also tried dierent dimension sizes for the pretrained word vectors and found that the best results can be obtained from a dimension size of 10. I rst focused on binary classication of \Negativity" at the session level which is easier to compare with human annotations. A threshold was applied to the average of behavior metrics in a session to classify that session into High or Low Negativity. For each congura- tion an Equal-Error Rate threshold for the binarization task was obtained from the training data. I trained using dierent context length for each of the proposed RNN congurations. Table 2.6: Classication accuracy (%) on negativity for dierent input sequence lengths. RNN Con- guration Input sequence length (words) unigram bigram trigram 1Hot 87:86 85:71 86:43 w2v 87:5 87:1 86:8 w2v-joint 88:93 88:21 87:86 34 The classication accuracy for the dierent RNN congurations with varying input sequence lengths is shown in Table. 2.6. I observe, as expected due to limited data, a slightly de- creasing accuracy as context is increased, but I also see that the accuracy drop is minimal. I also observe that the pretrained word representations (w2v) are more robust than embed- dings that only employ only domain data (1hot) but can become even more robust by joint training (w2v-joint). Note that while the relative improvement is signicant it is also limited by the upper limit { even humans do not agree 100% { so the binary evaluation task is limiting the evalua- tion abilities. For instance, if the upper limit was 100% then I have about 15% relative improvement but if the upper limit is 92% then this jumps to a relative 40% improvement. To better evaluate the system performance I estimated the behavior ratings which are ob- tained through the fusion layer. I compared the estimated behavior ratings to those from human annotators using Krippendor's alpha. In the rst comparison I randomly replaced a human annotation with the predicted rating for all sessions. I found that the jointly op- timized word representations gave ratings that had better agreement with human ratings than conventional one-hot vectors. Next, I replaced human annotations that deviated most from the mean with the predicted ratings. In this setting I found that the predicted ratings had higher inter-annotator agreement than human-only annotations. This shows that with jointly optimized word representations the RNN system can achieve better inter-annotator agreement than outlier human annotators. The inter-annotator agreement of the predicted ratings for the dierent comparisons is shown in Table. 2.7. In estimating the degree of behavior, I observe that the w2v-joint system provides a more reasonable distribution of behavior scores: the behavioral histogram is more skewed towards the true rating value. For example, Fig. 2.6 shows the distribution of the sequence scores for one session. 35 Table 2.7: Comparison of agreement using Krippendor's alpha. Annotator Conguration Krippendor's alpha 1Hot w2v-joint All human annotators 0:821 Random replacement with random predictions (average) 0:492 Random replacement with machine predictions (average) 0:7611 0:7739 Outlier replaced with machine prediction 0:7997 0.8249 Figure 2.6: Comparison of distribution of scores for words in a session for 1-hot system at the bottom versus w2v-joint above. I observe that the distribution of w2v short-term decisions closer resembles likely interactions of a negative session while the 1-hot system has very few discrimi- nating data points due to sparsity. 2.5.3 Evaluating Turn-level Feedforward NBM 2.5.3.1 Case Study 4: Classifying Positive, Constructive and Hostile behavior in Cancer Care 1. Introduction 36 Cancer impacts the quality of life of those diagnosed as well as their spouse caregivers, in addition to potentially in uencing their day-to-day behaviors. There is evidence that eective communication between spouses can improve well-being related to cancer but it is dicult to eciently evaluate the quality of daily life interactions using manual annotation frameworks. Automated recognition of behaviors based on the interaction cues of speakers can help analyze interactions in such couples and identify behaviors which are benecial for eective communication. 2. Aim In this work, I present and detail a dataset of dyadic interactions in 85 real-life cancer- aicted couples and a set of observational behavior codes pertaining to interpersonal com- munication attributes. I employ the turn-level feedforward NBM for classifying these be- haviors based on turn-level acoustic and lexical speech patterns. Furthermore, I investigate the eect of controlling for factors such as gender, patient/caregiver role and conversation content on behavior classication. Analysis of the preliminary results indicates the chal- lenges in this task due to the nature of the targeted behaviors and suggests that techniques incorporating contextual processing might be better suited to tackle this problem. 3. Dataset Data were gathered as part of a prospective observational study of couples coping with ad- vanced cancer. All procedures were conducted with the approval of the Institutional Review Board. Advanced cancer patients and their spouse caregivers were recruited from thoracic and gastrointestinal clinics at a National Cancer Institute-designated Comprehensive Can- cer Center. Inclusion criteria for patients were (a) a diagnosis of stage III or IV non-small cell lung or pancreatic, esophageal, gastric, gallbladder, colorectal, hepatocellular, and bile duct cancers; (b) Karnofsky Performance Status score of 70+ 1 ; (c) a prognosis of more 1 The Karnofsky Performance Scale Index allows patients to be classied as to their functional impairment. In short, above 70 patients are still able to mostly take care of themselves. 37 than 6 months; and (d) undergoing active treatment at an NCI-designated Comprehensive Cancer Center. Patients were to be cohabiting with a spouse/partner who self-identied as providing some care and also agreed to participate. Participants were required to be over 18 years of age and English-speaking/writing. A detailed description of study methods can be found elsewhere [129,130]. Couples were asked to interact with each other in two structured discussions used in previous research [94]. First, couples engaged in a 10-minute neutral structured discussion (describing daily routines) which served as a baseline. Next, participants independently completed the Cancer Inventory of Problem Situations, [65] in which a list of 20 common cancer concerns (e.g. lack of energy, nances, over-protection) are rated as being not a problem, somewhat of a problem, or a severe problem. After completing the concern list, items for which at least one person rated as a severe problem or both listed as at least somewhat of a problem were used as a prompt for the second structured discussion. Couples were asked to have a 10-minute conversation in which they were asked to describe the issue, how it made them feel, and why they felt it was a concern. In both interactions, the experimenter was present, but did not facilitate or participate in discussion. Interactions were audio-recorded at 44.1 KHz in naturalistic environments (e.g., clinic consult rooms, participant homes) using Sony Digital Recorders (ICD-UX533) with a lavalier microphone (Olympus ME-52W) worn by each participant. After the structured discussions, couples were asked to continue wearing the audio recorders for the rest of the day to capture their natural interactions. Two trained coders used Noldus Observer [113] to review recordings and identify and times- tamp communication behavior in the structured discussions using the Rapid Marital Inter- action Coding System, 2nd Edition (RMICS2) [66, 67]. In RMICS2, the unit of analysis is speaker turn (i.e. each time an individual takes the oor to speak is a turn; an interaction 38 Table 2.8: Description of RMICS2 Behavior Codes Code Denition Examples (utterances not within context) High Hostile Intense negative aect. Profoundly negative statements. Contempt, belligerence, character assassination. \Why are I still even in this relationship?" \Of course you always think about yourself. That's just what you do." Low Hostile Mild to medium intensity negative aect and verbal content that is mildly to moderately negative. Blame/ criticism (focus on behav- ior), demands. \I'm bothered by you not xing anything around the house." \Shut up." Constructive Problem Discussion All constructive approaches to discussing or solving problems. Includes descriptions of the problem, solutions, and questions. \When I take that medication I feel groggy." \I could be better at scheduling." Low Positive Measured positive aect and statements with positive content (focused on others' behavior) that facilitate low-level bonding within the couple. \I truly wish that I could be closer to one an- other" \I appreciate the fact that you cleaned the house the other day without me asking you to do so." High Positive Intense positive aect, statements with posi- tive content that facilitate high-level bonding within the couple. \I love you" \You're the funniest person I know." Dysphoric Aect Sad or depressed expressed emotional states. \I'm useless. I can't even mow the lawn any- more." \I just don't know what else I can do." Other Talk about the experimental situation. Not relevant to ongoing discussion. \Do you think the recorder stopped?" where patient speaks, spouse speaks, and patient speaks again consists of 3 turns), and each speaker turn is labeled with a single hierarchical communication code. A random sample of 20 percent of recordings was coded by both raters to calculate reliability. Inter-rater agreement was excellent, with Kappas above .88 for all codes. The RMICS2 codes represent the emotional valence of the turn and are organized in a hier- archy (i.e., negative codes, positive codes, neutral code). In addition to hostile, positive, and dysphoric aect codes, a constructive problem solving code represents a more emotionally- neutral discussion of the problem or conversational topic. Table. 2.8 provides the denition of these codes along with representative examples. The ordinal set of RMICS2 behaviors consists of 5 codes: \High Hostile", \Low Hostile", \Constructive Problem Discussion", \Low Positive" and \High Positive", in this order. However, since the number of samples in all classes except \Constructive Problem Discussion" was small, I combined the \High" and \Low" codes into a single class for both Positive and Hostile. This resulted in 3 behavior classes : \Hostile", \Constructive" and \Positive". In this work, I focus only on the ordinal 39 Table 2.9: Number of samples per behavior for dierent partitions Partition Constructive Hostile Positive None 13450 176 1369 Gender Male 6673 72 715 Female 6777 104 654 Role Patient 6670 76 728 Caregiver 6780 100 641 Content Neutral 7584 54 467 Stress 5866 122 902 set of behaviors; hence, I did not use the remaining independent codes \Dysphoric Aect" and \Other". Speaker attributes such as gender have been shown to be related to aspects of behavior expression and perception [151] and emotion recognition [3]. The topic of conversation (e.g., sensitive issues) can also cause some behaviors to be expressed more strongly than others. Thus, in addition to classifying behavior based on turn-level speech and text, I also investigate the eect of conditioning this task on dierent attributes of the speaker and interaction. I do so by partitioning the data based on the attribute, training and testing models for each partition separately and comparing their performance to that of the models trained on the original, un-partitioned data. In this paper, I focus on three partitions: (a) Speaker Gender: Male / Female (b) Speaker Role: Patient / Caregiver (c) Interaction Content: Neutral / Stress In the \Role" partition, Patient refers to the speaker recruited from the clinic and Caregiver refers to their spouse. In the \Content" partition, Neutral refers to the part of the interaction where speakers discussed their daily routines while Stress refers to the part where they discussed a cancer concern. Table. 2.9 shows the number of samples per behavior for each partition after pre-processing. 4. Results and Discussion 40 Table 2.10: Mean (Std. Deviation) UAR % of Test-fold Behavior Classication Partition System Acoustic Lexical None 45.36 (12.85) 57.42 (14.45) Gender 44.00 (12.79) 56.85 (14.57) Role 42.38 (11.83) 55.38 (13.84) Content 45.40 (14.49) 56.40 (15.44) Table. 2.10 shows the average UAR%, computed across all test folds, for each modality and partitioning scheme. The best performing classier is the lexical system trained on the entire data without partitioning, which achieves 57.42%. For comparison, I computed the average UAR% due to chance, by randomly assigning labels according to the class priors, which was found to be 43%. the results demonstrate the potential feasibility of using a system to automatically identify these behaviors based on conversational speech cues. Comparing modalities, I see that the lexical system performs better than the acoustic system across all partitioning schemes. This is consistent with ndings from past relevant work [20, 151] and is likely due to the fact that the lexical features are computed on human transcriptions while the acoustic features are computed on the noisy raw audio signal. I also tested feature-level and decision-level fusion of the acoustic and lexical modalities but they did not perform better than the lexical modality. An inspection of the best hyperpa- rameters from every test fold revealed a trend towards larger models, Adam optimizer and 1e-3 learning rate for the lexical system and larger batch sizes and ratio-based class weights for the acoustic system. Comparing partitions, I see that conditioning on the content of the interaction works best for the acoustic modality whereas using the data without any partitions works best for lexical. It is also observed that accounting for the speaker's gender did not improve the result for both lexical and acoustic systems, implying that there might not exist a signicant 41 1 3 5 7 9 11 13 Window size 40 50 60 70 80 90 Average UAR (%) Lexical system Acoustic system Random assignment Figure 2.7: Test recall versus extended evaluation window size gender dierence in how spouses express these behaviors. This matches the domain-experts' experience and expectation. While inspecting the model predictions, I observed that utterances neighboring the occur- rences of \Hostile" and \Positive" were also labeled as such in many cases. A possible explanation for this phenomenon is the presence of annotation delay attributed to the an- notators' reaction lag [95], wherein annotations are positioned multiple seconds after the actual occurrence of the behavior. I refer to this as a delayed, or \backward" mislabeling of the ground-truth. Conversely, labeling an occurrence as hostile or positive might lead an annotator to disregard residual but similar behavior in subsequent occurrences, resulting in a \forward" mislabeling of the ground-truth. Further to the mislabeling possibility, the ability to coarsely identify a region of hostility is more important than identifying the specic utterance labeled as such. Hostility usually builds up and human experts before any manual inspection would still require context to better understand the issue. I thus tested identifying for the presence of \Hostile" and \Positive" within a \tolerance" window, where a classication was judged to be correct if it was present inside a symmetric window centered on the target utterance. Test results using this evaluation are shown in 42 Figure 2.7, where window size K +1 represents tolerance of K 2 neighboring utterances on each side. I observed that the average UAR% for both acoustic and lexical systems increases at a higher rate than that for chance when using a window size of 3. Moreover, 54.35% of the false negatives in utterances whose ground truth label was either \Hostile" or \Positive" are corrected when using a window size of 5. These ndings oer potential empirical evidence of the need to often consider multiple turns to identify such behaviors and that there may be some annotation oset inherent to the way humans integrate information. Furthermore, they encourage the use of a context-based approach, where the neighboring turns of an utterance are also considered when classifying its behavior. 2.6 Discussion and Conclusions Automating the psychological evaluations can have a signicant impact by providing less expensive and more accurate methods of behavioral coding. In this work, I proposed a Dynamic Behavior Model based scheme which models a spouse in couples' therapy as transitioning through multiple behavioral states to obtain an overall perception of negativity. I tested two models for dynamically modeling the behavior: Activation-based DBM and Likelihood-based DBM which outperformed the Static Behavioral Model. Furthermore, the LDBM performed slightly better than ADBM as it provided more freedom to the way behavior could be expressed. I also performed demonstrated similar improvements of the DBMs over the SBM in the task of predicting therapist empathy in psychotherapy-based counseling. the proposed work models a therapist as transitioning through multiple behavioral states, in order to obtain an overall percep- tion of empathy. The LDBM performed better than ADBM as it takes context into consideration while being a more realistic model of how human behavior evolves: an interlocutor is likely to change their behavior, but their past behavior is likely to play a factor on what their next behavior 43 will be. I also tested the robustness of the 3 models to errors in diarization as well as transcription and found that the LDBM outperforms the other two models. In psychological evaluations of therapy sessions, ratings for behaviors are very often annotated at the global session-level. This coarse resolution drastically increases the diculty of learning frame-level or utterance-level behaviors. I have developed an RNN system for estimating behav- ior in variable-length context windows at the frame level. This enables us to observe continuous metrics of behavior in a sliding window and allows for fusion of behavior from dierent modalities. The RNN was trained in a data limited environment and only global ratings. I showed that by pretraining word representations on out-of-domain large vocabulary corpora and performing joint optimization I can solve the issue of data sparsity in the data and achieve increased robustness to out-of-vocabulary words. Finally I applied top level fusion on the frame-level behavior metrics to evaluate the behavior trajectories and estimate the true session rating. The estimated be- havior rating from the system achieves high agreement with trained human annotators and even outperforms outlier human annotations. In the work I showed that a RNN system can be trained in a data limited environment to obtain meaningful behavior trajectories in a couples interaction session. This is the rst step in allowing for detailed online analysis by psychologists of the interplay of behaviors in couples interactions at a ner resolution. In future work I plan to apply transfer learning between dierent behavior codes to obtain a better model of complex behaviors. I also plan to build a more complete model through the fusion of behavior metrics from dierent modalities. For future experiments I plan to also include the noisy unused portions of the data. Current observational studies in psychology often involve the time-consuming and expensive process of annotating specic behaviors in lengthy sessions. In the future I hope to deploy the system for a more automated method of evaluating behavior in human interactions. Finally, I described a turn-level feedforward neural network based model for classifying the behaviors of real-life cancer-aicted couples based on the language and vocal characteristics they 44 use when conversing with each other. the results showed that the lexical system performed better than the acoustic system and that there were no major dierences between dierent genders, patient/caregiver roles and interaction content with respect to how spouses expressed behaviors. A windowed-evaluation analysis of the results indicated that aective behavior might manifest in a region around target instances and, thus, context-based classication approaches might be more eective at identifying them. In future work, I will employ automatic transcription for the lexical modality and investigate advanced methods for fusing acoustic and lexical systems. In these works, I only considered one speaker's information when predicting their behavior and demonstrated the eectiveness of the models in dierent domains. However, behavior evolution in dyadic interactions is often a co-dependent process shared among the two speakers, which means that the partner's information is valuable too. Therefore, in the next section, I extend this model to incorporate dependencies amongst spouses to model their behavioral in uences. 45 Chapter 3 Determining Observation Window Lengths for Modeling Behaviors 3.1 Introduction Human interactions involve social-cognitive abilities of varying levels of complexity such as speech detection, language understanding, emotion recognition and appropriate response generation. Among these, the ability to reliably and accurately assess a person's behavior 1 by observing their verbal and non-verbal cues is a considerably complex and important one. Such a skill is especially important for both delivery and assessment in psychological research domains such as Couples Therapy [36], Addiction Counseling [7] and Cancer Care [127]. In these encounters, hu- man experts perform formal behavioral coding by observing interactions between the provider and the client and quantitatively annotating their behavior along dierent dimensions, which is then used to provide feedback and improve the clinical eectiveness of care. Subsequently, there have been eorts [108] to automate this behavior annotation (or coding) process using machine learning so that rapid and inexpensive feedback can be provided to the stakeholders. Previous work has shown that automated coding systems are eective at quantifying 1 we use the term `behavior' to refer to not just physical actions such as facial expressions, body gestures and speech but the underlying state of mind that is expressed through these actions, and how those are perceived by domain experts 46 behaviors as varied as Negativity [21, 31, 51, 148], Depression [60, 103] and Empathy [53, 121, 157] from multimodal interaction cues. However, there are some critical aspects of this behavior as- sessment process which humans can handle naturally and easily but machines still cannot. One such aspect is the notion of how much information needs to be observed in order to reliably assess behavior. It is intuitive to expect that evaluating dierent behaviors would require dierent ob- servation window lengths. In fact, such associations have been exhibited by humans when judging characteristics such as personality traits [22], non-verbal behaviors [106] and group dynamics [133]. It is not clear as to how these associations manifest in automated systems that quantify behaviors based on interaction cues. Unlike emotions, which are simple and rapid [14] and can be reliably estimated from short observations such as a few seconds [136], a sentence [163] or a speaker turn [27]), the rich variety of human behaviors can be much more complicated and long-ranging. Even expert coders in the eld of psychological research typically rst have to be trained according to domain-specic guidelines or manuals before they can start coding patients' behaviors, such as CIRS [63] and MITI [105]. This complexity can potentially give rise to uncertainty at the time of assessment, which then necessitates longer observations in order to achieve condent and reliable annotation. Furthermore, the annotation time-frames for coding dierent behaviors can range from as short as 30 seconds [66] to as long as 10 minutes [63], demonstrating the potential variability in observation lengths. These facets of behavior coding demonstrate the need for investigating the role of the length of observation for specic behavioral characterization. In this section, I present a systematic analysis of the observation window length for quantifying behavior and how this varies for dierent behaviors. Specically, I am interested in empirically identifying the minimum amount of language information, measured in number of words, from which a behavior can be judged. Through this analysis, I aim to address the following questions: 47 1. Do dierent behaviors need observations of dierent lengths in order to be quantied from language? 2. How is the nature or type of behavior related to the length of its required observation window? My proposed analysis framework consists of two components: (1) A pair of evaluation metrics that describe the window-level and interaction-level quality of behavior predictions by the system at a given window length, and (2) A step-by-step procedure that progressively examines how these metrics for a given behavior change as the window length is varied, based on which the appropriate window length is determined. I conduct my analysis on the Couples Therapy corpus [36] which contains real-life interactions coded for a large and diverse set of behaviors in human interaction. I then compare my ndings against existing literature on similar behavior analyses using modalities such as audio, as well as work in psychology related to constructs such as aect and personality traits. 3.2 Multi-scale nature of behavioral constructs A body of work in psychology that is related to, but not the same as, my notion of \window length" is the one which studies Thin Slices of observed behavior [4]. It refers to excerpts or snippets of an interaction that can be used to arrive at a similar judgment of behavior to as if the entire interaction had been used. Essentially, it implies that an entire interaction can be replaced with just a windowed part (which is dierent from my aim of identifying the best window through which to view the entire interaction). The eect of the location of these slices has been investigated as well; the conventional approach is to situate the slices near the start of the interaction. The eectiveness of thin slices has been observed in many applications, ranging from judging personality traits [22, 80] such as the \big ve" [96] to viewer impressions of TED talks [39] such as \funny" and \inspiring". 48 Notably, Carney et al. [28] studied the accuracy of impressions for Aect, Extraversion and Intelligence at dierent thin-slice durations, locations, etc. Accuracy was measured as the corre- lation between the true value of a construct (whether rated or self-reported) and the impression based on the thin slice, and it was observed that, in general, accuracy increased as the slice length increased from 5 seconds to 5 minutes. Furthermore, they found that Negative aect could be assessed with similar accuracies at all slice lengths whereas Positive aect was best assessed only when thicker slices were used. These works provide an encouraging support for analyzing the window length of behaviors along similar lines. There has also been a great deal of work in psychology on studying how humans perceive and process events characterized by concepts such as good and bad. Specically, there exists a notion that \bad" is \stronger" than \good" [13], meaning that undesirable or unpleasant events have a greater impact on an individual than desirable, pleasant ones. The behavior constructs that I am interested in analyzing are similar to concepts that have been shown to exhibit this phenomenon in previous works. For instance, Oehman et al. [114] found that people detected threatening faces more quickly and accurately than those depicting friendly expressions. In a similar experiment, Krull et al. [79] showed videos of either happy or sad individuals to participants and reported that happiness evoked more spontaneous inferences while sadness drew slower ones. This shows that dierent concepts are perceived dierently, depending on their valence; hence, in this section, I investigate how the nature of dierent behavior constructs is tied to aspects of their expression and perception in language such as window length, aggregation mechanism, etc. Some approaches in machine learning and speech processing that are similar to ours have investigated the accuracy of behavior prediction using acoustic vocal cues. Xia et al. [155] found that as the observation window used to compute acoustic features was increased from 2 seconds to 50 seconds, the classication accuracy generally improved, with Negative and Positive behaviors gaining the most. Similar results were reported by Li et al. [91] who classied behaviors such as Acceptance, Negativity and Sadness by employing emotion-based behavior models on acoustic 49 features. In their models, the receptive eld (measured in seconds) of a 1-D Convolutional Neural Network-based system served as the observation window for vocal cues. In general, they found that behaviors relating to negative aect, such as Negativity, were classied more accurately than behaviors such as Acceptance and Sadness. They also observed that increasing the receptive eld from 4 seconds to 64 seconds generally resulted in better classication, with Sadness performing best at 16 seconds while Negativity performed best at 64 seconds. Other eorts have addressed related aspects; for instance, Lee et al. [90] examined whether the behavior annotation process is driven more by a gradual, causal mechanism or by isolated salient events, which imply the use of long and short observation windows respectively. While these works have contributed to a better understanding of the eect of observation windows, they are limited in the variety of constructs that are analyzed. Furthermore, they mostly focus on acoustic and vocal cues and not enough on the lexical characteristics. Hence, the novelty of my work lies in (1) analyzing the eect of observation lengths in the lexical modality and (2) performing this analysis using a large and diverse set of real-life human behavior constructs. Through my analysis, I aim to understand the relation between the nature of the behavior and how long an automated system needs to observe its expression in language in order to accurately estimate it. 3.3 Framework of Analysis 3.4 Problem Statement Figure 3.1 depicts a typical system setup used in previous works [53, 148, 157] that employed a moving-window approach for estimating behaviors. Since my proposed analysis assumes such a setup, I will use its components to formally describe the problem of determining the appropriate observation window length for estimating behaviors. 50 Speaker Partner B 1 B 2 B i Facial Expression Acoustics Language ... ... E lang (B i ) Behaviors Interaction Cues Windowed Cues Window Length L j th window M(; ϴ) Window-level Behavior Scores Interaction-level Behavior Score Model Functional F Figure 3.1: Automated quantication of behavior from lexical cues using a moving-window ap- proach: During an interaction, interlocutors express behaviors such as B i through conversational cues such as language cues. The text of the conversation is decomposed into windowed chunks, each chunk L words long. Then, model M is used to score the text inside each window, resulting in a trajectory of window-level scores. Finally, a functional F is applied on the trajectory to obtain a summary of the behavior during the interaction Let's suppose that I want to estimate behavior B i using a set of observed data samples D, where each sample is a sequence of words. Let A i 2R jDj be the ground-truth annotations of B i in D and let C be a metric that is used to evaluate the estimation results against A i ; the higher the value of C, the more accurate my estimates are. LetE lang (B i ) represent the degree to which the behaviorB i is actually expressed in language. This is not known beforehand, so I assume it to be high and that it is possible to estimateB i from language. Let M denote a machine learning model (e.g., Deep Neural Network) that estimates a scalar value from a sequence of words and represent its learnable parameters (e.g., weights). Finally, let L represent the window length at which B i is observed in language and F denote a statistical functional that maps a sequence of scalar values to a single scalar value. Then, the quality of behavior quantication can be expressed as: Q i =C(F (M(E lang (B i );L;D;)); A i ) (3.1) My goal is to identify the window length L i that maximizesQ i for B i : L i = arg max L Q i (3.2) 51 I argue thatQ i can be high only when E lang (B i );L;M; and F are all appropriate together. If even one of them is awed or incompatible with the rest, then it would adversely aect Q i , as explained below: E lang (B i ): The behavior B i must be suciently expressed in the lexical channel to begin with; otherwise, it might not be possible to observe it using lexical cues alone (for example, if it is instead primarily expressed through nonverbal vocal cues such as laughter). In general, I do not have information about E lang (B i ) beforehand; instead, I simply assume that it is high enough so that it is possible to estimate B i from language. L: The observation window must be long enough to observe B i ; otherwise, the incomplete information from partial observations can lead to noisy or incorrect estimates. M: The model must be well-suited for capturing B i . For example, quantifying a behavior that is based on the actions of both speakers requires a model that looks at both speakers; using a single-speaker model instead would result in inaccurate estimates due to insucient information. : The model must be well-trained; otherwise, its estimates might be inaccurate. This is dependent on the training process, the amount and quality of data used to estimate parameters, etc. F : The aggregating functional must be well-suited for summarizing B i ; otherwise, the re- sulting aggregate estimate might not match the ground-truth annotations A i . For example, the functional mode, which identies the most frequently occurring value, might not be appropriate for summarizing a behavior that is expressed very infrequently. I reason that a high value ofQ is indicative of all the aforementioned factors being appropriate and a low value is indicative of a limitation in at least one of these factors. Based on this, I now proceed to analyze the variation inQ for dierent behaviors as window length L is varied. 52 3.5 Proposed Analysis Methodology In this section, I describe in detail my proposed framework of behavior window length analysis. I rst describe two evaluation metrics that quantify how well the system is able to accurately estimate a behavior. Following this, I present a step-by-step procedure that progressively employs both metrics to determine the appropriate window lengths for each behavior. 3.5.1 Metrics 3.5.1.1 Behavior Construct Similarity Human Expert Ratings of B i System estimates of B i at Length L 1 BCS i (L 1 ) = 0.5 BCS i (L 2 ) = 0.8 Session 1 Session 2 Session 3 Session X F F F F F F F F Window- level Session-level Aggregate System estimates of B i at Length L 2 Window- level Session-level Aggregate . . . . . . . . . Score Legend (Strong) (Weak) (None) Medium similarity between humans and system at L 1 High similarity between humans and system at L 2 Figure 3.2: Computation of Behavior Construct Similarity (BCS) for behavior B i at two obser- vation window lengths L 1 and L 2 : The session-level aggregates are more highly correlated with human expert ratings at L 2 (0.8) than at L 1 (0.5). Therefore, L 2 is considered to be more appropriate than L 1 for estimating B i My rst proposed metric, the Behavior Construct Similarity (BCS), measures how well the system matches humans in terms of understanding the overall, aggregate behavior content of the entire interaction. The higher the value, the more similar the system is to humans and, hence, the more reliable the behavior estimates are. For behavior B i and window length L j , it is computed as: BCS i (j) =R(F (M(E lang (B i );L j ;D;)); A i ) (3.3) 53 where R refers to the Spearman Correlation between human annotations A i and the system estimates, which are expected to be ordinal variables, in general. For the functional F, I test 3 statistics: median, minimum and maximum; the former provides a useful, \average" summary of behavior, as shown in previous works [150,151] while the latter two represent outlier events such as large/small. As can be seen from Eqn. 3.3, BCS is a direct implementation of Eqn. 3.1, with Spearman's Correlation R as the choice of the evaluation metric C. It is computed at the session-level using Algorithm 3 in A.1 and is, thus, a session-level measure of the system performance. It takes values in the range [-1, 1], where -1 represents no similarity while 1 represents full similarity. Figure 3.2 shows an example of how BCS can be used to determine the appropriate window length for behavior B i . While BCS provides the best validation possible for system estimates by directly comparing them against human judgments, it nevertheless suers from two limitations: 1. A poor choice of the functionalF in Eqn. 3.3 can result in inaccurate aggregates and, hence, low BCS, even if a suciently long window is used. In such a scenario, relying on BCS alone would lead to an incorrect conclusion that the window length is not appropriate. 2. BCS cannot be quickly computed at any arbitrary window length, since I rst require human annotations at that window length to compare against. This requirement becomes practically infeasible when analyzing a large set of window lengths. In order to account for these, I propose an additional metric that can be quickly computed at any arbitrary window length and which does not rely on any aggregation, as described below. 3.5.1.2 Behavior Relationship Consistency My second proposed evaluation metric, the Behavior Relationship Consistency (BRC), measures how well the system matches humans in terms of perceptual relations between dierent behaviors. 54 These refer to notions of similarity and dissimilarity that arise between dierent constructs due to the way they are dened; for example, Happiness is considered similar to Joy and Satisfaction but opposite to Sadness. Thornton et al. [147] studied emotional transitions and found that similar emotions (e.g. Anger and Joy) frequently tend to co-occur whereas dissimilar ones (e.g. Anger and Disgust) do not. I expect behaviors to exhibit such phenomena as well and reason that a system that can accurately model related behaviors would produce estimates that also consistently re ect these relations. The BRC metric measures this consistency; the higher its value, the greater the consistency and, thus, the more reliable the behavior estimates are. Human Expert Ratings System window-level estimates at Length L 1 System window-level estimates at Length L 2 Session 1 Session 2 Session 3 Session X B i B j . . . B i B j . . . . . . B i B j Q * i,j = -0.7 Q ’ i,j (L 1 ) = 0.3 Q ’ i,j (L 2 ) = -0.8 BRC i,j (L 1 ) = 0.5 BRC i,j (L 2 ) = 0.95 Medium consistency between humans and system at L 1 High consistency between humans and system at L 2 Score Legend (Strong) (Weak) (None) Figure 3.3: Computation of Behavior Relationship Consistency (BRC) between behaviors B i and B j at two observation window lengths L 1 and L 2 : The correlation betweenB i andB j 's estimates are more similar to the correlation betweenB i andB j 's human ratings (-0.7) atL 2 (-0.8) than at L 1 (0.3). Therefore, L 2 is considered to be more appropriate than L 1 for estimating B i and B j BRC is dened for a pair of behaviorsfB i ,B j g and measures how close the Spearman Corre- lation between their window-level estimates at window length L k is to the Spearman Correlation between their ground-truth annotations. It is calculated as: BRC i;j (k) = 1 jQ i;j Q 0 i;j (k)j 2 (3.4) where Q i;j =R(A i ;A j ) and Q 0 i;j (k) =R(M(E lang (B i );L k ;D;); M(E lang (B j );L k ;D;)) 55 BRC is computed at the window-level using Algorithm 5 in A.1 and is, thus, a window-level measure of the system performance. It takes values in the range [0, 1], where 0 represents no consistency while 1 represents full consistency. Figure 3.3 shows an example of how BRC can be used to determine the appropriate window length for behaviors B i and B j . It can be seen that Q 0 i;j in Eqn. 3.4 is similar toQ i from Eqn. 3.1; whileQ i evaluates against human annotations, Q 0 i;j evaluates against estimates of other behaviors. Thus, the eectiveness of B i 's BRC is directly dependent on B j 's estimates; the more accurate they are, the more I can rely on B i 's BRC with B j . Using this principle, given multiple behavior pairsfB i ;B j 8j2 Jg, B i 's weighted BRC is calculated as a weighted sum of its individual BRCs with B j , proportional to their BCS: BRC i (k) = X j2J j (k)BRC i;j (k) (3.5) where j (k) = BCS j (k) P m2J BCS m (k) (3.6) 3.5.2 Procedure With the analysis metrics dened, I now proceed to employ them in the following multi-stage fashion, depicted in the owchart in Figure 3.4. Stage 1: First, I focus on those behaviors which the system can estimate with high accuracy. As argued in Sec. 3.4, if a behavior B i can be accurately estimated at window length L, then this implies that L is appropriate for B i . The appropriate window length L i I am interested in is the minimum amount of words required to capture B i . Therefore, I identify 56 L i * = argmax BCS i (k) Is there significant variation in BCS i (k) ? Select behavior B i START Compute BCS i (k) at all window lengths k Є L Is BCS i (k) > Y 1 at all window lengths k Є L ? L i * = min L B i Є B rel Compute weighted BRC i (k) with B rel behaviors at all window lengths k Є L YES Is weighted BRC i (k) > Y 2 at any window length k Є L ? L i * = min {k : BRC i (k) > Y 2 } Is there significant variation in BRC i (k) ? Could not determine window length L i * for B i NO L i * = argmax BRC i (k) k CONDITION OPERATION Most appropriate window length for behavior B i Behavior estimates highly similar to human judgments at all window lengths Behavior estimates more similar to human judgments at some window lengths than others Behavior estimates highly consistent with human judgments at some window lengths Behavior estimates more consistent with human judgments at some window lengths than others Behavior estimates neither similar to nor consistent with human judgments at any window length Smallest window length with high similarity Window length with highest similarity Smallest window length with high consistency Window length with highest consistency Window length cannot be determined STAGE 1 STAGE 2 STAGE 3 STAGE 4 YES YES YES NO NO NO k RESULT Figure 3.4: Flowchart of the analysis procedure for determining appropriate window length of target behaviorB i : In each stage, I check if the system's estimates ofB i satisfy a condition, summarized under \CONDITION" and denoted in blue boxes under \OPERATION". If satised, I determine the appropriate window length as denoted in green boxes under \OPERATION" and summarized under \RESULT". If the condition is not satised, I simply proceed to the next stage. This procedure continues for 4 stages, beyond which I cannot make determinations about B i 's window length. 57 all window lengths L at which its BCS, dened in Eqn. 3.3, is high and simply select the shortest one. Identify B i :BCS i (k)>Y 1 8k2 L Window length L i =minf Lg Y 1 is a pre-dened threshold; the closer this threshold is to 1, the more condent I can be that the estimate of a behavior is indeed similar. I refer to the set of behaviors identied in Stage 1 as \reliable" behaviors, denoted using B rel . Stage 2: Next, I examine those behaviors whose estimates are not highly similar to human judgment (i.e. BCS lower thanY 1 ) but which nevertheless are more similar at some window lengths than others. While low similarity can be attributed to factors other than window length, since I do not vary them, any change must be solely due to the window length. Therefore, for such behaviors, I check if their BCS shows a signicant uctuation over the entire range of window lengths, and simply pick the length at which it was highest. Identify B i :maxfBCS i (k)g>minfBCS i (k)g (signicant, p< 0:05) Window length L i = arg max k BCS i (k) I check the change in BCS for statistical signicance by calculating the 95% condence interval for dierences in dependent overlapping correlations using Zou's method [164], as recommended by Diedenhofen et al. [43]. The change in BCS is considered to be statistically signicant if the interval does not contain 0; else, it is not signicant. Stage 3: In Stage 3, I analyze those behaviors whose estimates were not found to be similar to human judgments, as evinced from their low values of BCS. As explained in Sec. 3.5.1.1, however, a low value of BCS for a behavior B i does not automatically imply 58 that its window-level estimates themselves are inaccurate; rather, it might be due to an inappropriate aggregating functional used in computing BCS. Therefore, I inspect the window-level estimates directly by examining how consistent they are, given by the weighted BRC, dened in Eqn. 3.5. I compute the weighted BRC ofB i with respect to the B rel behaviors that were identied in Stage 1 as being accurately estimated. I then check if it is higher than a pre-dened threshold Y 2 and pick the shortest window length at which this is true; as before, the closer Y 2 is to 1, the more condent I am that the estimate of a behavior is consistent. Identify B i :9 k2 L :BRC i (k)>Y 2 Window length L i =minf k2 L :BRC i (k)>Y 2 g Stage 4: Lastly, I inspect behaviors whose estimates are not highly consistent (i.e. BRC lower than Y 2 ) but which nevertheless show signicantly more consistency at some window lengths than others. Therefore, similar to Stage 2, I check if their BRC varies signicantly over the entire range of window lengths, and simply pick the length at which it was highest. Identify B i :maxfBRC i (k)g>minfBRC i (k)g (signicant, p< 0:05) Window length L i = arg max k BRC i (k) I check the change in BRC for statistical signicance by calculating the 95% condence in- terval for dierences in independent correlations using Zou's method [164], as recommended by Diedenhofen et al. [43]. End: For behaviors that show neither similarity nor consistency in their estimates with human judgments, I do not analyze them any further and simply conclude that I am unable to make any determinations at this point about their appropriate window lengths. 59 3.6 Windowed Scoring of Text Let's suppose that the text transcript T p , from the p th session, contains O p words. Using an observation window of length L k , I rst decompose it into its constituent windows. If O p >L k , then I get O p L k + 1 windows, each one containing L k words; else, I get just one window. Then, using model M, I estimate the behavior within each window independently . Assuming that T p =w 1 ;w 2 ;:::;w Op , there are O p possible lengths at which it can be windowed, as shown in the rst column of Table 3.1. Observation Window No. of Resolution Window Decomposition Windows session-length fw 1 ,w 2 , ... w Op g 1 Very Coarse Op-1 word fw 1 ,w 2 , ... w Op1 g, fw 2 ,w 3 , ... w Op g 2 Coarse ... 2-word fw 1 ,w 2 g, fw 2 ,w 3 g, ... fw Op1 ,w Op g Op-1 Fine 1-word fw 1 g, fw 2 g, ... fw Op g Op Very Fine Table 3.1: All possible window lengths at which an utterance with O p words can be scored The window with the coarsest resolution is the \session-length" window since it views the entire session as a whole, resulting in a single behavior estimate, or \score", for T p . On the other hand, the \1-word" window provides the nest resolution possible since a score is generated for each word in the session, resulting in a trajectory of scores for T p . In this work, I test the following observation window lengths: f3, 10, 30, 50, 100g where N represents a window that is N words long. 3 and 10 can be qualitatively thought of as \short" windows, 30 and 50 as \medium"-length windows and 100 as \long" window. The number of scores resulting from each session depends on the window length used; the longer the window, the fewer the number of scores. For instance, since there are 1325 sessions in my dataset, estimating behavior with a \session-length" window would result in one score per session and, thus, 1325 scores in total. On the other hand, estimating with a 1-word window results in 1067727 scores in total, over all the sessions. In my work, the window length varies from 3 to 100 and, hence, the total number of scores varies from 1065077 to 936738 respectively. 60 3.7 Behavior Scoring Models From Eqn. 3.2, I see that the analysis of window lengths depends on the model M; therefore, it is possible that the results of my analysis might be dierent for dierent choices of M. In such a scenario, running an analysis using just one modeling framework might be limiting, since the results might be re ective of the model and not necessarily the behaviors. This can be addressed by testing dierent models and observing traits that are specic to each model as well as traits that are consistent across dierent models, thereby providing a more comprehensive understanding about how behaviors are expressed through language and how they are aected by the observation window length. Hence, in this work, I conduct my analysis using two dierent modeling frameworks: (1) Maximum Likelihood N-gram models and (2) Deep Neural Networks, the details of which are provided in this section. 3.7.1 N-gram Maximum Likelihood Model 3.7.1.1 Modeling approach I use a Maximum Likelihood method closely following [50,131] which employs N-gram Language Models (LMs) in a cumulative fashion as shown in Figure 3.5. An N-gram LM takes as input a text sequence - for example, an utterance W consisting of O words W =fw 1 ;w 2 ;:::;w O g - and outputs a likelihood probability given by: P (W ) =P (w 1 ;w 2 ;:::;w O ) O Y n=1 P (w n jw n1 ;w n2 ;:::w nN+1 ) (3.7) I use N-gram LMs since they have been shown to be accurate at behavior estimation in previous works [33,51] and are easy and simple to train. 61 “I don’t like the way you do things” PROBABILITY Class 0 Class 1 BEHAVIOR SCALE 1 st 2 nd 3 rd K-3 th K-2 th 1 2 3 K K-1 K-2 LM PAIR . . . . . . 1 2 3 K K-1 K-2 . . . BEHAVIOR SCALE UTTERANCE 3.03 ESTIMATED BEHAVIOR SCORE EXPECTED VALUE 0.4 0.2 0 Figure 3.5: N-gram model used to estimate behavior of a sample utterance: Pre-built sets of K-2 binary LMs provide likelihoods for the utterance on a behavior scale from 1 to K (in my data, K=9). Posterior probabilities are then calculated and the expected value of the resulting distribution is used as the estimated behavior score of the utterance 3.7.1.2 Training My modeling method, shown in Figure 3.5, assumes that behavior is rated on a scale from 1 toK, where 1 indicates the lowest degree (\absence of behavior") and K indicates the highest degree (\strong presence of behavior"). I train K 2 pairs of LMs where the r th pair performs a binary classication of behavior belonging to Class 0, i.e. the range [1,r+1], or Class 1, i.e. the range (r+1,K ]. Let's denote the behavior score of utterance W as x (which I want to estimate); then, the r th LM pair's N-gram likelihood probabilities can be expressed as: P r 0 (W )P (Wj 1xr+1) (3.8) P r 1 (W )P (Wj r+1<xK) (3.9) I implement Maximum Likelihood models through 3-gram LMs trained with Good-Turing discounting using the SRILM toolkit [143]. A leave-one-couple-out scoring scheme is used where 62 models are trained on data from Z 1 couples and subsequently used to score data from the Z th couple, in order to prevent overitting. Ideally, these models would be trained at the same window length at which they would be tested, thereby resulting in ve sets of models, one for each of the window lengthsf3, 10, 30, 50, 100g words. However, it is not practically feasible to train N-gram models on sequences longer than 5 words, due to the curse of dimensionality [16], where the amount of training data required increases exponentially with the order N. Therefore, I instead train a single set of 3-gram models, i.e. at window length 3 words, and use them for testing at all the window lengths. I show in the results that this train-test mismatch does not particularly bias my analysis towards always selecting 3 words as the appropriate window length. 3.7.1.3 Testing Given an input utterance, the r th LM pair provides likelihood probabilities of its behavior lying in the ranges [1,r+1] and (r+1,K ]. I compute binary posteriors from these binary likelihoods using Bayes Rule and assuming uniform priors as shown in Eqn. 3.10. Then, using Eqn. 3.11, the binary posteriors are converted into a probability mass function where the r th point represents the probability of behavior lying in the range [r,r+1]. Finally, the behavior score x for utterance W is obtained by computing the expected value of this probability mass function, as in Eqn. 3.12. P (xr+1j W )P (1xr+1j W ) = P r 0 (W )P (1xr+1) P r 0 (W )P (1xr+1) +P r 1 (W )P (r+1<xK) (3.10) P (r<xr+1j W ) =P (xr+1j W )P (xrj W ) (3.11) 63 x = K1 X r=1 (r + 1 2 )P (r<xr+1j W ) (3.12) For my analysis, I test at all window lengths but only train 3-gram LMs due to the diculty of training higher-order LMs caused by the curse of dimensionality [16]. I show in the results that this mismatch does not particularly bias my results. 3.7.2 Neural Estimation Model GRU Fully Connected Layer h 0 h 1 h O ... ELMo embeddings GRU hidden representations ReLU6 e 1 ELMo mixing weighted connections 3.03 ESTIMATED BEHAVIOR SCORE I w 1 ... things w O “I don’t like the way you do things” UTTERANCE GRU e 2 e 3 e 1 e 2 e 3 Figure 3.6: Neural model that uses a O-word-long window to estimate the behavior score of a sample utterance: Given an utterance, the ELMo word embedding sequence is mapped to a xed- length hidden representation and passed through a fully connected layer and ReLU6 activation to obtain the estimate of the behavior score From Eqn. 3.1 I see that my analysis is fundamentally tied to the behavior modelM, in which case dierent choices of M might bias my results dierently. In order to examine this eect, I repeat my analysis with a Recurrent Neural Network model as shown in Figure 3.6. Since this 64 is only a comparative analysis, I limit its scope by training and testing at two window lengths: short (3 words) and medium-length (30 words). Eorts to analyze at a long window (100 words) were unsuccessful due to instability during training for behaviors with heavily skewed rating distributions, as explained in 3.7.2.2. 3.7.2.1 Modeling approach I use a model similar to the one in [150], which was shown to accurately estimate Negative behavior of speakers in dyadic interactions. It consists of a Gated Recurrent Unit (GRU) [35] followed by a fully connected layer and a ReLU6 [78] activation. ReLU6 is a modied version of the standard ReLU activation that takes in input x and provides an output y that is bounded between 0 and 6: y =min(max(x; 0); 6) (3.13) My model, shown in Figure 3.6, is similar to the one used by Tseng et al. [150] for classifying Negative behavior from language, but with a few changes: the Long Short-Term Memory [68] unit is replaced with a Gated Recurrent Unit (GRU) [35] and the word2vec [99] embeddings are replaced by ELMo [124] embeddings. Finally, while [150] post-processed the system outputs using Support Vector Regression, I do not use such transformations since I am interested in analyzing the properties of the system outputs themselves. Instead, I simply use a ReLU6 layer in order to ensure that my predictions are bounded, similar to the ground truth annotations A. In order to construct the input embeddings, I use ELMo [124] embeddings which capture semantic and syntactic relations in a deep, contextual manner. For every word, ELMo provides embeddings from 3 layers, each of dimension 1024, and a set of 3 mixing weights as well a scaling weight that can be trained in a task-specic manner. As recommended by Peters et al. [124], I obtain a single embedding for each word by taking the scaled, weighted sum of all 3 embeddings. 65 These mixing weights are softmax-normalized, similar to attention [57] weights, and, when trained, represent the relative importance of each layer in estimating behavior. Generally, it has been observed in deep neural networks that lower layers tend to learn simpler representations such as edges in images and phrase-level information in text whereas higher layers tend to learn more complicated representations such as objects in images and semantic features in text [70, 115]. In particular, the higher layers in ELMo have been found to model complex characteristics of language such as polysemy and semantics better than the lower layers [124]. Therefore, observing how these weights dier for each behavior group can illuminate how their linguistic characteristics are dierent. 3.7.2.2 Training The size of the GRU hidden representation V is tuned to be either 10 or 100 and the sample minibatch size was set to 64. Given a word sequence and the window length, zero padding is performed at the end wherever required. Dropout [141] of 0.2 is applied before the fully connected layer and all the model parameters are trained using backpropagation by optimizing L1 loss in conjunction with Adam [75] optimizer. A separate model is trained for each behavior, without any shared parameters or layers, so as to ensure that the results are indicative of that behavior only. I employ a 6-fold nested cross-validation setup where in every test fold, four folds are used to train the model while the fth fold is used to optimize the model hyper-parameters, learning rate range, etc. Similar to the setup with N-gram models, I ensure that no dyad appears in more than one fold. Instead of performing grid search for tuning the learning rate, I use the Cyclical Learning Rate schedule as proposed by Smith [138]. For each behavior and model conguration, I rst perform a \range test" to determine the minimum and maximum learning rates at which training remains stable. I then cyclically vary the learning rate between its minimum and maximum value during training, saving a model checkpoint at the end of every epoch when the learning rate would be at 66 its lowest. Finally, at testing time, inspired by Huang et al. [69], instead of using just the last or the best checkpoint, I use an ensemble average of all of them. While I was able to analyze the Neural model at two window lengths: 3-word (i.e. 3 unrolled time steps) and 30-word (30 unrolled time steps), I was unable to do so at a long window, i.e. 100-word. This was due to instability during training caused by the \dying ReLU" problem in behaviors with highly skewed distributions of human ratings. Specically, this problem occurred when randomly-shued minibatches ended up with nearly all its samples having the same rating as a result of the skewed distribution. This would then result in a large gradient update that would cause the network weights to update in such a manner that the output ReLU6 layer would henceforth only output a 0 value, thereby rendering the Neural Model ineective. While I could resolve this problem at 3-word and 30-word window lengths by identifying stable learning rate ranges, I was unable to do the same at the 100-word window length. As a result, I analyze the Neural Model at the 3-word and 30-word window lengths. 3.7.2.3 Testing At runtime, given an input windowed sequence of O words W =fw 1 ;w 2 ;:::;w O g words from an observation window, I rst mix, for each word, its ELMo embeddings using ELMo's weights. This gives us a sequence of O word embeddings. This sequence is then passed to the GRU whose hidden state representations are of dimension V. Finally, the last hidden state of the GRU is passed to a fully connected layer, followed by a Relu6 [78] layer, resulting in a scalar value that represents the behavior score of the windowed sequence of words. 67 3.8 Experiments 3.8.1 Analysis of window length for behavior estimation 3.8.1.1 Case Study 5: Estimating Interaction and Social Support behaviors in Couples Therapy 1. Introduction I perform my analysis on the Couples therapy dataset which was introduced in Sec. 2.5.1. The Couples Therapy project [36] involved 134 real-life chronically distressed couples that attended marital therapy over a period of up to 1 year. Its dataset consists of hundreds of real-life interactions as well as a rich and diverse set of human-annotated codes character- izing the behavior of the participants in these interactions. In this experiment, however, I signicantly expand the scope of the modeling eort by focusing on the task of estimating a large and diverse set of behaviors, instead of binary classication of a single behavior. 2. Aim I analyze the behavior modeling performance of both the N-gram and the Neural modeling frameworks for a large and diverse set of behaviors, in order to arrive at determinations for the observation window lengths that are best suited for dierent behaviors. 3. Data The corpus consists of audio-visual recordings, with manual transcriptions, of husband-wife couples discussing topics of marital distress in 10-minute interactions. Each couple had at least 2 interactions or \sessions", once with each participant leading the discussion on a topic of their choice, and the total number of sessions per couple ranged from 2 to 6. In each session, both the husband and the wife were rated for a total of 13 CIRS [63] and 18 SSIRS [71] behavior codes by trained human annotators with a sense of what \typical" behavior is like during these interactions. The annotators were asked to observe both verbal 68 and nonverbal expressions when rating each behavior independently and in many cases, dierent annotators rated dierent behaviors. Each behavior in each session was rated by 3 to 9 annotators, with most of them being rated by 3 to 4. The rating was done on a Likert scale from 1 to 9, where 1 represents \absence of behavior", 5 represents \some presence of behavior" and 9 represents \strong presence of behavior". More details about the recruitment, data collection and the annotations can be found in [12,36]. CIRS2 Code Description Acceptance Indicates understanding, acceptance, respect for partner's views, feelings and behaviors Perspective Tries to understand partner's views, feelings by clarifying and asking to hear them out Responsibility Implies self-power over feelings, thoughts, behaviors on issue being discussed External Softens criticism of partner by attributing their undesirable behaviors to external origins Dene Articulates problems clearly, facilitates everyone's participation in problem solving process Solution Suggests specic solutions that could solve the problem Negotiates Oers compromises or bargains Agreement States terms of agreement, willingness to follow them with partner Blame Blames, accuses, criticizes partner and uses critical sarcasm and character assassinations Change Requests, demands, nags, pressures for change in partner Withdrawal Generally non-verbal, becomes silent, refuses to respond, discuss, argue, defend Avoidance Minimizes importance and denies existence of problem, diverts attention, delays discussion Discussion Discusses problem, shows engagement, interest and willingness in discussing issue SSIRS Code Description Positive Overtly expresses warmth, support, acceptance, aection, positive negotiation Negative Overtly expresses rejection, defensiveness, blaming, and anger Anger Expresses anger, frustration, hostility, or resentment during the interaction Belligerence Quarrels, argues, verbalizes nasty comments and mean rhetorical questions Disgust Shows disregard, scorn, lack of respect and makes patronizing and insulting comments Sadness Cries, sighs, speaks in a soft or low tone, expresses unhappiness and disappointment Anxiety Expresses discomfort and stress, answers with short yes/no responses without elaboration Defensiveness De ects criticism by defending self, accusing partner of similar behavior Aection Expresses warmth and caring for partner, speaks warmly, uses endearments Satisfaction Feels satisfaction about how topic of discussion is dened, discussed, and resolved Dominance Commands course of interaction, dominates conversation, changes subject frequently Solicits Suggestions Shows interest in and seeks partner's suggestions, help in handling issue Instrumental Support Oers positive advice for clear, concrete actions to support partner Emotional Support Emphasizes feelings, builds condence, and raises self-esteem in partner Table 3.2: Description of behavior codes in Couples Therapy corpus Consistent with previous work [21, 51, 85, 89, 148], for each speaker and behavior, I take the average of the annotators' ratings as the true rating in that session. Therefore, for each speaker in every session, I have the manual transcription of their utterances and their behavior ratings in that session. I disregard 4 irrelevant codes such as \Is the topic of discussion a personal issue ?" and \Is the discussion about husband's behavior ?", since they are tied more to the topic of interaction and less to the speaker's behavior. The resulting set of 27 behaviors that will be analyzed in this experiment are listed in Table 3.2 and categorized as follows: 69 Couples Interaction Rating System 2 (CIRS2): This set contains 13 codes that describe a speaker when interacting with their partner about a problem. Social Support Interaction Rating System (SSIRS): This set contains 14 codes that measure emotional features and ratings of the interaction. I only consider those sessions where both speakers were rated for all 27 behaviors, resulting in 1325 sessions in total. Since the content and nature of interaction vary from one couple to another, the number of words spoken by a speaker during a session ranges from around 50 to 2500, with a mean of 805 words and a standard deviation of 305. 4. Behavioral Grouping As seen in Table 3.2, there exist some perceptual relations between the 27 Couples Therapy behaviors based on their denitions. For example, Negative is similar to Blame but opposite to Positive, while Withdrawal is similar to Avoidance but opposite to Discussion. Grouping these behaviors based on such relations can lend more interpretability to my analysis and help us better study the link between the nature of a behavior and its observation length. Hence, I group the 27 behaviors by clustering their human expert ratings using the k-means algorithm described in Algorithm 4 in A.1. Figure 3.7 shows the resulting 4 behavior groups. My behavior groups closely resemble the work by Sevier et al. [137] which also derived 4 \scales" of behavior using a Principal Component Analysis-based approach: Negativity, Withdrawal, Positivity and Problem-Solving. Hence, I name my 4 groups of behaviors in similar fashion. The rst group is Problem-Solving since it pertains to a back-and-forth style of interaction, with behaviors such as Discussion, Negotiates and Solutions. The second group consists of behaviors such as Anger, Blame and Disgust; hence, I refer to it as Negative. Similarly, I name the third group Positive since it contains Aection, Positive, Satisfaction, etc. The last group contains behaviors such as Anxiety, Sadness and 70 Figure 3.7: Grouping of Couples Therapy behaviors based on their relation to each other: Behaviors are clustered in the space of their human annotations using the k-means algorithm. The cell in the (i;j) position shows the Spearman Correlation between human ratings of the corresponding behaviors i and j. Yellow (Blue) indicates highly positive (negative) correlation. Non-diagonal gray cells indicate that correlation is not statistically signicant (p< 0:05) Withdrawal, most of which are related to \dysphoria", a state of unease or unhappiness; hence, I name this group Dysphoric. 5. Results I now present the nal results of my analysis framework that show the best window length for behavior estimation in the Ngram and Neural models. I detail the intermediate BCS results for these models, which show the similarity between aggregate behavior estimates and ground truth ratings, in A.2.1 and A.3.1 respectively. Similarly, the intermediate BRC 71 results, which show the similarity between inter-behavior relationships in behavior estimates and ground truth ratings, are given in A.2.2 and A.3.2 respectively. Based on these, I used thresholds Y 1 = 0:59 and Y 2 = 0:95 for my analysis procedure. In general, I observed that both the Ngram model and the Neural model performed best when estimating Negative behaviors with their BCS close to 0.5 on average. Performance was lower in both models for Positive behaviors, with BCS around 0.4 on average. In the case of Problem-Solving and Dysphoric behaviors, both models exhibited a wide range of BCS values, from 0.1 to 0.53. I interpret and compare the performance of both models in greater detail in the upcoming sections. 6. Relation between observation window length and behavior Ngram Model withdrawal dominance avoidance change sadness anxiety support-instrumental agreement affection satisfaction positive define support-emotional acceptance perspective negotiates discussion solutions solicit-suggestions responsibility external negative disgust defensiveness blame belligerence anger - 3 10 30 50 100 Observation Window Length (number of words) Behaviors NEGATIVE PROBLEM-SOLVING POSITIVE DYSPHORIC BEHAVIOR GROUP Neural Model withdrawal sadness change avoidance anxiety dominance define support-instrumental support-emotional positive acceptance satisfaction agreement affection perspective discussion solutions solicit-suggestions responsibility negotiates external negative disgust defensiveness anger blame belligerence - 3 30 Figure 3.8: Appropriate window lengths of behaviors estimated with (left bar plot) N-gram model and (right bar plot) Neural model. Absence of a bar for a behavior implies that my analysis framework was unable to determine appropriate window lengths for that behavior 72 Figure 3.8 shows the nal analysis results for both modeling frameworks. The left bar plot shows the results with the Ngram model at the ve window lengths tested:f3, 10, 30, 50, 100g words. The right bar plot shows the results with the Neural model at the two window lengths tested:f3, 30g words. Each behavior is shown against its appropriate observation window length, as determined by my analysis, and sorted within its behavior group. I was unable to determine appropriate window lengths for behaviors whose estimates were neither similar nor consistent with human judgments; such behaviors are shown without a bar in the gure. With the Ngram model, these behaviors are Perspective and Withdrawal while in the Neural model, they are Discussion, Perspective and Dene. I wish to clarify here that these results should not be interpreted as global optimums. For example, I see that with the Ngram model, the behavior Anger is best estimated using a 3-word long window. What this means is that 3 words is the best window length, among the ones that I tested, for estimating Anger. Another way to interpret this result is that short windows (3 words) are better than medium-length windows (30 words) or long windows (100 words) for estimating Anger. First, I focus on the results obtained with the Ngram model, shown in the left side plot in Figure 3.8. In general, I found that the BCS of behaviors did not vary greatly, as can be seen in Figure A.1. However, in cases where there were statistically signicant changes in BCS, I was able to determine appropriate window lengths using Stage 2 of the analysis framework. For behaviors whose BCS did not signicantly change, such as Aection and Solutions, I was able to determine their window lengths by inspecting their BRC using Stages 3 and 4 of the analysis framework. I refer the readers to A.2 for a more in-depth discussion of the Ngram model analysis results. I see that most of the behaviors in my dataset tend to perform best with short windows, such as Acceptance and Negative at 3 words and Positive and Change at 10 words. At the same 73 time, I see some behaviors that perform best when scored using much longer observation windows, such as Solutions at 50 words and Avoidance at 100 words. I also note that more than half of the behaviors perform better at windows longer than 3 words, even though the N-gram models were trained on 3 words. This shows that the train-test mismatch mentioned in Sec. 3.7.1.2 did not particularly bias my results towards always selecting 3 words as the appropriate window length. At the group-level, I see that all the Negative behaviors perform best with short observation windows. This seems to be in line with my intuition about the emotional, short-term nature of these behaviors that lends itself to brief expressions. These ndings match the observation by Baumeister et al. [13] that humans show heightened awareness of and react more quickly to negative information than to positive information. They also match the ndings by Carney et al. [28] who reported that Negative aect could be quantied well using thin slices whereas Positive aect required thicker slices. The remaining groups, on the other hand, appear to be expressed over a wide range of lengths. Among these, Positive and Dysphoric behaviors mostly work best at short observations (10 words or fewer). For Positive behaviors, this is likely due to their aective content which, while not as brief as negative ones, is nevertheless short-term. Dysphoric behaviors, on the other hand, are characterized by a lack of participation and expression and are thus, likely to be marked by brief expressions, which could be why they tend to do best at short window lengths. Finally, Problem-Solving behaviors are evenly split between either very short (3 words) or much longer windows (50 - 100 words). This is a little surprising since I would normally expect them to be mostly, if not completely, long-range due to their extended, back-and- forth nature. Upon inspection, I found that these behaviors exhibited their highest similarity with human ratings (as seen from their BCS) at short windows but their highest consistency 74 (as seen from their BRC) at long windows. This suggests that the existing functionals are unable to eectively aggregate behavior estimates from longer windows. Hence, for such behaviors, more sophisticated functionals might be able to exploit the full potential of long windows. Next, I examine the outcome of the comparison analysis with the Neural model and compare them with the above N-gram model results. This will help us understand how they vary based on the choice of modeling framework. 7. Relation between observation window length and modeling framework I now focus on the comparison analysis with the Neural model, shown in the right side plot in Figure 3.8. I show the appropriate observation window length for each behavior, sorted by its group. Once again, I see that Negative and Positive behaviors show a greater preference for short observation windows than Problem-Solving and Dysphoric behaviors. In particular, all the Problem-Solving behaviors perform best when estimated using longer windows. This is further supported by the trained ELMo weights in Figure 3.9 which show that Problem- Solving behaviors rely heavily on the top layer, which is associated with complex and high-level language aspects that are typically long-term. Therefore, the core nding that aect-based behaviors are best captured using shorter window lengths while non-aect-based behaviors are best captured using longer observation windows is seen to hold consistently across models. There do exist some dierences, however, that appear to be driven more by the nature of the modeling framework and less by the behaviors. For instance, with the Neural model I see that most of the behaviors perform best at 30 words, with only less than a quarter performing well at 3 words. This is possibly due to its Gated Recurrent Unit, which was 75 bottom middle top 0 0.1 0.2 0.3 0.4 0.5 Average Weight POSITIVE BEHAVIORS bottom middle top 0 0.1 0.2 0.3 0.4 0.5 Average Weight PROBLEM-SOLVING BEHAVIORS bottom middle top ELMo Layers 0 0.1 0.2 0.3 0.4 0.5 Average Weight NEGATIVE BEHAVIORS bottom middle top ELMo Layers 0 0.1 0.2 0.3 0.4 0.5 Average Weight DYSPHORIC BEHAVIORS Figure 3.9: Trained ELMo layer weights for dierent behavior groups originally designed to handle long-context dependencies and, thus, works better when fed information from a longer observation window. I also see that Negative behaviors perform better at medium-length windows than at short ones, on average. This is in contrast to the N-gram model, where they all performed best at short window lengths. To understand the reason for this dierence, I inspected their BCS and BRC values and found that while the Neural model's estimates were more consistent at 3 words, they were more similar at 30 words. This suggests that better functionals might be required to accurately summarize Negative behaviors at short windows when using the Neural model. 8. Relation between behavior and modeling framework Finally, I compare the two models in terms of how well they estimate each behavior and which functionals they used in doing so. This can provide insights into the estimation process and help us understand which of the two models is a better t for a behavior. Figure 3.10 shows, for every behavior, the best performance of each model over all window lengths and functionals. 76 POSITIVE satisfaction acceptance agreement positive support- instrumental support- emotional affection 0 0.2 0.4 0.6 Maximum BCS PROBLEM-SOLVING perspective responsibility solicit- suggestions solutions negotiates external 0 0.2 0.4 0.6 Maximum BCS Ngram Neural Model defensiveness negative belligerence blame anger disgust NEGATIVE 0 0.2 0.4 0.6 Maximum BCS change anxiety dominance avoidance sadness withdrawal DYSPHORIC 0 0.2 0.4 0.6 Maximum BCS Minimum Median Maximum Functional Figure 3.10: Comparison of best modeling performance from both models over all window lengths for dierent behaviors: Performance here refers to the similarity between behavior estimates and human judgments, as measured by the Behavior Construct Similarity (BCS) metric. I see that the N-gram model performs as well as, if not better than, the Neural model when quantifying most Negative and Positive behaviors. While this might seem counter- intuitive, I have observed a similar result in my previous work where N-gram-based and Neural-based models performed similarly when classifying the behavior construct Negative [29], which is part of the Negative group in this work. Furthermore, as I saw earlier, most of these behaviors were better quantied at shorter windows than longer ones. This suggests that short, frequently used expressions carry a considerable amount of information about how much aect a person is expressing. Since an N-gram model is ideal for capturing xed-length, short expressions, it appears to be better suited than the Neural model for this task. In the case of Problem-Solving and Dysphoric behaviors, I see that the Neural model performs as well as, if not better than, the N-gram model. Since these behaviors are more complex and ambiguous than aect-based ones, estimating them accurately requires the ability to handle long context dependencies in a sophisticated, non-linear manner. This is 77 precisely the advantage that the Neural model oers over the N-gram model; hence, in line with my expectation, I see that it performs better for these behaviors. Among the aggregating functionals, I see that the median is the best one for all of the Negative and Positive behaviors, similar to previous works [150, 151]. Since the me- dian represents the \typical" value, this suggests that aect-based behaviors are steadily expressed throughout the entire interaction, rather than impulsively or rarely. In the Problem-Solving and Dysphoric groups, however, functionals that represent ex- treme deviations from the \typical" value are seen to perform well. In particular, Dominance, an overt and high-arousal behavior, is best aggregated as the maximum while Withdrawal, a subtle and low-arousal behavior, is best aggregated as the minimum. This suggests that the expression patterns of non-aect-based behaviors might be highly infrequent and impulsive. These ndings are thematically aligned with Lee et al. [90] who showed that humans use dierent processes for dierent behaviors when forming an overall impression over the course of an interaction. In general, while my ndings agree with previous works, they diverge slightly from some studies that deal with the audio and video channels. For instance, while my analysis showed that Sadness and Positive were best captured at similar window lengths, Krull et al. [79] reported that sad faces evoked less spontaneous reactions than happy faces, implying that they were captured at dierent window lengths in the visual modality. Similarly, Li et al. [91] reported that, in contrast to my ndings, behaviors such as Blame and Negativity performed better with longer observation windows while Sadness performed best at shorter windows. This suggests that aect-based behaviors are suciently expressed through all three modal- ities - audio, video and lexical - but over dierent time-scales, in which case multi-scale approaches might be benecial when fusing information across modalities. Furthermore, 78 dysphoric behaviors such as Sadness do not appear to be strongly detected in either the audio or the lexical modality, regardless of how long they are observed, but appear to be well captured in the visual modality. This provides additional motivation for the use of multimodal approaches when estimating a general set of behaviors. 3.9 Discussion and Conclusions In this section, I analyzed how long a system needs to observe conversational language cues, measured in number of words, in order to quantify dierent behaviors. I proposed an analysis framework and associated evaluation metrics that can be used to determine appropriate window lengths for behavior estimation, even in scenarios where reference human judgments are not available to compare against at every possible window length. I applied my analysis to the Couples Therapy dataset which contains a rich and diverse set of behaviors observed in real-life interactions. I also examined the robustness of my analysis to two dierent behavior modeling methods, a Maximum Likelihood N-gram model and a Deep Neural Network model. Finally, I compared my ndings with those from similar work in psychology, machine learning and speech processing and addressed pertinent issues related to the nature of human behavior expression in spoken language. My analysis showed that aect-based behaviors are steadily and frequently expressed during a conversation and can be reliably captured from short lexical cues. On the other hand, behaviors involving complex, back-and-forth deliberations tend to be expressed in the form of rare and extreme events and require much longer observation windows in order to be accurately understood. Finally, the expression of dysphoria appears to be dicult to detect from language alone, even when observed using long windows. The ndings from this work are of relevance not only to machine learning-based behavior estimation approaches but also to psychological research studies that deal with manual annotations 79 of behaviors. For instance, future studies might nd it benecial, both in terms of cost as well as time, to consider which types of behaviors are of primary importance when deciding on the length of interactions to be collected. Studies focused on negative and positive aect-based behaviors may be able to elicit and measure them over relatively brief periods of time. On the other hand, studies focused on discussion-oriented behaviors will likely require considerably longer intervals that can generate larger amounts of text. The next step in this work would be to extrinsically evaluate my ndings across dierent be- havior modeling tasks and checking if they translate into improved performance over using the same window length for all behaviors. It is also worth investigating how the window length re- quirements change when employing a multimodal analysis system that uses acoustic and visual cues in addition to the lexical cues. As a supplement to this work, I would like to crowdsource hu- man annotations of how accurately humans can assess dierent behaviors using dierent amounts of text information, thus conducting a study similar to those involving thin slices. A related eort in that direction would also be to test on datasets with dialog acts and utterance-level annotations of behavior for direct evaluation. Finally, I plan on investigating if functionals that mimic human- like perception, such as primacy and recency [142], might be a better t for behavior aggregation during an interaction. 80 Chapter 4 Contextualizing Speaker Behavior with Partner Interaction Cues 4.1 Introduction Conversations in social settings are often marked by each person expressing behaviors that are not only driven by their own internal state of mind but also aected by how the other person responds to them [100]. The nature of this phenomenon, referred to as interpersonal in uence, can vary signicantly, depending on the individual traits of the speakers as well as the relationship between them. Even for the same speaker, the type of in uence might be dierent with dierent people, causing their behavior to change dierently. For example, a person might respond positively when their friend compliments them but might respond in a lukewarm manner when a stranger compliments them. Therefore, creating models that explicitly incorporate in uence can provide new understanding of the longer-term dyad dynamics along with better behavioral estimation. While traditional works have incorporated information from the partner, they either did not explicitly model the underlying states of the interlocutors [19, 150] or completely ignored the rated speaker's partner [31, 158]. These models lack explicit understanding of the interpersonal in uence. While interlocutor in uence models have been proposed in the past [86, 161], they assume direct interaction between the latent states of the speakers which is not applicable for 81 my problem. Outside of BSP, mutual in uence models have been proposed, such as [38] which predicts \attachment security" between mother and child based on their previous measures of security. However, these deal with fully observed processes which are dierent from the latent behavior I am interested in modeling. In this section, I present my proposed model, which I refer to as the in uence model. It is text- based and describes how a speaker continuously perceives their partner's behavior based on their responses and how this perception, in addition to their own past behavior, aects their behavior over time. It also species parameters that determine the strength of the in uence mechanisms. Thus, it provides a more complete understanding of how and why a person's behavior changes over time. Additionally, I investigate if the interpersonal in uence between the speakers of a couple during therapy relates to their relationship outcomes. It has been shown that interlocutor in u- ence in couples dyadic interactions is important for relationship functioning [11], and therefore, outcomes. Empirical relations have also been found between outcomes and interaction-dependent measures such as vocal entrainment (through withdrawal behavior) [87] and dyadic prosodic fea- tures [110]. Therefore, I want to examine whether my model parameters, designed to describe the characteristics of dyadic interactions, can also provide information about outcomes. Finally, I build on ndings from the previous chapters to create comprehensive behavioral representations based on interaction cues. The partner's information is incorporated in the form of interaction dynamics representations for modeling the speaker's behaviors. I then investigate the eectiveness of the dierent behavioral representations for classifying long-term behavioral at- tributes of the speaker in real-world applications. The eectiveness of diverse sources of language- based behavioral information is shown through an analysis of the most useful features in dierent prediction tasks. In addition, the partner's information is found to be universally valuable for accurately modeling the speaker's behaviors. Finally, in cases where language-based cues are in- sucient for capturing dysphoric behaviors, I nd that the speaker's and partner's vocal cues can address those limitations. 82 4.2 In uence Dynamic Behavior Model My proposed model is an extension of the Likelihood-Dynamic Behavior Model (LDBM), which was described in Sec. 2.3.2. My proposed in uence model introduces an additional latent \pseudo- state" that represents how the target speaker perceives their partner's behavior. Its purpose is to simulate the generation of the partner's utterance and aect the next state of the speaker. I do this to model the real-world scenario where a speaker does not know the state of mind of the partner they are interacting with [100] - the best they can do is to make an inference based on what they said. I call the partner's perceived-by-target-speaker state a pseudo-state to distinguish it from a state which represents the partner's internal state of mind. In the following sections, I describe my proposed model of behavior generation in language. 4.2.1 Modeling approach Let U s denotes the speaker's utterance sequence (ex:\No", \No", \I said no") and S s denotes a speaker state sequence (ex:\Insult", \Compliment", \Insult"). Dierent behavior classes share the same set of states but dier in their transition probabilities. For example, a \High Anger" speaker might tend to stay in a particular state and not change whereas a \Low Anger" speaker might frequently change states I use S s to denote the state of the target speaker (whose behavior I am interested in identifying) and S p to denote the pseudo-state of the partner (from whom I am obtaining supplementary information). Furthermore, my in uence model species parameters that control how much a speaker's behavior is in uenced by their own past behavior versus their partner, motivated by the domain literature [18]. and respectively control how much the speaker's current utterance and partner explain their current behavior, relative to their previous state, which is always weighted by 1. Speakers who do not listen much to their partner and instead continually espouse only their point 83 Figure 4.1: Behavior generation process over 3 speaker turns in the In uence Model of view have; 1. In contrast, speakers that strongly react to what their partners say during discussions have >; 1. The behavior generation process is as follows: Before theith speaking turn, the target speaker is in state S i1 s and, having observed the partner generate utterance U i1 p , has perceived them to be in pseudo-state S i1 p . Then, in the ith speaking turn, based on S i1 s and S i1 p , the speaker transitions to state S i s and generates U i s . As an illustration, Fig. 4.1 depicts this process over 3 turns of an interaction. It should be noted that whileS i s depends onS i1 s , S i p does not depend on S i1 p ; it depends only on U i1 p . For an interaction consisting of M speaking turns, the complete behavior generation process is given by: P (S s ; S p ;U s ;U p ;;) = =P (S s ; S p ;)P (U s ;U p jS s ; S p ;;) =P (S 1 s ) M Y j=2 P (S j s jS j1 s ) M1 Y j=2 P (S j s j S j1 p ) M Y j=1 P (U j s jS j s ) M1 Y j=1 P (U j p j S j p ) (4.1) 84 where U s , S s , U p and S p denote the speaker's utterance sequence, speaker state sequence, partner's utterance sequence and partner pseudo-state sequence respectively. P (S j s jS j1 s ) and P (S j s j S j1 p ) are transition probabilities that denote how a speaker's previous state and their part- ner's previous pseudo-state are likely to aect the speaker's choice of next state. P (U j s jS j s ) is an emission probability describing the likelihood of the speaker saying U j s when occupying state S j s . P (U j p j S j p ) describes the likelihood, according to the speaker, of the partner saying U j p in the perceived state S j p . Since both emission probabilities deal with the speaker's perspective they are obtained from the same model. Finally, I dene a metric for quantifying the degree of partner's in uence on speaker, = log10(1=). Positive values indicate that the speaker is in uenced more by their previous state than their partner, whereas negative values signify the opposite (speaker in uenced more by partner than previous state). A value of 0 signies equal in uence from past and partner's states. 4.2.2 Training The core training methodology is similar to that of the LDBM in my previous work [31]. I use two states S 0 and S 1 which are represented by statistical language models (LM) that I term language-to-behavior (L2B) models. In this case the L2B model is an n-gram LM as in [33, 51], trained using the SRILM toolkit [143]. These state LMs are used to score text to obtain emission probabilities P (U j s jS j s ) and P (U j p j S j p ) in Eqn. 4.1. Both classes C 0 and C 1 use the same states but through dierent state transition probabilities. The state models, along with the transition probabilities, are trained using the Expectation Maximization (EM) algorithm [41] following an procedure similar to Algorithm 2 described in Sec. 2.3.2.2. In this rst attempt at creating the in uence model, my focus was on establishing its benets. Thus, rather than risk overtting through optimization, I decided to explore a range of and in logarithmic steps of 10 from 10 3 to 10 3 , resulting in a search space of 49 points. The main idea 85 is to get a general sense of the importance of the information streams. For example, =1, =1 implies that \the rated speaker is aected as much by their own past behavior as their spouse's". The training procedure is as follows: 1. In a test fold, pick parameter conguration ; 2. Initialize and iteratively train state models with ; using the LDBM training described in [33] 3. Estimate state transition probabilities per class, classify dev couple sessions and repeat for all dev sessions 4. Pick best conguration ; based on dev accuracy 5. Use this ;, train class and state models to evaluate on the test session and repeat for all test folds I avoided estimating transition probabilities at each iteration since they tended to skew towards not changing states as a result of the turns initially having the same label. For this experiment and evaluation I have 134 models, one for each test couple due to the n-fold nature of my evaluation. The training procedure for modied LDBM is similar with only being optimized over the dev set. For baseline comparison, I also trained LDBM with xed =1. It should be noted that LDBM with=1 is very similar to but not the same as in uence model with=1,=0 since their training data are dierent. 4.2.3 Testing Given a session consisting ofM rated speaker turns U s andM1 spouse turns U p , the goal is to classify the rated speaker as either C 0 or C 1 . For each class, I get decoded state sequences, that best explain the observed utterances, along with its probability. The class that is most likely to have generated the utterances is then picked as the label of the rated speaker as denoted below: 86 C i = arg max Cj P (S j s ; S j p jU s U p ;;) (4.2) = arg max Cj P (S j s ; S j p ;U s ;U p ;;) (4.3) where S j s is the speaker state sequence decoded by model of C j for U s , S j p is the perceived partner state sequence decoded by model ofC j forU p , and and are the optimized parameters used to train nal model The joint probability in Eqn. 4.3 reduces to the conditionally independent probabilities in Eqn. 4.1. Using this scheme, I classify all test couples with their corresponding fold models trained in Sec. 4.2.2 and compute the classication accuracy. The same scheme is also used to test the LDBM and obtain its classication accuracy. 4.3 Multi-Scale Multimodal Speaker-Partner Interaction Cues In this section, I describe a comprehensive behavioral representation of the speaker and the partner that builds on the ndings in this thesis thus far. Such a representation can be fed directly to downstream tasks where the goal is to predict the speaker or the partner's behavioral attributes based on their interaction cues. First, I incoporated lexical semantic embeddings at the turn-level, which can capture long contexts and which was shown to be eective for the task described in Sec. 2.5.3.1. I then extracted sentiment information at the word-level, which was found to be useful for training the turn-level embeddings, using both count-based as well as model-based techniques. I also built models of the behavior codes described in the Couples Therapy corpus, using the appropriate observation window length as determined by my proposed analysis framework described in Sec. 3.5.2. As I had discussed in Sec. 3.9, language alone might not be sucient for capturing dysphoric behaviors; hence, I also extracted acoustic voice quality features similar to the ones described in Sec. 2.4.2.2 that have been shown to be useful for capturing similar constructs [21, 49]. Finally, I captured 87 the interaction dynamics between the speaker and partner as expressed through vocal speech synchrony as well as turn-taking speech and pause patterns. I now detail each of these individual feature sets. 4.3.1 Modeling approach 4.3.1.1 Turn-level semantic representations I used the 600-dimensional sentence embeddings described in Sec. 2.4.2.2 that had been trained using a sequence-to-sequence model to predict dialogue turns and majority sentiment class in a multi-task framework. 4.3.1.2 BERT-based turn-level sentiment cues To extract sentiment information from language, I used RoBERTa [93], a deep learning-based model that was trained on a sentiment analysis task over thousands of movie reviews [140]. RoBERTa is an improved version of a general-purpose language representation model, BERT [42], which takes in as input a pair of sentences and learns a vector-space encoding of words by jointly training to predict missing words in the sentences and classify whether the two sentences are related to each other or not. Although it is trained as a model for language representation in context, BERT has been shown to perform very well with minimal ne-tuning on many Natural Language Processing tasks. The sentiment analysis task that was used to train RoBERTa is a binary classication of whether the sentiment expressed in the text of a movie review is positive or negative. For example, a positive movie review might contain phrases such as refreshingly honest, ultimately touching, and one of the best lms of the year, while a negative movie review might contain phrases such as this sloppy drama is an empty vessel and a complete failure. Although BERT achieves 94.9% accuracy on this classication task, RoBERTa improves it to 96.2% and achieves state-of-the-art performance by modifying the training regime parameters such as batch size, sequence lengths, and word masking patterns. In this study, I used a pretrained 88 implementation of RoBERTa that is available online 1 . Given an input sentence, RoBERTa out- puts probabilities for negative and positive sentiment which can be interpreted as the degree of sentiment expressed in that sentence. Because a sentence can only be negative or positive, the two probabilities sum to one and, hence, in this paper, I used only the probability of negative sentiment; the higher (lower) the probability, the more negative (positive) the sentence is. In each transcript, I rst extracted negative sentiment probabilities for each speaker turn and this was repeated for all speakers over all interactions. Finally, I derive rst and second-order temporal dif- ferences (delta (Delta), delta-delta ()) and compute 9 session-level statistical functionals (min, max, mean, median, range, std. deviation, quartiles) on top of them to obtain a 30-dimensional feature. 4.3.1.3 LIWC-based word-level sentiment cues Since emotion expressed in language has been reported previously [153] to be associated with suicidal risk, I computed count-based statistics of LIWC [120] positive and negative emotion words from the session transcripts. 6 lexical features were extracted: the proportions of both emotions in the speaker's language throughout the session, followed by the log-ratios of the speaker's and their partner's proportions for all 4 combinations of emotions (e.g. log-ratio of speaker-negative- proportion to partner-positive-proportion, etc.). 4.3.1.4 Multi-scale lexical behavior cues I built N-gram behavior models as described in Sec. 3.7.1 for each of the 27 Couples Therapy codes described in Table 3.2 at their corresponding appropriate observation window length, as shown in Figure 3.8. I obtained score trajectories for each behavior, following which I derive rst and second-order temporal dierences (delta (Delta), delta-delta ()) and compute 9 session-level 1 https://demo.allennlp.org/sentiment-analysis 89 statistical functionals (min, max, mean, median, range, std. deviation, quartiles) on top of them to obtain a 648-dimensional feature vector. 4.3.1.5 Acoustic Low-level Descriptors (LLDs) Acoustic features have been shown to be useful in prior work as markers of suicide risk [40, 135, 153] and depression [40, 132]. I used OpenSMILE [46] for extracting the standard openEAR Emobase feature set [47] from the speech segments of each speaker separately. This set includes various prosody features (pitch, intensity etc), voice quality features (jitter, shimmer etc), and spectral features (MFCCs, line spectral frequencies etc). Then I took six statistical functionals (1 st percentile, 99 th percentile, dierence between the 99 th and 1 st percentiles, mean, median, std. deviation) over the low-level descriptors to obtain 228 session-level features for each speaker. 4.3.1.6 Speaker-Partner Vocal Entrainment Entrainment is the phenomenon by which interlocutors tend to spontaneously synchronize their speech patterns such as vocal style or language use with their partner's patterns over the course of an interaction. Such adaptations often signify that the two interlocutors are actively engaged in conversations with each other and is usually associated with positive interaction dynamics. For each turn, I extracted a 384-dimensional vocal entrainment representation that had been learned in an unsupervised manner by using an encoder-decoder network to predict the vocal features of adjacent speech segments in telephonic conversations [109]. For each interlocutor, I then computed two variants of these representations: (i) their average representation over the course of the interaction and (ii) the average of the dierence between their and their partner's representations over the course of the interaction. This resulted in 2 sets of 384-dimensional entrainment embeddings; one directional and the other undirected. 90 4.3.1.7 Speaker-Partner Turn-Taking Cues The dynamics of turn-taking and pausing during an interaction have been linked to suicidal risk and psychological distress in multiple studies [40, 123, 153]. To capture them, I extract features relating to speech duration, number of words, speech rate and pause for every speaker and also compute dierences between the speaker's and their partner's features during a turn change. This is performed locally in every turn as well as globally over the entire session where applicable. For each of the local features, I derive rst and second-order temporal dierences (delta (Delta), delta-delta ()) and compute 9 session-level statistical functionals (min, max, mean, median, range, std. deviation, quartiles) on top of them. This results in 167 session-level features. 4.3.2 Training A Support Vector Machine (SVM) was used as the classier and multimodal features were tested through feature-level fusion. When performing fusion, I tested all possible combinatorics of the speaker's and the partner's features. This allowed us to observe the usefulness of the partner's features for speaker behavior modeling in multiple ways as well as identifying the combination of their features that would be best suited for the downstream task. Feature dimensionality reduction was applied using Principal Component Analysis such that 95% of the total energy was retained. Sample weighting is applied to address any class imbalance and tuned hyperparameters such as feature normalization scheme (min-max, z-score), SVM kernel (linear, rbf), SVM penalty C and rbf in uence (both 10 5 ; 10 4:5 ;::: 10 5 ) to optimize the classier. 4.3.3 Testing I used leave-one-couple-out cross-validation where, in foldi, coupleC i was picked as the test split and the remaining couplesC j ;j6=i were randomly assigned to an 80:20 train:validation split such that both splits had similar label distributions and no couple appeared in more than 1 split. The 91 classier was then trained on the train split, optimized on the validation split and used to predict the suicidal risk of the speaker(s) in couple C i . Performance was evaluated using macro-average recall of classication in order to account for class imbalance. To determine whether there were statistically signicant (p< 0:05) dierences between my results and chance, I ran the McNemar's test for binary classications and the Stuart-Maxwell test for the 3-class scenario. In partition experiments, a separate classier was created for each partition and at test time, the appropriate one was used. 4.4 Experiments 4.4.1 In uence-DBM vs Likelihood-DBM 4.4.1.1 Case Study 6: Identifying Negative behavior in Couples Therapy 1. Introduction Traditional models [19,31,51,150] that have been shown to work well using language in the Couples Therapy domain have incorporated information only from the partner and not from the partner. However, the partner's observations are important for understanding better why the speaker behaved the way they did. In this section, I perform an experiment to verify whether incorporating the partner's information can indeed help us better predict the speaker's behavior. 2. Aim In this section, I compare the performance of my proposed in uence model against existing models which don't incorporate the partner's in uence on speaker. In addition, I also analyze if my in uence model's characterization of the speaker-partner dynamics can provide information about their long-term relationship outcomes. 3. Dataset 92 For this experiment, I used the same dataset as in Sec. 2.5.1.1, shown in Table. 2.3. I performed the same binary classication task of classifying the label of Negative behavior. Each couple was also rated by the annotators at the 26-week or the 2-year time period on its recovery since starting therapy, referred to as \outcomes" [12]. They were rated as follows - 1 indicating deterioration in relationship, 2 indicating no change, 3 indicating measurably better improvement and 4 indicating recovery. I use 2-year outcome ratings for 79 couples belonging to the Top/Bottom 20 percentile classes. These are the ones I am interesting in analyzing and their demographics are shown in Table 4.1. Outcome Decline No Change Better Recovery Rating 1 2 3 4 No. Couples 18 9 18 34 Table 4.1: Couples Therapy Outcome Demographics 4. Results, Analysis and Discussion The test results comparison betwen the baseline LDBM (=1), LDBM and in uence model is shown in Table. 4.2. Model LDBM (=1) LDBM In uence Model 1-gram 84.64 82.86 85.00 2-gram 86.78 86.43 88.93 3-gram 85.71 87.86 88.21 Table 4.2: Comparison of Test Classication Accuracy % with best performing model indicated in bold I see that my proposed model improves upon the existing one and the baseline in every context scenario. This indicates that the partner's responses do provide supplementary information about why the speaker behaves in a certain way. I also see that while the LDBM gets better with more context, the in uence model improves from 1-gram to 2-gram but slightly drops in accuracy from 2-gram to 3-gram. One reason for this could be the coarse sampling of control parameters while another could simply be the sparsity in learning 3- grams in bigger models. I note that my best performing model matches the Neural Behavior 93 (a) LDBM 1,2,3-gram (b) In uence 1-gram (c) In uence 2-gram (d) In uence 3-gram Figure 4.2: Histogram of best parameter congurations for 1,2,3-gram LDBM and In uence mod- els Model, proposed by Tseng et al in [150], which was described in Sec. 2.4. This points to complementary gains from improved L2B models [150] and from incorporating interlocutor dynamics as in this work. I also examined which (;) conguration in the in uence model and in the LDBM performed best in dierent test folds. This can provide insight into how relevant each stream of information is, on average. The histogram of best congurations for both models is shown from Fig. 4.2a to Fig. 4.2d. It is interesting to note that the best performing model, the 2-gram in uence model, overwhelmingly picked ==1, implying equal importance of all the information streams. This too stresses the importance of incorporating the spouse's feedback when studying a rated speaker's Negative behavior. Finally, I checked whether the speaker-partner dynamics, as characterized by my in uence model, were associated with the couples' therapy outcomes. For an outcome rating, I rst 94 selected couples with this rating. Then, for each couple, I obtained its average dev classi- cation accuracy over all test folds for each parameter conguration , . I then picked the parameter conguration with the highest accuracy and added 1 count to its bin; equal fractional counts if there were multiple such congurations. By repeating for all couples and normalizing the counts, I obtained a 2-D histogram. Since the weights corresponding to the speaker's state and the partner's eect are 1 and respectively, I converted the 2-D histogram into a 1-D histogram with new axis =log10(1=) and repeated for all outcomes. The 1-D histogram relates to interpersonal in uence in the following way: = 0 denotes equal contribution of the rated speaker's past behavior as the spouse. > 0 corresponds to the speaker's past behavior being more dominant than the spouse whereas < 0 de- notes the reverse. In order to check if the outcome histograms are generated from dierent distributions, I t a Gaussian distribution to each one. I obtained ts with a mean value of 0.2, 0.16, -0.18 and 0.05 for outcomes 1, 2, 3 and 4 respectively; all gaussians exhibit a high variance. I see that couples with the best outcome 4 have a mean closest to 0, signifying equal in uence, while couples with the worst outcome 1 have a mean farthest from 0, signifying one person's in uence dominating the other's. This shows that the best outcome is associated with a degree of in uence that is closest to a balanced interaction, while the worst outcome is associated with a degree that is farthest from balanced. 4.4.2 Modeling Long-Term Speaker Behaviors from Multi-Scale Multimodal Speaker-Partner Interaction Cues 4.4.2.1 Case Study 7: Assessing Suicidal Risk in Military Couples 1. Introduction 95 According to the Centers for Disease Control and Prevention, 39,518 suicides were reported in 2011, making suicide the 10th leading cause of death overall and the leading non-natural cause of death for Americans 2 . Among military personnel, rates of suicide are even higher [72]. One of the biggest challenges to improving the success of suicide prevention eorts in the military is the absence of reliable methods for predicting who will engage in suicidal behaviors and when they will do so. This restricts my ability to identify at-risk military personnel and ensure that they are getting the best available and most appropriate treatment for their psychological symptoms. It also impacts not only them but also their spouses who are at increased risk for a wide range of psychological and physical health symptoms [116,134]. Clinical interviews and surveys are the best available methods for identifying if and when a person is at increased or heightened risk. However, these methods do not work well for measuring suicide risk in soldiers 3 . The major limitations of clinical interviews are that they require in-person interaction with a health care professional and that most service members who die by suicide do not interact with such professionals immediately prior to the event. Likewise, surveys about suicidal thoughts and feelings are not informative if military personnel are unwilling to acknowledge or are unaware of their psychological distress. The ability to assess risk directly at home is, therefore, considered valuable, for which machine learning (ML) is being examined as a viable platform. There is signicant literature on using ML for identifying attributes related to suicidal risk; I refer readers to Burke et al. [26] for a comprehensive review. Most works either deal with static, non-interactive scenarios such as microblog posts [25,34] and written answers [37,122], or with interactive but highly structured settings such as interviews with therapists [49] and social workers [135, 153]. Existing works typically use information from only one modality 2 https://www.cdc.gov/nchs/fastats/suicide.htm 3 https://theactionalliance.org/sites/default/les/agenda.pdf 96 such as text [37, 122] or audio [49, 55, 135]. Recently, there have been eorts on using multimodal approaches for quantifying suicidal risk [123,153], which is the topic of interest in my work as well. However, despite this progress, these approaches are often constrained by heavy manual supervision that ensures clean, directly usable features but also greatly limits the scope of their deployment in real-world conditions. As a result, it is not feasible to deploy these methods at home for suicidal risk assessment. Distressed couples conversations, which have been well studied within the broad realm of family studies [10, 36, 64], oer a potential setting for performing such assessments. They have been well analyzed using ML-based computational approaches that have been found to be useful across a variety of behavioral and health domains [23,108]. For instance, in Couples Therapy [36], multiple works have eectively quantied behaviors related to speakers' mental states such as Blame, Positive and Sadness using the speaker's language [51] and vocal traits [21]. Similarly, in Cancer Care [129] interactions, lexical and acoustic cues have been found to be useful in predicting Hostile and Positive behaviors [107]. 2. Aim In this experiment, I investigate whether military personnel's conversations with their spouses at home can provide useful markers of their suicidal risk. Furthermore, I am inter- ested in understanding if the speaker-partner cues within an interaction can reveal valuable insights into how much at risk the speaker, beyond the scope of the interaction. I test the eectiveness of the Multi-Scale Multimodal Speaker-Partner Interaction Cues in real-world conditions by extracting all of my features from raw, noisy data in an ecologically-meaningful manner. My system uses an automatic diarization and speech recognition front-end with operating conditions that require limited manual supervision and is, hence, readily deploy- able. 3. Dataset 97 My dataset consists of 62 mixed-sex couples, a total of 124 individuals. They were recruited for a study of behavioral and cognitive markers of suicide risk among geographically dis- persed military service members. The study criteria required that they be in the National Guard, a Reserve Component, or a recent Veteran who served during the Operation Endur- ing Freedom / Operation Iraqi Freedom era, be married/cohabitating, at least 18 years old, be uent in English and have reliable internet access at home. Using a standard interview protocol [111], each person was assigned one of 3 labels based on their history of suicidal behaviors: (1) none if they had no history, (2) ideation if they had experienced suicide thoughts but did not act on it and (3) attempt if they had attempted suicide in the past. According to the World Health Organization, a prior attempt is the most important risk factor for suicide 4 ; hence, these labels represent the degree of suicidal risk, from none representing no risk to attempt representing severe risk. Risk was evaluated at three points in time: (i) Baseline, or lifetime risk before the interaction, (ii) 6-month, or risk during the 6 months after the interaction and (iii) 12-month, or risk during the 6 months after the 6-month evaluation. Table 4.3 shows the label demographics of the participants. Risk nLabel none ideation attempt Baseline 65 37 22 6-month 98 15 1 12-month 101 7 0 Table 4.3: Demographics of suicide risk labels of subjects. Some subjects dropped out as the study progressed While the primary focus of this work is to accurately classify the 3 degrees of a person's suicidal risk, I am also interested in investigating the 2 constituent \one-versus-rest" binary classication scenarios. These could be of importance in scenarios where the goal is to distinguish no risk from some risk or to identify and isolate attempt, the most important risk factor. Hence, I perform classication experiments for the following 3 scenarios: 4 https://www.who.int/en/news-room/fact-sheets/detail/suicide 98 (a) Degree of Risk: none vs ideation vs attempt (b) No-Risk vs Risk: none vsfideation / attemptg (c) Non-Severe vs Severe Risk:fnone / ideationg vs attempt Behavior expression patterns are known to vary across speakers of dierent genders [64] and are also likely to be in uenced by the nature and topic of the interaction [10]. To examine whether gender has an impact on the classication performance, I employ the following data partitioning schemes in my experiments: (a) None: Same model for all speakers, sessions (1 model) (b) Gender: Separate models for Husband, Wife (2 models) As part of their participation, couples completed 2 relationship-change (RC ) and 1 reasons- for-living (RFL) conversations, or \sessions", in their homes. Each session was video- recorded for 10 minutes, and couples spent 5 to 10 minutes between sessions on lling questionnaires and reading instructions. In the RFL session, they were asked to discuss what they found meaningful, or what their reasons for living were. In the RC sessions, they were asked to discuss one of their top areas of discontent and con ict in their relationship, with each person getting to select the topic in one session. I denote the session where the wife picked the topic as W-Con ict and the one where the husband picked as H-Con ict. Randomization was applied to the order of RC and RFL as well as the order of H-Con ict and W-Con ict. I obtained audio streams from all but one session of one couple, resulting in 370 sessions. I obtained two versions of the dataset: Human annotated speaker timestamps and identity, as well as manual transcription of what each speaker said. I refer to this version as the \Clean" data. Automatically diarized and transcribed, which I refer to as the \Noisy" data. 99 I now detail the process of automatically diarizing and transcribing the data. 4. Automatic Processing In general, the audio quality was observed to be good; however, some couples recorded their interactions in a noisy environment or did not sit close enough to the recorder to be clearly intelligible. Nevertheless, I retained all their samples consistent with the goal of this work of using data re ecting real-world operational use conditions. The rst, important step towards automatically analyzing couples conversations is speaker diarization, i.e. identifying \who spoke when" during the interaction. I employ the x-vector [139] based diarization system proposed in [117] for extracting speaker embeddings in each speech segment and clustering them using the spectral clustering approach described in [117]. As part of this approach, a pruning parameter p is tuned to maximize the performance of the diarization, for which I created a dev corpus of randomly selected 5-minute snippets from sessions of 16 couples and manually annotated the speaker labels and timestamps 5 . After tuning, the system achieved 20.35% Diarization Error with 2.31% Speaker Error on the dev corpus. Once I obtained the speaker labels S 1 and S 2 , the pitch of their corresponding segments was extracted and the ID Husband was assigned to the speaker with lower median pitch and Wife to the remaining speaker. I evaluated this assignment on the dev corpus using ground truth speaker labels and obtained 100% accuracy. I used the Kaldi [126] ASpIRE chain model 6 but adapted the Language Model (LM) on related psychotherapy data in order to improve recognition accuracy. Adaptation was per- formed by interpolating, with equal weights, 3 LMs that were trained on ASpIRE [77], cantab-TEDLIUM [154] and a mix of Couples Therapy [36] and Motivational Interview- ing [5] corpora, using SRILM [143]. Session transcripts were built from 1-best hypotheses 5 I thank the members of USC SCUBA lab for manually annotating speaker IDs and their corresponding speech segment timestamps 6 https://kaldi-asr.org/models/m1 100 and spurious word insertions were eliminated by setting a minimum threshold of 0.9 for the confusion condence score and 4 for the confusion rank. 5. Results & Discussion Tables 4.4 and 4.5 show the results of the best systems in the baseline and future risk prediction with clean data respectively. I see that the system is better at predicting future risk, with the highest performance of 80.48% for 12-month risk, than for baseline risk. This is in line with expectations about since future risk is evaluated closer in time to the interactions than the baseline risk, which is a lifetime variable. I can see that every system makes use of multiple diverse features, ranging from lexical and acoustic to interaction dynamics, hence emphasizing the need for multimodal approaches. I also see that every system makes use of both speaker's as well as partner's features, even when predicting the speaker's risk. In addition, it is observed that entrainment and turn-taking features contribute to nearly every system in every task. This demonstrates that the partner's information is critical for accurate modeling of the speaker's behaviors. Finally, I see that the lexical features feature frequently in the best systems, both at word-level and turn-level as well as in the multi-scale behavior features. Tables 4.6 and 4.7 show the results of the best systems in the baseline and future risk prediction with noisy data respectively. I immediately see a drop in performance of up to 5% average recall, compared to the clean data results. This is to be expected since the noisy data introduces errors in the modeling and classication process. Nevertheless, my system is still able to achieve a high of 77% average recall when predicting 12-month risk. Similar to the clean data, I observe that the system is better at predicting future risk than baseline risk, underscoring the earlier conjecture. 101 Table 4.4: Test Macro-Recall % for Baseline Risk Prediction with Clean Data Best system's features shown in blue for acoustic modality and green for lexical modality Prediction Tasks Classier Degree of Risk Risk vs. No Risk Severe vs. Non-Severe Risk Chance 33 50 50 Best System 52.47 69.36 75.23 Entrainment Entrainment Entrainment Low-level Descriptors Sentiment Count Multi-scale Behavior Multi-scale Behavior Sentence Embedding Sentence Embedding Turn-taking Cues Sentiment Count Sentiment Score Speaker Features Turn-taking Cues Sentence Embedding Low-level Descriptors Low-level Descriptors Sentiment Count Sentence Embedding Sentiment Count Turn-taking Cues Sentiment Count Sentiment Score Partner Features Turn-taking Cues Turn-taking Cues Once again, I see that the partner's features are always used for risk prediction and that either turn-taking or entrainment is present in the best features, thereby stressing the im- portance of using the partner's information. Interestingly, I see that word-level features, such as liwc-based sentiment feature prominently in the noisy systems, in contrast to the sentence-level features in the clean data. This could be due to the sentence embeddings being corrupted by erroenous transcription whereas the impact of such errors on individual words is much lesser. 4.5 Discussion and Conclusions In this chapter, I proposed a model of how a speaker's behavior changes over time as a result of interacting with their partner. My approach described a generative process of how one person is perceived based on what they said and how this perception relatively aects the other in conjunction with their own past behavior. By achieving higher accuracy over a single-speaker model on the task of classifying Negative behavior for Couples Therapy, I demonstrated the 102 Table 4.5: Test Macro-Recall % for 6-month, 12-month Risk Prediction with Clean Data Best system's features shown in blue for acoustic modality and green for lexical modality Risk vs. No Risk Classier 6-month 12-month Chance 50 50 Best System 76.21 80.48 Entrainment Entrainment Low-level Descriptors Sentence Embedding Multi-scale Behavior Sentiment Score Sentence Embedding Turn-taking Cues Sentiment Count Speaker Features Turn-taking Cues Low-level Descriptors Sentence Embedding Sentiment Score Sentiment Score Partner Features Turn-taking Cues eectiveness of incorporating information from both participants in a conversation. In addition, I investigated if my model could provide insights into how interaction dynamics relate to the long-term quality of a couple's relationship. I found that some outcomes tend to be associated with how proportionately a speaker is aected by their partner's perceived behavior with respect to their own past behavior. I also demonstrated the feasibility of an automated, multimodal approach to classifying the speaker's long-term behaviors, such as suicidal risk, based on how they and their partner behaved with each other during conversational interactions. Employing a multimodal approach was found to be eective across dierent risk preditions tasks over dierent noise conditions, with an indi- vidual best performance of 80.48% average recall for predicting future risk. I also showed that extracting a diverse set of behavior-laden language cues, such as word-level, turn-level and multi- scale, contributed to the best performance in many tasks. In addition, the partner's information was found to be extremely valuable, both in terms of characterizing the entrainment dynamics as well as in explicit observation of their cues. Finally, I showed that this approach could work in real-world, noisy conditions without precipitous drops in performance. 103 Table 4.6: Test Macro-Recall % for Baseline Risk Prediction with Noisy Data Best system's features shown in blue for acoustic modality and green for lexical modality Prediction Tasks Classier Degree of Risk Risk vs. No Risk Severe vs. Non-Severe Risk Chance 33 50 50 Best System 50.13 65.20 70.49 Entrainment Entrainment Entrainment Multi-scale Behavior Sentiment Count Multi-scale Behavior Speaker Features Sentence Embedding Sentiment Count Low-level Descriptors Low-level Descriptors Low-level Descriptors Sentiment Count Sentiment Count Sentiment Count Partner Features Sentiment Score As part of future work, I will implement comprehensive optimization of the model while also jointly training all components in it. I also plan on investigating which behaviors benet from in uence modeling and which ones don't and why. Another area where my model can potentially be improved is in replacing discrete behavior states with continuous ones such as in the Kalman lter or recursive neural networks. I will also investigate incorporating my more recent advances in language-to-behavior mapping as in [150]. I also plan on further investigating the relation between in uence parameters and outcomes with the help of the extensions mentioned above - rened estimation of parameters and for all behaviors - as well as by jointly analyzing the in uence parameters and states. 104 Table 4.7: Test Macro-Recall % for 6-month, 12-month Risk Prediction with Noisy Data Best system's features shown in blue for acoustic modality and green for lexical modality Risk vs. No Risk Classier 6-month 12-month Chance 50 50 Best System 72.77 79.49 Entrainment Entrainment Low-level Descriptors Low-level Descriptors Multi-scale Behavior Sentence Embedding Sentiment Score Sentiment Count Speaker Features Turn-taking Cues Low-level Descriptors Sentiment Score Multi-scale Behavior Turn-taking Cues Sentiment Count Sentiment Score Partner Features Turn-taking Cues 105 Chapter 5 Conclusion and proposed work 5.1 Summary This dissertation presented various approaches for modeling the behaviors of a speaker based on the language cues observed in their interactions with spouses. First, I showcased work on quantifying a specic behavior of a single speaker by observing only their cues and evaluated their eectiveness on behavior classication tasks in mental health domains such as Couples Therapy and Addiction Counseling. I found that temporal-context-dependent as well as statistically-pooled models were able to accurately capture constructs such as Negative and Empathy at dierent scales of observations, from just a few words to a speaker turn. Next, I proposed two variants of a Neural Network-based modeling framework that leverages out-of-domain information in a context-independent manner to better estimate behavior. The eectiveness of neural architectures at estimating aective constructs is explored at both the word-level as well as at sentence-level across dierent applications. The rst variant, a word- level Recurrent Neural Model was shown to perform better than the previous models at the same task while also improving the inter-annotator agreement with the worst annotator removed. The second variant, a sentence-level Feedforward Neural Model was shown to accurately identify useful regions of behaviors even in scenarios where aective behavior is rarely observed. I nd that 106 context-dependent latent-space models and context-independent neural network-based models are both eective at modeling how humans express aective behavior such as negative and positive during interactions. Then, as part of developing a framework that can model speaker-partner in uence across mul- tiple behaviors, I analyzed a variety of behavior constructs in order to understand the appropriate scales at which their cues should be observed and the appropriateness of dierent statistical ag- gregation mechanisms. Aect-related constructs were, in general, accurately quantiable at short observation scales using an aggregation mechanism that focused on the typical degree of behav- ior expressed during an interaction. On the other hand, communicative behaviors, such as those related to Problem-Solving, were found to perform better at longer observation scales using aggre- gation mechanisms that captured extreme-valued events. Dysphoric behaviors, however, could not be estimated accurately solely from language cues, therefore suggesting that perhaps a multimodal approach would be needed. Following this, I investigated the usefulness of the partner's information for better modeling the speaker's short-term and long -term behavioral attributes. First, I proposed an extension to a latent-state temporal-context-dependent single-speaker model that also incorporates the part- ner's in uence on the speaker. I tested this model on similar behavior classication tasks and demonstrated that, in addition to improving the modeling accuracy, my model parameters also exhibited an association with long-term therapy outcomes that are of interest to therapists in these domains. Finally, I built upon the ndings from the previous works to propose a multi-scale multimodal representation of speaker and partner interaction cues that would also characterize their interaction dynamics. I evaluated this comprehensive behavioral prole on the task of sui- cidal risk classication based solely on interaction cues and found that the partner's information, interaction dynamics and lexical information from diverse sources, whether word-level sentiment or multi-scale behavior, were all found to be extremely useful for accurately predicting both the lifetime as well as future risk. 107 5.2 Proposed Work My ndings indicate that the process of modeling behaviors from language cues should be a multifaceted one, with multiple scales, modeling frameworks and aggregation mechanisms to be accounted for in order to handle a wide variety of applications. Hence, I now list 3 principles to guide my proposed work that will aim to create a comprehensive behavior modeling framework: 1. Ability to extract and integrate cues from multiple scales of the interaction such as word- level, turn-level, segment-level, etc. 2. Capacity for explicitly characterizing speaker-partner dynamics using causal, temporal- context-dependent and non-linear mechanisms 3. Ability to focus on salient parts of the interaction and integrating cues in a learned, ap- propriate manner in order to gather adequate information for jointly modeling multiple behavioral constructs There exist many techniques and methods for extracting and integrating information at mul- tiple scales, the most popular and well-suited framework for which is a Convolutional Neural Network (CNN) [82]. In the past, many works have succcessfully used CNNs to automatically learn to extract eective features from text documents for applications such as sentiment analy- sis [74]. However, such methods only consider information from only one scale and do not consider the problem of integrating information across scales. Spatial Pyramid Pooling (SPL) [62] has been proposed as a technique to resolve this problem and has been shown to be successful in many com- puter vision tasks. Hence, I propose to employ SPL in conjunction with CNNs on the transcripts of my interactions to extract language cues of both speakers at dierent observation window lengths. In order to selectively focus on salient parts of the interaction and integrating information over time, I employ the attention mechanism [57] that has been demonstrated to be a valuable 108 tool in many natural language processing tasks [8]. Specically, I propose to use variants such as Self-Attention [92] and Multi-head Attention [152], both of which compute dierent aggregates of the same information by focusing on dierent aspects and integrating them in dierent ways. Using this, I propose to implement a rich system of information aggregation that can be jointly used for learning the various types of target behavioral constructs in my applications. Finally, I propose to incorporate the above described multi-scale multi-attention techniques in existing speaker-partner in uence approaches in order to extend them to handle multi-behavior in uence modeling. Existing techniques [125, 162] are designed for emotion recognition where classication is performed at only the turn-level scale. Hence, I propose to extend these techniques by replacing the pre-determined turn-level features with multi-scale features that can be learned to maximize the behavior estimation or classication accuracy. I will also work on incorporating information from dierent modalities as well, since this is expected to contribute to better modeling performance for behaviors such as those in the Dysphoric group since they are typically not expressed adequately in spoken language. Finally, I will aim to implement these multi-behavior in uence framework using generative models such as Variational Auto Encoders (VAE) [76] which have been used in previous works to learn latent variables for generating data, so that I can learn models of how human speakers in uence each other's behaviors and express them in dierent modalities. 109 References [1] General psychotherapy corpus. In Alexander Street Press. [2] http://www.statmt.org/wmt14/training-monolingual-news-crawl/. [3] Kaat Alaerts, Evelien Nackaerts, Pieter Meyns, Stephan P Swinnen, and Nicole Wenderoth. Action and emotion recognition from point light displays: an investigation of gender dier- ences. PloS one, 6(6):e20989, 2011. [4] Nalini Ambady and Robert Rosenthal. Thin slices of expressive behavior as predictors of interpersonal consequences: A meta-analysis. Psychological bulletin, 111(2):256, 1992. [5] David C Atkins, Mark Steyvers, Zac E Imel, and Padhraic Smyth. Scaling up the eval- uation of psychotherapy: evaluating motivational interviewing delity via statistical text classication. Implementation Science, 2014. [6] Hoda Badr. New frontiers in couple-based interventions in cancer care: rening the pre- scription for spousal communication. Acta Oncologica, 56(2):139{145, 2017. [7] John S Baer, Elizabeth A Wells, David B Rosengren, Bryan Hartzler, Blair Beadnell, and Chris Dunn. Agency context and tailored training in technology transfer: A pilot evaluation of motivational interviewing training for community counselors. Journal of substance abuse treatment, 37(2):191{202, 2009. [8] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014. [9] C Daniel Batson. These things called empathy: eight related but distinct phenomena. 2009. [10] Brian R Baucom, Pamela T McFarland, and Andrew Christensen. Gender, topic, and time in observed demand{withdraw interaction in cross-and same-sex couples. Journal of Family Psychology, 24(3):233, 2010. [11] Katherine JW Baucom, Brian R Baucom, and Andrew Christensen. Changes in dyadic communication during and after integrative and traditional behavioral couple therapy. Be- haviour research and therapy, 65:18{28, 2015. [12] Katherine JW Baucom, Mia Sevier, Kathleen A Eldridge, Brian D Doss, and Andrew Chris- tensen. Observed communication in couples two years after integrative and traditional be- havioral couple therapy: Outcome and link with ve-year follow-up. Journal of consulting and clinical psychology, 79(5):565, 2011. [13] Roy F Baumeister, Ellen Bratslavsky, Catrin Finkenauer, and Kathleen D Vohs. Bad is stronger than good. Review of general psychology, 5(4):323{370, 2001. 110 [14] Roy F Baumeister, Kathleen D Vohs, C Nathan DeWall, and Liqing Zhang. How emo- tion shapes behavior: Feedback, anticipation, and re ection, rather than direct causation. Personality and social psychology review, 11(2):167{203, 2007. [15] Paul S Bellet and Michael J Maloney. The importance of empathy as an interviewing skill in medicine. JAMA, 266(13):1831{1832, 1991. [16] Yoshua Bengio, R ejean Ducharme, Pascal Vincent, and Christian Jauvin. A neural proba- bilistic language model. Journal of machine learning research, 3(Feb):1137{1155, 2003. [17] Yoshua Bengio, Holger Schwenk, Jean-S ebastien Sen ecal, Fr ederic Morin, and Jean-Luc Gauvain. Innovations in Machine Learning: Theory and Applications, chapter Neural Prob- abilistic Language Models, pages 137{186. Springer Berlin Heidelberg, Berlin, Heidelberg, 2006. [18] Sara B Berns, Neil S Jacobson, and John M Gottman. Demand{withdraw interaction in couples with a violent husband. Journal of Consulting and Clinical Psychology, 67(5):666, 1999. [19] M. P. Black, A. Katsamanis, C.-C. Lee, A. Lammert, B. R. Baucom, A. Christensen, P. G. Georgiou, and S. S. Narayanan. Automatic classication of married couples' behavior using audio features. In Proceedings of InterSpeech, 2010. [20] Matthew Black, Panayiotis Georgiou, Athanasios Katsamanis, Brian Baucom, and Shrikanth Narayanan. \You made me do it": Classication of blame in married couples' interaction by fusing automatically derived speech and language information. In Proceedings of InterSpeech, Florence, Italy, August 2011. [21] Matthew P Black, Athanasios Katsamanis, Brian R Baucom, Chi-Chun Lee, Adam C Lam- mert, Andrew Christensen, Panayiotis G Georgiou, and Shrikanth S Narayanan. Toward automating a human behavioral coding system for married couples' interactions using speech acoustic features. Speech communication, 55(1):1{21, 2013. [22] Melinda C Blackman and David C Funder. The eect of information on consensus and accuracy in personality judgment. Journal of Experimental Social Psychology, 34(2):164{ 181, 1998. [23] Daniel Bone, Chi-Chun Lee, Theodora Chaspari, James Gibson, and Shrikanth Narayanan. Signal processing and machine learning for mental health research and clinical applications. IEEE Signal Processing Magazine, 2017. [24] Hana Boukricha, Ipke Wachsmuth, Maria Nella Carminati, and Pia Knoeferle. A compu- tational model of empathy: Empirical evaluation. In Aective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on, pages 1{6. IEEE, 2013. [25] Scott R Braithwaite, Christophe Giraud-Carrier, Josh West, Michael D Barnes, and Carl Lee Hanson. Validating machine learning algorithms for twitter data against established mea- sures of suicidality. JMIR mental health, 2016. [26] Taylor A Burke, Brooke A Ammerman, and Ross Jacobucci. The use of machine learning in the study of suicidal and non-suicidal self-injurious thoughts and behaviors: A systematic review. Journal of aective disorders, 2018. [27] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335, 2008. 111 [28] Dana R Carney, C Randall Colvin, and Judith A Hall. A thin slice perspective on the accuracy of rst impressions. Journal of Research in Personality, 41(5):1054{1072, 2007. [29] Sandeep-Nallan Chakravarthula, Brian Baucom, and Panayiotis Georgiou. Modeling inter- personal in uence of verbal behavior in couples therapy dyadic interactions. Proc. Inter- speech 2018, pages 2339{2343, 2018. [30] Sandeep Nallan Chakravarthula, Brian RW Baucom, Shrikanth Narayanan, and Panayiotis Georgiou. An analysis of observation length requirements for machine understanding of human behaviors from spoken language. Computer Speech & Language, 66:101162, 2021. [31] Sandeep Nallan Chakravarthula, Rahul Gupta, Brian Baucom, and Panayiotis Georgiou. A language-based generative model framework for behavioral analysis of couples' therapy. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 2090{2094. IEEE, 2015. [32] Sandeep Nallan Chakravarthula, Md Nasir, Shao-Yen Tseng, Haoqi Li, Tae Jin Park, Brian Baucom, Craig J Bryan, Shrikanth Narayanan, and Panayiotis Georgiou. Automatic pre- diction of suicidal risk in military couples using multimodal interaction cues from couples conversations. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6539{6543. IEEE, 2020. [33] Sandeep Nallan Chakravarthula, Bo Xiao, Zac E Imel, David C Atkins, and Panayiotis G Georgiou. Assessing empathy using static and dynamic behavior models based on thera- pist's language in addiction counseling. In Sixteenth Annual Conference of the International Speech Communication Association, 2015. [34] Qijin Cheng, Tim MH Li, Chi-Leung Kwok, Tingshao Zhu, and Paul SF Yip. Assessing suicide risk and emotional distress in chinese social media: A text mining and machine learning study. Journal of medical internet research, 2017. [35] Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder{decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103{111, 2014. [36] A. Christensen, D.C. Atkins, S. Berns, J. Wheeler, D.H. Baucom, and L.E. Simpson. Tradi- tional versus integrative behavioral couple therapy for signicantly and chronically distressed married couples. Journal of Consulting and Clinical Psychology, 72(2):176{191, 2004. [37] Benjamin L Cook, Ana M Progovac, Pei Chen, Brian Mullin, Sherry Hou, and Enrique Baca-Garcia. Novel use of natural language processing (nlp) to predict suicidal ideation and psychiatric symptoms in a text-based mental health intervention in madrid. Computational and mathematical methods in medicine, 2016. [38] William L Cook and David A Kenny. The actor{partner interdependence model: A model of bidirectional eects in developmental studies. International Journal of Behavioral Devel- opment, 29(2):101{109, 2005. [39] Ailbhe Cullen and Naomi Harte. Thin slicing to predict viewer impressions of ted talks. In Proceedings of the 14th International Conference on Auditory-Visual Speech Processing, 2017. [40] Nicholas Cummins, Stefan Scherer, Jarek Krajewski, Sebastian Schnieder, Julien Epps, and Thomas F Quatieri. A review of depression and suicide risk assessment using speech analysis. Speech Communication, 71:10{49, 2015. 112 [41] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incom- plete data via the em algorithm. Journal of the royal statistical society. Series B (method- ological), pages 1{38, 1977. [42] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre- training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. [43] Birk Diedenhofen and Jochen Musch. cocor: A comprehensive solution for the statistical comparison of correlations. PloS one, 10(4):e0121945, 2015. [44] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121{ 2159, 2011. [45] Lee Ellington, Maija Reblin, Patricia Berry, Janine Giese-Davis, and Margaret F Clayton. Re ective research: supporting researchers engaged in analyzing end-of-life communication. Patient education and counseling, 91(1):126{128, 2013. [46] Florian Eyben, Felix Weninger, Florian Gross, and Bj orn Schuller. Recent developments in opensmile, the munich open-source multimedia feature extractor. In Proceedings of the 21st ACM international conference on Multimedia, pages 835{838. ACM, 2013. [47] Florian Eyben, Martin W ollmer, and Bj orn Schuller. Openear|introducing the munich open-source emotion and aect recognition toolkit. In 2009 3rd international conference on aective computing and intelligent interaction and workshops, pages 1{6. IEEE, 2009. [48] Norma Deitch Feshbach. 12 parental empathy and child adjustment/maladjustment. Em- pathy and its development, page 271, 1990. [49] Daniel Joseph France, Richard G Shiavi, Stephen Silverman, Marilyn Silverman, and M Wilkes. Acoustical properties of speech as indicators of depression and suicidal risk. IEEE transactions on Biomedical Engineering, 2000. [50] Eibe Frank and Mark Hall. A simple approach to ordinal classication. In European Con- ference on Machine Learning, pages 145{156. Springer, 2001. [51] Panayiotis G Georgiou, Matthew P Black, Adam C Lammert, Brian R Baucom, and Shrikanth S Narayanan. \that's aggravating, very aggravating": Is it possible to classify be- haviors in couple interactions using automatically derived lexical features? In International Conference on Aective Computing and Intelligent Interaction, pages 87{96. Springer, 2011. [52] Panayiotis G Georgiou, Matthew P Black, and Shrikanth S Narayanan. Behavioral signal processing for understanding (distressed) dyadic interactions: some recent developments. In Proceedings of the 2011 joint ACM workshop on Human gesture and behavior understanding, pages 7{12. ACM, 2011. [53] James Gibson, Dogan Can, Bo Xiao, Zac E Imel, David C Atkins, Panayiotis Georgiou, and Shrikanth Narayanan. A deep learning approach to modeling empathy in addiction counseling. Commitment, 111:21, 2016. [54] James Gibson, Bo Xiao, Panayiotis Georgiou, and S. Narayanan. An audio-visual approach to learning salient behaviors in couples' problem solving discussions. In International Con- ference on Audio, Speech and Signal Processing, 2013. 113 [55] John Gideon, Heather T Schatten, Melvin G McInnis, and Emily Mower Provost. Emotion recognition from natural phone conversations in individuals with and without recent suici- dal ideation. In The 20th Annual Conference of the International Speech Communication Association INTERSPEECH 2019, 2019. [56] John M Gottman and Cliord I Notarius. Marital research in the 20th century and a research agenda for the 21st century. Family process, 41(2):159{197, 2002. [57] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013. [58] Alex Graves and J urgen Schmidhuber. Framewise phoneme classication with bidirectional lstm and other neural network architectures. Neural Networks, 18(5):602{610, 2005. [59] Leslie S Greenberg, Jeanne C Watson, Robert Elliot, and Arthur C Bohart. Empathy. Psychotherapy: Theory, Research, Practice, Training, 38(4):380, 2001. [60] Rahul Gupta, Nikolaos Malandrakis, Bo Xiao, Tanaya Guha, Maarten Van Segbroeck, Matthew Black, Alexandros Potamianos, and Shrikanth Narayanan. Multimodal predic- tion of aective dimensions and depression in human-computer interactions. In Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, pages 33{40. ACM, 2014. [61] Donald P Hartmann and David D Wood. Observational methods. In International handbook of behavior modication and therapy, pages 107{138. Springer, 1990. [62] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 37(9):1904{1916, 2015. [63] C Heavey, D Gill, and A Christensen. Couples interaction rating system 2 (cirs2). University of California, Los Angeles, 7, 2002. [64] Christopher L Heavey, Christopher Layne, and Andrew Christensen. Gender and con ict structure in marital interaction: A replication and extension. Journal of consulting and clinical psychology, 61(1):16, 1993. [65] Richard L Heinrich, Cyndie Coscarelli Schag, and Patricia A Ganz. Living with cancer: The cancer inventory of problem situations. Journal of Clinical Psychology, 40(4):972{980, 1984. [66] Richard E Heyman. Rapid marital interaction coding system (rmics). In Couple observa- tional coding systems, pages 81{108. Routledge, 2004. [67] Richard E Heyman, Robert L Weiss, and J Mark Eddy. Marital interaction coding system: Revision and empirical evaluation. Behaviour Research and Therapy, 33(6):737{746, 1995. [68] Sepp Hochreiter and J urgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735{1780, 1997. [69] Gao Huang, Yixuan Li, Geo Pleiss, Zhuang Liu, John E Hopcroft, and Kilian Q Wein- berger. Snapshot ensembles: Train 1, get m for free. In International Conference on Learning Representations, 2017. 114 [70] Ganesh Jawahar, Beno^ t Sagot, Djam e Seddah, Samuel Unicomb, Gerardo I~ niguez, M arton Karsai, Yannick L eo, M arton Karsai, Carlos Sarraute, Eric Fleury, et al. What does bert learn about the structure of language? In 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, 2019. [71] J Jones and A Christensen. Couples interaction study: Social support interaction rating system. University of California, Los Angeles, 7, 1998. [72] Mark S Kaplan, Nathalie Huguet, Bentson H McFarland, and Jason T Newsom. Suicide among male veterans: a prospective population-based study. Journal of Epidemiology & Community Health, 61(7):619{624, 2007. [73] Patricia K Kerig and Donald H Baucom. Couple observational coding systems. Taylor & Francis, 2004. [74] Yoon Kim. Convolutional neural networks for sentence classication. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746{1751, 2014. [75] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. [76] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. [77] Tom Ko, Vijayaditya Peddinti, Daniel Povey, Michael L Seltzer, and Sanjeev Khudanpur. A study on data augmentation of reverberant speech for robust speech recognition. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017. [78] Alex Krizhevsky and Geo Hinton. Convolutional deep belief networks on cifar-10. Unpub- lished manuscript, 40(7):1{9, 2010. [79] Douglas S Krull and Jody C Dil. Do smiles elicit more inferences than do frowns? the eect of emotional valence on the production of spontaneous inferences. Personality and Social Psychology Bulletin, 24(3):289{300, 1998. [80] Sheherezade L Krzyzaniak, Douglas E Colman, Tera D Letzring, Jennifer S McDonald, and Jeremy C Biesanz. The eect of information quantity on distinctive accuracy and normativity of personality trait judgments. European Journal of Personality, 2019. [81] Shiro Kumano, Kazuhiro Otsuka, Masafumi Matsuda, and Junji Yamato. Analyzing per- ceived empathy/antipathy based on reaction time in behavioral coordination. In Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Work- shops on, pages 1{8. IEEE, 2013. [82] Yann LeCun, L eon Bottou, Yoshua Bengio, Patrick Haner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278{2324, 1998. [83] C.-C. Lee, A. Katsamanis, M. P. Black, B. R. Baucom, P. G. Georgiou, and S. S. Narayanan. An analysis of PCA-based vocal entrainment measures in married couples' aective spoken interactions. In Proceedings of InterSpeech, Florence, Italy, 2011. [84] C.C. Lee, A. Katsamanis, M. Black, B. Baucom, P. Georgiou, and S. Narayanan. Aec- tive state recognition in married couples interactions using PCA-based vocal entrainment measures with multiple instance learning. Aective Computing and Intelligent Interaction, pages 31{41, 2011. 115 [85] Chi-Chun Lee, Matthew Black, Athanasios Katsamanis, Adam C Lammert, Brian R Bau- com, Andrew Christensen, Panayiotis G Georgiou, and Shrikanth S Narayanan. Quan- tication of prosodic entrainment in aective spontaneous spoken interactions of married couples. In Eleventh Annual Conference of the International Speech Communication Asso- ciation, 2010. [86] Chi-Chun Lee, Carlos Busso, Sungbok Lee, and Shrikanth S Narayanan. Modeling mutual in uence of interlocutor emotion states in dyadic spoken interactions. In Tenth Annual Conference of the International Speech Communication Association, 2009. [87] Chi-Chun Lee, Athanasios Katsamanis, Brian R Baucom, Panayiotis G Georgiou, and Shrikanth S Narayanan. Using measures of vocal entrainment to inform outcome-related be- haviors in marital con icts. In Signal & Information Processing Association Annual Summit and Conference (APSIPA ASC), 2012 Asia-Pacic, pages 1{5. IEEE, 2012. [88] Chi-Chun Lee, Athanasios Katsamanis, Matthew P. Black, Brian Baucom, Andrew Chris- tensen, Panayiotis G. Georgiou, and Shrikanth S. Narayanan. Computing vocal entrain- ment: A signal-derived PCA-based quantication scheme with application to aect analysis in married couple interactions. Computer, Speech, and Language, 2012. [89] Chi-Chun Lee, Athanasios Katsamanis, Matthew P Black, Brian R Baucom, Andrew Chris- tensen, Panayiotis G Georgiou, and Shrikanth S Narayanan. Computing vocal entrainment: A signal-derived pca-based quantication scheme with application to aect analysis in mar- ried couple interactions. Computer Speech & Language, 28(2):518{539, 2014. [90] Chi-Chun Lee, Athanasios Katsamanis, Panayiotis G Georgiou, and Shrikanth S Narayanan. Based on isolated saliency or causal integration? toward a better understanding of human annotation process using multiple instance learning and sequential probability ratio test. In Thirteenth Annual Conference of the International Speech Communication Association, 2012. [91] Haoqi Li, Brian Baucom, and Panayiotis Georgiou. Linking emotions to behaviors through deep transfer learning. PeerJ Computer Science, 6:e246, 2020. [92] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured self-attentive sentence embedding. 5th International Conference on Learning Representations (ICLR 2017), 2017. [93] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019. [94] Sharon Manne, Marne Sherman, Stephanie Ross, Jamie Ostro, Richard E Heyman, and Kevin Fox. Couples' support-related communication, psychological distress, and relation- ship satisfaction among women with early stage breast cancer. Journal of consulting and clinical psychology, 72(4):660, 2004. [95] Soroosh Mariooryad and Carlos Busso. Analysis and compensation of the reaction lag of evaluators in continuous emotional annotations. In 2013 Humaine Association Conference on Aective Computing and Intelligent Interaction, pages 85{90. IEEE, 2013. [96] Robert R McCrae, Paul T Costa Jr, and Catherine M Busch. Evaluating comprehensive- ness in personality systems: The california q-set and the ve-factor model. Journal of Personality, 54(2):430{446, 1986. 116 [97] Scott W McQuiggan and James C Lester. Modeling and evaluating empathy in embodied companion agents. International Journal of Human-Computer Studies, 65(4):348{360, 2007. [98] Tomas Mikolov, Kai Chen, Greg Corrado, and Jerey Dean. Ecient estimation of word representations in vector space. In In Proceedings of Workshop at ICLR, 2013. [99] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Je Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111{3119, 2013. [100] Mario Mikulincer, Victor Florian, Philip A Cowan, and Carolyn Pape Cowan. Attachment security in couple relationships: A systemic model and its implications for family dynamics. Family process, 41(3):405{434, 2002. [101] William R Miller and Stephen Rollnick. Motivational interviewing: Helping people change. Guilford Press, 2012. [102] William R Miller and Gary S Rose. Toward a theory of motivational interviewing. American psychologist, 64(6):527, 2009. [103] Michelle Renee Morales, Stefan Scherer, and Rivka Levitan. A linguistically-informed fusion approach for multimodal depression detection. In NAACL HLT, page 13, 2018. [104] TB Moyers, T Martin, JK Manuel, WR Miller, and D Ernst. Revised global scales: Moti- vational Interviewing Treatment Integrity 3.0, 2007. [105] Theresa B Moyers, Tim Martin, Jennifer K Manuel, William R Miller, and D Ernst. The mo- tivational interviewing treatment integrity (miti) code: Version 2.0. Retrieved from Verf ubar unter: www. casaa. unm. edu [01.03. 2005], 2003. [106] Nora A Murphy, Judith A Hall, Mollie A Ruben, Denise Frauendorfer, Marianne Schmid Mast, Kirsten E Johnson, and Laurent Nguyen. Predictive validity of thin-slice nonverbal behavior from social interactions. Personality and Social Psychology Bulletin, page 0146167218802834, 2018. [107] S. Nallan Chakravarthula, H. Li, S.Y. Tseng, M. Reblin, and P. Georgiou. Predicting behavior in cancer-aicted patient and spouse interactions using speech and language. In- terspeech, 2019. [108] Shrikanth Narayanan and Panayiotis G Georgiou. Behavioral signal processing: Deriv- ing human behavioral informatics from speech and language. Proceedings of the IEEE, 101(5):1203{1233, 2013. [109] Md Nasir, Brian Baucom, Shrikanth Narayanan, and Panayiotis Georgiou. Towards an unsupervised entrainment distance in conversational speech using deep neural networks. Proc. Interspeech 2018, pages 3423{3427, 2018. [110] Md Nasir, Brian Robert Baucom, Panayiotis Georgiou, and Shrikanth Narayanan. Predict- ing couple therapy outcomes based on speech acoustic features. PloS one, 12(9):e0185123, 2017. [111] Matthew K Nock, Elizabeth B Holmberg, Valerie I Photos, and Bethany D Michel. Self- injurious thoughts and behaviors interview: Development, reliability, and validity in an adolescent sample. Psychological Assessment, 2007. [112] John Nolan. Stable distributions: models for heavy-tailed data, 2003. 117 [113] Lucas PJJ Noldus. The observer: a software system for collection and analysis of observa- tional data. Behavior Research Methods, Instruments, & Computers, 23(3):415{429, 1991. [114] Arne Ohman, Daniel Lundqvist, and Francisco Esteves. The face in the crowd revisited: a threat advantage with schematic stimuli. Journal of personality and social psychology, 80(3):381, 2001. [115] Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization. Distill, 2(11):e7, 2017. [116] Nansook Park. Military children and families: strengths and challenges during peace and war. American Psychologist, 66(1):65, 2011. [117] Tae Jin Park, Manoj Kumar, Nikolaos Flemotomos, Monisankha Pal, Raghuveer Peri, Rim- ita Lahiri, Panayiotis Georgiou, and Shrikanth Narayanan. The second dihard challenge: System description for usc-sail team. Proc. Interspeech 2019, pages 998{1002, 2019. [118] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic dier- entiation in pytorch. NIPS 2017 workshop, 2017. [119] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon- del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Jour- nal of Machine Learning Research, 12:2825{2830, 2011. [120] James W Pennebaker, Martha E Francis, and Roger J Booth. Linguistic inquiry and word count: Liwc 2001. Mahway: Lawrence Erlbaum Associates, 71(2001):2001, 2001. [121] Ver onica P erez-Rosas, Rada Mihalcea, Kenneth Resnicow, Satinder Singh, and Lawrence An. Understanding and predicting empathic behavior in counseling therapy. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1426{1435, 2017. [122] John P Pestian, Jacqueline Grupp-Phelan, Kevin Bretonnel Cohen, Gabriel Meyers, Linda A Richey, Pawel Matykiewicz, and Michael T Sorter. A controlled trial using natural language processing to examine the language of suicidal adolescents in the emergency department. Suicide and Life-Threatening Behavior, 2016. [123] John P Pestian, Michael Sorter, Brian Connolly, Kevin Bretonnel Cohen, Cheryl McCullum- smith, Jery T Gee, Louis-Philippe Morency, Stefan Scherer, Lesley Rohlfs, and STM Re- search Group. A machine learning approach to identifying the thought markers of suicidal subjects: a prospective multicenter trial. Suicide and Life-Threatening Behavior, 47(1):112{ 121, 2017. [124] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227{2237, 2018. [125] Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Majumder, Amir Zadeh, and Louis-Philippe Morency. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 873{883, 2017. 118 [126] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlcek, Y. Qian, P. Schwarz, et al. The Kaldi speech recognition toolkit. In Proc. ASRU, 2011. [127] Maija Reblin, Brian RW Baucom, Margaret F Clayton, Rebecca Utz, Michael Caserta, Dale Lund, Kathi Mooney, and Lee Ellington. Communication of emotion in home hospice cancer care: Implications for spouse caregiver depression into bereavement. Psycho-Oncology, 2019. [128] Maija Reblin, Margaret F Clayton, Kevin K John, and Lee Ellington. Addressing method- ological challenges in large communication data sets: Collecting and coding longitudinal interactions in home hospice cancer care. Health communication, 31(7):789{797, 2016. [129] Maija Reblin, Richard E Heyman, Lee Ellington, Brian RW Baucom, Panayiotis G Geor- giou, and Susan T Vadaparampil. Everyday couples' communication research: Overcoming methodological barriers with technology. Patient education and counseling, 101(3):551{556, 2018. [130] Maija Reblin, Steven K Sutton, Susan T Vadaparampil, Richard E Heyman, and Lee Elling- ton. Behind closed doors: How advanced cancer couples communicate at home. Journal of psychosocial oncology, pages 1{14, 2018. [131] Viktor Rozgi c, Bo Xiao, Athanasios Katsamanis, Brian Baucom, Panayiotis G Georgiou, and Shrikanth Narayanan. Estimation of ordinal approach-avoidance labels in dyadic inter- actions: Ordinal logistic regression approach. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2368{2371. IEEE, 2011. [132] Michelle Hewlett Sanchez, Dimitra Vergyri, Luciana Ferrer, Colleen Richey, Pablo Gar- cia, Bruce Knoth, and William Jarrold. Using prosodic and spectral features in detecting depression in elderly males. In Twelfth Annual Conference of the International Speech Com- munication Association, 2011. [133] Patricia Satterstrom, Jerey T Polzer, Lisa B Kwan, Oliver P Hauser, Wannawiruch Wiruchnipawan, and Marina Burke. Thin slices of workgroups. Organizational Behavior and Human Decision Processes, 151:104{117, 2019. [134] Steven L Sayers. Family reintegration diculties and couples therapy for military veterans and their spouses. Cognitive and Behavioral Practice, 18(1):108{119, 2011. [135] Stefan Scherer, John Pestian, and Louis-Philippe Morency. Investigating the speech char- acteristics of suicidal adolescents. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013. [136] Bj orn Schuller, Michel Valster, Florian Eyben, Roddy Cowie, and Maja Pantic. Avec 2012: the continuous audio/visual emotion challenge. In Proceedings of the 14th ACM interna- tional conference on Multimodal interaction, pages 449{456. ACM, 2012. [137] Mia Sevier, Kathleen Eldridge, Janice Jones, Brian D Doss, and Andrew Christensen. Ob- served communication and associations with satisfaction during traditional and integrative behavioral couple therapy. Behavior therapy, 39(2):137{150, 2008. [138] Leslie N Smith. Cyclical learning rates for training neural networks. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 464{472. IEEE, 2017. [139] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. X-vectors: Robust DNN embeddings for speaker recognition. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., pages 5329{5333, Apr. 2018. 119 [140] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, An- drew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631{1642, 2013. [141] Nitish Srivastava, Georey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut- dinov. Dropout: a simple way to prevent neural networks from overtting. The journal of machine learning research, 15(1):1929{1958, 2014. [142] Dirk D Steiner and Jerey S Rain. Immediate and delayed primacy and recency eects in performance evaluation. Journal of Applied Psychology, 74(1):136, 1989. [143] Andreas Stolcke. Srilm-an extensible language modeling toolkit. In Seventh international conference on spoken language processing, 2002. [144] Martin Sundermeyer, Ralf Schl uter, and Hermann Ney. Lstm neural networks for language modeling. In Proceedings of Interspeech, pages 194{197, 2012. [145] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104{3112, 2014. [146] Yla R Tausczik and James W Pennebaker. The psychological meaning of words: Liwc and computerized text analysis methods. Journal of language and social psychology, 29(1):24{54, 2010. [147] Mark A Thornton and Diana I Tamir. Mental models accurately predict emotion transitions. Proceedings of the National Academy of Sciences, 114(23):5982{5987, 2017. [148] Shao-Yen Tseng, Brian Baucom, and Panayiotis Georgiou. Approaching human perfor- mance in behavior estimation in couples therapy using deep sentence embeddings. In Pro- ceedings of Interspeech. August 2017, 2017. [149] Shao-Yen Tseng, Brian Baucom, and Panayiotis Georgiou. Unsupervised online multitask learning of behavioral sentence embeddings. PeerJ Computer Science, 5:e200, 2019. [150] Shao-Yen Tseng, Sandeep Nallan Chakravarthula, Brian R Baucom, and Panayiotis G Geor- giou. Couples behavior modeling and annotation using low-resource lstm language models. In INTERSPEECH, pages 898{902, 2016. [151] Shao-Yen Tseng, Haoqi Li, Brian Baucom, and Panayiotis Georgiou. Honey, i learned to talk: Multimodal fusion for behavior analysis. In Proceedings of the 2018 on International Conference on Multimodal Interaction, pages 239{243. ACM, 2018. [152] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998{6008, 2017. [153] Verena Venek, Stefan Scherer, Louis-Philippe Morency, John Pestian, et al. Adolescent suicidal risk assessment in clinician-patient interaction. IEEE Transactions on Aective Computing, 8(2):204{215, 2017. [154] Will Williams, Niranjani Prasad, David Mrva, Tom Ash, and Tony Robinson. Scaling recurrent neural network language models. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015. 120 [155] Wei Xia, James Gibson, Bo Xiao, Brian Baucom, and Panayiotis G Georgiou. A dynamic model for behavioral analysis of couple interactions using acoustic features. In Sixteenth Annual Conference of the International Speech Communication Association, 2015. [156] Bo Xiao, Daniel Bone, Maarten Van Segbroeck, Zac E Imel, David C Atkins, Panayiotis G Georgiou, and Shrikanth S Narayanan. Modeling therapist empathy through prosody in drug addiction counseling. In Fifteenth Annual Conference of the International Speech Communication Association, 2014. [157] Bo Xiao, Dogan Can, Panayiotis G Georgiou, David Atkins, and Shrikanth S Narayanan. Analyzing the language of therapist empathy in motivational interview based psychotherapy. In Signal & Information Processing Association Annual Summit and Conference (APSIPA ASC), 2012 Asia-Pacic, pages 1{4. IEEE, 2012. [158] Bo Xiao, Panayiotis Georgiou, Zac E. Imel, David Atkins, and S. Narayanan. \Rate my therapist": Automated detection of empathy in drug and alcohol counseling via speech and language processing. PLOS ONE, December 2015. [159] Bo Xiao, Panayiotis Georgiou, and S. Narayanan. Data driven modeling of head motion towards analysis of behaviors in couple interactions. In International Conference on Audio, Speech and Signal Processing, 2013. [160] Bo Xiao, Panayiotis G Georgiou, Zac E Imel, David C Atkins, and Shrikanth Narayanan. Modeling therapist empathy and vocal entrainment in drug addiction counseling. In IN- TERSPEECH, pages 2861{2865, 2013. [161] Zhaojun Yang and Shrikanth Narayanan. Modeling mutual in uence of multimodal behavior in aective dyadic interactions. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 2234{2238. IEEE, 2015. [162] Sung-Lin Yeh, Yun-Shao Lin, and Chi-Chun Lee. An interaction-aware attention network for speech emotion recognition in spoken dialogs. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6685{6689. IEEE, 2019. [163] AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2236{2246, 2018. [164] Guang Yong Zou. Toward using condence intervals to compare correlations. Psychological methods, 12(4):399, 2007. 121 Appendix A Algorithms and Intermediate Results of Behavior Observation Window Length Analysis A.1 Algorithms Algorithm 3 Behavior Construct Similarity 1: for each behavior B i in B do 2: for each window length L k in L do 3: for each functional F z in F do 4: Initialize empty lists G, H 5: for each session transcript T p in T containing O p words do 6: Score T p at window length L k to get trajectory of scores 7: S p i =fS 1 i ;S 2 i ;:::S max(1;OpL k +1) i g for behavior B i 8: Compute aggregate score F z (S p i ) 9: Append F z (S p i ) to G 10: Append ground-truth annotation A p i to H 11: end for 12: Compute Spearman Correlation R z (k) between G and H 13: end for 14: BCS i (k) =fR z (k)8 zg 15: end for 16: end for Algorithm 4 Similarity-based Grouping of Behaviors 1: Calculate R global = R G as in Algorithm 5 2: for number of clusters U inf2; 3; 4g do 3: for 10000 random initializations do 4: Run K-Means clustering on R global with U clusters 5: Store clustering scheme 6: end for 7: Pick most frequently occurring unique clustering scheme cluster U 8: Calculate cluster size disparity dsp U = Range(cluster sizes in cls U ) 9: end for 10: Pick Behavior Grouping cls argmin U dsp U 122 Algorithm 5 Behavior Relationship Consistency 1: for each behavior B i in B do 2: for each behavior B j in B :j6=i do 3: for each window length L k in L do 4: Initialize empty listsC i ,C j ,G i ,G j 5: for each session transcript T p in T containing O p words do 6: Score T p at window length L k to get trajectory of scores 7: S p i =fS 1 i ;S 2 i ;:::S max(1;OpL k +1) i g for B i , 8: S p j =fS 1 j ;S 2 j ;:::S max(1;OpL k +1) j g for B j 9: AppendS p i toC i ,S p j toC j 10: Append ground-truth annotation A p i toG i , A p j toG j 11: end for 12: Compute Spearman Correlations R C (i;j) between C i and C j , R G (i;j) between G i andG j 13: BRC i;j (k) = 1 jR C (i;j)R G (i;j)j 2 14: end for 15: end for 16: end for A.2 Intermediate Results for N-gram model A.2.1 Behavior Construct Similarity Figure A.1 shows the BCS of the N-gram model scores at the 5 observation window length values that were tested -f3, 10, 30, 50, 100g words. During the analysis procedure in Sec. 3.5.2, we set the BCS threshold Y 1 = 0:59 since the highest BCS, as can be seen in Figure A.1, is around 0.6. Every behavior is represented by a trajectory and each point on the trajectory represents the Spearman Correlation between ground truth annotations and the aggregated model scores at that window length. While we test three functionals for aggregation - minimum, median and maximum - we only use the one that performed best, on average, across all window lengths for our analysis. Hence, we only show the best performing functional for each behavior in the BCS plot; all correlations are statistically signicant (p< 0:05). We do not use the functional mean because in some instances, we observe that the system's window-level scores were impulsive. Fitting them to an -stable distributed results in an 1, for which the mean is undened [112]; the other three statistics, however, are still dened. The best performing behaviors with the N-gram model are Acceptance and Blame, with BCS values greater than 0.6 at nearly all window lengths; hence, we use these as reliable behaviors B rel during our analysis. With respect to behavior groups, we see that the Negative behaviors are, on average, the best estimated ones, followed by Positive behaviors. The BCS for Problem-Solving behaviors varies from moderate (0:45 for Solutions) to extremely low (0:06 for External). Finally, with the exception of Change, the BCS for Dysphoric behaviors is, in general, extremely low. This matches previous studies which have found that behavioral constructs related to negative and positive aect tend to be estimated well from low-level lexical features. From these results, we can now also see that they are, in fact, much better estimable than higher-level and more complex behaviors related to dysphoria and problem-solving. This could be due to factors such as these behaviors not being expressed suciently in language or their expression in language, even if sucient, being too complex to be modeled using N-gram phrases and simple statistics. In evaluating the choice of functionals, median appears to be the best aggregation method for nearly every behavior. On the other hand, maximum and minimum perform best for some 123 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 anger belligerence blame defensiveness disgust negative NEGATIVE 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 acceptance affection agreement define positive satisfaction support-emotional support-instrumental POSITIVE 3 10 30 50 100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Spearman Correlation anxiety avoidance change dominance sadness withdrawal DYSPHORIC 3 10 30 50 100 Observation Window Length (number of words) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Spearman Correlation discussion external negotiates perspective responsibility solicit-suggestions solutions PROBLEM-SOLVING Minimum Median Maximum Functional Figure A.1: Behavior Construct Similarity for N-gram model: Spearman Correlation between human annotations and functional-aggregated scores of N-gram model at dierent observation window lengths. All correlations are statistically signicant (p< 0:05) Figure A.2: Sample distribution of Dominance and External scores at window lengths 3, 100 and session-length: In both behaviors, the median 3-gram as well as 100-gram scores are very similar to the session-level scores, possibly due to symmetrical distributions behaviors such as Dominance and Withdrawal. In cases where median is the best functional, we see that the BCS does not change much even from varying from the shortest window length 124 to the longest one possible. This, however, does not imply that all window lengths are equally appropriate for such behaviors. As shown in Figure A.2, the scores from the N-gram model tend to be symmetrically distributed, a pattern which was also reported in [150]. As a result, any change in scores resulting from changes in the window length would not be re ected by the median and, hence, the BCS would not change, giving the false impression that all windows are equally appropriate. Hence, to further disambiguate this, we also check the Behavior Relationship Consistency (BRC) metric. A.2.2 Behavior Relationship Consistency Reliable Behavior Acceptance BRC Weight (3) =0.501 (10) =0.507 (30) =0.516 (50) =0.514 (100) =0.516 Target Behavior Q Q 0 (3) Q 0 (10) Q 0 (30) Q 0 (50) Q 0 (100) Discussion 0.178 -0.007 0.015 0.018 0.012 0.005 External 0.256 0.181 0.166 0.192 0.216 0.256 Negotiates 0.237 0.199 0.198 0.227 0.252 0.292 Perspective 0.099 0.068 0.041 0.021 0.006 -0.022 Responsibility 0.328 0.198 0.221 0.257 0.284 0.329 Solicit-suggestions 0.379 0.297 0.305 0.349 0.381 0.434 Solutions 0.282 0.323 0.326 0.351 0.37 0.397 Anger -0.653 -0.499 -0.564 -0.627 -0.66 -0.71 Belligerence -0.646 -0.483 -0.538 -0.598 -0.632 -0.684 Defensiveness -0.564 -0.446 -0.487 -0.548 -0.585 -0.644 Disgust -0.66 -0.424 -0.485 -0.545 -0.579 -0.632 Negative -0.729 -0.656 -0.708 -0.757 -0.782 -0.819 Aection 0.582 0.283 0.299 0.348 0.384 0.44 Agreement 0.347 0.269 0.259 0.284 0.307 0.34 Dene 0.517 0.439 0.497 0.556 0.59 0.646 Positive 0.67 0.575 0.619 0.674 0.705 0.753 Satisfaction 0.563 0.456 0.486 0.538 0.572 0.624 Support-emotional 0.492 0.239 0.257 0.301 0.333 0.386 Support-instrumental 0.46 0.32 0.342 0.397 0.432 0.487 Anxiety -0.299 -0.107 -0.146 -0.182 -0.201 -0.233 Avoidance -0.17 0.004 -0.015 -0.022 -0.022 -0.022 Change -0.474 -0.47 -0.528 -0.579 -0.606 -0.653 Dominance -0.144 -0.201 -0.202 -0.209 -0.216 -0.237 Sadness -0.131 -0.059 -0.069 -0.075 -0.074 -0.07 Withdrawal -0.164 0.019 - 0.003 0.008 0.014 Table A.1: Spearman Correlation between window-level scores of Acceptance and target behaviors with the N-gram model: Q and Q 0 refer to the correlations used to calculate the pair BRC in Eqn. 3.4. refers to the proportional weight used to calculate the individual BRC in Eqn. 3.5. All correlations are statistically signicant (p< 0:05) unless marked as - Tables A.1 and A.2 display the Spearman Correlations between dierent target behaviors and Acceptance and Blame respectively, which are the reliable behaviors for the Ngram model. Q represents the \true correlation" i.e. the correlation between the ground-truth annotations. Q 0 (L) represents the correlation between the model's scores at window lengthL. Negative values signify that the behaviors are dissimilar or opposite whereas positive values signify that the two behaviors are similar. For example, we can see from Table 3.2 that Anger is similar to Blame but dissimilar to Acceptance. This is re ected in the Q for Acceptance and Anger in Table A.1 which is -0.653 since they are dissimilar whereas the Q for Blame and Anger in Table A.2 is 0.673 since they are similar. Next, we calculate the normalized weights at each window length for the two reliable behav- iors Acceptance and Blame as shown in Eqn.3.6; these are shown in the second row of Tables A.1 and A.2 respectively. Finally, we plug in Q , Q 0 and into Eqn. 3.5 to obtain the BRC of each target behavior at every window length. During the analysis procedure in Sec. 3.5.2, we set the 125 Reliable Behavior Blame BRC Weight (3) =0.499 (10) =0.493 (30) =0.484 (50) =0.486 (100) =0.484 Target Behavior Q Q 0 (3) Q 0 (10) Q 0 (30) Q 0 (50) Q 0 (100) Discussion 0.085 -0.21 -0.191 -0.174 -0.161 -0.146 External -0.073 0.212 0.176 0.12 0.08 0.015 Negotiates -0.083 0.181 0.138 0.076 0.033 -0.036 Perspective - 0.141 0.124 0.123 0.127 0.139 Responsibility -0.239 0.176 0.102 0.029 -0.016 -0.092 Solicit-suggestions -0.23 0.083 0.013 -0.067 -0.116 -0.195 Solutions -0.184 0.006 -0.057 -0.117 -0.153 -0.207 Anger 0.673 0.736 0.739 0.762 0.778 0.806 Belligerence 0.677 0.738 0.735 0.756 0.773 0.8 Defensiveness 0.522 0.639 0.635 0.666 0.689 0.729 Disgust 0.69 0.707 0.708 0.729 0.745 0.773 Negative 0.693 0.774 0.784 0.809 0.825 0.852 Aection -0.352 0.152 0.081 -0.004 -0.058 -0.143 Agreement -0.295 0.118 0.072 0.008 -0.032 -0.097 Dene -0.353 -0.536 -0.552 -0.594 -0.622 -0.674 Positive -0.547 -0.201 -0.297 -0.39 -0.44 -0.522 Satisfaction -0.537 -0.095 -0.182 -0.27 -0.32 -0.401 Support-emotional -0.326 0.181 0.119 0.042 -0.008 -0.086 Support-instrumental -0.343 0.089 0.004 -0.089 -0.143 -0.23 Anxiety 0.17 0.425 0.408 0.413 0.417 0.426 Avoidance 0.085 0.333 0.307 0.287 0.273 0.253 Change 0.7 0.703 0.704 0.726 0.743 0.77 Dominance 0.293 0.174 0.224 0.234 0.24 0.254 Sadness 0.198 0.394 0.354 0.328 0.312 0.288 Withdrawal - 0.293 0.268 0.241 0.224 0.2 Table A.2: Spearman Correlation between window-level scores of Blame and target behaviors with the N-gram model: Q andQ 0 refer to the correlations used to calculate the pair BRC in Eqn. 3.4. refers to the proportional weight used to calculate the individual BRC in Eqn. 3.5. All corre- lations are statistically signicant (p< 0:05) unless marked as - BRC thresholdY 2 = 0:95 in order to be as close to 1 as practically possible. This now enables us to observe changes in the quality of the N-gram model scores which are otherwise not re ected in the BCS. For instance, for the behavior Solutions, the BCS is nearly the same, around 0.45 at all the window lengths. However, when compared to its ground truth correlations with Acceptance and Blame (0.282 and -0.184 respectively), we see that the N-gram score correlations at 50 words (0.37 and -0.153 respectively) are more similar to them than those at 3 words (0.323 and 0.006 respectively). Therefore, based on this, we can conclude that a window length of 50 words is more appropriate for scoring the behavior Solutions than 3 words. A.3 Intermediate Results for Neural model A.3.1 Behavior Construct Similarity Figure A.3 shows the BCS of the Neural model scores at the two observation window lengths tested, 3 and 30 words. For each behavior, a bar represents the Spearman Correlation between ground truth annotations and the aggregated Neural model score at that window length. Similar to the N-gram model, we used the best performing functional, on average, for our analysis and display it for each behavior in the BCS plot; all correlations are statistically signicant (p< 0:05). Similar to the N-gram model, for the analysis procedure in Sec. 3.5.2, we set the BCS threshold Y 1 = 0:59 since the highest BCS, as can be seen in Figure A.3, is around 0.6. The best performing behavior with the Neural model is Blame; hence, it is used as the reliable behavior in our analysis. Once again, we see that the best estimated behaviors belong to the Negative group, followed by Positive. Interestingly, however, we observe low-to-moderate BCS 126 NEGATIVE belligerence blame anger negative disgust defensiveness 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Spearman Correlation POSITIVE satisfaction agreement support- emotional support- instrumental affection positive acceptance 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Spearman Correlation avoidance change anxiety dominance sadness withdrawal DYSPHORIC 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Spearman Correlation 3 words 30 words Window Length perspective external solicit- suggestions negotiates responsibility solutions PROBLEM-SOLVING 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Spearman Correlation Minimum Median Maximum Functional Figure A.3: Behavior Construct Similarity for Neural model: Spearman Correlation between human annotations and functional-aggregated scores of Neural model at dierent observation window lengths. All correlations are statistically signicant (p< 0:05) for both Dysphoric behaviors, which are generally subtle and non-verbal, as well as Problem- Solving behaviors, which are generally verbose. This shows that the contextual embeddings of ELMo in conjunction with the long-term, non-linear processing of the GRU are able to handle both scenarios' diverse linguistic requirements equally well. We also see that the BCS for some behaviors changes noticeably as the window length increases from 3 to 30 words. Technically, this variation can be attributed to not just the change in window length but also the quality of the model trained at that window length. However, since we tuned for the best Neural model at each window length, we assume that the quality of training is similar across window lengths and that the variation in BCS is mostly due to the change in window length. A.3.2 Behavior Relationship Consistency Table A.3 shows the Spearman Correlations between target behaviors and Blame, reliable behavior in the Neural model. Q represents the \true correlation" i.e. the correlation between the ground- truth annotations. Q 0 (L) represents the correlation between the model's scores at window length L. Since we have only one reliable behavior, we calculate the BRC using Eqn. 3.5 with = 1. Similar to the N-gram model, for the analysis procedure in Sec. 3.5.2, we set the BRC threshold Y 2 = 0:95. Then, using both BCS and BRC, we analyze the Neural model scores of all the behaviors, the results of which are shown in Figure 3.8. Since we are comparing just two window lengths, 3 and 30 words, the resolution of our analysis here is slightly coarse and doesn't necessarily re ect exhaustive trends. For instance, a behavior that actually requires 10-word windows might perform better at 30 words than at 3 words simply because of the increased context and not because it is best observed at 30 words. Hence, the window length results in the Neural model should be interpreted in a relative light, i.e. short window vs longer window. 127 Reliable Behavior Blame Target Behavior Q Q 0 (3) Q 0 (30) External -0.073 -0.18 -0.134 Negotiates -0.083 -0.381 -0.263 Perspective - 0.063 0.0902 Responsibility -0.239 -0.311 -0.346 Solicit-suggestions -0.23 -0.388 -0.42 Solutions -0.184 -0.306 -0.356 Anger 0.673 0.678 0.659 Belligerence 0.677 0.605 0.55 Defensiveness 0.522 0.408 0.454 Disgust 0.69 0.634 0.421 Negative 0.693 0.732 0.457 Acceptance -0.75 -0.656 -0.618 Aection -0.352 -0.481 -0.528 Agreement -0.295 -0.464 -0.347 Positive -0.547 -0.61 -0.59 Satisfaction -0.537 -0.465 -0.464 Support-emotional -0.326 -0.541 -0.358 Support-instrumental -0.343 -0.571 -0.39 Anxiety 0.171 0.369 0.349 Avoidance 0.085 - -0.005 Change 0.7 0.76 0.725 Dominance 0.293 0.394 0.425 Sadness 0.198 0.204 0.177 Withdrawal - -0.118 -0.035 Table A.3: Spearman Correlation between window-level scores of Blame and target behaviors with the Neural model: Q andQ 0 refer to the correlations used to calculate the pair BRC in Eqn. 3.4. Since we have only one , = 1 in Eqn. 3.5. All correlations are statistically signicant (p< 0:05) unless marked as - A.3.3 ELMo Layer Weights The 3 layers in ELMo, from bottom to top, are the input token layer and 2 bidirectional language model (biLM) layers. Figure 3.9 displays the mixing weights of each layer, averaged over all model checkpoints, from each test fold and for each behavior in a group. We see that Problem-Solving behaviors tend to predominantly use information mostly from the top layer, followed by the middle layer. This means that the models used to estimate these behaviors rely mostly on the biLM representations which, as we noted earlier, pertain to complex and high-level characteristics of language. While Negative and Positive behaviors similarly place heavy emphasis on the top layer, they assign a much larger weight to the bottom layer, which typically encodes word-level features; this matches the notion that it is possible to express them using short expressions. Finally, we see that Dysphoric behaviors assign similar weights to all 3 layers, implying that we need to extract information from all 3 aspects of language in order to capture them. 128
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Behavioral signal processing: computational approaches for modeling and quantifying interaction dynamics in dyadic human interactions
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Multimodal and self-guided clustering approaches toward context aware speaker diarization
PDF
Machine learning paradigms for behavioral coding
PDF
Interaction dynamics and coordination for behavioral analysis in dyadic conversations
PDF
Modeling expert assessment of empathy through multimodal signal cues
PDF
Automatic quantification and prediction of human subjective judgments in behavioral signal processing
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Computational methods for modeling nonverbal communication in human interaction
PDF
Toward robust affective learning from speech signals based on deep learning techniques
PDF
Towards social virtual listeners: computational models of human nonverbal behaviors
PDF
Spoken language processing in low resource scenarios with applications in automated behavioral coding
PDF
Understanding sources of variability in learning robust deep audio representations
PDF
Enriching spoken language processing: representation and modeling of suprasegmental events
PDF
Enhancing speech to speech translation through exploitation of bilingual resources and paralinguistic information
PDF
Computational modeling of human behavior in negotiation and persuasion: the challenges of micro-level behavior annotations and multimodal modeling
PDF
Modeling dyadic synchrony with heterogeneous data: validation in infant-mother and infant-robot interactions
PDF
Extracting and using speaker role information in speech processing applications
PDF
Context-aware models for understanding and supporting spoken interactions with children
Asset Metadata
Creator
Nallan Chakravarthula, Sandeep
(author)
Core Title
Computational modeling of behavioral attributes in conversational dyadic interactions
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
02/26/2021
Defense Date
12/17/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
conversation influence,deep learning,Human behavior,Language,latent variable modeling,machine learning,OAI-PMH Harvest,Psychology,speech
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth (
committee chair
), Georgiou, Panayiotis (
committee member
), Jenkins, Keith (
committee member
), Margolin, Gayla (
committee member
)
Creator Email
nallanch@usc.edu,sandeep.ncacharya@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-423620
Unique identifier
UC11667581
Identifier
etd-NallanChak-9293.pdf (filename),usctheses-c89-423620 (legacy record id)
Legacy Identifier
etd-NallanChak-9293.pdf
Dmrecord
423620
Document Type
Dissertation
Rights
Nallan Chakravarthula, Sandeep
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
conversation influence
deep learning
latent variable modeling
machine learning