Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Behavioral signal processing: computational approaches for modeling and quantifying interaction dynamics in dyadic human interactions
(USC Thesis Other)
Behavioral signal processing: computational approaches for modeling and quantifying interaction dynamics in dyadic human interactions
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Behavioral Signal Processing: Computational Approaches for Modeling and Quantifying Interaction Dynamics in Dyadic Human Interactions by Chi-Chun Lee A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December, 2012 Copyright 2012 Chi-Chun Lee Dedication I would like to dedicate this thesis to my family and friends. Thank you my parents and my brother for raising me and supporting me through my study, my work, and simply my entire life. You have made sure that I do not have to worry about anything else except for concentrating on my work and pursuing what I have been so determined to do. Thank you Dolphine, my love, for being there for me, being my best support during my most dicult times, keeping me going strong, being my best company, and best of all, taking the best care of me. You are the uttermost motivator for me to complete this thesis, and I am so blessed to have you and a wonderful life with you in the future. Thank you all my dear friends, you have given me all the wonderful and joyful time past years. You have provided me a dierent perspective that I have come to deeply realize that was the driving force for me to \think outside the box". This thesis is dedicated to all of you, and I look forward for the adventurous and exciting future with all of you. ii Acknowledgements I would like to acknowledge that the completion of this thesis is not possible without the help of the most admirable advisor, Professor Shrikanth Narayanan, research professors, Professor Sungbok Lee and Professor Panayiotis Georgiou, dissertation and qualication committees, Professor C.-C. Jay Kuo, Professor Antonio Ortega, and Professor Gayla Margolin, and all my wonderful lab mates and mentors. To Shri, without your encouragement, your wealth of knowledge, and your eort in guiding me through the maturation of my quest of understanding and carrying out scientically-relevant research addressing large questions, I would not be able to com- plete this thesis. To research professors and thesis committees, your close and helpful guidance have denitely pushed and started me o onto the route of nishing much of this research work and more in the future. Last, I was blessed to have an excellent mentor, Dr. Carlos Busso, when I started my PhD life. The entire SAIL (signal analysis and interpretation laboratory) member also has helped me along the way at each step. All of your help along with our fun time together, it has made my life as a PhD student just that much better! iii Table of Contents Dedication ii Acknowledgements iii List of Figures viii List of Tables x Abstract xii Chapter 1: Introduction 1 1.1 Behavioral Signal Processing: Behavioral Informatics . . . . . . . . . . . 1 1.1.1 Challenge and Complexity in BSP . . . . . . . . . . . . . . . . . 2 1.1.2 BSP Application Domains . . . . . . . . . . . . . . . . . . . . . . 4 1.2 BSP: Computational Methods for Dyadic Interaction Dynamics . . . . . 5 1.2.1 Complexities in Computationally Quantify and Model Interaction Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.2 Questions of Focus for Engineers in the Study of Dyadic Interactions 9 1.3 Thesis Contribution and Outline . . . . . . . . . . . . . . . . . . . . . . 10 1.3.1 Research in Conversation Analysis . . . . . . . . . . . . . . . . . 10 1.3.2 Research in Aective Computing . . . . . . . . . . . . . . . . . . 10 1.3.3 Research in Modeling Human Social-Communication Interaction Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.4 Research in Modeling Human Annotation Perception . . . . . . . 11 Chapter 2: Predicting Interruptions in Dialog 13 2.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Database and Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.1 Interrupter Gestural Features . . . . . . . . . . . . . . . . . . . . 16 2.3.2 Interruptee Acoustic Features . . . . . . . . . . . . . . . . . . . . 17 2.4 Staitistical Modeling Framework . . . . . . . . . . . . . . . . . . . . . . 18 iv 2.4.1 Review of Hidden Conditional Random Field . . . . . . . . . . . 18 2.5 Experiment Setup and Results . . . . . . . . . . . . . . . . . . . . . . . 19 2.5.1 Experiment I: Results . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5.2 Experiment II: Results . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5.3 Feature Selection Discussion . . . . . . . . . . . . . . . . . . . . . 21 2.6 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 23 Chapter 3: Recognizing Emotion with Single Speaker Modeling 24 3.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Emotional Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.1 AIBO Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.2 USC IEMOCAP Database . . . . . . . . . . . . . . . . . . . . . . 29 3.2.3 Acoustic Feature Extraction . . . . . . . . . . . . . . . . . . . . . 30 3.2.4 Feature Selection and Normalization . . . . . . . . . . . . . . . . 31 3.3 Emotion Classication Framework . . . . . . . . . . . . . . . . . . . . . 32 3.3.1 Building the hierarchical decision tree . . . . . . . . . . . . . . . 32 3.3.2 Building the hierarchical decision tree for the AIBO database and the IEMOCAP database . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.3 Classier for Binary Classication Tasks . . . . . . . . . . . . . . 36 3.4 Experiment Setup and Results . . . . . . . . . . . . . . . . . . . . . . . 38 3.4.1 AIBO Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4.2 USC IEMOCAP Database . . . . . . . . . . . . . . . . . . . . . . 43 3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Chapter 4: Recognizing Emotion with Joint Interlocutors Modeling 48 4.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2 Database and Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.1 IEMOCAP Database . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.2 Emotion Annotation . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3 Dynamic Bayesian Network Model . . . . . . . . . . . . . . . . . . . . . 53 4.4 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . . 54 4.4.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.4.2 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.4.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 57 4.5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 59 Chapter 5: Quantifying Vocal Entrainment in Dialog 61 5.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2 The Couple Therapy Corpus . . . . . . . . . . . . . . . . . . . . . . . . 67 5.2.1 Pre-processing / Audio Feature Extraction . . . . . . . . . . . . 69 5.2.2 Behavioral Codes of Interest . . . . . . . . . . . . . . . . . . . . 71 5.3 Signal-derived Vocal Entrainment Quantication . . . . . . . . . . . . . 73 v 5.3.1 PCA-based Similarity Measures . . . . . . . . . . . . . . . . . . . 74 5.3.2 Representative Vocal Features . . . . . . . . . . . . . . . . . . . . 79 5.3.3 Vocal Entrainment Measures in Dialogs . . . . . . . . . . . . . . 81 5.4 Analysis of Vocal Entrainment Measures . . . . . . . . . . . . . . . . . . 83 5.4.1 Natural Cohesiveness of Dialogs . . . . . . . . . . . . . . . . . . 84 5.4.2 Entrainment in Aective Interactions . . . . . . . . . . . . . . . 87 5.5 Aect Classication using Entrainment Measures . . . . . . . . . . . . . 90 5.5.1 Classication Framework . . . . . . . . . . . . . . . . . . . . . . 90 5.5.2 Classication Setup . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.5.3 Classication Results and Discussions . . . . . . . . . . . . . . . 96 5.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 98 Chapter 6: Analyzing Vocal Entrainment in Marital Con ict 101 6.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.2 The Couple Therapy Corpus . . . . . . . . . . . . . . . . . . . . . . . . 104 6.3 PCA-based Vocal Entrainment Measures . . . . . . . . . . . . . . . . . . 105 6.3.1 Symmetric Entrainment Measures . . . . . . . . . . . . . . . . . 106 6.3.2 Directional Entrainment Measures . . . . . . . . . . . . . . . . . 106 6.3.3 Canonical Correlation Analysis . . . . . . . . . . . . . . . . . . . 108 6.4 Analyses Results and Discussions . . . . . . . . . . . . . . . . . . . . . . 108 6.4.1 Correlation Analysis: the Four Behavioral Dimensions . . . . . . 109 6.4.2 Canonical Correlation Analysis: Withdrawal . . . . . . . . . . . . 111 6.5 Lessons Learnt from Correlation Analysis . . . . . . . . . . . . . . . . . 113 6.6 Vocal Entrainment and Demand and Withdraw in Couple Con ict . . . 114 6.6.1 Demand and Withdraw . . . . . . . . . . . . . . . . . . . . . . . 114 6.6.2 Behavioral In uence and Polarization of Demand and Withdraw 116 6.6.3 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Chapter 7: Modeling Human Annotation Perception 121 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.2 Corpus Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 7.3 Computationl Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.3.1 Multiple Instance Learning . . . . . . . . . . . . . . . . . . . . . 125 7.3.2 Sequential Probability Ratio Test . . . . . . . . . . . . . . . . . . 126 7.4 Analysis Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 7.4.1 Lexical Feature Extraction . . . . . . . . . . . . . . . . . . . . . 128 7.4.2 Classication Setup . . . . . . . . . . . . . . . . . . . . . . . . . 128 7.5 Detection Results and Discussions . . . . . . . . . . . . . . . . . . . . . 131 7.6 Isolated-Saliency vs. Causal-Integration . . . . . . . . . . . . . . . . . . 133 7.7 Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . . . . 134 vi Chapter 8: Conclusions and Future work 135 8.1 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . 135 8.1.1 Algorithmic Development - Turn Taking and Aective Dynamics 135 8.1.2 Application Domains - Applying Vocal Entrainment in Analysis 136 8.1.3 Transferring Computational Models - Quantifying Other Aspects of Entrainment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 8.1.4 CreativeIT: Synthesizing Actors' Improvisation Interaction . . . 136 8.1.5 Data-driven Human Behavioral Science Study . . . . . . . . . . . 137 Bibliography 138 vii List of Figures 1.1 BSP: An interdisciplinary research domain and applications . . . . . . . 3 1.2 Introduction: Schematics of Studying Dyadic Human Interactions . . . 7 1.3 Introduction: Interpersonal Interaction Dynamics in Dyadic Interaction 8 2.1 Predicting Interruption: Markers Placement. . . . . . . . . . . . . . . . 15 3.1 Proposed Classication Framework: A Hierarchical Binary Decision Tree with the easiest task as the rst stage and the most ambiguous task as the last stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2 Left: Proposed Hierarchical Structure for the AIBO Database Right: Hierarchical Structure for the USC IEMOCAP Database. . . . . . . . . 35 3.3 Left: Conventional Hierarchical Structure for the AIBO Database Right: Conventional Hierarchical Structure for the USC IEMOCAP Database. 38 4.1 Recognizing Emotion: Example of Analysis Windows. . . . . . . . . . . 51 4.2 Recognizing Emotion: K-Means Clustering Output of Valence-Activation. 52 4.3 Recognizing Emotion: Proposed Dynamic Bayesian Network Structure. 53 4.4 Recognizing Emotion: Structures of Emotion States Evolution. . . . . . 57 5.1 Quantifying Vocal Entrainment: Example of Computing Measures Quan- tifying Vocal Entrainment for Turns H t in a Dialog . . . . . . . . . . . . 81 5.2 Quantifying Vocal Entrainment: Categories of Conceptualization of Dy- namic Interplay of Directionality of In uences in Dyadic Interactions . . 86 viii 5.3 Quantifying Vocal Entrainment: Examples of Vocal Entrainment Mea- sures: dsim Lto ;dsim L fr ;ssim Lu ;ssim Lw , are computed for one couple in dierent aective interactions; (a) and (b) correspond to positive aect, (c) and (d) correspond to negative aect . . . . . . . . . . . . . . . . . 90 5.4 Quantifying Vocal Entrainment: Dynamic Bayesian Network Represen- tation of (a) HMM and (b) FHMM . . . . . . . . . . . . . . . . . . . . 94 ix List of Tables 2.1 Predicting Interruptions: Summary of Experiment I . . . . . . . . . . . 20 2.2 Predicting Interruption: Summary of Experiment II . . . . . . . . . . . . 21 2.3 Predicting Interruption: Features Selected . . . . . . . . . . . . . . . . . 22 3.1 AIBO Database: Table of Emotion Utterances . . . . . . . . . . . . . 29 3.2 USC IEMOCAP Database: Number of of Emotion Utterances per Category . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3 Acoustic Features Extracted (16 2 12 = 384) . . . . . . . . . . . . . . 31 3.4 Experiment I: Summary of Result . . . . . . . . . . . . . . . . . . . . 40 3.5 Experiment II: Summary of Result . . . . . . . . . . . . . . . . . . . . 43 3.6 Experiment: Summary of the USC IEMOCAP Database Classication Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.1 Recognizing Emotion: Emotion Label Clustering (k = 5) . . . . . . . . . 52 4.2 Recognizing Emotion: Valence & Activation Clustering (k = 3) . . . . . 56 4.3 Recognizing Emotion: Summary of Experiment Accuracy Percentage . . 58 5.1 Quantifying Vocal Entrainment: Summarization of Methods in Comput- ing the Eight Vocal Entrainment Measures . . . . . . . . . . . . . . . . . 83 5.2 Quantifying Vocal Entrainment: Analyzing Vocal Entrainment Measures for Natural Cohesiveness (1000 runs): Percentage of Rejecting H o at = 0:05 Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 x 5.3 Quantifying Vocal Entrainment: Analyzing Vocal Entrainment Measures for Aective Interactions ( Positive Aect vs. Negative Aect): One- sided p-value Presented . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.4 Quantifying Vocal Entrainment: Results of Binary Aective State (Pos- itive vs. Negative) Recognition: Percentage of Accurately Classied (%) 96 6.1 Summary of correlation analyses between vocal entrainment quantitative descriptors and the four behavioral dimensions. An `p' refers to a sta- tistically signicant positive correlation, and `n', negative correlation (** indicates p-value < 0:01 and * indicates p-value < 0:05) . . . . . . . . . 109 6.2 Summary of the three Analyses and the analyzed behaviors . . . . . . . 109 6.3 Summary of canonical correlation analysis results of `withdrawal' behav- ioral dimension with vocal entrainment . . . . . . . . . . . . . . . . . . . 111 6.4 Summary of MLM results of demand and withdraw with directional vocal entrainment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.1 Summary of detection results (percentage of accurately detected sessions): numbers in bold indicate the highest performing decision framework for that specic task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 7.2 Summary of detection results (percentage of accurately detected sessions): numbers in bold indicate the highest performing decision framework for that specic task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 7.3 SPRT st : median and 75% quantile of decision time, measured as number of turns required divided by the total number of turns of each session for the 40% task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 xi Abstract Behavioral Signal Processing (BSP) is an emerging interdisciplinary research domain, operationally dened as computational methods that model human behavior signals, with a goal of enhancing the capabilities of domain experts in facilitating better decision making in terms of both scientic discovery in human behavioral sciences and human- centered system designs. Quantitative understanding of human behavior, both typical and atypical, and mathematical modeling of interaction dynamics are core elements in BSP. This thesis focuses on computational approaches in modeling and quantifying interacting dynamics in dyadic human interactions. The study of interaction dynamics has long been at the center for multiple research disciplines in human behavioral sciences (e.g., psychology). Exemplary scientic ques- tions addressed range from studying scenarios of interpersonal communication (verbal interaction modeling, human aective state generation, display, and perception mech- anisms), modeling domain-specic interactions (such as, assessment of the quality of theatrical acting or children's reading ability), to analyzing atypical interactions (for example, models of distressed married couples behavior and response to therapeutic interventions, quantitative diagnostics and treatment tracking of children with Autism, people with psycho-pathologies such as addiction and depression). In engineering, a metaphorical analogy and framework to this notion in behavioral science is based on the idea of conceptualizing a dyadic interaction as a coupled dynamical system: an xii interlocutor is viewed as a dynamical system, whose state evolution is not only based on its past history but also dependent on the other interlocutor's state. However, the evolution of this \coupled-states" is often hidden by nature; an interlocutor in a con- versation can at best \fully-observe" the expressed behaviors of the other interlocutor. This observation or partial insights into the other interlocutor's state is taken as \in- put" into the system coupling with the evolution of its own state. This, then, in returns, \outputs" behaviors to be taken as \input" for the other interlocutors. This complex dynamics is in essence capturing the ow of dyadic interaction quantitatively. The chal- lenge in modeling human interactions is, therefore, multi-fold: the coupling dynamic between each interlocutor in an interaction spans multiple levels, along variable time scales, and diers between interaction contexts. At the same time, each interlocutor's internal behavioral dynamic produces a coupling that is multimodal across the verbal and nonverbal communicative channels. In this thesis, I will focus on addressing questions of developing computational meth- ods for carrying out studies into understanding and modeling interaction dynamics in dyadic interactions. In specic, I will rst demonstrate the ecacy of jointly model in- terlocutors behaviors for better prediction of interruption in conversations. Since turn taking is a highly-coordinated behavioral phenomenon between interlocutors, it is ben- ecial to model both speakers together to achieve better prediction accuracy. Second, I have contributed to the domain of aective computing, recognizing human emotional states through behavioral signals extraction from audio-video recordings, with a hierar- chical structure of classication. Furthermore, with joint modeling of emotional states with DBN, I have demonstrated that it improves over single speaker emotion recognition system. Next, I have developed a computational tool showing the ability of quantifying subtle interaction dynamics for quantifying vocal entrainment, a natural spontaneous xiii vocal behavior matching between interlocutors. The computational tool, with close col- laboration with psychologists, was able to bring further insights in the domain of mental health (in specic, distressed married couples) with regard to the cyclical behavior of demand and withdraw. Lastly, I have presented an initial computational approach for studying perceptual process of human observers, viewed as distal interacting entities, in the context of subjective human behavior judgments. Since most studies in behavioral science rely heavily on trained annotators to carry out analysis into human behaviors, given an existing database with multiple annotators ratings, I have designed an initial computational approach to understand the underlying perception mechanism. xiv Chapter 1: Introduction 1.1 Behavioral Signal Processing: Behavioral Informatics Behavioral Signal Processing (BSP) is an emerging interdisciplinary research domain, operationally dened as computational methods that model human behavior signals. Human can be considered as a complex system, with dierent internal states and pro- cesses governing the underlying mechanism of behavioral generations. These behaviors, in signal processing terms, are called \behavioral signals" with humans' as systems char- acterized by hidden and time-varying internal states. Behavioral signals are manifested in overt and covert cues, are processed and used by humans explicitly or implicitly, and often time, are fundamental in facilitating human analysis and decision making. The goal of BSP is to computational model these signals, and the outcome of BSP is called, \behavioral informatics" - computational methods aims at enhancing the capabilities of domain experts in facilitating better decision making in terms of both scientic discovery in human behavioral sciences and human-centered system designs. Behavioral informatics computational methods potentially can make profound im- pact both on the fronts of both behavioral science and engineering systems. Because of 1 its interdisciplinary nature that encompass research domains across behavioral sciences, machine learning, and signal processing, it provides an opportunity to integrate human domain-specic knowledge for the design of human-machine dialog and interaction in- terfaces. The integration of human knowledge into the end-user system design can push the boundary of possibility in the development of human-centric systems. Furthermore, BSP, with its computational ability, can develop machine learning algorithms to sense and predict human behaviors enabling the machines to carry out tasks that humans are capable of. BSP provides a grounding domain for engineers to advance the state of the art machine learning and signal processing techniques to better handle the complexity involved in modeling human behaviors. On the front pushing the eld of human be- havioral science, with the use of modeling human behaviors with complex mathematical modeling using objective behavioral signals, BSP can provide computational tools to the domain experts expanding the capability of behavioral science researchers to conduct research into nding new scientic insights about human behaviors that were not possi- ble before. Lastly, the ability of using computational methods to process large amount of behavioral data opens up opportunity to conduct human subject experiments directly with existing corpora in a \data-drive" experiments. 1.1.1 Challenge and Complexity in BSP A standard framework and pipeline of carrying out research work in the domain of BSP often starts with careful experimental design with appropriate recruitment of human subjects and well-designed control and experimental variables. The second part involves extensive data collection, often in the forms of audio-video recordings, and quantitative data analysis. The second part of the framework is based on an assumption that a given subject often has a \hidden" internal state, and his/her expressed behaviors are a full or partial realization of that hidden internal state to carry the intended messages. Then, 2 Figure 1.1: BSP: An interdisciplinary research domain and applications we are able to perceive and capture these expressed behaviors through data recording devices, e.g., audio-video recording, physiological sensors, and other advanced techno- logical devices. In order to extract meaningful and objective behavioral descriptors, signal processing methods on these raw recorded signals are explored. With the ap- propriate behavioral descriptors and modeling references given by humans, explicit or implicit judgments and descriptions, behavioral modeling techniques can be used to quantitatively understand or predict behaviors of interest. There are many technical challenges associated with this framework of behavioral informatics. Some of the apparent challenges lie in the data collection process and signal processing methodologies. In order to robustly and automatically model a behavior, 3 an adequate and representative number of samples are required for training machine learning technique. Also the instrumentation of recording spaces can be a challenge in order to make the data that we observe ecologically-valid. Another apparent challenge lies in the signal processing methodologies: \how to devise a signal processing method to extract the most informative features to best describe a specic human behavior or internal state?" Besides these apparent challenges, there are inherent uncertainties in modeling human behaviors and internal states. First of all, the expressed behaviors only oer a partial window into the insights of the hidden state for a particular subject. Another issue arises from the subjective nature of human evaluation, which is commonly used as ground truth for supervised machine learning techniques. Lastly, there exists issue of heterogeneity and variability in how data are generated and used. Despite these challenges, there has been great progresses in engineering to model and quantify human behaviors/internal states across multiple levels. Automatic speech recognition, diarization, face detection, body pose estimation, etc., can all be seen as a\low-level" modeling of human behaviors. At a higher level, there has been works in recognition systems for human behaviors in conversation, such as backchannel prediction [89], dis uency detection [122]. At the level of internal state, there has been a great body of research performing various engineering studies of emotion recognition [20, 88, 90] . Lastly, there are several works in the domain of behavioral signal processing, where engineers provide computational tools in aiding the various studies with psychological signicances [11, 32, 35, 60]. 1.1.2 BSP Application Domains BSP and its outcome of behavioral informatics have direct benet into various down- stream application domains, such as security, commerce, education, mental health and well-being, and human-centric system. Security and commerce applications can benet 4 from robust and accurate user behaviors and internal states modelings. In applications for education, we can imagine it can benet greatly from the ability of behavioral in- formatics into sensing and predicting human behaviors for a better design of automatic tutoring or language learning system. Mental health and well-being is another domain that can progress in a new direction to push the state of the art in mental health di- agnosis, assessment, and even intervention practice with the computational methods provided by BSP in terms of mathematical modeling and quantication of objective behavioral signals. All of these eorts lend themselves into the umbrella of better de- sign of technology or system that can benet human in all aspects of decision making - human-centric system development and design. 1.2 BSP: Computational Methods for Dyadic Interaction Dynamics Quantitative understanding of human behavior, both typical and atypical, and math- ematical modeling of interaction dynamics are core elements in BSP. This thesis fo- cuses on computational approaches in modeling and quantifying interacting dynamics in dyadic interactions. Dyadic interactions are one of the most common real-life inter- personal interaction scenarios. Across various research domains, dyadic interaction is a core unit of analysis in understanding detailed interaction dynamics. Types of scientic research question asked related to analyzing dyadic interactions in various domains of human communication studies can be roughly categorized with attributes of interac- tions, such as forms of communication and relationship strengths and types. Forms of communication, e.g., face-to-face, telephonic, emails, chats, lectures, etc., dictate how human exchange information (verbally, non-verbally, lexical contents only, etc.). The relationship strength between the interacting dyad exists on a continuum; it can range 5 from two completely strangers to two intimate partners. The relationship type also varies largely; it can be of casual and typical interactions, e.g., friends talking and sib- lings interactions, to a more task-oriented interacting atmospheres, e.g., job interviews and doctor's visit, or even involving atypical interactions, e.g., psychologists interacting with children with Autism, or chronically-distressed couples' spoken interactions. While across research domains, each domain focus on dierent aspects of various types of dyadic interactions, often time, the ow of the analysis can be generally con- ceptualized into three major parts (see Figure 1.2). The rst part involves hypothesis formulation into a problem of interest to study a particular context of dyadic human interaction.. The second part involves behavioral data collection, often in the forms of audio-video recordings, and quantitative data analysis. The last part involves inter- pretations based on the analysis resulted from the second part along with pre-existing knowledge. Dierent domains have established various ways of \quantifying" human behaviors, e.g., self-report, observation coding based on coding manuals, surveys, etc.. These measures were established to \measure" patterns of variations in the interaction dynamics to aid the domain experts in quantitatively analyzing and judging human behaviors in relation to the formulated hypothesis. Quantifying and modeling interac- tion dynamics is at the core of providing grounding evidences into the study of human interactions We will focus on the technological aspect of part two in this research process by utilizing computational methods in aiding the various studies of dyadic interaction by quantifying and modeling interaction dynamics using behavioral signal processing methodologies. 6 Figure 1.2: Introduction: Schematics of Studying Dyadic Human Interactions 1.2.1 Complexities in Computationally Quantify and Model Interac- tion Dynamics Aside from the challenges exist in behavioral informatics mentioned above. Another layer of complexity exists due to the fact that human behavior modeling often occurs when the subject is undergoing interactions. As indicated in psychology study [16], as soon as two people are engaged in a conversation, their internal states are coupled and behaviors become mutually dependent. In the eld of signals and systems, the study of human behaviors can be abstracted when viewing humans as interacting systems (e.g., stochastic coupled dynamical systems with evolving internal states) and observable and measurable signals (e.g., speech, video, gestural kinematics and physiological data) as encoding expressive behaviors carrying meaningful information. The challenge in such interaction modeling is multi-fold: the coupling dynamic between each interlocutor in an interaction spans multiple levels, along variable time scales, and diers between in- teraction contexts. At the same time, each interlocutors internal behavioral dynamic 7 Figure 1.3: Introduction: Interpersonal Interaction Dynamics in Dyadic Interaction produces a coupling that is multimodal across the verbal and nonverbal communicative channels. Decoding inter and intrapersonal coupling eects displayed and perceived through behavioral cues is the essential building block leading to better understanding of human behaviors. The challenges of joint models of inter and intrapersonal interac- tion at multiple levels has brought exciting research opportunities in human-centered engineering. The coupling eect is best illustrated in the Figure 1.3. Given a speaker, say, Speaker 2, with an initial internal state, produces expressed behaviors through a noisy behav- ioral production. Speaker 2's interacting partner, say, Speaker 1, perceives Speaker 2's behaviors through a noisy perception process. This perception alters the internal state of Speaker 1, which in terms produces modulated behaviors of Speaker 1. This, then, feeds into Speaker 2 and creates a loop of dynamics. The concepts of interaction dynamics is a result of the coupling between internal states and coordination between interacting dyad's behaviors. 8 This additional complexity requires further technological advances in statistical mod- eling framework to adequately model multiple interacting processes, since interlocutors' internal states are inherently coupled. At the same time, in order to quantify various de- pendencies of expressed behaviors, new computational framework needs to be developed due to the nature of mutual dependencies between interlocutors' expressed behaviors. 1.2.2 Questions of Focus for Engineers in the Study of Dyadic Inter- actions The study of interaction modeling is a very broad term and is multi-facets. A critical issue to address for the use of technology in aiding the various psychological analyses is what and where can technology helps in terms of problem of interest for human behavioral science researchers. In this thesis, while various studies on dierent aspects of human behaviors are presented, I concentrate on three main themes posed as three questions related to the behavioral signal processing aspect into the study of human interaction dynamics. 1. Given the knowledge from psychology on the nature of interaction dynamics, \can we incorporate this knowledge into a better design of recognition system?" 2. Given the strength that engineering approaches have, which is the quantitative nature, \can we develop a computational framework to quantify subtle yet es- sential interaction dynamics to be used to further strengthen or aid the various studies of dierent dyadic interaction scenarios?" 3. Given that human annotators can be viewed as a \distal interacting" partner be- cause of their essential roles in providing insights about the interaction of interest back to the experts, \can we understand computationally how annotators make 9 a subjective judgment about an interaction at the interaction session level given the behavioral signals derived locally at turn level?" 1.3 Thesis Contribution and Outline My thesis addresses four major elements in interpersonal interaction dynamics: conver- sation process (Chapter 2), aective computing (Chapter 4, 3), social-communicative interaction dynamics (Chapter 5 and 6), and human perception modeling (Chapter 7). 1.3.1 Research in Conversation Analysis Understanding turn taking is a critical element in human conversation process analysis. Interruption is often considered as a deviation or perturbation from a smooth human turn taking structure and is often a region of interest for behavioral analysis. For exam- ple the nature of interruption can signal dominance or the absence of social engagement. I have made novel contribution is the development of a predictive model of interrup- tions in dialog through direct modeling of both interlocutors behaviors. Results from the prediction model provide a hint into the cognitive planning as expressed in human behaviors just prior to turn taking changes. 1.3.2 Research in Aective Computing Aective computing is a rapidly advancing domain of research in human-centered en- gineering for the design of human-centric system. Humans are capable of transmitting and receiving emotional expressions naturally and spontaneously, and machines with- out this ability often deem as not promoting ecient and eective interactions. There have been numerous studies in attempting to enable machines to sense and recognize emotional states of human using objective signals. I have contributed both in terms 10 of robust emotion recognition (Chapter 3), and in the direct modeling of the coupling in the interlocutors emotional states to improve the overall emotion recognition as the system decodes through the entire dialog (Chapter 4). 1.3.3 Research in Modeling Human Social-Communication Interac- tion Dynamics One of the perennial challenges in behavioral sciences is quantifying the complex, often subtle, interplay between the interlocutors. I have focused on modeling an essential phenomenon, vocal entrainment - a naturally-occurring synchrony in the coordination of vocal behaviors between the interacting dyad. Entrainment has long been established as an important attribute in describing interaction dynamics in psychology, yet the essential modeling mechanism has received little attention in the past three decades. I have contributed in robustly quantifying vocal entrainment using a novel abstract PCA space representation. My approach has successfully addressed challenges due to the asynchronous turn taking structures and the multivariate nature of acoustic cues and has introduced a novel way for capturing the directionality of entrainment (Chapter 5). This computational framework is not only applicable to aect recognition, but has also demonstrated its potential in mental health research and practice (Chapter 6). 1.3.4 Research in Modeling Human Annotation Perception Humans are capable of internalizing descriptions of behaviors in order to assign a label or a numerical value to behaviors exhibited by others. This ability is used broadly in the behavioral science research in order to study various phenomenon of human behavior and interactions. These annotators can be imagined as a distal interacting partner to the interaction of interest. A common approach to carry out manual observational coding is done by observing the entire interaction and assigning a global metric describing 11 the behavior of the specic interactions. I have contributed to a novel computational framework in to understanding whether these human annotators make decision, i.e. through their perceptual process, through isolated salient local behavioral information or a causal integration of information as one view the entire sequence of behavioral data (Chapter 7). 12 Chapter 2: Predicting Interruptions in Dialog 2.1 Introduction and Motivation During dyadic spontaneous human conversation, interruptions occur frequently, and often correspond to breaks in the information ow between conversation partners. Ac- curately predicting such dialog events not only provides insights into the modeling of human interactions and conversational turn-taking behaviors, but can also be used as an essential module in the design of natural human-machine interface. Further, we can capture information such as the likely interruption conditions and interrupter's sig- nallings by incorporating both conversation agents in the prediction model (we dene in this chapter that interrupter as the person that takes over the speaking turn, and inter- ruptee as the person who yields the turn). This modeling is predicated on the knowledge that conversation ow is the result of the interplay between interlocutor behaviors. 13 Several previous works [72, 76, 119] have analyzed dierent aspects of interruption in human dialogs in terms of prosodic, gestural, and lexical cues exhibited under dier- ent conditions of interruption. The work presented in this chapter is novel in the sense that it utilizes the information that happens before a turn change occurs to perform prediction of interruption rather than just recognition. Our hypothesis is motivated by a smiliar theory discussed in [44, 79] that intentions of speakers are transmitted multi- modally. Hence during an interaction, the interrupter would exhibit dierent nonverbal behaviors while preparing to interrupt than when participating in corrdinated smooth turn-taking conversation. This study relies on verbal behaviors of the interrupter and nonverbal behaviors of the interruptee. Many of our gestural cues are extracted with intuitively higher-level implication, such as mouth opening, raise of eyebrows, and rigid head motions, using direct motion capture data. They provide interpretable results and oer guidances for future eorts on automatic video feature extraction. Further, discriminant models have been shown to outperform generative models in several clas- sication tasks, and the model assumption on the independence of observation across time is more relaxed. We utilize the Hidden Conditional Random Field (HCRF) [115], a dynamic discriminant model, for the interruption prediction task. The IEMOCAP database [17] was used in the present study. It provides detailed information on dierent modalities (speech, gestures of face, head and hand movements) expressed in natural human-human conversational settings. Furthermore, in order to cover more general case of interruptions, interruptions were annotated based on human judgement instead of syntactic structure based solely on instances of overlapping speech [95]. The proposed prediction model achieves F-measure of 0.54, accuracy of 70.68%, and unweighted accuracy (average per class accuracy) of 66.05% by using acoustic cues from the interruptee and gestural cues from the interrupter for the duration of one second before turn change happens. 14 2.2 Database and Annotation We used the IEMOCAP database for the present study [17]. It was collected for the purpose of studying dierent modalities in expressive spoken dialog interaction. The database was recorded in ve dyadic sessions, and each session consists of a dierent pair of male and female actors both acting out scripted plays and engaging in spontaneous dialogs in hypothetical real-life scenarios. In this Chapter, we are interested in the spontaneous portions of the database since they closely resemble real-life conversation. During each spontaneous dialog, 61 markers (two on the head, 53 on the face, and three on each hand) were attached to one of the interlocutors to record (x;y;z) positions of each marker. Figure 2.1 illustrates the placement of the markers. The markers were then placed onto the other actor and recorded again with the same set of scenarios to complete a session. The recorded speech data from both subjects were available for every dialog. The database was transcribed and segmented by humans, and time boudaries resulted from the automatic forced alignment are assumed to correspond to the actual speech portion of each subject. Figure 2.1: Predicting Interruption: Markers Placement. 15 We used the Anvil software [63] as our annotation tools as it provides a multimodal annotation interface. Our interruption annotation scheme is based on subjective judge- ment rather than the syntactic structure. Interruption was labeled if the utterance made by the interrupter was to intentionally stop the interruptee's ow of speech. Annotator was instructed to be aware that an interruption can happen without occurrence of over- lapping speech, and overlapping speech instance that is cooperative in nature should be noted as smooth transition. In total, we annotated 1763 turn transitions in which 1558 were smooth transitions and 215 were interruptions. Since the distribution of these two types of turn transitions are highly unequal, we downsampled the data by including only three sessions (six subjects) of the IEMOCAP database with three dialogs chosen for each recording session. Subjects and dialogs were selected to include majority of the annotated interruptions. In total, there are 382 turn transitions annotated with 130 interruptions and 252 smooth transitions used as our dataset in this Chapter. 2.3 Feature Extraction For every given turn transition, we extracted two sets of features; one corresponds to the interrupter's body gestural cues, and another one corresponds to the interruptee's acoustic cues. The features were calculated for a total of one second in duration before the interrupter starts spearking at 60 frames per second. We assume that the duration captures relevant behaviors associated with turn taking. Only acoustic cues were ex- tracted from the interruptee because there are no markers placed on the interruptee, and only gestural gues were extracted from the interrupter because interrupter has not started speaking during the time of interest. 2.3.1 Interrupter Gestural Features The following features were extracted for the interrupter. 16 • Mouth opening distances denoted as (M z ;M x ) • First-order polynomial parametrization for right and left eyebrows denoted (A r ;B r ;A l ;B l ) • Six degress of rigid head motion - pitch, roll, yaw, translation in x, translation in y, translation in z denoted as (P;R;Y;T x ;T y ;T z ) M z was calculated as absolute distance between markers Mou3 and Mou7 as shown in Figure 2.1, andM x was calculated as distance between markers Mou1 and Mou5. The eyebrow's shape was parametrized by a linear equation for each frame (Z =A X+B ). In our preliminary expriment, second-order polynomial parametrization resulted in an negligible coecient for the X 2 term. We only considered the (x;z) direction. People rarely have eyebrow movement in the y direction that is the forward and backward direction in our database after normalization of head movements. A r and A l are the slopes of the polynomial calculated from the right and left eyebrow marker positions respectively; B r and B l are the intercepts. The slope and intercept can be easily asso- ciated with tilting and raising of eyebrows. T x ;T y and T z were derived from the nose marker, and P;R;Y were computed from all the markers using a technique based on Singular Value Decomposition (SVD) [17]. 2.3.2 Interruptee Acoustic Features The interruptee's energy and pitch values (denoted as E, F) were calculated using the Praat toolbox [12] at 60 frames per second during the same time windows described previously. Concatenation of the two sets of features along with deltas computed from inter- rupter's eyebrow parametrization, mouth opening distance, and interruptee's acousic 17 cues resulted in a 22-dimensional feature vector to serve as the observation inputs for our prediction model. 2.4 Staitistical Modeling Framework 2.4.1 Review of Hidden Conditional Random Field Details of HCRF are described in [101]. An HCRF models the conditional probabilty of a class label y given a set of observation vectors x in terms of the Equation 2.1, P (yjx;) = X s P (y; s;jx;) = P s e (y;s;x;) P y 0 2Y;s2S me (y 0 ;s;x;) (2.1) where s corresponds to hidden states in the model which captures underlying structure of each class label, and the potential function (y; s; x;) parameterized by is a measure of compatibility between a labely, a set of observations x and a conguration of hidden states s. The following objective function is used in [115] to train the parameters of the model using a hill-climbing optimization technique called the Broyden Fletcher Goldfarb Shanno (BFGS) method, L() = n X i=1 logP (y i jx i ;) 1 2 2 jjjj 2 (2.2) where n is the number of training sequences. The rst term is the log-likelihood of data, and the second term assumes a Gaussian priors with regularization factor, 2 , on paramters, . The optimal parameter is obtained as, = argmax L(). At testing stage, for a new sequence x given the optimal parameters obtained from the training data , we can assign the label of the sequence using Equation 2.3 through standard belief propogation techniques, 18 argmax y2Y P (yjx; ) (2.3) 2.5 Experiment Setup and Results Two experiments were setup to evalute the performance of the interruption prediction model with the following goals. • Experiment I: Compare dynamic modeling of HCRF with the static model us- ing a Logistic Regression Model. Further optimization of prediction performance through feature selection • Experiment II: Compare interrupter-only model, interruptee-only model, and the optimized model For both experiments, we performed z-normalization with respect to speaker iden- tity, and this normalization makes our feature vectors across speakers comparable. We also performed a six-fold (leave-one subject out) cross validation to evalute the per- formance. The label that annotates whether the turn-transition utterance is an inter- ruption or smooth transition served as ground truth for computing dierent prediction metrics. Since the database is skewed toward smooth transitions, several dierent met- rics other than accuracy percentage, such as unweighted accuracy, F-measure, precision and recall, are reported below. F-measure is our primary measure to assess the per- formances of our prediction model. Training and testing were both done using HCRF library [3]. 2.5.1 Experiment I: Results In Experiment I, three prediction models were trained. We rst trained a baseline model using logistic regression because it can be seen as a static version of discriminant model. 19 Table 2.1: Predicting Interruptions: Summary of Experiment I Model F-Measure Accuracy Unweighted Precision Recall Chance N/A 65.96% 50.00% N/A N/A Logistic Regression 0.39 68.06% 58.85% 0.56 0.30 HCRF w/o Feature Selection 0.48 64.66% 60.37% 0.48 0.47 HCRF w/ Feature Selection 0.54 70.68% 66.05% 0.57 0.51 The baseline model was trained with the full 22-dimensional feature vector on every frame of the training sequences given the class label. At testing, the decision was made with majority vote over the frames. The second model was obtained by training an HCRF model with the full 22-dimensional feature vector, and the number of hidden states and regularization factor were set to be 4 and 1 empirically. Lastly, forward feature selection performed through an inner ve-fold cross validation for each of the six fold validation. We selected features that optimize the accuracy percentage on the inner ve-fold cross validation for every given fold. The third model was trained using the nal feature set, which was the union of the features selected in each of the six folds, with the number of hidden states set to 4, and regularization factor set to 1. Results are shown in Table 7.3. The best performing model is HCRF with feature selection, which obtains an F- measure of 0.54 with 70.68% accuracy and 66.05% unweighted accuracy, and . The results indicates that dynamic modeling improves prediction accuracy. Specically, HCRF without the Feature Selection model obtains a 23.1% relative improvement in F-measure over the Logistic Regression model. 2.5.2 Experiment II: Results Experiment II was performed by training interrupter-only and interruptee-only HCRF models to compare with the best performing model - a combination of features from both speakers after feature selection. The interrupter-only model used a 18-dimensional 20 Table 2.2: Predicting Interruption: Summary of Experiment II Model F-Measure Accuracy Unweighted Chance N/A 65.96% 50.00% Interrupter-only 0.41 64.66% 57.57% Interruptee-only 0.45 68.59% 61.11% Optimized 0.54 70.68% 66.05% feature vector correspond to the interrupter's gestural cues, and the interruptee-only model used 4-dimensional feature vector correspond to the interruptee's acoustic cues. The number of hidden states was set to 4 with regularization factor being 0.1 in the interrupter-only model and 1 in the interruptee-only model. Table 6.3 shows a summary of results from Experiment II. As results indicate in Table 6.3, the best performing model in terms of F-measure is the one that uses models both speakers' behaviors. In particular, the combination of models improves 20% and 31% relatively in F-measure compared with interruptee- only and interrupter-only model, respectively. Combination model with the full 22- dimensional feature vector listed in Table 7.3 also has a relative 6.7% and 17.1% higher F-measure compared with interruptee-only and interrupter-only model, respectively. 2.5.3 Feature Selection Discussion We can gain some insights by examining the features selected along with the performance summary. Table 3.4 shows the feature selected for each fold and the feature set used to generate the nal prediction model. The rst thing to notice in Table 3.4 is that the energy-related features from the interruptee is always selected as one of the features. This is not surprising because the abrupt jump-in during interruptee's speech correlates highly with what people perceive as an interruption, while a smooth transition often accompanies with pause between 21 speaker turns. Indeed, if we look at Table 6.3, using interruptee-only acoustic features alone shows improvement in unweighted accuracy compared to chance. The more interesting phenomenon is that the feature selection process also selected some of the intuitive interrupter's gestural features, such as mouth-opening and head rigid movement. In fact, examining along with Table 6.3 shows that by using interrupter- only cues, we still obtain an improvement in unweighted accuracy compared to chance. This implies that listener's behaviors provide information on his/her own intention of interrupting. Table 2.3: Predicting Interruption: Features Selected Fold Interruptee Interrupter One Energy, Energy Slope Right Eyebrow Two Energy Roll Three Energy Slope Right Eyebrow Four Energy Mouth Open z Five Energy Yaw Six Energy Mouth Open z, Translation x Final E, E M z ; M z ;A r ;R;Y;T x In summary, the best prediction model is obtained through a combination of inter- rupter and interruptee features with F-measure of 0.54, 70.68% accuracy, and 66.05% unweighted accuracy. The result shows that interruption usually happens when the interrupter jumps in during interruptee's speaking turn. It also shows that interrupter's gestural behaviors provide information on the intention of his/her interruption. While the prediction work is limited because the assumption of time boundary availability, the experimental results still show encouraging results in predicting interruptions by monitoring speaker's interaction in a dialog. 22 2.6 Conclusions and Future Work Interruptions in dialogs often provide essential information on changes in the conver- sation ow. Prediction of such event before it happens can be of great use in human- machine dialog interface. This work investigated the usage of HCRF as the prediction model and obtained promising prediction accuracy by monitoring both interlocutor's behaviors before a turn change occurs. The results reinforce our hypothesis that speak- ers' multimodal behaviors can be a good predicting indicator of the upcoming speech intention; in particular, listener's behaviors before turn taking is shown to indicate his/her intention of interruption. Future work will extend the prediction modeling to predict occurrences of interrup- tion without knowledge of exact turn change boundaries with dierent fusion to model interlocutors behaviors. Further inclusion of other features, such as lexical content and dialog acts, should be investigated as they can also provide information on the intention of speakers. Accurately modeling of interruption in a spoken interaction can bring in- sights into the design of natural dialog system in terms of dierences in behaviors under various turn-taking structures. This could also provide improved insights into the study of human-human conversations. 23 Chapter 3: Recognizing Emotion with Single Speaker Modeling 3.1 Introduction and Motivation Emotion recognition is an integral part of quantitative studies of human behavior. The emerging areas of human behavioral signal processing and behavioral informatics oer new analytical tools to support a variety of applications, including the design of natural human-machine interfaces (HMI). Emotionally-cognizant human-computer and human- robot interfaces promise a more responsive and adaptive user experience. In real life settings, behavioral computing must reconcile information in the context of a situated interaction [15]. This is also true of human-machine interactions where the ability to sustain interactions may be hampered by an interacting agent's inability to recognize, track and respond appropriately to the interacting partners [97]. Many applications can benet from an accurate emotion recognizer. For example, customer care interactions (with a human or an automated agent) can use emotion recognition systems to assess customer satisfaction and quality of service (e.g., lack of 24 frustration) [47, 74]. Other tasks that rely on observational coding of human interac- tion, such as in therapeutic settings [9, 68, 107] can also benet from robust emotion recognition. Increasingly, interactive educational systems are becoming commercially available [55, 56]. These systems must be able to accurately identify a child's emotional state to foster interactions and positive evaluations [14, 100, 121, 122]. Understanding a child's certainty in a problem solving and learning task can help scaold the interaction in a context appropriate way [7]. All these applications can benet from the design of a robust emotion recognition scheme, which should also be easily adaptable to dierent interaction scenarios. The computational emotion recognition framework we describe in this paper is loosely motivated by the Appraisal Theory [66] of emotions. Appraisal Theory states that emotion perception is a multi-stage conscious and unconscious process. The ap- praisal process can be thought of as a series of decisions (e.g., how positive is the stimulus, how novel is the stimulus, what is the cause of the stimulus, etc.). At each stage, an individual appraises the situation, reacts, and reappraises, inducing dierent emotions in the process (e.g., fear, surprise, and then joy). The proposed framework is inspired by the Appraisal Theory in its approximation of the appraisal and reappraisal processes. We do not, however, propose a direct interpretation or implementation of this theory; rather, we propose a simplied computational model in the form of a hierarchi- cal binary decision tree. The framework splits a single multi-class emotion classication problem into stages of binary emotion classication tasks capturing the idea of appraisal and reappraisal. The key idea of the proposed framework is the recognition, early in the tree, of the most distinguishable emotional classes. The ambiguous emotional classes are recognized at the bottom of the tree, mitigating error propagation. The key idea behind this proposed emotion recognition framework is the use of binary classiers in a hierarchical tree structure. There are many well-established state 25 of the art classiers that can be readily implemented to work with binary classication problems, e.g., logistic regression, support vector machines, Fisher discriminant analysis, etc. The system also benets from its unweighted recall optimization criterion. In many real life interactions, the neutral emotion class is both the most dominant and the most ambiguous emotion class. If the system is optimized on the measure of conventional accuracy (number of accurately classied samples by total number of tested samples), it will likely be biased in recognizing only the dominant state accurately [65]. The bias is not desirable in many applications. The average unweighted recall (average percentage of number of accurately recalled utterances for each emotion class) measure can provide a way to assess the performance of our proposed classier in emotionally biased datasets. The hierarchical structure along with the optimized decision threshold can eectively mitigate the inherent problem of class imbalance and achieve good average unweighted recall percentage. Several other emotion research works [2, 43, 80, 118] have also utilized hierarchical tree structure in performing emotion recognition tasks. The two most similar approaches are the DDAGSVM [80] proposed by Mao et al, and the hierarchical structure [118] proposed by Xiao et al. In both papers, the hierarchical structures are designed to operate on easier binary classication tasks in their rst layer and relatively ambiguous tasks in the last layer of the tree. Our classication framework, proposed independently, shares the same design principle. However, in our framework, we do not restrict each node to classify between pairs of emotion classes. In our framework, each node is exible in classifying between mixtures of emotion classes. The design framework proposed in this paper can be easily extended to additional emotional corpora even when the emotion class distributions dier. Our proposed framework can eectively cope with class bias. The presented emotion recognition framework was rst evaluated in the Interspeech 2009 Emotion Challenge using the AIBO database. The evaluation metric was the 26 average unweighted recall percentage per emotion class using the AIBO database, which has two dierent splits: training and evaluation datasets. The database consists of aective speech collected from fty-one children interacting with an AIBO robot dog [108]. The ve emotion classes of interest are: angry, emphatic, neutral, positive, and rest. The class of neutrality is over-represented in this database. We demonstrated the exibility of this classication framework by applying it to a dierent emotion database, the USC IEMOCAP database. This database consists of natural dyadic aective spoken interactions of professional actors. It includes both scripted plays and spontaneous dialogs. The four emotion classes of interest are: angry, happy, sad, and neutral. Both databases contain natural aective interactions, instead of utterance-by-utterance acted emotional speech. In the AIBO database experiment, we achieve an average unweighted recall of 48.37% using leave-one speaker out (26-fold) cross validation on the training dataset. We obtain a 41.57% unweighted recall on the evaluation dataset, which is 3.37% absolute (8.82% relative) over the best baseline results presented in the Emotion Challenge baseline summarized in Schuller et al [108]. In the USC IEMOCAP database experiment, we achieve an average unweighted recall of 58.46% using 10-folds (leave-one-speaker out cross validation), which is a 7.44% absolute (14.58% relative) improvement over the Support Vector Machine (SVM) based baseline. The paper is organized as follows. The two emotional databases used in our study are described in Section 2. The hierarchical classier framework is presented in Section 3. The experimental results and discussion are provided in Section 4. Conclusions and future work are given in Section 5. 27 3.2 Emotional Databases 3.2.1 AIBO Database The AIBO database [111] consists of 51 children interacting with a Sony toy robot, AIBO, using a Wizard-of-Oz technique. The data collection was designed to provoke emotional reactions from the children. The robot dog was programmed a priori and did not respond to the children's commands. The children were led to believe that the dog would respond; thereby making the Sony dog seem disobedient and inducing emotional speech. The database was collected at two schools (26 and 25 subjects, respectively). The data from one of the schools is used for training while the other is used for testing. The audio was recorded wirelessly with 16 bits at 48 KHz, which was further downsampled to 16 KHz. The database was segmented into turns by splitting the audio with a silence threshold of one second. Five advanced linguistics students labeled the emotional content of the database at the word level, and the following is the list of emotion classes that the annotators were asked to rate: joyful, surprised, emphatic, helpless, irritated, angry, motherese, bored, reprimanding, rest, and neutral. The weighted Kappa for the ve annotators is 0.56, indicating a fair agreement among evaluators. The database description [111] includes other metrics for computing inter- evaluator agreement that all indicate not perfect, but fair, agreement due to the nature of spontaneous dialogs. The words were combined into longer length chunks, manually dened using syntactic-prosodic criteria [111]. The labels of these chunks were based on majority vote over the merged words. In this study, we were provided with the ve emotion classes (a subset of the whole AIBO database): Angry (includes angry, irritated, reprimanding), Emphatic, Neutral, Positive (includes motherese and joyful), and Rest. The detailed descriptions of the AIBO database collection, annotation process, and of the merging of emotion classes can be found in the cited references [108, 111]. A 28 summary of the emotion class distribution used in the work is listed in Table 7.1. The class of Neutral represents about 80% of the database. In the testing split of the database, the class distribution is dierent from the training split as evident in the class of Neutral and Positive. Table 3.1: AIBO Database: Table of Emotion Utterances Angry Emphatic Neutral Positive Rest Total train 881 2093 5590 674 721 9959 test 611 1508 5377 215 546 8257 3.2.2 USC IEMOCAP Database The USC IEMOCAP database [17] was collected for studying multimodal expressive dyadic interactions. The design of the database assumed that by exploiting the context of dyadic interactions between actors, a more natural and richer emotional display would be elicited than in speech read by a single subject. Furthermore, the use of scripted and emotionally targeted improvisational scenarios allowed us to collect an aectively varied and balanced database. The database was collected using motion capture and audio/video recording (approximately a total of 12 hours) over ve dyadic sessions with 10 subjects. Each session consists of a dierent dyad of male-female actors performing scripted plays and engaging in spontaneous improvised dialogs elicited through aective sce- nario prompts. At least three na ve humans annotated each utterance in the database with the categorical emotion labels chosen from the set: happy, sad, neutral, angry, surprised, excited, frustration, disgust, fear and other. In this work, we consider only the utterances with majority agreement (i.e., at least two out of three annotators la- beled the same emotion) over the emotion classes of: Angry, Happy, Sad, and Neutral. These classes represent the majority of the emotion categories in this database. This 29 annotation scheme had an inter-evaluator agreement of 0.40 (Fleiss' Kappa), which can be considered as fair agreement between evaluators. The detailed description of the USC IEMOCAP database is in the reference [17]. A summary of the emotion class distribution can be found in Table 7.3. Table 3.2: USC IEMOCAP Database: Number of of Emotion Utterances per Category Angry Happy Sad Neutral Total 1083 1630 1083 1683 5480 3.2.3 Acoustic Feature Extraction Table 6.3 presents the acoustic features used in this work. We used the same features for the experiments on both databases to provide a common setting in which to evaluate the eectiveness of the proposed classication framework. This acoustic feature set is largely based on the ndings by Schuller et al [107]. We extracted these features using the OpenSmile toolbox [30]. The feature set includes 16 low level descriptors consisting of prosodic, spectral envelope, and voice quality features listed in Table 6.3. These low level descriptors are zero crossing rate, root mean square energy, pitch, harmonics- to-noise ratio, and 12 mel-frequency cepstral coecients and their deltas. Then 12 statistical functionals were computed for every low level descriptor per utterance in the USC IEMOCAP database and per chunk in the AIBO database: mean, standard deviation, kurtosis, skewness, minimum, maximum, relative position, range, two linear regression coecients, and their respective mean square error. This results in a collection of 384 acoustic features. 30 Table 3.3: Acoustic Features Extracted (16 2 12 = 384) Raw Acoustic Features + Deltas Statistical Functionals pitch (f0) mean, standard deviation, kurtosis root mean squre energy (rms) skewness, minimum, maximum zero crossing rate (zcr) relative position, range harmonic to noise ratio (hnr) two linear regression coecients mel-frequency cepstral coecients (1-12 mfcc) mean square error of linear regression 3.2.4 Feature Selection and Normalization We normalized features using z-normalization with respect to the neutral utterances in the training dataset for both databases. The process has the underlying assumption that the average characteristics of neutral utterances across speakers do not vary extensively; therefore, the testing examples' features are z-normalized with respect to the mean, , and variance, 2 , of neutral utterances from the training data. The normalization allows us to use acoustic features across multiple dierent speakers and to eliminate the eect of variations in individual speakers' speaking characteristics. We perform feature selection on the 384 features using the standard statistics soft- ware SPSS to obtain a reduced feature set. We used binary logistic regression in SPSS with step-wise forward selection. The stopping criterion was based on a conditional likelihood criterion. Forward selection was terminated when the inclusion of an ad- ditional feature no longer increased the condition likelihood of the model statistically signicantly. This feature selection process resulted in a range of 40 - 60 features for each binary classier per cross validation fold. While there exist many other feature selection algorithms, we utilized binary logistic regression because it is standard and is proven eective as a feature selection method [48]. This feature selection algorithm was used in each experimental setup in this work to show the eectiveness of the proposed framework for the multi-class classication task. The purpose of this work is not to 31 demonstrate the ecacy of the specic feature selection method, but instead to show how it can be incorporated in the proposed system. 3.3 Emotion Classication Framework 3.3.1 Building the hierarchical decision tree Our goal is to optimize the unweighted recall percentage (average of per-class accuracies) in the classication framework. This metric is arguably a more useful metric in assessing emotional content in natural interactions when the distribution of classes is non-uniform or dominantly non-emotional. The two essential key points in our design of a emotion classication framework are listed below: • The use of a combination of binary classiers instead of a single multi-class clas- sier. • The use of a hierarchical tree, where the top level classication is performed on the easiest emotion recognition task. The structure of the framework is shown in Figure 3.1. The proposed classication scheme splits the multi-class problem into a series of two-class problems, starting with the relatively easy classication task at the top level and leaving the harder tasks for the end. The order of the classication is important and essential in this framework. The goal is to ensure a maximum separation between any two chosen classes at each level. As depicted in Figure 3.1, Classier 1 operates on the easiest binary classication task and classiers in the nal stage (Classier Stage M ) operate on the sets of binary classications that are most ambiguous for the given acoustic features. 32 A key aspect in the proposed framework is to investigate the separability of the emotional classes given feature streams. This information impacts the order of the tree. We propose the following two criteria: • Prior knowledge: several previous emotion recognition studies have shown the eectiveness of dierent feature streams in discriminating between specic emotion classes. For example, we know that acoustic features can accurately discriminate between high-activated emotion classes and low-activated emotion classes [? ]. Therefore, the rst level classication task on any two sets of emotion classes that have distinct activation levels cab provide an initial split. • Empirical testing: each emotional dataset may include dierent denitions and categories of emotion classes. Due to the complex combinatorial nature of nding the most distinguishable pair of emotion classes, we can rely on results obtained from a series of simple empirical studies. For example, classication based on Gaussian Mixture Models, Linear Discriminant Analysis, multi-class support vec- tor machine, and/or any other schemes can be easily trained as a preliminary step. While each classier may obtain dierent accuracies, by observing the re- sulting confusion matrix, the discriminability between each emotion classes can be observed. The hierarchical structure can then be determined. The approach of designing the classication tree has the potential to propagate fewer classication errors down the tree when compared to directly applying the conventional intuitive approach of classifying non-emotional classes vs. emotional classes as the rst step, and splitting the broad emotional classes. Further, we can obtain a balanced recall percentage per emotion class by conveniently optimizing the decision threshold while performing each binary classication task. 33 Figure 3.1: Proposed Classication Framework: A Hierarchical Binary Decision Tree with the easiest task as the rst stage and the most ambiguous task as the last stage Each classier box shown in Figure 3.1 is a binary classier. At each level, the hard output label of the test sample is fed into the next level of classiers to perform another set of binary classications. This sequence of binary classication allows us to take advantage of the variability inherent in the data by creating initial classications with high recall and identifying classication tasks with a high level of discriminability. 3.3.2 Building the hierarchical decision tree for the AIBO database and the IEMOCAP database Figure 3.2 presents the proposed trees for the AIBO (left) and IEMOCAP (right) databases. The realization for each database diers but follows the structure illustrated in Figure 3.1. Both frameworks are determined through a combination of criteria men- tioned above. For the AIBO database, the classes considered are: Angry, Emphatic, Positive, Neutral, and Rest. We placed A/E vs. P at the rst classication stage because multiple iterations of preliminary classication tasks using acoustic features 34 ALL DATA A,E,N,P,R Classifier A&E vs P Classifier A vs E Classifier A vs N&R Classifier E vs N&R Classifier P vs N&R Classifier N vs R Classifier N vs R Classifier N vs R A N E P N N R R R Figure 3.2: Left: Proposed Hierarchical Structure for the AIBO Database Right: Hierarchical Structure for the USC IEMOCAP Database. demonstrated that a high-level of discrimination between these two groups of classes. We delay the decision between N and R until the end, and again based on the empirical observation regarding the high level of similarity and ambiguity between N and R classes of this database. We trained a total of six classiers listed as follows (the classiers were trained using all the data from the training set with class labels relevant to the task): • Angry/Emphatic vs. Positive (A&E vs. P) • Angry vs. Emphatic (A vs. E) • Angry vs. Neutral/Rest (A vs. N&R) • Emphatic vs. Neutral/Rest (E vs. N&R) • Positive vs. Neutral/Rest (P vs. N&R) • Neutral vs. Rest (N vs. R) The same design process was applied to the USC IEMOCAP database. The right panel in Figure 3.2 shows the decision sequence order. The emotion classes of interest in this task are: Angry, Happy, Sad, and Neutral. We placed A/H vs. S as the rst classication step. 35 In this experiment, we use the same set of acoustic features, hypothesizing that they can accurately discriminate between these two groups of emotion classes. The neutral class is delayed until the last stage due to the diculties in recognizing the neutral class [88]. A total of ve binary classiers for the USC IEMOCAP database were trained and are listed below: • Angry/Happy vs. Sad (A&H vs. S) • Angry vs. Happy (A vs. H) • Angry vs. Neutral (A vs. N) • Happy vs. Neutral (H vs. N) • Sad vs. Neutral (S vs. N) 3.3.3 Classier for Binary Classication Tasks Under this hierarchical framework, the specic binary classier can be determined tai- lored to the problem domain. Many dierent binary classiers have shown promising results in performing classication. For example, both Bayesian Logistic Regression (BLR) [31] and Support Vector Machine (SVM) [112] have been shown to be eective in classication tasks. Logistic regression provides a discriminative model to be used as a classier [1] and the Bayesian version is a method to prevent data overtting by placing a prior centered at zero on the weights of the models. SVM is a maximum margin classier that nds the largest separation between two classes. In our participation of the 2009 Emotion Challenge [73], two dierent classier types were used, Bayesian Logistic Regression and Support Vector Machines. Bayesian Logistic Regression obtained the best accuracy though it was not statistically signi- cantly better than Support Vector Machines. As a result, we decided to employ only 36 the Bayesian Logistic Regression as the choice for each of the binary classier boxes. The feature selection algorithm presented in Section 2.4 is based on logistic regression. It is well suited to Bayesian Logistic Regression, since they share many properties in common. However, the specic choice on the binary classier can be made along with feature selection to obtain performance optimized for the specic task. Single class bias is an issue in this emotion recognition task as it may bias the results towards the over-represented class, in this case - neutral. Prior work has shown the eectiveness of using Synthetic Minority Oversampling Technique (SMOTE) [22] to deal with the over-representation of a single class. However, in this paper, instead of generating articial data samples to balance the classes, we exploit our prior knowledge about the class distribution of the training splits in the two databases to adjust the decision threshold on the Bayesian Logistic Regression to obtain a balanced recall across the emotion classes of interest. 3.3.3.1 Bayesian Logistic Regression A general binary logistic regression model is a discriminative model of the form shown in Equation 4.1. p(y = 1j;x) = (B T x) (3.1) where y is the class label (+1,1), x is the input feature vector, 's are the model parameters, and is the logistic function dened in Equation 3.2 (z) = exp(z) 1 + exp(z) (3.2) In Bayesian Logistic Regression (BLR), we place a Gaussian prior with = 0 and covariance 2 I on the model parameters 's shown in Equation 3.3 and perform a 37 ALL DATA A,E,N,P,R Classifier N&R vs A&E&P Classifier N vs R Classifier A vs E Classifier A&E vs P R N A P E ALL DATA A,H,S,N Classifier N vs A&H&S Classifier A&H vs S Classifier A vs H A H S N Figure 3.3: Left: Conventional Hierarchical Structure for the AIBO Database Right: Conventional Hierarchical Structure for the USC IEMOCAP Database. maximum a posteriori estimation of the model parameters to prevent overtting of the parameters on the training data. This prior on the model parameters has the same eect as the ridge logistic regression where the model parameters'jjL 2 jj norms are constrained. Another possible prior is a Laplacian prior, which has the same eect as lasso logistic regression. In this work, Gaussian prior is used since it oered better accuracy than a Laplacian prior in our empirical testing. p( j j 2 ) = 1 p 2 exp( 2 j 2 2 ) (3.3) The BBR software [31] was used for Bayesian Logistic Regression model training and threshold tuning. 3.4 Experiment Setup and Results The eectiveness of the proposed hierarchical classication method was evaluated on the two dierent databases introduced in Section 2: the AIBO and the USC IEMOCAP databases. The rst set of experiments utilizes the AIBO database and follows the guidelines used in the 2009 Interspeech Emotion Challenge [73]. A training dataset with emotion labels was used to develop our algorithm. The labels for the testing database 38 were unknown (the performance metrics were given through a website interface). To show that the algorithm can be easily applied in another database, we applied the proposed classication framework to the USC IEMOCAP database. 3.4.1 AIBO Database Two predened subsets of the AIBO database were available for this task, a training dataset and an unlabeled evaluation dataset. Two dierent experiments were designed based on this structure. In Experiment I, we analyzed our hierarchical structures using only the training data subset. We used leave one speaker out (26-fold) cross-validation was used to estimate the classication performance (average unweighted recall). This cross-validation method was used to simulate the scenario in which the unlabeled evalu- ation dataset consists of a disjoint speaker set. Experiment I serves as the development phase. In Experiment II, the framework was trained on the entirety of the training dataset and tested on the evaluation dataset. • Experiment I : Leave one speaker out (26-fold) cross-validation on the training dataset (AIBO Database) • Experiment II : Evaluate performance on the unlabeled evaluation dataset (AIBO Database) 3.4.1.1 Results of Experiment I on the AIBO database The unweighted recall for Bayesian Logistic Regression was 48.27% (Table 3.4). The columns of the confusion matrix found in Table 3.4 represent our hypothesized class labels and the rows are the annotated ground truth class labels. The conventional framework presented was based on classifying non-emotional vs. emotional classes as the 39 Table 3.4: Experiment I: Summary of Result Bayesian Logistic Regression (BLR): Proposed Unweighted Recall (UA) Weighted Recall (WA) 48.27% 48.82% Angry Emphatic Neutral Positive Rest Angry 504 145 126 53 53 Emphatic 395 1078 412 101 107 Neutral 506 1020 2703 776 585 Positive 21 31 121 439 62 Rest 97 130 185 171 138 Bayesian Logistic Regression (BLR): Conventional Unweighted Recall (UA) Weighted Recall (WA) 38.42% 48.66% Angry Emphatic Neutral Positive Rest Angry 420 129 193 30 109 Emphatic 342 916 596 58 181 Neutral 395 923 3207 317 748 Positive 18 30 248 132 246 Rest 106 133 217 94 171 rst step; Figure 3.3 shows the conventional structure for the AIBO database experiment I. Several observations can be made from examining the results. While the conven- tional hierarchical structure obtains approximately the same weighted accuracy as the proposed framework, the proposed method outperforms the conventional method in un- weighted accuracy. The confusion matrices show that the largest improvement is in the emotion class of Positive, which is confused mostly with Neutral and Rest. The eect of tackling the classication problem involving the Positive emotion as the rst step is essential as evidenced by the increase in the recognition accuracy of this class. The recall for A/E vs. P for the proposed method at the rst step is at 94.82%. The rst stage binary classier is able to separate these two emotion groups with high accuracy. 40 Therefore, we are able to retain the majority of the members of the two groups of emo- tion classes by placing this classication task as the rst step in the proposed structure as compared to the conventional method. We are classifying the emotion class, Rest, at about the chance level. This is expected because this class is not as strictly dened as the emotions in the other classes. Rest is misclassied more often as either Neutral or Positive compared with Angry or Emphatic (Table 3.4). This indicates that the Rest is acoustically similar to the Positive and the Neutral in this database. We obtain a good recall for four of the emotion classes (Angry: 57.2%, Emphatic: 51.5%, Positive: 65.1%, Neutral: 48.4% ) excepting the class, Rest (19.1%). This indi- cates that the structure of our framework is able to handle the highly skewed database and is able to obtain a more balanced retrieval rate on the emotion classes. This is essential in emotion recognition since in natural human interaction, Neutral is often be the majority of expressed emotions. The balancing of the recognition accuracy using the proposed structure is advantageous because it is able to identify several other less frequently expressed but informative emotion classes. 3.4.1.2 Results of Experiment II on the AIBO database In Experiment II, we evaluated our framework on the evaluation dataset, which was the actual task for the 2009 Emotion Challenge. The six classiers were trained on the entirety of the training dataset. The unweighted recall using Bayesian Logistic Regression was 41.57%. The summary of the results is shown in Table 3.5. An HMM baseline is also presented [108] because HMMs are generally eective for mitigating the eect of class bias. SVM based baseline is used as our baseline since it obtains the highest accuracy. 41 Our proposed framework using Bayesian Logistic Regression achieved the highest average unweighted recall. It improves the accuracy measure of the baseline model (multi-class SVM with a SMOTE class balancing technique) presented [108] by 3.37% absolute (8.82% relative). The average unweighted recall rate on the three emotional classes (Angry, Emphatic, and Positive) is about 52% where the average unweighted recall rate on non-emotional (Neutral and Rest) classes is only about 25%. This re- sult demonstrates that our proposed framework is capable of retrieving the emotional utterances even given that some of these emotion classes are only a small portion of the database. The characteristics of the proposed framework is advantageous in real world applications where the majority of expressions are often Neutral. Furthermore, we also notice a large discrepancy between Experiments I and II for this database. We speculate that except for the fact that the two datasets were recorded at dierent places with dierent subjects, the primary reason for this discrepancy is that the evaluation dataset may be more unbalanced (the class of Neutral concentrates almost 65% of the dataset). It would be interesting to investigate whether an improved classication ac- curacy could be obtained by incorporating the knowledge of emotion class distribution on this portion of the database. In summary, our proposed framework for the ve-class emotion recognition as a sequence of binary classication tasks is able to improve the unweighted recall by 3.37% absolute (8.82% relative) compared with using Support Vector Machine with SMOTE baseline on the unlabeled evaluation dataset, which is the baseline provided by the 2009 Emotion Challenge. Since the AIBO database contains realistic and spontaneous interactions, it is encouraging to see that the framework has the potential to overcome the class imbalance problem in the database and to achieve a good recall percentage especially on the emotional classes. 42 Table 3.5: Experiment II: Summary of Result Bayesian Logistic Regression (BLR) Unweighted Recall (UA) Weighted Recall (WA) SVM Baseline 38.2% 39.2% HMM Baseline 35.9% 37.2% BLR 41.57% 39.87% Angry Emphatic Neutral Positive Rest Angry 290 171 65 63 22 Emphatic 210 752 325 136 85 Neutral 748 1094 2057 1109 369 Positive 23 13 39 131 9 Rest 95 58 134 197 62 3.4.2 USC IEMOCAP Database In order to show such framework can be easily applied in another emotional database, the USC IEMOCAP database was used with leave-one-speaker out cross validation evaluation scheme. The leave-one-speaker out cross validation setup was used to emulate the AIBO evaluation condition in which the testing data consists speaker set disjoint from that of the training set (the USC IEMOCAP database does not specify the two splits to be used for training and testing. Therefore, we utilize a dierent experimental setup for the USC IEMOCAP database). We used this evaluation scheme to make this experiment comparable to the AIBO database experimental setup. In each fold, we used nine speakers as the training dataset and one speaker as the testing dataset. 3.4.2.1 Experiment Result of the USC IEMOCAP Database A summary of the classication accuracy is shown in Table 3.6. The average unweighted recall is 58.46%, which is a 7.77% absolute improvement (15.16% relative) compared to a recently published result [88] on the same four emotion classes using Hidden Markov Models trained with acoustic features. In the current work, a mutliclass SVM was presented as additional baseline classication. Our proposed method obtains a 7.44% 43 absolute (14.58%) improvement over the SVM baseline. The conventional hierarchical structure classies Neutral vs. others as the rst step (Figure 3.3). Table 3.6: Experiment: Summary of the USC IEMOCAP Database Classication Result Bayesian Logistic Regression (BLR): Proposed Unweighted Recall (UA) Weighted Recall (WA) HMM Baseline 50.69% N/A SVM Baseline 51.02% 42.41% BLR 58.46% 56.38% Angry Happy Sad Neutral Angry 720 168 30 183 Happy 319 680 205 426 Sad 24 42 782 235 Neutral 116 256 394 918 Bayesian Logistic Regression (BLR): Conventional Unweighted Recall (UA) Weighted Recall (WA) 53.55% 53.47% Angry Happy Sad Neutral Angry 683 155 32 231 Happy 290 585 183 572 Sad 36 42 516 489 Neutral 103 190 245 1146 Several observations can be made by looking at the results. Through examination of the confusion matrices of both the conventional structure and proposed structure, we observe the same trend as seen in the AIBO database. The recognition accuracy of Sad and Happy increased signicantly. These two classes are mostly confused with the class of Neutral. This confusion is alleviated by making this assessment at the rst stage leaving the assignment of Neutral to a later step. This structure improves the recognition accuracy (Table 3.6). There is very little confusion noticed between the combined angry/happy and sad class; the recall percentage at the rst stage classication (A/H vs. S) is 85% and 87.5%, respectively. Most of the angry/happy vs. sad emotional utterances were successfully 44 recalled and split at the rst binary classication stage. The recognition accuracy of the emotion class, Happy, is the lower (41.7%) compared to the emotion classes of Angry (65.4%) and sad (47.64%). This is in accordance with the trend found in previous work on this database [88]. This likely results from the reliance on only acoustic features (the emotional evaluation included audiovisual stimuli). Previous work has demonstrated that happiness can be more accurately modeled by incorporating facial expression fea- tures [85]. One of the most noticeable results in this experiment is that the recall percentage for the neutral class is 54.54%. This is an encouraging outcome considering the highly ambiguous nature of the neutral class, which is evident in previous results (35.23%) on the same database [85]. While the conventional hierarchical approach obtains a higher recognition accuracy on the emotion class of neutral, the proposed framework improves the recognition rates on the other three emotional classes without losing much of the recognition rate on the neutral. This is likely because the neutral assignment is made at the last step. We have multiple binary classiers to separate the dierent emotional classes from neutral instead of having one multi-class classier to identify neutral class. This approach takes into account the fuzziness in the denition of the neutral emotion class; it can be an emotion class itself or can be used a way to describe a user state that is not emotional. Overall, the result improved 7.77% absolute compared to the recently published results on the same set of emotion classes on the same database [88]. The experimental conditions diered between these two works with respect to the features used. While not directly comparable, it is still encouraging to see that without exhaustive tuning and optimization, the proposed framework can provide signicant improvement in the overall emotion recognition accuracy. 45 3.5 Conclusions Accurate emotion recognition systems are essential for the advancement of human be- havioral informatics and in the design of eective human-machine interaction systems. Such systems can help promote the ecient and robust processing of human behavioral data as well as in the facilitation of natural communication. In this work, a multi-level binary decision tree structure was proposed to perform multi-class emotion classica- tion. The framework was designed by empirical guidance and experimentation. The easiest subset of classication problems were placed at the top level to reduce the accu- mulation of error. This classication framework was introduced rst in the Interspeech 2009 Emotion Challenge (where it placed rst on the classier sub-challenge task) and has been since tested on another emotional database and reported in this paper. The results show encouraging recognition rates that are competitive with the state of the art. Many future modications can be integrated within this framework. Instead of outputting hard labels at every level, a soft label, such as a measure of probability or even prole based representation [90], can be used to enhance the modeling power of the proposed framework. Also, since the choice of binary classier is exible and largely dependent on the feature selection technique, the framework can be further improved by optimizing the choice of binary classier along with the appropriate feature selection method at each classication stage. The major limitation of the approach described here is the empirical nature of the proposed hierarchical structure. While the proposed method has the advantage of being intuitive and ecient to design, it does not ensure an optimal solution. Our future work plans to investigate an automatic procedure to generate the hierarchical structure. This can minimize the need for several iterations of empirical testing. A specic related question for future work surrounds the derivation of a hierarchical structure that will not only optimally balance performance accuracy 46 and combinatorial complexity but also yield results that are intuitively interpretable in light of psychological theories of emotions. 47 Chapter 4: Recognizing Emotion with Joint Interlocutors Modeling 4.1 Introduction and Motivation In dyadic human-human conversation, the interactions between the two participants have shown to exhibit varying degrees and patterns of mutual in uence along several aspects such as talking style/prosody, gestural behavior, engagement level, emotion, and many other types of user states [16]. This mutual in uence guides the dynamic ow of the conversation and often plays an important role in shaping the overall tone of the interaction. In fact, we can view a dyadic conversation as two interacting dynamical state systems such that the evolution of a speaker's user state depends not only on its own history but also the interacting partner's history. This modeling will not only allow us to capture interactants' user states more reliably, but it could also provide a higher level description of the interaction details, such as talking in-sync, avoidance, or arguing. 48 The increasing sophistication of automatic meeting and dialog analysis due to the advances in audio-visual technologies, modeling emotion evolution has since become an important aspect of dialog modeling. Emotion evolution is related to people's per- ception on the overall tones of interaction, and it can also be used to identify salient portions in a conversation. Further, if we can better model the mutual in uence dur- ing interaction, we could bring insights into designing communication strategy for a human-machine interaction agent to promote ecient communication. In this Chapter, we propose and implement a model describing the evolution of emotion states of the two participants engaged in dyadic dialogs by incorporating the idea of mutual in uence during interaction. Emotion can be represented by three-dimensional attributes as presented in [104]: (V)Valence: positive - negative, (A)Activation: aroused - calm, (D)Dominance: strong - weak, with each attribute associated with a numerical value indicating the level of expression. In our model, we focus on Activation and Valence dimension only. This dimensional representation oers a general description of the emotion, and it provides a natural way for describing dynamic emotion evolution in a dialog since not all utter- ances in a dialog can be easily labeled as a specic categorical emotion. Our approach contrasts with most of the previous emotion classication schemes that have primarily focused on utterance level recognition of categorical labels [87] or emotion attributes [42]. Others, such as proposed in [78] have used features that encode contextual in- formation to perform emotion recognition. However, most of these works have neither considered decoding dynamic emotions through the dialog, nor have they incorporated the mutual in uence exhibited between interactants in their models. Because of its ability to model conditional dependency between variables within and across time, we utilize the Dynamic Bayesian Network (DBN) framework to model the 49 mutual in uence and temporal dependency of speakers' emotional states in a conversa- tion. The experiment of this Chapter used the IEMOCAP database [17] since it provides a rich corpus of expressive dyadic spoken interaction. Also, detailed annotation of emo- tion is available for every utterance in the corpus. We hypothesize that by including cross speaker dependency and modeling the temporal dynamics of the emotion states in a dialog, we can obtain better emotion recognition performance and bring improved insights into mutual in uence behaviors in dyadic interaction. 4.2 Database and Annotation 4.2.1 IEMOCAP Database We use the IEMOCAP database [17] for the present study. The database was collected for the purpose of studying expressive dyadic interaction from a multimodal perspective. The designing of the database assumed that by exploiting dyadic interactions between actors, a more natural and richer emotional display would be elicited than in speech read by a single subject [19]. This data allows us to investigate our hypothesis about the mutual in uence between speakers during spoken interaction. The database was motion captured and audio recorded in ve dyadic sessions with 10 subjects, where each session consists of a dierent pair of male-female actors both acting out scripted plays and engaging in spontaneous dialogs. The analysis in this Chapter utilizes the recorded speech data from both subjects in every dialog available with speech transcriptions and emotional annotations. Three human annotations on categorical emotion labels, such as happy, sad, neutral, angry, etc, and two human evaluation of the three emotion attributes (Valence, Activation, Dominance) are available for every utterance in the database. Each dimension is labeled on a scale of 1 to 5 indicating dierent levels of expressiveness. 50 A B T-1 T Turn_A1 Turn_A2 Turn_B1 Turn_B2 Figure 4.1: Recognizing Emotion: Example of Analysis Windows. The database was originally manually segmented into utterances. But, to ensure that we have both speakers' acoustic information for a given analysis window in our dynamic modeling, we dene a turn change, T , as one analysis window. Each T consist of two turns. Each turn is dened as the portion of speech belonging to a single speaker before he/she nishes speaking, and may consists of multiple original segmented utterances. Figure 4.1 shows an example that explains our denition. The example has two speakers, A and B, and a total of two analysis windows, T- 1 and T, segmented. Speaker A is dened as the rst person to speak in a dialog, and is always the starting point of any analysis window. A speaker can speak multiple utterances in a given turn as shown in Figure 4.1 of Turn A2. Two turns - one from each speaker, denotes a turn change, which is dened as our one analysis window. Annotators were asked to provide a label for every utterance in the database. Since our basic unit is a turn, an emotional label is given to every turn as described in the following section. 4.2.2 Emotion Annotation In this work, we focused on the Valence-Activation dimensions of emotion represen- tation, since the combination of these two dimensions can be intuitively thought as corresponding to marking most of the conventional categorical emotions [42]. The di- mension values for each turn is obtained by averaging the two annotated values. In order to further reduce the number of emotional states values, 5 2 = 25, we cluster these 51 1 1.5 2 2.5 3 3.5 4 4.5 5 1 1.5 2 2.5 3 3.5 4 4.5 5 Valence Activation Figure 4.2: Recognizing Emotion: K-Means Clustering Output of Valence-Activation. Table 4.1: Recognizing Emotion: Emotion Label Clustering (k = 5) Emotion Cluster Number of Turns Cluster Centroid (V,A) Class 1 1254 (2.19 , 3.29) Class 2 1954 (3.15 , 3.14) Class 3 2027 (4.06 , 2.21) Class 4 1092 (1.89 , 2.25) Class 5 2016 (3.84 , 3.55) two dimensions' values. Based on our empirical observation, we decided to group these two dimension values into ve clusters using the K-Means clustering algorithm. Figure 4.2 shows our clustering output. Although this averaging may create quantization noise, from Figure 4.2, we can see that this process does provide reasonably interpretable clusters. For example, cluster 3 represented by diamond-shaped markers could be thought as corresponding to angry because of its concentration on lower values of valence with higher level of activation; in fact, about 70% of all angry utterances in the database where at least 2 annotators agree on, reside in cluster 3. Cluster 2 represented by point-shaped markers are centered at 52 about the mid-range of Valence-Activation levels could be thought as neutral emotion, and about 51% of the neutral utterances of the database reside in this cluster. There are a total of 5 pairs of subjects in 151 dialogs consisting of 8343 turns used in this Chapter. A table showing the distribution of turns for each emotion cluster and its clustering centroid is given in Table 7.1. 4.3 Dynamic Bayesian Network Model Dynamic Bayesian Network (DBN) is a statistical graphical modeling framework, where each node in a network is a random variable and the connecting arrows represent the conditional dependency between random variables. Since we want to capture the time dependency and mutual in uence between speakers' emotion states, we propose to use the DBN structure shown in Figure 4.3. EMO_A EMO_A EMO_B EMO_B T T - 1 . . . . . . F_A F_A F_B F_B Figure 4.3: Recognizing Emotion: Proposed Dynamic Bayesian Network Structure. In Figure 4.3, the EMO A and EMO B nodes represent the emotional class label for speakers A, B in the dialog, and the F A and F B nodes represent the respective observed acoustic information modeled by Mixture of Gaussian Distribution; the black rectangle represents the hidden mixture weights for the GMM. The proposed network tries to model two aspects of emotion evolution in an interaction. One is the time 53 dependency of the emotion evolution, where a person's emotion state is conditionally dependent on his/her previous emotion state modeled as a rst order Markov process. Second, the model incorporates the mutual in uence between the two speakers in the dyadic interaction, where one speaker's emotion state is aected by the interacting partner's emotion. The joint probability of emotion states E Bt and E At and feature vectorsY Bt andY At for a dialog under this model can be factored as shown in Equation 4.1. P (fE At ;Y At g;fE Bt ;Y Bt g) = (4.1) P (E A1 )P (Y A1 jE A1 )P (E B1 jE A1 )P (Y B1 jE B1 ) T Y t=2 P (E Bt jE Bt1 )P (E Bt jE At )P (Y Bt jE Bt ) T Y t=2 P (E At jE At1 )P (E At jE Bt1 )P (Y At jE At ) 4.4 Experimental Results and Discussion 4.4.1 Feature Extraction We focused on acoustic cues for the modeling study in this Chapter. All features except speech rate were extracted using the Praat Toolkit [12], while speech rate was estimated as the number of phonemes per second obtained from ASR forced alignment output detailed in [17] . The following is the list of extracted features at the turn level as previously dened. • F0 Frequency: Mean, Standard Deviation, Minimum, Maximum, 25% Quantile, 75% Quantile, Range, InterQuantile Range, Median, Kurtosis, Skewness 54 • Harmonic to Noise Ratio (HNR): Mean, Standard Deviation, Minimum, Maxi- mum, 25% Quantile, 75% Quantile, Range, InterQuantile Range, Median, Kurto- sis, Skewness • Intensity/Energy: Mean, Standard Deviation, Minimum, Maximum, 25% Quan- tile, 75% Quantile, Range, InterQuantile Range, Median, Kurtosis, Skewness • Speech Rate: Mean, Maximum, Minimum • 13 MFCC Coecients: Mean, Standard Deviation • 27 Mel Frequency Bank Filter Output: Mean, Standard Deviation This resulted in a 116-dimension feature vector. Furthermore, feature normaliza- tion was obtained by performing z-normalization on the feature vectors with respect to each individual speaker's neutral utterances. The rationale behind this normaliza- tion was that while individuals may express emotions dierently, by normalizing with respect to neutral utterances, speaker-dependent emotional modulation should be more comparable across speakers. 4.4.2 Experiment Setup • Experiment I: Recognize the 5 emotion classes • Experiment II: Recognize only the Activation and Valence dimension (each with 3 classes) separately using the same proposed structure Experiment II was performed to help us identify which of the emotion dimension is likely to be aected by mutual in uence in an interaction. Here, each dimension was clustered again into 3 classes (High, Medium, Low) using the K-Means algorithm. Table 6.3 shows a summary of data distribution and centroid of emotion classes for Experiment II. 55 Table 4.2: Recognizing Emotion: Valence & Activation Clustering (k = 3) Valence Activation No. of Turns Centroid No. of Turns Centroid Low 2355 2.05 3096 2.21 Medium 3271 3.29 2525 2.97 High 2717 4.18 2722 3.69 For both experiments, forward feature selection was performed with accuracy per- centage as the stopping criterion to reduced the number of features. We then analyzed four dierent structures representing dierent aspects of emotional state evolution in a dialog. The four dierent structures considered are shown in Figure 4.4. The rst struc- ture (1) is our baseline model that does not incorporate any time or mutual in uence dependency. Therefore, it recognizes each turn separately with trained GMM model using just the acoustic cues. Structure (2) incorporates time dependency of individual speaker's emotion without mutual in uence from the interacting partner. Structure (3) models only the mutual in uence between speakers, and Structure (4) is our proposed complete model that combined both time and cross-speaker dependencies. We tied the GMM parameters of both speakers' observation feature vector for both experiments to maximize the use of training data. Each trained baseline GMM models' parameters was passed onto all three other structures to ensure any changes in classi- cation accuracy is due to the change in emotion dependency structures. The model was implemented and tested using the Bayes Net Toolbox [92]. All experiments were done with 15-fold cross validation, where 140 dialogs were selected at training and about 10 dialogs were used as testing. The numbers of mixture for the GMM was determined empirically to be four. At training, emotional labels and feature vectors were provided to learn the mixture weights and conditional dependency between emotional states us- ing the EM Algorithm with Junction Tree Inference. At testing, the trained network 56 1: (Baseline) T-1 T 3: (Cross-Speaker Dependency) 2: (Individual Time-Dependency) ... ... ... ... T-1 T ... ... T-1 T 4: (Proposed Structure - Both Dependency) Figure 4.4: Recognizing Emotion: Structures of Emotion States Evolution. decoded both speakers' emotion labels by computing the most likely path of emotion state evolution throughout the dialog given the sequence of observations. 4.4.3 Results and Discussion The results of both experiments are summarized in Table 7.3. The performance measure used is the number of accurately classied turns divided by the total numbers of turns tested. Two dierent results are shown for the Experiment II. The same column in Table 7.3 means that the experiment was carried out using the same set features obtained from feature selection output in Experiment I, and the optimized column means that forward feature selection was performed on each of the Activation and Valence experiments separately. In the Experiment I, the results show that it is benecial to incorporate both time dependency and mutual in uence on the emotional state, since both Structure (2) and (3) improve the classication performance. Our proposed DBN model which combined both dependencies obtained an absolute 3.67% increase in accuracy (relative 7.12% improvement) over our baseline model. To see where the improvement comes from, we 57 Table 4.3: Recognizing Emotion: Summary of Experiment Accuracy Percentage DBN Structure I: 5 - Emotion Classes II: 3-Activation II: 3-Valence Same Optimized Same Optimized Chance 24.29% 37.11% 37.11% 39.21% 39.21% Baseline - GMM (1) 51.53% 62.30% 63.45% 56.59% 59.89% Time Dependency (2) 52.68% 62.02% 61.92% 59.78% 63.40% Mutual In uence (3) 53.37% 62.52% 62.30% 59.60% 62.67% Proposed Model (4) 55.20% 62.35% 62.49% 61.26% 65.02% can examine the results from Experiment II where the classication was performed on Valence-Activation dimension separately. In Experiment II, the rst thing to point out is that the classication accuracy using baseline GMM on the Valence and Activation separately shows that by exclusively using speech related features, the classication accuracy is higher with the Activation dimension than with the Valence dimension by absolute 5.71% (relative 10.01%). And this agrees with our knowledge about the discriminative power of acoustic features [18] in each of these dimensions. The second observation is that we improved classication accuracy in the Valence dimension by approximately 5% absolute (relative 8%) over baseline. However, the eect is not as observable with the Activation dimension. It appears that the advantage of this modeling comes primarily in the Valence dimension instead of the Activation dimension. We hypothesize that the mutual in uence on interacting partners may be more signicant in the Valence dimension. However, further analysis is necessary to verify this claim. In summary, our proposed model, which captures both time dependency and mutual in uence between speakers, was able to improve the overall classication accuracy. In spite of the limited amount of interaction data (151 dialogs with 10 subjects) with potentially noisy emotion classes, it is still encouraging to see that our model is able to capture these eects and improve the recognition results. 58 4.5 Conclusions and Future Work Interpersonal interactions often exhibit mutual in uence along dierent elements of the interlocutor behavior. In this Chapter, we utilized the Dynamic Bayesian Network (DBN) to model this eect to better capture the ow of emotion in dialogs. In turn, we use the model for performing emotion recognition in the Valence-Activation dimension. As shown in Section 3, it is advantageous to model the dynamics and mutual in uence of emotion states in dialog for improving emotion classication. There are two main limitations with this Chapter. The rst arises because we only had two human annotations on emotion attributes for each utterance. In order to incorporate both annotations to serve as our ground truth, we took the average of two annotations values for every turn; this created noise in the emotion labels. We plan on acquiring more annotations in the future to alleviate this problem. The other limitation is that we just relied on speech based features for our modeling; fortunately, the IEMOCAP database has detailed facial and rigid head/hand gesture information as well as transcriptions providing the language information, all of which have been shown useful for emotion modeling, could be incorporated within the model in the future. Several other future directions can be pursued. One immediate extension is to provide a mapping between decoded Valence-Activation state to some more human in- terpretable emotion categories, and extend this framework as a rst stage processing for inferring higher-level dialog attributes. Further, mutual in uence between speakers can happen at multiple levels. In this Chapter, we examined this eect through recognizing emotion states at the turn level. Prior works have shown mutual in uence on lexical structure [93] and on predicting task success [102] at the dialog level. We can analyze this eect along such levels using hierarchical structures. Furthermore, we are in the process of obtaining other forms of interaction databases with both natural human in- teraction and acted interaction. Once we acquire better insights into mutual in uence 59 in human interactions, we not only will be able to improve dialog modeling, but may also be able to incorporate such information in the design of robust machine spoken dialog interfaces. 60 Chapter 5: Quantifying Vocal Entrainment in Dialog 5.1 Introduction and Motivation Various psychological studies of interpersonal communication (e.g., [4, 16, 116]) concep- tualize dyadic human-human interaction as an adaptive and interactive process. This process occurs spontaneously in the progression of human interactions serving multi- ple purposes including achieving eciency, communicating interest and involvement in the interaction, and increasing mutual understanding through behavioral and aective mechanisms. This mutual coordination of behaviors both in timing and expressive forms between interlocutors is a phenomenon variously referred to as entrainment, accommo- dation, or interaction synchrony. A systematic and quantitative method for assessing and tracking this notion of behavioral dependency between interlocutors in conversation is essential in characterizing the overall quality and dynamic ow of human communi- cation in general. 61 Moreover, numerous psychological theories of intimate relationships, such as cou- ples' interactions, consider behavioral dependence to be a dening and core element of the theory. Support for this theoretical notion comes from a very large body of psy- chological and communication studies linking various forms of behaviorally dependent couples' interaction to individual well-being (e.g., psychological and physical health) and relationship outcomes (e.g., divorce and domestic violence; [26, 51, 53, 91, 105]). The quantitative study of the entrainment phenomenon, thus, becomes especially important because of not only its crucial role in analyzing human communication in general but also its utility to provide insights into the study of various mental distress and well-being conditions. Technological advances in capturing human behaviors with increasing ecological va- lidity and mathematical capabilities in quantifying interdependent processes enabled new computational approaches { referred by us as behavioral signal processing (BSP) { to support behavioral studies of human communication. The goal of BSP is to quantify and recognize subjective and abstract human states of interest (e.g., intention, emo- tions, distress, atypical behaviors, etc) for behavioral domain experts during realistic interaction settings and application domains (e.g., married couples therapy, addiction behaviors, depression, children with autism spectrum disorders) using signal processing and machine learning methods. Several previous works of BSP have demonstrated the eectiveness of using machine learning techniques to recognize and analyze distressed couples behaviors with various types of automatically-derived signals [8, 35, 106? ? ]. In this work, our aim is to propose a novel computational framework to quantify the degree of entrainment in the acoustic signals. Entrainment has been extensively studied for the past twenty years in the psychology literature. While this body of work has oered many insights into human interaction dynamics, methods for assessing and quantifying the degree of behavioral entrainment have received little attention. Except 62 for a few notable studies [13, 36] in modeling couples' interactions, the computational techniques for quantifying entrainment have been largely based on log-linear models of highly-reductionistic, categorical manual observation coding of behaviors. The manual observation coding of behaviors is often time-consuming and error-prone due to the nature of subjectivity in the coding process [61]. The subtle nature of the entrainment phenomenon makes it more dicult for methods that attempt to quantify it using just human annotation. Relatively few studies in engineering have attempted to capture and quantify the degree of these subtle dependent behaviors through computing measures directly on the automatically-extracted observable cues. Some key related studies focusing on quantify- ing entrainment in various communicative channels include the following: the investiga- tion of mutual entrainment in vocal activity rhymes [84]; the analysis of high frequency work usage entrainment [94]; the computation of entrainment of body movements [103]; the demonstration of phonetic convergence in conversation settings [98]. In the present work, we focus on quantifying one specic type of entrainment, i.e., vocal entrainment. The degree of vocal entrainment can be intuitively posed as \com- puting how much people speak/sound like each other as they engage in conversation". Note that we make a distinction between this intuition and the notion of \similarity in word usage", which is a dierent type of entrainment (i.e., lexical entrainment). In contrast to conventional supervised methods that require direct human annotation of entrainment, the proposed formulation utilizes automatically derived acoustic features to compute entrainment measures in a completely signal-derived and unsupervised man- ner. Further, we demonstrate the utility of such signal-derived entrainment measures in potentially providing a tool for human interaction analysis and recognizing aective states using data drawn from real married couples' interactions. 63 The existence of vocal entrainment is well-established in psychology ([38, 39, 40, 41]) and also has been demonstrated in engineering works [67, 75]. The schemes for quan- tifying the degree of prosodic entrainment for most of the studies rely on classical synchrony measures (e.g., Pearson correlation) on functionals of separate streams of acoustic features (e.g., mean pitch value per turn), computed across a speaker turn change. This approach has been widely adopted across a variety of research domains, e.g., econometrics, neuroscience, and physical coupled system studies. Through these various research works, there is a long list of classical synchrony measures with their variants to quantify interdependency between two simultaneously measured time series. A review article summarizes various measures on quantifying synchrony for electroen- cephalography (EEG) time series signals [27]. These classical synchrony measures can be roughly categorized into the following types: linear correlation, nonlinear correlation, phase coherence, state-based synchrony, and information theoretic measures; they are all widely used and eective depending on the domain of studies. However, there are some limitations with this quantication approach to study vo- cal entrainment - mainly due to the complex nature of human-human conversations. Human conversation has a turn-taking structure, which challenges the requirement of simultaneously measured time series of certain similar behaviors, notably of vocal activ- ity (visual behavior can co-occur, and be measured, although often one speaker tends to be holding the oor at any given time). Furthermore, the analysis window length for each time series, e.g., length of each speaking turn, varies across time (progressing through the dialog) and across variables (interlocutors in the dialog). While empirical evidence from psychological studies has shown that each acoustic feature stream carries information about the entrainment process, existing computational methods are not di- rectly expandable to evaluate on a combined pattern of multiple acoustic features, e.g., pitch, energy, and speech rate. 64 In this work, we propose a novel vocal entrainment quantication scheme where instead of computing synchrony measures on separate time series of acoustic features between interlocutors, we quantify degree of vocal entrainment as the similarity between interlocutors vocal characteristic representation spaces. The vocal characteristic space is constructed based on a set of parametrized raw acoustic feature streams using Principal Component Analysis (PCA). We rst introduced this notion of quantifying vocal entrainment in the framework of PCA with a single metric in our previous work [71]. There, we focused only on the directionality aspect of the entrainment process, e.g., how much speaker A in a dyad entrains toward speaker B and vice versa. The measure that we devised was computed based on the preserved variance as we project one set of acoustic parameters onto the PCA space of another. The derived measures were useful for aect state recognition [70, 71]. The method, however, has robustness issues when the lengths of turns are signicantly dierent; projecting a much longer-length turn onto a PCA space of shorter-length turns would result in a bias of \preserving more variances" as longer-length turns inherently tend to have larger variations. We have extended our previous work in two folds by (1) introducing the use of symmetric similarity measures and (2) improving the similarity metric computational framework. The symmetric similarity values are computed based on angles between principal components [52, 64] as a direct measure of similarity between two PCA spaces. This process results in values describing similarity that are symmetric, meaning that they have the same value for each interlocutor. We propose to measure the degree of the directional entrainment by projecting one interlocutor's acoustic parameters in the PCA space of the other interlocutor. Then, instead of measuring the variance preserved, we compute Kullback-Leibler divergence as a metric of similarity, inspired by the work of [96]. Our proposed entrainment measures 65 can be categorized into two types: symmetric and directional entrainment measures. The resulting measures from the proposed scheme consist of eight vocal entrainment values in total. We analyze these entrainment measures on a database, referred to here as the Cou- ple Therapy corpus, of real distressed married couples going through problem solving spoken interactions as part of their participation in a randomized clinical trial of cou- ple therapy. The data not only provides rich data for human communication studies, but represents an important realm of potentially benecial contributions by behavioral signal processing. Recent reviews of more than three decades of marital interaction research indicate the importance of behavioral dependency for marriages (e.g., [28] (in press)). We evaluate the proposed analytics in two ways. The rst evaluation focuses on whether these signal-derived measures adequately capture the notion of behavioral dependency conceptualized in psychology studies of human communication. The as- sumption behind this statistical testing is that there exists a natural cohesiveness in human-human conversations; therefore, the proposed entrainment measures should be expected to result in higher values when computed in dialog with an in-conversation dyad compared to randomly generated dialogs with not-in-conversation dyads. The second evaluation focuses on exploring the usefulness of these measures in be- havioral coding. Specically, we analyze the relationship between vocal entrainment and aective states of the spouse in the couple therapy interactions. Results from our anal- ysis indicate that these signal-derived vocal entrainment values are signicantly higher in interactions where the spouse was behaviorally coded as having high positive aect compared to high negative aect. This analysis provides some of the rst evidence that the vocal entrainment phenomenon captured by the proposed measures oers an indica- tion of positive aect during couple con ict. It is consistent with studies documenting 66 the positive eects of entrainment in other interaction contexts (e.g., [62, 113]. The two results show that our proposed method can indeed be a viable signal-derived approach to not only capture the general notion of entrainment but also be potentially used to quantitatively study vocal entrainment in distressed marital interactions. To further demonstrate the importance of vocal entrainment in characterizing cou- ples' aective states and the predictive power of these measures, we perform aect recognition of the spouse (positive aect vs. negative aect). In our previous work [70], we performed the same aect recognition task with the same database using a multiple instance learning framework. We utilize temporal modeling technique (Factorial Hidden Markov Model) in this work, and we obtain a classication accuracy of 62.86%, which is a 8.93% absolute (16.56% relative) improvement over our previous result. The rest of the Chapter is organized as follows: section 2 describes the Couples Therapy database; section 3 describes our PCA-based vocal entrainment quantica- tion scheme; section 4 presents the two approaches in analyzing the signal-derived en- trainment measures; section 5 describes the aective state classication framework and experimental results; and section 6 presents conclusions and ideas for future works. 5.2 The Couple Therapy Corpus The Couple Therapy corpus originated from a collaborative project between the psy- chology departments of the University of California, Los Angeles and the University of Washington [24]. This collaborative project resulted in the largest longitudinal, random- ized, behaviorally-based couple therapy clinical trial to date. A total of 134 seriously and chronically-distressed couples participated in the study, and they received up to 26 couple therapy sessions over the course of a year. As part of their participation in the study, they engaged in problem-solving interactions where each spouse picked one distinct topic related to a serious problem in their relationship to discuss, and they 67 tried to resolve it. Each topic of a problem-solving interaction lasted about ten minutes. Each interaction was audio-video recorded for observation analysis, and each spouse was coded separately by trained human annotators. The Couple Therapy corpus consists of audio-video recordings, manual transcrip- tions, and behavioral codings of each couples' problem-solving interactions. The interac- tions that we consider were recorded at three dierent points in time, pre-therapy, the 26-week assessment, and the two-year post-therapy assessment. The recorded audio- video data includes a split-screen video and a single channel far-eld audio recorded from the video camera microphone. The recording conditions, e.g. microphone and camera positions, background noise level, lighting conditions, etc, varied from session to session. Manual word transcriptions were carried out to aid the analysis of couples' lan- guage use. The resulting word-level transcriptions were chronological, and the speaker identity was explicitly labeled in the transcript. The transcriptions, however, did not have explicit timing information on speakers' turn-taking. For each interaction, multiple evaluators (ranging from 2 - 12 evaluators) rated each spouse with 33 dierent behavioral codes based on two established coding manuals, Social Support Interaction Rating System (SSIRS) [54] and Couples Interaction Rating System (CIRS) [45]. The SSIRS consists of 20 codes assessing the emotional content and the topic of the conversations corresponding to four dierent categories: aect, dominance/submission, features of the interaction, and topic denition. The CIRS con- sists of 13 codes which were specically designed for coding problem-solving discussions. Each code was evaluated on an integer scale from 1 (none/not at all) to 9 (a lot). All evaluators went through a training process to standardize the coding process. They were instructed to make their judgments after observing the whole interaction. For each problem-solving interaction, each spouse was assigned with one global value for each of the 33 behavioral codes. After eliminating sessions where codes were missing, 68 this Couple Therapy corpus consists of 569 problem solving interactions, totaling 95.8 hours of data with 117 unique couples. 5.2.1 Pre-processing / Audio Feature Extraction Since the original collection of the Couple Therapy corpus was not optimized for au- tomatic analysis, the database was further processed to aid various behavioral signal processing studies. Due to the variant noise condition across dierent sessions in the database, the rst pre-processing step that we carried out was to identify a subset of 569 sessions that could be robustly analyzed with automatically derived acoustic fea- tures. This was done with two criteria: average signal-to-noise ratio (SNR) estimation based on voice activity detection (VAD) [34] and speech-text alignment algorithm using SailAlign [59]. VAD was designed to detect non-speech segments larger than 300ms. SNR was, then, estimated using Equation 5.1. SNR (dB) = 10 log 10 1 ji2Sj P i2S A 2 i 1 ji= 2Sj P i= 2S A 2 i (5.1) wherefA i g2S is the set of amplitudes resulting from the VAD-detected speech regions, andfA i g = 2S is the set of amplitudes outputs in the non-speech regions, again based on the VAD. We empirically chose 5 dB SNR as the cuto for determining which sessions to include for automatic analysis. This procedure and the chosen SNR criterion eliminated 154 sessions from the current study. This resulted in a total of 415 sessions out of the 569 sessions. As is common in dyadic conversation studies, the spoken analysis unit adopted is the speaking turn. The Couple Therapy corpus does not contain explicit timing of each speaking turn. Instead of manually segmenting the speaking turn for each spouse for all 69 sessions, we decided to utilize a \hybrid" manual/automatic speaker segmentation, given the availability of manual word transcriptions. We implemented a recursive Automatic Speech Recognition (ASR)-based procedure to align the transcription with the audio data using an open source tool, SailAlign. As a result of this speech-text alignment, we obtained timing information on each alignment along with approximate speaking turn segmentation. We used these turn estimates (referred to as speaking turns, or just turns in the rest of the Chapter) as an approximation of the actual speaking turns for each spouse in each session. Due to the nature of the alignment process and the non-ideal nature of audio quality, not every word in the transcription could be reliably aligned. We further eliminated sessions where the algorithm failed to align 55% or more of the words in the transcripts. This resulted in a nal dataset of 372 interactions sessions for the current study (each session has two separate sets of codes describing behaviors of each spouse), totaling 62.8 hours of data with 104 unique couples. After the pre-processing steps, we extracted various speech-related features on the 372 sessions for which the audio quality was considered adequate for reliable feature extraction based on our previous work [9]. We utilized the following subset of the acoustic features, namely mel-frequency cepstrum coecients (MFCCs), pitch (f 0 ), in- tensity (int), and speech rate, in this work. The 15 MFCCs were computed using 25ms windows and 10ms shift with the OpenSMILE toolbox [30]. MFCCs were normalized using cepstral mean subtraction as shown in Equation 5.2. MFCC n [i] = MFCC [i] MFCC[i] ;i = 0;:::; 14 (5.2) where the MFCC[i] values were the mean MFCC of the i th coecient of the speaker across the whole session. Fundamental frequency and intensity were both extracted using the Praat toolbox [12]. Intensity frame values at each frame, n, were normalized using Equation 5.3. 70 int n = int int (5.3) where the int values were the mean intensity of speech during the active speaker regions, computed across the whole session. We further carried out several pitch cleaning procedures to ensure the raw pitch extraction was reasonably accurate. The extracted raw pitch values using an autocorrelation-based method often suered errors from doubling and halving. We at- tempted to mitigate this issue by passing the raw pitch signals through an algorithm detecting large dierences in F0 values in consecutive frames. The pitch values were forced to be zeros at regions where the VAD algorithm detects a non-speech portion. We interpolated over unvoiced regions with duration less than 300 ms using piecewise-cubic Hermite interpolation. Finally, a median lter of length ve was applied to eliminate spurious noise. F0 values were normalized using Equation 5.4. f 0 log = log 2 f 0 f 0 (5.4) where f 0 values were computed across the whole session using the speaker segmentation results. Finally, we computed the mean syllable speaking rate for each aligned word directly from the automatic word alignment results with the help of a syllabied pronunciation dictionary 1 . 5.2.2 Behavioral Codes of Interest There are numerous psychology studies [37, 51, 113] describing and indicating various degrees of relations between aective states and behavioral dependencies in couples' 1 http://www.haskins.yale.edu/tada download/index.php 71 interactions. The two behavioral codes of interest in this work are the \global positive aect" and \global negative aect" derived from the SSIRS coding manuals. The mean inter-evaluator agreements, computed using intraclass correlation [110], are 0.831 and 0.867 respectively, indicating a reasonably high agreement between evaluators for these two behavioral codes. We analyze these two codes to understand whether the proposed entrainment measures can be used to analyze entrainment dynamics in relations to a spouse's aective state. The following are the coding instructions quoted directly from the SSIRS manual for both codes. \[Global Positive An overall rating of the positive aect the target spouse showed during the interaction. Examples of positive behavior include overt expressions of warmth, support, acceptance, aection, positive negotiation, and compromise. Positivity can also be expressed through facial and bodily expressions, such as smiling and looking happy, talking easily, looking comfortable and relaxed, and showing interest in the conversation.]" \[Global Negative An overall rating of the negative aect the target spouse shows during the interaction. Examples of negative behavior include overt expressions of rejection, defensiveness, blaming, and anger. It can also include facial and bodily expressions of negativity such as scowling, crying, crossing arms, turning away from the spouse, or showing a lack of interest in the conversation. Also factor in degree of negativity based on severity (e.g., a higher score for contempt than apathy).]" We dene two emotional classes, positive and negative, of each spouse for the present work with respect to the rating of the above two behavioral codes. In order to mitigate the ambiguity in dening positive aect and negative aect, the specic subset of rating that we used in this work comes from the extreme ratings of these aect codings. We 72 chose the interactions for which any one of the spouses was rated in the top 20% for either of the codes (high rating of global positive and high rating of global negative) to serve as the positive aect and negative aect of a spouse. While the coding manual for the two behavioral codes instructs annotators to treat each code as independent of each other, the spouses that we dene as having positive af- fect state have a mean rating score, 7.00, on the \Global Positive", which is much higher than its mean rating score, 2.15, on the \Global Negative". The spouses that we dene as having the negative aect state have a mean rating, 6.25, on the \Global Negative" and a much lower rating, 2.08, on the \Global Positive". None of the spouses in this dataset has a mean evaluator score of both \Global Positive" and \Global Negative" to appear in the top 20%. This dataset was used to perform our evaluation (Section 5.4.2) and aect recognition (Section 5.5). The resulting dataset consists of interactions from 81 unique couples with 280 ratings: 140 high-positive ratings (70 of husbands, 70 of wives), 140 high-negative ratings (70 of husbands, 70 of wives). 5.3 Signal-derived Vocal Entrainment Quantication Our proposed signal-derived vocal entrainment quantication is based on the core idea of computing similarity measures of the vocal characteristic space (represented by the PCA space) between interlocutors. The framework computes vocal entrainment values at the level of speaking turns for each interlocutor in the interaction. It involves two steps. The rst is to obtain an adequate set of acoustic feature parameters to represent the speaking characteristics. The second is to represent these acoustic parameters in the PCA space from which we compute various similarity measures. In this section, we will rst describe four general similarity measures, given two PCA representations on two sets of time series observations. Then we will discuss the parametrization of the acoustic features to serve as descriptors of vocal characteristics, and, lastly, we will 73 describe how to apply the method to extract a total of eight features indicating the degree of vocal entrainment for each spouse in couples' interactions. 5.3.1 PCA-based Similarity Measures Principal component analysis (PCA) is a well-known statistical method for analyzing multivariate time series. PCA performs an orthogonal transformation of a set of obser- vation variables onto a set of uncorrelated variables called principal components. The rst component accounts for the maximum variance of the observed data, and each suc- ceeding component has the highest variance with the constraint that it be orthogonal to the preceding component. The mathematical formulation of PCA is given in Equation 5.5. Y T = X T W = V T (5.5) where X is the zero-mean data matrix, W is the matrix of eigenvectors of XX T , Y is the representation of X after PCA, V is the matrix of eigenvectors of X T X, and is a diagonal matrix containing values of variance associated with each principal component. Assume we are given two sets of multivariate time series observations (e.g., from two individuals in a dyadic interaction), X 1 ; X 2 , each comprised of the same n time series signals but that can be of dierent lengths. We can then compute the two sets of principal components, W 1 ; W 2 , and the two associated diagonal variance matrices, 1 ; 2 . We propose two types of similarity measures based on these representations. • Symmetric: similarity between the two PCA representations, W 1 and W 2 • Directional: similarity when representing one set of observations, e.g., X 1 , in the other PCA space, e.g., W 2 74 5.3.1.1 Symmetric Similarity Measures From Equation 5.5, PCA is essentially a process of rotating the original data matrix to a new coordinate system with the optimization criterion of maximizing explained variances. The general procedure for computing symmetric similarity measures with PCA is listed below. 1. Obtain principal components for each time series separately: Y 1 = X 1 T W 1 Y 2 = X 2 T W 2 2. Retain k components of each time series, k = max(k 1 ;k 2 ) k 1 <n;k 2 <n, each explaining a xed fraction (here, 95%) of variance 3. Compute measures of similarity based on angles between the k reduced set of components (Equations 6.1, 6.2). The rst similarity value is proposed in the work of Krzanowski [64] using Equation 6.1. ssim u (X 1 ; X 2 ) =trace(W T 1L W 2L W T 2L W 1L ) = k X i=1 k X j=1 cos 2 ( ij ) (5.6) where ij is the angle between the i th component from X 1 and j th component from X 2 , and W 1L and W 2L contain the reduced number of principal components, i.e., k components. ssim u (X 1 ; X 2 ) ranges between 1 and k. 75 Another similarity measure proposed by Johnannesmeyer [52] is an extension to the previous measure by weighting the angles with their corresponding variance (Equation 6.2). The former measure can be thought as an unweighted symmetric measure and the following is the weighted symmetric measure. ssim w (X 1 ; X 2 ) = k P i=1 k P j=1 ( X 1;i X 2;j cos 2 ( ij )) k P i=1 X 1;i X 2;i (5.7) where X 1;i ; X 2;i are the diagonal elements from 1 ; 2 . The interpretation of both of these measures, ssim u ;ssim w , above, is that if two sets of observations are similar to each other, the angles between the two principal components will be closer to zero; hence, cos() will be larger. Note that these two measures are symmetric (i.e., ssym(X 1 ;X 2 ) =ssym(X 2 ;X 1 )). 5.3.1.2 Directional Similarity Measures Since the entrainment process inherently carries notions of directionality { a given process can be entraining toward or getting entrained from another interacting process or a combination of both. We propose to quantify each of these directionality aspects in the same PCA framework. The idea is to compute similarity when we represent one time series in the PCA space of another time series. For each process, X 1 , there can be two directions of entrainment. We can compute the degree that it is entraining toward the other process, X 2 , denoted asdsim X 1 to , as the similarity between X 1 ; X 2 when representing X 1 in the PCA space of X 2 . The degree that it is getting entrained from another process,dsim X 1 fr , is computed as the similarity between X 1 ; X 2 when representing X 2 in the PCA space of X 1 . 76 We rst compute four normalized variance vectors ( ~ n 1to2 ; ~ n 2 , ~ n 2to1 , ~ n 1 ). ~ n 1to2 ; ~ n 2 are used for computing dsim X 1 to , and ~ n 2to1 , ~ n 1 are used for computing dsim X 1 fr . The following is a list of steps. • Compute ~ n 1to2 and ~ n 2 1. Project X 1 using W 2 : Y 1to2 = X 1 T W 2 2. Compute variance vector: ~ 1to2 = var(Y 1to2 ) 3. Normalize variance vector: ~ n 1to2 = ~ 1to2 = P i 1to2;i 4. Project X 2 using W 2 : Y 2 = X 2 T W 2 5. Compute variance vector: ~ 2 = var(Y 2 ) 6. Normalize variance vector: ~ n 2 = ~ 2 = P i 2;i • Compute ~ n 2to1 and ~ n 1 1. Project X 2 using W 1 : Y 2to1 = X 2 T W 1 2. Compute variance vector: ~ 2to1 = var(Y 2to1 ) 3. Normalize variance vector: ~ n 2to1 = ~ 2to1 = P i 2to1;i 4. Project X 1 using W 1 : Y 1 = X 1 T W 1 5. Compute variance vector: ~ 1 = var(Y 1 ) 6. Normalize variance vector: ~ n 1 = ~ 1 = P i 1;i Each normalized variance vector characterizes the proportion of the variance ex- plained as the time series projections represented in each of the principal components. 77 If we retain all components, they sum to one. We can, then, consider them as random variables, V 2 ;V 1to2 ;V 1 ;V 2to1 , with probability mass distribution based on each element in the normalized variance vectors described in Equation 5.8 and Equation 5.9. P 2 =P (V 2 =i) = n 2;i P 1to2 =P (V 1to2 =i) = n 1to2;i (5.8) P 1 =P (V 1 =i) = n 1i P 2to1 =P (V 2to1 =i) = n 2to1;i (5.9) The similarity between variance vectors can be thought of as computing similarity between two probability distributions. We employ the use of symmetric Kullback- Leilber Divergence (KLD) to quantify the dierence (hence, the similarity) between two probability distributions. If two sets of observations are more similar to each other, the symmetric KLD will result in a lower numerical value. This method of quantifying similarity is inspired from [96], but Otey's method was used on two completely dierent datasets and does not possess the same notion as our proposed method (projecting one time series onto another). dsim X 1 to = 1 2 (D KL (P 2 kP 1to2 ) +D KL (P 1to2 kP 2 )) (5.10) dsim X 1 fr = 1 2 (D KL (P 1 kP 2to1 ) +D KL (P 2to1 kP 1 )) (5.11) D KL (PkQ) = X i P (i) log P (i) Q(i) (5.12) wheredsim X 1 to ;dsim X 1 fr represent how much the time series observation, X 1 , is entraining toward, and getting entrained from, its interacting process, X 2 , respectively. 78 The same procedure needs to be carried out to calculatedsim X 2 to ;dsim X 2 fr to represent how much the time series observation, X 2 , is entraining toward, and getting entrained from, its interacting process, i.e. X 1 . Note that while this computation would result in the same numerical values for dsim X 2 to and dsim X 1 fr (also dsim X 1 to and dsim X 2 fr ), the underlying interpretation on the directionality of similarity is dierent. It can be in- tuitively interpreted as a vector representation composed of two components: direction and magnitude. While the numerical values, i.e., magnitude, are the same, the direction, toward versus from, of the entrainment process is dierent. 5.3.2 Representative Vocal Features Vocal entrainment can be intuitively thought of as a phenomenon that represents \how people sound alike when they speak to each other". In order to quantify the degree of en- trainment using the method proposed in Section 5.3.1, we need to capture this speaking style with acoustic vocal features. We utilized the four acoustic feature streams, pitch, intensity, speech rate, and MFCCs as described in Section 5.2.1. Prosodic cues, e.g., pitch, intensity, and speech rate, can often be used to describe more explicit speaking style characteristics, e.g., intonation patterns, loudness, and rate of speaking. Spectral cues, e.g., MFCCs, usually correspond to more implicit speaking style characteristics (consider, for example, its success in emotion recognition and speaker identication tasks). We carried out further parametrization of these features because the extracted 10ms frame-by-frame values are too detailed, considering the longer-length nature of entrain- ment. To adequately capture the inherent dynamic variations of the acoustic features in characterizing speaking styles at the turn level, we performed the parametrization of raw acoustic features at word level using statistical functionals and contour tting 79 methods. Both the contour-based and statistical functional method of analysis are com- mon for pitch and intensity. Contour-based methods can capture the temporal variation while statistical functional methods are used to describe the overall statistical proper- ties. We decided to parametrize these two feature streams using two methods. We used a least square t of pitch values with a third-order polynomial (Equation 5.13) and a rst-order polynomial (Equation 5.14) on the intensity values at the word level. This method of least squares t with a polynomial parametrization scheme was the same as in our previous work on analyzing entrainment of individual prosodic feature streams [67]. f 0 log (t) = 3 t 3 + 2 t 2 + 1 t + 0 (5.13) int n (t) = 1 t + 0 (5.14) To further obtain information on the statistical properties, we computed mean (f 0w , int w ) and variance ( 2 f 0w , 2 int w ) of both pitch and intensity at the word level. We only used 3 ; 2 ; 1 for pitch values and 1 for intensity values to characterize the pattern of pitch and intensity dynamics; intercept terms under least square contour tting method can be thought of as capturing approximately the same information because they tend to be highly-positively correlated. The speech rate feature is a one- dimensional feature and is based on the average syllable rate. We computed mean and variance for 13 MFCCs, resulting in 26 MFCC-related parameters per word. The following is the nal list of parameters of acoustic features calculated for each word. • Pitch Parameters (5): 1 ; 2 ; 3 ;f 0w ; 2 f 0w • Intensity Parameters (3): 1 ;int w ; 2 int w 80 Figure 5.1: Quantifying Vocal Entrainment: Example of Computing Measures Quanti- fying Vocal Entrainment for Turns H t in a Dialog • Speech rate (1) : sylb • MFCCs (26): MFCC w [i]; 2 MFCC w [i] , (i = 0::: 12) This parametrization resulted in a 35-dimensional vocal characteristic parameter vector derived from raw acoustic low level descriptors per word. Vocal quality fea- tures, (e.g., shimmer, jitter, harmonic-to-noise ratio) also convey information about vocal characteristics; however, they are computed based on the estimated fundamental frequencies. Since it is a challenging task to robustly estimate these features in noisy conditions, we did not include them in this present work. 5.3.3 Vocal Entrainment Measures in Dialogs Section 5.3.1 described a general framework to compute similarity between two multi- variate time series using PCA, and Section 5.3.2 described the acoustic parameters used to represent vocal characteristics. We will describe in this section the complete process of quantifying vocal entrainment in human-human conversation. There are two variants to each of the similarity values proposed in Section 5.3.1 de- pending on the manner in which the PCA vocal characteristic space is computed. Since 81 PCA is meant to represent vocal characteristics, we computed PCA for each speaker both at each speaking turn level (locally-computed PCA) and at each subject level (globally-computed PCA). The locally-computed vocal characteristic space was specied by performing PCA for every single turn, and, if the turn did not have at least 35 words, it was merged with the nearby turns to ensure a unique representation of PCA given the dimension of acoustic parameters is 35. The globally-computed vocal characteristic space was specied by rst aggregating all the turns of a single subject from all the sessions he/she participated in to perform PCA. The locally-computed PCA captures the moment-by-moment changes in the vocal characteristics of an individual speaker, and the globally-computed PCA captures an individual's overall vocal properties. In the case of the locally-computed PCA, the computation procedure listed in Section 5.3.1 can be directly implemented resulting in four vocal entrainment values for each speaker at each speaking turn, denoted as dsim Lto ;dsim L fr ;ssim Lu ;ssim Lw . In the case of globally-computed PCA, we need to substitute the locally-computed turn-level PCA components with the globally-computed subject-level PCA. All of the various projections and variance vector computations remain the same using the turn- level acoustic parameters. Because the projections were done using turn-level acous- tic parameters, the resulting computation could be interpreted as nding similarities of the local representation derived from the global vocal characteristics between in- terlocutors. This method gives four additional vocal entrainment values, denoted as dsim Gto ;dsim G fr ;ssim Gu ;ssim Gw . The complete procedure of computing vocal entrainment is illustrated in the example depicted in Figure 5.1. For each speaker (husband, wife) at each of their speaking turns (H t ;W t ) in the Couple Therapy database, we compute eight similarity measures (Section 5.3.1) between (H t and W t ) using z-normalized acoustic parameters (Section 5.3.2) as multivariate time series observations with two dierent ways of computing PCAs as 82 Table 5.1: Quantifying Vocal Entrainment: Summarization of Methods in Computing the Eight Vocal Entrainment Measures Symmetric Directional PCA Type unweight weight toward from global local ssim Gu X X ssim Gw X X ssim Lu X X ssim Lw X X dsim Gto X X dsim G fr X X dsim Lto X X dsim L fr X X mentioned above. These values serve as quantitative descriptors of vocal entrainment for the speaker (husband, wife) at that moment. Then, this process is repeated for every speaking turn across all of the interaction sessions and for every speaker in the database. Table 5.1 summarizes the eight entrainment measures with the associated computation methods. 5.4 Analysis of Vocal Entrainment Measures Section 5.3 described a framework to measure vocal entrainment using a completely signal-derived and unsupervised method. In order to understand whether they carry meaningful insights into the entrainment phenomenon that we would like to capture, we devised a statistical hypothesis test to investigate whether these signal-derived measures are capable of capturing the existence of the natural cohesiveness in dyadic conversa- tions. We further analyzed the phenomenon of vocal entrainment in relation to the aective state (rated based on experts coding using SSIRS) of the spouse. We carried out our analysis on the Couple Therapy corpus described in Section 5.2. We utilized two dierent statistical testing techniques. We rst carried out the commonly-used Student's T-Test given the large number of samples in our database. 83 The histograms show that directional measures are skewed slightly to the left and sym- metric measures are skewed slightly to the right. We took the square root of symmetric measures and the square of directional measures to transform the histograms to be more normal before carrying out the Student's T-Test. The Mann-Whitney U-Test is an non-parametric version of the Student's T-Test. We also utilized this test in case the strict normality assumption were to be violated. 5.4.1 Natural Cohesiveness of Dialogs In dyadic interactions, it has been shown that interlocutors exert mutual in uences [4, 16, 116] on each other's behavior; we refer to this well-known intuitive nature of dialogs in this context as natural cohesiveness. We analyzed whether the proposed vocal entrainment measures capture the existence of this natural cohesiveness between interlocutors. We rst computed the eight vocal entrainment measures for each of the spouses in the Couple Therapy Database at every speaking turn in every interaction. Then, we compared the mean value of each of the eight measures to the same set of the vocal entrainment measures computed on \randomly-generated" dialogs. The following one-sided hypothesis testing was carried out for each individual entrainment measure. H o : entrain dialog = entrain rand H a : entrain dialog > entrain rand The hypothesis states that the measures computed in dialogs where spouses were engaging in real conversations should have higher degrees of entrainment compared to measures computed in articially-generated dialogs from two randomly-selected spouses that were not interacting. Note that since measuresdsim Lto ;dsim L fr ;dsim Gto ;dsim G fr 84 Table 5.2: Quantifying Vocal Entrainment: Analyzing Vocal Entrainment Measures for Natural Cohesiveness (1000 runs): Percentage of Rejecting H o at = 0:05 Level Measures Student's T-Test Mann-Whitney U-Test ssim Gu 100.00% 100.00% ssim Gw 100.00% 100.00% ssim Lu 100.00% 99.70% ssim Lw 100.00% 100.00% Measures Student's T-Test Mann-Whitney U-Test dsim Lto 100.00% 100.00% dsim L fr 100.00% 100.00% dsim Gto 100.00% 100.00% dsim G fr 100.00% 100.00% were computed based on KLD, we expect lower numerical values indicating a higher level of entrainment (similarity). We followed these steps to generate \articial dialogs" for the hypothesis testing. 1. For a given subject in the Couple Therapy Corpus in each of his/her sessions, randomly select another subject from another session in the database with the constraint that these two subjects are not a couple and this randomly-selected subject is of opposite gender 2. Gather these two \randomly-selected" non-interacting spouses' speaking turns to form an articial dialog 3. Compute eight (Section 5.3.3) entrainment values for this subject in this articial dialog 4. Repeat steps 1-3 for every subject in the corpus 5. Repeat step 4 for 1000 times The purpose of step 5 listed above is to sample a large number of \articial" dialogs. 85 Figure 5.2: Quantifying Vocal Entrainment: Categories of Conceptualization of Dy- namic Interplay of Directionality of In uences in Dyadic Interactions Table 7.1 shows the percentage that each measure passed the hypothesis test (out of 1000 runs) indicating a statistically-signicant higher degree of entrainment captured by that specic quantitative descriptor. We observe that both the symmetric measures and directional measures of vocal entrainment almost always indicate a statistically- signicant higher degree of vocal entrainment in real conversations compared to arti- cial conversations. While the Couple Therapy Corpus' audio recording conditions varied from session to session, the evidence of improved robustness of the audio feature extrac- tion (Section 5.2.1) is also clear in the result of this hypothesis test. This result provides one validation that our vocal entrainment measures computed with signal processing techniques using audio-only features carry meaningful information about the nature of the interaction. Another point to make is that this test of the \natural cohesiveness" in the dialog is conceptually non-directional. A psychology study on interpersonal communication [16] describes the following phenomenon in a dyadic human-human interaction; for a given attribute of interest, e.g., engagement level, when the direction of in uence is introduced (not concentrating solely on the absolute degree of in uence between dyads), the dynamic interplay between these in uences can be roughly described as three categories (Figure 5.2) depending on who exerts a stronger force of in uence. It has been conceptualized that the evolution of these dynamic interplays is characterizing the essential ow of the dialog. 86 While our proposed signal-derived directional entrainment quantication measures carry this notion of dynamic interplay between in uences of the dyad, it is a challenging task to systematically validate the psychological signicance of directional measures in the context of this comparison between real dialogs and articial dialogs. However, it is encouraging to see that our proposed vocal entrainment measures demonstrate their ecacy in capturing the natural cohesiveness expected to occur in spontaneous human-human conversations. 5.4.2 Entrainment in Aective Interactions There is extensive psychology literature studying the nature of the aective state of the distressed married couples in con icts. One very important nding from the large body of work on negative con ict processes in distressed couples is that behavioral rigidity is characteristic of relationship dysfunction [29]. It means that very dissatised couples tend to be very negative and only negative during con ict, and this \reinforcing" of neg- ative behaviors between the spouses is common and problematic. While most studies have concentrated on negative processes of the distressed married couples, positive pro- cesses have received surprisingly little attention in the psychology literature in general and in work with distressed couples specically. It remains unknown what processes are associated with greater exibility, such as increased levels of positivity, during couples' con ict. We hypothesize that entrain- ment is likely to be one such process because it is a precursor to empathy (e.g., [113]). Numerous theoretical models of relationship functioning suggest that empathy plays a crucial role in helping couples successfully negotiate and resolve con ict [5]), perhaps by allowing them to be more exible and express greater positive emotion during con ict. Based on this idea, the following one-sided hypothesis testing was carried out on the subset of database (Section 5.2.2). 87 Table 5.3: Quantifying Vocal Entrainment: Analyzing Vocal Entrainment Measures for Aective Interactions ( Positive Aect vs. Negative Aect): One-sided p-value Pre- sented Measures Student's T-Test Mann-Whitney U-Test ssim Lu 0.047 0.047 ssim Lw 0.012 0.003 ssim Gu 0.073 0.045 ssim Gw 0.005 0.406 Measures Student's T-Test Mann-Whitney U-Test dsim Lto 0.098 0.078 dsim L fr 0.002 0.002 dsim Gto 0.016 0.004 dsim G fr < 0.0001 <0.0001 H a : entrainpos < entrainneg H o : entrainpos = entrainneg H a : entrainpos > entrainneg This hypothesis states that the each of the vocal entrainment measures would result in a higher degree of similarity for spouses in sessions rated with positive aect compared to spouses in sessions rated with negative aect. Table 7.3 shows the result of the hypothesis testing; the ones in bold are statistically signicant at the 5% level. We observe that most of the measures (especially, directional measures) indicating a statistically-signicantly higher degree of vocal entrainment are in spouses rated with positive aect. One observation we can make about the symmetric measures is that each one indicates slightly dierent results. For example, unweighted symmetric measures are at the borderline of signicance level (ssim Lu ) and the Student's T-Test and the Mann-Whitney U-Test indicate a vastly dierent p-value for ssim Gw . In fact, ssim Gu indicates the entrainment degree is higher with the spouse rated with 88 negative aect. We observe that for directional measures, the overall degrees of vocal entrainment computed by these measures are all statistically signicantly higher (except dsim Lto ). This suggests that although all of the entrainment measures are based on metrics of similarity, the directional measures carry somewhat distinct information from the symmetric measures. While most psychological studies have shown that the \behavioral dependency" of the negative process often occurs during distressed married couples' interactions, we have also demonstrated an initial empirical evidence that there can be a higher degree of vocal entrainment for spouses rated with a positive aective state when we consider the directional in uences of vocal entrainment phenomenon. More detailed analyses are needed with the entrainment process measured by symmetric measures. Furthermore, although there has been previous work [99] analyzing the notion of directionality of in uence in aective marital interactions, the knowledge of the dynamics between the interplay remains limited. The analyses we presented here can be viewed as a rst attempt to bring insights into the positive process of distressed couples' interactions. In summary, we tested two psychology-inspired hypotheses (Section 5.4.1 and Sec- tion 5.4.2). The purpose was to demonstrate that although our entrainment measures are completely signal-derived focusing only on acoustic vocal properties, the resulting quantitative descriptors can capture vocal entrainment between interlocutors. These tests, however, only examined the overall pattern of entrainment and not the dynamic ow of the interplay between interlocutors' mutual in uences. This was due to the chal- lenges in grounding the results within the complex nature of these pattern variations. It is, nevertheless, still encouraging to see that these computational measures are capa- ble of numerically describing several important aspects of human interactions. It also presents itself as a potential viable method in helping psychologists to quantitatively study entrainment in interpersonal communication and more importantly in various 89 Figure 5.3: Quantifying Vocal Entrainment: Examples of Vocal Entrainment Measures: dsim Lto ;dsim L fr ;ssim Lu ;ssim Lw , are computed for one couple in dierent aective interactions; (a) and (b) correspond to positive aect, (c) and (d) correspond to negative aect mental health applications, e.g., distressed marital interactions, where the knowledge remains limited. 5.5 Aect Classication using Entrainment Measures We performed aect recognition to examine the predictive power of these audio signal- derived measures. The results further demonstrate that the entrainment process is key for understanding the nature of the aective interactions. The aect state was rated based on the judgments of psychology-trained observers for each spouse at the session-level in the Couple Therapy Corpus described in Section 5.2.2. We will rst discuss brie y the background of the statistical modeling framework, Factorial Hidden Markov Model, that we used to perform aect recognition. Then we will describe the classication setup, and nally we will present the results and discussions. 5.5.1 Classication Framework The entrainment phenomenon is a complex temporal evolution of interplay between the directions of in uences (see conceptualization shown in Figure 5.2) of the interlocutors. 90 As an example illustrating the complex dynamics of our proposed vocal entrainment values, we show four vocal entrainment measures for a specic couple under two dif- ferent aective states (two sessions per emotion class) in Figure 5.3. There is not an easily-observable pattern of evolution on each individual entrainment measure for the two emotion classes throughout the dialog. Each measure seems to indicate a slightly dierent degree of vocal entrainment at the same time point. In order to better model and capture this complex dynamic in an interaction to perform aect recognition, we employ temporal statistical modeling techniques. The entrainment measures (our ob- served features) were computed at each speaking turn and the aective state rating was assigned at the interaction session level, this method of performing recognition was further deemed appropriate in this context. 5.5.1.1 Factorial Hidden Markov Model Factorial Hidden Markov Model (FHMM, [33]) is a generalization and extension of the Hidden Markov Model (HMM). HMMs are discrete state-space probabilistic generative models and are often used to describe the statistical generative process of a time se- ries. HMMs have been successfully used in various recognition applications, notably automatic speech recognition. Given a time series, O =fO t : t = 1:::Tg, HMMs describe that the observation at each time index is generated from one of K discrete states. These K states are not observable, hence the name hidden. Each of these states can be thought of as a representation of a particular pattern in the observed signals. The probability that an observation sequence, O, is generated from a particular HMM model, i , is expressed in Equation 5.15. p(Oj i ) = X S (S 1 )p(O 1 jS 1 ) T Y t=2 p(S t jS t1 )p(O t jS t ) (5.15) 91 where, • O = sequence of observation vectorsfO t :t = 1:::Tg • S = sequence of discrete statesfS t :t = 1:::Tg • p(S t jS t1 ) = transition probability from state S t1 to S t • (S 1 ) = initial state probability • p(O t jS t ) = probability of observation vector, O t , given the state, S t • K =number of states in the model, i.e., S t 2f1;:::;Kg HMMs can be easily represented in a directed acyclic graphical structure, since it is just a special case of Dynamic Bayesian Network (DBN). The DBN representation of HMMs is shown in Figure 5.4a, where shaded nodes are hidden states. The hidden state, s t , is represented as a multinomial random variable taking K discrete values. The assumption of a temporal rst order Markov process is encoded in the graph, along with the conditional independence assumption on the observation variable, O t . The continuous observation node, O t , is often modeled as a Gaussian distribution or a mixture of Gaussians. As seen in Figure 5.4a, an HMM is essentially modeling a single hidden process generating a set of observable features probabilistically. Hence, a natural extension of this framework is introducing layers of hidden processes consisting of multiple hidden variables [33]. Now instead of single state variable, S t , we obtain a new state variable, S m t . S m t =S 1 t ;S 2 t ;:::;S M t 92 where M represents the number of layers, and each S m t can take on K m number of states (we simplify it by having each S m t take the same number of possible states, K). Without restricting transition between these states, we are back with a regular HMM. Hence, Ghahramani introduced one intuitive constraint placed on the state transitions by considering that each state variable evolves based on its own previous dynamics and is a priori decoupled from other state variables (see Equation 5.16). This specic structure is termed as FHMM. p(S m t jS m t1 ) = M Y m=1 p(S m t jS m t1 ) (5.16) An example DBN representation of FHMM with two layers is shown in Figure 5.4b. The transition matrix for all the state variables can be parametrized by M distinct KK matrices. While the state transitions are not coupled together, they are coupled at the observation node. At any time step, t, the observation, O t , depends on every hidden process. One simple form of such a dependency is linear Gaussian: considering continuous observationO t is a Gaussian random vector (N 1 dimension) whose mean is a linear function of the states. We can write the observation probability shown in Equation 5.17. p(O t jS m t ) =jCj 1=2 (2) N=2 exp 1 2 (O t t ) 0 C 1 (O t t ) where, t = M X m=1 W (m) S m t (5.17) Each W m matrix is an NK matrix with each column indicating the contribution to the means for each setting of S m t . C is the covariance matrix. 93 Figure 5.4: Quantifying Vocal Entrainment: Dynamic Bayesian Network Representation of (a) HMM and (b) FHMM The use of FHMM in our context of aect recognition, given the vocal entrainment measures as observation, is intuitively appealing. The essence of FHMM is modeling multiple loosely-coupled hidden processes with the generation of observations depending on all hidden processes. Since the observable feature vectors used in this recognition task were computed based on both of the interlocutors, we satisfy the assumption. We designed the FHMM as having two layers intuitively modeling two interacting processes (husband and wife). Furthermore, the loosely-coupled nature of FHMM qualitatively has the same notion as the subtle nature of this entrainment phenomenon. Since both FHMM and HMM can be represented as DBNs, we implemented them using the Bayes Net Toolbox (BNT) [92] with the standard junction tree algorithm as the exact inference method; FHMM with two hidden processes is tractable for the junction tree algorithm. Expectation-maximization (EM) was carried out to estimate the model parameters. We used a mixture of Gaussians to model the observations, and this was done by simply adding another discrete node in the construction of the DBNs. The classication rule was based on the standard maximum a posteriori probability shown in Equation 5.18. 94 i = argmax i P ( i jO) (5.18) where, i2fpositive aect, negative aectg 5.5.2 Classication Setup We used a subset of the Couple Therapy Corpus to carry out the classication task (Section 5.2.2). The recognition was a binary classication task classifying each spouse's aective state (positive aect vs. negative aect) in a given interaction session. There are a total of 280 samples with equal splits between the two emotion classes. In addition to the eight vocal entrainment measures, we computed ve more sim- ilarity measures. We denote these ve similarity measures as \self vocal similarity" quantitative descriptors. We computed them to measure the self similarity of vocal characteristics for a speaker in an interaction. These measures can be interpreted ap- proximately as the degree to which a single speaker's speaking style stays the same (consistent) in the course of the dialog. We computed them using the same PCA frame- work (Section 5.3). Since these measures describe the self similarity, instead of using acoustic parameters from the other speaker as the \other interacting process", we used the acoustic parameters of the same speaker from his/her own immediate next speaking turn. Moreover, these measures were computed on two turns from the same subject; therefore, many of the similarity measures using globally-computed PCAs were not applicable (resulting in the same values for all turns). We used ve out of the eight measures (dsim s Lto ;dsim s L fr ;ssim s Lu ;ssim s Lw ;dsim s Gto ). We trained and evaluated a total of ve dierent models. We trained the combined model using feature-level fusion of entrainment measures and self similarity measures; this resulted in a feature vector of length 13. We did not explicitly train an FHMM 95 Table 5.4: Quantifying Vocal Entrainment: Results of Binary Aective State (Positive vs. Negative) Recognition: Percentage of Accurately Classied (%) Models Accuracy Positive Negative Chance 50% N/A N/A HMM (Entrainment) 55.75% 49.30% 62.14% HMM (Self Similarity) 47.50% 51.43% 43.57% HMM (Combined) 55.72% 52.86% 58.57% FHMM (Entrainment) 62.86% 65.71% 60.00% FHMM (Combined) 54.29% 55.00% 53.57% model using self similarity measures because the computation process of these measures inherently assumes a single hidden process (computed based on a single spouse in the dialog). We performed model evaluation using leave-one-couple-out cross validation (81 folds of cross validation), with the percentage of accurately classied emotion state as our metric. Various parameters, such as the number of states and the number of mixtures in the mixture of Gaussians, were optimized for each testing fold through another ve- fold cross validation done on the training dataset only. We chose the parameters for each fold that resulted in the highest classication accuracy from the ve-fold cross validation done on the training data. The number of states that were used ranged from three to seven, and the number of mixtures ranged from the integer set (1 - 4). We also performed z-normalization on these similarity measures to obtain better numerical properties. 5.5.3 Classication Results and Discussions Table 6.3 shows the overall aective state classication accuracy results as well as the class-wise accuracy for the ve models described above. There are several observations to be made with Table 6.3. The best performing model is FHMM trained with vocal entrainment features only; it obtained an overall accuracy of 62.86%. We used one-sided 96 McNemar's test for assessing the statistical signicance of this classication result, and this model (FHMM with entrainment) outperforms all four other models at = 0:05 level. The quantication method with FHMM improves the aective state recognition accuracy by an absolute of 8.93% (16.56% relative) compared to using multiple instance learning with \variance-preserved" as the only measure of entrainment in the same dataset. It is encouraging to see that the temporal dynamics of these quantitative descriptors of vocal entrainment possess discriminant power in classifying a spouse's aective state. As noted in the comparison between FHMM and HMM, using only entrainment measures as features, the accuracy of using FHMM is 7.11% absolute (12.75% rela- tive) better than using HMM. This statistically signicant improvement in aective recognition emphasizes the importance of adequately capturing the interaction dynam- ics between interlocutors while using these entrainment measures, which themselves are also derived from both spouses in the interaction. Another interesting point to make is at the comparison between the use of entrain- ment measures and self similarity measures. Our result indicates that merely modeling the self similarity of speaking style in the dialog does not carry information on the aective state of the spouse. The accuracy of HMM with self similarity measures is even below chance accuracy, and the combination of these features with entrainment measures are shown to be detrimental to the overall recognition accuracy compared to using only the entrainment measures. It is possible that this PCA entrainment framework is not appropriate to quantify the \self similarity" or that these measures simply do not carry information about this specic attribute of interest: aective state. However, we hypothesize the reason that we observe this classication result is that it is the interaction, the dynamic interplay 97 between spouses, that is at the core of characterizing and shaping the essence of each interlocutor's behaviors and mental states. In summary, we performed aective state recognition in married couples' interactions to investigate the predictive power of the signal-derived vocal entrainment measures. Us- ing only eight features and by utilizing FHMM, which implicitly modeled the interaction dynamics between the spouses, we obtained an accuracy of 62.86%. It is promising to see that these signal-derived measures not only can be used to quantitatively describe aspects of entrainment between interlocutors, but they also can be incorporated in a statistical modeling framework to carry out behavioral prediction tasks. 5.6 Conclusion and Future Work The degree of interpersonal behavioral dependency is a critical component in under- standing human-human communication, and it also plays a crucial role for psycholo- gists in their study of various intimate and distressed relationships. In this work, we focused on the phenomenon of entrainment. While the knowledge of this behavioral dependency, entrainment, is vast across various domains of human interaction studies, its subtle nature and often qualitative aspect has likely hindered advances in quan- tifying it computationally. In this work, we proposed a signal-derived framework for computing numerical values indicating the degree of a specic aspect of entrainment, vocal entrainment. The quantication framework is unsupervised, and the idea is cen- tered on computing various similarity measures between PCA-based representations of automatically-extracted acoustic parameters of interlocutors engaged in a dialog. We demonstrated in this work that these quantitative descriptors can capture aspects of entrainment and bring insights into distressed married couples' interactions using a well-established corpus of spontaneous aective interactions from real married couples. Furthermore, we obtained an 62.86% accuracy using just these entrainment measures in 98 a binary aective state recognition task quantitatively corroborating hypotheses about the relation between entrainment and aective behavior. There are many future directions in the work of computationally analyzing the phe- nomenon of entrainment. This work demonstrates the relation between vocal entrain- ment and aect. Although the two are related, they are not identical, and hence the upper bound of estimation of aect through vocal entrainment is unknown. To improve understanding of the role of vocal entrainment in characterizing human communication, one of the immediate directions is to extend the classication work in the Chapter to an- alyze other behavioral codes of interest in this richly-annotated corpus of real distressed married couples' interactions. We would like to examine in detail both the predictive power of these vocal entrainment measures for various behavioral attributes and the potential upper bound of classication accuracy using entrainment in the context of couples' interactions. Furthermore, we would also like to analyze the vocal entrainment at the session level for each couple given that the Couple Therapy database is longi- tudinal in nature. It would be insightful to have a quantitative monitoring of spouses' behaviors through a longer time span. Another important line of work is in this broad term of \entrainment". This term is a general term that includes various aspects, such as vocal entrainment, lexical entrain- ment, gestural entrainment, turn-taking entrainment, mental states entrainment, etc. In fact, this phenomenon spans multiple communicative channels and multiple levels in human communication, and often an interacting eect from all these dependencies between interlocutors characterizes the felt-sense or quality of a given interaction. Var- ious studies in mental health have pointed out the crucial aspect of this quality of an interaction in understanding dierent scenarios of interactions. One of our future works is to continuously develop computational frameworks in examining entrainment through other modalities to both see what is the relationship of same-subject, across-modality 99 signaling coherence and also to obtain richer information regarding the interlocutor mirroring behaviors. Finally, one of our long-term goals of behavioral signal processing, BSP, is to pro- vide psychologists and other domain experts automatic and meaningful engineering tools that can generate quantitative descriptors to be used in analyzing dierent domains of interpersonal communication. In fact, some of these measures have the potential of cap- turing aspects of interaction that are inherently dicult to annotate even for experts; one example is the dynamic interplay of directional in uences between dyads at various levels (acoustic, prosodic, lexical, gestural, etc). This requires a tight collaboration be- tween psychologists and engineers to develop computational methods that are grounded in psychologically meaningful questions and theory. 100 Chapter 6: Analyzing Vocal Entrainment in Marital Con ict 6.1 Introduction and Motivation Behavioral entrainment, which occurs naturally in interpersonal interactions, is a com- plex and subtly coordinated behavioral dynamic between the interacting dyad. The phenomenon has received much attention across research disciplines in human behav- ioral science. Numerous psychological literatures have stated the importance of under- standing the patterns and variations of behavioral entrainment to bring insights into, and even form theories about, higher-level human interaction dynamics and abstract human internal states. For example, Berneiri et al. demonstrated the existence of be- havioral entrainment when an interlocutor is signaling his/her explicit communicative intent for continuing engagement in the interactions [6]; likewise, Marinetti et al. de- scribed a theoretical framework of emotion processes in social interactions incorporating the phenomenon of behavioral entrainment as a crucial component [82]. The behavioral 101 entrainment phenomenon has, in general, been mentioned as a mechanism of achiev- ing eciency, increasing mutual understanding, and regulating emotions between the interacting dyads [6, 21]. Behavioral entrainment - due to its subtle nature - is a dicult behavioral dynamic to measure objectively using human observational coding approaches despite its crucial role for better analyzing the underlying aective and cognitive processes. Hence, a computational tool that is capable of calculating quantitative descriptors of behavioral entrainment directly using signals is a promising avenue for enriching the study of this complex phenomenon. Recently, we introduced a computational framework for quantifying entrainment re ected in one communicative channel, vocal entrainment, using acoustic signals [71]. This approach circumvents several shortcomings of existing quantication methods, such as the issues of handling the asynchronous structure of turn taking, incorporating the multivariate properties of acoustic cues, and introducing the notion of directionality in the entrainment process; it is also applicable to quantifying vocal entrainment in spontaneous natural language dialogs. With the availability of our proposed computational tool for quantifying vocal en- trainment, we carry out three analyses to bring quantitative insights into this complex and multi-faceted phenomenon in the context of studying con ictual interactions of distressed couple. Our ndings indicate that, 1. The computational measures of vocal entrainment we proposed are indicative of couples' extreme positive and negative aective states 2. Vocal entrainment encompasses a range of behaviors, and its quantitative descrip- tors correlate with multiple couple-specic behavioral dimensions that are related to relationship satisfaction outcomes 102 3. Vocal entrainment descriptors can provide a detailed picture of couple's vocal coordination as re ected in their withdrawal behavior patterns We demonstrated through the use of our computational method in analyzing aective states (nding 1 above) that in general, a higher degree of vocal entrainment is associated with a more positive aect [71]. The result signied that vocal entrainment is indicative of a positive behavioral process during the couples' interactions. In this work, we perform two further analyses to understand the role of vocal en- trainment in characterizing couple-specic behaviors with an aim at potentially shade lights into the role of vocal entrainment in assessing the overall eectiveness of cou- ple therapy. First, we examine whether this complex behavioral dynamic bears a statistically-signicant relationship with the four major high-level behavioral dimen- sions/categories, withdrawal, problem solving, positivity, negativity (commonly-used in psychology as means for monitoring changes in couples' behavior in order to study the eectiveness of a couple therapy [109, 117]). Each of these high-level behavioral dimen- sions is derived from a combination of multiple human-rated behavioral codes derived from codied manuals. The result indicates that with simple session-level statistics of our proposed computational measures, the phenomenon of vocal entrainment, are signicantly-correlated with these four major behavioral dimensions to varying degrees and in diering directions. Second, we carry out a canonical correlation analysis of vocal entrainment measures with the withdrawal behavior - the behavioral dimension that carries the most infor- mation regarding vocal entrainment as shown in our correlation analysis with the four major behavioral dimensions. Our results indicate that the most indicative behavioral code out of this behavioral dimension is the code, discussion, which can be intuitively thought of as measuring of the engagement level. We also quantitatively characterize 103 in uences between the spouses in their vocal coordination as re ected in the dynamics of their withdrawal behavior. 6.2 The Couple Therapy Corpus We use the Couple Therapy Corpus for the present work [24]. The corpus was collected as a part of the largest longitudinal, randomized control trial of psychotherapy for severely and stably distressed couples as they engaged in problem-solving interactions. Sevier et al. [109] assessed the eectiveness of this psychotherapy through analyzing couples' interactions. They showed that there were four major behavioral dimensions, negativity, withdrawal, positivity, and problem solving, related to the outcome variable of couples' relationship satisfaction. These dimensions were derived through a combination of principal component analysis and parallel analysis, from the 32 manually-coded be- havioral measures (derived from the Social Support Interaction Rating System (SSIRS) [? ] and the Couples Interaction Rating System (CIRS) [45]). We use the following categorization of the behavioral codes in each dimension of couples behaviors based on work done by Sevier et al. [109]: • negativity: belligerence, contempt, anger, blame, defensiveness, pressure for change • withdrawal: discussion (reversed), withdraws, denes problem (reversed), avoid- ance • positivity: aection, emotion support oered, humor • problem solving: negotiate, make agreement, oers solution, instrumental support 104 Literature [109, 117] has shown that an increase in positivity and problem solving cor- respond to an increase in relationship satisfaction and a decrease for the other two dimensions. We use a subset of the original 569 sessions of problem-solving interactions due to the varying noise conditions and variable audio quality for the recorded sessions. This resulted in a subset of 371 sessions (103 unique couples) with a total of 741 ratings available for the present work. Description of the data selection criterion, extraction of various low-level acoustic features (e.g., pitch, energy, MFCCs) and the automatic speaking turn segmentation algorithm, which were performed as preprocessing prior to the computation of vocal entrainment measures, are detailed in our previous work [10]. 6.3 PCA-based Vocal Entrainment Measures In order to quantify the degree of vocal entrainment, the notion of computing similarity between two interlocutors' speaking characteristics, using principal component analysis (PCA) in an unsupervised way was introduced in our previous work [71]. We have since extended the method by introducing the use of symmetric similarity measures and improving the similarity metric computation. The detailed computational formulation and psychology-inspired verication of these signal-derived vocal entrainment measures can be found in our previous work [69]. Given two sets of multivariate time series observations, X 1 and X 2 , we can compute two sets of principal components, W 1 and W 2 , and two associated diagonal variance matrices, 1 and 2 . Two types of similarity measures, symmetric and directional, are computed based on these representations. 105 6.3.1 Symmetric Entrainment Measures The idea behind symmetric entrainment measures is centered on computing cosine an- gles between the principal components. We compute two dierent symmetric measures using equations based on whether the computation involves weighting each component by its associated variance as follows: ssim u (X 1 ; X 2 ) =trace(W T 1L W 2L W T 2L W 1L ) = k X i=1 k X j=1 cos 2 ( ij ) (6.1) where ij is the angle between thei th component from X 1 andj th component from X 2 , and W 1L and W 2L contain the reduced number of principal components. ssim w (X 1 ; X 2 ) = k P i=1 k P j=1 ( X 1;i X 2;j cos 2 ( ij )) k P i=1 X 1;i X 2;i (6.2) where X 1;i ; X 2;i are the diagonal elements from 1 ; 2 . 6.3.2 Directional Entrainment Measures The entrainment process is inherently directional: how a person entrains toward the other need not be symmetric. We can quantify this notion of inherent directionality by computing similarity as representing one time series in the PCA space of another time series. We compute the degree that it is entraining toward the other time series, X 2 , denoted as dsim to , as the similarity between X 1 and X 2 when representing X 1 in the PCA space of X 2 . The degree that it is getting entrained from another process,dsim fr , 106 is computed as the similarity between X 1 and X 2 when representing X 2 in the PCA space of X 1 . The procedure of computingdsim involves retaining all of the principal components after projecting one time series onto the PCA space of another one. The variance asso- ciated with all the components after normalization sums to one and can be treated as a probability distribution. We then obtain two vectors of discrete probability distribution on which we compute Kullback-Leibler Divergence (KLD) as measures of dissimilarity (hence, it can be easily interpreted reversely as similarity). 6.3.2.1 Acoustic Features The multivariate feature time series used for computing vocal entrainment comprise word-level vocal parameters calculated at each speaking turn. Low level acoustic de- scriptors are extracted at a frame rate of 10ms, and contour parametrization and sta- tistical functionals are used to generate vocal parameters for every word. A total of 35 vocal parameters are computed at each word as summarized in the following list: • Pitch Parameters (5): 1 ; 2 ; 3 ;f 0w ; 2 f 0w • Intensity Parameters (3): 1 ;int w ; 2 int w • Syllabic Speech rate (1) : sylb • MFCCs (26): MFCC w [i]; 2 MFCC w [i] , (i = 0::: 12) In this work, we compute a total of four measures of vocal entrainment based on locally-computed PCA, denote here asdsim fr ;dsim to ;ssim u ;ssim w , for every speaking turn change. 107 6.3.3 Canonical Correlation Analysis Canonical correlation analysis (CCA) is a multivariate statistical analysis technique that subsumes most of the parametric univariate and multivariate statistical analyses intro- duced by Hotelling [49]. It is a framework for assessing linear relationship between two multidimensional variables, X and Y, that can be of dierent lengths. The idea behind CCA is to nd two basis vectors, w x and w y , such that the correlation between the cor- responding projections of these two sets of variables are maximized. The mathematical formulation involves solving the following eigenvalue equation (C's denote covariance matrices, corresponds to canonical correlation, and is a vector of mean values): B 1 A^ w =^ w (6.3) where, A = 2 4 0 C xy C yx 0 3 5 ; B = 2 4 C xx 0 0 C yy 3 5 ; and ^ w = 2 4 x ^ w x y ^ w y 3 5 Canonical correlation analysis, in essence, involves performing standard parametric statistical analysis after nding the canonical variables/functions through linear projec- tions of each set of variables into spaces with maximal joint correlation. In this work, we can treat vocal entrainment measures as X and behavioral codes of interest as Y to carry out the canonical correlation analysis. 6.4 Analyses Results and Discussions In this section, we present two analyses as described above and in Table 6.2: rst we analyze whether there is a signicant correlation between individual vocal entrainment measures and the four behavioral dimensions; second, we perform a canonical corre- lation analysis between the vocal entrainment measures and the specic behavioral 108 Table 6.1: Summary of correlation analyses between vocal entrainment quantitative descriptors and the four behavioral dimensions. An `p' refers to a statistically signicant positive correlation, and `n', negative correlation (** indicates p-value < 0:01 and * indicates p-value < 0:05) . dsim to dsim fr ssim u ssim w eect size max min max min max min max min average negativity - - n p p p n - n n - n 0.18 withdrawal p p p n n n - - - - - - 0.20 positivity - n - - n - - - - - - - 0.08 problem solving p p p - - - n n n n - n 0.16 dimension, withdrawal. Behavioral codes ratings are represented by the mean ratings of human annotators. In this study, we examine the session-level statistics of vocal en- trainment measures, computed by taking the mean, maximum, and minimum (resulting in 4 measures 3 functionals = 12 features) at the session level. 6.4.1 Correlation Analysis: the Four Behavioral Dimensions We represent the score of each behavioral dimensions as the average of dierent code ratings within that dimension to be consistent with the study in [109]. Spearman correlation is computed between all 12 individual measures of vocal entrainment with the score of the four behavioral dimensions. Table 6.2: Summary of the three Analyses and the analyzed behaviors Past work [71]: Positive and Negative aective states. Sec. 6.4.1: High-level behavioral dimensions, withdrawal, problem solving, positivity, negativity. Sec. 6.4.2: Behavioral codes that comprise the dimension of with- drawal: discussion, withdraws, denes problem, and avoidance. 109 Table 7.1 shows a summary of results. We indicate with an `n' if that vocal en- trainment measure has a statistically-signicant negative correlation with the specic behavioral dimension and a `p' if it has a positive correlation; eect size, in this work, is computed by averaging the absolute correlation of the features that are statistically signicant. The rst thing to note is that with just these session-level statistics of vo- cal entrainment, even while the correlation strength is relatively `weak', many of these purely signal-derived measures are statistically-signicantly correlated with the high- level behavioral dimensions. This supports the view that vocal entrainment can been seen as a subtle process underlying many of these categorically, and manually-coded, behavioral constructs. A second interesting observation is that measures of directional entrainment can complement the conventional interpretation of behavioral entrainment - often `sym- metric', where two processes are becoming similar without considering which one is driving this phenomenon. For example, an increase in the symmetric entrainment mea- sures seems to be negatively associated with both behavioral dimensions, negativity and problem solving, but these two dimensions are related to the couple therapy outcome in an opposite direction. However, by examining the directional entrainment measures, we can see a clear distinction between these two dimensions. This nding indicates a crucial need for quantifying the directionality of the entrainment process as it can carry signicant information regarding the interaction. We believe such a quantitative understanding of behavioral entrainment can potentially strengthen our knowledge in supporting various theoretical frameworks of human social interactions. Lastly, the ef- fect size is dierent across the four dierent dimensions, which shows that vocal entrain- ment possesses varying degrees of explanatory power in characterizing these abstract human-annotated behaviors for study of marital con ict. 110 Table 6.3: Summary of canonical correlation analysis results of `withdrawal' behavioral dimension with vocal entrainment Test of Dimensionality Dimension Wilk's() F - Value D.F. Sig. Corr. Sq.Corr. 1 to 4 0.652 13.841 24 < 0:001 0.569 (1 st ) 0.323 (1 st ) 2 to 4 0.964 1.794 15 0:03 0.142 (2 nd ) 0.021 (2 nd ) 1 st Dimension Canonical Variate/Function Variable St. Canonical Coe. Structure Coe. (r s ) Sq. Structure Coe. (r 2 s ) dsim to -0.701 -0.554 0.307 dsim fr 0.389 0.778 0.605 dsim tomax 0.428 -0.198 0.04 dsim frmax 0.128 0.595 0.354 dsim to min -0.243 -0.695 0.483 dsim fr min 0.219 0.672 0.452 discussion -1.115 -0.950 0.902 denes problem 0.079 -0.436 0.190 avoidance 0.301 -0.031 < 0:001 withdraws 0.032 -0.482 0.232 6.4.2 Canonical Correlation Analysis: Withdrawal As seen in Table 7.1, vocal entrainment measures best explain the withdrawal dimension if represented using the session-level statistics. We carry out further detailed analysis using canonical correlation analysis between the vocal entrainment and the withdrawal dimension, which consists of discussion, withdraws, denes problem, and avoidance be- havioral codes. Canonical correlation analysis is conducted on two set of variables: the set of behavioral codes dening withdrawal, and the set of six signicant vocal entrain- ment measures (mean, maximum, and minimum of dsim to and dsim fr ). Table 6.3 summarizes various statistics from the analysis. At an overall level, there exists a strong statistical signicant (p-value < 0:001) relationship between vocal en- trainment measures and the withdrawal behavior; the Wilks' lambda () is 0.652 indi- cating that the model explains a signicant portion of the variance, 34.8% (1-), across 111 the two sets of variables. We only present analysis results of the rst dimension in Table 6.3 because it captures most of the relevant information between the two sets of vari- ables (32.3% of the variances). We only interpret the variables that have ajr s j> 0:40 as the threshold in determining the importance of each variable in forming the canonical variate. The rst interesting point to note here is that, Sevier et al. demonstrated through the use of principal component analysis on the manually-rated behavioral codes that the discussion code, out of the four behaviors in dening withdrawal pattern, has the largest loading factor. In our analysis, we can interpret the rst canonical dimension as the most informative `common' component between the phenomenon of vocal entrainment (as derived from observed `signals') and the behavioral marking of withdrawal. From Table 6.3, we see that the most signicant determining factor in the rst canonical variate from the behavioral codes side is also discussion. This behavioral code of discussion can be intuitively thought as a measure of engagement in the interaction, making the result intuitively consistent and in accordance with the knowledge in the psychology literature. Our second observation is that the proposed computational framework provides further insights into the directional in uences between the spouses. From the results shown in Table 6.3, we can interpret that, in general, an increase in the degree of vocal entrainment (vocal characteristic matching) toward the other person is associated with a decrease in the withdrawal dimensions (especially signicant for discussion, denes problem and withdraws) - potentially related to better relationship satisfaction. The phenomenon holds for measures of entraining from the other spouse except it associates with an increase of the withdrawal (note that the interpretation on directional measures, discussion and denes problem ratings are all reversed). The analysis provides empirical evidence that these spouses' vocal entrainment (computed at the entire session-level) is 112 not only statistically related to this expert-coded withdrawal construct (often used in associating with the outcome of the couple therapy) but provides a quantitative picture of the coupling behaviors in the interacting spouses' vocal channels. In summary, our canonical correlation analysis results show that there exists a statistically-signicant relationship between vocal entrainment and an important be- havioral dimension of distressed couples, withdrawal. This relationship can be explained in part as the variation on the overall vocal engagement level (captured by these signal- derived measures and explained by the manually-coded behavior, discussion) of the spouses in their problem-solving interactions. The results further demonstrate the po- tential of this `signal-based' computational framework in bringing an objective view of behavioral entrainment phenomenon in the domain of couples interactions; it oers an objective grounding for further exploration of its in uencing dynamics in studies of interactions of marital con icts. 6.5 Lessons Learnt from Correlation Analysis With the availability of a computational framework for quantifying a subtle yet im- portant aspect of interpersonal interaction dynamics, vocal entrainment, we examine two hypotheses in quantitatively relating vocal entrainment to expert-coded behaviors, closely related to the couple therapy outcome, in couples interactions. Our analyses re- sults not only provide further insights and understanding about vocal entrainment with these expert-derived behavioral codes to the eld of psychology, but also oer an objec- tive view of analyzing this abstract process, which has repeatedly to be been implicated as an essential mechanism underlying behavioral dependency in human interactions, through the use of signal processing methodologies. There are many potential future directions. One of the main limitations of this work is that we examined vocal entrainment at the session-level. Some of the most 113 pressing challenges of analyzing behavioral entrainment, despite the recurring emphasis of its importance, not only involves its subtlety (this can be addressed in a signal-based approach) but also its dynamical function (it is often viewed as a `process' instead of a `xed behavior'). We need to further develop a proper understanding of behavioral entrainment in a dynamic fashion, and even more importantly, we need to devise an appropriate way of grounding these novel dynamic signal-based methods in theory of psychological signicance. We also plan to develop computational tools in quantifying other aspects of entrainment. An integrative understanding of multiple communicative channels of entrainment can further enhance and bring insights into the quantitative studies of human communication. 6.6 Vocal Entrainment and Demand and Withdraw in Couple Con ict 6.6.1 Demand and Withdraw The demand/withdraw interaction pattern is a frequently occurring, dysfunctional cycle of behavior that consists of one partner coercively pursuing change through pleading, nagging and criticism and the other partner stridently resisting change by quickly termi- nating discussions of change, altering the topic being discussed, or avoiding discussion of change altogether [23]. Higher levels of demand/withdraw behavior are associated with higher levels of relationship distress, and high levels of demand/withdraw behavior are commonly observed in treatment seeking couples as well as couples with a history of intimate partner violence [5]. Demand/withdraw behavior typically occurs in a highly polarized fashion in the in- teractions of distressed couples. Polarization refers to one partner taking on a strong demanding role (i.e., exhibiting signicantly higher levels of demanding behavior than 114 the other) and the other partner taking on a strong withdrawing role (i.e., exhibiting signicantly higher levels of withdrawing behavior than the other). Greater polariza- tion is indicated by greater dierences in the relative distributions of demanding and withdraw behaviors between spouses. The most well replicated nding related to polar- ization of demand/withdraw behavior is that the specic direction of polarization (i.e., which spouse is engaging in signicantly higher levels of demanding or withdrawing behavior) is strongly related to the topic of discussion. When discussing a topic identi- ed by the wife in a heterosexual couple, couples engage in signicantly higher levels of wife demand/husband withdraw than husband demand/wide withdraw. Likewise, when discussing a topic identied by the husband in a heterosexual couple, couples typically engage in signicantly higher levels of husband demand/wife withdraw behavior than the converse. A number of theories have been advanced to account for this pattern of behavioral polarization [5, 25, 46], and the strongest and most consistent empirical support has emerged for explanations that are dyadic in nature. For example, the con ict structure model suggests that polarization in demand/withdraw behavior occurs in large part because the change that the demander is seeking is contingent on the willingness of the withdrawer to grant it [25, 46]. In contrast, models focused on individual dierences and the impact of socio-cultural variables have received mixed empirical support. One promising but relatively new line of inquiry focuses on the ways that partners attempt to in uence one another during conversations about desired change. For example, be- havioral polarization is signicantly less likely to occur when responsibility for making a change in the relationship is placed on the couple relative to when it is placed either individual spouse, and greater demand/withdraw polarization is linked to higher levels of pressuring tactics [5]. 115 6.6.2 Behavioral In uence and Polarization of Demand and Withdraw We build on this nascent body of work by proposing that vocal entrainment represents a form of behavioral in uence between spouses. Behavioral in uence occurs when one spouse's vocal behaviors become signicantly more like the other spouse's vocal be- havior over time. This argument has considerable conceptual overlap with Gottman's dynamical systems models of couple behavior [36] and Boker's coupled oscillator mod- els of couple behavior [13]. All three model share two assumptions about the nature of behavior and in uence: 1) left to their own devices, spouses would show predictable variation in behavior over time, and 2) in uence occurs when variation in one spouse's behavior is associated with signicant deviations from what is expected in the other partner's behavior. The most important distinction between these three conceptual viewpoints is seen in their treatment of time. Our method for computing vocal entrain- ment does not specify a time lag over which in uence must occur. Rather, in uence is allowed to occur over multiple time lags that can vary between couples. Conversely, the other two models require that the timing of in uence be specied and set to the same interval for all couples (i.e., treated as a xed eect). Given the absence of theory or empirical evidence suggesting that in uence occurs over a particular time interval, the exibility of our approach is benecial in that it reduces the likelihood that ndings are biased by misspecication of time lags or by dierences between couples in the time lag over which in uence occurs. Though it is possible for behavioral in uence to occur in numerous ways, it is likely that directional in uence is most strongly related to demand/withdraw polarization. In a polarized interaction, the demander is pushing for a particular kind of change. Namely, the demander is requesting that the withdrawer deviate from his or her nor- mal behavior and act more in accord with the demander's wishes. Additionally, the demander is also unlikely to be willing to alter his or her behavior to become more 116 like the withdrawer's given that the withdrawer's behavior is causing the demander dis- tress. This logic suggests that withdrawer's behavior is likely to become more like the demander's and that demander's behavior is unlikely to become more like withdrawer's. Strongly polarized demand/withdraw behavior is therefore likely to be associated with these particular forms of directional vocal entrainment. Based on the theory and empirical evidence discussed, we hypothesize that greater demand/withdraw polarization will be signicantly associated with higher levels of di- rectional vocal entrainment in both spouses. More specically, higher levels of entrain- ment towards the spouse (i.e., partner a's behavior becoming more like partner b's) will be signicantly associated with being in a withdrawing role and lower levels of en- trainment towards the spouse will be signicantly associated with being in a demanding role. 6.6.3 Data Analysis Study hypotheses were tested using multilevel models (MLM's) run in HLM Version 7.00. MLM's adjust for the multiple dependencies that arise from measuring two spouses interacting at multiple time points and allow for diering numbers of data points be- tween couples. Because it is unknown whether vocal entrainment is a property of a given interaction (i.e., a within-partner eect), a general style of interacting (i.e., a between- partner eect), or an individual dierence that is not conveyed by behavior during interaction (i.e., a contextual eect), within-partner, between-partner, and contextual eects were simultaneously estimated in all models using the following equation: Level-1: Y (wife demand/husband withdraw) tij = 0jk + 1jk (Topic ijk )+ 2jk (Person- centered husband to wife directional entrainment ijk + 3jk (Person-centered wife to husband directional entrainment ijk ) + 4jk (Person-centered husband demand wife withdraw) + e ijk 117 Level-2: 0jk = 00k + 01k (Pre to Post change jk ) + 02k (Post to Follow up change jk ) + 03k (Grand mean-centered husband to wife directional entrainment jk ) + 04k (Grand mean-centered wife to husband directional entrainment jk ) + 05k (Grand mean-centered husband demand wife withdraw jk ) + 0jk 1jk ; 2jk ; 3jk ; 4jk = i0k Level-3: 00k = 000 + 001 *(Therapy k ) +u 00k 01k , 02k , 03k , 04k , 05k , 10k , 20k , 30k , 40k = ij0 where i indexes interactions, j indexes assessment points, and k indexes couples. Demand/withdraw polarization was modeled using a residualized change strategy by regressing wife demand/husband withdraw onto husband demand/wife withdraw. In this model, the remaining variance in wife demand/husband withdraw represents the magnitude of polarization. Person mean-centered values of vocal entrainment were entered at level-1 to estimate within-person eects; grand mean-centered values of vocal entrainment were entered at level-2 to estimate between-person eects; and, post-hoc contrasts of dierences between within-person and between-person eects were used to estimate contextual eects. The model also includes topic (-.5 = husband topic, .5 = wife topic) and treatment condition (-.5 = TBCT, .5 = IBCT) as eect coded covariates and changes over time (Pre-to-post change [-1 = pre, 0 = post], post-to-follow-up 2 change [0 = post, 1 = follow-up 2]) as dummy coded covariates. Because of the dummy coding used to represent change in demand/withdraw polarization over time, the intercept in the model represents demand/withdraw polarization at the post-treatment assessment. Standardized coecients were calculated by multiplying each unstandardized coecient by the standard deviation of the predictor divided by the total standard deviation of vocal entrainment range. 118 Table 6.4: Summary of MLM results of demand and withdraw with directional vocal entrainment Fixed Eect B SE Intercept, 000 3:68 0.12 Therapy, 001 -0.13 0.18 Pre-post change, 010 0:49 0.14 Post-follow-up 2 change, 020 -0.17 0.19 Husb. to wife entrainment betweenpartner , 030 8:61 3.13 Wife to Husb. entrainment betweenpartner , 040 10:53 2.55 Husb. demander wife withdrawer betweenpartner , 050 0:09 0.08 Topic, 100 0:78 0.13 Husb. to wife entrainment withinpartner , 200 10:28 3.04 Wife to Husb. entrainment withinpartner , 300 2.54 2.42 Husb. demander wife withdrawer withinpartner , 400 0:15 0.09 ***: p< 0:001, **: p< 0:01 6.6.4 Results Consistent with hypotheses, within-partner eects emerged for husband to wife direc- tional entrainment and between-partner eects emerged for both husband to wife and wife to husband directional entrainment (see Table Y). At the within-partner level, higher levels of husband to wife directional entrainment were signicantly associated with higher levels of wife demander/husband withdrawer polarization (B = 10:28, p < :001). At the between-partner level, higher levels of husband to wife directional entrainment were signicantly associated with higher levels of wife demander/husband withdrawer polarization (B = 8:61, p = :007) and lower levels of wife to husband directional entrainment were signicantly associated with higher levels of wife deman- der/husband withdrawer polarization (B = 10:53, p < :001). These associations indicate that withdrawer's behavior tends to become signicantly more like demander's behavior over the course of a given interaction and that withdrawer's behavior tends to be more like demande'rs behavior and demander's behavior tends to become less like 119 withdrawers behavior as a general style of interaction that is not tied to any particu- lar interaction. Contrary to hypotheses, the within-partner eect for wife to husband directional entrainment was non-signicant (B = 2:54, p = :30). In addition to these associations involving vocal entrainment, signicant eects emerged for topic (B = 0:78, p<:001) and pre-to-post assessment change (B = 0:49, p<:001). These associations indicate that wife demander/husband withdrawer polarization was signicantly higher during topics selected by wives and that demand/withdraw polarization decreased from the pre-treatment assessment to the post-treatment assessment. No other main eects or interactions emerged as signicant. 120 Chapter 7: Modeling Human Annotation Perception 7.1 Introduction Humans are capable of combining information from multiple perceived local events that span over a given time interval to come up with an overall, global, description/judgment of often abstract attributes of interest through a complex and integrative internal per- ception mechanism. This powerful human mechanism has played a major role in aiding research for numerous scientic communities, and is especially relevant in behavioral sciences where human evaluation is repeatedly used as the core methodology for pro- viding grounding evidence in carrying out various analyses. Trained annotators are considered as objective observers providing consistent global perceptual ratings on ab- stract behavioral attributes of interest for the domain experts, after they observe the entire interaction session of the recorded behavioral data [81]. In psychology and psychiatry, studies of comparing behaviors over short, local, time scales (a speaking turn or a complete thought unit) versus long, global, time scales 121 (an interaction session or a complete clinical trial) have focused largely on the design of an appropriate `unit' for annotators to carry out behavioral observational coding. The emphasis has been placed mostly on understanding the distinction (pros and cons) between micro-analytic and macro-analytic behavioral coding standards [77]. In the domain of human perception studies, the Gestalt Principle Theory of Perception [50] - a perception theory for the human visual process - is one such theory linking local structure to global attribute. It states that the human visual perception is holistic (global) in nature and is governed by dierent principles relating to the structures of local events. There has not been explicit research work done in analyzing human perception process in the context of high-level behavioral observational coding. In this work, our aim is to bring insights into this local-global process using machine learning algorithms within a behavioral detection framework. We carry out analyses using Multiple Instance Learning (MIL) and Sequential Prob- ability Ratio Test (SPRT) examining the question of how human annotators give an overall rating of abstract behavioral attributes at an interaction session level. Could it be based on: 1. isolated saliency - judging globally based on locally isolated highly informative events ? 2. causal integration - judging globally based on integrating information over time in the annotation process ? MIL presents a probabilistic formulation for extracting highly-salient local events related to the global rating, and SPRT is a statistical formulation that carries the notion of continuously monitoring and aggregating required information for making a decision in a sequential manner. 122 Most research relies on controlled lab experimentation for studying human percep- tion. In the present paper, we formulate our analysis of human perception as a machine learning and binary detection/classication framework on a large corpus of spontaneous dialogs with multiple human annotations. The analysis is a two-step process that in- volves studying the extreme human behaviors that are rated consistently by trained annotator (e.g., detection of a high or low degree of blame) in the context of distressed couples' interactions. The two-step process is based on the following: 1. identication of prototypical local behavioral patterns that are highly-informative about the global human-perception-based ratings 2. detection of extreme global behaviors with derived prototypical local behavioral patterns to infer insights about the annotators' perceptual process We utilize MIL in the rst step to discover prototypical local behavioral patterns. We assume that these prototypical local behavioral patterns derived from MIL can be perceptually meaningful because of their ability in performing detection of the extreme global behaviors. We then carry out the second step using SPRT-based and saliency- based detection frameworks with these prototypical local behaviors. The assumption behind the second step is that the human annotators are trained in a way such that they learn to retain a repertoire of a set of prototypical local behaviors, which they utilize internally to decide whether the particular behavior of interest falls into the categories of high or low degree rating. This second step is formulated to understand the decision mechanism of human annotators as they execute their internal functions of mapping local events to global ratings. We present analysis results with respect to the six dierent globally-rated behavioral codes (blame, acceptance, negative, positive, humor and sadness) designed for the purpose of measuring behaviors in con ictual marital interactions. We represent the behaviors at local speaking-turn-level with lexical 123 information computed using term frequency-inverse document frequency (tdf). We present our analyses and discussions using the proposed framework to bring initial insights into the mechanism underlying the annotation process. The rest of the paper is organized as follows, Section 2 describes research method- ology including analysis database, MIL and SPRT framework, and lexical feature rep- resentation. Section 3 presents analyses and discussions of our perception analyses. Section 4 describes our conclusion and future work. 7.2 Corpus Description We carry out our perceptual analysis in the Couple Therapy Corpus [24]. The corpus consists of audio-video recordings and manual word transcripts of severely-distressed couples as they engaged in problem-solving interactions. As a standard practice in behavioral studies, each spouse's behaviors were rated by multiple trained human anno- tators using expert-designed behavioral coding manuals. Each annotator was instructed to rate each behavioral code (at the global session-level) on an integer scale of 1 - 9, where a higher rating indicates the spouse displays more of that behavior, after observ- ing the whole interaction session. We carry out our perceptual analysis on the extreme ratings (25% and 40% of all the available ratings in the corpus: 186 and 280 samples of ratings, respectively) of the six dierent global codes (blame, acceptance, negative, positive, humor and sadness). Each of the behavioral codes is categorized into high and low degree of rating. The rationale behind this selection is two folds. The rst is to be consistent with the work done by Katsamanis et al. [? ]. Katsamanis et al. utilized the MIL framework, optimized for classication accuracy on the same corpus, to perform binary classication task on the same set of six codes where high accuracies were obtained. The second reason is that the human inter-evaluator agreements are satisfactorily high for these six codes 124 (0.78, 0.75, 0.80, 0.74, 0.76, 0.72) [10], especially for the set of of extreme ratings (very high and very low ratings of these codes). This signies that not only the annotators' internalization of the code descriptions are consistent but also reduces the confounds of rater variability making our assumption - that the extraction of the prototypical local behaviors possessing perceptually-meaningful global behaviors - better justied. 7.3 Computationl Framework 7.3.1 Multiple Instance Learning Multiple Instance learning (MIL) is a semi-supervised learning framework when a label, y, is assigned to a bag that consists of multiple unlabeled instances instead of associating a label with every training instance. The original idea of MIL is formulated for a binary classication task (y 2f(+1); (1)g). A bag is labeled as (+1)-bag if at least one instance in that bag is (+1), and the bag is (1)-bag if only all instances are (1). A general way of solving MIL problem is through maximization of Diverse Density (DD) function for a feature vector, x; dened as [83]: DD(x) = M Y i=1 2 4 1 +y i 2 y i N i Y j=1 (1e jjB ij xjj 2 ) 3 5 where B ij is the j th instance (feature at each speaking turn) of the i th bag (session), N i corresponds to the number of instances of the i th bag, andM is the total number of bags. Maximizing theDD(x) function can be posed as nding a point (often termed as a concept point, denote as t) in the feature space that is as close to at least one instance from every (+1)-bag and as far away from instances in (1)-bags. The maximization of the DD function with respect to t can be solved in an expectation-maximization 125 approach (EMDD [123]); we carry out the approach of EMDD in solving the MIL problem in this work. Using MIL in the DD formulation is intuitively meaningful in the rst step of our analysis framework. We use this process to discover two densities (i.e., two concept points) of prototypical local behavioral patterns with respect to the global extreme ratings of behaviors (denote as t (+1) and t (1) for the bag of very high (+1)-bag and very low (1)-bag rating of a behavior, respectively). For each instance, we can compute P (t jB ij ) where =f(+1); (1)g as measure of whether that instance is close to the (+1) or (1) concept point using the following: P (t jB ij ) = exp( X k s 2 k (B ijk t k ) 2 ) (7.1) wherek indicates thek th feature ands is the scaling vector of the feature. An indi- vidual instance's probability is computed for the bag, and the conventional classication decision in EMDD is based on the original formulation of MIL. 7.3.2 Sequential Probability Ratio Test Sequential Probability Ratio Test (SPRT) was originally developed in [114] and has since been widely-used for on-line manufacturing quality control and computerized clas- sication test. SPRT, in the context of this paper, can be used to represent the human decision-making process during the annotation, i.e., the human annotator makes a de- cision over time about whether the behavior each spouse is exhibiting falls in the high or low side of extreme behaviors as soon as the annotator becomes `condent' enough after observing a sequence of interaction data. SPRT sequential decision strategy (S i;m , 126 wherem indicate them th speaking turn in thei th interaction session) given two possible classes/hypothesesf(+1); (1)g can be written as below: S i;m = 8 > > > > < > > > > : (+1) if LR i;m U + (1) if LR i;m L continue if L <LR i;m <U + wheref(+1); (1)g in this work indicates the extreme high and low ratings of behavioral codes, and LR i;m is the likelihood ratio dened as below (assuming i.i.d samples): LR i;m = m Y j=1 P (B ij jt (+1) ) P (B ij jt (1) ) = m Y j=1 P (t (+1) jB ij )P (t (+1) ) P (t (1) jB ij )P (t (1) ) and by assuming uniform prior, P (t (1) ) =P (t (+1) ), LR i;m = m Y j=1 P (t (+1) jB ij ) P (t (1) jB ij ) (7.2) where P (t jB ij ) is given in Equation (7.1), U + and L correspond to an upper-bound and a lower-bound condence threshold computed from user-dened (Type I error) and (Type II error). The thresholds are set based on the following guideline according to Wald [114] to simultaneously control for and : U + = 1 ; L = 1 (7.3) here, we dene = = 0:05. This SPRT formulation can be used to represent a possible `causal-integration' annotation decision-making process of human annotators. 127 7.4 Analysis Setup 7.4.1 Lexical Feature Extraction We use the same lexical feature extraction as Katsamanis et al. [? ] to represent behavioral information of each instance (speaking turn) for training multiple instance learning described in Section 2.2 because lexical information has been shown to useful in behavioral prediction task [32? ]. Lexical information, in this work, is represented by a vector of normalized product of term/word (for the selected set of terms/words) frequencies with inverse document frequencies (tdf n ) dened as follow: tdf n (t k jd j ) = tdf(t k jd j ) q P W s=1 tdf(t s jd j ) 2 where W equals to the number of terms in an instance. tdf(t k jd j ) is computed by counting the number of appearances,n, of every selected term,t k , in the document,d j , and appears in D t k out of the total of D documents using the following: tdf(t k jd j ) = 8 > < > : n log DD tk Dt k if D t k 6=D 0 if D t k =D The selection criterion of terms is based on information gain computed on the train- ing set for each cross validation fold. We choose the terms appearing in the top 0.5% for the task of 25% (186 samples) and the top 1% for the task of 40% (280 samples) as sorted in descending order of information gain. 7.4.2 Classication Setup In this work, our goal is to study the possible human annotators' perception process as they engage in tasks of behavioral observational coding. The analyses are based 128 on detection/binary classication framework of extreme behaviors corresponding to the six behavioral codes (blame, acceptance, negative, positive, humor and sadness). The detection evaluation is measured based on 10-fold cross validation (the same couple was restricted to appear in either training or test set only). Lexical feature selection is performed on the training set only, and EMDD is trained using the MIL toolbox [120]. We employ multiple detection schemes emulating dierent possible annotation decision-making processes. They can be grouped in two major categories: isolated- saliency and causal-integration. The following is the list of detection schemes and asso- ciated descriptions: Isolated-Saliency: Saliency likelihood is dened as estimation by computing each instance's, m, likeli- hood ratio (LR i;m ) separately without the product using the following form instead of Equation 7.2: LR s i;m = P (t (+1) jB im ) P (t (1) jB im ) • Salient max : assign label of +1 to thei th session (with a total ofl instances) if the max of LR s i;1:::l > 1 and vice versa • Salient nMax : same as Salient max except performing majority vote over n largest jLR s i;j j, n is chosen to be 3 in this work empirically Causal-Integration: Causal-integration is based on the SPRT framework decision framework described in Section 2.3. • SPRT st : standard SPRT as described in Section 2.3, if the algorithm does not terminate before it reaches the end of the session, the label is decided based on the cumulative LR at the last step 129 Table 7.1: Summary of detection results (percentage of accurately detected sessions): numbers in bold indicate the highest performing decision framework for that specic task 25% Task (186 samples of ratings) Salient max Salient nMax SPRT st SPRT con SPRT res SPRT hyb blame 75.3 79.0 72.6 72.6 72.0 73.1 acceptance 66.7 74.2 71.5 68.8 67.7 70.4 negative 65.1 73.6 72.6 73.1 74.2 72.6 positive 69.9 74.7 65.1 65.1 62.9 66.1 humor 51.6 51.1 50.5 50.5 51.1 50.5 sadness 47.3 47.9 48.4 48.4 47.3 47.8 Table 7.2: Summary of detection results (percentage of accurately detected sessions): numbers in bold indicate the highest performing decision framework for that specic task 40% Task (280 samples of ratings) Salient max Salient nMax SPRT st SPRT con SPRT res SPRT hyb blame 78.2 77.1 74.6 74.6 73.2 75.7 acceptance 68.9 70.7 72.1 72.9 76.1 72.1 negative 65.7 69.6 66.1 66.8 68.2 65.4 positive 70.4 76.1 67.9 69.3 69.3 69.6 humor 53.8 51.1 50.0 49.6 49.3 49.3 sadness 54.2 50.7 49.6 50.0 50.4 50.4 • SPRT con : the same for SPRT st except that the detection algorithm does not terminate and continue running through the entire session. A majority vote is performed for the instances that surpass the threshold (U + ;L dened in Equation 7.3) to decide a label for the session • SPRT res : the detection algorithm reset cumulativeLR = 0 whenever it surpasses the pre-dened threshold, and a nal majority vote is carried out to decide a nal label • SPRT hyb : the same for SPRT st , except that if the algorithm does not terminate before it reaches the end of the session, the label is decided based on the detection scheme, Salient nMax 130 7.5 Detection Results and Discussions Table 7.1 7.2 summarizes the detection accuracies for the six dierent detection schemes (Section 3.1) for the six dierent globally-rated behavioral codes with two dierent subsets of the data using lexical features (Section 2.4). We present results on both sets of data (25% total, and 40% total), each with equal splits between the two classes. Kastamanis et al. performed the classication with the 25% total. In this work, we focus on interpreting 40% total task as it includes more data, potentially better generalizable interpretations, but present results on both tasks. The rst thing to note is that the overall accuracies reported here are lower compared to [? ] despite the fact many of the setups are similar. We think the main dierences could be caused by two major reasons: the rst is that Katsamanis et al. utilized a variant of MIL framework, which included estimation of multiple modes in DD function (instead of one concept point) along with the second layer of support vector machine. That formulation is benecial in boosting the overall accuracy although it is rather dicult to apply in sequential decision framework. We decide to use the standard EMDD framework as a starting point in this work because the original formulation provides a more straightforward interpretation and is applicable to be used in SPRT. The second reason is that Katsamanis et al. optimized parameters for the classication accuracies on the test set for the purpose demonstrating an upper-bound of accuracy of the algorithm. Since our assumptions rely on these prototypical local patterns that can carry signicant information about the global ratings, we will be focusing our discussion on only the codes for which we obtain a reasonably high accuracy: blame, acceptance, negative, and positive. The second observation is that for the isolated-saliency detection scheme, taking majority vote on multiple large values of likelihood ratio shows a better and more robust detection scheme compared to taking one single maximum (except for the code, blame). 131 Table 7.3: SPRT st : median and 75% quantile of decision time, measured as number of turns required divided by the total number of turns of each session for the 40% task accurately-classied mis-classied median 75% quantile median 75% quantile blame 0.20 0.36 0.22 0.43 acceptance 0.15 0.30 0.20 0.40 negative 0.18 0.33 0.21 0.42 positive 0.20 0.43 0.31 0.62 humor 0.25 0.50 0.22 0.47 sadness 0.17 0.29 0.13 0.26 Various SPRT-based detection schemes that we employ while all demonstrate reasonable accuracies, however, do not show observable trends of dierences when comparing among themselves. The third point to make is that on the overall level, isolated-saliency methods seem to obtain a higher accuracy, which may signify, in general, that the salient events can be more informative in triggering the global ratings from the annotators. However, one of the major strengths of SPRT-based methods is that the decision is made in an on-line fashion - the decision can be made quicker while maintaining a fairly high accuracy. Table 7.3 summarizes the eciency of the SPRT st algorithm. We see that for behavioral codes, blame, acceptance, negative, and positive, the SPRT only requires monitoring about the rst 15 - 20% and 30 - 43% of speaking turns for 50% (median) and 75% (75% quantile) of the time when the algorithm decides that it is condent enough to make a correct prediction. When the algorithm makes an error, it tends to take longer (20 - 31% and 40 - 62% of speaking turns for 50% and 75% of the time) re ecting the uncertainty/ambiguity in the information for forming the decision. We demonstrated the potential of SPRT st to be used in a real-time monitoring system for the therapists for early signaling on whether the interactions require immediate intervention. 132 7.6 Isolated-Saliency vs. Causal-Integration In this section, we present a discussion on compare the accuracies between isolated- saliency and causal-integration methods with respect to blame, acceptance, negative, and positive to oer possible insights into question of human annotation process as posed in Section 1. First, we examine the sessions where both SPRT st and Salient max make a correct prediction. These sessions constitute 84.5%, 90.1%, 84.2%, and 82.3% (blame, accep- tance, positive, and negative, respectively) of all the correct decisions made by using SPRT st . The 75% quantile of the time it takes for the algorithm, SPRT st , to make a correct prediction is (0.33, 0.37, 0.42, and 0.40) in this overlapping set of sessions. This set of sessions correspond to about 41% - 50% of the total data depending on the behavioral codes, indicating a signicant portion of database have the information that can be found in subparts (or as SPRT st intends to do, the beginning) of the session that is relevant for judging global behavioral attributes. Furthermore, from Table 7.1, the behavioral codes that have shown major dierences between the two categories are positive, acceptance and blame where isolated-saliency is better for positive and blame and causal-integration is better for acceptance in the 40% task. While there can be confounds because the features (or classiers) are not optimal, we attempt to interpret based on the trends seen in Table 7.1 (right). For the code, blame, a single salient instance can be highly-indicative (evident in the maximum accu- racy obtained in Salient max , which could mean the use of lexical term at a specic time aects the overall perceptual evaluation of the behavioral code. For behavioral codes that are designed to measure more positive attitude (e.g., positive and acceptance), the information that aects annotators' decision seems to be more distributed in the session considering Salient nMax and SPRT res are the most successful detection mechanisms. 133 This is also in accordance with the established psychological knowledge that negative impression carries more power (saliency) than positive impression [? ]. 7.7 Conclusion and Future Works In this work, through the design of our analysis framework, we show that it is capable of start exploring questions into the human annotation process as to whether it is the isolated-saliency that can trigger the nal decision or it is based on the causal- integration of information. Examining the results, we have demonstrated that not all behaviors can be judged on thin `slices' (a.k.a., small amounts of data). In cases where these behaviors can be robustly judged with thin slices, these slices need to be contextually appropriate (salient regions). We further reinforce the idea that data dependent modeling of annotator's behavior for automating behavior coding is crucial as in-line with the works [57, 58]. While the results in the work need to be further detailed investigated and veried, it is promising to see that some initial results corroborate the knowledge in psychology. There are many future directions. One of the limitations in this work is that the use of lexical features computed by tdf carry only partial information. We should also investigate further the assumption that the notion of perceptually-meaningful local behavioral patterns can be derived from MIL. We plan on incorporating cues from other communicative channels and rening the classiers within the same conceptual framework to bring further insights into dierent attentive process on behavioral cues. Lastly, we would like to continue designing the analyses framework for other perceptual experiments and to collaborate with domain experts to further enhance the quantitative aspects toward understanding human judgment of behavior. 134 Chapter 8: Conclusions and Future work The domain of BSP and the interaction modeling within is broad as it spans across multiple research disciplines. In the research work presented above, I utilized dierent computational approaches in tackling the issues of modeling and quantifying interaction dynamics in the study of dyadic human interactions. The essence of BSP is to make tangible impact on both fronts of human-centric system and human behavioral sciences. I would like to continue pursing goals with these two tangible impact within the realm of interaction modeling in behavioral signal processing. 8.1 Future Research Directions 8.1.1 Algorithmic Development - Turn Taking and Aective Dynamics As humans engage in spontaneous dialogs, turn taking phenomenon is inherently coupled with aective dynamics of the interlocutors. To facilitate a better and improved design of human-centric system development, a combined statistical modeling framework could be use to jointly model the turn-taking (cognitive planning) and expressive behaviors (aective dynamics). 135 8.1.2 Application Domains - Applying Vocal Entrainment in Analysis The proposed computational framework provides an opportunity to quantitatively ana- lyze the variations of vocal entrainment patterns associated with dierent sets of behav- ioral constructs in each domain of human interactions. We would like to concentrate on two domains: distressed couples' interactions and psychologists' interactions with children on the spectrum. By closely collaborating with psychologists and domain ex- perts, we can provide tools in computing various vocal entrainment measures to perform detailed analysis of the interaction dynamics, which is often atypical in nature in these interaction scenarios. 8.1.3 Transferring Computational Models - Quantifying Other As- pects of Entrainment Entrainment process between the interacting dyad occurs at multiple levels across mul- tiple communicative channels. We have demonstrated the idea of representing vocal entrainment as \the simliarity between two vocal characteristics spaces" is indeed a viable method in quantifying vocal entrainment. We would like to further examine the same conceptualization of computational framework in quantifying other aspects of en- trainment, such as gestural entrainment, lexical entrainment, turn-taking entrainment, etc. The core idea will be based on the fact that each interlocutor's features can be projected into a common abstract space to compute similarity between the interacting dyads. 8.1.4 CreativeIT: Synthesizing Actors' Improvisation Interaction Another line of research direction is on another domains of \dyadic interaction", which focuses more on a \task-oriented" interactions. We have collected a database, called the USC CreativeIT database [86], using full-body motion capture system of two actors 136 engaging in theatrical improvisations. The actors learned an acting technique termed \Active Analysis". The idea behind Active Analysis is that the two actors in a scene play forces and counterforces, as these two forces interact, it creates a natural ow of this improvised interaction. Often time, director examines the quality of the improvisation based on the degree of \focus of attention" between the interacting actors as they act out based on this technique. Given the availability of motion capture data, we would like to quantify this idea of \focus of attention" and to synthesize dierent movements between the actors on this scale of focus of attention to provide an automatic assessment on the \quality" of the improvisations. 8.1.5 Data-driven Human Behavioral Science Study The utilizing computational methods, i.e., BSP, to carry out experiments in the domain of behavioral science is promising on multiple fronts. The computational methods are able to work with large amount of spontaneous data without explicit experimental design of controlled experimentation. The use of behavioral signals are objective and can be used as grounding for either human behavior perception or production study. Given these opportunities, I would like further explore the potential of such method into expanding the existing scientic research paradigm of human behavioral science. 137 Bibliography [1] Alan Agresti. Categorical data analysis. Wiley series in probability and mathe- matical statistics. New York: Wiley, 1990. [2] E. M. Albornoz, D. H. Milone, and H. L. Runer. Spoken emotion recognition using hierarchical classiers. Computer Speech and Language. In Press., 2010. [3] Louis-Philippe Morency and. Hidden-state conditional random eld library. [4] Peter A. Andersen and Janis F. Andersen. The exchange of nonverbal intimacy: A critical review of dyadic models. Journal of Nonverbal Behavior, 8(4):328{349, 1984. [5] B.R. Baucom and D.C. Atkins. Polarization in marriage. To appear in M. Fine and F. Fincham (Eds.), Family Theories: A Content-based Approach. New York, NY: Routledge., (in press). [6] Frank J. Bernieri, J. Steven Reznick, and Robert Rosenthal. Synchrony, pseu- dosynchrony, and dissynchrony: Measuring the entrainment process in mother- infant interactions. Journal of Personality and Social Psychology, 54(2):243{253, 1988. [7] M. Black, J. Chang, and S. Narayanan. An empirical analysis of user uncertainty in problem-solving child-machine interactions: Linguistic analysis of spontaneous children speech. In Proceedings of the Workshop on Child, Computer and Inter- action, Chnia, Greece, October 2008. [8] M. P. Black, P. G. Georgiou, A. Katsamanis, B. R. Baucom, and S. S. Narayanan. \You made me do it": Classication of blame in married couples? interactions by fusing automatically derived speech and language information. In Proceedings of Interspeech, pages 89{92, 2011. [9] Matthew Black, Athanasios Katsamanis, Chi-Chun Lee, Adam Lammert, Brian R. Baucom, Andrew Christensen, Panayiotis G. Georgiou, and Shrikanth Narayanan. Automatic classication of married couples' behavior using audio features. In Proceedings of Interspeech, pages 2030{2033, 2010. 138 [10] Matthew P. Black, Athanasios Katsamanis, Brian R. Baucom, Chi-Chun Lee, Adam C. Lammert, Andrew Christensen, Panayiotis G. Georgiou, and Shrikanth S. Narayanan. Toward automating a human behavioral coding sys- tem for married couples? interactions using speech acoustic features. Speech Communication, page (in press), 2011. [11] Matthew P. Black, Abe Kazemzadeh, Joseph Tepperman, and Shrikanth S. Narayanan. Automatically assessing the abcs: Verication of children's spoken letter-names and letter-sounds. ACM Transactions on Speech and Language Pro- cessing, 7:4:15:1 { 15:17, 2011. [12] Paul Boersma and David Weenink. Praat: doing phonetics by computer (version 5.1.03) [Computer program]. March 2009. [13] Steven M. Boker and Jean-Philippe Laurenceau. Dynamical systems modeling: An application to the regulation of intimacy and disclosure in marriage. Models for intense longitudinal data, pages 195{218, 2006. [14] S. Brave, C. Nass, and K. Hutchinson. Computers that care: investigating the eects of orientation of emotion exhibited by an embodied computer agent. In- ternational journal of human-computer studies, 62(2):161{178, 2005. [15] L.R. Brody and J.A. Hall. Gender and emotion in context. The Guilford Press, 2008. [16] Judee K. Burgoon, Lesa A. Stern, and Leesa Dillman. Interpersonal Adaptation: Dyadic Interaction Patterns. Cambridge University Press, 1995. [17] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette Chang, Sungbok Lee, and Shrikanth Narayanan. IEMO- CAP: Interactive emotional dyadic motion capture database. Journal of Language Resources and Evaluation, 42:335{359, 2008. [18] Carlos Busso, Zhigang Deng, Serdar Yildirim, Murtaza Bulut, Chul Min Lee, Abe Kazemzadeh, Sungbok Lee, and Shrikanth Narayanan. Analysis of emotion recognition using facial expressions, speech and multimodal information. In Proc. of the Int'l Conf. on Multimodal Interfaces, October 2004. [19] Carlos Busso and Shrikanth Narayanan. Recording audio-visual emotional database from actors: a closer look. In Second Intl. Workship on Emotion: Cor- pora for Research on Emotion and Aect, Int'l conference on Language Resources and Evaluation, pages 17{22, May 2008. [20] Carlos Busso and Shrikanth S. Narayanan. Interrelation between speech and facial gestures in emotional utterances: A single subject study. IEEE Transactions on Audio, Speech, and Language Processing, 15:8:2331{2347, 2007. 139 [21] Tanya L. Chartrand and John A. Bargh. The chameleon eect: the perception- behavior link and social interaction. Journal of Personality and Social Psychology, 76(6):893{910, 1999. [22] Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. Smote: Synthetic minority over-sampling technique. Journal of Articial Intelligence Research, 16:321{357, 2002. [23] A. Christensen. Detection of con ict patterns in couples, chapter Understanding major mental disorder: The contribution of family interaction research, pages 250{265. New York: Family Process Press., 1987. [24] A. Christensen, D.C. Atkins, S. Berns, J. Wheeler, D. H. Baucom, and L.E. Simp- son. Traditional versus integrative behavioral couple therapy for signicantly and chronically distressed married couples. J. of Consulting and Clinical Psychology, 72:176{191, 2004. [25] A. Christensen and C. L. Heavey. Gender and social structure in the de- mand/withdraw pattern of marital con ict. Journal of Personality and Social Psychology, 59(1):73{81, 1990. [26] James V. Cordova, Neil S. Jacobson, John M. Gottman, Regina Rushe, and Gary Cox. Negative reciprocity and communication in couples with a violent husband. Journal of Abnormal Psychology, 102(4):559{564, 1994. [27] J. Dauwels, F. Vialatte, and A. Cichocki. Diagnosis of alzheimers disease from eeg signals: Where are we standing? Current Alzheimer's Research (Invited Paper), 7(6):487{505, 2010. [28] K. Eldridge and B. Baucom. Positive pathways for couples and families: Meeting the challenges of relationships, chapter (in press) Couples and consequences of the demand-withdraw interaction pattern. WileyBlackwell. [29] K. A. Eldridge, M. Sevier, J. Jones, D. C. Atkins, and A. Christensen. Demand- withdraw communication in severely distressed, moderately distressed, and nondistressed couples: Rigidity and polarity during relationship and personal problem discussions. Journal of Family Psychology, 21:218{226, 2007. [30] F. Eyben, M. W ollmer, and B. Schuller. OpenSMILE - The Munich versatile and fast open-source audio feature extractor. In ACM Multimedia, pages 1459{1462, Firenze, Italy, 2010. [31] Alexendar Genkin, David D Lewis, and David Madigan. Large-scale bayesian logistic regression for text categorization. Technometrics, 49(3):291{304, 2007. 140 [32] P.G. Georgiou, M.P. Black, A.C. Lammert, B.R. Baucom, and S.S. Narayanan. `That's aggravating, very aggravating': Is it possible to classify behaviors in couple interactions using automatically derived lexical features? In Aective Computing and Intelligent Interaction, pages 87{96, 2011. [33] Zoubin Ghahramani and Michael I. Jordan. Factorial hidden markov models. Machine Learning, 29:245{274, 1997. [34] Prasanta K. Ghosh, Andrea Tsiartas, and Shrikanth S. Narayanan. Robust voice activity detection using long-term signal variability. IEEE Trans. Audio, Speech, and Language Processing, 19(3):600{613, 2010. [35] J. Gibson, A. Katsamanis, M.P. Black, and S.S. Narayanan. Automatic identica- tion of salient acoustic instances in couples? behavioral interactions using diverse density support vector machines. In Proceedings of Interspeech, pages 1561{1564, 2011. [36] John Gottman, Catherine Swanson, and Kristin Swanson. A general system the- ory of marriage: Nonlinear dierence equation modeling of marital interaction. Personality and Social Psychology Review, 6(4):326{340, 2002. [37] John M. Gottman. The roles of con ict engagement, escalation, and avoidance in marital interaction: A longitudinal view of ve types of couples. Journal of Consulting and Clinical Psychology, 61(1):6{15, 1993. [38] S.W. Gregory, K. Dagan, and S. Webster. Evaluating the relation between vocal accommodation in conversational partners' fundamental frequencies to percptions of communication quality. Journal of Nonverbal Behavior, 21:23{43, 1997. [39] S.W. Gregory and B.R. Hoyt. Conversation partner mutual adaptation as demon- strated by fourier series analysis. Journal of Psycholinguistic Research, 11:35{46, 1982. [40] S.W. Gregory and S. Webster. A nonverbal signal in voices of interview partners eectively predicts communication accommodation and social status perceptions. Journal of Personality and Social Psychology, 70:1231{1240, 1996. [41] S.W. Gregory, S. Webster, and G. Huang. Voice pitch and amplitude convergence as a metric of quality in dyadic interviews. Language and Communication, 13:195{ 217, 1993. [42] Michael Grimm, Emily Mower, Kristian Kroschel, and Shrikanth Narayanan. Primitives based estimation and evaluation of emotions in speech. Speech Com- munication, 49:787{800, November 2007. 141 [43] Ali Hassan and Robert I. Damper. Multi-class and hierarchical svms for emotion recognition. In Proceedings of Interspeech, pages 2354{2357, 2010. [44] D. Haylan. Challenges ahead. head movements and other social acts in conversa- tion. AISB, 2005. [45] C. Heavey, D. Gill, and A. Christensen. Couples interaction rating system 2 (CIRS2). University of California, Los Angeles, 2002. [46] C. L. Heavey, C. Layne, and A. Christensen. Gender and con ict structure in mar- ital interaction: A replication and extension. Journal of Consulting and Clinical Psychology, 62:16{27, 1993. [47] Ota Herm, Alexandra Schmitt, and Jackson Liscombe. When calls go wrong: How to detect problematic calls based on log-les and emotions. In Proceedings of Interspeech, 2008. [48] David Hosmer and Stanley Lemeshow. Applied Logistic Regression. Wiley Series in Probability and Statistics, second edition edition, 2000. [49] H. Hotelling. The most predictable criterion. Journal of Experimental Psychology, 26:139{142, 1935. [50] G. Humhprey. The psychology of the gestalt. Journal of Educational Psychology, 15(7):401{412, 1924. [51] Neil S. Jacobson, John M. Gottman, Jennifer Waltz, Regina Rushe, Julia Babcock, and Amy Holtzworth-Munroe. Aect, verbal content, and psychophysiology in the arguments of couples with a violent husband. Journal of Consulting and Clinical Psychology, 62(5):982{988, 1994. [52] M.C. Johnannesmeyer. Abonormal situation analysis using pattern recognition techniques and historical data. Master's thesis, UCSB, Santan Barbara, CA, 1999. [53] Sheri L. Johnson and Theodore Jacob. Sequential interactions in marital com- munication of depressed men and women. Journal of Consulting and Clinical Psychology, 68(1):4{12, 2000. [54] J. Jones and A. Christensen. Couples interaction study: Social support interaction rating system. University of California, Los Angeles, 1998. [55] T. Kanda, T. Hirano, D. Eaton, and H. Ishiguro. Interactive robots as social partners and peer tutors for children: A eld trial. Human-Computer Interaction, 19(1):61{84, 2004. 142 [56] A. Kapoor and R.W. Picard. Multimodal aect recognition in learning envi- ronments. In Proceedings of the 13th annual ACM international conference on Multimedia, pages 677{682. ACM New York, NY, USA, 2005. [57] A. Kartik and S. S. Narayanan. Data-dependent evaluator modeling and its ap- plication to emotional valence classication from speech. In Proceedings of Inter- speech, pages 2366 { 2369, 2010. [58] A. Kartik and S. S. Narayanan. Emotion classication from speech using evaluator reliability-weighted combination of ranked lists. In ICASSP, pages 4956 { 4959, 2011. [59] A. Katsamanis, M. P. Black, P. G. Georgiou, L. Goldstein, and S. S. Narayanan. SailAlign: Robust long speech-text alignment. In Very-Large-Scale Phonetics Workshop, Jan. 2011. [60] A. Katsamanis, J. Gibson, M.P. Black, and S.S. Narayanan. Multiple instance learning for classication of human behavior observations. In Aective Computing and Intelligent Interaction, 2011. [61] P.K. Kerig and D.H. (Eds.) Baucom. Couple Observational Coding Systems. Lawrence Erlbaum Associates, Mahwah, NJ, USA., 2004. [62] Masanori Kimura and Ikuo Daibo. Interactional synchrony in conversations about emotional episodes: A measurement by 'the between-participants pseudosyn- chrony experimental paradigm'. Journal of Nonverbal Behavior, 30:115{126, 2006. [63] M. Kipp. Anvil - a generic annotation tool for multimodal dialogue. In Eurospeech, pages 1367{1370, 2001. [64] W. Krzanowski. Between-groups comparison of principal components. Journal of the American Statistical Association, 74(367):703{707, 1979. [65] Wagner H. L. Measuring performance in category judgment studies on nonverbal behavior. Journal of Nonverbal Behavior, 17(1):3{28, 1993. [66] R.S. Lazarus. Relational meaning and discrete emotions. Appraisal processes in emotion: Theory, methods, research, pages 37{67, 2001. [67] C.-C. Lee, M. P. Black, A. Katsamanis, A. C. Lammert, B. R. Baucom, A. Chris- tensen, P. G. Georgiou, and S. S. Narayanan. Quantication of prosodic en- trainment in aective spontaneous spoken interactions of married couples. In Proceedings of Interspeech, pages 793{796, 2010. 143 [68] Chi-Chun Lee, Matthew Black, Athanasios Katsamanis, Adam Lammert, Brian Baucom, Andrew Christensen, Panayiotis G. Georgiou, and Shrikanth Narayanan. Quantication of prosodic entrainment in aective spontaneous spoken interac- tions of married couples. In Proceedings of Interspeech, Makuhari, Japan, 2010. [69] Chi-Chun Lee, Athanasios Katsamanis, Matthew P. Black, Brian R. Baucom, Andrew Christensen, Panayiotis G. Georgiou, and Shrikanth S. Narayanan. Com- puting vocal entrainment: A signal-derived pca-based quantication scheme with application to aect analysis in married couple interactions. Computer Speech and Language, page (in press), 2012. [70] Chi-Chun Lee, Athanasios Katsamanis, Matthew P. Black, Brian R. Baucom, Panayiotis G. Georgiou, and Shrikanth S. Narayanan. Aective state recognition in married couples' interactions using pca-based vocal entrainment measures with multiple instance learning. In Aective Computing and Intelligent Interaction, pages 31{41, 2011. [71] Chi-Chun Lee, Athanasios Katsamanis, Matthew P. Black, Brian R. Baucom, Panayiotis G. Georgiou, and Shrikanth S. Narayanan. An analysis of pca-based vocal entrainment measures in married couples' aective spoken interactions. In Proceedings of Interspeech, pages 3101{3104, 2011. [72] Chi-Chun Lee, Sungbok Lee, and Shrikanth Narayanan. An analysis of multimodal cues of interruption in dyadic spoken interactions. In Interspeech, Brisbane, Aus- tralia, 2008. [73] Chi-Chun Lee, Emily Mower, Carlos Busso, Sungbok Lee, and Shrikanth Narayanan. Emotion recognition using a hierarchical binary decision tree ap- proach. In Proceedings of Interspeech, 2009. [74] Chul Min Lee and Shrikanth S. Narayanan. Toward detecting emotions in spoken dialogs. IEEE Transactions on Speech and Audio Processing, 13:2:293{303, 2005. [75] R. Levitan and J. Hirschberg. Measuring acoustic-prosodic entrainment with re- spect to multiple levels and dimensions. In Proceedings of Interspeech, pages 3081{3084, 2011. [76] Li-Chiung Liang. Visualizing spoken discourse: prosodic form and discourse func- tion of interruptions. In Second SIGdial Workship on Discourse and Dialog, vol- ume 16, pages 1{10, 2001. [77] Kristin M. Lindahl. Methodological issues in family observational research, chap- ter 2, pages 23{31. Erlbaum, 2001. 144 [78] Jackson Liscombe, Giuseppe Riccardi, and Dilek Hakkani-Tur. Using context to improve emotion detection in spoken dialog systems. In Interspeech, pages 1845{ 1848, 2005. [79] D. MacNeill. Hand and Minds: What Gestures Reveal about Thoughts. University of Chicago Press, Chicago, IL, 1992. [80] Qi-Rong Mao and Yong-Zhao Zhan. A novel hierarchical speech emotion recog- nition method based on improved ddagsvm. Computer Science and Information Systems, 7(1):211{222, 2010. [81] G. Margolin, P.H. Oliver, E.B. Gordis, H.G. O'Hearn, A.M. Medina, C.M. Ghosh, and L. Morland. The nuts and bolts of behavioral observation of marital and family interaction. Clinical Child and Family Psychology Review, 1(4):195{213, 1998. [82] Claudia Marinetti, Penny Moore, Pablo Lucas, and Brian Parkinson. Emotions in Social Interactions: Unfolding Emotion Experience, chapter 1-3, pages 31{46. Springer-Verlag Berlin Heidelberg, 2011. [83] O. Maron and T. Lozano-P erez. A framework for multiple-instance learning. In Advances in neural information processing systems, pages 570{576, 1998. [84] Andrew R. McGarva and Rebecca M. Warner. Attraction and social coordina- tion: Mutual entrainment of vocal activity rhymes. Journal of Psycholinguistic Research, 32(3):335{354, 2003. [85] Angeliki Metallinou, Carlos Busso, Sungbok Lee, and Shrikanth Narayanan. Vi- sual emotion recognition using compact facial representation and viseme informa- tion. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2010. [86] Angeliki Metallinou, Chi-Chun Lee, Carlos Busso, Sharon Canicke, and Shrikanth S. Narayanan. The usc creativeit database: A multimodal database of theatrical improvisation. In Multimodal Corpora: Advances in Capturing, Cod- ing and Analyzing Multimodality (MMC), 2010. [87] Angeliki Metallinou, Sungbok Lee, and Shrikanth Narayanan. Audio-visual emo- tion recognition using gaussian mixture models for face and voice. In In Proceed- ings of IEEE International Symposium of Multimedia, Berkeley, CA, December 2008. [88] Angeliki Metallinou, Sungbok Lee, and Shrikanth Narayanan. Decision level com- bination of multiple modalities for recognition and analysis of emotion expression. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 2010. 145 [89] Louis-Philippe Morency, Iwan de Kok, and Jonathan Gratch. Predicting listener backchannels: A probabilistic multimodal approach. In Lecture Notes in Computer Science, volume 5208, 2008. [90] Emily Mower, Maja J. Mataric, and Shrikanth S. Narayanan. A framework for au- tomatic human emotion classication using emotional proles. IEEE Transactions on Audio, Speech and Language Processing, 19:5:1057{1070, 2011. [91] Christopher M. Murphy and Timothy J. O'Farrell. Couple communication pat- terns of maritally aggressive and nonaggressive male alcoholics. Journal of Studies on Alcohol, 58:83{90, 1997. [92] Kevin P. Murphy. The bayes net toolbox for matlab. Computing Science and Statistics, 2001. [93] Ani Nenkova, Agustin Gravano, and Julia Hirschberg. High frequency word en- trainment in spoken dialogue. In Proceedings of ACL-08: HLT, volume Compan- ion, 2008. [94] Ani Nenkova, Agustin Gravano, and Julia Hirschberg. High frequency word en- trainment in spoken dialogue. In Proceedings of ACL-08: HLT, Short Papers, pages 169{172, Columbus, Ohio, June 2008. [95] Dina G. Okamoto, LS Rashotte, and L Smith-Lovin. Measuring interruptions: Syntactic and contextual method of coding conversation. Social Psychology Quar- terly, 65(1):38{55, 2002. [96] M. E. Otey and S. Parthasarathy. A dissimilarity measure for comparing subsets of data: application to multivariate time series. In Proceedings of ICDM Workshop on Temporal Data Mining, Houston, TX, 2005. [97] M. Pantic, N. Sebe, J.F. Cohn, and T. Huang. Aective multimodal human- computer interaction. In Proceedings of the 13th annual ACM international con- ference on Multimedia, pages 669{676. ACM New York, NY, USA, 2005. [98] Jennifer S. Pardo. On phonetic convergence during conversational interaction. Journal of Acoustical Society of America, 119:2382{2393, 2006. [99] Lauri A. Pasch, Thomas N. Bradbury, and Joanne Davila. Gender, negative aectivity, and observed social support behavior in marital interaction. Personal Relationships, 4:361{278, 1997. [100] H. Prendinger, J. Mori, and M. Ishizuka. Using human physiology to evaluate subtle expressivity of a virtual quizmaster in a mathematical game. International journal of human-computer studies, 62(2):231{245, 2005. 146 [101] A. Quattoni, M. Collins, and T. Darrell. Conditional random eld for object recognition. NIPS, (17), 2004. [102] David Reitter and Johanna D. Moore. Predicting success in dialogue. In Proceed- ings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 808{815, 2007. [103] Michael J. Richardson, Kerry L. Marsh, and R.C. Schmit. Eects of visual and verbal interaction on unintentional interpersonal coordination. Journal of Exper- imental Psychology: Human Perception and Performance, 31(1):62{79, 2005. [104] Kehrein Roland. The prosody of authentic emotions. In Speech Prosody Confer- ence, pages 423{426, 2002. [105] Joan M. Romano, Judith A.Turner, Larry S. Friedman, Richard A. Bulcroft, Mark P. Jensen, Hyman Hops, and Steven F. Wright. Sequential analysis of chronic pain behaviors and spouse responses. Journal of Consulting and Clinical Psychology, 60(5):777{782, 1992. [106] Viktor Rozgi c, Bo Xiao, Athanasios Katsamanis, Brian Baucom, Panayiotis G. Georgiou, and Shrikanth S. Narayanan. Estimation of ordinal approach-avoidance labels in dyadic interactions: Ordinal logistic regression approach. In ICASSP, pages 2368 { 2371, 2011. [107] Bjoern Schuller, Anton Batliner, Dino Seppi, Stefan Steidl, Thurid Vogt, Johannes Wagner, Laurence Devillers, Laurence Vidrascu, Noam Amir, Loic Kessous, and Vered Aharonson. The relevance of feature type for the automatic classication of emotional user states: low level descriptors and functionals. In Proceedings of Interspeech, August 2007. [108] Bjorn Schuller, Stephan Steidl, and Anton Batliner. The Interspeech 2009 Emotion Challenge. In Proceedings of Interspeech, Brighton, UK, 2009. [109] Mia Sevier, Kathleen Eldridge, Janice Jones, Brian D. Doss, and Andrew Chris- tensen. Observed communication and associations with satisfaction during tradi- tional integrative behavioral couple therapy. Behavior Therapy, 39:137{150, 2008. [110] Patrick E. Shrout and Joseph L. Fleiss. Intraclass correlation: Uses in assessing rater relibaility. Psychological Bulletin, 86(2):420{428, 1979. [111] Stefan Steidl. Automatic classication of emotion-related user states in sponta- neous children's speech. Logos, Verlag, Berlin, 2009. [112] Vladimir Naumovich Vapnik. The nature of Statistical Learning theory. New York : Springer, 1995. 147 [113] Lesley L. Verhofstadt, Ann Buysse, William Ickes, Mark Davis, and Inge Devoldre. Support provision in marriage: The role of emotional similarity and empathic accuracy. Emotion, 8(6):792{802, 2008. [114] A. Wald. Sequential test of statistical hypotheses. The annals of mathematical statistics, 16(2):117{186, 1945. [115] Sy Bor Wang, Aridadna Quattoni, Louis-Philippe Morency, David Demirdjian, and Trevor Darrell. Hidden conditional random elds for gesture recognition. In IEEE Computer Society Conference and Computer Vision and Pattern Recogni- tion, volume 2, pages 1521{1527, 2006. [116] James H. Watt and C. Arthur VanLearn, editors. Dynamic Patterns in Commu- nication Processes. Sage Publication, Inc, 1996. [117] Erica M. Woodin. A two-dimensional approach to relationship con ict: Meta- analytic ndings. Family Psychology, 25:325{335, 2011. [118] Zhongzhe Xiao, Emmanuel Dellandrea, Weibei Dou, and Liming Chen. Automatic hierarchical classication of emotional speech. In Proceedings of ISMW, pages 291{296, 2007. [119] F. Yang and P.A. Heeman. Avoiding and resolving initiative con icts in dialog. In NAACL HLT, Rochester, NY, Apirl 2007. [120] Jun Yang. Mill : A multiple instance learning library. online. [121] S. Yildirim, C. M. Lee, S. Lee, A. Potamianos, and S. Narayanan. Detecting politeness and frustration state of a child in a detecting politeness and frustration state of a child in a conversational computer game. In In Proc. Eurospeech, Lisbon, Portugal, October 2005. [122] Serdar Yildirim and Shrikanth S. Narayanan. Automatic detection of dis u- ency boundaries in spontaneous speech of children using audio-visual information. IEEE Transactions on Audio, Speech, and Language Processing, 17:1:2{12, 2009. [123] Q. Zhang and S.A. Goldman. Em-dd: An improved multiple-instance learning technique. Advances in neural information processing systems, 2:1073{1080, 2002. 148
Abstract (if available)
Abstract
Behavioral Signal Processing (BSP) is an emerging interdisciplinary research domain, operationally defined as computational methods that model human behavior signals, with a goal of enhancing the capabilities of domain experts in facilitating better decision making in terms of both scientific discovery in human behavioral sciences and human-centered system designs. Quantitative understanding of human behavior, both typical and atypical, and mathematical modeling of interaction dynamics are core elements in BSP. This thesis focuses on computational approaches in modeling and quantifying interacting dynamics in dyadic human interactions. ❧ The study of interaction dynamics has long been at the center for multiple research disciplines in human behavioral sciences (e.g., psychology). Exemplary scientific questions addressed range from studying scenarios of interpersonal communication (verbal interaction modeling, human affective state generation, display, and perception mechanisms), modeling domain-specific interactions (such as, assessment of the quality of theatrical acting or children's reading ability), to analyzing atypical interactions (for example, models of distressed married couples behavior and response to therapeutic interventions, quantitative diagnostics and treatment tracking of children with Autism, people with psycho-pathologies such as addiction and depression). In engineering, a metaphorical analogy and framework to this notion in behavioral science is based on the idea of conceptualizing a dyadic interaction as a coupled dynamical system: an interlocutor is viewed as a dynamical system, whose state evolution is not only based on its past history but also dependent on the other interlocutor's state. However, the evolution of this "coupled-states" is often hidden by nature
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Modeling dyadic synchrony with heterogeneous data: validation in infant-mother and infant-robot interactions
PDF
Interaction dynamics and coordination for behavioral analysis in dyadic conversations
PDF
Automatic quantification and prediction of human subjective judgments in behavioral signal processing
PDF
Multimodality, context and continuous dynamics for recognition and analysis of emotional states, and applications in healthcare
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Computational modeling of behavioral attributes in conversational dyadic interactions
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Modeling expert assessment of empathy through multimodal signal cues
PDF
Modeling and regulating human interaction with control affine dynamical systems
PDF
Statistical inference for dynamical, interacting multi-object systems with emphasis on human small group interactions
PDF
Parasocial consensus sampling: modeling human nonverbal behaviors from multiple perspectives
PDF
Machine learning paradigms for behavioral coding
PDF
Towards social virtual listeners: computational models of human nonverbal behaviors
PDF
User modeling for human-machine spoken interaction and mediation systems
PDF
Knowledge-driven representations of physiological signals: developing measurable indices of non-observable behavior
PDF
Computational modeling of human behavior in negotiation and persuasion: the challenges of micro-level behavior annotations and multimodal modeling
PDF
Computational methods for modeling nonverbal communication in human interaction
PDF
Improving modeling of human experience and behavior: methodologies for enhancing the quality of human-produced data and annotations of subjective constructs
PDF
A computational framework for diversity in ensembles of humans and machine systems
PDF
Computational modeling of human interaction behavior towards clinical translation in autism spectrum disorder
Asset Metadata
Creator
Lee, Chi-Chun
(author)
Core Title
Behavioral signal processing: computational approaches for modeling and quantifying interaction dynamics in dyadic human interactions
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
11/26/2012
Defense Date
10/18/2012
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
affective computing,behavioral signal processing,dyadic interactions,interaction dynamics,interpersonal interaction,Mental Health,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth S. (
committee chair
), Kuo, C.-C. Jay (
committee member
), Margolin, Gayla (
committee member
)
Creator Email
chiclee@usc.edu,jeremylee.cc@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-119634
Unique identifier
UC11292378
Identifier
usctheses-c3-119634 (legacy record id)
Legacy Identifier
etd-LeeChiChun-1345.pdf
Dmrecord
119634
Document Type
Dissertation
Rights
Lee, Chi-Chun
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
affective computing
behavioral signal processing
dyadic interactions
interaction dynamics
interpersonal interaction