Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Computational methods for modeling nonverbal communication in human interaction
(USC Thesis Other)
Computational methods for modeling nonverbal communication in human interaction
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
COMPUTATIONAL METHODS FOR MODELING NONVERBAL COMMUNICATION IN HUMAN INTERACTION by Rahul Gupta A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2016 Copyright 2016 Rahul Gupta Dedicated to my parents Poonam Gupta and Om Prakash Gupta. ii Contents Dedication ii Contents iii List of Tables vii List of Figures ix Acknowledgements xiii Abstract xv 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Dissertation overview . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Detection of non-verbal cues in human interaction . . . . . . 3 1.2.2 Understanding the encoding of behavioral states into non- verbal cues . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.3 Modeling the diversity in perception of non-verbal cues . . . 5 1.3 Modeling framework . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Related work 7 2.1 Detection of non-verbal cues . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Understanding the encoding of behavioral states into non-verbal cues 9 2.3 Modeling diversity in perception of non-verbal cues . . . . . . . . . 10 I Detecting nonverbal cues in human interaction 11 3 Detecting paralinguistic events in audio stream using context in features and probabilistic decisions 13 3.1 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 iii 3.3 Event detection scheme . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3.1 Predicting event probabilities based on the speech features . 19 3.3.2 Incorporating context in probabilistic frame-wise outputs . . 27 3.3.3 Masking highly probable/improbable events . . . . . . . . . 34 3.4 Analysis of features . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4.1 Sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . 36 3.4.2 Feature performance analysis . . . . . . . . . . . . . . . . . 39 3.4.3 Relation between feature performance and output probabil- ity dynamic range . . . . . . . . . . . . . . . . . . . . . . . . 40 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4 Variable span dis uency detection in ASR transcripts 45 4.1 Training and Evaluation Data . . . . . . . . . . . . . . . . . . . . . 48 4.1.1 Training Corpus . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.1.2 Evaluation Corpus . . . . . . . . . . . . . . . . . . . . . . . 49 4.2 Dis uency detection . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.1 Training methodology and features . . . . . . . . . . . . . . 51 4.2.2 Inference methodology . . . . . . . . . . . . . . . . . . . . . 55 4.3 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 II Estimation of latent states using nonverbal cues 60 5 Predicting aective dimensions based on self assessed depression severity 63 5.1 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.2.1 Investigating relationship between aective dimensions and depression severity . . . . . . . . . . . . . . . . . . . . . . . 66 5.2.2 Predicting Aective dimensions . . . . . . . . . . . . . . . . 68 5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6 Aect prediction in music using boosted ensemble of lters 79 6.1 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.2 Aect prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.3 Experiments and discussion . . . . . . . . . . . . . . . . . . . . . . 86 6.3.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 iv 7 Predicting client's inclination towards target behavior change in Motivational Interviewing and investigating the role of laughter 92 7.1 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 7.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7.2.1 Predicting valence of Client Change Talk utterance. . . . . . 95 7.2.2 Prediction incorporating laughters . . . . . . . . . . . . . . . 101 7.2.3 Laughters and their prosodic dierences . . . . . . . . . . . 102 7.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 III Modeling diversity in perception of nonverbal cues106 8 Modeling multiple time series annotations based on ground truth inference and distortion 108 8.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 8.2 Distortion based multiple annotator time series modeling . . . . . . 114 8.2.1 Choices for the feature mapping function and the distortion function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 8.3 Training Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 118 8.3.1 EM algorithm implementation . . . . . . . . . . . . . . . . . 119 8.3.2 Evaluation criteria . . . . . . . . . . . . . . . . . . . . . . . 121 8.4 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . 122 8.4.1 Feature set . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 8.4.2 Baseline models . . . . . . . . . . . . . . . . . . . . . . . . . 126 8.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 8.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 9 Inferring object rankings based on noisy pairwise comparisons from multiple annotators 138 9.1 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 9.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 9.2.1 Majority Vote (MV) . . . . . . . . . . . . . . . . . . . . . . 142 9.2.2 Independent Annotator Modeling (IAM) . . . . . . . . . . . 143 9.2.3 Joint Annotator Modeling (JAM) . . . . . . . . . . . . . . . 144 9.2.4 Variable Reliability Joint Annotator Modeling (VRJAM) . . 150 9.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 153 9.3.1 Data sets with synthetic annotations . . . . . . . . . . . . . 154 9.3.2 Data set with machine/human annotations . . . . . . . . . . 158 9.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 10 Conclusions 163 v Reference List 167 A Derivation of the Expectation Maximization algorithm stated in Chapter 8 195 B Derivation of equations for Chapter 9 202 vi List of Tables 3.1 Statistics of laughter and ller annotations in the SVC corpus. . . . 16 3.2 Statistics of data splits used as training, development and testing set. 17 3.3 Set of features extracted per frame. . . . . . . . . . . . . . . . . . . 18 3.4 AUC for prediction using context independent features with HMM, linear and non-linear classiers. . . . . . . . . . . . . . . . . . . . . 22 3.5 AUC for classication using contextual features obtained by append- ing window-wise feature statistics and dimensionality reduction. . . 26 3.6 AUC after temporally ltering the time series U E using MA and MMSE based lters. MMSE provides the best AUC, slightly better than MA lter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.7 AUC using the MMSE based lter and probability encoding. . . . 32 3.8 Detection results before and after masking the probability time series. 35 3.9 Statistics of linear model for predicting AUC with dynamic range of the output probability as the regressor. . . . . . . . . . . . . . . 42 4.1 Examples of dis uencies . . . . . . . . . . . . . . . . . . . . . . . . 46 4.2 Training and evaluation datasets . . . . . . . . . . . . . . . . . . . 50 4.3 Example training instances of variable span lengths. . . . . . . . . . 51 4.4 Results obtained on the reference and the ASR transcripts (English side of DARPA TRANSTAC data). . . . . . . . . . . . . . . . . . . 57 5.1 Correlation coecient between a subject's BDI-II score and sta- tistical functionals computed over the aective dimensions for his session. Signicance of 6=0 is shown in bold. . . . . . . . . . . . . 68 5.2 Mean of the correlation coecients, (and per aective dimension: valence: val., arousal: aro, dominance: dom.) between the ground truth and system prediction. Best performing system for each data is shown in bold. Best systems are signicantly better than the baseline at 5% level using the Student's t-statistics test (number of samples = number of frames). . . . . . . . . . . . . . . . . . . . . . 73 6.1 SNR values for aect rating prediction using the baseline and the proposed BESiF models. . . . . . . . . . . . . . . . . . . . . . . . . 87 vii 7.1 Example conversation with Counselor (T) behavior annotation and client (C) ChangeTalk valence annotation. . . . . . . . . . . . . . . . 95 7.2 List of various counselor behavior codes and ChangeTalk codes and corresponding count of utterances. . . . . . . . . . . . . . . . . . . . 95 7.3 Results (in %) for predicting ChangeTalk codes. . . . . . . . . . . . 100 7.4 Results (in %) for predicting counselor behavior codes. . . . . . . . . 100 7.5 Results (%) & relative improvements (%) over previous model for predicting ChangeTalk codes w/ laughter. . . . . . . . . . . . . . . . 101 7.6 Results (%) & relative improvements (%) over previous model for predicting counselor behavior codes w/ laughter. . . . . . . . . . . . 102 7.7 Prosodic features used in laughter classication. . . . . . . . . . . . 103 7.8 Results for classication of client laughters over ChangeTalk codes using prosody. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 8.1 Correlation coecient between the estimated ground truth and the predictions from the feature mapping function. A higher implies that the estimated ground truth is better estimated using the low lever features. The improvement over the closest baseline using the proposed model is signicant based on the Fisher z-transformation test [1] (p-value< .001, z-value = 6.1, number of samples equals the number of analysis frames:37k). . . . . . . . . . . . . . . . . . . . 129 8.2 Correlation coecient between the estimated ground truth and the predictions from the feature mapping function after removing annotators 27 and 28 from training. The proposed model is signi- cantly better than the closest baseline model (baseline 2) based on the Fisher z-transformation test [1], considering value at each frame to be a sample (p-value < 0.001, z-value = 7.7). . . . . . . . . . . . 135 9.1 Values of r k & mean(R k ) obtained on the red wine data set. . . . . 155 9.2 Accuracy in inferring z ij in the synthetic data sets. . . . . . . . . . 155 9.3 Ratio of pairwise comparisons in which a classier ranks the image containing greater value higher than the other image in the pair. (KNN: KNN classier, LR: Logistic Regression, NB: Naive Bayes classier, RF: Random Forests classier and Perc.: Perceptron). . . 159 9.4 Performance of the fusion schemes on pairwise comparisons z k ij , as obtained from the machine annotators. . . . . . . . . . . . . . . . . 159 9.5 Comparison of expressiveness/naturalness between TD and HFA kids. TD kids are expected to be more expressive/natural. . . . . . 160 viii List of Figures 1.1 This gure represents communication as an encoding-decoding pro- cess. The person on the left encodes information which is received by the audience on the right and decoded. . . . . . . . . . . . . . . 3 3.1 Block diagram representing each processing step performed during detection. Contents in gray boxes show experimental methods used with the best method (as determined during the system develop- ment) marked in red. I retain the best method at each step for subsequent processing step. . . . . . . . . . . . . . . . . . . . . . . 19 3.2 An audio segment containing laughter segment at the end and the corresponding output laughter probabilities. The laughter segment in this clip (as annotated) occurs in two bursts. Notice the low laughter probability assigned to the silence frames in between the two bursts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 Architecture for frame-wise classication based on incorporating feature context and dimensionality reduction. Feature values over a window are concatenated and projected on a lower dimensional space using PCA.f FC represents the discriminative classier trained with feature context. . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4 Architecture for frame-wise classication based on incorporating context from feature statistics. . . . . . . . . . . . . . . . . . . . . . 26 3.5 Estimated and target probability values for (a) laughter and (b) llers, using contextual features on sample test les. . . . . . . . . . 27 3.6 Filter coecients for the linear lters operating over the time series U E for (a) Laughters (b) Fillers. . . . . . . . . . . . . . . . . . . . . 30 3.7 Frequency response for the linear lters operating over the time series U E for (a) Laughters (b) Fillers. . . . . . . . . . . . . . . . . 31 3.8 Obtained and target probability values for (a) laughters and (b) llers, after probability smoothing on sample test les. . . . . . . . 33 3.9 Obtained and target probability values for (a) laughters and (b) llers, after masking. . . . . . . . . . . . . . . . . . . . . . . . . . . 35 ix 3.10 Plot of event probability output by the DNN on varying the z- normalized MFCCs over one standard deviation ( = 1). Note that while varying a single feature, all other features are set to 0. Plots 1a/2a: Plot over MFCC 1-6 for laughters/llers. Plots 1b/2b: Plot over MFCC 7-12 for laughters/llers. . . . . . . . . . . . . . . . . . 37 3.11 Plot of assigned probabilities on varying the z-normalized prosodic features over one standard deviation. (a/b: Plot over prosodic fea- tures for laughters/llers.) . . . . . . . . . . . . . . . . . . . . . . . 38 3.12 AUC (in %) obtained based on a single features at a time. (a): Plot for MFCCs (b): Plot for prosodic features. . . . . . . . . . . . . . . 41 3.13 Plot representing AUC and dynamic range values obtained per fea- ture. Line in red represent the best t using linear regression. The 17 datapoints correspond to each feature. . . . . . . . . . . . . . . . 42 4.1 ROC curves for dis uency detection in reference transcript . . . . . 57 4.2 ROC curves for dis uency detection in ASR transcript . . . . . . . 59 5.1 Baseline system with audio-video features as inputs. . . . . . . . . . 70 5.2 Proposed system with a feature transformation layer appended to the baseline system. . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.1 Average arousal-valence ratings in each music genre . . . . . . . . . 82 6.2 RMSE ( error ) of BESiF model on the test set against the number of base classiers. The red point denotes the count of base lters chosen based on development set. . . . . . . . . . . . . . . . . . . . 88 6.3 (a) Target arousal values, (b) ltered feature values and (c) raw feature values for a selected le in the testing set. . . . . . . . . . . 89 7.1 Proposed model to represent the dyadic interaction and the anno- tation process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 7.2 Counts and empirical distribution of laughters over (a) ChangeTalk codes (b) Counselor behavior codes. . . . . . . . . . . . . . . . . . . 103 8.1 A gure providing the intuition of proposed model, inspired from Raykar et al. [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 8.2 Graphical models for schemes previously proposed to model discrete label problems. (1a) Maximum likelihood estimation of observer error-rates using the EM algorithm [3] (1b) Supervised learning from multiple annotators/experts [2] (1c) Globally variant locally constant model [4]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 x 8.3 Graphical model for the proposed framework. X s represents the features, a s represents the ground truth. andhD 1 ;::;D N i are the set of parameters for feature mapping function and distortion functions, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . 116 8.4 Correlation coecient () between every pair of annotators repre- sented as an image matrix. Colorbar on the right indicates the value of the correlation coecient. Due to indexing based on agreement with annotator 1, annotators with lower indices have a higher with annotator 1. Annotator 27 and 28 have a very low agreement with several of the annotators. . . . . . . . . . . . . . . . . . . . . . . . . 124 8.5 Facial landmark points tracked on the children's face during inter- action. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 8.6 Correlation coecients between the true and predicted annotator ratings. A higher implies that the model is better able to model the dependencies between low level features and the annotator ratings. The values of proposed model signicantly better (at least at 5% level using Fisher z-transformation test) than all the baseline are marked with. Annotators 3,16 and 18 are signicant only at 10% level (marked with a) and annotator 1, 5, 14, 17, 24, 25, 17 and 28 are either not signicantly better or worse than at least one of the baselines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 8.7 Filter coecients estimated by the proposed EM algorithm for each of the annotators. The lter are plotted asd((W1));::;d(1);d(0) used during convolution as: a s n (t) = P W1 w=0 a s (tw)d n (w). A higher value for the coecients towards the left in the gure implies a higher emphasis on the past samples. . . . . . . . . . . . . . . . . 130 8.8 Annotator bias d b n estimated using the proposed model. . . . . . . . 133 8.9 Annotator delays estimated using the baseline 2 proposed in Mari- ooryad et al. [5]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 8.10 Ground truth a s as estimated by various baseline and proposed models on an arbitrary section of the data. . . . . . . . . . . . . . . 134 8.11 Correlation coecients between the true and predicted annotator ratings based on model trained after removing annotators 27 and 28. A higher implies that the model is better able to model the dependencies between low level features and the annotator ratings. For the correlation coecients obtained using proposed model, indicates a signicant improvement with p-value< 5%, indicates signicant improvement with p-value < 10% but greater than 5%. . 136 9.1 Graphical models for (a) Majority vote (MV) (b) Independent Anno- tator Model (IAM) (c) Joint Annotator Model (JAM) and, (d) Vari- able Reliability Joint Annotator Model (VRJAM) schemes. . . . . 145 xi 9.2 Model performances with increasing annotator noises. . . . . . . . . 157 9.3 Model performances with increasing number of annotators. . . . . . 157 B.1 Plot comparing the values of the negative hinge loss function (1 [1hw;fx i x j gi] + ) and the log of logistic loss function (logp(z ij jx i ;x j ;w)).205 xii Acknowledgements The completion of this thesis could not have been possible without the assistance and participation of several people. It may not be possible to enumerate all the names, but I would take this opportunity of my family, friends and colleagues who could not be thanked enough for their support and encouragement. First of all, I would like thank my adviser Dr. Shrikanth S. Narayanan for his excellent mentorship. His support extended beyond academics and helped shape the six years of my life as a doctoral student. In particular, he encouraged me to explore new research domains and this thesis would not have been possible without the research exibility that he provided. I would also like to thank Dr. Sungbok Lee and Dr. Panayiotis Georgiou for the numerous occasions during which they guided my research and provided valuable suggestions. I am also thankful to my dissertation committee, Dr. Antonio Ortega and Dr. Yan Liu for their insightful comments and suggestions. I express my gratitude to other researchers I have collaborated with over the years. I thank Dr. Agata Rozga, Dr. David Atkins, Dr. Sankarnarayanan Anan- thakrishnan and many other researchers who have guided me from time to time. They often helped me with interdisciplinary studies that I participated in, by pro- viding me advice related to their area of expertise. I am also very fortunate to have amazing colleagues and friends in the Signal Analysis and Interpretation Lab who xiii have helped and supported me during my doctoral studies. Specically, I would like to thank Dr. Chi-Chun Lee and Dr. Kartik Audhkhasi for their excellent mentorship during the initial year of my doctoral studies that helped to shape my research aptitude. Finally, I dedicate this thesis to my parents (Mrs. Poonam Gupta and Mr. O. P. Gupta). They have been a constant source of support and strength throughout my life. I would also like to thank my wife, Taruna Agrawal for providing me a new perspective and her relentless support in all my endeavors. I am also thankful to my brother Navneet Gupta for his constant encouragement during my studies. xiv Abstract Nonverbal communication is an integral part of human communication and involves sending wordless cues. Development of nonverbal communication starts at an early stage in life, even before the development of verbal communication skills and con- stitutes a major part of communication . In literature, nonverbal communication has been described as an encoding-decoding process. During encoding, a person embeds his internal state comprising of emotions, mental well being and sentiment into a multimodal set of nonverbal cues including nonverbal vocalizations, body gestures and facial expressions. After receiving these encoded cues, another person decodes these nonverbal cues based on his perceptions of the cues. In this thesis, I aim to facilitate the understanding of this encoding-decoding process. Specically, I conduct experiments with respect to three dierent target scenarios including: (a) detection of nonverbal cues in human interaction (b) estimation of latent states embedding using nonverbal cues and (c) modeling diversity in perception of non- verbal cues. The rst part of this thesis involves detection of nonverbal cues in time con- tinuous human signals such as speech and body language. Accurate detection of nonverbal cues can help us understand the embedding of nonverbal cues in human signals and can also aid the downstream analysis of nonverbal cues such as esti- mation of latent states and modeling diversity in perceptions of nonverbal cues. xv I develop two schemes on identication of nonverbal vocalizations (laughter and llers) in telephonic speech and automatic detection of dis uencies in automatic speech recognition. Both these experiments exploit the temporal characteristics of nonverbal cues and re ect the importance of context in nonverbal detection problems. In the second part, I focus on latent behavioral state estimation using nonverbal cues. These experiments are designed to understand the encoding process during nonverbal communication. Although the encoding process is fairly complex, these experiments establish the relation between latent human behavioral states and nonverbal cues as the rst step. I conduct experiments to establish the relation between behavioral constructs such as empathy, emotion, depression and other behavioral constructs with nonverbal cues such as facial expressions, prosody and nonverbal vocalizations. My focus in these experiments is not only to develop a mapping function between nonverbal cues and human behavior but also build interpretable models to understand the mapping. In the last part of the thesis, I develop models to understand diversity in perception of nonverbal cues. These experiments are a step towards understanding the decoding step in nonverbal communication. Decoding is a complex and person specic process and I develop models to capture and quantify the variability among people's perception of nonverbal cues. I conduct two experiments on modeling multiple annotator behavior in a study involving rating smile strength in dyadic human interaction and another experiment on modeling perceived pairwise noisy rankings by multiple annotators. I develop these models with parameters which can capture the variability in perceptions among people. Furthermore, such parameters can be used to quantify the variability. This thesis takes a holistic approaches the encoding-decoding process during xvi nonverbal communication and proposes models towards better understanding of the nonverbal communication phenomenon. As establishing these relationships is not trivial, I use fusion of experts in all my models to enhance the generalizabil- ity and interpretability of my results. For each of the parts stated above, I use either a set of sequential models, generalized stacking or ensemble of experts as the means of modeling. These models are geared towards providing insights into the encoding-decoding phenomenon instead of using a black box approach. In sum- mary, thesis contributes towards the understanding of nonverbal communication while using novel methods with applicability to a more general class of problems. I take a multidisciplinary approach to understanding the phenomenon of nonver- bal communication with novel designs inspired from state of the art practices in machine learning. xvii Chapter 1 Introduction \The most important thing in communication is hearing what isn't said" -Peter F. Drucker 1.1 Background Communication is a purposeful exchange of information involving a common set of symbols and semiotic rules between two parties. The communication process can be viewed as a process in which a sender encodes the information s/he wants to convey into signals that are received and decoded by another person. The art of communication has been evolving over time and is deemed crucial for the survival of mankind [6]. The functions of communications include (but not limited to) interaction, expression of aect, incipient mentation and bonding [7]. Based on the usage of worded cues, communication can be broadly classied into two categories: verbal and nonverbal communication [8, 9]. Verbal commu- nication makes use of words to communicate information and therefore is reliant on the concept of a common language between the parties involved in commu- nication [10]. On the other hand, nonverbal communication involves the use of wordless cues in conveying information and can extend beyond language and other cultural barriers. Nonverbal cues are often used with verbal communication to repeat, con ict, complement, substitute, regulate and accent the verbal messages and therefore can alter worded messages open to dierent interpretations. The 1 encyclopedia of communication theory categorizes the study of nonverbal commu- nication into four parts [11] including: (i) kinesics [12], the study of body language, (ii) proxemics [13], the study of physical distance, (iii) paralanguage [14], the study on nonverbal vocalizations and, (iv) haptics [15], the study of touch. The expanse of the study of nonverbal communication emphasizes both the importance and intricacy of understanding the discipline. With a sole focus on establishing the relationship between human behavioral constructs with nonverbal communication, several existing works explore the relation of human emotions [16], well being [17] and even childhood development [18] to aspects relating to nonverbal communica- tion. Nonverbal communication is the focus of this thesis and has been hypothesized to represent about two-thirds of all communication between humans [19]. This fact emphasizes the importance of nonverbal communication and Avram [20] lays special focus on understanding nonverbal communication as an encoding-decoding process. I adopt this encoding-decoding approach, further investigated in [20{22], and attempt to understand the encoding and decoding mechanisms separately. Figure 1.1 presents a pictorial representation of communication as an encoding- decoding process, also applicable to non-verbal communication. Using this scheme of encoding-decoding, I divide this theses into three parts addressing the aspects of detecting non-verbal cues, understanding the encoding of human behavioral constructs into non-verbal cues and, modeling the diversity in perception of non- verbal cues. I introduce these three parts of the theses in the next section. 2 Encoding / Decoding Figure 1.1: This gure represents communication as an encoding-decoding process. The person on the left encodes information which is received by the audience on the right and decoded. 1.2 Dissertation overview Based on the encoding-decoding approach, this thesis is divided into three parts titled as: (i) detection of non-verbal cues in human interaction, (ii) understanding the encoding of human behavioral states into non-verbal cues and, (iii) modeling the diversity in perception of non-verbal cues. I discuss these three topics, the challenges and contributions made below. 1.2.1 Detection of non-verbal cues in human interaction In the rst part of the thesis, I address the problem of detecting non-verbal cues. Non-verbal cues can be subtle and are often conveyed using multi-modal expres- sions. Therefore, there exists the need to develop models which can capture for the complex embedding of non-verbal cues during communication, while accounting for the subtle nature of many non-verbal cues. As it is well known that expression of several non-verbal cues is contingent on the context of the conversation, I present novel approaches to structurally quantify the contextual information and study it's relation to the expressed non-verbal cue. I present two case studies involv- ing detection of non-verbal vocalizations (laughters and llers) and detection of dis uencies during conversation. In these studies, I design models to reliably and 3 robustly capture the signature of non-verbal cues during expression. These models are trained on raw descriptors extracted from human signals such as speech and also use the contextual information in detecting non-verbal cues. 1.2.2 Understanding the encoding of behavioral states into non-verbal cues During the encoding process in communication, a person's latent behavioral states (e.g. emotions, mental health) are re ected in communication, including non- verbal cues. The challenges in modeling the latent behavioral states lie in dening them, followed by designing latent state estimation models to track the behavioral states. I address these challenges by borrowing denitions of human behavioral states from psychology literature followed by a focus on design of latent state esti- mation models. I design a model to predict the latent behavioral states based on non-verbal cues extracted from speech and facial expressions. Through these models, I establish the relationship between non-verbal cues and latent behav- ioral states. I describe a set of three case studies, namely: (i) predicting aective dimensions based on self assessed depression severity (ii) aect prediction in music using boosted ensemble of lters and, (iii) prediction of client's inclination towards target behavior change in motivational interviewing and investigating the role of laughter. The choice of models in these studies is driven by the interpretability of model parameters, with a goal towards informing back the discipline of psy- chology. The contributions in this section include extension of existing machine learning methodologies to model the link between non-verbal communications and behavioral states to new domains aided with model interpretations. 4 1.2.3 Modeling the diversity in perception of non-verbal cues In the last part of this dissertation, I address one specic aspect in decoding the non-verbal cues: the diversity in their perception. Once information in encoded in the nonverbal cues and the signal is transmitted, their perception is contingent upon the receiver of the cues. I model this diversity in perception of nonverbal cues, using multiple annotator models widely studied in applications relating to classication using crowd-sourced annotations. This work oers one of the rst steps towards the quantication of the diversity in perception of nonverbal cues. I develop multiple annotators models with applications to following two case studies: (i) modeling ratings from multiple annotators on smile strength and, (ii) inferring ranked lists from noisy pairwise comparisons. These models extend multiple anno- tator models in classication to temporal regression and ranking problems. I fur- ther interpret the parameters of this model to quantify the diversity in perception of the audience. Given in the inherent complexity in the modeling the aforementioned aspects of non-verbal communication, I adopt a common thread of modeling them through a fusion of experts framework. In the next section, I describe and justify my choice of modeling framework. 1.3 Modeling framework Given the subtlety of nonverbal cues, dierences in their perception and its innate relation with human behavior, the general computational approach that I adopt is a fusion of experts framework. I use either experts arranged in sequence, stacked generalization [23] or ensemble of experts [24]. The motivation behind using these 5 schemes is to improve the generalizability and interpretability of my models. Sev- eral works [25,26] show that a fusion of experts leads to a lower generalized error and this fact is important in several applications such as nonverbal cue detection. Additionally, fusion of experts has been traditionally used in several multi-modal fusion experiments [27,28]. This is critical in the study of nonverbal cues (both in the terms of modeling and understanding) as they are often multi-modal. Finally, in modeling perception of nonverbal cues I employ ensemble of experts with special emphasis on the interpretability of the model parameters. In summary, my general goal during the development of algorithms to understand the nonverbal commu- nication has been to develop interpretable models using various fusion schemes as opposed to a black-box approach. This document is arranged as follows. I present a summary of the previous work in the study of non-verbal communication in Chapter 2. In the next part of the thesis, I present the motivation and experiments for detection of nonverbal cues during human interaction in chapters 3 and 4. I discuss experiments on inferring latent human behavioral states from nonverbal cues in chapters 5, 6 and 7. Chapter 8 and 9 present my models in modeling diversity in perceptions of nonverbal cues as the next part. I present my conclusions and ongoing work in Chapter 10. 6 Chapter 2 Related work The earliest working adopting a scientic approach to the study of non-verbal communication was lead by Charles Darwin in his book \The expression of the emotions in man and animals" [29]. Darwin argued that the form of emotion of expression (e.g. using facial expressions) in man and animals is an outcome of evo- lution over time serving functions related to survival and communication. After the exploration by Darwin, there was a long hiatus in the study of non-verbal com- munication, restarting only in the middle of the twentieth century. Scientists such as Adam Kendon and Ray Birdwhistell pioneered the approach of context analysis to understand non-verbal communication [30, 31]. A real boost in the study of non-verbal communication was observed after 1960 when several psychologists and scientists conducted inaugural research in domains related to non-verbal communi- cation. Examples include Argyle [32], Fodor [33], Sommer [34] and, Rosenthal [35]. Following all this development, the Journal of Environmental Psychology and Non- verbal Behavior was commenced in 1978. Since the inception of detailed study of non-verbal communication, several researchers in the disciplines of machine learning, signal processing as well as psychology have further conducted experiments to understand the function of non-verbal cues. For instance, Halberstadt [36] describes the relation between family socialization of emotional expression and nonverbal communication styles. Salovey et al. [37] investigate the concept of emotional intelligence and relate it to the aspects of nonverbal receiving ability. Similarly other literature examines 7 the relation between emotional attributes and cues such as body language [38,39], prosody [40] and facial expression [41,42]. In terms of understanding the associa- tion of nonverbal cues and human well being, research has focused on understand- ing depression [43,44], relationship well-being [45] and mental health [46]. Within the domain of mental health, researchers have focused on specic domains such as motivational interviewing [47,48], interpersonal psychotherapy [49] and other man- ual based psychotherapy protocols [50]. Study of childhood development is another signicant area of research with considerable focus on understanding the evolution of nonverbal communication during childhood. For instance, Mundy et al. [51], Lord et al. [52] and Stone et al. [53] have focused on the relation between the developmental disorder of autism and nonverbal communication. Similarly, there exist studies correlating nonverbal behavior and joint attention [54], behavioral problems [55] and control over aective expressions [56]. In addition to above, researchers have also studied the association between nonverbal behavior and aspects such as self-presentation [57], social competence [58], gender dierences [59] and childhood depression [60]. Apart from studies focused on solely understanding non-verbal communication, existing literature delves further into investigating its complementarity and interaction with verbal messages [8,61]. In the following sec- tions, I present a brief summary of the contributions that researchers have made, categorized as per the three parts of this thesis: (i) detection of non-verbal cues, (ii) understanding the encoding of behavioral states into non-verbal cues and, (iii) modeling diversity in perception of non-verbal cues. 8 2.1 Detection of non-verbal cues Event detection is a classical problem relevant to several domains such as quality assurance [62], medicine [63] and psychology [64]. One of the earlier attempts for detection of non-verbal cues was made by Gangawar in her thesis titled \Arousal detection using EEG signal" [65]. Along these lines researchers have worked on detection and understanding of facial expressions [66] and non-verbal vocal char- acteristics such as stuttering [67]. More recently, researchers have focused on iden- tifying nonverbal cues related to anxiety [68], lies and deceit [69,70] and, commit- ment [71]. Research studies have also explicitly focused on detecting nonverbal cues such as smiles [72, 73], laughters [74, 75] and cries [76, 77]. The approaches taken to detection of these cues is conditioned on the subtlety of non-verbal cues as well as the form of their expression (such as modalities, intensity). I further discuss the related work on the two case studies presented in this dissertation in chapters 3 and 4. 2.2 Understanding the encoding of behavioral states into non-verbal cues Researchers have identied and studied several forms of human behavioral states in relation to non-verbal cues. Knapp et al. [78] provide a comprehensive study of non-verbal communication interaction in human interactions and the re ections of human behavioral states in them. Non-verbal communication has been studied in relation to deception [70], self-presentation [57] and commitment [71]. The same has been investigated in relation to human-computer interactions and applications include eciency and robustness [79] and, machine understanding [80]. In this 9 work, I study the encoding of human behavioral states such as aect, depression, and motivation into non-verbal cues including laughter and sighs. I further present some of the related works on these applications in chapters 5-7. 2.3 Modeling diversity in perception of non- verbal cues Due to the subtle nature of nonverbal cues [81,82], dierent people may have dier- ent perceptions of nonverbal cues transmitted by another communicator. Existing literature has explored dierences in the perceptions of nonverbal cues such as in lying [83], likability [84] and other discrepant nonverbal cues [85]. The work in this thesis is particularly inspired from the works by Raykar et al. [2]. Raykar et al. [2] proposed a model to infer the latent ground truth given perceptions from multiple annotators. Since this is the problem I approach with regards to decoding of non-verbal cues, I develop variations of his model and apply to studies involv- ing perception of smile strength, expressivity and naturalness perception. Further details of this work is presented in chapter 8 and 9. 10 Part I Detecting nonverbal cues in human interaction 11 In this part of the thesis, I focus on the detection of non-verbal cues in human interaction. As described before, non-verbal communication cues can often be subtle and may convey ambiguous messages. Therefore, automatic detection of nonverbal cues can aid their analysis. This is particularly relevant to my work since the later parts of these thesis deal with understanding the encoding and decoding of nonverbal cues, which may not be possible without the availability of location of nonverbal cues under investigation. The two specic case studies I talk about in this thesis are: (i) detecting par- alinguistic events in audio stream using context in features and probabilistic deci- sions and, (ii) variable span dis uency detection in Automatic Speech Recognition (ASR) transcripts. The rst case study involves detection of paralinguistic events in speech segments and the second study is regarding the detection of dis uencies in speech once it has been (inaccurately) annotated using an ASR system. In both these case studies, I exploit the fact that nonverbal cues always carry a context. I design my models to quantify the context and use it in detection of the cues. In the rst case study involving the detection of paralinguistic events, the context is quantied as features or event probabilities surrounding a frame in question. In the second case study for dis uency detection, I look at the syntactic structure of the sentence to identify dis uencies. I discuss both these case studies in chapter 3 and 4, respectively. 12 Chapter 3 Detecting paralinguistic events in audio stream using context in features and probabilistic decisions My focus in this chapter is on non-verbal vocalizations (NVVs) in speech. Previous research links various forms of non-verbal vocalizations such as laughters, sighs and cries to emotion [86, 87], relief [88, 89] and evolution [90]. The importance of each of these non-verbal vocalizations is highlighted by the role they play in human expression. Therefore a quantitative understanding of their production and perception can have a signicant impact on both behavioral analysis and behavioral technology development. In this chapter, I aim to contribute to the analysis of these non-verbal vocalizations by developing a system for detection of non-verbal events in spontaneous speech. Several previous works have proposed detection methods for NVVs. Kennedy et al [74] demonstrated the ecacy of using window-wise low level descriptors from speech [91] in detecting laughters in meetings. Truong et al. [92] investigated per- ceptual linear prediction (PLP) and acoustic prosodic features for NVV detection using Gaussian mixture models. Vrallyay et al. [93] performed acoustic analysis 13 of infant cries for early detection of hearing disorders. Schuller et al. [94] pre- sented static and dynamic modeling approach for recognition of non-verbal events such as breathing and laughter in conversational speech. In particular, the Inter- speech 2013 Social Signals Sub-challenge [95] led to several investigations [96{100] on frame-wise detection of two specic non-verbal events: laughters and llers. Building upon on my eorts [101] on the same challenge dataset [102] (that was the winning entry in the challenge), in this chapter I perform further analysis and experiments. Previous works in this research eld have primarily focused on local characteristics and my approach investigates the benets of considering context during the frame-wise prediction. My methods are inspired from the fact that the non-verbal events occur over longer segments (and hence analysis frames). The temporal characteristics of these events has been investigated in studies such as [103{105]. These studies reveal interesting patterns such as a positive correlation between duration of laughter [103] and number of intensity peaks and similarity in duration of llers across languages [105]. Bachorowoski et al. [104] went further into the details of laughter types (e.g. voiced vs unvoiced) and their relation to laughter durations. More studies on laughter and ller duration and its relation to their acoustic structures can be found in [106{108]. As statistics (presented later in Table 3.1) on my database of interest also show that laughters and llers exist over multiple analysis frames, I hypothesize that information from neighboring frames can be utilized to reduce the uncertainty associated with the current frame. Given that the target events are temporally contiguous, one can use many of the available sequential classiers for the problem of interest. Potential techniques include Markov models [109], recurrent neural networks [110] and linear chain conditional random elds [111]. For instance, Cai et al. [112] used Hidden Markov Models (HMM) for detection of sound eects like laughter and applause from 14 audio signals. This approach is similar to methods used in continuous speech recognition [109] with a generative model (the HMM), which may not be as optimal as other discriminative methods in event detection problems [113]. Brueckner et al. [114] used a hierarchical neural network for detecting audio events. This model initially provides a frame-level prediction using low level features and then another neural network is used to combine predictions from multiple frames to provide a nal frame-level prediction. Other studies [96,115] have also used similar prediction ad-hoc ltering methods (median ltering, Gaussian smoothing) to incorporate context in similar sequence classication problems. Most of these works have focused on performance driven approaches towards design of the detection systems but fail to provide a thorough model and feature analysis. In this work, I explore and analyze new architectures to incorporate context at the two levels of (i) frame- wise acoustic features and, (ii) frame-wise probabilistic decisions obtained from the features as a continuation of previous eort [101]. Through my analysis in this chapter, I aim to understand the relation of laughters and llers to the low level features as well as the temporal characteristics of these events. I focus on aspects such as using contextual features during classication and incorporating context in frame-wise outputs (termed as `smoothing' and `masking' operations). My nal system achieves an area under the receiver operating characteristics (ROC) value of 95.3% for laughters and 90.4% for llers on a held out test set. I also present an analysis on the role of each feature used during detection and its impact on the nal outcome. The chapter is organized as follows: Section 2 provides the database description and statistics and section 3 lists the set of used features. I present my core NVV detection schemes inclusive of smoothing and masking in section 4. Section 5 provides feature analysis and conclusion is presented in section 6. 15 Event Total number Statistics over the segment lengths of segments (in milliseconds) Mean Standard deviation Range Laughter 1158 943 703 2-5080 Filler 2988 502 262 1-5570 Table 3.1: Statistics of laughter and ller annotations in the SVC corpus. 3.1 Database I use the SSPNet Vocalization corpus (SVC) [102] for the experiments in this chapter. This data was used as the benchmark during the Interspeech challenge and provides a platform for comparison of various algorithmic methods [96{100]. The dataset consists of 2763 audio clips, each 11 seconds long. Each of these clips have at least one ller or laughter event in between 1.5 seconds and 9.5 seconds. These clips are extracted from 60 phone calls involving 120 subjects (57 male, 63 female) containing spontaneous conversation. The pair of participants in each call perform a winter survival task which involves identifying an entry from a predened list consisting of objects useful in a polar environment. The conversation was recorded on cellular phones (model Nokia N900) at the two ends. There was no overlap in speech as the recordings were made separately for the speakers involved. The audio les are manually annotated (single annotator) for laughter and ller. I list the statistics for laughter and ller events over the entire database in Table 3.1. For more details on the dataset please refer to [95,102]. For modeling and evaluating the frame-wise detection, I use the training, devel- opment and testing splits as dened during the Interspeech challenge [95]. The speaker information per clip is not available but the training, development and testing sets contain non-overlapping set of speakers (training: speakers 1-70, devel- opment: speakers 71-90, testing: speakers 91-120). Non-overlapping set of speak- ers allows for speaker-independent evaluation. Table 3.2 shows the counts of clips, 16 Count Dataset Total Training Development Testing Clips 1583 500 680 2763 Laughters Segments 649 225 284 1158 Fillers Segments 1710 556 722 2988 Table 3.2: Statistics of data splits used as training, development and testing set. laughter segments and ller segments in each data split. In the next sections, I list the set of features extracted per audio clip followed by my NVV detection scheme. 3.2 Feature extraction I use an assembly of prosodic features and mel-frequency cepstral coecients (MFCCs) extracted at a frame-rate of 10 milliseconds using OpenSMILE [116]. This set of features is same as the one used in the 2013 Interspeech challenge [95] and has been previously used in several other classication and detection experi- ments involving emotion, depression and child behavior characterization [117{121]. Studies [74,92] have shown the relation of speech prosody and spectral character- istics to similar non-verbal events. Note that this list of features, while large, is by no means exhaustive and new features for laughter and ller detection have been proposed in several other works [100, 122]. As I focus on the system development aspect of event detection in this chapter, I work with this smaller representative set of features provided during the Interspeech challenge. For further improvement and specic feature analysis, the proposed system can denitely be augmented with the new sets of features in future. The features used in this chapter are listed in Table 3.3. I z-normalize these features per le before subsequent system training. The feature means and variances for normalization are calculated over the entire duration of the le. 17 Prosodic features Voicing probability Harmonic to noise ratio Fundamental frequency (F0) Zero crossing rate Log intensity MFCC 12 coecients Table 3.3: Set of features extracted per frame. 3.3 Event detection scheme I use the aforementioned feature set to train my frame-wise detection scheme. I test the eect of incorporating context at various stages in detecting an event E2 flaughter, llerg. As these events usually last over multiple frames, I hypothe- size that proximal information may help in detection of these events. Given the rare occurrence of these events, the Interspeech challenge [95] adopted area under the Receiver Operating Characteristics (ROC) curve as the evaluation metric for laughter and ller detection. I use the same metric and explore several prediction architectures to maximize it. I develop a three step sequential algorithm which involves: (i) Predicting event probabilities based on the speech features. I investigate the eect of incorporating contextual features and compare it with a context indepen- dent baseline. (ii) Incorporating context in probabilistic frame-wise outputs obtained from the previous step. (iii) `Masking' the smoothed probabilities based on heuristics based rules. Figure 3.1 provides a block diagram representation of my methods and constituent experiments. I describe each of these steps below. 18 Predicting event probability based on speech features ● Context independent features ● Hidden Markov models ● Linear discriminative model ● Non-linear discriminative model ● Incorporating context features ● Using window level statistics ● Using window-wise features + dimensionality reduction Incorporating context in probabilistic frame-wise outputs ● Linear filter ● Moving average filter ● Minimum mean squared error based filter ● Probability encoding Masking highly probable/ improbable decisions (a) (b) (c) Figure 3.1: Block diagram representing each processing step performed during detection. Contents in gray boxes show experimental methods used with the best method (as determined during the system development) marked in red. I retain the best method at each step for subsequent processing step. 3.3.1 Predicting event probabilities based on the speech features I initiate my system training procedure with a model that uses the speech features and outputs the frame-wise probabilities of an event E. First, I present a model that takes no context into account and serves as a baseline. Next, I augment the feature set by introducing certain contextual features. I next describe the baseline scheme followed by the incorporation of contextual features. 19 Baseline: Frame-wise classication with context independent features In this classication scheme, I obtain the event probability exclusively based on the 17 features extracted per frame, as listed in Table 3.3. Note that these features represent the acoustic characteristics of only the analysis frame under considera- tion and contain no information about the feature values from neighboring frames or the feature dynamics. I train models to assign each audio frame as belonging to a target event or as a garbage frame based on the acoustic features. I experiment with several classication architectures and model my class boundaries using (i) a generative Hidden Markov Model [109] (ii) a linear discriminative classier (used as baseline in [95]) and (iii) a non-linear discriminative classier to assess the nature of separability of event classes in the feature space. I represent the column vector of features for the n th frame as x n and the corresponding probability obtained for an eventE asu E (n). I describe the training methodology for each of the models below. (i) Classication with Hidden Markov Model (HMM): In this scheme, I train an HMM model using the Kaldi toolkit [123]. I train monophone models using the Viterbi-EM algorithm [124] for each of the laughter, ller and garbage events from the training set. I then decode the development and the test sets using a unigram language model (LM) assuming equal LM weights for the three events (to account for class imbalance in the training set, as majority of frames belong to the garbage class). During decoding, I obtain the state occupancy probabilities of each frame to be belonging to the laughter, ller or garbage HMM. I use these probabilities from each of these HMMs as my laughter, ller and garbage probabilities. (ii) Classication with a linear discriminative classier: I determine linear class 20 boundaries using the Support Vector Machine (SVM) classier (this model was also used in the Interspeech challenge [95]). I train a multi-class SVM classier over the three classes with pair-wise boundaries. I obtain the class probabilities for each frame by tting logistic regression models to data-point distances from the decision boundaries. In order to prevent class bias due to unbalanced repre- sentative data from each class, I downsample the `garbage' frames (frames not belonging to either of the events) by a factor of 20 during training, as suggested in the challenge paper [95]. I use a linear kernel and the slack term weight is tuned on the development set. The predicted probabilities are computed using the Hastie and Tibshirani's pairwise coupling method [125]. (iii) Classication with a non-linear discriminative classier: Finally, I test a discriminative classier with non-linear boundaries. I expect better results with this classier due to a higher degree of freedom in modeling the class boundaries. On comparing results to the previous classier, I get a sense of deviation of the non-linear class boundaries from the SVM based linear boundaries. I chose a Deep Neural Network (DNN) [126] with sigmoidal activation as my non-linear classier. DNNs have been used in several pattern recognition tasks such as speech recognition [127, 128] and have provided state of the art results. I train a DNN with two hidden layers and the output layer consists of three nodes with sigmoidal activation. Each output node emits a probability value for one of the three classes; laughter, ller and garbage. I perform pre-training [128] before determining the DNN weights. The number of hidden layers and number of neurons in each hidden layer was tuned on the development set. Results and discussion: I list the results using the three models in Table 3.4. 21 System used AUC (in %) Development set Testing set Laughter Filler Laughter Filler Chance 50.0 50.0 50.0 50.0 Hidden Markov Models 74.2 79.3 71.3 78.0 Linear discriminative classier 77.0 81.2 73.8 79.1 Non-linear discriminative classier 83.7 83.8 80.9 82.3 Table 3.4: AUC for prediction using context independent features with HMM, linear and non-linear classiers. The \chance" Area Under the Curve (AUC) is determined based on random assign- ment of 0 and 1 probability values per frame for each event. I observe that the acoustic features considered carry distinct information about the non-verbal vocal event which distinguish them from the rest of the speech. Decoding the development and test sets using the HMM framework often outputs the garbage label for laughter and ller frames; as a large portion of the training set consists of frames with garbage label. Although, a higher weight to the events in language model leads to a better output for laughter and ller frames, but this comes at the expense of a higher false alarm rate. Between the discriminative models, I obtain better results using a non-linear boundary as compared to linear SVM boundaries given a higher degree of freedom in modeling the class distributions. The gain in the case of detecting laughters is higher as compared to that for llers. This indicates relatively less deviation of the non-linear boundary from the SVM boundary in case of the llers. However, in case of laughters the greater performance boost obtained using a DNN suggests that the true boundary may not be well approximated by a hyper-plane. I also observe that my classiers obtain a higher AUC in case of llers. This indicates that given the context independent features, llers are comparatively more distinguishable than laughters. This happens due to inherent dierences in the 22 −1 0 1 Audio waveform containing a laughter segment 0 100 200 300 400 500 600 0 0.5 1 Frame number Output probability Figure 3.2: An audio segment containing laughter segment at the end and the corresponding output laughter probabilities. The laughter segment in this clip (as annotated) occurs in two bursts. Notice the low laughter probability assigned to the silence frames in between the two bursts. acoustic structure of the two events. Fillers are contiguous sounds, (e.g. um, em, eh) and laughters typically involve bursts of sounds with silence in between. This heterogeneous acoustic property of laughter makes inference more dicult. For instance, assigning a silence frame to belong to a laughter event is dicult in the absence of any context (Figure 3.2). I expect better detection after incorporating contextual features as presented in the next section. Frame-wise detection with contextual features Non-verbal events such as laughters and llers occur contiguously over long segments spanning multiple short term analysis frames. Hence the inclusion of neighboring context from surrounding frames may assist prediction as proximal acoustic properties may help resolve con icts (e.g., in the case of silence frames within laughter). I extend the previous best system, i.e., the DNN classier and make modications to include feature context. I test two methods to incorporate context from features from the neighboring frames: (i) by feature concatenation 23 x (n−M x ) x (n+M x ) x (n−M x ) ⋮ x n ⋮ x (n+M x ) Dimensionality Reduction f FC x n Window of length 2M x +1 Feature concatenation u E (n) Figure 3.3: Architecture for frame-wise classication based on incorporating fea- ture context and dimensionality reduction. Feature values over a window are con- catenated and projected on a lower dimensional space using PCA. f FC represents the discriminative classier trained with feature context. over a window followed by dimensionality reduction, and (ii) appending statistical functionals calculated over a window to the frame-wise features. I discuss each of these below. (i) Window-wise feature concatenation and dimensionality reduction: In order to make a decision for the n th frame, I consider a window extending to M x frames before and after the n th frame (window length: 2M x + 1). I concatenate features from all the frames over the window, leading to (2M x + 1) 17 features. I determine the outcome for the n th frame based on these values. An increase in the number of features leads to data sparsity in the feature space. Therefore, I perform dimensionality reduction before training my classier. I use Principal Component Analysis (PCA) [129] to retain the maximum variance after linearly projecting the 17 (2M x + 1) features. I train a DNN to obtain target event probabilities based on these projected features. I again downsample the garbage class before training and tune M x , the number of principal components used and the DNN parameters on the development set. I show the classication schematic 24 in Figure 3.3. f FC indicates the classier trained with feature context. (ii) Appending window-wise feature statistics: This scheme relies of appending the feature from the current frame with statistical functionals of features over a longer temporal window in its neighborhood. The set of features along with a linear SVM classier was used as a baseline during the Interspeech challenge [95]. In this scheme I incorporate context by appending velocity () and acceleration ( 2 ) values calculated over the 2M x + 1 frame-long window to the features in the current frame. This leads to 17 3 feature values (feature + + 2 ). The practice of incorporating these contextual features ( and 2 ) is widely used in applications such as speech recognition [130] and language identication [131]. I further calculate means and variances over these 51 features to obtain my nal feature set. Further addition of statistical functionals (apart from and 2 ) provides additional temporal characterization and is inspired from various other previous works [132, 133]. I train a DNN on this set of features and also report the challenge baseline results obtained using the SVM classier for comparison. I tune M x and the DNN parameters on the development set. Figure 3.4 outlines the adopted classication scheme. Results and discussion: I list the results using the above two classication architectures in Table 3.5. I also list the result from the previous best context independent classier and the Interspeech challenge baseline results [95] for com- parison. From the results, I observe that I obtain higher AUC values for both the events, thus validating that context helps in improving prediction. I obtain higher AUCs using the statistical functionals compared to feature concatenation. This may be 25 x (n−M x ) x (n+M x ) f FC x n Window of length 2M x +1 Δ Δ 2 x n Δ Δ 2 x n μ(x n ) σ(x n ) Δ μ(Δ) σ(Δ) Δ 2 μ(Δ 2 ) σ(Δ 2 ) u E (n) Figure 3.4: Architecture for frame-wise classication based on incorporating con- text from feature statistics. System AUC (in %) used Development set Testing set Laughter Filler Laughter Filler Context independent DNN 83.7 83.8 80.9 82.3 Appending features + 86.0 88.5 84.3 84.9 dimensionality reduction (DNN) Window-wise feature statistics + DNN 88.6 91.9 86.4 86.5 Window-wise feature statistics + SVM 86.2 89.0 82.9 83.6 (Interspeech challenge baseline [95]) Table 3.5: AUC for classication using contextual features obtained by appending window-wise feature statistics and dimensionality reduction. due to the fact that the dimensionality reduction leads to loss of information. Also the approach with feature statistical functionals retains the features pertaining to the frame at hand. These features are otherwise projected onto a lower dimen- sional space in the approach involving dimensionality reduction. Given a better performance using the feature statistical functionals, I proceed with this scheme for further system development. In this section, I explored several methodologies to extract information from a set of vocal features. Overall, non-linear boundaries on frame-wise features appended with neighboring frame information provide us the best results. Figure 3.5 shows the outputs from two separate audio clips, each containing laughter and 26 0 200 400 600 800 1000 1200 0 0.5 1 Frame number Probability (a) Obtained probability Target probability 0 200 400 600 800 1000 1200 0 0.5 1 Frame number Probability (b) Figure 3.5: Estimated and target probability values for (a) laughter and (b) llers, using contextual features on sample test les. ller events. From the plots I observe that these les still contain several `garbage' frames with high laughter/ller probabilities. In spite of incorporating contextual features, the estimated probabilities do not evolve smoothly. Therefore, there is potential to further improve my results after including context in the output probabilities. I address this possibility in the next session, where I account for context in the decisions obtained from the current system. 3.3.2 Incorporating context in probabilistic frame-wise outputs I propose methods to improve the previous model by incorporating context to the sequence of frame-wise probabilities (block (b) in Figure 3.1). I concatenate the output frame-wise probabilities u E (n) (n = 1:::N) from the previous classication into a time seriesU E =fu E (1);:::;u E (N)g. I perform a \smoothing" operation on U E consisting of two steps: (i) Linear ltering, and (ii) Probability-encoding. These 27 steps assist us in incorporating neighboring frame context decisions as discussed below. Linear ltering I observe that the outcomes from the above systems tend to be noisy, consisting of sharp rises and falls. I therefore design a low pass FIR lter to reduce the spikes in U, as it is unlikely that these events last only for a few frames. I determine the ltered probability v E (n) at the n th frame for an event E as shown in (3.1). I use a window of length 2M u + 1 centered at n and a mu is the lter coecient applied to frame output at a distance of m u from the current frame. I determine the lter coecients using two approaches: (i) a moving average lter and (ii) a FIR lter with coecients determined using the Minimum Mean-Squared Error (MMSE) criteria. I explain these two approaches and list the results below. v E (n) = Mu X mu=Mu a mu u E (n +m u ) (3.1) (i) Moving Average (MA) lter: A moving average lter assigns equal values to all the coecients, as shown in (3.2). For each frame, this scheme provides equal importance to the frame and its neighbors. I tune the window length parameter M u on the development set. a mu = 1 2M u + 1 ; m u =M u ;:::;M u (3.2) (ii) Minimum Mean Squared Error (MMSE) based lter: In this lter design scheme, I nd the optimal set of lter coecients after minimizing the Mean Squared Error (MSE, equation 3.3) between the desired probability values and probability values obtained after ltering. MSE is a convex function with respect 28 System used AUC (in %) Development set Testing set Laughter Filler Laughter Filler DNN using feature statistical functionals 88.6 91.9 86.4 86.5 MA lter 97.0 95.5 92.5 89.8 MMSE lter 97.3 95.5 94.2 89.9 Table 3.6: AUC after temporally ltering the time seriesU E using MA and MMSE based lters. MMSE provides the best AUC, slightly better than MA lter. to (w.r.t.) the coecients with a global minima. Eacha mu is obtained analytically by setting derivative of MSE w.r.t. a mu to zero (equation 3.4). I tune M u for the MMSE based lter on the development set. MSE = P n2Training set t n Mu P a mu mu=Mu u E (n +m u ) 2 Size of the training set (3.3) a mu = arg min amu (MSE) at: @MSE @a mu = 0 (3.4) Results and discussion: I present the results in Table 3.6. I list the previous best results using contextual features for comparison. I observe that there is a similar increase in the AUCs using the two ltering schemes. I plot the lter coecients in Figure 3.6 and the frequency response (FFT based, equation (3.5)) of the lters in Figure 3.7. A pu represents the discrete Fourier transform at the indexp u . Although the coecients for MA and MMSE lters are dierent, the similarity in performance of the two lters can be explained by the similarity in their frequency response. Both these lter attenuate high frequency components in the time series U. However the MMSE lter has a slightly higher cut-o frequency and admits more high frequency components when compared to 29 −100 −80 −60 −40 −20 0 20 40 60 80 100 −0.02 0 0.02 0.04 0.06 0.08 m u (a) MMSE MA −50 −40 −30 −20 −10 0 10 20 30 40 50 −0.05 0 0.05 0.1 0.15 m u Coefficient values (b) Figure 3.6: Filter coecients for the linear lters operating over the time series U E for (a) Laughters (b) Fillers. the MA lter. M u was 50 for llers and 100 for laughters. This suggests that the context lasts longer for laughters as compared to llers. It may follow from the fact that mean laughter length is greater than mean ller length as is observed in Table 3.1. A pu = Mu X mu=Mu a mu exp i2p u m u +M u 2M u + 1 ; p u = 0::::; 2M u + 1 (3.5) In the next section, I describe the probability encoding scheme. 30 0 20 40 60 80 100 120 140 160 180 200 0 0.5 1 1.5 2 p u Magnitude of frequency response (a) MMSE MA 0 10 20 30 40 50 60 70 80 90 100 0 0.5 1 1.5 2 (b) p u Figure 3.7: Frequency response for the linear lters operating over the time series U E for (a) Laughters (b) Fillers. Probability encoding After processing the data through the above scheme, I pass the outputs v E (n) through an autoencoder [134]. The goal here is to capture any non-linear con- textual dependency which the linear lters fail to capture. I dene a new time series V E =fv E (1);:::;v E (n);:::;v E (N)g consisting of the ltered outputs. The auto-encoder f enc is a feed-forward neural network trained to reconstruct the tar- get values fromV E . I use an autoencoder with a single hidden layer and sigmoidal activation on the output node. This operation reconstructs a window of inputs fv E (nM V );:::;v E (n);:::;v E (n+M V )g to produce an output of the same length. The parameter M V and number of neurons in the hidden layer are tuned on the development set. Autoencoder accounts for any non-linear dependence and is a mapping to multiple target output values (unlike linear ltering). I train a neural 31 System AUC (in %) used Development set Testing set Laughter Filler Laughter Filler MMSE lter 97.3 95.5 94.2 89.9 Probability encoding 97.6 96.1 95.3 90.2 Table 3.7: AUC using the MMSE based lter and probability encoding. network encoder on a window of length 2M v + 1 centered at v E (n) to obtain the predictions on the same window as shown in equation (3.6) below. fw E (nM V );:::;w E (n);:::;w E (n +M V )g =f enc (v E (nM V );:::;v E (n);:::;v E (n +M V )) (3.6) Results and discussion: I list the results for auto-encoding in Table 3.7. I train the auto-encoder on the outputs of the MMSE lter. Auto-encoding leads to dierent degrees of improvements for capturing the two events. Also, the performance is inconsistent across the data splits. On the devel- opment set, I obtain a greater increase for llers and the pattern reverses on the test set. This is indicative of some degree of mismatch between the development and test splits. Very high AUC values on the development set indicate perfor- mance saturation and data distribution similarity between the development and the training set. A 1% absolute increase in AUC for laughters suggests that these events benet more from the non-linear encoding as compared to llers. Overall, I observe that incorporating context from output decisions leads to a greater detection improvement in the case of laughters than llers. This shows the importance of context, particularly in case of events with heterogeneous spatio- temporal characteristics. Figure 3.8 shows the smoothed probabilities for the same set of les as shown previously in Figure 3.5. I see that even though false alarms still 32 0 200 400 600 800 1000 1200 0 0.5 1 Frame number Probability (a) Obtained probability Target probability 0 200 400 600 800 1000 1200 0 0.5 1 Frame number Probability (b) Figure 3.8: Obtained and target probability values for (a) laughters and (b) llers, after probability smoothing on sample test les. exist for the sample les, I obtain near perfect detection in case of true positives. Also spurious isolated false alarms are largely non-existent. The false alarms are still a concern and mainly arise from similar acoustic properties between certain verbal sounds and the non-verbal events (e.g. sound \um" in umpire is similar to the ller \um"). I address this problem partially in the next section by using a `masking' technique. 33 3.3.3 Masking highly probable/improbable events As the nal step, I make use of inherent properties of the probability time series to further improve my results (block (c) in Figure 3.1). I develop the masking scheme based on two heuristics: (i) existence of low event probability values for extended period of time implies the absence of any event, and (ii) similarly, contiguous high event probability values implies presence of an event. I implement this strategy by developing binary masks as described below. (i) Zeros-mask design: If there is contiguous existence of probability values below a thresholdT 0 for at least a set number ofK 0 frames, I mask all such probabilities by zero. (ii) Ones-mask design: Similarly, if probability values are contiguously over a threshold T 1 for at least K 1 frames, I mask all such probabilities by one. The overall operation of implementing the zeros and ones masking operation is shown in (3.7). I tune T 0 ;T 1 and K 0 ;K 1 on the development set. y E (n) = 8 > > > > > < > > > > > : 0 if9 n such that: w E (n)<T 0 8 n2n;n + 1;:::;n +K 0 1 if9 n such that: w E (n)>T 1 8 n2n;n + 1;:::;n +K 1 w E (n) otherwise (3.7) Results and discussion: I show the results before and after masking in Table 3.8. I observe a slight increase in AUC for llers and none for laughters. In the case of llers, I obtain T 0 = 0:02 and T 1 = 0:98. The performance for laughters saturated in the previous step and any T 0 > 0 and T 1 < 1 led to a reduction in AUC. These 34 System AUC (in %) used Development set Testing set Laughter Filler Laughter Filler Probability encoding 97.6 96.1 95.3 90.2 Masking 97.6 96.3 95.3 90.4 Table 3.8: Detection results before and after masking the probability time series. 0 200 400 600 800 1000 1200 0 0.5 1 Frame number Probability (a) Obtained probability Target probability 0 200 400 600 800 1000 1200 0 0.5 1 Frame number Probability (b) Figure 3.9: Obtained and target probability values for (a) laughters and (b) llers, after masking. threshold values suggest that the previous step involving smoothing accounted for most of the information during detection, leading to marginal gains when using additional masking. I plot the output probabilities for the chosen sample les in Figure 3.9. The only visible impact is for the le containing llers where event probabilities between frames 20 to 500 are set to zero. I perform an analysis of the results obtained after masking, and each previous operation, in the next section. 35 3.4 Analysis of features In the experiments so far, all features are used together to make an event prediction. This makes it dicult to evaluate the contribution of individual features toward the nal laughter and ller prediction outcomes. In this section, I perform two sets of experiments to address this issue. The goal of these experiments is to understand the relation between a feature's discriminative power and its impact on the nal outcome. In the rst experiment, I look at the sensitivity of the nal outcome to perturbations in each single feature followed by computing the AUCs using a single feature at a time. Finally, I investigate the outcomes of these two experiments and nd a signicant correlation between them with further implications towards the system design. 3.4.1 Sensitivity analysis In this experiment, I study the eect of each feature on the nal probability out- come. During frame-wise prediction, the neurons in the DNN are activated by all feature inputs from an audio frame. However, all the features may not have similar activation patterns. Through the sensitivity analysis, I investigate the DNN acti- vation using one feature at a time. I expect dierences in activation patterns from each feature which may correlate with the feature discriminability (investigated in the next section). I perform separate analysis for the prosodic and the spectral features using my baseline DNN model trained on the 17 z-normalized features. I activate the DNN trained in Section 4.1.1(iii) using one feature at a time, while keeping all other features to be 0. I vary the chosen feature one standard deviation (-1 to +1, as it is z-normalized) around the mean (0 for z-normalized feature) and observe 36 the probability outcome from the DNN. I plot the output probabilities for each of the features in Figures 3.10 and 3.11. I discuss my results separately for prosodic features and MFCC coecients below. −1 −0.5 0 0.5 1 0.4 0.5 0.6 0.7 0.8 Feature value input to DNN Probability output from DNN (1a) −1 −0.5 0 0.5 1 0.4 0.5 0.6 0.7 0.8 Feature value input to DNN Probability output from DNN (1b) −1 −0.5 0 0.5 1 0.08 0.1 0.12 0.14 0.16 Feature value input to DNN Probability output from DNN (2a) mfcc1 mfcc2 mfcc3 mfcc4 mfcc5 mfcc6 −1 −0.5 0 0.5 1 0.08 0.1 0.12 0.14 0.16 Feature value input to DNN Probability output from DNN (2b) mfcc7 mfcc8 mfcc9 mfcc10 mfcc11 mfcc12 Figure 3.10: Plot of event probability output by the DNN on varying the z- normalized MFCCs over one standard deviation ( = 1). Note that while varying a single feature, all other features are set to 0. Plots 1a/2a: Plot over MFCC 1-6 for laughters/llers. Plots 1b/2b: Plot over MFCC 7-12 for laughters/llers. MFCC features I show the results for the MFCC features in Figure 3.10. I split the results for the 12 MFCCs in two gures for readability. From the gures, I observe that the rst few MFCC co-ecients show the most dynamic range in output probability and variation is low in the second half of coecients. The rst MFCC coecient shows the most dynamic range for both laughters and llers indicating high sensitivity. For all coecients except the rst, a monotonic change in a value either favors 37 −1 −0.5 0 0.5 1 0 0.2 0.4 0.6 0.8 1 (a) Feature value input to DNN Probability output from DNN −1 −0.5 0 0.5 1 0 0.05 0.1 0.15 0.2 Feature value input to DNN Probability output from DNN (b) Log energy Voicing probability HNR F0 Zero−crossing rate Figure 3.11: Plot of assigned probabilities on varying the z-normalized prosodic features over one standard deviation. (a/b: Plot over prosodic features for laugh- ters/llers.) laughters or llers. The output probability trends are opposite for the two events, wherein a positive slope for laughters corresponds to a negative slope for llers. This indicates a monotonic increase/decrease in feature values favor one event while reducing the probability of other. Note that all the curves intersect at 0 as this corresponds to a value where the neural network in not activated at all. Prosodic features The plot of assigned outputs versus variation in prosodic features is shown in Figure 3.11. In case of the prosodic features, I observe the most variation in the case of log energy and zero-crossing rate. Apart from log energy, I observe the opposite 38 trends in the slope of the curves for llers and laughters. I observe that the curves corresponding to log energy are not monotonic, indicating a more complicated boundary for this feature. I observe patterns such as the probability of outcome increases with higher intensity (more sharply for laughters than llers) suggesting a louder voice implies a higher laughter probability. Similar comments can be made by observing the outcome patterns with increase/decrease in a prosodic feature. From the output patterns of the features, I observe that the trained DNN has dierent activation patterns for each feature. This implies that variation in features have disproportionate impact on the nal probability prediction. Moreover, for several features, the output probability trends are opposite for ller and laughter events. This is expected in a discriminative model as increase in probability output for given feature values should translate to lower probability for the other. In the next experiment, I use the feature values from the dataset and predict the event labels. 3.4.2 Feature performance analysis I analyze the performance of each feature on the test set using the same baseline DNN on 17 features (Note that this is unlike previous experiment where outputs were obtained by synthetically varying feature values). In order to predict the event probabilities, I use the values from a single feature, while setting all other features to zero. I show the corresponding AUCs for laughters and llers in Figure 3.12. From the gures, I observe a dierence in AUC patterns across the MFCC features for the two events. This shows that each frequency band carries a dierent amount of discriminative information for each event type. MFCC-2,6,10 perform the best for laughters and MFCC-1,11,12 for llers. In case of the prosodic features, 39 log energy provides the highest AUCs for both the events. Prosodic features oer a higher degree of discriminability, particularly for ller as the best two prosodic features achieve an AUC greater that 70% by themselves. I speculate that the results from sensitivity analysis and feature performance analysis are related as the DNN model should tune more to better features. In the next section, I investigate the relationship between the two. 3.4.3 Relation between feature performance and output probability dynamic range The relation between activation patterns in neural networks and their performance have been a subject of study in various other tasks such as face and object recog- nition [135, 136]. I hypothesize that the DNN model is more sensitive to features which oer a higher discriminatory power. I test my hypothesis by performing regression analysis [137] between the dynamic range of outputs as obtained in section 3.4.1 and the AUC values obtained in section 3.4.2. The dynamic range obtained in sensitivity analysis is used as a proxy for the degree of activation. I t a linear model (equation (3.8)) with dynamic range as the predictor variable for AUC values. The linear regression analysis helps us understand the general trend in between the two variables (although the relation between the two variables may not be linear). The outcome of linear tting is shown in Figure 3.13. Each of the 17 datapoints in the gure corresponds to a feature, with the x-axis representing the output dynamic range obtained during sensitivity analysis of the feature and 40 MFCC1 MFCC2 MFCC3 MFCC4 MFCC5 MFCC6 MFCC7 MFCC8 MFCC9 MFCC10 MFCC11 MFCC12 0 10 20 30 40 50 60 70 80 Laughter Filler (a) AUC Log energy Voicing prob. HNR F0 ZCR 0 10 20 30 40 50 60 70 80 (b) AUC Figure 3.12: AUC (in %) obtained based on a single features at a time. (a): Plot for MFCCs (b): Plot for prosodic features. y-axis the AUC using that feature only. Table 3.9 shows the statistics on the rela- tion between the two variables. represents the correlation coecient between the AUC and the dynamic range as computed over the seventeen features. AUC = 0 Dynamic range + 1 (3.8) 41 0 0.2 0.4 0.6 0.8 45 50 55 60 65 70 Dynamic range of probability output per feature Obtained AUC using one feature at a time Laughter 0 0.05 0.1 35 40 45 50 55 60 65 70 75 80 Filler Figure 3.13: Plot representing AUC and dynamic range values obtained per fea- ture. Line in red represent the best t using linear regression. The 17 datapoints correspond to each feature. Event Statistics w.r.t. the regressor 0 Standard error F-statistic vs p-value constant model Laughter 18.1 9.64 3.51 0.08 0.44 Filler 194.4 81.9 5.63 0.03 0.53 Table 3.9: Statistics of linear model for predicting AUC with dynamic range of the output probability as the regressor. The F-statistic shows that the regressor is signicant at 5% level in predicting the AUC in case of llers and at 10% level for laughters. This provides evidence of relation between a feature's discriminative power and output sensitivity to the feature. This also shows that the DNN model is more sensitve to more discrim- inative features. In the future, I can use this observation during DNN training by discarding nodes corresponding to features with low sensitivity. This obser- vation may particularly be useful in case of increased dimensionality of features introduced while including the contextual features (as in Section 3.3.1). 42 3.5 Conclusion Non-verbal vocalizations are inherent constituents of natural conversations and their interpretation can inform understanding of the communication process. I present a system for robust detection of two non-verbal events, laughter and ller, with a goal of gaining insights into the role they play during interpersonal interac- tion. This specic task was originally proposed as a part of the Interspeech 2013 challenge event where the initial ideas of the present study were presented [101]. My system sequentially employs multiple signal processing methods, primarily accounting for context during the frame-wise detection as well as utilizing some inherent properties of the signal at hand. I establish an acoustic feature based context independent baseline, followed by the introduction of contextual features. Next, I incorporate context in the output decisions themselves using a \smoothing" technique. I use a linear lter and an auto-encoder for that purpose and observe performance increments. I note that I still obtain several false alarms and partially address this problem using \masking" techniques that reduce false alarms at a very low operating threshold. I observe that the performance of my system increases with each subsequent step with the contextual \smoothing" providing the most gain. I further perform a sensitivity and performance analysis on each feature. I observe that the constituent features oer varying degree of sensitivity to pertur- bation and stand alone performances. It is interesting to observe that several of the sensitivity patterns follow an opposite trend when compared across the two non-verbal target events considered in this work. Also, the stand alone perfor- mance provides us a sense of which features carry the most amount of information during inference. I establish a relation between feature sensitivity and stand alone 43 performance. I observe a positive correlation between the two and conclude that the output of the DNN model is more sensitive to more discriminative features. My detection scheme serves as a rst step toward a robust analysis of non-verbal behaviors in vocal communication. Non-verbal vocalizations are often associated with several aspects of human behavior such as depression [138,139], emotions [140, 141] as well as a marker of the overall quality of interaction [142, 143]. Detection of non-verbal events can facilitate further investigations such as studying their patterning with respect to specic communication settings and goals. This can be particularly benecial in diagnostic settings related with depression, autism and other such disorders [139, 144, 145]. Furthermore, my current system can be further expanded to incorporate other non-verbal event types. The proposed framework is general and event-specic alterations can be introduced. Even though my system successfully infer a majority of `garbage' frames as not belonging to an event, false alarms pose challenges given a lower frequency of occurrence of the target events. Additional measures can be introduced to address this issue and one suggested approach is to create a complementary modeling of the `garbage' frame themselves, which are composed of speech and silence. A long term goal may also include characterizing the distribution of these events with adapting the output probabilities of my current system based on the distribution prior. 44 Chapter 4 Variable span dis uency detection in ASR transcripts Dis uencies such as revisions, word repetitions, phrase repetitions, llers (e.g. ah, um, eh etc.) are common in natural spoken language. Previous work on dis uency includes studies on structures in dis uency production [146{148], acoustics charac- teristics [149] and their eect on human language comprehension [150]. Dis uencies can be deemed extraneous units in an utterance. The removal of these dis uencies yields the uent ancillary (the intended utterance), often desired by many spoken language technology applications including speech understanding and translation. There are specic types of dis uency such as word and phrase repetitions and revi- sions which are lexically well formed (and hence can be captured by an automatic speech recognition system). However their presence still can cause problems for processing downstream applications such as speech understanding and translation. The focus of this chapter is the detection of word-level dis uencies in ASR tran- scripts using context based lexical and other ASR-derived features, with a view toward aiding downstream machine translation. Johnson [151] categorizes speech dis uencies into eight categories, which include nonverbal vocalizations (e.g., llers like um, ah), word fragments and full lexi- cal items. Of these, I focus on detecting word repetitions, phrase repetitions, incomplete phrases and revisions in this chapter. My experiments are done using both oracle reference transcriptions and noisy ASR transcriptions. Comparison of 45 Dis uency type Example Word repetition Where Where are you going ? Phrase repetition So can you can you tell about their location ? Revision I think there is bad ah the security is really bad Incomplete His name where phrase does he live ? Table 4.1: Examples of dis uencies results over these two sources helps us gauge the usefulness of designed features as those in one case may not be as informative as in the other. A few examples of such dis uencies are shown in Table 1. As can be seen, the removal of these dis uencies does not alter the intended information conveyed but instead helps clean up the surface text to be conducive to further automated processing. Note that this work of dis uency detection from text does not focus on llers such as ah and um since they are trivial to mark automatically once detected from the audio stream. Further, partly spoken and broken words are removed from the database for the purpose of training the ASR system in my experiments and hence they do not occur in either the reference or the ASR transcripts by design. Previous studies have aimed at detecting dis uencies in human conversation given the true reference transcripts as well as ASR transcripts. Researchers [152{154] have focused on the removal of dis uencies for improving spoken language translation on true transcripts using word n-gram features. Prosodic features have also been used along with word level features for dis uency detection [155{157]. Howsoever, most of these schemes tag words individually using a local tagger (maximum entropy model) or sequential classier (linear chain conditional ran- dom eld) which only incorporates immediate context into account. Moreover, prosodic features are obtained from oracle sources or systems independent of ASR 46 transcription. Lease et al. [158] use features based on parsing trees to identify lled pauses, discourse markers and explicit editing terms to incorporate syntactic structure into account. They provide results on ASR transcripts, however do not make use of prosodic information readily available from the ASR output. I train a word span based classication system that tags spans of words instead of a single word, in contrast to previous models. As seen in the examples, some dis uencies do not occur as isolated words. This model helps us to tag a word span as dis uency at the same time incorporating longer context into account which is not possible with simple sequential modeling schemes. I then derive the word level predictions based on the outputs on such spans. My baseline system to obtain the nal predictions is based on a few simple lexical features. I further augment my system with features based on typed dependencies amongst words as well as the length of silence before each word to incorporate syntactic structure and prosodic information respectively. I compare the usefulness of these features across the two transcription sources and observe dissimilar performances. My results suggest that whereas the performance in the case of ASR transcripts degrades considerably when just using the baseline features as compared to that of oracle transcripts, the additional features provide a greater discrimination. I obtain an area under ROC curve for detection of dis uencies at word level of .93 and .87 with reference and ASR transcripts, respectively. 47 4.1 Training and Evaluation Data For the experiments of this chapter, I use the English side of the DARPA TransTac two-way spoken dialog collections covering various domains, including force protec- tion (e.g., checkpoint, reconnaissance, and patrol), medical diagnosis, aid, main- tenance, infrastructure, and others [160]. I test my system on both the reference transcripts and ASR transcripts of this data. I use the BBN Byblos ASR system to transcribe the speech automatically. The system uses a multi-pass decoding strategy in which models of increasing complexity are used in successive passes to obtain gradually rened recognition hypotheses [161]. The acoustic model (AM) was trained on approximately 200 hours of manually transcribed conversational English speech, which consisted of 129K utterances segmented by sentence bound- aries. I trained the language model (LM) on 6M sentences with 60M words, drawn from both the TransTac domain and other out-of-domain sources. 4.1.1 Training Corpus I manually annotated a subset of the true reference transcriptions (11501 utter- ances) of the 200 hours of English speech with dis uency labels. Each of these utterances necessarily contains at least one dis uency. I train my dis uency detec- tion models on this subset instead of the entire training set to maintain the class balance while training. The rest of the training data contains very few dis uen- cies, and therefore I do not include it in my training set. I train my model on the features derived from reference transcript as well as the length of silence before each word as output by the ASR system. I employed 10-fold jack-kning technique to decode the training corpus to obtain the silence timings before each word. I divided the 200 hours of English speech with reference transcriptions into ten equal 48 partitions. Each partition was decoded with an LM that left out transcriptions for that partition (ten dierent LMs were trained, one for each partition). I do this so that I do not over-t the LM to obtain the silence features for the training set. The global baseline AM was used for decoding all partitions. I list the specications of the training and evaluation set in Table 2. I observe a considerably higher word error rate (WER) for the training set. This corroborates my earlier note that the presence of dis uencies causes more frequent errors in ASR systems. This is intuitive given the fact that the LMs are trained mainly on uent data thus reducing the likelihood of an ASR hypothesis that contains dis uencies. Owing to the high WER, I train my models just on the reference transcripts and not on the ASR transcripts. 4.1.2 Evaluation Corpus I use 1354 utterances as the development set and 1450 as the testing set. Again, I manually annotated these sets for dis uencies to obtain the ground truth labels. These sets have more natural distribution of dis uencies because I did not restrict them to only those utterances that contained dis uencies, as I did in training. After decoding both these sets, I obtain a better WER when compared to the training set as only 175 of development set utterances and 171 of testing set utterances contain dis uencies. I evaluate my dis uency detector both on the reference transcript as well as the transcripts obtained from the ASR system. In order to obtain the ground truth label for ASR transcripts I map the labels from the reference transcripts using alignment. The labels for all the correctly aligned and replaced words in the reference transcripts are directly mapped to the corresponding words in the ASR hypotheses. All the insertions in the ASR transcripts were marked as not being a dis uency. As seen in Table 2, the number 49 Corpus Size Size WER (words) (Dis uent words) Training 209k 21k 39.2 Dev.(ref. transcript) 17k 334 - Dev.(ASR transcript) 17k 314 13.5 Testing(ref. transcript) 16k 347 - Testing(ASR transcript) 16k 331 11.1 Table 4.2: Training and evaluation datasets of dis uent words in the development set and the test set in case of ASR transcripts is fewer than in the reference transcripts (dierence equals the number of deleted dis uent words). 4.2 Dis uency detection Following the SPANMAXENT-ASR approach for named entity detection in [162], I use a variable span scheme for dis uency detection. As the dis uencies can range over several words, instead of tagging each word individually, it makes intuitive sense to tag an entire span of words as a dis uency. Also as the dis uencies disrupt the canonical syntactic structure of an utterance, local contextual information can help us identify irregularities corresponding to a dis uency. With this model, I am able to better capture contextual information in the utterance by treating spans as atomic units for labeling dis uencies. I next describe my training methodology using features on word spans followed by the inference methodology to obtain word level decisions. 50 Span Not-Dis. Dis. Not-Dis. length I think there is ah the security bad is really bad 1 I, think there, is, the, security, bad is, really, bad 2 I think there is, the security, is bad security is, is really, ... 3 NA there is bad the security is, security is really, is really bad Table 4.3: Example training instances of variable span lengths. 4.2.1 Training methodology and features In order to obtain the training instances, I use each individual span corresponding to a dis uent and non-dis uent region separately. All of my features are obtained from within the span with each word span containing words either from the dis- uent or uent region only. As an example, for the utterance shown in Table 1, \I think there is bad ah the security is really bad" there is one region correspond- ing to a dis uency and two regions which are not dis uent. I nd all spans of consecutive words for each region, up to a specied maximum length and assign to each span the label of the containing region. The spans up to a length of 3 for this example are listed in Table 4.3. As mentioned in the introduction, in this work, I do not focus on detecting llers as \ah". I train the model using features for each span. Additionally, I use context from the words immediately to the left/right of the span. For example all span internal words are considered as word identity features for the current span whereas the left and right words are added as lexical context. In the given example, if \there is bad" is the current span, (there, is, bad) are used as internal word identity features for the span while (think) and (the) are used as left and right context, 51 respectively. Hence for a span length 3 and using one right/left word context will give us features for the span \there is bad" along with the features for the words \think" and \ the". Again, if the right or left context is a ller I ignore it to obtain contextual features as they do not contain any lexical meaning and instead use the next closest neighbor. I set my baseline based on simple lexical features that are then augmented with typed dependency and timing features. The baseline features are solely designed to capture the repetitive nature of dis uencies as well as presence of specic words (as llers) around dis uencies. The additional features capture both the structure of the whole utterance and the prosodic information about pauses detected by the ASR system. I trained a maxent classier on features from each span length with the L-BFGS [163] algorithm and the Gaussian priors were tuned empirically to be 0.05 on the development set. Note I do not use sequential models (e.g. linear chain conditional random eld model) as it is not trivial to train a sequential model particularly on multi-word spans. Also, simple sequence modeling schemes only account for immediate neighbors instead of a longer context. Baseline features Current word identity: I use the current words in the span as features by themselves. This in itself is like training separate LMs for the uent and dis uent regions. In case of a multi-word span, this feature will help the model determine if this conjunction of words belongs to a dis uency or not. Filler in span: This feature indicates the presence of a ller (ah, um etc.) in the current span of words. It is motivated by the observation that word-level dis uencies are often accompanied by llers. Filler before/after the span: This feature indicates the presence of llers 52 right before or after the span of words. This is similar to context features but this feature just indicates their presence. I do not extract any other feature for llers in context. Word level string edit distance between current span and following span: I calculate the string edit distance between the current span and the following words considering each word to be an element of the string. I compare the current span of length L to next L 1, L and L + 1 words. This feature is expected to capture word repetitions, phrase repetitions and also revisions with some degree of reuse of words. I do not include llers while calculating string edit distances. Character level string edit distance between current span and the following span: I perform the same comparison as above, but instead consider each character to be an element of the string while calculating the string edit distance. Apart from repetitions, this feature is expected to capture revisions with false starts and rephrasing the utterance. Indicator representing meaningful use of a discourse marker: Certain phrases such as \you know", \like" help beginning/keeping a turn or serving an acknowledgment. However, I observe that they get marked as dis uency even when they are the part of intended utterance (e.g., do you know him?). I prepared a list of n-grams for the discourse markers corresponding to their meaningful usage. If I observe an n-gram around a discourse marker that is in the list, I set this feature to 1. 53 Additional features Dis uencies with any kind of lexical variability, particularly revisions, are dicult to capture using above features. Additionally, errors may be further aggravated in case of ASR errors involving word replacements. Furthermore all word and phrase repetitions are marked as a dis uency using the above scheme of features even if they are actually not (e.g., \He said that that belongs to him" does not contain a dis uency). Also lexically similar words appearing close to each other tend to be marked as dis uency. Therefore I use the features discussed below to capture the structure of the sentence and the pausing behavior before each word for a better detection of dis uencies in case of higher lexical variability and errors introduced during ASR transcriptions. Number of incoming and outgoing typed dependencies for each span: I parse each utterance using the Stanford parser [164] and obtain typed dependen- cies between all the words in an utterance. I count the number of incoming and outgoing dependencies from each word in the span. I observe that the number of dependencies for dis uencies is smaller than the non-dis uent words. This is also intuitive as dis uencies are extraneous words in the utterance and do not t the canonical syntactic ow. This feature is expected to work well even in the case of word replacements in ASR as they should have lower number of dependencies. Length of silence before each word: I also use the length of silence before each word as a feature since dis uencies are often accompanied by a pause around them. I obtain the length of silence from the ASR transcription. Since I evaluate the performance both on the reference and the ASR transcript, I map the silence phonemes as follows: True transcript: I align the reference and the ASR transcripts. For each aligned word (correct or replacement) in the reference transcript, I map the silence length 54 before the aligned ASR word as the corresponding silence length. For the remain- ing words, this feature is not available. ASR transcript: I directly use the silence length before each hypothesized word. I uniformly bin the silence lengths (bin length 0.1 seconds) to discretize this feature. I do not use other ASR based prosodic features because in case of an erroneous hypothesis, such features correspond to the erroneous word instead of the refer- ence word. The use of this silence feature is motivated towards designing features readily available from ASR transcriptions. Such ASR based features are expected to be a better match for detection as compared to features from other sources. 4.2.2 Inference methodology During inference, I do not have the labeled regions on the development and the test set. Therefore, I iteratively tag each span with the maxent classier, up to a maximum span length and then aggregate the results. In the rst pass, I tag all single-word spans; in the second pass, I tag all two-word spans; and so on. I obtain a posterior probability for each span being a dis uency during each pass. In the next step I calculate the probability of a word being a dis uency from these span probabilities. In order to decide if a word corresponds to a dis uency, I perform a weighted combination of classier posteriors for all spans covering that candidate hypothesis word. For instance, the nal score for the word \there" in the example for a model trained with maximum span length 3 is computed as shown below: 55 S dis (there) =w 1 p dis (there)+ w 2 fp dis (think there) +p dis (there is)g+ w 3 fp dis (I think there)+ p dis (think there is) +p dis (there is bad)g (4.1) I tune the maximum span length and w 1 , w 2 and w 3 on the development set. For both, the reference and the ASR transcripts, I nd that the optimal span length (as tuned on my development data) is 3, andw1,w2 andw3 are similarly tuned to (0.6,0.1,0.3) in the reference transcripts and (0.6,0.2,0.2) in the ASR transcripts. I apply these scaling factors to posteriors estimated on test-set spans to obtain word-level decisions. The number of context words for each span is tuned to 2. Hence I use the two closest left and right neighboring words to obtain contextual features barring the llers. 4.3 Results and discussion I plot the ROC curve for word level detection of dis uencies for the baseline feature set as well as after adding each additional feature consecutively, on the reference transcripts (Figure 1) and the ASR transcripts (Figure 2). The addition of depen- dency features improves the ROC curves in both the cases. I observe that the results on the reference transcripts are better than the ASR transcripts in all the cases. Also, while the addition of silence features does not improve the performance in the case of reference transcripts, it improves the ROC curve for the noisy ASR transcripts. I list the area under ROC curve in Table 4.4 along with the detection rate at a low false alarm rate of 2% given the rare occurrence of dis uencies. 56 Figure 4.1: ROC curves for dis uency detection in reference transcript Feature set True transcript ASR transcript AUC/ Detection AUC/ Detection rate rate Chance .50/2.0% .50/2.0% Baseline features .86/43.4% .72/11.0% + Dependency features .93/62.1% .82/45.2% All features .93/62.7% .87/47.8% Table 4.4: Results obtained on the reference and the ASR transcripts (English side of DARPA TRANSTAC data). 4.3.1 Discussion The distribution of weights w 1 ;w 2 ;w 3 over the three spans suggests that tagging spans is more optimal than tagging each word individually. As the number of context words is tuned to 2, this suggests the importance of context in tagging dis uencies. From the results I observe that the baseline features perform well above the chance model. This could be attributed to the fact that a majority of word rep- etitions, phrase repetitions and revisions involving part-repetitions can be well captured by the baseline features. However, the baseline features do not work as 57 well in the case of ASR transcripts. As the LM is trained on a database compris- ing a majority of uent utterances, the likelihood of a dis uency showing up in ASR transcript is low. In case of an erroneous ASR hypothesis corresponding to a dis uency, the baseline features are likely to fail. However, I observe that the addition of dependency based features gives a greater improvement in case of ASR transcripts over the baseline features. Unlike the baseline features, these features are expected to capture dis uencies erred as a dierent word during ASR transcription. Also, these features capture the syntactic ow and are more eective in capturing revisions. When it comes to the silence feature, I observe that I hardly improve upon the baseline + dependency based features in case of reference transcripts. This indicates that in the case where oracle transcripts are available, the additional information of pausing behavior is not as useful. However, I observe a gain in case of ASR transcripts. The erroneous phrasal boundaries cause the ASR system to rely on the prosodic information to achieve better results. As I am utilizing outputs from ASR to compute silence length, I have a direct correspondence of silence lengths to the ASR hypotheses words. This helps us to train a model with a greater match towards ASR transcriptions. This variable performance of features over the two cases advocates for a more comprehensive feature design in case of ASR transcripts. 4.4 Conclusion In this chapter, I use a word span based technique to identify dis uencies in spo- ken utterances. I show that features capturing syntactic ow and pausing behavior help us improve dis uency detection over simple lexical features, notably in noisy 58 Figure 4.2: ROC curves for dis uency detection in ASR transcript transcripts obtained from an ASR. The ultimate goal of this work aims at identi- fying and removing dis uencies is in its application to spoken language technology such as spoken language translation (SLT). As a future work, I would like to study more features that can further improve my capability to identify dis uencies specically those introduced in the ASR tran- scripts. In particular, I intend to apply the dis uency detection system to an SLT system and analysis of the behavior of an ASR system around such dis uencies that can strongly boost the SLT performance. One can also look at new schemes to cluster words while predicting dis uencies [165]. Finally, one can also look at prosodic nature of dis uencies to pre-process the utterances before they are fed to an ASR system [117,166]. 59 Part II Estimation of latent states using nonverbal cues 60 Several studies have investigated the relation between latent behavioral states such as emotions, mood and aect and nonverbal cues. In this part of the thesis, I establish a link between the latent behavioral states and nonverbal cues in a few emerging domains such as Motivational Interviewing (MI) and child-psychologist interactions. These new domains involve a semi-structured conversation between two parties, and the conversation often involves nonverbal cues in the form of laugh- ters, facial expressions and gestures. I also focus on designing models to predict the latent behavioral states from nonverbal cues. The model design is performed keeping in mind that the system interpretability is important in these domains. Instead of a black box design, I focus on designing models whose parameters carry interpretability. The three specic case studies I focus on this thesis are titled: (i) predicting aective dimensions based on self assessed depression severity, (ii) aect prediction in music using boosted ensemble of lters and, (iii) predicting clients inclination towards target behavior change in Motivational Interviewing and investigating the role of laughter. In the rst study, I link cues from paralanguage and facial expressions with self-assessed depression severity to predict the emotional state of the person. The key contribution in this study is designing a model that operates conditionally based on user meta-data (depression severity). In the next study, I investigate the aective stimulation resulting from music. The goal of this study is to use a few features in determining the arousal and valence resulting from listening to music. I design a model that uses a handful of features (leading to higher interpretability) in a model designed using gradient boosting. Finally, in the third topic, I extend the study of nonverbal communication to a new domain of Motivational Interviewing (MI). I design models to predict target behavior change 61 in clients along with studying the importance of laughters in such a prediction. I discuss these three studies in detail in chapters 5-7. 62 Chapter 5 Predicting aective dimensions based on self assessed depression severity Depression is a clinical condition characterized by a state of low mood, despon- dency and dejection [167]. Depression can impact a patient's mood leading to symptoms such as sadness, anxiety and restlessness [168]. The National Institute of Mental Health identies various forms of depression (such as major depressive disorder, dysthemic disorder and psychotic depression) and links their impact to the patient's life including his personal relationships, professional life as well as daily habits such as eating and sleeping [169]. Depression also has a direct impact on the patients aective expression and their association has been widely stud- ied in relation to depression therapy [170], genetic analysis of depression [171] and perception of emotions [172]. In this work, I address the problem of relating aective expression to the severity of a patient's depressive disorder. Tracking aective dimensions (valence, arousal and dominance) is a classic problem in the study of emotions [173] and I incorporate the severity of depression quantized by a self-assessed metric, the Beck Depression Inventory-II (BDI-II) index [174] in pre- diction of the aective dimensions. Through my proposed aect prediction system, I aim to exploit the impact that depression has on a patient's emotional states with an overarching goal of assisting the analysis and treatment of depression disorders. 63 Several previous works have analyzed the relationship between depression and emotions including various cross-cultural studies [175], in neurology [176] and psychology [177]. Greenberg et al. [170] describe an emotion focused therapy for depression in their book and Izard [178] and Blumberg et al. [179] analyzed patterns of emotions with respect to depression. Considering the application of machine learning to the analysis of depression, researchers have investigated the relation between depression and various audio-visual cues using Canonical Cor- relation Analysis (CCA) [180], i-vectors [181] and acoustic volume analysis [182]. Tracking aect is another problem that is widely studied in emotion research. For instance, Metallinou et al. [183] incorporated body language and prosodic cues in tracking continuous emotion and Nicolaou et al. [173] proposed an output- associative relevance vector machine regression for continuous emotion prediction. Ringeval et al. [184] utilized physiological data in predicting emotions and Gunes et al. [185] presented an analysis of trends and future directions in aect analysis. The Audio Visual Emotion Challenges (AVEC) [186, 187] led to particular interest in the study of depression disorder and emotions. Several interesting approaches were presented as a part of the challenge in predicting depression and tracking aective dimensions. A few proposed methods for tracking emotions include using ensem- ble CCA [188], regression based multi-modal fusion [189] and use of application dependent meta knowledge [190]. Methods proposed for rating depression include the use of vocal and facial biomarkers [191], facial expression dynamics [192] and Fisher vector encoding [193]. Despite the progress in the study of relating depression and emotions, existing models do not take depression severity into account while tracking aect. The challenge lies in incorporating a single patient specic depression assessment value in the models for tracking aect. I address this challenge in this work by performing 64 feature transformation based on the self-assessed depression severity. I perform experiments on two datasets obtained from the Audio-Visual Depressive language (AViD) corpus: Freeform and Northwind datasets [187]. Both of these datasets contain sessions involving human-computer interaction, where either the patients discuss freely on a given question (Freeform) or read aloud an excerpt (Northwind). In order to establish the relationship between depression severity and aect in the datasets, I initially perform preliminary correlation analysis between the patients' BDI-II index and statistical functionals computed over their aective dimensions. This is followed by the design of the aect prediction system, where I rst develop a baseline system based on audio-visual features only. I then extend the model to incorporate feature transformation based on the depression severity (as quantied by BDI-II index) for the specic individual patient in the session. The motivation behind adding the feature transformation is to train a joint model, incorporating the scalar depression severity value within the audio visual prediction system. I test several feature transformation schemes and my best models obtain a mean correlation coecient values of .33 and .52 (baseline system performance: .24 and .48) computed over the aective dimensions (valence, arousal and dominance) in the Freeform and Northwind datasets, respectively. Finally, I discuss the feature transformations applied, interpret the results for the two datasets and propose a few future directions. 5.1 Database I use a subset of the AViD (Audio-Visual Depressive language) corpus in this work, also used in the Audio-Visual Emotion Challenges (AVEC), 2013-14 [186, 187]. The corpus includes microphone and webcam recordings of subjects performing 65 a human-computer interaction task. A single recording session contains only one subject. The subset of the corpus I use is divided into two parts: Freeform and Northwind datasets. Both these datasets contain the same set of subjects, with 100 session recordings. In the Freeform dataset, the participants respond freely to a question, while the Northwind dataset is more structured in the sense that the participants read aloud a given excerpt. 3-5 naive annotators rate every session with three aective dimensions: valence, arousal and dominance, at a rate of 30 Frames Per Second (FPS). The nal aect annotations are obtained as the mean over all the annotator ratings, computed per frame. The subjects participating in the sessions also complete the standardized self-assessment based Beck Depres- sion Inventory-II (BDI-II) questionnaire [174]. The BDI-II index is a single score between 0-63 determined based on a set of 21 questions, with a higher score imply- ing more severe depression. For more details regarding the corpus, please refer to [186]. 5.2 Experiments I divide my experiments into two parts: (i) Investigating the relationship between aective dimensions and the depression severity using correlation analysis and, (ii) Aect prediction incorporating self-assessed depression severity. I describe my experiments in detail below. 5.2.1 Investigating relationship between aective dimen- sions and depression severity As discussed previously, existing literature oers an in depth exploration of the relationship between aect and depression and suggests several links [171,172]. In 66 this experiment, I perform an analysis to validate the relationship between aec- tive dimensions and depression severity on the Freeform and Northwind datasets. I compute session-level statistics (mean, variance, range and median) over the time series of aective dimensions (valence, arousal and dominance) and look for any signicant correlation with the BDI-II index. Table 5.1 lists the values of correla- tion coecient between each of these statistics and the BDI-II score for the two datasets. Signicance of the correlation coecient is computed using the Student's t-distribution test at 5% level against a null hypothesis of no correlation. Since I am performing multiple hypothesis tests, I apply the Bonferroni correction [194]. I limit ourselves to a few statistical functionals as the Bonferroni correction is likely to give more false negatives with increasing number of signicance tests. From the table 5.1, I observe that the severity of the depression correlates sig- nicantly with several statistics of the aective dimensions for both the datasets. In particular, the mean and median statistics correlate well with the BDI-II score for both the datasets. The variance and range statistics correlate with the BDI- II score only for the Northwind dataset sessions. As the set of subjects is same across the two datasets, this dierence in correlation suggests that aective expres- sion may be aected by the nature of the task (dataset collection). Spontaneous versus read elicitation exercise dierent aspects of the neurocognitive system and hence the resulting aective vocal/visual behavior can be dierently aected by depression. Nevertheless, signicance of several correlation coecients validate the relationship between emotions and depression and motivates my next experiment in predicting aective dimensions based on depression severity assessment. 67 Freeform Northwind Val. Aro. Dom. Val. Aro. Dom. Mean -.40 -.46 -.35 -.34 -.50 -.20 Median -.39 -.46 -.35 -.33 -.51 -.20 Variance -.09 -.08 .08 -.23 -.28 -.09 Range -.03 -.08 .07 -.31 -.39 -.13 Table 5.1: Correlation coecient between a subject's BDI-II score and statistical functionals computed over the aective dimensions for his session. Signicance of 6=0 is shown in bold. 5.2.2 Predicting Aective dimensions In this section, I propose a model to predict the frame-wise aective dimension ratings conditioned on the depression severity. Initially, I develop a multi-modal system for predicting aective dimensions and use it as a baseline. I then extend the baseline model to incorporate the BDI-II index as a parameter in aect prediction. I describe these models below. Baseline: Multi-modal aective dimension prediction I initially develop a system for aect prediction based on audio-visual cues. I perform a frame-wise extraction of several audio-visual features and develop a multi-layered system for aect prediction. Below, I describe the audio visual cues used in prediction followed by the model description. Multi-modal cues: I use a similar set of audio visual cues as was used in the AVEC challenge 2014 [187]. A brief description of the audio-visual features is given below. a) Audio features: I adopt the set of audio features proposed in the AVEC 2014 challenge baseline paper [187]. The set of features include low level descriptors 68 such as Mel Frequency Cepstral Coecients, loudness, jitter and shimmer. For a complete list of features used in the audio model, please refer to the Table 1 in [187]. The list consists of 32 energy and spectral features and 6 voicing related features. I further append delta features and window-wise statistical functionals to these 38 features as described in the AVEC 2014 baseline paper [187]. Note that the audio features are obtained at a sample rate ve times the annotation frame rate (30 FPS). Thus, I downsample the audio features by sampling every fth frame, as is suggested in [187]. I represent the audio features for thet th frame in the session s as the row vectorx s a (t). b) Video features: The set of video features used in the baseline system is also borrowed from the AVEC challenge 2014 [187]. The proposed set of frame-wise Local Binary Pattern (LBP) features is well known for describing facial expressions. The LBP descriptors computed for a pixel compare the pixel's intensity to it's neighbors. After computing the descriptors per pixel, a histogram feature is computed with each bin as dierent binary pattern. For a complete description of the LBP features, please refer to section 4.2 in [187]. I represent the video features for thet th frame in the sessions as the row vectorx s v (t). Aect prediction system: My baseline aect prediction system uses the afore- mentioned audio-visual cues for frame-wise prediction of the aective dimensions. A schematic of the baseline system is shown in Figure 5.1. I describe various components of the system below. a) Input audio/video features: The bottom-most layer of the system in Figure 5.1 serves as the input for the audio/video feature values. Note that I have 69 V A D V A D Low pass filtering Modality fusion Final prediction after filtering Audio output Video output Input audio features Input video features Figure 5.1: Baseline system with audio-video features as inputs. separate inputs for the audio and video features. In the session s, to predict the aective dimensions for the t th frame, the system uses a window of audio/video features centered at the t th frame. That is, for the t th frame, the audio (video) features used is the concatenated set of vectors [x s a (tn);::;x s a (t);::;x s a (t +n)] ([x s v (tn);::;x s v (t);::;x s v (t +n)]) where the window length is given by 2n + 1. b) Audio/video outputs: The audio/video outputs are the output values obtained from the respective modalities. The motivation for such an untied system is for an independent evaluation of each modality. I represent the audio (video) output for the t th frame in the session s as y s a (n) (y s v (n)). The dimensionality of y s a (n) andy s v (n) is the same as the number of aective dimensions, as represented by the 3 nodes in the audio/video output layer in Figure 5.1. I chose a linear system to obtainy s a (n) andy s v (n) from the window of audio and video features as shown in equations (5.1) and (5.2). w a andw v represent the weight vectors multiplied with the audio and video features, respectively, andb a andb v are the bias terms. The 70 strategy for training the system and obtaining the parameters w a ;w v ;b a andb v is described in the next section. y a (t) =w a [x a (tn);::;x a (t);::;x a (t +n)] T +b a (5.1) y v (t) =w v [x v (tn);::;x v (t);::;x v (t +n)] T +b v (5.2) c) Modality fusion: Modality fusion performs a weighted combination of the out- putsy s a (t) andy s v (t) to provide the fused outputy s f (t) for the t th frame in session s. The fusion is again chosen to be linear and the output y s f (t) is obtained as shown in equation (5.3). w f andb f represent the weight and bias vectors used for fusion, respectively. y s f (t) =w f [y s a (t);y s v (t)] T +b f (5.3) One could chose one of the several strategies for training the model shown in Figure 5.1. For instance, the bottom three layers in Figure 5.1 represent a neural network and can be trained using the standard back-propagation algorithm [195]. However, I chose to train each layer independently using data bootstrapping [196]. That is, the audio and video system parameters (w a ;b a andw v ;b v ) are optimized independently on randomly sampled portions of the training set to predict the aective dimensions; using the Minimum Mean Squared Error (MMSE) criteria [197]. The fusion parameters (w f ;b f ) are then obtained to predict aective dimensions on another independently sampled subset of the training data by fusing audio and video outputs (y a and y v ), again using the MMSE criteria. I randomly sample 80% of the training data for each optimization. I chose this training strategy because of the following reasons: (i) This strategy 71 V A D V A D Low pass filtering Modality fusion Final prediction after filtering Audio output Video output Input audio features Input video features Feature transformation based on depression score Figure 5.2: Proposed system with a feature transformation layer appended to the baseline system. allows for independent evaluation of audio and video systems, as well as their fusion, (ii) my preliminary experiments suggested that while data bootstrapping and back-propagation results are comparable, the former is faster to perform. Next, I describe my nal low pass ltering step to obtain the nal predictions. d) Final prediction after ltering: In the nal step of the baseline system, I low pass lter the time-series of predicted aective dimensions. Note that this is a post processing step after predictions for each analysis frame has been obtained. This step is motivated from the fact that aective dimensions evolve smoothly over time without abrupt changes, as is also observed in several other works [189, 198]. In my experiments, I use a moving average lter (length: k) as the low pass lter. In the next section, I extend the current system to use the depression severity in predicting the aective dimensions. 72 Freeform dataset Modality System Baseline Proposed: T 1 Proposed:T 2 Proposed:T 3 Audio .12(.10/.25/.01) .25(.28/.35/.11) .05(.04/.20/-.09) .19(.23/.30/.04) Video .21(.21/.19/.22) .24(.26/.23/.25) .13(.16/.20/.05) .21(.21/.27/.15) Fused .19(.18/.22/.18) .28(.27/.33/.25) .12(.11/.24/.02) .21(.22/.30/.11) Filtered .24(.22/.28/.23) .33(.31/.37/.31) .16(.15/.31/.03) .27(.25/.35/.24) Northwind dataset Modality System Baseline Proposed:T 1 Proposed:T 2 Proposed: T 3 Audio .19(.12/.26/.18) .36(.32/.43/.33) .19(.08/.32/.17) .37(.35/.45/.30) Video .36(.37/.37/.33) .43(.41/.46/.43) .38(.35/.51/.29) .46(.38/.55/.45) Fused .38(.38/.41/.36) .45(.42/.49/.45) .39(.36/.52/.30) .48(.41/.57/.46) Filtered .48(.47/.50/.47) .50(.46/.53/.52) .45(.41/.59/.36) .52(.45/.61/.51) Table 5.2: Mean of the correlation coecients, (and per aective dimension: valence: val., arousal: aro, dominance: dom.) between the ground truth and system prediction. Best performing system for each data is shown in bold. Best systems are signicantly better than the baseline at 5% level using the Student's t-statistics test (number of samples = number of frames). Aective dimension prediction incorporating depression severity In this section, I propose an extension to the baseline model by incorporating the BDI-II depression index within the aect prediction system. The motivation of the system design is to perform a joint learning on the self-assessed depression severity and audio-visual cues to predict aective dimensions. Since the BDI-II index is a single value associated with every subject, the challenge lies in using the index in the frame-wise aective dimension prediction. I propose the inclusion of the subject-specic BDI-II score as a model parameter in predicting aective dimension for that subject. Using the same set of audio-visual cues in the baseline system, I incorporate the BDI-II score in the model as described below. 73 Aect prediction system: The proposed system transforms the audio-visual features based on the subject's BDI-II score values. Figure 5.2 shows a schematic of the proposed model and below I describe each component of the model. a) Input audio/video features: The feature input scheme is same as in the baseline system. The system takes in a window of audio/video features and transforms them based on the BDI-II score as discussed next. b) Feature transformation based on depression score: In this layer, I transform the features for a session based on the corresponding subject's BDI-II score. Although there are several feature transformations that one can apply, I test three trans- formations in this work which modify the features means and/or variances in a session based on the corresponding subject's depression severity. I discuss these transformations below. 1. Feature shifting: In this transformation, I shift the features values for a ses- sion by adding the BDI-II score for the corresponding subject. This transformation alters the feature means for a session based on the subject's depression severity. The shifting transformationT 1 for audio features for the session s with the corre- sponding subject's BDI-II index,d s can be represented as shown in equation (5.4) (the same transformation holds for video features). 1 represents a vector of ones of the same dimensionality as the input feature. T 1 [x s a (tn);::;x s a (t);::;x s a (t +n)] = [x s a (tn);::;x s a (t);::;x s a (t +n)] + d s 1 (5.4) 2. Feature scaling: In this transformation, I scale the frame-wise feature values for each session by the corresponding subject's BDI-II score. This transformation 74 alters the feature variances for a session based on the subject's depression sever- ity. The scaling transformationT 2 is represented in equation (5.5). represents element-wise multiplication andd s is the BDI-II index for the subject in sessions. T 2 [x s a (tn);::;x s a (t);::;x s a (t +n)] = [x s a (tn);::;x s a (t);::;x s a (t +n)] d s 1 (5.5) 3. Feature scaling and shifting This transformation both scales and shifts the feature values, thereby aecting both means and variances for the features. This transformation T 3 is shown below. T 3 [x s a (tn);::;x s a (t);::;x s a (t +n)] = [x s a (tn);::;x s a (t);::;x s a (t +n)] d s 1 + (d s 1) (5.6) c) Audio/video outputs: Following the feature transformation, I obtain the audio/video outputs using similar linear models as in the baseline model (equa- tions (5.1), (5.2)). However instead of the explicit audio/video features, I use one of the feature transformations. d) Modality fusion: The modality fusion strategy is again same as the baseline system. I perform training using data bootstrapping as discussed in the section 3.2.1(c). e) Final prediction after ltering: The low pass ltering step is also same as the baseline system to avoid abrupt changes in tracking the aective dimensions. In the next section, I discuss the evaluation scheme and present my results. 75 Evaluation I perform independent evaluations on the Freeform and Northwind datasets. For the 100 sessions in each dataset, I use a 10 fold cross-validation, with 8 partitions as the training dataset, and 1 each as development testing sets. For all my experi- ments, the features in the training set are normalized to be of zero mean and unit variance. During testing, I normalize the testing set features using feature means and variances computed on the training set. The BDI-II scores are also normalized to a range of 0-1 and during feature transformation they scale and shift the fea- ture values accordingly. I use mean correlation coecient over the three aective dimensions computed over all the sessions as the evaluation metric, as was also used in the AVEC challenge 2014 [187]. I tune the feature window length n and length of the moving average lter length k on the development set. Table 5.2 shows the results for each of the dataset using the baseline and various feature transformations in the proposed system. Discussion In my rst experiment in section 5.2.1, I observed that depression severity is cor- related with aect and therefore provides a complementary source of information in aect prediction. The results in Table 5.2 vary in the two datasets with better prediction in the Northwind dataset. This may be due to the dierence in struc- tural formats of the two datasets, with each dataset calling for a dierent cognitive planning mechanism [199,200]. The Freeform dataset incorporates the exercise of lexical planning to form an answer where as Northwind dataset involves sensory input and sentence reproduction. I also notice that the best feature transformation scheme varies for the two datasets. Scaling the features alone (transformationT 2 ) does not perform well, implying changing feature variance based on the depression 76 severity does not help, particularly in the case of Freeform dataset. This can be attributed to the lack of correlation between depression severity and variances of aective dimensions, as seen in Table 5.1. Since changing feature variances has a direct impact on output aective dimension prediction due to ane projections during prediction (Figure 5.2), scaling feature values based on depression serves as a noisy operation. However, I observe that changing the feature means via the shifting transformation (T 1 ) helps in both the cases. For the Northwind dataset, the best results are obtained after applying the shifting and scaling transformation (T 3 ). Apart from these observations, I also notice that modality fusion performs better suggesting that audio and video modalities carry complementary informa- tion. Also, the low pass ltering improves performance by removing high frequency components from the aective dimension prediction. 5.3 Conclusion Researchers have investigated the impact of depressive disorders on emotion and discovered several patterns [178, 179]. In this work, I develop an aect prediction model with the subject's depression severity incorporated as a model parameter. I use two datasets for this purpose and initially test the relationship between depression severity and emotions using correlation analysis. I then develop an audio-visual feature based baseline model to predict aect. I modify the model to use the BDI-II depression index to perform session-wise feature transformations which shift or/and scale the feature values for a session based on the subject's depression severity. I test my model on two datasets and observe that the best performing transformation variation varies between them. 77 In the future, I will investigate other models for incorporating depression sever- ity in predicting aective dimensions. As of now, the depression severity only aects the rst layer of the prediction system. This BDI-II index could also be incorporated as a parameter in other layers of the model and the optimization problem could be framed accordingly. I also aim to apply the model to other prob- lem domains involving time series prediction with an accompanied static label, e.g., tracking engagement based on autism severity [201]. Finally, the model could also be extended to datasets with ratings available at multiple temporal granularities. 78 Chapter 6 Aect prediction in music using boosted ensemble of lters In recent years, considerable amount of research has gone into improving auto- matic understanding and indexing of music signals. This eort has been partly led by the data deluge in digital music and partly by the large number of new appli- cations in multimedia such as information retrieval, automatic transcription and music ngerprinting. A majority of these applications require classifying songs into meaningful categories as the rst step. Typically, grouping is done on the basis of genre or melody. However, in the context of a music recommendation system, it is desirable for these categories to be aligned with the listener's music preference or mood. Thus, studying the aective component in music signal is as important as studying its structural aspects. Music has been shown to possess the ability to in uence the emotional state of its listeners [202{204]. For example, consider the elaborate use of background soundtracks in movies to support the narrative being carried through speech and video. In movies, music plays a complementary role to the cinematography and dialog delivery [205]. In fact, previous studies have found that music in movies often plays a more important role in conveying emotion compared to other modal- ities [206]. Owing to its positive emotional in uence, music has also been used for therapeutic purposes [207,208]. The ability of music to aect human's emotion state has led to considerable 79 interest in the study of aective features and prediction models for music emo- tion recognition. Previous studies have focused on predicting both static and dynamic emotion labels from music [209{211]. These emotion labels are typi- cally measured along aective dimensions of arousal and valence, and collected from multiple human annotators [212, 213]. Predicting dynamic emotion ratings in music is considerably more dicult (as opposed to predicting a static overall rating) as it involves accounting for temporal evolution of emotion with music signal. The Emotion in Music Challenge at the 2014 Mediaeval Workshop [214] led to several investigations towards capturing the emotional content in music using low level frame-wise features. Some of the successful schemes involved using a recurrent neural network [215], multi-level regression systems [216] as well as state space models [217]. These methods perform well in predicting the emotional dimensions from low level features. However, they fail to explain the temporal evo- lution of emotion and its relation to the features. Moreover, all the above systems use a large number of features to predict the aective ratings, rendering feature analysis dicult. To overcome these shortcomings, I propose a gradient boosting- based [218] Boosted Ensemble of Single feature Filter (BESiF) method. Given a set of frame-wise features, the BESiF model sequentially learns lters on a selected set of features. The model later performs a weighted combination of the ltered feature values to provide the prediction for aect ratings. I obtain Signal to Noise Ratio (SNR) improvement by a factor of 1.92 and 1.06 for arousal and valence prediction when compared to the next best baseline algorithm. The BESiF model uses a small set of 14/16 features for arousal/valence prediction when compared to the available 6000+ features used by the baseline algorithms. I analyze the output from one of the lters for arousal prediction and interpret the transformation of feature values which contribute towards the nal prediction. 80 In the next section I describe the database used in experiments, followed by the details of aect prediction in Section 3. Section 4 presents the results and conclusions are presented in Section 5. 6.1 Database I use the music dataset provided in the Emotion in Music Challenge at the 2014 Mediaeval Workshop [214] to evaluate the BESiF algorithm. The data set consists of 1744 songs from dierent musical genres. 45 second clips were selected from each song in the dataset and assigned emotion labels by at least 10 annotators (at a rate of 2 frames/second). More details about the data set and annotation process can be found in [219]. To further motivate my study of emotion in music, I analyze the relation between genre and emotional variability present in this database. I plot the aver- age emotion ratings per genre in Figure 1. I observe that the annotated emotion ratings follow intuitive trends along high level music categories such as genres. As an example, notice how rock music displays high arousal, while country music is high valence. Classical music on the other hand has both low arousal and low valence. This suggests that a relation exists between the style of music and its emotional content. This further encourages automatic prediction of aect in music with potential to impact music design, recommendation and understanding music perception. For the purpose of prediction, I use the mean dynamic annotations for arousal- valence as the gold truth in accordance with the challenge task [214]. Moreover, the rst 15 seconds of annotations were excluded from consideration, to allow the dynamic annotations to stabilize. I use a set of 6000+ Opensmile [116] features 81 3.5 4 4.5 5 5.5 6 4.2 4.4 4.6 4.8 5 5.2 5.4 5.6 5.8 Blues Classical Country Electronic Folk J azz Pop Rock Mean Arousal Rating Mean Valence Rating Figure 6.1: Average arousal-valence ratings in each music genre supplied during the challenge [219]. These features are functionals of various spec- tral and frequency properties of music signals (Mel Frequency Bank, Fundamental frequency etc.), extracted at the rate of 2 frames/second (same as annotation frame rate.) Out of 1744 songs in the dataset, I use a split of 744, 300 and 700 songs as the train, development and test set respectively. The 744 les for training are as provided during the challenge. The development and testing set are randomly selected from a separate set of 1000 les. In the next section, I describe my training methodology to predict the aective ratings using the provided feature set. 6.2 Aect prediction Through my experiments, I not only aim to maximize the prediction quality, but also understand the relation between these low level signal features and the aective dimension. In this work, I focus on minimizing the mean squared error between predicted and true aect ratings. I denote the true aect ratings (arousal/valence) for a le f with N frames (arousal/valence) as the row vector t(f) = [t f 1 ;::; t f n ;::; t f N ] and the correspond- ing time series of feature vectors as X(f) = [x f 1 ;::; x f n ;::; x f N ], where x f n is a D- dimensional feature vector. The d th row of X(f) represents the values over time for thed th feature and I represent that as x d (f) = [x f d;1 ;:::; x f d;N ]. A functionM(x f n ) 82 maps the feature vector x f n to the one dimensional aect rating. I represent the time series vector [M(x f 1 );:::; M(x f N )] obtained after mapping as M X(f) . The mean square error L f between the mapped and true ratings as obtained for the le f is shown in equation 6.1 (jjjj 2 represents the L 2 norm). L f = t(f) M X(f) 2 2 = N X n=1 t f n M(x n f ) 2 (6.1) Given the set of les in the training set, I learn the function M by optimizing the sum of squared error losses,L, as dened below. L = X f2Training set 1 2 L f = X f2Training set 1 2 t(f) M X(f) 2 2 (6.2) One can assume any functional form forM before optimizing the cost function L. For the problem of interest, several schemes were proposed during the Medi- aeval challenge [214]. However, these schemes are either too simple to capture the complex relationship between the acoustic features and the abstract aective space (e.g. linear regression) or are dicult to interpret (e.g. Recurrent Neural Networks). In this work, I present a new gradient boosting [218] based Boosted Ensemble of Single-feature Filters, which overcomes the shortcomings of both these categories of models. The BESiF model is an ensemble of lters trained sequen- tially on one feature at a time and the nal prediction is given as the weighted sum of the lter outputs. I provide the BESiF model training algorithm below along with a brief description of the gradient boosting method. Boosted Ensemble of Single feature Filters (BESiF) The BESiF model consists of an ensemble of lters operating over the feature time series, learnt using gradient boosting. Gradient boosting is a general technique for learning an ensemble of weak learners applicable in the cases of arbitrarily 83 dierentiable loss functions (e.g. mean squared error loss). I represent an ensemble of K weak learnersf ~ h 1 ; ~ h 2 ;:::; ~ h K g as M K , where the prediction M k (X f ) is given as shown below. M K (X f ) = K X k=0 ~ h k (6.3) The base learnersf ~ h 1 ; ~ h 2 ;:::; ~ h K g are learnt sequentially. The rst base learner ( ~ h 0 ) is initialized to be a constant model obtained by solving the optimization problem shown in (6.4). 1 represents a vector of ones of the size same as the target aect variable t(f). ~ h 0 = 0 = arg min 0 X f2training set t(f) 0 1 2 2 (6.4) Subsequently, new regressors ~ h k are added by solving the following optimiza- tion. ~ h k = arg min h k X f2training set t(f) M k (X(f)) 2 2 = arg min h k X f2training set t(f) M k1 (X(f)) + h k 2 2 (6.5) However the optimization problem in equation 6.5 is not easy to solve and, in practice optimization is performed iteratively using the steepest descent method [220]. This is equivalent to tting base learners to a set of pseudo residuals (dened for the current problem in equation 6.6) and learning the weights of base learners using a one-dimensional optimization. For more details on gradient boosting please refer to [218]. In my case, I chose the set of base learners to be Finite Impulse Response (FIR) lters operating on a single feature. I chose this feature probabilistically, with the probability of selection proportional to its absolute correlation with the pseudo residuals. I summarize the training 84 algorithm for the BESiF model below. Training algorithm for BESiF models: Initialize M 0 with a constant model ~ h 0 (equation 6.4). For k = 1 to K { Computing the pseudo-residuals r k (f) = [r f k;1 ;:::; r f k;N ] for each le in the training set. r k (f) = @ 1 2 t(f) M X(f) 2 2 @M X(f) at M(X(f))= M k (X(f)) = t(f) M k X(f) (6.6) { Randomly selecting a feature: In the next step, I randomly select one of the D features. The probability of selecting a feature is proportional to the absolute correlation of feature with r k (f). Let d be index of the selected feature with values for the le f represented as x d (f) = [x f d;1 ;:::; x f d;N ]. { Learning a lter to predict the pseudo residuals using the selected feature: Given the lter length L, I learn the lter coecients w k =fw 1 k ;::; w L k g to predict residuals. These lter coecients are convolved with the selected feature to obtain with residuals as the target outputs. Coecients w k are obtained by solving the optimization problem mentioned in equation 6.8. hw x d (f)i represents the convolution of the selected feature with the lter coecients and is denoted by h k (f). h k (f) =hw x d (f)i (6.7) 85 w k = arg min w X f2training set r k (f)hw x d (f)i 2 2 (6.8) { Computing weights of the base learners: After obtaining the lter coecients, I compute the scalar k to weigh the lter outputsh k (f). I solve the following one dimensional problem to obtain k using backtracking algorithm [221]. k = arg min X f2 training set t(f) M k1 (X(f)) + h k (f) 2 2 (6.9) { Updating the model: After obtaining w k and k , the predicted outputs for a le f are obtained as M k (X f ) = M k1 (X f ) + ~ h k (f) = M k1 (X f ) + k h k (f) (6.10) End For 6.3 Experiments and discussion I use the proposed BESiF model to predict the arousal and valence ratings from the low level signal features. As the function M X(f) can assume several functional forms, I chose three other methods as baseline models for comparison. The rst baseline method performs linear regression followed by a smoothing operation as was proposed in my previous work [219]. The other two baselines involve techniques such as sequential selection of features and boosting, like the BESiF model. I describe these methods in detail below. 1. Linear regressor + smoothing: In this method, I use linear regression on the 86 Model SNR ( 2 signal / 2 error ) Arousal Valence Linear Regression + smoothing 1.39 1.37 Greedy Linear Regression + smoothing 1.27 1.21 Least squares boost + smoothing 1.47 1.44 BESiF 2.83 1.53 Energy in signal ( 2 signal ) 0.11 0.06 Table 6.1: SNR values for aect rating prediction using the baseline and the pro- posed BESiF models. entire feature set followed by a smoothing operation to predict the aective rat- ings. My analysis in the past work [219] showed that the aective signals evolve rather smoothly. The linear regressor computes the aective rating using the pro- vided features and smoothing is used to incorporate local temporal context. The smoothing operation is fundamentally a moving average operation where output at a frame is recomputed as an average of predictions over a local window. This operation also helps to remove any high frequency noise added during regression. 2. Greedy linear regressor + smoothing: This method is same as the previous baseline method, except for a greedy selection of a few features for regression. Similar to the BESiF training algorithm, features are added sequentially based on their correlation with the residual at each iteration. However, note that after addition of every new feature, the algorithm re-optimizes the regression coecients for each selected feature. This may lead to the problems associated with curse of dimensionality and high computation cost. The total number of features added are tuned on the development set. BESiF model does not suer from this problem as lter coecients are determined only for a single feature at a time. The nal outputs after regression are again smoothed using a moving average lter. 3. Least squares boost + smoothing: Least squares boost [222] is another class of boosting algorithm used to optimize squared error loss functions. However, this 87 0 5 10 15 20 0.19 0.2 0.21 0.22 RMSE for Arousal prediction RMSE 0 5 10 15 20 0.2 0.25 RMSE for Valence prediction Iteration number RMSE Figure 6.2: RMSE ( error ) of BESiF model on the test set against the number of base classiers. The red point denotes the count of base lters chosen based on development set. algorithm uses all the features at every iteration to predict the residuals. I use a regression tree [223] as my base learner in this case. Note that unlike BESiF, the regression trees can not account for the temporal relationship between the residuals and the feature time series. I again smooth the outputs using a moving average lter to account for smooth temporal evolution. I present the Signal to Noise Ratio (SNR) for aective rating predictions using the baseline methods and the BESiF model in Table 6.1. SNR is computed as the ratio of energy in the true arousal/valence signal ( 2 signal ) and energy in the prediction error ( 2 error ). The length L of the FIR lters for BESiF model and the length of moving average lters for the baseline methods are tuned on the development set. The number of base classiers for BESiF and the least squares boost models are also tuned on the development set. 6.3.1 Discussion From the results in Table 6.1, I observe that a substantial gain in arousal using the BESiF model is achieved over all other baselines. In my previous work [219], 88 0 10 20 30 40 50 60 0 0.2 0.4 (a) Arousal values 0 10 20 30 40 50 60 0 0.05 0.1 (b) Filtered feature values 0 10 20 30 40 50 60 −1 0 1 x 10 4 (c) Feature values Frame indices −> Figure 6.3: (a) Target arousal values, (b) ltered feature values and (c) raw feature values for a selected le in the testing set. I showed that the smoothing operation added context from neighboring windows, thus improving the prediction. However the regression design was decoupled from smoothing and the choice of lter during smoothing was ad-hoc. In the BESiF model, I overcome this drawback by incorporating lter design within the regres- sion framework. My base learners, i.e., the single-feature lters not only learn the mapping from the features to the aective dimension, but also incorporate the temporal context into account during prediction. I observe that the performance is particularly poor for greedy linear regression. I performed further investigation into this system and observed that even a backward feature selection (sequential removal of features starting from all features [224]) leads to degradation in per- formance. This suggests that removal of any feature leads to a degradation in performance of linear regression. The least squares boost algorithm in closest to BESiF in terms of performance. In general, boosting algorithms lead to strong regressors, therefore the better performance. However, decoupling of regressor design and smoothing again leads to poorer performance for least squares boost when compared to the BESiF model. 89 I plot the Root Mean Square Error (RMSE, error ) of the BESiF model on the test set against the number of base lters in Figure 6.2. I observe that the per- formance of BESiF model saturates approximately within 15 single feature lters for both valence and arousal. This observation is particularly interesting from the point of view of understanding the relation between the low level features and aec- tive dimensions. I observed that 15 (out of 16) and 13 (out of 14) features selected for arousal and valence respectively are spectral features (Mel Frequency Cepstral Coecients, Mel Filter Bank energies [116]) with a fundamental frequency (F0) statistic feature appearing once in both the cases. This re ects that most of the emotional information in music is associated with the evolution of spectral char- acteristics of the music. Since the nal prediction is a weighted sum of the ltered feature values, one can also analyze a feature of interest and its contribution to the nal prediction. For instance, I plot the target arousal rating, a selected feature (mean of absolute values for an MFB coecient), and ltered feature values in Figure 6.3. I observe that the ltered feature values follow the same trend as the target values (despite the scales being dierent). Moreover, the ltered values are smoother than the feature itself, indicating removal of high frequency noise after the ltering operation. This is consistent with the premise of smooth evolution of aective ratings. 6.4 Conclusion Music signals have been shown to carry emotional information. In this work, I present a novel BESiF scheme to predict the aective dimension of arousal and valence from low level audio signal features in music. This scheme is designed 90 not only to better predict the aective ratings, but also to add insights and inter- pretability to the prediction process. I show that the BESiF system beats several comparable baseline methods, using only a handful of features. I interpret pat- terns as observed in a feature time series after ltering and compare it to the target value. In the future, I aim to enhance my system by jointly predicting the aective dimensions instead to allow understanding of joint dynamics between valence and arousal. With availability of more data, one could also analyze the relation of system parameters (e.g. lters, base learner weights) with music categories such as genres. Further improvement in prediction is also possible by using better feature selection techniques (e.g. using dynamic programming [225]) and lter design methods. 91 Chapter 7 Predicting client's inclination towards target behavior change in Motivational Interviewing and investigating the role of laughter Rollnick et al. [48] dene Motivational Interviewing (MI) as \a directive, client- centered counseling style for eliciting behavior change by helping clients to explore and resolve ambivalence". MI setting is extensively used in addiction related prob- lems (alcohol, drugs, etc.) [226, 227]. MI helps addiction patients perceive both, the benet (e.g., the high) and the harm (e.g.,health, personal problems) and helps them resolve the ambivalence in a dyadic spoken interaction with a thera- pist towards a positive change. With emerging evidence base and popularity for MI [228], a signicant challenge lies in ensuring high quality of treatment which calls for standard metric to assess quality of such interactions. The Motivational Interviewing Skill Code (MISC) [229] has emerged as a standard observational method for evaluating the quality of MI interactions. MISC is a behavioral coding systems in which human coders are trained to annotate video or audio tapes over several parameters as global counselor ratings, empathy, behavior categories etc.. The uses of MISC range from providing detailed session feedback to counselors in 92 the process of learning MI to predicting treatment outcome from psychotherapy measures. MISC provides us with a systematized tool to assess the quality of MI inter- actions. However such manual coding systems are not scalable to real world use due to time, labor and economic constraints [230]. As part of my ongoing eorts towards Behavioral Signal Processing [231,232], in this chapter, I present a model towards the automation of MISC annotation. Specically, I focus on the \Target Behavior Change" (Target) aspect of the client behavior that species a target client behavior (smoking, drinking, medication) and a direction of change (stop- ping, increasing etc.). I conduct three sets of experiments to (a) predict a client's attitude towards a Target, (b) investigate the eect of laughters during interaction on Target and (c) look for prosodic dierences in laughters with respect to dierent behaviors. (a) Predicting client's attitude towards Target: A turn in client speech is anno- tated with a positive/negative valence if it shows an inclination towards/away from Target. Such client turns are termed as \Client Change Talk" (ChangeTalk) utter- ances. I design a lexical based model for predicting positive vs negative valence ChangeTalk vs no ChangeTalk given a client utterance. Additionally, I study the eect of including counselor behavior in my prediction model. Each counselor utterance is MISC annotated with a behavior code (re ect, support etc.). As the counselors are required to carefully attend to client language related to the target behavior, their behavior may carry indicators to the client's attitude towards Tar- get. My best prediction model achieves an unweighted accuracy (UA) of 50.8% (chance 33.3%) for the three way classication. (b) Investigating the role of laughters during interaction: Laughters are linked to human behavior and are hypothesized to carry out several social functions [233, 93 234]. In this experiment, I examine if the mere occurrence of a laughter during interaction carries some information with regards to the ChangeTalk valence. I observe that inclusion of information regarding their presence improves UA of my previous model to 51.4%. (c) Prosodic dierences amongst laughters: Several studies [235{237] suggest dierences amongst laughters contingent upon the context in which they happen. I look for prosodic dierences amongst client laughters belonging to utterances in the three ChangeTalk classes. I observe some discriminatory power in their prosody and achieve an UA of 40.5% in classifying laughters belonging to utterances from the three classes. 7.1 Database My experiments pertain to an MI based intervention study on drug abuse problems involving patients at a public, safety-net hospital [238{240]. I use data from 49 subjects coded by a single annotator as per the MISC manual. I present an excerpt from one of the sessions and the corresponding ChangeTalk and counselor behavior codes in Table 7.1. Utterances with positive valence during ChangeTalk are listed as Change+, negative valence as Change- and with no change talk as Change0. Table 7.2 lists the count of utterances belonging to various ChangeTalk and counselor behavior codes. 94 Utterance Counselor ChangeTalk behavior code T: Did you come here on own? Question C: Yes, I was sure about this. Change+ T: I am here to help you. Support C: I appreciate that. Change0 C: I really have to stop drugs, Change+ C: but I just don't want to. Change- Table 7.1: Example conversation with Counselor (T) behavior annotation and client (C) ChangeTalk valence annotation. Counselor behavior codes Advice (AD) (101) Raise Concern (RC) (3) Arm (AF) (427) Filler (FI) (29) Confront (CF) (4) Re ect (RE) (2310) Direct (DI) (2) Reframe (RF) (0) Emphasize control (EC) (36) Support (SU) (260) Facilitate (FA) (4175) Structure (ST) (215) Giving information (GI) (1475) Warn (WA) (5) Question (QU) (1665) Client change talk utterance Positive valence (1749) Negative valence (1253) No ChangeTalk in utterance (9060) Table 7.2: List of various counselor behavior codes and ChangeTalk codes and corresponding count of utterances. 7.2 Experiments 7.2.1 Predicting valence of Client Change Talk utterance. Although modeling the MISC annotation of an MI session may be very complex, in this work I propose a simplied scheme to address this problem. I represent each counselor utterance during the interaction using the set of variablesfU T 1 ;:::;U TN g and each client utterance asfU C1 ;:::;U CM g. These utterances are the observed outcomes determined by complex and time varying internal mental processes of each participant involving a mix of several aective, attitudinal, emotional states 95 U T1 U T<m-1> U C1 U Cm U C2 U CM Counselors Internal State Clients Internal State B T1 B T<m-1> C C1 C Cm C C2 C CM (a) (b) U Tn n th Counselor Utterance U Cm m th Client Utterance B Tn Counselor Behavior Code for U Tn C Cm Client Change Talk utterance code for U Cm Figure 7.1: Proposed model to represent the dyadic interaction and the annotation process. etc. Block (a) in Figure 1 lays out this representation of the dyadic interaction. During annotation, the coder observes each of the client and counselor utterances and assigns a code. The codes that I focus on are intended to capture specic behavior without regards to the overall impression of MI. Coders are made aware of the target behavior in detail so that they can reliably discriminate it from all other topics. In the following sections, I describe my baseline methodology to predict ChangeTalk valences and the additions I make to evaluate the eect of including counselor behavior codes. Baseline system Since the coder assigns the ChangeTalk codes after individually observing each of the client utterances, a simplistic model may assume that the assigned code is determined solely by the utterance in question. I design my baseline system based on this assumption. In this baseline model, I focus on lexical content of client utterances and design an n-gram based classication system. I describe my feature extraction, selection and classication scheme below. I perform a leave one session out cross validation in all my experiments. 96 (1) Feature extraction: Given the unbalanced class distribution, I initially down- sample instances from the majority classes (Change0, Change+), so that each class has the same number of instances as the least represented class; Change-. I extract all the unigrams and bigrams from each client utterances as potential features to learn the corresponding ChangeTalk code. However, this leads to a large feature space with several features that may not be relevant to the classication or may contain n-grams that are rare in occurrence. To overcome this problem, I use a feature selection algorithm as described next. (2) Feature selection: I select a given n-gram (ng k ) based on the entropy (E(ng k ), equation 7.1) of its empirical distribution (equation 7.2) over the three classes and their minimum count (#ng k ) on the downsampled data as shown in the Algorithm 1. The maximum entropy threshold (T E ) and the minimum word count (C min ) thresholds in the algorithm are tuned on a subset of the training set. Algorithm 1 N-gram selection for classier training. 1: Dene: G =fng 1 ;::;ng k ;::;ng K g : Set of n-grams in the training set 2: G sel : Set of selected n-grams for classication 3: Initialize: G sel = 4: for k = 1 .. K do 5: if (E(ng k )<T E ) and (#ng k >C min ) then 6: G sel =G sel [ng k 7: end if 8: end for E(ng k ) = X ChangeTalk val 2 fChange+,Change-,Change0g P k;ChangeTalk val log(P k;ChangeTalk val ) (7.1) where, P k;ChangeTalk val =P (ng k =ChangeTalk val ) = (#ng k )2 Training utterances coded as ChangeTalk val (#ng k )2 Training set (7.2) (3) Classier training: I train a conditional maximum entropy model [241] on 97 the chosen set of n-grams (G sel ). I perform the parameter estimation using L- BFGS [242]. My baseline system can be viewed as only utilizing the solid line connecting U Cm to C Cm in Figure 1, i.e. ChangeTalk code depends only on what client says. Equation 7.3 shows the decision rule for assigning class the ChangeTalk class C Cm to the utterance U Cm given a set of observed features O Cm . In my baseline model, I set O Cm to G sel;Cm ( G sel ), the set of n-grams extracted from the considered utterance U Cm . C Cm = arg max ChangeTalk val 2fChange+,Change-,Change0g P (ChangeTalk val =O Cm ) (7.3) Classication system incorporating counselor behavior In my next experiment, I account for the counselor behavioral codes while inferring theChangeTalk codes, i.e. the client's utterance and the counselor's immediate past behavior both contribute towards identifying the code. Hence, in this scheme I also utilize the dotted link connecting the preceding counselor behavior code B T<m1> in addition to features from U Cm to infer C Cm . I perform two sets of experiments incorporating oracle and inferred counselor behavior codes as follows. Oracle counselor behavior: In this experiment, I use the oracle counselor behav- ior code (B orc T<m1> ) precedingC Cm as annotated by the coder. This model utilizes fG sel;Cm ;B orc T<m1> g as the evidence O Cm in equation 7.3. Note this is not ideally possible in a real system as I am using the true counselor codes for the test set which likely won't be available in the real world scenario. Inferred counselor behavior: As the use of oracle values for oracle counselor behavior is impractical, I develop a system to infer them. I implement the same framework as described in the baseline system. However as I have extremely low number of training instances for a few codes, I merge several minority classes before 98 training my prediction system. I retain questions (QU), giving information (GI), re ection (RE) and facilitate (FA) and merge all the other classes into a fth class; others (OT ). Initially, I gauge the eect of merging couselor codes on my previous prediction system involving oracle counselor codes. O Cm in equation 7.3 is set to fG sel;Cm ;B orc,mer T<m1> g, whereB orc,mer T<m1> are the oracle counselor codes obtained after merging. In order to predict the merged counselor codes, I initially downsample the data to remove class bias over the 5 classes. I perform feature selection and classier training as described in the baseline system for inferring the counselor codes using the n-grams from counselor utterances (U Tn ). Equation 7.4 shows the rule for inferring the counselor code B inf,mer T<m1> from the set of features O T<m1> . In this model, O T<m1> is set to the selected set of n-grams G sel;T<m1> extracted from counselor utteranceU T<m1> precedingU Cm . I use this inferred counselor code as the observed evidence O Cm in equation 7.3 (O Cm =fG sel;Cm ;B inf,mer T<m1> g). B inf,mer T<m1> = arg max B val 2 fRE;GI;QU;FA;OTg P (B val =G sel;T<m1> ) (7.4) Results and discussion I use unweighted accuracy (UA) as my evaluation metric and also report the F- measure for Change+ and Change- given their low occurrence frequency relative to Change0. The results for inferring ChangeTalk codes using the baseline system and after incorporating B T<m1> ;B orc,mer T<m1> andB inf,mer T<m1> are shown in Table 7.3. I also show the results for inferring counselor behavior codes in Table 7.4. Discussion: From the results I observe that my model performs well above chance, indicating that just lexical content during conversation can inform us of change talk behavior. Improvement in the results, after including the counselor 99 Model UA Change+ Change- F- Accuracy F- Accuracy meas. /Precision meas. /Precision Chance 33.3 20.2 33.3/14.5 15.8 33.3/10.4 U Cm 49.0 30.3 44.8/22.9 28.5 49.2/20.1 +B orc T<m1> 50.8 32.7 44.8/25.8 29.2 49.1/20.8 +B orc;mer T<m1> 50.6 32.6 44.7/25.6 29.1 48.8/20.7 +B inf,mer T<m1> 50.2 31.8 43.6/25.0 29.1 48.5/20.8 Table 7.3: Results (in %) for predicting ChangeTalk codes. Model UA Class Accuracies RE GI QU FA OT Chance 20 20 20 20 20 20 U Tn 69.6 70.7 55.7 77.2 97.1 47.2 Table 7.4: Results (in %) for predicting counselor behavior codes. behavior code indicates, that coder does take context of conversation into account while assigning the ChangeTalk codes. Also, I do not observe any signicant dier- ence between using all the behavior codes versus using merged oracle codes. This suggests that training the model on a few codes with sucient number of samples is as good as training on all the codes with a few samples. However, I do observe a decrease when I use the inferred counselor codes over oracle codes. This stems from the imperfect prediction of the counselor codes themselves. I observe that a few classes in case of counselor codes are predicted more accurately as compared to others in spite of using balanced number of instances during training. This indicates that some classes are better distinguishable with lexical features, while other classes may be more distinguishable in other modalities. Particularly in the case of facilitate (FA), as most of utterances are simple, functioning as \keep going" acknowledgment such as \Mm Hmm", \OK" etc., I observe almost perfect prediction. 100 Model UA Change+ Change- F- Acc. F- Acc. meas. /Prec. meas. /Prec. U Cm 50.1 32.1 45.8/24.7 28.6 49.2/20.1 (2.2) (5.9) (2.2/7.7) (0.0) (0.0/0.0) +B orc T<m1> 51.4 33.4 45.4/26.4 29.6 49.5/21.1 (1.2) (2.1) (1.3/2.3) (1.7) (0.8/1.4) +B orc,mer T<m1> 51.3 33.4 45.3/26.4 29.5 49.5/21.0 (1.4) (2.5) (1.3/3.1) (1.4) (1.4/1.4) +B inf,mer T<m1> 50.8 32.8 44.1/26.1 29.3 48.9/20.9 (1.2) (3.1) (1.1/4.4) (0.6) (0.8/0.4) Table 7.5: Results (%) & relative improvements (%) over previous model for pre- dicting ChangeTalk codes w/ laughter. 7.2.2 Prediction incorporating laughters Several studies suggest the importance of laughters in discourse [233, 234]. In this section, I perform preliminary analysis of laughters and study their eect on my previous system. I hypothesize that mere occurrence of laughter events may provide us with some information regarding a client's attitude regarding Target. I add a simple binary feature indicating presence of laughterL Cm (available through transcripts) in the client utterance U Cm to my previous models and reproduce the results. I setO m tofG sel;Cm ;L Cm g for the baseline model and the similar addition is made to models incorporating the counselor behavior. While inferringB inf,mer T<m1> , I use a binary featuresL T<m1> indicating the presence of the counselor laughter in U T<m1> (O T<m1> =fG sel;T<m1> ;L T<m1> g in equation 7.4). I show the results for predicting ChangeTalk codes using laughters and relative improvements over the counterparts from the previous model in Table 7.5. Results and corresponding relative improvements for counselor behavior code prediction are shown in Table 7.6. 101 Model UA Class Accuracies RE GI QU FA OT U Tn 71.3 71.4 56.0 77.2 97.0 55.1 (2.4) (1.0) (0.5) (0.0) (-0.1) (16.7) Table 7.6: Results (%) & relative improvements (%) over previous model for pre- dicting counselor behavior codes w/ laughter. Discussion: From the results, I observe that I get a consistent gain in both the client and the counselor results. I list the empirical distribution of client laughters over the three ChangeTalk codes in Figure 2(a) and the counselor laughters over the merged counselor behavior codes in Figure 2(b). The occurrence probability of laughters is not uniform over codes, thereby providing additional information to the previous model. I observe that in the case of ChangeTalk codes, an utterance labeled Change+ is most likely to contain laughters. This indicates that presence of laughters shows a positive attitude. The occurrence probability of laughters is more skewed in the case of counselor codes. I observe that the utterance labeled OT are most likely to contain laughter. Due to this, fact I observe largest increase in the classication accuracy for the OT class. Laughters are rarely present in utterances labeled FA and hence I barely see any change in its class accuracy. 7.2.3 Laughters and their prosodic dierences Several studies show that there are inherent dierences in laughters and the con- text in which they occurs [235{237]. I investigate the dierences amongst client laughters in the context of the three ChangeTalk classes. I design a prosody based classication system to distinguish amongst laughters that occur in utterances coded as Change+, Change- or Change0. I hypothesize that if indeed dierences exist amongst the laughters, this can further aid my ChangeTalk valence prediction system. I describe the prosodic features and the classication system below. 102 Change+ Change- Change0 0 0.01 0.02 0.03 0.04 FA QU RE GI OT 0 0.02 0.04 0.06 0.08 0.1 61/ 1749 26/ 1253 284/ 9060 34/ 2310 28/ 1665 122/ 1475 (a) (b) Empirical probability 15/ 4175 7/ 1665 Empirical distribution of laughters over various client and counselor codes P(laughter/code) Figure 7.2: Counts and empirical distribution of laughters over (a) ChangeTalk codes (b) Counselor behavior codes. Feature Statisticals Pitch, intensity, voicing probability, Mean, variance, harmonic to noise ratio range Table 7.7: Prosodic features used in laughter classication. Features: I manually annotate all the 371 client laughters (Change+:61, Change-:26, Change0: 284) in the 49 sessions marking their start and end posi- tions. Given a small number of samples, I limit my experiment to a few low level prosodic cues and compute their global statistics shown in Table 7.7. I mean normalize these features per speaker. Classier: I use a linear kernel support vector machine classier. Given the unbalanced class distribution (counts shown in Figure 2), I downsample the sam- ples from Change0 and Change+ classes so that each class has equal number of samples. I perform leave one session out cross-validation on the laughters from 49 sessions. I list the classication accuracy in Table 7.8. Discussion: I observe that the use of a few low level prosodic cues does provide us with some discriminatory power in between client laughters belonging to the three classes. Poor classication accuracy stems from extremely small number of training instances. Laughters from the class Change0 are most poorly classied 103 Model Unweighted Class accuracies accuracy Change+ Change- Change0 Chance 33.3 33.3 33.3 33.3 Prosody 40.5 47.5 47.2 26.9 Table 7.8: Results for classication of client laughters over ChangeTalk codes using prosody. as they have a high downsampling factor for maintaining the class balance. This introduces a sampling bias. Due to the same reason, I could not carry out an experiment to investigate dierences in counselor laughters as some classes have too few samples (e.g. 7 for QU). Because of the weak discrimination, this infor- mation does not help my previous ChangeTalk valence prediction model as of now. However, given that prosodic dierences in laughters do exist, I am encouraged that I improve the proposed model with the availability of more training data in future. 7.3 Conclusion In this chapter, I present a scheme towards automatically obtaining MISC codes in MI settings. I design a lexical based scheme to automatically identify client utterances with positive or negative valence that indicate their attitude towards a targeted behavior change. I show that incorporating the counselor's behavior into account during the interaction helps improve the prediction, thus validating the importance of the counselor towards positive outcomes. I proceed with a preliminary analysis on incorporating laughters as additional information source, augmenting the previous system. Finally, I analyze the type of laughters based on their prosodic cues. I observe some discriminatory power in the prosody of 104 laughters with respect to the ChangeTalk valences, however due to limited data and poor accuracy I could not exploit this towards change talk classication. I presented my results on one aspect of MISC code. However, the MISC anno- tations provide other global measures such as empathy, motivational interviewing spirit etc. that furnish more indicators regarding the success of a session. I aim at building upon my current system to incorporate these measures. Studies link acoustic [120,243], visual cues [244,245] etc. to human behavior and one can incor- porate such cues to supplement the system. Also, I aim to further investigate other aspects of laughters (e.g. shared laughters) and the role they play in MI. One may further extend this work to other non-verbal vocalizations as llers, sighs etc.. 105 Part III Modeling diversity in perception of nonverbal cues 106 Perception of nonverbal cues is contingent upon the audience, with every per- son in the audience decoding the nonverbal cues uniquely. In this part, I quantify these dierences in perceptions by building upon the existing multiple annotator models. Specically, I borrow the classication model proposed by Raykar et al. [2] and extended it to case of time-continuous regression and ranking. Raykar's model assumes that an object can be labeled conditioned upon it's attributes/features and various annotators present their noisy perceptions of the latent ground truth. I modify this model to investigate two case studies involving: (i) modeling multiple time series annotations based on ground truth inference and distortion and, (ii) inferring object rankings based on noisy pairwise comparisons from multiple anno- tators. In the rst case study, I develop a time-continuous regression model that models the dierences in perception amongst the annotators as an Linear Time Invariant (LTI) lter. I analyze annotator specic traits such as annotator bias and lags based on this model. In the second study, I extend Raykar's model to a ranking problem. I perform experiments on several datasets along with a dataset regarding the perception of expressivity and naturalness of the speech from par- alanguage. I interpret the annotator quality based on how often they ip the order in pairwise comparisons. I discuss both these experiments in detail in chapter 8 and 9. 107 Chapter 8 Modeling multiple time series annotations based on ground truth inference and distortion Tracking the evolution of a time series over a continuous variable is a problem of interest in several domains such as social sciences [246, 247], economics [248, 249] and medicine [250,251]. However, often times the variable of interest may not be directly observable (such as in behavioral time series of psychological states) and judgments from multiple annotators are pooled to estimate the target variable. A classic example is tracking aective dimensions in the study of emotions [187, 189,252] where ratings from multiple annotators are used to determine the hidden aective state of a person from audio-visual data of emotional expressions. The general practice in these behavioral domains is to infer the hidden variable by using human annotation. These studies often use heuristic metrics such as mean over the annotator ratings or select annotators based on condence intervals for the true estimate (the ground truth) of the unobserved variable. However, these metrics may not provide an accurate representation for the ground truth. Apart from assuming a denite relation between the ground truth and the annotator ratings, several factors such as individual dierences between the annotators and annotator reliability are not accounted for. Recent research has addressed a few of these problems. For instance, Nicolaou 108 et al. [253] assume that there is a latent space shared by annotator ratings and iden- tify it using dynamic probabilistic Canonical Correlation Analysis (CCA) model with time warping. Another model proposed by Mariooryad et al. [5] aligns the annotator ratings by adjusting delays identied using mutual information between features and every annotator's ratings. Along the lines of the proposals by Nico- laou et al. [253] and Mariooryad et al. [5], I present a new model which assumes that the ground truth can be computed using a set of low level features based on a \feature mapping function". Furthermore, the annotators process this (latent) ground truth based on annotator specic \distortion functions" to provide their ratings. My model is inspired from multiple annotator modeling proposed by Raykar et al. [2], and Figure 8.1 provides an intuitive summary of the model. Sim- ilar to Mariooryad et al. [5], my model relies on both annotator ratings as well as features to identify the latent ground truth and is, in fact, a generalization of their model. This design assumption is inspired from the classic channel transfer function estimation in communication theory [254,255] wherein the channel (anno- tator) corrupts the true signal based on a transfer function (distortion function). These annotator specic distortion functions, apart from allowing model evalu- ation on annotator ratings themselves, also provide a window to an annotator's hidden perceptual and cognitive processes. The proposed model specically targets the class of problems where the ground truth can not be observed, but judgments from multiple annotators are obtain- able/available. I approach this problem using an Expectation Maximization (EM) [256] class of algorithms, a framework widely used under similar circumstances involving an unobserved/hidden variable. I assume specic structures for the fea- ture mapping function and the distortion functions and present an EM algorithm involving iterative execution of an expectation step (E-step) and a maximization 109 Based on the observed cues, a latent ground truth of the variable of interest (e.g. strength of smile) exists Ground truth variable being tracked Time Expressed Cues (e.g. facial expressions) Each annotator provides his perception of the ground truth (e.g. strength of smile), corrupted with bias, scaling and other noises. Annotation for variable of interest Time Annotation for variable of interest Time Time Annotation for variable of interest Annotator 1 Annotator 2 Annotator K Figure 8.1: A gure providing the intuition of proposed model, inspired from Raykar et al. [2] step (M-step). The E-step estimates the ground truth based on the values of model parameters at hand and the M-step recomputes the model parameters based on the ground truth obtained in the E-step. I demonstrate the eectiveness of the proposed algorithm in a study involving prediction of time continuous condence ratings of smile intensity in a video dataset involving toddlers engaging in a brief play interaction with an adult. A set of 28 annotators provide their condence ratings of the child's smile by looking at a video of the face recorded during the interaction. I present a brief data description and statistics on annotator rat- ings followed by experimental details of testing various baselines and the proposed model on this dataset. My results show that my model outperforms baseline mod- els that assume ground truth to be the mean of all annotator ratings as well as the model proposed by Mariooryad et al. [5]. I present my analysis on the distor- tion functions and compare the structural patterns in the estimated ground truth, annotator ratings and the mean over all annotator ratings. Finally, I also observe 110 a * a 1 a n a N p(a * ∣X) =f(X,θ) X a * a 1 a n a N X p(a n ∣a * ,z)=A n z (a n ,a * ) z p(X∣z)=N(X,μ z ,Σ z ) (b) (c) a * a 1 a n a N p(a * )=π a * p(a n ∣a * )=A n (a n ,a * ) (a) p(a n ∣a * )=A n (a n ,a * ) p(a * ∣X)=f(X,θ) Figure 8.2: Graphical models for schemes previously proposed to model discrete label problems. (1a) Maximum likelihood estimation of observer error-rates using the EM algorithm [3] (1b) Supervised learning from multiple annotators/experts [2] (1c) Globally variant locally constant model [4]. the impact of removing a few annotators and record performance changes over each annotator by the proposed and the baseline models. To summarize, the major contributions of this chapter include: (i) designing a system to jointly model time-continuous annotations from multiple annotators (ii) proposing an EM based algorithm to train the system and, (iii) applying and interpreting of the system on a specic case study involving estimating condence ratings of smile intensity. 8.1 Background Several previous works have addressed a range of multiple annotator problems involving discrete class labels. Figure 8.2 shows a few schemes for the discrete class modeling problem, each with a specic set of assumptions. Dawid et al. [3] provided one of the earlier models for the problem as shown in Figure 8.2(a). a 111 represents an unobserved reference label for a given training example, drawn from a probability distribution such that P (a ) = . Given a set of N annotators, the n th annotator provide his judgment of the example based on a reliability matrix A n . Raykar et al. [2] extended the above model to train a discriminative classier as shown in Figure 8.2(b). The model rst estimates the probability of reference label given a set of featuresX based on a functionf(X;) ( is the set of function parameters). Each of the annotators provides his/her judgment assuming a similar strategy as the rst model. Audhkhasi et al. [4] presented a further modication assuming variable feature reliability as shown in Figure 8.2(c). The data is assumed to be generated based on the parameterz, which also aects the judgment of each annotator. The probability of a is obtained based on the featuresX, through a discriminative maximum entropy model. Similar multiple annotator models have also been proposed by Bachrach et al. [257], Yan et al. [258] and Welinder et al. [259]. However these models have not been generalized to continuous time series annotations, despite covering a range of multiple annotator problems. Apart from multiple annotator models, other schemes that handle noisy distortion of data include matrix factorization techniques [260,261], wavelet based methods [262] and other matrix recovery methods [263]. On the other hand, several studies have also focused on modeling time series data. A classic example is modeling emotional dimensions (e.g. valence, dom- inance, arousal) during human interaction [252], human-computer interaction [187, 264] as well as in music [265, 266]. These studies use multiple annotators to derive the ground truth reference and use heuristic metrics over the annota- tor ratings as a proxy for the latent emotional dimension. For instance, all the studies listed above use mean over annotator ratings as the ground truth. Other human interaction modeling examples that represent time series of discrete events 112 capturing a hidden internal human state include characterizations of client and counselor behaviors during psychotherapy [251, 267], couples therapy [268] and human-machine spoken dialogs [269]. These studies either substitute ground truth using annotations from a single annotator or use majority voting over multiple annotator ratings at every sample. These approximations of the ground truth are rather crude as they do not account for annotator specic traits such as their prociency, subjective references as well as motor and cognitive delays in task performance. Recent research studies have addressed a few of these problems in aggregat- ing annotator ratings using novel methods to account for annotator disparities. For instance, Nicolaou et al. [253] assume that each annotator's ratings could be factored into individual factors and a warped shared latent space representation. They perform this factorization using a Dynamic Probabilistic CCA (DPCCA) model. In later versions of their model [270], they proposed further extensions where features from the data are assumed to be generated conditioned on the latent shared space (Supervised-Generative DPCCA) as well as a discriminative model where the features determine the latent shared space (Supervised-Discriminative DPCCA). In its formulation, the Supervised-Discriminative DPCCA is similar to the proposed model. The model uses CCA and dynamic time warping to address the fact that Raykar's model [2] does not account for temporal correspondences between annotation samples. On the other hand, my model uses a distortion function which operates on the latent ground truth to provide annotator ratings. The distortion function provides proxies for biases and delays estimated for each annotator, which I further interpret in the experiment of my interest (Nicolaou et al. [270] provide other interpretations such as ranking and ltering annotations). 113 Also, Nicolaou et al. [270] evaluate model performance based on how well the fea- tures predict the latent ground truth. Although this evaluation is appropriate, the model should also be evaluated on predicting the observed data (i.e., the annota- tor rating themselves), which is not trivial to obtain using this model. Mariooryad et al. [5] proposed another approach where they rst identify annotator specic delays based on mutual information between the annotator ratings and the data stream. The nal aggregation is computed as a frame-wise mean of annotator rat- ings after accounting for delays. Note that this model uses the data feature stream in computing the annotator delays and it is possible to compute (and hence eval- uate on) the individual annotator ratings from the ground truth by reintroducing those delays. My model is an extension to the model proposed by Marioordad et al. [5] wherein instead of only estimating a constant delay, I estimate a more general Finite Impulse Response (FIR) lter which can not only account for delays but also scaling and bias introduction in annotator ratings. Generally, my work is inspired from the models on discrete class labels and is modied to be applicable on continuous annotations. In the next section, I rst describe the general framework for my model. I then describe the data set used for evaluating my model and also discuss the baseline models in comparison to the proposed model. Finally, I interpret the model parameters obtained on the data set and analyze the ndings. 8.2 Distortion based multiple annotator time series modeling I propose a distortion-based modeling scheme similar in structure to Raykar et al. [2] to model time series annotations from multiple annotators. Given a session 114 s drawn from a set of sessions S, I assume that the ground truth is conditioned on the session featuresX s . Furthermore the annotator ratings are assumed to be noisy modications of the hidden ground truth, determined by annotator specic functions. I describe these two assumptions behind my model in detail below. (i) First, I assume that the ground truth ratings for the session s, a s = [a s (1);::;a s (t);::;a s (T s )] T are conditioned on a set of session features X s = [x s (1);::;x s (t);::;x(T s )]. T s is the number of data frames ins,a s (t) is the ground truth value at the frame indext andx s (t) is aK-dimensional column feature vec- tor also at the frame index t. a s is a column vector representation of the time seriesfa s (1);::;a s (T s )g. Equation (8.1) shows the relation between the ground truth time seriesa s andX s based on a feature mapping function g. represents the set of mapping parameters for the function g. a s = g X s ; (8.1) (ii) Next, I assume that the ratings provided by each annotator are distortions of the ground truth. For the sessions, ratings from then th annotator are represented as a column vectora s n = [a s n (1);::;a s n (t);::;a s n (T s )] T , a s n (t) being the rating at the t th frame. I obtaina s n based on a distortion function h operating ona as shown in (8.2). For the n th annotator,D n represents the set of parameters for h. a s n = h(a s ;D n ); n = 1; 2;::;N (8.2) Figure 8.3 shows the Bayesian network for the proposed scheme. All session specic variables are located inside the plate. The conditional dependencies (direc- tion of edges) are determined based on the equations (8.1) and (8.2). a s can be 115 X s θ a * s a 1 s a n s a N s D N D 1 D n s∈S Figure 8.3: Graphical model for the proposed framework. X s represents the fea- tures,a s represents the ground truth. andhD 1 ;::;D N i are the set of parameters for feature mapping function and distortion functions, respectively. determined based on andX s , hence the two variables are set to be the parents ofa s . Similarly,D n anda s are parents ofa s n . 8.2.1 Choices for the feature mapping function and the distortion function In this work, I chose linear functions with additive noise terms as the rep- resentations for the functions g and h. Linear representations lead to better interpretability and easier parameter learning but the model can be extended to more complicated representations. The additive noise terms account for factors that can not be captured by linear modeling and is a commonly used component in various regression and classier learning schemes [271]. I describe my choices in detail below. Feature mapping function: I choose a linear mapping between the featuresX s anda s as shown below. 116 a s = g X s ; = 2 4 X s 1 3 5 T + s (8.3) In the equation above, is a K + 1 dimensional vector, s = [ s (1);::; s (t);::; s (T s )] T is a random noise vector with noise variable s (t) added at the t th frame. 1 represents a vector of ones and appends a bias term to feature vector at each frame. In eect, ground truth at frame t, a s (t) is obtained from (8.3) as a s (t) = 2 4 x s (t) 1 3 5 T + s (t) (8.4) I assume the noise vector s N (0; I T s) 1 . Given the ane transformation in (8.3),a s follows the distribution given by a s N 2 4 X s 1 3 5 T ; I T s (8.5) Similar assumptions on noise distribution are made in several regression and classication models [272, 273]. The Gaussian noise distribution allows for easy computation, however, can be replaced with other noise distributions as done is several previous works [274,275]. Distortion function: An annotator may modify the ground truth based on his/her perception. I aim to capture this annotator specic modication using 1 I use the notationN(;) to represent a Gaussian distribution with mean and covariance matrix. InN(0; I T s),0representsazeromeanvectorand I T s isadiagonalcovariance matrix with all entries equal to . In this case, the operator implies multiplication of a scalar value to all entries of a matrix/vector. 117 a distortion function operating on the ground truth. I assume that the n th anno- tator's ratingsa s n for the session s are obtained after distorting the ground truth based on a linear time invariant (LTI) lter with additive bias and noise terms. Although a linear operation, LTI lters can account for scaling and time delays introduced by the annotators. I assume a lter of length W with coecients d n = [d n (0);::;d n (W 1)] along with an additive bias term d b n . The noise random vector is represented by s n = [ n (1);::; n (t);::; n (T s )] T where n (t) is noise ran- dom variable for t th frame. The set of parameters D n for the distortion function h as represented in (8.2) are the lter coecients d 1 ;::;d N and the bias terms d b 1 ;::;d b N . Based on the lter coecients, the bias term and the noise vector,a s n is given as shown in (8.6). a s n = h(a s ;d n ) = (d n a s ) + (d b n 1 s ) + s n (8.6) In (8.6), 1 s represents a vector of ones with as many entries as the number of frames in the session s. The operator represents the convolution operation between the time series a s and annotator specic ltersd n . Further, I assume n to be a zero mean Gaussian noise with a covariance matrix of the form ( I T s), where I T s represents an identity matrix with dimensions (T s ;T s ). Since n N (0; I T s), I can state the following given the ane transformation in (8.6) p(a s n ja s ;d n )N (d n a s ) + (d b n 1 s ); I T s (8.7) 8.3 Training Methodology I use data log-likelihood maximization technique for training the proposed model. Based on the denitions of the functions h and g, I maximize the likelihood of 118 the observed data (i.e., the annotator ratings) to obtain the parameters d n ;d b n and . Also note that in the multiple annotator experiments under considera- tion, the ground truth a s is not directly observable. Therefore the Expectation- Maximization (EM) algorithm [256] is a suitable candidate for maximum like- lihood estimation. The data log-likelihoodL is dened on the observed anno- tator ratings ha s 1 ;::;a s N i given the feature values X s and model parameters =hd 1 ;::;d N ;d b 1 ;::;d b N ;i over all the sessions s2S as shown below. L = X s2S logp(a s 1 ;::;a s N j;X s ) (8.8) The above expression is equivalent to the marginalized log-likelihood over the hidden ground truth variablea s as given below. L = X s2S log Z a s p(a s 1 ;::;a s N ;a s j;X s )@a s (8.9) A complete derivation of the EM algorithm for the model in Figure 8.3 based on the structural assumptions for the distortion and feature mapping functions is given in Appendix 1. Below, I brie y summarize the model training using the EM algorithm and the criteria to evaluate the model. 8.3.1 EM algorithm implementation Initialize lter coecientshd 1 ;:::;d N i, bias termshd b 1 ;::;d b N i and mapping func- tion parameter. While the data-log likelihood converges, perform: - E-step: In this step, I obtain the ground truth estimate a s . Based on the 119 Gaussian distribution functions dened in (8.5) and (8.7), I arrive at the optimiza- tion problem shown in (8.10).jj:jj 2 represents the L 2 vector norm. a s = arg min a s N X n=1 (a s n ) (d n a s +d b n 1 s ) 2 2 + (a s ) 2 4 X s 1 3 5 2 2 ; 8s2S (8.10) - M-step: In the M-step, I estimate the model parameters based on the Gaussian distribution functions dened in (8.5) and (8.7). A detailed derivation of this estimation is shown in Appendix 1 and it turns out that I can estimate lter coecientshd 1 ;::;d N i, the bias termshd b 1 ;::;d b N i and parameter by operating separately on the two constituent terms. The optimization problem to obtain the distortion function parameters is given below. d n ;d b n = arg min dn;d b n X s2S N X n=1 (a s n ) (d n a s +d b n 1 s ) 2 2 (8.11) The above optimization to obtaind n andd b n can be carried out jointly by using a matrix formulation. Optimization problem to obtain is stated below. = arg min X s2S ( a s ) 2 4 X s 1 3 5 2 2 (8.12) End while In the next section, I describe my evaluation criteria on a given test set after training the model using the EM algorithm. 120 8.3.2 Evaluation criteria I chose two evaluation criteria for my model: (i) accuracy of the feature mapping function in predicting the ground truth, and (ii) accuracy in prediction of annotator ratings themselves. I discuss these two criteria below. Eval1: Accuracy of the feature mapping function in predicting the ground truth In my rst criterion, I estimate the latent ground truth a ^ s;true for a test session ^ s using the annotator ratings only based on the optimization problem stated in (8.13). Then, I make ground truth predictions a ^ s;pred from the feature mapping function as shown in (8.14). The Eval1 criterion is given as the correlation between the estimated (a ^ s;true ) and predicted (a ^ s;pred ) ground truths. This evaluation cri- terion was also adopted by Nicolaou et al. [253] where they compute the ground truth based on annotator ratings and use features to predict the estimated ground truth. They motivate this evaluation criteria by arguing that a better ground truth can be better predicted using the low level features. Similarly, Mariooryad et al. [5] rst compute the ground truth after accounting for lags from annotator ratings and later use features from the data to predict sucient statistics of the estimated ground truth such as its mean. a ^ s;true = arg min a ^ s N X n=1 (a ^ s n ) (d n a ^ s +d b n 1 s ) 2 2 (8.13) a ^ s;pred = 2 4 X ^ s 1 3 5 (8.14) 121 Eval2: Accuracy in predicting the annotator ratings Since the ground truth is a latent variable in the problems of interest, I also evaluate my model directly on the observed data, i.e., the annotator ratings themselves. An accurate prediction of observed ratings would imply that the model is able to capture the inherent relationship between the features, ground truth and annotator ratings. I report the correlation coecient () between the true and predicted ratings per annotator which also allows for observing the performance for each annotator separately. The annotator ratings are obtained using the following two steps: (i) I rst predict the ground truth a ^ s on a test session ^ s using the feature mapping function as stated in (8.14) (ii) next, I compute a ^ s 1 (t);::;a ^ s N (t) froma ^ s andd 1 ;::;d N using the operation shown below. a ^ s n =d n a ^ s +d b n 1 s (8.15) Note that these estimates ofa ^ s anda ^ s n are the means of Gaussian probability distribution functions stated in (8.5) and (8.7), hence also the maximum likelihood estimates. In the next section, I describe the experimental evaluation and my dataset of choice. 8.4 Experimental evaluation I evaluate the proposed framework on ten sessions of a dyadic child-clinician inter- action dataset, the Rapid-ABC dataset [201,276,277] focusing on perceived ratings of the strength of a child's smile. The data were collected to computationally inves- tigate behavioral markers of psychological and cognitive health conditions such as Autism Spectrum Disorders; the patterns of smiles are hypothesized to be an important cue [278]. Each session is approximately three minutes long and involves 122 natural interaction between an adult and a child between the ages of 15 and 30 months. The interaction elicits verbal as well as non-verbal behaviors (e.g, smile, laughter, grins). The overarching goal of this data collection was to understand various aspects of child-adult interaction including social response, joint attention and child engagement. For the purpose of my study, a set of 28 annotators later independently viewed a video from each session that captured the child's face during the interaction. They provided ratings on the strength of a child's smile (using a joystick arrangement), recorded at a frame rate of 30 samples/second over a dynamic range of 0-500. The corresponding audio included both psychologist and child speech. The annotators underwent an extensive initial training in rating the smile condences. During this training, the annotators would rate a le and their ratings were discussed with the data collectors. The discussion points included disagreements with the data collectors and other annotators, the oset and onset of smile condence annotations and other factors such as the annotator's consistency. After multiple rounds of this training procedure, they were assigned the 10 sessions used in this study to code by themselves with no feedback. I show the inter-rater agreements using the correlation coecient () between every pair of annotators as the metric in Figure 8.4. These values are computed over frames from all the 10 sessions. The annotator indices are assigned based on agreement with the rst annotator; where the last index is assigned to the annotator having least agreement with annotator 1. From the gure, I observe that the values are in the range of 0.35 to 0.80 for most of the annotator pairs. However the values of annotator 27 and 28 with other annotators are particularly low. This is indicative of a lower quality of ratings from these two annotators. Therefore, apart from initially testing my models by 123 Annotator ids Annotator ids 5 10 15 20 25 5 10 15 20 25 0 0.2 0.4 0.6 0.8 1 Figure 8.4: Correlation coecient () between every pair of annotators represented as an image matrix. Colorbar on the right indicates the value of the correlation coecient. Due to indexing based on agreement with annotator 1, annotators with lower indices have a higher with annotator 1. Annotator 27 and 28 have a very low agreement with several of the annotators. including all the annotators, I also conduct a follow up evaluation after removing these two annotators and analyze the results. Evaluation including annotators 27 and 28 helps us to interpret their impact on the model by analyzing the parameters corresponding to these annotators. On the other hand, evaluation without annota- tors 27 and 28 provides an insight into the impact of removing noisy annotator on the predictive capability of the model. In order to evaluate my model, I perform a 10 fold cross-validation, where 8 sessions are used for training, 1 as development set and 1 for testing. In the next section, I describe the featuresX s used in this work. 8.4.1 Feature set Smile is a visual phenomenon and previous research has used several visual features for smile detection [72,279] and analysis [280]. I use a set of similar video based features in my study. The video features are computed per video frame (30 124 Figure 8.5: Facial landmark points tracked on the children's face during interaction. frames/second) and are synchronized with the annotator ratings. I describe the features below. Facial landmarks: I use the CSIRO Face analysis SDK [281] to track facial landmarks on the child's face. I t 66 landmark points to the face at every frame. Figure 8.5 shows a video frame from the database with landmark points marked on the face. Based on these landmark points, I compute two sets of features: (1) velocity of the head based on the nose-tip landmark point, and (2) distance and velocity of all other landmark points with respect to the nose tip landmark point. Local binary patterns (LBP) based features: LBP features [282] are well known for describing facial expressions. During the computation of this feature, every pixel's intensity is compared to its neighbors and a binary vector is returned. LBP descriptor is a histogram over these binary patterns. I combine the facial landmark features and the LBP features to obtain a feature vector with dimensionality K = 387 for every video frame. For more details on the features, please refer to [281,282]. In the next section, I provide a description of the baseline models. 125 8.4.2 Baseline models I use two baseline models to compare against the proposed model. In the rst baseline model the ground truth is assumed to be a frame-wise mean over all the annotator ratings and the second baseline is borrowed from the work by Mariooryad et al. [5]. I discuss these baselines below. Baseline 1: Frame-wise mean of annotator ratings I use a baseline model, where the ground truth at a given frame is assumed to be the mean over ratings from all the annotators at that frame. Several previous works [186,187,252] have used this assumption in obtaining the ground truth from multiple annotators on similar time series modeling problems. This scheme assigns equal weight to each annotator and does not account for individual dierences. In the baseline case, the relation between the ground truth and the annotator ratings is presumed before hand and can be represented by the following operation in (8.16). I T s represents an identity matrix of dimensionality (T s ;T s ). a s = 1 N I T sjI T sj:::jI T s | {z } N-times 2 6 6 6 4 a s 1 . . . a s N 3 7 7 7 5 (8.16) I incorporate the assumption in (8.16) in the framework of my model. I obtain the mapping parameter based on a s (obtained as in (8.16)) using the MMSE criteria in (8.12). However, instead of obtaining lter coecients using EM algorithm, they have to be computed based on equation (8.16). I use two dierent methods to compute the lter coecients using the hard coded ground trutha s from (8.16) as listed below. 126 Baseline 1(a): In the rst baseline model, the lter coecients are computed using the MoorePenrose pseudoinverse (Pinv) [283] operation on the set of identity matri- ces in (8.16) as shown in (8.17). As per (8.17), the multiplication ofI T s to a s to obtain a s n implies that the lters are inferred to be unit impulse response lters with no delay. Hence the lter coecient d n is a unit Kronecker delta function. The bias terms d b n are all estimated to be 0. 2 6 6 6 4 a s 1 . . . a s N 3 7 7 7 5 = Pinv 1 N I T sjI T sj:::jI T s | {z } N-times a s = 2 6 6 6 4 I T s . . . I T s 3 7 7 7 5 a s (8.17) Baseline 1(b): In this case, I seta s to the value shown in (8.16). Then, I compute the lter coecientsd n and the bias terms d b n using the MMSE criteria listed in (8.11). The lter length parameter W is tuned on the development set. Baseline 2: Lag compensated aggregation of annotator ratings My second baseline is borrowed from the work by Mariooryad et al. [5] where I rst estimate the lags per annotator with respect to the features obtained from the data stream. The lags per annotator are computed by introducing a delay in the ratings per annotator till his/her ratings have the maximum mutual information with the frame-wise features. Note that this formulation is a special case of the proposed model when the distortion function is constrained to be a unit impulse response lter with a constant delay (d n in (8.6) is set to a Kronecker delta function with the delay corresponding to the n th annotator). The bias terms d b n are set to 0 in this formulation. After compensating for the annotator delays calculated on the training set, a s for every data partition is computed as the frame-wise mean of the aligned annotator ratings (also the solution to the optimization in (8.13)). I 127 obtain the mapping parameter from the computeda s using the MMSE criteria in (8.12). In order to compute back the annotator ratings for the Eval2 criterion, individual annotator ratings on the test set are computed as per the convolution stated in (8.15). In essence, the convolution operation reintroduces the estimated delays in the ground truth to compute each annotator's ratings. For more details regarding this baseline, please refer to section 4 in [5]. 8.4.3 Results Using the stated cross validation split, I train the baseline and proposed models. For the proposed model and the baseline model 1(b), the lter length parameter W is tuned on the development set. Note that W is tuned globally over all the annotators, as tuning a W for each annotator is computationally expensive and the lter characteristics are expected to be robust to small changes in the length W . Table 8.1 shows the correlation coecient of feature mapping prediction with the estimated ground truth (Eval1 criterion). Note that results are the same for baselines 1(a) and 1(b) due to the common ground truth computation criteria, i.e., frame-wise means of annotator ratings. The Eval1 criterion correlation of the proposed model is better than the baseline using the Fisher z-transformation test [1] considering value at each frame to be a sample. Figure 8.6 shows the in predicting the observed annotator ratings (Eval2 criterion). For the Eval2 criterion, the proposed model is signicantly better than all the baselines for 20 annotators (Fisher z-transformation test, p-value< 10%, number of samples is the number of analysis frames:37k). This excludes the noisy annotators 27 and 28 as observed in Figure 8.4. The Cohen's D [284] comparing the proposed model against each baseline yields a values of .31 (baseline 1a), .11 (baseline 1b) and .33 (baseline 2). The Cohen's D is computed using correlation coecients for each 128 Eval1 criteria, Baseline Baseline Proposed correlation coecient 1a/1b 2 Model with the ground truth 0.28 0.30 0.34 Table 8.1: Correlation coecient between the estimated ground truth and the predictions from the feature mapping function. A higher implies that the esti- mated ground truth is better estimated using the low lever features. The improve- ment over the closest baseline using the proposed model is signicant based on the Fisher z-transformation test [1] (p-value < .001, z-value = 6.1, number of samples equals the number of analysis frames:37k). 0 5 10 15 20 25 30 0 0.1 0.2 0.3 0.4 Correlation coefficient 0 5 10 15 20 25 30 −0.4 −0.35 −0.3 Annotator ids Baseline 1(a) Baseline 1(b) Baseline 2 Proposed model Figure 8.6: Correlation coecients between the true and predicted annotator ratings. A higher implies that the model is better able to model the dependencies between low level features and the annotator ratings. The values of proposed model signicantly better (at least at 5% level using Fisher z-transformation test) than all the baseline are marked with. Annotators 3,16 and 18 are signicant only at 10% level (marked with a) and annotator 1, 5, 14, 17, 24, 25, 17 and 28 are either not signicantly better or worse than at least one of the baselines. annotator as the sample values. These values indicate a small improvement eect over baseline 1b and medium improvement eect over baselines 1a and 2. 129 −45 −40 −35 −30 −25 −20 −15 −10 −5 0 −0.02 −0.01 0 0.01 0.02 0.03 0.04 0.05 Value of the coefficients Filter coefficient index Annotator1 Annotator2 Annotator3 Annotator4 Annotator5 Annotator6 Annotator7 −45 −40 −35 −30 −25 −20 −15 −10 −5 0 −0.02 −0.01 0 0.01 0.02 0.03 0.04 0.05 Value of the coefficients Filter coefficient index Annotator8 Annotator9 Annotator10 Annotator11 Annotator12 Annotator13 Annotator14 −45 −40 −35 −30 −25 −20 −15 −10 −5 0 −0.02 −0.01 0 0.01 0.02 0.03 0.04 0.05 Value of the coefficients Filter coefficient index Annotator15 Annotator16 Annotator17 Annotator18 Annotator19 Annotator20 Annotator21 −45 −40 −35 −30 −25 −20 −15 −10 −5 0 −0.02 −0.01 0 0.01 0.02 0.03 0.04 0.05 Value of the coefficients Filter coefficient index Annotator22 Annotator23 Annotator24 Annotator25 Annotator26 Annotator27 Annotator28 Figure 8.7: Filter coecients estimated by the proposed EM algorithm for each of the annotators. The lter are plotted as d((W 1));::;d(1);d(0) used dur- ing convolution as: a s n (t) = P W1 w=0 a s (tw)d n (w). A higher value for the coecients towards the left in the gure implies a higher emphasis on the past samples. 8.4.4 Discussion The performance results in Table 8.1 are in the expected order. The naive base- line of computing the ground truth as frame-wise mean of the annotator ratings could not be well modeled by the features at hand and thus performs the worst. Adjusting for annotator specic delays and then aggregating the annotator rating performs better than baselines 1(a) and 1(b). However factors such as dierences 130 in annotator biases, range of annotation and context in annotation can not be modeled by imposing a constant delay assumption on the distortion functions. These factors are accounted for in the proposed model by allowing the distortion function to be an LTI lter, thereby providing the best performance. For the Eval2 criterion in predicting the annotator ratings, the proposed model performs the best for most of the annotators. Performance is particularly low in predicting the rat- ings for annotator 27 and 28. This indicates that these annotators are noisy and hard to model, an observation consistent with the inter-rater correlations shown in Figure 8.4. The performances of baseline 1(a) and 2 are comparable for the Eval2 criterion. This stems from the fact that the distortion functions for both these baselines are constrained to be unit response lters (with additional delay allowed for baseline 2), and thus carry low modeling strength in predicting back the annotator ratings. Baseline 1(b) still allows for the distortion function to be an LTI lter which can account for a longer temporal context in predicting annotator ratings from the ground truth (even though the ground truth is a naive frame-wise mean of annotator ratings). In the following section, I make a few more obser- vations regarding the model parameters, the inferred ground truth and eect of removing a few annotators. I note that the interpretation of these parameters only oers a window to the complex cognitive factors. Interpreting the distortion function parameters In this section, I plot and interpret various parameters of the distortion function. Figure 8.7 shows the LTI lter coecient values for the 28 annotators, obtained using model training over all the 10 sessions. The bias term in the lter is shown as a stem plot in Figure 8.8. From the lter coecients in Figure 8.7, I can make several observations to compare an annotator with others. For instance, the lter 131 coecients of Annotator 1 are such that the a s samples in the past are weighted higher in convolution to obtain a s 1 . The opposite is true for annotator 6 as a s samples closer to the current frame carry higher weight than the samples in the past. A phase delay analysis of lters from these two annotators suggests that the lter from annotator 1 introduces a greater delay in the ratings than that of annotator 6. Another observation is that the lter coecients for annotator 27 and 28 have lower absolute values. Thus, the ground truth ratings are attenuated to obtain annotations for the annotator. On the other hand, ratings for annotator 15 is obtained after amplication of the ground truth. Overall, the shape of LTI lter co-ecients varies across annotators (e.g. annotators 4 and 10 have a U- shaped lter and annotator 17 and 20 have a more at lter shape). I note that these lters coecients are obtained in a data driven fashion and their phase and magnitude responses provide an ad-hoc quantication of the complex annotation behavior. From the bias terms shown in Figure 8.8, I observe that annotator 14 and 28 have a high positive annotation bias term and annotator 10 has a high negative bias term. These terms are added to the ground truth to obtain the respective annotator ratings. The group of annotators 6, 7, 11 and 24 have a relatively low bias term. I also plot the annotator delays estimated using the baseline 2 in Figure 8.9. Annotators 1, 14, 18 and 28 are estimated to have the longest delays. This observation is fairly consistent with the lter coecient estimates shown in Figure 8.7, where the lter coecients in the past are estimated to carry higher value thereby introducing a larger phase delay. I note that interpretation of these parameters only oers a window to the complex cognitive factors during annotation. The parameters of annotator bias, delay and distortion are estimates obtained as per the model assumptions. They 132 0 5 10 15 20 25 30 −80 −60 −40 −20 0 20 40 60 80 100 Estimated bias term Annotator Id Figure 8.8: Annotator bias d b n estimated using the proposed model. 0 5 10 15 20 25 30 0 20 40 60 80 100 120 140 Estimated annotator lags Annotator Id Figure 8.9: Annotator delays estimated using the baseline 2 proposed in Mari- ooryad et al. [5]. are further in uenced by other factors such as the overall interaction dynamics between the child and the psychologist as well as other latent annotator states (such as their mood and the environment). These factors are not accounted for by my model and can be the subject of a future study. 133 0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 Frame number Estimated ground truth Baseline 1(a)/1(b) Baseline 2 Proposed model Figure 8.10: Ground trutha s as estimated by various baseline and proposed models on an arbitrary section of the data. Inferred ground truth from the annotator ratings I compare the estimated ground truth for an arbitrary segment of the data, from the various baselines and the proposed model in Figure 8.10. As expected, I observe that the ground truth estimate from baseline 2 has a phase lead over that estimated from the baseline 1(a)/(b) (compare the peaks in the plot). For the proposed model, a lead is again observed when compared to baseline 1(a)/(b), but not as large as baseline 2. Also, the dynamic range for the segment is higher for the baseline estimated from the proposed model. This results from the capability of the proposed model to be able to account for annotator bias as well as amplify- ing/attenuating their ratings, as discussed in the previous sections. Furthermore, high frequency components in the features get added during the ground truth com- putation using the proposed model (equation (A.16)). The features are otherwise not used during framewise aggregation in the baseline models. 134 Eval1 criteria, Baseline Baseline Proposed correlation with 1a/1b 2 Model the ground truth 0.29 0.31 0.36 Table 8.2: Correlation coecient between the estimated ground truth and the predictions from the feature mapping function after removing annotators 27 and 28 from training. The proposed model is signicantly better than the closest baseline model (baseline 2) based on the Fisher z-transformation test [1], considering value at each frame to be a sample (p-value < 0.001, z-value = 7.7). Performance after removing annotators 27 and 28 Finally, I observe the impact on the performance of the model after removing annotators 27 and 28. I observed that annotators 27 and 28 had the lowest inter- rater correlation with annotators in the Figure 8.4. I remove these annotators during model training and testing. The correlation coecient of feature mapping prediction with the estimated ground truth (Eval1 criterion) is shown in Table 8.2. From the results for Eval1 criterion in Table 8.2, I observe that the performances of all the models are better after removing annotators 27 and 28. Also, the increase in absolute performance is the highest for the proposed model. This indicates that the ground truth estimation in case of the proposed model benets the most after removing noisy annotators. Figure 8.11 show the between the predicted and true annotator ratings (Eval2 criterion). After removing the annotators 27 and 28, the proposed model performs signicantly better than all the other baselines for 21 out of 26 annotators (at p-value < 10% level). In this experiment, I obtain Cohen's D values of .95, .30 and .97 when comparing the correlation coecient samples obtained using the proposed method against baseline 1a, 1b and 2, respectively. This indicates a medium improvement eect size over baseline 1b and strong improvement eect sizes over baselines 1a and 2. The improvement in Cohen's D is primarily obtained due to discounting of annotators 27 and 28, which otherwise lead to an increase in 135 0 5 10 15 20 25 30 0 0.1 0.2 0.3 0.4 Annotator ids Correlation coefficient Baseline 1(a) Baseline 1(b) Baseline 2 Proposed model Figure 8.11: Correlation coecients between the true and predicted annotator ratings based on model trained after removing annotators 27 and 28. A higher implies that the model is better able to model the dependencies between low level features and the annotator ratings. For the correlation coecients obtained using proposed model, indicates a signicant improvement with p-value < 5%, indicates signicant improvement with p-value < 10% but greater than 5%. standard deviation of obtained correlation coecients, as presented in Figure 8.6. Note that the annotators 27 and 28 were poorly modeled by the proposed model as seen in Figure 8.6 and therefore their removal helps the proposed model in both, estimating a better ground truth as well as modeling other annotators better. 8.5 Conclusion Several studies employ multiple annotators to model time series over a continuous hidden (unobserved) variable. The ground truth is often substituted by heuristic measures over the available ratings, which are later used for training and eval- uating the model. In this work, I present a novel scheme to model the ratings from multiple annotators using an EM algorithm. My algorithm infers the hidden ground truth based on a feature mapping function and learns a distortion function for each annotator. This distortion function is used by the annotator to provide his perception of the ground truth. Evaluation on smile condence ratings from 136 28 annotators on the Rapid-ABC dataset demonstrates that the proposed model outperforms the baseline cases that substitute ground truth by computing means over annotator ratings or only compensate for delays in the annotator ratings. I further analyze the model parameters and identify annotator specic traits such as annotator bias and delay. My model can be further improved by using schemes similar to those proposed in multiple annotator modeling problems over discrete labels [4, 257, 258]. In this work, I have assumed a specic structure for the feature mapping and distortion functions but other formulations can be tested. The distortion functions from each annotator can also be investigated to study factors such as annotator similarity and reliability. Similarly investigations on feature mapping functions may reveal features best suited for the study. Furthermore, as I pointed out previously, there are several other complex factors that determine factors such as annotator bias and delay (e.g. interaction dynamics in the dyadic conversation, environmental settings). My model does not account for such factors and they can be a subject for future studies to further understand the dynamics of annotation. Finally, this study may be extended to cases involving multidimensional time series, involving joint modeling over each dimension. 137 Chapter 9 Inferring object rankings based on noisy pairwise comparisons from multiple annotators Given a set of items, ranking involves establishing a partial order over the items. This ordering allows comparison between two items, in which the rst is either ranked higher, lower or equal to the second [285]. This is commonly termed as a pairwise approach and has been investigated in relation to information retrieval [286], ranking web pages [287] and even analysis of human behavioral constructs such as emotions [288]. Within the problem of modeling preferences using pairwise comparisons, inferring the true order given comparisons from noisy annotators [289] is very relevant. Often, due to the unavailability of the ground truth, experimenters resort to accumulating judgments from multiple annotators and performing a fusion of their collective knowledge. This trend has existed beyond learning to rank and has also been observed in classication and regression tasks [258]. Particularly within the domain of classication, several researchers have proposed novel ways of jointly modeling the annotators in inferring the latent ground truth [2,4]. Although prior research has addressed similar problems within ranking, the methods enforce a specic structure (e.g., Borda count method, Nan- son method [290,291]) on annotator judgments in inferring the latent ground truth. In this work, I present Expectation-Maximization (EM) [292] based algorithms 138 inspired from work in classication problems to infer the latent ground truth in ranking objects. Through these algorithms, I not only aim to relax the ad-hoc constraints imposed in ground truth computation of preferences but also open up possibilities to integrate the existing approaches within ranking and classication addressing similar problems. Given noisy pairwise preferences from multiple annotators, the proposed algo- rithms target to infer a single ground truth ranking while also computing a relia- bility metric for each of the annotators. I assume the ground truth to be a latent variable that can be inferred not only based on the noisy pairwise comparisons from multiple annotators, but also the distribution of a set of attributes/features corresponding to the pair of items being compared. I approach this problem using the Expectation-Maximization (EM) framework [271] and develop a Joint Annotator Modeling (JAM) scheme, inspired from existing literature in model- ing multiple annotators [2, 4]. The JAM schemes assume that, given the set of attributes/features for a pair of objects, there exists a latent true preference order. Furthermore, the annotators either retain or ip this preference order based on an annotator-specic reliability metric, the \probability of ipping". The JAM scheme initially learns the relationship between the attributes of the object pair and the latent ground truth as well as each annotator's \probability of ipping". The nal inference on the preference ground truth is made jointly taking into account the model's belief based on the object attributes and the annotators' pref- erences. I further modify the JAM scheme to allow for non-constant \probability of ipping" based on the pair of objects at hand, termed as Variable Reliability Joint Annotator Modeling (VRJAM) scheme. I compare the JAM and VRJAM schemes to existing methods such as majority voting and fusion after Independent Annota- tor Modeling (IAM) (similar to Borda count method [290]). I evaluate my models 139 on two data sets with synthetic annotations to investigate the impact of annota- tor quality and quantity on my models. I also evaluate my models on two other data sets with annotations from machines (ground truths available) and humans (ground truths not available). I interpret the outcomes of the models based on the data characteristics and suggest a few future directions. In the next section, I provide a background of the relevant work, followed by the description of vari- ous methodologies for inferring latent true preference order from noisy annotator preferences. 9.1 Previous work Several researchers have addressed the problem of learning to rank from pairwise comparisons with applications to a variety of domains. In particular, works by H ullermeier and F urnkranz et al. [285, 293] provide a comprehensive background on preference learning using the pairwise approach. Considering consolidation of other machine learning topics within the framework of ranking, Brinker et al. [294] and Long et al. [295] integrated active learning in ranking problems, Chu et al. [296] provided an extension of Gaussian processes for ranking and He et al. [297] used manifold based ranking for image retrieval. Other notable works proposing novel methods and applications for ranking include learning to rank using non-smooth cost functions [298], the Mcrank algorithm [299] and learning to rank with partially labeled data [300]. Whereas several existing works have addressed other interesting avors of learning to rank [301], rank aggregation [302] is possibly one of the most well studied elds under this domain. A prominent setting under rank aggregation is learning a probability distribution centered around a single or a mixture of global rankings. Several works [303{305] present algorithms for rank aggregation 140 using non-negative matrix factorization, nuclear norm minimization and sparse decomposition techniques. A dierent problem setting under learning to rank is inferring a ground truth ranking from a set of pairwise preferences available from multiple annotators. Chen et al. [306] address this problem and present an active learning framework that selects a pair of objects as well as the annotator to be queried while training a ranking model. Along similar lines, Kumar et al. [307] investigated algorithms to fuse ranking models trained using noisy crowd. The formulation of inferring latent ground truth from noisy annotations is particularly well studied in classication and regression problems. Dawid et al. [3] presented one of the earlier works in fusion annotator beliefs followed by more recent models by Raykar et al. [2] and Zhou et al. [308]. Audhkhasi et al. [4] further extended the model to account for diversity in the reliability of annotators over the feature space. My algorithm carries similar goals as Chen el al. [306] and Kumar et al. [307] to fuse preferences from multiple annotators, with modeling schemes inspired from proposals by Raykar et al. [2] and Audhkhasi et al. [4]. In the next section, I discuss the algorithms designed for the fusion of noisy pairwise comparisons from multiple annotators along with a few other baseline methods. 9.2 Methodology Given a set of N items O =fO 1 ;O 2 ;:::;O N g and K annotators, I represent the k th annotator's preference ofO i overO j asO k i O k j . My goal is to infer the latent ground truth denoted by O i O j , indicating that O i is ranked higher than O j . I also assume the availability of attributes/feature values X =fx 1 ;x 2 ;:::;x N g for each of the N objects, wherex i is a vector of attributes for the item O i . I dene 141 the variables z k ij (k = 1::K) and z ij to represent the preferences of the annotators and the ground truth as follows. z ij = 8 > > < > > : 1 if O i O j 0 if O j O i and,z k ij = 8 > > < > > : 1 if O k i O k j 0 if O k j O k i ;k = 1::K (9.1) Below I describe four methods to obtain the ground truth given the noisy pairwise comparisons between items. The rst two methods, majority vote and Independent Annotator Model serve as a baseline. Although fusion from these methods is easy to perform, they assume that each annotator is equally reliable in inferring the ground truth which may not always be the case. The next two methods, the Joint Annotator Modeling and Variable Reliability Joint Annotator Modeling schemes learns a reliability metric for each annotator. The nal decision is made based on available annotations as well as the attributes for the pair of objects at hand. 9.2.1 Majority Vote (MV) Majority voting is one of the most popular methods for merging decisions from multiple annotators and has been consistently used in various classication exper- iments [87, 309] as well as ranking [307]. In this method, I say that the inferred preference isO i O j if a majority of the annotators say so. In case of a tie among annotators, a random decision is taken between z ij = 1 and z ij = 0. Note that this model does not use the object attributes X in inferring z ij and relies solely on z k ij as shown in the graphical model in Figure 9.1(a). Also, each annotator is weighted equally in deciding the majority. 142 9.2.2 Independent Annotator Modeling (IAM) In this scheme, I initially train annotator specic ranking models to capture the relation between object attributes and each annotator's preference rankings. The ranking model for the k th annotator returns a score f k (x i ) for every object O i based on the attributes x i . Finally, the inferred ground truth value for z ij is given by comparing the sum of scores f k (x i ) and f k (x j ) over all the annotators (k = 1::K). This method is synonymous to the Bradley-Terry model [310] (extended by Chen el al. [306]) and the Borda count method [290] used for aggregating decisions from multiple annotators. In case of Bradley-Terry model, preference between two objects is determined based on their relevance scores, computed as a sum of f k (x j ) over all the annotators in the current IAM scheme. Similarly, in the Borda count method, each annotator scores every object and z ij is inferred by comparing sum of scores across all the annotators. In this section, my substitute for the Borda count score forO i , as given by the annotatork, is the value f k (x i ). I describe the model training and ground truth inference in detail below. Training annotator specic models: Given the k th annotator's pairwise pref- erencesz k ij , I train an annotator specic Support Vector Ranker (SVR) [311] as the function f k . My goal is to learn f k for every annotator k, such that the following holds. O k i O k j () z k ij = 1 () f k (x i )> f k (x j ) (9.2) In this work, I chose f k to be a linear function characterized by a weight vec- tor w k such that f k (x i ) =hw k ;x i i, wherehw k ;x i i represents the dot product 143 betweenw k andx i . An SVR targeting the problem in (9.2) performs the following optimization on the cost functionM k [311]. w k = arg min w k M k = arg min w k X All pairs x i ;x j z k ij [1hw k ;fx i x j gi] + +(1z k ij )[1hw k ;fx j x i gi] + (9.3) In the equation above, fx i x j g depicts a notion of dierence operator between x i and x j and [ ] + represents the standard hinge loss function [312]. In this work, I usefx i x j g to be a simple element-wise subtraction between attribute vectorsx i andx j . I learnw k (8k = 1::K) using the standard gradient descent algorithm [313]. SinceM k is non dierentiable, I use the approximation suggested by Rennie et al. [314] in the hinge loss function. Fusing annotator models: After obtaining f k for each of the annotators, I say z ij = 1 if: K X k=1 f k (x i )> K X k=1 f k (x j ) (9.4) A graphical model representing this scheme is shown in Figure 9.1(b). Note in order to obtain z ij , the scheme of unweighted combination is enforced on f k outputs. 9.2.3 Joint Annotator Modeling (JAM) In this section, I propose an Expectation-Maximization (EM) algorithm [292] to infer the ground truth by jointly modeling the noisy comparisons. My algorithm is inspired by similar works [2, 4] in the domain of classication problems. A graphical model for this scheme is shown in Figure 9.1(c). I assume the ground truth z ij to be a latent variable that can be inferred using the object attributes 144 x i x j z ij k z ij (b) x i x j z ij z ij k w (c) k = 1..K r k x i x j z ij z ij k w (d) k = 1..K R k m ij w k k = 1..K z ij k z ij k = 1..K (a) Figure 9.1: Graphical models for (a) Majority vote (MV) (b) Independent Anno- tator Model (IAM) (c) Joint Annotator Model (JAM) and, (d) Variable Reliability Joint Annotator Model (VRJAM) schemes. x i and x j . My choice for inferring z ij based on x i ;x j is again an SVR model with a weight vector w. Furthermore, I assume that z k ij is obtained by ipping the binary variable z ij with a probability r k . In summary, this model assumes that there is an inherent true preference given attributes from two objects and the annotators are ipping it based on annotator specic probabilities (r k ;k = 1::K). Consequently, the probabilityr k also provides a measure of annotator quality as a higher r k implies higher chances of an annotator committing an error. I infer the latent ground truth z ij using an EM algorithm described in the next section. Expectation-Maximization algorithm The EM algorithm maximizes the log-likelihoodL of the observed data, that is, annotator preferences given the object attributes and the model parameters. In my case,L is given as shown in (9.5). Notice the introduction of the latent ground truth z ij intoL in (9.6). 145 L = X All pairs x i ;x j logp(z 1 ij ;::;z K ij =x i ;x j ;w;r 1 ;::;r k ) (9.5) = X All pairs x i ;x j X z ij logp(z ij ;z 1 ij ;::;z K ij =x i ;x j ;w;r 1 ;::;r k ) (9.6) Following the EM derivation procedure in section 9.4 in [271], I introduce a distribution over the latent ground truthz ij : q(z ij ). ConsequentlyL can be written as sum of two terms, a Kullback Leibler (KL) divergence term KL(qjjp) and another log-likelihood termM as shown in (9.7). L =M + KL(qjjp) (9.7) where, M = X all pairsx i ;x j X z ij q(z ij ) log n p(z ij ;z 1 ij ;::;z K ij jx i ;x j ;w;r 1 ;::;r K ) q(z ij ) o (9.8) KL(qjjp) = X all pairsx i ;x j X z ij q(z ij ) log n p(z ij jz 1 ij ;::;z K ij ;x i ;x j ;w;r 1 ;::;r K ) q(z ij ) o (9.9) The EM algorithm consists of two steps: the E and M steps. In the E-step, M is maximized with respect to q(z ij ) while holding the other parameters constant. The solution is equivalent to the posterior distribution p(z ij jz 1 ij ;::;z K ij ;x i ;x j ;w;r 1 ;::;r K ). In the M-step,M is maximized with respect to model parameters while holding the estimated distribution q(z ij ) constant. I describe the parameter initialization followed by the E and M steps below. 146 Initialization: I randomly initialize the SVR weight vectorw and the probabili- ties of ipping r k (k = 1::K). Whilew;r 1 ;::;r K not converged perform E and M-steps, where: E-step: In the E-step, I set the probability distribution q(z ij ) equal to p(z ij jz 1 ij ;::;z K ij ;x i ;x j ;w;r 1 ;::;r K ). This quantity can be represented as shown in (9.10). A detailed derivation for this quantity can be seen in Appendix 1. q(z ij ) =p(z ij jz 1 ij ;::;z K ij ;x i ;x j ;w;r 1 ;::;r K ) = p(z ij jx i ;x j ;w) K Y k=1 p(z k ij jz ij ;r k ) =p(z 1 ij ;::;z K ij ) (9.10) Note that the rst term p(z ij jx i ;x j ;w) in (9.10) is conditioned on the SVR model parameters w and object attributes x i and x j only. Since SVR is not a probabilistic model, I apply a commonly used trick in support vector machine classiers employed to obtain class probabilities. The trick involves tting logis- tic models to distance from the decision hyperplane to obtain the probabilities of preference decisions [315] (A comparison of the hinge loss function and the logis- tic loss function is made in Appendix 3). Equations (9.11) and (9.12) show the computation for p(z ij jx i ;x j ;w) using the logistic model. p(z ij = 1jx i ;x j ;w) = exphw;fx i x j gi 1 + exphw;fx i x j gi (9.11) p(z ij = 0jx i ;x j ;w) = 1p(z ij = 1jx i ;x j ;w) (9.12) 147 The second term p(z k ij jz ij ;r k ) in (9.10) is r k if z k ij and z ij are in disagreement and 1r k otherwise, as shown below. p(z k ij jz ij ;r k ) = 8 > > < > > : r k if z k ij 6=z ij 1r k if z k ij =z ij (9.13) Replacing the values in (9.10) from (9.11) and (9.13), I can representq(z ij = 1) as shown in (9.14). q(z ij = 0) can be computed accordingly. q(z ij = 1) = exphw;fx i x j gi 1 + exphw;fx i x j gi K Y k=1 [(r k ) (1z k ij ) (1r k ) z k ij ] | {z } r k (=1r k )is multiplied when z k ij = 0(=1) =p(z 1 ij ;::;z K ij ) (9.14) Note that the denominator p(z 1 ij ;::;z K ij ) is common between q(z ij = 1) and q(z ij = 0) and need not be computed. I can just compute the numerator in (9.10) for q(z ij = 1) and q(z ij = 0) and normalize these probabilities to sum to one. In the next section, I discuss the M-step. M-step: In this step, I estimate the model parameters w;r k (k = 1::K) based on estimated distribution q(z ij ). These parameters are estimated by maximizingM after substituting q(z ij ) estimated in the E-step. In my case,M can be written as shown in (9.15). H(q(z ij )) is the entropy of q(z ij ) and is a constant term with respect to the model parameters w;r 1 ;::;r k . I disregard the entropy term for further M-step derivations. M = X z ij q(z ij ) logp(z ij ;z 1 ij ;::;z K ij jx i ;x j ;w;r 1 ;::;r K ) +H(q(z ij )) (9.15) 148 I can rewriteM as shown in (9.16). For a detailed derivation, please see Appendix 2. Note that each parameter w, r 1 ;::;r K appears in a separate term within the summation in (9.16) and thus, I only need to consider the corresponding term while optimizing for a parameter. I discuss the optimization for the SVR parametersw and ipping probabilitiesr k below. M = X z ij q(z ij ) K X k=1 logp(z k ij jz ij ;r k ) + logp(z ij jx i ;x j ;w) (9.16) Obtaining SVR weight vector w: I only need to consider the following termM w withinM to optimize forw. M w = X z ij q(z ij ) logp(z ij jx i ;x j ;w) (9.17) In the EM algorithm, logp(z ij jx i ;x j ;w), would be obtained from a probabilistic model to inferz ij conditioned onx i ;x j ;w. However, since the choice of my model is a non-probabilistic SVR, I instead solve the following optimization in (9.18) to obtainw. I would like to point out that this is an approximation I use in the EM algorithm. Appendix 3 shows the probability distribution corresponding to the logistic models used in (9.11) and its relation to the following optimization. w = arg min w M 0 w = arg min w q(z ij = 1)[1hw;fx i x j gi] + +q(z ij = 0)[1hw;fx j x i gi] + (9.18) Note thatM 0 w in (9.18) is similar to the cost functionL k dened in (9.3) for training annotator specic models. However, instead of being trained on binary 149 decisions values (e.g., z k ij used inL k ),M w is dened over the soft estimate q(z ij ). Next, I discuss the optimization problem to obtain r 1 ;::;r K . Obtaining probability of ipping r 1 ;::;r K : In order to obtainr k , I need to optimize the following term withinM. r k = arg min r k M k r = arg min r k X All pairsx i ;x j X z ij q(z ij ) logp(z k ij jz ij ;r k ) (9.19) p(z k ij jz ij ;r k ) is replaced in the above equation as shown in (9.13) and the term can be optimized to obtainr k . I obtain the nal inference forz ij as discussed below. Final inference: After convergence, I make the nal inference on z ij based on obtained distribution p(z ij jz 1 ij ;::;z K ij ;x i ;x j ;w;r 1 ;::;r K ), as was derived in (9.10)- (9.14). z ij is inferred to be 1 or 0 as per the following equation. p(z ij = 1jz 1 ij ;::;z K ij ;x i ;x j ;w;r 1 ;::;r K ) 1 > < 0 p(z ij = 0jz 1 ij ;::;z K ij ;x i ;x j ;w;r 1 ;::;r K ) (9.20) Next, I propose a modication to this scheme considering the annotators' prob- ability of ipping to be variable. 9.2.4 Variable Reliability Joint Annotator Modeling (VRJAM) This scheme is similar to the joint annotator model proposed in the previous section except for the probability of ippingr k being variable. The motivation behind this scheme is that annotators may have variable reliability depending upon the pair 150 of objects O i and O j at hand (a similar assumption is in the model proposed by Audhkhasi et al. [4]). Therefore, instead of a constant r k for the annotator k, I determine a vectorR k = [r 1 k ;::;r D k ], where based on the dierence vectorfx i x j g, one of the values r d k (d = 1::D) is chosen as the probability of ipping. I retain the assumption that z ij is a latent variable conditioned onx i ;x j and the SVR weight vectorw. I again train this model using an EM algorithm described below. The algorithm is similar to the EM algorithm in section 9.2.3 and I borrow several steps for the sake of brevity. Expectation-Maximization algorithm For the purpose of my experiments, I divide the space spanned by dierence vectors fx i x j g intoD clusters. For the k th annotator, a distinct probability of ipping r d k (d = 1::D) is computed in each cluster. I obtain the clusters by performing the standard K-Means clustering [316] on the valuesfx i x j g obtained over all pairsx i ;x j 2X. The membership offx i x j g to a cluster is denoted by a 1-in-D encoding vectorm ij = [m 1 ij ;::;m D ij ] wherem d ij = 1 indicates thatfx i x j g belongs to the d th cluster. The overall graphical model for this scheme is represented in Figure 9.1(d). The graphical model is very similar to the one in Figure 9.1(c), except form ij now determining the ipping probability. The data log-likelihood termL in (9.5) changes slightly to incorporateR 1 ;::;R K andm ij (instead of scalar values r 1 ;::;r K ) as represented byL 0 in (9.21). I perform the initialization, the E and M-steps and nal inference as discussed in the next section. L 0 = X All pairs x i ;x j logp(z 1 ij ;::;z K ij =x i ;x j ;w;R 1 ;::;R k ;m ij ) (9.21) Initialization: I randomly initialize the SVR weight vector w and the vectors 151 R k for all the annotators. I perform K-means clustering to segment the space spanned byfx i x j g;8x i ;x j 2X. The number of clusters D is set empirically by gradually increasing D until the distance between two cluster centroids falls below a threshold (compared to distances to other centroids). Whilew;R 1 ;::R K not converged perform E and M-steps, where: E-step: The E-step is same as the E-step in section 9.2.3. The only dierence is that p(z k ij jz ij ;r k ) in (9.10) is replaced by p(z k ij jz ij ;R k ;m ij ), which equals to the quantity in (9.22). hR k ;m ij i represents a dot product between R k and m ij to select an entry inR k based on the cluster index corresponding tofx j x i g. p(z k ij jz ij ;R k ;m ij ) = 8 > > < > > : hR k ;m ij i, if z k ij 6=z ij 1hR k ;m ij i, if z k ij =z ij (9.22) Consequently, q(z ij = 1) is computed as shown in (9.23). After estimating q(z ij ), I estimate the model parameters as discussed next. q(z ij = 1) = exphw;fx i x j gi 1 + exphw;fx i x j gi K Y k=1 [hR k ;m ij i (1z k ij ) (1hR k ;m ij i) z k ij ] | {z } hR k ;m ij i(=1hR k ;m ij i)is multiplied when z k ij = 0(=1) =p(z 1 ij ;::;z K ij ) (9.23) M-step: In the M-step, I re-estimate the parameters w and the vectors R k . Value ofM also alters in this formulation to incorporate R 1 ;::;R K and m ij . p(z k ij jz ij ;r k ) in (9.16) is replaced by p(z k ij jz ij ;R k ;m ij ). This has no impact on the estimation of w, which remains the same as in section 9.2.3. I describe the 152 estimation of the vectorR k below. Obtaining probability of ipping entries in R k : The optimization framework to obtainR k is shown below. R k = arg min R k X All pairs x i ;x j X z ij q(z ij ) logp(z k ij jz ij ;R k ;m ij ) (9.24) The above optimization over the vector R k can easily be broken down into scalar optimization over each of its entries after replacing p(z k ij jz ij ;R k ;m ij ) as shown in (9.22). I next discuss the nal step for inferring z ij . Final inference: The nal inference onz ij is made based the following likelihood comparison once the model converges. This inference is similar to one in the JAM scheme. p(z ij = 1jz 1 ij ;::;z K ij ;x i ;x j ;w;R 1 ;::;R K ;m ij ) 1 > < 0 p(z ij = 0jz 1 ij ;::;z K ij ;x i ;x j ;w;R 1 ;::;R K ;m ij ) (9.25) In the next section, I evaluate various fusion schemes on several datasets with synthetic annotations as well as annotations obtained from machines and humans. 9.3 Experimental Results I test the discussed ranking algorithms on two synthetically created data sets and two real world data set as discussed next. 153 9.3.1 Data sets with synthetic annotations I use the two wine quality data sets (red and white wine data sets) [317] available in the UCI data repository [318]. Each data set provides 11 attributes for each entry and a quality score between 0-10 (10 being the best). In pairwise comparison between two entries O i and O j , I say that the ground truth is z ij = 1 if O i has a higher quality score than O j . Below I provide a short description of synthetic creation of noisy annotator labels from this data set followed by a set of three experiments investigating the reliability inference for each annotator and the eect of quality and number of annotators. Creating synthetic noisy annotations: Given the number of annotators K, I create synthetic noisy annotations for the k th annotator by ipping the ground truth z ij based on a Bernoulli variable. The parameter of the Bernoulli variable for annotator k is denoted by b k and a higher b k implies higher chances of z ij being ipped. In the rst experiment presented in the next section, I investigate the relation between b k used for each annotator and the probability of ipping r k determined by my joint annotator models. Relationship between probability of ipping and annotator noise In this experiment, I use a set of 6 noisy annotators with b k =k=20. That is the rst annotator is the best annotator with only 5% chance of ipping where as the sixth annotator has a 30% chance of ipping. I train the Joint Annotator Model (JAM) and Variable Reliability Joint Annotator Model (VRJAM). Table 9.1 shows the values forr k estimated using JAM and the mean value of vectorR k estimated using VRJAM on the red wine data set (similar patterns are observed for white 154 Model Parameter Values for k = 1::6;b k =k=20 JAM r k f.032, .086, .176, .196, .246, .273g VRJAM Mean(R k ) f.033, .085, .175, .196, .245, .273g Table 9.1: Values of r k & mean(R k ) obtained on the red wine data set. Data set MV IAM JAM VRJAM Red wine 95.9 55.2 97.9 98.0 White wine 96.1 55.3 97.9 98.1 Table 9.2: Accuracy in inferring z ij in the synthetic data sets. wine data set). Higher values forr k and mean ofR k imply that the annotatork is inferred to be more noisy. I also show the model accuracy in inferring the ground truth z ij over all pairs of objects in the data set in Table 9.2. From the Table 9.1, I observe that as the noise increases over annotators, my model successfully infers a higher probability of ipping. The values r k and the mean of vector R k are fairly close to each other indicating that the JAM and VRJAM model are very similar in inferring probability of ipping. This is expected as VRJAM diers from JAM only in determining cluster-wise proba- bilities and their average should be fairly close to r k . From Table 9.2, I observe that the proposed models outperform Majority Vote (MV) and Independent Annotator Modeling (IAM). The dierence in performance of JAM and VRJAM is not signicant. This stems from the choice of synthetic annotation generation as the noise added to the annotations is uniform and does not change based on the pair of objects at hand. Therefore VRJAM has no particular modeling advantage over the JAM scheme. Also, the performance of IAM is particularly low. My investigation reveals that the performances of the individual annotator SVR models (f k in section 9.2.2) were very low (e.g., varied between 53.0%-64.4% in red wine data set) . Since IAM performs a sum of f k over these fairly weak models, the nal performance is poor. This shows that the IAM performance is 155 contingent upon the model choice and can improve with a better choice for f k . However, an interesting point to note here is that the IAM performance (e.g., 55.2% for red wine data set) lies between the performance of the best annotator (64:4% for red wine data set) and the worst annotator (53:0% for red wine data set). This re ects the fact that IAM is susceptible to performing below collective knowledge of the crowd and can perform worse than the best available annotator. Relationship between model performances and annotator noise In this section, I perform multiple experiments similar to the one mentioned in the previous section. I chose a set of 6 annotators and in each experiment, I increase the parameter b k . Within an experiment, b k for the annotator k is set at k=20 and the parameter is increased by 10% over consecutive experiments. I plot the accuracy of the MV, IAM, JAM and VRJAM algorithms in inferring the ground truth z ij with increasing in Figure 9.2. From the gure, I note that the model performance drops as the annotator noise increases. Performances of the VRJAM and JAM schemes are again similar because of the reasons stated in the previous section. Another interesting observation is that the performances of MV, JAM and VRJAM converge as the annotator noise increases. This indicates that the joint models are likely to perform better than MV with better quality annotators. The IAM performances are again low attributed to weak annotator modeling by the SVRs. 156 1 1.2 1.4 1.6 1.8 82 84 86 88 90 92 94 96 98 Accuracy in predicting the ground truth α (a) MV JAM VRJAM Red wine data set 1 1.2 1.4 1.6 1.8 82 84 86 88 90 92 94 96 98 Accuracy in predicting the ground truth α (b) MV JAM VRJAM White wine data set 1 1.2 1.4 1.6 1.8 45 50 55 60 α IAM 1 1.2 1.4 1.6 1.8 50 55 α IAM Figure 9.2: Model performances with increasing annotator noises. 3 5 7 9 89 90 91 92 93 94 95 Accuracy in predicting the ground truth k (a) MV JAM VRJAM Red wine data set 3 5 7 9 87 88 89 90 91 92 93 Accuracy in predicting the ground truth k (b) MV JAM VRJAM White wine data set 3 5 7 9 50 55 60 k IAM 3 5 7 9 52 54 56 58 k IAM Figure 9.3: Model performances with increasing number of annotators. Relationship between model performances and number of annotators In this section, I perform multiple experiments by varying the total count of anno- tators K. The parameter b k for the annotator k is kept constant at k=20. Figure 9.3 shows the plots for model performance asK is varied from 3 to 9. In this case, I observe that except for IAM, performance of all models increase with increase in number of annotators. This indicates that addition of more noisy annotators (as b k < b k+1 ) tends to decrease IAM performance. Also, the JAM and VRJAM models provide greater improvement over MV with addition of more annotators. 157 The performance of MV, JAM and VRJAM models are same at K = 3 and the absolute improvement of the joint models over MV increases as I add more annota- tors. VRJAM and JAM again perform at similar levels. As stated, I attribute this to the nature of my synthetic labels creation where noisy annotators ip z ij solely based on b k and not based on the object attributesx i ;x j . In the next section, I test my algorithms on a data sets with machine/human annotations and analyze the results. 9.3.2 Data set with machine/human annotations I show the results for two real world data sets, one annotated by machine experts and other by naive mechanical turk workers. I discuss the results for these two datasets below. Digit ranking dataset: Machine annotation I use a subset of the pen based recognition of handwritten digits dataset [319] to rank images based on the digit value contained (for instance image with digit 9 is ranked higher than image containing any other digit). The dataset contains 1k samples of images with 16 features, leading to 370k possible comparisons (I do not consider comparison between images containing same values). I initially annotate the dataset using a set of ve classier as machine annotators: K-Nearest Neighbors (KNN), Logistic Regression, Naive Bayes, Random Forest and Perceptron [271]. These annotations are obtained using a 10 fold cross-validation framework. Each classier is trained on a subset of 3-4 features (out of 16) on 90% of the data and results are obtained on the remaining 10%. This process is repeated till I annotate the entire data using the classiers. Note that in this dataset, I have access to the ground truth which may not always be the true (this is the case with the 158 Classier KNN LR NB RF Perc. Performance 67.8 69.1 69.0 72.0 59.9 Table 9.3: Ratio of pairwise comparisons in which a classier ranks the image containing greater value higher than the other image in the pair. (KNN: KNN classier, LR: Logistic Regression, NB: Naive Bayes classier, RF: Random Forests classier and Perc.: Perceptron). Fusion scheme MV IAM JAM VRJAM Performance 78.0 65.9 78.1 79.7 Table 9.4: Performance of the fusion schemes on pairwise comparisons z k ij , as obtained from the machine annotators. dataset in the next section). Table 9.3 shows the performance of each classier as a machine annotator in pairwise comparison between images. Table 9.4 shows the performance of the fusion schemes operating over the machine annotations thus obtained. I use the entire set of 16 features in the JAM and VRJAM fusion schemes. From the Table 9.3, I see that the machine annotators perform in the range of 59% to 72% on the metric of pairwise comparison accuracy. Results in Table 9.4 indicate that the MV, JAM and VRJAM schemes outperform the best machine annotator, i.e., random forests. Where as the performances of MV and JAM are not signicantly dierent, VRJAM performs signicantly better than both MV and JAM schemes (McNemar's test [320], signicance level: 5%, computed over the 370k comparison samples). This indicates that assigning a ipping probability conditioned on the pair of images at hand is essential in this data set. The IAM scheme again fails to beat the best annotator and performs at a value within the range of best and the worst annotator. This indicates that an unweighted fusion of experts may perform below the collective knowledge of the crowd and weighting annotators based on individual performances may help. In the next section, I test the fusion scheme on another real data set with human annotators. 159 Attribute Ratio of times TD kids are inferred to have a higher rank over HFA kids MV IAM JAM VRJAM Expressiveness 64.3 61.5 64.3 65.4 Naturalness 55.7 52.7 55.9 57.7 Table 9.5: Comparison of expressiveness/naturalness between TD and HFA kids. TD kids are expected to be more expressive/natural. Safari Bob dataset In this section, I test my algorithms on the Safari Bob data set [321]. This data set involves two populations of High Functioning Autism (HFA) and Typically Developing (TD) individuals retelling a story based on a video stimulus. The recording of story retelling are later rated by naive Mechanical Turk (MTurk) raters for expressiveness and naturalness on a scale from 0-4 (4 being the best). I use a set of 40 TD kids and 65 HFA kids rated by 5 annotators and infer the ground truth expressiveness and naturalness from the available ratings. The attributesx i I use to train the models are statistical functionals extracted on prosodic and spectral features from the kid's speech (mean and variance of pitch, intensity, Mel lter banks and Cepstral Coecients) as are also used in [87, 321]. Since I do not have the ground truth available for evaluation, I analyze the association of inferred expressiveness and naturalness with the population attributes of HFA and TD. Although the relationship between autism and expressiveness/naturalness is fairly complex and undergoing extensive investigation [322], TD kids are expected to be ranked higher in expressiveness/naturalness over HFA kids [321]. I infer the latent ground truth for expressiveness/naturalness using my models set and show (Table 9.5) the proportion of times the models infer TD kids to have a higher expressiveness/naturalness than HFA kids. From the results, I observe that a TD kid is more often inferred to have a 160 higher expressiveness/naturalness over a HFA kid. Whereas outputs for MV and JAM are fairly close to each other, the outputs from the VRJAM has the high- est proportion of times that a TD kid is inferred to be more expressive/natural than an ASD kid. This trend is encouraging although the relation between speech expressiveness/naturalness and autism may not be this straightforward. Due to unavailability of ground truth, this experiment can not be used to support the ecacy of proposed algorithms. However the observed results motivate the appli- cation of proposed algorithms to data sets where the ground truth is unobserved. Overall, the experiment on synthetic, machine and human annotations in this section provide an understanding of the proposed algorithms within the aspects of annotator reliability, quality, and number of annotators. Although the performance of VRJAM is not signicantly better in the case of synthetic annotations, results on the machine and human annotations indicate the importance of accounting for dierences in the reliability of annotators based on the pair of objects at hand. I conclude my work in the next section and present a few future directions. 9.4 Conclusion In this chapter, I address the problem of inferring the hidden ground truth prefer- ence given noisy annotations from multiple annotators. I propose an EM algorithm based Joint Annotator Modeling (JAM) scheme, considering the latent ground truth preference to be a hidden variable and inferring it based on available anno- tation and object attributes. Given a pair of objects, the JAM scheme infers the latent true preference order based on a set of object attributes as well as noisy annotator preferences. The model assumes that annotators ip the true prefer- ence order based on a Bernoulli random variable and estimates annotator specic 161 \probability of ipping". I further extend the model to estimate a non-constant \probability of ipping" conditioned on the pair of objects at hand in the Vari- able Reliability Joint Annotator Model (VRJAM). I test the JAM and VRJAM schemes against majority voting and Independent Annotator Modeling schemes on data sets with annotations obtained synthetically, from machines as well as from human annotators. Using the data set with synthetic annotations, I test the impact of annotator quality and quantity on my models. The results on data sets with machine annotations depicts the importance of having a variable reliability per annotator based on pair of objects at hand. Finally, in the Safari Bob data with human annotators, I interpret the results based on the expected trends of expressiveness/naturalness in TD and HFA kids. In the future, I aim to extend the presented algorithms by integrating other existing work in the ranking domain (e.g., active learning). Other work in rank aggregation inferring a rank order probability distribution can also be integrated into the proposed EM framework. Also, within classication there are further extensions of multiple annotator models which can be incorporated into the current EM framework. I also aim to implement the designed algorithms to other data sets such as the Safari Bob data set in understanding the diversity in perception of various psychological constructs (e.g. naturalness) by the human annotators and their relation to a target variable (e.g. autism severity). 162 Chapter 10 Conclusions In this thesis, I adopt the encoding-decoding view of nonverbal communication. I try to answer questions pertaining to detection of nonverbal cues, understand- ing the encoding process and modeling diversity in the perception of nonverbal cues. Although a holistic modeling of the encoding-decoding process can be very involved, I initiate with understanding the relation between nonverbal cues and latent behavioral states and quantifying the diversity in perception of nonverbal cues. With respect to detection of nonverbal cues, I present two experiments on detection paralinguistic events (laughter and llers) and dis uencies in speech. My observations suggest that context is important in detection of these events and my models are geared towards capturing context in detecting these nonverbal cues. Within detecting paralinguistic cues, I develop a sequential model to capture context. On the other hand, in detecting dis uencies in speech, I develop a joint model that incorporates context by training multiple models over dierent context lengths. To understand the encoding process in nonverbal communication, I initiate with understanding the relation between nonverbal cues and latent behavioral states. I present three experiments on tracking aective states based on audio and video, tracking aect in music and investigating the role of laughters in Motivational Interviewing. In the rst experiment, I use audio and video modality to track valence, activation and dominance in human computer interaction. I establish the 163 importance of using cues from multiple modalities in this case and use a stacked generalization framework to capture the relationship between nonverbal cues and latent behavior. Furthermore, the model was personalized based on the depression severity of the person in the video recordings. In my second experiment, I perform a similar experiment on tracking aect in music. Although music is not a form of nonverbal communication, I aimed at developing a model that is interpretable and can be easily transferred to the previous problem in understanding aective evolution. I design a model based on gradient boosting which not only outperforms other regression based methods, but also only relies on a handful of features in doing so. This leads to a highly interpretable model as is crucial in understanding the nonverbal communication. Finally, in the third experiment, I extend the study of nonverbal communication to a new domain of Motivational Interviewing to track behavior change in patients suering from substance abuse. I investigated the role of laughters and empirically observed that they are essential in tracking the behavior changes. In the last section of this thesis, I present models to understand the decoding process during nonverbal communication. I design models to quantify the diversity in perceptions of nonverbal cues. I present two experiments, the rst one on jointly modeling the smile strength ratings from multiple annotators and the second on inferring ranked lists from noisy annotations. My goals in both the experiments was to capture the variability in perception of nonverbal cues. In the rst exper- iment involving ratings on smile strength, I develop a model to infer the ground truth given multiple ratings. I challenge the established practice of using means of annotator ratings and show that my schemes better models the data. As an outcome of this system, I also quantify the distortion introduced by each anno- tator to the ground truth. The transfer function parameter helps us to interpret 164 the dierences among annotator and provides a window to their thought process. In my second experiment on inferring ranked lists from noisy pairwise rankings, I extend existing multiple annotator models to the problem of ranking. I test the model on several datasets and with regards to nonverbal communication test the model on ranking expressivity and naturalness using paralanguage. Overall, in this thesis, I have attempted to understand the encoding-decoding process during nonverbal communication by breaking up the process in three parts: detection of cues, behavioral encoding of cues and nally decoding of cues. The questions I address are geared towards exploring specic aspects of nonver- bal communication and my future experiments will be formulated to the further understanding of nonverbal communication under the same umbrella of encoding- decoding approach. Specically, with respect to the detection of nonverbal cues, I aim to understand the relation between prosody and dis uencies. I intend to design and analyze a system for detecting dis uencies based purely based on prosody. The motivation behind this experiment is also to understand the interconnection between behavioral states leading to dis uencies and control of prosody. With respect to understanding the relation between nonverbal cues and human behav- ioral states, I intend to conduct a case study to understand the relation between valence of laughter to what is said around a laughter (transcript) and the prosody of utterance. This experiment will also help to establish the interplay between verbal and nonverbal communication as I are using both worded and paralanguage in inferring the internal behavioral state. In understanding the decoding process during nonverbal communication, I aim to perform two more experiments. In the rst experiment, I aim to understand the dierences in perception with respect to ordered ranking. Given the preference lists from multiple annotators (on constructs such as likability, awkwardness etc.) I aim to develop a system that can provide a 165 ground truth preference list. This can help us explain the reasons for dierences in perceptions among people, at the same time providing the ground truths in perception of behavioral states. In the second experiment, I aim to make theoret- ical contributions towards understanding of nonverbal communications. Bounds of errors is considerably well research topics, particularly in relation to fusion of experts. I aim to apply similar techniques to my models to enhance the under- standing of nonverbal communication. An overarching future goal is to make both theoretical and computational contributions which connect the multidisciplinary study I have undertaken. I aim to present novel solutions to the complex problem of understanding human behavior borrowing the established knowledge in the eld of machine learning. 166 Reference List [1] G. S. Mudholkar, \Fisher's z-transformation," Encyclopedia of Statistical Sci- ences, 1983. [2] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy, \Learning from crowds," The Journal of Machine Learning Research, vol. 11, pp. 1297{1322, 2010. [3] A. P. Dawid and A. M. Skene, \Maximum likelihood estimation of observer error-rates using the em algorithm," Applied statistics, pp. 20{28, 1979. [4] K. Audhkhasi and S. Narayanan, \A globally-variant locally-constant model for fusion of labels from multiple diverse experts without using reference labels," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 35, no. 4, pp. 769{783, 2013. [5] S. Mariooryad and C. Busso, \Correcting time-continuous emotional labels by modeling the reaction lag of evaluators," Aective Computing, IEEE Transactions on, vol. 6, no. 2, pp. 97{108, 2015. [6] C. Fraser, S. Restrepo-Estrada et al., Communicating for development: human change for survival. IB Tauris and Co Ltd, 1998. [7] F. E. Dance and C. Larson, \The functions of human communication," Infor- mation and behavior, vol. 1, pp. 62{75, 1985. [8] M. R. Key, The relationship of verbal and nonverbal communication. Walter de Gruyter, 1980, vol. 25. [9] R. P. Harrison, \Nonverbal communication," Human Communication As a Field of Study: Selected Contemporary Views, vol. 113, 1989. [10] S. H. Ng and J. J. Bradac, Power in language: Verbal communication and social in uence. Sage Publications, Inc, 1993. 167 [11] S. W. Littlejohn and K. A. Foss, Encyclopedia of communication theory. Sage, 2009, vol. 1. [12] R. L. Birdwhistell, Kinesics and context: Essays on body motion communi- cation. University of Pennsylvania press, 2010. [13] E. T. Hall, R. L. Birdwhistell, B. Bock, P. Bohannan, A. R. Diebold Jr, M. Durbin, M. S. Edmonson, J. Fischer, D. Hymes, S. T. Kimball et al., \Proxemics [and comments and replies]," Current anthropology, pp. 83{108, 1968. [14] D. Abercrombie, \Paralanguage," International Journal of Language & Communication Disorders, vol. 3, no. 1, pp. 55{59, 1968. [15] M. A. Srinivasan and C. Basdogan, \Haptics in virtual environments: Taxon- omy, research status, and challenges," Computers & Graphics, vol. 21, no. 4, pp. 393{404, 1997. [16] J. K. Burgoon, L. K. Guerrero, and K. Floyd, Nonverbal communication. Allyn & Bacon Boston, MA, 2010. [17] A. Mehrabian, Nonverbal communication. Transaction Publishers, 1977. [18] P. Mundy, C. Kasari, M. Sigman, and E. Ruskin, \Nonverbal communica- tion and early language acquisition in children with down syndrome and in normally developing children," Journal of Speech, Language, and Hearing Research, vol. 38, no. 1, pp. 157{167, 1995. [19] K. Hogan, Can't Get Through: Eight Barriers to Communication. Pelican Publishing, 2003. [20] S. AVRAM, \Coding and decoding nonverbal communication," ANALELE UNIVERSIT AT II DIN CRAIOVA, p. 22. [21] M. Zuckerman, M. S. Lipets, J. H. Koivumaki, and R. Rosenthal, \Encoding and decoding nonverbal cues of emotion." Journal of Personality and Social Psychology, vol. 32, no. 6, p. 1068, 1975. [22] J. T. Lanzetta and R. E. Kleck, \Encoding and decoding of nonverbal aect in humans." Journal of Personality and Social Psychology, vol. 16, no. 1, p. 12, 1970. [23] D. H. Wolpert, \Stacked generalization," Neural networks, vol. 5, no. 2, pp. 241{259, 1992. 168 [24] L. Rokach, \Ensemble-based classiers," Articial Intelligence Review, vol. 33, no. 1-2, pp. 1{39, 2010. [25] G. Zenobi and P. Cunningham, \Using diversity in preparing ensembles of classiers based on dierent feature subsets to minimize generalization error," in Machine Learning: ECML 2001. Springer, 2001, pp. 576{587. [26] K. Tumer and J. Ghosh, \Error correlation and error reduction in ensemble classiers," Connection science, vol. 8, no. 3-4, pp. 385{404, 1996. [27] P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S. Kankanhalli, \Multi- modal fusion for multimedia analysis: a survey," Multimedia systems, vol. 16, no. 6, pp. 345{379, 2010. [28] J. Fi errez-Aguilar, J. Ortega-Garcia, and J. Gonzalez-Rodriguez, \Fusion strategies in multimodal biometric verication," in Multimedia and Expo, 2003. ICME'03. Proceedings. 2003 International Conference on, vol. 3. IEEE, 2003, pp. III{5. [29] C. Darwin, The expression of the emotions in man and animals. Courier Corporation, 2013. [30] T. McGowan, \Abstract deictic gestures-in-interaction: A barometer of intersubjective knowledge development in small-group discussion," Working Papers in Educational Linguistics, vol. 79, p. 2010, 1955. [31] R. L. Birdwhistell, \Background to kinesics," ETC: A Review of General Semantics, pp. 10{18, 1955. [32] M. Argyle and J. Dean, \Eye-contact, distance and aliation," Sociometry, pp. 289{304, 1965. [33] J. D. Fodor, \Formal linguistics and formal logic," in The Formal Complexity of Natural Language. Springer, 1970, pp. 24{40. [34] R. Sommer, \Personal space. the behavioral basis of design." 1969. [35] R. Rosenthal and L. Jacobson, Pygmalion in the classroom: Teacher expecta- tion and pupils' intellectual development. Holt, Rinehart & Winston, 1968. [36] A. G. Halberstadt, \Family socialization of emotional expression and non- verbal communication styles and skills." Journal of personality and social psychology, vol. 51, no. 4, p. 827, 1986. [37] P. Salovey and J. D. Mayer, \Emotional intelligence," Imagination, cognition and personality, vol. 9, no. 3, pp. 185{211, 1990. 169 [38] S. K. DMello and A. Graesser, \Multimodal semi-automated aect detection from conversational cues, gross body language, and facial features," User Modeling and User-Adapted Interaction, vol. 20, no. 2, pp. 147{187, 2010. [39] S. D'Mello and A. Graesser, \Automatic detection of learner's aect from gross body language," Applied Articial Intelligence, vol. 23, no. 2, pp. 123{ 150, 2009. [40] S. Mozziconacci, \Prosody and emotions," in Speech Prosody 2002, Interna- tional Conference, 2002. [41] P. Ekman, \Facial expression and emotion." American psychologist, vol. 48, no. 4, p. 384, 1993. [42] D. Keltner, P. Ekman, G. C. Gonzaga, and J. Beer, \Facial expression of emotion." 2003. [43] J. A. Hall, J. A. Harrigan, and R. Rosenthal, \Nonverbal behavior in clin- icianpatient interaction," Applied and Preventive Psychology, vol. 4, no. 1, pp. 21{37, 1996. [44] H.-U. Fisch, S. Frey, and H.-P. Hirsbrunner, \Analyzing nonverbal behavior in depression." Journal of abnormal psychology, vol. 92, no. 3, p. 307, 1983. [45] J. S. Carton, E. A. Kessler, and C. L. Pape, \Nonverbal decoding skills and relationship well-being in adults," Journal of Nonverbal Behavior, vol. 23, no. 1, pp. 91{100, 1999. [46] T. Crook, R. T. Bartus, S. H. Ferris, P. Whitehouse, G. D. Cohen, and S. Gershon, \Age-associated memory impairment: Proposed diagnostic cri- teria and measures of clinical changereport of a national institute of mental health work group," 1986. [47] W. R. Miller and S. Rollnick, Motivational interviewing: Helping people change. Guilford press, 2012. [48] S. Rollnick and W. R. Miller, \What is motivational interviewing?" Behavioural and cognitive Psychotherapy, vol. 23, no. 04, pp. 325{334, 1995. [49] M. F. de Mello, J. de Jesus Mari, J. Bacaltchuk, H. Verdeli, and R. Neuge- bauer, \A systematic review of research ndings on the ecacy of interper- sonal therapy for depressive disorders," European archives of psychiatry and clinical neuroscience, vol. 255, no. 2, pp. 75{82, 2005. 170 [50] K. Strosahl, \The dissemination of manual-based psychotherapies in man- aged care: Promises, problems, and prospects," Clinical Psychology: Science and Practice, vol. 5, no. 3, pp. 382{386, 1998. [51] P. Mundy, M. Sigman, J. Ungerer, and T. Sherman, \Dening the social decits of autism: The contribution of non-verbal communication measures," Journal of child psychology and psychiatry, vol. 27, no. 5, pp. 657{669, 1986. [52] C. Lord, M. Rutter, and A. Le Couteur, \Autism diagnostic interview- revised: a revised version of a diagnostic interview for caregivers of indi- viduals with possible pervasive developmental disorders," Journal of autism and developmental disorders, vol. 24, no. 5, pp. 659{685, 1994. [53] W. L. Stone, O. Y. Ousley, P. J. Yoder, K. L. Hogan, and S. L. Hepburn, \Nonverbal communication in two-and three-year-old children with autism," Journal of autism and developmental disorders, vol. 27, no. 6, pp. 677{696, 1997. [54] P. Mundy, M. Sigman, and C. Kasari, \A longitudinal study of joint atten- tion and language development in autistic children," Journal of Autism and developmental Disorders, vol. 20, no. 1, pp. 115{128, 1990. [55] E. G. Carr and V. M. Durand, \Reducing behavior problems through functional communication training," Journal of Applied Behavior Analysis, vol. 18, no. 2, pp. 111{126, 1985. [56] W. A. Shennum and D. B. Bugental, \The development of control over aec- tive expression in nonverbal behavior," in Development of nonverbal behavior in children. Springer, 1982, pp. 101{121. [57] B. M. DePaulo, \Nonverbal behavior and self-presentation." Psychological bulletin, vol. 111, no. 2, p. 203, 1992. [58] R. S. Feldman, P. Philippot, and R. J. Custrini, \Social competence and nonverbal behavior," Fundamentals of nonverbal behavior, vol. 329, 1991. [59] A. Feingold, \Gender dierences in personality: a meta-analysis." Psycho- logical bulletin, vol. 116, no. 3, p. 429, 1994. [60] A. E. Kazdin, R. B. Sherick, K. Esveldt-Dawson, and M. D. Rancurello, \Nonverbal behavior and childhood depression," Journal of the American Academy of Child Psychiatry, vol. 24, no. 3, pp. 303{309, 1985. [61] D. T. Tepper and R. F. Haase, \Verbal and nonverbal communication of facilitative conditions." Journal of Counseling Psychology, vol. 25, no. 1, p. 35, 1978. 171 [62] C. Hudson, \Selection of vehicles for public utility service," SAE Technical Paper, Tech. Rep., 1960. [63] J. Hitch, \Trends in detection and measurement of radioisotopes for medical purposes," Electrical Engineering, vol. 72, no. 6, pp. 484{489, 1953. [64] A. Rosen, \Detection of suicidal patients: an example of some limitations in the prediction of infrequent events." Journal of consulting psychology, vol. 18, no. 6, p. 397, 1954. [65] S. Gangwar, \Arousal detection using eeg signal," Ph.D. dissertation, Thapar University, 1956. [66] S. J. Fields, \Discrimination of facial expression and its relation to personal adjustment," The Journal of Social Psychology, vol. 38, no. 1, pp. 63{71, 1953. [67] E. F. Hahn, Stuttering: signicant theories and therapies. Stanford Univer- sity Press, 1956. [68] T. S. Gregersen, \Nonverbal cues: Clues to the detection of foreign language anxiety," Foreign Language Annals, vol. 38, no. 3, pp. 388{400, 2005. [69] A. Vrij, Detecting lies and deceit: The psychology of lying and implications for professional practice. Wiley, 2000. [70] B. M. DePaulo, J. J. Lindsay, B. E. Malone, L. Muhlenbruck, K. Charlton, and H. Cooper, \Cues to deception." Psychological bulletin, vol. 129, no. 1, p. 74, 2003. [71] W. M. Brown, \Are there nonverbal cues to commitment? an exploratory study using the zero-acquaintance video presentation paradigm," 2003. [72] J. Whitehill, G. Littlewort, I. Fasel, M. Bartlett, and J. Movellan, \Toward practical smile detection," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 31, no. 11, pp. 2106{2111, 2009. [73] C. Shan, \Smile detection by boosting pixel dierences," Image Processing, IEEE Transactions on, vol. 21, no. 1, pp. 431{436, 2012. [74] L. S. Kennedy and D. P. Ellis, \Laughter detection in meetings," in NIST ICASSP 2004 Meeting Recognition Workshop, Montreal. National Institute of Standards and Technology, 2004, pp. 118{121. [75] M. T. Knox and N. Mirghafori, \Automatic laughter detection using neural networks." in INTERSPEECH, 2007, pp. 2973{2976. 172 [76] P. Ruvolo and J. Movellan, \Automatic cry detection in early childhood education settings," in IEEE International Conference on Development and Learning, vol. 7, 2008, pp. 204{208. [77] G. V arallyay, A. Ill enyi, and Z. Beny o, \Automatic infant cry detection." in MAVEBA, 2009, pp. 11{14. [78] M. Knapp, J. Hall, and T. Horgan, Nonverbal communication in human interaction. Cengage Learning, 2013. [79] C. Breazeal, C. D. Kidd, A. L. Thomaz, G. Homan, and M. Berlin, \Eects of nonverbal communication on eciency and robustness in human-robot teamwork," in Intelligent Robots and Systems, 2005.(IROS 2005). 2005 IEEE/RSJ International Conference on. IEEE, 2005, pp. 708{713. [80] M. Pantic, A. Pentland, A. Nijholt, and T. S. Huang, \Human computing and machine understanding of human behavior: a survey," in Artical Intel- ligence for Human Computing. Springer, 2007, pp. 47{71. [81] A. Abbey and C. Melby, \The eects of nonverbal cues on gender dierences in perceptions of sexual intent," Sex Roles, vol. 15, no. 5-6, pp. 283{298, 1986. [82] J. A. Hall and A. G. Halberstadt, \subordination and sensitivity to nonverbal cues: A study of married working women," Sex Roles, vol. 31, no. 3-4, pp. 149{165, 1994. [83] R. E. Kraut, \Verbal and nonverbal cues in the perception of lying." Journal of personality and social psychology, vol. 36, no. 4, p. 380, 1978. [84] K. Byron and D. C. Baldridge, \E-mail recipients' impressions of senders' likability the interactive eect of nonverbal cues and recipients' personality," Journal of Business Communication, vol. 44, no. 2, pp. 137{160, 2007. [85] B. M. DePaulo, \Decoding discrepant nonverbal cues." Journal of Personal- ity and Social Psychology, vol. 36, no. 3, p. 313, 1978. [86] J. Goodwin, J. M. Jasper, and F. Polletta, Passionate politics: Emotions and social movements. University of Chicago Press, 2009. [87] R. Gupta, C.-C. Lee, and S. Narayanan, \Classication of emotional content of sighs in dyadic human interactions," in Acoustics, Speech and Signal Pro- cessing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012, pp. 2265{2268. 173 [88] S. Soltysik and P. Jelen, \In rats, sighs correlate with relief," Physiology & Behavior, vol. 85, no. 5, pp. 598{602, 2005. [89] E. Vlemincx, J. Taelman, I. Van Diest, and O. Van den Bergh, \Take a deep breath: The relief eect of spontaneous and instructed sighs," Physiology & behavior, vol. 101, no. 1, pp. 67{73, 2010. [90] F. B. Furlow, \Human neonatal cry quality as an honest signal of tness," Evolution and Human Behavior, vol. 18, no. 3, pp. 175{193, 1997. [91] C. Cortes and V. Vapnik, \Support vector machine," Machine learning, vol. 20, no. 3, pp. 273{297, 1995. [92] K. P. Truong and D. A. Van Leeuwen, \Automatic detection of laughter." in INTERSPEECH, 2005, pp. 485{488. [93] G. V arallyay Jr, Z. Beny o, A. Ill enyi, Z. Farkas, and L. Kov acs, \Acoustic analysis of the infant cry: classical and new methods," in Engineering in Medicine and Biology Society, 2004. IEMBS'04. 26th Annual International Conference of the IEEE, vol. 1. IEEE, 2004, pp. 313{316. [94] B. Schuller, F. Eyben, and G. Rigoll, \Static and dynamic modelling for the recognition of non-verbal vocalisations in conversational speech," in Percep- tion in multimodal dialogue systems. Springer, 2008, pp. 99{110. [95] B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer, F. Ringeval, M. Chetouani, F. Weninger, F. Eyben, E. Marchi, M. Mortillaro, H. Salamin, A. Polychroniou, F. Valente, and S. Kim, \The INTERSPEECH 2013 computational paralinguistics challenge: social signals, con ict, emotion, autism," in Proceedings of Interspeech, 2013. [96] H. Kaya, A. M. Er cetin, A. A. Salah, and F. G urgen, \Random forests for laughter detection," in Proceedings of Workshop on Aective Social Speech Signals-in conjunction with the INTERSPEECH, 2013. [97] S. Pammi and M. Chetouani, \Detection of social speech signals using adap- tation of segmental hmms," Workshop on aective social speech signals, Grenoble, 2013, 2013. [98] T. F. Krikke and K. P. Truong, \Detection of nonverbal vocalizations using gaussian mixture models: looking for llers and laughter in conversational speech," Proc. of Interspeech, Lyon, France, pp. 163{167, 2013. [99] R. Brueckner and B. Schulter, \Social signal classication using deep blstm recurrent neural networks," in Acoustics, Speech and Signal Processing 174 (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 4823{ 4827. [100] G. An, D.-G. Brizan, and A. Rosenberg, \Detecting laughter and lled pauses using syllable-based features." in INTERSPEECH, 2013, pp. 178{181. [101] R. Gupta, K. Audhkhasi, S. Lee, and S. Narayanan, \Paralinguistic event detection from speech using probabilistic time-series smoothing and mask- ing," Proc. of Interspeech, Lyon, France, pp. 173{177, 2013. [102] H. Salamin, A. Polychroniou, and A. Vinciarelli, \Automatic detection of laughter and llers in spontaneous mobile phone conversations," in Sys- tems, Man, and Cybernetics (SMC), 2013 IEEE International Conference on. IEEE, 2013, pp. 4282{4287. [103] D. E. Mowrer, L. L. LaPointe, and J. Case, \Analysis of ve acoustic corre- lates of laughter," Journal of Nonverbal Behavior, vol. 11, no. 3, pp. 191{199, 1987. [104] J.-A. Bachorowski, M. J. Smoski, and M. J. Owren, \The acoustic features of human laughter," The Journal of the Acoustical Society of America, vol. 110, no. 3, pp. 1581{1597, 2001. [105] M. Candea, I. Vasilescu, M. Adda-Decker et al., \Inter-and intra-language acoustic analysis of autonomous llers," in Proceedings of DISS 05, Dis u- ency in spontaneous speech workshop, 2005, pp. 47{52. [106] J. Vettin and D. Todt, \Laughter in conversation: Features of occurrence and acoustic structure," Journal of Nonverbal Behavior, vol. 28, no. 2, pp. 93{115, 2004. [107] S. Sundaram and S. Narayanan, \Automatic acoustic synthesis of human- like laughtera)," The Journal of the Acoustical Society of America, vol. 121, no. 1, pp. 527{535, 2007. [108] I. Vasilescu, M. Candea, M. Adda-Decker et al., \Perceptual salience of language-specic acoustic dierences in autonomous llers across eight lan- guages," in proceedings of Interspeech, 2005. [109] L. Rabiner and B.-H. Juang, \An introduction to hidden markov models," ASSP Magazine, IEEE, vol. 3, no. 1, pp. 4{16, 1986. [110] K.-i. Funahashi and Y. Nakamura, \Approximation of dynamical systems by continuous time recurrent neural networks," Neural networks, vol. 6, no. 6, pp. 801{806, 1993. 175 [111] J. Laerty, A. McCallum, and F. C. Pereira, \Conditional random elds: Probabilistic models for segmenting and labeling sequence data," 2001. [112] R. Cai, L. Lu, H.-J. Zhang, and L.-H. Cai, \Highlight sound eects detection in audio stream," in Multimedia and Expo, 2003. ICME'03. Proceedings. 2003 International Conference on, vol. 3. IEEE, 2003, pp. III{37. [113] Z. Tu, \Probabilistic boosting-tree: Learning discriminative models for clas- sication, recognition, and clustering," in Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, vol. 2. IEEE, 2005, pp. 1589{1596. [114] R. Brueckner and B. Schuller, \Hierarchical neural networks and enhanced class posteriors for social signal classication," in Automatic Speech Recog- nition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 2013, pp. 362{367. [115] M. Piccardi, \Background subtraction techniques: a review," in Systems, man and cybernetics, 2004 IEEE international conference on, vol. 4. IEEE, 2004, pp. 3099{3104. [116] F. Eyben, M. W ollmer, and B. Schuller, \Opensmile: the munich versatile and fast open-source audio feature extractor," in Proceedings of the interna- tional conference on Multimedia. ACM, 2010, pp. 1459{1462. [117] Y. Liu, E. Shriberg, A. Stolcke, D. Hillard, M. Ostendorf, and M. Harper, \Enriching speech recognition with automatic detection of sentence bound- aries and dis uencies," Audio, Speech, and Language Processing, IEEE Transactions on, vol. 14, no. 5, pp. 1526{1540, 2006. [118] O. Pierre-Yves, \The production and recognition of emotions in speech: fea- tures and algorithms," International Journal of Human-Computer Studies, vol. 59, no. 1, pp. 157{183, 2003. [119] D. Bone, C.-C. Lee, M. P. Black, M. E. Williams, S. Lee, P. Levitt, and S. Narayanan, \The psychologist as an interlocutor in autism spectrum dis- order assessment: Insights from a study of spontaneous prosody," Journal of Speech, Language, and Hearing Research, 2014. [120] D. Bone, C.-C. Lee, T. Chaspari, M. P. Black, M. E. Williams, S. Lee, P. Levitt, and S. Narayanan, \Acoustic-prosodic, turn-taking, and language cues in child-psychologist interactions for varying social demand." in INTER- SPEECH, 2013. 176 [121] R. Gupta, C.-C. Lee, D. Bone, A. Rozga, S. Lee, and S. Narayanan, \Acous- tical analysis of engagement behavior in children," in Workshop on child computer interaction, 2012. [122] J. Wagner, F. Lingenfelser, and E. Andr e, \Using phonetic patterns for detecting social cues in natural conversations." in INTERSPEECH, 2013, pp. 168{172. [123] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motl cek, Y. Qian, P. Schwarz et al., \The kaldi speech recognition toolkit," 2011. [124] G. D. Forney Jr, \The viterbi algorithm," Proceedings of the IEEE, vol. 61, no. 3, pp. 268{278, 1973. [125] T. Hastie and R. Tibshirani, \Classication by pairwise coupling," in Advances in Neural Information Processing Systems, M. I. Jordan, M. J. Kearns, and S. A. Solla, Eds., vol. 10. MIT Press, 1998. [126] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., \Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups," Signal Processing Magazine, IEEE, vol. 29, no. 6, pp. 82{97, 2012. [127] F. Seide, G. Li, and D. Yu, \Conversational speech transcription using context-dependent deep neural networks." in Interspeech, 2011, pp. 437{440. [128] G. E. Dahl, D. Yu, L. Deng, and A. Acero, \Context-dependent pre- trained deep neural networks for large-vocabulary speech recognition," Audio, Speech, and Language Processing, IEEE Transactions on, vol. 20, no. 1, pp. 30{42, 2012. [129] I. Jollie, Principal component analysis. Wiley Online Library, 2005. [130] B. E. Kingsbury, N. Morgan, and S. Greenberg, \Robust speech recognition using the modulation spectrogram," Speech communication, vol. 25, no. 1, pp. 117{132, 1998. [131] P. A. Torres-Carrasquillo, E. Singer, M. A. Kohler, R. J. Greene, D. A. Reynolds, and J. R. Deller Jr, \Approaches to language identication using gaussian mixture models and shifted delta cepstral features." in INTER- SPEECH, 2002. [132] B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C. A. M uller, and S. S. Narayanan, \The interspeech 2010 paralinguistic challenge." in INTERSPEECH, 2010, pp. 2794{2797. 177 [133] B. Schuller, S. Steidl, A. Batliner, E. N oth, A. Vinciarelli, F. Burkhardt, R. Van Son, F. Weninger, F. Eyben, T. Bocklet et al., \The interspeech 2012 speaker trait challenge." in INTERSPEECH, 2012. [134] P. Baldi, \Autoencoders, unsupervised learning, and deep architectures." in ICML Unsupervised and Transfer Learning, 2012, pp. 37{50. [135] S. Lawrence, C. L. Giles, A. C. Tsoi, and A. D. Back, \Face recognition: A convolutional neural-network approach," Neural Networks, IEEE Transac- tions on, vol. 8, no. 1, pp. 98{113, 1997. [136] Q. V. Le, \Building high-level features using large scale unsupervised learn- ing," in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 8595{8598. [137] N. R. Draper, H. Smith, and E. Pownell, Applied regression analysis. Wiley New York, 1966, vol. 3. [138] H. Ellgring, Non-verbal communication in depression. Cambridge University Press, 2007. [139] E. Geerts and N. Bouhuys, \Multi-level prediction of short-term outcome of depression: non-verbal interpersonal processes, cognitions and personality traits," Psychiatry Research, vol. 79, no. 1, pp. 59{72, 1998. [140] P. Salovey and J. D. Mayer, \Emotional intelligence," Imagination, cognition and personality, vol. 9, no. 3, pp. 185{211, 1989. [141] M. Fabri, D. J. Moore, and D. J. Hobbs, \The emotional avatar: non- verbal communication between inhabitants of collaborative virtual environ- ments," in Gesture-Based Communication in Human-Computer Interaction. Springer, 1999, pp. 269{273. [142] J. K. Burgoon, J. A. Bonito, A. Ramirez, N. E. Dunbar, K. Kam, and J. Fis- cher, \Testing the interactivity principle: Eects of mediation, propinquity, and verbal and nonverbal modalities in interpersonal interaction," Journal of communication, vol. 52, no. 3, pp. 657{677, 2002. [143] M. Gabbott and G. Hogg, \An empirical investigation of the impact of non- verbal communication on service evaluation," European Journal of Market- ing, vol. 34, no. 3/4, pp. 384{398, 2000. [144] G. Ulrich and K. Harms, \A video analysis of the non-verbal behaviour of depressed patients before and after treatment," Journal of aective disorders, vol. 9, no. 1, pp. 63{67, 1985. 178 [145] K. Gotham, S. Risi, A. Pickles, and C. Lord, \The autism diagnostic observa- tion schedule: revised algorithms for improved diagnostic validity," Journal of autism and developmental disorders, vol. 37, no. 4, pp. 613{627, 2007. [146] E. E. Shriberg, \Preliminaries to a theory of speech dis uencies," University of California, 1994. [147] D. R. Little, R. Oehmen, J. Dunn, K. Hird, and K. Kirsner, \Fluency prol- ing system: An automated system for analyzing the temporal properties of speech," Behavior research methods, pp. 1{12, 2012. [148] R. Eklund, \Dis uency in swedish human{human and human{machine travel booking dialogues," Link oping, 2004. [149] E. Shriberg, \To errrr'is human: ecology and acoustics of speech dis uen- cies," Journal of the International Phonetic Association, vol. 31, no. 1, pp. 153{169, 2001. [150] R. Ferreira and K. Bailey, \Dis uencies and human language comprehen- sion," Trends in cognitive sciences, vol. 8, no. 5, pp. 231{237, 2004. [151] W. Johnson, \Measurements of oral reading and speaking rate and dis uency of adult male and female stutterers and nonstutterers." The Journal of speech and hearing disorders, p. 1, 1961. [152] W. Wang, G. Tur, J. Zheng, and N. Ayan, \Automatic dis uency removal for improving spoken language translation," in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on. IEEE, 2010, pp. 5214{5217. [153] K. Georgila, N. Wang, and J. Gratch, \Cross-domain speech dis uency detec- tion," in Proceedings of the 11th Annual Meeting of the Special Interest Group on Discourse and Dialogue. Association for Computational Linguistics, 2010, pp. 237{240. [154] S. Rao, I. Lane, and T. Schultz, \Improving spoken language translation by automatic dis uency removal: Evidence from conversational speech tran- scripts," Training, vol. 6370, no. 46300, pp. 6{50, 2007. [155] W. Wang, A. Stolcke, J. Yuan, and M. Liberman, \A cross-language study on automatic speech dis uency detection," in Proceedings of NAACL-HLT, 2013, pp. 703{708. [156] Y. Liu, E. Shriberg, A. Stolcke, and M. P. Harper, \Comparing HMM, max- imum entropy, and conditional random elds for dis uency detection." in INTERSPEECH. Citeseer, 2005, pp. 3313{3316. 179 [157] V. Rangarajan and S. Narayanan, \Analysis of dis uent repetitions in spon- taneous speech recognition," Proc. EUSIPCO 2006, 2006. [158] M. Lease, M. Johnson, and E. Charniak, \Recognizing dis uencies in con- versational speech," Audio, Speech, and Language Processing, IEEE Trans- actions on, vol. 14, no. 5, pp. 1566{1573, 2006. [159] M. Honal and T. Schultz, \Automatic dis uency removal on recognized spon- taneous speech-rapid adaptation to speaker dependent dis uencies," in Proc. of ICASSP, vol. 1, 2005, pp. 969{972. [160] D. Stallard, R. Prasad, P. Natarajan, F. Choi, S. Saleem, R. Meermeier, K. Krstovski, S. Ananthakrishnan, and J. Devlin, \The BBN transtalk speech-to-speech translation system," Speech and Language Technologies, InTech, pp. 31{52, 2011. [161] L. Nguyen and R. Schwartz, \Ecient 2-pass n-best decoder," in Fifth Euro- pean Conference on Speech Communication and Technology, 1997. [162] W. Chen, S. Ananthakrishnan, R. Prasad, and N. P., \Variable-span out- of-vocabulary named entity detection," in Fourteenth Annual Conference of the International Speech Communication Association. IEEE, 2013. [163] J. Nocedal, \Updating quasi-newton matrices with limited storage," Mathe- matics of computation, vol. 35, no. 151, pp. 773{782, 1980. [164] S. P. Version, \1.6 (2008)," Chicago Ill.: SPSS Inc. [165] W. Schuler, S. AbdelRahman, T. Miller, and L. Schwartz, \Broad-coverage parsing using human-like memory constraints," Computational Linguistics, vol. 36, no. 1, pp. 1{30, 2010. [166] M. Kaushik, M. Trinkle, and A. Hashemi-Sakhtsari, \Automatic detection and removal of dis uencies from spontaneous speech," in Proc. 13th Aus- tralasian Int. Conf. on Speech Science and Technology Melbourne, 2010, pp. 98{101. [167] S. Sandra, \Depression: Questions you have-answers you need," Peoples Medical Society, 1997. [168] American Psychiatric Association, \Diagnostic and statistical manual of mental disorders," 1980. [169] National Institute of Mental Health, \Depression," http://www.nimh. nih.gov/health/publications/depression-what-you-need-to-know-12-2015/ depression-what-you-need-to-know-pdf 151827.pdf. 180 [170] L. S. Greenberg and J. C. Watson, Emotion-focused therapy for depression. American Psychological Association, 2006. [171] I. Myin-Germeys, N. Jacobs, F. Peeters, G. Kenis, C. Derom, R. Vlietinck, P. Delespaul, J. Van Os et al., \Evidence that moment-to-moment varia- tion in positive emotions buer genetic risk for depression: a momentary assessment twin study," Acta Psychiatrica Scandinavica, vol. 115, no. 6, pp. 451{457, 2007. [172] A. L. Bouhuys, E. Geerts, P. P. A. Mersch, and J. A. Jenner, \Nonverbal interpersonal sensitivity and persistence of depression: perception of emo- tions in schematic faces," Psychiatry research, vol. 64, no. 3, pp. 193{203, 1996. [173] M. A. Nicolaou, H. Gunes, and M. Pantic, \Output-associative rvm regres- sion for dimensional and continuous emotion prediction," Image and Vision Computing, vol. 30, no. 3, pp. 186{196, 2012. [174] A. T. Beck, R. A. Steer, G. K. Brown et al., \Manual for the beck depression inventory-ii," 1996. [175] M. Brandt and J. D. Boucher, \Concepts of depression in emotion lexicons of eight cultures," International Journal of Intercultural Relations, vol. 10, no. 3, 1986. [176] W. Heller, N. S. Koven, and G. A. Miller, \Regional brain activity in anxi- ety and depression, cognition/emotion interaction, and emotion regulation." 2003. [177] J. J. Gross and R. F. Mu~ noz, \Emotion regulation and mental health," Clin- ical psychology: Science and practice, vol. 2, no. 2, pp. 151{164, 1995. [178] C. E. Izard, Patterns of emotions: A new analysis of anxiety and depression. Academic Press, 2013. [179] S. H. Blumberg and C. E. Izard, \Discriminating patterns of emotions in 10-and 11-yr-old children's anxiety and depression." Journal of personality and social psychology, vol. 51, no. 4, p. 852, 1986. [180] H. Kaya, F. Eyben, A. A. Salah, and B. Schuller, \Cca based feature selection with application to continuous depression recognition from acoustic speech features," in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014. 181 [181] N. Cummins, J. Epps, V. Sethu, and J. Krajewski, \Variability compensation in small data: Oversampled extraction of i-vectors for the classication of depressed speech," in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE. [182] N. Cummins, V. Sethu, J. Epps, and J. Krajewski, \Probabilistic acoustic volume analysis for speech aected by depression," in Annual Conference of the International Speech Communication Association, 2014. [183] A. Metallinou, A. Katsamanis, Y. Wang, and S. Narayanan, \Tracking changes in continuous emotion states using body language and prosodic cues," in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on. IEEE, 2011. [184] F. Ringeval, F. Eyben, E. Kroupi, A. Yuce, J.-P. Thiran, T. Ebrahimi, D. Lalanne, and B. Schuller, \Prediction of asynchronous dimensional emo- tion ratings from audiovisual and physiological data," Pattern Recognition Letters, 2014. [185] H. Gunes and B. Schuller, \Categorical and dimensional aect analysis in continuous input: Current trends and future directions," Image and Vision Computing, vol. 31, no. 2, pp. 120{136, 2013. [186] M. Valstar, B. Schuller, K. Smith, F. Eyben, B. Jiang, S. Bilakhia, S. Schnieder, R. Cowie, and M. Pantic, \Avec 2013: the continuous audio/visual emotion and depression recognition challenge," in Proceedings of the 3rd ACM international workshop on Audio/visual emotion challenge. ACM, 2013, pp. 3{10. [187] M. Valstar, B. Schuller, K. Smith, T. Almaev, F. Eyben, J. Krajewski, R. Cowie, and M. Pantic, \Avec 2014{3d dimensional aect and depression recognition challenge," 2013. [188] H. Kaya, F. C illi, and A. A. Salah, \Ensemble cca for continuous emotion pre- diction," in Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge. ACM, 2014, pp. 19{26. [189] R. Gupta, N. Malandrakis, B. Xiao, T. Guha, M. Van Segbroeck, M. Black, A. Potamianos, and S. Narayanan, \Multimodal prediction of aective dimensions and depression in human-computer interactions," in Proceed- ings of the 4th International Workshop on Audio/Visual Emotion Challenge. ACM, 2014, pp. 33{40. 182 [190] M. K achele, M. Schels, and F. Schwenker, \Inferring depression and aect from application dependent meta knowledge," in Proceedings of the 4th Inter- national Workshop on Audio/Visual Emotion Challenge. ACM, 2014. [191] J. R. Williamson, T. F. Quatieri, B. S. Helfer, G. Ciccarelli, and D. D. Mehta, \Vocal and facial biomarkers of depression based on motor incoor- dination and timing," in Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge. ACM, 2014, pp. 65{72. [192] A. Jan, H. Meng, Y. F. A. Gaus, F. Zhang, and S. Turabzadeh, \Automatic depression scale prediction using facial expression dynamics and regression," in Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge. ACM, 2014, pp. 73{80. [193] V. Jain, J. L. Crowley, A. K. Dey, and A. Lux, \Depression estimation using audiovisual features and sher vector encoding," in Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge. ACM, 2014. [194] R. J. Cabin and R. J. Mitchell, \To bonferroni or not to bonferroni: when and how are the questions," Bulletin of the Ecological Society of America, pp. 246{248, 2000. [195] R. Rojas, Neural networks: a systematic introduction. Springer Science & Business Media, 2013. [196] A. C. Davison and D. V. Hinkley, Bootstrap methods and their application. Cambridge university press, 1997. [197] L. L. Scharf, Statistical signal processing. Addison-Wesley Reading, MA, 1991, vol. 98. [198] R. Gupta, N. Kumar, and S. Narayanan, \Aect prediction in music using boosted ensemble of lters," in The 2015 European Signal Processing Con- ference, Nice, 2015. [199] J. O. Greene and J. N. Cappella, \Cognition and talk: The relationship of semantic units to temporal patterns of uency in spontaneous speech," Language and Speech, vol. 29, no. 2, pp. 141{157, 1986. [200] G. Beattie and A. Ellis, The psychology of language and communication. Psychology Press, 2014. [201] R. Gupta, D. Bone, S. Lee, and S. Narayanan, \Analysis of engagement behavior in children during dyadic interactions using prosodic cues," Com- puter Speech & Language, vol. 37, pp. 47{66, 2016. 183 [202] L.-O. Lundqvist, F. Carlsson, P. Hilmersson, and P. Juslin, \Emotional responses to music: experience, expression, and physiology," Psychology of Music, 2008. [203] P. N. Juslin and J. A. Sloboda, Music and emotion: Theory and research. Oxford University Press, 2001. [204] J. Panksepp, \The emotional sources of" chills" induced by music," Music perception, pp. 171{207, 1995. [205] T. Eerola and J. K. Vuoskoski, \A comparison of the discrete and dimensional models of emotion in music," Psychology of Music, 2010. [206] N. Malandrakis, A. Potamianos, G. Evangelopoulos, and A. Zlatintsi, \A supervised approach to movie emotion tracking," in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on. IEEE, 2011, pp. 2376{2379. [207] S. Hallam and J. Price, \Research section: can the use of background music improve the behaviour and academic performance of children with emotional and behavioural diculties?" British Journal of Special Education, vol. 25, no. 2, pp. 88{91, 1998. [208] M. K. Hul, L. Dube, and J.-C. Chebat, \The impact of music on consumers' reactions to waiting for services," Journal of Retailing, vol. 73, no. 1, pp. 87{104, 1997. [209] E. M. Schmidt and Y. E. Kim, \Modeling musical emotion dynamics with conditional random elds." in ISMIR, 2011, pp. 777{782. [210] Y.-H. Yang and H. H. Chen, \Machine recognition of music emotion: A review," ACM Transactions on Intelligent Systems and Technology (TIST), vol. 3, no. 3, p. 40, 2012. [211] Y. E. Kim, E. M. Schmidt, R. Migneco, B. G. Morton, P. Richardson, J. Scott, J. A. Speck, and D. Turnbull, \Music emotion recognition: A state of the art review," in Proc. ISMIR. Citeseer, 2010, pp. 255{266. [212] J. Speck, E. M. Schmidt, B. G. Morton, and Y. E. Kim, \A comparative study of collaborative vs. traditional annotation methods," ISMIR, Miami, Florida, 2011. [213] M. Soleymani, M. N. Caro, E. M. Schmidt, C.-Y. Sha, and Y.-H. Yang, \1000 songs for emotional analysis of music," in Proceedings of the 2nd ACM international workshop on Crowdsourcing for multimedia. ACM, 2013, pp. 1{6. 184 [214] A. Aljanaki, Y. Yang, and M. Soleymani, \Emotion in music task at medi- aeval 2014," in Mediaeval 2014 Workshop, Barcelona, Spain, October 16-17, 2014. [215] E. Coutinho, F. Weninger, B. Schuller, and K. R. Scherer, \The munich LSTM-RNN approach to the mediaeval 2014 emotion in music task," in Mediaeval 2014 Workshop, Barcelona, Spain, October 16-17, 2014. [216] Y. Fan and M. Xu, \Mediaeval 2014: Thu-hcsil approach to emotion in music task using multi-level regression," in Mediaeval 2014 Workshop, Barcelona, Spain, October 16-17, 2014. [217] K. Markov and T. Matsui, \Dynamic music emotion recognition using state- space models," 2014. [218] J. H. Friedman, \Greedy function approximation: a gradient boosting machine," Annals of statistics, pp. 1189{1232, 2001. [219] N. Kumar, R. Gupta, T. Guha, C. Vaz, M. V. Segbroeck, J. Kim, and S. Narayanan, \Aective feature design and predicting continuous aective dimensions from music," in Mediaeval Workshop, Barcelona, Spain, 2014. [220] J. Snyman, Practical mathematical optimization: an introduction to basic optimization theory and classical and new gradient-based algorithms. Springer Science & Business Media, 2005, vol. 97. [221] L. Armijo et al., \Minimization of functions having lipschitz continuous rst partial derivatives," Pacic Journal of mathematics, vol. 16, no. 1, pp. 1{3, 1966. [222] J. H. Friedman, \Stochastic gradient boosting," Computational Statistics & Data Analysis, vol. 38, no. 4, pp. 367{378, 2002. [223] R. J. Lewis, \An introduction to classication and regression tree (cart) anal- ysis," in Annual Meeting of the Society for Academic Emergency Medicine in San Francisco, California, 2000, pp. 1{14. [224] I. Guyon and A. Elissee, \An introduction to variable and feature selection," The Journal of Machine Learning Research, vol. 3, pp. 1157{1182, 2003. [225] D. P. Bertsekas, D. P. Bertsekas, D. P. Bertsekas, and D. P. Bertsekas, Dynamic programming and optimal control. Athena Scientic Belmont, MA, 1995, vol. 1, no. 2. [226] W. R. Miller and S. Rollnick, Motivational interviewing: preparing people for change. New York [etc.]: The Guilford Press, 2002. 185 [227] J. S. Baer, D. B. Rosengren, C. W. Dunn, E. A. Wells, R. L. Ogle, and B. Hartzler, \An evaluation of workshop training in motivational interview- ing for addiction and mental health clinicians," Drug and alcohol dependence, vol. 73, no. 1, pp. 99{106, 2004. [228] B. L. Burke, C. W. Dunn, D. C. Atkins, and J. S. Phelps, \The emerging evidence base for motivational interviewing: a meta-analytic and qualitative inquiry," Journal of Cognitive Psychotherapy, vol. 18, no. 4, pp. 309{322, 2004. [229] W. R. Miller, T. B. Moyers, D. Ernst, and P. Amrhein, \Manual for the motivational interviewing skill code (misc)," Unpublished manuscript. Albu- querque: Center on Alcoholism, Substance Abuse and Addictions, University of New Mexico, 2003. [230] E. Proctor, H. Silmere, R. Raghavan, P. Hovmand, G. Aarons, A. Bunger, R. Griey, and M. Hensley, \Outcomes for implementation research: con- ceptual distinctions, measurement challenges, and research agenda," Admin- istration and Policy in Mental Health and Mental Health Services Research, vol. 38, no. 2, pp. 65{76, 2011. [231] S. S. Narayanan and P. G. Georgiou, \Behavioral signal processing: Deriving human behavioral informatics from speech and language," Proceedings of the IEEE, vol. 101, no. 5, pp. 1203{1233, 2013. [232] P. G. Georgiou, M. P. Black, A. C. Lammert, B. R. Baucom, and S. S. Narayanan, \thats aggravating, very aggravating: Is it possible to clas- sify behaviors in couple interactions using automatically derived lexical fea- tures?" in Aective Computing and Intelligent Interaction. Springer, 2011, pp. 87{96. [233] R. L. Coser, \Some social functions of laughter." Human relations, 1959. [234] H. Plessner, Laughing and crying: a study of the limits of human behavior. Northwestern University Press, 1970. [235] V. Adelsw ard, \Laughter and dialogue: The social signicance of laughter in institutional discourse," Nordic Journal of Linguistics, vol. 12, no. 2, pp. 107{136, 1989. [236] P. J. Glenn, Laughter in interaction. Cambridge University Press Cam- bridge, 2003. [237] T. Chaspari, E. M. Provost, A. Katsamanis, and S. S. Narayanan, \An acous- tic analysis of shared enjoyment in eca interactions of children with autism," 186 in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE Interna- tional Conference on. IEEE, 2012, pp. 4485{4488. [238] A. Krupski, J. M. Joesch, C. Dunn, D. Donovan, K. Bumgardner, S. P. Lord, R. Ries, P. Roy-Byrne et al., \Testing the eects of brief intervention in primary care for problem drug use in a randomized controlled trial: rationale, design, and methods," Addiction science & clinical practice, vol. 7, no. 1, p. 27, 2012. [239] B. Xiao, P. G. Georgiou, and S. S. Narayanan, \Analyzing the language of therapist empathy in motivational interview based psychotherapy," Proc Asia Pacic Signal Inf Process Assoc, pp. 1{4, 2012. [240] D. Can, P. G. Georgiou, D. C. Atkins, and S. S. Narayanan, \A case study: Detecting counselor re ections in psychotherapy for addictions using linguis- tic features," in Thirteenth Annual Conference of the International Speech Communication Association, 2012. [241] Z. Le, \Maximum entropy modeling toolkit for python and c++," Natural Language Processing Lab, Northeastern University, China, 2004. [242] D. C. Liu and J. Nocedal, \On the limited memory bfgs method for large scale optimization," Mathematical programming, vol. 45, no. 1-3, pp. 503{ 528, 1989. [243] M. P. Black, A. Katsamanis, B. R. Baucom, C.-C. Lee, A. C. Lammert, A. Christensen, P. G. Georgiou, and S. S. Narayanan, \Toward automating a human behavioral coding system for married couples interactions using speech acoustic features," Speech Communication, vol. 55, no. 1, pp. 1{21, 2013. [244] A. Frischen, A. P. Bayliss, and S. P. Tipper, \Gaze cueing of attention: visual attention, social cognition, and individual dierences." Psychological bulletin, vol. 133, no. 4, p. 694, 2007. [245] B. Xiao, P. G. Georgiou, B. R. Baucom, and S. S. Narayanan, \Power- spectral analysis of head motion signal for behavioral modeling in human interaction," in Proceedings of IEEE International Conference on Audio, Speech and Signal Processing (ICASSP), May 2014. [246] R. McCleary, R. A. Hay, E. E. Meidinger, and D. McDowall, Applied time series analysis for the social sciences. Sage Publications Beverly Hills, CA, 1980. 187 [247] D. K. Simonton, \Sociocultural context of individual creativity: a transhis- torical time-series analysis." Journal of personality and social psychology, vol. 32, no. 6, p. 1119, 1975. [248] M. Baxter and R. G. King, \Measuring business cycles: approximate band- pass lters for economic time series," Review of economics and statistics, vol. 81, no. 4, pp. 575{593, 1999. [249] S. J. Taylor, \Modelling nancial time series," 2007. [250] N. K. Rathlev, J. Chessare, J. Olshaker, D. Obendorfer, S. D. Mehta, T. Rothenhaus, S. Crespo, B. Magauran, K. Davidson, R. Shemin et al., \Time series analysis of variables associated with daily mean emergency department length of stay," Annals of emergency medicine, vol. 49, no. 3, pp. 265{271, 2007. [251] R. Gupta, P. G. Georgiou, D. C. Atkins, and S. S. Narayanan, \Predicting clients inclination towards target behavior change in motivational interview- ing and investigating the role of laughter," in Fifteenth Annual Conference of the International Speech Communication Association, 2014. [252] A. Metallinou, A. Katsamanis, and S. S. Narayanan, \Tracking continuous emotional trends of participants during aective dyadic interactions using body language and speech information," Image and Vision Computing, vol. 31, no. 2, pp. 137{152, Feb. 2013. [Online]. Available: www.sciencedirect.com/science/article/pii/S0262885612001710 [253] M. A. Nicolaou, V. Pavlovic, and M. Pantic, \Dynamic probabilistic cca for analysis of aective behaviour," in Computer Vision{ECCV 2012. Springer, 2012, pp. 98{111. [254] D. G. Manolakis, V. K. Ingle, and S. M. Kogon, Statistical and adaptive signal processing: spectral estimation, signal modeling, adaptive ltering, and array processing. Artech House Norwood, 2005, vol. 46. [255] B. Widrow and S. D. Stearns, \Adaptive signal processing," Englewood Clis, NJ, Prentice-Hall, Inc., 1985, 491 p., 1985. [256] A. P. Dempster, N. M. Laird, and D. B. Rubin, \Maximum likelihood from incomplete data via the em algorithm," Journal of the Royal Statistical Soci- ety. Series B (Methodological), 1977. [257] Y. Bachrach, T. Graepel, T. Minka, and J. Guiver, \How to grade a test with- out knowing the answers|a bayesian graphical model for adaptive crowd- sourcing and aptitude testing," arXiv preprint arXiv:1206.6386, 2012. 188 [258] Y. Yan, R. Rosales, G. Fung, M. W. Schmidt, G. H. Valadez, L. Bogoni, L. Moy, and J. G. Dy, \Modeling annotator expertise: Learning when every- body knows a bit of something," in International conference on articial intelligence and statistics, 2010, pp. 932{939. [259] P. Welinder, S. Branson, P. Perona, and S. J. Belongie, \The multidimen- sional wisdom of crowds," in Advances in neural information processing sys- tems, 2010, pp. 2424{2432. [260] Q. Zhao, D. Meng, Z. Xu, W. Zuo, and Y. Yan, \lf1g-norm low-rank matrix factorization by variational bayesian method," Neural Networks and Learning Systems, IEEE Transactions on, vol. 26, no. 4, pp. 825{839, 2015. [261] A. Eriksson and A. Van Den Hengel, \Ecient computation of robust low- rank matrix approximations in the presence of missing data using the l 1 norm," 2010. [262] R. Zhang and T. J. Ulrych, \Physical wavelet frame denoising," Geophysics, vol. 68, no. 1, pp. 225{231, 2003. [263] Z. Lin, M. Chen, and Y. Ma, \The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices," arXiv preprint arXiv:1009.5055, 2010. [264] L. Zhang, D. Tjondronegoro, and V. Chandran, \Representation of facial expression categories in continuous arousal-valence space: Feature and cor- relation," Image and Vision Computing, 2014. [265] M. Soleymani, A. Aljanaki, Y.-H. Yang, M. N. Caro, F. Eyben, K. Markov, B. Schuller, R. Veltkamp, and F. W. F. Weninger, \Emotional analysis of music: A comparison of methods," in Proceedings of ACM International Conference on Multimedia-MM 2014, November 3-7, Orlando, Florida, USA, 2014. [266] N. Kumar, R. Gupta, T. Guha, C. Vaz, M. V. Segbroeck, J. Kim, and S. Narayanan, \Aective feature design and predicting continuous aective dimensions from music," in MediaEval 2014 Multimedia Benchmark Work- shop, Barcelona, 2014. [267] D. Can, P. Georgiou, D. Atkins, and S. S. Narayanan, \A case study: Detect- ing counselor re ections in psychotherapy for addictions using linguistic fea- tures," in Proceedings of InterSpeech, Sep. 2012. 189 [268] M. Black, A. Katsamanis, C.-C. Lee, A. Lammert, B. Baucom, A. Chris- tensen, P. Georgiou, and S. S. Narayanan, \Automatic classication of mar- ried couples' behavior using audio features," in In Proceedings of InterSpeech, Makuhari, Japan, Sep. 2010. [269] J. Liscombe, G. Riccardi, and D. Hakkani-T ur, \Using context to improve emotion detection in spoken dialog systems," in Ninth European Conference on Speech Communication and Technology, 2005. [270] M. A. Nicolaou, V. Pavlovic, and M. Pantic, \Dynamic probabilistic cca for analysis of aective behavior and fusion of continuous annotations," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 36, no. 7, pp. 1299{1311, 2014. [271] C. M. Bishop et al., Pattern recognition and machine learning. springer New York, 2006, vol. 1. [272] J. Neter, W. Wasserman, and M. H. Kutner, \Applied linear regression mod- els," 1989. [273] M. A. Figueiredo, \Lecture notes on bayesian estimation and classication," 2004. [274] M. Yamada and M. Sugiyama, \Dependence minimizing regression with model selection for non-linear causal inference under non-gaussian noise." in AAAI, 2010. [275] C. Gu, \Adaptive spline smoothing in non-gaussian regression models," Jour- nal of the American Statistical Association, vol. 85, no. 411, pp. 801{807, 1990. [276] O. Y. Ousley, R. I. Arriaga, M. J. Morrier, J. B. Mathys, M. D. Allen, and G. D. Abowd, \Beyond parental report: Findings from the rapid-abc, a new 4-minute interactive autism," Technical report series: report number 100 (http://www.cbi.gatech.edu/techreports), Center for Behavior Imaging, Georgia Institute of Technology, 2013. [277] R. Gupta, C.-C. Lee, L. Sungbok, and S. Narayanan, \Assessment of a child's engagement using sequence model based features," in Workshop on Aective Social Speech Signals, Grenoble, 2013. [278] D. Messinger and A. Fogel, \The interactive development of social smiling," Advances in child development and behaviour, vol. 35, pp. 328{366, 2007. 190 [279] Y.-H. Huang and C.-S. Fuh, \Face detection and smile detection," in Pro- ceedings of IPPR Conference on Computer Vision, Graphics and Image Por- cessing, Shitou, Taiwan, A5-6, 2009, p. 108. [280] T. S en echal, J. Turcot, and R. El Kaliouby, \Smile or smirk? automatic detection of spontaneous asymmetric smiles to understand viewer experi- ence," in Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on. IEEE, 2013, pp. 1{8. [281] M. Cox, J. Nuevo-Chiquero, J. Saragih, and S. Lucey, \Csiro face analysis sdk," Brisbane, Australia, 2013. [282] T. Ojala, M. Pietikainen, and T. Maenpaa, \Multiresolution gray-scale and rotation invariant texture classication with local binary patterns," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 24, no. 7, pp. 971{987, 2002. [283] A. Albert, Regression and the Moore-Penrose pseudoinverse. Elsevier, 1972. [284] M. E. Rice and G. T. Harris, \Comparing eect sizes in follow-up studies: Roc area, cohen's d, and r." Law and human behavior, vol. 29, no. 5, p. 615, 2005. [285] E. H ullermeier, J. F urnkranz, W. Cheng, and K. Brinker, \Label ranking by learning pairwise preferences," Articial Intelligence, vol. 172, no. 16, 2008. [286] T.-Y. Liu, \Learning to rank for information retrieval," Foundations and Trends in Information Retrieval, vol. 3, 2009. [287] T. H. Haveliwala, \Topic-sensitive pagerank," in Proceedings of the 11th international conference on World Wide Web. ACM, 2002. [288] K. H.-Y. Lin and H.-H. Chen, \Ranking reader emotions using pairwise loss minimization and emotional distribution regression," in Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 2008. [289] O. Wu, W. Hu, and J. Gao, \Learning to rank under multiple annotators," in IJCAI Proceedings-International Joint Conference on Articial Intelligence, vol. 22, no. 1, 2011. [290] M. Van Erp and L. Schomaker, \Variants of the borda count method for com- bining ranked classier hypotheses," in in the seventh international workshop on frontiers in handwriting recognition. Citeseer, 2000. [291] E. M. Niou, \A note on nanson's rule," Public Choice, vol. 54, 1987. 191 [292] T. K. Moon, \The expectation-maximization algorithm," Signal processing magazine, IEEE, vol. 13, no. 6, 1996. [293] J. F urnkranz and E. H ullermeier, \Pairwise preference learning and ranking," in Machine Learning: ECML. Springer, 2003. [294] K. Brinker, \Active learning of label ranking functions," in Proceedings of the twenty-rst international conference on Machine learning. ACM, 2004. [295] B. Long, O. Chapelle, Y. Zhang, Y. Chang, Z. Zheng, and B. Tseng, \Active learning for ranking through expected loss optimization," in Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. ACM, 2010. [296] W. Chu and Z. Ghahramani, \Extensions of gaussian processes for ranking: semisupervised and active learning," in Proceedings of the NIPS Workshop on Learning to Rank. MIT, 2005. [297] J. He, M. Li, H.-J. Zhang, H. Tong, and C. Zhang, \Manifold-ranking based image retrieval," in Proceedings of the 12th annual ACM international con- ference on Multimedia, 2004. [298] C. Quoc and V. Le, \Learning to rank with nonsmooth cost functions," Pro- ceedings of the Advances in Neural Information Processing Systems, vol. 19, 2007. [299] P. Li, Q. Wu, and C. J. Burges, \Mcrank: Learning to rank using multi- ple classication and gradient boosting," in Advances in neural information processing systems, 2007. [300] K. Duh and K. Kirchho, \Learning to rank with partially-labeled data," in Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2008. [301] Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li, \Learning to rank: from pairwise approach to listwise approach," in Proceedings of the 24th interna- tional conference on Machine learning. ACM, 2007. [302] C. Dwork, R. Kumar, M. Naor, and D. Sivakumar, \Rank aggregation meth- ods for the web," in Proceedings of the 10th international conference on World Wide Web. ACM, 2001. [303] W. Ding, P. Ishwar, and V. Saligrama, \Learning shared rankings from mix- tures of noisy pairwise comparisons," in Acoustics, Speech and Signal Pro- cessing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015. 192 [304] D. F. Gleich and L.-h. Lim, \Rank aggregation via nuclear norm minimiza- tion," in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011. [305] Y. Pan, H. Lai, C. Liu, Y. Tang, and S. Yan, \Rank aggregation via low-rank and structured-sparse decomposition." in AAAI, 2013. [306] X. Chen, P. N. Bennett, K. Collins-Thompson, and E. Horvitz, \Pairwise ranking aggregation in a crowdsourced setting," in Proceedings of the sixth ACM international conference on Web search and data mining. ACM, 2013. [307] A. Kumar and M. Lease, \Learning to rank from a noisy crowd," in Pro- ceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. ACM, 2011. [308] D. Zhou, S. Basu, Y. Mao, and J. C. Platt, \Learning from the wisdom of crowds by minimax entropy," in Advances in Neural Information Processing Systems, 2012, pp. 2195{2203. [309] E. Mower, M. J. Matari c, and S. Narayanan, \A framework for automatic human emotion classication using emotion proles," Audio, Speech, and Language Processing, IEEE Transactions on, vol. 19, no. 5, 2011. [310] P. Rao and L. L. Kupper, \Ties in paired-comparison experiments: A gen- eralization of the bradley-terry model," Journal of the American Statistical Association, vol. 62, no. 317, 1967. [311] P. Donmez and J. G. Carbonell, \Optimizing estimated loss reduction for active sampling in rank learning," in Proceedings of the 25th international conference on Machine learning. ACM, 2008. [312] C. Gentile and M. K. Warmuth, \Linear hinge loss and average margin," in NIPS, vol. 11, 1998. [313] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender, \Learning to rank using gradient descent," in Proceedings of the 22nd international conference on Machine learning. ACM, 2005. [314] J. D. Rennie and N. Srebro, \Loss functions for preference levels: Regression with discrete ordered labels," in Proceedings of the IJCAI multidisciplinary workshop on advances in preference handling. Kluwer Norwell, MA, 2005. [315] T. Hastie, R. Tibshirani et al., \Classication by pairwise coupling," The annals of statistics, vol. 26, no. 2, 1998. 193 [316] J. A. Hartigan and M. A. Wong, \Algorithm as 136: A k-means clustering algorithm," Applied statistics, 1979. [317] P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis, \Modeling wine preferences by data mining from physicochemical properties," Decision Sup- port Systems, vol. 47, 2009. [318] A. Asuncion and D. Newman, \UCI machine learning repository," 2007. [319] F. Alimoglu, D. Doc, E. Alpaydin, and Y. Denizhan, \Combining multiple classiers for pen-based handwritten digit recognition," 1996. [320] A. Trajman and R. Luiz, \Mcnemar2 test revisited: comparing sensitivity and specicity of diagnostic examinations," Scandinavian journal of clinical and laboratory investigation, vol. 68, no. 1, pp. 77{80, 2008. [321] D. Bone, M. P. Black, A. Ramakrishna, R. Grossman, and S. S. Narayanan, \Acoustic-prosodic correlates of awkward prosody in story retellings from adolescents with autism," in Proceedings of Interspeech, Sep. 2015. [322] R. B. Grossman, L. R. Edelson, and H. Tager-Flusberg, \Emotional facial and vocal expressions during story retelling by children and adolescents with high-functioning autism," Journal of Speech, Language, and Hearing Research, vol. 56, 2013. [323] S. Roweis and Z. Ghahramani, \A unifying review of linear gaussian models," Neural computation, vol. 11, no. 2, pp. 305{345, 1999. [324] J. M. Joyce, \Kullback-leibler divergence," in International Encyclopedia of Statistical Science. Springer, 2011, pp. 720{722. [325] A. H. Ang and W. H. Tang, \Probability concepts in engineering," Planning, vol. 1, no. 4, pp. 1{3, 2004. [326] M. Kearns, Y. Mansour, and A. Y. Ng, \An information-theoretic analysis of hard and soft assignment methods for clustering," in Learning in graphical models. Springer, 1998. [327] F. Jelinek, \Speech recognition by statistical methods," Proceedings of the IEEE, vol. 64, pp. 532{556, 1976. [328] D. Koller and N. Friedman, Probabilistic graphical models: principles and techniques. MIT press, 2009. 194 Appendix A Derivation of the Expectation Maximization algorithm stated in Chapter 8 We use an Expectation Maximization (EM) framework [256, 323] to estimate the model parameters and introduce a distribution q(a s ) dened over the hidden ground truth. Following up from the data log likelihood formulation in (8.9), the following decomposition holds for any choice of q(a s ) (please refer to section 9.4 in [271]). L =M(q; ;X s ) + KL(qjjp) (A.1) Where M and KL(qjjp) are functionals of q(a s ) [271] and KL(qjjp) specically refers to the Kullback-Leibler divergence [324] between q(a s ) and p(a s ja s 1 ;::;a s N ; ;X s ) as shown below. M(q; ;X s ) = X s2S Z a s q(a s ) log p(a s 1 ;::;a s N ;a s j;X s ) q(a s ) @a s (A.2) KL(qjjp) = X s2S Z a s q(a s ) log p(a s ja s 1 ;::;a s N ; ;X s ) q(a s ) @a s (A.3) 195 An EM algorithm iteratively performs an Expectation step (E-step) and a Max- imization step (M-step). In the E-stepM is maximized with respect toq(a s ) while holding the parameters constant. The solution is equivalent to the posterior dis- tributionp(a s ja s 1 ;::;a s N ; ;X s ), when KL(qjjp) vanishes [271]. During the M-step, we maximizeM to update the model parameters =hd 1 ;::;d N ;d b 1 ;::;d b N ;i. We can simplify the expression in (A.2) based on the graphical model shown in Figure 8.3. Applying the multiplication theorem on probability [325], we can express the joint probability betweenha s 1 ;::;a s N i anda s in equation (A.2) conditioned on the model parameters as shown in equation A.4. A detailed derivation of the relation in (A.4) is shown in Appendix 2. p(a s 1 ;::;a s N ;a s j;X s ) = p(a s 1 ;::;a s N ;a s jd 1 ;::;d N ;d b 1 ;::;d b N ;;X s ) = N Y n=1 p(a s n ja s ;d n ;d b n )p(a s j;X s ) (A.4) Based on the above equation, we can rewriteM in equation (A.2) as: M(q; ;X s ) = X s2S Z a s q(a s ) logq(a s )@a s | {z } Entropy(q(a s )) + X s2S Z a s q(a s ) log N Y n=1 p(a s n ja s ;d n ;d b n )p(a s j;X s ) @a s | {z } M 1 : Term containing model parameters (A.5) Note thatM in (A.5) can be completely dened given q(a s ), the distortion functionh and the feature mapping function g. In this work, we approximateq(a s ) with a point estimate ofa s . Based onh, we can compute the conditional probability P (a s n ja s ;d n ) and P (a s j;X s ) can be computed based on g. The distribution 196 entropy term on right hand side in (A.5) does not depend on model parameters and we only need to deal with termM 1 during the M-step. Next, we restate the formulation for the functions g and h in (A.6)-(A.9) as introduced in Section 8.2.1 along with their probability distributions. Furthermore, an approximation on q(a s ) in introduced as discussed below. a s = g X s ; = 2 4 X s 1 3 5 T + s (A.6) a s N 2 4 X s 1 3 5 T ; I T s (A.7) a s n = h(a s ;d n ) =d n a s + (d b n 1 s ) + s n (A.8) p(a s n ja s ;d n )N d n a s + (d b n 1 s ); I T s (A.9) In the E-step, q(a s ) is set equal to the distribution p(a s ja s 1 ;::;a s N ; ;X s ). Instead of exactly computing p(a s ja s 1 ;::;a s N ; ;X s ), we sample a point a s from this distribution. The distribution q(a s ) is then set equal to this point estimate from p(a s ja s 1 ;::;a s N ; ;X s ) as shown in (A.10) ( is the Dirac-delta function). Several popular algorithms such as K-means [326] and Viterbi EM for Hidden Markov Models [327] make this approximation. q(a s ) =(a s a s ) (A.10) We obtain the point estimate a s as Maximum Log-likelihood Estimate (MLE) 197 ofa s based onp(a s ja s 1 ;::;a s N ; ;X s ), as shown in (A.11). Subsequently, we rewrite the expression using multiplication theorem in (A.12). a s = arg max a s logp(a s ja s 1 ;::;a s N ; ;X s ) (A.11) = arg max a s log p(a s ;a s 1 ;::;a s N j;X s ) p(a s 1 ;::;a s N j;X s ) (A.12) The denominator in (A.12) does not containa s and can be disregarded during maximization. The MLE can be computed from the equivalent problem: a s = arg max a s logp(a s ;a s 1 ;::;a s N j;X s ) (A.13) Using equations (A.20)-(A.23), we can write (A.13) as: a s = arg max a s log N Y n=1 p(a s n ja s ;d n ;d b n )p(a s j;X s ) = arg max a s N X n=1 logp(a s n ja s ;d n ;d b n ) + logp(a s j;X s ) (A.14) During the M-step, we optimizeM 1 dened in (A.5). Substituting the assumed distribution for q(a s ) as stated in (A.10),M 1 is reduced to M 1 = X s2S N X n=1 logp(a s n j a s ;d n ;d b n ) | {z } term containing lter coecientsdn + X s2S logp( a s j;X s ) | {z } term containing mapping function parameter (A.15) Note that the lter parameters (d n ;d b n ) and appear in separate terms in the above equation (A.15). Hence in the M-step,hd 1 ;:::;d N ;d b 1 ;::;d b N i can be obtained by maximizing the rst term and by maximizing the second term alone. For the 198 sake of completion, we restate the stepwise EM algorithm implementation below. EM algorithm implementation Initialize lter coecientshd 1 ;:::;d N i, bias termshd b 1 ;::;d b N i and mapping func- tion parameter. While the data-log likelihood converges, perform: - E-step: In this step, we obtain the ground truth estimate a s as shown in A.14. Substituting Gaussian distribution functions dened in 8.7 and 8.5, we arrive at the equivalent optimization problem shown in (A.16).jj:jj 2 represents the L 2 vector norm. a s = arg min a s N X n=1 (a s n ) (d n (a s ) +d b n 1 s ) 2 2 + (a s ) 2 4 X s 1 3 5 2 2 ; 8s2S (A.16) - M-step: In the M-step, we maximizeM 1 in A.15. As stated, we can estimate lter coecientshd 1 ;::;d N i and parameter by operating separately on the two constituent terms. Substituting the Gaussian distributions stated in (8.7) and (8.5) in (A.15), we obtain the following optimization problems d n ;d b n = arg max dn;d b n X s2S N X n=1 logp(a s n j a s ;d n ;d b n ); n = 1;::;N = arg min dn;d b n X s2S N X n=1 (a s n ) (d n a s +d b n 1 s ) 2 2 (A.17) We would like to point out that it turns out that the joint optimization over the parameters d n and d b n can be carried out in one step by reformulating the 199 convolution and summation as a joint matrix multiplication. This optimization, however, involves a matrix inversion. This is a slow operation and can be replaced by faster methods such as gradient descent. The parameter is obtained as follows. = arg max X s2S logp( a s j;X s ) = arg min X s2S (a s ) 2 4 X s 1 3 5 2 2 (A.18) Appendix 2: Proof for equation (A.4) To prove: p(a s 1 ;::;a s N ;a s j;X s ) = p(a s 1 ;::;a s N ;a s jd 1 ;::;d N ;d b 1 ;::;d b N ;;X s ) = N Y n=1 p(a s n ja s ;d n ;d b n )p(a s j;X s ) (A.19) Proof: Using Bayes theorem, we can write: p(a s 1 ;::;a s N ;a s j;X s ) = p(a s 1 ;::;a s N ja s ; ;X s ) p(a s j;X s ) (A.20) Now we simplify the joint probability in equation (A.20) using D-separation properties in Bayesian networks [328]. Claim 1:ha s 1 ;::;a s n ;::;a s N i are mutually independent, givena s . Therefore: p(a s 1 ;::;a s N ja s ;;X s ) = N Y n=1 p(a s n ja s ; ;X s ) (A.21) 200 Proof: By the denition of conditional probability, p(a s 1 ;::;a s N ja s ; ;X s ) implies that a s is given in determining the joint probability betweenha s 1 ;::;a s N i. We apply \common cause" clause (dened in Section 3.3.1 in [328]) to the graphical model in Figure 8.3. Asa s is given, the clause implies thatha s 1 ;::;a s n ;::;a s N i are mutually independent. Claim 2: a s n is independent of alld n 0, n 0 6=n andh;X s i, givena s . Therefore: p(a s n ja s ; ;X) =p(a s n ja s ;d 1 ;::;d N ;d s 1 ;::;d s N ;;X) =p(a s n ja s ;d n ;d b n ) (A.22) Proof: The variablea s is given in determiningp(a s n ja s ; ;X). We apply \indirect evidential eect" clause [328] to the graphical model in Figure 8.3 to show that a s n is independent ofh;X s i. Similarly, we apply the \common cause" clause to show thata s n is independent of all distortion function parametersd n 0;d b n 0, n 0 6=n not directly connected toa s n . Claim 3: a s is independent of the lter parametershd 1 ;::;d N ;d b 1 ;::;d b N i, in a prob- ability distribution not conditioned onha s 1 ;::;a s N i. Therefore: p(a s j;X s ) =p(a s jd 1 ;::;d N ;d b 1 ;::;d b N ;;X s ) =p(a s j;X s ) (A.23) Proof: In case of the conditional probability p(a s jd 1 ;::;d N ;d b 1 ;::;d b N ;;X s ), the variablesha s 1 ;::;a s N i are not given. We apply \common eect" clause [328] to the graphical model in Figure 8.3. The clause implies that a s is independent of the parametershd 1 ;::;d N ;d b 1 ;::;d b N i. 201 Appendix B Derivation of equations for Chapter 9 Appendix 1: Proof for equation (9.10) To prove: q(z ij ) =p(z ij jz 1 ij ;::;z K ij ;x i ;x j ;w;r 1 ;::;r K ) = p(z ij jx i ;x j ;w) K Y k=1 p(z k ij jz ij ;r k ) =p(z 1 ij ;::;z K ij ) Proof: q(z ij ) =p(z ij jz 1 ij ;::;z K ij ;x i ;x j ;w;r 1 ;::;r K ) =p(z ij ;z 1 ij ;::;z K ij jx i ;x j ;w;r 1 ;::;r K )=p(z 1 ij ;::;z K ij ) (B.1) By Bayes theorem: p(z ij ;z 1 ij ;::;z K ij jx i ;x j ;w;r 1 ;::;r K )=p(z 1 ij ;::;z K ij ) =p(z 1 ij ;::;z K ij jz ij ;x i ;x j ;w;r 1 ;::;r K ) p(z ij jx i ;x j ;w;r 1 ;::;r K )=p(z 1 ij ;::;z K ij ) (B.2) Based on the graphical model in Figure 9.1(c), we can say that z 1 ij ;::;z K ij are independent of the attributes x i ;x j and SVR vector w, using the \indirect evi- dential eect" clause in [328]. 202 p(z 1 ij ;::;z K ij jz ij ;x i ;x j ;w;r 1 ;::;r K ) = p(z 1 ij ;::;z K ij jz ij ;r 1 ;::;r K ) (B.3) Next, applying the \common clause" eect [328] to the graphical model in Figure 9.1(c), we can say that z 1 ij ;::;z K ij are mutually independent given z ij . Con- sequentially, z k ij is also independent of all r k 0 for all k 0 6= k due to the \common clause" eect. Therefore: p(z 1 ij ;::;z K ij jz ij ;r 1 ;::;r K ) = K Y k=1 p(z k ij jz ij ;r k ) (B.4) We can also say thatz ij is independent ofr 1 ;::;r K when the probability distri- bution is not conditioned onz 1 ij ;::;z K ij again using the \common clause" eect [328]. p(z ij jx i ;x j ;w;r 1 ;::;r K ) =p(z ij jx i ;x j ;w) (B.5) Replacing (B.4) and (B.5) into (B.2), we obtain q(z ij ) =p(z ij jz 1 ij ;::;z K ij ;x i ;x j ;w;r 1 ;::;r K ) (B.6) =p(z ij jx i ;x j ;w) K Y k=1 p(z k ij jz ij ;r k )=p(z 1 ij ;::;z K ij ) (B.7) Appendix 2: Proof for equation (9.16) To prove: logp(z ij ;z 1 ij ;::;z K ij jx i ;x j ;w;r 1 ;::;r K ) = K X k=1 logp(z k ij jz ij ;r k ) + logp(z ij jx i ;x j ;w) (B.8) Proof: 203 Application of (B.2)-(B.5) to the left hand side of (B.8) yields the desired result. Appendix 3: Probability distribution for optimization in equation (9.18) The goal in the M-step of the EM algorithm in order to obtainw was to perform the following optimization. w = arg max w M w = arg max w X z ij 2f0;1g q(z ij ) logp(z ij jx i ;x j ;w) (B.9) Where p(z ij = 1jx i ;x j ;w) = exphw;fx i x j gi 1 + exphw;fx i x j gi (B.10) p(z ij = 0jx i ;x j ;w) = 1p(z ij = 1jx i ;x j ;w) (B.11) Instead, we performed the optimization in (9.18), restated below. w = arg min w M w = arg min w q(z ij = 1)[1hw;fx i x j gi] + +q(z ij = 0)[1hw;fx j x i gi] + (B.12) Above optimization can be rewritten as shown in (B.13). w = arg max w q(z ij = 1)(1 [1hw;fx i x j gi] + ) +q(z ij = 0)(1 [1hw;fx j x i gi] + ) (B.13) We compare the negative hinge loss function (1 [1hw;fx i x j gi] + ) and the log of the logistic loss function (logp(z ij jx i ;x j ;w)) stated in (B.9). Figure B.1 shows the values that these function take with respect to the input wfx i x j g. 204 20 15 10 5 0 5 10 15 20 Input: w{x i −x j } 25 20 15 10 5 0 Function values Hinge loss function Logistic loss function Figure B.1: Plot comparing the values of the negative hinge loss function (1 [1hw;fx i x j gi] + ) and the log of logistic loss function (logp(z ij jx i ;x j ;w)). The plots indicate that the values taken by the two functions are very close to each other. One dierence is around an input value of 0, where the hinge loss function is not dierentiable but the logistic loss function is. More importantly, the slopes of the two functions are same for a large range of input and therefore, for all practical purposes, the gradient descent algorithm should provide similar results after replacing the logistic loss function with hinge loss function in the M- step of the EM algorithm. However, we were unable to theoretically prove that the algorithm still falls under the paradigm of generalized EM algorithm, and therefore is an approximation in the EM algorithm. 205
Abstract (if available)
Abstract
Nonverbal communication is an integral part of human communication and involves sending wordless cues. Development of nonverbal communication starts at an early stage in life, even before the development of verbal communication skills and constitutes a major part of communication. In literature, nonverbal communication has been described as an encoding-decoding process. During encoding, a person embeds his internal state comprising of emotions, mental well being and sentiment into a multimodal set of nonverbal cues including nonverbal vocalizations, body gestures and facial expressions. After receiving these encoded cues, another person decodes these nonverbal cues based on his perceptions of the cues. In this thesis, I aim to facilitate the understanding of this encoding-decoding process. Specifically, I conduct experiments with respect to three different target scenarios including: (a) detection of nonverbal cues in human interaction, (b) estimation of latent states embedding using nonverbal cues and, (c) modeling diversity in perception of non-verbal cues. ❧ The first part of this thesis involves detection of nonverbal cues in time continuous human signals such as speech and body language. Accurate detection of nonverbal cues can help us understand the embedding of nonverbal cues in human signals and can also aid the downstream analysis of nonverbal cues such as estimation of latent states and modeling diversity in perceptions of nonverbal cues. I develop two schemes on identification of nonverbal vocalizations (laughter and fillers) in telephonic speech and automatic detection of disfluencies in automatic speech recognition. Both these experiments exploit the temporal characteristics of nonverbal cues and reflect the importance of context in nonverbal detection problems. ❧ In the second part, I focus on latent behavioral state estimation using nonverbal cues. These experiments are designed to understand the encoding process during nonverbal communication. Although the encoding process is fairly complex, these experiments establish the relation between latent human behavioral states and nonverbal cues as the first step. I conduct experiments to establish the relation between behavioral constructs such as empathy, emotion, depression and other behavioral constructs with nonverbal cues such as facial expressions, prosody and nonverbal vocalizations. My focus in these experiments is not only to develop a mapping function between nonverbal cues and human behavior but also build interpretable models to understand the mapping. ❧ In the last part of the thesis, I develop models to understand diversity in perception of nonverbal cues. These experiments are a step towards understanding the decoding step in nonverbal communication. Decoding is a complex and person specific process and I develop models to capture and quantify the variability among people’s perception of nonverbal cues. I conduct two experiments on modeling multiple annotator behavior in a study involving rating smile strength in dyadic human interaction and another experiment on modeling perceived pairwise noisy rankings by multiple annotators. I develop these models with parameters which can capture the variability in perceptions among people. Furthermore, such parameters can be used to quantify the variability. ❧ This thesis takes a holistic approaches the encoding-decoding process during nonverbal communication and proposes models towards better understanding of the nonverbal communication phenomenon. As establishing these relationships is not trivial, I use fusion of experts in all my models to enhance the generalizability and interpretability of my results. For each of the parts stated above, I use either a set of sequential models, generalized stacking or ensemble of experts as the means of modeling. These models are geared towards providing insights into the encoding-decoding phenomenon instead of using a black box approach. In summary, thesis contributes towards the understanding of nonverbal communication while using novel methods with applicability to a more general class of problems. I take a multidisciplinary approach to understanding the phenomenon of nonverbal communication with novel designs inspired from state of the art practices in machine learning.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Towards social virtual listeners: computational models of human nonverbal behaviors
PDF
A computational framework for diversity in ensembles of humans and machine systems
PDF
Modeling expert assessment of empathy through multimodal signal cues
PDF
Behavioral signal processing: computational approaches for modeling and quantifying interaction dynamics in dyadic human interactions
PDF
Computational modeling of behavioral attributes in conversational dyadic interactions
PDF
Computational modeling of human behavior in negotiation and persuasion: the challenges of micro-level behavior annotations and multimodal modeling
PDF
Computational modeling of human interaction behavior towards clinical translation in autism spectrum disorder
PDF
Emotional speech production: from data to computational models and applications
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Automatic quantification and prediction of human subjective judgments in behavioral signal processing
PDF
Latent space dynamics for interpretation, monitoring, and prediction in industrial systems
PDF
Nonverbal communication for non-humanoid robots
PDF
Context-aware models for understanding and supporting spoken interactions with children
PDF
Parasocial consensus sampling: modeling human nonverbal behaviors from multiple perspectives
PDF
Establishing cross-modal correspondences for media understanding.
PDF
Modeling dyadic synchrony with heterogeneous data: validation in infant-mother and infant-robot interactions
PDF
Generative foundation model assisted privacy-enhancing computing in human-centered machine intelligence
PDF
Toward understanding speech planning by observing its execution—representations, modeling and analysis
PDF
Knowledge-driven representations of physiological signals: developing measurable indices of non-observable behavior
PDF
Improving modeling of human experience and behavior: methodologies for enhancing the quality of human-produced data and annotations of subjective constructs
Asset Metadata
Creator
Gupta, Rahul
(author)
Core Title
Computational methods for modeling nonverbal communication in human interaction
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
10/11/2016
Defense Date
05/23/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computational methods,decoding in communication,encoding in communication,non-verbal communication,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth (
committee chair
), Liu, Yan (
committee member
), Ortega, Antonio (
committee member
)
Creator Email
guptarah@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-315233
Unique identifier
UC11214610
Identifier
etd-GuptaRahul-4879.pdf (filename),usctheses-c40-315233 (legacy record id)
Legacy Identifier
etd-GuptaRahul-4879.pdf
Dmrecord
315233
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Gupta, Rahul
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
computational methods
decoding in communication
encoding in communication
non-verbal communication