Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Computational modeling of human behavior in negotiation and persuasion: the challenges of micro-level behavior annotations and multimodal modeling
(USC Thesis Other)
Computational modeling of human behavior in negotiation and persuasion: the challenges of micro-level behavior annotations and multimodal modeling
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Computational Modeling of Human Behavior in Negotiation and Persuasion: The Challenges of Micro-Level Behavior Annotations and M ultimodal Modeling by Sunghyun Park Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2016 Copyright 2016 Sunghyun Park Table of Contents List of Tables List of Figures Abstract Chapter 1 Introduction 1.1 Motivation ................. . 1.2 Research Contexts and Challenges . . . . . . . . 1.2.1 Large-Scale Human Behavior Annotations 1.2.2 Computational Modeling in Face-to-Face Negotiation 1.2.3 Computational Modeling in Online Persuasion 1.3 Contributions 1.4 Outline . . . . . . . . . . . . . . . . . . . . . . Chapter 2 Related Work and Theoretical Backgrounds 2.1 Crowdsourcing Micro-Level Human Behavior Annotations 2.1.1 Quality Control ........... . 2.1.2 Crowdsourcing Video-Related Tasks 2.1.3 Inter-Coder Agreement / Reliability 2.2 Multimodal Behavior in Negotiation .... 2.2.1 Nonverbal Factors in Face-to-Face Negotiation 2.2.1.1 Proposer's and Respondent's Behavior. 2.2.1.2 Mutual Behavior .. 2.2.1.3 History .............. . 2.3 Multimodal Behavior in Persuasion ........ . 2.3.1 Modality Influence and Human Perception. 2.3.2 Acoustic Perspective .. 2.3.3 Verbal Perspective . . . 2.3.4 Para-Verbal Perspective 2.3.5 2.3.6 Visual Perspective . . . High-Level Attributes Related to Persuasion VI Vll x 1 1 4 5 6 7 8 11 12 12 13 13 14 14 15 15 16 18 18 18 19 19 20 20 20 ll 2.3.7 Thin Slice Prediction . . . . . . . . . . . . . . . . . . . 21 Chapter 3 Crowdsourcing Micro-Level Human Behavior Annotations 22 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 Online Crowdsourcing Tool for Annotations of Behavior (OCTAB) 24 3.2.1 Annotation Module (Micro-Level Behavior Annotations) 25 3.2.1.1 Precision . . 26 3.2.1.2 Integrability . . . . . . . . . . . . . . . . 26 3.2.1.3 Usability . . . . . . . . . . . . . . . . . . 26 3.2.2 Training Module (Training Crowd Workers Online) 27 3.2.2.1 Observational Study of Experienced Local Coders' Train- 1~ 27 3.2.2.2 Training Module's Design . . . . . . . . . . . . . . . . . 29 3.3 Procedure for Crowdsourcing Micro-Level Behavior Annotations in Videos 29 3.3.1 Obtaining Coding Schemes / Ground-Truth Annotations 29 3.3.2 Recruiting / Screening Crowd Workers . 30 3.3.3 Training Crowd Workers Online 30 3.3.4 Unique vs. Repeated Annotations 30 3.4 Experiments ................ . 3.4.1 Evaluation Methods ....... . 3.4.1.1 Time-Slice Krippendorff's Alpha 3.4.1.2 Disagreement Type Analysis 3.4.2 Datasets . . . . . . . . . . . 3.4.3 Annotated Behavioral Cues . . . . . 3.4.4 Experimental Design . . . . . . . . . 3.4.4.1 Experienced Local Coders . 3.4.4.2 Untrained Crowd Workers 3.4.4.3 Trained Crowd Workers 3.4.4.4 Repeated Annotations . 3.4.5 Annotation Strategies . . . . . . 3.5 Results and Discussions ........ . 3.5.1 User Experience Ratings of OCTAB 3.5.2 Performance of Trained Crowd Workers 3.5.3 Performance of Untrained Crowd Workers 3.5.4 Trained vs. Untrained Crowd Workers 3.5.5 Time-Slice Krippendorff's Alpha 3.6 Conclusions .................. . Chapter 4 Computational Modeling of Human Behavior in Face-to-Face Negotia tion 4.1 Introduction ........ . 4.2 Computational Descriptors 4.2.1 Proposer's Behavior 31 31 32 33 34 35 35 36 37 37 38 38 39 40 42 43 44 44 45 47 47 50 51 lll 4.2.2 Respondent's Behavior ....... . 4.2.3 Mutual Behavior .......... . 4.2.3.1 Acoustic Mutual Behavior 4.2.3.2 Visual Mutual Behavior 4.2.4 Negotiation History ... . 4.3 Experiments ............ . 4.3.1 Dyadic Negotiation Dataset 4.3.2 Annotations . . . . . . . . . 4.3.3 Prediction Models and Methodology 4 .4 Results . . . . . . . . . . . . . . . . . . . . . 4.4.1 Predicting the Respondent Reactions (Hl) . 4.4.2 Benefit of More Sources of Information . . . 4.4.3 Mutual Behavior and Classification of Cooperative vs. tive Interactions (H2) . . . . . . . . . 4.4.4 Top Performing Individual Descriptors 4.5 Discussions .................. . 4.5.1 Limitations .............. . 4.5.2 Predicting the Respondent Reactions (Hl) . 4.5.3 Benefit of More Sources of Information ... 4.5.4 Mutual Behavior and Classification of Cooperative vs. tive Interactions (H2) . . . . . . . . . 4.5.5 Top Performing Individual Descriptors 4.6 Conclusions .................. . Competi- Competi- Chapter 5 Computational Modeling of Persuasive Behavior in Online Social Mul- 52 53 56 57 59 59 60 61 62 63 63 64 65 66 67 68 68 69 69 70 70 timedia 72 5.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5 .2 Research Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2.1 Computational Descriptors (Unimodal vs. Multimodal Prediction) 75 5.2.2 Attribute-Based Multimodal Approach. 75 5.2.3 Effect of Opinion Polarity 76 5.2.4 Effect of Gender . . . . . . . . . . . . . 76 5.2.5 Thin Slice Prediction .......... . 5.3 Persuasive Opinion Multimedia (POM) Corpus 5.3.1 Subjective Annotations ........ . 5.3.1.1 Persuasiveness and High-Level Attributes 5.3.1.2 Analysis 5.3.1.3 Transcriptions 5.4 Computational Descriptors 5.4.1 Acoustic Descriptors .. 5.4.2 Verbal Descriptors ... 5.4.3 Para-Verbal Descriptors 77 77 78 79 80 82 82 83 84 84 IV 5.4.4 Visual Descriptors .. 5 .5 Experiments . . . . . . . . . . 5.5.1 Persuasiveness Labels 5.5.2 Experimental Conditions 5.5.3 Methodology ...... . 5 .6 Results and Discussions . . . . . 5.6.1 Unimodal vs. Multimodal (Hl) 5.6.2 Attribute-Based Multimodal Approach (H2) . 5.6.3 Effect of Opinion Polarity (H3) 5. 6 .4 Effect of Gender (H 4) . . . 5.6.5 Thin Slice Prediction (H5). 5.6.6 Descriptor Analysis 5. 7 Conclusions . . . . . . . . . . . . . Chapter 6 Conclusions and Future Directions 6 .1 Conclusions . . . 6 .2 Future Directions Bibliography 85 86 86 87 90 90 91 93 94 95 95 97 98 99 99 101 105 v List of Tables 4.1 Proposal-response events distribution in the face-to-face dyadic negotia- tion dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2 The prediction performance of both the early-fusion and late-fusion ap proaches with different combinations of information sources. . . . . . . . 66 4.3 Top descriptors according to their prediction performance when used alone in a single-descriptor predictor. . . . . . . . . . . . . . . . . . . . . 67 5.1 Krippendorff's alpha agreement values for the annotations of persuasive- ness and other related high-level attributes, including the Big Five per- sonality dimensions. . . . . . . . . . . . . . . . . . . . . . 80 5.2 An overview of the computational multimodal descriptors. 83 5.3 The multimodal prediction results using the computational descriptors in all combinations of modalities. . . . . . . . . . . . . . . . . . . . . . . 92 5.4 Top computational descriptors in each modality for predicting between strongly and weakly persuasive speakers. . . . . . . . . . . . . . . . . . . 97 VI List of Figures 1.1 An overview of the thesis on multimodal behavior analysis and mod eling in the contexts of face-to-face dyadic negotiation and persuasion in online social multimedia content, specifically addressing the technical challenges on (1) obtaining large-scale human behavior annotations, (2) building computational representations of individual and interpersonal human behavior, (3) performing temporal analysis and real-time predic tion, and ( 4) fusing multimodal information for effective computational models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3.1 An overview of the crowdsourcing approach for obtaining micro-level be havior annotations in videos, with a focus on the new web interface called OCTAB that includes a module specifically designed to train crowd work ers online. The approach is generalizable, and the training effect transfers to annotating new independent video corpora. . . . . . . . . . . . . . . . 23 3.2 The first component of OCTAB (Online Crowdsourcing Tool for Annota tions of Behavior) is a web annotation module that allows crowd workers to make precise micro-level annotations of human behavior or events in videos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 The second module of OCTAB to effectively train crowd workers online by giving them a quick overall visualization of disagreement (top) and the ability to review both ground-truth and their attempted annotations side-by-side (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4 Definition of the event and segmentation agreement metrics with examples. 33 3.5 The user experience ratings of the OCTAB interface. . . . . . . . . . . . 40 3.6 The performance of the trained crowd workers on the YouTube dataset (top) and the Semaine dataset (bottom). The dotted lines indicate the agreement threshold point at 0.67 measured with the Time-Slice Krip pendorff's alpha. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.7 The performance of the untrained crowd workers on the YouTube dataset. The dotted line indicates the agreement alpha threshold at 0.67. 43 Vll 3.8 The performance comparison between the untrained and trained crowd workers on the YouTube dataset (t-tests showed statistically significant difference at p* < 0.01 and p** < 0.001). . . . . . . . . . . . . . . . . . 45 3.9 The sensitivity analysis of the different frame sampling rates. Time-Slice Krippendorff's alpha across 46 4.1 An overview of this work's approach to predict the respondent reactions (acceptances or rejections) to negotiation offers using predictive compu tational descriptors from various sources of information. . . . . . . . . . 48 4.2 An illustration of two proposal-response events and two different types of time windows where the computational descriptors were extracted. . . . 50 4.3 An illustration of how audio-visual mutual behavior of symmetry and asymmetry was encoded as three computational descriptors for each type of acoustic and visual behavioral cues and also as long-term cues. For short-term cues, only the behavior from the beginning of the proposal until the beginning of the corresponding response was considered. . . . . 54 4.4 The mean accuracies for predicting the respondent reactions to negotia tion offers using the computational descriptors from each source of infor mation. The two right-most red bars show the early-fusion performance of combining all sources together at the feature-level and the late-fusion performance of combining them at the decision-level. The error bars show 1 standard error in both directions, and the paired-sam pies t-tests showed statistical significance in performance at p** < 0.001. . . 64 4.5 The mean prediction accuracies for combining multiple sources of in formation together. The graph shows the results of early-fusion at the feature-level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.1 An overview of the chapter with a newly created multimedia corpus and the multimodal approach for predicting persuasiveness with acoustic, ver- bal, para-verbal, and visual computational descriptors. . . . . . . . . . . 74 5.2 Pearsons correlation coefficients between persuasiveness and high-level and personality attributes (after taking the mean of 3 repeated annota tions). The two horizontal dotted lines indicate critical values at p < 0.001 for two-tailed probabilities, and the vertical dotted line visually divides the personality dimensions from other attributes. . . . . . . . . . 81 5.3 An overview of the attribute-based multimodal prediction approach in which the high-level attributes are used in the middle layer before pre dicting a speaker's level of persuasiveness. . . . . . . . . . . . . . . . . . 88 Vlll 5.4 The persuasiveness prediction results for the multimodal and unimodal models with the regression results on the left and the classification results on the right (p* < 0.05 and p** < 0.01). The error bars show 1 standard error ........ . 5.5 The persuasiveness prediction results for two different multimodal ap proaches, one combining all the descriptors at the feature-level and the other using the attribute-based fusion. The error bars show 1 standard 91 error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5 .6 The persuasiveness prediction results for the multimodal models when made opinion polarity-dependent and gender-dependent. The error bars show 1 standard error. . . . . . . . . . . . . . . . . . . . . . . 5. 7 The persuasiveness prediction results for various thin slices. The left graph shows the thin-slice results of using computational descriptors en coded from the length of only l/lOth of each review session, and the right graph shows the results for cumulative thin-slice windows (i.e., first 53 of the session, first 10%, first 15%, etc.). The dotted line in each graph indicates the prediction level for the multimodal approach in Hl using computational descriptors from all the modalities and the whole 100% session. 94 96 IX Abstract Having a deeper understanding of human communication and modeling it computation ally has substantial implications for our lives due to its potential synergistic impact with ever advancing technologies. It is an important step for a technology to be accepted as having effective artificial intelligence. However, human communication is a complicated phenomenon that can take an in-depth multimodal analysis of human behavior to un derstand, in all of the verbal, vocal, and visual channels. The challenge of multimodality is further complicated by many behavioral cues that are subtle and ambiguous. The work described in this thesis primarily revolves around computational model ing of human behavior, approaching it largely from the affective and social perspectives. This thesis explores computational behavior analysis and modeling in terms of two im portant contexts of human communication, one in face-to-face interaction and the other in online telemediated interaction. Firstly, this thesis explores human communication in the context of face-to-face dyadic negotiation to better understand and model inter personal dynamics that occur during close negotiation interaction. Secondly, this thesis explores human communication in the context of online persuasion, to obtain a deeper understanding of persuasive behavior and explore its computational models with online social multimedia content. In studying human communication in these two contexts of face-to-face negotiation and online persuasion, this thesis addresses four significant research challenges: large scale annotations, behavior representations, temporal modeling, and multimodal fusion. Firstly, this thesis addresses the challenge of obtaining annotations of human behavior on a large scale, which provide the basis from which computational models can be built. x Secondly, this thesis addresses the challenge of making computational representations of multimodal human behavior, in terms of individual behavior and also interpersonal behavior for capturing the dynamics during face-to-face interaction. Thirdly, this thesis addresses the challenge of modeling human behavior with a temporal aspect, specifically for the purpose of making real-time analysis and prediction. Lastly, this thesis explores multimodal fusion techniques in building computational models of human behavior. XI Chapter 1 Introduction 1.1 Motivation In the midst of this new era with an unprecedented rate of scientific and technological advances, one notable trend is burgeoning across many disciplines. There is a grow ing awareness of the importance in understanding human communication and behavior. From smartphones to autonomous vehicles, once a new technology matures enough to satisfy its functional requirements, an inevitable step that follows next is making it more convenient and effective from the perspective of us, humans. Trying to under stand human communication in various contexts is important because it enables us to understand the hows and whys of our behavior. As technological advances start to per meate every fabric of our lives however, studying human communication and behavior has an added implication for its synergistic impact with technology. In many contexts that involve human interaction, including business (Graham, Unruh, & Jennings, 1991; Morand, 2001), marketing and services (Gabbott & Hogg, 2000; Sundaram & Web ster, 2000), education (Skinner & Belmont, 1993), and medicine (Beck, Daughtridge, & Sloane, 2002; Ong, de Haes, Hoos, & Lammes, 1995), obtaining a deeper insight and building computational models of human behavior has the potential to strongly reshape the landscape of human lives. Currently in the discipline of computer science, such research efforts are especially prominent in the subfields related to high-level artificial 1 intelligence (AI), notably in human-computer interaction (Jaimes & Sebe, 2007; Pan tie, Pentland, Nijholt, & Huang, 2006), human-robot interaction (Fong, Nourbakhsh, & Dautenhahn, 2003), affective computing (Picard, 2000), and virtual humans (Cas sell, Sullivan, Prevost, & Churchill, 2000). Moreover, studying human communication is a largely multidisciplinary effort with researchers from a broad spectrum of back grounds outside of computer science, including psychology, sociology, cognitive science, linguistics, and signal processing, among others. Human communication is an intricate mechanism. From birth, we receive and pro cess a large amount of data to learn and develop our communication skills. We need a great amount reading and speaking to learn a language, and much experience teaches us many explicit and implicit behavioral signals and their meanings. For instance, we learn when and how to shake hands, and we also learn that a smile might not always mean a happy signal (Ekman & Friesen, 1982). Over time, we can infer a close friend's behavior with only a few seconds of interaction, allowing us to know if he/she is in a good mood or not. We also learn that our communicative behavior involves a dynamic interplay with other interlocutors, such as behavior synchrony and asynchrony. What makes learning human communication even more difficult is that it all takes place in multiple modalities, including all of the verbal, visual, and acoustic channels (Knapp, Hall, & Horgan, 2013). Some researchers go as far to claim that our perception of an individual is determined mostly by his/her vocal and visual cues rather than verbal content (Mehrabian, 1971). Even though such claims are arguable to accept at face value, common sense and experience tell us all that how we say something is also very important compared to what we actually say. For this challenging problem of understanding multimodal human behavior, many researchers approach it from the affective perspective and try to interpret human be havior in light of human experience of feeling, such as emotion or mood (Ekman & Davidson, 1994; Jaimes & Sebe, 2007; Pantie et al., 2006). The approach is based on the assumption that human affect is more or less commonly present across race, gender, 2 + -~ plus captions Multimodal Signals Verbal (2) Computational + Representations ( Auditory ) • Individual • Dyadic/ Mutual Visual (3) Temporal Analysis •Thin-slice • Behavioral dynamics (4) Multi modal Fusion • Feature-level • Modality-level •Attribute-level Negotiation • Face-to-face interaction Persuasion • Online social multimedia Low-level micro-level behavior annotations High-level perceptive annotations (1) Crowdsourcing Human Behavior Annotations • ( Web-based micro-level annotation tools I Training & Online annotation procedure - I EValuation metriu Figure 1.1: An overview of the thesis on multimodal behavior analysis and modeling in the contexts of face-to-face dyadic negotiation and persuasion in online social multime dia content, specifically addressing the technical challenges on (1) obtaining large-scale human behavior annotations, (2) building computational representations of individual and interpersonal human behavior, (3) performing temporal analysis and real-time pre diction, and ( 4) fusing multimodal information for effective computational models. and culture. A similar line of approach that has recently emerged is to interpret human behavior in terms of social signals (Vinciarelli, Pantie, & Bourlard, 2009). The work described in this thesis also uses the affective and social intuitions in modeling human behavior, especially in encoding affective and social cues as computational descriptors. For instance, computationally encoding emotional signals such as smiles and also social signals such as displaying similar behavior can provide meaningful information about the atmosphere of an ongoing interaction. Such intuitions can help in designing a com- putational model with affective and social intelligence that can be applied and accepted as natural and human-friendly AI by the general public. 3 1.2 Research Contexts and Challenges The research work described in this thesis primarily revolves around computational mod eling of human behavior, approaching it largely from the affective and social perspectives (Figure 1.1). Human communication occurs both in the face-to-face environment in volving physical interaction, and it also increasingly occurs online with the Internet as the medium of communication. Given the two environments of human communication, this thesis explores computational behavior analysis and modeling in terms of two im portant contexts of human communication, one in face-to-face negotiation and the other in online persuasion: • Face-to-face negotiation This thesis explores human communication in the context of face-to-face dyadic negotiation to better understand and model individual and interpersonal dynamics that occur during close negotiation interaction. • Online persuasion This thesis explores human communication in the online environment, specifically in the context of persuasion, to obtain a deeper understanding of persuasive behav ior and explore its computational models with online social multimedia content. In studying human communication in these two contexts of face-to-face negotiation and online persuasion, this thesis addresses four significant research challenges: 1. Human behavior annotations The challenge of obtaining detailed and micro-level annotations of human behavior on a large scale, which provide the basis from which computational models can be built. Micro-level behavioral annotations refer to those that identify the precise start and end time points of a behavioral cue in a given audio or video sequence of human behavior. 4 2. Computational representations of behavior The challenge of making condensed and meaningful computational representations of human behavior, in terms of individual behavior as well as interpersonal be havior for capturing the behavioral dynamics between interlocutors (e.g., behavior mirroring) during face-to-face interaction. 3. Temporal analysis The challenge of modeling human behavior with a temporal aspect, specifically for the purpose of making real-time analysis and prediction. 4. Multimodal fusion The challenge of fusing multimodal information from verbal, visual, and acoustic human behavior and designing effective computational models. This thesis addresses the challenges across three connected research topics that con verge to a coherent theme of understanding human communication and building com putational models of human behavior. The first topic is on crowdsourcing large-scale micro-level annotations of human behavior in videos to obtain the data necessary for behavior analysis and computational modeling. This research topic directly addresses the first research challenge of behavior annotations. The other two are building com putational models of human behavior in the context of face-to-face dyadic negotiation, which addresses the challenge of behavior representations, and in the context of online persuasion, which addresses the challenge of temporal analysis. The challenge of mul timodal fusion is addressed in both research contexts. The following three subsections each describes a research topic that makes up a separate chapter in this manuscript. 1.2.1 Large-Scale Human Behavior Annotations In order to build an effective computational model of human behavior, one of the initial challenges is the need for enough data to analyze and train a model, especially for statistical models and supervised learning approaches. It requires a large number of 5 micro-level behavior annotations in video and audio sequences, in which precise start and end time segments of relevant behavioral cues need to be identified. In the past, most of the annotation effort was carried out manually with locally recruited and trained annotators, making it very expensive and time-consuming. Although recent progress in computer vision and audio signal processing technologies (Degottex, Kane, Drugman, Raitio, & Scherer, 2014; Morency, Whitehill, & Movellan, 2008) enables automatic extractions of various visual and acoustic behavioral cues to some extent, they are mostly low-level signals that need context-specific interpretations to extract high-level information, and researchers still largely cannot avoid the manual effort. In fact, such extraction technologies themselves require manual annotations before building their models. Fortunately, recent advancements in the Internet technology and a growing interest in the crowdsourcing paradigm are changing the landscape of obtaining various types of annotations, including human perception data. Crowdsourcing is the idea of distributing to people at large small tasks at which humans perform well but computers cannot (Ross, Irani, Silberman, Zaldivar, & Tomlinson, 2010; Yuen, King, & Leung, 2011). It is analogous to each person completing a small piece of a gigantic jigsaw puzzle, and all the pieces combining at the end to complete a demanding task in a short amount of time. Using the Internet as the medium, and with a growing number of online crowdsourcing platforms, crowdsourcing has emerged as a very effective tool for research in many fields from psychology to computer science (Mason & Suri, 2012; Nowak & Ruger, 2010). 1.2.2 Computational Modeling in Face-to-Face Negotiation Negotiation is a complicated process of two or more parties, often having different pref erences or intentions, reaching mutual agreement. It is a component deeply ingrained in our lives, not just in business but even in our daily interaction with our friends and family. It is such a basic and common element of our lives that we often engage in the act without even being consciously aware of it. Modeling human behavior in the context 6 of negotiation and building a system that can automatically analyze and predict the respondent reactions to negotiation offers will have substantial implications. With such a technology, we could imagine to have it as a decision support tool to aid and facilitate a negotiation process, or it could be applied as a training system to teach us how to be good negotiators. There are many challenges associated with analyzing human behavior in the negoti ation context and building an effective computational model. One key challenge among others is the need for effective computational representations of human behavior that provide condensed and meaningful information, such as summarizing smiles a person portrayed. Effectively encoding individual behavior is important, but especially in face to-face interaction, much information can also reside in the interpersonal dynamics, such as mutual smiling. Once individual and interpersonal behavior is effectively encoded, a follow-up key challenge is fusing various multimodal behavioral cues together in a computational model. That is, how can multimodal information be fused together to effectively model human behavior and give the best performance? The work described in Chapter 4 addresses these challenges of making computational representations and modeling multimodal human behavior in the context of face-to-face negotiation. 1.2.3 Computational Modeling in Online Persuasion Very much related to negotiation, persuasion is also an essential part of our daily lives, probably at a level even more fundamental than negotiation. Being able to persuade others is a powerful ability, and our lives are influenced by persuasion in so many aspects, including business, education, and medicine, just to name a few. With the advent of the Internet technology, we are spending more and more time in the online environment where communication modality was mostly text in the past. However, with an explosive growth of the Internet capabilities, coupled with online social networking services and video sharing websites, online communication is increasingly occurring in the form of videos, making it more important and useful to understand persuasion in the context 7 of online social multimedia content. When people post videos of themselves talking about a social issue or a new movie, what is it that distinguishes certain people to be more persuasive and influential while others are ignored? We can hope to answer this question and have many useful real-life applications by computationally modeling human behavior in this context of online persuasion. The work described in Chapter 5 addresses the key challenges already discussed in the face-to-face negotiation context on designing effective computational representa tions of behavior and fusing multimodal information in computational models, but in the different problem of online persuasion. For this research problem, the thesis further emphasizes and addresses the challenge of multimodal fusion techniques, exploring novel and effective ways to combine multimodal information to model persuasive behavior. Another challenge addressed in the chapter is providing a temporal aspect for compu tational models. For instance, a model that needs all the data from the full length of an interaction before it can provide meaningful information would not be very useful for real-time analysis and prediction. The chapter addresses this challenge of providing temporal analysis that enables more useful applications of computational models. 1.3 Contributions This section gives a brief overview of significant research contributions of this thesis. • Online crowdsourcing tool, training procedure, and evaluation metrics This thesis introduces a novel crowdsourcing interface called OCTAB (Online Crowdsourcing Tool for Annotations of Behavior), a web tool specifically designed for obtaining micro-level human behavior annotations in videos on a large scale. Along with the web tool, this thesis introduces an effective training procedure and various evaluation metrics for measuring annotation quality and disagree ment that can be used to train naive crowd workers to be accurate and efficient online annotators. With an extensive set of experiments, this thesis shows that the 8 crowdsourcing approach can be used to obtain the annotations that are of com parable quality to the ones obtained with locally trained expert annotators. The experiments also show that the approach is scalable and generalizable with the effect of training crowd workers transferable across different datasets. The work described in Chapter 3 addresses the challenge of obtaining large-scale micro-level behavior annotations of multimedia data through crowdsourcing for the purpose of building computational models of human behavior. For this work, two papers have been published (Park, Mohammadi, Artstein, & Morency, 2012; Park, Shoe mark, & Morency, 2014) with one of them recognized as a finalist for the Best Paper Award at the ACM International Conference on Intelligent User Interfaces (Park, Shoemark, & Morency, 2014). • Computational modeling of human behavior in negotiation (mutual be havior and interpersonal dynamics) This thesis explores multimodal computational models of human behavior in the context of face-to-face dyadic negotiation. Using a face-to-face negotiation dataset consisting of 42 dyadic interactions in a simulated negotiation setting, a large amount of annotations was created on negotiation behavior. This thesis specifi cally introduces an approach of computationally representing mutual behavior for capturing the interpersonal dynamics between two people during interaction. The approach uses both behavioral symmetry and asymmetry for encoding mutual behavior, and its usefulness is shown by applying it for the problem of modeling human behavior during negotiation. An extensive set of experiments show that the negotiation behavior can be effectively modeled with four different sources of information, namely (1) the nonverbal behavior of the proposer, (2) that of the respondent, (3) the mutual behavior between the negotiators related to behav ioral symmetry and asymmetry, and ( 4) the past negotiation history between the negotiators. For combining multimodal information in the computational mod els, the experiments show an early-fusion technique at the feature level and also g a late-fusion technique at the decision level. For this work, four papers have been published (Nouri et al., 2013; Park, Gratch, & Morency, 2012; Park et al., 2013, 2015) with one paper in an international journal (Park et al., 2015) and another recognized as a finalist for the Best Student Paper Award at the Inter national Conference on Affective Computing and Intelligent Interaction (Park et al., 2013). • POM (Persuasive Opinion Multimedia) Corpus This thesis introduces the Persuasive Opinion Multimedia (POM) corpus, consist ing of 1,000 online movie review videos, for the purpose of providing the research community with a comprehensive dataset to study persuasion in online social mul timedia content. The dataset provides full verbal transcriptions, many automati cally extracted multimodal features, and extensive human perception annotations not just on persuasion but also on related high-level attributes such as credibility, passion, and personality traits. • Computational modeling of human behavior in persuasion (high-level attributes and temporal analysis) This thesis explores multimodal computational models of human behavior in the context of online persuasion. Using the POM dataset, an extensive set of experi ments shows that multimodal computational models derived from verbal and non verbal behavior can be used to effectively model persuasive behavior. This work introduces a novel attribute-based fusion approach for combining multimodal infor mation and modeling human behavior using various high-level attributes related to a specific behavioral context, such as credibility or expertise in the context of persuasion. Lastly, this thesis introduces a useful approach of adding a tern poral aspect to computational models of behavior. In particular, this thesis uses the concept of thin-slice, or a short time window of behavior, to enable computational models to be more meaningful in terms of real-time behavioral analysis and pre diction. For this work, four papers have been published (Chatterjee, Park, Shim, 10 Sagae, & Morency, 2014; Mohammadi, Park, Sagae, Vinciarelli, & Morency, 2013; Park, Shim, Chatterjee, Sagae, & Morency, 2014; Shim et al., 2015). 1.4 Outline The rest of the thesis is organized as follows. • Chapter 2 presents related works and relevant theoretical backgrounds on crowd sourcing behavior annotations, multimodal modeling of human behavior in the context of negotiation, and that in the context of persuasion. • Chapter 3 describes the online crowdsourcing approach to obtain micro-level hu man behavior annotations, showing a new interface, an effective training tech nique, evaluation metrics, and the generalization of the approach. • Chapter 4 describes the work on computationally modeling human behavior in the context of negotiation, specifically for the problem of predicting the respondent reactions during face-to-face dyadic negotiation sessions. • Chapter 5 describes the work on computationally modeling human behavior in the context of persuasion, specifically for the problem of predicting a speaker's level of persuasiveness in online social multimedia content. • Chapter 6 gives concluding remarks and discusses interesting and promising future research directions. 11 Chapter 2 Related Work and Theoretical Backgrounds This chapter highlights relevant work in the literature as well as theoretical background that motivate this thesis. Section 2.1 covers related work on the topic of crowdsourc ing annotations. Section 2.2 covers related work on negotiation, including theoretical background on the negotiation behavior from traditional psychological research. Lastly, Section 2.3 covers past work on persuasion and theoretical background most relevant to this thesis. 2.1 Crowdsourcing Micro-Level Human Behavior Annotations Crowdsourcing has gained much attention lately, and a survey paper by Yuen et al. (2011) and another by Quinn and Bederson (2011) present a general overview of the topic on crowdsourcing and human computation. Many interesting applications (Bernstein, Brandt, Miller, & Karger, 2011; Kim et al., 2014; Zeng, Pantie, Roisman, & Huang, 2008) have recently appeared that take advantage of the new paradigm. Regarding Amazon Mechanical Turk, which is a popular online crowdsourcing platform, Mason and Suri (2012) provided detailed explanations on using the platform for conducting behavioral research, and Ross et al. (2010) showed changing demographics of the crowd workers using the platform. 12 2.1.1 Quality Control Quality control is a critical issue with crowdsourcing. Some researchers showed the benefit of a screening/qualification process and a training procedure (Downs, Holbrook, Sheng, & Cranor, 2010; Le, Edmonds, Hester, & Biewald, 2010; Rashtchian, Young, Hodosh, & Hockenmaier, 2010). Some also explored repeated labeling of data for more reliability (Sheng, Provost, & lpeirotis, 2008). By comparing annotations (none of them on videos) obtained with crowdsourcing and those with expert annotators, many researchers have reported across different domains that they could obtain quality an notations through crowdsourcing (Gao & Vogel, 2010; Hsueh, Melville, & Sindhwani, 2009; Marge, Banerjee, & Rudnicky, 2010; Nowak & Ruger, 2010; Rashtchian et al., 2008; Snow, O'Conner, Jurafsky, & Ng, 2008). The work described in this thesis incor porates most of these quality control measures. 2.1.2 Crowdsourcing Video-Related Tasks As for crowdsourcing video-related tasks, researchers worked on obtaining video sum marizations (Wu, Thawonmas, & Chen, 2011), macro-labeling impressions of vloggers in videos (Biel & Gatica-Perez, 2012), and macro-labeling social contexts in video scenes (Riek, O'Connor, & Robinson, 2011). However, none of them was concerned with micro-level annotations within a video but focusing more on labeling on the entire video level. Probably most relevant pieces of work in terms of the web interface intro duced in this thesis were done by Vondrick, Ramanan, and Patterson (2010) and Spiro, Taylor, Williams, and Bregler (2010), whose interfaces allow micro-level motion tracking and are also used with Amazon Mechanical Turk. However, their interfaces only put an emphasis on motion tracking, while the crowdsourcing interface in this thesis are con cerned with identifying and segmenting behavioral events in videos. Although there are quite a number of software for making complicated annotations on videos (Dasiopoulou, Giannakidou, Litos, Malasioti, & Kompatsiaris, 2011), such full-fledged tools are not 13 suitable to be used for crowdsourcing due to a relatively steep learning curve and the difficulty of incorporating them into web-based crowdsourcing platforms. 2.1.3 Inter-Coder Agreement / Reliability Krippendorff's alpha has been previously used to measure inter-rater reliability of video annotations both at a macro-level (Riek et al., 2011) and micro-level (Kang et al., 2012). This thesis follows the approach taken by Kang et al. (2012) at a micro-level, while further exploring the stability and reliability of the alpha at different tern poral resolutions. A novel approach is also taken in this thesis to supplement the alpha with disagreement analysis because the alpha cannot show the types of disagreement between annotators, which can be critical information for effectively training crowd workers. 2.2 Multimodal Behavior in Negotiation Negotiation has long been and still is an active topic of research, and a paper by Pruitt (2012) gives a brief history of the psychological studies related to negotiation. For researchers endorsing a traditional cognitive view, negotiation is essentially a decision making process, the people involved dispassionate negotiators, and the outcome a re sult of the dynamics governed by rational strategies. There are also researchers who put more emphasis on the affective aspect (Druckman & Olekalns, 2008). Some have tried to understand the general role of affect in different stages of negotiation (Barry & Oliver, 1996) while others have investigated the influence of mood (Baron, 1990; Carnevale & !sen, 1986), emotion (Allred, Mallozzi, Matsui, & Raia, 1997; Van Kleef, De Dreu, & Manstead, 2004), and personality (Barry & Friedman, 1998). In addition, researchers focusing on social contexts further deepen our understanding of negotiation dynamics (Greenhalgh & Chapman, 1998; Kramer & Messick, 1995). The affective and social perspectives of negotiation give intuition that nonverbal be havior can provide clues to the ongoing state of a negotiation process. Although negoti ation research abounds in the literature, there have been limited works on investigating 14 nonverbal behavior in the context of negotiation, let alone computational models. Prob ably a research problem that is most analogous to the line of research in this thesis was explored in the work by Curhan and Pentland (2007) and that by Nguyen, Ftauendorfer, Mast, and Gatica-Perez (2014). In both works, the authors simulated an employment negotiation and interview scenarios and found that certain nonverbal behavioral cues were predictive of the overall outcome in the end. Curhan and Pentland (2007) mainly explored behavioral features in speech and showed that four speech features, including activity, conversational engagement, prosodic emphasis, and vocal mirroring during the first 5 minutes of interaction predicted 30% of the variance in individual negotiation outcomes on the terms of employment. One noteworthy finding of this research lay in its research focus on thin slices, the idea that observing only a narrow window of behavior is highly predictive of subsequent evaluations. Additionally, all of the speech features were extracted and encoded automatically. Nguyen et al. (2014) explored fea tures from both speech and visual behavior, not only looking at the behavior of the interviewees but also that of the interviewers, and found that ridge regression explained about 36% of the variance in predicting hirability scores. This work used a combination of automatically and manually coded features. 2.2.1 Nonverbal Factors in Face-to-Face Negotiation This subsection introduces relevant research on nonverbal behavior in dyadic interaction and highlights four potential sources where behavioral cues can be extracted for under standing the state of a negotiation process: the nonverbal behavior of the proposer, that of the respondent, the mutual behavior between the two negotiators, and the past negotiation history. 2.2.1.1 Proposer's and Respondent's Behavior In a business negotiation setting, Niemeier (1997) investigated various nonverbal cues including proxemics, body postures, gestures, facial expressions, and para-language, 15 arguing that they could hint at the emotional attitude of the negotiators. In a study of cooperativeness and competitiveness during negotiation, Johnson (1971) and Johnson, McCarty, and Allen (1976) similarly found that cooperativeness is expressed through "warm" behavior including soft tones of voice, smiles, interested facial expressions, direct eye contacts, open gestures, close spatial distance, and occasional soft touching, while competitiveness is expressed through "cold" behavior including tense postures, avoidance of eye contact, closed gestures, distant spatial distance, and avoidance of touching. Head movements can also provide rich information. For instance, the proposer could show eagerness by nodding his head while staring at the respondent, giving more emotional burden to the respondent if not accepting the offer. Similarly, the respondent could shake his head while listening to the proposal or tilt his head in confusion. Another interesting behavioral cue in the context of negotiation is self-touching, which Ekman and Friesen (1969) call a type of adaptors. According to Harrigan, Kues, and Weber (1986), the overall consensus is that negative affect, such as anxiety or discomfort, triggers self-touching behavior. 2.2.1.2 Mutual Behavior Extensive research shows that we have a tendency to match our behavior to our in teracting partners in various ways (Louwerse, Dale, Bard & Jeuniaux, 2012), and it is described with many terms in the literature including behavior matching, imitation, mimicry, synchrony, or chameleon effect. The changes in our behavior often occur un consciously and in many different channels of communication from facial expressions to speech patterns (Chartrand & Bargh, 1999; Chartrand, Maddux, & Lakin, 2006; Louw erse et al., 2012). Such behavioral characteristics are part of what is broadly referred to as mutual behavior in this thesis. Mutual behavior is not only limited to behavioral symmetry but spans more to also include any nonverbal characteristics that occur due to interactional influence, including behavioral asymmetry. 16 Mutual behavior is important in the context of negotiation because much evidence exists that it is related to social rapport. In general, people simply seem to get along better when their behavior is well coordinated (Bernieri & Rosenthal, 1991), and it is shown that displaying similar behavior helps with the smoothness of interaction and also builds a feeling of liking or positivity among interactional partners (Chartrand & Bargh, 1999; Chartrand et al., 2006). The phenomenon is so prevalent that even computer agents that mimic human partners are seen with a more positive feeling than non mimicking agents (Bailenson & Yee, 2005). Moreover, studies (Bernieri, Gillis, Davis, & Grahe, 1996; Grahe & Bernieri, 1999) show that observable nonverbal cues can be indicative of rapport, suggesting that it is possible to detect and gauge rapport among interactional partners, which in turn can be used to assess the status of a negotiation process. More specifically, Bernieri et al. (1996) studied observable nonverbal cues indicative of rapport in two different contexts of adversarial and cooperative settings, and the list of behavior included gestures, posture shifts, proximity, back-channel responses, eye contacts, and forward leaning. Tickle-Degnen and Rosenthal (1990), who describe rap port in terms of three components of mutual attentiveness, positivity, and coordination, also studied several nonverbal cues associated with rapport that included a similar set of behavior. Mutual behavior that hints at rapport can also reside at the speech level. People are known to imitate various acoustic characteristics of interactional partners in terms of accent, pause, speech rate, and tone of voice (Louwerse et al., 2012). Some researchers focus more on the smoothness of turn taking, which is usually measured with simul taneous speech, mutual silence, and interruption (Bernieri & Rosenthal, 1991). Many researchers also investigate synchronization or accommodation in prosody and various vocal qualities to capture the interpersonal dynamics in social interaction (De Looze, Scherer, Vaughan, & Campbell, 2014; Scherer, Hammal, Yang, Morency, & Cohn, 2014). 17 2.2.1.3 History The history information can be thought of as capturing the ongoing relationship between the negotiators. For instance, in the absence of other contexts, if the respondent has mostly rejected the proposer's offers in the past, it would mean something quite different from the opposite case. Moreover, reciprocity can be a good predictor of negotiation outcomes in mixed-motive settings (Dufwenberg & Kirchsteiger, 2004). 2.3 Multimodal Behavior in Persuasion Persuasion in human communication has been a very hot topic for research over the past decades due to its substantial implications and wide applicability, and there is a plethora of sources in the literature that cover the topic in much breadth and depth. This section gives a brief review of past research that is only immediately relevant to the work in this thesis. For an overview and history on persuasion research, interested readers are referred to other recent comprehensive texts (Crano & Prislin, 2006; O'Keefe, 2002; Perloff, 2010). In social psychology, dual process models of persuasion (Chaiken, Liberman, & Eagly, 1989; Petty & Cacioppo, 1986) have gained much attention and a wide ac ceptance over the past decades. According to the models, there are two different routes we take when processing information that can influence our attitudes. One route is based on cognition that is more systematic and effortful while the other is based on peripheral or heuristic cues such as credibility or attractiveness of the message source. The work in this thesis can be seen in light of the dual process models with the focus on the peripheral route of information processing. 2.3.1 Modality Influence and Human Perception Human communication is comprised of multiple modalities including verbal, acoustic, and visual channels, and it is apparent that each modality has its own separate influence 18 on human perception. Mehrabian (1971) even goes as far to claim that our perception of an individual is determined 7% by his/her verbal content, 38% by his/her tone of voice, and 553 by his/her facial and bodily cues. Although his claim is arguable in the research context of this thesis, it is obvious that multimodal analysis is an inevitable step to have a better understanding of human behavior and perception. In particular, Chaiken and Eagly (1976) showed different influences on persuasion and comprehension when a message was delivered through the written, audiotaped, or videotaped modality. Worchel, Andreoli, and Eason (1975) also studied the effects on persuasion with different types of media, communicators, and message positions. 2.3.2 Acoustic Perspective Showing the importance of acoustic cues in human speech, Stern, Mullennix, and Wilson (2002) reported that natural speech was more persuasive and taken more favorably than computer-synthesized speech. In addition, Mehrabian and Williams (1969) reported that more intonations and a higher speech volume contributed to perceived persuasive ness, Pittam (1990) studied the relationship between nasality and perceived persuasive ness with a group of Australian speakers, Burgoon, Birk, and Pfau (1990) found a pos itive correlation between vocal pleasantness and perceived persuasiveness, and Pearce and Brommel (1972) reported different effects of vocalic cues from conversational and dynamic speech styles on the perception of credibility and persuasiveness depending on the listener's preconceived notion of the speaker. 2.3.3 Verbal Perspective There are many components in the verbal channel that are closely related to persua siveness (Hosman, 2002; Young, Martell, Anand, Ortiz, & Gilbert, 2011). However, the work in this thesis is not concerned with the validity or quality of argumentation in the textual data but puts focus only at the level of finding key words that are informative in differentiating between strongly and weakly persuasive speakers. 19 2.3.4 Para-Verbal Perspective Para-verbal cues are consistently found by many researchers to be closely related to our perception of persuasiveness. For instance, some reported that a higher speech rate and less halting speech contributed to perceived persuasiveness (Mehrabian & Williams, 1969), a rapid speech rate positively influenced persuasion (Miller, Maruyama, Beaber, & Valone, 1976), and dynamic and conversational styles (with varying characteristics in pitch, volume, and use of pauses) had different effects on the perception of credibility and persuasiveness (Pearce & Brommel, 1972). 2.3.5 Visual Perspective Independent of text and voice, our facial expressions and bodily gestures convey much in formation as well. In relation to persuasion research, Mehrabian and Williams (1969) found that more eye contacts, smaller reclining angles, more head nodding, more gesticula tions, and more facial activity yielded significant effects for increasing perceived per suasiveness. LaCrosse (1975) also found a similar set of nonverbal behavior related to persuasiveness that he calls affiliative nonverbal behavior. Moreover, Burgoon et al. (1990) found that greater perceived persuasiveness correlated with kinesic and prox emic immediacy, facial expressiveness, and kinesic relaxation. Rosenfeld (1966) found that the level of persuasiveness was positively correlated with positive head nods and negatively correlated with self-manipulations. 2.3.6 High-Level Attributes Related to Persuasion Researchers investigating persuasion long knew that it was a com pl ex phenomenon involving multiple dimensions, or high-level attributes of a speaker, such as his/her level of credibility or confidence. For instance, many researchers identified that a message's persuasiveness partially depended on its source, which comprised of multiple dimensions such as credibility, a high-level attribute that is particularly known to be similar across cultures in its relationship with persuasiveness. More interested readers can find a 20 review of persuasiveness and source credibility by Pornpitakpan (2004). Similarly, there are multiple attributes that have been under study in relation to persuasiveness, such as attractiveness, likability, confidence, expertise, and message vividness (Carli, LaFleur, & Loeber, 1995; Chaiken, 1979; Frey & Eagly, 1993; Inglis & Mejia-Ramos, 2009; LaCrosse, 1975; Maddux & Rogers, 1980; Maslow, Yoselson, & London, 2011). 2.3.7 Thin Slice Prediction Ambady and Rosenthal (1992) showed that much inference is possible just by observing nonverbal behavior within a short time window called "thin slices." Curhan and Pent land (2007) applied the idea in a simulated employment negotiation scenario and found that certain speech features within the first five minutes of a negotiation session were predictive of the overall negotiation outcome in the end. It is quite likely that the same idea can apply in the context of persuasiveness perception. 21 Chapter 3 Crowdsourcing Micro-Level Human Behavior Annotations 3.1 Introduction Annotating multimedia content is an important part of many recent research problems, including multimedia event recognition (Poppe, 2010), video retrieval and classifica tion (Lew, Sebe, Djeraba, & Jain, 2006), and human behavior analysis (Pantie et al., 2006). Supervised learning approaches applied to these research problems usually re quire a large number of annotated video sequences. While some of these algorithms are applied at the video or scene level, requiring macro-level annotations, many problems need micro-level annotations that identify precise start and end times of an event or behavior. These annotation efforts, which are usually carried out with experienced local coders, are very costly both in terms of budget and time. In recent years, there has been an explosive growth m the research and use of crowdsourcing, fueled by convenient online crowdsourcing environments like Amazon Mechanical Turk. In the research community, crowdsourcing is already being actively used for many types of tasks, including image labeling (Nowak & Ruger, 2010) and linguistic annotations (Novotney & Callison-Burch, 2010). When crowdsourcing micro level human behavior annotations in videos, three main challenges emerge: interface, training crowd workers online, and generalization. Firstly, there is a need of a web interface that allows crowd workers to accurately and efficiently annotate micro-level 22 c. r------------------------------------------------------------ QI ~ ~ Ill c .5 Experienced Coders mt Ground-'.ruth Annotations ~~ ' :. : -:. . : Coding Schemes ------------------------------------------~-~ -------~~ OCTAB (Online CrowdsourcingTool) ] ... u QI ..c ... QI! c c -~ I- Crowd Workers New Corpus Videos ~ Training Module OCTAB: Annotation Module .. Trained Crowd Workers ~ ~ Micro-Level Annotations Figure 3.1: An overview of the crowdsourcing approach for obtaining micro-level be havior annotations in videos, with a focus on the new web interface called OCTAB that includes a module specifically designed to train crowd workers online. The approach is generalizable, and the training effect transfers to annotating new independent video corpora. behavioral events while keeping the interface simple and intuitive. Secondly, there should be an effective web interface and procedure for training crowd workers online that can simulate the environment experienced local coders use when discussing and reaching agreement. Lastly, the training of online workers should generalize across different datasets for the approach to be widely scalable and applicable. This chapter presents OCTAB (Online Crowdsourcing Tool for Annotations of Be havior), a web-based annotation tool that allow precise and convenient behavior anno tations in videos, directly portable to popular crowdsourcing platforms such as Amazon 23 Mechanical Turk (Figure 3.1). In addition, this chapter introduces a training module with specialized visualizations and an iterative procedure for effectively training crowd workers online, inspired by an observational study of experienced local coders reaching agreement. Finally, this chapter presents an extensive set of experiments that evalu ates the feasibility of the crowdsourcing approach for obtaining micro-level behavior annotations in videos, showing improvements in the annotation reliability when prop erly training online crowd workers. This chapter also shows the generalization of the training approach with a new independent corpus. To the author's knowledge, this work is the first to introduce effective interfaces with specialized visualizations to train crowd workers online, extensively showing the feasibility of training crowd workers to obtain micro-level behavior annotations in videos and demonstrating generalizability of training across different video corpora. A novel metric is also introduced for disagreement analysis to supplement Krippendorff's alpha in measuring inter-coder reliability because the alpha cannot show the types of disagree ment between coders, which can be critical information for effectively training crowd workers. The next section (Section 3.2) introduces a novel web tool called OCTAB for crowd sourcing micro-level annotations and training crowd workers, and Section 3.3 describes an effective procedure for training crowd workers online. Section 3.4 gives detail on new annotation evaluation metrics and the design of the experiments. The experimental results are described and discussed in Section 3.5, and the chapter concludes with 3.6. 3.2 Online Crowdsourcing Tool for Annotations of Behavior (OCTAB) Online Crowdsourcing Tool for Annotations of Behavior (OCTAB) is a web-based an notation tool developed for the purpose of making convenient and precise micro-level annotations in videos. The tool consist of 2 main modules. The first module is an 24 OCTAB Interface: Annotation M. odule j,.,..... ~ .,,,,.._ 1.;:.--: Figure 3.2.: Tir e ttrl;t compi)lll!llt ¢f OQTAB (Onlinit Ct<iw&¢ur~ingTo<il for knn"Pt~iP!!$ :Qf Behavior) · is a web annotation module that allows i::rowd workers to make pre9isa mkrp...Jevel anMtatiPns <if human b.ehavior or event;; in videct;.. HTMirbased Web inte~fa¢e: tha! all¢Ws :m 11nn"Ot a1or to ¢onvenientiy· 11wlgat!l' in a videl)' t'i> amrotate micro-level human behavior 9r wents- . The SO,'!Ond module Vl2S·designed for training crowd workers t>nline, inspired liy O b1'!erving how experienced loGaf ~odem train themselves tp reaeh agreement. 3.2.1 A1n1otation ?vlo: .dule (Micro-Level Behaviot · Annotations:) OC'FAB ls intended for anno tating a single behav;ior o;>n a sil)gle video at a time 1 and it is tlast!d . on I1TML$ ancf.J ay(IScr: ipt, providing all tlle ba$ic functiomllities of a web vldeQ player (HTML5 siipput.s. three: vi'deQ type-s of MP4i WeoM, and Ogg) , TM fP"ll1v/ing thtee main ~~pllcts were ·Mns1dered in the de;,\gn <>f oc~~B's annotati()n modlllil (F\gure a .2) . 3.2.1.1 Precision For accurate micro-level annotations on videos, annotators need to have a frame-level precision in identifying the start and end time points of an event. To address this requirement, the interface provides an annotator with 4 buttons for moving 1 second backward/forward and 1 frame backward/forward from the current time in the video, as well as a slider bar that offers frame-level navigation in the range from -3 to +3 seconds. Once the annotator identifies a behavioral cue or event to annotate, he/she can use the navigation control buttons or the slider bar to pinpoint and select the behavioral cue's or event's precise start and end times. Then, he/she can play the selection to verify and press a button to save the selection as a valid annotation. Although intended for annotating a single behavior on a single video at a time, it should be noted that this interface also allows annotations of multiple behavior tiers or intensities with a sim pie addition of radio buttons, and it can be even configured to support any arbitrary annotation tasks with additional radio buttons, sliders, text boxes, etc. 3.2.1.2 Integrability Popular annotation software applications like ELAN or ANVIL (Dasiopoulou et al., 2011) allow annotators to make sophisticated annotations on video and audio files, but they are not suitable for the purpose of crowdsourcing. They have a relatively steep learning curve to use and cannot be used with online crowdsourcing platforms like Amazon Mechanical Turk. OCTAB is written directly in HTML so that it can be easily used to create a template task page when using online crowdsourcing platforms. 3.2.1.3 Usability Annotating videos often involves moving around in a video to check, re-evaluate, and edit previously made annotations. A special section in the annotation module displays a list of all saved annotations, and annotators can always go back and work on previously made annotations by replaying, editing, or deleting any annotations. For convenience 26 and speed in making annotations, most controls in the interface have hotkeys associated with them, and the interface's functionalities are kept to the minimal level with an intuitive layout to minimize confusion. 3.2.2 Training Module (Training Crowd Workers Online) The challenge of training crowd workers for annotation tasks arises mainly due to the lack of physical interaction that local coders enjoy when training themselves in person according to a coding scheme. In order to have an effective design of this training module, an observation was first made on how experienced local coders work together to reach agreement. Then, needed visualizations and a training procedure were created to translate the observed findings to effectively train crowd workers online. 3.2.2.1 Observational Study of Experienced Local Coders' Training As a preliminary step, an observational study was performed with two experienced local coders reaching agreement on behavior annotations for five short YouTube videos of people giving movie reviews. The coders annotated a total of four behavioral cues, the same ones used in our experiments: gaze away, pause filler, frown, and headshake (see Section 3.4). According to the observation, the experienced local coders sat together to devise a coding scheme, or a precise description of an annotation task. Then, they individually tried annotating according to the coding scheme on a training video. After comput ing their agreement, they again sat head-to-head to review their annotations together, replayed all of their annotations multiple times side-by-side, engaged in discussions, and made appropriate modifications to the coding scheme as needed. This process was iterated with more training videos one after another until the agreement consistently reached a satisfactory level determined by researchers. From the observational study, it became apparent that the online training module should concentrate on two key functionalities in order to simulate how local coders 27 OCTAB Interface: Training Module (1) Overall Bar-Graph Visualization Component \':Xo l1 .....-.-•- to-ici: t .. <•h1t .•. "i>f < 01: A1!"'1: tr.... ' I ' '* J.i J; ~ ~ -l. 1·~ l'" j • i\\ J '.II" ~KOK!' (2) Side-By-Side Review Component D +3 >&s C\t:rre-:itT!ne: O , I ~ h '<' -ll':"uo -·.dtL'f9"~·· ~--·lt:-..-o -"'1dl:~""-l:r, Figure 3.3: The second module of OCTAB to effectively train crowd workers online by giving them a quick overall visualization of disagreement (top) and the ability to review both ground-truth and their attempted annotations side-by-side (bottom). train themselves. Firstly, crowd workers should have an overall visualization that en ables t hem to quickly compare their annotations with each other (or with ground-truth annotations). Secondly, crowd workers should also be able to efficiently review (play the video and see all instances of) both ground-truth and their attempted annotations side-by-side. 28 3.2.2.2 Training Module's Design The first necessary functionality noted during the observational study is reflected in the training module with an overall bar-graph visualization on a time line that not only informs crowd workers with an overall picture of their mistakes in the identification of behavior but also in its segmentation (see Overall Bar-Graph Visualization Compo nent from Figure 3.3). The second functionality is reflected with a modified version of the behavior annotation module in which crowd workers can review both ground-truth annotations and their attempted annotations side-by-side by repeatedly playing any of those annotation instances in the video (see Side-by-Side Review Component from Figure 3.3). This training module is generated automatically with scripts. 3.3 Procedure for Crowdsourcing Micro-Level Behavior Annotations in Videos Given the interactive web interface for training crowd workers and annotating micro level behavior in videos, four main steps are suggested to successfully train new crowd workers: Obtaining coding schemes and ground-truth annotations, recruiting and screen ing workers, training the workers online, and obtaining repeated annotations if necessary. 3.3.1 Obtaining Coding Schemes / Ground-Truth Annotations If no trained online workers are available, the first step is to work with experienced local coders to create a coding scheme and annotating a small set of training videos. As will be shown, this step of creating a coding scheme with annotated training examples is only necessary if the behavioral cue to annotate is new. During this step, the local coders train themselves on the training videos until their agreement reach a satisfactory level (see Section 3.4 for more detail on agreement measurements during their training sessions). The resulting annotations from these training videos can be used as ground truth annotations for training crowd workers. If trained online workers are available 29 for the desired behavioral cue or if a coding scheme and annotated training set already exist, this step can be skipped. 3.3.2 Recruiting / Screening Crowd Workers In recruiting crowd workers, it is suggested to first try recruiting from a forum such as www.mturk.com, where many serious crowd workers reside. It is also beneficial to use a relatively unambiguous annotation task that still requires a close attention to detail at the frame level to check if a crowd worker is able to annotate with a frame-level precision. For example, the gaze away behavior is a relatively easy behavior to identify with unambiguous start and end times, but it requires one to pay attention at the frame level. Measuring agreement performance on this type of tasks can be a good threshold point for screening crowd workers. 3.3.3 Training Crowd Workers Online For training crowd workers, an iterative procedure is suggested with workers first an notating a video with OCTAB's annotation module and then receiving feedback with the training module. This gives them a chance to learn and improve with each train ing video using the overall bar-graph visualization and side-by-side review components. Once crowd workers consistently perform at the agreement level on par with the agree ment between local coders, they are tagged as properly trained. For the study described in this chapter, the Time-Slice Krippendorff's alpha (described in Section 3.4) was used to measure agreement, with the minimum satisfactory alpha level set at 0.80 for rela tively clear behavioral cues and 0. 70 for harder ones. 3.3.4 Unique vs. Repeated Annotations When annotators are trained to strongly agree among themselves (or with ground truth annotations), future annotations can be obtained with one annotator per video. With properly trained crowd workers, it could be the case that having only one worker 30 annotate per video is sufficient to obtain quality annotations. However, for relatively harder behavioral cues to annotate, it may be necessary to make repeated annotations with multiple workers per video and take a majority vote approach. In fact, it could be possible to take this approach with even untrained crowd workers and obtain annotations with satisfactory quality. This chapter shows the effect of training and having repeated annotations with an extensive set of experiments. 3.4 Experiments The experiments were designed to evaluate the performance and user experience of the new OCTAB interface for online crowd annotations. This work particularly put a focus on the effect of training crowd workers and also tested the generalization of the training procedure by having workers trained on one dataset and have them tested on another independent dataset. 3.4.1 Evaluation Methods In this work, the Time-Slice Krippendorff's alpha (Kang et al., 2012) was used as the main evaluation metric for measuring inter-rater reliability of micro-level behavior anno tations in videos. Krippendorff's alpha is particularly suited for crowdsourcing because it can handle multiple annotators at the same time and can also account for missing data. This chapter also introduces two new supplementary metrics, which were used in the experiments to analyze the types of disagreement between coders. The supplemen tary metrics can be very helpful in determining whether coder disagreement stems from inaccurate identifications of a behavioral cue or from imprecise segmentations. 31 3.4.1.1 Time-Slice Krippendorff's Alpha Krippendorff's alpha (Krippendorff, 2012), is a generalized chance-corrected agreement coefficient that can be calculated between two or more annotators. The general formula for the alpha is the following: (3.1) where D 0 , or observed disagreement, is the amount of pairwise disagreement ob served between the annotators, and De, or expected disagreement, is the level of dis- agreement expected by chance as calculated from the data. The coefficient alpha itself is a measure of agreement ranging from -1 to 1, where 1 is perfect agreement (zero observed disagreement), 0 is chance-level agreement, and values lower than 0 indicate systematic disagreement. The alpha works by looking separately at the agreement on individual annotation instances. For micro-level annotations, each time slice (e.g., 1 frame per slice) is treated as a separate annotation instance, with a binary annotation indicating presence or absence of a specific behavioral cue such as smile. While it is the case that adjacent frames tend to have similar annotations, the experiments in this work show that the alpha is not very sensitive to the sampling rate of the time slices. The agreement is calculated separately for each annotated behavioral cue. Applying the alpha to individual time slices means that the measure can only assess whether the annotators agree that at a certain time point a behavioral cue takes place, not whether they agree about its segmentation or individuation (whether a certain time span contains one or two instances of smiling). This drawback has been pointed out by Krippendorff (1995). To supplement the alpha, this work introduces two new metrics which are intended to capture agreement on the behavior segmentation. 32 Identified Behavior Instances CoderA i i i m · :: i ii . i 1 · : i . i CoderB I j_l ' i-1 ! ! ! '' H1 I I r-J j ii j i .... 2 instances .... 5 instances Event Agreement Metric From the reference point of Coder A From the reference point of Coder B ~:::~: ~ ~ H H1 I U I l I I ~:::~: ~ ~ H H I I r® I 2 agreed instances 4 agreed instances Event Agreement = Total Number of Agreed Events = ~ = 8 S. 70 /o Total Number of Identified Events 2+5 Segmentation Agreement Metric CoderA I r n r-1IFJT Im11 Coder B , t+:ttJ , µ,f1 , , , H : I:::: I:: I': l::: Segmentation Agreement= Total# of A9reed Slices WithinA9reed Events = .:!, = 33 _ 3 % Total# of Sltces W1thinAgreed Events 9 Figure 3.4: Definition of the event and segmentation agreement metrics with examples. 3.4.1.2 Disagreement Type Analysis As mentioned in the previous section, the Time-Slice Krippendorff's alpha does not differentiate between disagreement caused by misalignment of the annotations or that caused by direct event disagreement. To better understand these annotation differ ences, two new metrics (Figure 3.4) were used, which provide valuable information when deciding whether crowd workers' training should concentrate on better behavior identification, segmentation, or both. • Event Agreement Metric An agreed event is defined as when there is an overlap of identified events in two annotations. In other words, agreed events are those that both annotators jointly identified. Depending on which annotation is taken as the reference point however, the number of agreed events could be different (Figure 3.4). For this 33 reason, this work computes the percentage of agreed behavior events between the two annotations by dividing the total number of agreed events from both reference points by the total number of identified events from both reference points. • Segmentation Agreement Metric Another informative measure in gauging the agreement between two annotators is to see how precisely they segmented the boundary of the same annotation event. To compute the segmentation precision, this work looked at the time windows of agreed behavior events from both reference points combined and computed agreement within the time windows only (Figure 3.4). The percentage is computed by dividing the number of agreed time slices by the number of total time slices within the time windows of agreed events. 3.4.2 Datasets From YouTube, which is a video-sharing website where users upload and share videos, about 360 videos of people giving movie reviews were collected. Each video was an notated by two coders to determine the sentiment of the reviews (negative, neutral, and positive). From those videos, 20 videos were selected for this study that were both gender-balanced and sentiment-balanced to have a wide range of expressions. Addi tionally, 5 more videos were randomly selected and used for training purposes. Each video showed a frontal, upper-body shot of a different person talking. Since all of the videos appeared to have been recorded using a webcam, it should be noted that the overall quality of the videos was not ideal but still fair enough to discern various facial expressions and eye gaze movements. For the 20 videos that were used in the actual experiments, the frame rate was at 30 frames per second and the video length ranged from 60 to 180 seconds, averaging at 138 seconds. The 5 training videos had the same frame rate, averaging at 106 seconds in length. To show the generalization of the suggested training procedure, a second dataset was created with 10 clipped videos from the Semaine corpus (McKeown, Valstar, Cowie, & 34 Pantie, 2010), which is a well-known video corpus in the research communities focusing on emotion, affective computing, and human behavior analysis. The purpose of this second dataset was to investigate if the effect of training crowd workers on one dataset can be transferred to another dataset for annotating human behavior. These videos also showed a frontal, upper-body shot of a person speaking, and the frame rate was also at 30 frames per second, averaging at 150 seconds in length. 3.4.3 Annotated Behavioral Cues From behavioral cues that were relatively common and frequent in all the videos, four different types of behavior were selected to annotate based on their variety (one for eye movements, one for facial expressions, one for head movements, and one for verbal cues) and difficulty. These behavioral cues are all very frequently annotated ones for research involving human behavior analysis. The descriptions of the behavioral cues in this work's coding schemes were adapted from the MUMIN multimodal coding scheme (Allwood et al., 2005). • Gaze away Eye gaze is directed away from the camera. • Pause filler The person says "um ... " or "uh ... " • Frown The eyebrows contract and move toward the nose. • Headshake A repeated rotation of the head from one side to the other. 3.4.4 Experimental Design Amazon Mechanical Turk (AMT) was used for the experiments, which is arguably the most well-known and widely used platform for crowdsourcing. The main idea behind 35 AMT is to distribute small tasks at which humans are proficient and computers are still incompetent to a crowd of online workers. Using AMT's web interface, the "requesters" can design and publish tasks online, which are called Human Intelligence Tasks (HITs). In designing HITs, the requesters can set various options to restrict access to specific kinds of workers, set the number of unique workers to work on them, and set the amount of monetary reward. Moreover, a HIT template can be created, and one can define vari ables that will vary from HIT to HIT, which becomes very useful in creating a batch of similar HITs but with different videos. A HIT tern plate was created with OCTAB annotation interface integrated and all of the HITs were batch-created with the videos in our YouTube and Semaine datasets. A total of 19 workers participated in the exper iments, who worked for an effective hourly wage between $4 and $6 for compensation. For more detail, Mason and Suri's (2012) paper gives a thorough explanation on using AMT. 3.4.4.1 Experienced Local Coders For this study, two experienced local coders were recruited and the agreement between them after training was considered as the gold standard in the experiments. They devised the coding schemes for the four behavioral cues to annotate, and they trained themselves to reach agreement on the 5 YouTube videos set aside for training purposes only. They trained on one video at a time until agreement (measured with the Time-Slice Krippendorff's alpha) reached a threshold of 0.80 (or very close) for all behavioral cues with the exception of the headshake behavior because the local coders could not reach 0.80 for a few training videos even after three trials. However, the average alpha level for the headshake behavior across the 5 training videos still reached 0.80. A more detailed analysis was performed on the types of errors in the experiments (see Section 3.5) to better understand this challenge with the headshake behavior (see Figure 3.7, bottom part). 36 After training, each local coder used the same environment as crowd workers to annotate all the videos from the YouTube and Semaine datasets across all the behavioral cues. Since agreement between the local coders was high, the final annotations from one of the local coders during training were used as ground-truth annotations to train crowd workers online. 3.4.4.2 Untrained Crowd Workers To com pare the training approach introduced in this work with a scenario where crowd workers are untrained, a total of 12 workers were selected to participate as untrained crowd workers. As mentioned earlier, they were screened using an annotation test for the gaze away behavior. This brief screening process was only to ensure that they could pay attention to frame-level detail, and no training sessions were given. They were provided with the coding schemes drafted by the two local coders, and they made a combined effort to annotate all the videos from only the YouTube dataset across all the behavioral cues. 3.4.4.3 Trained Crowd Workers A total of 7 workers, who were not involved as untrained crowd workers, participated as trained crowd workers. They were trained with the same 5 YouTube videos and the coding schemes that the local coders used for training. After each training video, the workers received e-mail feedback with the OCTAB training module, generated automat ically with scripts. Workers were considered trained when they reached the same alpha thresholds used for the experienced local coders. The training process involved only at most one trial per training video for the gaze away and pause filler behavioral cues. For the frown behavior, each worker took mostly one trial per video to reach the alpha threshold on average across all training videos, and it took about two to three trials per training video for the headshake behavior. The trained workers then annotated all the videos across all the behavioral cues from the YouTube dataset first. Then, they 37 similarly annotated the Semaine dataset to investigate if the effect of training crowd workers for annotating human behavior on one dataset can be transferred to annotating a different and independent dataset. The crowd workers were not informed that these videos were from a different dataset. On average, the trained workers spent about 11 minutes to annotate about 13 in stances of the gaze away behavior per minute of video, 8 minutes to annotate 4 instances of the pause-filler behavior, 5 minutes to annotate 2 instances of the frown behavior, and 8 minutes to annotate 4 instances of the headshake behavior. 3.4.4.4 Repeated Annotations For both of the above-mentioned conditions with the untrained and trained crowd work ers, three repeated annotations were obtained to investigate the benefit of taking a majority vote approach. 3.4.5 Annotation Strategies For each dataset, the agreement performance was compared across three annotation approaches: experts, crowdsourced unique, and crowdsourced majority. • Experts The two local experienced coders each produced a complete set of annotations for all behavioral cues for all videos in each dataset. The agreement between them was considered as the gold standard in the experiments. These sets are referred to as experts throughout the chapter. • Crowdsourced unique From the crowd workers, three repeated annotation sets were obtained from dif ferent workers per behavior per video. The three repeated annotation sets were obtained both from the trained worker group and the untrained group. By ran domly permuting the order in the three annotation sets in each group, three 38 complete sets of trained and untrained crowdsourced annotations were made for each dataset, which we refer to as crowdsourced unique. • Crowdsourced majority The three complete sets of crowdsourced annotations can be combined to make another complete set using majority voting, where a time slice (or frame) is judged annotated if at least two out of three workers agreed. This set is referred to as crowdsourced majority for each dataset. The annotation agreement was compared across three different combinations: (1) within experts so that there is a baseline, (2) experts vs. crowdsourced unique to see if having one worker annotate per video is sufficient, and (3) experts vs. crowdsourced majority to see the benefit of having repeated annotations and performing a majority vote. The agreement comparison was performed for the YouTube dataset with the untrained crowd workers, the YouTube dataset with the trained crowd workers, and the Semaine dataset with only the trained crowd workers. 3.5 Results and Discussions This section highlights five main research problems studied with the experiments: the user experience ratings of the OCTAB interface, the performance when training crowd workers, the performance without training crowd workers, the analysis on the types of disagreement, and the sensitivity analysis of the Time-Slice Krippendorff's alpha to test its stability and reliability. It should be noted that researchers in social sciences usually consider macro-level annotation data with a Krippendorff's alpha value equal to or above 0.80 as reliable and in high agreement, and they consider data with an alpha value equal to or above 0.67 but lower than 0.80 as reliable only to draw tentative conclusions (Krippendorff, 2012). These threshold points, however, are somewhat arbitrary, and it is controversial whether the same standard is fair to hold for judging the reliability and quality of micro-level 39 • OCTAB: Annotation Module Very 1 • 1 Inconvenient 1-I ---+---+-----lt---+---+l-t • - -tl Very Convenient 6.37 1 6 7 Very Unintuitive 1-I --+---+-___,l---+---tl•m----11 Very Intuitive 5.89 • OCTAB: Training Module Overall bar-graph visualization component 1 6 7 Very Useless 1-I --+---+----tl---t---+l-te~ I Very Useful 6.33 Side-by-side review component 1 6 7 Very Useless li---+---+---+---+-- •- 1----tl Very Useful 5.67 Figure 3.5: The user experience ratings of the OCTAB interface. (frame-level) behavior annotations. Keeping this in mind, the 0.67 threshold was still used as the standard of quality in the remainder of this section. 3.5.1 User Experience Ratings of OCTAB The 19 crowd workers who participated in the experiments completed a survey to eval uate the OCTAB annotation and training modules and the behavior annotation tasks (Figure 3.5). On a 7-point Likert scale to rate the OCTAB annotation module's con venience (from very inconvenient at 1 to very convenient at 7) and intuitiveness (from very unintuitive at 1 to very intuitive at 7), the mean score was 6.37 (n = 19, sd = 0.74) for convenience and 5.89 (n = 19, sd = 0.91) for intuitiveness. For the OCTAB training module, the mean score on usefulness (from very useless at 1 and very useful at 7) was 6.33 (n = 6 sd = 0.47) for the bar-graph visualization and 5.67 (n = 6, sd = 1.97) for the side-by-side review component. These evaluation results show high usability of the OCTAB interface. The crowd workers also evaluated the difficulty of each behavior to annotate (from very difficult at 1 to very easy at 7), and the mean score was 6.42 (n = 19, sd = 0.82) 40 o.s ~ 0.7 :t ·"' c.s 'E 0 'F 0.5 ~ ~ 0.4 ~ 0.3 i ~ 0.2 0.1 Performance of Trained Crowd Workers (YouTube Dataset) gaze away pause filler frown headshab:! Performance of Trained Crowd Workers (Semaine Dataset) Gaze away pal.IS• filler frown head•hak. . witliin ex.perts • expertsvs.cmwdsour ced majority (trained) experts vs. crowdsourced unique !trained) Figure 3.6: The performance of the trained crowd workers on the You Tube dataset (top) and the Semaine dataset (bottom). The dotted lines indicate the agreement threshold point at 0.67 measured with the Time-Slice Krippendorff 's alpha. for the gaze away behavior, 5. 71 (n = 14, sd = 1.33) for t he pause filler behavior, 3.94 (n = 16, sd = 2.05) for the frown behavior, and 3.64 (n = 14, sd = 1.59) for the headshake behavior. Not surprisingly, the reported difficulty level correlated with the general agreement performance of each behavior. 41 3.5.2 Performance of Trained Crowd Workers For the YouTube dataset, on which the crowd workers were trained to perform the annotation tasks, the performance of crowdsourced majority was striking. For all be havioral cues, the average agreement between individual experienced local coders and crowdsourced majority was higher than between the two local coders themselves (Fig ure 3.6). The average alpha between experts and crowdsourced majority reached above the 0.67 threshold for all behavioral cues, specifically 0.87 for the gaze away behavior, 0.82 for the pause filler behavior, 0.70 for the frown behavior, and 0.67 for the head shake behavior. These results show that crowdsourcing can be a very effective tool for researchers in obtaining high-quality behavior annotations, provided that proper train ing sessions were given and three repeated annotations were obtained to take a majority vote approach. For relatively unambiguous behavioral cues, such as the gaze away and pause filler behavioral cues, the results indicate that repeated annotations are actually unnecessary and having one worker annotate per behavior and video is sufficient to obtain high-quality annotations. When the crowd workers, who were trained on the YouTube dataset, performed the same annotation tasks on the different videos in the Semaine dataset, the results showed that the effect of training was actually transferable. The agreement between experts and crowdsourced majority was almost equal to or higher than between the experts themselves except for the gaze away behavior. This exception was most likely due to the speakers in the Semaine videos not talking directly toward the camera as was the case in the YouTube dataset. The speakers in the Semaine dataset talk to an interlocutor invisible in videos and this difference most likely caused much confusion in deciding what makes a valid instance of the gaze away behavior in the changed setting because the coding scheme was the same for both datasets. Nevertheless, the average alpha between experts and crowdsourced majority was still high at 0. 79 for the gaze away, 0.77 for the pause filler, 0.84 for the frown, and 0.70 for the headshake behavior. A similar trend also emerged indicating that having only one worker annotate 42 Performance of Untrained Crowd Workers (YouTube Dataset) Time-Slice Krippendorff's Alpha gazeowav POlllO filler frown headsha(.oo Event Agreement Metric Segmentation Agreement Metric pi!1UH! i lll4!r heamhake po111e filler head•hak.. • wltl11n expiii!rt.S • expert.sv.s. <:ruwd.souru-d majorlty(untr.al111i!odl expertsv.s. crowdsouru-d unlque(untr.aln~) Figure 3.7: The performance of the untrained crowd workers on the You Tube dataset. The dotted line indicates the agreement alpha threshold at 0.67. per video is sufficient to obtain high-quality annotations for the gaze away and pause filler behaviors. 3.5.3 Performance of Untrained Crowd Workers The performance of the untrained crowd workers on the YouTube dataset showed that both crowdsourced unique and crowdsourced majority reached the agreement alpha threshold of 0.67 for the gaze away behavior (Figure 3.7). The agreement between 43 experts and crowdsourced majority reached very close to the 0.67 threshold for the pause filler and frown behaviors, and it should be noted that it is not uncommon for an alpha value of 0.60 to have well over 85% of agreement at the frame-level without chance correction, which is by no means a low agreement. The results also showed the benefit of disagreement analysis with the event and segmentation agreement metrics. For instance, the disagreement analysis revealed that the source of the low alpha value for the crowd workers in annotating the pause filler behavior was not in behavior identification but in segmentation. In other words, the untrained crowd workers were just as proficient as the experienced local coders in iden tifying instances of the pause filler behavior. However, for the headshake behavior, the disagreement analysis showed that the untrained crowd workers had problems of both identifying and segmenting behavior correctly compared to the experienced local coders. This analysis was aligned with the previous observation that the experienced local coders also had trouble agreeing with an alpha threshold at 0.80. 3.5.4 Trained vs. Untrained Crowd Workers The effect of training was quite notable as shown in Figure 3.8, which shows the aver age agreement alpha values between experts and trained crowdsourced unique and also between experts and untrained crowdsourced unique. By training crowd workers, their agreement performance on the YouTube dataset improved with a statistical significance at p < 0.01 for the gaze away and pause filler behaviors and at p < 0.001 for the headshake behavior (statistical significance computed with t-tests). 3.5.5 Time-Slice Krippendorff's Alpha For all the behavioral cues, the Time-Slice Krippendorff's alpha was shown to be a stable measure that stayed consistent across different sizes of time slices, and the results for the gaze away and frown behaviors on the YouTube dataset are shown in Figure 3.9. For this experiment, annotation sets created at a lower frame rate were up-sampled 44 Trained vs. U ntrail'lad Crowd Workers, [)'ouTuba Datase~) ..,. •.. r •• •• • L I .a $ .1. " a __ =E1 ~· ~.(j ~ ~' ~ t ~0.11 ~ =i io .• ;,_ .... ~ f IJl l Ca1e ?waV 1:1ays.e nller ftowo heod!l'laf(e r:~~re~~ c,M;w~ou la!d 11n· 911· (1111t111 ~Al!a) C)lp!it:~'l •• i'fl.'lwa:o1 rca1111iqlli (f 1111AI! d) using ii> maj~rh;r v~te t~hnique, wher" ~ch time sh¢!! w~ cm'ls.kien:d a'n. Mtated ii at. le:;>st, 5.~% 9f the slic. e was ar111otated, 3,6. Gonclusi0ns 'Thi~ c)1a1'te1 pres~ntea: · a novef vAfb .. h1terfac;e ~d training pri:>t edut" 'f9t a.b:,rd..'1-oi.trc ing micr&lwel 1J:ehav1'0r mm-otati®is in vldeoo. and showed that si.1:¢ti ann0.tatrons.<£an aoh\eve a qua.ilty ;;pmpar'1ble t~ lhe\>e d<:>mi i)y. experierr ce:d l~q11J ¢1)tle l'$\ .Sp~Ginca~,. this chapter present~d an e'fie9tiv:e "vell. t90! :called OOJ'A,B f9r G'l'ow.dsoµrcipg micro level' bebavior annotations o :nlm<>, '"hich c6n$ists of a convenient antf pte. '<:i~e annota:tiqn ni~uie,.an;d a training m0di.tla that gives G10wd '"~rkers· tlie abilit,,y to qulek!y get trained by 'Z'eeing first an overall view of \.heir ertors and t!ien petf6rmingsida-by-side re'l!law of tb 'eir:GnWt!!tiPllS i;g~instgr@nd-truthcanMtli~i<illli • Th~ re\;rnit;; b'l:1tn ~n ·eittensive:set ¢f Sensitivity Analysis of Time-Slice Krippendorff's Alpha Gaze Away Behavior (YouTubC! Dataset) .. o.9 T!::=;~~~~::::!E~,- i -- .. ~ "' 0.8 -r--------------- i .,, ~ 0.7 -t--------------- ... ... ~ ~ 0.6 'i{i :::~ .. E F 30 15 7.5 3.75 1.875 Frown Behavior (YouTube Dataset) rg 0.9 ~-------------- "" ... <i lfl 0.8 +----- - - --------- i l 0.7 t---:==::=~=;:=:::~~;:::~~- ~ ~ 0.6 ~ r F 30 15 7.5 3.75 1.875 Frame Sampling Rate (Hz) >fame Sampling Rate (Hz) • within experts • experts vs. crowd sourced majority (trained) experts vs. crowdsourced unique (trained) Figure 3.9: The sensitivity analysis of the Time-Slice Krippendorff's alpha across dif ferent frame sampling rates. experiments showed the feasibility of the suggested crowdsourcing approach for obtain- ing micro-level behavior annotations in videos, showing improvements in the annot ation reliability when properly training online crowd workers. This chapter also investigated the generalization of the training approach with a new video corpus, showing that the training effect is transferable across different independent video corpora. 46 Chapter 4 Computational Modeling of Human Behavior in Face-to-Face Negotiation 4.1 Introduction Negotiation is a complex and dynamic process in which two or more parties, often having non-identical preferences or agenda, attempt to reach agreement. Be it in our workplace or with our family and friends, negotiation comprises such a fundamental fabric of our everyday lives that we sometimes engage in the act without even being consciously aware of it. A real-time system that can automatically analyze human behavior in terms of negotiation and predict the respondent reactions to negotiation offers has the potential to help us in our daily lives. For instance, such a system could function as a real-time decision support tool, especially in the online environment, to directly help a person during a negotiation process by providing an automatic analysis of the other person's behavior while teleconferencing. Computational analysis and modeling of behavior during negotiation could also be useful in training a person to be a better negotiator. For instance, the models could be applied to create virtual characters for training and simulating negotiation scenarios. Automatically predicting the respondent reactions to offers made during negotiation, that is whether the respondent will accept or reject an offer, is a challenging problem. Despite a long history of research on negotiation (Pruitt, 2012), much work is still 47 Behavior History "'"~~ : Pitcn ; .. Q : OcS Mutual Behaviors (Symmetry and Asymmetry) Figure 4.1: An overview of this work's approach to predict the respondent reactions (ac ceptances or rejections) to negotiation offers using predictive computational descriptors from various sources of information. needed in order to fully understand how people display various nonverbal behavior in the context of negotiation. There has been very limited work that investigated nonverbal behavior and computational approaches, but recent progress in computer vision and audio signal processing technologies enable automatic extractions of various visual and acoustic behavioral cues without having to depend on costly and time-consuming manual annotations. In this work, automatic feature extraction tools are used for many low level behavioral cues such as head displacements and rotations, but manual annotations are also used for several high-level behavioral cues such as head nods that cannot yet be reliably extracted automatically. The work in this chapter presents a computational analysis of face-to-face dyadic ne gotiation sessions to investigate multiple behavioral factors predictive of the respondent reactions to negotiation offers (Figure 4.1). For this challenging prediction problem, an alyzing the nonverbal behavior of the respondent would intuitively be first to consider, 48 but ample predictive information can reside in other sources as well. Specifically, the nonverbal behavior of the proposer might hint at the status of an ongoing negotiation process, and the past negotiation history between the two negotiators could shed light on their current relationship, making the respondent more likely to act in a reciprocal manner to a given negotiation offer. Additionally, this work explores mutual behavior, which is defined as a set of nonverbal characteristics that occurs due to interactional influence in terms of behavioral symmetry and asymmetry between the two negotiators. It is hypothesized that mutual behavior is important in the context of negotiation be cause people unconsciously engage in constant adaptation to others' behavior during face-to-face interaction. Then, the degree of behavioral matching or mismatching could show the overall atmosphere or rapport of the participants in the interaction. With a face-to-face negotiation dataset consisting of 42 dyadic interactions, this work presents an extensive set of experimental results to show that such nonverbal cues in various sources of information can be encoded as computational descriptors for a statistical model to automatically predict the respondent's immediate reaction to a ne gotiation offer. Whereas past related works in the literature were mainly focused on predicting overall negotiation outcomes in the end, this work's focus is on making im mediate predictions of the respondent reactions (acceptances or rejections) to individual proposals made during negotiation. In particular, this work examines the following four sources of information: the nonverbal behavior of the proposer, that of the respondent, the mutual behavior of symmetry and asymmetry between the two negotiators, and the past negotiation history. In addition to demonstrating the advantage of considering more sources of information to achieve a higher prediction accuracy, this work also con centrates on showing the mutual behavioral cues that can be extracted automatically to explore the possibility of building an automatic system for the prediction task. The next section (Section 4.2) describes detail on the representations of the compu tational descriptors, and Section 4.3 describes the dataset and experiments. The results 49 Proposal-response event I Os 10 s I accept PARTlCIPANTl f-- - ----- - L I proposal PAR11CIPANT2 r- ---1. --- Short-term cues Proposal-response event I 280 s 290 s 300 s I proposal ____ _ ____ , 4 _ _____ ________ __ I reject I Figure 4.2: An illustration of two proposal-response events and two different types of time windows where the computational descriptors were extracted. and discussions are provided in Section 4.4 and Section 4.5, and this chapter concludes with Section 4.6. 4.2 Computational Descriptors In creating the computational descriptors for predicting the respondent's reactions in a dyadic negotiation session, the following four different sources of information were identified in which predictive cues could reside: the nonverbal behavior of the proposer, that of the respondent, the mutual behavior between the two negotiators, and their past negotiation history. Another factor considered in creating the computational descriptors was time de pendency (Figure 4.2). Since negotiation is an ongoing process in which participants constantly adapt themselves to each other, assessing both short-term and long-term cues could provide a deeper understanding of the current state of negotiation on which to base predictions of future actions. For this purpose, a proposal-response event was 50 defined as a time window when the proposer made an utterance with a clear negotia tion offer followed by the respondent's clear verbal utterance of acceptance or rejection. In each proposal-response event, short-term cues were explored only within the time boundary from the start of the proposal until the start of the response. For long-term cues, the cumulative history of behavioral cues were explored from the start of the in teraction until the start of the response. It is noted that no information was used from the response part in a proposal-response event even when there was an overlap between the proposal and the response. 1. Long-term cues These descriptors were designed to model social engagement and rapport created over a longer period of time. For exam pie, a continuous mutual gaze is often correlated with high rapport, which in turn can be correlated with successful collaboration. 2. Short-term cues These descriptors were designed to model the recent momentum in negotiation. For example, the negotiation momentum could change rapidly because of cheating or mockeries, and the short-term descriptors were designed to quickly adapt. 4.2.1 Proposer's Behavior For each proposal-response event, the following nonverbal behavioral cues displayed by the proposer were explored as potential short-term cues: • Head nod: a vertical downward (or repeated upward and downward) movement of the head. • Head shake: a repeated horizontal left and right movement of the head. • Head tilt: a rotation of the head to the left or to the right (rotation around the z-axis with a frontal view of the face in the 3D coordinates). 51 • Gaze: gaze direction toward the other party, the table, or somewhere else. • Smile: presence of smiling. • Self-touch: Touching his/her own body with his/her hands (e.g. touching the face with the hand). Only the upper portion of the body was visible in the videos. The proposer's behavioral cues were manually annotated within the time window of each proposal-response event and were encoded as binary descriptors (except for the gaze behavior, which had three different states) at the event level. For example, the proposer's smile descriptor depended on whether the proposer portrayed a smile or not from the start of the proposal until the start of the response in each proposal response event. In summary, from this source of information, a total of 6 computational descriptors were encoded as short-term cues. 4.2.2 Respondent's Behavior In creating the computational descriptors of the respondent's behavior in each proposal response event, the same set of behavioral cues were used and the same approach was followed as described for creating the descriptors of the proposer's behavior. In addition to the event-level binary descriptors of the respondent's behavior, another descriptor called the binary response time was added that encoded the respondent's behavior in terms of his/her response time to the proposal. • Binary response time: For each proposal-response event, the response time was computed as the time when the respondent started uttering the acceptance or rejection minus the time when the proposer finished uttering the proposal. After taking the mean of the response times for all accepted cases and that of all rejected cases, the midpoint of the two means was found and used as a threshold, which was 1.37 seconds in the experiments. It is noted that recomputing this threshold yielded a very similar value with the mean of 1.50 seconds and the standard deviation of 0.22 seconds when using only the training and validation folds in the 52 12 experiments from 3 randomly balanced sets with 4-fold cross validation (see Section 4.3.3 for experiment detail). Using this threshold, the response time in each proposal-response event was converted into a binary descriptor. In summary, a total of 7 computational descriptors were encoded as short-term cues from this information source. 4.2.3 Mutual Behavior In creating the computational descriptors of mutual behavior, the following three main aspects were considered: behavioral symmetry / asymmetry, automatic extraction, and multi-modality. Although past research principally focused on the symmetric mutual behavior, such as social rapport and behavior matching, it is noted that much information can also reside in the asymmetric mutual behavior, such as opposite postures, in the context of this negotiation problem. For behavioral symmetry, the descriptors were designed to capture the similarity of behavioral patterns of the two negotiators. And for behavioral asymmetry, the descriptors were designed to capture the behavioral patterns of one negotiator that contrasted with those of the other negotiator. • Behavioral symmetry: This behavioral characteristic describes the similarity and synchrony in the negotiators' behavior. For example, a mutual gaze or a re ciprocal smile can show a general feeling of rapport and connection. Such behavior is more expected to appear in cooperative settings. • Behavioral asymmetry: This behavioral characteristic describes unilateral be havior or behavioral patterns that contrast between the negotiators. For example, if only one of the two negotiators is smiling or if they show opposite body postures, these are possible signs of disengagement and com petition. 53 Os 10 s 20 s I proposal PARTICIPANT 1 1---------------- -------------------------------1 BEHAVIORAL CUE \ ,,~ 'l""Y' '\ ()\, , JA"'l...,v~<v - PART IC I PANT 2 1-----------------------------1 reject -------------1 BEHAVIORAL CUE ~ ·• ./\J1,.,1 '\. ,.f"\rv.,..A/\r,_f "' fr:J ' V t Correlation Diff. in means Diff. in standard deviations Acoustic Behavioral Cues Visual Behavioral Cues • Voice quality - peak slope • Smile • Voice quality - norm. • Leaning posture amplitude quotient • Head gaze • Pitch • Eye gaze • Energy • Energy slope • Spectral stationarity Figure 4.3: An illustration of how audio-visual mutual behavior of symmetry and asym metry was encoded as three computational descriptors for each type of acoustic and visual behavioral cues and also as long-term cues. For short-term cues, only the behav ior from the beginning of the proposal until the beginning of the corresponding response was considered. Such behavioral characteristics of symmetry and asymmetry were captured with the following three computational descriptors that were derived from each type of behav ioral cues (Figure 4.3). For instance, a continuous visual signal such as smile or an acoustic signal such as pitch was extracted for both negotiators in each dyad (more detail about specific behavioral cues in the following subsections). Then the symmetric and asymmetric characteristics were summarized as follows: • Correlation: Pearson's correlation coefficient was computed for each behavioral cue between the two negotiators in a dyad. The higher the correlation, the more 54 symmetric the behavior in the specific behavioral dimension. The correlation coefficient of -1 would mean perfect asymmetry. • Difference in the means: For each negotiator in a dyad, the mean value was computed for each behavioral cue, and the absolute difference between the two mean values was computed. A higher difference value signifies more asymmetry between the two negotiators' behavior. • Difference in the standard deviations: As in computing the difference in the means, the same approach was taken to compute the difference with respect to the standard deviation values. The mutual descriptors involved only with the nonverbal behavior that could be automatically extracted and that mutually occurred between the proposer and the re spondent. That is, in extracting the automatic mutual behavior descriptors, each one was derived by considering jointly the nonverbal behavior of both the proposer and the respondent together, and none of these descriptors were derived from the nonverbal behavior of just one party in the interaction. Such symmetric and asymmetric mutual behavior descriptors were explored in two different modalities of acoustic and visual channels. In this work, the two types of short-term and long-term windows were used to encode the mutual behavior descriptors, but it is noted that there could be arbitrarily many sizes of the time window to encode the descriptors. For instance, the descriptors could be separately encoded in each time window of 1 second, 5 seconds, and 10 seconds before the response, or the best time window could be empirically determined (Narayanan & Georgiou, 2013). Additionally, mutual behavior can be portrayed in the form of entrainment (Lee, Katsamanis, Black, Baucom, Christensen, Georgiou, & Narayanan, 2014). Exploring various ways of using the time windows and entrainment as mutual behavior in the problem context of negotiation would be an interesting future direction to take. 55 4.2.3.1 Acoustic Mutual Behavior Using publicly available software for speech analysis called Covarep (Degottex et al., 2014) , the following acoustic descriptors were extracted at 100 Hz for each participant per proposal-response event. The descriptors were extracted only within the long-term time windows since the amount of time was often too short to compute meaningful descriptors within the short-term time windows: • Voice quality peak slope: Used to indicate breathiness or tenseness of the voice. Values closer to 0 are considered as more tense (Kane & Gobi, 2011; Scherer, Kane, Gobi, & Schwenker, 2013). • Voice quality normalized amplitude quotient (NAQ): Another feature for the tenseness of the voice (Scherer et al., 2013). • Pitch (fO): The base frequency of the speech signal. It is the frequency the vocal folds vibrate during voiced speech segments. The method introduced in Drugman and Alwan's work (2011) was used. • Energy: Used to indicate the loudness and intensity of the voice. • Energy slope: Extracted as the absolute value of the first derivative of the energy. Higher slope values indicate stronger changes in the energy and lower values indicate higher monotonicity of the energy. • Spectral stationarity: A measure that captures the fluctuations and changes in the voice signal. Higher values indicate a more stable vocal tract and little changes in the speech (e.g. during hesitation or sustained elongated vowels) indi cating higher monotonicity (Finkelstein, Scherer, Ogan, Morency, & Cassell, 2012; Scherer, Weibel, Morency, & Oviatt, 2012). For each proposal-response event, the acoustic descriptors extracted for each partic ipant within the long-term time windows were processed with a linear filter, specifically 56 using a time-aligned moving average (sliding window) technique (Vaughan, 2011) with a time window of 10 seconds. This step was taken since unlike visual signals, acoustic signals usually do not have overlapping regions to compute meaningful mutual behavior descriptors. Then, the symmetric and asymmetric mutual behavior descriptors were computed for each acoustic behavioral cue by taking the correlation and the difference in the means and in the standard deviation values. In summary, a total of 18 compu tational descriptors were encoded as long-term mutual behavior descriptors from this source of information (3 types of mutual behavior descriptors of correlation and dif ference in the means and in the standard deviations multiplied by 6 types of acoustic behavioral cues). 4.2.3.2 Visual Mutual Behavior In order to automatically extract the visual mutual behavior descriptors, commercial software (OKAO Vision) was used that detects a person's face from frame to frame in a video and outputs various low-level and high-level facial features. Below is a list of the visual descriptors that were extracted as potential predictive cues for each participant per negotiation session. Each visual descriptor listed below was smoothed with a linear filter, and each descriptor, except for smile, was converted into a binary descriptor at each frame using an empirically determined threshold point. • Smile: Used to indicate if the person is displaying positive affect with a smile. The smile intensity value ranges on a scale from 0, which means no smile, up to 100. • Leaning posture: Used to indicate if the person is showing a forward or a backward lean / posture, approximated with the face length and face size. The face length and face size values were z-normalized and the threshold points of 0.00, 0.25, 0.50, 0.75, and 1.00 were used to convert them to binary values at the frame level. With the 5 different thresholded versions of the descriptor, prediction performance was measured with each of them used in a single-feature predictor. 57 The threshold that performed best was with the threshold point of 0. 75, and it was used for all subsequent experiments. • Head gaze: Used to indicate if the face is directed downward toward the table. From the raw face direction signals in upward / downward rotational degrees, the threshold points of-5, -10, -15, and -20 were used to convert them to binary values at the frame level since the videos were recorded from a lower position at an angle. Based on the prediction performance as an individual descriptor (similarly as how the threshold point was determined in the case of leaning posture), the threshold point of -5 was eventually used for all the experiments. • Eye gaze: Used to indicate if the gaze is directed downward toward the table. The same approach was taken as head gaze for converting to binary values at the frame level and the same threshold point of -5 was eventually used. For each proposal-response event, the visual descriptors above were extracted from two different time windows: within the short-term time window (from the start of the proposal until the start of the response) and within the long-term time window (from the start of the interaction until the start of the response) as shown in Figure 4.2. Then, for each time window, the symmetric and asymmetric mutual behavior descriptors were computed with Pearson's correlation coefficient for each descriptor between the two par ticipants in each dyadic session. The difference in the mean values and the difference in the standard deviation values were also computed. In summary, a total of 24 com putational descriptors were encoded as the short-term and long-term mutual behavior descriptors from this source of information ( 4 types of visual descriptors multiplied by 3 types of mutual behavior descriptors of correlation and differences multiplied by 2 types of short-term and long-term windows). Unlike audio signals, visual signals tend to occur more simultaneously. For instance, when two people smile or have an eye contact, even if their behavior is not perfectly synchronized, there tends to be an overlapping period when both interactants display 58 the behavior at the same time, which is captured with the correlation coefficients. It would be interesting to take into account time delays using a similar time-aligned moving average technique that was used for the acoustic descriptors, but it is left as future work. Additionally, time delay in behavior is not relevant for the descriptors of the difference in the means and in the standard deviations because they are already summary statistics over a time period. 4.2.4 Negotiation History To capture useful predictive cues from the negotiation history, this work explored the following descriptors from the long-term time windows: • Net negotiation history: The total net response history of the respondent at the time of the proposal-response event ( + 1 and -1 for each previous acceptance and rejection respectively). • Last negotiation history: The result of the proposal-response event ( + 1 for acceptance and -1 for rejection) immediately prior to the current one. • Response time history: The mean of all the previous response times of the respondent at the time of the proposal-response event. This descriptor could help better understand the binary response time descriptor by providing the general response time characteristic / habit of each negotiator. In summary, a total of 3 computational descriptors were encoded from this source of information. 4.3 Experiments The experiments were designed and performed to address the following primary hypoth esis to investigate the degree of benefit that can be gained by considering more sources of information for potential predictive cues of the respondent reactions: 59 Hypothesis 1 (Hl). For predicting the respondent reactions during dyadic negotia tion, other sources of information (the proposer's nonverbal behavior, the mutual behav ior between the negotiators, and the negotiation history) can yield comparable prediction performance to looking at the nonverbal behavior of the respondent, and combining all sources together yields higher performance than using a single source of information. In addition to the primary research question, we also tested a secondary hypothe sis regarding mutual behavior. If Hl is true and the mutual behavior descriptors are predictive of the respondent reactions during negotiation, it may be due to the descrip tors capturing the very nature of the interaction itself, whether it is cooperative or competitive. Hypothesis 2 (H2). The computational descriptors of mutual behavior that are predictive of the respondent reactions are also useful for determining whether the nego tiation interaction is cooperative or competitive. 4.3.1 Dyadic Negotiation Dataset A dataset of dyadic negotiation sessions was collected in order to understand how people negotiate with various incentive scenarios. In total, 84 undergraduate business major students ( 40 males and 44 females) participated in 42 dyadic negotiation sessions, of which 1 dyad was discarded because the participants deviated from the experimental procedure. Each dyadic session involved two same-gender participants to control for the influence of gender. In addition, the negotiators in each dyad were instructed to adopt only one of three motivational orientations that derived from the monetary incentive associated with the negotiation task: cooperative (maximize joint outcomes), individu alistic (maximize own outcomes), and competitive (maximize own outcomes relative to the other's outcomes). Out of 42 sessions, 13 were cooperative, 13 individualistic, and 16 competitive. The negotiators in each dyad received the same motivational instruction and were aware that the other was so instructed. A total of three cameras were placed 60 Total# of sessions Total# of samples (proposal-response events) Accept samples across sessions - mean - standard dev. Reject samples across sessions - mean - standard dev. Total samples across sessions - mean 41 dyads 253 4.63 2.38 1.54 2.37 6.17 - standard dev. 2.97 Table 4.1: Proposal-response events distribution in the face-to-face dyadic negotiation dataset. unobtrusively to record a near-frontal view of each negotiator, as well as an overall side view of the interaction. In each session, two participants sat face-to-face across each other at the opposite ends of a table, on which several types of plastic fruits and vegetables were placed. The participants were randomly assigned to represent one of two different restaurants, which had different pay-off matrices associated with the items on the table. Each participant knew only the pay-off matrix of his/her assigned restaurant, and the participants had 12 minutes to negotiate on how to distribute the items on the table. As an incentive, each participant could receive up to $50 depending on the final points earned for his/her restaurant (Figure 4 .1). 4.3.2 Annotations For each negotiation session, all the proposal-response events were identified by two expert coders with each coder annotating half of the dataset, and the inter-coder re liability on 4 randomly selected sessions (about 10% of the dataset) measured with Krippendorff's alpha was at 0.67 (following the approach of measuring the Time-Slice Krippendorff's alpha described in the previous chapter with seconds as the time-slice granularity). A proposal was defined as an utterance made with a clear offer related to 61 negotiating the items on the table, and if it was followed by a clear verbal utterance of acceptance or rejection, the start of the proposal until the end of the matching response was identified as a proposal-response event. A total of 253 events were identified, out of which 190 were accepted proposals and 63 were rejected proposals (see Table 4.1). For each proposal-response event, a subset of the nonverbal behavior (see Section 4.2.1) of the proposer and the respondent were annotated. For the purpose of extracting the acoustic descriptors, speaker diarization was also performed with the annotations, but this step could have been done automatically with close-talk microphones equipped for both participants. All the annotations were performed using ELAN software (Brugman & Russel, 2004). 4.3.3 Prediction Models and Methodology For the prediction models, support vector machine (SVM) classifiers with a radial-basis kernel were trained and tested (Chang & Lin, 2011). In all of the prediction exper iments, 4-fold cross-validation was performed with hold-out testing and also hold-out validation to find the optimal parameters (gamma and C) using a grid-search technique. An exhaustive feature selection looking at all possible combinations of features was per formed in each of the four sources of information. For making predictions with combined sources of information, two different approaches of early and late fusions were explored. For the late-fusion approach, each source of information outputted its best prediction results in the form of log-likelihoods, which were combined altogether and used as input features for a final predictor. For the early, or feature-level, fusion approach, the same feature selection approach was performed after combining the features at the feature level using the best subset of features that was automatically determined in each source of information. In order to make balanced sample sets for predictor training and testing, all of the 63 samples of the rejected proposal-response events were combined with 63 randomly selected samples of the accepted events, making the baseline prediction at 50%, and 3 62 such randomly balanced sets were created. Each randomly balanced set was again ran domly separated into 4 folds with almost an equal number of acceptance and rejection samples. All the prediction results were averaged over 12 test results (3 randomly bal anced sets 4-fold cross-validation). It should be noted that none of the folds contained samples from the same negotiation session for better generalizability. In other words, the 4 folds were created such that they were all session-independent to one another. To test for the second hypothesis, this work also investigated to what extent the prediction accuracies were due to the mutual behavior descriptors' capturing the dif ferent conditions of the negotiation sessions, specifically between the cooperative and competitive conditions. Using the same final descriptor set determined for the mutual behavior group predictor, another classifier was trained and tested in order to clas sify each negotiation session between the cooperative and competitive conditions. The samples were also randomly balanced with 13 cooperative sessions and 13 competitive sessions making the baseline classification at 50%, and a similar feature selection tech nique and 13-fold cross validation were performed. In each cross-validation experiment, 1 hold-out fold was used for testing, 8 folds for training, and 4 folds for validating the model parameters. 4.4 Results All of the experimental results have the baseline prediction rate of 50% since all the samples were trained and tested using randomly balanced sets. 4.4.1 Predicting the Respondent Reactions (Hl) For predicting the respondent reactions to negotiation offers, combining all the compu tational descriptors at the feature-level in an early-fusion approach yielded the mean prediction accuracy at 75.8%, and the late-fusion approach at the decision-level yielded the accuracy at 74.5% (Figure 4.4). The prediction accuracy was at 56.8% when using the descriptors from only the proposer's behavior, 72. 7% when using those from only the 63 ~ 75 ~ ~ 70 +--------- ~ ~65+-------- "ij 'C ~ flJ +--------- ~ iss+------ 50 45 baseline proposer respondent history mutual early fusion late fusion behavior !all sources) !all sources) Figure 4.4: The mean accuracies for predicting the respondent reactions to negotiation offers using the computational descriptors from each source of information. The two right-most red bars show the early-fusion performance of combining all sources together at the feature-level and the late-fusion performance of combining them at the decision level. The error bars show 1 standard error in both directions, and the paired-samples t-tests showed statistical significance in performance at p** < 0.001. respondent's behavior, 66.93 when using those from only the history information, and 68.83 when using those from only the mutual behavior. Paired-samples t-tests showed the early-fusion predictor's performance to be better with a statistical significance at p j 0.001 compared to both the predictor using only the proposer's behavior and that using only the history information. 4.4.2 Benefit of More Sources of Information As shown in Figure 4.5, the predictors on average performed at 66.33 when using only 1 source of information (mean of 4 different predictors, 1 predictor for each source of information). For the early-fusion approach, the predictors on average performed at 70.73 when using 2 sources (mean of 6 different predictors possible by choosing 2 out of 4 sources), 72.93 when using 3 sources (mean of 4 different predictors possible by 64 Early Fusion (Feature Level) 80+-----------------~ 'i j1s+---------------- a10+------- ~ ~65 +----- 'ij ] 60 ~55 +----- . :; 50 45 baseline 1 source 2 sources 3 sources 4 sources Late Fusion (Decision Level) 80+-----------------~ l: ~ 75 ~70+------- ~ ~65+---- 'ij ]60 +----- .. ~55+----- ~ 50 45 baseline 1 source 2 sources 3 sources 4sourc:es Figure 4.5: The mean prediction accuracies for combining multiple sources of informa tion together. The graph shows the results of early-fusion at the feature-level. choosing 3 out of 4 sources), and 75.8% when using all 4 sources together. The late fusion approach yielded very similar results, and the predictors on average performed at 70.9% when using 2 sources, 72.5% when using 3 sources, and 74.5% when using all 4 sources together. The prediction performance of using all possible combinations of sources is summarized in Table 4.2. 4.4.3 Mutual Behavior and Classification of Cooperative vs. Competitive Interactions (H2) Using the same final descriptors selected for the information source of mutual behavior, 4 descriptors were relevant for the problem of interaction condition classification because 65 Information Sources(• signifies inclusion) Accuracy {°Ai) Proposer Respondent History Mutual Behavior Early Fusion Late Fusion • • • • 75.8 74.5 • • • 71.1 72.1 • • • 74.7 72.7 • • • 69.5 71.4 • • • 76.0 73.7 • • 69.8 72.4 • • 66.9 67.2 • • 65.1 69.0 • • 73.7 72.1 • • 74.5 74.0 • • 74.0 70.6 • 56.8 • 72.7 • 66.9 • 68.8 Table 4.2: The prediction performance of both the early-fusion and late-fusion ap proaches with different combinations of information sources. they were from the long-term time windows, making them sensible to be computed for the entire length of each negotiation session. The descriptors were the correlation for smile, the correlation for head gaze, the difference in the means for eye gaze, and the difference in the means for pitch. Using the 4 descriptors, the best classification rate that could be achieved for classifying the negotiation sessions between cooperative and competitive conditions was at 65.4%. 4.4.4 Top Performing Individual Descriptors Table 4.3 summarizes the performance of the top computational descriptors from all sources of information. The prediction accuracies in the table show the performance when a single-descriptor predictor was trained and tested using each individual descrip tor alone. In other words, the prediction accuracies suggest how much discriminative power each computational descriptor showed when it was considered alone in the pre diction problem. From the proposer's nonverbal behavior, head tilt information was 66 Source Descriptor Accuracy (%) Time Dependency Proposer Head tilt 56.8 Binary response time 63.8 Respondent Head nod 60.4 Short-term Head shake 59.6 Leaning posture (correlation) 54.2 Head gaze (diff. in means) 59.6 Head gaze (correlation) 59.1 Mutual Behavior Smile (correlation) 59.1 (Symmetry I Asymmetry) Eye gaze (diff. in means) 57.3 Eye gaze (correlation) 54.9 Long-term NAQ (diff. in standard deviations) 54.2 Pitch (diff. in means) 54.2 Last negotiation history 66.9 History Net negotiation histo!}'. 58.6 Table 4.3: Top descriptors according to their prediction performance when used alone in a single-descriptor predictor. the only descriptor that performed above 553 at 56.83. From the respondent's non- verbal behavior, 3 descriptors of the binary response time, head nod, and head shake performed above 553 when used individually with the prediction accuracies of 63.83, 60.43, and 59.63 respectively. From the mutual behavior descriptors, the difference in the means for head gaze and eye gaze performed at 59.63 and 57.33 respectively, and the correlation in head gaze and smile both performed at 59.13. Lastly, from the history information, the last negotiation history and net negotiation history performed at 66.93 and 58.63 respectively. 4.5 Discussions Overall, the results showed partial support for the two hypotheses outlined in Sec tion 4.3. This section elaborates the results with discussions in light of the hypotheses and the results. 67 4.5.1 Limitations It is noted that the samples in the experiments were randomly forced-balanced to have a majority baseline or chance-level prediction at 50%. Also, the threshold points for the visual mutual descriptors were determined using all the training and testing samples. It is also noted that the main focus of this study was to investigate whether different sources of information would also yield comparable prediction performance to that of looking at the nonverbal behavior of the respondent. The focus was not on making best possible predictions, and the same thresholding advantage was given to all multimodal predictors that included the mutual behavior. 4.5.2 Predicting the Respondent Reactions (Hl) As shown in Figure 4.4, other sources of information aside from the respondent's non verbal behavior also displayed comparable predictive power, especially the history and mutual behavior sources showing slightly lower but still comparable performance, which confirmed the first half of the first hypothesis. However, when com paring the prediction accuracy results between the 4-source predictors versus each of the single-source pre dictors, not all of them showed statistical significance, not completely confirming our first hypothesis that the 4-source predictors are always better. It is not surprising that the nonverbal behavior of the respondent was the most predictive source of information when predicting the respondent reactions. After all, if one wishes to predict the future actions of a person, it is only natural and intuitive to observe that specific person's behavior more than anything else. The mutual behavior between the negotiators also proved useful, probably due to the descriptors having captured the level of rapport be tween the negotiators and the overall atmosphere of cooperation or competition. The computational descriptors from the history information were also quite useful in the prediction, most likely by directly capturing whether the interaction was a cooperative or competitive condition. The nonverbal behavior of the proposer, although the least predictive of the 4 sources, still showed some predictive power. 68 4.5.3 Benefit of More Sources of Information For the early-fusion results, there was a slight trend of adding more sources of infor mation leading to better prediction performance in general. In the graphs shown in Figure 4.5, each bar shows the average prediction accuracy by using information from a certain number of sources in all possible combinations. On average, using information from all 4 sources of information yielded the best performance for the early-fusion at 75.8% and late-fusion at 74.5%. However, this tendency is only in terms of general per formance, and when looking at the performance of individual predictors with different combinations of sources, the tendency is not always true and conclusive. For instance, the 3-source predictor (using the respondent's nonverbal behavior, the negotiation his tory, and the mutual behavior) actually outperformed the 4-source predictor for the early-fusion case. 4.5.4 Mutual Behavior and Classification of Cooperative vs. Competitive Interactions (H2) The condition classification performance of 65.4% is relatively lower compared to the early-fusion prediction accuracy of 75.8% but still by far higher than the chance level at 50%. Considering that the engineering and selection of the computational descriptors were not completely focused for the purpose of the interaction condition classification, the classification performance suggests more meaning and implications. This result in dicates that the descriptors that were useful for the respondent reaction prediction were also helpful in determining the type of negotiation sessions, moderately confirming the second hypothesis (H2). The result also suggests that the performance of the respondent reaction predictors could be partially due to the mutual behavior descriptors' having captured the nature of the negotiation sessions, or the overall atmosphere of cooperation or competitiveness. 69 4.5.5 Top Performing Individual Descriptors As shown in Table 4.3, from the proposer's behavior, head tilt was slightly predictive of the respondent reactions, possibly because the behavior often showed lack of confidence from the proposer with the proposal, which was more likely to be rejected than ac cepted. From the respondent's nonverbal behavior, the binary response time, head nod, and head shake individually yielded the best prediction accuracy. It is noted that head nods and head shakes from the respondent's behavior individually only performed at about 60% prediction accuracy and were not determinant factors. Often in a dyadic ses sion, the respondent gave head nods as a form of backchannel response to the proposer's speech, and their presence were somewhat related to the final respondent reaction to the offer but not to a great extent. From the history information, the last negotiation history descriptor performed best at 66.9%, which was the best performance among all descriptors and all sources of information. This was most likely due to the descriptor having captured the degree of cooperation or competition in the interaction, as well as the tendency of acceptances or rejections to occur in closer tern poral proximity ( ac ceptances tended to happen in blocks and so were rejections). Also, there were several mutual behavior descriptors that performed very close to about 60% prediction accuracy even when used alone in a single-feature predictor. When designing a real-time system, a customized feature selection could be tried based on this individual performance if it is known which descriptors must be included due to domain knowledge or convenience. 4.6 Conclusions This work presented extensive experimental results showing that it is possible to pre dict the respondent reactions to negotiation offers (whether the respondent will accept or reject) with reasonable accuracy using computational descriptors from four different sources of information: the nonverbal behavior of the proposer, the nonverbal behavior 70 of the respondent, the mutual behavior between the negotiators, and the negotiation his tory. Furthermore, this chapter presented the results with both an early-fusion approach of fusing information at the feature-level and a late-fusion approach of fusing informa tion at the decision-level. The results also showed a qualitative observation that adding more sources of information generally improves the prediction performance. Specifically, this work moderately confirmed the first hypothesis that other sources of information aside from the nonverbal behavior of the respondent can also be useful in the prediction problem. Furthermore, this work moderately confirmed the second hypothesis that the computational descriptors from mutual behavior are useful in the prediction problem due to their capturing the very nature of the interaction itself between the cooperative and competitive atmosphere. For future work, more behavioral cues and feature engi neering can be explored, especially by taking into account time delays in the behavior between the negotiators. It would also be interesting to investigate how humans perform for the same prediction problem for comparison. 71 Chapter 5 Computational Modeling of Persuasive Behavior in Online Social Multimedia 5.1 Introduction Our daily lives are heavily influenced by persuasive communication. Making a con vincing case in the courtroom (Voss, 2005), seeking patients' compliance to medical advice (O'Keefe & Jensen, 2007), advertising and selling products in business (Meyers Levy & Malaviya, 1999), and even interacting with our friends and family all have persuasion at the core of the interaction. With the advent of the Internet and a recent growth of social networking sites, more and more of our daily interaction is taking place in the online domain. Whereas the communication modality used online was predominantly text in the past, there is now an explosion of online content in the form of videos, making it more important and useful to understand persuasiveness in the context of online social multimedia content. What makes some people persuasive in online multimedia and influential in shaping other people's opinions and attitudes while others are ignored? This is the key question that this chapter explores. This research topic has many practical implications from the human-computer in teraction perspective. For one, an automatic technology that can analyze multimodal signals from a human user in real-time and predict his/her level of persuasiveness from 72 behavioral and verbal indications can be useful as a training system. Such a system can help a speaker to behave as a more persuasive speaker and probably a better negotiator in daily interactions. Furthermore, such a system can be used as a filtering tool and aid a person with real-time analysis of online video and audio content. While there has been a considerable amount of research on persuasion from the standpoints of psychology and social science, there has been very limited work inves tigating persuasion from the computational perspective and from the context of social multimedia. Fortunately, recent progress in computer vision and audio signal process ing technologies (Degottex et al., 2014; Lao & Kawade, 2005; Littlewort et al., 2011; Morency et al., 2008) enables automatic extractions of various visual and acoustic behav ioral cues without having to depend on costly and time-consuming manual annotations, making it more feasible to tackle the problem from a computational standpoint. This chapter introduces a newly created dataset called Persuasive Opinion Multime dia (POM) corpus consisting of 1,000 movie review videos with subjective annotations of persuasiveness as well as high-level related characteristics or attributes such as con fidence (Figure 5.1). The experimental analysis revolves around the following five main research hypotheses. Firstly, this work studies if the computational descriptors derived from verbal and nonverbal behavior can be predictive of persuasiveness. This work fur ther explores combining descriptors from multiple communication modalities (acoustic, verbal, para-verbal, and visual) for predicting persuasiveness and compare with using a single modality alone. Secondly, this work investigates how certain high-level attributes, such as credibility or expertise, are related to persuasiveness and how the information can be used in modeling and predicting persuasiveness. Thirdly, this work investigates the differences when speakers are expressing a positive or negative opinion and also if the opinion polarity has any influence in the persuasiveness prediction. Fourthly, this work further studies if gender has any influence in the prediction performance. Lastly, this work tests if it is possible to make comparable predictions of persuasiveness by only looking at thin slices (i.e., shorter time windows) of a speaker's behavior. 73 Persuasive Opinion Multimodal (POM) Corpus r Positive opinions (5-star ratings) Acoustic Descriptors ._,.~,.,II. \..Pitch, energy, M F<Cs, voice quality, ett:_.i Para-Verbal Descriptors It's a(uhb) 70 (umm) minute movie (uhh) so so (pause) it's it's (stutter) not especially longlonglong (stutte11 but it is pretty(uhb) cool (pause) and (ubh) lots of cool... Articulation rate, pause, pause-fillers, et<. r Negative opinions (1or2-star ratings) Verbal Descriptors It's a 70 minute movie so it's not especially long but it is pretty coo 1 sn.d lots of cools cenes and stuffs o !really I don't know! really like tllis movie ... U nigrams and bigrams Visual Descriptors r ..., Multimodal Prediction • Strongly persuasive? vs. • Weakly persuasive? t Figure 5.1: An overview of the chapter with a newly created multimedia corpus and the multimodal approach for predicting persuasiveness with acoustic, verbal, para-verbal, and visual computational descriptors. To the author's knowledge, this new corpus is the first multimedia dataset created with extensive annotations for studying persuasiveness in online social multimedia. Fur- thermore, another main novelty of this work lies in investigating computational models of persuasiveness that take advantage of several natural multimodal communicative modalities encompassing acoustic, verbal, para-verbal, and visual channels. In addition to providing an extensive set of experiments for computationally modeling persuasive- ness, this chapter also introduces a novel attribute-based multimodal fusion approach which uses various high-level attributes related to persuasion in the middle layer for predicting persuasiveness. 74 The next section (Section 5.2) outlines the main research hypotheses, and Section 5.3 introduces the novel multimedia dataset designed for investigating persuasiveness in social multimedia. The chapter gives explanations on the design of the computational descriptors in Section 5.4 and experiments in Section 5.5. The experimental results are described in Section 5.6 with discussions, and the chapter concludes with Section 5. 7. 5.2 Research Hypotheses Motivated by findings from past research outlined in the past related works (Section 2.3), the research work presented here was designed to specifically address the following five main hypotheses. 5.2.1 Computational Descriptors (Unimodal vs. Multimodal Prediction) Past research works in the literature point to various cues in verbal and nonverbal be havior that influence human perception of persuasiveness. It is hypothesized that it is possible to capture such indicators of persuasiveness through computational descrip tors to predict whether a speaker in social multimedia is strongly persuasive or weakly persuasive. In particular, it is hypothesized that combining computational descriptors derived from multiple communication modalities can make more accurate predictions compared to using those from a single modality alone from the acoustic, verbal, para verbal, or visual channel. Hypothesis 1 (Hl): Multimodal computational descriptors of verbal and nonverbal behavior perform better than unimodal descriptors in predicting a speaker's persuasive ness in social multimedia. 5.2.2 Attribute-Based Multimodal Approach Past research findings and intuition both tell us that several high-level attributes, such as credibility and expertise, are very likely to have close relevance to persuasion. And 75 there can be a handful of key high-level attributes, each of which is a critical and distinct component in shaping a speaker's persuasiveness. It is hypothesized that it is possible to achieve better performance in predicting the level of a speaker's persuasiveness by first using multimodal computational descriptors to predict the levels of such high-level attributes in the middle layer and subsequently predicting the level of persuasiveness from the refined, higher-level information. Hypothesis 2 (H2): Using multimodal computational descriptors of verbal and nonverbal behavior to predict the levels of key high-level attributes related to persuasive ness and then subsequently using the intermediate information to predict a speaker's persuasiveness yield better performance compared to directly predicting persuasiveness from the computational descriptors. 5.2.3 Effect of Opinion Polarity Persuasion can happen in a variety of contexts, and it is likely that we change our behavior depending on the context in our persuasion attempt. For instance, we might nod our head more when we try to persuade someone to go watch a particular movie, while we shake our head more in the opposite case. It is hypothesized that if it is known in advance whether a speaker is trying to persuade one in favor of or against something, computational models can better capture the difference between persuasive and unpersuasive contents to make a more informed and better prediction. Hypothesis 3 (H3): Opinion polarity {sentiment) dependent models perform better in predicting a speaker's level of persuasiveness compared to those that are polarity independent. 5.2.4 Effect of Gender Gender can have an influence on how a speaker behaves in his/her persuasion endeavor. For instance, female speakers might be more verbally descriptive while male speakers are less expressive. It is hypothesized that same gender speakers have more similarity 76 in their behavior, allowing gender-dependent computational models to better capture the difference between strongly persuasive and weakly persuasive speakers. Hypothesis 4 (H4): Gender-dependent models perform better in predicting a speaker's level of persuasiveness compared to those that are gender independent. 5.2.5 Thin Slice Prediction In trying to persuade others, we may convey varying degrees of information in different stages of our persuasion attempt. For instance, we may tend to put more emphasis in the very beginning or we may typically want to close our speech with more impact close to the end. Combined with the idea of thin slices (see Subsection 2.3. 7), it is hypothesized that by looking at the verbal and nonverbal behavior at specific shorter time periods, it is possible to make comparable predictions of persuasiveness of a speaker in social multimedia compared to making predictions based on the entire length of the speaker's behavior. Hypothesis 5 (H5): Computational descriptors derived from a thin slice time period can make comparable predictions of a speaker's persuasiveness compared to those derived from the entire length of his/her video .. 5.3 Persuasive Opinion Multimedia (POM) Corpus Since there is currently no suitable corpus in the research community to study persua siveness in the context of online social multimedia, a website called ExpoTV.com was used as a source to create a new corpus for the research topic. To the author's knowl edge, currently the most relevant dataset to the POM dataset is a dataset of online conversational videos of vloggers by Biel et al. (2012) . It was not completely suit able for the research purpose of this chapter's work because it was created for studying personality and the topics were too broad. ExpoTV.com is a popular website housing videos of product reviews. Each product review has a video of a speaker talking about a particular product, as well as the speaker's direct rating of the product on an integral 77 scale from 1 star (for most negative review) to 5 stars (for most positive review). This direct rating is useful for the purpose of this work because the star rating has close re lationship with the direction of persuasion. For instance, the speaker in a 5-star movie review video would most likely try to persuade the audience in favor of the movie while the speaker in a 1-star movie review video would argue against watching the movie. Our corpus includes only movie review videos for the consistency of context. Since the primary interest of this work lay in exploring the difference in behavior between the cases when a speaker is trying to persuade the audience positively and negatively, a total of 1,000 movie review videos were collected as follows: • Positive Reviews: 500 movie review videos with a 5-star rating (306 males and 194 females). • Negative Reviews: 500 movie review videos with a 1 or 2-star rating, consisting of 208 1-star videos (145 males and 63 females) and 292 2-star videos (218 males and 74 females). 2-star videos were included due to a lack of 1-star videos on the website. Each video in the corpus has a frontal view of one person talking about a particular movie, and the average length of the videos is about 93 seconds with the standard deviation of about 31 seconds. The corpus contains 352 unique speakers and 610 unique movie titles, including all types of common movie genres. 5.3.1 Subjective Annotations Amazon Mechanical Turk (AMT) (Mason & Suri, 2012), which is a popular online crowdsourcing platform, was used to obtain subjective evaluations of the speaker in each video. Each video received 3 repeated annotations from 3 different workers, making the total number of complete annotations 3,000 instances for 1,000 videos in the corpus. All the annotations were obtained from native English-speaking workers based in the 78 United States. To minimize gender influence, all the annotations were distributed such that the workers only evaluated the speakers of the same gender. To ensure that no prior knowledge of the movies biased how the annotators rated each speaker's level of persuasiveness and other high-level characteristics, the annota tions were obtained in two separate phases. In the first phase, a total of 49 workers participated in the evaluation process online, and the task was evenly distributed among them. During this first phase, the workers were asked if they had previously seen the movie being reviewed for each video. This information was then used to filter out all such annotations that were made with prior knowledge of the movie under review (about one third of all the annotations). In the second phase, a total of 38 workers participated, first indicating which movies they had seen or not seen from a list of movies that we needed to re-annotate. Then, the annotation task was distributed as evenly as possi ble among the workers such that each worker only annotated those videos that discuss movies that he/she had not seen before. In summary, a total of 87 workers participated in the annotation process with each worker annotating about 35 videos (M = 34.5, SD = 13.4). The second goal of the POM dataset was to better understand other high-level at tributes that could be related to persuasiveness such as personality traits. The following subsections describe the detail of these attribute annotations. These extra annotations are expected to make the corpus more widely applicable for other related research topics such as personality trait modeling. 5.3.1.1 Persuasiveness and High-Level Attributes For each video in the corpus, 3 repeated annotations were obtained on the level of per suasiveness of the speaker by asking the workers to give a direct rating on the speaker's persuasiveness on a Likert scale from 1 (very unpersuasive) to 7 (very persuasive). In addition to persuasiveness, the annotations also included the evaluations on various high-level attributes, many of which past research suggests for having close relationship 79 Attribute Kripp. alpha Attribute Kripp. alpha Confident 0.74 Passionate 0.76 Credible 0.69 Professional-looking 0.71 Dominant 0.69 Vivid 0.68 Entertaining 0.68 Voice pleasant 0.69 Expert 0.69 Phys. Attractive 0.73 Humorous 0.67 Persuasive 0.69 Agreeableness 0.65 Openness 0.67 Conscientiousness 0.71 Neuroticisrn 0.64 Extraversion 0.75 Table 5.1: Krippendorff's alpha agreement values for the annotations of persuasiveness and other related high-level attributes, including the Big Five personality dimensions. with our perception of persuasiveness. The high-level attributes were evaluated simi- larly as persuasiveness on a 7-point Likert scale with 1 being the least descriptive of the attribute and 7 being the most descriptive. For evaluating personality, a 10-item version of the Big Five Inventory (Rammstedt & John, 2007) was used to assess the personality of the speaker in each video. It is noted that the annotations also include self-assessed personality of the workers who performed the evaluations so that a future analysis is possible by investigating the relationship between the personality of the perceiver and that of the perceived. • High-Level Attributes: confident, credible, dominant, entertaining, expert, humorous, passionate, physically attractive, professional-looking, vivid, and voice pleasant. • Personality Dimensions (Big Five Model): agreeableness, conscientiousness, extraversion, openness, and neuroticism. 5.3.1.2 Analysis Due to the variability in human perception and judgment, taking the mean of repeated evaluations was chosen as a sensible method of obtaining final labels that most closely 80 0.8 +---- Figure 5.2: Pearsons correlation coefficients between persuasiveness and high-level and personality attributes (after taking the mean of 3 repeated annotations). The two hori zontal dotted lines indicate critical values at p < 0.001 for two-tailed probabilities, and the vertical dotted line visually divides the personality dimensions from other attributes. reflect the perception of the general public. For this work, the mean score of 3 repeated Likert-scale evaluations was used as the final measure for each video. Table 5.1 sum- marizes the mean agreement measured with Krippendorff's alpha (Krippendorff, 2012) between this final measure and each coder. The agreement was generally high around 0.70. Figure 5.2 shows the correlations between persuasiveness and other attributes when using the final measures after taking the mean, and many of the high-level attributes suggest a strong correlation with persuasiveness, which is consistent with past research findings in the literature (Crano & Prislin, 2006; O'Keefe, 2002; Perloff, 2010). It is particularly interesting to see which traits are not correlated or inversely correlated. The fact that physical attractiveness is only weakly correlated is most likely due to the 81 study design of the same-gender evaluation. Neuroticism is inversely correlated. Some of the most strongly correlated traits are credibility, confidence, and expertise. To validate the persuasiveness measure, the annotation tasks included two ques tions related to the annotators' interest in watching the reviewed movies. For the first question, the annotators were shown general information on the movie, including the synopsis and cast. Then, they were asked, "How interested are you in watching this movie?" This first question was answered before watching each review video. Then after watching the review, the annotators were asked the following second question, "After seeing this movie review, how interested are you in watching this movie?", with a scale ranging from -3 (much less interested than before) to +3 (much more interested than before). Out of 3,000 annotation instances (1,000 movies multiplied by 3 for repeated annotations), a validity analysis indicates a strong correlation between the persuasive ness score rating and the annotators' interest after watching the movie reviews, 0.71 for positive reviews and -0.56 for negative reviews (since viewers are discouraged to watch the movies for negative reviews). 5.3.1.3 Transcriptions Using AMT and 17 participants from the same worker pool for the subjective evalua tions, verbatim transcriptions were obtained, including pause-fillers and stutters. Each transcription was reviewed and edited by in-house expert transcribers for accuracy. 5.4 Computational Descriptors This section gives detail on the extractions and computational encodings of multimodal descriptors as potential candidates for capturing persuasiveness. The verbal and para verbal descriptors were computed automatically from the manual transcriptions. The acoustic and visual descriptors were also extracted automatically, directly from the audio and video streams. Table 5.2 summarizes all of the computational descriptors used in this study. 82 5.4.1 Acoustic Verbal Formants: Fl~ F5 Mel frequency cepstral coefficients: MFCC 1 ~ 24 Pitch /Fundamental frequency (FO) Voice qualities: normalized amplitude quotient (NAQ), parabolic spectral parameter (PSP), maxima dispersion quotient (MDQ), quasi-open quotient (QOQ), difference between the first two harmonics (Hl-H2), and peak-slope Unigrams Bi grams Para-Verbal Visual Verbal fluency qualities: articulation rate, pause, pause-filler, speech disturbance ratio, and stutter Emotions: anger, contempt, disgust, fear, joy, sadness, and surprise Valence: negative, neutral, and positive Facial Action Units: AUl, AU2, AU4, AU5, AU6, AU7, AU9, AUlO, AU12, AU14, AU15, AUl 7, AU18, AU20, AU23, AU24, AU25, AU26, and AU28 Eye gaze movements: displacement in x and y axes Head movements: displacement and rotation in x, y and z axes Approximated posture: displacement in the z-axis Statistical Functionals (acoustic and visual descriptors only) mean, median, percentiles (IOlli, 25lli, 75lli, and 9Qlli), ranges (between min and max, I Olli and 9Qlli percentiles, and z5lli and 75lli percentiles) skewness, standard deviation Table 5.2: An overview of the computational multimodal descriptors. Acoustic Descriptors Following the common approaches for conducting automatic speech analysis (Schuller, Steidl, Batliner, Schiel, & Krajewski, 2011), various speech features were extracted re lated to pitch, formants, voice qualities, and mel-frequency cepstral coefficients (MFCCs) using publicly available software called Covarep (Degottex et al., 2014). The raw fea- ture values were then used to compute common statistical descriptors including the means, medians, percentiles, ranges, skewness, and standard deviation. The encoded descriptors were then used to explore their feasibility in capturing persuasiveness in the acoustic signals of speech. 83 • Formants: The information of acoustic resonance of the human vocal track, called formant, is commonly used for speech recognition and emotion recognition. This work explored formants Fl through F5. • Mel frequency cepstral coefficients (MFCC): Also widely used for speech and emotion recognition are MFCCs, and this work explored MFCC 1 24. • Pitch (FO): It is also referred to as the fundamental frequency and is closely tied to the affective aspect of speech (Busso, Lee, & Narayanan, 2009). • Voice Qualities: Many studies show a strong relation between voice quality fea tures and perceived emotion (Gobi & Ni Chasaide, 2003), and it is widely used for emotion recognition in speech. This work used various voice quality descriptors including normalized amplitude quotient (NAQ), parabolic spectral parameter (PSP), maxima dispersion quotient (MDQ), quasi-open quotient (QOQ), the dif ference between the first two harmonics (Hl-H2), and peak-slope. For more detail, readers are referred to other works more focused on acoustic analysis (Scherer et al., 2013; Kane, Scherer, Aylett, Morency, & Gobi, 2013). 5.4.2 Verbal Descriptors From the verbatim transcriptions of the dataset, all standard unigram and bigram fea tures were extracted that are commonly used in natural language processing (Rosenfeld, 2000), with the only difference in that the term frequencies were normalized by the video length. 5.4.3 Para-Verbal Descriptors In addition, a set of frequent para-verbal cues were identified that could be associated with the level of persuasiveness. 84 • Articulation rate: Articulation rate is the rate of speaking in which all pauses are excluded from calculation. It was computed by taking the ratio of the number of spoken words in each video to the actual time spent speaking. • Pause: This descriptor was computed by counting all the instances of silence during speech that were greater than 0.5 seconds in length, normalized by the total length of the video. FaceFX software was used to automatically extract and encode this descriptor. • Pause-filler: Pause-fillers are sounds that are used to fill the pause in speech, such as "um" or "uh." This descriptor was computed by counting all the instances of pause-fillers, normalized by the total number of words spoken in each video. • Speech disturbance ratio: Pause-fillers and stuttering can be considered as the same category of speech disturbance (Mahl, 1956). The speech disturbance ratio was computed by counting the number of speech disturbance instances (pause fillers and stutter), normalized by the total number of words spoken in each video. • Stutter: This descriptor was computed by counting all the instances of stuttering in each video, normalized by the number of words spoken in the video. 5.4.4 Visual Descriptors Using readily available visual tracking technologies (Lao & Kawade, 2005; Littleworth et al., 2011; Morency et al., 2008), various raw features were extracted from the face and the head movement of the speaker in each video. Similar to the acoustic descriptors, the same statistical descriptors were computed to explore their usefulness in indicating persuasiveness. • Discrete emotions: The level of anger, contempt, disgust, fear, joy, sadness, and surprise. 85 • Valence: The level of high-level valence, including negative, neutral, and positive valence. • Facial Action Units: The level of movement in various facial areas as codified by Facial Action Coding System (FACS) (Ekman & Rosenberg, 1997) including AUl, AU2, AU4, AU5, AU6, AU7, AU9, AUlO, AU12, AU14, AU15, AU17, AU18, AU20, AU23, AU24, AU25, AU26, and AU28. • Eye gaze movements: The gaze movement in the x and y axes. • Head movements: The head displacement and rotation in the x, y, and z axes. • Approximated posture: The movement in the z axis (toward or away from the camera). 5.5 Experiments This section gives detail on the experimental methodology, particularly on the predic tion models and the experimental conditions designed to test the research hypotheses outlined in Section 5. 2). 5.5.1 Persuasiveness Labels The experiments were designed to explore two types of labels discrete and continuous persuasiveness ratings. For the discrete labels, classifiers were used for testing discrete labels and regressors were used for testing continuous labels. For the regression exper iments, the ground-truth scores on all 1,000 videos were computed by averaging the 3 repeated annotations. For the classification experiments, the ground-truth persuasive ness scores of equal to or greater than 5.5 were taken as strongly persuasive speakers and the scores of equal to or less than 2.5 weakly persuasive speakers. It is noted that this dataset trimming process was done so that the study could primarily focus on inves tigating the behavioral differences between strongly persuasive and weakly persuasive 86 videos. Taking this discretization step left a total of 253 videos for the experiments. In terms of the opinion polarity, the final sample set comprised of 137 videos of positive reviews (63 strongly persuasive and 74 weakly persuasive) and 116 videos of negative reviews (61 strongly persuasive and 55 weakly persuasive). In terms of gender, the final sample set comprised of 152 videos of male reviewers (75 strongly persuasive and 77 weakly persuasive) and 101 videos of female reviewers ( 49 strongly persuasive and 52 weakly persuasive). 5.5.2 Experimental Conditions For the first hypothesis (Hl), this work explored both types of discrete and continu ous persuasiveness labels. For both kinds of labels, the performance of the multimodal approach of combining all the descriptors at the feature level was compared with the performance of using the descriptors only from a single modality. In addition to in vestigating whether the multimodal models perform better than any unimodal ones, all possible combinations of the modality groups were tested for further analysis. The below summarizes the experimental conditions of the prediction models designed to test Hl: • Acoustic descriptors only (see Section 5.4.1). • Verbal descriptors only (see Section 5.4.2). • Para-Verbal descriptors only (see Section 5.4.3). • Visual descriptors only (see Section 5.4.4). • Multimodal descriptors: All computational descriptors concatenated together at the feature level for an early-fusion approach. The second hypothesis (H2) was tested with a new approach in fusing multimodal information in relation to several high-level attributes and persuasiveness (Figure 5.3). The approach specifically used those attributes that showed an absolute correlation of 87 00 0 Multimodal Descriptors Descriptor 1 Descriptor 2 Descriptor 3 Oesttiptorl Descriptor 2 Descriptor 3 Descriptor 1 Descriptor 2 Descriptor 3 Desuiptor 1 Descriptor 2 DescriP!J• 3 • • • • • • • • • • • • .. Ill c ~ 0 :I·- .. u .. Ill Ill- ... Ill U) .. Ill c ~ 0 :I·- .. u + .. Ill Ill- ... JI .. Ill c ~ 0 :I·- .. u .. Ill Ill- ... Ill U) .. Ill c ~ 0 :I·- .. u .. Ill Ill- ... Ill U) Attribute Regressions ( Confdence Credibility Expertise ( Passion ) ) + Classification Highly persuasive? Lowly persuasive? Figure 5.3: An overview of the attribute-based multimodal prediction approach in which the high-level attributes are used in the middle layer before predicting a speaker's level of persuasiveness. at least 0.5 with persuasiveness, which came out to be 7 speaker attributes: credible, expert, confident, vivid, passionate, entertaining, and dominant (personality traits re main as future work). A regressor was first trained for each attribute, and the predicted regression level was then subsequently used to finally classify samples into strongly and weakly persuasive speakers. The performance of this new attribute-based approach was compared with that of the multimodal predictor just described above. To address the third hypothesis (H3), new classification experiments were performed using all the multimodal descriptors and grouping the dataset in three different ways depending on the opinion polarity expressed in the videos (as just explained in Subec- tion 5.5.1): 88 • Positive reviews only (sentiment-dependent): Multimodal models were explored using only the 137 positive reviews from the trimmed sam pies of 253 videos. • Negative reviews only (sentiment-dependent): Multimodal models were explored using only the 116 negative reviews from the trimmed samples of 253 videos. • Both review types combined (sentiment-independent): The same classification models as the multimodal models in Hl, using all 253 trimmed samples. To address the fourth hypothesis (H4), new multimodal classification experiments were performed using three different groups depending on gender of the reviewers: • Male reviewers only (gender-dependent): Multimodal models were explored using only the 152 reviews by male speakers from the trimmed samples of 253 videos. • Female reviewers only (gender-dependent): Multimodal models were explored us ing only the 101 reviews by female speakers from the trimmed samples of 253 videos. • Both gender reviewers combined (gender-independent): The same classification models as the multimodal models in Hl, using all 253 trimmed samples. To address the last hypothesis (H5), additional classification experiments were per formed using all the multimodal descriptors computed separately within different thin slices. More specifically, each review video was divided into 10 equal-length thin slices of first 10%, 10% to 20%, 20% to 30%, etc., and repeated the same classification experi ments within each thin slice window. Furthermore, additional experiments were carried out to test the performance in progressive cumulative thin slices of first 53, first 10%, first 15%, etc., to find out how soon the performance reaches that of using the whole 100% sessions. For the verbal descriptors, time was estimated using word count. 89 5.5.3 Methodology For all the experiments, the support vector machines (SVMs for classification and SVRs for regression experiments) were used with the radial basis function kernel as the pre diction models (Chang & Lin, 2011). The experiments were performed with a 20-fold cross-validation (CV). Each CV experiment had 1-fold testing and 3-fold validation (among 19 training folds) for the automatic selection of hyper parameters using a grid search method as recommended in Chang and Lin (2011). It is worth emphasizing that the folds were created such that no 2 folds contained sam pies from the same speaker. These restrictions assured speaker-independent experiments for better generalizability of the prediction models and results. The evaluation metric reported here is the aver aged Pearson's correlations for the regression experiments and averaged accuracies for the classification experiments over all 20 testing folds. For feature selection, the absolute correlations were used for the regressions and Information Gain (JG) metric was used for the classifications (Yang & Pedersen, 1997). It is noted that feature selections were performed only using the training samples from each cross-validation experiment. None of the test samples were used for feature selec tions. That is, 20 separate vocabulary buildings of n-grams and feature selections were performed using only the training samples for the 20 iterations of the cross-validation testings. The feature space was always limited to roughly l/lOth of the sample size. 5.6 Results and Discussions This section reports and discusses the experimental results centered around the five main research hypotheses described in Section 5 .2. These results are followed by an analysis of the multimodal descriptors. 90 0.5 85 p" p" - p• 80 ~ 0.4 ~ 75 Qi ~ 0.3 ~ ~ 70 u ~ 0 u c .;:( 65 g 0.2 ~ ; ~ ~ 0.. :; 60 c m 0.1 :; Aooustic Verbal Para-Verbal Visual All Baseline Acoustic Verbal Para-Verbal Visual All Only Only Only Only Modalities Only Only Only Only Modalities Combined Combined Figure 5.4: The persuasiveness prediction results for the multimodal and unimodal models with the regression results on the left and the classification results on the right (p* < 0.05 and p** < 0.01). The error bars show 1 standard error. 5.6.1 Unimodal vs. Multimodal (Hl) The left graph in Figure 5.4 shows the regression results of predicting the continuous persuasiveness labels by each unimodal and the multimodal models. The multimodal models predicted the level of persuasiveness with a mean Pearson's correlation of 0.34, the acoustic descriptors only models with 0.18, the verbal descriptors only models with 0.26, the para-verbal descriptors only models with 0.30, and the visual descriptors only models with 0.24. The paired-samples t-tests showed that the performance of the mul timodal models was better with a statistical significance compared with that of the acoustic descriptors only models (p < 0.01), the verbal descriptors only models (p < 0.01), and the visual descriptors only models (p < 0.05). The right graph in Figure 5 .4 shows the classification results of predicting between the strongly and weakly persuasive speakers by each unimodal and the multimodal models. The multimodal models predicted between the strongly and weakly persuasive speakers with a mean accuracy of 70.343, the acoustic descriptors only models with 62.213, the verbal descriptors only models with 69.983, the para-verbal descriptors only models with 67.853, and the visual descriptors only models with 61.943. The 91 Early Fnsion Sonrces <• signifies inclnsion) Regression Classification Aconstic Verbal Para-verbal Visnal (Pearson's Accnracy (%) correlation r) • • • • 0.34 70.85 • • • 0.31 70.45 • • • 0.32 69.77 • • • 0.31 71.27 • • • 0.34 70.34 • • 0.27 67.26 • • 0.26 67.49 • • 0.25 65.40 • • 0.32 71.05 • • 0.28 66.83 • • 0.31 68.56 • 0.18 62.21 • 0.26 69.98 • 0.30 67.85 • 0.24 61.94 Table 5.3: The multimodal prediction results using the computational descriptors in all combinations of modalities. paired-sample t-tests showed that the performance of the multimodal models was better with a statistical significance compared to that of the acoustic descriptors only models (p < 0.05) and the visual descriptors only models (p < 0.01). The majority baseline for the classification experiments was 55.02%. For both the regression and classification results, the first hypothesis was partially confirmed that the multimodal information improves the prediction performance com pared to that of using unimodal information, especially with statistical significance for the acoustic only and visual only information. However, for the regression results, there was no statistical significance between the multimodal models and the para-verbal only models. For the classification results, the multimodal models also similarly performed better but did not show any statistical significance compared to that of the verbal only models and the para-verbal only models. The results suggest that especially the para- verbal behavioral cues, captured in the form of computational descriptors, are powerful in predicting persuasiveness. Table 5.3 summarizes both the regression and classification 92 p<0.1 80+----------r--~(mM¢naf,__,__ _ _ _ ~75 +--------------- Ci ~70+--------- a u <( 65 +--------- Iii Cl) :E 60 +---------- 50 Baseline All Combined (Eerly Fusion) All Combined (Attribute-Based) Figure 5.5: The persuasiveness prediction results for two different multimodal ap proaches, one combining all the descriptors at the feature-level and the other usmg the attribute-based fusion. The error bars show 1 standard error. results in all possible combinations of the modality groups for more detailed analysis of multimodal information fusion. It is noted that combining all four modalities was not necessarily better than using a subset of them, especially since the para-verbal descriptors were very powerful. 5.6.2 Attribute-Based Multimodal Approach (H2) Figure 5.5 shows the classification results of the attribute-based multimodal models, which performed at 76.033. A paired-samples t-test showed a marginal statistical sig nificance at p < 0.1 between the performance of the attribute-based approach and that of the early-fusion approach at 70.853. The results partially confirmed the second hypothesis and suggest the benefit of us ing several key high-level attributes that are highly related to persuasion in predicting the level of persuasiveness. The attribute-based approach can give more detail by break- ing down a speaker's persuasiveness into several dimensions. For instance, a speaker may be persuasive particularly based on his/ her level of credibility or passion, and the 93 ~ 7S +--------------- e e! 70 :I u u <( 6S c "' ~ 60 SS All Combined Positive Reviews Negative Reviews Only Only Male Only Female Only Figure 5.6: The persuasiveness prediction results for the multimodal models when made opinion polarity-dependent and gender-dependent. The error bars show 1 standard error. attribute-based approach can provide a deeper understanding of why he/she is more or less persuasive. The better performance of the attribute-based approach is most likely due to the computational descriptors being able to predict the levels of some of the at- tributes with relative ease in some cases than predicting persuasiveness directly, overall improving the performance of persuasiveness prediction. 5.6.3 Effect of Opinion Polarity (H3) Figure 5.6 shows the classification results of the multimodal predictors across different conditions of positive reviews only, negative reviews only, and all reviews combined (the left-most bar labeled "all combined"). Compared to the predictors using all the reviews at 70.853, the predictors trained and tested using only the positive reviews performed at 64.913 and those trained and tested using only the negative reviews performed at 68.653. 94 The experiments did not support the third hypothesis, and the opinion polarity dependent classifiers did not show any improvement in the performance. None of the results showed statistical significance however, and no conclusions could be drawn from the results. The reduced sam pie sizes for training the opinion-dependent models could have been the cause of the relatively reduced performance. Also, there is a possibility that the behavior change is not significant enough to give an advantage of opinion dependent modeling. 5.6.4 Effect of Gender (H4) Figure 5.6 shows the classification results of the multimodal predictors across differ ent gender conditions of male reviewers only, female reviewers only, and all reviewers combined (the same left-most bar labeled "all combined"). Compared to the predic tors using all the reviewers at 70.85%, the predictors trained and tested using only the male reviewers performed at 77.05% and those trained and tested using only the female reviewers performed at 63.01 %. The experiments did not conclusively support the fourth hypothesis, and the gender dependent classifiers did not necessarily show any improvement in the performance. Although the male-only classifiers did show some improved performance compared to the all-reviewers classifiers, the results did not show any statistical significance and no conclusions could be drawn from the results. One possible cause of the female-reviewers classifiers performing relatively poorly compared to the male-reviewers classifiers could be due to the difference in the sample size. The male-reviewers classifiers had a sample size that was 50% greater than that for training female-reviewers classifiers, and such a small sam pie size could have resulted in models that were not generalized enough. 5.6.5 Thin Slice Prediction (H5) Figure 5.7 shows the classification results of the all-modalities predictors across different thin slices. Compared with the prediction accuracy of 70.85% when using the whole 95 85 80 55 - 0% - 10% - 20% - P' -6 - 4()0 ,6 - 50" ,6 - 600 ,6 - 700 ,6 - 800 ,6 - 90° ,6 - 100 ,6 20 °,6 ~ 0 ,6 40 °,6 50 °,6 60% 70% 80% 90% 1DO% 85 80 55 • 5% 10% 15% 20% 25% 30% 35% 40% Figure 5. 7: The persuasiveness prediction results for various thin slices. The left graph shows the thin-slice results of using computational descriptors encoded from the length of only l / lOth of each review session, and the right graph shows the results for cumu lative thin-slice windows (i.e., first 53 of the session, first 103 , first 153 , etc.). The dotted line in each graph indicates the prediction level for the multimodal approach in Hl using computational descriptors from all the modalities and the whole 1003 session. length of each review, using 1 / 10th of the session mostly yielded between 603 and 703 prediction accuracy, with the highest prediction in the 503 to 603 session thin slice that performed at 70.023 prediction accuracy. The average performance across all the 10 thin slices was at 64.003 . Apart from the figure in a separate experiment, when the 1/ lOth slice was chosen at random in each 20-fold cross-validation experiment, the performance was very close at 64.153 . That is, there was no performance difference between the slices chosen randomly and the 103 slices averaged across the whole session. The cumulative thin slice results show that the prediction performance reached that of using the whole session when the cumulative thin slice was taken up to the 403 of the session from the beginning, performing at 70.333 prediction accuracy. The results are a typical demonstration of the idea of thin slices and suggest that it can still be possible t o make much inference on a speaker's persuasiveness just by looking at a small window of behavior. It is particularly interesting that only looking at l / lOth 96 Descriptors Acoustic F2: range (min ~ max) Peak Slope: range (25th ~ 75th percentile) MFCC4: 25th percentile MFCC2: range (lQth ~ 9Qth percentile) MFCC4:mean Para-Verbal Pause Visual Gaze movement (up I down): range (25th~ 75th percentile) Gaze movement (up I down): range (lQth ~ 9Qth percentile) Gaze movement (up I down): 25th percentile Surprise: range (min~ max) AU20: 7 5th percentile Info Gain 0.09 0.08 0.07 0.07 0.06 0.20 0.11 0.09 0.08 0.06 0.06 Table 5.4: Top computational descriptors m each modality for predicting between strongly and weakly persuasive speakers. of a movie review, especially toward the middle, seems to be enough to reasonably predict the speaker's level of persuasiveness. 5.6.6 Descriptor Analysis Table 5.4 highlights several top descriptors that have been particularly discriminative in separating strongly persuasive and weakly persuasive speakers. The verbal modality was not included in the analysis due to the nature of the bag-of-words descriptors being useful collectively. From the acoustic modality, the ranges in the second formant and the peak slope voice quality were particularly useful in the classification experiments. MFCC descrip- tors in the low frequency regions also stood out for predicting persuasive speakers, which were expected to perform better than high frequency regions due to denser resolutions 97 and being more robust to noise. Consistent with the literature described in Section 2.3, the para-verbal descriptor of pause proved to show much discriminative power in sepa rating speakers who are perceived as strongly persuasive and weakly persuasive. From all the descriptors and from all the modalities combined, this descriptor was a single most predictive cue. From the visual modality, the descriptors from gaze were predom inant followed by those from discrete emotion of surprise and AU20 (lip stretcher). 5. 7 Conclusions This chapter introduced a novel multimedia corpus specifically designed to study per suasiveness in the context of social multimedia. This chapter presented various compu tational approaches in using verbal and nonverbal behavior from multiple channels of communication to predict a speaker's persuasiveness in online social multimedia content. The chapter also showed a novel approach of using the high-level attributes related to persuasion in predicting the level of persuasiveness. Furthermore, it was demonstrated that the idea of thin slices can be used to observer a short window of a speaker's be havior to achieve comparable prediction compared to observing the entire length of the video. Interesting future directions include investigating more ways of computationally cap turing various indicators of persuasiveness and exploring other algorithmic methods of fusing information from multiple modalities. The results in this chapter will provide a baseline for all future studies using this new corpus for carrying out a deeper analysis to understand the relationship between persuasiveness and relevant high-level attributes including personality. 98 Chapter 6 Conclusions and Future Directions 6.1 Conclusions This thesis described three connected research topics that converge to a coherent theme of understanding human communication and building computational models of human behavior. The first topic was on crowdsourcing large-scale micro-level annotations of human behavior in videos to obtain the data necessary for behavior analysis and com putational modeling. The other two were building computational models of human be havior in the contexts of face-to-face dyadic negotiation and online persuasion. Among other contributions, this thesis specifically addressed four research challenges in model ing human behavior restated below: 1. Human behavior annotations The challenge of obtaining detailed and micro-level annotations of human behavior on a large scale, which provide the basis from which computational models can be built. For this challenge, the thesis introduced a novel web interface, evaluation metrics, and procedure for training crowd workers and effectively obtaining micro-level annotations of human behavior through online crowdsourcing. An extensive set of experiments showed that the crowdsourcing approach can be used to obtain the annotations that are of comparable quality to the ones obtained with locally 99 trained in-house coders. The approach has also been shown to be generalizable with the effect of training crowd workers transferring across different datasets. The greatest promise of the crowdsourcing approach lies in its scalability, making it an effective solution for the challenge of obtaining large-scale annotations in a very short amount of time. 2. Computational representations of behavior The challenge of making condensed and meaningful computational representations of human behavior, in terms of individual behavior as well as interpersonal be havior for capturing the behavioral dynamics between interlocutors (e.g., behavior mirroring) during face-to-face interaction. For this challenge, the thesis explored how individual behavior can be compu tationally encoded both in the contexts of negotiation and persuasion. For the negotiation context, various multimodal behavioral cues were encoded from the nonverbal behavior of the proposer and also from the respondent during negoti ation interaction, which were shown to be effective representations for modeling the negotiation behavior and making predictions of intermediate negotiation out comes during the interaction. In particular, the work specifically showed that mutual behavioral cues can be used to capture the interpersonal dynamics be tween the negotiators by encoding behavior symmetry and asymmetry, which can also hint at the overall atmosphere of the negotiation process, whether it is more on the cooperative or competitive side. In the persuasion context, the thesis showed effective computational descriptors of an individual speaker to capture persuasive behavior in the acoustic, verbal, para-verbal, and visual modalities. 3. Tempoml analysis The challenge of modeling human behavior with a temporal aspect, specifically for the purpose of making real-time analysis and prediction. 100 For this challenge, the work on persuasion showed how thin-slice modeling makes the computational models more powerful by enabling them to provide real-time analysis and prediction of persuasive behavior. The thin-slice modeling both in non-overlapping / separate thin-slices of the time window and in progressive cu mulative thin slices were shown to provide meaningful information that the com putational models otherwise cannot provide. 4. Multimodal fusion The challenge of fusing multimodal information from verbal, visual, and acoustic human behavior and designing effective computational models. For this challenge, the work on negotiation explored both the feature-level and modality-level information fusion techniques in designing the computational mod els. The work on persuasion explored computational models with a novel attribute based multimodal fusion approach by using various high-level attributes related to persuasion in the middle layer for predicting persuasiveness. 6.2 Future Directions This thesis concludes with a list of possible future directions that would be interesting to take as follow-up research. Algorithms and Computational Descriptors • An interesting follow-up research work would be to explore recent progress in representation learning for modeling individual and interpersonal behavior. For instance, recent neural network models have been showing much promise with the name of deep learning, becoming state-of-the-art models in many different research problems from image recognition to affective modeling (Martinez, Bengio, & Yannakakis, 2013). It would be interesting to apply deep neural networks for modeling human behavior and compare their performance with the models 101 already explored in this thesis. The newly learned features from deep neural networks can also be used in the models already explored. Apart from exploring completely different types of models, a relevant direction for exploring state-of the-art computational models would be to focus more deeply on the multimodal information fusion aspect. This thesis explored early-fusion at the feature level, late-fusion at the modality level, and a novel attribute-based fusion technique for modeling persuasive behavior with relevant high-level attributes such as credibility and expertise in the middle layer. It would be interesting to explore other ways to fuse multimodal information in computational models, including hybrid fusion techniques (Atrey, Hossain, El Saddik, & Kankanhalli, 2010). • Another interesting direction would be to explore more on the structure of negotia tion and behavioral paths that lead to negotiation success or failure. For instance, a negotiation session could be broken down into multiple and common interme diate steps. Maybe, a good level of rapport is a preliminary step that is usually reached before the negotiators start to become more open and accommodating, leading to a successful negotiation. This thesis already explored having high-level attributes as an intermediate step during multimodal fusion for the problem of predicting persuasiveness. The same idea could be applied and expanded for the problem of negotiation. The high-level attributes related to successful negotiation could be identified and the negotiation structure defined to capture the informa tion in the computational models. Moreover, meaningful behavioral paths could be found by exploring the temporal dynamics of the attributes and behavioral changes across multiple intermediate steps of negotiation. • In terms of computational descriptors, this thesis explored computationally en coding mostly nonverbal multimodal behavior. Verbal behavior was encoded at the lexical level, and while it is effective for many problems involving natural lan guage processing, more sophisticated modeling of verbal behavior is possible in the syntactic and semantic levels. The performance of the models explored in this 102 thesis would most likely increase if the verbal content and its logical structures can be captured and used in computational models both in the negotiation and persuasion contexts. Operationalizing Persuasiveness • For the work on persuasion, an extended theoretical and psychological study could provide a deeper insight of what persuasion really means. More specifically, future work could involve operationalizing persuasiveness in a different way by identify ing key dimensions that make up human perception of persuasiveness. It has been shown extensively in past research that persuasion is closely related to many other attributes such as credibility (Burgoon et al., 1990; Pornpitakpan, 2004). This the sis used such findings in designing a novel computational model with the high-level attributes. Yet, more complex dynamics can be found and proven between per suasion and the high-level attributes, and this thesis opens up many new research questions. For instance, to what degree does credibility or humor factor in when we decide a speaker to be persuasive or not persuasive? How about the influence of message source, message receiver, and the medium? Such theoretical research results will be very useful for future research efforts related to persuasion. Other Contexts • Both research contexts of negotiation and persuasion allow for natural extensions with follow-up studies. Specifically, all the studies in both works were on the same-gender interaction, but studying cross-gender influence is equally impor tant and meaningful. Having both same-gender and cross-gender results will give an additional picture of research implications. Furthermore, the research results could be corroborated by taking into account other contexts, such as cultural or age differences. Both works can also be extended to see if they generalize in any other contexts, such as public speaking for persuasion. For instance, do the 103 same behavioral cues that were useful for predicting online persuasiveness trans fer to face-to-face interaction and public-speaking presentations? It would be an interesting follow- up research question to pursue. Applications and Human-Computer Interaction • It would be useful to explore the models of negotiation and persuasion from the application standpoint. The models of persuasive behavior can be applied in a system that can perform automatic video analysis and filtering, providing users with valuable real-time information. For negotiation, a system can also perform real-time analysis of the negotiation process and provide decision-support infor mation. Furthermore, an automatic system in both contexts would allow a great training tool. For instance, a training system can provide real-time analysis of the users, such as their level of eye contact, voice tone, or any key behavioral cues that are found important in the negotiation or persuasion context, and train the users to be better negotiators or more persuasive speakers. To be more human-friendly and effective, such training systems can be implemented with virtual humans on a computer screen or with robots that have real-life embodiment. • Such applications of the models naturally lead to many research problems related to human-computer interaction. For instance, for both negotiation and persuasion training systems, when and how they display real-time analysis and feedback to the users would have a big impact on their usability and effectiveness. 104 Bibliography Allred, K. G., Mallozzi, J. S., Matsui, F., & Raia, C. P. (1997). The influence of anger and compassion on negotiation performance. Organizational Behavior and Human Decision Processes, 70(3), 175-187. http://dx.doi.org/10.1006/obhd.1997.2705 Allwood, J., Cerrato, L., Dybkaer, L., Jokinen, K., Navarretta, C., & Paggio, P. (2005). The MUMIN multimodal coding scheme. NorFA yearbook, 2005, 129-157. Retrieved from http://www.ling.helsinki.fi/kit /2006k/ clt31 Ommod/MUMIN-coding scheme-V3.3. pdf Ambady, N., & Rosenthal, R. (1992). Thin slices of expressive behavior as predictors of interpersonal consequences: A meta-analysis. Psychological Bulletin, 111(2), 256-274. http:/ /dx.doi.org/10.1037 /0033-2909.111.2.256 Atrey, P. K., Hossain, M. A., El Saddik, A., & Kankanhalli, M. S. (2010). Multi modal fusion for multimedia analysis: A survey. Multimedia Systems, 16(6), 345-379. http:/ /dx.doi.org/10.1007 /s00530-010-0182-0 Bailenson, J. N., & Yee, N. (2005). Digital chameleons: Automatic assimilation of nonverbal gestures in immersive virtual environments. Psychological Science, 16(10), 814-819. http:/ /dx.doi.org/10.1111/j .1467-9280.2005.01619.x Baron, R. A. (1990). Environmentally induced positive affect: Its impact on self-efficacy, task performance, negotiation, and conflict. Journal of Applied Social Psychology, 20(5), 368-384. http:/ /dx.doi.org/10.1111 /j .1559-1816.1990.tb00417 .x Barry, B., & Friedman R. A. (1998). Bargainer characteristics in distributive and in tegrative negotiation. Journal of Personality and Social Psychology, 74(2), 345-359. http:/ /dx.doi.org/10.1037 /0022-3514.74.2.345 Barry, B., & Oliver, R. L. (1996). Affect in dyadic negotiation: A model and propo sitions. Organizational Behavior and Human Decision Processes, 67(2), 127-143. http:/ /dx.doi.org/10.1006 /obhd.1996.0069 Beck, R. S., Daughtridge, R., & Sloane P. D. (2002). Physician-patient communication in the primary care office: A systematic review. Journal of the American Board of Family Medicine, 15(1), 25-38. Retrieved from http://www.jabfrn.org/content/15/l/25.short 105 Bernieri, F. J., Gillis, J. S., Davis, J. M., & Grahe, J. E. (1996). Dyad rapport and the accuracy of its judgment across situations: A lens model analysis. Journal of Personality and Social Psychology, 71(1), 110-129. http://dx.doi.org/10.1037 /0022- 3514. 71.1.110 Bernieri, F. J., & Rosenthal, R. (1991). Interpersonal coordination: Behavior matching and interactional synchrony. In R. Feldman & B. Rime (Eds.), Fundamentals of Nonverbal Behavior (pp. 401-432). New York, NY: Cambridge University Press. Bernstein, M. S., Brandt, J., Miller, R. C., & Karger, D. R. (2011). Crowds in two seconds: Enabling realtime crowd-powered interfaces. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology (pp. 33-42). http:/ /dx.doi.org/10.1145 /204 7196.204 7201 Biel, J.-1., & Gatica-Perez, D. (2012). The good, the bad, and the angry: An alyzing crowdsourced impressions of vloggers. In Proceedings of the 6th Interna tional AAA! Conference on Weblogs and Social Media (pp. 407-410). Retrieved from http:/ /infoscience.epfl.ch/record/192339 Biel, J.-1., Teijeiro-Mosquera, L., & Gatica-Perez, D. (2012). FaceTube: Predicting personality from facial expression of emotion in online conversational video. In Pro ceedings of the 14th ACM International Conference on Multimodal Interaction (pp. 53-56). http:/ /dx.doi.org/10.1145 /2388676.2388689 Brugman, H., & Russel, A. (2004). Annotating multi-media / multi-modal re sources with ELAN. In Proceedings of the 4th International Conference on Lan guage Resources and Evaluation (pp. 2065-2068). Retrieved from http://www.lrec conf.org/proceedings /lrec2004/ pdf / 480. pdf Burgoon, J. K., Birk, T., & Pfau, M. (1990). Nonverbal behaviors, per- suasion, and credibility. Human Communication Research, 17(1), 140-169. http://dx.doi.org/10.1111 /j .1468-2958.1990.t b00229 .x Busso, C., Lee, S., & Narayanan, S. (2009). Analysis of emotionally salient aspects of fundamental frequency for emotion detection. IEEE Transactions on Audio, Speech, and Language Processing, 17(4), 582-596. http:/ /dx.doi.org/10.1109 /TASL.2008.2009578 Carli, L. L., LaFleur, S. J., & Loeber, C. C. (1995). Nonverbal behavior, gen der, and influence. Journal of Personality and Social Psychology, 68(6), 1030-1041. http:/ /dx.doi.org/10.1037 /0022-3514.68.6.1030 Carnevale, P. J ., & !sen, A. M. (1986). The influence of positive affect and visual access on the discovery of integrative solutions in bilateral negotiation. Organizational Be havior and Human Decision Processes, 37(1), 1-13. http://dx.doi.org/10.1016/0749- 5978(86)90041-5 106 Cassell, J., Sullivan, J., Prevost, S., & Churchill, E. (Eds.). (2000). Embodied conversa tional agents .. Cambridge, MA: The MIT Press. Chaiken, S. (1979). Communicator physical attractiveness and persuasion. Journal of Personality and Social Psychology, 37(8), 1387-1397. http://dx.doi.org/10.1037 /0022- 3514.37.8.1387 Chaiken, S., & Eagly, A. H. (1976). Communication modality as a determinant of message persuasiveness and message comprehensibility. Journal of Personality and Social Psychology, 34(4), 605-614. http://dx.doi.org/10.1037 /0022-3514.34.4.605 Chaiken, S., Liberman, A., & Eagly, A.H. (1989). Heuristic and systematic information processing within and beyond the persuasion context. In J. Uleman & J. Bargh (Eds.), Unintended Thought: Limits of Awareness, Intention, and Control (pp. 212-252). New York, NY: Guildford Press. Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for support vector ma chines. ACM Transactions on Intelligent Systems and Technology, 2(3), 27:1-27:25. http:/ /dx.doi.org/10.1145 /1961189.1961199 Chartrand, T. L., & Bargh, J. A. (1999). The chameleon effect: The perception-behavior link and social interaction. Journal of Personality and Social Psychology, 76(6), 893- 910. http:/ /dx.doi.org/10.1037 /0022-3514.76.6.893 Chartrand, T. L., Maddux, W. W., & Lakin, J. L. (2006). Beyond the perception behavior link: The ubiquitous utility and motivational moderators of nonconscious mimicry. In R. Hassin, J. Uleman, & J. Bargh (Eds.), The New Unconscious (pp. 334-361). New York, NY: Oxford University Press. Chatterjee, M., Park, S., Shim, H. S., Sagae, K., & Morency, L.-P. (2014). Verbal behaviors and persuasiveness in online multimedia content. In Proceedings of the 2nd Workshop on Natural Language Processing for Social Media (pp. 50-58). Retrieved from http://www.aclweb.org/old_anthology /W /W14/W14-59.pdf#page=60 Crano, W. D., & Prislin, R. (2006). Attitudes and persuasion. Annual Review of Psychology, 57, 345-374. http://dx.doi.org/10.1146/annurev.psych.57.102904.190034 Curhan, J. R., & Pentland, A. (2007). Thin slices of negotiation: Predicting outcomes from conversational dynamics within the first 5 minutes. Journal of Applied Psychol ogy, 92(3), 802-811. http://dx.doi.org/10.1037 /0021-9010.92.3.802 Dasiopoulou, S., Giannakidou, E., Litos, G., Malasioti, P., & Kompatsiaris, Y. (2011). A survey of semantic image and video annotation tools. In G. Paliouras, C. Spyropou los, & G. Tsatsaronis (Eds.), Knowledge-Driven Multimedia Information Extraction and Ontology Evolution: Building the Semantic Gap (pp. 196-239). Springer Berlin Heidelberg. 107 Degottex, G., Kane, J., Drugman, T., Raitio, T. & Scherer, S. (2014). COVAREP - A collaborative voice analysis repository for speech technologies. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 960-964). http:/ /dx.doi.org/10.1109 /ICASSP.2014.6853739 De Looze, C., Scherer, S., Vaughan, B., & Campbell, N. (2014). Investigating automatic measurements of prosodic accommodation and its dynamics in social interaction. Speech Communication, 58, 11-34. http://dx.doi.org/10.1016/j.specom.2013.10.002 Downs, J. S., Holbrook, M. B., Sheng, S., & Cranor, L. F. (2010). Are your partic ipants gaming the system?: Screening mechanical turk workers. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 2399-2402). http:/ /dx.doi.org/10.1145 /1753326.1753688 Druckman, D., & Olekalns, M. (2008). Emotions in negotiation. Group Decision and Negotiation, 17(1), 1-11. http://dx.doi.org/10.1007 /s10726-007-9091-9 Drugman, T., & Alwan, A. (2011). Joint robust voicing detection and pitch estima tion based on residual harmonics. In Proceedings of the 12th Annual Conference of the International Speech Communication Association (pp. 1973-1976). Retrieved from http://citeseerx.ist.psu.edu/viewdoc/summary? doi= 10.1.1. 228.1866 Dufwenberg, M., & Kirchsteiger, G. (2004). A theory of sequential reciprocity. Games and Economic Behavior, 47(2), 268-298. http://dx.doi.org/10.1016/j.geb.2003.06.003 Ekman, P., & Davidson, R. J. (Eds.). (1994). The nature of emotion: Fundamental questions. New York, NY: Oxford University Press. Ekman, P., & Friesen, W. V. (1969). The repertoire of nonverbal behavior. Semiotica, 1(1), 49-98. http://dx.doi.org/10.1515/semi.1969.1.1.49 Ekman, P., & Friesen, W. V. (1982). Felt, false, and miserable smiles. Journal of Nonverbal Behavior, 6(4), 238-252. http://dx.doi.org/10.1007 /BF00987191 Ekman, P., & Rosenberg, E. L. (Eds.). (1997). What the face reveals: Basic and applied studies of spontaneous expression using the Facial Action Coding System {FAGS). New York, NY: Oxford University Press. FaceFX [Computer software]. Ocean Grove, NJ: OC3 Entertainment. Finkelstein, S., Scherer, S., Ogan, A., Morency, L.-P., & Cassell, J. (2012). Investigating the influence of virtual peers as dialect models on students' prosodic inventory. In Proceedings of the Workshop on Child, Computer, and Interaction. Retrieved from http://repository.cmu.edu/ cgi/viewcontent .cgi? article=1268&context=hcii Fong, T., Nourbakhsh, I., & Dautenhahn, K. (2003). A survey of so- cially interactive robots. Robotics and Autonomous Systems, 42(3-4), 143-166. http:/ /dx.doi.org/10.1016 /S0921-8890(02)00372-X 108 Frey, K. P. & Eagly, A. H. (1993). Vividness can undermine the persuasive ness of messages. Journal of Personality and Social Psychology, 65(1), 32-44. http:/ /dx.doi.org/10.1037 /0022-3514.65.1.32 Gabbott, M., & Hogg, G. (2000). An empirical investigation of the impact of non verbal communication on service evaluation. European Journal of Marketing, 34(3-4), 384-398. http:/ /dx.doi.org/10.1108/03090560010311911 Gao, Q., & Vogel, S. (2010). Consensus versus expertise: A case study of word alignment with Mechanical Turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk (pp. 30-34). Retrieved from http://dl.acm.org/citation.cfrn?id=l866700 Gobi, C., & Ni Chasaide, A. (2003). The role of voice quality in communi- cation emotion, mood and attitude. Speech Communication, 40(1-2), 189-212. http:/ /dx.doi.org/10.1016 /S0167-6393(02)00082-1 Graham, G. H., Unruh, J., & Jennings, P. (1991). The Impact of nonverbal communi cation in organizations: A survey of perceptions. International Journal of Business Communication, 28(1), 45-62. http://dx.doi.org/10.1177 /002194369102800104 Grahe, J. E., & Bernieri, F. J. (1999). The importance of nonverbal cues in judging rapport. Journal of Nonverbal Behavior, 23(4), 253-269. http:/ /dx.doi.org/10.1023/A:1021698725361 Greenhalgh, L., & Chapman, D. (1998). Negotiator relationships: Con- struct measurement, and demonstration of their impact on the process and outcomes of negotiation. Group Decision and Negotiation, 7(6), 465-489. http:/ /dx.doi.org/10.1023/A:1008694307035 Harrigan, J. A., Kues, J. R., & Weber, J. G. (1986). Impressions of hand move ments: Self-touching and gestures. Perceptual and Motor Skills, 63(2), 503-516. http:/ /dx.doi.org/10.2466 /pms.1986.63.2.503 Hosman, L.A. (2002). Language and persuasion. In J. Dillard and M. Pfau (Eds.), The persuasion handbook: Developments in theory and practice (pp. 371-390). Thousand Oaks, CA: Sage Publications. Hsueh, P.-Y. , Melville, P., & Sindhwani, V. (2009). Data quality from crowdsourcing: A study of annotation selection criteria. In Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing (pp. 27-35). Retrieved from http://dl.acm.org/citation.cfrn?id=l564137 Inglis, M. & Mejia-Ramos, J. P. (2009). The effect of authority on the persua- siveness of mathematical arguments. Cognition and Instruction, 27(1), 25-50. http:/ /dx.doi.org/10.1080 /07370000802584513 109 Jaimes, A., & Se be, N. (2007). Multimodal human-computer interaction: A survey. Computer Vision and Image Understanding, 108(1-2), 116-134. http:/ /dx.doi.org/10.1016 /j .cviu.2006.10.019 Johnson, D. W. (1971). Effects of warmth of interaction, accuracy of understanding, and the proposal of compromises on listener's behavior. Journal of Counseling Psychology, 18(3), 207-216. http://dx.doi.org/10.1037 /h0030841 Johnson, D. W., McCarty, K., & Allen, T. (1976). Congruent and contradictory verbal and nonverbal communications of cooperativeness and competitiveness in negotiations. Communication Research, 3(3), 275-292. http:/ /dx.doi.org/10.1177 /009365027600300303 Kane, J., & Gobi, C. (2011). Identifying regions of non-modal phonation using fea tures of the wavelet transform. In Proceedings of the 12th Annual Conference of the International Speech Communication Association (pp. 177-180). Retrieved from http://citeseerx.ist.psu.edu/viewdoc/summary? doi= 10.1.1. 228.3985 Kane, J., Scherer, S., Aylett, M., Morency, L.-P., & Gobi, C. (2013). Speaker and language independent voice quality classification applied to unlabelled cor pora of expressive speech. In Proceedings of the 2013 IEEE International Con ference on Acoustics, Speech and Signal Processing {ICASSP) (pp. 7982-7986). http:/ /dx.doi.org/10.1109 /ICASSP.2013.6639219 Kang, S.-H., Gratch, J., Sidner, C., Artstein, R., Huang, L., Morency, L.-P. (2012). Towards building a virtual counselor: Modeling nonverbal behavior dur ing intimate self-disclosure. In Proceedings of the 11th International Confer ence on Autonomous Agents and Multiagent Systems (pp. 63-70). Retrieved from http:/ /dl.acm.org/citation.cfm?id=2343585 Kim, J., Nguyen, P. T., Weir, S., Guo, P. J., Miller, R. C., & Gajos, K. Z. (2014). Crowdsourcing step-by-step information extraction to enhance existing how-to videos. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 4017-4026). http://dx.doi.org/10.1145/2556288.2556986 Knapp, M., Hall, J., & Horgan, T. G. (2013). Nonverbal communication in human interaction (8th ed.). Boston, MA: Cengage Learning. Kramer, R. M. & Messick, D. M. (Eds.). (1995). Negotiation as a social process. Thou sand Oaks, CA: Sage Publications. Krippendorff, K. (1995). On the reliability of unitizing continuous data. Sociological Methodology, 25, 47-76. Krippendorff, K. (2012). Content Analysis: An Introduction to Its Methodology (3rd ed.). Thousand Oaks, CA: Sage Publications. 110 LaCrosse, M. B. (1975). Nonverbal behavior and perceived counselor attrac- tiveness and persuasiveness. Journal of Counseling Psychology, 22(6), 563-566. http:/ /dx.doi.org/10.1037 /0022-0167.22.6.563 Lao, S. & Kawade, M. (2005). Vision-based face understanding technologies and their applications. In S. Li, J. Lai, T. Tan, G. Feng, & Y. Wang (Eds.), Advances in Biometric Person Authentication (pp. 339-348). Springer Berlin Heidelberg. Le, J., Edmonds, A., Hester, V., & Biewald, L. (2010). Ensuring quality in crowd sourced search relevance evaluation: The effects of training question distribution. In Proceedings of the SIGIR 2010 Workshop on Crowdsourcing for Search Evaluation (pp. 17-20). Retrieved from http://ir.ischool.utexas.edu/cse2010/materials/leetal.pdf Lee, C.-C., Katsamanis, A., Black, M. P., Baucom, B. R., Christensen, A., Georgiou, P. G., Narayanan, S. S. (2014). Content-based multimedia information retrieval: State of the art and challenges. ACM Transactions on Multimedia Computing, Communi cations, and Applications, 2(1), 1-19. http://dx.doi.org/10.1145/1126004.1126005 Lew, M. S., Sebe, N., Djeraba, C., & Jain, R. (2006). Content-based multi media information retrieval: State of the art and challenges. ACM Transac tions on Multimedia Computing, Communications, and Applications, 2(1), 1-19. http:/ /dx.doi.org/10.1145 /1126004.1126005 Littlewort, G., Whitehill, J ., Wu, T., Fasel, I., Frank, M., Movellan, J ., & Barlett, M. (2011). The computer expression recognition toolbox (CERT). In Proceedings of the 2011 IEEE International Conference on Automatic Face f3 Gesture Recognition and Workshops (pp. 298-305). http://dx.doi.org/10.1109/FG.2011.5771414 Louwerse, M. M., Dale, R., Bard, E. G., & Jeuniaux, P. (2012). Behavior matching in multimodal communication is synchronized. Cognitive Science, 36(8), 1404-1426. http:/ /dx.doi.org/10.1111 /j .1551-6709.2012.01269.x Maddux, J.E. & Rogers, R. W. (1980). Effects of source expertness, physical attractive ness, and supporting arguments on persuasion: A case of brains over beauty. Journal of Personality and Social Psychology, 39(2), 235-244. http://dx.doi.org/10.1037 /0022- 3514.39.2.235 Mahl, G. F. (1956). Disturbances and silences in the patient's speech in psychotherapy. Journal of Abnormal and Social Psychology, 53(1), 1-15. http:/ /dx.doi.org/10.1037 /h0047552 Marge, M., Banerjee, S., & Rudnicky, A. I. (2010). Using the Amazon Me- chanical Turk for transcription of spoken language. 2010 IEEE Interna- tional Conference on Acoustics Speech and Signal Processing (pp. 5270-5273). http:/ /dx.doi.org/10.1109 /ICASSP.2010.5494979 111 Martinez, H. P., Bengio, Y., & Yannakakis, G. N. (2013). Learning deep physio logical models of affect. IEEE Computational Intelligence Magazine, 8(2), 20-33. http:/ /dx.doi.org/10.1109 /MCI.2013.224 7823 Maslow, C., Yoselson, K., & London, H. (2011). Persuasiveness of confidence expressed via language and body language. British Journal of Social and Clinical Psychology, 10(3), 234-240. http:/ /dx.doi.org/10.1111 /j .2044-8260.1971.tb007 42.x Mason, W., & Suri, S. (2012). Conducting behavioral research on Amazon's Mechanical Turk. Behavior Research Methods, 44(1), 1-23. http://dx.doi.org/10.3758/s13428-011- 0124-6 McKeown, G., Valstar, M. F., Cowie, R., & Pantie, M. (2010). The SEMAINE corpus of emotionally coloured character interactions. 2010 IEEE International Conference on Multimedia and Expo (pp. 1079-1084). http:/ /dx.doi.org/10.1109 /ICME.2010.5583006 Mehrabian, A. (1971). Silent messages. Belmont, CA: Wadsworth Publishing Company. Mehrabian, A., & Williams, M. (1969). Nonverbal concomitants of perceived and in tended persuasiveness. Journal of Personality and Social Psychology, 13(1), 37-58. http:/ /dx.doi.org/10.1037 /h0027993 Meyers-Levy, J., & Malaviya, P. (1999). Consumers' processing of persuasive advertise ments: An integrative framework of persuasion theories. Journal of Marketing, 63, 45-60. http://dx.doi.org/10.2307 /1252100 Miller, N., Maruyama, G., Beaber, R. J., & Valone, K. (1976). Speed of speech and persuasion. Journal of Personality and Social Psychology, 34( 4), 615-624. http:/ /dx.doi.org/10.1037 /0022-3514.34.4.615 Mohammadi, G., Park, S., Sagae, K., Vinciarelli, A., & Morency, L.-P. (2013). Who is persuasive? The role of perceived personality and communication modality in social multimedia. In Proceedings of the 15th ACM International Conference on Multimodal Interaction (pp. 19-26). http://dx.doi.org/10.1145/2522848.2522857 Morand, D. A. (2001). The emotional intelligence of managers: Assessing the construct validity of a nonverbal measure of"people skills." Journal of Business and Psychology, 16(1), 21-33. http://dx.doi.org/10.1023/A:1007831603825 Morency, L.-P., Whitehill, J., & Movellan, J. (2008). Generalized adaptive view-based appearance model: Integrated framework for monocular head pose estimation. In Proceedings of the 8th International Conference on Automatic Face f3 Gesture Recog nition (pp. 1-8). http://dx.doi.org/10.1109/AFGR.2008.4813429 112 Narayanan, S., & Georgiou, P. G. (2013). Behavioral signal processing: Deriving human behavioral informatics from speech and language. Proceedings of the IEEE, 101(5), 1203-1233. http://dx.doi.org/10.1109 /JPROC.2012.2236291 Nguyen, L. S., Frauendorfer, D., Mast, M. S., & Gatica-Perez, D. (2014). Hire me: Computational inference of hirability in employment interviews based on nonverbal behavior. IEEE Transactions on Multimedia, 16(4), 1018-1031. http:/ /dx.doi.org/10.1109 /TMM.2014.2307169 Niemeier, S. (1997). Nonverbal expressions of emotions in a business negotiation. In S. Niemeier & R. Dirven (Eds.), The Language of Emotions (pp. 277-306). Philadephia, PA: John Benjamins Publishing Company. Nouri, E., Park, S., Scherer, S., Gratch, J., Carnevale, P., Morency, L.-P., & Traum, D. (2013). Prediction of strategy and outcome as negotiation unfolds by using basic verbal and behavioral features. In Proceedings of the 14th Annual Conference of the International Speech Communication Association (pp. 1458-1461). Retrieved from http:/ /ict.usc.edu/pubs/Prediction Novotney, S., & Callison-Burch, C. (2010). Cheap, fast and good enough: Automatic speech recognition with non-expert transcription. In Proceedings of the HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 207-215). Retrieved from http://dl.acm.org/citation.cfrn?id=l858023 Nowak, S., & Ruger, S. (2010). How reliable are annotations via crowdsourcing: A study about inter-annotator agreement for multi-label image annotation. In Proceedings of the International Conference on Multimedia Information Retrieval (pp. 557-566). http:/ /dx.doi.org/10.1145 /17 43384.17 434 78 OKAO Vision [Computer software]. OMRON Global. O'Keefe, D. J. (2002). Persuasion: Theory and research {2nd ed.). Thousand Oaks, CA: Sage Publications. O'Keefe, D. J., & Jensen, J. D. (2007). The relative persuasiveness of gain-framed loss-framed messages for encouraging disease prevention behaviors: A meta-analytic review. Journal of Health Communication, 12(7), 623-644. http:/ /dx.doi.org/10.1080 /10810730701615198 Ong, L. M., de Haes, J. C., Hoos, A. M., & Lammes, F. B. (1995). Doctor-patient communication: A review of the literature. Social Science f3 Medicine, 40(7), 903- 918. http:/ /dx.doi.org/10.1016 /0277-9536(94)00155-M 113 Pantie, M., Pentland, A., Nijholt, A., & Huang, T. S. (2006). Human com puting and machine understanding of human behavior: A survey. In Proceed ings in the 8th International Conference on Multimodal Interfaces (pp. 239-248). http:/ /dx.doi.org/10.1145 /1180995.1181044 Park, S., Gratch, J., & Morency, L.-P. (2012). I already know your answer: Using nonverbal behaviors to predict immediate outcomes in a dyadic negotiation. In Pro ceedings of the 14th ACM International Conference on Multimodal Interaction (pp. 19-22). http:/ /dx.doi.org/10.1145 /2388676.2388682 Park, S., Mohammadi, G., Artstein, R., & Morency, L.-P. (2012). Crowdsourcing micro level multimedia annotations: The challenge of evaluation and interface. In Proceed ings of the ACM Multimedia 2012 Workshop on Crowdsourcing for Multimedia (pp. 29-34). http:/ /dx.doi.org/10.1145 /2390803.2390816 Park, S., Scherer, S., Gratch, J., Carnevale, P., & Morency, L.-P. (2013). Mutual behaviors during dyadic negotiation: Automatic prediction of respondent reactions. In Proceedings of the 2013 Humaine Association on Affective Computing and Intelligent Interaction (pp. 423-428). http://dx.doi.org/10.1109/ACII.2013.76 Park, S., Scherer, S., Gratch, J., Carnevale, P., & Morency, L.-P. (2015). I can already guess your answer: Predicting respondent reactions during dyadic negotiation. IEEE Transactions on Affective Computing, 6(2), 86-96. http:/ /dx.doi.org/10.1109 /TAFFC.2015.2396079 Park, S., Shim, H. S., Chatterjee, M., Sagae, K., & Morency, L.-P. (2014). Compu tational analysis of persuasiveness in social multimedia: A novel dataset and multi modal prediction approach. In Proceedings of the 16th ACM International Conference on Multimodal Interaction (pp. 50-57). http://dx.doi.org/10.1145/2663204.2663260 Park, S., Shoemark, P., & Morency, L.-P. (2014). Toward crowdsourcing micro-level behavior annotations: The challenges of interface, training, and generalization. In Proceedings of the 19th International Conference on Intelligent User Interfaces (pp. 37-46). http:/ /dx.doi.org/10.1145 /2557500.255 7512 Pearce, W. B., & Brommel, B. J. (1972). Vocalic commun1ca- tion m persuasion. Quarterly Journal of Speech, 58(3), 298-306. http:/ /dx.doi.org/10.1080 /00335637209383126 Perloff, R. M. (2010). The dynamics of persuasion: Communication and attitudes in the 21st century (4th ed.). New York, NY: Routledge. Petty, R. E., & Cacioppo, J. T. (Eds.). (1986). Communication and persuasion: Central and peripheral routes to attitude change. New York, NY: Springer-Verlag New York. Picard, R. W. (2000). Affective computing. Cambridge, MA: MIT Press. 114 Pittam, J. (1990). The relationship between perceived persuasiveness of nasality and source characteristics for Australian and American listeners. Journal of Social Psy chology, 130(1), 81-87. http://dx.doi.org/10.1080/00224545.1990.9922937 Poppe, R. (2010). A survey on vision-based human action recognition. Image and Vision Computing, 28(6), 976-990. http:/ /dx.doi.org/10.1016 /j .imavis.2009.11.014 Pornpitakpan, C. (2004). The persuasiveness of source credibility: A critical review of five decades' evidence. Journal of Applied Social Psychology, 34(2), 243-281. http:/ /dx.doi.org/10.1111 /j .1559-1816.2004.tb0254 7.x Pruitt, D. G. (2012). A history of social conflict and negotiation research. In A. Kruglanski & W. Stroebe (Eds.), Handbook of the History of Social Psychology (pp. 431-452). New York, NY: Psychology Press. Quinn, A. J., & Bederson, B. B. (2011). Human computation: A survey and taxonomy of a growing field. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 1403-1412). http:/ /dx.doi.org/10.1145/1978942.1979148 Rammstedt, B., & John, 0. P. (2007). Measuring personality in one minute or less: A 10-item short version of the Big Five Inventory in English and German. Journal of Research in Personality, 41(1), 203-212. http://dx.doi.org/10.1016/j.jrp.2006.02.001 Rashtchian, C., Young, P., Hodosh, M., & Hockenmaier, J. (2010). Collecting image annotations using Amazon's Mechanical Turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk (pp. 139-147). Retrieved from http://dl.acm.org/citation.cfm?id=l866717 Riek, L. D., O'Connor, M. F., & Robinson, P. (2011). Guess what? A game for affective annotation of video using crowd sourcing. In Proceedings of the 4th International Con ference on Affective Computing and Intelligent Interaction (pp. 277-285). Retrieved from http://dl.acm.org/citation.cfm?id=2062813 Rosenfeld, H. M. (1966). Approval-seeking and approval-inducing functions of verbal and nonverbal responses in the dyad. Journal of Personality and Social Psychology, 4(6), 597-605. http:/ /dx.doi.org/10.1037 /h0023996 Rosenfeld, R. (2000). Two decades of statistical language modeling: Where do we go from here? Proceedings of the IEEE, 88(8), 1270-1278. http:/ /dx.doi.org/10.1109 /5.880083 Ross, J., Irani, L., Silberman, M. S., Zaldivar, A., & Tomlinson, B. (2010). Who are the crowdworkers?: Shifting demographics in Mechanical Turk. In Proceedings of the CHI '10 Extended Abstracts on Human Factors in Computing Systems (pp. 2863-2872). http:/ /dx.doi.org/10.1145 /1753846.1753873 115 Scherer, S., Hammal, Z., Yang, Y., Morency, L.-P., & Cohn, J. F. (2014). Dyadic behavior analysis in depression severity assessment interviews. In Proceedings of the 16th International Conference on Multimodal Interaction (pp. 112-119). http:/ /dx.doi.org/10.1145 /2663204.2663238 Scherer, S., Kane, J ., Gobi, C., & Schwenker, F. (2013). Investigating fuzzy-input fuzzy-output support vector machines for robust voice quality classification. Computer Speech f3 Language, 27(1), 263-287. http://dx.doi.org/10.1016/j.csl.2012.06.001 Scherer, S., Weibel, N., Morency, L.-P., & Oviatt, S. (2012). Multi- modal prediction of expertise and leadership in learning groups. In Pro- ceedings of the 1st International Workshop on Multimodal Leaming Analytics. http:/ /dx.doi.org/10.1145 /2389268.2389269 Schuller, B., Steidl, S., Batliner, A., Schiel, F., & Krajewski, J. (2011). The Inter speech 2011 speaker state challenge. In Proceedings of the 12th Annual Conference of the International Speech Communication Association (pp. 3201-3204). Retrieved from http:/ /mediatum.ub.tum.de/doc/1107300 /1107300.pdf Sheng, V. S., Provost, F., & lpeirotis, P. G. (2008). Get another label? Improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 614-622). http://dx.doi.org/10.1145/1401890.1401965 Shim, H. S., Park, S., Chatterjee, M., Scherer, S., Sagae, K., & Morency, L.-P. (2015). Acoustic and para-verbal indicators of persuasiveness in social multimedia. In Pro ceedings of the 2015 IEEE International Conference on Acoustics, Speech, and Signal Processing (pp. 2239-2243). http:/ /dx.doi.org/10.1109 /ICASSP.2015.7178369 Skinner, E. A., & Belmont, M. J. (1993). Motivation in the classroom: Reciprocal effects of teacher behavior and student engagement across the school year. Journal of Edu cational Psychology, 85(4), 571-581. http://dx.doi.org/10.1037 /0022-0663.85.4.571 Snow, R., O'Connor, B., Jurafsky, D. & Ng, A. (2008). Cheap and fast but is it good?: Evaluating non-expert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 254- 263). Retrieved from http://dl.acm.org/citation.cfm?id=l613751 Spiro, I., Taylor, G., Williams, G., & Bregler, C. (2010). Hands by hand: Crowd sourced motion tracking for gesture annotation. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (pp. 17-24). http:/ /dx.doi.org/10.1109 /CVPRW.2010.5543191 Stern, S. E., Mullennix, J. W., & Wilson, S. J. (2002). Effects of perceived disability on persuasiveness of computer-synthesized speech. Journal of Applied Psychology, 87(2), 411-417. http://dx.doi.org/10.1037 /0021-9010.87.2.411 116 Sundaram, D. S., & Webster, C. (2000). The role of nonverbal communi- cation in service encounters. Journal of Services Marketing, 14(5), 378-391. http://dx.doi.org/10.1108 /08876040010341008 Tickle-Degnen, L., & Rosenthal, R. (1990). and its nonverbal correlates. Psychological http:/ /dx.doi.org/10.1207 /s15327965pli0104_1 The nature of Inquiry, 1 ( 4 ), rapport 285-293. Van Kleef, G.,A., De Dreu, C. K., & Manstead, A. S. (2004). The interpersonal effects of emotions in negotiations: A motivated information processing approach. Journal of Personality and Social Psychology, 87(4), 510-528. http://dx.doi.org/10.1037 /0022- 3514.87.4.510 Vaughan, B. (2011). Prosodic synchrony in co-operative task-based dialogues: A mea sure of agreement and disagreement. In Proceedings of the 12th Annual Conference of the International Speech Communication Association (pp. 1865-1868). Retrieved from http:/ /arrow.dit.ie/dmccon/113/ Vinciarelli, A., Pantie, M., & Bourlard, H. (2009). Social signal processing: Sur vey of an emerging domain. Image and Vision Computing, 27(12), 1743-1759. http:/ /dx.doi.org/10.1016 /j .imavis.2008.11.007 Vondrick, C., Ramanan, D., & Patterson, D. (2010). Efficiently scaling up video annotation with crowdsourced marketplaces. In Proceedings of the 11th European Conference on Computer Vision (pp. 610-623). Retrieved from http:/ /dl.acm.org/citation.cfm?id=l888136 Voss, J. (2005). The science of persuasion: An exploration of ad- vocacy and the science behind the art of persuas10n m the court room. Law and Psychology Review, 29, 301. Retrieved from http:/ /heinonline.org/HOL/LandingPage?handle=hein.journals/lpsyr29&div=16&id=&page= Worchel, S., Andreoli, V., & Eason, J. (1975). Is the medium the message? A study of the effects of media, communicator, and message characteristics on attitude change. Journal of Applied Social Psychology, 5(2), 157-172. http:/ /dx.doi.org/10.1111/j.1559- 1816.1975.tb01305.x Wu, S.-Y., Thawonmas, R., & Chen, K.-T. (2011). Video summarization via crowd sourcing. In Proceedings of the CHI '11 Extended Abstracts on Human Factors in Computing Systems (pp. 1531-1536). http:/ /dx.doi.org/10.1145/1979742.1979803 Yang, Y., & Pedersen, J. 0. (1997). A comparative study on feature selection in text categorization. In Proceedings of the 14th Interna tional Conference on Machine Learning (pp. 412-420). Retrieved from http://citeseerx.ist.psu.edu/viewdoc/ download? doi= 10.1.1.32. 9956&rep=rep l&type=pdf 117 Young, J., Martell, C., Anand, P., Ortiz, P., & Gilbert, H. T. (2011). A microtext corpus for persuasion detection in dialog. In Proceedings of the AAAI-11 Workshop on Analyzing Microtext (pp. 80-85). Retrieved from http:/ /www.aaai.org/ocs/index.php /WS / AAAIWll/paper /view /3896 Yuen, M.-C., King, I., & Leung, K.-S. (2011). A survey of crowdsourcing systems. In 2011 IEEE 3rd International Conference on Social Computing (pp. 766-773). http:/ /dx.doi.org/10.1109 /PASS AT /SocialCom.2011.203 Zeng, Z., Pantie, M., Roisman, G. I., & Huang, T. S. (2008). A sur- vey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(1), 39-58. http:/ /dx.doi.org/10.1109 /TPAMI.2008.52 118
Abstract (if available)
Abstract
Having a deeper understanding of human communication and modeling it computationally has substantial implications for our lives due to its potential synergistic impact with ever advancing technologies. It is an important step for a technology to be accepted as having effective artificial intelligence. However, human communication is a complicated phenomenon that can take an in-depth multimodal analysis of human behavior to understand, in all of the verbal, vocal, and visual channels. The challenge of multimodality is further complicated by many behavioral cues that are subtle and ambiguous. ❧ The work described in this thesis primarily revolves around computational modeling of human behavior, approaching it largely from the affective and social perspectives. This thesis explores computational behavior analysis and modeling in terms of two important contexts of human communication, one in face-to-face interaction and the other in online telemediated interaction. Firstly, this thesis explores human communication in the context of face-to-face dyadic negotiation to better understand and model interpersonal dynamics that occur during close negotiation interaction. Secondly, this thesis explores human communication in the context of online persuasion, to obtain a deeper understanding of persuasive behavior and explore its computational models with online social multimedia content. ❧ In studying human communication in these two contexts of face-to-face negotiation and online persuasion, this thesis addresses four significant research challenges: large-scale annotations, behavior representations, temporal modeling, and multimodal fusion. Firstly, this thesis addresses the challenge of obtaining annotations of human behavior on a large scale, which provide the basis from which computational models can be built. Secondly, this thesis addresses the challenge of making computational representations of multimodal human behavior, in terms of individual behavior and also interpersonal behavior for capturing the dynamics during face-to-face interaction. Thirdly, this thesis addresses the challenge of modeling human behavior with a temporal aspect, specifically for the purpose of making real-time analysis and prediction. Lastly, this thesis explores multimodal fusion techniques in building computational models of human behavior.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Towards social virtual listeners: computational models of human nonverbal behaviors
PDF
Parasocial consensus sampling: modeling human nonverbal behaviors from multiple perspectives
PDF
Computational models for multidimensional annotations of affect
PDF
Multimodal representation learning of affective behavior
PDF
Improving modeling of human experience and behavior: methodologies for enhancing the quality of human-produced data and annotations of subjective constructs
PDF
Automated negotiation with humans
PDF
Generative foundation model assisted privacy-enhancing computing in human-centered machine intelligence
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Behavioral signal processing: computational approaches for modeling and quantifying interaction dynamics in dyadic human interactions
PDF
Multimodality, context and continuous dynamics for recognition and analysis of emotional states, and applications in healthcare
PDF
Heterogeneous graphs versus multimodal content: modeling, mining, and analysis of social network data
PDF
Decoding situational perspective: incorporating contextual influences into facial expression perception modeling
PDF
Situated proxemics and multimodal communication: space, speech, and gesture in human-robot interaction
PDF
Computational modeling of behavioral attributes in conversational dyadic interactions
PDF
Human adversaries in security games: integrating models of bounded rationality and fast algorithms
PDF
Computational methods for modeling nonverbal communication in human interaction
PDF
Modeling dyadic synchrony with heterogeneous data: validation in infant-mother and infant-robot interactions
PDF
Modeling expert assessment of empathy through multimodal signal cues
PDF
A framework for research in human-agent negotiation
PDF
Automatic quantification and prediction of human subjective judgments in behavioral signal processing
Asset Metadata
Creator
Park, Sunghyun (author)
Core Title
Computational modeling of human behavior in negotiation and persuasion: the challenges of micro-level behavior annotations and multimodal modeling
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
03/01/2016
Defense Date
01/08/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
behavior annotations,behavior models,computational models,crowdsourcing,multimodal behavior,multimodal fusion,multimodal models,negotiation,OAI-PMH Harvest,persuasion,persuasiveness,video annotations
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Morency, Louis-Philippe (
committee chair
), Gratch, Jonathan (
committee member
), Nakano, Aiichiro (
committee member
), Narayanan, Shrikanth (
committee member
)
Creator Email
sunghyup@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-217298
Unique identifier
UC11277326
Identifier
etd-ParkSunghy-4172.pdf (filename),usctheses-c40-217298 (legacy record id)
Legacy Identifier
etd-ParkSunghy-4172.pdf
Dmrecord
217298
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Park, Sunghyun
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
behavior annotations
behavior models
computational models
crowdsourcing
multimodal behavior
multimodal fusion
multimodal models
persuasion
persuasiveness
video annotations