Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Learning multi-annotator subjective label embeddings
(USC Thesis Other)
Learning multi-annotator subjective label embeddings
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Learning Multi-Annotator Subjective Label Embeddings by Karel Bogomir Mundnich Batic A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Electrical Engineering) August 2021 Copyright 2021 Karel Bogomir Mundnich Batic To my mother, who has always encouraged me to be the best version of myself. ii Acknowledgements First and foremost, I would like to thank Shri for taking the time to listen to my ideas while I was still applying to PhD programs, and later accepting me to be part of SAIL. Thank you for your trust in me and your guidance. You enabled growth and learning opportunities that will always be with me. At SAIL, I found many valuable friends with which I shared classwork and screening prepa- rations, as well as laughs, conference trips, and a good dose of sarcasm (at times). Thank you Do gan and Pavlos for all the conversations about life and your willingness to share your experi- ences. Thank you Krishna for showing me in countless opportunities the power of kindness and compassion. Thank you Theodora for always being there, and for the shared love for sweets. Thank you Tiantian, Amrutha, and Brandon for pushing through the challenging times in the best possible way. Tim, you're an amazing hugger. To my Chilean friends in LA, thank you Rodrigo and Eduardo for being there and bringing that feeling of home to LA. Thank you George and Valentina for being amazing friends. You will all be missed. To my professors back in Chile, thank you Jorge and Marcos. Your love and passion for education, as well as your guidance throughout the years are one of the things that allowed me to successfully complete this process. I will always be grateful for your support. To my Chilean friends in other time zones (Alfonso, C esar, Paola), thank you for the endless phone conversations and for sharing this process called growing up. I know you will always be there. To my California family, Kate, Grant, Tristan, and Liam, I will always be grateful for your openness, kindness, invitations, and apparently innite supply of food (even through mail! Thank you Kate). You have certainly made me feel at home away from home. I'd nally like to thank my mother, Jasna, and Isabel for their unconditional love and support. Thank you mom for fully and completely trusting all of my decisions, and encouraging me all those times when I needed it. iii Contents Acknowledgements iii List of Tables vii List of Figures ix Abstract xiv 1 Introduction 1 1.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.1 Combining Multiple Annotations from Absolute Ratings Acquired in Real-Time 5 1.1.2 Combining Multiple Annotations Beyond Ratings in Time . . . . . . . . . . . 7 1.1.3 Finding Subjective Labels with no Time Indexing . . . . . . . . . . . . . . . 8 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2 Background 12 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Ordinal Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.1 Estimating an Embedding from Noisy Triplet Comparisons . . . . . . . . . . 15 2.4 Algorithms to estimate Ordinal Embeddings . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.1 General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.2 Other Methods proposed in the Literature . . . . . . . . . . . . . . . . . . . . 19 iv I Learning Fused Annotations 22 3 Fusing Annotations in Time 23 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Algorithm: Majority Voting Triplet Embeddings . . . . . . . . . . . . . . . . . . . . 26 3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3.1 Sanity Check: Green Intensity Videos . . . . . . . . . . . . . . . . . . . . . . 30 3.3.2 AVEC: RECOLA dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.4.1 Green Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.4.2 AVEC: RECOLA dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4 Combining Likert-scale Ratings through Rank Aggregation 37 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2 Algorithm: Rank Aggregation through by Adapting Copeland's Method . . . . . . . 38 4.2.1 Modied Copeland's Counting Method . . . . . . . . . . . . . . . . . . . . . . 39 4.2.2 Rankings to Ratings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2.3 Assessment of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3.1 Stability from data corruption . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3.2 Leave-one-out average agreement . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3.3 Leave-one-out ensemble agreement . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5 Combining Likert-scale Ratings through Triplet Comparisons 48 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.1.1 Algorithm: Combining Likert-scale Ratings through Triplet Comparisons . . 50 5.1.2 Assessment of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 v 5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2.1 Perturbation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2.2 Learning the Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.3.1 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.3.2 Learning the Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 II Re-Thinking Annotations and their Evaluation 58 6 A New Sampling Approach 59 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.2 Annotating Triplets with Multiple Annotators . . . . . . . . . . . . . . . . . . . . . . 61 6.2.1 Annotation fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.2.2 Triplet violations and annotation agreements . . . . . . . . . . . . . . . . . . 62 6.2.3 Unbiased Triplet Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.3.1 Synthetic triplet annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.3.2 Mechanical Turk triplet annotations . . . . . . . . . . . . . . . . . . . . . . . 66 6.3.3 Error measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.3.4 Comparison to other methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.4.1 Synthetic annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.4.2 Mechanical Turk triplet annotations . . . . . . . . . . . . . . . . . . . . . . . 68 6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7 Conclusions 75 7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 7.2 Code Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 vi Bilbiography 77 vii List of Tables 3.1 Agreement measures between dierent fusion methods and the objective truth signal in two tasks from the color intensity data set [1]. . . . . . . . . . . . . . . . . . . . . 33 3.2 Challenge results for arousal and valence. Only the best values of CCC are displayed for each approach. We do not include results for time alignment based on GTW since it systematically performed worse than DTW. . . . . . . . . . . . . . . . . . . . . . . 34 4.1 Inter-annotator agreement for all behavioral codes. : Krippendor's for the orig- inal annotations; mean , Copeland : Average Krippendor's computed by leaving one annotator out and replacing it with the n 1 annotators' ensemble (y indicates statistically signicantly higher than mean , p< 0:05). . . . . . . . . . . . . . . . . . 45 4.2 Krippendor's for ensemble matrices E designed based on mean and Copeland ensembles in a leave-one-annotator-out setup (Equation 4.4). . . . . . . . . . . . . . 46 5.1 MSE results for a xed SVR model trained using labels from dierent annotation fusion approaches (scaling method in parenthesis). Results are over 5-fold cross validation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.1 MSE and Pearson's correlation for the proposed method and state of the art continuous- time fusion techniques against ground truth. For our method, percentage is with respect to the total number of triplets. . . . . . . . . . . . . . . . . . . . . . . . . . . 71 viii 6.2 Triplet violations v for the Mechanical Turk experiment. Percentages correspond to percentage of total triplets observed. We include the fraction of triplet violations as computed by the labels generated with EvalDep. The results are consistent with those presented in Table 6.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 ix List of Figures 1.1 Problem description. The annotation task requires an interpretation of the obser- vations, followed by the annotation of this interpretation of the target construct. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 A closeup snapshot of the user interface at dierent times during the an annotations task of color green intensity. Annotators adjusted the slider in sync with changes in the green video [1]. UI developed by Brandon Booth. . . . . . . . . . . . . . . . . . . 4 1.3 Annotation experiment with several annotators and a known ground truth (thick black line). Note the artifacts from the annotation process. . . . . . . . . . . . . . . 4 2.1 Two dierent link functions f for the decision model described in Eq. 2.9. . . . . . . 15 3.1 AVEC GES 2018. The GES sub-challenge consisted in creating an algorithm to combine dierent annotations obtained in real-time such that, when used to train a regression model, they maximize the canonical correlation coecient (CCC) between the labels and the predictions in a held-out set. . . . . . . . . . . . . . . . . . . . . . 24 3.2 Frames of a subset of training videos of the RECOLA [2] data set. Audio and video were recorded, while physiological signals (ECG, EDA) were collected during 5min- long one-on-one conversations. Only one of the subjects in each conversation was recorded. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Algorithm: We can nd a 1-dimensional embedding where each item is a value of a signal in time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4 Example of labels of valence from the RECOLA dataset. In our approach, we use time triplets (i;j;k) to decide whetherfD a ij ? <D a ik g a2A , whereD a ij =jY a i Y a j j. . . 27 x 3.5 Agreement between annotators for two example subjects from the RECOLA emotion dataset in two dierent annotation tasks: arousal (left) and valence (right). Agree- ment is measured using CCC. As the shared scale shows, the overall agreement for valence is higher than the overall agreement for arousal. . . . . . . . . . . . . . . . . 32 3.6 Plots for task A and task B from the color intensity annotation dataset. The true color intensity signal is shown (black) alongside the unweighted average of individual annotations (purple) and the label produced using an unweighted version of our majority vote triplet embedding algorithm (cyan). This shows the proposed method producing the labels is sensible and qualitatively similar to the average signal. . . . . 33 3.7 Plots showing the percentages of triplet violations after tSTE convergence. The bottom row in the legend shows the use of Algorithm 1 without previous time- alignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.8 Comparison between ordinal embedding approach and baseline (simple average). Note the regions of dierence and the high-frequency components of the annotated signal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.1 (Top row) data corruption stability analysis. Each rating has been corrupted by a noise, wherep represents the probability of the original rating to be increased/decreased by 1. The plotted values are the inter-rater agreement between the original rating's ensemble and the corrupted rating's ensemble averaged over 10 iterations. The er- ror bars correspond to one standard deviation. Note: No noise corruption implies a Krippendor's of 1 for every behavioral code. (Bottom row) rating distribu- tions for each annotated behavioral code. The y-axis represents relative counting frequency, while the x-axis represents behavioral codes ratings (scale is 1 to 9). His- tograms' colors refer to behavioral codes as dened in the legend of Figure 4.1. Legend: acceptance of the other , blame , responsibility for self , solicits perspective , states external origins , discussion , clearly denes prob- lem , oers solutions , negotiates , makes agreements , pressures for change , withdraws , avoidance . . . . . . . . . . . . . . . . . . . . . . . . 43 xi 5.1 Data corruption stability analysis. Each rating has been corrupted by noise, where p represents the probability of the original rating to be increased/decreased by 1. The plotted values are the inter-rater agreement between the original rating's en- semble and the corrupted rating's ensemble averaged over 10 iterations. The error bars correspond to one standard deviation. Note: No noise corruption implies a Krippendor's of 1 for every behavioral code. Legend: acceptance of the other |, blame |, responsibility for self |, solicits perspective |, states external origins |, discussion |, clearly denes problem |, oers solutions |, negotiates |, makes agreements |, pressures for change |, withdraws |, avoidance |. . . . . . 55 5.2 Data corruption stability analysis. Each rating has been corrupted by a noise, where p represents the probability of the original rating to be increased/decreased by 1. The plotted values are the inter-rater agreement between the original rating's en- semble and the corrupted rating's ensemble averaged over 10 iterations. The error bars correspond to one standard deviation. Note: No noise corruption implies a Krippendor's of 1 for every code. Legend: C 1 |, O |, A |, C 2 |, H |. . . 56 5.3 Rating distributions for each annotated COACH code. The y-axis represents relative counting frequency, while the x-axis represents behavioral codes ratings (scale is 1 to 9). Histograms' colors refer to behavioral codes as dened in the legend of Figure 4.1. 56 5.4 Fusion results for mean and Triplet Embedding approaches. . . . . . . . . . . . . . . 57 5.5 Annotation matrices for both the Couples' Therapy behavioral codes and the COACH rating system. The purple color re ects missing entries in these matrices, which are absent annotations. The scale for both rating systems are quantized. . . . . . . . . . 57 6.1 Question design for queries in Mechanical Turk. . . . . . . . . . . . . . . . . . . . . . 66 6.2 MSE as a function of the percentage of observed tripletsjSj=jTj 100 with constant and logistic noise in triplet labels. Each point in the plots represents the mean over 30 random trials, while the shaded areas represent one standard deviation from the average MSE values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.3 Probabilities of success ^ f I a(jD ij D ik j) as a function of the distance from the ref- erence i to frames j and k. Only the top annotators have been included. . . . . . . . 71 xii 6.4 Results for Mechanical Turk annotations. The computed embeddings have been scaled to t the true labels Z (Eq. 6.17). The embedding in task A uses 0.5% (47,052) of all possible triplet comparisonsjTj. The embedding in task B uses 0.5% (13,862) of all possible triplet comparisonsjTj. In both tasks, the estimated green intensity is sometimes less than zero due to scaling. . . . . . . . . . . . . . . . . . . . 72 xiii Abstract In the setting of supervised learning, labels are used as examples to train models. When the objective is to learn a subjective construct (such as dimensions of aect), the labeling process oftentimes includes multiple annotators conveying their opinion about the construct for each sample in the form of annotations, which are later processed to obtain labels. This work explores the use of ordinal comparisons to both nd consensus from annotations and to generate labels of subjective constructs. In the rst part, we study how to nd consensus from annotations collected as absolute or Likert-scale ratings over time or at session-level. In this setting, we propose to look at the problem of generating labels as nding a 1-dimensional embedding that faithfully represents the latent construct we want to learn. The diculty of the problem lies on how to process these annotations, which are usually noisy due to subjectivity, and because the process of annotation can lead to errors. Assuming that people internally use ordinal comparisons to annotate even in the case of absolute ratings, we collect triplets and set up an optimization problem to estimate the underlying embedding. Since the ground truth or latent construct is unknown to us, we evaluate the methods by training a set of supervised models in two dierent scenarios: Aect (arousal, valence) and behavioral codes, yielding models with lower validation or test errors for valence and the behavioral codes. In the second part, we reconsider two aspects of the previous process: (1) the annotation process, and (2) the evaluation of the methods. We propose the use of a specically-designed dataset to evaluate annotations collected in real-time, and further propose using triplet comparisons only to estimate latent constructs that vary over time. The dataset allows us to study the noise and artifacts generate when collecting annotations in real-time, while the novel sampling scheme based on triplet comparisons allows us to better estimate latent signals. xiv Chapter 1 Introduction Many problems in human-centered engineering and machine learning require learning target con- structs that are subjective. Examples of these problems include medical imaging [3], aective computing [4], and applications in healthcare (for example, automatically understanding behavior from a mental-health perspective [5, 6, 7, 8]). The main driver behind these applications is supervised learning, where we are interested in nding a mappingf between a feature spaceX and a label spaceY, such thatD =f(x i ;y i )g N i=1 is a dataset andx i 2X and y i 2Y, for i =f1;:::;Ng: f :X!Y; (1.1) x7!f(x) =y: (1.2) We want such an f function generalizes well to previously unseen data, assuming that is has the same distribution asD. In many applications of supervised learning, the labels inY are objective and easily queried from the data (for example, in computer vision, dening whether an image has a cat or a dog in it is usually easy). However, in the aforementioned human-centered applications, querying labels is not always a straightforward task. For example, if we want to learn a model to estimate the level of aect expressed by an individual, we are required to train a model with these labels of aect, which might be subject to interpretation or context. Even more, a ground truth may not even exist. In practice, we query several annotations from a set of experts or annotators in order to nd 1 Observation X ∈X Interpretation Sampling Annotations Y ∈Y Annotations task Figure 1.1: Problem description. The annotation task requires an interpretation of the observations, followed by the annotation of this interpretation of the target construct. consensus to describe the variable of interest (i.e., obtain a label to estimate f 1 ). We usually nd the mapping f in Eq. 1.1 by solving an optimization problem, where f is a function parameterized by , and we have access to n independent and identically distributed (i.i.d.) data samplesx i with corresponding labels y i : min n X i=1 `(y i ;f(x i ;)); (1.3) where` is a loss function that is chosen according to the task we are interested in and the properties of the problem. These properties are closely related to the label spaceY (indicating the space to which f maps to), which depend on the task we are interested in, and can be categorized mainly into: Regression problems, where we are interested in a continuous variable, soYR n . The labels may be restricted to a subset ofR n , such as [0; 1] n . Classication problems, where we are interested in a discrete number of states or classes that a variable can take. We usually assume that these classes are categorical (i.e., there is no order), such thatY =f0;:::;cg. However, the classes may have an order (i.e., lie in a Likert scale [9]) where there exists an order such that 0< 1<:::<c. If the spaceY describes a subjective construct, there are two main diculties that arise when trying to nd a mappingf that generalizes well to unseen data (from a label-acquisition viewpoint): First, collecting high-quality labels from human annotators or experts, which can be divided into 1 Throughout this manuscript, we will use \annotation" for the input given by an annotator, and \label" for the value used to train a supervised model. 2 two coupled sub-problems (Figure 1.1): 1. The interpretation of the subjective construct (i.e., the mental model a given annotator has of the construct) [3], 2. The sampling (i.e., how the annotator uses an interface to describe his/her interpretation of the construct) [10]. This is subject to the following sub-problems as well, which produce inconsistencies in the annotated values [11]: (a) Random human errors in the annotations process, (b) The eect of the sampling mechanism: The role of inertia in annotations when using a joystick or a user interface for a continuous variable, The decision on a given value in a discrete scale (3 vs. 4 stars in a 5-star scale). (c) The diculty of the task, (d) Adversarial annotations. The second problem lies in combining the dierent annotations obtained from dierent annotators to obtain a label that faithfully describes the latent construct. The interpretation problem is usually tackled by asking experts in a eld to annotate, by training the annotators, or by iterating over the set of instructions given to annotators. As we will see, the sampling problem has been less explored, and most works rely on learning from noisy and biased absolute ratings. To illustrate the importance of the sampling stage and its challenges, we created a regression dataset for which the ground truth labels exist and are known to us [12]. In this dataset, the construct is objective (or not interpretable). The setup is as follows: Annotators are shown a user interface (UI) where a video with only green frames is displayed (see Figure 1.2). The intensity of green (G channel in Red-Green-Blue or RGB) varies in time, and the annotators are asked to follow the trend of green intensity as accurately as possible by moving a slider. The results of this annotation process are shown in Figure 1.3. In this gure, we observe that annotators overshoot, do not stay constant in constant intervals, and lter the underlying construct as if they were using 3 Figure 1.2: A closeup snapshot of the user interface at dierent times during the an annotations task of color green intensity. Annotators adjusted the slider in sync with changes in the green video [1]. UI developed by Brandon Booth. 0 50 100 150 200 250 0 0.5 1 Time [s] Green intensity Task A: https://youtu.be/hOPfInpDD9E 0 50 100 150 0 0.5 1 Time [s] Task B: https://youtu.be/o3o5cUPBPAg Figure 1.3: Annotation experiment with several annotators and a known ground truth (thick black line). Note the artifacts from the annotation process. a low-pass lter. As we will discuss in the literature review, dierent works usually tackle a subset of these diculties or inconsistencies. 1.1 Literature Review The problem of learning from human annotations of subjective constructs has been studied by dierent communities within engineering. For example, the aective computing and related com- munities have widely studied the learning from subjective constructs in the setting of regression. Other communities related to computer vision and music understanding/retrieval have studied as- 4 pects of learning subjective constructs from human annotations in the classication setting. We discuss these works in the following subsections. 1.1.1 Combining Multiple Annotations from Absolute Ratings Acquired in Real-Time Oftentimes, to learn a construct over time, annotations are acquired in the form of absolute ratings in real-time (see Figure 1.3). These kinds of annotations suer from the lag introduced by reaction times as well as biases introduced by the sampling process. The following subsections describe works that solve the problems related to these kinds of annotations. Aligning annotations and features in time EvalDep In terms of reaction-time corrections, [13] studies and models the reaction lag of anno- tators by nding the maximum mutual information (MMI) between features from the feature space and the lagged annotations: ^ a = argmax I [x;y a ]; (1.4) wherey a is an annotation coming from a single annotator. Since Eq. 1.4 this is non-tractable in the general case, the assumption is that x and y a are jointly Gaussian, which allows for a closed-form solution: I(X;Y ) = 1 2 log det ( xx ) det ( y a y a) det() ; (1.5) = 2 6 4 xx xy a y a x y a y a 3 7 5; (1.6) where xx y a y a are the covariance matrices ofx andy a respoectively, and is the joint covariance matrix. Then, they perform a simple average to combine them, thus creating a unique label. This method estimates a constant reaction time for a given annotation. In their work, they study how pre-processing the annotations by shifting the signal to account for reaction times improves the learnability of the labels. 5 Dynamic Time Warping (DTW) [14] is another popular time-alignment method to warp the annotations in time with respect to a feature, which is usually combined with the averaging of signals. In DTW, the lag does not need to be constant, but rather is found for all times. The warping in time is found by minimizing: J DTW (P) = P X t=1 x p x t y a p y a t ; (1.7) where P is a matrix showing the time mapping betweenx andy a . In both of the aforementioned methods, the diculty lies in nding features that correlate well with the annotations, such that a proper model for the reaction times can be derived. Canonical Time Warping (CTW) [15] proposes to combine ideas from canonical correlation analysis (CCA) [16] and DTW to time-align time-series of human behavior. The idea behind this method is to extend CCA to time series, and use DTW to perform the time-alignment in a weighted fashion. This method is commonly used to align annotations with a single feature. In [17], the authors expand this approach to perform time-alignment of a single time-series with multiple features at the same time. Moreover, [18] extends this approach to capture non-linear relationships between the dierent time-series. All of the aforementioned techniques require the use of features that are many times weakly correlated with the annotations [19]. Combined Time Alignment and Fusion Dynamic Probabilistic CCA Motivated by CCA and DTW, [20] propose Dynamic Probabilis- tic CCA to discover dependencies between the signals. Moreover, they extend their approach to include time-alignments for a method that time-aligns and combines multiple annotations by using features. Long-Short-Term-Memory Networks (LSTMs) Combining reaction-time and other adjust- ments, [21] proposes the use of a Long-Short-Term-Memory Networks [22, 23] to combine asyn- chronous annotations and features and predict labels. 6 Modeling annotator distortions With respect to scale adjustments only, [24] presents a method for modeling multiple annotations over a continuous variable, and computes a single label from annotations by modeling annotator- specic distortions as lters whose parameters can be estimated jointly using Expectation-Maximization (EM). The model propose is: y a =g (x; g ) = 2 6 4 x 1 3 7 5 T g + s ; (1.8) y a =h y a ; h = ( a y a ) + b a 1 + a ; (1.9) where y a is the latent variable to be estimated, and y a are the distorted observations through a linear lter. Hence, the latent variable is assumed to be a linear transformation of the data, plus a random (Gaussian) noise, and the observed variable is modeled as a ltered value of the latent variable, plus a bias and Gaussian noise. This model relies on heavy assumptions for mathematical tractability that do not necessarily re ect how annotators behave (for example, the noise induced by human error is not necessarily Gaussian, as can be seen in Figure 1.3). 1.1.2 Combining Multiple Annotations Beyond Ratings in Time A dierent set of works consider sampling methodologies dierent from acquiring ratings in real- time. We will discuss these in this section. Ordinal relations-based methods The work in [25] proposes the tool RankTrace to annotate emotions in real-time by looking at the ordinal relations annotated between time-frames, instead of using absolute ratings. The authors do not perform annotation fusion, since annotated their own experience by watching their own videos, but do perform time-alignment with the features of interest. In a similar fashion and intending to exploit the ordinal characteristics of annotations, in [26] we propose a framework based on ordinal embeddings to correct a continuous-time label generated 7 by one of the aforementioned algorithms. This approach warps the scale of the label by selecting specic windows of it in time to collect extra information from human annotators through triplet comparisons. 1.1.3 Finding Subjective Labels with no Time Indexing Some references in classication problems One of the rst works related to learning from subjective annotations in the case where Y = f0; 1;:::;cg is [27]. In this work, the authors are interested in detecting the presence of volcanoes (and their classes) from low-quality images. Encoding the presence of some volcano as v, the true label (or target) as y, and the annotations as a, they pose the problem as one in which they want to estimate the conditional probability of having a volcano given the annotations: p(vja) = c X y=0 p(vjy)p(yja): (1.10) In this scenario, they assign a priori probabilities to features indicating the likelihood of hav- ing a given class of volcano, which requires expert knowledge. The estimation is done using the Expectation-Maximization (EM) algorithm. In [3], the authors are interested in training supervised models for detection in medical images. Their key contributions are: (1) to estimate the labelsy i from multiple annotators without knowing a priori how much to trust them. This is dierent from majority voting because it does not assume equivalent weights for each annotator, and (2) training a classier by jointly learning the class labels and the supervised model (this is, treating the labels as parameters to be learned), instead of estimating labels and training models as two independent processes. Other methods with dierent ideas have been proposed as well. For example, in [28], a noise and bias model for each annotator is proposed, such that the observed ratings are a noisy and biased version of the underlying rating. By estimating the model parameters, annotators can be taken into a joint annotator space where their ratings are comparable. In [29] an Expectation-Maximization approach is used to nd the latent ratings, while in [30], a similar approach is used to jointly model the latent ordinal labels and annotator reliability. All these methods work with absolute ratings and assume we observe noisy and biased representations of the underlying ratings. 8 Subjective Labels of Multimodal Data An ordinal embedding approach to learn metrics from multi-modal data was rst proposed in [31]. In this work, the authors develop an algorithm to account for noisy triplet comparisons (the notion of noisy annotations in this setting was initially observed by [32] in music applications). In [31], the authors use their proposed algorithm to embed artists based on their (subjective) similarities. In [33] the authors introduce the idea of using ranking information extracted from metric learning approaches for the comparison of music applied to recommender systems. 1.2 Contributions In this thesis, we propose the use of ordinal embeddings to represent labels of subjective constructs for either classication (with ordinal relations) or regression, by sampling the information through triplet comparisons. We argue that collecting these labels is a learning problem in itself (albeit unsupervised), and that the collection of these labels itself is an inverse problem [34]. As [35] beautifully states, \When solving an inverse problem, we need to understand what is recoverable and what is forever lost in the forward problem." In the estimation of labels for regression and (ordinal) classication of subjective constructs, it seems to be the case that the scale of the labels is forever lost, and we only have access to a scaled version of pairwise distances between sample labels. We study the performance of new methodologies to generate labels for regression and (ordinal) classication, in the cases when: 1. The annotations are already given, 2. We can choose how to sample the annotations. In both scenarios, we sample the information by asking annotators questions of the form \is the signal for item i more similar to the signal for items j or k?" to build 1-dimensional embeddings in Euclidean space, where t = (i;j;k) forms a triplet. We motivate these approaches with four key observations: (1) psychology and machine learn- ing/signal processing studies have shown that people are better at comparing than rating items [36, 37, 38, 39], so this sampling mechanism is easier for annotators than requesting absolute ratings 9 in real-time, (2) intuitively, it seems that annotators always annotate by comparing the current item to references, even when the annotation involves giving a rating, (3) the use of triplet em- beddings naturally solves the annotation fusion problem, since it is done by taking the union of sets (details in Chapter 6), (4) triplet embeddings oer a simple way to verify the agreement of the annotations, given by the number of triplet violations in the computed embedding. To understand the eectiveness of the propose approaches, we study our algorithms in the decoupled problem (Figure 1.1). Throughout this work, we empirically show that it is possible to reconstruct the hidden green intensity signal (i.e., recover the metric information between samples) of tasks A and B in Figure 1.3 under dierent synthetic and real-world noise scenarios in the triplet labeling stage. As we will see in Chapter 6, these reconstructions may be accurate up to a scaling and bias factor but do not suer from artifacts such as time-lags present in real-time annotations. Moreover, we test our algorithms in real-world datasets, we gather triplet comparisons from annotations sampled in the traditional fashion, and show that we produce labels that are more learnable (i.e., for which we can nd a mapping f that lowers the test error). This leads to the following thesis statement: Thesis Statement We can leverage the use of ordinal information through triplet comparisons to generate noise-robust and learnable labels for subjective constructs. 1.3 Organization The organization of this thesis is as follows: Chapter 2 describes the mathematics of ordinal embed- dings, which will be needed in the subsequent chapters as the basis for our methods. We perform a literature review on current approaches and include some results found along the way. In Chap- ter 3, we describe an algorithm to combine annotations for regression acquired in real-time. We test out algorithm in the green intensity experiment, as well as in the RECOLA dataset [2] (the paper that introduces this algorithm was the winner of the Audiovisual Emotion Challenge 2018). These two chapters lead to the ideas presented in Chapter 4, where we explore the use of pairwise ordinal comparisons to combine Likert-scale annotations from multiple annotators. In Chapter 5, we propose a method using triplet comparisons to combine Likert-scale annotations, and compare 10 it to using pairwise comparisons. In Chapter 6, we propose a new sampling scheme to generate labels for regression. In Chapter 7, we conclude. 11 Chapter 2 Background 2.1 Introduction In this chapter, we discuss the main results associated to ordinal embeddings, which is the backbone algorithm over which we develop new methodologies to generate labels for subjective constructs. We introduce relevant denitions, state the ordinal embedding problem, and display results that will be useful later on in our experiments and analysis. 2.2 Denitions Consider a collection of n points in m-dimensional Euclidean space. We refer to the matrix Y 2 R mn as an embedding, where each column is a vector representing a point in Euclidean space. In this work, we will consider centered embeddings, this is: Y =YV; (2.1) whereV = I 1 n 11 T is a centering matrix, so that: n X i=1 (YV ) i;: = 0: (2.2) 12 Let y be the mean of the embedding [35]: y = 1 n n X i=1 y i = 1 n Y 1: (2.3) Then, Y =Y y1 T =Y 1 n Y 11 T =Y I 1 n 11 T =YV: (2.4) Then, ifY is centered,Y =YV . We may dene the Gram matrix of an embedding: G =Y T Y; (2.5) whose entries contain all the dot products between points of the embedding. Due to the construction and because the dot product operation is symmetric, any gram matrixG is symmetric and positive semi-denite. The Euclidean Distance Matrix (EDM) D associated to an embedding can be dened entry- wise: D ij =ky i y j k 2 2 : (2.6) Note that if the embedding is centered, then rank(D) = m, the dimension of the embedding. Moreover, we always have that rank(D)m + 2. The denitions of embedding, Gram matrix, and Euclidean distance matrix are now used to introduce the problem of ordinal embedding. 2.3 Ordinal Embeddings In the problem of Ordinal Embeddings, we want to nd a representation in Euclidean space of n items that have only been compared through ordinal comparisons of their pairwise distances. Formally, let n be items that we want to represent through points y 1 ;:::;y n 2 R d , respectively, with [y 1 jjy n ] = Y 2 R dn . We assume that the n items lie in a metric space, and the true Euclidean distances between them are given by D ij =ky i y j k 2 2 . We also assume that we have access to noisy distance comparisons, denoted by d(y i ;y j ). These noisy distances may come from 13 perceptual comparisons, such as those coming from human annotators. We use these noisy distances to examine comparisons of the form: d(y i ;y j ) ? 7d(y i ;y k ) (2.7) to nd the embeddingY . The comparisons displayed in Eq. 2.7 are called triplet comparisons (for each (i;j;k)). LetT be the set of all possible unique triplets for n items: T =f(i;j;k)ji6=j <k6=i; 1i;j;kng: (2.8) Note thatjTj =n n1 2 =O(n 3 ), which may be a very large set. In the context of recovering an Ordinal Embedding from noisy observations, let us assume we observe a set of tripletsS, such thatST , and corresponding realizations of the random variables w t , where t = (i;j;k)2S, such that: w t = 8 > > < > > : 1; w.p. f(D ij D ik ) +1; w.p. 1f(D ij D ik ): (2.9) Here,f :R! [0; 1] is a function that behaves as a cumulative distribution function [40] (sometimes called link function [41]), and therefore has the property that f(x) = 1f(x). Hence, the w t 's indicate if i is closer to j than k, with a probability depending on the dierence of the distances D ij D ik (which is a way to measure how dicult the decision is). Eq. 2.9 describes a class of models that are not new; in fact they come from the Decision Theory literature. When f is chosen to be a logistic function, this model is called the Bradley-Terry-Luce model [42, 43], and when f is chosen as a Gaussian CDF, this is known as the Thurstone model [44]. Figure 3.4 shows example functions for these two models. Newer permutation-based models have been proposed to make decisions when learning from people 1 [45]. However, these models are well suited to rank items, where the original model (where 1 Term coined by Nihar B. Shah in his doctoral thesis. 14 −6 −4 −2 0 2 4 6 0 0.5 1 1 (1+exp(−σ(D ∗ ij −D ∗ ik ))) 1 2 1+erf (D ∗ ij −D ∗ ik ) √ σ D ∗ ij −D ∗ ik f(D ∗ ij −D ∗ ik ) Bradley-Terry-Luce Thurstone Figure 2.1: Two dierent link functions f for the decision model described in Eq. 2.9. w represents some real number and F is an increasing function): P(i beats j) =F (w i w j ) (2.10) is replaced by the model: P(i beats l)P(j beats l); (2.11) which is non-parametric and generalizes the previous models. These have been studied for the problem of ranking, and up to the best of our knowledge, 1-dimensional non-parametric ordinal embeddings have not been studied. Using these models for the general m-dimensional Ordinal Embedding case would become prohibitively expensive, as we would need to compare quadruplets of items instead. 2.3.1 Estimating an Embedding from Noisy Triplet Comparisons The problem of learning an ordinal embedding from noisy data is described by [40] as follows: Problem 2.3.1 (Ordinal Embedding from Noisy Data) Considern points inR m . LetS denote the set of observed triplets and for eacht = (i;j;k)2S, we observe independent random variables w t whose distributions are given by a corresponding known link function f : [0; 1]!R. EstimateY fromS,fw t g, and f. Note that Problem 2.3.1 assumes that the noise model f is known (see Remark 2.3.3), and that D exists. Moreover, the next assumption is important: Assumption 2.3.1 (Sampling of triplets) The triplets inS have been sampled uniformly with replacement. 15 Under Assumption 2.3.1, the problem of estimating the embedding can be solved within the framework of Empirical Risk Minimization (ERM). Using the Gram matrix G ofY , the risk can be written as: R(G) =E [`(w t hL t ;Gi F )]; (2.12) where ` is a margin-based (classication) loss function andL t is dened as: L t = 2 6 6 6 6 4 i j k i 0 1 1 j 1 1 0 k 1 0 1 3 7 7 7 7 5 ; (2.13) and zeros everywhere else, so that the Frobenius inner product between matrices is: hL t ;Gi F =G jj 2G ij G kk + 2G ik (2.14) = (Y T Y ) jj 2(Y T Y ) ij (Y T Y ) kk + 2(Y T Y ) ik (2.15) =y T j y j 2y T i y j y T k y k + 2y T i y k (2.16) =ky i y k k 2 2 ky i y j k 2 2 ; (2.17) (and therefore, w t contributes only a sign). Note that this is the idea behind triplet loss, rst introduced in [46] as a regularization term and later on used as an objective in [47], where ` was chosen to be the hinge loss. We can estimateR(G) (and henceG) by using the Empirical Risk: b R S (G) = 1 jSj X t2S ` (w t hL t ;Gi F ): (2.18) There are a few observations to make under this setup: Remark 2.3.1 The construction in Eq. 2.12 of the risk makes the loss function be evaluated in a linear function ofG, allowing easier manipulation over the variable that is optimized. Moreover, if` is a convex function, then the objective function in Eq. 2.18 is convex with respect to G [48], although it 16 is generally non-convex with respect toY . Remark 2.3.2 Eq. 2.18 suggests that the problem of nding an embedding can be posed as a triplet classica- tion problem, where the observed triplets inS are the train set and the unseen tripletsS c =TnS are the test set. The hypothesis set forG is given by the set of all positive semi-denite matrices fG2R nn :G< 0g. Remark 2.3.3 Iff is known, then b R(G) is an unbiased estimator ofR(G) sincef will induce a loss` f under a Maximum Likelihood framework [40]. However, if we do not know f a priori and f does not match the induced loss when setting the problem as a maximum likelihood estimation problem, then the empirical risk is indeed biased [49]. We will empirically observe this in Chapter 6. 2.4 Algorithms to estimate Ordinal Embeddings 2.4.1 General Case The estimation of an Ordinal Embedding can be done by posing a constrained optimization problem over the empirical risk: Problem 2.4.1 (Constrained Empirical Risk Minimization [40]) To nd the embedding, we can minimize the Empirical Risk: min G R(G) s.t. G< 0 kGk kGk 1 ; where and are parameters that keep the values of the embedding constrained. 17 The recovery ofY fromG can be achieved up to a rigid transformation using the Singular Value Decomposition (SVD) 2 . In [40], the authors use several techniques based on projected gradient de- scent (PGD) to solve Problem 2.4.1. In particular, in this work we are interested in low-dimensional embeddings (typically, in the case wherem = 1). In this setting, we have empirically observed that minimizing: b R S (X) = 1 jSj X t2S ` w t hL t ;Y T Yi F ; (2.19) whereX2R mn , allows for a better estimation in low dimensions, while also being more memory ecient, since the number of variables to estimate is mn (number of entries in X) rather than n(n 1)=2 (as in the number of variables in the Gram matrix), while also being able to enforce the rank constraint without the need to clip the number of dimensions in the SVD operation. A Maximum Likelihood Interpretation In a maximum likelihood framework,` is induced by our choice off [40], assuming that the random variables w t encoding the decisions for each triplet are independent. The likelihood for a single triplet may be dened as: ` f (w t hL t ;Gi) =1 yt=1 log 1 f (hL t ;Gi) +1 yt=1 log 1 1f (hL t ;Gi) : (2.20) If we solve Eq. 2.20 for a given f, we obtain the log-likelihood, which negated can be used as a loss function and minimized to estimate G, and therefore, the choice of f induces a given loss function`. For example, iff is the logistic functionf(x) = 1=(1+exp(x)), the induced loss is the logistic loss `(x) = log(1 + exp(x)) [40] 3 . This example (using the logistic loss) is equivalent to Stochastic Triplet Embeddings [50], since minimizing the logistic loss is equivalent to maximizing the soft-max. 2 Note that the scale is lost. If we consider the problem of ordinal embedding as an inverse problem where we estimate an embedding from triplet comparisons, we can recover pairwise distances between embedded items up to a factor. 3 The actual loss depends on how we choose the class labels. See https://stats.stackexchange.com/questions/ 250937/which-loss-function-is-correct-for-logistic-regression/279698#279698 18 Finite sample prediction In [40], the authors prove that the error R( b G)R(G ) (where G is the true underlying Gram matrix associated to D ) is bounded with high probability ifjSj =O(mn log(n)) if a function f exists. Consequently, they show thatk ^ DD k F is also bounded. As a result, the number of triplets that we need to be query isO(mn log(n)) instead ofO(n 3 ). This becomes of importance later in our analysis. 2.4.2 Other Methods proposed in the Literature Before the statistical study and interpretation of Ordinal Embeddings proposed in [40], several contributions were made, which were mostly based in posing dierent optimization problems to nd embeddings based on triplet comparisons. Here we give a brief review of these methods, which we later use in our experiments. In each of the subsequent loss functions, triplets are queried such that, if i is considered to be closer toj thank, then (i;j;k) is added toS, otherwise (i;k;j) is added toS. This is, the decision is implicitly encoded in the order of the triplets inS rather than having explicit random variables w t . Moreover, all the methods are described in terms of the Gram matrixG. Generalized Non-metric Multidimensional Scaling This method is presented in [51], and poses the problem as a trace minimization problem with slack variables that encode each one of the triplet comparisons: min G Tr (G) + X (i;j;k)2S ijk s.t. G ii 2G ij G kk + 2G ik 1 + ijk ; ijk 0; G< 0: 19 Crowd Kernel Learning This method is presented in [52]. \Probabilities\ are dened as: p ijk = G ii +G jj 2G ij + (G ii +G jj 2G ij ) + (G ii +G kk 2G ik ) + 2 ; (2.21) where is a scalar. Then, the sum of the log-probabilities is minimized: min G X (i;j;k)2S log(p ikj ) s.t. G ii = 1; G< 0: Stochastic Triplet Embeddings The authors propose to dene \probabilities" for each triplet using a two-class softmax: p ijk = exp ky i y j k 2 2 2 exp ky i y j k 2 2 2 + exp ky i y k k 2 2 2 (2.22) Then, the cost function is simply the sum over all the log-probabilities: max Y X (i;j;k)2S log(p ijk ): (2.23) Moreover, they propose a new cost function where the Gaussian kernel of the softmax function is replaced by a t-Student kernel, arguing that the heavier tails of this kernel aid the estimation of the embedding by avoiding vanishing gradients: p ijk = 1 + ky i y j k +1 2 1 + ky i y j k +1 2 + 1 + ky i y k k +1 2 ; (2.24) where are the degrees of freedom of the t-Student kernel. Claim 2.4.1 (Relationship between STE and tSTE) As !1, the cost function of tSTE (Eq. 2.24) is equivalent to the cost function of STE 20 (Eq. 2.23) when = 1. Proof : Setting the limit, we have: lim !1 1 + kyiyjk 2 2 +1 2 1 + kyiyjk 2 2 +1 2 + 1 + kyiy k k 2 2 +1 2 = exp kyiyjk 2 2 exp kyiyjk 2 2 + exp kyiy k k 2 2 : We have experimentally observed that the cost function of tSTE is a good replacement of the cost function of STE for high values of in terms of numerical stability. In the deep learning community, it is a well-known fact that softmax is numerically unstable due to vanishing gradients [53]. 21 Part I Learning Fused Annotations 22 Chapter 3 Fusing Annotations in Time 3.1 Introduction In this chapter, we present an algorithm to combine annotations that have been acquired in real- time for a construct that changes over time [12]. We apply this algorithm to annotations of two dimensions of aect: arousal and valence. We validate this algorithm in our green intensity experiment and the Remote Collaborative and Aective Interactions (RECOLA) dataset [2], as part of the 2018 Audio/Visual Emotion Recognition Challenge Gold-standard Emotion Sub-challenge (AVEC GES 2018) [54], where we obtained the rst place. Problem Statement LetA be a set of annotators. Let us assume that we have annotationsfY a g a2A obtained in real- time for constructs for which we do not have a ground truth. In this case, the constructs are two dimensions of aect (arousal and valence). We also have access to features of audio, video, and physiological signals, and a system to train regression models over single labels Y 2 R n . The problem is to create an algorithm that combines the annotationsfY a g a2A such that we minimize the test error between input labelsY and predictions b Y , for a xed set of possible regression models M. Figure 3.1 shows a summary of the problem. 23 Labels Y ∈Y Model f Features X ∈X Annotations {Y a } a∈A Predictions b Y = f(X) CCC Model training Combine Combine Figure 3.1: AVEC GES 2018. The GES sub-challenge consisted in creating an algorithm to combine dierent annotations obtained in real-time such that, when used to train a regression model, they maximize the canonical correlation coecient (CCC) between the labels and the predictions in a held-out set. Data The Remote Collaborative and Aective Interactions (RECOLA) dataset was used in this task [55]. The data set consists of 27 sessions, each being 5 min long. In each session, a dierent subject is recorded during a video conference, where the subjects are French native-speakers. The data set contains audio, video, electro-cardiogram (ECG), and electro-dermal activity (EDA), all synchronously recorded. The data set is split into nine training sessions, nine validation sessions, and nine test sessions, where each partition corresponds to 45 min in total. The generation of the features using supervised, semi-supervised, and unsupervised methods is described in [54]. Figure 3.2 shows example frames of the video for three dierent train sessions. Annotations The annotations are acquired in real-time from six annotators using a joystick while watching the videos. The annotators are gender balanced (three females and thre males) and are all French native speakers. The two dimensions of aect (arousal, valence) have been annotated independently in a continuous scale in the range [1; +1]. We have access to nine train and nine validation sets of annotations, and the test annotations are not seen by us. The method to generate the baseline labels is described in [56, 57], which involves a time-alignment step plus a weighted average. Figure 3.4 shows an example of annotations for valence. 24 Figure 3.2: Frames of a subset of training videos of the RECOLA [2] data set. Audio and video were recorded, while physiological signals (ECG, EDA) were collected during 5min-long one-on-one conversations. Only one of the subjects in each conversation was recorded. Evaluation The metric for evaluation is the Canonical Correlation Coecient (CCC): Denition 3.1.1 (Canonical Correlation Coecient (CCC)) Letx andy be two time series. The CCC is dened as: c = 2 x y 2 x + 2 y + ( x y ) 2 2 [0; 1]: (3.1) Note that the CCC measures the correlation between two variables, but penalizes the dierences in their mean, unlike the Pearson Correlation Coecient. Moreover, a constant input will yield a CCC equal to 0. The evaluation of the performance is done by computing the highest CCC between the generated labelY in the test set (using out developed algorithm) and the best prediction b Y m in the test set among all the pre-trained regression models m2M for each dimension of aect assessed: best, dimension c = max m2M c Y; b Y m : (3.2) 25 0 0:2 0:4 0:6 0:8 Y 0 50 100 150 200 250 Time [s] Y Figure 3.3: Algorithm: We can nd a 1-dimensional embedding where each item is a value of a signal in time. The nal score is computed as the average of best, dimension c for each dimension of aect, this is: score = best, arousal c + best, valence c : (3.3) 3.2 Algorithm: Majority Voting Triplet Embeddings We now discuss our approach to obtain a unique labelY from a set of annotationsfY a g a2A . In the design of this algorithm, we exploit two ideas: (1) a 1-dimensional embedding Y 2R 1n indexed by time can be seen as a time-series (Figure 3.3), and (2) although the values provided during continuous real-time annotations may not be directly reliable [1], their relative pairwise distances are reliable enough throughout an annotation scheme for the majority of the annotators: Assumption 3.2.1 (Informal) LetD ij be the distance between values of a construct at times i andj. Then, we assume that most annotators will be consistent, this is, for timesi;j andk, if the true signal hasD ij <D ik , then annotator a will annotateD a ij <D a ik , whereD ij is the true distance between times i and j with some probability p> 1=2. 26 0 i 50 j 100 150 200 k 250 300 −1 −0.5 0 0.5 1 Time [s] Valence level Figure 3.4: Example of labels of valence from the RECOLA dataset. In our approach, we use time triplets (i;j;k) to decide whetherfD a ij ? <D a ik g a2A , whereD a ij =jY a i Y a j j. Assumption 3.2.2 (Formal) LetD ij be the distance between values of a construct at times i andj. Then, we assume that annotator behavior can be modeled using a decision function of the form: w a t = 8 > > < > > : 1; w.p. f a (D ij D ik ) +1; w.p. 1f a (D ij D ik ): (3.4) Therefore, assuming thatD exists (so there is an existing and unique latent variable, up to a scaling factor), we hypothesize that we can obtain the correct direction of the triplet relation with some probability greater or equal than 1/2: D ij ? 7D ik (3.5) by taking all annotators' opinions for any (i;j;k)2T (the set of all possible triplets). Here, D ij represents the unknown true distance of the construct being annotated between time i and time j (note thati andj are not necessarily instants in time, but refer to short time windows). This leads to a natural weighted majority vote construction to decide which direction in Eq. 3.5 is correct. We consider all possible triplets t = (i;j;k) 1 and assign weights to each individual annotator's 1 If jTj becomes too big, we can subsample T or sample batches of triplets to estimate the embedding using Stochastic Gradient Descent (SGD). 27 response: Decision of annotator 1 : f 1 (D ij D ik )>f 1 (D ik D ij ))w 1 t =1 Decision of annotator 2 : f 2 (D ij D ik )>f 2 (D ik D ij ))w 2 t =1 . . . Decision of annotatorjAj : f jAj (D ij D ik )<f jAj (D ik D ij ))w 1 t = +1 We use r a ; a2A to denote the weights of each annotator, and we index each f a ; a2A as a reminder that each annotator perceives events dierently, while assuming the same underlying unique construct throughD . Each annotator's weight is assigned beforehand using one of many techniques described in the next subsection, but intuitively each weight is proportional to the trust (or reliability) that we assign to the corresponding annotator. Then, if w t = sign X a2A r a w a t ! 6= 0; we add (i;j;k) toS, the set of observed triplets 2 . Algorithm 1 shows the implementation for the triplet generation through weighted majority vot- ing. The implementation takes a matrixA2R njAj , where each column represents an annotation time series from one ofjAj annotators, and the rows represent time frames. The implementation takes a weight vectorr2R jAj , representing the reliability of each annotator. The implementation is exible enough that settingr a = 0 will remove annotatora from the decision process (leave-one- annotator-out), while recovering a simple majority vote if r a is constant for all annotators. After the labels for triplet are obtained, we choose a loss function to estimate the embedding. An interesting feature of this majority vote embedding approach to annotation fusion is that once the embedding is estimated, the number of triplets inS that violate the distances in embedding may be computed, leading to a measure of agreement in the construction of the embedding itself. 2 This strategy might produce a set of observed triplets S that has not been sampled uniformly at random. In practice, this is not an issue, since we perform an exhaustive sampling of triplets. 28 Algorithm 1: Generate set of tripletsS using annotator weights and a majority vote strategy. Data: A2R njAj : Annotations matrix Input:fr2R jAj : Annotator weights Result: Set of tripletsS S fg; for a 1 tojAj do /* Compute all pairwise distances between values of each column of A. Each D[a] is a distances matrix between all points in each A[:;a]. */ D[a] = distances(A[:;a]); end for k 1 to n do for j 1 to k 1 do for i 1 to n do // Iterate over unique triplets t = (i;j;k) if i6= j and i6= k then w t = 0; for a 1 tojAj do // Compute each annotator decision // for each unique triplet t = (i;j;k) if D[a][i;j]<D[a][i;k] then w t w t r a ; else if D[a][i;j]>D[a][i;k] then w t w t +r a ; end end // Add triplet (i;j;k) unless there is a tie. if w t < 0 then S S[f(i;j;k)g; else if w t > 0 then S S[f(i;k;j)g; end end end end end 29 We revisit this idea in our discussion in Section 3.4. 3.3 Experiments 3.3.1 Sanity Check: Green Intensity Videos To check the performance of our algorithm in a setting where the ground truth is known, we apply our majority vote triplet embedding to the individual annotations in both annotation tasks from the green intensity videos dataset to produce labels. Time Alignment For this task, we do not time align and we set all annotator weights to the same value. Triplet Sampling and Annotation Since the number of triplets necessary to fully specify each annotation's vote scalesO(n 3 ) with the number of frames, we downsample the annotations from 30 Hz to 1 Hz. We then sample the triplets according to Algorithm 1, and estimate the embedding using tSTE with = 30 (we use this value of because it is a convex function whose gradient does not vanish). Annotator Weights We consider all annotator weights to be the same. 3.3.2 AVEC: RECOLA dataset Time Alignment We use two temporal alignment methods to remove the lags of annotations with respect to features: dynamic time warping (DTW) [14] and generalized time warping (GTW) [17]. DTW adjusts the sequence of temporal indices of one signal given another reference signal to achieve an optimal correlation between the two. Since the true emotional signal is unknown, we elect to use a single feature as a reference signal and then apply DTW to each individual annotation to align it with the reference feature. We perform an exhaustive search of all corre- sponding video, audio, and physiologic features provided in the RECOLA data set to nd the single feature with the highest average Pearson correlation with each corresponding annota- tion, which turns out to be the geometric video-based feature geometric feature 245 amean. 30 GTW is an enhanced version of canonical time warping which attempts to learn a mono- tonic temporal warping function and feature projection function that together maximize the correlation between the projected features and the temporally warped signal. We use the implementation provided in [58] and perform a grid search for tuning d, the CCA en- ergy threshold hyperparameter, by maximizing the CCC obtained when evaluating using the AVEC GES evaluation metric. We nd the performance is not impacted signicantly for d2f0:6; 0:7; 0:8; 0:9g and that d = 0:7 is slightly better. Triplet Sampling and Annotation Since the number of triplets necessary to fully specify each annotation's vote scalesO(n 3 ) with the number of frames, we rst downsample the annotations from 25 Hz to 1 Hz. This leads to 13,365,300 unique triplets for a signal with 300 samples. Once the optimal embedding is obtained, we upsample to the original sampling rate. This upsampling step is done in Julia using the DSP.jl package. Furthermore, both DTW and GTW require that the features and annotations are temporally aligned. Thus, when applying these two methods, we downsample the annotations by a factor of 10 to match the sampling rate of the features. After running the time warping method, we linearly interpolate the computed label Y provide time indices at the original resolution. We estimate the embeddings using tSTE with = 30. Annotator Weights We test three approaches to assigning weights to annotators: unweighted, weighted, and weighted leave-worst-out. In the unweighted scenario, each annotator is given an equal weightw a so each triplet is given an equal vote. In the weighted scenario, we use the CCC to assess the similarity between two annotators. We obtain a matrix of correlationsS such as in Fig. 3.5 and then setw a = P i S ai 1. Thus annotations that agree with each other are given higher weights and annotations in disagreement with the majority (potentially adversarial) are given less weight. Lastly, we adopt a weighted leave-worst-out strategy where the annotator with the lowest weight is left out entirely and the embedding depends only on the remaining annotations. Scaling Since the distances of the embeddingY are computed up to a scale (including a sign in this 1-dimensional case), we use the simple average of the annotationsfY a g a2A over time to infer the scale of the embedding, by minimizing the MSE between the point-wise (in time) average Y 31 FM1 FM2 FM3 FF1 FF2 FF3 FM1 FM2 FM3 FF1 FF2 FF3 FM1 FM2 FM3 FF1 FF2 FF3 0 0.5 1 Figure 3.5: Agreement between annotators for two example subjects from the RECOLA emotion dataset in two dierent annotation tasks: arousal (left) and valence (right). Agreement is measured using CCC. As the shared scale shows, the overall agreement for valence is higher than the overall agreement for arousal. andY : min s;b 1 n sYb1 Y 2 2 ; (3.6) where n is the length of the embeddings, and s and b are scale and bias variables. Then, we can compute a scaled version ofY . 3.4 Results and Analysis 3.4.1 Green Videos We present the results in Table 3.1 and Figure 3.6. This gure shows that both solutions are very similar. However, in the section between 80 s to 150 s, our approach based on triplet embeddings seems to be able to better recover the constant values. An interesting results that is visible in both tasks is the lost scale of the methods, and not only in the ordinal embedding approach (which was expected). It seems that annotators annotate trends instead of actual values even when they are asked to rate, so that the correlation between the signals is high, but the scale is lost, making the mean squared error (MSE) between the labelsY and the potentially large, and subject to the sampling mechanism. Another interesting point is the fact that the annotations seem to act as a low-pass version of the underlying construct. This can be seen especially for task A due to its higher-frequency components compared to task B, which is a smooth function. 32 Table 3.1: Agreement measures between dierent fusion methods and the objective truth signal in two tasks from the color intensity data set [1]. Method RMSE Pearson CCC TaskA Unweighted Average 0.1916 0.7756 0.6392 Triplet Embedding 0.1907 0.7762 0.6410 TaskB Unweighted Average 0.1057 0.9523 0.9172 Triplet Embedding 0.1005 0.9594 0.9248 0 50 100 150 200 250 0 0.5 1 Green intensity Task A 0 50 100 150 0 0.5 1 Task B Ground truth Average Proposed Figure 3.6: Plots for task A and task B from the color intensity annotation dataset. The true color intensity signal is shown (black) alongside the unweighted average of individual annotations (purple) and the label produced using an unweighted version of our majority vote triplet embedding algorithm (cyan). This shows the proposed method producing the labels is sensible and qualitatively similar to the average signal. 3.4.2 AVEC: RECOLA dataset Figure 3.5 shows the annotator agreement matrices based on CCC for single sessions used to compute the weights as described in Section 3.3.2. For this particular sessions, we can observe that the overall agreement for valence is higher than for arousal. Moreover, we can see that there is high agreement between male annotators in the arousal dimension, while low agreement between females. The combined annotations serving as proposed labelsY for the RECOLA emotion annotations are evaluated using CCC as described in Section 3.1. Table 3.2 shows the CCCs of our methods compared to the AVEC challenge baseline algorithm (for details, see [54]). We only include the results for DTW as time-alignment tool since the results were systematically the best. Our method did not improve on the baseline for the arousal task, but did improve for the valence task. Overall, the score for the our approach is higher than the baseline. We also notice that there is little eect on the weighting of annotators when using DTW, showing that the majority voting usually selects 33 Table 3.2: Challenge results for arousal and valence. Only the best values of CCC are displayed for each approach. We do not include results for time alignment based on GTW since it systematically performed worse than DTW. Dimension Set Baseline DTW unweighted DTW weighted DTW LWO Arousal Val 0.775 0.682 0.655 0.686 Test 0.657 0.580 0.567 0.626 Valence Val 0.570 0.645 0.610 0.641 Test 0.515 0.621 0.629 0.617 Score 0.586 0.628 the appropriate triplet directions to before learning the embedding. To better understand the impact of time alignment on the estimated embedding qualities, we looked at the number of triplet violations for each embedding, for dierent time-alignment approaches. Figure 3.7 shows the percentage of triplet violations remaining after running the majority vote triplet embedding algorithm to convergence (smaller is better). Since we initialize the optimization routines with a signal very close to the average of individual annotations and we are using a convex loss function, the resulting embedding is extremely stable over multiple trials. Thus, we only plot the triplet violations for one trial run per annotation task. Interestingly, DTW produces the embeddings with the smallest percentage of triplet violations across all annotations weighting schemes, with triplet violation percentages below 10% for all train and validation sessions. This can be interpreted as DTW nding the time lags that most accurately align in independent annotator if we consider that Assumption 3.2.2 holds. 3.5 Discussion The results obtained for arousal using the ordinal embedding approach are worse than the baseline. Interestingly enough, the annotator agreements in this task are also lower than arousal. If we look at Table 4.1, we may observe that leaving one annotator out showed the best results for arousal, indicating that leaving more annotators out for this task should be a next step. The results in Figure 3.7 suggest that time alignment is a critical step in nding embeddings that are learnable. However, are there any other properties of the signals that could give us clues as to how to improve the sampling of the underlying constructs? Figure 3.8 shows the labels found by the 34 val 1 val 2 val 3 val 4 val 5 val 6 val 7 val 8 val 9 train 1 train 2 train 3 train 4 train 5 train 6 train 7 train 8 train 9 0 10 20 30 Percent of triplet violations Arousal DTW unweighted DTW weighted LWO DTW weighted GTW unweighted GTW weighted LWO GTW weighted Unweighted Weighted LWO Weighted val 1 val 2 val 3 val 4 val 5 val 6 val 7 val 8 val 9 train 1 train 2 train 3 train 4 train 5 train 6 train 7 train 8 train 9 0 10 20 30 Valence Figure 3.7: Plots showing the percentages of triplet violations after tSTE convergence. The bottom row in the legend shows the use of Algorithm 1 without previous time-alignment. 0 50 100 150 200 250 300 −0.2 0 0.2 0.4 0.6 Time [s] Valence Ordinal approach Baseline Figure 3.8: Comparison between ordinal embedding approach and baseline (simple average). Note the regions of dierence and the high-frequency components of the annotated signal. baseline and the triplet embedding approach for a valence task. There's not only visible dierences in the signals at certain time windows, but also, we see a similar structure than that of task A in the green intensity experiments: There are high frequency components in the signal. In this setting, it might be that all annotations in real-time are low-pass ltering the annotations, giving representations that do not faithfully represent the underlying signals. This poses the question: Can we sample the underlying construct dierently? 3.6 Conclusions In this chapter, we propose a new algorithm to combines lables acquired in real-time for constructs that vary over time. The algorithm was used in to participate in the AVEC GES 2018 to train regression models for arousal and valence. Our algorithm generated the most learnable labels 35 among all competitors and the baseline, giving the highest score among all participants. 36 Chapter 4 Combining Likert-scale Ratings through Rank Aggregation 4.1 Introduction In the previous chapter, we studied the problem of combining annotations for regression problems. However, other applications in social sciences involve session-level annotations, where a scalar de- scribes a particular variable for a given item. For example, in clinical psychology and psychiatry, the assessment of attributes describing human behavior is crucial for the improvement of the quality of therapy. Often known as behavioral codes, these behavioral descriptors are usually annotated at session-level by human experts based on the observation of audio-visual expressions or interac- tions/interviews of the subjects. In this chapter, we study the problem of combining behavioral code annotations by adapting Copeland's method [59]. Problem statement We are given a partially observed matrixA describing the annotations of a single behavioral code. The rows of A correspond to items (or sessions) and the columns correspond to annotators. We want to nd a vector (or ensemble)Y such that each session is described by a scalar capturing the information of the annotations. In this setting, we have access to the description of the variable being annotated, but there are no corresponding features. 37 Data description Our experiments in this chapter are done on the ratings derived from the Couples' Therapy corpus of dyadic spoken interactions of a husband and wife in marital counseling [60] dened by the Couples Interaction Rating System (CIRS) [61, 62]. It has 13 dierent behavioral codes (denoted by the set C) (listed in Table 4.1) and every code is rated on a scale from 1 to 9. Annotations The dataset contains a total of 5367 annotations that correspond to 1538 unique sessions from 186 couples rated by 17 trained annotators. Ratings include behavioral assessment of husband and wife along 13 behavioral codes. Each session is rated by a subset of the annotators ranging from 2 to 9 annotators, for a total of 5367 over 26146 possible ratings. Evaluation There is no direct way to assess the quality of Y . Therefore, we evaluate our algorithm in the following indirect ways: (1) by replacing the worst annotator (in terms of Krippendor's inter- rater agreement [63, 64]) with our ensembleY and studying how the inter-rater agreement changes, and (2) by studying how the estimated ensemble Y changes if we compute it from the matrix A plus noise. We will discuss these in more details later in the chapter. 4.2 Algorithm: Rank Aggregation through by Adapting Copeland's Method Our algorithm involves two steps to create a mapping between our rating space (1{9), a ranking space (the relative ranking between objects), and back: (1) using a modied version of Copeland's counting method to create a ranking of all the objects (sessions) being evaluated, and (2) creating a mapping from the ranking space back into the original rating space. This section describes both steps and the experiments executed to assess the performance of the algorithm. 38 4.2.1 Modied Copeland's Counting Method Copeland's counting method is an algorithm to compute the underlying ranking of choices among dierent voters, where each one contributes with one vote. Each voter ranks objects by preference. For each pair of objects j and j 0 , if j is higher than j 0 in the ranking, it gets assigned one vote. We repeat for all voters. If object j gets more votes than object j 0 , j gets assigned one point. We repeat for all the remaining pairs of objects, and rank them according to the amount of points (score) obtained. For more details, please refer to [65], pages 44{47. We introduce a variation of this method to combine annotator ratings. Let A2 R njAj be an annotations matrix, where n is the set of sessions/items andjAj is the number of annotators. We consider that annotator i prefers session j over session j 0 if the rating of session j is greater than the rating of sessionj 0 . We then proceed in the same manner as Copeland's counting method (Algorithm 2). The assumptions that we consider are the following: Assumption 4.2.1 (Copeland's modied method, informal) 1. Annotators rate high behavioral code scores with high ratings and low behavioral code scores with low ratings most of the time, 2. Each behavioral code is treated independently, 3. Annotators do not use values above or below the maximum or minimum values of the given rating scale. The proposed algorithm is designed for mostly complete annotation matrices, where there is a small amount of missing values. Given that our ratings matrices are sparse, we perform the analysis by imputing the missing values with the average rating for each of the rated objects. This prevents spurious high scores coming from annotators with low amounts of annotations. 4.2.2 Rankings to Ratings Algorithm 2 outputs a vector containing the scores assigned to each object (session) being rated. A direct mapping from scores into ratings from 1{9 is achieved through an ane transformation 39 Algorithm 2: Modied Copeland's counting method input :A2R njAj : Annotations matrix n;jAj: # of sessions, # of annotators output: Scores vector2R n scores zeros(n,1); for j 1 to n 1 do for j 0 i + 1 to n do votes j 0; votes j' 0; for i 1 tojAj do rating j A[j;i]; rating j' A[j 0 ;i]; if rating j > rating j' then votes j votes j + 1; else if rating j < rating j' then votes j' votes j' + 1; end end if votes j > votes j' then scores[j] scores[j] + 1; else if votes j < votes j' then scores[j 0 ] scores[j 0 ] + 1; else scores[j] scores[j] + 1/2; scores[j 0 ] scores[j 0 ] + 1/2 end end end 40 between the ranking space and the rating space. Nevertheless, this transformation creates a dis- tribution of ratings that does not represent the structure of behavioral code scores for the set of analyzed sessions. To work around this issue, we perform an estimation of the density of ratings of the annotations matrixA by computing the histogram of the ratings. This computation preserves the distribution of ratings from matrixA, allowing a mapping between the ranking space and the ratings space that also preserves these statistics. The output of Algorithm 2 is used to rank sessions by their scores. To create the mapping between ranking space and rating space, we compute the histogram counts c 1 ;:::;c 9 of the ratings matrixA and the proportions 1 ;:::; 9 for each possible value in the scale, where l = c l = l 0c l 0. In the sorted scores vector, we assign (from lowest to highest ranked) a value of 1 to the rst m 1 elements, a value of 2 to the followingm 2 . We proceed until we have assigned rating values to all the ranked sessions. As previously mentioned, this mapping preserves the statistics of the collected ratings, while considering the underlying preference among all raters. 4.2.3 Assessment of the Algorithm There is an intrinsic diculty in assessing the performance of the algorithm, as there is no ground truth to compare the results to within our data set. Even more, since our algorithm goes from a rating space into a ranking space and back into a rating space, assessment depends on both stages, and not only the output (ensemble). To work around these diculties, we propose three dierent experiments to assess its performance: (1) corrupt the initial ratings with noise, and compare the original ensemble with the ensembles from the corrupted data; (2) perform cross validation by leaving one annotator out, and compute Krippendor's [63, 64] over the ratings matrix including the ensemble; and (3) compute the consistency of the output ensembles by leaving one annotator out. In all three experiments, we compare the results of the proposed modied Copeland's ensembles with the ensembles obtained by the average ratings. 41 Corruption of the ratings with noise To investigate the stability of the algorithm against noisy inputs, we have corrupted all the ratings inA with the following rule (each behavioral code is trated independently): A 0 =A +W; (4.1) where W ij = 8 > > > > > > < > > > > > > : 1; w.p. p; 0; w.p. 1 2p; +1; w.p. p; (4.2) and i2f1;:::;jSjg (S is the set of sessions), j2f1;:::;jAjg, and p2f1=10;:::; 1=3g. The comparisons have been performed in the following manner: For each behavioral code, its corresponding annotation matrixA has been passed through our algorithm, obtaining an ensemble vectorY . Then, the annotations matrix is corrupted with noise, obtainingA 0 and the corresponding ensemble vector Y 0 . Krippendor's is computed between Y and Y 0 . The corruption process is repeated throughout 10 iterations, and the average and standard deviation Krippendor's is computed. This process is repeated for each value of p, for each dierent behavioral code. Leave-one-out average agreement For each behavioral code, we leave one annotator out and produce an annotation ensembleY A\fig c . We replace the left-out annotations with the ensemble annotations, and compute the inter-annotator agreement of this replaced-leave-one-out matrix A A\fig c = [A 1 jjY A\fig c jjA n ]: (4.3) We do this for each of thejAj = 17 annotators and compute the mean . This mean agreement gives us an idea of how much agreement an ensemble creates within the annotations matrix, but does not tell us how close we are to the ground truth. 42 1/10 1/9 1/8 1/7 1/6 1/5 1/4 1/3 0:5 0:6 0:7 0:8 0:9 1 Value of p in the noise density Krippendor's Proposed 1/10 1/9 1/8 1/7 1/6 1/5 1/4 1/3 0:5 0:6 0:7 0:8 0:9 1 Value of p in the noise density Baseline 2468 0 200 400 600 800 1000 1200 2468 2468 2468 2468 2468 2468 2468 2468 2468 2468 2468 2468 Figure 4.1: (Top row) data corruption stability analysis. Each rating has been corrupted by a noise, where p represents the probability of the original rating to be increased/decreased by 1. The plot- ted values are the inter-rater agreement between the original rating's ensemble and the corrupted rating's ensemble averaged over 10 iterations. The error bars correspond to one standard deviation. Note: No noise corruption implies a Krippendor's of 1 for every behavioral code. (Bottom row) rating distributions for each annotated behavioral code. The y-axis represents relative counting frequency, while the x-axis represents behavioral codes ratings (scale is 1 to 9). Histograms' colors refer to behavioral codes as dened in the legend of Figure 4.1. Legend: acceptance of the other , blame , responsibility for self , solicits perspective , states external origins , discus- sion , clearly denes problem , oers solutions , negotiates , makes agreements , pressures for change , withdraws , avoidance . Leave-one-out ensemble agreement In our last assessment routine, we consider all the ensembles Y A\fig c calculated in the previous experiment to build an ensemble matrixY : Y = [Y A\f1g c jjY A\fig c jjY A\fjAjg c ]: (4.4) To understand how robust these ensembles are with respect to leaving one annotator out, we compute Krippendor's for this ensembles matrix. 43 4.3 Results and Analysis 4.3.1 Stability from data corruption Figure 4.1 shows the mean of Krippendor's betweenY andY 0 for each behavioral code, and for 10 iterations of each corruption process together with the standard deviation. A horizontal black dashed line has been plotted for reference as good inter-annotator agreement is considered for 0:8 [64]. Here, the top left subgure shows the stability analysis under noise for our proposed method. We observe two dierent clusters of behavioral codes: (1) the top six behavioral codes where inter-annotator agreement is not very aected by noise corruption, and where most agreement is above = 0:9; and (2) a group of behavioral codes that is more aected by noise, where most values go below = 0:8 for a values of p = 1=3 (Eq. 4.2). The top right subgure also shows the same results for the average ensembles. We observe a similar trend for the top behavioral codes, even achieving higher values for . Nevertheless, performance overall decays faster for each behavioral code, having lines going below = 0:8 from p = 1=7 onward. For the lower cluster of behavioral codes, all of them go below = 0:8 for the lower behavioral codes' cluster. To investigate the possible reasons for the dierences in performance, we have plotted the histograms of ratings for each behavioral code in the histograms of Figure 4.1. Behavioral codes with values that are highly skewed towards a rating of 1 (such as \solicits partner's perspective", \clearly denes problem", \oers solutions", \pressures for change", and \withdraws") are not stable under slight variations of the ratings, while behavioral codes with good relative frequency for each rating in the scale excel under perturbations. Relative higher stability for behavioral codes suggest that our proposed algorithm is more robust against perturbations, meaning that it can handle noisy inputs from annotators and better nd the underlying structure of the data. 4.3.2 Leave-one-out average agreement Table 4.1 shows the results for the leave-one-out experiment cross validation. The results obtained show that both methods perform very similarly. For 5 out of 13 behavioral codes, Copeland shows a statistically signicant increase (p< 0:05) over mean , while on 8 out of 13, we cannot reject that mean and Copeland are the same. This implies that replacing our ensemble with each annotator 44 Table 4.1: Inter-annotator agreement for all behavioral codes. : Krippendor's for the original annotations; mean , Copeland : Average Krippendor's computed by leaving one annotator out and replacing it with the n 1 annotators' ensemble (y indicates statistically signicantly higher than mean , p< 0:05). Behavioral code mean Copeland acceptance of the other 0:6279 0:7056 0:7103 blame 0:6364 0:7106 0:7189 responsibility for self 0:4785 0:5745 0:5855 y solicits perspective 0:3497 0:4635 0:4646 states external origins 0:3764 0:4928 0:5092 y discussion 0:6207 0:6943 0:7050 y clearly denes problem 0:4229 0:5239 0:5267 oers solutions 0:5474 0:6280 0:6293 negotiates 0:7010 0:7612 0:7659 makes agreements 0:5941 0:6744 0:6860 y pressures for change 0:3939 0:4978 0:4996 withdraws 0:4072 0:5068 0:5093 avoidance 0:4121 0:5198 0:5325 y produces more inter-annotator agreement than human annotators alone, and it does at least as well as considering the mean of the ratings. 4.3.3 Leave-one-out ensemble agreement Results for the ensemble robustness leaving one annotator experiment are shown in Table 4.2. Our proposed algorithm has greater ensemble agreement in 12 of the 13 rated behavioral codes. This means that our algorithm is able to extract more information from the group's overall agreement than what the mean among ratings is able to capture, implying lower variability coming from each individual annotator. 4.4 Discussion The results suggest that the proposed algorithm is capable of nding ner underlying structure within the data than creating an annotator ensemble based on averaging ratings. Although exact assessment of the algorithm is challenging due to the high quantity of missing values in the present data set and a lack of ground truth, the experiments suggest more robustness to locally chang- ing values due to annotation noise/interpretation and higher robustness with respect to missing 45 Table 4.2: Krippendor's for ensemble matrices E designed based on mean and Copeland en- sembles in a leave-one-annotator-out setup (Equation 4.4). Behavioral code mean Copeland acceptance of the other 0:9730 0:9773 blame 0:9800 0:9857 responsibility for self 0:9618 0:9795 solicits perspective 0:9498 0:9631 states external origins 0:9435 0:9737 discussion 0:9796 0:9875 clearly denes problem 0:9569 0:9626 oers solutions 0:9699 0:9643 negotiates 0:9836 0:9866 makes agreements 0:9791 0:9862 pressures for change 0:9317 0:9633 withdraws 0:9394 0:9641 avoidance 0:9368 0:9694 annotators in the proposed algorithm compared to the average ensemble. Poor results for 6 out of 13 behavioral codes in the stability experiment suggest intrinsic prob- lems in the distribution of behavioral codes. Observing all sub-gures in Figure 4.1 in parallel show that low robustness to noise is directly related to highly skewed rating distributions, where high values are poorly represented. When this occurs, both algorithms lack robustness. Nevertheless, this is more prominent in the average ensemble. Our algorithm suers from this condition as miss- ing low ratings are considered a win over missing values, needing to impute missing values. This issue suggests the exploration of other methods to impute missing data. The proposed method forces a mapping from ranking space (lead by points from Copeland's method) into rating space. This let us assess our algorithm in comparison to an average baseline, while having an output in the original rating space. Nevertheless, this mapping forces a change in the distribution of scores, which is often uniform-looking (gure not included). Our mapping is a simple solution to this complex problem, but needs revision as it may be a sub-optimal solution to this problem, which has the potential to aect the estimation of the underlying ground truth. From a classication/regression perspective, working in the ranking space oers more reso- lution and a close-to-uniform distribution of points, which may lead to improvements in nding audio-visual features that are highly correlated with each one of the behavioral codes, as well as improvements in the overall classication/regression performance. 46 Finally, our algorithm partially nds a solution to the question: How to compare ratings from dierent annotators? While directly comparing the ratings of dierent annotators is deemed as a highly complex task, comparing their preferences and weighing them seems a natural approach for annotator fusion. This permits the fusion among ratings within dierent scale uses (or scale misuse) among dierent annotators, while providing higher resolution results in the ranking space. 4.5 Conclusions The present chapter presents and explores an algorithm to combine annotations from several anno- tators based on an adapted version Copeland's counting method by reinterpreting voters' choices as preferences. This algorithm is presented to support the use of audio-visual machine learning tech- niques in the quantication of human behavior and behavioral expressions, as a means to support clinicians and psychology researchers in the study of latent and subjective objects. The proposed algorithm has been implemented and its performance tested in a real-world cou- ples' therapy data set, where annotators have annotated the scores of 13 dierent behavioral codes that are pertinent to couples' therapy, such as acceptance of other and blame, among others. The results suggest that the method is able to recover ne structure from the underlying latent state of the behavioral code, by being more robust to noise and by better capturing the group consensus. Nevertheless, the algorithm needs data imputation for success, an area that needs further exploration in the ratings domain. 47 Chapter 5 Combining Likert-scale Ratings through Triplet Comparisons 5.1 Introduction In previous chapters, we have described dierent methods to aggregate annotations using both triplet (in the case of aggregation over time) and pairwise comparisons (in the case of Likert-scale annotations). The assumptions made in Chapter 3 do not involve a dependency in the time domain, therefore, the algorithm can be naturally extended to session-level aggregation of annotations. In this chapter, we propose the use of triplet comparisons to learn an embedding of session-level annotations, even in the case where there are absent annotations. Problem statement We are given a partially observed matrixA describing the annotations of a single behavioral code. The rows of A correspond to items (or sessions) and the columns correspond to annotators. We want to nd a vector (or ensemble)Y such that each session is described by a scalar capturing the information of the annotations. In this setting, we have access to the description of the variable being annotated, but there are no corresponding features. Data description Our experiments in this chapter are done over two dierent datasets. 48 Couples' Therapy Behavioral Codes In part of the experiments, we use the ratings derived from the Couples' Therapy corpus of dyadic spoken interactions of a husband and wife in marital counseling [60] dened by the Couples Interaction Rating System (CIRS) [61, 62]. It has thirteen dierent behavioral codes (listed in Table 4.1 in Chapter 4) and every code is rated on a scale from 1 to 9. COACH Fidelity Rating System In a second part of the experiments, we use the COACH Fidelity Rating System, which is used to assess the extent to which a caregiving provider displyas adherence to two linked intervention protocols: Family Check-Up [66] and the Everyday Parenting program [67]. This rating system is composed of ve codes (C 1 , O, A, C 2 , H), each one of them being in a scale from 1 to 9 (with half points). Annotations Couples' Therapy Behavioral Codes The dataset contains a total of 5367 annotations that correspond to 1538 unique sessions from 186 couples rated by seventeen trained annotators. Ratings include behavioral assessment of husband and wife along thirteen behavioral codes. Each session is rated by a subset of the annotators ranging from 2 to 9 annotators, for a total of 5367 over 26146 possible ratings. COACH codes The dataset contains a total of 246 annotations that correspond to 197 unique sessions from 111 families rater by four trained annotators. For each session, we have the transcripts of the session, which may be used as features to train a supervised model. The ratings include the behavioral assessment of the provider along ve behavioral codes. Each session is rated by s subset of the annotators ranging from one to four annotators, for a total of 246 over 788 possible ratings. Evaluation Couples' Therapy Behavioral Codes As discussed in Chapter 4, there is no direct way to assess the quality of Y in this dataset. Therefore, we evaluate our algorithm in the following indirect ways by studying how the estimated ensembleY changes if we compute it from the matrix A plus noise, to have a comparison to the results in Chapter 4. 49 COACH codes The data allows to train a model to predict the COACH codes based on the transcripts of each sessions. Therefore, we assess the algorithm in two ways: (1) by studying how the estimated ensembleY changes if we compute it from the matrixA plus noise, to have a comparison to the results in Chapter 4, (2) by training a supervised model on each set of labelsY and check which set of labels can be better learned (or predicted) by the model. 5.1.1 Algorithm: Combining Likert-scale Ratings through Triplet Comparisons Combining Multiple Annotations In Chapter 3, we proposed a method to aggregate annotations using triplet comparisons through time. However, we did not assume any specic properties of the signal over time. Algorithm 3 shows a generalized version of the aforementioned method, where absent (or missing) entries can be handled by only comparing triplets that exist. Algorithm 3 is able to build triplets by only looking at the entries that are present in the annotation matrix A, while not needing imputation as Copeland's modied method Algorithm 2 presented in Chapter 4. Scaling: Rankings to Ratings The Triplet Embedding approach produces a 1-dimensional embedding where the relative distances between items (in this case, sessions) are scaled with respect to that of the underlying constructs. Even though this embedding could be taken interpreted as a ranking of sessions (as in the case of Chapter 4, in this case we take the approach introduced in Chapter 3, where we use the mean value of the ratings as a reference to scale to. 5.1.2 Assessment of the Algorithm Stability Analysis We assess the algorithm using the Couples' Therapy corpus, as well as the COACH corpus. For both datasets, we conduct the stability analysis introduced in Chapter 4, as a means to directly compare the new method we the previously proposed alternative. 50 Algorithm 3: Generate set of tripletsS from a partially complete matrix. Data: A2R njAj : Annotations matrix Input: r2R jAj : Annotator weights Result: Set of sampled tripletsS T f(i;j;k)ji6=j <k6=i; 1i;j;kng; S fg; for a 1 toA do /* Compute all pairwise distances between values of each column of A. Each D a is a distances matrix between all points in each A :;a . */ D a distances(A :;a ); end for t2T do w t 0; for a 1 tojAj do // Compute each annotator decision // for each unique triplet (i;j;k) if D a [i;j] exists and D a [i;k] exists then if D a [i;j]<D a [i;k] then w t w t r a ; else if D a [i;j]>D a [i;k] then w t w t +r a ; end end S S S f(i;j;k)g; w t sign(w t ); end 51 Learning the Labels through a Supervised Model Moreover, for the COACH dataset we train a support vector regression (SVR) model 1 . This model uses the transcripts of the sessions to predict the labels generated by each method. Since the number of sessions is small (197), we use a 5-fold cross validation approach to the train the model, and report the average mean squared error (MSE) over all folds. 5.2 Results 5.2.1 Perturbation Analysis Couples' Therapy Behavioral Codes Figure 5.1 shows the results for the perturbation and stability analysis. Specically, Figure 5.1c shows the results of the Triplet Embedding approach, where we observe that Krippendor's values between the originally estimated ensembleY and its corrupted versionY 0 is typically lower than when using the mean or ranking aggregation methods. We observe however that there are similar trends that are comparable with the two previously discussed methods: The performance of Triplet Embedding approach in this experiment depends on the distribution of the ratings (for example, codes with highly skewed histograms perform worse). In the case of Triplet Embeddings, since we only collect triplets whereD a ij 6=D a ik , if we have too many entries with the same values, it will not collect triplets associated to the values, breaking the assumption that triplets are collected uniformly with replacement. COACH Codes A somewhat similar trend is observed for the COACH codes (Figure 5.2). In this experiment, taking the mean performs best across all three methods, followed by Copeland's modied method. In this dataset, however, we do not observe many dierences across codes: This can be observed in the histograms of the codes (Figure 5.3), where most of the codes have are centered across the middle of the rating scale. 1 Model kindly provided by Dillon Knox. 52 Table 5.1: MSE results for a xed SVR model trained using labels from dierent annotation fusion approaches (scaling method in parenthesis). Results are over 5-fold cross validation. Code Mean (no scaling) Copeland's [68] (distribution) Triplet Embedding (Procrustes) C 1 0.664 0.895 0.643 O 0.674 0.824 0.657 A 0.754 0.959 0.649 C 2 0.811 1.176 0.745 H 0.907 1.021 0.769 5.2.2 Learning the Labels We report the 5-fold cross validation results of the SVR model trained over 197 sessions in Table 5.1, where each column represents the mean of the training error over the labels generated by that algorithm. In parenthesis, we have included the information pertaining the scaling that we have employed. The table shows that training over Triplet Embedding labels generates better results across all codes, and the dierence in error is higher as the diculty of the code prediction increases. 5.3 Analysis 5.3.1 Stability Analysis In Figure 5.1 and Figure 5.2, we have shown how the dierent algorithms behave under perturba- tions of the ratings they have as input, and we have measured the performance using Krippendor's . In social sciences, we are interested in the reliability of the algorithms against changes in the ratings given by annotators. However, from a machine learning perspective, if we are trying to learn the metric information instead of mapping back to the original rating space, Krippendor's might not be a good performance measure. Moreover, as observed in Figure 5.4, when using the global information rather than only local, we observe that certain ratings are adjusted, which would be penalized in the case of Triplet Embedding but not in the case of using the mean or rank aggregation when mapping back to the original rating values. In Figure 5.5a and Figure 5.5b, we show the annotation matrices for both rating systems. Here, we observe that there are many missing entries in both matrices, with a clear structure of annotations, since in both cases the annotator behavior generates clusters of annotations within 53 the matrices, showing systematic missing values. These absent values and their distribution over the matrix may explain why the Triplet Embedding approach does poorly over the corruption and stability analysis: If we have annotations for a small group of annotators for a given triplet of sessions, changing a few values can mor easily change the comparison for that triplet. 5.3.2 Learning the Labels The results we have presented on learning the COACH codes open the question: Why are labels learned using a Triplet Embedding more learnable than those of the other two approaches? In Figure 5.4, we plot the labels used to train one of the codes (continuous lines are included simply for visual purposed, since there is specic order of the sessions, but allows to more easily see the ratings). We observe that outliers obtained by taking the mean are regressed back to the mean using the Triplet Embedding approach, while a few new outliers stand out in the Triplet Embedding approach. Moreover, since we do not restrict the embeddingY produced by the triplet comparisons to values in the rating scale, there are several small corrections. We remind ourselves that the Triplet Embedding approach allows us to learn the metric information from the ratings (observations) rather than imposing values over the ratings scale. 5.4 Conclusions In this chapter, we have presented an algorithm based on triplet comparisons to nd embeddingsY that summarize the information annotated by all raters. A key dierence of this algorithm against those studied in Chapter 4 is the fact that we are learning the metric information present in the annotations provided by raters, rather simply staying in the rating space. Moreover, the algorithm uses not only local information about the ratings for each sessions, but uses both local and global information to learn an embedding that resembles the annotation behavior. This property of the learned embedding allows to correct outliers or extreme values in COACH dataset that we have studied, while also providing labels that are more learnable when training a regression model over them, as compared to training the same model over the mean of annotations and the aggregated rank provided by the modied Copeland's method introduced in Chapter 4. 54 1/10 1/9 1/8 1/7 1/6 1/5 1/4 1/3 0.6 0.8 1 Value of p in the noise density Krippendorff’s α (a) Mean (baseline). 1/10 1/9 1/8 1/7 1/6 1/5 1/4 1/3 0.6 0.8 1 Value of p in the noise density Krippendorff’s α (b) Modied Copeland's method [68]. 1/10 1/9 1/8 1/7 1/6 1/5 1/4 1/3 0.6 0.8 1 Value of p in the noise density Krippendorff’s α (c) Triplet Embeddings. 2468 0 200 400 600 800 1000 1200 2468 2468 2468 2468 2468 2468 2468 2468 2468 2468 2468 2468 Figure 5.1: Data corruption stability analysis. Each rating has been corrupted by noise, where p represents the probability of the original rating to be increased/decreased by 1. The plotted values are the inter-rater agreement between the original rating's ensemble and the corrupted rating's ensemble averaged over 10 iterations. The error bars correspond to one standard deviation. Note: No noise corruption implies a Krippendor's of 1 for every behavioral code. Legend: acceptance of the other |, blame |, responsibility for self |, solicits perspective |, states external origins |, discussion |, clearly denes problem |, oers solutions |, negotiates |, makes agreements |, pressures for change |, withdraws |, avoidance |. 55 1/10 1/9 1/8 1/7 1/6 1/5 1/4 1/3 0.6 0.8 1 Value of p in the noise density Krippendorff’s α (a) Mean (baseline). 1/10 1/9 1/8 1/7 1/6 1/5 1/4 1/3 0.6 0.8 1 Value of p in the noise density Krippendorff’s α (b) Modied Copeland's method [68]. 1/10 1/9 1/8 1/7 1/6 1/5 1/4 1/3 0.6 0.8 1 Value of p in the noise density Krippendorff’s α (c) Triplet Embeddings. Figure 5.2: Data corruption stability analysis. Each rating has been corrupted by a noise, where p represents the probability of the original rating to be increased/decreased by 1. The plotted values are the inter-rater agreement between the original rating's ensemble and the corrupted rating's ensemble averaged over 10 iterations. The error bars correspond to one standard deviation. Note: No noise corruption implies a Krippendor's of 1 for every code. Legend: C 1 |, O |, A |, C 2 |, H |. 2 4 6 8 0 20 40 60 80 100 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 Figure 5.3: Rating distributions for each annotated COACH code. The y-axis represents relative counting frequency, while the x-axis represents behavioral codes ratings (scale is 1 to 9). Histograms' colors refer to behavioral codes as dened in the legend of Figure 4.1. 56 11014 1 11145 1 11160 2 11266 2 11236 1 11084 2 11211 1 11052 3 2 3 4 5 6 7 Session ID Ratings Mean Triplet Embedding Figure 5.4: Fusion results for mean and Triplet Embedding approaches. 1 3 5 7 9 11 13 15 17 500 1,000 1,500 Annotators Sessions 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 (a) Couples' therapy annotations. 1 2 3 4 0 50 100 150 Annotators Sessions 2.0 4.0 6.0 8.0 (b) COACH annotations. Figure 5.5: Annotation matrices for both the Couples' Therapy behavioral codes and the COACH rating system. The purple color re ects missing entries in these matrices, which are absent anno- tations. The scale for both rating systems are quantized. 57 Part II Re-Thinking Annotations and their Evaluation 58 Chapter 6 A New Sampling Approach 6.1 Introduction In the previous chapters, we proposed an algorithm to combine annotations acquired in real-time. When plotting the output of our algorithm with respect to a known label, we observed that anno- tations in real-time suer from a low-pass ltering eect, and that there are local biases associated to a short-term memory of the construct being annotated. In this chapter we study the performance of a new methodology to acquire and create a single label for regression by changing the sampling procedure of the latent construct. We sample this information by asking annotators questions of the form \is the signal in time-frame i more similar to the signal in time-frame j or k?" to build a 1-dimensional embedding Y . Figure 6.1 shows an example of a query in the proposed sampling method where the comparison is based on the perceived shade (intensity) of the color. Problem Statement LetY be a latent construct that varies in time. We want to train a regression modelm2M, such that we can estimate this construct from unseen data. In this setting, we have access to features X2X , whereX is the feature space. The problem is to generate labelsY that faithfully resemble Y . 59 Data To study the properties and performance of our algorithms, we propose to use the green intensity experiment, presented in Chapter 1. Annotations Formally, we propose that annotators perform the following mapping: g a :XXX!f1; +1g; (6.1) (x i ;x j ;x k )7!y a t ; (6.2) where x i is the video-frame at time i. We use a set of queried tripletsS and the corresponding annotationsfy a t g a2A to estimate the embeddingY that resembles the underlying construct. As previously discussed, we motivate this approach using three key observations. First, psychol- ogy and machine learning/signal processing studies have shown that people are better at comparing than rating items [36, 37, 38, 39], so this sampling mechanism is easier for annotators than request- ing absolute ratings in real-time. Second, the use of triplet embeddings naturally solves the problem of combining the annotations, since it is done by taking the union of sets (details in Section 6.2). Third, triplet embeddings oer a simple way of verifying the agreement of the annotations, given by the number of triplet violations in the estimated embedding. Evaluation We evaluate our algorithm using Eq. 3.6: min s;b 1 n sYb1 Y 2 2 ; so that the scaling and bias of the embedding and the baselines is not a factor that we consider. Moreover, we can estimate the proportion of triplet violations that each algorithm (the proposed and the baselines) incurs in. 60 6.2 Annotating Triplets with Multiple Annotators Eq. 2.9 shows a way to encode the decision of a single annotator when queried for a decision as in Eq. 2.7. However, for multiple annotators we need to extend this model. LetA be a set of annotators. We deneS a as the set of triplets annotated by annotator a2A, so we observe a random variable y a t for each t2S a . The annotations are dened as: y a t = 8 > > < > > : 1; w.p. f a (D ij D ik ) +1; w.p. 1f a (D ij D ik ): (6.3) where f a is the noise model that drives the probabilities for each annotator. 6.2.1 Annotation fusion Due to annotation costs, we choose the setsS a such that they are disjoint: S = [ a2A S a and \ a2A S a =?; (6.4) so that all queries are unique and any annotated triplet (i;j;k) is labeled by at most one annotator. As seen in Chapter 3, this is not necessary, but it helps us minimize annotation costs. Note that the fusion process occurs in this step: The annotation fusion in a triplet embedding approach is done by taking the union of all the individually generated setsS a to generate a single set of tripletsS, and using all corresponding triplet annotationsy a t , dened for each annotator and each corresponding triplet t2S. One diculty of this multi-annotator model is that the distribution of y a t depends on the annotators through f a , and, hence, the induced loss function in a maximum likelihood setting is annotator-dependent. Fortunately, in our experiments, we can assume f a = f, as we show experimentally in Figure 6.3 (the price we pay is some bias en the empirical risk minimization, see Remark 2.3.3). 61 6.2.2 Triplet violations and annotation agreements Triplet violations occur when a given triplet t = (i;j;k)2S does not follow the estimated embed- dingY , this is: ky i y k k 2 <ky i y j k 2 ; (i;j;k)2S: (6.5) Therefore, we can count the fraction of triplet violations using: v = 1 jSj X (i;j;k)2S [ky i y k k 2 <ky i y j k 2 ] , (6.6) where [] is Kronecker's delta (1 if the condition is true, 0 otherwise). To compute the expected number of correctly annotated triplets inS, we can derive another random variable that models the correct annotation of triplet t = (i;j;k) based on f: c t = 8 > > < > > : 0; w.p. 1f(jD ij D ik j) 1; w.p. f(jD ij D ik j), (6.7) where f(jD ij D ik j) is the probability of successfully annotating triplet t = (i;j;k). Using Eq. 6.7 we can model the number of correctly labeled triplets as a Poisson Binomial (PB) random variable C: C = X t=(i;j;k)2S c t PB f(jD ij D ik j);jSj : (6.8) Its expected value is the sum of the success probabilities: E[C] = X t=(i;j;k)2S f(jD ij D ik j): (6.9) After computing Y fromS, and assuming that the optimization routine has found the best possible embeddingY forS, then the fraction of triplet violations v inY is linearly related to C by: v = 1C=jSj; or E[ v ] = 1E[C]=jSj: (6.10) v 2 [0; 1] is a measure of disagreement between all triplets used to compute the embedding Y . 62 v = 0 means that all used triplets agree with the computed embeddingY , meaning that all triplet annotations agree with each other. 6.2.3 Unbiased Triplet Embedding As we discussed in Chapter 2, in a maximum likelihood framework the loss ` in the empirical risk dened in Eq. 2.18 is induced by the link functionf in the noise model (Eq. 2.9) [40]. For example, if f is the logistic function, the ` is the logistic loss. In a real-world setting, however, it is unlikely to know or estimate the noise-generating function f, since the true embedding is unknown and only estimated. Using a loss ` not induced by the noise-generating function f results in a bias of the empirical risk, which may be observed as non- linear noise in estimated 1-dimensional embeddings (see Figure 6.4). We empirically showed the bias of the empirical risk in Figure 6.2 when using non-paired function f and `. In [49], the authors describe a simple approach to debias the empirical risk in the presence of noisy labels, in the general problem of classication. Let + and be the proportion of noisy labels for positive and negative classes, respectively. By Lemma 1 of [49], we can construct an unbiased estimator of the loss ` using noisy labels ~ w t by dening the following surrogate loss: ~ `(z; ~ w t ) := (1 ~ wt )`(z; ~ w t ) ~ wt `(z; ~ w t ) 1 + ; (6.11) where ` is a margin loss. Inspection of this surrogate loss shows that this is simply a weighted average of the evaluation of the loss function in both the positive and negative labels, evaluated by the expected noise rate in each one of the classes. Then, it is possible to show that: E ~ wt h ~ `(z; ~ w t ) i =`(z;w t ): (6.12) A property of the surrogate loss ~ ` is that it is convex if ` is convex, twice dierentiable, and has the following symmetry property (Lemma 4 of [49]): 8z2R; ` 00 (z;w) =` 00 (z;w): (6.13) An example of such loss is the logistic loss. 63 The Triplet Embedding Scenario In the case of Triplet Embeddings, we can use the previous formulation to have an unbiased estimator of the risk: b R ~ ` (G) = 1 jSj X t2S ~ ` ( ~ w t ;hL t ;Gi F ); (6.14) which, set up an Empirical Risk Minimization problem, is a novel formulation of the Triplet Em- bedding problem. Choosing the noise rate When sampling triplets t = (i;j;k), the outcome of the annotation process depends on how (i;j;k) are presented to the annotator (see Figure 3.4). Therefore, we will assume the following claim is true for the debiased Triplet Embedding risk: Claim 6.2.1 (Informal) In the Triplet Embedding scenario, class noise is symmetric with high probability, i.e., + (6.15) 6.3 Experiments We conduct two simulation experiments and one human annotation experiment using Amazon Mechanical Turk (https://www.mturk.com) to verify the ecacy of our approach. We use the synthetic green intensity dataset proposed in [26], for which the values forY are known. We use this data because the reconstruction errors can be computed and we can assess the quality of the resulting labels, in contrast to experiments with aect, where the underlying signal is unknown. To construct our triplet problem we rst downsample the videos to 1Hz, so that the number of frames n equals the length of the video in seconds to reduce the number of unique triplets. We also set the dimension d to 1, since we want to nd a 1-dimensional embeddingY that represents the intensity of green color over time. 64 6.3.1 Synthetic triplet annotations We simulate the annotation procedure by comparing the scalar green intensity values of frames of the video using the absolute value of the dierence between points. Hence, the dissimilarity for Eq. 2.7 is s(y i ;y j ) =jy i y j j, where i and j are time indices. We generate a list of noisy tripletsS by randomly and uniformly selecting each triplet (i;j;k) from the pool of all possible unique triplets. Each triplet t = (i;j;k) is correctly annotated by y t with probability f(jD ij D ik j). We test eight dierent fractions of the total possible number of tripletsjTj using logarithmic increments such thatjSj =f0:0005;:::; 0:1077gjTj, which goes from 0:05% to 10:77% ofjTj. We use a logarithmic scale to have more resolution for smaller percentages of the total number of possible unique triplets. Note that for 267 frames (task A), the total number of unique triplets is 9,410,415. The queried triplets are randomly and uniformly sampled from all possible unique triplets, since there is no evidence of better performance for active sampling algorithms in this problem [52, 69]. We use various algorithms available in the literature to solve the triplet embedding problem: Stochastic Triplet Embeddings (STE) [70] (with = 1= p 2) and t-Student Stochastic Triplet Embeddings (tSTE) [70] (with 2 f2; 10g), Generalized Non-metric Multidimensional Scaling (GNMDS) [71] (parameter-free) with hinge loss, and Crowd Kernel Learning (CKL) [72] (with 2f2; 10g). We use gradient descent to optimize all the loss functions proposed by the algo- rithms. Note that STE and GNMDS pose convex problems, while tSTE and CKL pose non-convex problems, and therefore we perform 30 dierent random starts for each set of parameters. We now describe the three experimental settings we use to validate our approach. Simulation 1: Constant success probabilities We choosef(jD ij D ik j) to be approximately constant, such that the probabilityf(jD ij D ik j) = +, where N (0; 2 ), = 0:01 (to add small variations). We run three dierent experiments for 2f0:7; 0:8; 0:9g. Picking the values of f(jD ij D ik j) randomly aects our calculation ofE[C] (Eq. 6.9), but we will assume that these have been xed a priori, meaning that the annotation process has a xed 65 Reference A B Between option A and option B, select the shade that most closely resembles the color shade of the reference. Figure 6.1: Question design for queries in Mechanical Turk. probability for labeling any triplet (i;j;k). Simulation 2: Logistic probabilities A more realistic simulation is given by labeling the triplets inS according to the Bradley-Terry-Luce model: f(D ij D ik ) = 1 1 + exp (D ij D ik ) : (6.16) We use dierent values for =f2; 6; 20g, which drive the steepness of the logistic function mostly on the vicinity of D ij D ik . Intuitively, the triplets with smaller dierences between D ij and D ik should be harder to annotate, so this is a more realistic noise model than constant errors independent of the diculty of the task. Note that this noise model induces the logistic loss used in STE. 6.3.2 Mechanical Turk triplet annotations Using the list of images generated earlier we sample 0:5% of the total number of triplets of images randomly and uniformly. In this setting, we sample approximately Kn log(n) triplets, with K = 31:5 for task A, and K = 15 for task B. To compute the embedding we use STE with parameter = 1= p 2, as well as the new proposed surrogate loss with = 0:15. To obtain the list of annotated triplets, we show the annotators options A and B against a reference, and instructions as in Figure 6.1. We do not provide further instructions for the case whereD Reference;A D Reference;B . For this task, we paid the annotators $0.02 per answered query. An example query is shown in Figure 6.1. 66 Constanttripletnoise TaskA TaskB 0 1 2 3 ·10 −2 MSE GNMDS CKLμ=2 CKLμ=10 tSTEα=2 tSTEα=10 STE μ=0.7 μ=0.8 μ=0.9 0 1 2 3 ·10 −2 Logistictripletnoise TaskA TaskB 0 1 2 3 ·10 −2 σ= 2 σ= 6 σ= 20 0 1 2 3 ·10 −2 0.05 0.11 0.23 0.5 1.08 2.32 5 10.8 % 0.05 0.11 0.23 0.5 1.08 2.32 5 10.8 % 0.05 0.11 0.23 0.5 1.08 2.32 5 10.8 % 0.05 0.11 0.23 0.5 1.08 2.32 5 10.8 % 0.05 0.11 0.23 0.5 1.08 2.32 5 10.8 % 0.05 0.11 0.23 0.5 1.08 2.32 5 10.8 % Figure 6.2: MSE as a function of the percentage of observed tripletsjSj=jTj 100 with constant and logistic noise in triplet labels. Each point in the plots represents the mean over 30 random trials, while the shaded areas represent one standard deviation from the average MSE values. 6.3.3 Error measure We use the error measure proposed in [73], and compute the error by rst solving the following optimization problem: MSE = inf s;b 1 n ksYb1Y k 2 2 ; (6.17) wheres;b2R are the scaling and bias factors, andn is the length ofY . We use this MSE and not a naive MSE between the ground truthY and the reconstructed labelY because the embeddings are optimal only up to scaling and bias factors. Hence, this approach yields a more fair assessment of the quality of the embeddings. We also report Pearson's correlation between the ground truth and the estimated embedding, to compare our method with other proposed algorithms in a scale-free manner. 6.3.4 Comparison to other methods We compare the proposed annotation and fusion framework with two dierent approaches us- ing real-time annotations: EvalDep [74] and the EM-based approach (after time-alignment using 67 EvalDep's method) from [24] with window lengths of 4, 8, 16, and 32. 6.4 Results and Analysis 6.4.1 Synthetic annotations Figure 6.2 shows the MSEs as a function ofjSj=jTj 100 for both synthetic experiments. For both constant and logistic noise in tasks A and B we generally obtain a better performance as the amount of noise in the triplet annotation process is reduced (larger or ). This is not always true in the algorithms that propose non-convex loss functions (tSTE, CKL), where sometimes more noise generates better embeddings. We hypothesize that these algorithms sometimes nd better local minima under noisier conditions. The MSE in Figure 6.2 typically becomes smaller asjSj increases. This is true (generally) for tSTE, STE, and CKL. GNMDS does not always produce a better embedding by increasing the number of triplets employed. We also note that the embedding in task B is easier to compute than that of task A. We observe two possible reasons for this: (1) task A has constant intervals while task B has none (and constant regions may be harder to compute in noisy conditions), and (2) the extreme values in task A seem harder to estimate, since these occur for very short intervals of time that are less likely to be sampled. Overall, STE is the best-performing algorithm independent of noise or task. We recall that tSTE with = 10 approaches STE in many of the presented scenarios. In fact, tSTE becomes STE with ! 1 as !1, so these results are expected (see Claim 2.4.1). 6.4.2 Mechanical Turk triplet annotations Annotator noise In the Mechanical Turk experiments, 170 annotators annotated triplets in task A, and 153 in task B. To understand the diculty of the tasks and the noise distributions for the annotators, we estimate the probabilities of success f(jD ij D ik j) for both tasks, using the top three annotators in terms of number of annotated triplets. 68 To estimate f(jD ij D ik j), we partition the triplets based onjD ij D ik j into intervals with the same number of triplets. For each interval, we compute the average distance of the triplets. For each triplet (i;j;k)2I, we know the outcome (realization) of the random variable y a t since we know the hidden constructY . We assume that the success probabilityf a (jD ij D ik j) is constant in this interval, so that C Binomial(f a (jD ij D ik j)). Finally, we use the maximum likelihood estimator for success probabilities for each interval: ^ f I a(jD ij D ik j) = 1 jI a j X (i;j;k)2I a c a ijk : (6.18) In Figure 6.3, we show the function ^ f a (jD ij D ik j) for each of the top annotators with the most answered queries and compare it to the logistic function with = 20. The comparison between the estimated probabilities of success and the logistic function shows that this is a very good noise model for this annotation task, while also telling us that we should expect the best results from STE when computing the embedding from the crowd-sourced triplet annotations. Noticeably, our initial assumption of an annotator-independent noise model is veried. Mechanical Turk embedding We present in Figure 6.4 the results for the reconstructed embeddings using triplets generated by annotators via Mechanical Turk. We show the reconstructed embeddings obtained using 0.5% of the total number of tripletsjTj for each task. Although there is some visible error, we are able to capture the trends and overall shape of the underlying construct with only 0.5%. We also plot a scaled version (according to Eq. 6.17) of the fused annotation obtained using the EvalDep method and using the continuous-time annotations from the green intensity experiment (see Figure 3.4). In this gure, we observe that methods based on continuous-time annotations are not able to de-bias the annotations in windows of time that are biased. Our proposed method does not suer from this issue. Moreover, our method does not rely on a time-alignment step, which, as we saw in Chapter 3, can have a big impact in the estimated embedding. We show in Table 6.1 the MSE for each task, where percentages again represent the number of triplets employed. We have also included the MSE and Pearson's correlation for the embedding produced with 0:25% of the triplets (not included in Figure 6.4 due to the high overlap between 69 the 0:25% and 0:5% embeddings). We observe that the MSE is lower for a higher number of labeled triplets used. This is expected: there is more information (as triplet constraints) about the embedding as we increase the number of triplets that we feed into the optimization routine, therefore producing a higher quality embedding. We also show a scale-free comparison through Pearson's correlation, which captures how the trends of signals vary over time and neglects dierences in scale and bias. In task A, our approach improves upon previous work by a large margin, being able to capture constant regions of the signal, as well as regions with high-frequency components. Interestingly, the debiased loss does not achieve the same results, as the error approximately doubles that of the debiased loss. This can be explained by the following: the noise model does not consider the case for triplets where there is no winner (i.e.,D ij =D ik ), as it happens in the constant areas of the embedding. This creates averaging over two incorrect evaluations over the loss, increasing the error. In task B, our approach performs comparably to the EM-based method. Our understanding suggests that the EM-based algorithm benets from a smooth ground truth, given their lter- modeling approach on a given window size. Here, however, the debiased triplet embedding loss indeed does better at learning the metric information. Triplet violations and annotator agreement Table 6.2 displays the number of triplet violations for each task. We record the true percentage of triplet violations according to our ground truth (generated using distances D, as in Eq. 6.5) and then compare them to the annotation responses. We also display the number of triplet violations according to the computed embeddingsY . We see that the percentage of triplet violations according to our ground truth and the triplet violations calculated from the embeddingsY is not the same, being overestimated in task A and underestimated in task B. We also observe that even if the number of violations increases in task A, the MSE is reduced with a larger number of triplets. This happens because a higher number of triplet constraints more easily dene an embedding. 6.5 Discussion Section 6.4 shows that it is possible to use Triplet Embeddings to nd a 1-dimensional embedding that resembles the true underlying constructY up to scaling and bias factors. There are several 70 Table 6.1: MSE and Pearson's correlation for the proposed method and state of the art continuous- time fusion techniques against ground truth. For our method, percentage is with respect to the total number of triplets. Task Fusion technique MSE Correlation A EvalDep [74] 0.00489 0.906 EM [24] (best) 0.00494 0.903 Proposed (0.25% ofjTj) 0.00145 0.973 Proposed (0.50% ofjTj) 0.00132 0.975 Proposed ( = 0:15 with 0.50% ofjTj) 0.00262 0.951 B EvalDep 0.00304 0.969 EM (best) 0.00241 0.975 Proposed (0.25% ofjTj) 0.00305 0.969 Proposed (0.50% ofjTj) 0.00285 0.971 Proposed ( = 0:15 with 0.50% ofjTj) 0.00274 0.972 Table 6.2: Triplet violations v for the Mechanical Turk experiment. Percentages correspond to percentage of total triplets observed. We include the fraction of triplet violations as computed by the labels generated with EvalDep. The results are consistent with those presented in Table 6.1. Triplet violations v Task MTurk Y (0.25%) Y (0.5%) EvalDep [74] EM [24] A 0:122 0:159 0:161 0.262 0.259 B 0:179 0:146 0:129 0.139 0.124 0 0.1 0.2 0.3 0.4 0.5 0.4 0.6 0.8 1 |D ∗ ij −D ∗ ik | ˆ f a Task A Logistic noise with σ = 20 Best annotator: 1384 triplets Second annotator: 1273 triplets Third annotator: 1212 triplets 0 0.1 0.2 0.3 0.4 0.5 0.4 0.6 0.8 1 |D ∗ ij −D ∗ ik | Task B Logistic noise with σ = 20 Top annotator: 1411 triplets Second annotator: 1091 triplets Third annotator: 1050 triplets Figure 6.3: Probabilities of success ^ f I a(jD ij D ik j) as a function of the distance from the reference i to frames j and k. Only the top annotators have been included. additional considerations for our proposed method. Annotation costs One of the challenging aspects of using triplet embeddings is theO(n 3 ) growth of the number of unique triplets for n objects or frames. As mentioned earlier, the results by [75] suggest however that the theoretical number of triplets needed scales withO(dn log(n)). In our 71 0 50 100 150 200 250 0 0.5 1 Green intensity Task A 0 50 100 150 0 0.5 1 Task B True label EvalDep EM Proposed (0.5%) Figure 6.4: Results for Mechanical Turk annotations. The computed embeddings have been scaled to t the true labelsZ (Eq. 6.17). The embedding in task A uses 0.5% (47,052) of all possible triplet comparisonsjTj. The embedding in task B uses 0.5% (13,862) of all possible triplet comparisons jTj. In both tasks, the estimated green intensity is sometimes less than zero due to scaling. experiments, we use Kn log(n) triplets with K = 31:5 for task A and K = 15 for task B to achieve equivalent or better approximations of the underlying ground truth compared to the state- of-the-art. This means that with our approach, we are paying an extra log(n) in annotation costs compared to annotations obtained in real-time, where the time complexity isO(n). Computational costs Triplet embeddings are computationally cheap in comparison to the other methods employed in this paper, since they can be eciently estimated using gradient-based meth- ods to minimize the loss function. Moreover, in our current implementation, the calculation of the gradient is parallelized, and the average time to estimate the embeddings for task A and B using 0.5% ofjTj in Figure 6.4 is 125 ms and 44 ms respectively over 100 trials using 10 threads on a laptop with an Intel i7-8850H processor and 32GB of RAM. The evaluation of the gradient for a xed dimensiond of the embedding scales linearly with the number of triplets employed (where the number of triplets needed isO(n logn). This is computa- tionally cheap in comparison to other methods considered: for example, time-alignment (needed in all continuous-time fusion approaches) is an expensive operation. For a signal of lengthn, alignment using Dynamic Time Warping isO(n 2 ), and EvalDep [76, 74] needs to compute the determinants of two nn matrices and one 2n 2n matrix to estimate the mutual information, each being at leastO(n 2:373 ) (by using Fast Matrix Multiplication [77]). The EM-based approach also requires inverting an nn matrix. As examples, EvalDep takes 8s in task A and 6s in task B, while EM takes on average 20min for both tasks and dierent window lengths. 72 Embedding quality The embeddings reconstructed are more accurate than the method pro- posed in [74]. Moreover, no time-alignment is needed since the annotation process does not suer from reaction times. It is also important to note that sharp edges (high frequency regions of the construct) are most appropriately represented and do not get smoothed out, as with averaging- based annotation fusion techniques (where annotation devices such as mice or joysticks and user interfaces perform low-pass ltering). This is an understudied topic, but might be important in the modeling of aect, as seen in Chapter 3. In terms of reconstruction, the scaling factor is an open challenge. We see two possible ways to work with the dierences in scaling when the underlying construct is unknown: (1) include the scaling as a parameter to learn in a machine learning pipeline that uses these labels to create a statistical models of the hidden construct, or (2) normalize the embedding Y such that Y = 0 and Y = 1, and train the models using either these labels or the derivatives dY=dt. However, we note that continuous-time annotations do suer from the same loss of scaling and bias, since both techniques are trying to solve an inverse problem where the scale is not accessible. Feature sub-sampling for triplet comparisons In the experiments of this paper, we sub- sample the videos to 1Hz so that we have a manageable number of frames n. Down-sampling is possible due to the nature of the synthetic experiment we have created, but may not be suitable for other constructs such as aect in real world data, where annotation of single frames might lose important contextual information. In these scenarios, further investigation is needed to under- stand how to properly sub-sample more complex annotation tasks, as well as how to deal with the increasing annotation costs. 6.6 Conclusions In this chapter, we present a new sampling methodology based on triplet comparisons to produce continuous-time labels of hidden constructs. To study the proposed methodology, we use two experiments previously proposed in [26] and show that it is possible to recover the structure of the underlying hidden signals in simulation studies using human annotators to perform the triplet comparisons. These labels for the hidden signals are accurate up to scaling and bias factors. 73 Our method performs annotator fusion seamlessly as a union of sets of queried tripletsS a , which greatly simplies the fusion approach compared to existing approaches which directly combine real- time signals. Moreover, our approach does not need post-processing such as time-alignments or averaging. Some challenges for the proposed method include dealing with the annotation costs given the number of triplets that needs to be sampled, and also learning the unknown scaling and bias factors. As future directions, we are interested in several paths. We believe it is necessary to further study the proposed method for labeling constructs where the the ground truth cannot be validated, as is the case of human emotions, and contrast the eects of using triplet comparisons to annotate individual frames and using triplet comparisons to annotate over frame sequences. 74 Chapter 7 Conclusions In this work, we introduce the use of Ordinal Embedding as a framework to learn subjective constructs from human annotators. Leveraging the evidence in the literature, we propose that several annotation tasks can be considered as a scenario where annotators are asked questions of the form \is item i close to item j or k"? We study the performance of these ideas in dierent and evaluation methods: In Chapter 3 we show that we can use absolute ratings from annotators to annotate triplets that are later used to estimate embedding representing an unknown underlying construct varying in time, which allows to estimate labels that are more learnable in the case of valence (one dimension aect). In Chapter 4 we show that using pairwise ordinal comparisons of the form "does i beat j" over absolute ratings has advantages in terms of the combination of annotations with respect to taking the mean of the absolute ratings. However, this method suers from missing data and could be posed as an ordinal embedding problem. Further extending the work in Chapter 4, in Chapter 5 we show that we can use triplet com- parisons from ordinal ratings to estimate an embedding, so that we can create an embedding from quantized (ordinal) observations. This method suers from structured missing data, but shows advantages when learning with these labels. Finally, in Chapter 6 we show that certain aspects of the estimation process that suer in 75 the algorithm proposed in Chapter 3 can be solved by directly querying annotators for triplet comparisons to again estimate constructs varying in time. The experiments in several applications suggest that both considering that absolute and Likert- scale ratings are answered by annotators using underlying ordinal comparisons, and asking anno- tators binary questions can help us nd complex structure that otherwise is hard to extract from absolute or ordinal ratings. Evenmore, the noise in these questions can be modeled as are also robust to some sources of noise for missing entries, widening the range of applications where they can be employed. Even though the proposed methods show advantages in terms of estimating underlying signals and learning from them, there are some limitations to the methods proposed. For example, needing to queryO(mnlog(n)) may become too expensive, as compared to collecting absolute ratings and nding an embedding for those. Another problem is the fact that a reference is needed when computing an embedding, since the scale is lost. Finally, the proposed methods use information from all the time window or all sessions, which may not be appropriate for all tasks. 7.1 Future Work There are dierent avenues that one could follow from this work, as there are practical and theo- retical problems that could be addressed. Non-uniform sampling of triplet The applications in behavioral coding show that the sam- pling of triplets is non-uniform, given the structured nature of the missing entries in the annotation matrices. To better understand the limitations of our methods, a posibility would be to derive nite sample prediction error bounds for Ordinal Embedding in the case of non-uniform triplet sampling, similar to the work [78], where the authors explore the use of masks with random graph as structures of the missing entries. The diculty here lies in dening the mask for a tensor-like structure in the case of the missing triplets. 76 Learning annotator weights As we previously stated, to estimate an embedding we can pose a risk minimization problem: b R ` (G) = 1 jSj X t2S ` (w t ;hL t ;Gi F ); (7.1) However, this denition poses the triplet annotations as a single value, which we solved in Chapter 3 by using majority voting over a subset of annotators. We could propose however to solve the following objective function as well: b R ` (G;w) = 1 jSj X t2S ` (sign (rw t );hL t ;Gi F ); (7.2) wherew2f[0; 1] jAj g are the annotator weights (or reliablities) that we want to learn jointly with the embedding. Since Eq. 7.1 is not dierentiable, a relaxation could be used: b R ` (G;w) = 1 jSj X t2S ` ( (wy t );hL t ;Gi F ); (7.3) where() is a non-linearity such as tanh() approximating the sign() function. Minimizing Eq. 7.3 would yield an estimate of annotator reliabilities as well as that of an embedding. 7.2 Code Availability The code to estimate Triplet Embeddings and to fuse annotations from dierent annotators can be found in these two links: https://github.com/usc-sail/TripletEmbeddings.jl. https://github.com/usc-sail/AnnotatorFusion.jl. 77 Bibliography [1] Brandon M Booth, Karel Mundnich, and Shrikanth S Narayanan. A novel method for human bias correction of continuous-time annotations. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3091{3095. IEEE, 2018. [2] Fabien Ringeval, Andreas Sonderegger, Juergen Sauer, and Denis Lalanne. Introducing the recola multimodal corpus of remote collaborative and aective interactions. In 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG), pages 1{8. IEEE, 2013. [3] Vikas C Raykar, Shipeng Yu, Linda H Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. Learning from crowds. Journal of Machine Learning Research, 11(Apr):1297{1322, 2010. [4] Angeliki Metallinou and Shrikanth Narayanan. Annotation and processing of continuous emo- tional attributes: Challenges and opportunities. In 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG), pages 1{8. IEEE, 2013. [5] Armen C Arevian, Daniel Bone, Nikolaos Malandrakis, Victor R Martinez, Kenneth B Wells, David J Miklowitz, and Shrikanth Narayanan. Clinical state tracking in serious mental illness through computational analysis of speech. PLoS one, 15(1):e0225695, 2020. [6] Matthew Black, Athanasios Katsamanis, Chi-Chun Lee, Adam C Lammert, Brian R Bau- com, Andrew Christensen, Panayiotis G Georgiou, and Shrikanth S Narayanan. Automatic classication of married couples' behavior using audio features. In INTERSPEECH, pages 2030{2033, 2010. 78 [7] Bo Xiao, Chewei Huang, Zac E Imel, David C Atkins, Panayiotis Georgiou, and Shrikanth S Narayanan. A technology prototype system for rating therapist empathy from audio recordings in addiction counseling. PeerJ Computer Science, 2:e59, 2016. [8] Daniel Bone, Chi-Chun Lee, Matthew P Black, Marian E Williams, Sungbok Lee, Pat Levitt, and Shrikanth Narayanan. The psychologist as an interlocutor in autism spectrum disorder assessment: Insights from a study of spontaneous prosody. Journal of Speech, Language, and Hearing Research, 57(4):1162{1177, 2014. [9] Rensis Likert. A technique for the measurement of attitudes. Archives of psychology, 140(22):1{ 55, 1932. [10] Karel Mundnich, Brandon M Booth, Benjamin Girault, and Shrikanth Narayanan. Generating labels for regression of subjective constructs using triplet embeddings. Pattern Recognition Letters, 128:385{392, 2019. [11] Amos Tversky. Intransitivity of preferences. Psychological review, 76(1):31, 1969. [12] Brandon M. Booth, Karel Mundnich, and Shrikanth Narayanan. Fusing annotations with ma- jority vote triplet embeddings. In Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop, pages 83{89. ACM, 2018. [13] Soroosh Mariooryad and Carlos Busso. Correcting Time-Continuous Emotional Labels by Modeling the Reaction Lag of Evaluators. IEEE Transactions on Aective Computing, 6(2):97{ 108, 2015. [14] Meinard M uller. Dynamic Time Warping. Information retrieval for music and motion, pages 69{84, 2007. [15] Feng Zhou and Fernando De la Torre. Canonical Time Warping for Alignment of Human Behavior. In Advances in neural information processing systems, pages 2286{2294, 2009. [16] Harold Hotelling. Relations between Two Sets of Variates. Biometrika, 28(3/4):321{377, 1936. 79 [17] Feng Zhou and Fernando De la Torre. Generalized Time Warping for Multi-modal Alignment of Human Motion. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1282{1289, 2012. [18] George Trigeorgis, Mihalis A. Nicolaou, Bj orn W. Schuller, and Stefanos Zafeiriou. Deep Canonical Time Warping for Simultaneous Alignment and Representation Learning of Se- quences. IEEE Transactions on Pattern Analysis & Machine Intelligence, 40(5):1128{1138, 2018. [19] Mihalis A. Nicolaou, Stefanos Zafeiriou, and Maja Pantic. Correlated-Spaces Regression for Learning Continuous Emotion Dimensions. In Proceedings of the 21st ACM international conference on Multimedia, pages 773{776. ACM, 2013. [20] Mihalis A. Nicolaou, Vladimir Pavlovic, and Maja Pantic. Dynamic Probabilistic CCA for Analysis of Aective Behavior and Fusion of Continuous Annotations. IEEE transactions on pattern analysis and machine intelligence, 36(7):1299{1311, 2014. [21] Fabien Ringeval, Florian Eyben, Eleni Kroupi, Anil Yuce, Jean-Philippe Thiran, Touradj Ebrahimi, Denis Lalanne, and Bj orn W. Schuller. Prediction of Asynchronous Dimensional Emotion Ratings from Audiovisual and Physiological Data. Pattern Recognition Letters, 66:22{ 30, 2015. [22] Sepp Hochreiter and J urgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735{1780, 1997. [23] Alex Graves and J urgen Schmidhuber. Framewise phoneme classication with bidirectional lstm and other neural network architectures. Neural networks, 18(5-6):602{610, 2005. [24] Rahul Gupta, Kartik Audhkhasi, Zach Jacokes, Agata Rozga, and Shrikanth Narayanan. Modeling Multiple Time Series Annotations as Noisy Distortions of the Ground Truth: An Expectation-Maximization Approach. IEEE Transactions on Aective Computing, 9(1):76{89, 2018. 80 [25] Phil Lopes, Georgios N Yannakakis, and Antonios Liapis. Ranktrace: Relative and unbounded aect annotation. In 2017 Seventh International Conference on Aective Computing and Intelligent Interaction (ACII), pages 158{163. IEEE, 2017. [26] Brandon M. Booth, Karel Mundnich, and Shrikanth Narayanan. A Novel Method for Human Bias Correction of Continuous-Time Annotations. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3091{3095. IEEE, 2018. [27] Padhraic Smyth, Usama M Fayyad, Michael C Burl, Pietro Perona, and Pierre Baldi. Inferring ground truth from subjective labelling of venus images. In Advances in neural information processing systems, pages 1085{1092, 1995. [28] Md Nasir, Brian Caycom, Panayiotis Georgiou, and Shrikanth Narayanan. Redundancy anal- ysis of behavioral coding for couples therapy and improved estimation of behavior from noisy annotations. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 1886{1890, 2015. [29] Anil Ramakrishna, Rahul Gupta, Ruth B. Grossman, and Shrikanth S. Narayanan. An ex- pectation maximization approach to joint modeling of multidimensional ratings derived from multiple annotators. In Interspeech 2016, pages 1555{1559, 2016. [30] Kartik Audhkhasi and Shrikanth Narayanan. A Globally-Variant Locally-Constant Model for Fusion of Labels from Multiple Diverse Experts Without using Reference Labels. IEEE transactions on pattern analysis and machine intelligence, 35(4):769{783, 2013. [31] Brian McFee and Gert Lanckriet. Learning Multi-modal Similarity. Journal of machine learn- ing research, 12(Feb):491{523, 2011. [32] Daniel P. W. Ellis, Brian Whitman, Adam Berenzweig, and Steve Lawrence. The Quest for Ground Truth in Musical Artist Similarity. In In Proceedings of the International Symposium on Music Information Retrieval (ISMIR 2002), pages 170{177, October 2002. [33] Brian McFee, Luke Barrington, and Gert Lanckriet. Learning Content Similarity for Music Recommendation. IEEE transactions on audio, speech, and language processing, 20(8):2207{ 2218, 2012. 81 [34] Ernesto De Vito, Lorenzo Rosasco, Andrea Caponnetto, Umberto De Giovannini, and Francesca Odone. Learning from examples as an inverse problem. Journal of Machine Learning Research, 6(May):883{904, 2005. [35] Ivan Dokmanic, Reza Parhizkar, Juri Ranieri, and Martin Vetterli. Euclidean distance ma- trices: essential theory, algorithms, and applications. IEEE Signal Processing Magazine, 32(6):12{30, 2015. [36] Neil Stewart and Gordon D. A. Brown and Nick Chater. Absolute Identication by Relative Judgement. Psychological Review, 112(4):881{911, 2005. [37] Georgios N. Yannakakis and John Hallam. Ranking vs. Preference: A Comparative Study of Self-Reporting. Aective Computing and Intelligent Interaction, pages 437{446, 2011. [38] Angeliki Metallinou and Shrikanth Narayanan. Annotation and Processing of Continuous Emo- tional Attributes: Challenges and Opportunities. In Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on, pages 1{8. IEEE, 2013. [39] Georgios N. Yannakakis and H ector P. Mart nez. Ratings are Overrated! Frontiers in ICT, 2:13, 2015. [40] Lalit Jain, Kevin G Jamieson, and Rob Nowak. Finite sample prediction and recovery bounds for ordinal embedding. In Advances In Neural Information Processing Systems, pages 2711{ 2719, 2016. [41] Mark A Davenport, Yaniv Plan, Ewout van den Bergand, and Mary Wootters. 1-bit matrix completion. Information and Inference, 3(3), 2014. [42] Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324{345, 1952. [43] R. Duncan Luce. Individual choice behavior, a theoretical analysis. Bull. Amer. Math. Soc., 66(4):259{260, 07 1960. [44] Louis L Thurstone. A law of comparative judgment. Psychological review, 34(4):273, 1927. [45] Nihar B. Shah. Learning from people. PhD thesis, UC Berkeley, 2017. 82 [46] Kilian Q Weinberger, John Blitzer, and Lawrence K Saul. Distance metric learning for large margin nearest neighbor classication. In Advances in neural information processing systems, pages 1473{1480, 2006. [47] Kilian Q Weinberger and Lawrence K Saul. Distance metric learning for large margin nearest neighbor classication. Journal of Machine Learning Research, 10(Feb):207{244, 2009. [48] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. [49] Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with noisy labels. In Advances in neural information processing systems, pages 1196{1204, 2013. [50] Laurens Van Der Maaten and Kilian Weinberger. Stochastic triplet embedding. In In Machine Learning for Signal Processing (MLSP), 2012 IEEE International Workshop on, pages 1|-6, 2012. [51] Sameer Agarwal, Josh Wills, Lawrence Cayton, Gert Lanckriet, David Kriegman, and Serge Belongie. Generalized non-metric multidimensional scaling. In Articial Intelligence and Statistics, pages 11{18, 2007. [52] Omer Tamuz, Ce Liu, Serge Belongie, Ohad Shamir, and Adam Tauman Kalai. Adaptively learning the crowd kernel. In Proceedings of the 28th International Conference on Interna- tional Conference on Machine Learning, ICML'11, page 673{680, Madison, WI, USA, 2011. Omnipress. [53] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016. [54] Fabien Ringeval, Bj orn Schuller, Michel Valstar, Roddy Cowie, Heysem Kaya, Maximilian Schmitt, Shahin Amiriparian, Nicholas Cummins, Denis Lalanne, Adrien Michaud, et al. Avec 2018 workshop and challenge: Bipolar disorder and cross-cultural aect recognition. In Pro- ceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop, pages 3{13. ACM, 2018. 83 [55] Fabien Ringeval, Andreas Sonderegger, Juergen Sauer, and Denis Lalanne. Introducing the recola multimodal corpus of remote collaborative and aective interactions. In 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG), pages 1{8. IEEE, 2013. [56] Fabien Ringeval, Bj orn Schuller, Michel Valstar, Roddy Cowie, and Maja Pantic. Avec 2015: The 5th international audio/visual emotion challenge and workshop. In Proceedings of the 23rd ACM international conference on Multimedia, pages 1335{1336, 2015. [57] Michel Valstar, Jonathan Gratch, Bj orn Schuller, Fabien Ringeval, Denis Lalanne, Mercedes Torres Torres, Stefan Scherer, Giota Stratou, Roddy Cowie, and Maja Pantic. AVEC 2016: Depression, mood, and emotion recognition workshop and challenge. In Proceedings of the 6th international workshop on audio/visual emotion challenge, pages 3{10, 2016. [58] Feng Zhou and Fernando De la Torre. Generalized Canonical Time Warping. IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 38(2):279{294, 2016. [59] H. Nurmi. Voting Procedures: A Summary Analysis. British Journal of Political Science, 2:181{208, 1983. [60] Andrew Christensen, David C. Atkins, Sara Berns, JEnnifer Wheeler, Donald H. Baucom, and Lorelei E. Simpson. Traditional versus integrative behavioral couple therapy for signicantly and chronically distressed married couples. Journal of consulting and clinical psychology, 72(2):176, 2004. [61] J. Jones and A. Christensen. Couples interaction study: Social support interaction rating system. Technical report, University of California, Los Angeles, 1998. [62] C. Heavey, D. Gill, and A. Christensen. Couples interaction rating system 2 (CIRS2). Technical report, University of California, Los Angeles, 2002. [63] Klaus Krippendor. Estimating the reliability, systematic error and random error of interval data. Educational and Psychological Measurement, 30(1):61{70, 1970. [64] Klaus Krippendor. Content analysis: An introduction to its methodology. Sage, Thousand Oaks, CA, 2004. 84 [65] David Lippman. Math in Society. Pierce College Ft Steilacoom, 2013. [66] Thomas J Dishion and Kate Kavanagh. Intervening in adolescent problem behavior: A family- centered approach. Guilford Press, 2003. [67] Elizabeth A Stormshak, Arin M Connell, Marie-H el ene V eronneau, Michael W Myers, Thomas J Dishion, Kathryn Kavanagh, and Allison S Caruthers. An ecological approach to promoting early adolescent mental health and social adaptation: Family-centered intervention in public middle schools. Child development, 82(1):209{225, 2011. [68] Karel Mundnich, Md Nasir, Panayiotis G Georgiou, and Shrikanth S Narayanan. Exploiting intra-annotator rating consistency through copeland's method for estimation of ground truth labels in couples' therapy. In INTERSPEECH, pages 3167{3171, 2017. [69] Kevin G. Jamieson, Lalit Jain, Chris Fernandez, Nicholas J. Glattard, and Robert D. Nowak. NEXT: A System for Real-World Development, Evaluation, and Application of Active Learn- ing. In Advances in Neural Information Processing Systems, pages 2656{2664, 2015. [70] Laurens Van Der Maaten and Kilian Weinberger. Stochastic Triplet Embedding. In Machine Learning for Signal Processing (MLSP), 2012 IEEE International Workshop on, pages 1{6. IEEE, 2012. [71] Sameer Agarwal, Josh Wills, Lawrence Cayton, Gert Lanckriet, David Kriegman, and Serge Belongie. Generalized Non-Metric Multidimensional Scaling. In Articial Intelligence and Statistics, pages 11{18, 2007. [72] Omer Tamuz, Ce Liu, Serge Belongie, Ohad Shamir, and Adam Tauman Kalai. Adaptively Learning the Crowd Kernel. In Proceedings of the 28th International Conference on Machine Learning, ICML'11, pages 673{680, 2011. [73] Yoshikazu Terada and Ulrike von Luxburg. Local ordinal embedding. In International Con- ference on Machine Learning, pages 847{855, 2014. [74] Soroosh Mariooryad and Carlos Busso. Correcting Time-Continuous Emotional Labels by Modeling the Reaction Lag of Evaluators. IEEE Transactions on Aective Computing, 6(2):97{ 108, 2015. 85 [75] Lalit Jain, Kevin G. Jamieson, and Robert D. Nowak. Finite Sample Prediction and Recovery Bounds for Ordinal Embedding. In Advances in Neural Information Processing Systems, pages 2711{2719, 2016. [76] Soroosh Mariooryad and Carlos Busso. Analysis and Compensation of the Reaction Lag of Evaluators in Continuous Emotional Annotations. In Aective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on, pages 85{90. IEEE, 2013. [77] Alfred V. Aho, John E. Hopcroft, and Jerey D. Ullman. The Design and Analysis of Computer Algorithms. Addison-Wesley, 1974. [78] Srinadh Bhojanapalli and Prateek Jain. Universal matrix completion. In International Con- ference on Machine Learning, pages 1881{1889. PMLR, 2014. 86
Abstract (if available)
Abstract
In the setting of supervised learning, labels are used as examples to train models. When the objective is to learn a subjective construct (such as dimensions of affect), the labeling process oftentimes includes multiple annotators conveying their opinion about the construct for each sample in the form of annotations, which are later processed to obtain labels. This work explores the use of ordinal comparisons to both find consensus from annotations and to generate labels of subjective constructs. ? In the first part, we study how to find consensus from annotations collected as absolute or Likert-scale ratings over time or at session-level. In this setting, we propose to look at the problem of generating labels as finding a 1-dimensional embedding that faithfully represents the latent construct we want to learn. The difficulty of the problem lies on how to process these annotations, which are usually noisy due to subjectivity, and because the process of annotation can lead to errors. Assuming that people internally use ordinal comparisons to annotate even in the case of absolute ratings, we collect triplets and set up an optimization problem to estimate the underlying embedding. Since the ground truth or latent construct is unknown to us, we evaluate the methods by training a set of supervised models in two different scenarios: Affect (arousal, valence) and behavioral codes, yielding models with lower validation or test errors for valence and the behavioral codes. ? In the second part, we reconsider two aspects of the previous process: (1) the annotation process, and (2) the evaluation of the methods. We propose the use of a specifically-designed dataset to evaluate annotations collected in real-time, and further propose using triplet comparisons only to estimate latent constructs that vary over time. The dataset allows us to study the noise and artifacts generate when collecting annotations in real-time, while the novel sampling scheme based on triplet comparisons allows us to better estimate latent signals.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Improving modeling of human experience and behavior: methodologies for enhancing the quality of human-produced data and annotations of subjective constructs
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Exploiting latent reliability information for classification tasks
PDF
Semantically-grounded audio representation learning
PDF
Automatic quantification and prediction of human subjective judgments in behavioral signal processing
PDF
A computational framework for diversity in ensembles of humans and machine systems
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Computational models for multidimensional annotations of affect
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Novel algorithms for large scale supervised and one class learning
PDF
Computational modeling of behavioral attributes in conversational dyadic interactions
PDF
Computational methods for modeling nonverbal communication in human interaction
PDF
Machine learning paradigms for behavioral coding
PDF
Establishing cross-modal correspondences for media understanding.
PDF
Green image generation and label transfer techniques
PDF
Multimodal and self-guided clustering approaches toward context aware speaker diarization
PDF
Toward robust affective learning from speech signals based on deep learning techniques
PDF
Sampling theory for graph signals with applications to semi-supervised learning
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
Context-aware models for understanding and supporting spoken interactions with children
Asset Metadata
Creator
Mundnich Batic, Karel Bogomir
(author)
Core Title
Learning multi-annotator subjective label embeddings
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2021-08
Publication Date
07/25/2021
Defense Date
05/11/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
affect,annotation fusion,annotations,dimensions of affect,labels,machine learning,OAI-PMH Harvest,ordinal embeddings,subjective,supervised learning,triplet embeddings
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth (
committee chair
), Margolin, Gayla (
committee member
), Ortega, Antonio (
committee member
)
Creator Email
kmundnic@gmail.com,mundnich@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC15622363
Unique identifier
UC15622363
Legacy Identifier
etd-MundnichBa-9875
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Mundnich Batic, Karel Bogomir
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
affect
annotation fusion
annotations
dimensions of affect
labels
machine learning
ordinal embeddings
subjective
supervised learning
triplet embeddings