Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Improving modeling of human experience and behavior: methodologies for enhancing the quality of human-produced data and annotations of subjective constructs
(USC Thesis Other)
Improving modeling of human experience and behavior: methodologies for enhancing the quality of human-produced data and annotations of subjective constructs
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
IMPROVING MODELING OF HUMAN EXPERIENCE AND BEHAVIOR: METHODOLOGIES FOR ENHANCING THE QUALITY OF HUMAN-PRODUCED DATA AND ANNOTATIONS OF SUBJECTIVE CONSTRUCTS by Brandon M. Booth A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2020 Copyright 2020 Brandon M. Booth When someone does something wrong, don't forget all the things they did right. Unknown ii I dedicate this dissertation to the countless number of educators, family members, colleagues, friends, neighbors, and strangers who have, at one time or another, helped instill in me a fascination with the world and a love for learning. iii Acknowledgements Throughout my journey in this PhD program and in the writing of this disserta- tion, I have received a great deal of assistance. First, I would like to thank my advisor, Dr. Shrikanth Narayanan, whose resolute support for my research interests enabled me to passionately pursue this research topic. I would like to acknowledge my colleagues in the SAIL lab for their numerous and thoughtful discussions and reviews of the various research ideas and pub- lications contained in this dissertation. I want to particularly thank my SAIL co-worker, Karel Mundnich, for several lengthy and astute discussions about an- notation fusion. Furthermore, I would like to thank Dr. Jonathan Gratch for his considerate feedback on an earlier version of this manuscript, which provided helpful guidance. I also wish to thank my parents for their sympathetic ears and emotional support, especially throughout the doctoral process. Finally, and denitely not least, I want to thank my wife for listening to my research ramblings, asking insightful questions, and for reviewing this dissertation. iv TABLE OF CONTENTS Epigraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Objectives and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Overview of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Human Behavior Modeling: Background and Challenges . . . . . . . . . . . . . . 6 2.1 Human Behavior Modeling Background . . . . . . . . . . . . . . . . . . . . . 6 2.2 Human Behavior Modeling Challenges . . . . . . . . . . . . . . . . . . . . . 10 2.3 Case Study: Distance Learner Engagement Assessment . . . . . . . . . . . . 12 2.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.3 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.4 Annotation Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.5 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.6 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3 Improving the Quality of Human-produced Data and Annotations . . . . . . . . . 33 3.1 Improving the Quality of Data Collected In Situ from Humans . . . . . . . . 34 3.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1.2 Challenges with Studies in the Wild . . . . . . . . . . . . . . . . . . . 37 3.1.3 Data Acquisition and Flow Framework . . . . . . . . . . . . . . . . . 41 3.1.4 Considerations and Criteria for Sensor Selection . . . . . . . . . . . . 44 3.1.5 Case Study: TILES Data Set . . . . . . . . . . . . . . . . . . . . . . 60 3.1.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.1.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.2 Improving Quality of Real-time Continuous-scale Mental State Annotations . 72 v 3.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.2.2 Green Intensity Annotation Experiment . . . . . . . . . . . . . . . . 77 3.2.2.1 Experimental Procedure . . . . . . . . . . . . . . . . . . . . 77 3.2.2.2 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.2.3 Ordinal Triplet Embeddings and Notation . . . . . . . . . . . . . . . 79 3.2.4 Ground Truth Estimation via Frame-wise Ordinal Triplet Embedding 81 3.2.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.2.4.2 Majority-vote Triplet Embeddings . . . . . . . . . . . . . . 82 3.2.4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.2.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.2.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.2.5 A Framework for Ground Truth Estimation via Perceptual Similarity Warping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.2.5.1 Perceptual Similarity Warping Framework . . . . . . . . . . 91 3.2.5.2 Experiments and Results . . . . . . . . . . . . . . . . . . . . 96 3.2.5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 3.2.6 Trapezoidal Segmented Regression: An Algorithm for Identifying Per- ceptual Similarity in Annotations . . . . . . . . . . . . . . . . . . . . 98 3.2.6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 3.2.6.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 3.2.6.3 Benets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 3.2.6.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . 106 3.2.6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 3.2.7 Trapezoidal Segment Sequencing: A Perceptual Similarity-based An- notation Fusion Method . . . . . . . . . . . . . . . . . . . . . . . . . 110 3.2.7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 3.2.7.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 3.2.7.3 Experiment and Results . . . . . . . . . . . . . . . . . . . . 113 3.2.7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 3.2.8 Signed Dierential Agreement: An Agreement Measure for Real-time Continuous-scale Annotations . . . . . . . . . . . . . . . . . . . . . . 115 3.2.8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 3.2.8.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 3.2.8.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 3.2.8.4 Agreement Measures . . . . . . . . . . . . . . . . . . . . . . 125 3.2.8.5 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 3.2.8.6 Agreement Comparison . . . . . . . . . . . . . . . . . . . . 129 3.2.8.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 3.2.9 Validation of Framework and Enhancements for Ground Truth Esti- mation via Perceptual Similarity . . . . . . . . . . . . . . . . . . . . 134 3.2.9.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 3.2.9.2 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 3.2.9.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 3.2.9.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 3.2.9.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 vi 4 Discussion & Future Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 149 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 Appendix A: Sensing Checklist for Studies in the Wild . . . . . . . . . . . . . . . . . 166 vii LIST OF TABLES 2.1 Cross-subject engagement prediction F1 scores using dierent learning algo- rithms, engagement label representations, and video feature sets. Reported F1 scores are weighted averages from leave-one-subject-out cross validation tests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2 F1 scores for engagement prediction for three binary classiers trained on dierent subsets of the data based on the supplementary discrete labels. . . . 24 2.3 Average intra-subject engagement prediction F1-scores . . . . . . . . . . . . 26 2.4 Engagement prediction F1 scores for dierent learning models and features. . 28 3.1 Considerations and criteria for sensor selection . . . . . . . . . . . . . . . . . 45 3.2 Signals of interest in the case study that were measurable using consumer sensors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.3 Selected sensors for the case study and their expected uses . . . . . . . . . . 62 3.4 Compliance rates for participant-tracking sensors (n=212) and environment sensors (n=244) in the case study. Compliance is computed as the presence of data exceeding half of the measurement period per day among the participants who opted in for each sensor. . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.5 Agreement measures between dierent fusion methods and the true signal in two tasks from the color intensity data set [1]. . . . . . . . . . . . . . . . . . 88 3.6 CCC values for various proposed gold-standard annotation fusions using the 2018 AVEC emotion sub-challenge evaluation metric explained in Section 3.2.4.3 89 3.7 Agreement measures for baseline and proposed warped fused annotation ap- proaches. All warped results use a complete set of ordinal comparisons from an oracle. NMI = normalized mutual information. . . . . . . . . . . . . . . . 97 viii 3.8 Agreement measures between the true signal and various ground truth esti- mates using the perceptual similarity warping method. All warped results use a complete set of ordinal comparisons from the oracle. NMI = normalized mutual information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 3.9 Several agreement measures computed for various ground truth techniques showing the agreement between each method's ground truth and the true green intensity. Some values vary from Section 3.2.6 due to down-sampling. PSW = perceptual similarity warping framework. NMI = normalized mutual information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 3.10 Pairwise and group-level agreement measures applied to the simulated anno- tation example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 3.11 Movie clip cut times for the selected lms. . . . . . . . . . . . . . . . . . . . 137 3.12 Number of movie clip excerpts extracted from each movie based on the number of constant intervals observed in each movie's clips' fused TSS. . . . . . . . . 143 3.13 Rank-based measures of agreement between the warped signals resulting from using additional batches of pairwise comparisons. Batches of size 5000 (unique comparisons) are added to all previous batches and the agreement between the resulting PSW-warped signal is measured using the PSW-warped signal from the previous iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 ix LIST OF FIGURES 2.1 One example sequence of events of interest to human behavior researchers. Some external change, once observed by a human observer, produces a new or altered mental state as a direct result of some mental process. . . . . . . . 7 2.2 A broader view of the factors in uencing unobservable mental states and internal experiences. Arrows point in the direction of in uence and represent active human processes. Examples of these processes appear next to each arrow in black text. Jagged arrows represent processes which are vulnerable to various forms of bias, also referred to as unwanted noise. Since hidden mental states cannot be directly observed or interpreted, annotations of those mental states help researchers learn more about how the observable world in uences them. This diagram illustrates how annotations can only be obtained via noisy channel processes subjected to various biases. . . . . . . . . . . . . . . 8 2.3 A common framework for human behavior modeling. Data is measured from an individual's surroundings to form a vector x. Self-reports or external ob- servations are made of the individual's experiences to produce a vector of mental state annotationsy. Machine learning is used to nd a functionf and parameter set such thatL[yf(x;)] is minimized for some loss functionL. 9 2.4 An expanded common framework for human behavior modeling. The data x andy used to train a model are shown along with the original sources of that data. The overall quality of the data (their tness for the purpose of training the behavior model) is impacted by measurement noise, cognitive biases. and human error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 x 2.5 The annotation interface used in this study. Annotators could play/pause the video and were instructed to move the engagement slider in real-time to annotate their perception of the student's engagement level. (This image is reproduced with the subject's permission.) . . . . . . . . . . . . . . . . . . . 17 2.6 Late fusion ensemble for engagement prediction . . . . . . . . . . . . . . . . 25 2.7 An example engagement proxy from Gevins et al plotted over time [2]. This metric uses the ratio of where is averaged across the F3, F4, FC5, and FC6 sites and across the P7 and P8 channels. . . . . . . . . . . . . . . . . 27 2.8 Plots of average correlation between dierent EEG engagement metrics and the ground truth as a function of sliding window size. Each line color repre- sents a unique subject. Left: working memory load [2], Middle: engagement [3], Right: concentration [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.9 Confusion matrices for engagement prediction for the best-performing learn- ing models trained on video-based features. Rows correspond to true labels, columns to predicted labels, and element values are normalized across the rows. The k-means trinary labels are used in each case. Left: best cross- subject classier (KNN), Middle: best cross-subject ensemble (mixed), Right: average intra-subject classier (RF) . . . . . . . . . . . . . . . . . . . . . . . 29 3.1 An overview of the general scientic process for human research studies in- volving sensing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 xi 3.2 A framework for studies of human behavior in the wild, showing common potential information pathways for data produced by sensors (e.g., physio- logic and activity), destined to be stored on a single research server. This type of data ow paradigm enables centralized data monitoring and facili- tates immediate automatic participant feedback regarding data quality and compliance via the participants smartphones. RFID: Radio-Frequency Iden- tication; NFC: Near-Field Communication; USB: Universal Serial Bus; API: application programming interface. . . . . . . . . . . . . . . . . . . . . . . . 42 3.3 Setup of the TILES Audio Recorder [5] . . . . . . . . . . . . . . . . . . . . . 64 3.4 Histograms of the total number of hours of recorded sensor data per day, across all participants. These plots only show data from days where data was logged. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.5 Simple Annotation Processing Pipeline for Ground Truth Estimation . . . . 73 3.6 A closeup snapshot of the user interface at dierent times during the green intensity annotation task. Annotators only adjusted the slider in sync with changes in the green video. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.7 Plots of the objective truths (bold) and annotations of green channel intensity from ten annotators in two separate tasks. . . . . . . . . . . . . . . . . . . . 79 3.8 Agreement between annotators for two example subjects from the RECOLA emotion dataset in two dierent annotation tasks: arousal and valence. Agree- ment is measured using CCC. The overall agreement in valence is higher than the overall agreement for arousal. . . . . . . . . . . . . . . . . . . . . . . . . 85 xii 3.9 Plots for TaskA (left) and TaskB (right) from the color intensity annota- tion dataset. The true color intensity signal is shown (black) alongside the unweighted average of individual annotations (purple) and a gold-standard produced using an unweighted version of the proposed triplet embedding al- gorithm (light blue). This shows the proposed method is sensible and quali- tatively similar to the average signal. . . . . . . . . . . . . . . . . . . . . . . 87 3.10 An example plot of an individual annotation (from FM1) of arousal from the RECOLA data set [6] for subject "train 5" and the corresponding feature with the highest average Pearson correlation with all annotations (geomet- ric feature 245 amean). The two signals are shown in (a) and the DTW- warped annotation is shown in (b). . . . . . . . . . . . . . . . . . . . . . . . 88 3.11 Plots showing the percentages of triplet violations after t-STE convergence . 89 3.12 Overview of perceptual similarity warping algorithm for accurate ground truth generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 3.13 Plot of the objective truth signal, time-shifted average annotation signal (EvalDep), warped signal (proposed), and the 1-D embedding for extracted constant intervals for Task A. The spatially warped signal better approximates the structure of the objective truth and also achieves greater self-consistency over the entire annotation duration. . . . . . . . . . . . . . . . . . . . . . . . 95 3.14 Equations for the rank-based spatial warping method. Lett2f1; 2;:::;Tg be a time index,y t denote the fused annotation signal,y 0 t denote the warped signal value, and let C be the ordered sequence of non-overlapping time intervals corresponding to the extracted constant intervals. E is dened as the sequence of embedding values inR d corresponding to the time interval sequenceC. The sequenceI is used instead ofC to handle edge cases. For notational simplicity, a new sequence S is used whose i th element is the dierence between interval i's average value and the corresponding embedding value. . . . . . . . . . . . 96 xiii 3.15 Illustration of the curvature produced by TVD when approximating nearly constant regions of a signal. The proposed trapezoidal segmented regression (TSR) algorithm approximates these regions with constant segments making the extraction of near-constant intervals easier. . . . . . . . . . . . . . . . . . 99 3.16 Two trapezoidal signal examples. The left shows a prototypical trapezoidal signal shape. The right contains the optimum four-segment trapezoidal signal t to the sample points. A broader denition of a trapezoidal signal is used: continuous alternating sloped and constant line segments. Each line type is colored dierently for clarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 3.17 Optimum trapezoidal segmented regression . . . . . . . . . . . . . . . . . . . 103 3.18 Plots of the EvalDep [7] fused annotations for TaskA and TaskB in the continuous-scale real-time annotation of green color intensity. The frame-level average of unaligned annotations is shown for reference. . . . . . . . . . . . . 107 3.19 Plots of the TSR and TVD signal approximations of the EvalDep fusion in both green intensity annotation tasks. . . . . . . . . . . . . . . . . . . . . . . 107 3.20 Dierences between TSR and TVD algorithms in two experimental tasks when used for signal approximation as part of the PSW ground truth framework. The number of constant intervals extracted during the segmentation stage of the pipeline are shown at the bottom as functions of each algorithm's tunable parameter (T or). Kendall's tau correlations between the nal warped signal and the true target signal are shown at the top. . . . . . . . . . . . . . . . . 109 3.21 Perceptual similarity warping algorithm presented in Section 3.2.5 [1]. This section proposes a novel annotation fusion method that merges the preliminary fusion and signal approximation steps in this pipeline. . . . . . . . . . . . . . 112 3.22 Plots of the true target signal and ground truth signals produced by the baseline EvalDep fusion method and the proposed TSS when using the (PSW) ground truth framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 xiv 3.23 Plot of Task A annotation signals (yellow and bold blue) and the fused TSS sequence (vertical bands of green or red). Red bands denote no change in the TSS (zero value) and green bands represent some change in the fused TSS (1).116 3.24 Scatter plots of the true green intensity values against the annotated values for each annotator in both tasks. The dashed line show where the points would lie for a perfect annotator. The orange curves are cubic polynomial regressions of the scattered points. . . . . . . . . . . . . . . . . . . . . . . . . 120 3.25 Scatter plots of the forward dierence of the true green intensity values against the forward dierence of annotated values for each annotator in both tasks. The dashed line shows where the points would lie for an idealized annotator. Four quadrants are color coded according to the percentage of each plot's points falling within them. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 3.26 Plots of the aligned annotations over time. Blue points and line segments cor- respond to samples where the forward dierence matches the sign of the true target signal's forward dierence. Red points and line segments correspond to mismatches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 3.27 Example annotations showing similar structure but negative Spearman and Kendall's correlations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 3.28 Hypothetical annotations of the same stimulus. . . . . . . . . . . . . . . . . 130 3.29 Pairwise agreement measures for Task A in the green intensity annotation experiment; rows permuted via agglomerative clustering. . . . . . . . . . . . 131 3.30 Example annotations of emotional arousal from the RECOLA data set. An- notators FF1 and FF2 (bold solid lines) structurally disagree about half the time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 xv 3.31 Expanded general pipeline for generating ground truth representations from real-time continuous-scale human annotations of a stimulus. Annotation fu- sion shows two possible paths based on contributions in this manuscript, though other methods exist. Methods proposed in this dissertation are col- ored beige and demarcated as follows: Signed Dierential Agreement (x3.2.8), y Frame-wise Ordinal Triplet Embedding (x3.2.4), z Perceptual Similarity Warp- ing Framework (x3.2.5), yy Trapezoidal Segmented Regression (x3.2.6), zz Trapezoidal Segment Sequencing (x3.2.7). . . . . . . . . . . . . . . . . . . . . . . . . . . 135 3.32 Landing page for Mechanical Turk workers interested in our continuous an- notation experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 3.33 Landing page for Mechanical Turk workers electing to participate in the movie violence annotation experiment . . . . . . . . . . . . . . . . . . . . . . . . . 139 3.34 Raw crowd-sourced annotations of portrayed violence over time in cut 4 from The Hustle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 3.35 Example output from the TSR optimization procedure for a single annotation. The left gure shows the two considered agreement measures during optimiza- tion for a range of trapezoidal regressions. The dotted vertical line denotes the number of segments selected by the heuristic-based optimization. The right gure plots the annotation and the selected TSR based on the optimization heuristic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 3.36 Plots of the baseline averaged annotations and the PSW-warped annotations 145 3.37 Spearman correlation between the CSM ratings and various functionals of the baseline and warped annotations . . . . . . . . . . . . . . . . . . . . . . . . 146 xvi ABBREVIATIONS API Application Programming Interface. xii, 42, 44, 50, 59, 65 AUC Area Under Curve. 18 AVEC Audio/Visual Emotion Challenge. viii, 86, 89, 132 CCA Canonical Correlation Analysis. 86 CCC Concordance Correlation Coecient. viii, xii, 84{86, 88{90, 125, 126 CORE Connected and Open Research Ethics. 58 CSM Common Sense Media. 136, 144, 145, 147 DIY Do-It-Yourself. 52 DTW Dynamic Time Warping. 75, 86{88, 90, 119, 132, 133, 138, 141 ECG Electrocardiography. 38, 46, 47, 49, 50, 58, 62, 64 EDA Electrodermal Activity. 46, 61 EEG Electroencephalography. xi, 13{15, 17, 19, 20, 22, 25{30 FACS Facial Action Coding System. 18, 19, 22 FIR Finite Impulse Response. 87 GPS Global Positioning System. 46, 59 GTW Generalized Time Warping. 86, 87, 90 HIPAA Health Insurance Portability and Accountability Act. 63 xvii HIT Human Input Task. 137 ICC Intra-class Correlation Coecient. 21, 127 IMU Inertial Measurement Unit. 46 IRB Institutional Review Board. 41, 58, 60 JMIR Journal of Medical Internet Research. 34 KNN K-Nearest Neighbors. 23{25, 27 MCC Matthews Correlation Coecient. 125 MFCC Mel-Frequency Cepstral Coecients. 41 MOOC Massive Open Online Course. 12, 13 MOSAIC Multimodal Objective Sensing to Assess Individuals with Context. 60, 61 MSE Mean Squared Error. 118, 126, 132 NaN Not a Number. 138 NFC Near-Field Communication. xii, 42, 60 NMI Normalized Mutual Information. viii, ix, 97, 108, 114 PII Personally Identiable Information. 41, 59, 60 PPG Photoplethysmography. 39, 46, 48, 50, 62 PSD Power Spectral Density. 14, 20 PSW Perceptual Similarity Warping. ix, xiv, xvi, 98, 99, 105, 106, 109{116, 136, 144{147, 149, 150 xviii QP Quadratic Program. 104 RBF Radial Basis Function. 18, 19 REST Representational State Transfer. 50 RF Random Forest. 25{27, 30, 31 RFID Radio-Frequency Identication. xii, 42, 43, 60 RGB Red, Green, Blue. 15 ROI Region of Interest. 18, 19 SAGR Signed Agreement. 126, 130, 132 SDA Signed Dierential Agreement. 127{129, 131{133, 138, 140, 141 SDK Software Development Kit. 19 SNR Signal-to-Noise Ratio. 47 SR Segmented Regression. 99{101 SVC Support Vector Classier. 25, 27, 29 SVM Support Vector Machine. 18, 19, 30 t-STE t-Stochastic Triplet Embedding. xiii, 80, 89, 94, 143 TAR TILES Audio Recorder. 70 TILES Tracking IndividuaL performancE with Sensors. xii, 60, 64, 65, 69 TSR Trapezoidal Segmented Regression. xiv, xvi, 99, 102, 105{110, 113, 141, 142 TSS Trapezoidal Segment Sequence. ix, xv, 112{116, 142{144 xix TVD Total Variation Denoising. xiv, 21, 22, 91{93, 98, 99, 105{110 USB Universal Serial Bus. xii, 39, 42, 44, 49, 60, 63 USC University of Southern California. 15, 16 USD United States Dollar. 138, 143 xx Abstract Mathematical and algorithmic modeling of human behaviors and experiences is a challenging domain of fundamental interest quite simply because we wish to better understand ourselves. Modern advancements in machine learning tech- nologies and the big data revolution have bred insights across many areas of knowledge, yet progress in the broad eld of human behavior research remains slow and has not beneted from this emerging machinery in the same manner. This dissertation focuses on certain problems within the human behavior and experience modeling eld and oers new strategies and methods for enhancing the quality of data gathered from people ultimately to aid in understanding human behavior and experiences in various contexts. Human behavior modeling is taken not simply as a classic data-in, inference-out machine learning problem, but as an end-to-end whole encompassing also the collection of human data and experience annotations to serve as data labels. This manuscript argues and provides evidence that strong advancement in the eld of human behavior research requires two conditions. The rst is having access to a large amount of high quality data gathered from many individuals and in a natural setting, as opposed to a laboratory environment, to facilitate accurate intra-subject and cross-subject modeling. The main barriers to scal- ing up participation in research in-the-wild are long-term participant compliance and diculties related to the creation and maintenance of machine-assisted in- frastructural support to capture, monitor, and ensure the quality of gathered data. A guide for conducting natural-setting research at scale is detailed which xxi gives researchers a framework for mindfully creating protocols and crafting data collection and monitoring systems which can improve the likelihood of obtaining high quality data. The second condition is having a collection of labels of participant behaviors or experiences that is accurate and consistent. For dynamic real-time stimuli, these labels are often produced by participants or observers, or retrospectively by ex- ternal annotators, and due to basic human dierences, they contain artifacts and are in uenced by personal biases. Identifying and removing these biases and artifacts to generate a gold-standard collection of labels is dicult and requires careful consideration of the human annotation process itself. This dissertation proposes novel algorithms for generating accurate gold-standard continuous nu- meric labels of real-time stimuli that leverage peoples' ability to more accurately compare than appraise. The techniques in this manuscript enhance state-of-the-art practices for human behavior learning and improve the ability of researchers to obtain more accurate human behavior and experience models. xxii Chapter 1 INTRODUCTION Human behavior and experience modeling is a challenging research domain that seeks to reveal fundamental features of personal experiences and shared human nature to help us live better and more informed lives. The study of human experience has grown rapidly over the last few decades following technological milestones and spawning several sub-elds: neuro-ergonomics, human-computer interaction, human-robot interaction, and virtual hu- man assistant projects, to name a few. More recently, both academic and industry interest in the eld has exploded following the emergence of portable and pervasive sensing tech- nologies, personal computerized devices, big data technology, and other machine learning techniques. Today we have the capacity to digitally collect an enormous amount of informa- tion about humans including biological signals, brain waves, the surrounding environment, physical activities, and other contextual cues. In spite of this potential, advancements in human behavior and experience understanding have not readily beneted from this emerg- ing machinery due in part to the quantity and quality of representations and measurements of human behaviors and subjective experiences that are collected during human behavior studies. 1.1 Motivations To properly illustrate the enormous complexity inherent in human experience modeling, especially for subjective mental constructs, let us consider one particular example problem: how do we measure and quantify fun? The experience of fun has been a fundamental endeavor of human leisure time, but though the experience is of academic interest to philosophers, sociologists, psychologists, and the (now burgeoning) gaming industry, little research has been conducted until very recently on the nature of this experience, perhaps due to the diculty of nding a suitable generalizable denition of the word. Some researchers suggest 1 fun is a state achieved when the people pursuing it enter a psychological state of ow [8] where the mind is highly focused on the activity at hand. Others believe it describes engagement in an intrinsically motivated activity [9], and yet others postulate the term fun itself is a sub-cultural construct that has no consistent meaning in a broader context. Any scientic endeavor intending to glean insights into this mental construct we call fun needs to begin with a clear goal. One approach avoids the use of the word fun replacing it with a specic interpretation of its meaning, such as ow. Even this approach proves dicult; a prerequisite for achieving a ow state is one's complete engagement in an activity, and there are many ways to dene and measure engagement: visual attention, motivated interest, the continuous allocation of mental resources to a task, among other possibilities [10]. For an experiential state as socially ubiquitous and subjective as fun, the risk of providing this kind of specicity is that the actual experience being measured no longer resembles each individual's notion of fun. An alternative approach avoids dening fun and simply asks research participants to decide when and how much fun they have during or after performing an activity. This method limits the assumptions researchers need to make about the mental construct itself but creates further complications. If fun is indeed a mental state involving some form of ow or engagement and participants are interrupted or asked to report when they are having fun, their full engagement in the desired activity is disrupted and the risk is again that the gathered data may not re ect the actual experience. There are many ways to setup scientic studies to learn about fun, but eventually all of them require human involvement at various stages of the experiment: (1) human subjects to be studied, (2) human observers to provide labels for the moment-to-moment experiences (e.g., fun or not fun), and (3) people to interpret the results. Sources of bias, whether incidental or deliberate, are introduced any time people are involved, and human behavior studies require people at every stage. In order to empower researchers to gain insights into these sorts of subjective mental constructs, which often have a socially shared and intuitive meaning but are dicult to accurately dene or measure for an individual, we must pay careful attention to the quality of data measured from subjects and their associated labels of the intensity of experiences. 2 1.2 Objectives and Contributions This dissertation provides methodologies and algorithms that contribute to improving the quality of human behavior and subjective experience data often used to train behavior infer- ence models. This manuscript seeks to ameliorate the predictive power of trained behavior models by focusing on the quality of the data used as input to these learning models. The human component of the human-produced training data for these models is closely exam- ined from two perspectives: (1) how human factors in uence the quality of the data collected by sensors or via observation of human participants (typically used as training samples in machine learning), and (2) how human errors and biases in uence the quality of annotations of subjective experiences (typically used as \ground truth" labels in supervised machine learning). In the former case, it argues that some human mental states and experiences, such as fun or engagement, cannot be modeled well across individuals or from data collected in contrived settings. Thus, a methodology is outlined for collecting high-quality human data for researchers needing large quantities gathered from realistic and natural settings unobtrusively. In the second case, this dissertation proposes a suite of novel methods for generating ground truth labels over time of dynamic subjective experiences which are more consistent and better representations of the intended mental constructs. The term quality is used loosely throughout the dissertation because, when used to de- scribe human-produced data, it too has several interpretations. Kahn et al [11] oer a high-level taxonomy of some of the dierent aspects of data quality as it relates to data gathered in real-world (ecologically valid) studies. Their categorization deconstructs data quality into various features including conformance, completeness, plausibility, verication, and validation, which together determine the suitability of a collection of data for a certain type of analysis. In this manuscript, the term quality should be interpreted as a measure of \tness for a purpose." This means that the quality of data cannot be entirely deter- mined by direct inspection alone but is instead dependent upon the environment, context, or manner in which it collected, how it is processed and handled when transforming and extracting its latent information, and how it is used during analysis to achieve a research 3 goal. There is no single \silver bullet" metric or formalization that captures all of the desired properties of high quality human data, so this dissertation carefully dissects and examines modern approaches to human data collection, identifying certain limitations, and proposes protocols and methodologies for enhancing the overall utility of the data. One particular aspect highlighted throughout this work is the human need for support when either consciously or passively producing data. A helpful analogy for understanding how the methods and approaches presented here enhance the quality of human data is a foot race. In one scenario, asking participants to allow researchers to gather observations and measurements of themselves and their surroundings is akin to asking them to run a marathon without proper training; it is possible, but dicult, and it can become much easier when a crowd of excited observers cheers them on. In the other scenario, asking people to produce quantied representations of subjective experiences is similar to asking them to run a race while jumping over hurdles; most people struggle, but if we change the the rules by allowing observers to adjust the hurdle heights, then reaching the nish line is much easier. In both cases, when foot race runners receive support, either from a cheering crowd or from observers adjusting hurdles, the race becomes easier to complete successfully. This dissertation argues that by implementing systematic and methodological, infrastructural support facilitating the collection of human-produced data, the overall quality of that data can be substantially enhanced and thereby enable greater understanding of human behaviors and experiences. 1.3 Overview of the Thesis Chapter 2 provides background information for typical human behavior and experience stud- ies, highlights certain shortcomings, and examines their impact in a case study. That chapter identies and describes the two sources of human data used when modeling cause-eect re- lationships for human understanding. Problems and methods for improving the quality of human data from these two sources are closely examined in Chapter 3: (1) pertaining to measurements and observations of people and their environments, and (2) regarding the representations experiences and perception elicited during the annotation process. Chap- 4 ter 4 discusses implications of this work and highlights promising avenues for related future research. 5 Chapter 2 HUMAN BEHAVIOR MODELING: BACKGROUND AND CHAL- LENGES This chapter presents and elaborates on some of the barriers preventing current human behavior research and modeling eorts from achieving deeper insights about human experi- ences. First, an overview of the standard formula for conducting human behavior research is described. Then the chapter examines modern approaches employed in human behavior studies and highlights some of the pitfalls and barriers to obtaining the highest quality data from people. Finally, a case study is presented, following modern best practices, attempt- ing to glean a meaningful relationship between observed human behaviors and perceptions of engagement in an activity. The results of this study, combined with a re ection of its procedures and those of other prior works, suggest that more informative, longitudinal, and higher quality data should be obtained to improve the process of learning about behavior and experience relationships. 2.1 Human Behavior Modeling Background One of the primary relationships of interest in human behavior research is the cause-eect relationship, aiming to understand how various stimuli and exogenous changes impact a per- son's mental state and future behavior. Figure 2.1 shows a simplied depiction of three of the numerous factors that may be of research interest. In this instance, \exogenous changes" may be dened as the external events or stimuli that either intentionally or unintentionally attract the attention of a human observer. The \mental processes" transition represents an observer's active cognitive processing of the changes, for example, perception, attentiveness, or ght-or- ight responses. This mental processing may result in some change to the ob- server's internal mental state, which quite broadly and inclusively refers to emotional states, desires, goals, stressors, and other factors. One concrete example of this cause-eect rela- 6 Figure 2.1: One example sequence of events of interest to human behavior researchers. Some external change, once observed by a human observer, produces a new or altered mental state as a direct result of some mental process. tionship may be a familiar one: a stack of paperwork arrives in the inbox of an oce worker (exogenous change), who observes its arrival (mental process), and responds by feeling more anxious or stressed (mental state). Of course, the picture becomes more complicated when considering more of the factors that in uence mental states and experiences, as shown in Figure 2.2. This diagram illustrates how various factors in uence and are in uenced by a person's mental state and internal ex- periences. The arrows represent the direction of in uence in a cause-eect relationship and some example processes that facilitate each in uence appear in black text next to the arrows. Three factors: exogenous changes, bodily sensations, and expressed behaviors, are labeled as being directly observable because measurement instruments (e.g., physiological sensors, gesture and facial recognition cameras, neuroimagers, etc.) can capture informative repre- sentations of their conditions (although these measurements can be noisy as we will discuss shortly). The mental states and experiences box is labeled as hidden because no measure- ment tools today are able to directly capture representations of fundamental experiences. At present, the primary research tools available for accessing these mental states and experi- ences are self-assessments or observer-based assessments, which are also sometimes referred to as annotations (pictured in the bottom right corner). The key purpose of this gure is to illustrate how all of the various pathways or channels connecting the hidden mental states to these annotations are vulnerable to personal or observer biases. These biases can ultimately skew the resulting annotations of mental states and thereby degrade the informativeness and quality of the annotations. As mentioned previously, a common modeling goal in human behavior research is to 7 Figure 2.2: A broader view of the factors in uencing unobservable mental states and internal experiences. Arrows point in the direction of in uence and represent active human processes. Ex- amples of these processes appear next to each arrow in black text. Jagged arrows represent processes which are vulnerable to various forms of bias, also referred to as unwanted noise. Since hidden mental states cannot be directly observed or interpreted, annotations of those mental states help researchers learn more about how the observable world in uences them. This diagram illustrates how annotations can only be obtained via noisy channel processes subjected to various biases. 8 Figure 2.3: A common framework for human behavior modeling. Data is measured from an individual's surroundings to form a vector x. Self-reports or external observations are made of the individual's experiences to produce a vector of mental state annotations y. Machine learning is used to nd a function f and parameter set such that L[yf(x;)] is minimized for some loss functionL. learn a mapping function that represents the aggregate eect of in uential factors on a person's hidden mental states. For subjective mental constructs, however, the model can only be derived from measurements of the observable factors and annotations of mental state. Figure 2.3 depicts this modeling goal, posed as a minimization problem. The input x refers to the collection of measurements of the observable factors, which may include physiological, contextual, and environmental information. The other input y refers to annotations of the otherwise inaccessible mental states. A classic example of this modeling setup occurs when a customer service call center asks someone to rate her/his level of satisfaction after speaking to a representative. The reported level of satisfaction becomes y, various measurements and features of the representative's phone interaction become x (e.g., tone of voice, time on phone, etc.), and the goal is to learn a model that can predict customer satisfaction based on call center representative behavior. A substantial amount of research eort, especially in machine learning, has been applied to this basic modeling paradigm for human behavior studies and focused on nding the right type of function f, the right loss function L, and the right parameterization . With the same end goal in mind of improving the model, this dissertation instead focuses on 9 improving the quality of the data x and y used to train and evaluate the model. Figure 2.4 shows the same basic modeling framework but with the observable and non-observable in uential factors added in. This manuscript focuses entirely on protocols and methods for improving the overall quality of the data used as inputs during behavior inference modeling. 2.2 Human Behavior Modeling Challenges Recent advances in neural network techniques have shown an unprecedented ability to dis- cover more complex and more accurate models than ever before, and they continue to improve each year. These improvements, however, come at a cost: the models need a lot more data. Though many publicly available human behavior data sets exist and have been used to train these advanced machine learning algorithms, human behavior modeling of subjective mental states has not yet realized the same benet. Though there may be many reasonable expla- nations, there at least seems to be a very large degree of variability both within this data and between people, which hinders the ability of the learning models to discover meaningful patterns. This dissertation argues that more data, richer data, and higher quality data from a wide variety of contexts, environments and people are needed to solve this problem. Hu- man behavior experiments, unfortunately, require many person-hours to plan and conduct and can thus be expensive to perform and manage at scale. There is a sore need for scientic human behavior studies to more eciently scale to large populations of thousands or hun- dreds of thousands of people while also improving the quality of data gathered so the gains from emerging modeling technologies can be realized. This poses a major challenge for the future of human behavior research and is one of the topics addressed in this dissertation. A second major challenge is ensuring that the annotated representations of mental states used as labels in these learning models are as consistent and accurate as possible. As men- tioned, there are many potential sources for unintentional bias and noise to be introduced into these annotations. Some types of annotation styles, such as discrete Likert-scale surveys, have been very heavily used and studied, and many shortcomings related specically to how humans provide information are fairly well documented and understood [12, 13]. A certain type of annotation process, real-time continuous-scale annotation, which is the focus in this 10 Figure 2.4: An expanded common framework for human behavior modeling. The data x and y used to train a model are shown along with the original sources of that data. The overall quality of the data (their tness for the purpose of training the behavior model) is impacted by measurement noise, cognitive biases. and human error. 11 dissertation, is relatively new but an indispensable tool for understanding the dynamics of subjective mental constructs. The primary challenge regarding these annotations is that manner in which people provide the annotations and the capacity or limitations of people to provide accurate information has not been thoroughly studied yet. This dissertation dives deeply into problems emerging from these types of annotations and proposes methodologies for addressing them. 2.3 Case Study: Distance Learner Engagement Assess- ment This section presents two experiments and a data set which together aim to model the engagement level of students viewing lecture videos as a function of physiological data. The experiments and results demonstrate that a link exists between student engagement and the measured physiological factors, but they also underscore the challenges in human behavior modeling previously discussed. Following a description and analysis of this experiment, a discussion considers some steps to help ameliorate the diculty of these challenges and improve human behavior modeling research eorts. The following experiments and results are reproduced from published works [14, 15]. 2.3.1 Introduction Participation in massive open online courses (MOOCs) has steadily increased over the last decade [16] as more educational content has become available to distance learners. The at- trition rate of students enrolled in these online courses however continues to be substantially higher than equivalent university classes taught in person. While some students start these online courses with no intention of completing them, others drop out due to a lack of support and course diculty [17]. Modern pedagogy proclaims that the way students are monitored, behave, and learn aects the manner in which they approach studying the content. This section adopts the hypothesis that enabling educators and MOOC platforms to monitor and respond to dis- 12 tracted and disengaged students in distance learning settings will reduce attrition rates and improve learning outcomes. Measurements of engagement in distance education contexts have mainly focused on be- havioral engagement [18, 19] and thus emphasized the use of learning management system interactions such as content views, forum views/posts, or language usage as predictor vari- ables. Although these measurements have proven to help identify students on the path to educational success, they are scarce measures and may not distinguish well between inactive and at-risk students. A more active and observant engagement metric would be benecial for monitoring student focus during lecture videos while also helping lecturers improve their presentation and ow. This case study focuses on engagement as it pertains to students viewing course material via a (MOOC) platform. In the classroom, teachers are able to judge the engagement levels of students in real time which provides passive feedback that enables these educators to adjust the structure of lectures and potentially identify at-risk students. This type of feedback is absent in an online setting where students may be multi-tasking and at greater risk of distraction [20]. This study captures human behavior during online learning sessions in order to under- stand distance student engagement dynamics. Video data is collected and used for en- gagement assessment, as in prior work, and this study also examines consumer-grade elec- troencephalographic (EEG) signals as they provide an aordable insight directly into the observable mental state at a high temporal resolution and may oer the best awareness of engagement compared to other physiologic signals [21]. EEG signals have been used to study engagement in several contexts outside of distance learning and their their predictive power is tested in this experiment. This experimental design diers from other studies in that it allows students to behave naturally during the lecture session. Participants are allowed full freedom to eat, drink, take notes, surf the web, and encouraged to do whatever else they like while viewing the lecture. Afterwards, human annotators rate the students' engagement levels on a continuous scale while monitoring the camera recordings. Many annotation schemes are possible, but this approach is chosen for three reasons: (1) it aords more subtle variations in the ratings, (2) 13 it allows the full temporal context to be integrated into rating decisions, and (3) because it is more natural and similar to what lecturers do while teaching. The aim of the work in this section is threefold. First, it analyzes the predictive power of state-of-the-art video and EEG features for student engagement prediction during mostly unstructured learning sessions. Second, it tests several machine learning models on these features using dierent ground truth representations of the engagement labels to assess the prediction performance end-to-end. Third, it compares results on personalized and cross- subject models quantifying the performance gap for this data set and reveals that good performance can be achieved per-subject with very few labels. 2.3.2 Background Eorts to automate student engagement estimation are fairly recent. Researchers have ex- plored using camera recordings of students in the classroom and shown that computer vision techniques that estimate head poses for each student can predict engagement with some success [22]. Motion-based features have also been used on classroom videos to assess en- gagement based on the observation that students are more attentive and more engaged when they are moving less [23]. Some eorts have explored using aect recognition and motion features for predicting engagement for individual students in front of computers [24, 25]. Prior research in real-time EEG-based cognitive engagement assessment has focused on the relationship between the power spectral density (PSD) bands of voltage readings at dierent scalp sites and mental engagement [21, 26{28]. Foundational research conducted by Pope et al. [27] examines subjects interacting with a ight control system in real-time and uses EEG signals with a feedback loop to adjust the demands and challenge of the interaction. They oer four potential metrics for cognitive engagement using dierent combinations of alpha, beta, and theta frequency bands at various EEG channels and nd the index + measured from the Cz, Pz, P3, and P4 sites to be the most successful for maintaining operator engagement over a period of time. Later studies validate this engagement formula in similar kinds of vigilance tasks. Chan- dra et al. [3] nd engagement measurements using this equation are most accurate at dierent electrode sites: AF3, AF4, F7, F8. Other studies reveal the metric's success when cognitive 14 engagement is measured by self-assessment and participant behavioral changes [26, 29, 30]. McMahan et al. [31] demonstrate this metric correlates with cognitive engagement during video game play thus verifying its utility for non-vigilance tasks. Experiments from Mikulka et al. [32] and Freeman et al. [30] examine the measure's behavior in closed human-computer systems under negative feedback noting that increasing the task diculty when engagement is low (due to boredom) and vice verse helps maintain engagement. Other metrics for quantifying constructs related to engagement have been proposed. Gevins et al. [2] observe frontal theta band increases and occipital alpha wave decreases during tasks where the mental workload or cognitive demand changes, and they propose as a mental workload metric. Yamada [4] observes participants engaging in dierent visual tasks such as video game play and animation viewing and reports that the midline frontal theta wave is a consistent barometer for concentration on the given task. Though these measures do not claim to assess engagement directly, both mental workload and focused concentration are constructs of interest in its evaluation. 2.3.3 Data Acquisition Data from 12 students was collected in the Experience Lab at the University of South- ern California (USC). The lab is a multi-sensor bio- and psycho-physiologic data collection facility designed for capturing human behaviors during interactive computer tasks in a time- synchronous fashion. In this experiment, the lab was used to record front-facing video of students watching lecture videos. A Microsoft Kinect v2 sensor collected RGB frames at 30Hz of the face and upper body of each student seated at a desk in front of the com- puter screen. This arrangement is similar to typical camera setups in laptops or on personal computers at home. An Emotiv EPOC+ EEG headset was also used to capture voltage uctuations at 128Hz in 14 scalp locations. Students at USC enrolled in a graduate-level computer science course were enlisted for the study. Student recruitment occurred weeks prior to the nal exam and all students in the class were encouraged to participate in the study by reviewing recorded lectures from the class in the Experience Lab. All interested volunteers from the class were accepted and the subject pool contained a mix of sexes and ethnicities of students in their mid twenties. 15 During the study, participants were provided a seat at an empty desk and privacy in a sound-resistant room. Drinks and snacks were provided and students were encouraged to eat, drink, and otherwise act naturally. The aim was to capture behaviors similar to those that students might exhibit when distance learning in the wild. Participants were allowed to select the lecture they were most interested in reviewing and encouraged to watch as much or as little of the video as desired past a minimum duration of 20 minutes. Observed behaviors included: pausing the lecture, eating/drinking, taking notes, interacting with personal devices, and surng the web for related information while learning. 2.3.4 Annotation Protocol Ratings of engagement were provided by human annotators. USC undergraduate and grad- uate students were recruited to provide engagement ratings on a continuous [0; 1] scale in real time as they viewed the front-facing video of participants. Annotation was performed using a mouse, a standard slider user interface widget, and a viewport showing the video from the front-facing camera. For each trial, annotators were instructed to move the slider in real time according to their perception of the student's engagement. No clarications or indications of how to interpret the term engagement were provided. Since many of the annotation tasks required at least an hour of time, annotators were allowed to pause and resume after taking a short break. Figure 2.5 shows the annotation interface. 16 Figure 2.5: The annotation interface used in this study. Annotators could play/pause the video and were instructed to move the engagement slider in real-time to annotate their perception of the student's engagement level. (This image is reproduced with the subject's permission.) Each session received approximately nine annotations. After nishing all annotation tasks, annotators completed an exit survey containing questions about their condence in annotation accuracy and about any distractions that may have aected their labeling eorts. All annotators reported feeling condent both in their ability to assess engagement and in their actual labeling of it. 2.3.5 Feature Extraction Several machine learning experiments are conducted to ascertain whether annotated engage- ment can be predicted using state-of-the-art visual features extracted from the front-facing videos and EEG engagement features. Dierent machine learning models are tested against various ground truth engagement representations derived from the annotations to try to 17 nd the best end-to-end system for distance learner engagement prediction. Details for the extracted features, various ground truth methods employed, and prediction results are presented next. Video Feature Extraction The following features are extracted from students' faces and upper bodies in each video frame: Facial landmarks Facial geometric features Facial action coding system (FACS) action units FACS eye movement codes Probability of emotional expressions Mean and median average optical ow velocities and average direction vector Head size and pose First a region of interest for the face in each frame is extracted using a local binary patterns histogram cascade implemented in the OpenCV library [33]. Once the face is located and cropped within each frame, 68 facial landmarks are computed denoting the eyes, nose, mouth and face perimeter using the method described in [34]. Given the cropped face image and facial landmarks in each frame, a smaller region of interest (ROI) of the face is extracted, linearly skewed such that the eyes are horizontally aligned, and then resized to 32x32 pixels. A bank of 40 Gabor lters is applied to the square region producing a set of 40 textures corresponding to facial gradients at various orientations, spatial frequencies, and scales. From the facial landmarks, geometric features are extracted corresponding to pairwise 2D distances between landmarks per the method in [35]. FACS action units [36] are also extracted using an array of pre-trained binary linear support vector classiers. Each classier detects a unique action unit and is pre-trained on three FACS data sets from [37{39]. The average AUC across all 33 FACS classiers is 89% in a leave-one-out cross validation test. The eye movement codes, also including the centered eye gaze, are extracted from the Gabor-ltered ROIs corresponding to FACS action units 61-64. A small 5x10 pixel region containing the eyes is pulled directly from the square ROIs and fed to a pre-trained RBF- kernel SVM. The SVM model is trained and tested on 1,888 images from the Columbia Gaze 18 Database [40] and the RaDF database [41]. The prediction accuracy using this method is estimated to be 77%. Emotional expression probabilities are derived using another classier. The distance of the Gabor ROIs to each FACS action unit classier's separating hyperplane is interpreted as a measure of predictive certainty. These certainties are used to train a RBF-kernel SVM to predict one of the eight basic emotional expressions (neutral, anger, contempt, disgust, fear, happiness, sadness, surprise) using four labeled data sets [37, 41{43] including both elicited and genuine expressions. In total 3,365 images are used with a leave-one-subject-out cross-validation paradigm resulting in an average accuracy of 80%. Optical ow is generated using OpenCV's [33] dense optical ow library based on [44]. Since the subjects in this study are conned to a room, no motion other than the sub- jects' occurs. Mean and median ow velocities are computed by averaging over all pixels in the dense ow image. The average ow direction is generated by aggregating the 2D ow directions over all pixels and normalizing the result. Finally, the head size and pose are extracted using the fR Face Recognition SDK [45]. This toolbox is pre-trained on millions of faces in the wild and reliably extracts head pose and a binary mask for pixels belonging to the head. The head size is estimated as the sum of head pixels and the three Euler angles reported by the software are used as a pose estimate per frame. Functionals of all of these features within 10-second, overlapping, sliding time windows are used as supplementary features for machine learning. Windows of this size are chosen based on results from Whitehill et al [24] suggesting that average engagement can be summarized well over 10-second intervals. A standard collection of functionals is used including: mean, variance, quartiles, min, max, and range. The nal set includes 476 features in total, and with the functionals applied, the count increases to 3,808. EEG Feature Extraction The following metrics proposed in prior literature and based on frequency bands are used to evaluate their utility for engagement assessment in distance learning: 1. + at AF3, AF4, F7, and F8 [3] 19 2. : from F3, F4, FC5, FC6, and from P7, P8 [2] 3. at AF3, AF4, F3, F4 [4] Metric one, rst proposed by Pope et al. [27] for measuring engagement, was adapted by Chandra et al. [3] for the EEG channels available in the headset. The last two metrics have been used to measure working memory load and concentration respectively in game-related interactive experiments, and these are employed as proxies for engagement. The methods used to compute the power spectral density bands needed by these metrics are described next. Raw data from the EEG headband is preprocessed by the headset using standard P3/P4 references. Utilizing the MEG and EEG Analysis and Visualization toolkit (MNE) [46, 47], the signals are zero-phase band-pass ltered from 0.5Hz to 31Hz and then the power spectral density (PSD) is computed using a multi-taper approach for each electrode with overlapping one-second windows and 0.1-second spacing. The power spectrum is binned into standard brain wave ranges: delta (0.5-4 Hz), theta (4-7 Hz), alpha (8-15 Hz), and beta (16-31 Hz), and then normalized per subject by dividing by the total power across all channels within each window. Low frequency artifacts such as those caused by blinks are not removed explicitly and may impact the signal-to-noise ratio of the delta band, but this band is not used by any of the metrics described above. For the machine learning experiments, three functions of the PSD bands are computed for each channel within their respective frequency bands (alpha, beta, theta): proportional PSD features (e.g.,P AF3 ()), inverse PSD features (e.g., 1 P AF3 () ), and pairwise dierences between the same frequency band in dierent channels (e.g., P AF3 ()P AF4 ()). The proportional and inverse features allow the machine learning models to capture proportional and inverse relationships between engagement and the power of the frequency bands at various channels [48, 49]. The dierence features are simple estimates of the co-occurrences of dierentials in the average power per frequency band at distinct sites in the brain. Ground Truth Representation Several methods for establishing ground truth engage- ment labels are tested in order to help nd the best end-to-end prediction scheme. All tested 20 ground truth labels are listed in the Labels column in Table 2.1. First, the separate anno- tations pertaining to each student are fused into a single continuous time series using the Optimal ^ method described in [7] and dubbed the continuous set of labels. All other ground truth labels are computed from this lag-corrected and fused signal. Inter-rater reliability is measured by computing the two-way mixed, consistency, average measures intraclass cor- relation (ICC(3,k)) separately for each subject and then averaging yielding an ICC value above 0.6 which is \good" according to [50]. The following descriptions refer to label names in Table 2.1. The binary labels are gen- erated directly from the continuous labels using the midpoint (0.5) as a threshold. This interpretation is designed to test whether annotators label engaged versus disengaged be- haviors with respect to the mid point of the annotation scale. The trinary and quintary labels are produced from the continuous labels by binning the data across study partici- pants so that each bin contains the same number of samples. Though less interpretable, this scheme avoids having bins with too few samples which can hinder machine learning. The delta-sign signal is created from the continuous signal by computing the sign of the forward dierences, which corresponds to momentary engagement trends. A unique approach is also used to generate ground truth labels where the continuous signal is binned using thresholds obtained by clustering the extrema and plateaus. This method leverages a recurring observation [51{53] that human annotators more successfully capture trends and less accurately represent exact ratings. Thus, though the ratings of individual peaks and valleys may not be precise or accurate, the hypothesis is that binning of these temporal clusters of points better captures the intended distribution of the data. To detect these extrema and plateaus, total variation denoising (TVD) is used to approxi- mate the continuous signal. TVD has been successfully used to remove salt and pepper noise from images while simultaneously preserving signal edges [54]. The TFOCS Matlab library [55] is used nd a new sequencey t that approximates the continuous annotation sequencex t and minimizes: min yt h X t jjx t y t jj 2 ` 2 + X t jjy t+1 y t jj ` 1 i The parameter controls the in uence of the temporal variation term and degree to which 21 y t is approximately piecewise-constant. This parameter needs tuning to produce a desirable sequence and for this study is hand-tuned to a value of 0.05. This produces a visually and approximately piecewise constant version of the continuous engagement signal. Nearly-constant regions of the TVD signal are extracted and correspond to the peaks, valleys, and plateaus in the original continuous engagement signal. For each constant region, the average engagement value is recorded and included as a point in a 1D k-means clustering method with k = 3 to generate three categories of high engagement, engagement, and disengagement. Finally, the continuous signal is converted to k-means trinary labels by thresholding on the midpoints between cluster means. 2.3.6 Experiments and Results Several machine learning experiments are performed using either the video or EEG features. Each experiment is described in the following paragraphs. Engagement Prediction with Video Features Feature imputation and selection are employed to decrease training time and improve overall performance of the trained models. First, missing features (no face detected) are imputed using the mean feature values. Each feature having Pearson correlation less than 0.4 with the engagement labels are discarded. This correlation threshold is chosen using a grid search over the range [0:1; 0:9] with 0.05 spacing. Among the selected features, one of any pair of features with an absolute mutual correlation greater than 0.95 is removed, and nally all features are Z-normalized. The nal feature count is reduced by over 50 percent on average and the retained features include: eye gaze, facial landmarks, FACS action units (1,2,4,5,8,14,16,18,20,22,23,29,32,33), emotion probabilities for neutral and fear, average optical ow magnitude and direction, geometric facial features, and head size and pose. Using this method, at least some features from each category (e.g., facial expression, optical ow, etc.) are incidentally preserved. Various machine learning algorithms are tested on dierent ground truth representations for cross-subject engagement prediction and presented in Table 2.1. Sklearn's Python-based machine learning library [56] is utilized for all models and the hyperparameters are optimized using a grid search. The table reports results from the optimal conguration for each model. 22 Labels Baseline Algorithms Video Features Functionals of Video Features Mean Value SVR LSTM SVR Continuous 0.006 0.0566 0.11 0.016 Majority Class SVC KNN SVC KNN RF Binary 0.480 0.425 0.470 0.408 0.455 0.479 Delta-sign 0.23 0.25 0.27 0.23 0.23 0.25 Trinary 0.239 0.223 0.222 0.193 0.249 0.207 Quintary 0.107 0.083 0.087 0.099 0.111 0.113 K-means Trinary 0.249 0.247 0.278 0.275 0.232 0.213 Table 2.1: Cross-subject engagement prediction F1 scores using dierent learning algorithms, engagement label representations, and video feature sets. Reported F1 scores are weighted averages from leave-one-subject-out cross validation tests. For the continuous labels, all classiers use a mean squared error performance metric, and for all discrete labels a weighted zero-one loss function is employed where the weights are assigned per subject proportional to each session's percentage of total samples. Leave-one- subject-out cross-validation is used to test and optimize each model, and the results table shows weighted average scores across all subjects. The best-performing model in bold (KNN) uses 201 neighboring points for classication. Engagement Prediction via Ensemble with Video Features Separately from the continuous annotations of engagement, discrete frame-wise annotations are obtained from subsets of the videos and used to augment engagement prediction accuracy. Representative segments from some of the videos are selected and labeled frame-by-frame via majority voting by a panel of human annotators. These supplementary labels correspond to four states: no face visible (invalid frame), disengaged, engaged (attentive), and highly engaged (focused). Three binary classiers are trained on subsets of the same video feature set to perform dierent tasks. The no face classier is trained to predict whether a face is not visible or 23 Labels Baseline Algorithms Video Features Functionals of Video Features Majority Class SVC KNN SVC KNN RF No Face 0.586 0.397 0.611 0.395 0.607 0.576 Attention 0.642 0.419 0.591 0.533 0.551 0.546 Focus 0.791 0.492 0.652 0.431 0.745 0.791 Table 2.2: F1 scores for engagement prediction for three binary classiers trained on dierent subsets of the data based on the supplementary discrete labels. obstructed. The attention classier is trained only on frames for which a face is visible and predicts whether the student is visually engaged or disengaged. The focus classier is trained only on frames with visible faces where the student is not disengaged and predicts whether the student is simply engaged or highly engaged. Average F1 scores for each binary classier in a leave-one-out test are shown in Table 2.2. A late-fusion ensemble classier is assembled using these binary predictor with the best KNN model for engagement prediction (Table 2.1) and the most easily predicted k-means trinary engagement labels. The late fusion logic is displayed in Figure 2.6. This ensemble achieves a F1 score of 0.369, an improvement over the 0.278 from the best KNN classier alone. 24 Figure 2.6: Late fusion ensemble for engagement prediction Personalized Engagement Prediction This nal set of experiments test the utility of several learning models for predicting engagement for individual subjects using the most easily predicted ground truth labels (k-means trinary). Table 2.3 shows engagement pre- diction F1 scores for each individual averaged across all subjects when using ve-fold cross validation. To help mitigate learning deciencies due to label imbalances, each fold retains a proportional fraction of each unique label value. This strategy avoids validating held-out sets containing labels missing in the training sets. Engagement Prediction with EEG Features In this experiment, three machine learn- ing models are employed for engagement prediction: support vector classiers (SVC), K- nearest neighbors (KNN), and random forests (RF). Subject-agnostic models are trained and tuned on all available data with one subject at a time left out for testing. The hyperpa- rameter for SVC is optimized using a log-scale grid search between 10 8 and 10 1 to obtain 25 Table 2.3: Average intra-subject engagement prediction F1-scores Labels Baseline Algorithms Video Features Functionals of Video Features Majority Class SVC KNN SVC KNN RF K-means trinary 0.254 0.357 0.321 0.376 0.477 0.615 C = 10 4 . A value k = 3 is selected for KNN after optimizing over K2f1; 2;:::; 30g, and 150 nodes is chosen for RF after testing nodes counts inf10; 20;:::; 300g. These same hy- perparameters are used to train subject-specic models that use data only from a particular subject and a randomized ve-fold split for training and testing. Average test results are reported in each case in Table 2.4. Prior to training, temporally local statistics of the EEG features are extracted including the mean, quartiles, range, min, max, and variance. Research has shown that perceived engagement varies slowly and can accurately be represented by statistics over windows as large as ten seconds [14, 24]. After separately testing window sizes of 5, 10, 20, and 30 seconds, ve-second windows are determined optimal and only these results are reported. Figure 2.7 shows the EEG-based proxy measure of engagement from Gevins et al plotted alongside the continuous ground truth for a single subject over the entire session. Plots of other engagement metrics for other subjects are qualitatively similar. To assess the eec- tiveness of these EEG-based engagement metrics in comparison to prior work, the Pearson correlation measure is used. Since some prior literature reports predictions over shorter sessions, the correlation is calculated and averaged over dierently sized windows ranging from one second to the entire length of a session. Calculating the average correlation over sliding windows in this manner gives each metric more opportunities to demonstrate high correlation for all subjects. Figure 2.8 plots the correlation for the three tested engagement metrics over various window sizes. 26 Figure 2.7: An example engagement proxy from Gevins et al plotted over time [2]. This metric uses the ratio of where is averaged across the F3, F4, FC5, and FC6 sites and across the P7 and P8 channels. Figure 2.8: Plots of average correlation between dierent EEG engagement metrics and the ground truth as a function of sliding window size. Each line color represents a unique subject. Left: working memory load [2], Middle: engagement [3], Right: concentration [4] To ascertain whether there is meaningful information in the EEG data for predicting engagement, machine learning is employed using the proportional, inverse, and dierence features and the k-means trinary ground truth labels, which are shown to be the most predictable ground truth labels in the video-based experiments. A majority label classier provides a baseline for comparison where the most prevalent engagement label across the session is predicted each frame. Model performance for SVC, KNN and RF classiers is 27 Data Set Baseline Features EEG Features EEG Feature Functionals Majority Class SVC KNN RF SVC KNN RF Cross-subject 0.249 0.311 0.271 0.293 0.317 0.288 0.307 Average Intra-subject 0.254 0.419 0.466 0.493 0.375 0.533 0.625 Table 2.4: Engagement prediction F1 scores for dierent learning models and features. measured using the F1 score. Table 2.4 shows the results when these models are trained on the EEG per-frame feature functions and separately on the aggregates of these features over ve-second windows. The best engagement predictor by a substantial margin is the random forest model trained on individual subjects, whereas the best cross-subject model performs modestly better than the baseline. The sensitivity of the highest performing model to dierences in window size is evaluated using the F1 scores for window sizes of 5, 10, 20, and 30 seconds. All F1 score dierences are between 0.01 indicating the choice of window size is relatively insignicant for this data set. Figure 2.9 shows confusion matrices for the best-performing models using video features for each of the three tested approaches: cross-subject classication with engagement labels, cross-subject ensemble classication with augmented engagement labels, and intra-subject classication with engagement labels. 28 Figure2.9: Confusion matrices for engagement prediction for the best-performing learning models trained on video-based features. Rows correspond to true labels, columns to predicted labels, and element values are normalized across the rows. The k-means trinary labels are used in each case. Left: best cross-subject classier (KNN), Middle: best cross-subject ensemble (mixed), Right: average intra-subject classier (RF) 2.3.7 Discussion Figure 2.8 portrays a large amount variability in EEG-based engagement proxy measures across subjects in this experiment. While some metrics achieve correlations near0.6 for a few subjects, there is no agreement on the optimal window size where this occurs. These plots also display a large degree of variability in maximum correlation between subjects for any size window, and the signs of the correlations are inconsistent. The disagreement in both magnitude and sign of the correlation shows that none of these engagement proxies exhibit convergent validity in this study and are therefore unreliable for distance learner engagement prediction across subjects. The engagement prediction performance on all cross-subject experiments is unfortunately dicult to compare to the results in prior literature due to dierences in tasks, data collection protocols, and prediction metrics. For EEG-based engagement prediction, Chandra et al [3] report around a 60% accuracy in cross-subject tests, which seems better than the 0.317 F1-score attained by the SVC classier in this experiment (assuming similar balance among labels). The other studies using EEG features for engagement-related measurement [2, 4] cannot be fairly compared directly. Using video-based features for engagement assessment, Bidwell reports around 80% accu- racy for detection of highly engaged states [22] but only around 38% for identifying attentive 29 (engaged) states when using eye gaze features. Bosch et al achieve a prediction accuracy around 64% when classifying facial expressions as engaged or not [25]. Using the k-means trinary labels and collapsing the engaged and highly engaged into a single state, the results reported in this section are similar. Bosch et al also remark that the temporal window size for aggregating features does not seem to matter when measuring engaged facial expres- sions, which is conrmed in this work. Whitehill et al demonstrate an ability to discriminate between four dierent levels of engagement using one-vs-many SVM classiers trained on Haar-like features extracted from the face, achieving an average accuracy around 0.72 when training and testing on the same data set [24]. When chance corrected using Cohen's, how- ever, the value drops to around 0.31. To summarize, these works demonstrate that accurate engagement prediction across subjects is possible, but capturing the dynamics of engagement and the transitions between disengaged and engaged states is very dicult. These results agree with both experiments performed in this section. Though the cross-subject ensemble conditioned on binary behavior features outperforms the other cross-subject models, its improvement is not strong enough to encourage further exploration. The best personalized intra-subject models in both the video-based and EEG- based experiments, however, perform signicantly better on average and with little variation between subjects. The large performance gap between the cross-subject and intra-subject models (Figure 2.9) suggests that the video features used in this study are benecial for student engagement prediction but do not generalize across subjects. To gain insight into the factors contributing to the improvement of the intra-subject models, a test of the features used in the best RF model is performed to determine which hold the most discriminative information. Twenty trials are conducted where the data is shued uniformly before cross- validation training then the top 20 features with the highest decision weights in the RF are considered. The same top features appear in this list for each individual subject but vary for the other participants in the data. This too arms that the engagement measures and derived features are not as informative for generalized distance learner engagement prediction as prior literature suggests. In order for the potential gains of intra-subject engagement models to be realized, it may seem necessary to eciently scale up both the number of subjects and the amount of 30 data collected in these types of experiments. Remarkably, if the best intra-subject model (RF) is trained on only a tiny random fraction of the data from a single subject, instead of using ve-fold cross validation but still ensuring that all engagement labels are represented, it can achieve nearly identical F1-scores. This result is consistent with an observation from Whitehill et al [24] indicating a strong intra-subject similarity across frames. Thus, a smaller number of annotations are necessary. In this data set an average of only 25 annotated frames spanning the three categories of engagement levels are necessary to achieve a similar F1-score when training individualized models. As a nal observation, the novel k-means trinary method for generating discrete ground truth labels yields results that give only modest performance benets but still oer the best prediction performance overall. Therefore, some gains may be obtained by binning the less precise continuous ratings, which encourages exploration of similarly unique methods for ground truth generation. 2.4 Conclusions The information in this chapter paints a clear picture about why advancements in human behavior understanding have not followed in stride with advancements in machine learning. Namely, there are many sources for measurement noise, personal biases, and human errors, which interfere with the quality of data gathered from human subjects and observers. On top of this, the presented case study elucidates two other issues: (1) high quality data for intra- subject modeling is much needed, and (2) ground truth labels obtained from annotations require closer inspection. Collecting data for intra-subject modeling and validation requires the involvement of study subjects for extended durations. Ensuring the quality of the data and the applicability of the behavior models derived from them for prediction in natural settings also requires that the data be collected in realistic, non-contrived settings. These factors alone pose an enormous challenge to human behavior modeling research. The entirety of Section 3.1 is dedicated to this problem and provides a comprehensive list of factors which researchers should consider in order to conduct long-term human data collections in natural environments 31 at scale. The other lesson of this chapter is that obtaining an accurate representation of a sub- jective mental construct is important for building accurate human experience models. The distance learner engagement case study demonstrates how dierent interpretations of the engagement annotations (i.e., dierent discretization strategies) lead to very dierent pre- diction accuracies. This begs the question: which interpretation, if any, yields the most accurate representation of the true perceived level of engagement of the subjects and how can we be sure? Section 3.2 dives into various facets of this problem and proposes a collection of methods which can improve the consistency and accuracy of ground truth representations of subjective constructs. 32 Chapter 3 IMPROVING THE QUALITY OF HUMAN-PRODUCED DATA AND ANNOTATIONS This chapter is a collection of various published works which together comprise a suite of techniques for improving the quality of human-produced data. The chapter is split into two main sections, the rst of which addresses the quality of data gathered via sensing and observation of people in an experiment, and the second which examines the quality of real-time continuous-scale annotations of mental states. Section 3.1 highlights common problems observed in typical in situ data collections in- volving human subjects and introduces methods for ensuring the data gathered from observ- able factors (recall Figure 2.3) are as good as possible, given the particular constraints of the study and environment. This portion of the chapter presents a comprehensive guide for selecting and managing sensors, participant needs and compliance, and desired research out- comes for in situ research studies. Referring back to the foot race analogy from Section 1.1, the journey of a participant in this type of research study is similar to that of an unprepared runner in a marathon; each participant is more likely to complete the race when constantly cheered onward by a supportive crowd. Section 3.2 digs deep into one burgeoning and popular method for collecting labels for dy- namic and unobservable human experiences (see Figure 2.3): real-time and continuous-scale annotation. This kind of approach is often employed by researchers to obtain information about the intensity of a subjective experience over time. As you will see, human-produced annotations are not immune to various sources of human error, but there are ways of correct- ing for these mistakes to improve the annotations' consistency. Harking back to our running metaphor in Section 1.1, the real-time human annotation process is not unlike running a race with hurdles, which are potential sources for human error. By changing the rules of the race and adding in additional observers to support the runner by adjusting the hurdle heights, the race is much easier to complete. 33 3.1 Improving the Quality of Data Collected In Situ from Humans This section outlines a framework and best practices researchers should consider for improv- ing the quantity and quality of observable human data when conducting large-scale studies of populations in natural settings over long periods of time. This information is reproduced from a paper published in the Journal of Medical Internet Research (JMIR) [57]. 3.1.1 Overview Recent advances in portable consumer technologies have led to a surge in the development of electronic devices [58] for monitoring and tracking human activity, wellness, and behav- ior. Aided by the ubiquity of personal smartphones, Bluetooth, and Wi-Fi, many devices currently on the market can discreetly collect physiologic and behavioral signals and up- load the information to remote servers. Because of the growing support for distributed and personalized sensing, diverse research communities are taking a keen interest in this eld, empowering the coordination of research studies of populations outside the laboratory and in natural home or work environments (also known as studies in the wild) [59]. For research into everyday human behavior, such as daily routines, studies conducted in natural settings can yield more relevant and insightful data than those performed in the laboratory [60{65]. Several factors need to be considered for the collection of data in natural human settings using sensing devices. Dierent sensors have dierent sampling rates, power restrictions, and communication capabilities. Participants also have their own habits and daily routines into which the sensors and the data collection procedures need to be embedded. A data collection framework designed to operate in the wild should therefore be exible enough to accommodate dierent data communication channels and be capable of capturing infor- mation from dierent people with dierent needs at dierent times. These factors and a host of other challenges mentioned in this work complicate the data collection process and ultimately aect the quality of data available for analysis. 34 Background Several previous studies have described some of these challenges [66{68] and oered strategies for mitigating them [69{72]. Other works oer data collection plans for particular elds of study that address the unique concerns of their research areas [73, 74]. This paper subsumes many of the challenges and suggestions from these other works and aims to provide a comprehensive collection of methods and suggestions that help researchers address the challenges related to sensor selection and management in research studies. It specically focuses on longitudinal studies aiming to unobtrusively capture and assess aspects of human experience and natural behavior; thus, it assumes a participatory study framework instead of a provocative approach [59]. Some examples of unobtrusive human behavior studies are StudentLife [63], AectiveROAD [75], and a dataset on emotion recognition from wearable physiological sensing [76]. Objectives Figure 3.1 illustrates a sequence of research program states at various stages for these types of studies. The scope of this section covers the pre-planning stages pertaining to sensor selection and the stages during a study related to sensor and data management. The key assumptions are as follows: (1) researchers already have a clear research objective in mind and have examined previous literature to develop a sense of the types of physiological and behavioral signals that may be helpful in achieving the goals and (2) researchers have surveyed the landscape of sensing technologies and are beginning to design a study protocol and select the appropriate sensors. The following guide is based on a survey of related work [66{72] and the authors ex- periences in designing a multi-week research study. It describes the main challenges that dierentiate longitudinal and unobtrusive [77] studies in the wild from studies conducted in controlled laboratory settings. It also provides an overview of modern portable sens- ing capabilities and information work ows and outlines a general data collection framework that leverages an internet-enabled infrastructure for real-time data collection and feedback. Furthermore, it enumerates several criteria (or dimensions) that researchers should consider when designing a data collection protocol using portable sensors for a known participant population, and discusses the manner in which these dimensions can aect human subjects concerns and data quality. This case study employs all the criteria and methods discussed, 35 Figure 3.1: An overview of the general scientic process for human research studies involving sensing. 36 evaluating their eectiveness with respect to participant compliance and the number of no- table unplanned events occurring during data collection. 3.1.2 Challenges with Studies in the Wild This section discusses the challenges involved in designing protocols and using sensors to collect data about human behavior in the wild. It presents a general-purpose framework that researchers can use for envisioning and orchestrating sensor data ow which subsumes the most common information ows provided by modern sensing technologies. This section also presents an exposition of the various criteria and dimensions for which all sensors should be evaluated before the beginning of the data collection period. Appendix A provides a concise checklist of the challenges in this section with the hope that researchers will use it in their discussions and planning regarding protocol design to help account for the numerous sensing challenges. Studies conducted outside of controlled laboratory settings are of interest to researchers, as participants can be examined in their day-to-day environments where natural behaviors occur. Nevertheless, in the wild, many potentially confounding variables cannot be fully controlled, yielding unpredictable sources of variability alongside logistical diculties. Some challenges in this kind of data collection are mitigated through careful planning and eective communication before the study begins. Other challenges are predictable, but they occur spontaneously, and they must be managed reactively with the aid of semiautomated systems. The following paragraphs describe the primary diculties that are unique to studies in the wild and suggest strategies to help overcome them. Sensor Logistics, Deployment, and Maintenance One of the foremost diculties is the logistical burden of deploying and maintaining sensors. As research teams have limited direct control over the environment for in situ studies, they should be aware of the dierent potential sensor failure modes and have a plan for quickly detecting and recovering from them. Sensor failure is often inevitable, especially for studies conducted at scale, and an ef- fective solution is to simply replace the devices by preplanning to streamline this process. 37 For example, arranging to have trained personnel available to meet with participants in their environment to swap defective devices can minimize data losses because of downtime. For sensors deployed in the environment itself (as opposed to wearable sensors), devising a mounting scheme that will allow for easy replacement may also help. To aid the tracking of the status of all sensors in a large study, planning in an upfront manner to create semiautomated tools that monitor the state of each sensor as often as possible can help identify failures quickly, report them to personnel for maintenance, and further decrease data collection downtime [78]. Moreover, the use of automated tools may become a necessity if the number of participants, sensors, or hours of recording becomes large. Data-driven approaches for detecting and identifying anomalous sensor data streams have been recently proposed in the literature [79, 80]. Implementing a strategy for automated ongoing maintenance of the deployed sensors is much easier once the research team has direct access to recent sensor data. A data ow framework (presented later in Section 3.1.3) outlines and describes the communication channels that carry sensor data to the data servers (the collection of systems where the data are collected and securely stored for later processing). Researchers can use this framework to plan communication paths for each sensor and then set up a script to run on the research server, which monitors these data streams and noties assistants when sensors malfunction. For example, automatic programs can be used to assess the quality of electrocardiography (ECG) signals and give feedback to the research support sta about potential tting and usage problems [81]. Specic logistics and deployment strategies will be unique to each research study, and they will largely be in uenced by the restrictions and constraints imposed by the research envi- ronment. For example, some hospitals require all equipment to be powered using three-prong plugs; therefore, all sensor chargers are required to be used through three-prong adapters. Other restrictions may include Wi-Fi availability, permission to mount sensors on the walls, availability of charging ports for sensors, and space for sensor storage, to name a few. Per- missions for the research personnel to access all areas in which the study takes place should also be considered. 38 Data Loss Data loss may occur for several reasons, including sensor or data pipeline malfunctions, poor participant compliance, and attrition, among other reasons. For example, sensors may fail to deliver data, as they run out of battery power or break, or they may fail to deliver when network outages interfere with data transfers. Subjects may also neglect the data collection protocol (including forgetting to wear the sensor or wearing the sensors without following instructions), forget to recharge a worn device, or fail to upload data at the end of each session, for example, the Hexoskin garment [82] requires manual data upload via universal serial bus (USB). In more extreme cases, subjects may become frustrated with the study and elect to drop out, thereby reducing the total amount of available data. The key to mitigating these various sources of data loss is being aware of where in the data stream pipeline the losses occur. Section 3.1.3 enumerates the communication paths that help carry sensor data to their destination on a research server. Once researchers have decided on a sensor suite, and once they know which paths are required, small scripts or monitoring systems can be instrumented to test or infer status of each communication channel and report failures to the research team. For cases where data loss occurs at the source (i.e., the participants), Section 3.1.3 also describes a mechanism for sending feedback to the participants to notify them of the data loss and encourage them to remedy it. Data Signal Quality and Unintentional Variability Related to data loss, the signal quality of sensor data is a concern that presents a substantial challenge for research in the wild. The term signal quality used here refers to the ability of each sensor to measure its signal(s) of interest. Poor data quality may occur when sensors are not properly worn or maintained, such as when a wristband photoplethysmography (PPG) sensor, used to measure heart rate, is worn too loosely or when a microphone, perhaps recording ambient audio, is obscured. Instances of improper or inconsistent sensor usage are inevitable in large studies in the wild, and they can lead to an unintentionally higher degree of variability in the data captured across all participants, which may consequently skew the resulting statistical analyses. Early steps should be taken to ensure that participants receive proper training for using the adopted sensors before the study begins and that clear and accessible instructions are 39 made available to serve as a reminder. Making plans to monitor the quality of sensor data streams so that appropriate actions can be taken to rectify problems is also highly benecial, especially for long-term studies. Once a process is in place to determine the quality rating of recent data, dierent intervening actions may be appropriate, depending on the participant population, study environment, and the goals of the research project. Some example inter- ventions for improving data quality include the following: retraining participants in sensor usage, adjusting sensor t, improving the network infrastructure to reduce downtime, or sim- ply sending reminders to participants (e.g., smartphone push notications) to remind them to wear their devices and upload the data. Quickly responding to rectify data quality drops can help preserve the value of the data and minimize data loss. If low-quality data persist despite these measures, automated signal enhancement methods may still be employed to algorithmically improve data quality. Section 3.1.3 illustrates how data from the sensors can be aggregated on a research server to facilitate quality monitoring. Privacy and Security Among the opportunities to generate scientic knowledge are sig- nicant challenges to the ethical conduct of research on human subjects [83]. Threats to privacy and data security constitute the greatest risk to participants of behavioral research in the wild. As sensing technologies become ubiquitous and data science advances, it is possible that passively collected digital data will be used to identify and predict a surprising range of human behaviors with increasing accuracy [84]. Participants are often unaware that when they consent to share data from their tness tracker, they may be allowing researchers to infer information about their alcohol consumption, sexual activity, and mental health symptoms. The accidental or malicious release of this information could cause signicant social, occupational, and psychological consequences to participants. An informed consent process must provide clear and transparent communication about what data are collected, how the data will be used by researchers, how data are anonymized or kept condential, and how data are securely transferred, stored, and destroyed. Researchers must stay up-to-date on evolving privacy and security concerns and best practices for mitigating risk. Another signicant challenge when conducting studies in the wild is respecting and pro- tecting the privacy of nonparticipants coincidentally present in the research environment. For 40 research scenarios in which raw audiovisual data are collected, extra steps must be taken to ensure that either no personally identiable information (PII) is recorded about nonpartici- pants or that they are informed that they may be recorded, where appropriate and depending on municipal or state regulations and institutional review board (IRB) approval. A tactic for avoiding the collection of PII, even accidently, is to immediately transform the collected raw data streams, such as audio or video, into anonymized features|intonation, mel-frequency cepstral coecients (MFCCs), gestures, and posture|and record these instead [5]. Another important step toward maintaining privacy is to ensure secure transmission of sensor data to the research server with as few transfers to intermediate nodes as possible. Section 3.1.4 discusses methods for securely transmitting information across a network, and information in Section 3.1.3 can help researchers plan secure communication paths. 3.1.3 Data Acquisition and Flow Framework State-of-the-art electronics and sensing technologies oer a wide variety of communication protocols for sending information among devices. Selection of the appropriate sensors for a research project depends on many factors, which are discussed in more detail in Section 3.1.4. A crucial step toward evaluating each sensor is to understand the ways in which its data can be transmitted through dierent communication channels and how its data ow may be aected by the choice of other sensors and data hubs. The proposed general sensing framework, deemed suitable for studies in the wild, depicts common transmission paths for data owing from multiple sensor streams through disparate network paths and arriving on a secure server that is accessible only by the research team. The framework aggregates data in a single place, allowing for simpler implementations of automatic stream monitoring and participant feedback systems. Information ow layers Figure 3.2 depicts potential information pathways through dif- ferent communication channels for passing data obtained from sensors (in the left column) to data servers (right column), where all the information passes through an intermediate data hub layer (middle column). These intermediate hubs are any devices that act as bridges to facilitate the aggregation and delivery of transient sensor data into long-term storage. Most 41 of the available sensors in the market support a data ow matching some combination of paths in this gure. The primary aim of this framework is the aggregation of all sensor data onto a single research server where additional processing, monitoring, and feedback can be performed. The following paragraphs describe each of these layers (columns) in detail. Figure 3.2: A framework for studies of human behavior in the wild, showing common potential information pathways for data produced by sensors (e.g., physiologic and activity), destined to be stored on a single research server. This type of data ow paradigm enables centralized data monitoring and facilitates immediate automatic participant feedback regarding data quality and compliance via the participants smartphones. RFID: Radio-Frequency Identication; NFC: Near- Field Communication; USB: Universal Serial Bus; API: application programming interface. Sensors Sensors for studies in the wild can broadly be grouped into three categories: environmental sensors, non-wearable (human) trackers, and wearable sensors. Environmental sensors: These devices passively capture information about their surroundings. Some examples of data captured by these types of sensors include the following: temperature/humidity/CO 2 levels, inertial measurements 42 (e.g., from accelerometers, gyroscopes, or magnetometers), and acoustics. These devices often perpetually broadcast information about their surroundings, using low-energy Bluetooth or radio-frequency identication (RFID) signals. Sampling rates below 1 Hz or event-based sampling techniques are typical, as environmental data usually change slowly (at least compared with physiological signals). Non-wearable trackers: These devices are placed in the environment, and they capture information about subjects and their behaviors indirectly or in a passive way. A few instances of these types of devices include the following: RFID scanners, Doppler eect and under-the-mattress sleep trackers, infrared gaze trackers, and video cameras. These sensors often operate on wall power and may include network capabilities for simplifying data transmission to long- term storage on the data servers. They also often include companion websites or smartphone apps for visualizing metrics extracted from the sensor data. Wearable sensors: These types of sensors encompass the set of custom-built or consumer products that are worn or carried on the subjects body to collect physiologic or contextual data or features extracted from data, for example, heart rate from electrocardiogram data, for behavior and activity tracking. Devices such as smart watches/wristbands, smart undergarments (underwear, T-shirts, and bras that collect data), smart rings, voice activity detectors, and smart shoe soles are some examples. Many of these devices can be recharged for long-term use over multiple sessions, and they generally either communicate via Bluetooth with companion apps installed on users smartphones or via USB connections with personal computers. The companion apps tend to provide visualizations of the received data and upload functionality for long-term storage on third-party servers. Interestingly, when running certain tracking apps in the background, smartphones themselves can also serve as wearable sensors, collecting information about user movement and smartphone usage patterns. Data hubs Data hubs are devices dedicated to collecting, aggregating, and transmitting sensor information to data servers. Transient data sources, such as many environmental 43 sensors, have little memory, and they need to have their data collected continuously and retransmitted to a data server for long-term storage. Wi-Fienabled data hub devices with Bluetooth capabilities can serve as conduits for these types of data streams, whereas personal computers can act as data hubs for USB-only sensors. Battery-powered sensors collecting data at a high rate usually communicate via USB, as the bandwidth and transmission speed are higher, and as wireless data transmission drains more power. Battery-powered sensors can aord to send a smaller amount of data through Bluetooth, whereas smartphones can often serve as both data hubs and data visualizers. Data servers The term data servers refers to the collection of machines in which all the sensor data are stored. A typical consumer o-the-shelf sensing device will provide some pipeline for getting data o the sensor and into a data store in the cloud, usually owned by the sensor products company. These companies often provide an application programming interface (API) for accessing the data, using automated tools that transmit the data securely to protect subject privacy. Eventually, all the sensor information needs to be aggregated in a single place, the research team's own server, so that the team has permanent and easy access to it. The aggregation of all sensor data on this server continuously throughout the data collection process enables monitoring and feedback systems to help manage some of the challenges mentioned in Section 3.1.2. 3.1.4 Considerations and Criteria for Sensor Selection Minimizing participant risk and burden while maximizing the amount and quality of data is of primary importance. The set of sensors used plays a major role in a studys outcome, as data quality is inherently constrained by the sensors characteristics and the participants interactions with those sensors. Selecting the appropriate sensors to employ in a research study can be complicated, as the market provides many options, and each device has unique qualities and capabilities. This section establishes a comprehensive list of the dierent criteria that should be con- sidered before data collection begins. In practice, researchers must strike a balance between meeting their research objectives and ensuring a smooth participant experience to minimize 44 Research objectives and logistics Sensor characteristics Participant experience Human subject protection Signals of interest Sensor customizability Cohort and individual suitability Access and usability Data properties and quality Cost Burden to participants Privacy Data access logistics Battery life Data security Sensor synergy Operating system support Additional experiment setup costs Robustness Sensor acceptance among target population Provider support On-site infrastructure requirements Table 3.1: Considerations and criteria for sensor selection attrition and data loss. Both needs are constrained by the properties of the sensors that are available or can be produced. The criteria are partitioned according to whether each crite- rion is a characteristic of the sensor or whether it primarily concerns either the researchers or the participants. It is important for researchers to carefully review each one of these criteria, as they are highly connected, and each choice aects the outcome and experiences for both researchers and participants. Table 3.1 lists dierent sensor criteria grouped according to whether they primarily con- cern research objectives and logistics, sensor characteristics, participant engagement, or hu- man subject protection during the study. The categorization is not perfect, as some of the criteria pertain to more than one group, but it helps emphasize the dierent perspectives researchers should examine when selecting sensors. Key criteria are included in this ta- ble, which are expected to remain relevant as technology changes; however, there may be other factors worth considering, depending on the specic needs of a research project. The following paragraphs describe each criterion in detail. Criteria related to research objectives and logistics The criteria in the following paragraphs pertain to the logistical implications of the selected sensors and ways in which 45 the selection aects the nal outcomes and goals of a research project. Signals of interest The following criteria relate to the signals and how they are measured. Target signals: Before data collection begins, researchers need to consider what type of signals they want to measure from the participants or the environment. Varying amounts of potentially relevant information can be obtained from sig- nals collected from dierent sources, such as physiology, for example, heart rate, breathing rate, electrodermal activity (EDA), behavior (e.g., time spent speak- ing, sleeping duration and stage progression, number of steps per time interval, social interactions, and surveys), and the environment (e.g., temperature, humid- ity, and CO 2 levels). The utility and overall quality of the chosen signals depend on the sensors' measurement mechanisms. Measurement mechanism: The physical mechanism through which a signal is acquired aects its quality and overall utility for future analysis. As an ex- ample, human location and kinematic data can either be reconstructed from a series of global positioning system (GPS) coordinates or inferred from an inertial measurement unit (IMU), such as an accelerometer and gyroscope. The GPS data tend to produce more accurate location measurements and less accurate kinematic ones; the IMU location accuracy drifts over time but yields better kinematic gures, whereas GPS data can be used to aid in calibrating step count from an IMU [85]. Another example is heart rate, which can be inferred through PPG or ECG, each of which yields signicantly dierent signals and properties. The measurement mechanism may be constrained by a sensor's form factor re- quirements (wristband vs. garment), which may limit the quality of data that can be obtained. Data properties and quality These criteria are important for assessing the quality and potentially undesirable aspects of gathered data. Sampling rate: For most consumer sensing technologies, the sampling rate is xed by hardware design and power constraints, and it cannot be altered. 46 It is always possible to decrease the number of samples considered for analysis purposes by downsampling data originally collected at a higher rate. However, upsampling data collected at a lower rate introduces distortions into the signal [86], and that may impact its utility for later analysis. The sampling rate of any sensing device should be at least twice that of the desired underlying signals maximum frequency for the recording to provide reasonable delity (per the Nyquist-Shannon sampling theorem). The human voice, for example, can be characterized by pitch and formants (among many other features), which require sampling rates at least twice the maximum vocal frequency (typically greater than 8 kHz) for adequate analysis. However, tracking the position of a person inside of a building can be sampled around once per second, with meter-level accuracy based on average indoor walking speeds [30]. Researchers should be aware of the analytical power of the target signals and choose sensors capable of capturing data at a frequency where meaningful information can be extracted. Signal-to-noise ratio: Measured data will only be useful if the signal-to-noise ratio (SNR) of the measurements is higher than a certain threshold. Noise in this case refers to any unwanted alterations to a signal during the measurement process, and it can appear for many reasons. If the noise is too high, it might not be possible to extract the relevant information from the measurements. For example, ECG-based heart pulse measurements may be subjected to noise when a participant moves or when the electrodes attached to the skin brie y detach during physical activity. Audio recordings of people socializing may also include unwanted background sounds. As unexpected sources of noise can occur in a research study, test runs with a small cohort should be conducted for sensors under consideration and then inspected to determine whether the SNR is ade- quate to extract meaningful information. Researchers may be able to improve a sensor's SNR by understanding where noise is introduced into the measurements and taking steps to reduce it. Accuracy and precision: Accuracy refers to the bias of the measurements, 47 and precision is a representation of the variance of the measurements over time. High-accuracy (low bias) and high-precision (low variance) sensors are the most desirable. Published scientic validation studies pertaining to the accuracy and precision of measurements are available for some commercial and research-grade sensors. In situations where no previous validation work exists for a device, researchers should consider performing their own validation tests, using state-of- the-art, gold-standard sensors as the basis for comparison. As an example, in a study examining the measurement accuracy and precision of wrist-worn PPG devices (e.g., Fitbit sensors) among a diverse group of participants performing various physical activities, heart rate measurements were accurate to within 5% of clinical-grade devices, and the measured number of step counts varied within 15% of the actual number [87]. Drift: Measurement drift is a natural phenomenon that can occur in any sensor, caused by unintentional modications to the device or object being measured [88]. When all other factors are held constant, measurements of a signal may drift up or down because of, for example, temperature or humidity shifts, changes in electrode impedance, or physical movement of the body. In many cases, drift is caused by physiological or environmental factors that cannot be controlled in the wild, but there are many common techniques for removing drift eects, including high-pass lters [89], adaptive lters [90], and time-variant lters [91]. In other cases, drift can be caused by sensor wear or material corrosion; therefore, it is important for research teams to consider the impact that normal usage and time will have on the sensors, and it is important to consider how this may cause a drift in the measurements. Data access at various stages of processing: In some applications, it is im- portant to be able to access the raw (unprocessed and unltered) signals. This is most relevant for research involving the denoising of signals, artifact removal, feature extraction, or even the estimation of other data streams from correlated signals [92]. Many consumer sensor devices provide preprocessed signals with 48 artifacts already removed and which have been transformed into higher-level fea- tures, such as step count, heart rate, sleep quality, or physical readiness. Some sensor product companies elect to keep their preprocessing techniques unpub- lished; therefore, it can be dicult for researchers to understand exactly what each feature represents. These ready-made features can be useful for analysis, but researchers should be cautious when using features with no published methodol- ogy unless the features have been previously validated in scientic experiments. In cases where a provided feature cannot be trusted or is proven unhelpful in analysis, having access to the raw data to extract more meaningful features may be benecial. Data access logistics These criteria concern the ease with which data are stored and accessed by researchers. Data upload procedure: How and when data are transferred from sensor de- vices through the network to a data server can have a profound impact on a research project. As far as data upload procedures are concerned, there are two types of sensors: the ones that require manual interaction and the ones that auto- matically and transparently upload data once congured. Manual interaction is often required for devices that collect a large amount of data and need to transfer it in bulk (e.g., a Hexoskin ECG sensor uploads to a personal computer via USB). Automatic uploading is typically available for sensors that can stream informa- tion transparently to a data hub or smartphone over either Wi-Fi or Bluetooth (e.g., an OMsignal ECG sensor uploads data wirelessly to a smartphone app). Both researchers and participants usually benet from the automated paradigm, as there is less work involved for both parties, and data becomes available sooner, but the researchers need to consider its impact on smartphone battery drain and network bandwidth contention. Ease of data access: Once the data have been successfully transferred from sensors to the data servers, data need to be stored on a research server that is easily accessible to the research team. Some sensors may be congured to upload 49 information to the research server directly. For example, some companies supply a website where researchers can log in, visualize, and download participant data. Some companies track uploaded sensor data separately per user, in which case the research team would be responsible for creating and managing the participant accounts. Companies may provide tools to facilitate the download of data, such as Web-based (e.g., representational state transfer - REST) interfaces or APIs. The existence of well-documented guidebooks or a responsive technical support sta for these tools should be considered when selecting sensors. Sensor synergy These criteria concern potential symbioses among sensors and signals. Redundancy of signals: There are situations in which measuring the same underlying signals using dierent measurement devices might be advantageous to a research eort. One such circumstance is when a sensors accuracy and precision are unknown, but it is otherwise an appropriate pick for research. For example, if this device is a PPG-based wrist-worn sensor for heart rate tracking, then collecting heart information in parallel (perhaps on a subset of the participants), using an ECG sensor that has been validated against a gold standard, can enable researchers to infer the measurement quality of the PPG sensor. In a dierent scenario, researchers may decide that a certain signal is so important to capture in its full delity that using a single sensor that may occasionally fail or experience higher noise levels is not adequate. Using multiple sensors to capture the same signal adds fault tolerance to the measurement of that target signal, and this may also help reduce systemic measurement errors (e.g., by averaging). Sensor versatility: Using a sensor that can adequately serve multiple purposes may be preferable to using multiple sensors instead. There are many reasons why this may be benecial, such as cost, reductions in participant and research sta burdens, and simplicity. For example, it is possible to program a smartphone to gather human-produced audio and record participant proximity to known loca- tions within a building by exploiting its Bluetooth or Wi-Fi connectivity. This approach uses a single sensor to achieve both goals instead of using two separate 50 devices to capture each signal. Additional experiment setup costs These criteria describe the (perhaps hidden) extra time and nancial costs associated with setting up sensors for experiments. Installation and maintenance costs: Once purchased, sensors require instal- lation and maintenance throughout a research study to ensure measurement con- sistency and minimize data loss. Some sensors, such as Bluetooth beacons, may come packaged with installation tools that interfere with maintenance objectives (e.g., double-stick tape for wall mounting). Using alternative installation devices (e.g., adhesive hook-and-fastener strips) in anticipation of device malfunctions or required battery replacements can help expedite repairing or replacing these devices when necessary. This may add a small additional per-unit cost to some of the chosen sensors, but this can save time and may help save money in other ways. Participant training and support: Participants who will wear sensors through- out a study should be trained to use these devices according to study rules and objectives. Generally, support stang may be required, as the complexity and number of sensors increases or the sensors robustness decreases. Service costs: Some companies, such as those producing sensors targeted for research rather than consumer use, may oer additional services for some cost. These services may include data aggregation and storage, data visualization, more convenient data access, or real-time monitoring and quality tracking for incoming data. Researchers should identify which services, if any, may be necessary. Sensor acceptance among target population Regardless of every desirable quality a sensor may possess for the research team and objectives, it cannot be benecial if participants recruited from the target population will not accept or use it. There are many reasons why participants may reject any specic sensor, such as discomfort, obtrusiveness, complexity, or fashionability. These objections cannot be anticipated fully; thus, researchers should assess beforehand whether 51 the target population would be generally willing to engage with the potential sensor set selected for use. On-site infrastructure requirements Studies conducted in the wild, which use sensors, depend on the study site infrastructure. As researchers converge on a set of desired sensors for a specic study, the infrastructural resources necessary at the study site(s), which can satisfy the sensor requirements, will emerge. In some cases, the existing infrastructure may not provide the resources required, but it can sometimes be augmented (e.g., with additional wireless data hubs or power extension cables) to suit the needs of the research project. Supplement- ing the infrastructure may not be possible in other situations because of costs or prohibitive regulations, and researchers may have to settle for less desirable sensors with fewer requirements. Some other examples of the infrastructural considerations that should be accounted for include the following: the total net- work bandwidth usage for all participants, the availability of power and network outlets, and access to a secured network for sensitive or private data transfer. Criteria related to the evaluations of sensor characteristics The criteria in the fol- lowing paragraphs describe various ways to evaluate sensors compared with other potential sensor choices. Each choice poses a certain set of constraints on the study, which can aect the research team, the study objectives, and the participants; thus, this merits vigilant con- sideration. Sensor Customizability These criteria address the alterability of sensor functionality. Hardware design: Presently, most commercial sensors are not designed with extensibility or hardware-level customization in mind. Therefore, it is dicult to alter the sampling rate, storage capacity, or battery life to suit the needs of a research study. There exist customizable do-it-yourself (DIY) hardware platforms (eg, Arduinos or Raspberry Pis) that researchers may want to consider in cases where no existing ready-made option is sucient. 52 Software customization: Many sensors on the market, which stream data to a smartphone, have a companion phone app, typically providing data visualization, high-level data summaries, or some types of behavioral interventions (e.g., advice like stand up and stretch or get extra sleep tonight). Some devices, such as smartwatches, contain their own displays for visualizing data and haptic feedback for alerts and interventions. These features can be useful to participants, but they may misalign or interfere with the goals of a research study; therefore, customized versions may be desired. Certain sensors oer software development kits, enabling researchers to build their own software for collecting, visualizing, and storing sensor data. Other devices, such as the Apple Watch or Wear OS-enabled gear, support software extensions installed on the device, giving researchers more control over the visual and haptic feedback to suit the needs of a study. Cost The total monetary cost of a sensor device itself is an important factor for researchers to consider, and it may impact the total number of participants who can be recruited and sup- ported throughout a study. Sensor prices can sometimes be negotiated with their providers, depending on the quantity of devices desired. Battery life Sensor battery lives vary greatly and depend on the device types and their functionality. On the basis of the survey of devices available today, wearable sensor battery life spans range from several hours to nearly a week. Most devices are rechargeable in just a few hours, but researchers should oer suggestions to participants about when to recharge to maximize the analytical utility of the data. Some strategies for minimizing the impact of data loss caused by recharging are as follows: staggering the recharge periods for dierent participants (so at least some data are always present) and choosing recharge times that coincide with periods where the devices could not normally be worn anyway (e.g., while sleeping or taking a shower). It is inevitable that participants will at times forget to recharge their sensors, and researchers should have a plan for handling these situations as well. Other devices, such as many tiny and portable environmental sensors, consume a small amount 53 of power, and they can operate continuously for over a year. These devices are often not designed for recharging, and they may need to be replaced throughout the research study. Operating system support Some wearable sensors designed to stream data to a smart- phone companion app may only support phones running on a particular operating system (e.g., iOS or Android), which can create diculties for the research team. Researchers could elect to recruit only those participants with compatible smartphones, but this will introduce a selection bias that may impact the generalizability of the research ndings or may reduce the number of potential participants. If researchers determine that a sensor with partial smartphone support is necessary, these negative eects could be mitigated by providing the interested participants using incompatible smartphones with a temporary and inexpensive compatible smartphone for use during the study. Robustness These criteria concern the ability of sensors to endure repeated use and prone- ness to failure. Physical design: Dierent sensors have distinct physical characteristics that make them more or less suitable for reliable operations over an extended period of time. Some properties worth evaluating are as follows: whether a device is sturdy and can handle mild physical wear, how easily a worn device may fall o, whether its buttons and other inputs function well after prolonged use, how well it stays in place without shifting, and how quickly it resumes operation after being reattached. Researchers should consider performing a pilot study to fully understand and evaluate the sensors beforehand. Firmware: The reliability of a sensor's rmware is important, as any failure may lead to loss of data. A few probing questions worth answering are as fol- lows: is the rmware code stable or does it crash? Can it handle a barrage of unexpected inputs and continue to function? If the device sleeps, does it resume data collection once awakened? Researchers should stress-test sensor rmware before committing to any device to ensure they understand the possible failure modes and recovery procedures. 54 Companion software: Some sensors require a companion app running on a second device, such as a smartphone or computer, to facilitate data collection and long-term storage. This software needs to be resilient to minimize data loss. Ideally, it functions consistently, showing no signs of glitches or crashing. Its ability to receive data from the sensor and either cache or upload data to a data server should be seamless and fault tolerant. Research teams should stress-test this software to understand when and how it fails, so that the support sta will be prepared to help participants. A few tests worth performing are as follows: (1) disconnecting the sensor from the app or removing the network uplink during a data transfer to see if sensor data are lost and (2) switching foreground apps or providing random inputs to see if the app crashes. Once the failure modes of the companion software are understood, steps can be taken to remedy them or at least to alert participants. Provider support Some companies are interested in building a scientic reputation for their sensor products; therefore, they are concerned with supporting research studies. This support comes in a few forms, and the criteria below pertain to the benecial impact this support can have during and on the outcome of data collection. Pre-study support: Before data collection, it is essential for researchers to fully understand the properties and unique characteristics of each sensor under consideration to make the most informed choices. Virtually all sensor providers oer documentation and a communication channel for answering specic techni- cal questions. Some of these providers may oer additional services for research teams, including direct communication to key technical or support personnel and free samples for testing. Logistics: Research teams should seek any available logistical aid, oered by the sensor product companies, that may help the study function more smoothly. Teams should ensure that sensors can be provided on time and that there is a backup plan for any sensors that need replacement. It is advisable to seek help from the product companies to train the research sta for proper tting 55 of the sensors, especially for those requiring specialized knowledge. Other kinds of logistical help may include preconguration of sensors (e.g., to specic Wi-Fi networks), custom delivery options (packaging, rush shipping), tailored ttings, or an emergency contact. Moreover, some sensor providers oer ongoing assis- tance, ranging from providing quality metrics and statistical reports of the study participants to ensuring APIs support the types of data monitoring and quality assessment metrics researchers desire. Criteria related to participant experience These criteria pertain to how sensors aect the participants perception of a research experiment and willingness to engage with a study throughout its duration. Cohort and individual suitability These criteria relate to the ability of sensors to meet the needs of members of a cohort. Sizing and t: Garments and sensors that match each participants unique physical characteristics are best equipped to provide usable data. Devices that are too large or too small can cause discomfort, possibly leading to side eects, such as blistering or reductions in data quality. Technological literacy: Each sensor provides a unique interface for operating with its hardware and companion app software. Researchers should ensure inter- faces are simple enough for all potential participants in the target population. In cases where the interface is unfamiliar, researchers will need to provide instruc- tions, describing not only how to operate and interact with the devices but also how to check that they are in a proper state and performing the desired function at any time. Fashionability: The selected suite of sensors should comply with dress codes of the environment in which they will be worn. Moreover, the design and appeal to wear sensors should be considered by the research team to ensure that all participants are comfortable wearing the sensors from an aesthetic perspective. 56 Burden to participants These criteria address the physical and mental burdens sensors impose on research participants. Physical interference: Obtrusive sensors may physically interfere with normal activities, causing frustration or eventually leading participants to avoid wearing these sensors, or worse, drop out of the study. For example, undergarment or chest strap sensors may become uncomfortable after a few hours or produce skin rashes, preventing participants from using them further. Another example is desk-mounted, infrared eye-tracking devices that require participants to keep their heads in view, which may incidentally encourage poor posture. Other job- specic scenarios should be considered, such as the use of smart rings in hospital settings, where they may interfere with hygiene requirements. Sensors that can adequately collect the intended signals without interfering or causing discomfort will improve the participants' acceptance of the devices, potentially minimizing attrition. Time investment: Studies conducted in the wild, which ask participants to wear or interact with sensors over an extended period, inherently push more responsibility onto the participants to manage and operate the sensors. Daily upkeep, such as cleaning and charging the devices and verifying that they are functioning as intended, requires a time investment that burdens participants and can cause frustration if the demands are too high. Choosing sensors with low upkeep and low training costs will reduce this burden and can improve compliance and overall data quality [93]. Cognitive load: An implicit stipulation in any study is that the participants understand they are responsible for adhering to the study protocol. This re- quires that participants remain mindful of the study throughout its duration. Researchers should aim to choose sensors and an overall study design that re- quires a small or occasional investment of the participants' time and mental energy. For example, helping participants with reminders to charge their sensors every night, and supporting them with a charging hub may increase sensor usage. 57 Criteria related to protection of human subjects Research investigating human be- havior, using sensing technology, is subject to review by IRBs, which evaluate the risks and benets to human participants and ensure that the study adheres to ethical principles de- tailed in the Belmont Report [94]. Researchers must consider how the passive collection of behavioral data will respect participants' autonomy and privacy, how it will maximize the benets of the research while minimizing risks to participants, and how it will ensure that benets and risks are equitably distributed. Some of the most relevant themes are reviewed here, but it is important to be aware of ethical guidelines that apply to specic populations or data types. The Connected and Open Research Ethics (CORE) provides a checklist to guide researchers in deciding which technologies are appropriate for a study, with respect to protecting human subjects [95]. Access and usability: Researchers are responsible for ensuring that potential benets of a study are likely to apply to all members of the population under investigation. This means that sensor selection must not inadvertently exclude members of the study population from participation or result in poorer data quality because of individual dierences. For example, wearable sensors may be aected by factors related to body shape and size, skin tone, body hair, or tattoos. It would violate the ethical principle of justice to exclude individuals as study participants on the basis of these factors, solely as the sensors selected did not perform well on them. Researchers should aim to select sensors that have demonstrated validity across diverse participants (e.g., heart rate monitors that rely on ECG instead of optical technologies), can be adapted to individual dierences (e.g., respiration monitors that can be worn on a bra or belt), and employ inclusive design features (e.g., accessibility settings to accommodate those with visual impairments) to ensure equitable representation and data quality. Privacy: Privacy refers to peoples' right to control what information about them is shared, with whom it is shared, and how the information is used. The most common privacy protection is to separate information that could identify the participant from the data collected about the participant, but some passively 58 collected behavioral data are inherently identiable and sensitive. For example, GPS features can predict depression symptom severity [96], and 95% of indi- viduals can be identied with as few as four GPS data points [97]. Participants electing to engage in a study that requires the collection of sensitive and personal data need assurances that researchers will take steps to mitigate the risk that their behaviors can be linked to their identities. Given the array of data types available through passive sensing technologies and the low cost of collecting data unobtrusively, it is tempting to collect as much data as possible. However, researchers are ethically obligated to only collect data that are pertinent to specic research questions. When possible, researchers should disable sensors that are irrelevant and securely dispose of data that are not specically related to study aims. In addition, participants should be able to select which data they are willing to share, with whom, for what duration, and for what purpose. Ideally, sensors should allow participants to deny or revoke access to particular data types. If these user controls are not permitted by third-party providers, researchers should consider providing additional data management tools that help participants exercise their right to privacy. Many sensors on the market today require participants to register their own accounts, using their own personal information, which creates a link between potentially sensitive data and each identiable participant. Studies needing to access these data while guaranteeing participant privacy have a few options. Re- searchers could register dummy accounts, allowing the participants to remain anonymous, or they may alternatively acquire data directly from each partici- pant's personal prole (e.g., by using an API) and then immediately remove PII. In the latter case, researchers should also check that both the network channels from the sensors to data servers and the network channel for researchers to access the data are encrypted and secured to avoid any privacy breaches. Data security: Proper protection of the PII sensor data gathered from partic- ipants requires all communication channels for the data streams to be secured 59 (refer to Figure 3.2), and it requires protected long-term data storage with lim- ited accessibility. Information sent over a Bluetooth link is naturally secure, as only paired Bluetooth devices can communicate. Similarly, USB transfers are secured between the two devices at either end of the USB cable. Data sent over Ethernet or Wi-Fi require an extra encryption layer (e.g., https or secure le transfer protocol) to ensure the information cannot be intercepted. RFID and NFC are generally not considered safe, but sensors are typically using these chan- nels to infer general information (e.g., about the movement of people indoors) rather than transferring PII directly. Stored data are typically secured by limit- ing physical access to the storage device itself, but encryption of the data is also possible. Access to the sensitive stored data should be limited to select mem- bers of the research team, and it is usually controlled through credential-based authentication (e.g., usernames and passwords). Unfortunately, today, there are many other ways for hackers to obtain PII data (e.g., malware, spyware, and cy- berattacks), and research teams and participants may wish for every precaution to be implemented. Readers are referred to the study by Filkins et al [98] for more information about protecting private data in a mobile sensing landscape. 3.1.5 Case Study: TILES Data Set In early 2018, myself and a research team began preparations for an in situ study at the University of Southern California's Keck Hospital, per the MOSAIC program [99], using sensors to track nurse and hospital sta behavior in the workplace and at home. The project was called TILES (Tracking IndividuaL performancE with Sensors) and aimed to understand how physiological dynamics and behavior both at work and at home are associated with personality, well-being, and work performance. This section shares results from the application of the methods previously described. The team's experiences and rationale for selecting sensors to help achieve the research objectives are discussed, as well as how compliance was monitored and encouraged during the study. Metrics for attrition and compliance rates are provided. For a more detailed overview of the data collection itself, including IRB information, readers are referred to the retrospective 60 study by L'Hommedieu et al [100]. A full description of the dataset and collection method- ology will appear in a future publication [101]; this section focuses on study goals, aspects related to sensing, and data ow. Study goals and constraints The primary goal of MOSAIC was to use information gathered through commercially available sensors to study the predictive power of these types of sensors for assessing personality traits, as well as work-related behaviors and mental states throughout time. Owing to the complex trade space encompassing consumer sensors, creating a data collection protocol that met the project goals and was satisfactory to the participants and hospital environment required many iterations and challenging decisions. These deliberations and the data collection protocol that resulted led to a study of over 200 hospital sta participants over a 10-week period and with a low attrition (dropout) rate of 4% (primarily because of vacation con icts). Signals of interest Previous literature, related studies, and our team experience revealed many physiologic signals of interest for capturing data potentially related to work behaviors and mental states. Some of these signals, such as EDA and brain waves, were not possible to capture accurately in the wild over extended periods using consumer sensors. The research team initially reduced the list of potential signals of interest down to the ones that could be captured with unobtrusive sensors, based upon a survey of existing technologies (see Figure 3 in [57]). Table 3.2 shows these signals and a short explanation of the expected utility for each in meeting the research objectives. Sensor Selection Rationale As the study continuously required the collection of data over several months, one of the top priorities was to minimize the burden on participants to achieve a high compliance rate and capture representations of behavior in the wild. As previously described in Section 3.1.4, the study took a holistic approach to assessing the participants' responsibilities and duties, including their time invested in compliance, physical disruption, cognitive load, and interference with their daily activities. While keeping these burdens in mind, each paragraph below describes how sensors were chosen to capture each signal of interest, and Table 3.3 lists the sensors that were selected alongside their intended 61 Signal Reason for interest Cardiac Connection to exercise, tness level, and stress levels [102, 103] Physical activity Linked with stress [104] Sleep Health (physical and emotional) [105, 106] Speech Contains information about emotional expressions [107] and information related to social interaction Breath Calmness, stress, anxiety, and speech activity detection [108, 109] Environment and distractions Connection with workplace performance, anxiety, and stress [110] Locality Captures workplace behavior and job role dynamics [111] and context for the job types of interest Table 3.2: Signals of interest in the case study that were measurable using consumer sensors. purpose in the study. Sensor Measurements Intended usage period Fitbit Charge 2 PPG-based heart rate, step count, and sleep 24 hours per day OMsignal garments ECG-based heartbeat, breath, body motion At work (12-hour shifts) Unihertz Jelly Pro Audio features, Bluetooth-based localization At work (12-hour shifts) reelyActive's Owl-in-One Bluetooth-based localization, data hub for environmental sensors Installed on site (24 hours per day) Minew E6, E8, S1 Light, motion, temperature and humidity Installed on site (24 hours per day) Table 3.3: Selected sensors for the case study and their expected uses Cardiac and physical activity: Several form-tting garments with ECG sen- sors were tested, and many provided the data quality desired (see Figure 3 in [57] the list). Chest strap sensors were found to be uncomfortable for daylong use (as they are designed for exercise sessions), but the existence of dierent form factors of ECG garments (e.g., shirts, bras) made it possible to gather high-quality data 62 across genders. Some of these garments continuously collected high-quality data throughout the day, but they required that the physical box hidden inside which recorded the data be hooked up to a computer via USB on a daily basis for data transfer. This step seemed cumbersome for participants; therefore, another sim- ilar garment that could stream the data to the subjects' personal smartphones was selected. The caveat with this second device was its companion app, which required a manual start and stop of the data recording process. The research team elected to have subjects wear these garments only during work hours to avoid potential discomfort associated with wearing them all day. Participants were also assisted in setting location-based reminders on their personal phones to start and stop the recordings. Heart-related information and other physical activities outside of work were also tracked by asking participants to continuously wear a wristband. Sleep: Many unobtrusive sensors were capable of capturing information about sleep duration and sleep stages. Some sensors required a one-time installation on or near the bed, and then they would automatically detect and monitor par- ticipants when the participants were sleeping. Nurse focus groups had privacy concerns; therefore, wearable sensors were deemed more appropriate. To mini- mize cost and the burden of wearing multiple sensors, a wristband sensor was chosen, which was capable of capturing sleep and the cardiac and physical activ- ity signals mentioned previously. Participants were asked to wear the band every day, including during their sleeping periods. Speech: At the time of the study, no portable consumer devices were avail- able for automatically sampling only human-produced audio. The research team programmed a smartphone app to automatically start, run in the background, and collect audio samples of ambient human utterances [5]. To address Health Insurance Portability and Accountability Act (HIPAA) concerns about hospital patient and nonparticipant privacy, relevant information about the emotional content of the voice signal was computed by the device, and the raw audio signal 63 was immediately discarded. Moreover, participants could disable the recording process for intervals of half an hour, by pressing a button in the app, after which the recording was resumed. Collecting low-noise audio required the smartphone's microphone to be placed near the mouth, and the research team wished to avoid using external microphones to reduce further participant burdens. Research sta met with representatives from the potential participant pool to discuss unobtru- sive solutions and discovered that hospital personnel were already accustomed to wearing hospital badges on their lapels. Credit cardsized smartphones were acquired to run the custom software, and then these were attached to the par- ticipants' shirts, with a clip to get the microphone closer to the mouth [5], as shown in Figure 3.3. Although this solution may have been unacceptable for some subject populations, it was appropriate for the hospital workers in this study. Figure 3.3: Setup of the TILES Audio Recorder [5] Breath: Commercially available portable breath sensors measured the expan- sions and contractions of the chest. Some of these sensors were stand-alone devices attached to the waist or chest, and some were integrated into other mul- tipurpose sensing garments. Once the research team decided on a comfortable device for capturing ECG, they found that breathing rate information was al- ready available; therefore, the same device was used. Environment and distractions: Environmental sensors for capturing temper- ature, humidity, and door motion were used. Statistics about social media and general phone usage were acquired with the participants' permission and with the help of smartphone apps running in the background on their personal phones, requiring little power and no interaction after the initial setup. 64 Locality: Precise localization of subjects inside the hospital was deemed pro- hibitively expensive and would have required several months of installation time; therefore, approximate measurements of location by proximity to known locations were used instead. As described in [57], using a dense hub network and wearable consumer sensors, there were two general ways to achieve this: tracking partici- pants' smartphones or tracking other worn wireless communication devices. The latter option was chosen using the audio recording phones for tracking to avoid any power draw from participants' personal phones. Data Flow Figure 3.2 depicts a general ow of information for measurements obtained through sensors. In this study, all three kinds of sensors (in the left column) were used: environmental sensors, non-wearable, and wearable. All of the sampled data owed through two dierent intermediate types of data hubs: Bluetooth data hubs connected to Wi-Fi and personal smartphones. Personal computers were not used to retrieve any data in an eort to reduce the time spent by participants uploading data to dierent servers. Wireless passive sensors capturing information about light levels, temperature, and hu- midity were used, which transmitted information over Bluetooth. In addition, the partic- ipants wore Jelly Pro phones that were programmed to send Bluetooth pings with unique identiers. The Owl-in-Ones received the data. They were connected to the public Wi-Fi network of the hospital and transmitted the data over this network to reelyActive's servers, from which the data were retrieved in real time, using a provided API. Audio data recorded by Jelly Pro phones were directly uploaded to the research server, using hospital or home Wi-Fi networks. Wi-Fi was necessary because of the size of the les: approximately 8Gb per day. Data transfer took place from Fitbit Charge 2 devices to participants' smartphones over Bluetooth, followed by data upload to Fitbit's servers through the smartphones' internet connections. The research server then retrieved these data using Fitbit's API. The same ow was employed by the OMsignal garments, using OMsignals API. Feedback for participants happened through a custom app (the TILES app) via push notications. This app sent surveys to participants and gave them notications about sensor 65 usage and the quality of their previously received data when necessary. Monitoring and Encouraging Compliance Minimizing participant frustration in a study can help improve compliance and overall data quality [71]. This was one of the top priorities in this case study, and this was achieved by reducing cognitive burdens on partic- ipants, oering monetary incentives and consistent feedback to participants for compliance and providing convenient help whenever the participants encountered diculties. A custom smartphone app for the participants was developed, and it served as the primary resource for all aspects of the study. This app provided progress and monetary reward tracking, information about the study and protocol, and direct contact links for requesting help, and it also distributed questionnaires and reminders. Participants were rewarded for uploading their data daily, per the study protocol, which allowed the research team to monitor compliance and data quality every night. Each morning, the app provided feedback to the participants by letting them know whether their previous day's data had been received and whether the quality was sucient. If the data were missing or quality was poor, the app reminded participants to double-check their sensors or seek help from the research team. On-site assistants were always available during work hours to help participants who encountered diculties during the study. Participants were able to drop in for help, or they could request for assistants to visit them and provide in-person support. These assistants actively engaged with participants who had recently uploaded poor-quality data to help make sure their devices were worn and functioning properly. Metrics Table 3.4 shows the average data compliance rates across dierent 10-week waves of this study for dierent sensors. The attrition rate was under 4% across all participants, and most of the participants dropped out because of vacation time con icting with the study's participant inclusion criteria. More details about the study, including information about poststudy surveys on user experience, are presented by Hasan et al [100]. Figure 3.4 shows a histogram of the number of hours each sensor was used per day across all participants, where days with no logged data are not shown. This gure illustrates that on average, Fitbit was used about twice the amount of time as other sensors, which was 66 Sensor type and signals Sensors Participant who opted, n (%) Total hours Compliance rate, n(%) Denition of compliance Participant-tracking Cardio, sleep, and steps Fitbit 208 (98.1%) 236,725 152 (73.1%) Average fraction of days per participant with >12 hours data Cardio, breath, and motion OMsignal 208 (98.1%) 44,240 125 (60.1%) Average fraction of work days per participant with > 6 hours data Audio Jelly 184 (86.8%) 37,065 131 (61.8%) Average fraction of work days per participant with >6 hours data Locality Jelly+ Owl-in-one 184 (86.8%) 37,065 131 (61.8%) Average fraction of work days per participant with >6 hours data Environment Temperature, humidity, and motion Minews | | 239 (98.0%) Uptime of the sensor network Table 3.4: Compliance rates for participant-tracking sensors (n=212) and environment sensors (n=244) in the case study. Compliance is computed as the presence of data exceeding half of the measurement period per day among the participants who opted in for each sensor. 67 in line with expectations. Moreover, although both the Jelly Pro and OMsignal garments were designed to be used by participants at work, there was a noticeable dierence in usage. This is partly explained by participants starting the recording of their OMsignal garments at home rather than at work. It can also be explained by the fact that the Jelly Pro recording was activated only when participants or nearby persons were speaking. Figure 3.4: Histograms of the total number of hours of recorded sensor data per day, across all participants. These plots only show data from days where data was logged. The Fitbit usage of the subject cohort is in line with other studies [112], which claim 70% to 90% of compliance using wristband sensors. Compliance rates for the OMsignal garments and the Jelly Pro are and were expected to be lower, as these devices were only used during work shifts (typically 12 hours) and required more attention from participants. For Fitbit, the mean usage among all days with logged data is 17.8 hours, with an standard deviation of 4.0 hours. For the OMsignal garments, the mean is 10.6 hours, with a standard deviation of 1.8, and for the Jelly Pro audio recorder and localizer, the mean is 8.4, with a standard deviation of 2.1 hours. 68 3.1.6 Discussion The methods previously presented are evaluated with respect to the case study outcomes using measures of participant compliance and also the number and duration of emergent unexpected challenges impacting the data collection period. Details follow below, but in summary, participant compliance is competitive with similar types of smaller scale long- term sensing studies conducted in natural settings. Furthermore, the unexpected challenges were manageable, having either short-lived durations or isolated impacts on the study and data quality. The case study's results according to these metrics illustrate that the methods and mitigation strategies presented in this paper serve as a helpful guide for researchers for sensor selection and management during longitudinal human behavior studies in the wild. Participant Compliance With a compliance threshold of one-third of the measurement period (instead of one-half from Table 3.4), the compliance rate across all wearable sensors is above 84% [101]. This compliance rate is comparable or higher than similar kinds of sensor- based pilot studies conducted in the wild [112{115], and this rate is achieved for over 200 TILES participants compared to the maximum of 36 participants in these other studies. This compliance rate is also the same as the average sensing compliance rate in the StudentLife dataset [63], which collected data only from its 48 participants' smartphones over 10 weeks. The competitive compliance rates achieved in the longitudinal TILES study, using wearable sensing at scale, demonstrate the eectiveness of the proposed protocol design strategies and data collection framework. Unexpected Challenges During the Case Study This section recounts the unantic- ipated challenges encountered during the case study despite eorts to avoid them during study planning. Unexpected challenges are dened as the events not considered a priori or deemed unlikely to happen and whose occurrences negatively aect the project budget, schedule, participants, or data. Each of the occurrences below were either isolated incidents, aecting only a narrow piece of the research project, or were short-lived, as the research sta was able to address them quickly. Potential strategies for mitigating each of these events in future studies are discussed. 69 Shipping Dependencies and Customs: Some bundled sensor shipments were delayed because of product dependencies on secondary companies with limited shipping capacities. Urgent sensor package shipments from other countries were sometimes held up by the customs authority. In future studies, it would be best to be aware of the shipping capabilities of each product company and any potential shipping delays when preparing a study schedule. Installation Time: The research sta underestimated the time required to in- stall on-site sensors at the hospital. Although oor plans were used heavily for placement planning, they did not include locations of the electrical outlets. Sev- eral iterations and supplemental cabling were needed to install sensors across 16 dierent nursing units with similar layouts but dierent electrical circuit restric- tions. Moreover, as most sensors were installed in patients' rooms, more trips to the data collection site were needed than expected to accommodate patient needs. Starting the installation process early can help researchers identify this problem in advance and budget time accordingly. Battery Life: The Jelly Pro devices running the custom TAR app ran out of power for some of the participants early on during data collection. The param- eters of the TAR app were tuned on the basis of the data collected during a pilot study from a subset of the nal participant pool, but this subset did not re ect the worst-case scenario for power consumption. The battery life in this case depended on how many times vocal audio recording was triggered by the automatic voice activity detector, and the hospital sta in highly social environ- ments triggered it more often than the worst case in the pilot study. The research team responded by recollecting the Jelly devices and modifying the parameters overnight. A possible strategy for mitigating this issue would be to design a pilot study that includes more participants at the expected extremes of the measure- ment spectra, but this may negatively aect the expected average-case ndings. Perhaps a better strategy would be to implement tools to remotely or more easily update the parameters for all participants in anticipation of this type of issue. 70 Sensor Synergy: As the Jelly Pro devices served two functions in this study (collecting vocalized audio and proximity detection), when the power consump- tion exceeded expectations, two data streams were aected instead of one. For sensors serving multiple purposes, there is greater risk to the data quality when they fail; therefore, proper stress testing and tooling should be prepared before the main study. Sensor Discomfort: Some participants acquired rashes caused by skin friction with the wrist-worn or undergarment sensors. This occurred because the sensors these participants used were improperly tted or sized, and the discomfort they produced led to a short-term loss of data while the participants recovered. The pilot study helped the research sta identify and mitigate some tting concerns, but it was not enough to handle all the cases during the main study. The team reached out to the product companies for these sensors to get help with proper tting procedures, and with their guidance, they were able to nd proper ts for each aected participant. Better approaches for mitigating the risk of data loss here would be to solicit help with tting and sizing from the product companies earlier and then incorporate that wisdom into the study as well as consider dif- ferent options for materials that are in contact with the skin (e.g., Fitbit oers wristbands made of dierent materials). Data Pipeline Failure: Months into the main data collection, two site-wide disconnections of the environmental and proximity sensors occurred. These de- vices were all connected to the existing hospital Wi-Fi network, and the research servers data monitoring processes identied this event immediately. Within 24 hours, research sta was dispatched to manually power cycle the devices and en- sure they reported gathered data upstream. Although these sensors were stress- tested during the pilot study and determined to be robust to power and network outages, they did not all recover automatically in these two instances. Having a separate backup system in place (e.g., an extra rmware layer to perform a soft reboot) may help improve robustness in these unexpected situations, but the data 71 monitoring processes enabled researchers to respond quickly in this instance. 3.1.7 Conclusion The experiences and viewpoints expressed in Section 3.1 highlight and enumerate many of the research challenges faced during studies conducted in the wild, when using sensors for unobtrusively capturing human activity and behavior. A general-purpose information ow for data is presented along with an explanation of the roles of dierent computerized devices for data collection, transmission, and storage, and is further accompanied by an example adaptation of the ow framework to a case study. Finally, this section provides a comprehensive list of criteria that researchers should carefully consider when conducting their own studies in natural settings, including explanations of trade-os among them. Though the study involved more wearable sensors and included more participants than other similar types of longitudinal sensor-based human studies, it still achieved a comparable compliance rate. The minimal loss of data, through quick recovery from unexpected events and from constant participant feedback, demonstrates how the comprehensive list of considerations and criteria, when put into action, can help improve the quality and quantity of data obtained from humans in longitudinal studies. 3.2 Improving Quality of Real-time Continuous-scale Mental State Annotations This section of the chapter explores methods for improving the quality of human-produced data used in place of otherwise unobservable human experiences. The methodologies con- tained in this section exclusively regard human-produced annotations of experiences or be- haviors which are generated in real-time and reported using a continuous set of values. Referring back to the hypothetical research question of modeling and understanding the no- tion of fun, rst discussed in Section 1.1, this kind of annotation is relevant in an experiment where, for example, a participant is asked to play a video game while an observer rates how much fun the participant is experiencing over time. These ratings of perceived mental 72 Figure 3.5: Simple Annotation Processing Pipeline for Ground Truth Estimation states are invaluable for modeling of individual responses to evolving stimuli and in domains involving subjective human experiences where there is some social consensus or agreement about their dynamics over time. These annotations have been heavily used to study human aect [35, 116{119], facial expression intensity [120], and student engagement [14] among other perceptual and mental constructs. The procedure of collecting and processing annotations of various mental states in order to generate a single best representation of specic unobservable human experiences occurs over several stages. Each stage requires modeling decisions to be made which ultimately impact the resulting best representation of a mental state, also commonly referred to as the ground truth in machine learning contexts. Figure 3.5 provides an overview of the typical stages of annotation processing which convert the raw annotations provided by human annotators into the ground truth. In the rst stage, the unaltered raw annotations are cleaned, which may include steps such as removing invalid data or interpolating missing data. Temporal alignment is then optionally performed to correct for annotator perceptuomotor lag and maximize the mutual correspondence of the annotations' dynamics. Next, the annotations are inspected to en- sure they are in sucient agreement about the general dynamics of the target construct. Annotations which deviate from consensus too much are typically removed from the group. Lastly, the remaining annotations are fused together to form a single time series which is adopted as a ground truth representation of the changes in the annotated construct over time. This represents one possible, but common, annotation processing pipeline for ground truth generation. Each of the sections in this chapter proposes solutions to one or more of the problems that arise throughout these annotation processing stages. Each section draws its insights from one common set of observations about the types of errors people tend to make when providing real-time continuous-scale annotations, which are discussed in Section 3.2.2.2. The 73 methods proposed in each section are therefore complimentary, and the last section in this chapter examines their utility when combined and applied in a crowd-sourcing environment where the validity of any single annotation is more questionable due to the fast-paced and anonymous nature of that medium. Background information about state-of-the-art methods for processing and fusing anno- tations into a ground truth representation are discussed in the next section. Subsequent sections dive into particular stages of the basic annotation processing pipeline, highlighting certain pitfalls and proposing unique solutions. By the end of this chapter, readers will have a deeper understanding of the common pitfalls in state-of-the-art techniques for ground truth generation using annotations, and they will emerge with a set of tools for dealing with the variety of artifacts and errors that humans either intentionally or coincidentally introduce during the annotation process. Much of the following work is reproduced from published works [1, 121{124]. 3.2.1 Background Human-produced annotations of mental and subjective constructs have been used in many existing human behavior data sets as a representation of a genuine experience or behavior. Many data sets, like SEMAINE [125], MAHNOB-HCI Tagging [118], DEAM [126], DEAP [117], IEMOCAP [127], RECOLA [6], SEWA DB [116], and others [119, 128], represent individuals' emotional responses to stimuli using annotations provided by humans. The utility of annotations is not limited to emotional labeling and sees application in student engagement assessment [14], sincerity perception [129], movie emotion portrayal [130], among many others. Many dierent types of schemes for eliciting annotations have proven useful as well, diering in their discretization of time and space: continuous-scale labels for discrete data [131], discrete labels for sampled real-time data [117, 118, 127], and continuous-scale labels for real-time data [6, 132]. This chapter focuses on better understanding the continuous- scale and real-time annotation strategy. The term real-time refers to setups where annotators provide ratings simultaneously while viewing the stimulus. Continuous-scale refers to the continuous range of values annotators can provide at any given time, as opposed to discrete 74 Likert-scale annotations for instance. A typical example of a continuous-scale and real-time annotation scheme is the use of a user interface slider widget, ranging continuously from zero to one, to annotate the emotional valence of a movie as it is being watched. This annotation scheme is of primary interest for three reasons: (1) it is the most natural way to present real- time data to annotators and avoids any temporal aliasing of the stimulus via discretization, (2) it allows for subtle variations to be annotated, and (3) it enables annotators to fully absorb and incorporate the temporal context of the stimulus into the annotated value at any time. In order to generate a single set of labels for use as ground truth in behavior modeling, several annotations from dierent annotators are typically gathered and aggregated to help reduce the impact of unique biases that individuals may introduce. Many methods for the fusion of continuous-scale annotations have been proposed. The naive choice is averaging, which is seldom used to establish ground truth because it does not account for the dierences in response times or lag times between annotators with respect to events of interest within the stimulus. Time-shift and average [6, 7] and dynamic time warping (DTW) [133] methods attempt to correct varying perception-response temporal lags introduced by each annotator before combining the annotations, but these approaches by themselves do not oer any means to correct annotator perception or valuation biases and artifacts. Other approaches such as correlated spaces regression [134] and canonical correlation analysis [135] (and variants [136, 137]) attempt to produce a ground truth via fusion of the annotations by minimizing the distance between some linear combination of annotations and a combination of features extracted from the associated stimulus (e.g., facial expression features). These approaches can improve the ability of a model to map features to ground truth labels, but they do so at the risk of losing pertinent information in the annotations that may not be representable with the available feature set. Some eorts combine the benets of both approaches [138, 139] but still require the available features to jointly contain information about the construct being annotated. This dissertation contends that approaches of this sort are unlikely to lead to accurate human behavior models because neither the correct features nor the true labels are known in advance. Furthermore, in situations where fatigue or external distractors interfere with annotators, artifacts can be introduced that may not be representable with 75 the feature set which can diminish the accuracy of the resulting ground truth. It is therefore presumed preferable for any method producing ground truth labels to do so independently of the collected data. More recent work explores alternative strategies for computing a gold-standard. Lopes et al [140] show that in some cases the gradient in an annotation is more informative than the annotation value, which can be exploited to produce a better ground truth. The use of comparisons between dierent frames of a continuous annotation appears in [141] show- ing that a fused continuous annotation signal can be reconstructed quite accurately using ordinal embedding and triplet-based comparisons alone. This focus on using comparative information to construct an accurate ground truth is very recent in the literature and forms the basis of the methods presented in this chapter. Various works have observed that errors in human annotation are not random [1, 51, 53]. Curiously, however, no prior research has closely examined or validated the capabilities of human annotators to produce accurate real-time continuous-scale annotations of stimuli. Section 3.2.2 introduces an annotation experiment where the true target construct value over time is known in advance, which enables the quality of the human annotation process to be probed. A brief list of common annotator mistakes in this controlled continuous annotation experiment are observed, one of which suggests that annotators capture trends more reliably than they accurately assign values. This observation is partially supported by philosophical and psychological perspectives on the fundamental nature of human experience [51, 53]. Yannakakis et al [142] provide a summary and exposition of research and arguments in favor of the underlying ordinal character of emotional experience and how it may impact human- produced continuous annotations. Some researchers [1, 51, 143] have found it benecial to treat annotations as ordinal information, but so far the arguments for doing so have been its utility and potential correspondence with the underpinnings of human experience. Section 3.2 aims to help build a foundation for this assumption through the analysis of continuous human annotations in an experiment where the true target signal is known. Subsequent sections build upon this foundation introducing methods and algorithms for enhancing the quality of ground truth representations. 76 3.2.2 Green Intensity Annotation Experiment This section recites an experiment rst performed and published in [1] where the true target construct value over time is known in advance. This experiment lays a foundation for the rest of the chapter and demonstrates there is a certain structure to the types of errors human annotators make during the annotation process. 3.2.2.1 Experimental Procedure This experiment uses a simple but perceptually challenging annotation task where the objec- tive truth is known. Ten annotators were asked to separately rate the intensity (luminance) of the color green in two videos in real-time and on a continuous scale by adjusting a standard user-interface slider widget. The same physical computer, monitor, and lighting conditions were used for all ten annotators. The videos were less than ve minutes in length, 864x480 resolution, and comprised entirely of solid color frames of green at varying green channel intensities in RGB color space. In Task A's video, the green intensity changed at dierent speeds and times while avoiding discontinuous jumps and was designed to test annotator rating accuracy. Task B's video featured a perturbed slow oscillation of the green intensity and was chosen to test consistency in annotation over time. The annotation process was de- vised to be mechanically undemanding with a simple responsive interface to help ensure the main annotation challenge lay in the translation of perceived green intensity to annotation rating. A picture of the annotation interface is shown in Figure 3.6. Figure 3.6: A closeup snapshot of the user interface at dierent times during the green intensity annotation task. Annotators only adjusted the slider in sync with changes in the green video. 77 Figure 3.7 shows a plot of all ten annotations alongside the objective truth for these two annotation tasks. Intra-class correlation measures were computed to estimate annotator agreement per the guidelines in [144] and achieved approximately 0.97 at a 95% condence interval for each task earning an excellent agreement rating according to [50]. The ICC values were calculated using the psych package version 1.6.9 in R using a mean-rating (k=3), consistency, two-way mixed eects model. 3.2.2.2 Observations Although the annotator agreement measure is very high and Figure 3.7 shows that anno- tators were generally quite good at capturing large-scale changes and trends, they still had diculties in other areas. First, annotators tended to over-shoot the target value when anno- tating increases or decreases in value over a period of time such as in Figure 3.7a between 200 and 250 seconds. This indicates they were perhaps xated on annotating the rate of change rather than the actual rating. Secondly, approximately half of the annotators struggled to capture the lack of change in green intensity especially during the 100 to 150-second time interval in Figure 3.7a. One possible explanation is that the longer duration of this constant segment gave annotators time to realize their current intensity ratings did not match their perception and then adjust the value to match in spite of what was (not) occurring in the video. Lastly, similar green intensities were annotated inconsistently over time. In particu- lar, there was a signicant dierence in average annotation value per and within annotators at dierent time intervals where the green intensity was actually at a constant 0.5 value (see Figure 3.7a). This last observation implies that even for this relatively simple annotation task, annotators struggled to accurately capture the trends while preserving self-consistency over time. 78 0 50 100 150 200 250 0 0.2 0.4 0.6 0.8 1 Time [s] Green intensity value (a) Task A: Objective truth and real-time annotations 0 50 100 150 0 0.2 0.4 0.6 0.8 1 Time [s] (b) Task B: Objective truth and real-time annotations Figure 3.7: Plots of the objective truths (bold) and annotations of green channel intensity from ten annotators in two separate tasks. 3.2.3 Ordinal Triplet Embeddings and Notation In preparation for the next subsections which introduce new methods for improving ground truth representations, ordinal embeddings are introduced here. An ordinal embedding is a representation ofn objects as a collection ofn points, where the pairwise distances between points satisfy a set of similarity relationships among the objects. These similarity relation- ships are generally expressed among 4-tuples of objects [145, 146], though in this work we will consider relations on 3-tuples or triplets because they are simpler and similarly eective. In this dissertation, ordinal embedding techniques will be used to convert a collection of relationships between objects (samples from annotations) into one-dimensional embeddings (collections of points in 1D space). When the annotated values exist in a well-ordered space (e.g., when it can be determined that the green intensity at time j is greater than the intensity at time i), then a ranking (1D embedding) can be generated from a collection of pairwise comparisons instead. However, it is presumptuous to assume that all subjective constructs can be annotated naturally on a well-ordered scale, so we will instead consider triplet comparisons. Triplet comparisons, formalized below, do not need an ordered space and only require that the similarity between two objects can be determined. Mathematical notation for the triplet embedding problem and one approach to computing an embedding are introduced below. Triplet Embeddings Letz 1 ;:::;z n be the items that we want to represent with points x 1 ;:::;x n 2R m , respectively. Thez i items do not necessarily lie in a metric space, but we assume some form of a dissimilarity measured(z i ;z j ) exists among them. This dissimilarity 79 can be a perceptual model that may not be mathematically dened, such as the dissimilarity of items' emotional arousal or valence as perceived by an annotator. The input is a setT of possibly noisy and incomplete triplet comparisons such that T =f(i;j;k)ji6=j6=k6=i; d(z i ;z j )<d(z i ;z k )g (3.1) which can be used to construct an embeddingX2R nm . Each rowx T l ofX represents a point inm-dimensional Euclidean space for eachz l , respectively. The indicesl =f1;:::;ng are dened as frame osets in a regularly sampled time series, so that x l is the value that the signal takes at time indexl. Various dierent techniques have been proposed to nd the embeddingX. For an extensive list, refer to [147]. Stochastic Triplet Embeddings Stochastic triplet embeddings, using the t-Student dis- tribution (t-STE), have been used for solving (non-metric) ordinal embedding problems [148] and proven suitable for recovering 1-dimensional embeddings. As the authors highlight in [149], t-STE aggregates similar points and repels dissimilar ones leading to simpler embed- ding solutions (Occam's razor). The t-STE approach denes a value p ijk that a certain triplet (i;j;k)2 T is satised under a stochastic selection rule: p ijk = 1 + kx i x j k 2 2 +1 2 1 + kx i x j k 2 2 +1 2 + 1 + kx i x k k 2 2 +1 2 ; (3.2) where regulates the thickness of the tails of the t-Student kernels in p ijk . The goal is to maximize the log-probabilities through all the triplets inT to nd the embeddingX: max X X (i;j;k)2T logp ijk (3.3) The diculty of solving this problem lies in the fact that using t-Student kernels produces an objective function that is non-convex (a sum of quasi-concave functions log(p ijk )). In t- STE, this optimization problem is solved using gradient descent with random initializations. 80 Attempts to produce 1-dimensional embeddings in [141] show that artifacts can appear in the nal solution for many random initializations, but disappear when initialized to an educated guess (i.e., a signal similar to the desired embeddingX). Each obtained embedding X is invariant to monotonic transformations of the relative metric distances between points (i.e., rotations and scaling) [147, 150]. In the case of a one-dimensional embedding (m = 1), only the shift and scale need to be corrected (using an ane transformation). In scenarios where the target signal is known a priori, isotonic regression can be employed to learn a monotonic transformation which optimally maps the embedding to the target. 3.2.4 Ground Truth Estimation via Frame-wise Ordinal Triplet Embedding This section presents results from a previous publication [121]. 3.2.4.1 Overview This section presents and explores a triplet embedding algorithm for annotation fusion where the annotations produced by distinct annotators vote on the similarity of the target con- struct between each unique triplet of frames. The model assumes that comparisons between annotation values at distinct times are meaningful. Thus it presumes that annotators are able to continuously rate the stimulus in real time with consistency such that the same ap- proximate value is assigned whenever the same assessment is made at two distinct points in time, after time-alignment. The observations in Section 3.2.2 suggest that consistency may not be preserved during continuous real-time annotation, but this method aims to show that making this simplifying assumption still produces higher quality annotation fusions than approaches based on averaging. First, this idea is tested on the green intensity data set presented in Section 3.2.2 to show that this majority vote triplet embedding approach yields a sensible fused annotation. Then the method is applied to the RECOLA dimensional emotion data set [6] to produce better gold-standard labels for dimensional emotion. The ndings suggest that these proposed 81 gold-standard labels can improve emotional valence prediction from the associated features while having minimal eect on arousal prediction. 3.2.4.2 Majority-vote Triplet Embeddings Although the values provided during continuous real-time annotation may not be directly comparable, their relative pairwise distances are presumed to be a more relevant and reliable source of information. Therefore, for all annotators and arbitrary frames (i;j;k)2 T, the following relationship is of interest: d(z i ;z j ) ? 7d(z i ;z k ) (3.4) Here, d(z i ;z j ) represents the unknown true dissimilarity of the construct being annotated between time i and time j (note that i and j are not necessarily instants in time, but may refer to short time windows). The direction of the relation in Equation 3.4 can be determined for each annotation separately by applying the dissimilarity function over the three frames' values, which leads to a natural weighted majority vote scheme to decide which direction in Equation 3.4 is correct overall. All possible unique triplets (i;j;k) are considered and weights assigned to each individual annotator response: Decision ofA 1 : d 1 (z i ;z j )<d 1 (z i ;z k )2T)w < 1 Decision ofA 2 : d 2 (z i ;z j )<d 2 (z i ;z k )2T)w < 2 . . . Decision ofA r1 : d r1 (z i ;z j )>d r1 (z i ;z k ) = 2T)w > r1 Decision ofA r : d r (z i ;z j )<d r (z i ;z k )2T)w < r Thew a (a2f1; 2;:::;rg) variable denotes the weight of annotatora relative to other annota- tors, and the dissimilarity functions d a (;) are indexed using subscripts as a reminder that each annotator perceives events dierently. Each annotator's weight is assigned beforehand 82 using one of many techniques described in the next subsection, but intuitively each weight is proportional to an amount of trust assigned to the corresponding annotator. For any given triplet, if X r w < a > X r w > a ; (3.5) then the method concludes (i;j;k)2T. Algorithm 1 Generate set of tripletsT using annotator weights and a majority vote strategy. Data: A2R nr : Annotations matrix Input: weights2R r : Annotator weights Result: Set of triplets T triplets [] for a 1 to r do /* Compute all pairwise distances between values of each column of A. Each D[a] is a distances matrix between all points in each A[:;a]. */ D[a] = distances(A[:;a]) end for k 1 to n do for j 1 to k1 do for i 1 to n do // Iterate over unique triplets (i;j;k) if i6= j and i6= k then w < = 0 w > = 0 for a 1 to r do // Compute each annotator decision // for each unique triplet (i;j;k) if D[a][i;j]<D[a][i;k] then w < += weights[a] else if D[a][i;j]<D[a][i;k] then w > += weights[a] end end if w < >w > then triplets.append((i;j;k)) else if w < <w > then triplets.append((i;k;j)) end end end end end Algorithm 1 shows the implementation for the triplet generation through weighted ma- 83 jority voting. The implementation takes a matrix A2R nr , where each column represents an annotation time series from one of r annotators, and the rows represent time frames. The implementation takes a weight vectorw2R r , representing the trust in each annotator. The implementation is exible enough that settingw a = 0 will remove annotatora from the decision process (leave-one-annotator-out), while recovering a simple majority vote if w a is constant for all annotators. An interesting feature of this majority vote embedding approach is that onceX is com- puted, the number of triplets inT that violate the distances inX may be computed, leading to a measure of agreement across all annotators and in the construction of the embedding itself. We revisit this idea in our discussion in Section 3.2.4.5. Annotator Weights Three approaches are tested for assigning weights to annotators: unweighted, weighted, and weighted leave-one-out. In the unweighted scenario, each annotator is given an equal weight w a so each triplet is given an equal vote. In the weighted scenario, the concordance correlation coecient (CCC) is used to assess the similarity between two annotators, which is dened as: c = 2 x y 2 x + 2 y + ( x y ) 2 (3.6) A matrix of correlationsS is obtained such as in Figure 3.8 and then row-wise normalized sow a = P i S(a;i)1. Thus annotations that agree with each other are given higher weights and annotations in disagreement with the majority are given less weight. Lastly, a weighted leave-one-out strategy is adopted where the annotator with the lowest weight is left out entirely and the embedding depends only on the remaining annotations. 84 FM1 FM2 FM3 FF1 FF2 FF3 FM1 FM2 FM3 FF1 FF2 FF3 (a) AVEC arousal dev 7 FM1 FM2 FM3 FF1 FF2 FF3 FM1 FM2 FM3 FF1 FF2 FF3 (b) AVEC valence train 2 Figure 3.8: Agreement between annotators for two example subjects from the RECOLA emotion dataset in two dierent annotation tasks: arousal and valence. Agreement is measured using CCC. The overall agreement in valence is higher than the overall agreement for arousal. 3.2.4.3 Experiments The majority vote triplet embedding method is tested in two scenarios. First, the method is used to fuse annotations in the green intensity annotation experiment described in Section 3.2.2. This test helps show that the embedding algorithm produces intuitive gold-standard labels qualitatively similar to the average of annotations. Then the method is applied to the continuous annotations from the RECOLA data set using the previously described weighting schemes and also using well-established time series warping methods to try to produce a gold standard annotation fusion that best approximates the levels of emotional arousal and valence. Green Intensity Annotations This majority vote triplet embedding algorithm is applied to the individual annotations in both Task A and Task B from the green intensity data set (Section 3.2.2) to produce gold-standard labels which are evaluated in Section 3.2.4.4. RECOLA Emotion Annotations The RECOLA data set contains real-time human annotations of dimensional aect (valence and arousal) on a continuous scale. Each emotional dimension is separately annotated, so the triplet embedding method is employed for each emotional construct separately. Since the true valence and arousal signals are unknown, dierent methods for compar- ing the proposed gold-standard annotation fusions are used. In this section, an assertion is made in order to establish an evaluation criteria: ideal gold-standard labels minimize the 85 unexplained variance among the individual annotations and also are easier to learn from the features using simple models. A variety of unimodal and multi-modal linear regression mod- els with dierent regularizers are trained on all available features in the RECOLA data set (physiologic, video-based, and audio-based) with the gold-standard serving as labels, then each model attempts to reconstruct the gold-standard as accurately as possible. The gold- standard is compared to its projection into each model's representation space using CCC which provides a measure of the \quality" of the gold-standard. This evaluation method is used per the AVEC 2018 gold-standard emotion sub-challenge guidelines and more informa- tion can be found in [151]. The majority vote triplet embedding approach results in non-linear spatial warping of the annotations, but makes no adjustments in time. Therefore, it is tested using two state- of-the-art temporal warping methods to achieve better time alignment of annotations and features: dynamic time warping (DTW) [133] and generalized time warping (GTW) [152]. Dynamic time warping adjusts the sequence of temporal indices of one signal given an- other reference signal to achieve an optimal correlation between the two. Since the true emotional signal is unknown, a single feature is used as a reference signal and then DTW ap- plied to each individual annotation to align them with the reference feature. An exhaustive search of all corresponding video, audio, and physiologic features provided in the RECOLA data set is performed to nd the single feature with the highest average Pearson correlation with each corresponding annotation, which turns out to be the geometric video-based feature \geometric feature 245 amean". Figure 3.10 shows an example annotation of arousal before and after using DTW for temporal alignment to this reference feature. Python's dtw package is utilized for this. Generalized time warping is an enhanced version of canonical time warping which at- tempts to learn a monotonic temporal warping function and feature projection function that together maximize the correlation between the projected features and the temporally warped signal. Matlab code provided by [138] is utilized to implement GTW and then a grid search for tuning d, the canonical correlation analysis (CCA) energy threshold hyperparameter, is performed by maximizing the evaluation metric, CCC. The performance is not impacted signicantly for d2f0:6; 0:7; 0:8; 0:9; 0:95g but d = 0:7 is slightly better. 86 0 50 100 150 200 250 0 0:5 1 [s] Green intensity (a) Task A 0 50 100 150 0 0:5 1 [s] Ground truth Average Triplets (b) Task B Figure 3.9: Plots for TaskA (left) and TaskB (right) from the color intensity annotation dataset. The true color intensity signal is shown (black) alongside the unweighted average of individual anno- tations (purple) and a gold-standard produced using an unweighted version of the proposed triplet embedding algorithm (light blue). This shows the proposed method is sensible and qualitatively similar to the average signal. Since the number of triplets necessary to fully specify each annotation's vote scalesO(n 3 ) with the number of frames, rst the annotations are downsampled from 25Hz to 1Hz using a polyphase FIR lter. This leads to 13,365,300 unique triplets for a signal with 300 samples. Once the optimal embedding is obtained, it is upsampled to the original sampling rate. These resampling steps are performed in Julia using the dsp package. Furthermore, both DTW and GTW require that the features and annotations are temporally aligned. Thus, when applying these two methods, the annotations are downsampled by a factor of 10 to match the sampling rate of the features. After running the time warping method, the monotonic temporal map is piecewise linearly interpolated to provide time indices at the original resolution. 3.2.4.4 Results The results from the green color intensity validation experiment and application of this fusion technique to the RECOLA data set are presented. Green Intensity Validation Figure 3.9 shows the resulting proposed gold standard from the majority vote triplet embedding applied to the green color intensity data. Table 3.5 shows the correlations between this approach and the average of the individual annotations without time alignment. 87 Method RMSE Pearson CCC TaskA Unweighted Average 0.1916 0.7756 0.6392 Triplet Embedding 0.1907 0.7762 0.6410 TaskB Unweighted Average 0.1057 0.9523 0.9172 Triplet Embedding 0.1005 0.9594 0.9248 Table 3.5: Agreement measures between dierent fusion methods and the true signal in two tasks from the color intensity data set [1]. 0 100 200 300 2 0 2 [s] Values Feature Annotation (a) Feature vs. annotation for arousal, train 5, FM1 0 100 200 300 2 0 2 [s] Values Feature Warped Annotation (b) Feature vs. warped annotation for arousal, train 5, FM1 Figure 3.10: An example plot of an individual annotation (from FM1) of arousal from the RECOLA data set [6] for subject "train 5" and the corresponding feature with the highest av- erage Pearson correlation with all annotations (geometric feature 245 amean). The two signals are shown in (a) and the DTW-warped annotation is shown in (b). RECOLA Emotion Annotations The fused annotations serving as proposed gold-standards for the RECOLA emotion annotations are evaluated using CCC as described in Section 3.2.4.3. Table 3.6 shows the resulting correlations compared to the AVEC challenge baseline algo- rithm (for details see [151]). Figure 3.8 shows example annotator agreement matrices used to compute the weights as described in Section 3.2.4.2. Figure 3.10a shows an example of one annotation plotted alongside the feature having the highest correlation, and Figure 3.10b plots the same annotation after applying DTW. Figure 3.11 plots the percentage of triplet violations remaining after running the majority vote triplet embedding algorithm to convergence. Since the triplet optimization subroutine is initialized with a signal very close to the average of individual annotations, the resulting embedding is extremely stable over multiple trials. Triplet violations for only one trial run per annotation task are plotted. 88 dev 1 dev 2 dev 3 dev 4 dev 5 dev 6 dev 7 dev 8 dev 9 train 1 train 2 train 3 train 4 train 5 train 6 train 7 train 8 train 9 0 10 20 30 Percent triplet violations AVEC dataset: arousal DTW unweighted DTW weighted LOO DTW weighted all GTW unweighted GTW weighted LOO GTW weighted all Individual unweighted Individual weighted LOO Individual weighted all dev 1 dev 2 dev 3 dev 4 dev 5 dev 6 dev 7 dev 8 dev 9 train 1 train 2 train 3 train 4 train 5 train 6 train 7 train 8 train 9 0 10 20 30 AVEC dataset: valence Figure 3.11: Plots showing the percentages of triplet violations after t-STE convergence Annotation Prediction Baseline Triplet Embedding Scheme Model Arousal Valence Arousal Valence DTW arousal DTW valence GTW arousal GTW valence Unweighted Unimodal N/A N/A 0.751 0.561 0.469 0.559 0.751 0.561 Multimodal-Multi Rep N/A N/A 0.751 0.693 0.679 0.687 0.751 0.696 Multi-Rep N/A N/A 0.751 0.676 0.669 0.674 0.751 0.676 Multimodal Hierarchic N/A N/A 0.751 0.685 0.674 0.677 0.751 0.683 Weighted Unimodal 0.760 0.506 0.753 0.561 0.576 0.561 0.751 0.561 Multimodal-Multi Rep 0.760 0.506 0.753 0.691 0.687 0.693 0.751 0.692 Multi-Rep 0.772 0.506 0.753 0.671 0.673 0.678 0.751 0.676 Multimodal Hierarchic 0.775 0.570 0.753 0.685 0.684 0.677 0.751 0.686 Weighted Leave One Out Unimodal N/A N/A 0.753 0.561 0.753 0.561 0.751 0.561 Multimodal-Multi Rep N/A N/A 0.753 0.692 0.753 0.691 0.751 0.690 Multi-Rep N/A N/A 0.753 0.676 0.753 0.676 0.751 0.676 Multimodal Hierarchic N/A N/A 0.753 0.685 0.753 0.685 0.751 0.686 Table 3.6: CCC values for various proposed gold-standard annotation fusions using the 2018 AVEC emotion sub-challenge evaluation metric explained in Section 3.2.4.3 3.2.4.5 Discussion It is clear from the results in the validation experiment in Table 3.5 that this method produces gold-standard labels comparable to simple unweighted averages. This empirically helps illus- trate that even though this majority vote triplet embedding technique is more complicated, it produces sensible annotation fusions. The results from Table 3.6 comparing dierent gold-standard proposals on the RECOLA 89 emotion data set are somewhat surprising. None of the triplet embeddings are able to surpass the baseline's arousal score, even the approaches using time warping to improve annotation alignment. The drop in CCC for arousal in most cases is quite small, however, and dominated by the improved valence correlations. Even when no time warping is performed, the triplet embedding improves over the baseline method's CCC score for valence substantially. Additionally, when the most adversarial annotator is left out (there is usually only one in the RECOLA data), DTW performs quite well. This indicates that the embedding approach applied to DTW-warped annotations is very sensitive to annotation outliers, which may in part be due to the choice to use a single feature as a reference signal for the time warping. GTW can accommodate all features and works well in these tests and so is considered to be a more robust approach when time warping is used. A couple of unexpected results come from the experiments on the RECOLA data set. First, the performance dierence between using and not using GTW when applying majority vote triplet embeddings is negligible in spite of the visually noticeable dierence in the tem- poral warping shown in Figure 3.10. Secondly, the number of violated triplets at convergence shown in Figure 3.11 is less in the weighted and unweighted DTW cases, but the CCC values associated with them is often worse than the other methods. So, the relationship between the quality of a gold-standard and the quality of an embedding cannot be easily inferred from the number of triplet violations. Overall, the results in this section demonstrate the potential for ordinal information extracted from annotations to produce higher quality ground truth representations than state-of-the-art methods. We investigate this idea further in the next section. 3.2.5 A Framework for Ground Truth Estimation via Perceptual Similarity Warping Building o the observation that triplet comparisons obtained from the annotations can be used to generate a reasonable ground truth representation, this section presents a two- stage annotation method for correcting various global inconsistencies, artifacts, and errors introduced by annotators during the real-time continuous-scale annotation process. This 90 content is replicated from a previous publication [1]. 3.2.5.1 Perceptual Similarity Warping Framework The method presented in this section leverages a recurring observation that annotators more successfully capture trends and less accurately represent exact ratings [51{53]. In this ap- proach, additional information in the form of similarity comparisons between unique seg- ments in time of the video must be collected from annotators after the continuous annotation task. The structure of the initial fusion of annotations is leveraged to identify peaks, valleys, and spans of time where the target construct does not appear to change and comparisons corresponding to these segments are obtained. This approach is summarized as a sequence of steps and also depicted in Figure 3.12: 1. Preliminary annotation fusion 2. Total variation denoising (TVD) 3. Constant interval extraction 4. Triplet comparison collection 5. Ordinal embedding 6. Fused annotation warping The rst step fuses the raw annotations together to form a single time series, and, in prin- ciple, nearly any existing annotation fusion method could be used at this stage (see Section 3.2.1 for an incomplete list). Total variation denoising (TVD) is then used to approximate the fused signal as a piecewise-constant step function in order to facilitate the identication of segments in time where the target construct does not change noticeably (i.e., peaks, valley, and plateaus). Nearly constant intervals of the fused signal are extracted yielding these time segments and then additional rank information is procured from annotators to re-evaluate the proper sorting of these constant intervals with respect to the target construct. We collect comparison results among unique triplets of these constant intervals and employ an ordinal embedding technique to re-rank them. Finally, the fused signal is warped piecewise-linearly so the corresponding constant intervals align with the embedding. These steps and their assumptions are described in detail in their corresponding sections below and Figure 3.13 91 shows an example result after applying this technique. Evaluation of the method is later performed using the green intensity experiment (see Section 3.2.2). Raw Annotations Fused Annotations Signal Approximation Constant Interval Segmentation Triplet Ordinal Embedding Human-based Triplet Comparisons (e.g. Crowdsourcing) Triplet Comparisons Warped Fused Signal (Ground Truth) ||x i -x j || < ||x i -x k || Figure 3.12: Overview of perceptual similarity warping algorithm for accurate ground truth generation. Annotation Fusion The rst step involves fusing the annotations into a single represen- tative signal. Many methods have been proposed for this [7, 133, 136, 137, 139, 153, 154] and in principle any choice works for this step. This experiment uses a simple per-annotator time shift (EvalDep) proposed by Mariooryad et al. [7]. This method requires some feature sequences to be extracted from the video for alignment and so it is provided with the true green intensity and its forward dierence per frame. After shifting each annotation by its own lag estimate (approximately 1.6 seconds each), the trailing frames are truncated so all annotations are equal length and then averaged in time. Total Variation Denoising Total variation denoising (TVD) has been successfully used to remove salt-and-pepper noise from images while simultaneously preserving signal edges [54]. In this context, it is used identify the set of peaks and valleys where the annotation rating may be inaccurate, and also to nd the set of nearly constant regions of the fused annotation signal corresponding to a lack of noticeable change in the target construct. The TFOCS Matlab library [55] is used to nd a new vector y that approximates the fused annotation x by minimizing: y = min y h X t kx t y t k 2 ` 2 + X t ky t+1 y t k ` 1 i (3.7) 92 The parameter controls the in uence of the temporal variation term and degree to whichy is approximately piecewise-constant. For this study, is coarsely hand-tuned starting from a very tiny number (e.g., = 1 9 ) and increased by multiples of ten until the signal rst starts appearing piecewise constant. Then is nely tuned using a grid search over multiples of that coarse value between [1; 10]. For this green intensity experiment, a value of = 0:05 is used. In theory, this parameter could be automatically selected based on other criteria and heuristics, such as \elbow plot" optimization. Constant Interval Domain Extraction A simple heuristic is used to extract nearly constant intervals from the TVD signal. In this step, the TVD signal is scanned to nd the smallest set of (largest) intervals where each interval satises two criteria: (1) the total height does not exceed a tunable threshold h, and (2) the frame length of the interval is at least T frames. The domains of the constant intervals are recorded to produce a collection of non-overlapping closed intervals. A height threshold h = 0:003 and a minimum frame count threshold T = 17 are used when applying this method to the green annotation experiment (see Section 3.2.5.2). The h value is chosen to be quite small relative to the annotation scale and it is observed that for well-denoised signals, this constant interval extraction step is not very sensitive to this parameter. TheT parameter is selected to be the smallest value greater than 10 frames (an average human reaction time at 30Hz) that produces a manageable number of intervals. The number of triplets for n intervals growsO(n 3 ) so the interval count should be minimized to ease future computation. Triplet Comparisons In this step, annotators are asked to examine three video segments at a time and make a judgement about their relative similarity. A collection of video segments are produced by extracting a dierent video clip from the initial green intensity video for each unique constant interval domain in the previous step. For each unique triplet of video segments, an annotator is asked to perform a triplet annotation task. In this type of task, one video segment serves as a reference and the other two as test candidates, and the annotator is instructed to select which of the two candidate video segments is most similar to the 93 reference. In this section, the comparison results are simulated using the true average green intensity over each movie segment (recall that the intensity value is approximately constant within each movie segment), but collecting triplet comparisons from humans has been studied and proven useful in other works [155]. Ordinal Embedding Ordinal embedding problems attempt to learn a (typically lower dimension) embedding that preserves a similarity relationship between subsets of data points. For this application and for reasons described in Section 3.2.3, the comparisons are obtained in triplet form. Using the notation from that section, given a set of video segments Z = fz 1 ;:::;z n g with each z2R and a set of triplet comparisons overZ of the form d(z i ;z j ) < d(z i ;z k ) (for some unknown perceptual dissimilarity function d(;)), wherefi;j;kg is a 3- subset off1; 2;:::;ng, the goal is to nd a setX =fx 1 ;:::;x n g with each x2R such that: kx i x j k<kx i x k k()d(z i ;z j )<d(z i ;z k ) for some norm onX. In the perceptual similarity warping method (see Figure 3.12), ordinal embedding is used to reevaluate the constant intervals and sort them so their ranks are more aligned to the true signal. Many ordinal embedding solvers over triplets have been proposed [155, 156]. The t-stochastic triplet embedding (t-STE) approach [155] is employed because, as the authors highlight, it aggregates similar points and repels dissimilar ones leading to simpler solutions. Figure 3.13 shows the embedding results for the extracted constant intervals on Task A from the green intensity experiment, which have been rescaled to the proper [0; 1] range and computed using all possible triplet comparisons obtained by an oracle. Note that the embedding only preserves the relative similarity relationships, so the resulting values are expected to be misaligned with the true signal by a (unknown) monotonic transformation of the objective truth. 94 0 20 40 60 80 100 120 140 160 180 200 220 240 260 0 0.2 0.4 0.6 0.8 1 Time [s] Green intensity value Objective truth Average signal Embedded intervals Warped signal Figure 3.13: Plot of the objective truth signal, time-shifted average annotation signal (EvalDep), warped signal (proposed), and the 1-D embedding for extracted constant intervals for Task A. The spatially warped signal better approximates the structure of the objective truth and also achieves greater self-consistency over the entire annotation duration. Spatial Warping In the nal step, the fused annotation is spatially warped to rectify inconsistencies using the ordinal embedding results. Within the time span of each interval, the fused annotation is shifted so its average over the interval is equal to its correspond- ing embedding value. The fused annotation between each constant interval is oset and linearly scaled to align with its neighboring repositioned constant intervals. A linear inter- interval warping function is utilized because it minimally distorts the signal. The formalized denition is given in Figure 3.14. 95 I i = 8 > > > > > < > > > > > : ft : min(C i )t max(C i )g i2f1; 2;:::;jCjg f0g i = 0 fTg i =jCj + 1 S i = 8 > > < > > : E i 1 jI i j P t2I i y t i2f1; 2;:::;jCjg 0 else y 0 t = 8 > < > : y t +S i 9I i :t2I i y t + ytya y b ya S i+1 + y b yt y b ya S i 9i :atb where a = max(I i ); b = min(I i+1 ) Figure 3.14: Equations for the rank-based spatial warping method. Let t2f1; 2;:::;Tg be a time index, y t denote the fused annotation signal, y 0 t denote the warped signal value, and let C be the ordered sequence of non-overlapping time intervals corresponding to the extracted constant intervals. E is dened as the sequence of embedding values inR d corresponding to the time interval sequence C. The sequence I is used instead of C to handle edge cases. For notational simplicity, a new sequence S is used whose i th element is the dierence between interval i's average value and the corresponding embedding value. 3.2.5.2 Experiments and Results Table 3.7 shows various agreement measures comparing the true green intensity signal to both the rank-based warped signal and a baseline ground truth obtained from the EvalDep method [7]. The proposed method improves the accuracy of the ground truth estimate in both tasks and Figure 3.13 shows that it produces a signal with valuations that are more consistent over large periods of time. 96 Task Signal Type Pearson Spearman Kendall's NMI Tau A EvalDep Average 0.906 0.946 0.830 0.484 Warped EvalDep 0.967 0.939 0.835 0.562 B EvalDep Average 0.969 0.969 0.855 0.774 Warped EvalDep 0.988 0.987 0.906 0.862 Table 3.7: Agreement measures for baseline and proposed warped fused annotation approaches. All warped results use a complete set of ordinal comparisons from an oracle. NMI = normalized mutual information. Although one would expect the proposed rank-based perceptual similarity warping proce- dure to improve rank-based correlation metrics, the Spearman correlation decreases slightly primarily due to frame-level rank disagreements over the warped constant intervals rather than disagreements at a large scale due to the ordinal embedding. In other words, this disparity is largely due to the \jitter" present in the preliminary fused annotation which is not present in the true green intensity signal. More generally, a decrease in rank-based correlation can occur when any non-injective function is piecewise linearly warped and thus is not a particular artifact of this method. 3.2.5.3 Discussion Clearly, the proposed rank-based perceptual similarity warping procedure improves the ac- curacy of the ground truth compared to the \time-shift and average" style baseline algo- rithm. The method achieves this because it permits more accurate comparative information gathered from humans to correct mistakes introduced during real-time annotation. The to- tal variation denoising and constant interval extraction procedures, unfortunately, require careful tuning to achieve desirable results. These hyperparameters are eliminated using an alternative signal approximation technique presented in the next section. 97 3.2.6 Trapezoidal Segmented Regression: An Algorithm for Iden- tifying Perceptual Similarity in Annotations This section introduces a novel method for approximating human-produced annotations, which better facilitates the perceptual similarity warping procedure discussed in the last section. Content from previously published work [122] is reproduced here. 3.2.6.1 Overview As mentioned in the previous section, Figure 3.12 gives an overview of the sequence of steps in the perceptual similarity warping (PSW) algorithm. While this procedure is shown to be eective at reducing human bias and artifact noise and also improving ground truth accu- racy in the green intensity annotation experiment, its eectiveness depends on the quantity and average quality of triplet comparisons and the successful tuning of the constant interval segmentation method. The former concern is mitigated by the redundancy in the infor- mation content of a full set of triplet comparisons. Based on some exploratory work, also corroborated by [141], only around 5-10% of the total number of triplets with a 90% triplet comparison accuracy rate is required to obtain a good ordinal embedding. The latter issue regarding the successful tuning of the signal approximation and constant interval segmenta- tion method is the focus here. In the previous rank-based PSW procedure, TVD is used to obtain a piecewise-constant signal approximation ^ w of an input signal v by optimizing the following: ^ w = arg min w h X t kv t w t k 2 ` 2 + X t kw t+1 w t k ` 1 i The coecient is tunable and weighs the relative importance of the total variation term (` 1 -norm), which makes the solution more piecewise constant, against the signal error term (` 2 -norm). Given a particular signal v, needs to be adjusted so the optimization produces a signal approximation ^ w similar tov but with nearly constant regions at the extrema. This process can be tedious, requiring either many heuristics or several iterations of hand tuning with a human examining the result each time. Some smooth functions with large peaks and shallow bump features are highly sensitive to perturbations of and more dicult to tune. 98 Furthermore, needs to be congured separately for every signal, even for those with similar structure but varied scales. In practice it is common for TVD to produce piecewise signals where the segments have at least a slight curvature as illustrated in Fig. 3.15, which hampers the constant interval segmentation stage in the PSW method. For these reasons, using TVD to approximate the fused annotation complicates the PSW procedure and hinders its ability to provide an accurate ground truth. 90 100 110 120 130 140 0:65 0:7 0:75 Time (sec) Signal Value Signal TSR TVD Figure 3.15: Illustration of the curvature produced by TVD when approximating nearly constant regions of a signal. The proposed trapezoidal segmented regression (TSR) algorithm approximates these regions with constant segments making the extraction of near-constant intervals easier. 3.2.6.2 Method A new algorithm called trapezoidal segmented regression (TSR) is proposed, which can be used for signal approximation to better enable constant interval segmentation compared to TVD as was previously used. First, prior work on segmented regression is explored and then a formal denition of TSR is provided. Segmented Regression Background and Denitions In general, segmented regres- sion (SR) involves the approximation of a sampled signal by means of tting (typically) simpler functions over T segments of the domain. When the segment domain endpoints are known a priori, this problem reduces toT local best-t optimization problems. SR becomes harder when the segments are not known in advance and the locations of the break points, also known as knots, need to be calculated as well. Generalized SR for points in a variable d-dimensional space with a xed number of T segments (T > 1) is a NP-complete problem 99 as shown in [157]. For one-dimensional signals, however, the problem is solvable in polyno- mial time. Variants of SR limit the domain for knot point selection to the set of sampled points to reduce computational complexity. In this section, the primary aim will be nding the optimum segmented regression for a xed T where segment boundaries are allowed to occur at arbitrary real-valued locations in time. The problem space is further constrained to continuous signal approximations, so SR solutions are only considered where adjacent trapezoidal line segments intersect at their shared boundaries. Trapezoidal functions are fundamental and widely used in signal processing and electrical engineering [158]. Figure 3.16 shows a prototypical trapezoidal function on the left containing two sloped line segments and a constant line segment connecting them. A slightly broader denition of trapezoidal signals is adopted in this formulation: piecewise linear signals with every other line segment having a slope of exactly zero. On the right, Figure 3.16 also displays an example optimum four-segment trapezoidal signal approximation of the sample points using this relaxed denition. 0 1 2 3 4 5 0:2 0:4 0:6 0:8 0 0:2 0:4 0:6 0:8 0:2 0:4 0:6 0:8 Figure 3.16: Two trapezoidal signal examples. The left shows a prototypical trapezoidal signal shape. The right contains the optimum four-segment trapezoidal signal t to the sample points. A broader denition of a trapezoidal signal is used: continuous alternating sloped and constant line segments. Each line type is colored dierently for clarity. Though trapezoidal signals are ubiquitous in some domains, trapezoidal signal regression has not been studied in the literature. Piecewise linear signal approximations have been studied a great deal, however, and form the basis of the approach in this section. 100 Universal Signal Approximation Below is a formal proof that trapezoidal functions can approximate any human-produced annotation to arbitrary precision, represented as a continuous function on a compact domain. Note that in practice these annotations are sam- pled, so when T exceeds the number of data points, there is no residual error. Theorem: Let f : [a;b] ! R be a continuous function with compact support. There exists a sequence of trapezoidal functions ( n ) n2N : [a;b]!R such that n !f uniformly. Proof: Given some > 0 and x;y2 [a;b],9 > 0: jxyj < =) jf(x)f(y)j < becausef is continuous. Let 0 be one such and letn2N. Partition the domain [a;b] into open intervals (x i ;x i+1 ) as follows:8i2f0; 1;:::;ng :x i =a +i ba n : Let n be a sequence of trapezoidal functions such that8i : n (x i ) = f(x i ). Within each interval (x i ;x i+1 ), let s =x i +x i+1 and let: n (x) = 8 > < > : f(x i ) x i <x s 2 f(x i ) + (f(x i+1 )f(x i ))(2xs) x i+1 x i s 2 <x<x i+1 n (x) is a piece-wise continuous trapezoidal function and consists of two segments per in- terval, thus alternating constant and linear segments. By construction, 8x 2 (x i ;x i+1 ) either: f(x i ) n (x) < f(x i+1 ) or f(x i ) n (x) > f(x i+1 ). From the intermediate value theorem, 9x 0 2 (x i ;x i+1 ) : f(x 0 ) = n (x) since n (x) is continuous in the same interval and shares the same boundary values. Pick some N 2 N such that N > ba 0 . 8nN :jx i+1 x i j< 0 , thereforejf(x i+1 )f(x i )j<. Given any x2 [a;b] and nN, we considerjf(x) n (x)j for two cases. If x = x i for some i, thenjf(x) n (x)j = 0. If x i <x<x i+1 for some i, thenjf(x) n (x)j =jf(x)f(x 0 )j for some x 0 2 (x i ;x i+1 ). So, 8n>N,jxx 0 jjx i+1 x i j< 0 =) jf(x)f(x 0 )j =jf(x) n (x)j<. Trapezoidal Segmented Regression The optimum TSR algorithm is formulated lever- aging the work of Camponogara and Narzari [159] which provides algorithms for a variety of piecewise-linear SR problems. The optimum TSR algorithm solves the following constrained 101 minimization problem: arg min a 1:T ;b 1:T ;k 1:T+1 min c2f0;1g T X t=1 X ktik t+1 (a c;t x i +b t y i ) 2 s.t.: minfxg =k 1 k 2 :::k T k T+1 = maxfxg 8t2f1;:::;Tg : a c;t k t+1 +b t =a c;t+1 k t+1 +b t+1 a c;t = 8 > < > : a t c =t mod 2 0 o.w. The x and y parameters are vectors whose elements contain the sample point data (x i ;y i ) for i2f1;:::;ng. In our annotation fusion application, the x values are sampled times and they values are sampled ratings. T is the desired number of segments andc2f0; 1g denotes whether the rst segment of the regression should be constant (zero slope) or not (non-zero slope). The parametersa andb are coecients of the line segments between knot points and each k 1:T+1 are the real-valued knot boundaries along the domain. Algorithm 3.17 contains the pseudo-code for optimum continuous TSR also borrowing notation from [159]. The optimum solution is produced by calling TrapezoidalSegReg twice, once with c = 0 and once with c = 1, then keeping the result with the smallest total cost (F n;T ). 102 Figure 3.17: Optimum trapezoidal segmented regression 103 The algorithm proceeds as follows assuming the sample pointsf(x 1 ;y 1 );:::; (x n ;y n )g are ordered according to the x values: x (1) x (2) :::x (n) Start with the point (x 1 ;y 1 ) and iteratively add the next point x j (on the rst iteration j = 2) maintaining a set of linear function coecients A j;t and B j;t of the right-most line segment that minimizes the total squared error cost (F j;t ) of tting t2f1; 2;:::;Tg line segments over all pointsf(x i ;y i )g j i=1 . Matrix K j;t holds the exact real-valued knot point in the domain (x values) and matrix I j;t holds the indices of the sample points to the left of each knot (used for book-keeping). Once the last point has been considered, reconstruct the signal approximation for the minimum cost set of T line segments and return it. This dynamic programming approach reduces the overall run-time complexity. The FitLine function returns the coecients of a line (y =ax +b) that minimizes the sum squared error of thex andy sample points passed in. FitConst returns the coecient b that minimizes the sum squared error over a constant line (y = b). FitLineX and Fit- ConstX return the same linear coecients that minimize the sum squared error subject to the constraint that the new line/constant segment tted to points (x i:j ;y i:j ) intersects the right-most line segment from the set of best-t coecients for points (x 1:i1 ;y 1:i1 ). In order to enforce the continuity of the piecewise trapezoidal function, the new line must in- tersect y =A i;t1 x +B i;t1 somewhere between x i1 and x i+1 (for 2in 1). Both of these methods can be expressed as convex quadratic programs (QP) and solved optimally in polynomial time. FitConstX should return the solution to the following convex QP: min b i;j j X k=i (b i;j y k ) 2 s.t.: b i;j B i;t1 +x maxf1;i1g A i;t1 b i;j B i;t1 +x minfi+1;ng A i;t1 (3.1) 104 FitLineX should return the solution to this convex QP: min a i;j ;b i;j j X k=i (a i;j x k +b i;j y k ) 2 s.t.: b i;j +x maxf1;i1g a i;j B i;t1 +x maxf1;i1g A i;t1 b i;j +x minfi+1;ng a i;j B i;t1 +x minfi+1;ng A i;t1 a i;j >A i;t1 (3.2) In either of the two minimization formulations, it is possible the inequality constraints needs be ipped to achieve a smaller minimum (e.g. a i;j < A i;t1 for the FitLineX procedure) which still satises the intersection point restriction. For each optimization problem both constraint cases need to be handled, thus two QPs must be solved for eachFitConstX and FitLineX call corresponding to the normal and ipped constraints and the optimum taken as the minimum. This algorithm has the same run-time complexity as its linear segmented regression coun- terpart from [159]; it requiresO(n 2 Tg) operations assuming T <n and where g is an upper bound on the steps required for the convex QP to converge (for reference, TVD has a prac- tical runtime of O(n) [160]). The TSR algorithm also requires O(n 2 ) memory. A Python implementation of TSR is publicly available on GitHub [161]. 3.2.6.3 Benets Once the trapezoidal signal approximation is computed, the constant interval segmentation step in the next stage of the PSW method becomes simpler: the constant interval set is taken directly from the constant line segments. More importantly, TSR has one parameter T required for tuning which should be roughly proportional to the number of extrema in the fused annotation signal. Given thatT expresses the desired number of segments, the resulting signal approximation is unaected by the scale of the fused signal. Humans can easily inspect the fused annotation to count the number of extrema and provide a good initial guess for the value T , unlike TVD which requires several human-in-the-loop iterations to obtain a reasonable initial guess for . Though this approach does not completely eliminate human 105 oversight from the pipeline, it succeeds in removing much of the ambiguity. Additionally, the choice of T in TSR is less sensitive to under- and overestimation than in TVD. Underestimating T causes the resulting trapezoidal approximation to fail to capture only the most subtle extrema but still provides a good approximation of the overall structure of the signal. Overestimating T creates more constant intervals than necessary but still achieves an accurate signal approximation. More details on this observation are presented in the next section. 3.2.6.4 Experiments and Results Next we will evaluate TSR as a replacement for TVD in the rank-based perceptual similarity warping method for ground truth creation by comparing the two on the green intensity annotation experiment described in Section 3.2.2 where the true underlying signal is known a priori. The EvalDep fusion method [7] is once again used to perform the preliminary fusion of annotations. These fused results and the underlying true signals are shown in Figure 3.18 along with the unaligned annotation average for reference. Figure 3.19 shows the best resulting hand-tuned signal approximations using both methods. The parameter in TVD is tuned using a log-scale grid search between 1 4 and 1 3 and TSR is tuned over a gridfT 2;T 1;:::;T + 2g where T is chosen by counting the number of peaks, valleys, and plateaus in the signal and doubling it. In order to compare the impact of using either TVD or TSR as part of the PSW algorithm in Fig. 3.19, an oracle is used to generate triplet comparisons for the extracted constant intervals and produce the nal warped ground truth signals. Table 3.8 displays dierent correlation measures between these nal warped signals and the corresponding true signals. 106 0 100 200 0 0:2 0:4 0:6 0:8 1 Time (sec) Green Intensity (a) Task A EvalDep fusion and true signal plots. 0 50 100 150 Time (sec) True Signal EvalDep-Fused Annotations Frame-level Average Annotation (b) Task B EvalDep fusion and true signal plots. Figure 3.18: Plots of the EvalDep [7] fused annotations for TaskA and TaskB in the continuous- scale real-time annotation of green color intensity. The frame-level average of unaligned annotations is shown for reference. 0 100 200 0 0:2 0:4 0:6 0:8 1 Time (sec) Green Intensity (a) Signal approximations for TaskA. T = 51 for TSR and = 0:001 for TVD. 0 50 100 150 Time (sec) EvalDep-Fused Annotations TSR Approximation TVD Approximation (b) Signal approximations for TaskB. T = 75 for TSR and = 0:001 for TVD. Figure 3.19: Plots of the TSR and TVD signal approximations of the EvalDep fusion in both green intensity annotation tasks. 107 Task Ground Truth Pearson Spearman Kendall's NMI Technique Tau A EvalDep Average 0.906 0.946 0.830 0.807 Warped EvalDep with TVD 0.967 0.939 0.835 0.818 Warped EvalDep with TSR 0.974 0.939 0.831 0.818 B EvalDep Average 0.969 0.969 0.855 0.947 Warped EvalDep with TVD 0.981 0.980 0.881 0.987 Warped EvalDep with TSR 0.990 0.988 0.911 0.989 Table 3.8: Agreement measures between the true signal and various ground truth estimates using the perceptual similarity warping method. All warped results use a complete set of ordinal comparisons from the oracle. NMI = normalized mutual information. The robustness of TSR and TVD are evaluated by comparing Kendall's tau correlations of their resulting warped signals to the true signals for dierentT and parameter settings. In both cases, extracted constant segments with durations shorter than 200ms are ignored because they are shorter than average human perception-touch response times [162]. Fig- ure 3.20 plots these results for both tasks and also displays the corresponding number of constant segments produced as the tunable parameters vary for each of the two methods. 108 Kendall's Tau # Const Intervals Task A Task B Task A Task B 0:6 0:7 0:8 0:9 TSR TVD 0:6 0:7 0:8 0:9 0 10 20 30 40 0 20 40 60 80 100 0 10 20 30 40 T 0:0001 0:001 0:01 0:1 1 10 Figure 3.20: Dierences between TSR and TVD algorithms in two experimental tasks when used for signal approximation as part of the PSW ground truth framework. The number of constant intervals extracted during the segmentation stage of the pipeline are shown at the bottom as functions of each algorithm's tunable parameter (T or ). Kendall's tau correlations between the nal warped signal and the true target signal are shown at the top. 3.2.6.5 Discussion Results from Table 3.8 suggest that the ground truth signals produced using the fused annotation warping method employing TSR for signal approximation are similar and perhaps better than those produced using TVD. Figure 3.19 indeed shows the two methods produce very similar approximations to the fused annotation signals in both TaskA and TaskB. The TSR method does, however, have two other advantages over TVD. The plots of ground truth correlations over a range of parameter settings for TSR and TVD in Figure 3.20 reveal that the nal warped signal quality is more stable for variations of T (TSR) near and above the optimum setting compared to the variance in correlation as (TVD) varies near its optimum. The interpretablity ofT enables humans to provide a good initial estimate and the results in these two task experiments suggest that this estimate will perform comparably to the optimum as long it is does not severely underestimate it. The sharp peaks near = 1 in TaskA or = 0:001 in TaskB imply that nding the optimum when using TVD requires 109 an exhaustive grid search over a ne grid. Another benet to the TSR method is the predictability of the number of constant intervals as shown in Figure 3.20. The number produced using TSR is nearly linear and independent of the signal being approximated whereas the number generated by TVD is highly non-linear and dependent on the structure of the signal. This oers a level of control to users of the PSW method who may want to minimize the amount of required supplemental triplets comparisons by reducing the number of constant intervals. The only downside to using TSR seems to be its computational complexity, though the extra time required is worth the cost when it produces a more robust and accurate ground truth. 3.2.7 Trapezoidal Segment Sequencing: A Perceptual Similarity- based Annotation Fusion Method Section 3.2.5 describes a suite of methods called the perceptual similarity warping framework which can improve the quality of ground truth signals derived from continuous-time and continuous-space human annotations of mental states or other perceived constructs. The rst stage of that pipeline requires raw annotations from multiple human annotators to be preliminarily fused together to form a single time series. This section explores a novel method for performing this initial fusion which exploits the observations about human annotation errors rst mentioned in Section 3.2.2. The content in this section is duplicated from a published work [123]. 3.2.7.1 Overview Like previous sections, this one focuses on one annotation scheme common to many data sets, and customary for dimensional aect annotation: continuous-scale, value-based, real-time annotation. An example of this style of annotation appears in the SEMAINE [125] database, which contains real-valued annotations of valence and arousal levels collected as annotators viewed video clips. Many methods have been proposed for fusing these kinds of annotations, which can be characterized either as value projection and approximation methods or tem- poral correction methods. Correlated spaces regression [134], canonical correlation analysis 110 [135], and neural network variants [137] are examples of value projection and approxima- tion methods. These algorithms assume that the true construct value exists in a subspace spanned by features extracted from the stimulus (e.g., facial expressions, eye gaze, heart rate). Other methods such as dynamic time warping [133], and evaluator-dependent time shifting [7] attempt to correct for varying annotation delays due to human reaction time. Fusion is then sometimes achieved via frame-wise averaging after temporal alignment. More recent fusion methods combine both value projection and time correction such as generalized canonical time warping [138] and neural network variants [139]. Though each of these meth- ods can improve the ground truth representation in some manner, none of them can account for annotation artifacts produced by distractions, perceptual uncertainty, or inconsistency in valuation over time. Section 3.2.5 proposes a perceptual similarity warping procedure capable of correcting many kinds of artifacts, enhancing the valuation consistency, and improving the accuracy of the ground truth. The pipeline achieves this by leveraging a recurring observation that humans more reliably annotate perceived changes to the construct and are less capable of assigning accurate values at any given time [1, 51{53]. Figure 3.21 shows a ow diagram of the dierent pipeline stages proposed in Section 3.2.5 and also includes the improved signal approximation method from Section 3.2.6. The main shortcoming of this pipeline, which this section will address, is that it requires annotations to be preliminarily fused rst, using any existing fusion technique, before it corrects valuation and artifact errors. Operating on fused annotations means that important per-annotator information may be lost. This section proposes a novel fusion method where each individual annotation is approx- imated by a trapezoidal signal and then converted to a sequence of values representing the direction of change in the construct per time frame. The trapezoidal signal representation allows each annotation to be segmented into regions of perceived change and no appar- ent change. This segmentation strategy is arguably the most natural approach, given the amount of evidence that annotators more reliably capture changes or the lack thereof. A fusion compatible with the PSW pipeline is then achieved by merging these sequences of construct changes. This section presents a novel fusion method using trapezoidal signal approximations and 111 Raw Annotations Fused Annotations Signal Approximation (TSR) Constant Interval Segmentation Triplet Ordinal Embedding Human-based Triplet Comparisons (e.g. Crowdsourcing) Triplet Comparisons Warped Fused Signal (Ground Truth) ||x i -x j || < ||x i -x k || Proposed Method (TSS) Figure 3.21: Perceptual similarity warping algorithm presented in Section 3.2.5 [1]. This sec- tion proposes a novel annotation fusion method that merges the preliminary fusion and signal approximation steps in this pipeline. validates the proposed fusion method using data from the green intensity annotation experi- ment (Section 3.2.2). Further benets and implications of the proposed fusion technique are discussed. Open-source code implementing the proposed method and PSW framework are made available at https://github:com/brandon-m-booth/2019 continuous annotations. 3.2.7.2 Methods The proposed method for fusing annotations combines the rst two steps from the PSW pipeline in Figure 3.21. First, the annotations are aligned to the stimulus using any time- alignment method. Each annotation is then approximated by a trapezoidal signal [163], and a trapezoidal segment sequence (TSS) is generated for each annotation from the regression. Finally, the sequences are merged to produce a single TSS that represents the fused annota- tions and can be easily consumed by the next stage in the PSW pipeline (constant interval segmentation). Each of these steps is outlined in detail below. Temporal Alignment In order to remove delays introduced into the annotations by lag in human reaction time, the annotations are time-adjusted to better align them with the stimulus. This step is not strictly necessary, but since the ground truth that is ultimately produced may be used for human behavior or experience modeling, it is valuable to perform this step. This has the added benet of aligning the annotations to each other before performing fusion, which may improve the quality of the nal ground truth. Some existing methods that serve this function are the mutual information-based alignment technique from 112 the EvalDep algorithm [7] and DTW [133]. Trapezoidal Signal Approximation The TSR method from Section 3.2.6 [163] is applied to each aligned annotation to produce a set of trapezoidal signal approximations. TSR requires an additional input T , which is the desired number of segments in the regression. As observed in Section 3.2.6.4, when T is large enough to capture the basic structure of any particular input signal, then further increases to T yield quickly diminishing returns in approximation accuracy, which makes estimation of the T parameter straightforward. For the experiments in this section,T is roughly estimated by a human observer and then set to a value 1:2 times higher to ensure it avoid underestimation. This parameter could be more precisely tuned using an elbow optimization technique, for example, but as Section 3.2.7.3 will show, this rough method works well and requires little time. Trapezoidal Segment Sequencing Each annotation's trapezoidal signal approximation is converted to a new time series where each frame is assigned a value fromf1; 0; 1g based on whether the approximation shows a negative change (decreasing linear segment), no change (constant segment), or a positive change (increasing linear segment) respectively. This new time series is called the trapezoidal segment sequence (TSS). Fusion The TSS sequences from the annotators are pooled and majority voting is used per frame to create a single fused TSS. Ties involving constant segments are assigned a zero value; otherwise, the ties are resolved arbitrarily. In the next stage of the PSW pipeline (see Figure 3.21), constant intervals can easily be extracted from the fused TSS by nding all sub-sequences of contiguous zeros. 3.2.7.3 Experiment and Results The proposed method is tested using the green intensity annotation data (Section 3.2.2) because the true green intensity is completely known a priori and thus allows for proper validation. For both annotation tasks, Task A and Task B, in the green intensity data set, the annotations are down-sampled to 1Hz from their native 30Hz to expedite the trapezoidal 113 0 50 100 150 200 250 0 0:2 0:4 0:6 0:8 1 Time(s) Green Intensity Task A True Signal TSS+PSW (proposed) EvalDep (baseline) 0 50 100 150 Time(s) Task B Figure 3.22: Plots of the true target signal and ground truth signals produced by the baseline EvalDep fusion method and the proposed TSS when using the (PSW) ground truth framework. signal approximation step. A baseline ground truth is generated from these annotations using the PSW pipeline and the EvalDep method [7] as a preliminary fusion method. A competing ground truth is generated using the same PSW procedure with the fusion and signal approximation steps replaced by the proposed TSS method. Figure 3.22 plots the dierent ground truth representations and the true construct signal. Table 3.9 displays several agreement measures comparing each ground truth method to the true signal in both tasks. Task Method Pearson Spearman Kendall's NMI Tau A EvalDep (baseline) 0.90 0.93 0.80 0.88 TSS+PSW(proposed) 0.96 0.92 0.80 0.88 Fusion-rst PSW 0.97 0.94 0.83 0.82 B EvalDep (baseline) 0.96 0.96 0.83 1.0 TSS+PSW(proposed) 0.96 0.96 0.83 1.0 Fusion-rst PSW 0.99 0.99 0.91 0.99 Table 3.9: Several agreement measures computed for various ground truth techniques showing the agreement between each method's ground truth and the true green intensity. Some values vary from Section 3.2.6 due to down-sampling. PSW = perceptual similarity warping framework. NMI = normalized mutual information. 114 3.2.7.4 Discussion Both of the PSW methods in Table 3.9 achieve a similar or signicantly higher Pearson correlation with the true green intensity signal than the baseline method. Aside from the mutual information score, the other measures show little to moderate improvements for these methods over the baseline, which is consistent with [1] and [163]. These experiments indicate that perceptual similarity warping with TSS fusion is competitive with other methods. The true potential benets of TSS fusion lie in its rank-based encoding of annotations. It treats each annotation as a sequence of denoised construct value changes and thus allows the annotations to be fused or compared with respect to their local value dierences rather than the values themselves. This means the TSS sequences are invariant to the potentially non-uniform construct scale dierences between dierent annotators. Furthermore, the TSS representation enables researchers to study agreement between annotators over interpretable partitions in annotation space. For example, Figure 3.23 shows annotations from Task A plotted with vertical bands of color for each 1Hz segment representing the fused TSS. Red bands correspond to zero TSS values (constant regions) and green ones denote non-zeros (increasing or decreasing regions). Within the time domain of the rst two contiguous subsequences of green, all annotations either decrease or increase, so their individual TSS representations would be identical within these intervals. The green interval between 155s and 160s shows that two annotators (highlighted with bold blue lines) observed a phantom peak before a true peak near the 159s frame. Though this peak does not exist in Task A, the fact that two annotators separately observed it may reveal something about perception and may warrant further investigation. The TSS representation proposed in this section enables this type of interpretable local inquiry and analysis. 3.2.8 Signed Dierential Agreement: An Agreement Measure for Real-time Continuous-scale Annotations The previous sections in this chapter have proposed a perceptual similarity warping pipeline and supporting methods for improving the quality of ground truth measures generated from human-produced real-time continuous-scale annotations. Each of the methods presented thus 115 136 138 140 142 144 146 148 150 152 154 156 158 160 162 164 166 0 0:2 0:4 0:6 0:8 1 Time (sec) Annotated Value Figure3.23: Plot of Task A annotation signals (yellow and bold blue) and the fused TSS sequence (vertical bands of green or red). Red bands denote no change in the TSS (zero value) and green bands represent some change in the fused TSS (1). far presume that the set of human-made annotations used as input to the pipeline are all suitable representations of the construct of interest. Sometimes, especially in crowd-sourcing environments, the annotations are substantially dierent and cannot be used together in the pipeline. Finding a subset of the annotations which are in general agreement regarding the dynamics of the construct is therefore an important step that must be taken before using the PSW pipeline. This section explores and proposes a new measure of agreement, again exploiting the observations about human annotator behavior discussed in Section 3.2.2. The information in this section has been published previously [124]. 3.2.8.1 Overview Assessing reliability of the annotations, or the soundness and reproducibility of an annotation process across studies [164], is a key step during analysis. For subjective constructs where no notion of the true dynamics exists, reliability is often presumed when the annotations are collected independently and exhibit sucient agreement [165], for example using the Cohen's measure. Agreement measures are also employed for identifying and removing outlying or potentially adversarial annotations when researchers are interested in the majority consensus about a construct's dynamics. Quite often, the same agreement measurement tools are used for annotation selection and reliability assessment, and when annotations are selected 116 for analysis based on their consensus, a selection bias is introduced that may reduce the generalizability of the results to other studies. Sometimes similar measures are used to assess the validity of a collection of annotations, which is a measure of the representativeness of the annotations to the true construct dynamics. This section will focus on inter-rater agreement measures as a tool for nding consensus among a subset of human-produced interval-scale continuous annotations for the purposes of annotation curation prior to analysis. One of the aims is to highlight shortcomings of common existing agreement metrics and to present a new agreement measure, based on dierential analysis, which oers unique complementary information that is potentially benecial for curation. Various works have observed that errors in human annotation are not random [1, 51, 53]. Section 3.2.2.2 includes a brief list of observed common mistakes made by annotators in a controlled interval-scale continuous annotation experiment, including one observation that annotators capture trends more reliably than they accurately assign values. This statement is partially supported by philosophical and psychological perspectives on the fundamen- tal nature of human experience [51, 53]. Yannakakis et al. [142] provide a summary and exposition of research and arguments in favor of the underlying ordinal character of emo- tional experience and how it may impact human-produced continuous annotations. Some researchers [1, 51, 143] have found it benecial to treat interval annotations as ordinal ones, but so far the arguments for doing so have been its utility and potential correspondence with the underpinnings of human experience. This section aims to help build a foundation for this assumption through the analysis of interval-scale continuous human annotations in an experiment where the true target signal is known 1 . This section provides: 1. Analysis of the accuracy of interval-scale continuous annotation output and its dier- entials 2. Manifested evidence that annotator perception is modeled by a noisy monotonic dis- tortion function 1 Source code for all analysis in this work is available from https://github:com/brandon-m-booth/ 2020 annotator agreement 117 3. Analysis of existing agreement measures and proposal of a new metric oering com- plementary information 3.2.8.2 Background A variety of techniques have been used to assess agreement among interval-scale continuous human annotations. Several works [51, 125{127] use Cronbach's to assess inter-annotator agreement, and the authors of [126] employ it in a crowd-sourced annotation scenario to remove annotations which disagree with the majority cluster. Other databases for emotion and aect recognition [51, 119, 166, 167] use Pearson correlation along with dierent thresh- olding schemes to remove low agreement annotations and try to assess reliability. Yannakakis et al. employ Krippendor's to measure inter-annotator agreement for both ordinal and interval human emotion labels [143]. Cohen's coecient thresholding is used in [168] to form a consensus among annotators and discard outliers. Nicolaou et al. introduce a metric called signed agreement (sagr) for assessing the accordance of emotional valence annotations in the SAL-DB database [128]. The RECOLA data set employs mean squared error (MSE), Pearson correlation, Cronbach's , and Cohen's for assessing inter-annotator agreement but does not use them to exclude outliers [6]. Each of these methods are included in the analysis in Section 3.2.8.4 in addition to Kendall's and Spearman's correlation to examine rank-based measures. Though correlation- based methods should not be used to assess inter-annotator agreement [165], they continue to see use in practice. This is the rst work closely examining annotator behavior in inter- val (i.e., value-based) continuous-time annotation tasks in order to evaluate existing agree- ment measures. Development of ordinal interpretations of annotations has been conducted [51, 142, 143], but not for the purpose of establishing a foundation for measuring agreement. 3.2.8.3 Experiment Before examining the assumptions, benets, and drawbacks of dierent agreement measures, the intra-annotator variability and noise present in continuous human annotations is rst characterized. To avoid making assumptions about the relative correctness of annotations of ambiguous stimuli, the green intensity annotation experiment from Section 3.2.2 is uti- 118 lized. An analysis of the annotator accuracy and consistency using the down-sampled 1Hz annotations is performed which will demonstrate that annotators asked to perform interval (value-based) annotations incidentally provide reliable ordinal information. Qualitative Analysis To facilitate fair comparison between the samples comprising each annotation, time alignment is attempted using two methods: an evaluator-dependent con- stant temporal shift (EvalDep) [7] and dynamic time warping (DTW) using a symmetric Sakoe-Chiba step pattern to limit the warping distortion [133, 169]. Both methods are ef- fective at correcting the annotations for human lag time, but DTW's allowance for repeated values when locally stretching a signal leads to a distortion in its structure (i.e., derivatives). The EvalDep method is therefore employed in this case for temporal alignment, individually shifting each annotation an average of 1.6 seconds in both tasks. In other experimental scenarios with unknown true signals, this temporal adjustment could be based on mutual alignment of annotations to each other or to features extracted from the stimulus (see [138] for one example). Figure 3.24 shows scatter plots of the true intensity of the green color per frame against the annotated values for each annotator. An idealized annotator with no delay, no perceptual bias, and no incidentally or motorically added noise would produce a scatter plot matching the straight dashed line. It is apparent from the scatter patterns in these gures that each annotator deviates from this ideal in a unique manner. The vertical spread of scattered points at any true green intensity value represents the precision with which each target value is annotated. Most annotations show an increased spread in the mid-to-upper range ([0.5- 1]) indicating that this range of values is annotated more inconsistently than, for example, values near zero (black color). The general shape of the scattered points for each annotator forms either an arc or S- curve, so a cubic polynomial is chosen to model the underlying structure of the distribution. These regressions vary in exact shape and location for each annotator and seem to character- ize their individual perceptual biases. Interestingly, though cubic regressions are capable of representing non-monotonic S-curves, all of the tted cubic regressions lie in their monotonic parametric regime. 119 0 0.2 0.4 0.6 0.8 1 ann1 ann2 ann3 ann4 ann5 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Annotated green intensity ann6 0 0.2 0.4 0.6 0.8 1 ann7 0 0.2 0.4 0.6 0.8 1 True green intensity ann8 0 0.2 0.4 0.6 0.8 1 ann9 0 0.2 0.4 0.6 0.8 1 ann10 (a) Task A 0 0.2 0.4 0.6 0.8 1 ann1 ann2 ann3 ann4 ann5 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Annotated green intensity ann6 0 0.2 0.4 0.6 0.8 1 ann7 0 0.2 0.4 0.6 0.8 1 True green intensity ann8 0 0.2 0.4 0.6 0.8 1 ann9 0 0.2 0.4 0.6 0.8 1 ann10 (b) Task B Figure 3.24: Scatter plots of the true green intensity values against the annotated values for each annotator in both tasks. The dashed line show where the points would lie for a perfect annotator. The orange curves are cubic polynomial regressions of the scattered points. 120 Figure 3.25 shows scatter plots of the forward dierence in the true intensity of the green color per frame against the forward dierence of the annotated values per annotator. As before, an idealized annotator would have a scatter plot precisely following the dashed line. To help examine whether annotators locally preserve ordinal relationships as mentioned in [1, 119, 142, 143, 170], the points are partitioned about zero to compare frame-wise agreement in trend direction. The plots have been partitioned into quadrants and colored according to the percentage of total points within each quadrant (numbered counter-clockwise starting from the upper right section). Points in quadrants I/III correspond to samples where the true signal derivative increases/decreases and the annotation slope also increases/decreases. Quadrants II and IV contain samples where an annotation's trend diers from the true trend. Together, these quadrants can be viewed as a confusion matrix, and when the points are binned in this manner, it is clear that the trend error rate is relatively low. There does not appear to be a similar partitioning scheme of the scatter plots showing annotation values in Figure 3.24 which can bin the data across all annotators in a meaningful way. Finally, Figure 3.26 plots the annotations over time where blue points and line segments correspond to samples where the forward dierence matches the sign of the true target sig- nal's forward dierence and the red points and segments represent samples which do not match. Most of the mismatches occur near transitions where the derivative of the signal changes (i.e., near the peaks, valleys, and plateaus). Unless annotator reaction times are perfectly consistent, one would naturally expect a few samples to disagree near these tran- sitions. Section 3.2.8.6 later claims that an agreement measure should not heavily penalize these types of annotation errors. In summary, through the value-based and dierential frame-wise comparison of anno- tations in this experiment, two key aspects of continuous human interval annotation can be observed: (1) annotators approximately preserve rank ordering when annotating values (as shown by the monotonicity of the cubic regression), and (2) annotators reliably capture the direction (increasing or decreasing) of trends, but are less accurate in both annotation valuation and trend magnitude. 121 −0.4 −0.2 0 0.2 0.4 ann1 ann2 ann3 ann4 ann5 −0.4 −0.2 0 0.2 0.4 −0.4 −0.2 0 0.2 0.4 Annotated green intensity derivative ann6 −0.4 −0.2 0 0.2 0.4 ann7 −0.4 −0.2 0 0.2 0.4 True green intensity derivative ann8 −0.4 −0.2 0 0.2 0.4 ann9 −0.4 −0.2 0 0.2 0.4 ann10 0 0.2 0.4 0.6 0.8 1 (a) Task A −0.4 −0.2 0 0.2 0.4 ann1 ann2 ann3 ann4 ann5 −0.4 −0.2 0 0.2 0.4 −0.4 −0.2 0 0.2 0.4 Annotated green intensity derivative ann6 −0.4 −0.2 0 0.2 0.4 ann7 −0.4 −0.2 0 0.2 0.4 True green intensity derivative ann8 −0.4 −0.2 0 0.2 0.4 ann9 −0.4 −0.2 0 0.2 0.4 ann10 0 0.2 0.4 0.6 0.8 1 (b) Task B Figure 3.25: Scatter plots of the forward dierence of the true green intensity values against the forward dierence of annotated values for each annotator in both tasks. The dashed line shows where the points would lie for an idealized annotator. Four quadrants are color coded according to the percentage of each plot's points falling within them. 122 0 0.5 1 ann1 ann2 ann3 ann4 ann5 0 100 200 0 0.5 1 Green intensity ann6 0 100 200 ann7 0 100 200 Time (seconds) ann8 0 100 200 ann9 0 100 200 ann10 (a) Task A 0 0.5 1 ann1 ann2 ann3 ann4 ann5 0 50 100 150 0 0.5 1 Green intensity ann6 0 50 100 150 ann7 0 50 100 150 Time (seconds) ann8 0 50 100 150 ann9 0 50 100 150 ann10 (b) Task B Figure 3.26: Plots of the aligned annotations over time. Blue points and line segments corre- spond to samples where the forward dierence matches the sign of the true target signal's forward dierence. Red points and line segments correspond to mismatches. 123 Quantitative Analysis Next, the observations about continuous human annotation qual- ity are formalized. Let T (t) represent a time series of values corresponding to the true construct target at timet. LetP i (z) be a unique noisy perceptual distortion function for an- notatori for an observed stimulus valuez (e.g., a shade of green). The two key observations about human annotator behavior, summarized above, can be formally expressed as follows: 1. Annotators preserve rank ordering when annotating values: dP i dz ' 0 (3.3) 2. Annotators capture trend directions reliably: sgn h dT dt i sgn h d dt P i [T (t)] i (3.4) Using the chain rule on Equation 3.4 and notation T 0 (t) = dT dt , it can be rewritten as: sgn[T 0 (t)] sgn h P 0 i [T (t)]T 0 (t) i = sgn h P 0 i [T (t)] i sgn[T 0 (t)] Adding back Equation 3.3, it can be deduced that these two terms are equal since sgn(P 0 i [T (t)]) = 1, which demonstrates that these two observations are mathematically consistent and com- plementary. Each observation also suggests an avenue for measuring agreement. If one could somehow measure the degree to which an annotator preserves monotonicity (Equation 3.3), it could be utilized as a measure of agreement. Unfortunately,T (t) is unknown for most problems of interest. Instead, the approximation in Equation 3.4 can be exploited for two independent annotators i and j to state: sgn h P 0 i [T (t)]T 0 (t) i sgn h T 0 (t) i sgn h P 0 j [T (t)]T 0 (t) i =) sgn h P 0 i [T (t)] i sgn h P 0 j [T (t)] i With one further assumption that T (t) is a continuous function, P 0 i [T (t)] can be approxi- 124 mated using a forward dierence: sgn h P i [T (t + )]P i [T (t)] i sgn h P j [T (t + )]P j [T (t)] i (3.5) for some small > 0. In practice, will be the distance between two adjacent annotation samples in time, which may not be very small, but the approximation is valid provided that the sampling rate is high enough for the construct dynamics. This key relationship is exploited in the next section to dene a new agreement measure. 3.2.8.4 Agreement Measures Let us examine the metrics mentioned in Section 3.2.8.2 that have been used for measuring inter-annotator agreement and then evaluate the benets and drawbacks of each one based on the observations about human annotation and errors in the shades of green experiment. Pearson correlation is not agnostic to the eects of a monotonic transformation of the sample data. As a simple example, let us take the true target signal from TaskA and apply a distortion mimicking one of the \best" annotators. Annotator 7 has the lowest dierential error rate on that task with an F1 score of 0.96 and a Matthews correlation coecient (MCC) of 0.92. The cubic regression from Figure 3.24a for ann7 is applied to the true signal and then its Pearson correlation with the unwarped true signal is measured, yielding a value of 0.90. Thus, assuming the cubic regression models this annotator's perceptual bias, the best possible Pearson correlation score that can be achieved is 0.9, even though ann7 arguably captures the shape and structure of the true annotation better than this (using the F1 score and MCC for rough comparison). A second drawback of this metric is that it requires some variance to be present in the annotation. It cannot be used to measure agreement over subsets of the annotation in time where no variability is present or in situations where a legitimate annotation for a stimulus would contain no uctuations. Concordance correlation coecient (CCC) is dened as: CCC = 2 x y 2 x + 2 y + ( x y ) 2 125 for annotationsx andy. It is based on the Pearson correlation measure and further punishes annotations which have suciently dierent mean values or standard deviations. Again, borrowing the perceptual distortion from ann7 and comparing the true signal to itself after a simulated distortion, a CCC value of 0.43 is obtained. Dierent perceptual distortions would cause pairwise agreement values obtained using CCC to be dicult to compare. Both Spearman correlation and Kendall's coecients measure correlation from the rank order of a pair of annotations. As long as the perceptual distortion function (approx- imated by a cubic regression in Figure 3.24) is monotonic, then these metrics will remain unchanged. However, one major drawback is their sensitivity to prolonged errors in ranking. Figure 3.27 shows two example annotations with very similar structure but a dierence in the constant annotated value past 13s. Note the dierence in value between the second con- stant segments in each annotation. This minor dierence substantially aects the rankings and the Spearman and Kendall's correlations of these two signals are -0.119 and -0.194 respectively. 0 5 10 15 20 0 2 4 Time (seconds) Annotated Value Example Annotation 0 5 10 15 20 Time (seconds) Similar Annotation Figure3.27: Example annotations showing similar structure but negative Spearman and Kendall's correlations. Mean squared error (MSE) measures metric deviation between two annotations and thus is naturally impacted by any monotonic perceptual distortion. Any dierences in an- notator valuations are further magnied by the square of the error, meaning this metric is highly unforgiving of value dierences in spite of structural similarity. Signed agreement (SAGR) is a unique measure of similarity initially proposed by Nico- 126 laou et al. [128] for assessing the correspondence of dimensional emotional valence annota- tions. The measure is intended for signals on a [-1,1] scale where the origin is a conceptual reference point for the annotations (e.g., a neutral expression with no positive or negative valence). This measure discards the magnitude of the value assigned to each frame and captures how often the annotators agree on the positive or negative nature of an emotional expression in each frame. For constructs where there exists a natural mid-point, this type of agreement measure over the annotated values is interesting and warrants further investi- gation. However, for many annotation tasks, such as the intensity of the shade of green or emotional arousal, there is no middle reference point and this metric is not interpretable or directly applicable. Cohen's is computed by = 1 1p 0 1pe where p 0 is the observed agreement among all annotators andp e is a measure of the random chance of agreement. In order to estimate the probabilities of agreement between continuous interval annotations, probabilistic models of annotators must be assumed or estimated empirically (e.g., via binning). Either approach requires a supposition about the annotation process which may be dicult to validate in small-scale experiments. The correction this metric utilizes, however, to measure the relative agreement between two annotations above random chance is useful and will be revisited when introducing a new agreement technique below. Cronbach's , intra-class correlation coecient (ICC), and Krippendor's are group-level metrics which are not intended to be applied to subsets of the annotations (such as pairs). These metrics are useful for measuring the agreement amongst entire collections of annotations, but are less useful for identifying adversarial ones (e.g., from crowd sourcing) during curation. 3.2.8.5 Method Based on the observations about annotator behavior in Section 3.2.8.3, it seems that an ideal measure of continuous human annotation agreement for data set reliability assessment and curation purposes should be invariant to unique perceptual biases, and it should focus on capturing consensus in signal structure and shape rather than valuation. In order to achieve this, an ordinal agreement measure called Signed Dierential Agreement (SDA) is proposed 127 for measuring agreement over construct dynamics in ordinal-scale continuous annotations and which this section claims also functions for interval-scale annotations because of the high degree of ordinal consensus shown in Figure 3.25 and captured by Equation 3.5. For notational brevity, let x;y2 R N represent the two annotation time series P i [T (t)] and P j [T (t)] respectively where t2f1;:::;Ng. Equation 3.5 becomes: sgn(x t x t1 ) sgn(y t y t1 ) At this point, an agreement measure could be dened that is agnostic to human perceptual eects by using any loss function with this equation. A sum-of-delta approach is chosen and SDA is dened as: SDA = 1 N1 P N t=2 [sgn(x t x t1 ); sgn(y t y t1 )] (p;q) = 8 > < > : 1 p =q 1 p6=q (3.6) This measure gives an equal weight to every annotation sample and thereby avoids heav- ily penalizing short periods of disagreement. This function's range is [-1,1] like many other agreement measures and is also similarly interpretable. A value of 1 indicates complete agree- ment, -1 indicates complete disagreement, and 0 means the two annotations are mutually uncorrelated. Many human annotation data sets contain unaligned annotations due to annotator lag or varying temporal resolution due to asynchronous input. If resampling is infeasible, [143] sug- gests that aggregation over small (three to ve-second) windows may be appropriate. In this scenario, majority voting could be used to determine the prevailing sign of the dierentials within each time window before applying the delta function when computing SDA. Additionally, with this agreement dened as a similarity measure over annotation sam- ples, it is possible to add a correction for chance agreement for cross-corpora comparison 128 [171], similar to Cohen's Kappa measure: SDA = 1 1p o 1p e where the observed probability of agreement p o is measured using SDA with its replaced by a Kronecker delta function ( K ) and where the probability of chance agreement is ap- proximated empirically: p e = 1 (N 1) 2 X i2f1;0;1g " N X t=2 K [sgn(x t x t1 );i] N X t 0 =2 K [sgn(y t 0y t 0 1 );i] # This is one possible formulation of a chance-corrected SDA measure, with other formulations involving alternative estimation algorithms for p o and p e . As indicated in the analysis of existing agreement measures above, this corrected SDA formula is undened for signals with no variability. Therefore, this style of chance correction may be undesirable in domains where annotations have a fair likelihood of being completely constant. 3.2.8.6 Agreement Comparison To demonstrate the advantages of using the proposed SDA agreement metric, a few examples cases are considered. First, SDA is employed in a simulated annotation setup to demonstrate that it captures structural agreement while other existing measures do not. Second, SDA is applied to real human annotations to show that the metric oers complementary information in practice and that it can be used to help identify potentially poor or adversarial annotations. Simulated Task Comparison Figure 3.28 shows two hypothetical annotations of the same stimulus. 129 0 10 20 30 40 0 0:5 1 Time (seconds) Annotated Value Figure 3.28: Hypothetical annotations of the same stimulus. In both cases, these annotations capture a spike event near 40s followed by a decrease in value. The primary dierence between the two is the rate of decrease near the end. In a real human annotation scenario, this disparity is realistic and could be due to perceptual dierences or more simply an input error during the annotation recording process. Either way, this dierence, which accounts for less than 10% of the total annotation duration, has a profound impact on existing agreement measures. Table 3.10 shows pairwise and global agreement measures for this example. SAGR is excluded because there is no presumed conceptual reference at the midpoint, and Cohen's is computed using 10 equally-spaced bins. Agreement Measure Simulated Task Value Pearson -0.10 Spearman -0.36 Kendall's -0.38 CCC -0.08 Cohen's 0.61 SDA 1.0 Cronbach's -0.18 ICC(k) -0.15 Krippendor's -0.50 Table 3.10: Pairwise and group-level agreement measures applied to the simulated annotation example These annotations are suciently similar, and if they were part of a larger collection 130 of annotations containing outliers, they should be deemed equally valid (or invalid) during annotation curation. Though a contrived example, these annotations are representative of the types of human errors observed in Section 3.2.2.2 and the results indicate that SDA is the only measure that outputs a number close to 1.0, virtually guaranteeing that the two annotations would be treated equally (e.g., by a clustering algorithm). The large disparity in agreement measures also conrms prior analysis concluding that other metrics do not easily or directly capture the structural similarity in annotations. Comparison using Human Annotations Next, SDA is compared to existing measures in real human annotation experiments, starting with the green intensity annotation study. Figure 3.29 shows the Pearson correlation and SDA pairwise agreements for the temporally aligned task A annotations. Notably, SDA yields a substantially smaller value for unique pairs of annotations compared to its perfect self-similarity on the main diagonal. Since this metric is sensitive to temporal alignment and annotator lag, even with lag correction, it captures short durations of disagreement when the trend changes (refer to Figure 3.26). What is most important about the SDA measure in this case is the uniformity of the values for unique annotators rather than the exact agreement value [164, 172], which indicates a similar level of structural agreement for all annotations. ann3 ann7 ann10 ann6 ann1 ann9 ann4 ann5 ann2 ann8 Pearson ann3 ann10 ann8 ann7 ann1 ann9 ann4 ann6 ann2 ann5 SDA 0.0 0.2 0.4 0.6 0.8 1.0 Figure 3.29: Pairwise agreement measures for Task A in the green intensity annotation experi- ment; rows permuted via agglomerative clustering. To show how SDA performs on other real data, we examine annotations from the RECOLA 131 data set [6]. RECOLA contains interval-scale continuous human annotations of spontaneous dyadic aective interactions between French speakers, and it has been highly utilized by researchers, in particular for semi-annual emotion prediction challenges (AVEC) [151]. Each interaction is rated continuously by a xed set of six annotators for emotional arousal and valence on a bounded [1; 1] scale. The authors employ Pearson correlation, MSE, and the percentage of positive samples (similar to SAGR) to assess annotator agreement. For rea- sons mentioned in Section 3.2.8.4, MSE is excluded and the analysis focuses on the benets of SDA in relation to the other types of agreement. Because SDA is particularly sensitive to misalignment, DTW is applied to temporally warp the annotations using a symmetric Sakoe-Chiba step pattern to constrain the maximum temporal distortion to three seconds. Figure 3.30a shows pairwise agreement matrices for Pearson, SAGR, and SDA across all six annotators for their annotations of emotional arousal in one dyadic interaction from the data set. The corresponding annotations are plotted in Figure 3.30b with the two traces from annotators FF1 and FF2 drawn in solid bold. The magnitude and uniformity of both the Pearson and SAGR agreement matrices suggests that the annotations co-vary similarly. The SDA measure, however, highlights more subtle dierences between the annotations and indicates that annotations FF1 and FF2 are structurally uncorrelated. This is apparent from the two annotations shown in Figure 3.30b, which agree on the trend in arousal sometimes, but disagree in structure about half of the time. This example demonstrates that SDA oers unique information when measuring human annotation agreement and can be helpful when curating data sets. 132 FF1 FF2 FF3 FM1 FM2 FM3 Pearson FF1 FM3 FM1 FM2 FF2 FF3 SAGR FF1 FF3 FF2 FM2 FM1 FM3 SDA 0.0 0.2 0.4 0.6 0.8 1.0 (a) Pairwise agreement measures in matrix form; rows permuted via agglomerative clustering. 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 0:5 0 0:5 FM1 FM2 FM3 FF1 FF2 FF3 (b) DTW-shifted annotations. Figure 3.30: Example annotations of emotional arousal from the RECOLA data set. Annotators FF1 and FF2 (bold solid lines) structurally disagree about half the time. 3.2.8.7 Discussion The examples and analysis in Section 3.2.8.6 show that the proposed Signed Dierential Agreement and chance-corrected SDA measures yield interpretable and unique information about human annotation agreement. When combined with other agreement metrics, this additional information can help mitigate the risk of improper inclusion of adversarial an- notations or exclusion of sensible ones. Thus SDA and SDA , or other measures derived from the approximate equality of signed derivatives across annotators (Equation 3.5), are useful tools oering supplemental information to facilitate annotation selection. Since SDA 133 eectively ignores some of the structured biases present in independent annotation (e.g., overshooting values as observed in Section 3.2.2.2), it may also be an eective reliability measure for reproducibility and machine learning purposes [172]. The results pertaining to the green intensity annotation experiment support the no- tion that annotators incidentally output reliable ordinal annotations when asked to perform interval annotations while also providing approximately rank-consistent valuations. This proposes that there is more valuable information to be obtained from interval annotations than simply an ordinal interpretation. In part, this idea con icts with views expressed by Yannakakis et al [142], suggesting that both interval and ordinal reliability suer during in- terval annotation tasks due to cognitive loading. Admittedly, the green intensity annotation experiment is cognitively less demanding compared to other tasks probing expressed human states and traits, for example, annotation of emotional expressions, which may explain why both interpretations are meaningful. 3.2.9 Validation of Framework and Enhancements for Ground Truth Estimation via Perceptual Similarity This chapter has presented a variety of techniques to be used at dierent points in time when soliciting, collecting, inspecting, correcting, and analyzing human-produced annotations. Figure 3.31 depicts an expanded version of the basic annotation processing pipeline (from Figure 3.5) with additional details revealed for certain stages according to the contributions in this chapter. Beige-colored steps represent dierent methods from Section 3.2. 134 Figure 3.31: Expanded general pipeline for generating ground truth representations from real- time continuous-scale human annotations of a stimulus. Annotation fusion shows two possible paths based on contributions in this manuscript, though other methods exist. Methods proposed in this dissertation are colored beige and demarcated as follows: Signed Dierential Agreement (x3.2.8), y Frame-wise Ordinal Triplet Embedding (x3.2.4), z Perceptual Similarity Warping Frame- work (x3.2.5), yy Trapezoidal Segmented Regression (x3.2.6), zz Trapezoidal Segment Sequencing (x3.2.7). Each of these contributions has demonstrated some improvement over current state-of- the-art methods, but a few questions remain: (1) do these methods pertaining the perceptual similarity warping method produce an overall benet when used together, and (2) do the methods generalize to other annotation scenarios less controlled than the green intensity experiment? This section introduces a validation study aimed to address these concerns. 135 3.2.9.1 Overview All of the analysis and methodologies in Chapter 3 thus far aim to improve the consistency of ground truth representations, especially for some of the more dicult learning tasks like subjective mental constructs. Part of the diculty in learning from these kinds of annotation tasks is accounting for the often large variety of opinions, which can be exaggerated in crowd- sourcing contexts in which adversarial annotations may also be present. Another confounding factor is the diculty in validating annotations because no prior notion of a true construct signal exists. This section is designed to test this chapter's prior contributions in a more realistic setting by conducting an annotation experiment in a crowd-sourced environment. In order to be able to assess both the external validity and the ability of the various proposed methods to improve the ground truth representations, a subjective construct is selected for which a rating authority also exists. In this section, real-time continuous-scale interval annotations are solicited from the crowd about its perception of portrayed violence in recent Hollywood movies, and then violence ratings from a lm rating authority are utilized to assess the accuracy of the ground truth resulting from the PSW framework and enhancements presented in this chapter. 3.2.9.2 Experiment Five recent Hollywood lms are selected based on two criteria: violence rating and total running time. Ratings of violence are obtained from the Common Sense Media (CSM) movie rating authority. CSM oers violence ratings as integers between 1-5, so each movie is selected to have a unique violence level. Within each of the ve violence categories, the shortest full-length feature lm is chosen from all top-grossing Hollywood lms within the last two years. In ascending order of violence ratings, these movies are: The Hustle (2019) Good Boys (2019) The Peanut Butter Falcon (2019) The Possession of Hannah Grace (2018) 136 Rambo: Last Blood (2019) In order to promote a consistent level of attentiveness during the annotation process, each movie is partitioned into approximately 10-minute segments. The exact time of each cut is determined manually and chosen such that it aligns with the nearest scene transition to the end of the 10-minute duration. This process ensures that scenes are entirely contained within one clip and help provide annotators with context for understanding the events in a scene. Any advertisements, previews, title screens, or segments not related to the movie occurring at the beginning and the end of the lm are trimmed before segmentation. Table 3.11 lists the cut times used to partition each lm into movie clips. Cut Times (s) Movie Name Start C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 The Hustle 15 537 1247 1844 2486 3106 3630 4177 4872 5332 5521 5612 Good Boys 54 598 1281 1886 2588 3121 3729 4405 5121 The Peanut Butter Falcon 53 620 1255 1964 2554 3264 3842 4548 5116 5469 The Possession of Hannah Grace 34 596 1192 1795 2414 3014 3713 4261 4901 Rambo: Last Blood 51 659 1265 1884 2468 3008 3608 4206 4773 4959 Table 3.11: Movie clip cut times for the selected lms. Continuous Annotation Procedure Annotations of violence are obtained from Me- chanical Turk, a crowd-sourcing platform. Annotators electing to participate in our Human Input Task (HIT) are presented with a description of the task and a warning about the amount of violence that might be present, as shown in Figure 3.32. Mechanical Turk participants electing to proceed are directed to a second landing page as shown in Figure 3.33. The PAGAN [170] annotation framework is used to collect the continuous-scale annotations in real-time as participants view a randomly selected movie 137 Figure 3.32: Landing page for Mechanical Turk workers interested in our continuous annotation experiment clip. A bounded interval annotation format is implemented within PAGAN, which prevents participants from moving the annotation label trace outside of the minimum and maximum boundary lines. This results in decimal-valued interval annotations recorded as time se- ries with values between [-100,100]. The numbers on the scale are hidden from annotators within the PAGAN interface. Each movie clip is annotated by 10 separate Mechanical Turk participants, and workers are compensated 1 USD. Alignment In order to compensate for variance in human perception time and input lag, the ten annotations for each clip are rst temporally aligned. The SDA metric (see Section 3.2.8) is used to measure agreement between each pair of annotations where frames contain- ing Not a Number (NaN) values are ignored. Each annotation's pairwise agreement with other annotations are averaged and the \most agreeable" annotation with the highest mean SDA score is selected as a reference for time alignment. Each of the nine other annotations are aligned to the chosen reference using dynamic time warping (DTW) [133] with a symmet- ric Sakoe-Chiba step pattern to constrain the maximum temporal distortion to ve seconds. Though this application of DTW distorts the structure of each annotation, as mentioned in Section 3.2.8.3, it is necessary to achieve reasonable alignment of the crowd-sourced annota- tions compared to the EvalDep method previously employed in a more controlled annotation setting. 138 Figure 3.33: Landing page for Mechanical Turk workers electing to participate in the movie violence annotation experiment 139 0 50 100 150 200 250 300 350 400 450 500 550 600 650 0 0:2 0:4 0:6 0:8 1 Time (s) Violence (a) All raw crowd-sourced annotations 0 50 100 150 200 250 300 350 400 450 500 550 600 650 0 0:2 0:4 0:6 0:8 1 Time (s) Violence (b) Inlier annotations resulting from spectral clustering Figure 3.34: Raw crowd-sourced annotations of portrayed violence over time in cut 4 from The Hustle Outlier Removal Many artifacts and potentially adversarial behaviors are observed in each set of 10 annotations per movie clip. Figure 3.34a plots the raw annotations for cut #4 of The Hustle and demonstrates a typical result. Notice the large amount of variability across annotators regarding both the magnitude of portrayed violence and also in the temporal location of large changes in violence. In this type of crowd-sourced experiment, it is dicult to distinguish adversarial annotations from the genuine ones which happen to disagree with other annotations in the collection. Thus, to ensure that the adversarial annotations are at least removed, we must accept that some genuine annotations may be removed as well. In future experiments of this sort, more annotations could be collected with the aim of identifying adversarial annotations by their insucient agreement with all other annotations within the collection. Outliers are identied by performing a binary clustering of the annotations for each movie clip. The SDA metric is used to compare every pair of annotations per clip. In 140 order to give each annotation the best chance at agreeing with other annotations, DTW (with a 5-second Sakoe-Chiba step pattern) is used to align every pair of annotations before computing agreement using SDA. The SDA values are assembled into a similarity matrix and linearly normalized so all values fall between [0; 1]. Spectral clustering is performed using this matrix as a precomputed anity matrix to produce two clusters. The cluster with the highest average SDA value is selected as the inlier group while the other annotations are ignored for the remainder of the experimental analysis. Figure 3.34b plots the three inlier annotations selected using this approach for cut #4 of The Hustle using bold lines. Trapezoidal Regression After the annotations are aligned, trapezoidal segmented regres- sion (TSR) is used to approximate each annotation to facilitate the separation of trends and constant regions. The parameter for the number of segments, T i , for TSR is approximated via human inspection for each annotation i and a separate series of regressions performed for each annotation for all integers T 2fb0:8T i c;:::;d1:2T i eg N. Optimizing over this parameter T requires balancing two goals: (1) minimizing the regression error and (2) minimizing T (the complexity of the regression). A heuristic is introduced to help balance these two goals and enable the selection of an appropriate T i for each annotation and movie clip. The heuristic is dened over two measures, SDA and Kendall's , which are used to compute an agreement score between each trapezoidal segmented regression and its original individual annotation. First, SDA is examined as a function of T to nd the value t in the domain which corresponds the largest jump in SDA value from its previous value at t 1. Then the values are examined as a function ofT over the domain of values larger thant to nd the value t 0 which produces the largest jump in from its value at t 0 1. An example is shown in Figure 3.35. The SDA and Kendall's values tend to plateau after a noticeable increase, so this heuristic is crafted to locate the smallest value s which is larger than the T values in the domain corresponding to the SDA and jumps. Many other optimization strategies using dierent heuristics are possible. but as stated in Section 3.2.6.4, once T is large enough to capture the structure of an annotation, its agreement with the original annotation becomes substantially less sensitive to further increases in T . 141 10 12 14 16 18 0 0:2 0:4 0:6 0:8 1 # TSR Segments Agreement Value Kendall's Tau SDA 0 100 200 300 400 500 600 100 95 90 85 Time (s) Movie Violence Optimized TSR Annotation Figure 3.35: Example output from the TSR optimization procedure for a single annotation. The left gure shows the two considered agreement measures during optimization for a range of trapezoidal regressions. The dotted vertical line denotes the number of segments selected by the heuristic-based optimization. The right gure plots the annotation and the selected TSR based on the optimization heuristic. Constant Interval Extraction The trapezoidal segment sequence (TSS) is computed for each of the ten annotations of each movie clip. Fusion of the TSS sequences is per- formed according to Section 3.2.7, resulting in a single TSS sequence. Constant intervals are extracted by nding the contiguous sequences of zero-values. For each constant interval, a movie excerpt is extracted from the corresponding movie clip, with the total number of excerpts extracted per movie shown in Table 3.12. Pairwise Violence Comparison All excerpts from all movie clips from each of the ve movies are collected and included in a second crowd-sourcing experiment. Again, Mechanical Turk is employed and participants are presented with the following instructions: Two video clips of dierent lengths and from various Hollywood lms are dis- played: video A, video B. View the entirety of both clips, one at a time, then select the one that portrays the least amount of violence. Though some of the clips may be taken from the same movie, or though you may even be familiar with the lms, you should focus only on the amount of violence displayed within in each clip. Some clips may be extremely short. 142 Movie Clips Movie Name C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 Total The Hustle 7 8 3 11 1 7 2 4 3 1 4 51 Good Boys 5 6 19 11 6 6 4 4 61 The Peanut Butter Falcon 13 6 13 2 3 3 5 5 8 58 The Possession of Hannah Grace 6 9 9 3 8 6 3 7 51 Rambo: Last Blood 3 3 11 14 16 13 10 19 17 106 Table 3.12: Number of movie clip excerpts extracted from each movie based on the number of constant intervals observed in each movie's clips' fused TSS. In this experiment, annotators are asked to compare two movie excerpts and select the one portraying the least amount of violence. This protocol diers from the previous triplet- based setup, which will be discussed further in Section 3.2.9.5. Participants are paid 0:08 USD per comparison. To help minimize total annotation costs, pairwise comparisons are collected in batches of 5000 until the resulting embedding (described next) does not change signicantly from its previous iteration. Due to redundancy in the mutual information within the set of all possible pairs, only O[log(n)] would be required if selected adaptively [173], though in this study we proceed with a uniformly random selection. Out of the possible 53; 301 total possible pairwise comparisons, 20; 000 are collected in this phase of the experiment with each pair being annotated exactly once. Embedding The pairwise comparisons are used to generate an ordinal embedding using t-STE as prescribed in Section 3.2.5.1. To make the pairs compatible with the triplet-based embedding algorithm, a dummy movie clip acting as the reference object is created, which serves as an imaginary least violent movie clip. Thus if option A is deemed less violent than option B, then option A is selected as the most similar to the dummy reference. This 143 reference point is discarded once the embedding is computed. Signal Warping A trapezoidal signal is constructed for each movie clip using the fused TSS and its corresponding ordinal embedding. The domain intervals corresponding to con- stant interval regions in the TSS are assigned constant functions with a value equal to its ordinal embedding. These constant intervals are joined end-to-end with line segments to form a trapezoidal signal for each movie clip. Finally, the signals for each clip are joined together end-to-end to form a single ground truth time series for each of the ve movies. 3.2.9.3 Methods Comparing annotations to scalar ratings of movie violence requires a many-to-one dimension reduction. Intuitively, movie violence ratings are produced with the intention of providing information about the peak levels of violence in lms, but it is unclear which function ratings authorities use to summarize the data. Therefore, several aggregation functions are tested: min, max, mean, median, sum. For evaluation, a baseline continuous annotation is established from the aligned inlier annotations obtained from spectral clustering. These inlier annotations are averaged within each movie clip, and then the corresponding ones are stitched together end-to-end for each movie. Each aggregation function is applied to both the baseline and PSW annotations and then their rankings are used for assessment. Rank-based correlation measures are used to score the aggregated ratings against the CSM violence ratings. 3.2.9.4 Results Figure 3.36 plots the baseline annotations alongside the warped annotations for each movie. The scales have been normalized between [0,1] and though they have been plotted together, it is important to note that the values cannot be compared directly. Section 3.2.9.5 elaborates on this point. As mentioned in Section 3.2.9.2, batches of 5000 unique pairwise comparisons are collected at a time. After each batch, the PSW pipeline is used to generate warped annotations for each movie using data from all available batches previously collected. The resulting 144 0 2,000 4,000 0 0.5 1 Time(s) Violence 0 2,000 4,000 Time(s) 0 2,000 4,000 Time(s) 0 2,000 4,000 0 0.5 1 Time(s) Violence 0 2,000 4,000 Time(s) Baseline Signal Warped (a) Good Boys (b) The Hustle (c) Rambo: Last Blood (d) The Possession of Hannah Grace (e) The Peanut Butter Falcon Figure 3.36: Plots of the baseline averaged annotations and the PSW-warped annotations warped annotations for each movie are compared between subsequent batches to measure rank agreement and also serving as a stopping criteria for the collection of future batches. Table 3.13 contains Spearman and Kendall's rank-based correlation measures of agreement between the warped annotations resulting from the PSW framework and using increasing amounts of the movie clip pairwise comparisons. After the fourth batch is collected, the Spearman correlation for nearly every lm is above 0.9, suggesting that adding more batches would not signicantly alter the warped annotations. Though the correlation measures decrease after the fourth batch for The Peanut Butter Falcon, the agreement is subjectively determined to be large enough for other movies and the pairwise comparison batch collections are halted. Finally, Figure 3.37 shows the Spearman correlation between the CSM violence ratings for each movie and the values obtained by applying dierent aggregation functions to both the baseline and warped annotations. 145 Movie Name Batches 1 vs. 1-2 Batches 1-2 vs. 1-3 Batches 1-3 vs. 1-4 Spearman Kendall's Spearman Kendall's Spearman Kendall's Good Boys 0.71 0.53 0.79 0.62 0.93 0.78 The Hustle 0.74 0.57 0.78 0.59 0.93 0.77 Rambo: Last Blood 0.84 0.66 0.89 0.74 0.95 0.82 The Possession of Hannah Grace 0.83 0.65 0.94 0.79 0.92 0.78 The Peanut Butter Falcon 0.80 0.64 0.90 0.76 0.81 0.64 Table 3.13: Rank-based measures of agreement between the warped signals resulting from using additional batches of pairwise comparisons. Batches of size 5000 (unique comparisons) are added to all previous batches and the agreement between the resulting PSW-warped signal is measured using the PSW-warped signal from the previous iteration. Baseline Warped 0 0:2 0:4 0:6 0:8 1 Spearman Correlation min max mean median sum Figure 3.37: Spearman correlation between the CSM ratings and various functionals of the baseline and warped annotations 146 3.2.9.5 Discussion The results in Figure 3.37 demonstrate the eectiveness of the perceptual similarity warping ground truth framework over the baseline averaging technique. Regardless of which aggre- gation function best models the CSM rating authority's summarization of the dynamics of movie violence, the PSW-warped annotation achieves a higher level of agreement with the ratings than the baseline annotations. It is interesting to note that the mean, median, and sum aggregation functions achieve similar agreement with the CSM ratings for both the baseline and warped annotations. This propounds that movies with higher violence ratings tend to contain scenes with at least some violence more often than movies with lower violence ratings. The working hypothesis for this experiment predicted the max function would be the best aggregation function because the primary purpose of the rating authority is to inform people, parents especially, about the worst amount of violence in lms. Indeed the max aggregate of the PSW-warped annotations correlates with the CSM ratings more than the others, corroborating the hypothesis. If max is the correct mapping function, then the results demonstrate a signicant advantage over the baseline annotation fusion method, which exhibits very poor correlation under this aggregate transform. Looking back at Figure 3.36, it appears that annotators have a tendency to annotate maximum levels of violence at some point in every lm, in spite of the instructions which state that maximum values should correspond to the most \extreme" amount of violence. It is important to note that the values of the two annotations in Figure 3.36 in theory should not be compared directly. The baseline annotation is produced via averaging, and thus it exists on an interval scale (i.e., in metric space). The PSW-warped annotations are generated to t results from an ordinal embedding procedure, therefore the values of the warped annotations exist in an ordinal space. It should be possible to learn a monotonic mapping function between the ordinal space of the PSW-warped annotations and the original metric space. An additional data collection could be designed to anchor the constant interval regions of the annotations to values in metric space, then a monotonic function could be tted to these anchor points (similar to the approach in Figure 3.24). Anchoring segments with 147 zero violence would be a good place to start, but it would be more dicult to nd objective ways of anchoring segments with some violence to certain values in the metric scale. This question of \anchoring" the warped ground truth to a metric scale is an interesting avenue for future research, but it is unnecessary for the purpose of establishing a consistent real- time and continuous-scale annotation for use as ground truth. The ordinal representation is well-suited for preference-based machine learning algorithms, but the results from previous sections in this chapter still demonstrate that it functions well enough if interpreted as interval data. Lastly, let us consider the use of pairwise comparisons in this experiment and what it implies for future research. The original proposal in Section 3.2.5 calls for crowd sourcing the triplet comparisons, where the main benet is that it generalizes to all constructs and contexts in which an ordered scale may not be apparent to people. Movie violence, however, does have a preconceived notion of an ordered scale, so using pairwise comparisons is possible and also requires far fewer comparisons than triplets. 148 Chapter 4 DISCUSSION & FUTURE CONSIDERATIONS This dissertation has attempted to make a strong case for the need to rethink the way in which data for human behavior studies is collected and analyzed. The evidence presented in Chapter 2 suggests that high quality data gathered from human subjects over longer periods of time and in natural settings is needed to enable intra-subject learning and analysis. It also demonstrates that a new dierential-style interpretation of human annotations of subjective constructs combined with supplemental help from the crowd can better align the resulting ground truth with the true underlying experiences. The data collection protocol and strategies outlined in Section 3.1 oer advice for man- aging longitudinal studies in the wild. The explicit checklist provided as a summary in Appendix A should be helpful for researchers when transitioning pilot studies and well- controlled studies to the real-world where ecologically valid data can be gathered from more people. Transitioning more research studies to realistic environments where data from a larger number of individuals can be collected should alone benet human behavior research, but it is also becoming more important as data science and machine learning continue to advance. These elds are expected to usher in a new kind of era for scientic advancement, following continued advancements in big data and deep learning, where information can be logically induced from large collections of data instead of deduced from individual experi- ments. Measuring and understanding the reliability and limitations of collected human data will be essential for these systems to accurately learn. The specic strategies and compre- hensive checklist presented in this dissertation provide a starting point along this front, but much more research eort will be needed to fully categorize and quantify all of the relevant and important aspects of human data collection. The PSW pipeline and suite of methods presented in Section 3.2 show that the real-time continuous-scale interval annotation process does not, by itself, provide consistent annotation signals which are appropriate as ground truth representations for subjective constructs. The 149 analysis of annotation quality in the green intensity experiment veries what some recent research has observed: people reliably annotate changes but do not capture values. The evi- dence from the studies of annotator behavior on both the green intensity and movie violence rating data suggest that the correct way to interpret these annotations is in a dierential space. Though some machine learning models are capable of learning information directly from these temporally sequential comparisons, the proposed PSW pipeline is capable of re- aligning the annotations in space, decreasing the signal noise and thus eectively increasing the accessible information, such that the labeled values can be used in any standard machine learning framework. These gains are primarily achieved through careful signal processing and supplemental annotation from a second round of annotators. The perceptual similar- ity warping pipeline and the various algorithms discussed in this manuscript consistently produce desirable results, but more research is needed to verify and further validate the ndings. Understanding the subtle dierences between rank-based annotation and the inter- val annotation style used exclusively in this work will be essential for learning more about annotator behavior and therefore annotation accuracy. One question of particular research interest that remains unexplored is understanding when using pairwise comparisons instead of triplet comparisons is appropriate for the second wave of annotators in the PSW pipeline. The work on improving the annotation quality in this dissertation only scratches the surface, but it has denitively shown that this is a fruitful direction for ground truth estimation. Overall, the experiments, evidence, and discussions in this dissertation have demonstrated that gathering quality data from people for human behavior and experience studies is quite dicult. For both data collection and human annotation, acceptance of human biases, errors, and mistakes seems to be best approach. Therefore, we should not put too much trust into the data or annotations that humans would produce in isolation, and we should instead provide procedural support and give participants and annotators opportunities to correct mistakes. 150 REFERENCES [1] Brandon M Booth, Karel Mundnich, and Shrikanth S Narayanan. A novel method for human bias correction of continuous-time annotations. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3091{3095. IEEE, 2018. viii, xiv, 74, 76, 77, 88, 91, 111, 112, 115, 117, 121 [2] Alan Gevins, Michael E Smith, Linda McEvoy, and Daphne Yu. High-resolution EEG mapping of cortical activation related to working memory: eects of task diculty, type of processing, and practice. Cerebral cortex, 7(4):374{385, 1997. xi, 15, 20, 27, 29 [3] Sushil Chandra, Kundan Lal Verma, Greeshma Sharma, Alok Mittal, and Devendra Jha. EEG based cognitive workload classication during NASA MATB-II multitasking. International Journal of Cognitive Research in Science, Engineering and Education (IJCRSEE), 3(1):35{41, 2015. xi, 14, 19, 20, 27, 29 [4] Fumio Yamada. Frontal midline theta rhythm and eye blinking activity during a VDT task and a video game: useful tools for psychophysiology in ergonomics. Ergonomics, 41(5):678{688, 1998. xi, 15, 20, 27, 29 [5] Tiantian Feng, Amrutha Nadarajan, Colin Vaz, Brandon Booth, and Shrikanth Narayanan. Tiles audio recorder: an unobtrusive wearable solution to track audio activity. In Proceedings of the 4th ACM Workshop on Wearable Systems and Applica- tions, pages 33{38, 2018. xii, 41, 63, 64 [6] Fabien Ringeval, Andreas Sonderegger, Juergen Sauer, and Denis Lalanne. Introducing the recola multimodal corpus of remote collaborative and aective interactions. In Au- tomatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on, pages 1{8. IEEE, 2013. xiii, 74, 75, 81, 88, 118, 132 [7] Soroosh Mariooryad and Carlos Busso. Correcting time-continuous emotional labels by modeling the reaction lag of evaluators. IEEE Transactions on Aective Computing, 6(2):97{108, 2015. xiv, 21, 75, 92, 96, 106, 107, 111, 113, 114, 119 [8] Mihaly Csikszentmihalyi. Toward a psychology of optimal experience. In Flow and the foundations of positive psychology, pages 209{226. Springer, 2014. 2 [9] Thomas W Malone. What makes things fun to learn? a study of intrinsically motivating computer games. 1981. 2 [10] Harry J Witchel. Engagement: the inputs and the outputs: conference overview. In Proceedings of the 2013 Inputs-Outputs Conference: An Interdisciplinary Conference on Engagement in HCI and Performance, page 1. ACM, 2013. 2 [11] Michael G Kahn, Tiany J Callahan, Juliana Barnard, Alan E Bauck, Je Brown, Bruce N Davidson, Hossein Estiri, Carsten Goerg, Erin Holve, Steven G Johnson, et al. 151 A harmonized data quality assessment terminology and framework for the secondary use of electronic health record data. Egems, 4(1), 2016. 3 [12] Phillip A Bishop and Robert L Herron. Use and misuse of the likert item responses and other ordinal measures. International journal of exercise science, 8(3):297, 2015. 10 [13] Robert C MacCallum, Shaobo Zhang, Kristopher J Preacher, and Derek D Rucker. On the practice of dichotomization of quantitative variables. Psychological methods, 7 (1):19, 2002. 10 [14] Brandon M Booth, Asem M Ali, Shrikanth S Narayanan, Ian Bennett, and Aly A Farag. Toward active and unobtrusive engagement assessment of distance learners. In Aective Computing and Intelligent Interaction (ACII), 2017 Seventh International Conference on, pages 470{476. IEEE, 2017. 12, 26, 73, 74 [15] Brandon M Booth, Taylor J Seamans, and Shrikanth S Narayanan. An evaluation of eeg-based metrics for engagement assessment of distance learners. In 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 307{310. IEEE, 2018. 12 [16] Dhawal Shah. Monetization over Massiveness: A Review of MOOC Stats and Trends in 2016, December 29, 2016 (accessed April 21, 2017). https://www.class- central.com/report/moocs-stats-and-trends-2016. 12 [17] Daniel F.O. Onah, Jane Sinclair, and Russell Boyatt. Dropout rates of massive open online courses: behavioural patterns. EDULEARN14 Proceedings, pages 5825{5834, 2014. 12 [18] Rebecca Ferguson, Doug Clow, Russell Beale, Alison J Cooper, Neil Morris, Si^ an Bayne, and Amy Woodgate. Moving through MOOCs: Pedagogy, learning design and patterns of engagement. In Design for Teaching and Learning in a Networked World, pages 70{84. Springer, 2015. 13 [19] Arti Ramesh, Dan Goldwasser, Bert Huang, Hal Daume III, and Lise Getoor. Uncov- ering hidden engagement patterns for predicting learner performance in MOOCs. In Proceedings of the First ACM Conference on Learning at Scale, pages 157{158. ACM, 2014. 13 [20] Berit Ostlund. Stress, disruption and community-adult learners' experiences of obsta- cles and opportunities in distance education. European Journal of Open, Distance and E-learning, 8(1), 2005. 13 [21] Chris Berka, Daniel J Levendowski, Michelle N Lumicao, Alan Yau, Gene Davis, Vladimir T Zivkovic, Richard E Olmstead, Patrice D Tremoulet, and Patrick L Craven. EEG correlates of task engagement and mental workload in vigilance, learning, and memory tasks. Aviation, space, and environmental medicine, 78(5):B231{B244, 2007. 13, 14 152 [22] Jonathan Bidwell and Henry Fuchs. Classroom analytics: measuring student engage- ment with automated gaze tracking. Behavior Research Methods, 49:113, 2011. 14, 29 [23] Mirko Raca and Pierre Dillenbourg. System for assessing classroom attention. In Pro- ceedings of the Third International Conference on Learning Analytics and Knowledge, pages 265{269. ACM, 2013. 14 [24] Jacob Whitehill, Zewelanji Serpell, Yi-Ching Lin, Aysha Foster, and Javier R Movellan. The faces of engagement: Automatic recognition of student engagement from facial expressions. IEEE Transactions on Aective Computing, 5(1):86{98, 2014. 14, 19, 26, 30, 31 [25] Nigel Bosch, Sidney D'Mello, Ryan Baker, Jaclyn Ocumpaugh, Valerie Shute, Matthew Ventura, Lubin Wang, and Weinan Zhao. Automatic detection of learning-centered aective states in the wild. In Proceedings of the 20th international conference on intelligent user interfaces, pages 379{388. ACM, 2015. 14, 30 [26] Lawrence J Prinzel III, Mark W Scerbo, Frederick G Freeman, and Peter J Mikulka. A bio-cybernetic system for adaptive automation. In Proceedings of the human factors and ergonomics society annual meeting, volume 39. 14, 15 [27] Alan T Pope, Edward H Bogart, and Debbie S Bartolome. Biocybernetic system evaluates indices of operator engagement in automated task. Biological psychology, 40 (1):187{195, 1995. 14, 20 [28] Lawrence J Prinzel III, Frederick G Freeman, Mark W Scerbo, Peter J Mikulka, and Alan T Pope. Eects of a psychophysiological system for adaptive automation on per- formance, workload, and the event-related potential P300 component. Human factors, 45(4):601{614, 2003. 14 [29] Nathan R Bailey, Mark W Scerbo, Frederick G Freeman, Peter J Mikulka, and Lorissa A Scott. Comparison of a brain-based adaptive system and a manual adaptable system for invoking automation. Human Factors: The Journal of the Human Factors and Ergonomics Society, 48(4):693{709, 2006. 15 [30] Frederick G Freeman, Peter J Mikulka, Mark W Scerbo, and Lorissa Scott. An eval- uation of an adaptive automation system using a cognitive vigilance task. Biological psychology, 67(3):283{297, 2004. 15 [31] Timothy McMahan, Ian Parberry, and Thomas D Parsons. Evaluating player task engagement and arousal using electroencephalography. Procedia Manufacturing, 3: 2303{2310, 2015. 15 [32] Peter J Mikulka, Mark W Scerbo, and Frederick G Freeman. Eects of a biocybernetic system on vigilance performance. Human Factors, 44(4):654{664, 2002. 15 [33] Gary Bradski. The OpenCV library. Dr. Dobb's Journal of Software Tools, 2000. 18, 19 153 [34] Eslam Mostafa, Asem A Ali, Ahmed Shalaby, and Aly Farag. A facial features detector integrating holistic facial information and part-based model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 93{ 99, 2015. 18 [35] Michel Valstar, Jonathan Gratch, Bj orn Schuller, Fabien Ringeval, Dennis Lalanne, Mercedes Torres Torres, Stefan Scherer, Giota Stratou, Roddy Cowie, and Maja Pantic. Avec 2016: Depression, mood, and emotion recognition workshop and challenge. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, pages 3{10. ACM, 2016. 18, 73 [36] P Ekman and W Friesen. Facial action coding system: a technique for the measurement of facial movement. Consulting Psychologists, 1978. 18 [37] Patrick Lucey, Jerey F Cohn, Takeo Kanade, Jason Saragih, Zara Ambadar, and Iain Matthews. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specied expression. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, pages 94{101. IEEE, 2010. 18, 19 [38] Yan Tong, Wenhui Liao, and Qiang Ji. Facial action unit recognition by exploiting their dynamic and semantic relationships. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(10), 2007. [39] Martin Breidt, Douglas W. Cunningham, Christian Wallraven, and Heinrich H. Bltho. Face video database of the mpi for biological cybernetics. 2003. 18 [40] Brian A Smith, Qi Yin, Steven K Feiner, and Shree K Nayar. Gaze locking: passive eye contact detection for human-object interaction. In Proceedings of the 26th annual ACM symposium on User interface software and technology, pages 271{280. ACM, 2013. 19 [41] Oliver Langner, Ron Dotsch, Gijsbert Bijlstra, Daniel HJ Wigboldus, Skyler T Hawk, and Ad van Knippenberg. Presentation and validation of the radboud faces database. Cognition and emotion, 24(8):1377{1388, 2010. 19 [42] Michael Lyons, Shigeru Akamatsu, Miyuki Kamachi, and Jiro Gyoba. Coding facial expressions with gabor wavelets. In Automatic Face and Gesture Recognition, 1998. Proceedings. Third IEEE International Conference on, pages 200{205. IEEE, 1998. [43] Abhinav Dhall, Roland Goecke, Simon Lucey, and Tom Gedeon. Static facial expression analysis in tough conditions: Data, evaluation protocol and benchmark. In Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pages 2106{2112. IEEE, 2011. 19 [44] Gunnar Farneb ack. Two-frame motion estimation based on polynomial expansion. Image analysis, pages 363{370, 2003. 19 154 [45] fr TM face recognition sdk. Private academic license via Google, 2016. 19 [46] Alexandre Gramfort, Martin Luessi, Eric Larson, Denis A. Engemann, Daniel Strohmeier, Christian Brodbeck, Lauri Parkkonen, and Matti H am al ainen. MNE soft- ware for processing MEG and EEG data. NeuroImage, 86:446{460, February 2014. ISSN 1053-8119. 20 [47] Alexandre Gramfort, Martin Luessi, Eric Larson, Denis A. Engemann, Daniel Strohmeier, Christian Brodbeck, Roman Goj, Mainak Jas, Teon Brooks, Lauri Parkko- nen, and Matti H am al ainen. MEG and EEG data analysis with MNE-Python. Fron- tiers in Neuroscience, 7, December 2013. ISSN 1662-453X. 20 [48] Paul L Nunez and Ramesh Srinivasan. Electric elds of the brain: the neurophysics of EEG. Oxford University Press, USA, 2006. 20 [49] Laura Astol, F De Vico Fallani, Febo Cincotti, D Mattia, MG Marciani, S Bufalari, S Salinari, Alfredo Colosimo, L Ding, JC Edgar, et al. Imaging functional brain connec- tivity patterns from high-resolution EEG and fMRI via graph theory. Psychophysiology, 44(6):880{893, 2007. 20 [50] Domenic V Cicchetti. Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychological assessment, 6 (4):284, 1994. 21, 78 [51] Angeliki Metallinou and Shrikanth Narayanan. Annotation and processing of contin- uous emotional attributes: Challenges and opportunities. In 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, pages 1{8. IEEE, 2013. 21, 76, 91, 111, 117, 118 [52] Georgios N. Yannakakis and John Hallam. Ranking vs. preference: A comparative study of self-reporting. In Aective Computing and Intelligent Interaction: 4th Inter- national Conference, pages 437{446. Springer, 2011. [53] Georgios N Yannakakis and H ector P Mart nez. Ratings are overrated! Frontiers in ICT, 2:13, 2015. 21, 76, 91, 111, 117 [54] Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60(1-4):259{268, 1992. 21, 92 [55] Stephen R Becker, Emmanuel J Cand es, and Michael C Grant. Templates for convex cone problems with applications to sparse signal recovery. Mathematical programming computation, 3(3):165, 2011. 21, 92 [56] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon- del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825{2830, 2011. 22 155 [57] Brandon M Booth, Karel Mundnich, Tiantian Feng, Amrutha Nadarajan, Tiago H Falk, Jennifer L Villatte, Emilio Ferrara, and Shrikanth Narayanan. Multimodal human and environmental sensing for longitudinal behavioral studies in naturalistic settings: Framework for sensor selection, deployment, and management. J Med In- ternet Res, 21(8):e12832, Aug 2019. ISSN 1438-8871. doi: 10:2196/12832. URL http://www:jmir:org/2019/8/e12832/. 34, 61, 62, 65 [58] Natasha Lomas. Global wearables market to grow 17% in 2017, 310M devices sold, $30.5BN revenue: Gartner, 2017. URL https://techcrunch:com/2017/08/24/ global-wearables-market-to-grow-17-in-2017-310m-devices-sold-30-5bn- revenue-gartner. 34 [59] Yvonne Rogers and Paul Marshall. Research in the wild synthesis lectures on human- centered informatics. Morgan and Claypool Publishers, 2017. 34, 35 [60] Feng-Tso Sun, Yi-Ting Yeh, Heng-Tze Cheng, Cynthia Kuo, Martin L Griss, et al. Nonparametric discovery of human routines from sensor data. In PerCom, pages 11{ 19. Citeseer, 2014. 34 [61] Nikola Banovic, To Buzali, Fanny Chevalier, Jennifer Manko, and Anind K Dey. Modeling and understanding human routine behavior. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pages 248{260, 2016. [62] Emma Pierson, Tim Altho, and Jure Leskovec. Modeling individual cyclic variation in human behavior. In Proceedings of the 2018 World Wide Web Conference, pages 107{116, 2018. [63] Rui Wang, Fanglin Chen, Zhenyu Chen, Tianxing Li, Gabriella Harari, Stefanie Tig- nor, Xia Zhou, Dror Ben-Zeev, and Andrew T Campbell. Studentlife: assessing mental health, academic performance and behavioral trends of college students using smart- phones. In In Proceedings of the 2014 ACM International Joint Conference on Perva- sive and Ubiquitous Computing, pages 1{12. ACM, 2014. 35, 69 [64] Yvonne Rogers, Kay Connelly, Lenore Tedesco, William Hazlewood, Andrew Kurtz, Robert E Hall, Josh Hursey, and Tammy Toscos. Why its worth the hassle: The value of in-situ studies when designing ubicomp. In International Conference on Ubiquitous Computing, pages 336{353. Springer, 2007. [65] Elizabeth Bonsignore, Alexander J Quinn, Allison Druin, and Benjamin B Bederson. Sharing stories in the wild a mobile storytelling case study using storykit. ACM Transactions on Computer-Human Interaction (TOCHI), 20(3):1{38, 2013. 34 [66] Steve Benford, Chris Greenhalgh, Andy Crabtree, Martin Flintham, Brendan Walker, Joe Marshall, Boriana Koleva, Stefan Rennick Egglestone, Gabriella Giannachi, Matt Adams, et al. Performance-led research in the wild. ACM Transactions on Computer- Human Interaction (TOCHI), 20(3):1{22, 2013. 35 156 [67] Anne Adams, Elizabeth Fitzgerald, and Gary Priestnall. Of catwalk technologies and boundary creatures. ACM Transactions on Computer-Human Interaction (TOCHI), 20(3):1{34, 2013. [68] Elena S Izmailova, John A Wagner, and Eric D Perakslis. Wearable devices in clinical trials: hype and hypothesis. Clinical Pharmacology & Therapeutics, 104(1):42{52, 2018. 35 [69] Eiman Kanjo, Luluah Al-Husain, and Alan Chamberlain. Emotions in context: ex- amining pervasive aective sensing systems, applications, and analyses. Personal and Ubiquitous Computing, 19(7):1197{1212, 2015. 35 [70] Stefan Rennick-Egglestone, Sarah Knowles, Gill Toms, Penny Bee, Karina Lovell, and Peter Bower. Health technologies' in the wild' experiences of engagement with comput- erised cbt. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pages 2124{2135, 2016. [71] John M Carroll and Mary Beth Rosson. Wild at home: The neighborhood as a living laboratory for hci. ACM Transactions on Computer-Human Interaction (TOCHI), 20 (3):1{28, 2013. 66 [72] Nemanja Memarovic, Marc Langheinrich, Keith Cheverst, Nick Taylor, and Flo- rian Alt. P-layers{a layered framework addressing the multifaceted issues facing community-supporting public display deployments. ACM Transactions on Computer- Human Interaction (TOCHI), 20(3):1{34, 2013. 35 [73] Young-Hye Sung, Hae-Young Kim, Ho-Hyun Son, and Juhea Chang. How to design in situ studies: an evaluation of experimental protocols. Restorative dentistry & en- dodontics, 39(3):164{171, 2014. 35 [74] Robin Whittemore and Gail D'Eramo Melkus. Designing a research study. The Dia- betes Educator, 34(2):201{216, 2008. 35 [75] Neska El Haouij, Jean-Michel Poggi, Sylvie Sevestre-Ghalila, Raja Ghozi, and M eriem Ja dane. Aectiveroad system and database to assess drivers attention. In Pro- ceedings of the 33rd Annual ACM Symposium on Applied Computing, SAC 18, page 800803, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450351911. doi: 10:1145/3167132:3167395. URL https://doi:org/10:1145/ 3167132:3167395. 35 [76] Jennifer Anne Healey. Wearable and automotive systems for aect recognition from physiology. PhD thesis, Massachusetts Institute of Technology, 2000. 35 [77] Ya-Li Zheng, Xiao-Rong Ding, Carmen Chung Yan Poon, Benny Ping Lai Lo, Heye Zhang, Xiao-Lin Zhou, Guang-Zhong Yang, Ni Zhao, and Yuan-Ting Zhang. Unob- trusive sensing and wearable devices for health informatics. IEEE Transactions on Biomedical Engineering, 61(5):1538{1554, 2014. 35 157 [78] Mostafa Haghi, Kerstin Thurow, and Regina Stoll. Wearable devices in medical internet of things: scientic research and commercially available devices. Healthcare informatics research, 23(1):4{15, 2017. 38 [79] Haemwaan Sivaraks and Chotirat Ann Ratanamahatana. Robust and accurate anomaly detection in ecg artifacts using time series motif discovery. Computational and mathematical methods in medicine, 2015, 2015. 38 [80] Shaofeng Zou, Yingbin Liang, H Vincent Poor, and Xinghua Shi. Data-driven ap- proaches for detecting and identifying anomalous data streams. In Signal Processing and Machine Learning for Biomedical Big Data, pages 75{90. CRC Press, 2018. 38 [81] Udit Satija, Barathram Ramkumar, and M Sabarimalai Manikandan. An automated ecg signal quality assessment method for unsupervised diagnostic systems. Biocyber- netics and Biomedical Engineering, 38(1):54{70, 2018. 38 [82] Carre Technologies inc (Hexoskin). Hexoskin smart shirts - cardiac, respiratory, sleep & activity metrics. URL https://www:hexoskin:com/. 39 [83] Camille Nebeker, John Harlow, Rebeca Espinoza Giacinto, Rubi Orozco-Linares, Cin- namon S Bloss, and Nadir Weibel. Ethical and regulatory challenges of research using pervasive sensing and other emerging technologies: Irb perspectives. AJOB empirical bioethics, 8(4):266{276, 2017. 40 [84] Tim Altho, Eric Horvitz, Ryen W White, and Jamie Zeitzer. Harnessing the web for population-scale physiological sensing: A case study of sleep and performance. In Proceedings of the 26th international conference on World Wide Web, pages 113{122, 2017. 40 [85] P Aggarwal, Zainab Syed, Xiaoji Niu, and Naser El-Sheimy. A standard testing and calibration procedure for low cost mems inertial sensors and units. The Journal of Navigation, 61(2):323, 2008. 46 [86] Abdul J Jerri. The shannon sampling theorem{its various extensions and applications: A tutorial review. Proceedings of the IEEE, 65(11):1565{1596, 1977. 47 [87] Anna Shcherbina, C Mikael Mattsson, Daryl Waggott, Heidi Salisbury, Jerey W Christle, Trevor Hastie, Matthew T Wheeler, and Euan A Ashley. Accuracy in wrist- worn, sensor-based measurements of heart rate and energy expenditure in a diverse cohort. Journal of personalized medicine, 7(2):3, 2017. 48 [88] JF Enderle, JD Bronzino, and Y Mendelson. Sensitivity Drift. Burlington: Academic Press, 3rd edition, 2012. 48 [89] JA Van Alste, W Van Eck, and OE Herrmann. Ecg baseline wander reduction using linear phase lters. Computers and Biomedical Research, 19(5):417{427, 1986. 48 158 [90] Raimon Jane, Pablo Laguna, Nitish V Thakor, and Pere Caminal. Adaptive base- line wander removal in the ecg: Comparative analysis with cubic spline technique. Computers in Cardiology, pages 143{143, 1992. 48 [91] L S ornmo. Time-varying digital ltering of ecg baseline wander. Medical and Biological Engineering and Computing, 31(5):503, 1993. 48 [92] Raymundo Cassani, Shrikanth Narayanan, and Tiago H Falk. Respiration rate esti- mation from noisy electrocardiograms based on modulation spectral analysis. CMBES Proceedings, 41, 2018. 48 [93] Gabriella M Harari, Weichen Wang, Sandrine R M uller, Rui Wang, and Andrew T Campbell. Participants' compliance and experiences with self-tracking using a smart- phone sensing app. In Proceedings of the 2017 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2017 ACM International Symposium on Wearable Computers, pages 57{60, 2017. 57 [94] United States. National Commission for the Protection of Human Subjects of Biomed- ical and Behavioral Research. The Belmont report: ethical principles and guidelines for the protection of human subjects of research, volume 2. The Commission, 1978. 58 [95] Camille Nebeker, Rebecca J Bartlett Ellis, and John Torous. Development of a decision-making checklist tool to support technology selection in digital health re- search. Translational Behavioral Medicine, 2019. 58 [96] Sohrab Saeb, Emily G Lattie, Stephen M Schueller, Konrad P Kording, and David C Mohr. The relationship between mobile phone location sensor data and depressive symptom severity. PeerJ, 4:e2537, 2016. 59 [97] Yves-Alexandre De Montjoye, C esar A Hidalgo, Michel Verleysen, and Vincent D Blondel. Unique in the crowd: The privacy bounds of human mobility. Scientic reports, 3:1376, 2013. 59 [98] Barbara L Filkins, Ju Young Kim, Bruce Roberts, Winston Armstrong, Mark A Miller, Michael L Hultner, Anthony P Castillo, Jean-Christophe Ducom, Eric J Topol, and Steven R Steinhubl. Privacy and security in the era of digital health: what should translational researchers know and do about it? American journal of translational research, 8(3):1560, 2016. 60 [99] The intelligence advanced research projects activity. multimodal objective sensing to assess individuals with context (mosaic), 2016. URL https://www:iarpa:gov/ index:php/research-programs/mosaic. 60 [100] Michelle L'Hommedieu, Justin L'Hommedieu, Cynthia Begay, Alison Schenone, Lida Dimitropoulou, Gayla Margolin, Tiago Falk, Emilio Ferrara, Kristina Lerman, and Shrikanth Narayanan. Lessons learned: Recommendations for implementing a longi- tudinal study using wearable and environmental sensors in a health care organization. JMIR mHealth and uHealth, 7(12):e13305, 2019. 61, 66 159 [101] Karel Mundnich, Brandon M. Booth, Michelle L'Hommedieu, Tiantian Feng, Ben- jamin Girault, Justin L'Hommedieu, Mackenzie Wildman, Sophia Skaaden, Amrutha Nadarajan, Jennifer L. Villatte, Tiago H. Falk, Kristina Lerman, Emilio Ferrara, and Shrikanth Narayanan. Tiles-2018: A longitudinal physiologic and behavioral data set of hospital workers, 2020. 61, 69 [102] Hye-Geum Kim, Eun-Jin Cheon, Dai-Seg Bai, Young Hwan Lee, and Bon-Hoon Koo. Stress and heart rate variability: a meta-analysis and review of the literature. Psychi- atry investigation, 15(3):235, 2018. 62 [103] Ju-Mi Lee, Hyeon Chang Kim, Jee In Kang, and Il Suh. Association between stressful life events and resting heart rate. BMC psychology, 2(1):29, 2014. 62 [104] Elizabeth H Anderson and Geetha Shivakumar. Eects of exercise and physical activity on anxiety. Frontiers in psychiatry, 4:27, 2013. 62 [105] Andrea N Goldstein and Matthew P Walker. The role of sleep in emotional brain function. Annual review of clinical psychology, 10:679{708, 2014. 62 [106] Tara W Strine and Daniel P Chapman. Associations of frequent sleep insuciency with health-related quality of life and health behaviors. Sleep medicine, 6(1):23{27, 2005. 62 [107] Bj orn W Schuller. Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends. Communications of the ACM, 61(5):90{99, 2018. 62 [108] Robert G Loudon, Linda Lee, and Barbara J Holcomb. Volumes and breathing patterns during speech in healthy and asthmatic subjects. Journal of Speech, Language, and Hearing Research, 31(2):219{227, 1988. 62 [109] Ansgar Conrad, Anett M uller, Sigrun Doberenz, Sunyoung Kim, Alicia E Meuret, Eileen Wollburg, and Walton T Roth. Psychophysiological eects of breathing in- structions for stress management. Applied psychophysiology and biofeedback, 32(2): 89{98, 2007. 62 [110] Kathy O Roper and Parminder Juneja. Distractions in the workplace revisited. Journal of Facilities management, 2008. 62 [111] Petros Saravakos and Georgios Ch Sirakoulis. Modeling employees behavior in work- place dynamics. Journal of Computational Science, 5(5):821{833, 2014. 62 [112] Juho Merilahti, Juha P arkk a, Kari Antila, Paula Paavilainen, Elina Mattila, Esko- Juhani Malm, Ari Saarinen, and Ilkka Korhonen. Compliance and technical feasibility of long-term health monitoring with wearable and ambient technologies. Journal of telemedicine and telecare, 15(6):302{309, 2009. 68, 69 [113] Merryn J Mathie, Adelle CF Coster, Nigel H Lovell, Branko G Celler, Stephen R Lord, and Anne Tiedemann. A pilot study of long-term monitoring of human movements in the home using accelerometry. Journal of telemedicine and telecare, 10(3):144{151, 2004. 160 [114] D Scherr, R Zweiker, A Kollmann, P Kastner, G Schreier, and FM Fruhwald. Mobile phone-based surveillance of cardiac patients at home. Journal of telemedicine and telecare, 12(5):255{261, 2006. [115] Simon de Lusignan, Sally Wells, Paul Johnson, Karen Meredith, and Edward Leatham. Compliance and eectiveness of 1 year's home telemonitoring. the report of a pilot study of patients with chronic heart failure. European journal of heart failure, 3(6): 723{730, 2001. 69 [116] Jean Kossai, Robert Walecki, Yannis Panagakis, Jie Shen, Maximilian Schmitt, Fa- bien Ringeval, Jing Han, Vedhas Pandit, Antoine Toisoul, Bjoern W Schuller, et al. Sewa db: A rich database for audio-visual emotion and sentiment research in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019. 73, 74 [117] Sander Koelstra, Christian Muhl, Mohammad Soleymani, Jong-Seok Lee, Ashkan Yaz- dani, Touradj Ebrahimi, Thierry Pun, Anton Nijholt, and Ioannis Patras. Deap: A database for emotion analysis; using physiological signals. IEEE transactions on af- fective computing, 3(1):18{31, 2012. 74 [118] Mohammad Soleymani, Jeroen Lichtenauer, Thierry Pun, and Maja Pantic. A mul- timodal database for aect recognition and implicit tagging. IEEE Transactions on Aective Computing, 3(1):42{55, 2012. 74 [119] Michel Valstar, Bj orn Schuller, Kirsty Smith, Timur Almaev, Florian Eyben, Jarek Krajewski, Roddy Cowie, and Maja Pantic. Avec 2014: 3d dimensional aect and depression recognition challenge. In Proceedings of the 4th international workshop on audio/visual emotion challenge, pages 3{10, 2014. 73, 74, 118, 121 [120] Rahul Gupta. Computational Methods for Modeling Nonverbal Communication in Hu- man Interaction. PhD thesis, University of Southern California, 2016. 73 [121] Brandon M Booth, Karel Mundnich, and Shrikanth Narayanan. Fusing annotations with majority vote triplet embeddings. In Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop, pages 83{89. ACM, 2018. 74, 81 [122] Brandon Booth and Shrikanth Narayanan. Improving continuous human annotation fusion: A novel signal approximation algorithm. In Aective Computing and Intelligent Interaction (ACII), 2019 Seventh International Conference on. IEEE, 2019. 98 [123] Brandon M Booth and Shrikanth S Narayanan. Trapezoidal segment sequencing: A novel approach for fusion of human-produced continuous annotations. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4512{4516. IEEE, 2020. 110 [124] Brandon M Booth and Shrikanth S Narayanan. Fifty shades of green: Towards a robust measure of inter-annotator agreement for continuous signals. In 2020 International Conference on Multimodal Interaction. 74, 116 161 [125] Gary McKeown, Michel Valstar, Roddy Cowie, Maja Pantic, and Marc Schroder. The semaine database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Transactions on Aective Computing, 3 (1):5{17, 2011. 74, 110, 118 [126] Anna Aljanaki, Yi-Hsuan Yang, and Mohammad Soleymani. Developing a benchmark for emotional analysis of music. PloS one, 12(3):e0173392, 2017. 74, 118 [127] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Inter- active emotional dyadic motion capture database. Language resources and evaluation, 42(4):335, 2008. 74, 118 [128] Mihalis A Nicolaou, Hatice Gunes, and Maja Pantic. Automatic segmentation of spontaneous data using dimensional labels from multiple coders. In Proc. of LREC Int. Workshop on Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality, pages 43{48. Citeseer, 2010. 74, 118, 127 [129] Bj orn Schuller, Stefan Steidl, Anton Batliner, Julia Hirschberg, Judee K Burgoon, Alice Baird, Aaron Elkins, Yue Zhang, Eduardo Coutinho, Keelan Evanini, et al. The interspeech 2016 computational paralinguistics challenge: Deception, sincerity & native language. In 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5, pages 2001{2005, 2016. 74 [130] Nikos Malandrakis, Alexandros Potamianos, Georgios Evangelopoulos, and Athanasia Zlatintsi. A supervised approach to movie emotion tracking. In 2011 IEEE interna- tional conference on acoustics, speech and signal processing (ICASSP), pages 2376{ 2379. IEEE, 2011. 74 [131] Benedek Kurdi, Shayn Lozano, and Mahzarin R Banaji. Introducing the open aective standardized image set (oasis). Behavior research methods, 49(2):457{470, 2017. 74 [132] Ellen Douglas-Cowie, Cate Cox, Jean-Claude Martin, Laurence Devillers, Roddy Cowie, Ian Sneddon, Margaret McRorie, Catherine Pelachaud, Christopher Peters, Orla Lowry, et al. The humaine database. In Emotion-Oriented Systems, pages 243{ 284. Springer, 2011. 74 [133] Meinard M uller. Dynamic time warping. Information retrieval for music and motion, pages 69{84, 2007. 75, 86, 92, 111, 113, 119, 138 [134] Mihalis A Nicolaou, Stefanos Zafeiriou, and Maja Pantic. Correlated-spaces regression for learning continuous emotion dimensions. In Proceedings of the 21st ACM interna- tional conference on Multimedia, pages 773{776. ACM, 2013. 75, 110 [135] Harold Hotelling. Relations between two sets of variates. Biometrika, 28(3/4):321{377, 1936. 75, 111 162 [136] Mihalis A Nicolaou, Vladimir Pavlovic, and Maja Pantic. Dynamic probabilistic cca for analysis of aective behavior and fusion of continuous annotations. IEEE transactions on pattern analysis and machine intelligence, 36(7):1299{1311, 2014. 75, 92 [137] Galen Andrew, Raman Arora, Je A Bilmes, and Karen Livescu. Deep canonical cor- relation analysis. In Proceedings of the International Conference on Machine Learning, pages 1247{1255, 2013. 75, 92, 111 [138] Feng Zhou and Fernando De la Torre. Generalized canonical time warping. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(2):279{294, 2016. 75, 86, 111, 119 [139] George Trigeorgis, Mihalis A Nicolaou, Stefanos Zafeiriou, and Bjorn W Schuller. Deep canonical time warping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5110{5118, 2016. 75, 92, 111 [140] Phil Lopes, Georgios N Yannakakis, and Antonios Liapis. Ranktrace: Relative and un- bounded aect annotation. In Aective Computing and Intelligent Interaction (ACII), 2017 Seventh International Conference on, pages 158{163. IEEE, 2017. 76 [141] Karel Mundnich, Brandon M Booth, Benjamin Girault, and Shrikanth Narayanan. Generating labels for regression of subjective constructs using triplet embeddings. Pat- tern Recognition Letters, 128:385{392, 2019. 76, 81, 98 [142] Georgios N Yannakakis, Roddy Cowie, and Carlos Busso. The ordinal nature of emo- tions: An emerging approach. IEEE Transactions on Aective Computing, 2018. 76, 117, 118, 121, 134 [143] Georgios N Yannakakis and Hector P Martinez. Grounding truth via ordinal annota- tion. In 2015 international conference on aective computing and intelligent interaction (ACII), pages 574{580. IEEE, 2015. 76, 117, 118, 121, 128 [144] Terry K Koo and Mae Y Li. A guideline of selecting and reporting intraclass correlation coecients for reliability research. Journal of chiropractic medicine, 15(2):155{163, 2016. 78 [145] Roger N Shepard. The analysis of proximities: multidimensional scaling with an un- known distance function. i. Psychometrika, 27(2):125{140, 1962. 79 [146] Joseph B Kruskal. Nonmetric multidimensional scaling: a numerical method. Psy- chometrika, 29(2):115{129, 1964. 79 [147] Matth aus Kleindessner and Ulrike von Luxburg. Uniqueness of ordinal embedding. In COLT, pages 40{67, 2014. 80, 81 [148] Kevin G Jamieson, Lalit Jain, Chris Fernandez, Nicholas J Glattard, and Rob Nowak. Next: A system for real-world development, evaluation, and application of active learn- ing. In Advances in Neural Information Processing Systems, pages 2656{2664, 2015. 80 163 [149] Laurens Van Der Maaten and Kilian Weinberger. Stochastic triplet embedding. In Machine Learning for Signal Processing (MLSP), 2012 IEEE International Workshop on, pages 1{6. IEEE, 2012. 80 [150] Lalit Jain, Kevin G Jamieson, and Rob Nowak. Finite sample prediction and recovery bounds for ordinal embedding. In Advances in Neural Information Processing Systems 29, pages 2711{2719. Curran Associates, Inc., 2016. 81 [151] Fabien Ringeval, Bj orn Schuller, Michel Valstar, Roddy Cowie, Heysem Kaya, Max- imilian Schmitt, Shahin Amiriparian, Nicholas Cummins, Denis Lalanne, Adrien Michaud, et al. Avec 2018 workshop and challenge: Bipolar disorder and cross-cultural aect recognition. In Fabien Ringeval, Bj orn Schuller, Michel Valstar, Roddy Cowie, and Maja Pantic, editors, Proceedings of the 2018 on audio/visual emotion challenge and workshop, pages 3{13, Seoul, Korea, October 2018. ACM. 86, 88, 132 [152] Feng Zhou and Fernando De la Torre. Generalized time warping for multi-modal alignment of human motion. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1282{1289. IEEE, 2012. 86 [153] Fabien Ringeval, Florian Eyben, Eleni Kroupi, Anil Yuce, Jean-Philippe Thiran, Touradj Ebrahimi, Denis Lalanne, and Bj orn Schuller. Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data. Pattern Recog- nition Letters, 66:22{30, 2015. 92 [154] Feng Zhou and Fernando Torre. Canonical time warping for alignment of human behavior. In Advances in neural information processing systems, pages 2286{2294, 2009. 92 [155] Laurens Van Der Maaten and Kilian Weinberger. Stochastic triplet embedding. In IEEE International Workshop on Machine Learning for Signal Processing (MLSP), pages 1{6. IEEE, 2012. 94 [156] Omer Tamuz, Ce Liu, Ohad Shamir, Adam Kalai, and Serge J. Belongie. Adaptively learning the crowd kernel. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 673{680. ACM, 2011. 94 [157] Fabien Lauer. On the complexity of piecewise ane system identication. Automatica, 62:148{153, 2015. 100 [158] M Govind and TN Ruckmongathan. Trapezoidal and triangular waveform proles for reducing power dissipation in liquid crystal displays. Journal of Display Technology, 4 (2):166{172, 2008. 100 [159] Eduardo Camponogara and Luiz Fernando Nazari. Models and algorithms for optimal piecewise-linear function approximation. Mathematical Problems in Engineering, 2015, 2015. 101, 102, 105 [160] Laurent Condat. A direct algorithm for 1-d total variation denoising. IEEE Signal Processing Letters, 20(11):1054{1057, 2013. 105 164 [161] Brandon M. Booth. 2018 continuous annotations. https://github:com/brandon-m- booth/2018 continuous annotations, 2019. 105 [162] Aditya Jain, Ramta Bansal, Avnish Kumar, and KD Singh. A comparative study of visual and auditory reaction times on the basis of gender and physical activity levels of medical rst year students. International Journal of Applied and Basic Medical Research, 5(2):124, 2015. 108 [163] Brandon Booth and Shrikanth Narayanan. Trapezoidal segmented regression: A novel continuous-scale real-time annotation approximation algorithm. In In proceedings of Proceedings of the 8th International Conference on Aective Computing & Intelligent Interaction, September 2019. 112, 113, 115 [164] Ron Artstein and Massimo Poesio. Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4):555{596, 2008. 116, 131 [165] Klaus Krippendor. Reliability in content analysis: Some common misconceptions and recommendations. Human communication research, 30(3):411{433, 2004. 116, 118 [166] Jean Kossai, Georgios Tzimiropoulos, Sinisa Todorovic, and Maja Pantic. Afew-va database for valence and arousal estimation in-the-wild. Image and Vision Computing, 65:23{36, 2017. 118 [167] Stefanos Zafeiriou, Dimitrios Kollias, Mihalis A Nicolaou, Athanasios Papaioannou, Guoying Zhao, and Irene Kotsia. A-wild: Valence and arousal'in-the-wild'challenge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 34{41, 2017. 118 [168] Laurence Devillers, Roddy Cowie, Jean-Claude Martin, Ellen Douglas-Cowie, Sarkis Abrilian, and Margaret McRorie. Real life emotions in french and english tv video clips: an integrated annotation protocol combining continuous and discrete approaches. In LREC, pages 1105{1110, 2006. 118 [169] Toni Giorgino et al. Computing and visualizing dynamic time warping alignments in r: the dtw package. Journal of statistical Software, 31(7):1{24, 2009. 119 [170] David Melhart, Antonios Liapis, and Georgios N Yannakakis. Pagan: Video aect annotation made easy. In 2019 8th International Conference on Aective Computing and Intelligent Interaction (ACII), pages 130{136. IEEE, 2019. 121, 137 [171] Jean Carletta. Assessing agreement on classication tasks: the kappa statistic. arXiv preprint cmp-lg/9602004, 1996. 129 [172] Dennis Reidsma and Jean Carletta. Reliability measurement without limits. Compu- tational Linguistics, 34(3):319{326, 2008. 131, 134 [173] Kevin G Jamieson and Robert Nowak. Active ranking using pairwise comparisons. In Advances in Neural Information Processing Systems, pages 2240{2248, 2011. 143 165 APPENDIX A 166
Abstract (if available)
Abstract
Mathematical and algorithmic modeling of human behaviors and experiences is a challenging domain of fundamental interest quite simply because we wish to better understand ourselves. Modern advancements in machine learning technologies and the big data revolution have bred insights across many areas of knowledge, yet progress in the broad field of human behavior research remains slow and has not benefited from this emerging machinery in the same manner. ❧ This dissertation focuses on certain problems within the human behavior and experience modeling field and offers new strategies and methods for enhancing the quality of data gathered from people ultimately to aid in understanding human behavior and experiences in various contexts. Human behavior modeling is taken not simply as a classic data-in, inference-out machine learning problem, but as an end-to-end whole encompassing also the collection of human data and experience annotations to serve as data labels. ❧ This manuscript argues and provides evidence that strong advancement in the field of human behavior research requires two conditions. The first is having access to a large amount of high quality data gathered from many individuals and in a natural setting, as opposed to a laboratory environment, to facilitate accurate intra-subject and cross-subject modeling. The main barriers to scaling up participation in research in-the-wild are long-term participant compliance and difficulties related to the creation and maintenance of machine-assisted infrastructural support to capture, monitor, and ensure the quality of gathered data. A guide for conducting natural-setting research at scale is detailed which gives researchers a framework for mindfully creating protocols and crafting data collection and monitoring systems which can improve the likelihood of obtaining high quality data. ❧ The second condition is having a collection of labels of participant behaviors or experiences that is accurate and consistent. For dynamic real-time stimuli, these labels are often produced by participants or observers, or retrospectively by external annotators, and due to basic human differences, they contain artifacts and are influenced by personal biases. Identifying and removing these biases and artifacts to generate a gold-standard collection of labels is difficult and requires careful consideration of the human annotation process itself. This dissertation proposes novel algorithms for generating accurate gold-standard continuous numeric labels of real-time stimuli that leverage peoples' ability to more accurately compare than appraise. ❧ The techniques in this manuscript enhance state-of-the-art practices for human behavior learning and improve the ability of researchers to obtain more accurate human behavior and experience models.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Learning multi-annotator subjective label embeddings
PDF
Computational models for multidimensional annotations of affect
PDF
Computational modeling of human behavior in negotiation and persuasion: the challenges of micro-level behavior annotations and multimodal modeling
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Machine learning paradigms for behavioral coding
PDF
Generative foundation model assisted privacy-enhancing computing in human-centered machine intelligence
PDF
Automatic quantification and prediction of human subjective judgments in behavioral signal processing
PDF
Behavioral signal processing: computational approaches for modeling and quantifying interaction dynamics in dyadic human interactions
PDF
Parasocial consensus sampling: modeling human nonverbal behaviors from multiple perspectives
PDF
Towards social virtual listeners: computational models of human nonverbal behaviors
PDF
Knowledge-driven representations of physiological signals: developing measurable indices of non-observable behavior
PDF
Beyond parallel data: decipherment for better quality machine translation
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Computational methods for modeling nonverbal communication in human interaction
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
A computational framework for diversity in ensembles of humans and machine systems
PDF
Complete human digitization for sparse inputs
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Human appearance analysis and synthesis using deep learning
Asset Metadata
Creator
Booth, Brandon Michael
(author)
Core Title
Improving modeling of human experience and behavior: methodologies for enhancing the quality of human-produced data and annotations of subjective constructs
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
09/20/2020
Defense Date
08/21/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
human annotation,in-situ sensing,movie violence,OAI-PMH Harvest,perceptual similarity warping,subjective constructs,trapezoidal signals,triplet ordinal embedding
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth (
committee chair
), Margolin, Gayla (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
bbooth@usc.edu,brandon.m.booth@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-370569
Unique identifier
UC11665834
Identifier
etd-BoothBrand-8988.pdf (filename),usctheses-c89-370569 (legacy record id)
Legacy Identifier
etd-BoothBrand-8988.pdf
Dmrecord
370569
Document Type
Dissertation
Rights
Booth, Brandon Michael
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
human annotation
in-situ sensing
movie violence
perceptual similarity warping
subjective constructs
trapezoidal signals
triplet ordinal embedding