Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Modeling and regulating human interaction with control affine dynamical systems
(USC Thesis Other)
Modeling and regulating human interaction with control affine dynamical systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Modeling and Regulating Human Interaction with Control Ane Dynamical Systems by Victor Ardulov A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2022 Copyright 2022 Victor Ardulov For my parents, Yuri and Maria for everything you have done to help me here ii Acknowledgements This dissertation could not have been ever possible without the constant support and guid- ance of my advisor Shrikanth Narayanan. From our rst conversations, throughout the past many years, thank you for creating a space in which I could learn to be a better researcher and to explore the topics that interested me. To the many members of the USC Computer Science community: Professors Maja Mataric, Jonathan Gratch, David Traum, Paul Rosenbloom, Fei Sha, Gale Lucas, thank you for your input and feedback throughout the years on my research and dissertation dur- ing my qualifying exams and thesis proposal process. To Lizsl and Tanya, thank you for all of your support navigating the many dierent parts of the PhD process, helping make sure I was always on top of everything and all of the kind conversations we had. I am especially grateful to Professor Tom Lyon and all of the members of the Child Forensic Interviewing Lab who have been critical to my understanding and the motivation behind much of the work presented here. Thank you to Drs. Somer Bishop and Shuting Zheng for all of your patience and constant feedback as I familiarized myself with the realm of Autism and Autism diagnostics. Finally, thank you to Drs. Dave Atkins, Torrey Creed, and Zac Imel for your continued input throughout the DEPTH and CBT projects. To my parents, Yuri and Maria, thank you for instilling in me a passion for learning and working tirelessly to give me the opportunities to reach this point. Life placed you in incredible circumstances and you had to make dicult decisions but your love and wisdom are a constant source of inspiration for me. To my sister Olga, and her family Nathan, Astrid, and Odin, thank you for you perpetual kindness, thoughtfulness, and the tremendous amount iii of happiness you have brought into my life. Without you all this work would not have been possible. This work would not be possible without the constant input from all of my SAIL labmates, Jimmy, Victor Martinez, Krishna, Nikos, Manoj, Amrutha, Rimita, Digbalay, and Rajat. Whether it witty banter or serious discussion, I am grateful for all that I have learned from you. Zane, Madelyn, and Neha, thank you for being exceptional undergraduate collaborators and students, and inspiring me to learn together. To all the friends who have supported me and endured the many long expositions, Danya, Anastasiya, Xochitl, Carlos, Brianna, Bryan, Maryam, Megan, Marcello, Stephen, Juliette, and Danny, thanks for lending so many ears throughout the years. Finally, an acknowledgement to my grandparents, Victor Gavrilovich, Berta, Vitaly, and Tamara, and my friends Mark Ginzburg, and Alex Rosen, who instilled in me so much of what made me capable of this work and are not here to see it, thank you for believing in me and being so instrumental in setting me on my course. iv Table of Contents Dedication ii Acknowledgements iii List of Tables viii List of Figures xi Abstract xiv Chapter 1: Introduction 1 1.1 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Behavioral Signal Processing . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.2 Dialogue Management . . . . . . . . . . . . . . . . . . . . . . . . . . 4 I Control Ane Dynamics of Behavioral Systems 5 Chapter 2: Interpersonal Coordination as an Indicator of Child Truthfulness 6 2.1 Broken Toy Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Deception Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.1.1 Psycholinguistic Norms . . . . . . . . . . . . . . . . . . . . 11 2.2.1.2 Acoustic Features . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.2 Classication Task and Baselines . . . . . . . . . . . . . . . . . . . . 12 2.2.2.1 Truth-telling Task . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.2.2 Disclosure Task . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.3 Multi-modal Aggregate Features . . . . . . . . . . . . . . . . . . . . . 14 2.2.4 Modelling Cross-modal Dynamics . . . . . . . . . . . . . . . . . . . . 15 2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.1 Truth-telling Task Results . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.2 Disclosure Task Results . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 v Chapter 3: A Control Theoretic Perspective of Child Forensic Interviews 23 3.1 Child Forensic Interviewing Evaluation . . . . . . . . . . . . . . . . . . . . . 24 3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.2 Data Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.3 Verbal Productivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.3.1 Agenda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.3.2 Responsiveness . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2.4 Psycholinguisitc Norms . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2.5 Acoustic Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.6 Dynamical System Models . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.6.1 Linear Mixed Eects Models . . . . . . . . . . . . . . . . . 32 3.2.6.2 Dynamic Mode Decomposition with Control . . . . . . . . . 33 3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.1 Verbal Productivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.1.1 Agenda and Responsiveness . . . . . . . . . . . . . . . . . . 36 3.3.1.2 Signal Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3.1.3 Reducing Age Correlation . . . . . . . . . . . . . . . . . . . 38 3.3.2 Modeling Productivity with Linear Mixed Eects Models . . . . . . . 39 3.3.3 Modeling Dynamics Using DMDc Framework . . . . . . . . . . . . . 40 3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.4.1 Verbal Productivity Measures . . . . . . . . . . . . . . . . . . . . . . 42 3.4.2 Dynamical System Models . . . . . . . . . . . . . . . . . . . . . . . . 44 Chapter 4: Local Dynamic Modes of Cognitive Behavioral Therapy 47 4.1 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.1.1 Automated Therapy Evaluation . . . . . . . . . . . . . . . . . . . . . 49 4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2.2 Tasks and Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2.3 Windowed Dynamic Mode Decomposition . . . . . . . . . . . . . . . 54 4.2.4 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.2.4.1 Aggregating Global Scores from Local Predictions . . . . . . 58 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.3.1 Local Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.3.2 Session-Level Predictions . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 II Optimization Interaction Strategies 65 Chapter 5: Optimizing Interactions for Neuro-developmental Diagnostics 66 5.1 Data-Driven Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.2 Autism Diagnostic Interview - Revised . . . . . . . . . . . . . . . . . . . . . 69 5.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 vi 5.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.4.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.4.3 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.5.1 Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Chapter 6: Conclusions and Future Work 84 6.1 Child Truthfulness and Disclosure . . . . . . . . . . . . . . . . . . . . . . . . 85 6.2 Child Forensic Interviewing . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.3 Cognitive Behavioral Therapy . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.4 Optimizing Interaction Strategy . . . . . . . . . . . . . . . . . . . . . . . . . 87 References 88 Appendices 95 A Child Forensic Interviewing . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 B Robust Diagnostic Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 97 C Single-lag and Multi-lag GCA Features . . . . . . . . . . . . . . . . . . . . . 98 D Q-Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 vii List of Tables 2.1 Number of session transcripts in the dataset for each transgression and dis- closure condition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Human and probabilistic sampling baselines for both tasks. F1-score, accu- racy, precision, and false negative rate (FNR) are considered for each baseline. t and d values represent thresholds of signicant results for the truth-telling and disclosure tasks, respectively. h 0 ,h 1 , andh 2 represent the 50 th , 84 th , and 97:5 th percentiles of simulated human performance for the truth-telling task, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Truth-telling task performance using session-level aggregated features and a feature signicance threshold of p < 0:30. * indicates performance better than randomized bootstrap t . ** indicates performance better than human simulation h 2 . Bold values indicate the best performance in their respective columns. Pos. Acc. and Neg. Acc. indicate the accuracy for the positive and negative classes, respectively. The best performing L-SVM hyperparameters, shown in the table above, are C = 1 along with balanced class weights. The decision tree classier used entropy as the splitting criteria, while the random forest classier used Gini impurity. . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 Disclosure task performance using session-level aggregated features. Using feature signicance of p < 0:20. * indicates performance better than ran- domized bootstrap d . Bold values indicate the best performance in their respective columns. Pos. Acc. and Neg. Acc. indicate the accuracy for the positive and negative classes, respectively. The best performing L-SVM hyperparameters, shown in the table above, are C = 0:1 along with balanced class weights. The decision tree classier used entropy as the splitting criteria, while the random forest classier used Gini impurity. . . . . . . . . . . . . . 15 viii 2.5 Truth-telling task performance using Granger causal features and a feature signicance threshold of p < 0:15. * indicates performance better than ran- domized bootstrap t . ** indicates performance better than human simula- tionh 2 . Bold values indicate the best performance in their respective columns. Pos. Acc. and Neg. Acc. indicate the accuracy for the positive and negative classes, respectively. The best performing L-SVM hyperparameters, shown in the table above, areC = 0:01 along with balanced class weights. The decision tree classier used entropy as the splitting criteria, while the random forest classier used Gini impurity. . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.6 Disclosure task performance using Granger causal features and a feature signif- icance ofp< 0:10. * indicates performance better than randomized bootstrap d . Bold values indicate the best performance in their respective columns. Pos. Acc. and Neg. Acc. indicate the accuracy for the positive and negative classes, respectively. The best performing L-SVM hyperparameters, shown in the table above, are C = 0:001 along with balanced class weights. The decision tree classier used Gini impurity as the splitting criteria, while the random forest classier used entropy. . . . . . . . . . . . . . . . . . . . . . . 20 2.7 Average cross-fold feature importances assigned by the L-SVM models for disclosure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1 Top 10 weighted words from 5 agendas . . . . . . . . . . . . . . . . . . . . . 36 3.2 Comparison of productivity produced by dierent methods and their rela- tive rank (1 being highest rated rank). Demonstrates the ability to score substantive information from utterances. For Responsiveness+Agenda hyper- parameters and are chosen to be 0.5 . . . . . . . . . . . . . . . . . . . . 36 3.3 Pearson correlation (r) indicating strength of relationship between dierent productivity metrics and the age of the child. All scores were signicant at p< 0:001. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4 LME analysis for studying eect of (a) prosodic features, and (b) lexical fea- tures on verbal productivity. t-statistics for statistically signicant features are reported here. (p< 0:05) ; (p< 0:01) . . . . . . . . . . . . . . . . . . . 41 4.1 Abbreviations of sub-scores . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2 Bootstrapped baselines based by window size. Each value represents 3 sig- nicance or 3 standard deviations above the mean. . . . . . . . . . . . . . . 55 4.3 Baselines for session level codes . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.4 Best performing models on local samples of segments by window size. Models with identical performance per a given window size . . . . . . . . . . . . . . 63 4.5 Models performing the best on the global task . . . . . . . . . . . . . . . . . 64 ix 5.1 Demographic distributions. The p-values for t-tests suggest that we cannot reject the null hypothesis that they are from dierent populations according to these features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.2 Items used, their descriptions and the Student's t-test statistic. All p-values < 1:06 10 3 indicating signicance with Bonferroni correction. . . . . . . . 74 5.3 F1-Score initially when all data is available, subsequently when only one fea- ture is available. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.4 Variance and Range of Feature Importance for each of the classiers. . . . . 81 6.1 Denitions and limits on variables . . . . . . . . . . . . . . . . . . . . . . . . 97 6.2 Truth-telling task performance using Granger causal features from the multi- lag feature set and a feature signicance threshold of p < 0:30. * indicates performance better than randomized bootstrap t . ** indicates performance better than human simulation h 2 . Bold values indicate the best performance in their respective columns. Pos. Acc. and Neg. Acc. indicate the accuracy for the positive and negative classes, respectively. . . . . . . . . . . . . . . . 98 6.3 Disclosure task performance using Granger causal features from the multi- lag feature set and a feature signicance threshold of p < 0:10. * indicates performance better than randomized bootstrap d . Bold values indicate the best performance in their respective columns. Pos. Acc. and Neg. Acc. indicate the accuracy for the positive and negative classes, respectively. . . . 98 x List of Figures 2.1 An overview of possible interview outcomes depending on transgression (toy- breaking) and disclosure conditions. . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 The distribution for F GCA comparing child imageability in response to adult imageability for the truth-telling task. Kernel density smoothly approximates the probability density of the two distributions. As shown in Figure 2.3, the imageability-imageability F GCA is the most predictive causality score for the truth-telling task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 A visualization of the feature set that obtained the model with the highest F1 score for the truth-telling task. TheF -statistic measures which combinations of signals are most dierentiating for the truth-telling task according to a one- way ANOVA. The annotation indicates which lag had the strongest measured causal eect. All F -statistics reported with a p < 0:15. The distribution of the strongest predictor, imageability-imageability, can be seen in Figure 2.2. 18 2.4 A visualization of the feature set that obtained the model with the highest F1 score for the disclosure task. The F -statistic measures which combinations of signals are most dierentiating for the disclosure task according to a one- way ANOVA. The annotation indicates which lag had the strongest measured causal eect. All F -statistics reported with a p< 0:20. . . . . . . . . . . . . 19 3.1 Expressiveness (in terms of average turn-level word count) of children across age groups collected over 527 Forensic Interview Transcripts . . . . . . . . . 25 3.2 Age distribution across 527 sessions . . . . . . . . . . . . . . . . . . . . . . . 26 3.3 Various proposed productivity scores over the course of a single session, agenda(top, left), responsiveness (top, right), combined (bottom, left) and word count (bot- tom, right). Each score is normalized with respect to its maximum. . . . . . 38 3.4 Distribution of scores and variances by age using only the Agenda criterion . 39 3.5 Distribution of scores and variances by age using only the Responsiveness criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 xi 3.6 Distribution of scores and variances by age using Combined Agenda/Responsiveness scoring method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.7 Dominant Eigenvalues associated with transition matrix of child state evolu- tion. This is indicative of the dynamics that are dominantly present in the observation time-series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.8 Distribution of dominant singular values. All representing all age groups together, EC and LC representing \Early Childhood" and \Late Childhood" respectively. Larger dominant singular values indicate larger contribution to output from the control input. . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.9 Relative scale of in uence on productivity across age groups and features. Average absolute value of row used to calculate contribution to productiv- ity. Contributing components from the derived \transition" (left) showing how the child's own state contributes to their future state, and \controller" (right) indicative of which components of interviewer speech contributes to the child's productivity the most. i ; i correspond to the mean and standard deviation of the therapist prosodic intensity, while p ; p correspond to the pitch. v ; a ; p correspond to the valence, arousal, and pleasantness of the in- terviewers utterance. [L 1 L 5 ] correspond to the 5 most important \agenda" and their appearance in the previous utterances. . . . . . . . . . . . . . . . . 46 4.1 Distribution of the dierent CTRS subscores and the total CTRS scores in the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2 Each talk turn is converted into their respective vector representation via the DistilBERT transformer model, and the used to constructX andY matrices, which are in turn window conditioned on the window size w. . . . . . . . . . 53 4.3 Number of non-zero eigenvalue for dynamics computed at varying window size for both transition (T) and controller (C) . . . . . . . . . . . . . . . . . . . . 57 4.4 With the windows extracted using the pipeline outlined in Figure 4.2 the windowed DMDc is applied in order to nd the corresponding transition and controller matrices [A;B] t and their respective [ T ; C ]. These are used to make local predictions which are then accumulated and passed into the ag- gregate model for a global prediction. . . . . . . . . . . . . . . . . . . . . . . 58 4.5 A sample trajectory of CTRS scores accumulating over the course for 2 ses- sions one being a high quality and one low quality session . . . . . . . . . . . 62 xii 5.1 ADI-R administration process. The parent is interviewed by a clinician. The clinicians asks open-ended questions that are tied to an item and listens to the responses from a parent. Typically the clinician is listening and asking about specic examples of the child's behavior in relation to the item at hand. The clinician records a rating based on the presented information and can leave notes to themselves. After the interview is complete the clinician uses their recorded ratings to complete the ADI-R algorithm computing whether the child meets the instrument's cut-o thresholds for ASD. . . . . . . . . . . . . 70 5.2 Distribution of demographic information: age, FSIQ and VIQ across dierent diagnostic conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.3 Process demonstrates how a single example is converted into masked examples. The 0s represent values that are unavailable to the classier a priori and will be potentially imputed. The notation C m n (m choose n) represents the number of examples generated by masking n items. . . . . . . . . . . . . . . 77 5.4 F1-Score degradation as more features are masked from the inputs. . . . . . 79 5.5 An example of how a policy updates with all possible responses from an in- quiry. The top row captures the initial \empty" state of the policy, while the branch represent all of the possible state update that could occur depending on the observation made following the action taken. The column vector rep- resents the state of the policy, or the items that the policy has information about so far. The horizontal bar chart captures the relative Q-value of each action (actions are equivalent to querying an item or making a prediction). As ADI 45 has the highest Q-value, it is the rst item that is queried by the policy. The arrows capture possible responses, or observations, that the pol- icy can have, which in turn are used to update the state. The verticle bar chart captures the current state's predicted probabilities of ADHD and ASD respectively (Belief ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.6 Importance of dierent items relative to each other according to dierent model types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 xiii Abstract Human interaction is a vital component to a persons' development and well-being. These interactions enable us to over come obstacles and nd resolutions that an individual might not be able to. This subject is particularly well studied in the domains of human psychology, where human behavior is diagnostically categorized and the interaction can be utilized in order to improve somebody's health. Prior work has explored the use of computational models of human behavior to aide in the diagnostic assessment of behavioral patterns. Most recently, novel machine learning methods and access data has invited the to study the dynamics of human interaction on a more granular time-resolution. These dynamics have been used to identify specic moments during interactions that are relevant to the over all assessment of a individuals behavior with respect to their interlocutor. By reformulating this system from the perspective of an operator that can be controlled, it invites the possibility to predict how an individual would react to a specic input from their partner, which itself lends the opportunity to plan out interventions and probes more eectively. This dissertation presents a formulation of human interaction through a systems theoretic paradigm with a control ane element and demonstrates how these frameworks can be utilized to gain insight into improving desired outcomes and approaches towards optimizing interaction strategies. In support of the thesis, we will present the application of these techniques to the domains of forensic interviewing, psychotherapy, and neurodevelopmental diagnostics. xiv Chapter 1 Introduction Consider a teacher being introduced to a student for the rst time. The pair is interested in reaching an objective together, perhaps to teach the student a particular concept. The teacher is likely using exercises as a tool to guide the student towards achieving the desired outcome, learning. However, the exercise serves also as a probe gaining insight into the students' current state of understanding. In this setting, the teacher has decisions to make, leveraging resources (e.g. class time, emotional capacity, and support systems) to get as close as they can to the class objectives. They are making decisions trying to optimize positive outcomes within the constraints set before them. Meanwhile, the student is also their own taking actions, based on the context, objectives, and constraints. As they engage with the instruction, classroom, and teacher, they are integrating the inputs towards their current understanding of the material, and taking steps necessary to learn the material. These types of interactions are encountered in many dierent domains of human inter- actions: therapists engaging with their clients, patients with their doctors, investigators with witnesses; people observe and guide each other towards objectives, constrained by the resources available. Considering the interactions above within the context of a computational frameworks, the work outlined in this thesis reimagines the interaction as a control ane dynamical sys- tem. More concretely, the guidance of an interlocutor and their conversational partner's state 1 during an interaction is studied as a multimodal system that would evolve autonomously, but is in uenced by the signal produced. This allows us to utilize well established techniques from the domain of Control Theory and introduce methods capturing directly interpretable representations of the dynamics. Consequently, it introduces opportunities to consider feed- back control optimization methods to help explore and identify strategies that lead to the desired outcomes with higher likelihood or improved resource utilization. The central focus of the work presented drives to demonstrate that control theoretic analysis of dynamical systems describe human interactions under a holistic framework to consider and optimize engagement policies with respect to social outcomes and constraints. In support of this thesis, we will rst discuss the space of behavioral signal processing, aec- tive computing, and natural language dialogue systems (x1.1) and the many building blocks that lay a foundation to build our approaches upon. Chapter 2 demonstrates how consider- ing in uence to measure the coordination between speakers leads to insights in determining when statements made during an interview are truthful or if the interviewee is prepared to disclose. Following that, a method for measuring verbal productivity is introduced and demonstrates the eectiveness of dynamical systems models to holistically characterize the interactions of a forensic interview (Chapter 3) and cognitive therapy (Chapter 4). Finally in Chapter 5 we will demonstrate an approach to utilize control optimization in the context of diagnosis of neurodevelopmental disorders. 1.1 Prior Work 1.1.1 Behavioral Signal Processing In order to broach the problem of managing human interactions, it is critical to rst under- stand and adequately model the linguistic and paralinguistic aspects of conversations. Be- havioral Signal Processing (BSP) is a eld which outlines the application of signal processing 2 techniques developed for problem spaces such as localization, navigation, and communica- tion, to gain insights into observed behaviors and underlying models of behavior expression Narayanan and Georgiou [2013]. Many of these techniques utilize statistical methods for the estimation of latent states and the probability that certain behavioral classes are met. These approaches typically integrate multimodal streams of information such as acoustic, visual, and lexical. Relevant to this thesis, BSP provides a framework within which models conditioned on the latent states can be leveraged to predict the evolution of an interlocutor's behavior throughout a conversation. Our methods build on this to more explicitly model controller operators that describe the way in which a guided party involved in the dyad integrates the input from the guiding party. BSP naturally extends itself to applications in the domain of aective computing due to the inherently intertwined nature of aect and behavior. Behavioral expression can be moti- vated by aect, and aect in turn is critical in the internalization and regulation of behavior. Therefore when considering the approaches for the representation of guided human-human interaction it is fundamental to account for the aective content of the interaction we will often rely on explicit aective constructs rather than relying solely on the observable behav- ioral signals. These will largely be realized in our work through the use of psycho-linguistics which are representations, such as EmotiWord, which capture the associated aective content of language Malandrakis et al. [2011]. Some concrete examples of BSP methods could support automated psychotherapy eval- uation was introduced by Gibson et al. [2016] which used deep recurrent neural networks (RNN) to predict session level empathy ratings for therapy session practicing Motivational Interviewing (MI) from dialogue transcripts. Gibson et al. [2017] built on this work by introducing attention mechanisms on top of RNNs for automatically identifying MI skills on an utterance level. These results demonstrate how behavioral signal data, acoustic and linguistic time-series, could be evaluated to construct representation of nuanced dialogue 3 acts. By predicting local actions taken by the therapist, these models gave way for evaluat- ing strategies on a more generalizable level, for example allowing the evaluation of dierent question types on therapeutic outcomes. Most recently Flemotomos et al. [2021b] described a complete pipeline which is able to translate from raw audio to full transcripts, with ut- terance and session level predictions for MI behavioral skills and assessment ratings. These pipelines demonstrate the power of providing feedback in a timely manner, and open a natural opportunity for continued learning and renement. In contrast to previous time-series methodologies, the work outlined in this thesis will take the perspective that dynamics of the behavior during an interaction can be explicitly studied using control theoretic approaches, which account for the input from the therapist in a more specic manner. 1.1.2 Dialogue Management Natural Language Dialogue Systems (NLDS) are methods outlined for understanding and generating many of the phenomena associated with dialogues, and improving computational models capable of anticipating and responding adequately. Beginning with simple textual understanding and transactional interactions that were predominantly rule based, new av- enues in statistical language modeling, BSP, AC, and increases in the abundance of data, have given way to incredibly powerful often emotionally capable agents. Many techniques presented further in this thesis are inspired by approaches, particularly those in dialogue understanding, to gain insight into these interactions. The objective of this thesis is to demonstrate to utilize the natural ability of a interlocutor and explore how the specic engagement to circumvent the various and dicult challenges associated with adequately modelling aect-aware spoken dialogue agents. Doing so allows to consider interactions, unhinged by the limitations of modern aective speech synthesis and dialogue modelling, instead investigating how the human-in-the-loop might be leveraged to guide the dialogue to its desire state. 4 Part I Control Ane Dynamics of Behavioral Systems 5 Chapter 2 Interpersonal Coordination as an Indicator of Child Truthfulness Being able to discern between truthful and deceptive statements during interactions is a vital conversational skill. Deception can take on many forms, from out right misinformation, to deception via omission, or de ection, including multi-layered combinations of these tech- niques. Being able to discern and identify moments of deception can be critical to extract context and accurate information necessary to navigate high-stakes situations. One such scenario is the recording of legally viable testimony from a child. The task is made harder by the often nascent developmental state of the child, as their behavior is signicantly more variable across individuals and age, since their communication abilities and understanding of the circumstances might be incomplete. The following chapter explores dierent methods of learning to predict if a child's state- ments are truthful. During an interview which pertains to an experimentally set-up trans- gression, it is possible to accurately evaluate the true nature of whether a particular event occurred and if the child disclosed it. The methods presented explore how dierent behav- ioral signals compare and inform about the underlying condition of a child's statements, and then highlights that is the interaction and coordination of these behavioral signals that provide stronger indicators rather than just the content of the signals. 6 Our work denes a controllable system that can be studied using Granger Causal Analysis to capture information about the coordination between the child's and interviewer's behav- ioral signals. Our results demonstrate that these features are not only powerful predictors of truthfulness, but are also indicative as to whether or not a child will disclose with high delity. Doing so allows us to characterize how child-interviewer interactions dier among these conditions, enabling the interviewers to create new strategies for adapting their speech and language to better identify truthful statements or determine if a child is prepared to disclose that a transgression occurred. 2.1 Broken Toy Interviews The physical and psychological immaturity of children makes them vulnerable to abuse, aecting their mental health into adolescence and adulthood [Fergusson et al., 1996, Kaplow et al., 2008, Hillberg et al., 2011]. Since children are most often victims or witnesses to crimes committed by their legal guardians or caretakers, eliciting accurate and truthful testimony can remove them from dangerous living environments. However, this same context can create a con icting desire to testify for the child, or children can be coached and coerced to admit or omit falsely when asked about the crimes [Lamb et al., 2003, Radford et al., 2011, Talwar et al., 2018]. This challenge is further compounded as the same elements of a child's cognitive development and language ability that contribute to their vulnerability, also in uence their ability to consistently recount incidents in courtroom settings. This makes their testimony subject to coercion and intimidation, often leading to its dismissal in court [Lyon et al., 2017]. Further still, child testimony is often necessary since child victims are most typically the sole witness to the crimes against them. To combat these many challenges associated with child testimony, for legal proceedings and investigations alike, Child Forensic Interviews (CFIs) are administered to elicit testimony from a child in a controlled environment which have been shown to reliably obtain information [Lyon, 2005, Brown and Lamb, 2015]. 7 The CFI relies on a two phase approach: rapport building and incident recall. During rapport building, the interviewer will ask about innocuous topics to get the child comfortable narrating and responding to open-ended questions. Once the interviewer feels that the child is in a state where they are reliably responding, they will begin the recall section of the interview and ask questions more directly pertinent to the investigation. Since the questions are open-ended, the child is not pressured to disclose specic details, and thus the testimony can be legally admissible in future proceedings. In order to study the eectiveness of strategies and approaches, child psychologists and legal scholars constructed the Broken Toy experiments. These experiments were designed to induce a low stakes transgression that would invoke a situation within which a child might experience a con ict in disclosing. The experiments in the present study are an interaction between a child and two adult experimenters: a confederate, and an interviewer. A session begins with the confederate and the child playing in a room full of toys. The confederate is informed prior to the session as to whether or not one of the toys is designed to break during play. If a toy breaks, then a transgression has occurred, and the confederate tells the child that an interviewer will enter the room to ask some questions. The confederate then asks the child to promise not to tell the interviewer about the toy breaking because they \could get in trouble". The confederate exits the room, and the interviewer enters. Blind as to whether or not a transgression occurred, the interviewer follows a modied CFI protocol beginning with the rapport building phase, adhering to a semi-structured script that does not discuss the toys. After rapport building, the interviewer enters the recall phase, and asks the child to name each toy and describe what happened with it. The interviewer only asks the child about each toy exactly once and only repeats if the child indicates that they did not understand. If the child admits to a toy being broken, it is said that the child has disclosed. As Figure 2.1 illustrates the process can end in 4 possible outcomes depending on the transgression and disclosure conditions. If the transgression condition matches the 8 Figure 2.1: An overview of possible interview outcomes depending on transgression (toy- breaking) and disclosure conditions. disclosure condition we consider it to be a moment of truthfulness, otherwise it is a moment of deception. 2.1.1 Deception Detection When presented with statements made by children during forensic interviews, adults per- formed just slightly better than random guessing, yielding an average accuracy of 54% [Gongola et al., 2017]. This work further alludes to the fact that relevant professional back- grounds did not signicantly impact the ability of the adult to predict whether a child was telling the truth, and that adults performed better (59% accurate) when identifying true non- disclosure statements compared to false non-disclosure (49% accurate). More recent studies examined the eects of specic interview instructions and their impact on adults' abilities to detect deception in interviews [Gongola et al., 2018, 2020]. These studies showed that 9 adults tend to be biased towards believing that a child's statements are truthful [Gongola et al., 2020], and that additional specic interview instructions only slightly improved adult performance [Gongola et al., 2018]. Automated deception detection in courtroom settings has been largely conned to adult subjects, utilizing multi-modal streams of features including video, audio, and text [Mihalcea and Strapparava, 2009, P erez-Rosas et al., 2015, Mathur and Matari c, 2020]. Mathur and MATARI extended this work to demonstrate how low stakes deception data can be utilized to learn representation of high-stakes deception within sparse data environments. Mean- while work on automated deception detection in child broken toy experiments used syntactic linguistic features [Yancheva and Rudzicz, 2013]. Related to the eort outlined in this chapter, the work completed in Ardulov et al. gave us an initial understanding of how emotional content and language compare as indicators of child truthfulness which in turn inform the discussion surrounding modality and feature extraction inx2.2.1 In contrast, the work presented here utilizes novel acoustic features and considers the co- ordination and relative causal dependencies between child and interviewer turn-level features to better understand the dynamics of child truth-telling within the interview. Additionally, a new task is introduced which evaluates the rapport-building phase to see if there are causal relationships that indicate whether or not a child will disclose later during recall, under the assumption that a transgression occurred. 2.2 Methods 2.2.1 Data The data consist of 200 interactions between school-aged children and experimenters. Each session is associated with a unique child, while the same adults may appear in multiple sessions. Specic demographic information about each child and experimenter is unknown, 10 Transgression No Transgression Disclosure 40 2 Non-Disclosure 109 49 Table 2.1: Number of session transcripts in the dataset for each transgression and disclosure condition. however, generally the children included in the study are of elementary school age (< 12 year old) and the experimenters are legal and psychology domain experts trained in the CFI protocol. The sessions are transcribed and annotated as to whether the child indicated that one of the toys broke. The frequency of each transgression and disclosure conditions in our dataset is shown in Table 2.1. 2.2.1.1 Psycholinguistic Norms Psycholinguistic norms are a numerical representation of a word's general perceived align- ment with certain aective and cognitive measures. Malandrakis et al. [2011] developed EmotiWord, a dictionary that maps words to psyscholinguistic norms constructed via crowd- sourced perception. Each psycholinguistic norm lies on a bounded and continuous domain of [1; 1]. For example, a word's pleasantness measures its degree of pleasant feelings and \cookies" has a pleasantness of +1, while \bedbug" has a pleasantness of1. More specically, this study uses the psycholinguistic norms of valence, arousal, pleas- antness, and age of acquisition, as these aective and cognitive signals have been shown to be predictive of child deception in Ardulov et al. [2020a]. Child interviewers have also demonstrated that children utilize vague language as an attempt to avoid admitting to a transgression [Clemens et al., 2010, Gongola et al., 2021], so concreteness and imageability, which correspond to a word's descriptiveness and clarity, are also used as psycholinguistic indicators of deception. Psycholinguistic norms for each word spoken by the child and the interviewer (excluding backchannels) are averaged along each conversational turn. The constructed time-series 11 signal captures an observable representation of how the child responds to the language of the interviewer. In the event that a turn has no words, such as in cases of non-verbal or exclusively back-channeled responses, the value is coded as neutral and represented by 0. 2.2.1.2 Acoustic Features Annotated time boundaries for the rapport building and recall sections of the interview are used to generate segments to extract features from the interview audio. Each segment is then processed using an o-the-shelf model 1 to align the audio with the text transcriptions using forced alignment [Moreno et al., 1998], in which each word in the transcript is aligned with a timestamp irrespective of the condence of the alignment. Turn-level features for speaking rate and latency are then calculated for both the child and the interviewer. 2.2.2 Classication Task and Baselines For each binary classication task described below, a set of models are trained to classify a given interview session. Specically, Gaussian Naive Bayes (GNB), Decision Tree (DT), Random Forest (RF), and Linear Support Vector Machine (L-SVM) models are trained using stratied 5-fold cross-validation (CV). Results and baselines are reported as the average CV F1-score, accuracy (Acc.), precision (Prec.), and false negative rate (FNR). For the DT and RF models, the splitting criterion was tuned, while for the L-SVM the penalty parameter and the use of class balanced loss penalties were tested. In our analyses, we report the performance and associated hyperparameters with the highest average CV F1-score. Model implementations were used from scikit-learn 2 . 1 https://github.com/lowerquality/gentle 2 https://github.com/scikit-learn/scikit-learn 12 2.2.2.1 Truth-telling Task The truth-telling task is an evaluation of the interactions that resulted in non-disclosure. A true non-disclosure corresponds to when the toy did not break and the child did not disclose (n = 49). In contrast, a false non-disclosure is an instance in which the toy breaks, but the child does not disclose that the toy broke (n = 109). Turn-level features from both the rapport building and free recall sections of the interview are used for the analysis. The truthful (true non-disclosure) and deceptive (false non-disclosure) interviews are labeled as the positive and negative classes, respectively. Due to the relatively poor performance of adults on child deception detection, which is only 54% accurate on average [Gongola et al., 2017], two baselines are constructed us- ing bootstrapping: a simulated human baseline, h 2 , and a probabilistic sampling from the training distribution, t . After 10,000 simulations, the baselines are set at two standard deviations above the mean, corresponding to the 97:5 th percentile of the 10,000 simulations. See Table 2.2. 2.2.2.2 Disclosure Task To discover indicators during rapport building that may signal to an interviewer that the child is ready to disclose, the disclosure task evaluates the turn-level features extracted from the rapport building phase to predict if a child will disclose during the recall phase. While studying false-disclosure is of interest in general, pertaining to how strategies might coerce disclosure of events that did not occur, in principle the occurrence in our data is sparse, and our objective is to recognize indicators which might lead to disclosure under the condition that a transgression did occur. Consequentially, only the case where a transgression occurred is considered for this classication task; thus, the two outcomes of interest are true disclosure (n = 40) and false non-disclosure (n = 109). Similar to the truth-telling task, the truthful (true-disclosure) and deceptive (false non-disclosure) interviews are labeled as the positive and negative classes, respectively. Similarly, since the event we are trying to predict 13 F1 Acc. Prec. FNR d 0.462 0.653 0.482 0.257 t 0.475 0.633 0.493 0.277 h 0 0.448 0.542 0.388 0.312 h 1 0.505 0.591 0.440 0.264 h 2 0.561 0.639 0.494 0.217 Table 2.2: Human and probabilistic sampling baselines for both tasks. F1-score, accuracy, precision, and false negative rate (FNR) are considered for each baseline. t and d values represent thresholds of signicant results for the truth-telling and disclosure tasks, respec- tively. h 0 , h 1 , and h 2 represent the 50 th , 84 th , and 97:5 th percentiles of simulated human performance for the truth-telling task, respectively. occurs during recall, the features are aggregated exclusively from the rapport building phase of the relevant interviews. Since no known human baseline exists for this task, only the probabilistic sampling baseline, similar to the one described for truth-telling, is evaluated against. For the disclosure task, the baseline is denoted as d in Table 2.2. 2.2.3 Multi-modal Aggregate Features Our study seeks to demonstrate that it is the dynamics of the child's behavior that is indicative of their willingness to disclose or the truthfulness of their statements. To better determine the eectiveness of using these features a set of baseline models are additionally trained. Rather than use measures of behavioral coordination elicited from a dynamical system, these models take as input aggregate quantities over the signals available. In this way our experiments illustrate the importance of the temporal context within which spoken content occurs in, rather than just an indicator of it occurrence. Gaussian Naive Bayes (GNB), Decision Tree (DT), Random Forest (RF), and Linear- Support Vector Machine (L-SVM) models are trained on child and adult acoustic and psy- cholinguistic features, similar to the approach in Ardulov et al. [2020a]. A one-way ANOVA is also used for feature selection, where 10 feature sets are created per task, consisting of the 14 Model F1 Acc. Prec. FNR Pos. Acc. Neg. Acc. DT 0.540* 0.662** 0.587** 0.268* 0.525 0.737 RF 0.324 0.633 0.517** 0.328 0.250 0.840 GNB 0.378 0.536 0.370 0.354 0.393 0.613 L-SVM 0.640** 0.719** 0.584** 0.165** 0.725 0.718 Table 2.3: Truth-telling task performance using session-level aggregated features and a fea- ture signicance threshold of p < 0:30. * indicates performance better than randomized bootstrap t . ** indicates performance better than human simulation h 2 . Bold values indi- cate the best performance in their respective columns. Pos. Acc. and Neg. Acc. indicate the accuracy for the positive and negative classes, respectively. The best performing L-SVM hyperparameters, shown in the table above, are C = 1 along with balanced class weights. The decision tree classier used entropy as the splitting criteria, while the random forest classier used Gini impurity. Model F1 Acc. Prec. FNR Pos. Acc. Neg. Acc. DT 0.452 0.641 0.451 0.259 0.471 0.725 RF 0.229 0.633 0.327 0.316 0.186 0.849 GNB 0.480* 0.603 0.434 0.256 0.557 0.622 L-SVM 0.609* 0.703* 0.531* 0.161* 0.719 0.697 Table 2.4: Disclosure task performance using session-level aggregated features. Using feature signicance ofp< 0:20. * indicates performance better than randomized bootstrap d . Bold values indicate the best performance in their respective columns. Pos. Acc. and Neg. Acc. indicate the accuracy for the positive and negative classes, respectively. The best performing L-SVM hyperparameters, shown in the table above, are C = 0:1 along with balanced class weights. The decision tree classier used entropy as the splitting criteria, while the random forest classier used Gini impurity. session-level features that yielded p < P max , where P max 2 [0:05; 0:10; 0:15;:::;:50]. Tables 2.3 and 2.4 show the cross-validation performance of models trained on the session-level feature sets that produced the highest F1 cross-validation scores for the truth-telling and disclosure tasks, respectively. 2.2.4 Modelling Cross-modal Dynamics To capture information surrounding the in uence of the interviewer on the child's behavior, we utilize a measure of dynamic coordination known as Granger causality analysis (GCA). Al- though GCA was originally developed for econometric forecasting [Granger, 1969], it has been 15 shown to signicantly measure coordination and synchrony between interlocutors [Kalimeri et al., 2011, 2012, Bone et al., 2014]. Thus, the strength of temporal causality between the interviewer's and child's speech signals can be interpreted as a measure of coordination between the two interlocutors. Explicitly, for two given signals Y [0:T ] = [y 0 ;y 1 ;:::;y T ] and X [0:T ] = [x 0 ;x 1 ;:::;x T ], X is said to \Granger cause" Y if the auto-regressive error, ;t , is signicantly larger than the error of the in uence model, ;t , according to an F-test. GCA implies that the inclusion of signalX in Eq. 2.1 with a lagL better explains the observation ofY than the auto-regressive model shown in Eq. 2.2: y t =~ aY [0:t] + ~ bX [0:tL] + ;t (2.1) y t =~ aY [0:t] + ;t (2.2) The GCA yields anF -statistic,F GCA , and a correspondingp value which can be interpreted as the strength of the in uence and a measure of how likely the relationships is to occur by chance. Given an interview session, a pair of turn-level child and adult speech signals, X [0:T ] and Y [0:T ] respectively, are extracted. A causality score,F GCA , is computed for each combination of available speech signals (both acoustic and psycholinguistic) by applying the GCA with a maximum lag of 5. To determine which causality scores are most predictive, theF GCA distributions are eval- uated using a one-way analysis of variance (ANOVA). An example for how the distribution of causality scores varies based on the underlying transgression and disclosure conditions can be seen in Figure 2.2, which demonstrates the distribution of the child-adult imageability- imageability F GCA for the truth-telling task. These ANOVA results were used for feature selection by constructing 10 feature sets per task, consisting of the causality scores for which the ANOVA signicance yielded p > < > > : hvalence j ; arousal j ; pleasantness j i v j 2V e h0; 0; 0;i else Then let S i represent eitherQ i orV i . Then the session intensity is computed by: (S i ) = 1 m X 8t2[0;T ] X 8v2st j(v)j 31 where m represents the count number of words for which (v)6=h0; 0; 0;i. Then for a given utterance s t 2S i , the ui-isi is dened as: (s t ) = log(T ) log((S i )) X v2st (v) in this case the log() and division is performed element-wise across the vector. Sections 3.2.6.1 and 3.2.6.2 describe modeling strategies, which use dierent representa- tions of aect norms per utterance to model the interactions between the aect and produc- tivity. The details and decisions are described in more detail below. 3.2.5 Acoustic Features Prosody refers to the rhythm, rate and intonation associated with speech and is concerned with how something spoken, rather than what was spoken. Speech prosody has been hy- pothesized as primary feature for understanding emotion from speech Pollermann [2002], and prosodic features have been used extensively to study aect from speech Ringeval et al. [2017]. The fundamental frequency (intonation) and intensity (loudness) are used as the features representative of the vocal prosody. Log-pitch and intensity contours are extracted at frame- level using Praat Boersma [2006], and mean normalized per speaker and per session. The median and standard deviation for each feature is computed across and utterance resulting in 4 prosodic features. 3.2.6 Dynamical System Models 3.2.6.1 Linear Mixed Eects Models Linear Mixed Eects Models (LMEs) provide a framework within which it is possible to analyze common dynamical features across dierent subjects. LMEs provide insight into relationship across extracted features of both, child and interviewer, and verbal productivity. 32 LMEs attempt to capture the LME models t the present problem due to their ability to model individual sessions. In two dierent model constructions, the xed eects are set as either the median lexical aective features or the acoustic prosodic features, and the session IDs are set as the random eect. Given a particular interview i , dene c t as the features either, lexical or acoustic, as- sociated with the child at time step t and respectively, u i t , will represent observation made around the interviewer. For each time step we dene s t = [c i t1 u t1 ] as the concatenation of those 2 vectors. capturing the features of the previous child and interviewer features. Subsequently the LME formulations then follows as: ^ t ( i ) = + i +Ws t + t where represent the mean productivity for all sessions and i is the mean productivity of session i . Then W are the coecients as computed by the LMEs and is the residual dening a linear approximation which only considers the most recent preceding utterances. 3.2.6.2 Dynamic Mode Decomposition with Control An alternative approach to estimating the dynamics, by applying a dynamical system model to each interview independently, and then evaluating the dominant dynamic modes to eval- uate similarities in across the interviews. Following notation presented in Section 3.2.6.1, let us dene observation of a child as x t and interviewer input asu t at given time-pointt2 [0;T ]. With this construct the dynamical model we outline is the following: x t+1 =Fx t +Gu t 33 where F is referred to as the transition matrix, describing the autonomous evolution of the child's observations, and G henceforth referred to as the controller indicates the in uence that the input signal has on the evolution of the observations. When analyzed on per-session basis, the transition matrix and controller capture indi- vidual's interaction dynamics, illuminating the nature of how dierent children respond to input from their interviewers. By utilizing DMDc, the time-series observations of child's productivity, the interviewers prosodic features, and both interlocutor's lexical aective norms can be used to reconstruct the interaction dynamics. The construction leverages the fact that forX [1:t+1] = [x t+1 ;x t ;:::;x 1 ], X [0:t] = [x t ;x t1 ;:::;x 0 ], and [0:t] = [u t ;u t1 ;:::;u 0 ]: X [1:t+1] =FX [0:t] +G [0:t] Since the controller is not known a priori, the problem is re-framed as: X [1:t+1] = [FG] 2 6 4 X [0:t] [0:t] 3 7 5 =H [0:t] To solve for H = [FG], it is required that rankj j =jx t j +ju t j wherejx t j andju t j repre- sents the dimensionality of the observations and inputs respectively. With this assumption upheld, it follows that: H =X [1:t+1] y [0:t] where () y refers to the Moore-Penrose inverse. To directly study the behavior of the dynamical model as outlined by these matrices, the eigenvalues,2C can be extracted, which in turn can be interpreted as theZ-transformed open-loop poles, which are the modes of the system [Ragazzini and Zadeh, 1952]. TheZ- transform is a discrete-time analogue to the Laplace transform (L) and the eigenvalues exist 34 in the spectral space z2C and correspond to the time-domain exponential-functions that govern the system. Through this lens, these eigenvalues can be interpreted as the com- plex exponential powers that describe the dierential equations which describe the behavior observed in the temporal domain. Brie y, the larger the magnitude of the eigenvalue the more \dominant" the related mode is, implying that the related dynamics will persist for longer since the associated time- domain exponential is larger. Extending this notion modes with associatedjj = 0 dampen immediately and do not continue in uence the evolution of the model. Using the imaginary component to compute the angle of de ection from the real axis, the angle! is related to the frequency of the oscillatory response described in the mode. By including the eigenvalues of the transition matrix, it is possible to model the in uence the child's previous statements on their future statements. Meanwhile, since the controller matrix is not square the singular values are used instead, as they contain information pertaining to the in uence statements made by the interviewer on the future statements of the child. 3.3 Results 3.3.1 Verbal Productivity To validate our methods, samples are randomly selected from the dataset that highlight the introduced mechanism's ability to identify topics. The analysis below evaluates the dierences observed across the use of 4 dierent scoring metrics: word count, agenda (g) scoring, responsiveness () scoring, and a combined () scoring. Specically, the analysis looks at the utterances that are most highly rated by these methodologies, the dierence in the time-series signal, and the dierence in correlation with the child's age. 35 Table 3.1: Top 10 weighted words from 5 agendas 1 2 3 4 5 bathroom mister cousin gonna really gonna pinched al mommy understand outside kids cousin al aunt gonna touched thank touched touched sometimes garage gonna private touching times uncle mom talking michael important sitting wrong doors clothes clothes clothes kid outside peepee kids x id gonna pants mom mom really legs grandma touched 3.3.1.1 Agenda and Responsiveness Table 3.1 shows examples of agendas that have been constructed using our methods with simple stop-word ltering. The extraction clearly identies important concepts that are relevant to potential episodes of abuse that the subject experiences or witnesses. Particularly we take note that these words were extracted purely from the lexicon of the interviewer, and are not directly modeling the response of the child. This is justied by the assumption that in an interview setting, a concept that is repeated frequently in the questions is indicative of a broader interest in that topic. Word Count Agenda Responsive Combined Utterance Excerpt (g) () () I like to play with my friends. I have a new student ... 35 (1) 0.00 (128) 0.00 (129) 0.00 (128) Outside in the bathroom... (child reveals details) 9 (26) 27.00 (1) 0.16 (3) 0.23 (2) Over my skirt clothes, but outside he got me 12 (17) 12.00 (3) 0.37 (1) 0.40 (1) under my clothes. Table 3.2: Comparison of productivity produced by dierent methods and their relative rank (1 being highest rated rank). Demonstrates the ability to score substantive information from utterances. For Responsiveness+Agenda hyper-parameters and are chosen to be 0.5 36 A closer exploration of utterance level interactions and their scoring can be found in Table 3.3.1.1, which exhibits the dierences and similarities between scoring metrics. Comparing the rating of each method against the others' top-rated utterances provides insight to what features each criterion is sensitive to. The word count criteria suered from relatively low relevance assignment to informative utterances during disclosure in favor of long narrative utterances during the rapport-building phase. In the example highlighted in 3.3.1.1, a response to the interviewer asking the child to talk about their experience in school is rated highly, despite not being substantive to overall interview. While responses with higher word count can be seen as an demonstration of trust, it does not guarantee that utterances containing substantive information will be available in those responses. In contrast to the word count, the productivity measures used in this work force that important topics are addressed in the utterance in to have a high rating. This result does not suggest that inquiring about a child's schooling is not productive, but rather, that the information observed in that specic utterance is not indicative of the overall interview session's eectiveness. 3.3.1.2 Signal Sparsity Comparing the dierences in signals of dierent metrics as shown in Figure 3.3, it can be seen that the signal produced by the productivity metrics is sparser 4 than the word count. This resulting sparsity can be interpreted as a information-relevance lter convolved against the lexical information provided by the child. While this might imply less information to construct dynamics from, it also suggests that the there should be less confounding factors that could produce erroneous results. 4 While many interpretations of sparsity can be considered in this context \sparsity" is considered to be the proportion of non-zero values in the signal 37 Figure 3.3: Various proposed productivity scores over the course of a single session, agenda(top, left), responsiveness (top, right), combined (bottom, left) and word count (bot- tom, right). Each score is normalized with respect to its maximum. 3.3.1.3 Reducing Age Correlation Figures 3.4, 3.5, and 3.6 are compared to Figure 3.1 to demonstrate the resulting correlation between productivity scores and child age. These gures and Table 3.3 reveals the dierence in Pearson correlations to be statistically signicant, suggesting that the weaker correlation between age and productivity is not an erroneous result. The weaker correspondence to Table 3.3: Pearson correlation (r) indicating strength of relationship between dierent pro- ductivity metrics and the age of the child. All scores were signicant at p< 0:001. Metric r Word Count 0.46 Agenda g 0.26 Responsive 0.24 Responsive/Agenda 0.25 38 age, further suggests that the productivity metrics are also more robust to a specic child's language ability and development. Figure 3.4: Distribution of scores and variances by age using only the Agenda criterion Figure 3.5: Distribution of scores and variances by age using only the Responsiveness crite- rion 3.3.2 Modeling Productivity with Linear Mixed Eects Models Arousal as measured through the psycho-linguistic norm from both child and interviewer is signicantly associated with productivity. Some of the emotions that encompass high arousal 39 Figure 3.6: Distribution of scores and variances by age using Combined Agenda/Responsiveness scoring method. especially anger and fear have been reported during forensic interviews Paine and Hansen [2002], Berliner and Conte [1995], hence this result shows that entities salient to the alleged crime including names, places and objects, which were shown to feature in top productive responses tend to be associated with, and possibly evoke strong emotions in the child during CFI. The corresponding valence normative from the interviewer questions was negatively associated, although not signicant (p 0:14). 3.3.3 Modeling Dynamics Using DMDc Framework Performing DMDc and evaluating the single most dominant mode via eigenvalue decompo- sition of the transition matrix helps identify the dynamical characteristics associated with the child, as well as visualizing common behavioral types across children. Figure 3.7 shows that there is a trend among many children in the EC age group have eigenvalues that have a larger magnitude and fall on the positive end of the number line. Values that have a mag- nitude closer to 1 consistent with slow decaying and dampened oscillation. Meanwhile older children generally lie on the other side of the unit circle implying high degrees of oscillatory 40 Table 3.4: LME analysis for studying eect of (a) prosodic features, and (b) lexical features on verbal productivity. t-statistics for statistically signicant features are reported here. (p< 0:05) ; (p< 0:01) (a) Prosodic Features All (df=895) EC (df=497) LC (df=389) ad in st -2.66** -2.77** -1.55 ch in st 2.36* 1.60 2.30* (b) Lexical Features All (df=35343) EC (df=19162) LC (df=15726) ad ar 3.21** 3.07** 2.67** ch ar 2.41* 1.75* 2.07* ch va 2.09* 1.41 2.06* behavior, but generally, also having smaller eigenvalue magnitudes representing faster decay rates. Figure 3.8 demonstrate the distributions of singular values rounded to their closest integer values. Large singular values correspond to large gain applied on the inputs. Children in EC generally have smaller singular values, implying that they respond slower to the inputs of the interviewer than the children in the LC group, whose singular values are generally more spread out, implying a varying degree of responses across the group. Similar to LME, DMDc enables the analysis of model parameters to understand dominant contributors in the interactions. Figure 3.9 shows that in general, the child's preceding productivity is the most indicative component in predicting their future state. Similarly, the the preceding \rolling agenda" observations are most indicative of the child's productivity. 41 Figure 3.7: Dominant Eigenvalues associated with transition matrix of child state evolution. This is indicative of the dynamics that are dominantly present in the observation time-series 3.4 Discussion 3.4.1 Verbal Productivity Measures Our experiments and evaluations demonstrate the usefulness of using topic modeling for productivity assessment. Our work outlines a promising methodology towards automating the detection of substantive information in child interview responses, while also leveraging and attending to a child's engagement to the interviewer. Further, these new metrics are demonstrated to facilitate novel methods for utilizing these automated productivity measure for the purpose of eliciting interaction dynamics through behavioral modeling. Furthermore, by automating the interview evaluation, this approach alleviates some of the need for human involvement in data creation and labeling, making the process more scalable, faster, and less prone to label agreement errors. By introducing hyper-parameters and to the proposed productivity measure the system maintains exibility for dierent criteria, without sacricing consistency of scoring 42 Figure 3.8: Distribution of dominant singular values. All representing all age groups to- gether, EC and LC representing \Early Childhood" and \Late Childhood" respectively. Larger dominant singular values indicate larger contribution to output from the control input. across examples. While the work presented outlines the need for constructing an agenda, these techniques also allow for interviewers to construct \gold standard" agendas prior to the interview. Using these prepared agendas will allow interviewers to use situational hypotheses and prior knowledge to improve and inform the specic policy or strategy they employ for productivity evaluation. Further still, the outlined metrics can be used to model similar goal-oriented dialogues such as negotiation and question answering in an eort to make information discovery a larger part of the dialogue process. The presented results on agenda and responsiveness also demonstrate less correlation to the subjects age. The weaker age correlations suggests that these productivity metrics are less likely to be confounded with overall language ability of the child. While it is reasonable to assume that the child's age is still going to be tied into their ability to produce narrative and 43 informative responses to the interviewer prompts, decreasing the correspondence between the two suggests that these metrics will generalize better to children with lower language resources. 3.4.2 Dynamical System Models This work presents two methods for utilizing dynamical systems models to gain insight into multi-modal interaction dynamics. The results produced by LMEs can be leveraged to suggest that there are non erroneous examples of prosodic and aective eects on the child from the interviewer, and supported the inclusion of these features into the higher-order dynamical models. Particularly surprising was the nding that valence from the child's preceding response was positively associated with productivity, suggesting that narratives by children expressing positive but strong emotion tend to facilitate productivity in the following responses. Subsequently, DMDc provided meaningful visualizations and insights into the intrinsic dynamics and in uences predictive of child behavior. This modeling approach highlights the temporal relationships present between rapport building and disclosure phases of inter- view. In Figure 3.7 the relative positions of the EC and LC samples on the unit circle imply the varying rates of decaying exponential-functions. This can imply that children who are older generally express behaviors that are more sporadic and closer to impulse character- istics, while younger children have behaviors that are more constant. Meanwhile, Figure 3.8 demonstrated the relative spread of singular values between EC and LC children which suggests younger children respond slowly to the inputs of the interviewer, while the older children had a less consistent response behavior. Figure 3.9 also suggests that aective lexical features make up a larger part of the inter- viewer's in uence in EC children than it does in LC children. This suggests that the content of the words being said and questions being asked of the child become more important the child gets older, and presumably will have better developed language skills. The resulting 44 relative importance is expected since the productivity is itself a product of the child's align- ment and responsiveness to the interviewer's word usage. Of particular interest though is the observation of the remaining features, the variability of the interviewers prosodic intensity is the next strongest indicator of productivity in the child's response. This supports the results found in Table 3.4 which indicated a statistically signicant correlation between the two variables. To the best of our knowledge this is the rst work to computationally model the interac- tion dynamics of CFI and we believe this is a strong starting point towards developing novel BSP driven interview strategies. The ensuing chapter will build on the presently described application of DMDc to iden- tify local dynamic characteristics in the context of identifying interaction strategies in psy- chotherapy. The application looks at windows of behavioral dynamics, rather the session entirely, to observe whether a therapist is adherent to a particular form of therapy. By extending the successes presented in these works, it is possible to identify characteristics in a more momentary fashion enabling more ne-tuned feedback and insight in an inherently dynamically less stationary interaction context. 45 Figure 3.9: Relative scale of in uence on productivity across age groups and features. Av- erage absolute value of row used to calculate contribution to productivity. Contributing components from the derived \transition" (left) showing how the child's own state con- tributes to their future state, and \controller" (right) indicative of which components of interviewer speech contributes to the child's productivity the most. i ; i correspond to the mean and standard deviation of the therapist prosodic intensity, while p ; p correspond to the pitch. v ; a ; p correspond to the valence, arousal, and pleasantness of the interviewers utterance. [L 1 L 5 ] correspond to the 5 most important \agenda" and their appearance in the previous utterances. 46 Chapter 4 Local Dynamic Modes of Cognitive Behavioral Therapy The previous chapters highlighted the importance of in uence aware systems models for eliciting important interview information pertaining to a variety of metrics. However, these methods aimed to describe the dynamics of the entirety of a session. This following chapter, our attention will shift to using a localized variation of DMDc, in an eort to demonstrate how these methods might be useful in capturing smaller sections of behavioral strategies relavent to the creation of As mental health awareness and diagnosis become more pervasive so does the need for access to high quality counseling and psychotherapy. As of 2019 about 1 in 5 adults in the United States (19%) reported to be diagnosed with a mental, behavioral, or emotional disorder and half as many, 9.5% of adults in the US, report receiving counseling[Abuse, 2019, Terlizzi and Zablotsky, 2020]. In order to sustainably and equitably provide access to care it is imperative that therapists and clinicians are equipped with technologies and tools that best support their skills. Recent eorts to increase equitable access to eective mental health care have resulted in policy shifts towards treatments with consistent evidence of strong outcomes in clinical research. Initiatives such as those described in Creed et al. [2016] demonstrate how these treatments, for Cognitive Behavioral Therapy (CBT), can be successfully delivered outside of 47 clinical research environments, including community mental health care settings. In order to support high-quality care and support positive clinical outcomes, researchers and clinicians developed the Cognitive Therapy Rating Scale (CTRS) to assess a therapists' competence to core components of CBT [Young and Beck, 1980]. During periods of evaluation and training, the use of CTRS provides therapists with concise and targeted feedback to suggest adjustments to improve their CBT delivery. This feedback in turn has been shown to improve consistency of clinician competence in the therapy they deliver to clients [Creed et al., 2016]. A critical limiting factor in scaling these initiatives is the need for expert evaluations to assess therapists via CTRS, as these evaluations are quite time-critical and resource intensive. By increasing access to timely and concise feedback, therapists would be able to strengthen their clinical care delivery. This in turn leads to a higher bandwidth of support for clients, by enabling more professionals to be available, and improve overall quality of care. Towards this end, previous works in automated psychotherapy assessment and feedback have demonstrated success in data-driven evaluation of therapist competence [Gibson et al., 2019, Chen et al., 2021, Flemotomos et al., 2018, 2021a]. These works utilize acoustic and language elements of speech towards predicting behaviors and outcomes of the clients and the therapists' utilization of therapeutic skills. The presented work outlines an alternative paradigm through which to study the interac- tion, namely by re-imagining the interaction as a dynamical system it is possible to construct a framework under which the ow of conversation can be studied as a control-ane system. This paradigm would enable the optimization of therapeutic strategies, as well as, have di- rectly interpretable model parameters. The approach presented will construct and t local dynamical systems models over short windows of interaction, extracting the dynamic modes via eigenvalue decomposition and demonstrate that these modes carry information that is pertinent to the competence of CBT tenants. Our results suggest that there is a natural interpretation of these modes as momentary indicators of strategies that are informative for assessing therapists' competence. 48 4.1 Prior Work 4.1.1 Automated Therapy Evaluation The process of identifying patterns indicative of therapy quality and client outcome requires domain expertise, and is cognitively straining and time-consuming. To help overcome these limitations and support practitioners, behavioral signal processing (BSP) has been intro- duced as a context to utilize data towards automating many of the involved tasks [Narayanan and Georgiou, 2013]. An early demonstration of how data-driven methods could support automated psy- chotherapy evaluation, Gibson et al. [2016] introduced the processing of dialogue transcripts with deep recurrent neural networks (RNNs) to predict session-level empathy ratings for therapists practicing Motivational Interviewing (MI). Gibson et al. [2017] built on this work by introducing attention mechanisms on top of RNNs for automatically identifying MI skills on an utterance level. These results demonstrate how behavioral signal data, acoustic and linguistic time-series, could be evaluated to construct representation of nuanced dialogue acts. By predicting local actions taken by the therapist, these models gave way for evaluat- ing strategies on a more generalizable level, for example allowing the evaluation of dierent question types on therapeutic outcomes. Most recently Flemotomos et al. [2021b] described a complete pipeline which is able to translate from raw audio to full transcripts, with ut- terance and session level predictions for MI behavioral skills and assessment ratings. These pipelines demonstrate the power of providing feedback in a timely manner, and open a natural opportunity for continued learning and renement. Specically in the domain of CBT evaluation, Flemotomos et al. [2018] rst compared dierent language models for evaluating the CTRS scores, while Gibson et al. [2019] uti- lized multi-task learning to improve automatic CTRS assessment by utilizing experiences from other therapy styles such as MI. Chen et al. [2021] evaluated the value of enriching language features with behavioral skill codes and dialogue acts. Recently, Flemotomos et al. 49 [2021a] demonstrated the use of utilizing ne-tuned large language models to improve the discrimination of high- and low-competence CBT strategies. Prior work has demonstrated how the analysis of speech signals and contents of clients' and therapists' language can be powerful indicators of what strategies therapists are using and the overall quality of the session with respect to clinically relevant metrics. In contrast to previous works, the presented study specically evaluates the dynamics of therapeutic dialogue as observed from language. The methodology outlined constructs a more general framework for the study of interpersonal interactions and demonstrates how these models capture pertinent dynamical modes which can be translated into meaningful CBT strategies. In contrast to the aforementioned deep neural methods, our approach looks at more explicitly at control-ane dynamical system models. This model applies an interpretation to the observed data as a system in which the output of the client is result of a previous state that is undergoing a self-regulating process (transition) that is in uenced by the observed data from the therapist (control) at a given time point. This coupled pair of transition and control is referred to the dynamics of the system as they describe the evolution of the interaction instead of the static content of the interaction. Rather than tting a single model to all of the data available, each interaction is t separately and then the parameters of the t model can be evaluated directly via methods such as eigenvalue decomposition. The methods, outlined in more detail inx4.2.3, have been used to extract behavioral dynamics to study Child Forensic Interviews [Ardulov et al., 2018, Durante et al., 2022] and infant-mother interaction modeling [Klein et al., 2021] demonstrating that interpersonal interactions could be holistically evaluated under with these considerations. 4.2 Methods CBT is focused on addressing mental health problems through the conscious exercise of cognitive change strategies [Beck, 2011, Creed et al., 2016], outlining techniques that the 50 Score Abbreviation agenda ag application of technique ap collaboration co feedback fb guided discovery gd homework hw interpersonal ip key cognition behavior cb pacing and timing pt strategy for change sc understanding un total CTRS ctrs Table 4.1: Abbreviations of sub-scores therapist conversationally guides their client through. The therapist engages the client in a series of activities to help them build the necessary skills to shift their patterns of thinking and reacting to situations. The procedure highlights 11 elements that a therapist should be integrating throughout the session, such as setting an agenda, providing feedback, or utilizing homework, which are then summarily scored together, as described in Table 4.1. The present work is focused on the identication of types of dialogue ows that demon- strate competence in CBT practices. Our approach evaluates the likelihood that the ex- tracted behavioral dynamics occur in a session where the therapist's skills are delivered ef- fectively. Under the control-ane system paradigm the therapists' utterances are considered as an input signal to the system described by the present client, and the clients' utterances are viewed as the observable output from the system, similar to how a sensor might be used to estimate the state of a system. This perspective explicitly models the client as the central view of the dynamics. More concretely, the dynamics analyzed in this methodology are a combination of the observable evolution of the client's behaviors as re ected by their utterances in response to the therapists input. Inx4.2.1, the data and the steps taken to process the data into the format necessary to conduct the study. Next,x4.2.2 outlines the two classication tasks that are considered and 51 the baselines against which the methods are considered, whilex4.2.3 presents the models used for the tasks. Finally,x4.3 demonstrates the results, highlighting the eectiveness of the approach to simultaneously identify local interactions that are indicative of high session quality and discusses the conclusions which can be drawn. Figure 4.1: Distribution of the dierent CTRS subscores and the total CTRS scores in the data 4.2.1 Data The present study is conducted by analyzing the transcripts of 292 recorded CBT sessions with 175 unique clients. The transcripts are annotated by professionally trained coders for each CTRS sub-score, which in turn are aggregated to compute a total score. The individual sub-scores, named explicitly in Table 4.1, are each scored on a scale between 0 and 6. Having a skill rated as a 4 or higher is considered demonstrating high-competence, and subsequently a total CTRS score of 40 is the threshold necessary for a session to be considered as high-competence [Vallis et al., 1986]. These thresholds outlined by the CTRS coding manual were in turn converted into binary labels where 0 and 1 represented low- or high-competence respectively. Figure 4.1 can be evaluated for the distribution of scores for each of the sub-scores and the summary CTRS score. 52 For each of the transcripts, the talk turns were converted into embeddings using a trans- former based language model such as those described in Devlin et al. [2018]. In this way, these embeddings can be considered the inputs and observables of the dynamical system model where each talk turn is considered a time-step. 1 An illustration of the process taken to convert talk turns into aligned embedding can be found in Figure 4.2, which also demon- strates how the data were grouped into windows which would be the inputs into the models described inx4.2.3. Figure 4.2: Each talk turn is converted into their respective vector representation via the DistilBERT transformer model, and the used to construct X and Y matrices, which are in turn window conditioned on the window size w. 4.2.2 Tasks and Baselines Using the labels outlined inx4.2.1, our approach is interested in associating windows of dialogues with whether or not they belong to a CBT competent session. To accomplish this 1 Embeddings were extracted using the DistilBERT model found at https://huggingface.co/distilbert-base-uncased 53 two classication tasks are considered. The rst, referred to as the local task, looks at taking a window of talk turns and predicting the global binary label given to the session as a whole. Presumably, there will be similar windows of dialogue that occur during both high- and low-competence sessions since they are not typically completely absent of strategies that are also employed in high-competence settings, rather low-competence sessions typically under deliver these strategies. For this reason the models' output is considered to be a probability that the window came from a high-competence session, consequentially providing a discriminator that can be aggregated over the session. Since the underlying hypothesis is that a high-competence session will have more windows of high-competence behavior than the low-competence sessions, the second global classication task, takes aggregate predictions over the entire session and predicts the nal binary label using the combined score. For each of the local and global tasks, bootstrapped simulated baselines were constructed. Since the size of the window impacts the number of samples available to predict over, separate baselines were computed for each size in the local scenario. The local 3 performance was used as a threshold to consider a model for the global task. Bootstrap F1 results can be found in Tables 4.2 and 4.3, which display thresholds of signicance which are used to evaluate the models against. 4.2.3 Windowed Dynamic Mode Decomposition Similar to the approach presented inx3.2.6.2 each session is dened as a conversational in- teraction between the therapist and client as a control-ane dynamical system. Specically, the language of the client as an observation of a hidden state, in uenced by an input signal consisting of the therapists language allows us represent the dynamics as a matrix couple consisting of the transitions and controls. Building atop the DMDc method for approximating a control-ane linear dynamical system model over an entire interview session, our methodology trivial extends the method Zhang et al. [2019] introduced for windowed and online implementations of DMD which in 54 Score Window Size F1 ag 3 0.5115 ag 5 0.5117 ag 8 0.5118 ap 3 0.5122 ap 5 0.5124 ap 8 0.5126 co 3 0.5121 co 5 0.5122 co 8 0.5123 fb 3 0.5082 fb 5 0.5083 fb 8 0.5084 gd 3 0.5128 gd 5 0.5129 gd 8 0.5131 hw 3 0.5101 hw 5 0.5102 hw 8 0.5104 Score Window Size F1 ip 3 0.5119 ip 5 0.5121 ip 8 0.5123 cb 3 0.5121 cb 5 0.5122 cb 8 0.5124 pt 3 0.5059 pt 5 0.5059 pt 8 0.5062 sc 3 0.5121 sc 5 0.5122 sc 8 0.5124 un 3 0.5075 un 5 0.5075 un 8 0.5076 ctrs 3 0.5077 ctrs 5 0.5078 ctrs 8 0.5080 Table 4.2: Bootstrapped baselines based by window size. Each value represents 3 signi- cance or 3 standard deviations above the mean. Score 2 3 ag 0.6268 0.6929 ap 0.6255 0.6914 co 0.6277 0.6935 fb 0.6209 0.6861 gd 0.6288 0.6950 hw 0.6256 0.6911 ip 0.6269 0.6925 cb 0.6315 0.6983 pt 0.6195 0.6848 sc 0.6273 0.6933 un 0.6246 0.6904 ctrs 0.6217 0.6871 Table 4.3: Baselines for session level codes turn more adequately account for non-stationary processes [Proctor et al., 2016, Zhang et al., 2019]. 55 More explicitly, the extracted turn-level sentence embeddings outlined inx4.2.1 are con- structed into arrays,Y = [y 0 ;y 1 ;:::;y T ] andX = [x 0 ;x 1 ;:::;x T ], for the client and therapist respectively over the session of length T . The presented method models a window of turns Y w;t = [y t ;y t+1 ;:::y t+w ] and X w;t = [x t ;x t+1 ;:::x t+w ] using windowed DMDc, represented by: Y t+1;w =A t;w Y t;w +B t;w X t;w = A w;t B w;t 2 6 4 Y w;t X w;t 3 7 5 (4.1) The dynamics over the window are captured by A w;t and B w;t in Eq 4.1 which can be reconstructed by applying: A w;t B w;t =Y t+1;w 2 6 4 Y w;t X w;t 3 7 5 y (4.2) Once again, the transition matrices A w;t and controller B w;t represent how the previ- ous observations and the input signal approximately model the next time step. While the dynamics are typically considered for evaluating stable systems, where the eigenvalues are contained by the unit circle and subsequently correspond to a dampened oscillator, in the local dynamics extracted there were isolated moments when the system experiences instabil- ity. However in those cases the faster rising modes will still be dominant so the relationship between the magnitude and the dominance remains. For this application the eigenvalues of the transition matrix, denoted T , model the local in uence the client's previous statements on their future statements, while the controller matrix eigenvalues 2 , C , contain information pertaining to the in uence statements made by the therapist on the future statements of the client. In the experiments we study the information contained by T and C in isolation, as well as, combined T +C . 2 Due to equivalent dimensionality the controller can be considered under eigenvalue decomposition rather than SVD 56 Figure 4.3: Number of non-zero eigenvalue for dynamics computed at varying window size for both transition (T) and controller (C) Furthermore our experiments compare the use of dierent window sizes and include varying amounts of dominant eigenvalues extracted from the matrices. Specically windows sizes, w 2 [3; 5; 8], were evaluated. 3 Similar evaluations were conducted to evaluate for include a number of eigenvalues. Experiments re ected in Figure 4.3, enabled us to choose varying number of eigenvalues denoted by n 2 [1; 3; 5; 7]. 4.2.4 Models With the eigenvalues for each window size computed, a set of machine-learning models with varying parameters and underlying assumptions were trained to classify from each window for the local prediction task. 3 3 was chosen as 2 or fewer time-steps is too short a window to meaningfully capture the dynamics and 8 was a maximum to account for a session that was 9 turns long. 57 Figure 4.4: With the windows extracted using the pipeline outlined in Figure 4.2 the win- dowed DMDc is applied in order to nd the corresponding transition and controller matrices [A;B] t and their respective [ T ; C ]. These are used to make local predictions which are then accumulated and passed into the aggregate model for a global prediction. The models trained included: Gaussian Naive Bayes (GNB) Logistic Regression (LR) Linear Support Vector Machine c2 [0:01; 0:1; 1; 10; 100] (L-SVM c ) k-nearest Neighbors k = [1; 3; 5; 10; 30; 50; 100] (KNN k ) Since our objective is to study whether these modes capture meaningful information about the CBT competence the models are kept relatively simple and a wide array are chosen to explore how dierent underlying assumptions about the data can lead to better local and global performances. 4.2.4.1 Aggregating Global Scores from Local Predictions The models listed above are trained to make a prediction P (score = 1j) t , estimating the likelihood that the local dynamic modes belong to a CBT session that is high-competence. Since the data originates from a CBT competence score that is non-binary, it stands to reason that low-competence sessions (CTRS < 40) will have moments that will correspond to competence but not frequently enough to reach the threshold. For this reason if the local predictions are more generally informative, accumulating them should yield a signal that is useful to predict the overall competence of the session, P (score = 1) session . 58 Our analysis tests 2 dierent accumulation methods: sum and average. Sum simply adds up all of the instances of competence over each of the extracted windows for a given session, while average divides this value by the number windows that have been extracted from the session. Similarly, two dierent models for classication are used: training mean (TM) and logistic regression (LR). TM aggregates the mean over the training split and assigns a class 1 if the score for a test session is greater than or equal to the mean. LR uses the training split score to t a simple linear model that predicts over the accumulated score for the test sessions. 4.3 Results All results below are averaged across a 5-fold cross validation. The splits are maintained for both of the local and global model training and accumulation. This splitting step helps account for erroneously good results that occur due to spuriously lucky splits of the data. The data splits are conditioned on the labels maintaining a relatively identical distribution of high- and low-competence sessions, as well as, accounting for the client, so that all sessions from the same client are kept together in a split, allowing us to know that the model is not over-tting to anything about a specic client and their dialogue dynamics. 4.3.1 Local Predictions Using the reference scores in Table 4.2, Table 4.4 displays the models that succeeded at having a signicant (3) improvement over random chance. These models were the selected candidate models that were later applied for the global task. Notably, the best performing the models for each of the tasks use both eigenvalues from both the transition and controller (T+C) which suggests that it there is latent information in the dynamics of both the client and the aect of the therapist that meaningfully maps to CBT competence. 59 GNB models seemed to have generally done the best, with w = 8 and n = 7 which are the largest values possible. There are two components that likely contribute to this: 1. that the more windows there are the less sample there are for the model make mistakes on, 2. the longer range dynamics capture more meaningful representations with regards to the task of predicting CBT competence than those in smaller windows. This suggests that the time-varying component of the dynamics meaningfully changes over 8 talk turns or more. Looking at the 3 scores which were the most eective locally: agenda (ag, 0.5749), key cognitive behavior (cb, 0.5608), and strategy for change (sc, 0.5624) stand out, suggesting that there is a more strongly observed signal in the extracted dynamics pertinent to those scores than to others . An interpretation that could account for this is that eective presen- tation of an agenda and strategy for change is referred to occur consistently throughout the session, similar to the dynamics of invoking key cognitive behavior. The results in Table 4.4 suggest that these extracted eigenvalues contain a meaningful signal for the determination of CTRS scores and the competence of the session globally. Fundamentally, these models are not expected to perform particularly well as they are trying to predict a global label from only local information. Furthermore, the models do not ever observe the data directly, only a representation of the interaction of these signals over a small window sampled from the session. It reasons that high-competence moments exist in sessions that are scored overall low-competence. Further still, it is likely that the majority of moments are entirely neutral moments that the classier would predict ambiguously (close to 0.5). 4.3.2 Session-Level Predictions Table 4.5 lists the best performing models for each of the scores. When compared to the boot strap scores in Table 4.3, it's clear that most of the models outperform the 2 baselines but ultimately fall short of the 3. The homework (hw, 0.6191) and interpersonal (ip, 0.5995) do not globally perform better than the 2 baseline. 60 The result of the interpersonal models does not come as much of a surprise since those local models were among the worst performing in Table 4.4 as well. Observing in Tables 4.2 and 4.3 that both the local and global baselines are each lower than many of the other scores suggests that this is generally a more dicult task, and that this method of accumulating and aggregating the scores from local dynamic windows is not adequate. Interestingly a separate Pearson correlation study showed that the model's interpersonal predictions correlated rather well (r = 0:7595, p = 1:903 10 39 ) on the training set but generalized extremely poorly to the test set (r = 0:1303, p = 0:4024). By comparison the test set correlation for ctrs was r = 0:4647 with an associated p = 0:0008. 4 Notably, pacing and timing (pt, 0.6868) is the only score where the 3 baseline is eec- tively met, with agenda (ag, 0.6835) and total CTRS (ctrs, 0.6767) coming close. According to the CTRS coding manual, the pacing and timing of a session evaluates the use of time via the induced structure and control the therapist exercises over the session. It is likely the case that setting an agenda at the beggining helps with the pacing and timing, which in turn strongly correlates with the overall CTRS competence. The result is particularly interest- ing result since this implies that the local dynamics extracted using the Windowed DMDc method and BERT embeddings capture have a meaningful interpretation of the interaction between the therapist and the client. Since the T+C models were found to be the most eective in predicting the session level competence it indicates that it is both the client's own internal transition, as well as, the way in which they interpret of the therapist's words that are most informative towards the predicting global scores. Most of these models were also using the largest window size (w = 8) and eigenvalues (n = 7) with the GNB model which is likely also a consequence of the results in Table 4.4 which down sampled the models that were used in the global predictions. 4 Traditionally correlations withjrj> 0:3 and p< 0:05 are considered signicant 61 Results were split fairly evenly between using sum and averages (avg) as the accumulation method, but the LR aggregator was clearly a stronger at separating the dierences than using the simple training mean (TM). 4.4 Discussion Figure 4.5: A sample trajectory of CTRS scores accumulating over the course for 2 sessions one being a high quality and one low quality session Overall the results correspond positively, suggesting there is signicant value in extracting these dynamic modes and in their ability to adequately capture the desired behaviors of the therapist in assessing and providing feedback. This is particularly interesting when considering how these systems might be used in tracking and real-time feedback to the therapist. As seen in Figure 4.5 the resulting models can used to track the relative trajectory of the session as it is happening. With these models it would be possible to evaluate and even plan the future control signals. The fact that these systems perform well using a linear dynamical system assumption suggests that the computation and optimization steps would be relatively computationally inexpensive compared to alternative non-linear models. 62 Score Model Input Type w n F1 ag L-SVM 1 T+C 3 3 0.5267 ag L-SVM 100 C 5 5 0.5395 ag GNB T+C 8 7 0.5749 ap L-SVM 10 C 3 5 0.5163 ap L-SVM 10 C 5 7 0.5301 ap GNB T+C 8 7 0.5467 co KNN 50 T 3 3 0.5169 co KNN 100 T 5 3 0.5310 co GNB T+C 8 7 0.5547 fb L-SVM 10 C 3 7 0.5154 fb L-SVM 10 C 5 7 0.5222 fb GNB T+C 8 7 0.5401 gd L-SVM 10 C 3 7 0.5153 gd L-SVM 10 C 5 5 0.5292 gd GNB T+C 8 7 0.5355 hw L-SVM 10 C 5 5 0.5134 hw GNB T+C 8 7 0.5303 ip L-SVM 10 C 3 1 0.5173 ip KNN 1 T 5 5 0.5179 ip KNN 1 T+C 8 7 0.5187 cb L-SVM 10 T+C 3 7 0.5249 cb GNB T+C 5 5 0.5400 cb GNB T+C 8 7 0.5608 pt L-SVM 10 C 3 3 0.5210 pt L-SVM 10 C 5 5 0.5295 pt GNB T+C 8 7 0.5514 sc L-SVM 1 C 3 3 0.5234 sc KNN 100 T 5 5 0.5322 sc GNB T+C 8 7 0.5624 un L-SVM 0:01 C 3 3 0.5075 un L-SVM 10 T+C 5 5 0.5203 un L-SVM 10 T+C 8 7 0.5338 ctrs L-SVM 10 C 3 5 0.5203 ctrs L-SVM 10 T+C 5 5 0.5315 ctrs GNB T+C 8 7 0.5514 Table 4.4: Best performing models on local samples of segments by window size. Models with identical performance per a given window size 63 Score Model n Input Type w Accumulator Aggregator F1 ag GNB 7 T+C 8 sum LR 0.6835 ap GNB 7 T+C 8 sum LR 0.6574 co GNB 7 T+C 8 avg LR 0.6470 fb GNB 7 T+C 8 sum LR 0.6580 gd GNB 7 T+C 8 sum LR 0.6577 hw GNB 7 T+C 8 sum LR 0.6191 ip KNN 1 7 T+C 8 avg TM 0.5995 cb GNB 5 T+C 5 avg LR 0.6588 pt GNB 7 T+C 8 avg LR 0.6868 sc GNB 7 T+C 8 avg LR 0.6361 un L-SVM 10 5 T+C 5 avg LR 0.6635 ctrs GNB 7 T+C 8 sum LR 0.6767 Table 4.5: Models performing the best on the global task 64 Part II Optimization Interaction Strategies 65 Chapter 5 Optimizing Interactions for Neuro-developmental Diagnostics Previous chapters explored the eectiveness of regulated dynamical systems as a paradigm to predicting the evolution of behavior during an interpersonal exchange. We explored the method by which the extracted dynamics and controller matrices can inform the conversa- tional driver (e.g. interviewer or therapist) might be able to adapt their approach to increase the likelihood of a positive outcome. In the following chapter we will turn our attention to the methods by which these policies might be optimized. 5.1 Data-Driven Diagnostics Clinical diagnoses represent a time-sensitive, and high stakes domain in which experts are called upon to seek and assess presenting symptoms, test results, and histories against the plethora of clinical conditions to characterize, and categorize, the patient's condition. Due to the wide variety in clinical conditions and presentation, and the complex relationships between them, data-driven models have become increasingly adopted to coalesce these mul- timodal streams of information. A number of works have demonstrated the promise of machine learning (ML) models in these domains, addressing many clinical challenges such as patient privacy and medical data-mining [Kononenko, 2001, Bellazzi and Zupan, 2008, 66 Kourou et al., 2015, Papernot et al., 2016]. However, many of the computational approaches operate under the assumption that the source of data for these ML models are all available at the same time. Clinical practice, especially in complex decision making, typically involves a progressive multi-step approach, starting with the collection of a history and symptoms on admission. With this information a practitioner can form initial hypotheses which are then supported or rejected by further tests and data gathering which can then either be followed up on, or used to establish a diagnosis. Following this it seems that a natural extension to the adoption of ML models is to reformulate the diagnostic procedure as a decision process, rather than a detection and classication problem in which an intelligent agent (the prac- titioner) is taking actions that have a cost but progressively reveal information about the underlying condition they are attempting to characterize. To demonstrate the eectiveness of this task formulation, we will ground our work in addressing challenges associated with determining a child's risk for a neurodevelopmental disorder. In particular, with the latest Diagnostic and Statistical Manual for Mental Disor- ders, Fifth Edition (DSM-5) introducing joint diagnosis of Autism Spectrum Disorder (ASD) and Attention Decit-Hyperactivity Disorder (ADHD), clinicians are posed with a challenge dierentiating between the two, despite having overlapping presentations. Specically, cer- tain behaviors and subsequent social skills presented in children with ADHD have a well documented overlap with those associated with ASD [Sinzig et al., 2009, Levy et al., 2010, Kern et al., 2015, Salley et al., 2015, Ng et al., 2019]. This results in children being char- acterized as being at-risk for conditions they may not have, potentially delaying the correct evaluation, diagnosis, and subsequent interventions. Within the domain of ASD diagnosis, prior work has evaluated the sensitivity of ASD diagnostic instruments dierentiating the two conditions in children known to have non- comorbid ADHD and those with a known ASD diagnosis [Grzadzinski et al., 2016, Ardulov et al., 2020b]. Currently, when a child is considered at-risk for ASD they and their care-givers are routed through a series of extensive diagnostic procedures composed of observational 67 items, both care-giver reported and clinician observed. These items rate the behaviors and developmental abilities exhibited by a child. Diagnostic algorithms are then used to aggregate the scores to inform clinicians and allow them to more methodically determine diagnoses or to triage children to further appropriate screening. The rigor and thoroughness of these instruments, while time-intensive, are designed with the consideration of the high stakes and consequences of misdiagnosis. Machine learning (ML) models have been presented as analytical tools in order to develop a stronger understanding for the dierentiating factors of ASD and ADHD from the data available. However due to the signicance of the domain, it is important that diagnostic models adhere to two critical characteristics: The rst, robustness, aims to promotes the ability of the model to predict eectively with less information or with missing and noisy information. The second characteristic, interpretability, promotes a traceable decision path, enabling a human readable decision ow that determines how a model came to a conclusion for a particular data input. Prior work on clinical diagnostic ML approaches have largely utilized decision tree or forest based models in order to select predictive and reduced sets of items [Wall et al., 2012, Kosmicki et al., 2015, Duda et al., 2016]. While capable of generating decision paths which accommodate the second desiderata presented above, concerns over the choice of data and the generalization of these results underscore the diculty of validating the robustness of these models [Bone et al., 2015, 2016]. By selecting a more appropriate dataset and more thoroughly outlining the notion of a robust classier, our work presents a novel application of a reinforcement learning (RL) framework towards constructing a simultaneously robust and interpretable classier. The work demonstrates the limits of tree and forest based models illustrating their sensitivity to data omission and corruption. The methodology outlines an approach to build adaptive strategies able to guide and recommend which item to inquire next as more information about the child becomes available. This reformulation establishes this diagnostic setting as 68 a decision making process, for which an optimal policy can found through the policy op- timization method known as Q-learning. Through simulation of diagnostic processes, the policy jointly maximizes diagnostic accuracy, while minimizing the number of items needed to assign a diagnosis, thus eliciting a policy that is adaptive to the available information and less dependant on any single feature to make a diagnosis. In the end the policy can be inter- rogated similarly and interpreted similarly to a tree based model, however the advantages in robustness become more apparent. This work highlights the advantages of this reformu- lation and presents an illustrative evaluation of Q-learning as applied to the ASD-ADHD diagnostic domain. 5.2 Autism Diagnostic Interview - Revised ASD is a neurodevelopmental disorder (NDD) that primarily manifests itself in an individu- als' ability to conduct and regulate behaviors associated with social communications [Lord, 2010]. In an eort to more consistently assess children for ASD symptoms, a number of dierent instruments consisting of a variety of clinically-relevant items have been developed that enable diagnostic algorithms. While these instruments and associated algorithms are not used to diagnose, they are tools that the clinician can rely on in determining a child's risk and need for further diagnostic observation. Many dierent instruments measure similar behaviors, but dier in who the observer is and how the observations are made, varying be- tween parent/caregiver reported to direct clinician observation, leading to varying reliability and accuracy for certain observations. One of those widely used instruments is the Autism Diagnostic Interview - Revised (ADI- R), which is a clinician led diagnostic interview, during which the child's caregivers are asked a series of questions associated with evaluating the child's communication abilities and social behaviors [Lord et al., 1994]. The clinician assesses responses to these questions to determine whether the child's behavior ts within expected behavior given the child's 69 Figure 5.1: ADI-R administration process. The parent is interviewed by a clinician. The clinicians asks open-ended questions that are tied to an item and listens to the responses from a parent. Typically the clinician is listening and asking about specic examples of the child's behavior in relation to the item at hand. The clinician records a rating based on the presented information and can leave notes to themselves. After the interview is complete the clinician uses their recorded ratings to complete the ADI-R algorithm computing whether the child meets the instrument's cut-o thresholds for ASD. background. If through the course of interview the clinician can determine that the child should not be considered at-risk for ASD, either because they are typically developing or have a dierent NDD, it could prevent further unnecessary testing and can direct a child to a more appropriate diagnostic procedure or treatment option. A more thorough description of the interview process is described in Figure 5.1. Recent works evaluating the dierentiation of ADHD and ASD in at-risk children have demonstrated the statistical sensitivity in identifying the symptoms and items which most signicantly distinguish the two conditions [Grzadzinski et al., 2016, Ardulov et al., 2020b]. These have evaluated dierent instruments, including ADI-R, in verbal, school-aged, clin- ically referred children. Similar studies have explored the use of machine learning (ML) models, specically Decision Tree based models, to dierentiate and identify specic fea- tures which have diagnostically relevant information [Wall et al., 2012, Bone et al., 2016, 2015, Kosmicki et al., 2015]. Tree models are preferred for their interpretability however, our results demonstrate that they typically are poorer classiers than those learned by Q- learning at accurately predicting in the case of corrupted data, either by omission or noise. 5.3 Reinforcement Learning Reinforcement learning (RL) is a paradigm for identifying strategies or algorithms that solve complex, often sequential, decision making problems, optimizing the total reward gained over the length of an interaction [Kaelbling et al., 1996]. By formulating a problem as sequence of actions made by an agent which is rewarded or penalized, it is possible to learn long-term 70 strategies that can learn complex behaviors such as local sacrices of reward for long-term pay-out. This realm is closely related to the domains of game-theory and optimal control theory [Kiumarsi et al., 2017] in which rewards and errors are used to dene the utility of a policy. In both domains the denition of a policy can be generalized as a function : S!A where S represents a state-space of an agent or system being controlled andA represents an action space. A reward functionr : S!R is a function that informs the agent about the utility of reaching a specic state. Given a discrete state-space and discrete action-space the decision making process is a graph with nodes and edges representing states and actions respectively. There are three types of states: initial, passive, and terminal. An initial state is one that the agent can nd themselves initialized with, and a terminal state is one that has no outgoing edges thus terminating the decision-making process. A path p = [s 0 ;s 1 ;:::;s n1 ;s n ], represents the order in which the states are visited, a list beginning with the initial state s 0 , followed by passive statess 1 ;:::;s n1 , nishing with the terminals n , where some states j+1 was reached from s j by taking some action a j . An optimal policy ? is one where the sequence of states visited dened the actions selected according to the policy compose a path p ? such that: X s i 2p? r(s i ) = max p2P X s j 2p r(s j ) When state transitions are stochastic, the problem is known as a Markov Decision Process (MDP), implying that performing the same action in a particular state may not always yield the same transition. If the transition probabilities to a MDP are not known apriori, a common approach to nd the optimal policy is to learn through simulation. One such learning method,Q-learning, attempts to approximate theQ-function mapping state-action pairs onto a real value representing the \quality". 71 The quality of taking an action a j given non-terminal state s j is the reward received at the current state plus the expected sum of -discounted future rewards received from all states following action a j . More explicitly: Q(s j ;a j ) =r(s j ) +E[ r(s j+1 ) + 2 r(s j+2 ) +::: + n r(s n )] =r(s j ) + max a 0 2A Q(s j+1 ;a 0 ) TheQ-learning paradigm and its variations have demonstrated recent success, achieving super-human performance in complex games, namely playing Atari games when using deep architectures [Mnih et al., 2015] and defeating the reigning human champion of Go when paired with Monte-Carlo Tree Search [Silver et al., 2017]. This work builds to adapt these principles to leverage the long-term planning capabilities in the dierentiation of NDDs in children based on behaviors that are observed by their guardians. The key to this formulation is to consider the diagnostic instrument as an MDP in which the clinician acts as the agent of a policy which suggests the items to follow-up on to reach a desired end state where their prediction, based on observed responses to actions, matches the underlying true label for the child. 5.4 Methods 5.4.1 Data Querying a dataset of clinically referred children as being at-risk for ASD, an initial sample of 119 children with best clinical estimates of ADHD were assigned by experts. Then the Full Scale IQ (FSIQ) and age of referral were used to restrict a query of children with best clinical estimates of ASD. These restrictions allow the analysis to focus on groups of children that would present most similarly in terms of developmental ability. When FSIQ was not available a Verbal IQ (VIQ) was used to determine whether a child with ASD belonged in 72 Figure 5.2: Distribution of demographic information: age, FSIQ and VIQ across dierent diagnostic conditions Variable Average t-test p-value Age 99.06 ( =29:34) 0.5639 FSIQ 102.15 ( =16:54) 0.2615 VIQ 102.21 ( =17:15) 0.9396 Table 5.1: Demographic distributions. The p-values for t-tests suggest that we cannot reject the null hypothesis that they are from dierent populations according to these features. the sample. The nal data sample consisted of 463 children with 344 children with clinical estimates of ASD. Figure 5.2 visualizes the distributions by clinical estimates and Table 5.1 shows the averages, standard deviations and p-values associated with evaluating Student's t-test. The yielded p-values suggest that we cannot reject the null hypothesis that the distributions are signicantly dierent. Further still, although not used in sample selection similar distribution of sex were ob- served | ASD contained 80% male to 20% female while ADHD was 74% male to 26% so as this is not con ated in the diagnostic features, consistent with globally observed statistics across clinical referrals for ASD [Halladay et al., 2015]. Our experiments focus on the use of ADI-R items to predict the correct ASD/ADHD clinical estimate. Each item is scored on a scale [0; 2] where 0 implies typical behavior and 2 represents signicantly atypical behavior. A subset of the ADI-R items are chosen using Student's t-test to identify the 10 items that are maximally dierentiated across the two diagnostic conditions, comparing each item's capacity to distinguish ASD from ADHD. This was done to account for the way in which the states will be encoded in the policy and the 73 Item Description t-test Statistic ADI 35 Current:reciprocal conversation 9.2263 ADI 34 Current:social vocalization/chat 7.1570 ADI 68 Current:circumscribed interests 6.5214 ADI 51 Current:social smiling 6.3941 ADI 33 Current:stereotyped utterances and delayed ech... 6.3521 ADI 42 Current:pointing to express interest 6.2030 ADI 59 Current:appropriateness of social responses 6.1215 ADI 45 Current:conventional/instrumental gestures 5.7257 ADI 57 Current:range of facial expression used to com... 4.9405 ADI 72 Current:undue general sensitivity to noise 4.7258 Table 5.2: Items used, their descriptions and the Student's t-test statistic. All p-values < 1:06 10 3 indicating signicance with Bonferroni correction. exponential growth of the state-space to accommodate for each new item. This limitation is described in more detail in Section 5.4.3. Table 5.2 explicitly states the 10 items used as well as the relative t-statistic. In order to evaluate the methods and models trained more eectively, our experiments utilize a 10-fold stratied cross-validation (CV). This method allow us to validate that the results of model performance were not sporadically produced by a lucky train-test split. 5.4.2 Baselines In diagnostic modeling, it is key to interpret not only the output of the model, but the path to that decision [Ahmad et al., 2018]. For this reason prior work on autism diagnosis has explored largely the use of tree and forest based models. With these characteristics in mind, we outline a benchmark consisting of Decision Tree (DT) and Random Forest (RF) models that are trained on Stratied 10-fold Cross Validation (CV) to better approximate our performance on evaluation and expected generalization of the learned models. Two types of benchmark models are trained: the rst with no data-augmentation that learns from the original training samples as is and the second set of these models are trained using a batch of masked data. In this case the masking operation selects a number of items 74 and drops them from the input. In order to allow these models to still make predictions the missing data is interpolated with the median value for the missing column from the training split. This conguration allows us to synthesize more examples for training the model. Furthermore, these new data can potentially introduce counterfactual examples to those observed in the original sample requiring the model to learn a more robust classier which is less likely to fail with imperfect inputs. In order to conrm whether the models are learning signicant representations of the data, a bootstrapped baseline is generated by randomly sampling for a label from the same distribution as the original dataset. Simulating this process 10000 times produces a distribu- tion of expected performance to compare against. The 95th-percentile of the F1 distribution, 0.4428, is taken for the comparisons. If a model's average CV F1-score is above this threshold then it implies that the model generally performs signicantly better than random guessing thus suggesting that the relationships learned by the model is likely signicant. 5.4.3 Q-Learning The implementation of our policy relies on two major elements: The rst is a Naive Bayes classier (G) that is used to make the nal prediction and the policy which utilizes a table to discretely approximate the Q-function. The formulation begins by reconsidering the ADI-R scores as observations made by an agent after taking an action. In particular, we consider an action as \asking" a question and receiving the score for that item as the observation. We use 0 to represent missing information and then shift the true severity score by adding one to represent the observed score. Now by evaluating the observation as 4-bit number, we can convert the observation into a state that can be used as key in the policy table to look up theQ values of each action when in that state. The Naive Bayes classier is trained on a masked set of training inputs, similar to those outlined in the ML model baselines. This function G, maps observations to a 2-dimensional 75 vector, each entry represent the classiers' assigned likelihood of belonging to each possible class. This function is used for two things: First, it is used during training in order to assign an intermediate reward to the policy informing the policy when the model is steering the observation towards allowing the classier to make the appropriate classication. Secondly, by adding the PREDICT action to the action-space, the policy feeds whatever state it is currently in to the classier and a nal reward is given depending on whether or not the classier produced the correct predicted label. Equations 5.1 and 5.2 outline the local and nal rewards used to train the policy more explicitly. r local (s;G;y) = 2(G(s)[y] 1 2 ) (5.1) r final (s;G;y;h;l) = 8 > > < > > : +C(1 h l ) y = arg max G(s) 1 else (5.2) During each iteration we simulate the process of starting with no information and allow the agent follow its current policy, observing the reward and state transition and updating their Q-function at each interaction. Additionally, during training a parameter 2 [0; 1) is used as the exploration probability. This is the probability that during training the agent will choose a random action (explore) over following its current policy (exploit). For the purposes of these experiments an = 0:2 is used. Further still, the discount parameter is set to 0.99, and the learning rate is set to 0.01. The policy cannot repeat actions, once one has already been taken. A more thorough process for training the policy can be found in Algorithm 1 in Appendix D. Each fold trained the policy on the same set of training examples as those used in the ML models so that the comparisons across folds could account for splits in the data. During training, the policy is evaluated on a withheld development set. While there is an upper threshold for how many iterations the policy can learn for, the reward received when 76 Figure 5.3: Process demonstrates how a single example is converted into masked examples. The 0s represent values that are unavailable to the classier a priori and will be potentially imputed. The notationC m n (m choosen) represents the number of examples generated by masking n items. evaluating on the development set is also used as an early stopping condition in the event the model stop improving its development reward for more than 30 iterations. 5.4.4 Experiments The performance and robustness of the Decision Tree (DT) and Random Forest (RF) models and the reinforcement learning (RL) model, are evaluated through prediction of diagnostic labels in a 10-fold stratied cross validation. For each fold, 10 percent of the data is withheld from training and combinatorially masked. This means that for each sample in the test set a \corrupted" sample is created by omitting a subset of the features. For the DT models this omission is replaced with that item's median value from the training split in the same manner that they were to train the robust models. For the RL missing data remains encoded as 0. This omission process produces every combination of possible mask as applied to the 10 items in any given sample. An example of the masking process on a single example can be seen in Figure 5.3. The experiment with masking demonstrates how the models perform with incomplete knowledge and allows us to determine whether they are learning robust representations. A 77 Items Masked DT RF DT robust RF robust Q-Learning () 0 0.4131 0.4753 0.5044 0.5064 0.5620 9 0.0307 0.0000 0.1211 0.1211 0.4312 Table 5.3: F1-Score initially when all data is available, subsequently when only one feature is available. more robust model should recover a correct diagnosis and perform better than the boot- strapped baseline. A classier that is able to successfully predict despite missing data, demonstrates that the policy could generalize correctly to an instrument with fewer items. 5.5 Results Table 5.3 shows how the dierent models performed rst on the completely available dataset, and then when information was missing. The table shows us that the DT is the only model that initially does not out perform the randomised baseline set forth in Section 5.4.2. No- tably, the learned policy out performs all of the ML models and is the only model to maintain a competitive F1 score even when 9 features are inaccessible. These results suggest that the policy is learning to correlate each item independently with the diagnostic label while also learning the relationships between the features. These results can also be seen more gradually in Figure 5.4 where the degradation of the ML models performance against the baseline is seen with higher granularity. The DT robust and RF robust degrade much more steeply than the policy suggesting that the underlying states and structures learned by the policy are more robust than the representations learned by the ML models. 5.5.1 Policy In line with our goal to maintain interpretability of the policy, Figure 5.5 demonstrates how the learned policy for one of the folds adapts to responses depending on which response is returned. The gure shows how it is possible to trace back from a diagnosis, see what the 78 Figure 5.4: F1-Score degradation as more features are masked from the inputs. 79 Figure 5.5: An example of how a policy updates with all possible responses from an inquiry. The top row captures the initial \empty" state of the policy, while the branch represent all of the possible state update that could occur depending on the observation made following the action taken. The column vector represents the state of the policy, or the items that the policy has information about so far. The horizontal bar chart captures the relative Q- value of each action (actions are equivalent to querying an item or making a prediction). As ADI 45 has the highest Q-value, it is the rst item that is queried by the policy. The arrows capture possible responses, or observations, that the policy can have, which in turn are used to update the state. The verticle bar chart captures the current state's predicted probabilities of ADHD and ASD respectively (Belief ). diagnostic belief state is according to the Naive Bayes classier trained for the RL model, and how in certain circumstances the policy learns to avoid predicting too early, when too little information is available. Another way to better interrogate the representations learned by the ML models and the policy is to understand the feature importance for each of the models. For the policy, the feature importance can be thought of as theQ-value of any action given a state and is adap- tive through out the session as seen in Figure 5.5. To interpret a more global representation of feature importance is to look at the output ofQ(0;) as this is the priority of items to ask 80 Model Variance Range DT robust 8:8 10 4 0.1089 RF robust 2:7 10 4 0.0635 6:3 10 5 0.0273 Table 5.4: Variance and Range of Feature Importance for each of the classiers. when no other information is presently available. Although it is important to note that in any state the policy may produce a dierent importance for future decisions. Figure 5.6 shows some of the underlying dierences between the ML models and policy. Namely, it seems that the policy learns to more evenly consider the features than the tree based models. This is supported further by the observation that the variance and range of the feature importance for each of the classiers shown in Table 5.4. The dierence in these is likely a byproduct of the exploration procedure allowing the model to learn better partial representations. Similarly, since the quality of an action is dependent on all future states and rewards the Q-function for a given state-action pair aggregates the expected results of future decisions. 5.6 Conclusion By training a policy through simulation, our models are able to learn best clinical estimates from missing or corrupted data. This method suggests that the models are not overly reliant on an individual item's correlation with a target but may instead discover more nuanced representations that interpret the interaction between items. This enables an interpretation of possibly con icting information from various items with the ability to follow up further into an interaction. Furthermore the adaptive nature of this policy and the joint objective it optimizes implies that when the cases are clearly pointing towards one outcome the model will reach them with fewer interactions, suggesting the next most informative action in more complex scenarios. The result is a policy that considers all future possible outcomes before taking the next 81 Figure 5.6: Importance of dierent items relative to each other according to dierent model types. 82 action and ingesting more information, thus it is possible to maximize its robustness while preserving its interpretability. An additional advantage is that while in the experiments the policy was allowed to choose and execute the next highest quality action according the reward, this method is also adaptable to spontaneous actions being taken. More concretely, the structure of the graph and algorithm of the policy will always recommend the next best action in terms of assigning the correct label from any state, regardless of how the agent reached that current state. This work demonstrated that data-augmentation and Q-learning can be utilized in low- resource conditions to learn a more diagnostically robust representation of clinical data. By reformulating the process of diagnostic classication as an MDP, this work further outlines a computational framework for designing policies and utilizing the benets of machine learning for adaptive diagnostic testing in uncertain data environments. It is an initial broach into the domain of The method for identifying and optimizing this policy, may be useful in designing future diagnostic methods which follow adaptive procedures by leveraging data available on current instruments. Specically, by structuring future methods in a way that enables clinical prac- titioners to interact with a policy while they are administering diagnostic tests, a learned policy would be able to recommend prompts and items to explore that would be most pow- erful in solidifying a particular diagnosis while also accounting for potentially spontaneous observations that the clinician might make. 83 Chapter 6 Conclusions and Future Work Over the course of this thesis, the work presented built up a procedure for demonstrating the value of considering interactions under a control-ane paradigm. In particular, the present work begins by demonstrating that attending to the relative coordination an individual is demonstrating is re ective of an underlying state with reference to desired outcome. Next the modelling components presented in Chapters 3 and 4 demonstrate how these behavioral signals can be adapted to the well established framework of control theory. Finally, demon- strating concrete strategies to improving the strategies to gain a more favorable, adaptive, and robust interaction policy with the help of Q-Learning. By explicitly considering the specic ways in which dyadic interactions evolve as a func- tion of momentary decisions and statements made by one of the interlocutors it is possible to explicitly consider and evaluate strategies. Further more, these strategies can be designed in a manner that is adaptive to individualized desire outcomes with acknowledgement of explicit limits, faculties, and resources. Additionally, the framework could be further uti- lized to more eectively identify adversarial or negative strategies, potentially creating an opportunity to protect individuals from manipulation or coercion. The following is a more thorough consideration of each of the presented works conclusions and take-aways, as well as, considerations for further developments. 84 6.1 Child Truthfulness and Disclosure The improvements in automated child deception detection by introducing a control ane paradigm and Granger causal features to capture child-adult interaction dynamics shows that these features are signicantly better indicators than static aggregation methods. This suggests that a child's response to interviewers is more informative than the specic language they use throughout the interview. Additionally, our results demonstrated the eectiveness of the dynamic systems to eectively capitalize on cross-modal relationships and between both audio features and language. Furthermore, signicant performance in both the truth-telling and disclosure tasks sug- gests the existence of speech cues that can be used to inform interviewers of whether a child is prepared to disclose and whether their statements during disclosure are truthful. These insights can be used to evaluate CFI strategies and inform improvements to existing protocols. 6.2 Child Forensic Interviewing By rst demonstrating how to develop a more nuanced metric of verbal productivity, our analysis explored the use of a control-ane models to explore the similarities and dissim- ilarities between child behavior. The power of utilizing the DMDc approach allows to t individualized models to each child, this opens up the possibility to study the similarities between children while still accommodating their individual characteristics and dynamics. 6.3 Cognitive Behavioral Therapy In a similar fashion to the analysis done for CFI dynamics, here the utilization of locally windowed dynamics enables insight into the types of local behavioral patterns that are associated with adherent therapeutic strategies. The extension of this approach would allow 85 for the consideration of subtle modications to psychologists' interaction policies in a manner as to strength adherent moments and support therapists towards making more impactful decisions throughout the session. This also suggests an important distinction that a person's interpretation of their con- versational partners input might vary over time, but the results suggest that the information captured by the dynamics would provide a strong basis upon which local strategy might be eectively studied. Our study demonstrates the value of the extracted dynamics, however the methods out- lined for down stream classication are relatively simple and could likely be improved by the utilization of more complex sequence classication systems to make more complex predictions over time. Also training using a non-binary target might increase the separation of dynamic modes. Furthermore, it might be possible to improve overall prediction results by learning in a multi-task setting, as the sub-scores together may contain overlapping information that would improve individual sub-score predictions. As demonstrated in Chapter 2, measuring the error produced by the autonomous model and controlled model can be used to interpret the coordination between the individuals interacting which was found to be a useful towards predicting global measures about the interaction. In the future, it would possible to conduct a similar evaluation for this data and interpret those results in a similar way to the work done by Martinez et al. [2019] which looked at the alignment of narrative structures between clients and therapist to estimate alliance between the dyad, which is a related measure to a few of the scores of interest in CTRS. Expanding the capabilities of these methods and building increasingly ecient automatic feedback mechanisms would allow for the development of concrete tools that therapists would be able to leverage to aid their clients. This is an exceptionally timely issue as the envi- ronments and platforms for delivering mental health care are especially challenged following the impacts of the recent global pandemic [Youn et al., 2020, Aknin et al., 2021], as these 86 challenges most heavily fall upon those systems most vulnerable and the least supported communities. 6.4 Optimizing Interaction Strategy While in the previous chapters we explicitly studied the power of the dynamical systems to describe the ways in which the structure might be utilized most eectively to support a conversational guide, in this work we explore the explicit nature through which the input of the guide might be optimized to reach our desired state most eectively. Again by constructing a function that captures the nuance of eciency and accuracy, we are able to provide not only a more accurate model for diagnostics but also a more robust model which more appropriately handles missing information. By extending these methods into a more ecient information compression and state representation method, it would be possible to consider more questions and explore potentially a more nuanced strategy. 87 Bibliography S. Abuse. Mental health services administration.(2019). key substance use and mental health indicators in the united states: Results from the 2018 national survey on drug use and health (hhs publication no. pep19-5068, nsduh series h-54). rockville, md: Center for behavioral health statistics and quality. Substance Abuse and Mental Health Services Administration. Retrieved from https://www. samhsa. gov/data, 2019. E. C. Ahern, S. N. Stolzenberg, and T. D. Lyon. Do prosecutors use interview instructions or build rapport with child witnesses? Behavioral Sciences & the Law, 33(4):476{492, 2015. E. C. Ahern, S. J. Andrews, S. N. Stolzenberg, and T. D. Lyon. The productivity of wh- prompts in child forensic interviews. Journal of interpersonal violence, 33(13):2007{2015, 2018. M. A. Ahmad, C. Eckert, and A. Teredesai. Interpretable machine learning in healthcare. In Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, pages 559{560, 2018. L. B. Aknin, J.-E. De Neve, E. W. Dunn, D. E. Fancourt, E. Goldberg, J. F. Helliwell, S. P. Jones, E. Karam, R. Layard, S. Lyubomirsky, et al. Mental health during the rst year of the covid-19 pandemic: A review and recommendations for moving forward. Perspectives on Psychological Science, page 17456916211029964, 2021. G. D. Anderson, J. N. Anderson, and J. F. Gilgun. The in uence of narrative practice techniques on child behaviors in forensic interviews. Journal of child sexual abuse, 23(6): 615{634, 2014. V. Ardulov, M. Mendlen, M. Kumar, N. Anand, S. Williams, T. Lyon, and S. Narayanan. Multimodal interaction modeling of child forensic interviewing. In Proceedings of the 20th ACM International Conference on Multimodal Interaction, pages 179{185, 2018. V. Ardulov, Z. Durante, S. Williams, T. D. Lyon, and S. Narayanan. Identifying truthful language in child interviews. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8074{8078. IEEE, 2020a. V. Ardulov, K. Somandepalli, N. Anand, S. Zheng, E. E. Salzman, S. L. Bishop, C. Lord, and S. Narayanan. Identifying measured characteristics on ados, adi-r and srs dierentiating asd from adhd. In INSAR 2020 Virtual Meeting. INSAR, 2020b. 88 J. S. Beck. Cognitive behavior therapy: Basics and beyond. Guilford Press, 2011. R. Bellazzi and B. Zupan. Predictive data mining in clinical medicine: current issues and guidelines. International journal of medical informatics, 77(2):81{97, 2008. L. Berliner and J. R. Conte. The eects of disclosure and intervention on sexually abused children. Child abuse & neglect, 19(3):371{384, 1995. P. Boersma. Praat: doing phonetics by computer. http://www. praat. org/, 2006. D. Bone, C.-C. Lee, A. Potamianos, and S. Narayanan. An investigation of vocal arousal dynamics in child-psychologist interactions using synchrony measures and a conversation- based model. In Fifteenth Annual Conference of the International Speech Communication Association, 2014. D. Bone, M. S. Goodwin, M. P. Black, C.-C. Lee, K. Audhkhasi, and S. Narayanan. Applying machine learning to facilitate autism diagnostics: Pitfalls and promises. Journal of Autism and Developmental Disorders, 45(5):1121{1136, May 2015. ISSN 1573-3432. doi: 10.1007/ s10803-014-2268-6. URL https://doi.org/10.1007/s10803-014-2268-6. D. Bone, S. L. Bishop, M. P. Black, M. S. Goodwin, C. Lord, and S. S. Narayanan. Use of machine learning to improve autism screening and diagnostic instruments: eectiveness, eciency, and multi-instrument fusion. Journal of Child Psychology and Psychiatry, 57 (8):927{937, 2016. M. S. Brady, D. A. Poole, A. R. Warren, and H. R. Jones. Young children's responses to yes-no questions: Patterns and problems. Applied Developmental Science, 3(1):47{57, 1999. S. E. Brennan. Lexical entrainment in spontaneous dialog. Proceedings of ISSD, 96:41{44, 1996. D. A. Brown and M. E. Lamb. Can children be useful witnesses? it depends how they are questioned. Child Development Perspectives, 9(4):250{255, 2015. Z. Chen, N. Flemotomos, V. Ardulov, T. A. Creed, Z. E. Imel, D. C. Atkins, and S. Narayanan. Feature fusion strategies for end-to-end evaluation of cognitive behavior therapy sessions. In 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pages 1836{1839. IEEE, 2021. F. Clemens, P. A. Granhag, L. A. Str omwall, A. Vrij, S. Landstr om, E. R. a. Hjelms ater, and M. Hartwig. Skulking around the dinosaur: Eliciting cues to children's deception via strategic disclosure of evidence. Applied Cognitive Psychology, 24(7):925{940, 2010. R. Collins, R. Lincoln, and M. G. Frank. The eect of rapport in forensic interviewing. Psychiatry, psychology and law, 9(1):69{78, 2002. 89 T. A. Creed, S. A. Frankel, R. E. German, K. L. Green, S. Jager-Hyman, K. P. Taylor, A. D. Adler, C. B. Wolk, S. W. Stirman, S. H. Waltman, et al. Implementation of transdiagnostic cognitive therapy in community behavioral health: The beck community initiative. Journal of consulting and clinical psychology, 84(12):1116, 2016. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. M. Duda, R. Ma, N. Haber, and D. Wall. Use of machine learning for behavioral distinction of autism and adhd. Translational psychiatry, 6(2):e732{e732, 2016. Z. Durante, V. Ardulov, M. Kumar, J. Gongola, T. Lyon, and S. Narayanan. Causal indica- tors for assessing the truthfulness of child speech in forensic interviews. Computer Speech & Language, 71:101263, 2022. D. M. Fergusson, L. J. Horwood, and M. T. Lynskey. Childhood sexual abuse and psychiatric disorder in young adulthood: Ii. psychiatric outcomes of childhood sexual abuse. Journal of the American Academy of Child & Adolescent Psychiatry, 35(10):1365{1374, 1996. N. Flemotomos, V. R. Martinez, J. Gibson, D. C. Atkins, T. Creed, and S. S. Narayanan. Language features for automated evaluation of cognitive behavior psychotherapy sessions. In INTERSPEECH, pages 1908{1912, 2018. N. Flemotomos, V. R. Martinez, Z. Chen, T. A. Creed, D. C. Atkins, and S. Narayanan. Automated quality assessment of cognitive behavioral therapy sessions through highly contextualized language representations. PloS one, 16(10):e0258639, 2021a. N. Flemotomos, V. R. Martinez, Z. Chen, K. Singla, V. Ardulov, R. Peri, D. D. Caperton, J. Gibson, M. J. Tanana, P. Georgiou, et al. Automated evaluation of psychotherapy skills using speech and language technologies. Behavior Research Methods, pages 1{22, 2021b. J. Gibson, D. Can, B. Xiao, Z. E. Imel, D. C. Atkins, P. Georgiou, and S. S. Narayanan. A deep learning approach to modeling empathy in addiction counseling. Interspeech 2016, pages 1447{1451, 2016. J. Gibson, D. Can, P. G. Georgiou, D. C. Atkins, and S. S. Narayanan. Attention networks for modeling behaviors in addiction counseling. In INTERSPEECH, pages 3251{3255, 2017. J. Gibson, D. Atkins, T. Creed, Z. Imel, P. Georgiou, and S. Narayanan. Multi-label multi- task deep learning for behavioral coding. IEEE Transactions on Aective Computing, 2019. J. Gongola, N. Scurich, and J. A. Quas. Detecting deception in children: A meta-analysis. Law and Human Behavior, 41(1):44{54, 2017. ISSN 01477307. J. Gongola, N. Scurich, and T. D. Lyon. Eects of the putative confession instruction on perceptions of children's true and false statements. Applied Cognitive Psychology, May: 1{7, 2018. ISSN 10990720. 90 J. Gongola, J. Quas, S. E. Clark, and T. D. Lyon. Adults' diculties in identifying conceal- ment among children interviewed with the putative confession instructions, 2020. J. Gongola, S. Williams, and T. D. Lyon. Children's under-informative responding is associ- ated with concealment of a transgression. Applied Cognitive Psychology, 35(4):1065{1074, 2021. C. W. J. Granger. Investigating causal relations by econometric models and cross-spectral methods. Econometrica: Journal of the Econometric Society, pages 424{438, 1969. R. Grzadzinski, C. Dick, C. Lord, and S. Bishop. Parent-reported and clinician-observed autism spectrum disorder (asd) symptoms in children with attention decit/hyperactivity disorder (adhd): implications for practice under dsm-5. Molecular autism, 7(1):7, 2016. A. K. Halladay, S. Bishop, J. N. Constantino, A. M. Daniels, K. Koenig, K. Palmer, D. Messinger, K. Pelphrey, S. J. Sanders, A. T. Singer, J. L. Taylor, and P. Szatmari. Sex and gender dierences in autism spectrum disorder: summarizing evidence gaps and identifying emerging areas of priority. Molecular autism, 6:36{36, 06 2015. doi: 10.1186/s13229-015-0019-y. URL https://pubmed.ncbi.nlm.nih.gov/26075049. T. Hillberg, C. Hamilton-Giachritsis, and L. Dixon. Review of meta-analyses on the as- sociation between child sexual abuse and adult mental health diculties: A systematic approach. Trauma, Violence, & Abuse, 12(1):38{49, 2011. L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey. Journal of articial intelligence research, 4:237{285, 1996. K. Kalimeri, B. Lepri, T. Kim, F. Pianesi, and A. S. Pentland. Automatic modeling of dominance eects using granger causality. In International Workshop on Human Behavior Understanding, pages 124{133. Springer, 2011. K. Kalimeri, B. Lepri, O. Aran, D. B. Jayagopi, D. Gatica-Perez, and F. Pianesi. Modeling dominance eects on nonverbal behaviors using granger causality. In Proceedings of the 14th ACM international conference on Multimodal interaction, pages 23{26, 2012. J. B. Kaplow, E. Hall, K. C. Koenen, K. A. Dodge, and L. Amaya-Jackson. Dissociation predicts later attention problems in sexually abused children. Child Abuse & Neglect, 32 (2):261{275, 2008. J. K. Kern, D. A. Geier, L. K. Sykes, M. R. Geier, and R. C. Deth. Are asd and adhd a continuum? a comparison of pathophysiological similarities between the disorders. Journal of attention disorders, 19(9):805{827, 2015. B. Kiumarsi, K. G. Vamvoudakis, H. Modares, and F. L. Lewis. Optimal and autonomous control using reinforcement learning: A survey. IEEE transactions on neural networks and learning systems, 29(6):2042{2062, 2017. 91 L. Klein, V. Ardulov, A. Gharib, B. Thompson, P. Levitt, and M. Matari c. Dynamic mode decomposition with control as a model of multimodal behavioral coordination. In Pro- ceedings of the 2021 International Conference on Multimodal Interaction, pages 25{33, 2021. I. Kononenko. Machine learning for medical diagnosis: history, state of the art and perspec- tive. Articial Intelligence in medicine, 23(1):89{109, 2001. J. Kosmicki, V. Sochat, M. Duda, and D. Wall. Searching for a minimal set of behav- iors for autism detection through feature selection-based machine learning. Translational psychiatry, 5(2):e514, 2015. K. Kourou, T. P. Exarchos, K. P. Exarchos, M. V. Karamouzis, and D. I. Fotiadis. Machine learning applications in cancer prognosis and prediction. Computational and structural biotechnology journal, 13:8{17, 2015. M. E. Lamb. Eects of investigative utterance types on israeli children's responses. Inter- national Journal of Behavioral Development, 19(3):627{638, 1996. M. E. Lamb and A. Fauchier. The eects of question type on self-contradictions by children in the course of forensic interviews. Applied Cognitive Psychology: The Ocial Journal of the Society for Applied Research in Memory and Cognition, 15(5):483{491, 2001. M. E. Lamb, K. J. Sternberg, Y. Orbach, P. W. Esplin, H. Stewart, and S. Mitchell. Age dierences in young children's responses to open-ended invitations in the course of forensic interviews. Journal of consulting and clinical psychology, 71(5):926, 2003. M. E. Lamb, Y. Orbach, I. Hershkowitz, P. W. Esplin, and D. Horowitz. A structured forensic interview protocol improves the quality and informativeness of investigative interviews with children: A review of research using the nichd investigative interview protocol. Child abuse & neglect, 31(11-12):1201{1231, 2007. S. E. Levy, E. Giarelli, L.-C. Lee, L. A. Schieve, R. S. Kirby, C. Cunni, J. Nicholas, J. Reaven, and C. E. Rice. Autism spectrum disorder and co-occurring developmental, psychiatric, and medical conditions among children in multiple populations of the united states. Journal of Developmental & Behavioral Pediatrics, 31(4):267{275, 2010. C. Lord, M. Rutter, and A. Le Couteur. Autism diagnostic interview-revised: a revised version of a diagnostic interview for caregivers of individuals with possible pervasive de- velopmental disorders. Journal of autism and developmental disorders, 24(5):659{685, 1994. C. E. Lord. Autism: From research to practice. American Psychologist, 65(8):815, 2010. T. D. Lyon. Ten step investigative interview. Los Angeles, CA: Author, 2005. T. D. Lyon, L. C. Malloy, J. A. Quas, and V. A. Talwar. Coaching, truth induction, and young maltreated children's false allegations and false denials. Child Development, 79(4): 914{929, 2008. 92 T. D. Lyon, S. N. Stolzenberg, and K. McWilliams. Wrongful acquittals of sexual abuse. Journal of interpersonal violence, 32(6):805{825, 2017. N. Malandrakis, A. Potamianos, E. Iosif, and S. Narayanan. Emotiword: Aective lexicon creation with application to interaction and multimedia data. In MUSCLE, 2011. V. R. Martinez, N. Flemotomos, V. Ardulov, K. Somandepalli, S. B. Goldberg, Z. E. Imel, D. C. Atkins, and S. Narayanan. Identifying therapist and client personae for therapeutic alliance estimation. In INTERSPEECH, pages 1901{1905, 2019. L. Mathur and M. J. MATARI. Unsupervised audio-visual subspace alignment for high-stakes deception detection. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2255{2259. IEEE, 2021. L. Mathur and M. J. Matari c. Introducing representations of facial aect in automated multimodal deception detection. In Proceedings of the 2020 International Conference on Multimodal Interaction, ICMI '20, page 305{314, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450375818. R. Mihalcea and C. Strapparava. The lie detector: Explorations in the automatic recognition of deceptive language. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, ACLShort '09, pages 309{312, Stroudsburg, PA, USA, 2009. Association for Computa- tional Linguistics. V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015. P. J. Moreno, C. Joerg, J.-M. V. Thong, and O. Glickman. A recursive algorithm for the forced alignment of very long audio segments. In Fifth International Conference on Spoken Language Processing, 1998. S. Narayanan and P. G. Georgiou. Behavioral signal processing: Deriving human behavioral informatics from speech and language. Proceedings of the IEEE, 101(5):1203{1233, 2013. R. Ng, K. Heinrich, and E. Hodges. Associations between adhd subtype symptomatology and social functioning in children with adhd, autism spectrum disorder, and comorbid diagnosis: Utility of diagnostic tools in treatment considerations. Journal of attention disorders, page 1087054719855680, 2019. M. L. Paine and D. J. Hansen. Factors in uencing children to self-disclose sexual abuse. Clinical psychology review, 22(2):271{295, 2002. N. Papernot, M. Abadi, U. Erlingsson, I. Goodfellow, and K. Talwar. Semi-supervised knowledge transfer for deep learning from private training data. arXiv preprint arXiv:1610.05755, 2016. V. P erez-Rosas, M. Abouelenien, R. Mihalcea, and M. Burzo. Deception detection using real- life trial data. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pages 59{66. ACM, 2015. 93 B. Z. Pollermann. A place for prosody in a unied model of cognition and emotion. In Speech Prosody 2002, International Conference, 2002. J. L. Proctor, S. L. Brunton, and J. N. Kutz. Dynamic mode decomposition with control. SIAM Journal on Applied Dynamical Systems, 15(1):142{161, 2016. L. Radford, S. Corral, C. Bradley, H. Fisher, C. Bassett, N. Howat, and S. Collishaw. Child abuse and neglect in the uk today. London: NSPCC, 2011. J. R. Ragazzini and L. A. Zadeh. The analysis of sampled-data systems. Transactions of the American Institute of Electrical Engineers, Part II: Applications and Industry, 71(5): 225{234, 1952. doi: 10.1109/TAI.1952.6371274. F. Ringeval, B. Schuller, M. Valstar, J. Gratch, R. Cowie, S. Scherer, S. Mozgai, N. Cum- mins, M. Schmitt, and M. Pantic. Avec 2017: Real-life depression, and aect recognition workshop and challenge. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, pages 3{9. ACM, 2017. B. Salley, J. Gabrielli, C. M. Smith, and M. Braun. Do communication and social in- teraction skills dier across youth diagnosed with autism spectrum disorder, attention- decit/hyperactivity disorder, or dual diagnosis? Research in autism spectrum disorders, 20:58{66, 2015. D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017. J. Sinzig, D. Walter, and M. Doepfner. Attention decit/hyperactivity disorder in children and adolescents with autism spectrum disorder: Symptom or syndrome? Journal of attention disorders, 13(2):117{126, 2009. V. A. Talwar, K. Hubbard, C. Saykaly, K. Lee, R. C. L. Lindsay, and N. Bala. Does parental coaching aect children's false reports? comparing verbal markers of deception. Behavioral Sciences & the Law, 36(1):84{97, 2018. E. P. Terlizzi and B. Zablotsky. Mental health treatment among adults: United states, 2019. NCHS Data Brief, 380:1{8, 2020. T. M. Vallis, B. F. Shaw, and K. S. Dobson. The cognitive therapy scale: psychometric properties. Journal of consulting and clinical psychology, 54(3):381, 1986. D. P. Wall, J. Kosmicki, T. Deluca, E. Harstad, and V. A. Fusaro. Use of machine learning to shorten observation-based screening and diagnosis of autism. Translational psychiatry, 2(4):e100, 2012. M. Yancheva and F. Rudzicz. Automatic detection of deception in child-produced speech using syntactic complexity features. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 944{953, 2013. 94 S. J. Youn, T. A. Creed, S. W. Stirman, and L. Marques. Hidden inequalities: Covid-19's impact on our mental health workforce. Anxiety & Depression Association of America, 2020. J. Young and A. T. Beck. Cognitive therapy scale. Unpublished manuscript, University of Pennsylvania, 1980. H. Zhang, C. W. Rowley, E. A. Deem, and L. N. Cattafesta. Online dynamic mode decom- position for time-varying systems. SIAM Journal on Applied Dynamical Systems, 18(3): 1586{1609, 2019. 95 Appendices A Child Forensic Interviewing Variable Name Denition Interview as an ordered set of question-response pairs V Vocabulary an unordered set of words Q i set of questions from i R i set of responses from i t a time step T the terminal time step for an interview q t the question at time step t r t the response at time step t A function that maps an interview on to a topic importance vector g verbal productivity evaluated against an agenda verbal productivity evaluated against responsiveness decay component that discounts past utterances when computing the responsiveness score verbal productivity that unies the agenda and responsiveness scores constant balancing between responsiveness and agenda productivity X i child acoustic, lexical, and aective observations for interview i i adult acoustic, lexical, and aective inputs for interview i utterance aective intensity logarithmically smoothed utterance aective intensity vector term-frequency of an utterance 96 B Robust Diagnostic Q-Learning Variable Bounds Denition i 2I a specic item from the set of itemI being analyzed s 2S =Z + ; s< 4 jIj represents the state space used by the Q-table a 2A =Z + ; a<jIj + 1 actions to be taken by the agent 2R + learning rate for policy 2 [0; 1) future reward discount 2 [0; 1] exploration rate for training the policy G G: S!D Naive Bayes classier evaluating intermediate states 2R + class bonus to alleviate class imbalance C 2R + reward bonus for using less states l 2Z + maximum allowable length of a session h 2Z + ; hl length of the session on exit : S!A function that maps from state space into action space Table 6.1: Denitions and limits on variables 97 C Single-lag and Multi-lag GCA Features During GCA analysis, for a given speech signal pair, multiple lag values may suciently separate the positive and negative classes such that p < P max . However, using multiple signicant lag values resulted in reduced model performance compared to only using the feature that has the lowest p-value. Models trained on the multi-lag feature sets obtained the highest F1 score when using ANOVA signicance thresholds ofp< 0:30 andp< 0:20 for the truth-telling and disclosure tasks, respectively. Tables 6.2 and 6.3 show the performance of models using these feature sets. Model F1 Acc. Prec. FNR Pos. Acc. Neg. Acc. DT 0.433 0.546 0.421 0.345 0.471 0.589 RF 0.332 0.659* 0.430 0.307 0.271 0.865 GNB 0.657* 0.729* 0.599* 0.153* 0.746 0.716 L-SVM 0.627* 0.719* 0.564* 0.129* 0.743 0.701 Table 6.2: Truth-telling task performance using Granger causal features from the multi-lag feature set and a feature signicance threshold of p< 0:30. * indicates performance better than randomized bootstrap t . ** indicates performance better than human simulation h 2 . Bold values indicate the best performance in their respective columns. Pos. Acc. and Neg. Acc. indicate the accuracy for the positive and negative classes, respectively. Model F1 Acc. Prec. FNR Pos. Acc. Neg. Acc. DT 0.458 0.654* 0.452 0.247* 0.481 0.743 RF 0.150 0.683* 0.500* 0.312 0.090 0.969 GNB 0.579* 0.663* 0.494* 0.170* 0.719 0.637 L-SVM 0.624* 0.737* 0.600* 0.154* 0.695 0.759 Table 6.3: Disclosure task performance using Granger causal features from the multi-lag feature set and a feature signicance threshold of p< 0:10. * indicates performance better than randomized bootstrap d . Bold values indicate the best performance in their respective columns. Pos. Acc. and Neg. Acc. indicate the accuracy for the positive and negative classes, respectively. 98 D Q-Learning Algorithm Algorithm 1 Policy Training(X;Y;G;I;T;;; ) A I[fPREDICTg for t2 [1;T ] do shuffle(X) for j2 [1;jXj] do x X[j] y Y [j] s ~ 0 1jIj ^ A [1;jAj] h 0 d 0 while d6= 1 do uU([0; 1]) if u then aU( ^ A) else a arg max a 0 2 ^ A Q(s;a 0 ) end if ^ A ^ Anfag h h + 1 if A[a] = PREDICT then p =1[A[a] =P ADHD ] r r final (p;y;h;jAj) Q(s;a) Q(s;a) +(rQ(s;a)) d 1 else s 0 s s 0 [a] x[a] r r local (s;G;y) Q(s;a) Q(s;a) + (r + max a 0 2 ^ A Q(s 0 ;a 0 ))Q(s;a) s s 0 end if end while end for end for In Algorithm 1,U() refers to a uniform density probability over a set which is being sampled (). 99
Abstract (if available)
Abstract
Human interaction is a vital component to a persons' development and well-being. These interactions enable us to over come obstacles and find resolutions that an individual might not be able to. This subject is particularly well studied in the domains of human psychology, where human behavior is diagnostically categorized and the interaction can be utilized in order to improve somebody's health.
Prior work has explored the use of computational models of human behavior to aide in the diagnostic assessment of behavioral patterns. Most recently, novel machine learning methods and access data has invited the to study the dynamics of human interaction on a more granular time-resolution. These dynamics have been used to identify specific moments during interactions that are relevant to the over all assessment of a individuals behavior with respect to their interlocutor. By reformulating this system from the perspective of an operator that can be controlled, it invites the possibility to predict how an individual would react to a specific input from their partner, which itself lends the opportunity to plan out interventions and probes more effectively.
This dissertation presents a formulation of human interaction through a systems theoretic paradigm with a control affine element and demonstrates how these frameworks can be utilized to gain insight into improving desired outcomes and approaches towards optimizing interaction strategies. In support of the thesis, we will present the application of these techniques to the domains of forensic interviewing, psychotherapy, and neurodevelopmental diagnostics.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Modeling dyadic synchrony with heterogeneous data: validation in infant-mother and infant-robot interactions
PDF
Behavioral signal processing: computational approaches for modeling and quantifying interaction dynamics in dyadic human interactions
PDF
Computational modeling of human behavior in negotiation and persuasion: the challenges of micro-level behavior annotations and multimodal modeling
PDF
Situated proxemics and multimodal communication: space, speech, and gesture in human-robot interaction
PDF
Statistical inference for dynamical, interacting multi-object systems with emphasis on human small group interactions
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Towards social virtual listeners: computational models of human nonverbal behaviors
PDF
Computational modeling of mental health therapy sessions
PDF
Computational modeling of human interaction behavior towards clinical translation in autism spectrum disorder
PDF
Extracting and using speaker role information in speech processing applications
PDF
Parasocial consensus sampling: modeling human nonverbal behaviors from multiple perspectives
PDF
Computational narrative models of character representations to estimate audience perception
PDF
Machine learning paradigms for behavioral coding
PDF
Automatic quantification and prediction of human subjective judgments in behavioral signal processing
PDF
Fair Machine Learning for Human Behavior Understanding
PDF
User modeling for human-machine spoken interaction and mediation systems
PDF
Multimodality, context and continuous dynamics for recognition and analysis of emotional states, and applications in healthcare
PDF
Generative foundation model assisted privacy-enhancing computing in human-centered machine intelligence
PDF
Managing multi-party social dynamics for socially assistive robotics
PDF
Socially-informed content analysis of online human behavior
Asset Metadata
Creator
Ardulov, Victor
(author)
Core Title
Modeling and regulating human interaction with control affine dynamical systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2022-08
Publication Date
05/27/2022
Defense Date
05/04/2022
Publisher
University of Southern California. Libraries
(digital)
Tag
behavioral signal processing,human centered machine learning,human interaction modelling,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Lyon, Thomas (
committee chair
), Narayanan, Shrikanth (
committee chair
), Mataric, Maja (
committee member
)
Creator Email
ardulov@usc.edu,victor.ardulov@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111336164
Unique identifier
UC111336164
Identifier
etd-ArdulovVic-10731.pdf (filename)
Legacy Identifier
etd-ArdulovVic-10731
Document Type
Dissertation
Rights
Ardulov, Victor
Internet Media Type
application/pdf
Type
texts
Source
20220527-usctheses-batch-944
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
behavioral signal processing
human centered machine learning
human interaction modelling