Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
An investigation of fully interactive multi-role dialogue agents
(USC Thesis Other)
An investigation of fully interactive multi-role dialogue agents
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
An Investigation of Fully Interactive Multi-Role Dialogue Agents by Eli Pincus A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements of the Degree DOCTOR OF PHILOSOPHY (Computer Science) December 2020 Copyright 2020 Eli Pincus To my dad. For picking me up the many times I fell when I was younger in the best way you knew how. It is a great source of sadness for me that you didn’t live to see this finished. To my mom and brother. Thank you for all of your support during this process. To Dr. Bearne. You introduced me to a wise person who said “we usually think there are two ways of handling emotions. Act them out which generates bad karma or bottle them up where they eventually turn into cancer. ” Thanks for pointing me in the direction of a third way. To all beings everywhere without exception. May any merit that was generated during this activity go the benefit of all beings everywhere without exception. May we all find the causes for true happiness, live with ease, and be free from suffering. ii Acknowledgements No dissertation is completed solely by one individual. Many people contributed to the work presented here. My advisor, David Traum, provided funding and feedback for this project without which the work could not have been completed. Thank you for taking a chance on me as a student. Several interns and cadets at the USC Institute for Creative Technologies (ICT) chose this project as their focus which helped move this work forward. Thanks to Usman Sohail, Anna Lou, Vaibhav Desai, Adriana Camcho, Aiden McCarthy, and William Smith. Several professors and research staff at ICT also helped with various aspects of this projects. Thanks to Anton Leuski for providing significant technical advice on the automatic speech rec- ognizers and NPCEditor technologies leveraged by the test-bed agent. Thanks to Ed Fast for valuable and patient help on any issues I faced with ICT’s Virtual Human Toolkit. Thanks to Gale Lucas and Jon Gratch who helped design and fund the Game-Framing/Regulatory-Fit Evaluation. Thanks to David DeVault for early design advice on the test-bed agent. Thanks to Ron Artstein for helpful annotation and statistical feedback. This dissertation built on work that was started before I ever arrived at ICT. Thanks to Maike Paetzel and David Nicolas Racca who collected the Rapid Dialogue Corpus used for the human- human analysis. Some of my peers provided helpful technical advice/code for certain parts of this project. Ramesh Manuvinakurike provided helpful debugging advice. Chris Weinberg provided the initial script for training the PMI model used to create the automatic clue filter. iii Several people helped run experiments described in this dissertation. Thanks to Su Lei, Sharon Mozgai, Jill Boberg, and Sohail Alavi for being generous with their time and running these ex- periments. Many USC and ICT staff helped create a supportive environment for my studies. Thanks to Alesia Gainer, Anabel Franco-Huerta, Kevin Watley, and Lizsl De Leon. Several ICT professors were very encouraging at different moments during my time at ICT. Thanks to Kallirroi Georgila, Stefan Scherer, and Ari Shapiro. Finally, many of my fellow “comrades in arms” (aka - fellow graduate students) were a great source of support and good distraction during these years. Thanks to Ramesh Manuvinakurike, Maike Paetzel, Su Lei, Rens Hoegen, Setareh Nasihati Gilani, Chloe Legendre, Koki Nagano, Jacqueline Brixey, Chris Weinberg, Melissa Roemmele, Zahra Nazari, Sayan Ghosh, and Lixing Liu for helping me create some positive memories. Catherine Neubauer and Mathieu Chollet also fall in this category (although you both were technically post-docs when we first met). iv Table of Contents Dedication ii Acknowledgements iii List of Tables ix List of Figures xi Abstract xii 1 Introduction 1 1.1 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Related Work 18 2.1 Organization of Dialogue Agents According to Their Roles . . . . . . . . . . . . 21 2.1.1 Single-Role Chatbots . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.1.2 Single-Role Task-Based Agents . . . . . . . . . . . . . . . . . . . . . . 28 2.1.3 Single-Role Dialogic Gaming Agents . . . . . . . . . . . . . . . . . . . 30 2.1.4 Situating Fully Interactive Multi-Role Dialogue Agent Research . . . . . 36 2.2 Content Sourcing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.2.1 Comparison of Content Sourcing Policies . . . . . . . . . . . . . . . . . 38 2.2.2 Situating Multi-Role Enabled Content Sourcing . . . . . . . . . . . . . . 40 3 Activity Analysis 46 3.1 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2 Activity Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.3 Annotation Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.4 Annotation Scheme Method and Evaluation . . . . . . . . . . . . . . . . . . . . 58 3.5 Annotation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.5.1 Clue & Guess Type Breakdown . . . . . . . . . . . . . . . . . . . . . . 60 3.5.2 Clue Packaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.5.3 Non-Clues & Non-Guesses . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.6 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 v 3.6.1 Baseline Human-Human Game Performance . . . . . . . . . . . . . . . 66 3.7 Implications for Test-Bed Agent . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4 Architecture for Enabling Interactivity 71 4.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.1.1 Dialogue Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.1.2 Auxiliary Graphical User-Interfaces (GUI’s) . . . . . . . . . . . . . . . 80 4.1.3 Agent Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.2 Virtual Human . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.2.1 Clue-Giving Role Dialogue Management . . . . . . . . . . . . . . . . . 84 4.2.2 Guessing Role Dialogue Management . . . . . . . . . . . . . . . . . . . 86 4.3 Robot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.3.1 Robot Dialogue Manager . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.4 Non-Embodied Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.4.1 DialPort Front-End for Real Users . . . . . . . . . . . . . . . . . . . . . 91 4.4.2 Web Clue-Giving Role Dialogue Manager . . . . . . . . . . . . . . . . . 94 4.5 Parameter Instantiation Decisions . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5 Clue-Giving Role Content Generation 101 5.1 Scalable Content Sourcing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.2 Automatic Clue Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.2.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.2.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.3 Online Evaluation of Machine Learning Clue Filter . . . . . . . . . . . . . . . . 112 5.3.1 Metrics for Clue Sequences with 2 Types of Clues . . . . . . . . . . . . 113 5.3.2 Human-Agent Baseline Clue-Giving Measures . . . . . . . . . . . . . . 117 5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6 Design Decision Comparative Experiments 119 6.1 Off-Line Text-To-Speech Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 120 6.1.1 Data & Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.1.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.1.4 Test-Bed Agent Implications . . . . . . . . . . . . . . . . . . . . . . . . 138 6.1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 6.2 On-Line Embodiment and Incrementality Evaluation . . . . . . . . . . . . . . . 139 6.2.1 Independent Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.2.2 Experimental Design & Method . . . . . . . . . . . . . . . . . . . . . . 140 6.2.3 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 vi 6.2.5 Implications for Test-bed Agent & Summary . . . . . . . . . . . . . . . 146 6.3 On-line Game Framing & Feedback Evaluation . . . . . . . . . . . . . . . . . . 147 6.3.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 6.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 6.3.3 Test-Bed Agent Implications . . . . . . . . . . . . . . . . . . . . . . . . 157 6.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 7 Guessing Role Content Generation 159 7.1 Machine Resources for Guessing . . . . . . . . . . . . . . . . . . . . . . . . . . 160 7.1.1 Resources & Methods for Guessing . . . . . . . . . . . . . . . . . . . . 161 7.1.2 Guess Ranking Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 7.1.3 Offline Evaluation of Guessing Resources . . . . . . . . . . . . . . . . . 169 7.1.4 Human Guessing Performance Data Collection . . . . . . . . . . . . . . 176 7.1.5 Human-Human and Human-Agent Measurement Comparisons . . . . . . 178 7.1.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 7.2 Offline Evaluation of Question Answering Guessing Approach . . . . . . . . . . 181 7.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 7.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 7.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 8 Multi-Role Enabled Content Sourcing 186 8.1 Content Sourcing Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 8.1.1 Human Clues from Prior Interactions . . . . . . . . . . . . . . . . . . . 188 8.1.2 Machine Generated Clues . . . . . . . . . . . . . . . . . . . . . . . . . 189 8.2 Test-Bed Agent & User Recruitment . . . . . . . . . . . . . . . . . . . . . . . . 189 8.3 Experimental Design & Method . . . . . . . . . . . . . . . . . . . . . . . . . . 191 8.3.1 Materials and Independent Variables . . . . . . . . . . . . . . . . . . . 191 8.3.2 Dependent Variables and Hypotheses . . . . . . . . . . . . . . . . . . . 192 8.3.3 Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 8.4 Results & Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 8.4.1 Aggregate User-Agent Interaction Statistics . . . . . . . . . . . . . . . . 197 8.4.2 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 8.5 Eliminating Obvious Confounds . . . . . . . . . . . . . . . . . . . . . . . . . . 202 8.5.1 Clue Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 8.5.2 Interaction Modality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 8.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 9 Multi-Role Agent Evaluations 209 9.1 Pilot Robot Multi-Role Data Collection . . . . . . . . . . . . . . . . . . . . . . 211 9.1.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 9.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 9.1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 vii 9.2 Multi/Single-Role Agent Comparative Evaluation . . . . . . . . . . . . . . . . . 217 9.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 9.2.2 Experimental Design & Method . . . . . . . . . . . . . . . . . . . . . . 219 9.2.3 Dependent Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 9.2.4 Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 9.2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 9.2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 9.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 10 Conclusion 239 10.1 Scoping of Methods & Findings . . . . . . . . . . . . . . . . . . . . . . . . . . 241 10.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 BIBLIOGRAPHY 247 viii List of Tables 1.1 Summary of Experiments/Data Collection Efforts . . . . . . . . . . . . . . . . . 12 1.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1 Common Content Sourcing Policies . . . . . . . . . . . . . . . . . . . . . . . . 38 3.1 Clue Types & Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.2 Clue-Packaging Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.3 Inter-Annotator Agreement Statistics . . . . . . . . . . . . . . . . . . . . . . . . 58 3.4 Giver and Guesser Move Frequencies . . . . . . . . . . . . . . . . . . . . . . . 60 3.5 Clue Type Relative Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.6 Clue Packaging Relative Frequencies . . . . . . . . . . . . . . . . . . . . . . . . 62 3.7 Human Clue-Giving Baseline Metric Results . . . . . . . . . . . . . . . . . . . 67 4.1 Game Dialogue Moves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.2 Test-Bed Agent Clue-Giving Role Dialogue Policy . . . . . . . . . . . . . . . . 84 4.3 Test-Bed Agent Guessing Role Dialogue Policy . . . . . . . . . . . . . . . . . . 86 5.1 Clue Frequency by Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.2 Clue Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.3 Experiment Corpus Clue Type Freq. Info. . . . . . . . . . . . . . . . . . . . . . 107 5.4 Guess Annotation Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.5 Features Used for Clue Quality Selection . . . . . . . . . . . . . . . . . . . . . 110 5.6 Baseline & Automatic Method Results . . . . . . . . . . . . . . . . . . . . . . . 112 5.7 Filt./Un-Filt. Clue Metrics Comparison . . . . . . . . . . . . . . . . . . . . . . 114 5.8 Filtered vs. Un-Filtered Clue Effectiveness . . . . . . . . . . . . . . . . . . . . . 116 6.1 Example Clues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.2 Clue Type Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.3 Experiments & Obtained Measures . . . . . . . . . . . . . . . . . . . . . . . . 127 6.4 S/S & E/C Naturalness Means . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.5 S/S & E/C Likability Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.6 Objective Measure Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 6.7 First vs. Last Naturalness Scores . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.8 Guessability Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 6.9 Intelligibility Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 ix 6.10 Experiment Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.11 Participant Interaction Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 144 6.12 Feedback Realization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 6.13 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 7.1 Guessing Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 7.2 Relationship of Guessing Resource to Clue Type . . . . . . . . . . . . . . . . . 167 7.3 Guess Re-Ranking Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 7.4 Example Clues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 7.5 Individual Resource Guessing Performance . . . . . . . . . . . . . . . . . . . . 173 7.6 Ensemble Methods Guessing Performances . . . . . . . . . . . . . . . . . . . . 174 7.7 Human Top5 Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 7.8 NPCEditor Guessing Ability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 8.1 Example Clues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 8.2 Dependent Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 8.3 Shapiro-Wilk Normality Test Results . . . . . . . . . . . . . . . . . . . . . . . . 195 8.4 Mean Subset User Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 8.5 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 8.6 Average Points Per Clue Sequence (APCS) Results . . . . . . . . . . . . . . . . 203 8.7 User Interaction Modality Breakdown . . . . . . . . . . . . . . . . . . . . . . . 204 8.8 Mobile Text Strata Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 8.9 AS Strata Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 8.10 AI Strata Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 9.1 Multi/Single-Role Agent Comparative Evaluation Perception Dependent Variables 223 9.2 Multi/Single-Role Agent Comparative Evaluation Participant Demographics . . . 228 9.3 Perception Variable Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 9.4 Non-Inferiority Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 9.5 Team Guessing Ability Comparison . . . . . . . . . . . . . . . . . . . . . . . . 235 x List of Figures 2.1 Classical Dialogue Agent Architecture . . . . . . . . . . . . . . . . . . . . . . 29 2.2 Example Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.1 RDG-Phrase Corpus Screenshot . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2 Sample Human-Human Dialogue . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3 Sample Human-Human Dialogue 2 . . . . . . . . . . . . . . . . . . . . . . . 51 4.1 Agent Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.2 Virtual Human Agent Architecture . . . . . . . . . . . . . . . . . . . . . . . . 81 4.3 Test-Bed Agent Screenshot [Game Player (right) and Game Judge (left)] . . . . . . . . 82 4.4 Robot Agent Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.5 Non-Embodied Web Agent Architecture . . . . . . . . . . . . . . . . . . . . . 91 4.6 Dialport Game-Agent Personal Computer Web Interface . . . . . . . . . . . . 92 4.7 Facebook Messenger Game-Agent Interface . . . . . . . . . . . . . . . . . . . 93 5.1 Sample Dialogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.1 Post-Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.2 Sample Round Dialogues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 6.3 Post-Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 7.1 Example Result Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 8.1 FaceBook Experiment Advertisement . . . . . . . . . . . . . . . . . . . . . . 190 8.2 Box Plots of Dependent Variables with Significant Differences *(p<0.05), **(p<0.01), ***(p<0.001) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 9.1 Short Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 9.2 Long Post Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 9.3 Multi/Single-Role Agent Comparative Evaluation Pre-Survey . . . . . . . . . . 220 9.4 Multi/Single-Role Agent Comparative Evaluation Post-Survey . . . . . . . . . 221 9.5 Possible Outcomes of Non-Inferiority Tests/Trials . . . . . . . . . . . . . . . . 225 9.6 Yelp User Reviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 9.7 Average Round Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 9.8 Average Skips per Round . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 xi Abstract In the course of their lives human perform multiple roles such as work and social roles. However, current research in human-computer dialogue has focused on dialogue agents that perform only one role of an interaction. For example, Apple’s Siri acts mainly as an assistant. In this thesis we help fill the gap in multi-role dialogue agent research. We describe an architecture that endows a test-bed agent with core dialogue management capabilities for both roles of a word-guessing game but can be adapted for different embodiments including virtual human, robot, and a non-embodied web-platform that enables use of the test- bed agent in “in the wild” experiments. We incrementally evaluate design decisions for the test- bed agent that decrease the chance that our later experiments, that more directly evaluate the agent’s multi-role capabilities, failed to find effects due to confounds stemming from poor design decisions. We establish that multi-role agents, when compared to single-role versions of the same agent, are able to elicit enjoyment from users without negatively impacting user’s perceptions. We also use an “in the wild” experiment to prove that a multi-role content sourcing strategy can be superior to other scalable content sourcing strategies. xii Chapter 1 Introduction “It may seem that we have multiple realities - because we’re capable of playing many different roles in our life. ” Art Hochberg In the course of their everyday lives humans perform multiple roles which specify the “char- acteristic behaviors, parts to be played, and scripts of behavior (for an actor)” [Biddle, 1986]. For example, humans typically engage in social roles such as parent, child or friend and professional roles such as boss, employee, or colleague. Humans also frequently engage in asymmetric inter- actions, where interlocutors have different “competences / obligations & rights to act” [Allwood, 2000]. For instance, professional interactions can be asymmetric such as ones between a boss and employee; it makes sense for an employee to request a raise from their boss but it makes less sense if this action occurs in the other direction. Another example can be found in certain social interactions such as when two humans play a game together and the players have different responsibilities. Until recently, most dialogue agents were built to fulfill only one role of an asymmetric inter- action. For example, the first well-known dialogue agent, which was built in the mid 1960’s using 1 the ELIZA framework [Weizenbaum, 1966] performed best when users were told to interact with it as they would a rogerian therapist. Up until the last few years, the human computer dialogue field has been focused on agents that perform the service provider role of a restricted asymmetric task-based activity such as acting as an “assistant” like Apple’s Siri, providing access to e-mail over the phone [Walker et al., 1998], acting as a tutor [Graesser et al., 2004a], or providing bus scheduling information [Raux et al., 2005]. Another type of interactive system, the type we focus on in this work, is an agent designed to engage a user in a game. While there have been some examples of automated interactive dialogue agents that play a game with a human user [Burgener, 2006b, Marti and Emnet, 2018, Sawaki et al., 2008a, Bo- hus and Horvitz, 2009b, Ferrucci, 2010, Paetzel et al., 2015, Adrian et al., 2016, Leite et al., 2016a] they have typically only performed one role of an asymmetric interaction. For exam- ple, [Burgener, 2006b] is a patent that describes a method for implementing an automated agent that only performs the guessing role of the popular 20 questions game. These game agents have generally been designed to study specific human-computer dialogue research problems such as multi-party dialogue in an open world setting [Bohus and Horvitz, 2009b] or content authoring for repeated interaction [Leite et al., 2016a]. While there are a couple examples of agents that have been designed with the more general intent of satisfying user goals like enjoyment [Bur- gener, 2006b, Sawaki et al., 2008a], unlike the test-bed agent used in this work, they did not perform more than one role of the game. There are several reasons that motivate investigating multi-role interactive agents. First, if one is interested in building agents that emulate human behavior then the ability of humans to fluidly perform multiple roles in everyday life must also be emulated. Second, if one desires to investigate intelligence there are reasons to believe the ability to perform multiple roles is at least one aspect of intelligence. For example, consider the Turing 2 Test, a widely accepted test of intelligence in the field of artificial intelligence for the last 70 years [Turing, 1950]. It is clear from most recent reports of interactive agents that have “beaten” the Turing Test that a sufficient number of judges were fooled by the agents mainly because of the judge’s expectations of the agent role (e.g- a 13 year old boy with little world experience or an assistant with the one task of scheduling an appointment). If the judges in those tests expected their conversational partners to be able to discuss topics associated with multiple roles (e.g. - expect the 13 year old boy to perform the role of a student and learn something new or the assistant to perform the role of a daughter and describe her parents) the judges would likely no longer have trouble differentiating the automated agent from the human interlocutor. Third, there is evidence that the ability to perform multiple roles has learning benefits. Con- sider the widely held belief in the mental health field on the benefits that occur when a mental health professional seeks his own therapy (thereby performing the role of the patient) and as a result becomes a better provider to his own patients [Pope and Tabachnick, 1994, Geller et al., 2005]. In fact, Geller et. al. report on a study where 78% of the therapists interviewed reported that “personal therapy had a positive influence on their own professional development. ” Finally fourth, the ability of agents to perform multiple roles provides opportunities for the agent to observe humans performing a role it will be expected to perform in the future with different humans. This opportunity, which can be considered a form of learning, provides multi- role agents the ability to organically source interaction data for the observed roles. This interaction data can be used in 2 different ways to benefit the agent’s performance later when it performs the observed role. It can be directly reused in later interaction with different users when the agent is performing the observed role. Also, analysis of the interaction data can be used to create statistics https://www.zdnet.com/article/google-duplex-beat-the-turing-test-are-we-doomed/, https://www.bbc.com/news/technology-27762088 3 of human interaction behaviors in the observed role that can be used to inform parameter setting in the automated agent model for the observed role. This thesis is intended to be a first step towards the scientific study of multi-role agents that encourages future research in this area. It is meant to help fill the gap in multi-role dialogue agent research by investigating the following research questions. RQ1 “Can an automated multi-role agent that leverages a full stack of interactive tech- nologies elicit positive perceptions and behaviors from users?” RQ2 “Does an agent’s performance of more than one role of an interaction negatively impact the elicited perceptions?” RQ3 “Are there capabilities possessed by multi-role agents not shared by their single role counterparts that create positive interaction effects for their users?” We investigate these research questions in the context of a test-bed dialogue agent that can play both roles of an asymmetric interaction: both clue-giver and guesser in a collaborative word guessing game. There are several desiderata that motivate the choice of a word-guessing game as a test-bed domain to investigate our research questions. First, we are interested in evaluating a multi-role dialogue agent in an interaction that is as nat- ural as possible where users who interact with the agent are intrinsically motivated - i.e.- “moved to doing something because it is inherently interesting or enjoyable” [Ryan and Deci, 2000]. A game is an intuitive choice as a test-bed since people naturally play games for fun as opposed to a less natural alternative activity more common in the literature, where people pretend to do another task involving an asymmetric interaction (e.g. -request a restaurant recommendation for a distant city, when one has no intention of actually dining in that city). 4 Second, a word-guessing game in particular, in contrast to more haptic based games like pong or tetris, features a large amount of fast-paced multi-modal dialogue between interlocutors. Thus word-guessing game agents can naturally employ a full-stack of dialogue technologies including automatic speech recognition, text-to-speech, natural language understanding, natural language generation, and embodiment with non-verbal behavior generation. Third, the games also offer simple and intuitive ways to evaluate the interaction such as points (in the case of our agent # of correct guesses) and # rounds played by choice. Finally fourth, since the game is collaborative in nature, an investigation is possible into whether the ability of multi-role agents, who can alternate roles with their users, are able to engender interaction benefits related to team cohesion to a greater extent than single-role agents who do not have this ability. 1.1 Thesis Overview Here we provide a map for this thesis. In the next section we summarize the contributions presented in this thesis. In Chapter 2 we provide a review of the literature in this area that covers two main topics. We trace the history of the human-computer dialogue field pointing out its focus on single role agents including prior work on dialogic gaming agents. Second, we compare/contrast several content sourcing strategies commonly employed by dialogue agents. An initial scalable content sourcing strategy is described in Chapter 5. We review methods that re-use observed interaction content, and speculate that multi-role enabled content sourcing can be a capability referred to in RQ3. This is proved in Chapter 8, which shows superiority of this method over the more traditional content sourcing method described in Chapter 5.. 5 The work described in the next 5 chapters demonstrates that a well-principled approach to agent design was taken and provides support that this work makes use of a suitable test-bed agent for investigating our three research questions on multi-role dialogue agents. Without a well designed test-bed our later experiments that more directly investigate the impact of endowing the agent with the ability to perform multiple roles (described in the last 2 chapters) might have suffered from confounds due to poor design decisions. For example, benefits stemming from the agent’s multi-role capability might have been obscured due to users focusing on a disconcerting embodiment choice or content generation methods that produce low quality content. In Chapter 3 we begin our exploration via an activity analysis that investigates a corpus of audio and video recordings of pairs of human players engaged in a word-guessing game activity. We identify important aspects of the game, game roles (clue-giving and guessing), and interac- tion between humans performing the game roles that helped inform agent design decisions such as dialogue management incrementality level or embodiment type. Further, we introduce evalu- ation metrics that can be used to evaluate the performance of humans or agents performing the game roles and apply those metrics to the human-human data providing us with strong human- human baselines of performance for both roles of the game activity that can be later used to help contextualize agent performance. In Chapter 4 we introduce an architecture that enables the interactivity of an automated agent so that it may perform both roles of the game activity in a manner that emulates important hu- man behaviors identified by the activity analysis. In Chapter 5 we present and evaluate content sourcing/generation methods for the clue-giving which enable an automated agent to perform the clue-giving role of the game activity with performance levels in line with that of an average hu- man player. In Chapter 6 we discuss off-line and on-line comparative experiments that begin to test the impact on user interaction of different means of incorporating (or not incorporating) some 6 of the aspects pointed out in the activity analysis in this agent architecture. This chapter also begins to investigate RQ1 by discussing online evaluations of the test-bed agent when it performs the clue-giving role leveraging a full stack of dialogue technologies. In Chapter 7 we present and evaluate content sourcing/generation methods for guessing which enable an automated agent to perform the guessing role of the game activity with performance levels in line with that of an average human player. In Chapter 8 we extend our investigation of content sourcing/generation methods for the clue- giving role and present results from an “in the wild” experiment that demonstrate that multi-role agents can make use of a scalable content sourcing strategy, multi-role enabled content sourcing, that produces relatively high quality content that is not available to single role agents intended for asymmetric interaction. These findings begin to address RQ3 by identifying and demonstrating a capability of multi-role agents not shared by their single role counterparts intended for asymmetric interaction. In Chapter 9, we present the results of two multi-role evaluation experiments with the test-bed agent that address RQ1 and RQ2 . For RQ1, the results provide evidence that the multi-role agent is able to elicit positive perceptions and behaviors from users. In terms of RQ2, the results provide evidence that the multi-role ability of the test-bed agent does not negatively impact the elicited perceptions. Finally, in Chapter 10 we conclude. 1.2 Thesis Contributions Here we summarize the contributions presented in this thesis. We begin by presenting a summary of the contributions presented in each thesis chapter. After the summary of chapter contributions 7 we present Table 1.1 and Table 1.2. Table 1.1 summarizes all of the experiments/data collection efforts discussed in this thesis. In addition to the contributions that cover the experimental results and systems built, the work in this thesis also contributes to the human-computer dialogue field by creating many resources for word-guessing games that can be used by other researchers. Table 1.2 provides details on the datasets generated from the experiments/data collection efforts described in this thesis. Chapter 3 Activity Analysis In this chapter we provide a formal activity analysis of the word-guessing game test-bed do- main. This analysis involved analyzing a corpus of audio and video recordings of pairs of human players playing a word guessing game. This chapter identifies and formalizes in the form of an annotation scheme important aspects of the game including the game roles and common dialog moves and interactive behaviors performed by people in the game roles. This process helped raise important design considerations for the test-bed agent around embodiment, content genera- tion/sourcing method, and synthetic voice selection as well as incremental language processing. This chapter also quantifies the relative frequency of some of the important interactive be- haviors of game interlocutors and evaluates the annotation scheme. Finally, this chapter defines metrics for evaluating a player’s game performance in both roles and applies these metrics to the human-human data to create baselines that later help to contextualize agent performance. Chapter 4 Architecture for Enabling Interactivity This chapter presents an architecture that enables an automated agent to perform both roles of the word guessing game activity. The architecture is intricate enough to allow an agent to emulate 8 some of the more sophisticated interactive behaviors performed by game players (e.g. - incre- mental processing) but still flexible enough to allow for testing of several of the design questions raised in the activity analysis. The architecture was specialized for 3 different platforms with different embodiment types: virtual human, robot, and non-embodied web. Chapter 5 Clue-Giving Role Content Generation This chapter introduces a scalable automated method for clue sourcing/generation that allows an agent performing the clue-giving role to produce large quantities of diverse clues for arbitrary target-words that are similar to the types of clues given by human clue-givers. The chapter also presents a supervised machine learning classifier that uses linguistic features of clues and their as- sociated target-word to filter a corpus of clues for clues more likely to elicit correct guesses. The chapter introduces a metric for assigning credit to clues in a clue sequence that result in a correct guess but have two different types of clues in the sequence. This metric is applied to human agent- data from the embodiment experiment (see Chapter 6) and the human-human data. The results show that this clue generation method is able to come in line with human clue-giving performance. Chapter 6 Design Decision Comparative Experiments In this chapter we discuss 3 experiments that tested the importance of incorporating or not in- corporating some of the design aspects pointed out in the activity analysis. Two experiments were on-line evaluations of the test-bed agent that begin our investigation of RQ1. They evaluate a fully interactive version of the test-bed agent that leverages a full stack of dialogue technologies and performs the clue-giving role. One of these experiments demonstrates endowing an agent with embodiment and incremental language processing capabilities had positive effects on user inter- actions. The second online experiment provides evidence that we chose an advantageous game 9 framing for the game activity. The third experiment was an offline evaluation conducted over the web that shows that using a a high quality synthetic voice, as opposed to a recorded human voice, did not have significant impacts on the objective behaviors including game performance of human users. Chapter 7 Guessing Role Content Generation In this chapter we discuss two automated methods for guess generation that allows an agent performing the guessing role to make guesses. The first method queries pre-existing machine resources including popular online search engines, encyclopedias, dictionaries, and pre-compiled lexical databases to produce guesses. We evaluate this method in several contexts by calculating its performance in different game structures as well as its performance when presented single clues or clue sequences. We contextualize this method’s performances by comparing its perfor- mance with human guessing performance in similar conditions. The second method relies on a classification tool that leverages cross language information retrieval techniques and is commonly used for question answering systems. We evaluate this second method and demonstrate its strong performance. Chapter 8 Multi-Role Enabled Content Sourcing This chapter addresses RQ3 by identifying and demonstrating a capability of multi-role agents not shared by their single role counterparts intended for asymmetric interaction, multi-role en- abled content sourcing. Multi-role agents have the opportunity to observe human interlocutors performing a role the agent will be expected to perform in the future. This provides multi-role agents an opportunity to organically source interaction data for the observed roles that it can later re-use with different human interlocutors when the agent is performing the observed role. 10 Through an “in the wild” comparative evaluation conducted over the web we show the benefits of this content sourcing policy over the scalable content sourcing/generation policy (see Chapter 5) that is shown to perform well. The results of the comparative evaluation demonstrated that re-use of previous interaction data significantly improved user’s game performance, perceptions of content naturalness, and overall interaction enjoyment. This evaluation also helps establish benchmark statistics (e.g. user drop-off rate & post-survey response rate) for dialogue agents de- ployed “in the wild” intended to satisfy user’s intrinsic motivations. Chapter 9 Multi-Role Agent Evaluations In this chapter we present two evaluations of the multi-role test-bed agent. The first pilot eval- uation featured a version of the agent that performed both roles of the game on a robot platform. In this evaluation the guessing capabilities of the test-bed agent were “wizarded”. The evaluation demonstrated that users will cooperate with the multi-role agent’s performance of both roles, the agent can elicit useful data, and that users enjoyed the multi-role interaction. The second evaluation used the fully automated version of the test-bed agent on the virtual hu- man platform. The results address RQ1 by providing evidence that automated multi-role agents are able to elicit positive perceptions and behaviors from users. In terms of RQ2, the results provide evidence that the multi-role ability of the test-bed agent does not negatively impact the positivity of these elicited perceptions. We compare online human-agent team performance with human-human team performance. As mentioned, Table 1.1 summarizes all of the major experiments discussed in this thesis. The first column of this table provides the name of the experiment and the section where that experiment is presented in more detail, the second column lists the main hypotheses/research 11 questions investigated in the experiment, the third column lists the system/agent implementation used in the experiment (see Chapter 4 for details on agent implementations), the fourth column lists the number of experimental participants, and the fifth column lists the main findings of the experiment. Table 1.1: Summary of Experiments/Data Collection Efforts Name of Experiment (Section Described) Main Hypotheses/ Research Questions System/ Agent Implementation # of Participants Main Findings Machine Clue Corpus Collection (Section 5.1) Can web-based resourced & pre-compiled lexical databases generate a diverse set of clues similar to those of human-game players? Web-framework that scrapes web & queries databases/Linguistic normalizer N/A Yes. This method can be used to generate a large set of diverse clues similar to those given by human clue-givers in game-players. Clue Filtering Experiment (Section 5.2) Can machine learning be used to prune a machine clue corpus to produce a clue corpus with average clue quality good enough for game-play? Web-framework that streams a synthesized clue and records spoken guesses over the web. (offline) Virtual Human (online) 15 (offline) 52 (online) Yes. Off-line and on-line evaluations of average clue quality in filtered corpus demonstrated average clue-quality in line with human measures of clue-giving ability. TTS Evaluation (Section 6.1) Is a high quality synthetic voice comprehensible enough for a reasonable game interaction? Web-framework that streams a synthesized clue, records spoken guesses over the web, and allows users to submit audio transcriptions. 60 Evidence that high quality synthetic voice is good enough. No significant difference in objective evaluation measures between highest quality synthetic voice and human voice. 12 Incremental/Embodied Evaluation (Section 6.2) 1. The number of user initiative barge-in dialogue move (correct guesses and skips) recognized by the agent would be higher in the barge-in condition vs the non-barge-in condition. 2. Subjective evaluations of the game in terms of user enjoyment would be higher in the barge-in condition vs the non-barge-in condition. 3. Subjective evaluations in terms of naturalness for the voice would be higher in the embodied condition vs the non-embodied condition. 4. Subjective evaluations in terms of naturalness for the clue-giver would be higher in the barge-in/embodied condition than the non-embodied/ non-barge-in condition. Virtual Human 52 Main take-away is there are a advantages to using incremental embodied version of agent. 1. More correct guesses in barge-in condition (p=0.064). Significantly more skips in barge-in condition. 2. No evidence 3. Participants found voice significantly more natural in embodied condition. 4. Participants found clue-giver more natural in embodied condition than non-embodied condition (p=0.07). Game-Framing/Regulatory-Fit Evaluation (Section 6.3) 1. Can an automated agent personalize feedback based on regulatory focus theory to improve elicited behaviors and perceptions from users? 2. Does a gain or loss framing of the game elicit more positive behaviors/ perceptions from users? Virtual Human 59 1. Evidence not found. Post-hoc analysis found evidence that general purpose design guidelines (gain-framing and early success) enhanced player enjoyment and motivation. 2. Gain framing of the game elicited more enjoyment (p=0.076) than loss framing. 13 Online Machine Resource Guessing Evaluation (Section 7.1) Can web-based resourced & pre-compiled lexical databases generate high quality guesses? Web-framework that scrapes web & queries database/linguistic normalizer N/A Yes. Ensemble methods significantly outperform individual methods. For variants of the game where set of possible guesses filtered ensemble methods come in line-with human performance. For variants of the game that only allow one clue per target-word ensemble methods can significantly out-perform a human guesser. NPCEditor Guessing Evaluation (Section 7.2) Can a classification technology that uses information retrieval techniques commonly used in question- answering dialogue character systems be used to produce high quality guesses? NPCEditor N/A Yes. Method could produce a correct guess in its top 5 guesses almost half the time for a variant of the game where set of possible guesses were filtered. Comparative Content Sourcing Evaluation (Chapter 8) RQ3. Are there capabilities possessed by multi-role agents not shared by their single role counterparts that create positive interaction effects for their users? (RQ3) 1. Users behaviors most positive when interacting with version of agent that re-uses clues from previous interaction vs version that uses clues from machine resources. 2. Users perceptions most positive when interacting with version of agent that re-uses clues from previous interaction vs version that uses clues from machine resources. DialPort 106 for behavioral analysis 26 for perception analysis RQ3. Yes, multi-role agents can observe a human performing a role they will perform in the future and re-use the observed content when the agent performs the observed role. 1. Significantly more (and better) guesses made by users interacting with version of the agent that re-uses interaction content. 2. Significantly higher perceived clue naturalness and overall game enjoyment for users interacting with version of the agent that re-uses interaction content. 14 “Wizarded” Multi-Role Agent Evaluation (Section 9.1) 1. Will users cooperate with an agent that performs more than one role of the interaction? 2. Can a multi-role agent elicit useful data? RQ1. Can a multi-role agent elicit positive user perceptions? Robot 16 1. Yes. Users cooperated. 2. Yes. The agent elicited behavioral and user perception data. The perception data was used to investigate RQ1. The agent also elicited clues from human users interacting with the agent in the game context. These clues were used for the comparative content sourcing evaluation above. RQ1. Yes. The agent elicited average user enjoyment of 4.2 on a 5 pt. scale. Players chose to play on average an additional 2.1 optional rounds with agent. Multi/Single-Role Agent Comparative Evaluation (Section 9.2) RQ1. Can an automated multi-role agent that leverages a full stack of interactive technologies elicit positive perceptions and behaviors from users? RQ2. Does an agent’s performance of more than one role of an interaction negatively impact the elicited perceptions? Virtual Human 68 RQ1. Yes. the agent elicited average user enjoyment of 4.1 9/12 multi-role condition optional user comments had something positive to say. In comparison this statistic for participants in the clue-giving condition was 4/13 and for participants in the agent guessing condition was 7/18. Players chose to play on average an additional 1.9 optional rounds with agent. RQ2. No. Non-inferiority statistical tests showed that the perceptions of users interacting with the multi-role agent were significantly non-inferior on the enjoyment, rapport, (agent) perceived intelligence, & team cohesion dimensions compared to perceptions of users who interacted with single-role versions of the agent. 15 Also as noted, Table 1.2 provides details on the datasets generated from the experiments (or data collection efforts) described in this thesis. The first column lists the name of the dataset and the section where the experiment or data collection effort that generated that data is described. If a dataset was generated from experiment(s) the second column specifies the name of the ex- periments where the data was generated (see Table 1.1). The third column describes the data generated. Finally, the fourth column specifies the data types in the dataset. 16 Table 1.2: Datasets Dataset Name (Chapter/Section Described) Source Description Type Machine Clue Corpus (Section 5.1) Machine Clue Corpus Collection Clues for target-words generated from machine resources including online dictionaries, ontologies, and lexical databases. Text Special Purpose Machine Clue Corpus (Section 5.2) Clue Filtering Experiment Subset of clues from the machine clue corpus used in the Clue Filtering Experiment. These are the clues associated with the Clue Effectiveness Guess Set described in the next row. Text Clue Effectiveness Guess Set (Section 5.2) Clue Filtering Experiment Crowd-sourced audio recordings of guesses elicited from crowd-workers after they listened to clues for target-words spoken by a synthesized or human voice. Annotated for whether guess was correct. Audio Recordings Agent Clue-Human Guess Sequence Corpora (Chapters 6 & 9) Incremental/Embodied Evaluation, Game-Framing/Regulatory-Fit Evaluation, “Wizarded” Multi-Role Agent Evaluation, Multi/Single- Role Agent Comparative Evaluation Sequences of agent clues/human guesses for target-words recorded during agent-human game-play. Audio Recordings, Text ASR Hypotheses Turk Guess Set (Section 7.1.4) Mechanical Turk Data Collection Effort Crowd-sourced guesses provided in text format in response to seeing clues in text-format for target-words. Text Turk Clue Set 1 and 2 (Sections 7.2.1 & 9.2.1) Mechanical Turk Data Collection Effort Crowd-sourced clues provided in text format for target-words provided also provided in text format. Text ”Wizarded” Agent Guess - Human Clue Sequence Corpus (Section 9.1) ”Wizarded” Multi-Role Agent Evaluation Sequences of agent guesses/human clues recorded during game-play where agent used robot embodiment and the guessing role was ”wizarded”. Audio Recordings, Text ASR Hypotheses “Wizarded” Agent Elicited Human Clue Dataset (Section 9.1) “Wizarded” Multi-Role Agent Evaluation A subset of the “Wizarded” Agent Guess - Human Clue Sequence Corpus containing all of the human generated clues. Non-Clues were filtered out & partial clues were manually concatenated. Audio Recordings, Text ASR Hypotheses ”Wizarded” Robot Social Chat + Game Dataset (Section 9.1) ”Wizarded” Multi-Role Agent Evaluation Audio recordings of free-form dialogue & MultiSense video recordings of a human user and a “wizarded” robot engaging in an ice-breaking social chat activity and both roles of the word-guessing game activity. Example questions in the social chat activity include ones asking a user about their hobbies, interests, and favorite movies. Audio Recordings, Text ASR hypotheses, MultiSense Videos, Survey Data with demographics, perception, & dominance personality user information “Automated” Agent Guess- Human Clue Sequence Corpus (Section 9.2) Multi/Single-Role Agent Comparative Evaluation Sequences of agent guesses/human clues recorded during game-play where agent used virtual embodiment and both roles of agent were fully automated. Audio Recordings, Text ASR Hypotheses 17 Chapter 2 Related Work “If you want the present to be different from the past, study the past. ” Baruch Spinoza In this chapter we provide an overview of the history of the human-computer dialogue field pointing out it focus, until recently, on single-role agents. The first well known dialogue agents were ELIZA [Weizenbaum, 1966] and Parry [Colby et al., 1971, Colby, 1981a], which were single-role agents capable of asymmetric narrow specific domain chats. Beginning in the late 1980’s and through the 90’s dialogue agent research, catalyzed by The DARPA Spoken Language Systems Program [Sears, 1988], focused on practical single-role asymmetric agents intended to execute a user-desired action [Hemphill et al., 1990, Price, 1990, Zue et al., 1992, Dahl et al., 1994, Walker et al., 1998, Rudnicky et al., 1999]. During this time, there was also a couple of examples of single-role gaming agents [Burgener, 2005, Burgener, 2006a, Marti and Emnet, 2018] . Burgerner invented the 20Q game cited here in 1988 even though the patent and talk cited here are from 2005 and 2006 respectively. 18 A controversial annual contest called the Loebner Prize Competition was first held in 1991 where teams could submit programs that could compete to pass a slightly modified version of the Turing Test. The competition’s impact on the progress of the field of artificial intelligence research is controversial [Shieber, 1994]; critics maintaining that the contest rewards deception rather than genuine technological advances. In the mid 1990’s a notable chatbot called A.L.I.C.E [Wallace, 2009] was created that went on to win the Loebner Prize 3 times. At the turn of the millennium, the field continued to work on advancing asymmetric single- role task-based agents [Bohus and Rudnicky, 2003, Raux et al., 2005, Williams and Young, 2007, Nakano et al., 2011, Wang et al., 2011, Tur and Deng, 2011] but also began to pay attention to the social dimension of an interaction as well as the effects of endowing an agent with embod- iment [Cassell et al., 1999, Bickmore and Cassell, 2000, Bickmore and Cassell, 2001, Bickmore and Casselle, 2005, Bickmore and Picard, 2005]. These agents were also some of the first to in- terleave social dialogue moves with task dialogue moves. Additionally, since 2000 there has been significant work on single-role gaming agents capable of asymmetric interaction [V on Ahn et al., 2006, Higashinaka et al., 2007, Sawaki et al., 2008b, V on Ahn and Dabbish, 2008, Jung and Graf, 2008, Bohus and Horvitz, 2009a, Bohus and Horvitz, 2009c, Adrian et al., 2016]. Starting in 2011 when Apple released Siri [Apple, 2018] y in the iPhone 4S many well known technology companies deployed intelligent personal assistants (IPAs) which are generally con- sidered single-role task-based agents; assisting users with scheduling, phone messaging/calling, controlling home appliances, finding restaurants, and looking-up information or making transac- tions on the web. Recently, there has been a surge of interest in single-role chatbots capable of symmetric interaction with a user and designed with a goal of extended and/or repeated “small http://aisb.org.uk/events/loebner-prize y Siri was a spin-off project of the DARPA-funded CALO project at SRI International Artificial Intelligence Center 19 talk” [Yu et al., 2015, Yu et al., 2016a, Yu et al., 2016b, Carpenter, 2018, MicroSoft, 2018, Ama- zon, 2018b]. There has also been some work on agents capable of participating in multiple roles/activities (but not multiple roles of the same activity) [Nakano et al., 2011, Fasola and Mataric, 2012, Leite et al., 2016b, Artstein et al., 2016, Kennedy et al., 2017, Artstein et al., 2017a, Lucas et al., 2017, Lucas et al., 2018] and more recent work on agents that interleave social dialogue moves with task dialogue moves in an attempt to improve task outcomes or user’s subjective perception of the agent [Tapus and Mataric, 2008, Tapus et al., 2008, Fasola and Mataric, 2012, Yu et al., 2017, Amazon, 2018a, Apple, 2018, Zhao et al., 2018] While agents such as Siri and the main Alexa agent (mostly responsible for Internet of Things (IoT) tasks, scheduling tasks, and providing information such as weather and news) are capable of some social dialogue, their facility to be social is rather limited and their primary role is that of assistant. During the course of writing this thesis agent platforms that loosely couple a set of agents capable of performing specific roles in different activities [Lee et al., 2017, Amazon, 2018b] have sprung up. In the rest of this chapter we trace this history in more detail. In Section 2.1 we organize classes of agents according to the roles they perform. In Section 2.2 of this chapter we compare and contrast common methods used by dialogue agents to source content for interaction in order to set the stage for our discussion of multi-role enabled content sourcing. We demonstrate multi-role enabled content sourcing is a capability available to multi-role agents not available to single-role agents meant for asymmetric interaction in Chapter 8. 20 2.1 Organization of Dialogue Agents According to Their Roles Here we organize important classes of dialogue agent in terms of the roles they perform. We discuss representative example agents from each category. We also summarize some of the more common methods leveraged by some of the more common classes of agents to facilitate dialogic interaction to help contextualize the architecture we present for the test-bed agent (see Chapter 4). Section 2.1.1 presents the most common frameworks for single-role chatbot agents as well as providing details on some major chatbot agents. Section 2.1.2 discusses the history of single-role task-based agent research. Section 2.1.3 explores prior work that has been done on single-role gaming agents. Finally, Section 2.1.4 situates the type of multi-role dialogue research pursued in this thesis within this larger context of the field. 2.1.1 Single-Role Chatbots In this section we discuss notable agents that fall in the class of single-role chatbots. The earliest chatbots were focused on specific narrow domains and were capable only of asymmetric interac- tion. They were successful in maintaining the illusion of human-like behavior mostly due to users having a preconception of who they were talking to, i.e.- the agent had an expected role. This as- sumption allowed the agents to naturally conceal their lack of knowledge about the real world but still be proactive in the conversation without making salient inappropriate dialogue moves. Earlier chatbots were mostly rule-based and relied on mostly hand authored content and templates that were sometimes instantiated with segments of previous user utterances. More modern chatbots intend to engage a user in more open-ended and symmetric collo- quial conversation without any explicit goals motivating the interaction. These newer variants are generally corpus-based, make use of documents containing human-human and human-machine 21 dialogue (such as movie scripts, call logs, or prior human-agent interactions) or sometimes doc- uments containing non-conversational dialogue (such as online encyclopedias like Wikipedia or news sites like New York Times) to generate responses. Typically, these bots keep track of a small amount of discourse context and fall in one of two categories based on how they retrieve their content; informational-retrieval (IR) and sequence-to-sequence (seq2seq) agents. Information-retrieval chatbots typically choose a response line from their corpus by calcu- lating the similarity between each line in their corpus and a given user utterance and return either the line in their corpus that is most similar to the user utterance or return the line in the corpus that followed the line with maximum similarity to the given user utterance. Similarity between corpus lines and user utterances can be calculated in different ways, one common measure is the cosine similarity between vector representations (either tf-idf or word embeddings) of the lines and user utterances. Sequence-to-sequence chatbots generate a response to a user utterance using machine transla- tion techniques to map from the input utterance to the output utterance. A sequence-to-sequence model typically consists of two recurrent neural network models. The encoder model is used to output a thought vector composed of features that capture important information in an input sequence (sentence) but are also features that have minimal redundancy with each other. The encoder processes one word at each time-step in successive hidden layers, layers of neurons (or nodes) that are not the output layer but rather connected to other neuron layers. The decoder model takes the thought vector as input and generates an output sentence word by word. At each time-step the decoder takes into account the thought vector and all the words it has already output. There are two main strengths to IR agents compared to seq2seq agents. First, IR agent re- sponses are typically grammatically correct (dependent on the corpus used). Second, IR agents 22 do not require a lot of training data to train intricate machine learning models. IR agent disadvan- tages include their lack of robustness at handling unseen input if no appropriate response exists in the corpus and their inability to make use of context such as being able to refer back to named entities that appeared in a user utterance. There are two main advantages of sequence-to-sequence agents over their informational re- trieval counterparts. First, they have a better ability to more appropriately handle unseen input. Second, they are more capable of referring back to prior context in an utterance such as named entities that appeared in a given user utterance. Disadvantages to sequence-to-sequence agents in- clude a higher frequency of ungrammatical output (since response utterances are generated word by word and they need to learn syntax on their own) and their reliance on a large amount of training data for their more intricate machine learning models. In the rest of this section we discuss notable examples of chatbots first earlier ones that engage in asymmetric interaction and then examples that engage in symmetric interaction. Asymmetric Chatbots ELIZA One of the first well known programs capable of conversing in natural language was the single role asymmetric text-based agent ELIZA [Weizenbaum, 1966], built by the computer scien- tist Joseph Weizenbaum at MIT in 1966 who was motivated to show how shallow the interaction between human and machines were. ELIZA was really a framework for building these types of agents but most people know ELIZA because of the discussion around the DOCTOR script, which intended to simulate a rogerian psychotherapist, an instantiation of the ELIZA framework. ELIZA performed best when users were told to “talk to it just as one would talk to a psy- chiatrist” [Weizenbaum, 1966], a well-specified agent role. This expectation lessened the burden From now on ELIZA will refer specifically to the Doctor Script. 23 on the agent to output dialogue moves that demonstrated real-world knowledge. In other words when properly prepped for the interaction users gave ELIZA the benefit of the doubt. For ex- ample, repetition of user phrases according to simple rules is more natural in this interaction (as opposed to seeming repetitive) as rogerian psychotherapists commonly use an intervention called restatement where the therapist repeats or rephrases what the client says in order to provide an opportunity to the client to clarify anything that might have been misinterpreted or to delve into something deeper. Under the hood, ELIZA worked by identifying key words in a user utterance input according to a partially ordered list of key words, applying a transformation rule (decom- position of text based on a pre-defined pattern) associated with the identified keyword, and then generating a response to the user according to a re-assembly rule. An example decomposition and re-assembly rule taken from the original paper of how Eliza worked can be seen in Equations 2.1 and 2.2. Given a user utterance containing the words YOU and ME such as “It seems that you hate me” ELIZA uses the decomposition rule to decompose the utterance into 4 arguments: 1 (IT SEEMS THAT), 2 (YOU), 3 (HATE), 4 (ME). ELIZA then feeds the arguments as input into the re-assembly rule specified for that decomposition rule and responds with the utterance “what makes you think I hate you”. Decomposition Rule : (0 YOU 0 ME) (2.1) Re assembly Rule : (WHAT MAKES YOU T HINK I 3 YOU) (2.2) ELIZA also made use of a MEMORY queue and thus was the first known agent to leverage the hierarchical structure inherent in dialogue. If the keyword “my” was the highest priority keyword in a user utterance, parts of the the user utterance would be stored on the queue and used later. For 24 example, if ELIZA didn’t find any keywords in a future user utterance ELIZA would pop the top of the queue and use it in an utterance like “EARLIER YOU SAID YOUR<QUEUE ITEM>”. Parry The psychiatrist Kenneth Colby created a single role asymmetric chatbot designed to emulate a paranoid person, Parry [Colby et al., 1971, Colby, 1981a] in 1972. Clinical psychi- atrists were given transcripts of other psychiatrists interacting with Parry as well as with other paranoid patients and were not able to identify the interactions involving Parry above chance lev- els. With Parry, Colby introduced the idea of the virtual patient; the idea that students of a medical discipline could practice their skills on a simulated version of a patient with appropriate issues before being “let loose” on real patients. Again, Parry’s human-like fidelity is mostly owed to the expected agent role held by the people, clinicians, who interacted with him. Parry worked similarly to Eliza as both were rule-based agents highly dependent on keyword identification. Parry had a slightly more sophisticated operating procedure as the agent kept track of its affective state (e.g.- fear, anger, mistrust), which was Parry’s conceptualization of context. Parry mapped input to one of several input categories such as input contains mention to self in sensitive area or question. Parry updated the activation level of his affect variables, aka Parry’s emotions, based on update rules that were triggered by identifying combinations of key words (that appeared in input or output) that were linked to sensitive topics. An example of an update rule for Parry’s Anger and Fear affect variable can be seen in Equation 2.3 where the level of the affect variable at the (i) th input-output pair is computed as function of its level at the (i-1) th input-output pair plus a percentage defined by the RISE variable of the difference between the variable’s maximum value (20) and the variable’s level at the (i-1) th input-output pair. VAR i = VAR i1 + RISE var (20VAR i1 ): (2.3) 25 Parry chose what to say next based on which category an input was mapped to and the current activation level of his affect variables. The agent could be run in a weakly paranoid mode or a strongly paranoid mode. In the weakly paranoid mode initial value of the affect variables are set low and the variable’s rise is slower than in the strongly paranoid mode. In the strongly paranoid mode the affect variables could be set low or high and there is an additional delusional complex built into the mental model that is elicitable if certain key words appear in the dialogue . For example, the word “mafia” in the input “Are you connected to the mafia?” is mapped to the category input contains initial mention of delusional topic in the strongly paranoid version which triggers the response “You know they know me” (assuming affect variables are at a certain level). Symmetric Chatbots ALICE One of the most notable chatbots that participated in the Loebner competition was ALICE [Wallace, 2009], originally designed by computer scientist Richard Wallace in 1995. ALICE won the Loebner prize contest in the years 2000, 2001, and 2004. The bot is built using AIML, an XML based language for crafting stimulus-response chatbots (also originally created by Wallace). ALICE’s construction was based on the original ELIZA program but has evolved over the years through the efforts of Wallace and over 500 developers. ALICE, except for a few cases, operates under the assumption that context is not important in an informal conversation and only takes into account the last user’s utterance each time it generates a response. The corpus ALICE’s uses to select dialogue lines from has over 41,000 categories which are composed of a question and an answer (stimulus/response or pattern/template) and an optional context. The answer or template can be composed of a natural language sentence and/or additional AIML tags allowing ELIZA made use of a corpus with only 200 categories 26 for conditional answers, recursive calls to other categories, and calling other programs that might transform the template’s content. The categories in ALICE’s corpus are organized into topics The categories are stored in a tree-structure object that defines efficient pattern storage and matching algorithms. ALICE makes learns via a designated person, the bot-master (or wizard), who updates ALICE’s corpus of AIML content in one of two ways. First, the wizard can suggest alternative ways of phrasing a given user stimulus (or utterance) as well as use an automatic tool, knowledge wizard that automatically suggests alternate phrasings for a given stimulus. Second, the wizard can look through ALICE’s log-files for inappropriate replies to a stimulus and update the category of the original stimulus so that the response is more appropriate in the future. Real-World Chatbots deployed in the real-world include: Microsoft’s Xiao-ice [MicroSoft, 2018] dubbed the girlfriend app and its American and Japanese counterparts Tay (later reincar- nated as Zo) and Rinna. Soon after its initial release in 2016, Tay began to output offensive posts due to malicious users y and was taken offline. Another real-world chatbot example is the rule- based cleverBot [Carpenter, 2018] that crowdsources its content by recycling content from pre- vious interactions (keeping track of which prompts triggered which response). AI scientist Rollo Carpenter designed cleverBot which grew out his original project jabberWacky z . cleverBot’s website states that the application is meant as “an entertainment - not made to be logical, give advice, or be useful. ” and its inventor has said it can be considered a “conversational wikipedia” x . https://www.nytimes.com/2015/08/04/science/for-sympathetic-ear-more-chinese-turn-to-smartphone- program.html y http://www.businessinsider.com/microsoft-deletes-racist-genocidal-tweets-from-ai-chatbot-tay-2016-3 z http://www.jabberwacky.com/ x https://www.livescience.com/15940-cleverbot-computer-chats-human.html 27 2.1.2 Single-Role Task-Based Agents Historically, dialogue agent research has been more practically minded with a focus on agents designed for asymmetric single role interactions intended to help a user with a specific task. Beginning in the late 1980’s and early 1990’s, with improvements in automatic speech recognition and text-to-speech technologies; the field began to work with agents with speech interfaces. The most common architecture for a task-based dialogue agent can be found in Figure 2.1. An automatic speech recognizer (ASR) maps the waveform associated with a user’s utterance to text, an ASR hypothesis, using an acoustic and language model. The natural language understanding (NLU) module then typically parses the text to a machine readable semantic representation of the text’s surface form. The dialogue manager (DM) is then responsible for generating the next move in the interaction based on the current input and current discourse context. This might involve executing a database query, if needed to get user requested information and/or updating context variables in the discourse context. Over the years task-based agents have used the following approaches to dialogue management: finite-state, form-filling, information-state-update, and plan-based approaches [Bohus and Rudnicky, 2003]. We briefly summarize the first two in the following paragraphs as they are the most common. The DM then sends a semantic representation of the next dialogue move to the natural lan- guage generation (NLG) module which is responsible for mapping the semantic representation to a surface form. The NLG module then constructs a surface form, for example by using a pre- defined template instantiated with values taken from the information sent by the DM, and then forwards the surface form of the next utterance to the text-to-speech (TTS) module. The TTS module maps the text to a sequence of phonemes, atomic units of sounds, or diphones, adjacent half-phonemes, and outputs the constructed synthetic sound-wave. 28 Figure 2.1: Classical Dialogue Agent Architecture In the following two subsections we discuss two of the most most common approaches taken by task-based agents to dialogue management and some examples of agents and toolkits that fall under these categories. Finite-State Agents A finite-state DM represents the entire interaction as a finite-state-automaton and relies pretty heavily on the assumption that a scripted dialogue is sufficient for the interaction. Each user input utterance is associated with a state transition and each state has associated agent actions. This representation works well for simple agent-initiative interactions, where only the agent can cause a conversation topic shift. However, this representation becomes difficult to manage as the interaction domain increases in complexity and/or their is a need for mixed-initiative interaction, where the user or agent can shift topics; as the number of states and transitions becomes unwieldy. [Sutton et al., 1996] is a toolkit for building finite-state dialogue agents and [Graesser et al., 2004b] is an example of an automatic tutoring agent that uses finite-state dialogue management. Frame-Based Agents Much of the research on task-based agents has focused on the frame-based architecture first in- troduced in [Bobrow et al., 1977] which assumes a well defined domain ontology. A frame, a 29 collection of information, has a set of slots that are instantiated by possible values and possibly other sloths and/or frames. For example, a simple frame architecture for an airline reservation and car rental booking agent can be seen in Figure 2.2. Here, the FLIGHT RESERV ATION FRAME has an ARRIV AL CITY slot with 3 possible values: New York, Los Angeles, or Chicago. The DEPARTURE TIME slot is composed of an HOUR slot and a MINUTE slot which can take on the integers between 1-24 and 1-60 respectively (the agent assumes a user is specifying military time). Machine learning classifiers are trained and/or rules written that map user utterances to certain frames, e.g. - “I want to rent a car” activates the CAR RENTAL FRAME. Classifiers and rules are also used to map words in user utterances to certain values for certain slots for an ac- tivated frame. Example frame based architecture research agents include ones that make a flight reservation [Hemphill et al., 1990, Price, 1990, Dahl et al., 1994]. Almost all currently com- mercial deployed task-based agents including: Siri, Alexa [Amazon, 2018a], Microsoft Cortana, Facebook M, Google Home have a frame-based architecture [Shum et al., 2018, Pan, 2018]. 2.1.3 Single-Role Dialogic Gaming Agents There have been many agents built that can play asymmetric dialogic games with a human user. Some of these agents were mainly meant for artificial intelligence and human-computer interac- tion research, some were meant for pedagogical purposes, and others were built with the goal of satisfying a user’s intrinsic motivations like enjoyment. Typically, these agents were only able to play in one role in an asymmetric game. Gaming agents are more similar to task-based agents than to chat agents in that the game structure constrains the set of dialogue moves the agent needs to expect at given times and de- signers generally assume that players will be cooperative and adhere to the game structure during the interaction. On the other hand, gaming agents are more similar to chatbots in that users are 30 Figure 2.2: Example Frames FLIGHT RESERVATION FRAME ————————————————————————- ARRIV AL CITY: [New York, Los Angeles, Chicago] DEPARTURE CITY: [New York, Los Angeles, Chicago] DEPARTURE TIME: [ HOUR: [1-24] MINUTE: [1-60] ] DEPARTURE DATE: [ MONTH: [1-12] DAY: [1-31]] ] ————————————————————————- CAR RENTAL FRAME ————————————————————————- RENTAL CITY: [New York, Los Angeles, Chicago] PICK-UP DATE: [ MONTH: [1-12] DAY: [1-31]] ] PICK-UP TIME: [ HOUR: [1-24] MINUTE: [1-60] ] TYPE: [Suv, Convertible, Sedan] ————————————————————————- 31 typically interacting with them to satisfy intrinsic motivations and thus subjective perceptions of the agent are more important. A variety of architectures for dialogic gaming agents have been used over the years and no clear standard architectures exists but rule-based agents have been the most common. The rules generally map user utterances to corresponding game actions based on the expected game action at the given time. There have been some more advanced agents that leverage machine learning models to optimize which game move to make or the best content to select for a given game move [Burgener, 2005, Ferrucci, 2010]. In the rest of this section we provide notable examples of gaming agents from the literature organizing them into two categories. In the first category are dialogic game agents designed primarily to satisfy user’s intrinsic motivations. In the second category are dialogic game agents designed primarily to investigate a specific research question. Games Intended to Satisfy Intrinsic Motivations 20 Questions One of first well known interactive dialogic games [Burgener, 2005] patented an algorithm for a neural network designed to play the 20 Questions guessing game which powers a website that can be used to play the game. Besides for its lack of multi-role capabilities, the 20 Question game differs from the game played by the test-bed agent in that the 20 Questions game is adversarial in nature with the two player’s having opposite goals. The game played by the test-bed agent in this work is cooperative allowing for an investigation of user team cohesion perceptions . In the case of the 20 Questions agent the human’s goal is to think of something the algorithm (the other player) cannot guess correctly. Who Is This [Sawaki et al., 2008b] discuss an agent that generates hints to a user who is tasked with associating the hints with a particular famous person and making an appropriate guess. http://www.20q.net/ 32 The agent was designed to have users “want to be near it for a long time”. Although this is a collaborative game, the agent can only perform the game’s hinting role. [Higashinaka et al., 2007] conducted a study to learn an optimal ranking of hints for this agent. Their ranking model used information retrieval features, positional features, bag of word features, and semantic features. The information retrieval features were generated using two different corpora and included term-frequency inverse-document-frequency (tf-idf) based on different def- initions of a document unit (e.g. - sentence or section) and PMI features for the content words in a hint and the famous person’s name. Positional features were generated based on where a content word appeared in a hint. Bag of word features were generated using a morphological/dependency tool. Finally, semantic features were generated using a semantic analyzer that mapped words in the hints to semantic categories. The ranking model, which beat other baselines, found that the point-wise-mutual information feature most useful for identifying hints more likely to elicit a correct guess. The interactive agent used automatic speech recognition and synthetic speech and was em- bodied in a stuffed toy as well as a computer graphic display. The agent gave feedback to users based on how close their guess was to the target famous person. The distance between people was calculated using a person network which contained information about how often two peo- ple’s names co-occurred on a subset of internet pages. The interactive agent used only the PMI feature from the ranker just described for ordering the hints. The agent was able to automatically generate quizzes for people from different categories like sports or politics. A post-evaluation showed that users did not always find easy hints “interesting” indicating the most interesting game experience is not necessarily one that maximizes the score but also provides a bit of a challenge. The post evaluation also provided evidence that users with high 33 correlation between their interest level and knowledge of famous people in his/her favorite cat- egories generally had higher interest and users with low correlation between their interest level and knowledge of famous people in their favorite quiz category generally had lower interest. Importantly, the evaluation showed that quiz category was sometimes more important than hint difficulty in keeping a user motivated. Games Meant for Research Purposes Daboo An early single-role gaming agent was the text-based Daboo [Marti and Emnet, 2018], that also used the WordNet [Miller, 1995] database as a source of clues, served as a clue-giver for the popular word-guessing game Taboo TM (similar to the agent in this work). More specifically, that work discussed methods for adapting to a user’s “strongest area of knowledge”; the most interesting of which attempts to model the semantic relationship between the target-word and the last user guess by leveraging the trees available in WordNet that model how one word can be related to another word via a “kind of” relation (in linguistics called a hyponym). This algorithm allowed the Daboo agent to find the closest entity in WordNet which both the last guess and target-word are “a kind of” (a branching point) as well as the two children of that branching point which provide some information on how the target-word and guess differ. The Daboo agent is similar to the test-bed agent from this work in that its intended to be a collaborative team-mate in word-guessing game but it can only play the clue-giving role of the asymmetric game interaction. GWAP Gaming agents have also have also been used previously in artificial intelligence, some in the form of games with a purpose (GWAP) in “which people, as a side effect of playing, perform tasks computers are unable to perform” [V on Ahn and Dabbish, 2008]. One example is Verbosity [V on Ahn et al., 2006] which is an activity used to build up a computer’s common-sense world model of the world. It involves the computer playing an adapted version of Taboo over 34 the web with one player providing hints using pre-constructed sentence templates (as opposed to natural language) and the other attempting to guess the target-word. The human-like robot head Furhat [Al Moubayed et al., 2014] has been used to play a simple multi-choice quiz game . [Bohus and Horvitz, 2009c, Bohus and Horvitz, 2009a] discuss an agent where an avatar head is displayed on a screen that also plays a multiple choice quiz game; but the purpose of this agent was to study multi-party engagement in an open-world setting. Besides their lack of multi-role capabilities, the Furhat and and the multi-party engagement agent differ from the agent in this work in that the former agents are not structured to support repeated play nor have the questions asked during the game been studied to determine which ones are more effective at eliciting correct answers. Watson Another single-role gaming system was IBM’s Watson, which was used in the semi- adversarial clue-guessing game Jeopardy. The agent uses a myriad of sophisticated natural lan- guage processing techniques to answer questions in real time [Ferrucci, 2010]. The future work section in that paper specifically discusses how Watson is constrained by the nature of the in- teraction in Jeopardy in so far as it can not participate in a dialogue during the game to resolve ambiguities; which does not hold true for the game played by our agent. [Adrian et al., 2016] work on the opposite role, the guesser, for a specific instance of the Taboo TM game; the Location Taboo (LT) challenge which uses only cities as target terms and serves as a competition for artificial guessing agents. Human hints are used as input for the artificial guessing agents. In [Adrian et al., 2016] a two tier semantic architecture is discussed where different semantic distance metrics are used to first relate a hint to a country and then the hint to cities in that country. The semantic distance metrics include one based off of the hierarchical relationships that can be found in WordNet (somewhat similar to Daboo’s algorithm). https://www.youtube.com/watch?v=v84e6HMFbyc 35 Other semantic distance metrics this guesser uses are based off of a Wikipedia corpus and use the number of pages returned when the corpus is queried with a given hint and geographic location in isolation relative to how many pages are returned that reference both the hint and geographic location. The highest accuracy in this work is achieved using one of the metrics used on the Wikipedia corpus, point-wise-mutual information. 2.1.4 Situating Fully Interactive Multi-Role Dialogue Agent Research In this section we traced the history of human-computer dialogue research pointing out, until recently, its emphasis on single-role agents. We organized important classes of dialogue agent in terms of the roles they perform. Previous human-computer dialogue research has concentrated on single-role chatbot agents, single-role task-based agents, single-role gaming agents, agents that formally interleave task and non-task social content, agent portals that serve as a front-end for dialogue agents that perform different activities/roles, and agents that perform more than one role but in different activities. None of the agents we discovered while conducting the literature review performed more than one role in the same activity. Moreover, while each agent discussed leveraged different automated interactive technologies; rarely did they tie all a full stack of interactive technologies together including automatic speech recognition, text-to-speech, natural language understanding, automatic content generation, incremental dialogue management, embodiment, and non-verbal behavior generation. In contrast, the test-bed agent in this work performs more than one role (both clue-giving and guessing) in an asymmetric game activity. Further the test-bed agent leverages a fully automated architecture (see Section 4.2) that endows the test-bed agent with all of the interactive capabilities just mentioned. 36 2.2 Content Sourcing One major challenge in developing a dialogue agent is endowing the agent with the ability to output utterances that are appropriate given the context of the current conversation. Outputting utterances can be broken down into two sub-problems: a content sourcing/generation policy, that determines what content is available, and a content selection policy that chooses which to say at the current moment. In this section we focus on common methods used for content sourcing. This discussion helps contextualize our content sourcing methods for the test-bed agent’s clue-giving role (see Chapter 5) and guessing role (see Chapter 7). This discussion also helps set the stage for our comparative content sourcing evaluation (see Chapter 8) that demonstrates the benefits of a content sourcing method available to multi-role agents, multi-role enabled content sourcing, that is not available to single role agents intended for asymmetric interaction. Chapter 8 therefore serves to address RQ3: “Are there capabilities possessed by multi-role agents not shared by their single role counterparts that create positive interaction effects for their users?”. Common ways to source/generate content for a dialogue agent include authoring (e.g.- by ex- perts [Georgila et al., 2018] or crowd-workers [Leite et al., 2016c]), or machine generation/extraction approaches using corpora [Oh and Rudnicky, 2000, Traum et al., 2015] or knowledge resources [Fang et al., 2017, Pincus et al., 2018, Ruan et al., 2019]. A less common way of sourcing content is observing human interlocutors engaged in inter- action similar to the expected agent interaction using a data elicitation tool [Manuvinakurike and DeVault, 2015, V on Ahn and Dabbish, 2008], which we call re-use of observed content. Another way to source content is to directly reuse previous user interaction content between the human and agent. Multi-role agents have the opportunity to observe human interlocutors 37 Table 2.1: Common Content Sourcing Policies Expert Authoring Crowd-Sourced Authoring Machine Generation/ Extraction Re-use of Observed Content Multi-Role Enabled Content Sourcing Time Higher Higher Lower Medium Lower Cost More Expensive More Expensive Cheaper More Expensive Cheaper Scalability Lower Lower Higher Lower Higher Situatedeness Medium Medium Lower Higher Fully-Situated Quality Human Human Machine / Human Human Human performing a role the agent will be expected to perform in the future. This provides multi-role agents an opportunity to source interaction data for the observed roles that it can later re-use with different human interlocutors when the agent is performing the observed role. We call this content sourcing method multi-role enabled content sourcing. 2.2.1 Comparison of Content Sourcing Policies Table 2.1 compares and contrasts the common content sourcing strategies on several important measures (seen in the first column). The more positive measure values are shown in green and the less positive measure values are shown in red. Time refers to the amount of time it takes to source content using the method, generally less time is better. Cost refers to the amount of money it takes to source content using the method, generally less expensive is better. Scalability refers to the ease of scaling the method for generation of large amounts of content for many users (which is a function of cost and time). Higher scalability is preferred for an agent intended to be used to satisfy intrinsic motivations and/or extended or repeated interaction. 38 Situatedness is a measure that seeks to capture the similarity between the interaction context where the content was originally sourced from to the interaction context where content is ex- pected to be used. Higher situatedeness is preferred as it is expected to produce content more appropriate/relevant (on average) for the expected agent interaction. Finally, quality is a measure that seeks to capture how positively user’s will perceive the con- tent as well how well the content can elicit positive behaviors from users during the interaction. Machine quality can be less positively perceived (and elicit less positive behaviors) if the content has grammatical issues or is irrelevant for the interaction context. Generally, there is trade-off be- tween scalability and quality. The less scalable content sourcing methods (e.g. - expert authoring) have higher quality content on average than the more scalable content sourcing methods (e.g. - machine generation). In the first two columns of Table 2.1 are expert and crowd-worker authoring. These methods can take more time relative to other methods since authors need to be found and vetted. Also, depending on the domain, it can be difficult to think of new content as more and more content is needed. Further, appropriate summary representations of the domain need to be generated for these authors which can involve increased work time for the agent designers. Cost is on the higher side since authors need to be compensated. Since time and cost are higher this implies scalability is lower. Even with a detailed summary representation of the context of the conversation there will inevitably be information that is lost resulting in content that is only semi-situated and might not be useful for unseen situations. Authored content will be of human quality. On the other hand as seen in the 3rd column of Table 2.1 machine generation/extraction poli- cies can be more scalable since these approaches can produce large amounts of content relatively quickly and cheaply. However, machine generation/extraction source material is generally less situated since it was possibly created for different purposes than the expected agent interaction. 39 In the case of generated this method produces machine quality content while in the case of ex- traction this method produces human quality content (although possibly relatively unsituated). In the 4th column of Table 2.1 is re-use of observed content. Similar to authoring methods this method can be time consuming and expensive since 2 or more people generally need to be artificially matched to engage in the desired type of interaction and compensated for their time. Further, while the content is likely more situated than content produced via authoring and machine generation/extraction methods it is still only semi situated as situated information (e.g.- dialogue agent embodiment type) can result in material differences in produced content. Re-use of observed content is of human quality. Finally, in the fifth column we see multi-role enabled content sourcing which ranks best ac- cording to our measures and scales compared to the other content sourcing methods. Cost and time are lower assuming people are motivated to interact with the agent for task or entertainment purposes. This allows content to grow organically which suggests scalability should be higher. This type of content is also fully-situated since it is generated in a context as similar as possi- ble to the context of the dialogue agent’s expected interaction and thus should be able to capture contextual idiosyncrasies of the interaction that content sourced from other places might miss. Multi-role enabled content sourcing is also of human quality. 2.2.2 Situating Multi-Role Enabled Content Sourcing In this section we discuss how the human-computer dialogue field has made use of content sourc- ing policies that re-uses observed content from previous interaction up to now. This helps situate where multi-role enabled content sourcing fits in the literature as a content sourcing method. 40 Dialogue agents that source content observed during previous user interactions for use in the current interaction with the observed user have been around since the beginning of the human- computer dialogue field [Weizenbaum, 1966]. It has been common practice in the field for dia- logue agents to use either rule-based or data-driven methods to instantiate pre-defined templates with observed content from the current interaction for response generation. Some agents have done so in order to attempt to maintain topic coherency [Weizenbaum, 1966, Colby, 1981b, Bow- den et al., 2018] while others use this technique for purposes of attempting to make more natural confirmation or clarification dialogue moves which aim to ensure or help agent understanding [Oh and Rudnicky, 2000, Boyce and Gorin, 1996, Pincus et al., 2013]. However, there are only a few examples of dialogue agents that re-use observed content in future interactions with different non-observed users as a content sourcing method [Carpenter, 2018, Leite et al., 2016a, Bright, 2016, Yu et al., 2016c] and all of these agents engage in a symmetric interaction, social chat. Further none of the previous agents that source content in this manner have been empirically evaluated in a comparative content evaluation experiment with “real” users “in the wild”. Also, only one of those agents [Leite et al., 2016a] has been evaluated in a setting where the agent employed a full-stack of dialogue technologies including automatic speech recognition and text-to-speech. This is likely mostly due to the field’s focus, until recently, on task based agents that generally only performed the service provider role of a task (e.g., providing access to e-mail [Walker et al., 1998], acting as a tutor [Graesser et al., 2004a], or providing bus scheduling information [Raux 41 et al., 2005]) . Re-using previous interaction content for service providers is a less viable strat- egy. These agents engage in asymmetric interactions, where participating interlocutors generally have different competences / obligations & rights to act [Allwood, 2000]; and do not have the opportunity to observe a human performing the role (and making the same kind of acts) that the agent will perform in the future. This thesis addresses this gap in the literature by demonstrating this is as a viable and even preferred content sourcing policy available to multi-role agents. The rest of this section provides more details on agents from the literature that have source content from previous user interactions. Eliza One of the first well known interactive dialogue agent’s ELIZA [Weizenbaum, 1966] re- used observed content in the current interaction. As mentioned, ELIZA is a rule-based dialogue agent framework most well known for its use in a script that sought to emulate the role of a Rogerian therapist. Agents built using this framework generate responses via a set of re-assembly rules (or templates), many of which are instantiated with parts of prior user utterances. The framework also uses of a memory queue so that if the agent did not identify any key words in a user utterance to base a response on; it popped the most recent user utterance off the memory queue pre-pending it with the words ”earlier you said your<QUEUE ITEM ¿... ”. Since Eliza, it has been a common practice for dialogue agents to use either rule-based or data-driven methods to instantiate pre-defined templates with observed content from the current interaction for response generation. Examples agents that employ this technique include agents, similar to Eliza, that do so in order to attempt to maintain topic coherency [Colby, 1981b, Bowden Barring a couple exceptions that performed the service requester role: Google Duplex (https://www.youtube.com/watch?v=fBVCFcEBKLM) was demoed at the Google I/O developer festival in May 2018 and can schedule appointments for people. BigBlueBot [Weisz et al., 2019] performs the role of a bank support customer when a human plays the role of a bank support agent. The bot was built with the goal of improving the ability of human users to interact with chat-bots. 42 et al., 2018]. Other agents use this technique for purposes of attempting to make more natural confirmation or clarification dialogue moves which aim to ensure or help agent understanding [Oh and Rudnicky, 2000, Boyce and Gorin, 1996, Pincus et al., 2013]. More directly related to the current work are agents that re-use observed user content for future interaction with different users. Cleverbot The real-world rule-based chat-bot Cleverbot [Carpenter, 2018] has been available to interact with users on the world-wide-web since 1997. Cleverbot sources all of its content from previous interactions (keeping track of which prompts triggered which response). The bot matches the last user utterance to all instances of that utterance in its database and considers which instance is most appropriate given past words and phrases that have also come up earlier in the current conversation [Wolchover, 2011]. Some evaluations of CleverBot have been conducted (namely in the context of a modified Turing Test [Gehl, 2014, Torrey et al., 2016]) which provide some evidence that this method can confuse human judges into believing the bot’s responses appear more human-like (than a human) in limited contexts some of the time. Tay Microsoft’s chatbot Tay was created to mimic a 19-year old female on certain social media/chat services and had a “repeat after me” capability that could be used to expand Tay’s available content. This capability was later exploited by malicious users to have the chat-bot output offensive messages leading to Tay being taken offline [Bright, 2016]. Tay serves as a good case-study of what can go wrong for a dialogue agent that re-uses content from previous agent interaction. TickTock [Yu et al., 2016c] describes a comparative experiment that evaluated two versions of a chatbot, TickTock. One version of TickTock sourced its responses from a corpus solely com- posed of transcripts from news interviews. The other version of TickTock sourced its responses from an augmented corpus that contained the interview transcripts, responses elicited in previous 43 text-based interactions between TickTock and crowd-workers that had been crowd-filtered for ap- propriateness, and crowd-corrected responses from previous text-based interactions between Tick- Tock and crowd-workers that did not pass the appropriateness crowd-filter. The results showed that the version of TickTock that relied on the augmented corpus was rated significantly higher in terms of response level appropriateness and user engagement. However, in contrast to the evaluation done here, that work provided little evidence that the version of TickTock that relied solely on news interviews transcripts was a strong baseline. Also, in contrast to the recycled clues used in this evaluation, the recycled utterances that augmented TickTock’s corpus were recorded during text-based interaction as opposed to speech based interac- tion. Therefore, TickTock’s recycled utterances do not necessarily reflect the quality of utterances that can be collected automatically when a dialogue agent employs automatic speech recognition (ASR) where ASR error can introduce noise. PIP Finally, [Leite et al., 2016a] describes a framework that can generate a “persistent in- teractive personality” that stores content in a dialogue graph that is expanded via crowd-sourced content authored by crowd-workers as well as previous agent interaction content. The crowd- workers are shown a summary of context of the expected agent interaction and/or actual lines of conversational history and asked to author appropriate conversation continuations. The frame- work was instantiated in an agent intended to engage a user in social chat and the evaluation experiment provided evidence that a content sourcing method that combines crowd-sourced con- tent and re-use of previous interaction content can be a viable approach to content sourcing when an appropriate pre-existing large corpus is not available to dialogue agent designers. While the prior agents described here sourced observed content, all of them that did so for fu- ture interaction engaged in symmetric interaction, social chat, where interlocutors have the same competencies /obligations & rights to act. In contrast, the test-bed agent evaluated in this work 44 is engaged in an asymmetric interaction. Further, it does not appear that the agents that have sourced content from prior interactions were ever used in a comparative content evaluation exper- iment with “real” users and a strong baseline condition as in the comparative content sourcing evaluation described in this work (see Chapter 8). Additionally, only the PIP agent has been evaluated in a setting where the agent employed a full-stack of dialogue technologies including automatic speech recognition and text-to-speech [Leite et al., 2016a]. Moreover, that evaluation involved compensated users who were told to in- teract multiple times with the PIP agent which is in contrast to the uncompensated users given no prior instructions on how to interact with the test-bed agent here. This could make the PIP eval- uation results less representative of the results one should expect from users interacting with an agent deployed “in the wild” than the results presented in this work for the comparative sourcing evaluation. 45 Chapter 3 Activity Analysis “Life is more fun if you play games. ” Roald Dahl In this chapter we provide a formal activity analysis of the word-guessing game domain. We begin in Section 3.1 by introducing a corpus of audio and video recordings of pairs of human players engaged in the word-guessing game domain that we used to investigate the domain. Sec- tion 3.2 begins to formalize the game structure by specifying game participant roles and their associated actions and highlights different behaviors and phenomena seen in gameplay that raised important automated agent player design considerations. Section 3.3 presents an annotation scheme that defines a formal taxonomy of verbal moves for both the clue-giver and guessing roles of the game. Section 3.4 describes a small inter-annotator agreement evaluation we conducted to demonstrate the efficacy of the annotation scheme. Section 3.5 presents the results of applying the annotation scheme to the human-human corpus providing statistics on the relative frequency of the various verbal moves performed by players in both roles of the game. Section 3.6 defines and applies metrics for human game play performance in both roles establishing human-human baseline performances for the test-bed domain. Section 46 3.7 discusses implications of issues raised in the activity analysis for the test-bed agent. Finally, Section 3.8 summarizes the material presented in this chapter. 3.1 Corpus We begin our analysis of the test-bed domain by examining a corpus comprised of audio and video recordings of pairs of human players playing a word-guessing game. The Rapid Dialogue Game Corpus [Paetzel et al., 2014] contains around 11 hours of audio and video recordings of 32 pairs of human players playing multiple timed rounds of a word guessing game called the RDG- Phrase game. Each pair played 6 timed rounds of the game and alternated between performing the clue-giving and guessing roles of the game each round. Each round was 70-seconds long and players had the opportunity to give clues/guesses for up to 10 target-words. The first 2 rounds were considered training rounds. Figure 3.1 contains a screenshot of one of the videos in the Rapid Dialogue Game Corpus showing a team playing the RDG-Phrase game. In the top left corner of the figure you see a human clue-giver. In the top right corner of the figure you see a screenshot of a monitor the clue-giver is currently viewing. On the bottom of the clue-giver’s screenshot you see the current target-word “alley” and on the top of that screenshot you see the current round score and the time left in the current round. In the bottom left corner of the figure you see a human guesser sitting across the table from the clue-giver and on the bottom right corner of the figure you see a screenshot of a monitor viewed by the guesser with the same information as the clue-giver’s screen except for the current target-word. Each round’s target-words were associated with a particular theme. Each set of 10 target- words were constructed by querying the lexical database WordNet [76] with a seed word from 47 Figure 3.1: RDG-Phrase Corpus Screenshot a common list of nouns such as “car” or “television”. Thus the returned list of words were all semantically related to each other. A human judge disqualified a target-word if the clue-giver used any form of the target-word in a clue. Either player could choose to skip to the next target-word at any time during game-play. If the tenth target-word was skipped or guessed correctly the next un-guessed target-word would be circled back to if time remained. 3.2 Activity Structure According to [Allwood, 2000] an activity can be described in terms of 4 of its parameters: First, an activity has a “type,purpose, function: procedures”. Second, people engaged in an activity typically perform certain roles which have particular “competence/obligations/rights (to act)” (also known as social norms). Third, the activity involves “instruments, machine/media”. Finally, fourth the activity takes place in a “physical environment”. 48 The type of activity investigated in this work is that of a game whose main purpose is to en- gender enjoyment in the people playing the game (ideally because they are intrinsically motivated to play). This particular game, a word-guessing game involves two required roles, clue-giver and guesser and one optional role, a judge. The clue-giver is responsible for coming up with several clues (that have the ability to elicit correct guesses from the guesser) for arbitrary target-words quickly on-the-fly and to select them in a manner that steers the guesser to the target-word as quickly as possible, which sometimes involves taking prior guesses into account. The guesser is responsible for making correct guesses as quickly as possible based on the clues output by the clue giver. The judge is responsible for alerting the players when the round time is up, disqualify- ing a word if the clue-giver uses the target-word (or part of the target-word) in a clue, and helping the players transition when starting a new round. As far as “instruments, machine/media” required by word-guessing games, at a bare mini- mum the game requires a set of target-words and a means to display these words to the clue-giver. It also requires a means to display the amount of time left in a round to both players. The game ac- tivity in the Rapid Dialogue Corpus made use of computers and monitors for that purpose. When you replace one of the players in the game with an agent there are additional considerations such as which agent platform to use for the agent role. This choice can be made based on the platforms affordances (something we investigate and experiment with in Section 6.2 and Chapter 4). Finally, since we are interested in the game as a means to elicit enjoyment from intrinsically motivated players the ideal physical environment has 2 qualities. First the environment should be as natural as possible. For example, in terms of naturalness, a lab that people come to as a paid participant is less natural than a person’s home where they play the game with a friend or on their personal computer because they saw the game advertised on their social media. Second, the environment should not create unnecessary interferences/burdens on activity participants ability 49 (Current Target-Word: hair care) 1: Giver: Let’s give it a shot< laughter>; okay. 2: Giver: um what would you do to take what what you what you would do at a salon 3: Guesser: get your hair done .... (Current Target-Word: bus) 4: Giver: uh what you were just saying 5: Guesser: party bus (Current Target-Word: taxi) 6: Giver: there you go 7: Guesser: bus 8: Giver: okay ...uh what uh a car you hire for short trips around town 9: Guesser: taxi 10: Giver: okay Figure 3.2: Sample Human-Human Dialogue to perform their role in the activity. For example, an environment with a lot of background noise would interfere with the ability of players to hear each other. The game activity in the Rapid Dialogue Corpus took place in a lab environment which has the benefit of reducing outside in- terferences but is a less natural setting to play a game. Several of our experiments (see Chapters 6 and 9) were done in a lab setting. However, the comparative content sourcing experiment dis- cussed in Chapter 8 was conducted in a more natural environment (which might or might not have had more interferences). A sample dialogue from the corpus can be seen in Figure 3.2. Lines 2,3,4,5,7,8, and 9 all show the clue-giver and guesser fulfilling the responsibilities of their roles by outputting clues and guesses for the associated target-words. However, examination of the dialogue also shows both players engaged in verbal moves not strictly required by their role but clearly important to 50 (Current Target-Word: haircare) 1: Giver: um its what a salon person does. 2: Guesser: serves service, okay 3: Giver: skip, um skip (Current Target-Word: babysit) 4: Giver: um you would hire someone to do this to look after your children 5: Guesser: babysit (Current Target-Word: petsit) 6: Giver: um you would hire someone to do this look after your dog 7: Guesser: dogsit 8: Giver: or your cat or your snake or your lizard 9: Guesser: petsit 10: Giver: okay Figure 3.3: Sample Human-Human Dialogue 2 creating a positive interaction for both participants. For example, feedback is given by the clue- giver in lines 1 and lines 6 that seem to be attempts at encouraging the guesser. Also lines 6, 8 (first part “okay”), and 10 show the clue giver outputting confirmation moves that inform the guesser he made a correct guess and that the next clue will be for a new target-word. Another sample dialogue can be seen in Figure 3.3. This dialogue brings to attention two additional important features of game-play. First, in this dialogue we see that the clue-giver opts to skip the target-word haircare (line 3), a move that alerts the other player the current target-word will be updated and the next clue will be for that word. The clue-giver likely skips because he deemed the previous guess in line 2 to be too far off from haircare or he had trouble thinking of new clues for that word. We observed the use of skips by players was an important move that helped forward momentum in the game and helped avoid player frustration resulting from becoming stuck on a single target-word. 51 Second, this dialogue brings to attention that although sometimes a single clue in isolation is good enough to elicit a correct-guess (as in lines 4-5 in Figure 3.3) other times more than one clue (a clue sequence) is used to elicit a correct guess (as in lines 6-9 in Figure 3.3 or the example in Table ?? in Section 3.6). The clues in a clue sequence generally build off of each other and take into account previous guesses. The dialogues in Figures 3.2 and 3.3 demonstrate that clue-givers use different clue types and clue givers use different linguistic packaging for clues. An example of different clue types can be seen by comparing the clues in lines 2 and 8 in 3.2. The clue in line 2 for the target-word haircare describes a place where the action represented by the target-word is generally carried out. The clue in line 8 for the target-word taxi is more of a definition. An example of different types of clue packaging can be seen by comparing lines 4 and 8 in Figure 3.2 and lines 1,4,6 in Figure 3.3 with lines 2 and 8 in 3.3. The former set of clues clues are given in complete sentences while the latter 2 clues are packaged in a more fragmentary manner. In order to develop agent dialogue management and content generation policies that were reflective of these observed human interaction patterns we constructed a formal annotation scheme to formally define the verbal moves made by human players performing both roles of the game. We describe this scheme in the next section. Further observations of human interaction in the RDG-Phrase game showed that there was frequent speech overlap between players. We examined recordings for 4 different pairs of players who played 24 rounds of the RDG-Phrase game. We calculated that 1 minute of speech from a total of 27 minutes of speech was overlapping (3.7%). While at first glance this seems like a relatively small amount of speech, subjective analysis indicated that the overlapping speech came at critical points that helped forward progress in the game. For example, a correct guess made by the guesser would cause the clue-giver to interrupt themselves if the clue-giver was currently give 52 a clue and output a confirmation move saving time. Other examples include when the clue-giver gave a new clue while the guesser was making guesses. When this occured the guesser would frequently interrupt their current guessing associated with the previous clue and make guesses associated with the new clue. A final example is when either player made a skip move. This would cause the player who didn’t make the skip move to interrupt whatever move they were currently making. Our analysis also made clear that non-verbal behaviors were frequently employed by the clue-giver (and sometimes the guesser). One other observation we noted from our analysis was the importance of easily comprehensible speech. It was clear that in order to perform well at the game players had to easily understand each other’s speech. The time constrained nature of this activity makes this aspect even more important compared to a more typical non-time constrained interaction. 3.3 Annotation Scheme As mentioned, in order to help the test-bed agent dialogue management and content generation policies reflect observed human interaction patterns that occurred in the RDG-Phrase data, we constructed a formal verbal annotation scheme for clue-giver and guesser verbal dialogue moves [Pincus and Traum, 2014]. The taxonomy seeks to capture strategies and typical behavior of both givers and guessers and serves as a helpful tool for formally specifying important moves a test- bed agent in this domain should be able to perform. The taxonomy defines categories for different types of clues, different types of guesses, as well as more generic actions including feedback. It also define several attributes that these actions can possess. Several of the non-clue/non-guess 53 verbal moves are not domain dependent and we relate them to core dimensions defined in an ISO standard domain independent annotation scheme for dialogue [Bunt et al., 2010]. The annotation scheme divides actions that occur during word-guessing gameplay into three categories; two according to role: clue-giver or clue-receiver, and one category of moves that is shared by both roles. Both giver and receiver actions come in verbal and non-verbal form but as mentioned we focus on verbal moves. Clue-Giver Specific Verbal Moves The clue-giver’s primary responsibility is to give clues, utterances intending to elicit a target- word from a guesser. Observations of human-players giving clues illustrated that clue-giving can be decomposed into two operations, selecting a clue-type and packaging the clue-type. Selecting the clue-type involves identifying the type of clue to be given. A sample of the common types of clues defined in this scheme can be seen in Table 3.1. This Table defines clue-types that composed at least 6% of the clues seen in the sample data from the corpus we analyzed (see Table 3.5 in Section 3.5.1). Packaging the clue-type refers to the process of wrapping the content words of a clue-type with the structural/functional words to form a full clue. For example, with a target word of good an antonym clue’s content word(s) might be evil and a complete type packaging of the clue could be “This is a word that means the opposite of evil. ”. The different types of packaging along with their definitions can be found in Table 3.2. The clue-giver also made non-clue verbal moves. The giver can state a Confirmation in order to convey to the receiver that he has made a correct guess or partially correct guess. Confirmation can be viewed as lying in the auto-feedback dimension defined in [Bunt et al., 2010]. 54 Table 3.1: Clue Types & Examples Clue Type Specification Example Instance (Target) Packaging Type Action Description Describes an action (generally the target-word) in a way not resembling a dictionary definition. “sometimes if you’re scared you might want to do this to your eyes. ” (cover) Complete Description Definition Describes or defines target-word in a manner similar to a dictionary definition. “The pathway behind a building is called a” (alley) Leading Question Cite Past Refers to words said in previous turns. “You mentioned it before” (today) Complete Partial Phrase Uses words that commonly co-occur with target word. “Abraham Lincoln lives in a log” (cabin) Fill-In-Blank Disabuse Indicates previous guess was “semantically” far from target-word. <after guess of prius> “nope? (electric) Fragment Hyponym Uses target-word hyponyms. “cadillac” (gas guzzler) Fragment Antonym Contrast Uses target-word antonym(s) or contrasting word(s) of target-word. “Not audio” (video) Fragment 55 Table 3.2: Clue-Packaging Types Type Specification Complete Complete sentence that could not be classified as other packaging types. Fragment Incomplete sentence or complete sentence where there are no structural/functional words wrapping the main content words Fill-In-Blank Sentence with a missing word that is intended to be the target-word. Leading-Question Sentence expressed in the form of a question whose answer is intended to be the target-word. 56 Guesser Specific Verbal Moves The guesser’s primary responsibility is to give guesses, utterances intending to say the current target-word. Guesses can be broken down into 4 types Correct (guess contains target-word), Partial Correct (contains target-word in larger word or phrase), Abbreviated (contains abbreviated form of target-word or part of target-word, Incorrect (all other guesses). The guesser also makes non-guess moves. The guesser Rejects by communicating his lack of knowledge of the target based on current information or RequestRepeats by asking the giver to repeat his last clue. In terms of the ISO standard, RequestRepeat can be viewed as lying in the auto-feedback dimension defined in [Bunt et al., 2010]. Common Giver and Guesser Verbal Moves Giver and guesser non-clue and non-guess actions have several categories in common. Both players can state an Acknowledgement indicating understanding of what the other player has said or a Clarification indicating that the player requires additional information about what was just said. Alternatively, either player can state a Delay, a filler utterance said while a player is thinking about his next action. In terms of the ISO standard, Acknowledgement and Clarification lie in the auto-feedback dimension and Delay has the communicative function stalling in the time- management dimension. In addition, either player can utter an Encouragement in an attempt to boost the other player’s morale or request to Skip to the next target. Either player can also Evaluate their performance by expressing thoughts on current game-play or emit Laughter. Evaluate, Skip, and Encouragement lie in the task core dimension defined in [Bunt et al., 2010]. Note that we only consider Laughter and Delay tags if none of the other tags seem appropriate. 57 Table 3.3: Inter-Annotator Agreement Statistics Category Cohen’s Kappa Absolute Clue/Non-Clue 76.2% 88.9% Guess/Non-Guess 75.6% 89.5% Clue-Type 59.0% 64.4% Guess Type 75.0% 80.7% Packaging Type 53.0% 64.7% 3.4 Annotation Scheme Method and Evaluation In [Pincus and Traum, 2014] we describe how we performed a small inter-annotator agreement study to evaluate the efficacy of a slightly earlier version of the annotation scheme and used the multi-modal annotation tool Anvil [Kipp, 2012] to annotate (using a slightly earlier form of the annotation scheme) RDG-Phrase videos. Speech was segmented in the transcriptions of the RDG- Phrase videos if it was separated by 300 milliseconds of silence or more. We automatically convert these segmented utterances to instantiate utterance block elements in Anvil. Each speaker’s utter- ance blocks are assigned their own “track” in Anvil. Each utterance block is labeled with its type in corresponding blocks in either the giver track or the guesser track and appropriate attributes selected. The inter-annotator agreement study was calculated based on annotations completed by 2 annotators for 4 sequential seventy-second RDG-phrase rounds played by one pair (team). This includes 90 giver and 57 guesser utterances. Table 3.3 contains Cohen’s Kappa statistics [Cohen, 1960] and absolute agreement statistics for each of the major verbal categories in the scheme and demonstrates this scheme has reasonable inter-annotator agreement. 58 The tags causing the most disagreement for utterances both annotators label as clue are De- scription Definition and Associated Action. This type of disagreement accounts for 3 out of the 10 or 30% of clue type disagreements. One example of this disagreement occurs with the giver utterance “yeah and then this one is on the ocean” where the target had been “beach house” and the guesser had just correctly guessed country house. This clue seems to fit in both categories as it describes the target like a Description Definition but in some sense it also answers the question: “what is it used for?” like an Associated Action. The most common disagreement for the clue packaging method attribute occurs when one annotator feels the packaging method is not clear and therefore chooses the None value. This scenario accounts for 7 of the 18 tags that did not match; close to 40%. None of the other packaging method disagreements account for more than 3 of the delivery method tags that do not match. 3.5 Annotation Results In order to help develop a picture of the of the relative frequency of the various verbal dialogue moves output by both clue-givers and receivers we annotated 3 sets of 8 70-second RDG-phrase rounds played by three different pairs of people. The speech was segmented into 762 utterances. 439 (58%) of the total utterances were said by the giver while 323 (42%) utterances were said by the guesser. See Table 3.4 for the clue/non-clue and guess/non-guess move breakdown for the clue-giver and guesser role’s respectively. The next two sections break down each of these categories further into the types defined in the beginning of this section. 59 Table 3.4: Giver and Guesser Move Frequencies Move # of Utterances Clue Giver Clues 277 (63%) Non-Clues 162 (37%) Guesser Guesses 224 (69%) Non-Guesses 99 (31%) 3.5.1 Clue & Guess Type Breakdown In this sample data human clues-givers give on average 4.1 clues per target-word before a cor- rect guess or a skip. Table 3.5 shows the relative frequency of Clue types that we found in the annotated rounds. The two most common clue types are Associated Action clues (28%) and Description Definition clues (16%). One possibility is that this indicates that the definition of As- sociated Action captures important properties of the most common conceptual model for a noun or noun-phrase (all targets fall into one of these two syntactic categories). These statistics also imply that givers find word-relations (a category most of the other clue-types fall under) either more difficult to construct or consider them a less effective way of eliciting the target. We calculate 24% of the total guesses are correct and 55% contain at least part of the target or an abbreviated version of part of the target. More specifically, the breakdown of guesses are as follows: Correct (23.7%), Partial Correct (2.7%), Abbreviated (29%), Incorrect (44.6%). In this sample data players made 4 or less incorrect guesses for a given clue before a new clue was said by the clue giver, either player skipped, or the guesser made a correct guess. This observation helped motivate our initial parameter setting for the test-bed agent parameter that dictated how 60 Table 3.5: Clue Type Relative Frequencies Clue Type % of Clues Associated Action 28.3% Description Definition 15.8% Cite Past 9.7% Partial Phrase 7.7% Disabuse 7.3% Hyponym 6.9% Antonym Contrast 6.9% Shared Hypernym 4.1% Synonym 2.4% Hypernym 2.4% General Context 2.4% Narrow 2.4% Widen 1.2% Analogy 0.04% 61 Table 3.6: Clue Packaging Relative Frequencies Delivery Method # of Clues Fragment 39% Complete 28% Fill-in-the-Blank 17% None 9% Leading Question 7% many guesses the agent would make when performing the guessing role for each human clue it heard (see Section 4.5) as well as our choice of an evaluation metric for agent guessing methods (see Section 7.1.3). 3.5.2 Clue Packaging Table 3.6 shows clue packaging statistics. The Fragment (39%) and Complete (28%) packaging methods were the most common for clues. This indicates that human givers find non-complete sentences the most efficient manner to package a clue and likely consider structuring a grammat- ically correct sentence a task that does not contribute a significant amount of value. This also implies human givers use Fill-In-Blank and Leading Question packaging methods less often; pos- sibly due to the time needed to construct clues in these forms. The None packaging method refers to clues that packaging methods do not cleanly fall into one of the defined categories from Table 3.2. 62 3.5.3 Non-Clues & Non-Guesses We tag 133 (17% of all utterances, 51% of Non-Clue/Non-Guess Verbal utterances) utterances of either the giver or the receiver as Delay. 74 of these delays were said by the giver and 59 by the guesser. One third of all non-clues said by the giver were Confirmations. 18% of guesser’s non-guesses were Acknowledgements. The other non-clue categories and the other non-guess categories each comprised a small relative percentage of all non-clue and non-guess utterances; all together totaling 21% and 22% respectively. 3.6 Performance Metrics Human-Human Sample Clue Sequence Synergy Dialogue Target-Word: hour Player Utterance Giver “now” Receiver “time now” Giver “not minutes” Receiver “seconds” Giver “not seconds” Receiver “hours” Giver “okay” In order to help us contextualize the test-bed agent’s ability to perform both roles of the game it is important to calculate human-human measures of performance for both the guessing and clue- giving roles. A relatively simply statistic to approximate average guess quality is to just take the 63 number of correct guesses and divide it by the total number of guesses. We approximate average human guessing performance in the timed game with two metrics. The first metric, average correct guesses per target, takes the number of correct guesses and divides by total number of target-words. The second metric, average correct guesses per clues heard takes the number of correct guesses and divides by total number of clues. Choosing human performance measures for clue-giving is less straightforward. Evaluating a given clue’s quality involves considering the quality of a clue in isolation as well as its quality given its presence in a clue sequence and any previous guesses that might have been output. Given a clue sequence that leads to a correct guess can be considered a case of the classical credit assignment problem [Minsky, 1961] in Computer Science. How should one assign credit to individual clues in a clue sequence that leads to a correct guess. Here we introduce metrics for measuring human-human and human-agent clue-giving ability that provide a means for assigning effectiveness scores to clues in a clue sequence thereby giving us a means to estimate human clue-giving performance. More specifically, we constructed an upper-bound measure, a lower-bound measure and expected guessability measure, that assign different weights to a clue that appears in a successful sequence based on sequence length and sequence position. These metrics can also be viewed as an initial way to assign credit to clues in a clue sequence leading to a correct guess. We introduce some general notation in Equation 3.1 before discussing these metrics. N = total # o f clues (given or in corpus) c = total # o f single clues leading to a correct guess (3.1) 64 An upper-bound score for a clue’s effectiveness can be arrived at if each clue in a clue se- quence leading to a correct or partially correct guess is considered effective and given a score of 1. Implicit in this upper-bound is that each clue could elicit a correct guess from a receiver on its own. If a target-word is skipped or time runs out before a correct guess is made each clue in that sequence is considered ineffective and receives a score of 0 . Using these optimistic assumptions the chance of a clue-giver generating an effective clue can be calculated by Equation (3.2). # o f clues part o f a correct clue sequence N (3.2) A lower-bound score for each clue’s effectiveness can be computed by giving only clues that elicited correct guesses without being preceded by additional clues a score of 1 and all clues that were in a sequence of more than one clue a score of 0. This assumes that a clue sequence’s effectiveness can be totally attributed to synergies from the combination of clues in the sequence rather than to any single clue’s effectiveness (unless of course the clue sequence is of length one). These pessimistic assumptions provide yet another way, shown in equation (3.3), to compute the likelihood that a clue-giver’s next clue is effective. c N (3.3) As a compromise between these extremes, we define an expected guessability score for each clue in a sequence leading to a correct or partially correct guess, where partial credit (between the above extremes of 0 and 1) is given for each clue in the sequence. For simplicity, for sequences We note our analysis of the human-human data shows this to be a very generous (unrealistic) assumption. We see many examples of clue sequences where clues build off of one another creating synergistic clue sequence effects (see an example in Table ??) 65 larger than 1, we define the expected guessability to be 1=(t + 1) where t represents the total number of clues in the sequence. For single clues, we assign a value of 1, as in both of the above measures. The intuition behind this measure is that the method distributes the credit equally between each clue and a synergistic combination of clues. If a target-word is skipped or time runs out before a correct guess is made each clue in that sequence is considered ineffective and receives a score of 0. Taking the weighted average of all clue’s effectiveness scores then provides an alternative measure of how likely a clue-giver is to generate an effective clue; this calculation can be found in equation (3.4). å m2S ( length(m) length(m)+1 ) + c N S=fall clue sequences leading to a correct guessj length(sequence)>1g (3.4) 3.6.1 Baseline Human-Human Game Performance We calculated point estimates for our guess quality and guess performance metrics discussed in the last section from the annotated data discussed in Section 3.5. First, we calculated a human baseline for average guess quality of 26.4% (summing partial correct and correct guesses). A little over a quarter of those guesses contained the target-word. Second, we calculated human baselines of average guessing performance in the timed game condition. The human baseline for average correct guesses per target-word was 72.2% and the human baseline for average correct guesses per clues heard was 23.2%. Table 3.7 shows the results of applying the clue effectiveness metrics from the last section to the same 8 rounds of human-human annotation data discussed in Section 3.5.1 providing human- human baselines for clue-giving performance. 66 Table 3.7: Human Clue-Giving Baseline Metric Results Methodology % of Effective Clues Human (Lower-Bound) 20.0% Human (Expected-Guessability) 31.8% Human (Upper-Bound) 79.1% 3.7 Implications for Test-Bed Agent Here we discuss implications arising from our activity analysis for the test-bed agent. Clue Sequencing As noted, human clue-givers sometimes were able to elicit a correct guess with a single clue. Other times clue-givers employed clue sequences that took into account pre- vious clues and guesses to elicit correct guesses. The architecture described in Chapter 4 endows the agent with the the more interactive ability to output clue sequences for a given target-word as opposed to a less interactive version of the game that only support a response to a single guess per target. However, there is an empirical question as to whether or not clue sequences need to take into account prior clues and guesses in order to be effective. The results from our evaluations described in Sections 6.2, 6.3 and Chapter 9 indicate that clue sequences can be effective without taking into account prior clues and guesses. Skipping Human players frequently chose to skip target-words and move on in the game. This was likely due to the player’s belief that they would not be able to score a point for that target- word. The architecture described in Chapter 4 endows the agent with the ability to recognize user skip moves as well as ( if appropriate) the ability to perform its own skip moves according to a skip policy. The results from the evaluations described in Chapter 9 indicate that endowing the test-bed agent with the ability to recognize skips in both roles, and not implementing an agent 67 skipping policy for most implementations, was sufficient to create a test-bed agent that people enjoyed interacting with. Feedback The activity analysis suggested feedback was an important move performed by hu- man players possibly increasing subjective perceptions of the interaction and/or positive objective behaviors like correct guesses. The architecture described in Chapter 4 endows the agent with the ability to provide different types of feedback with different frequencies as desired. In Chapter 6 we use the test-bed agent to evaluate the impact of using different types of feedback in the clue- giving role in the context of two different game framings. The results from this study support our ultimate choice of feedback and game framing for our final multi-role evaluations described in Chapter 9. Embodiment/Non-Verbal Behavior Human players frequently performed non-verbal moves which co-occurred with their verbal moves. This suggested it would be important to endow the test-bed agent with embodiment and non-verbal behavior generation. In Chapter 4 we discuss how we specialized the test-bed agent’s architecture for different platforms that gave the test-bed agent different embodiments. We did not go as far as to make custom gestures for the test-bed agent as there were challenges to making similar custom non-verbal gestures to those seen in the human-human corpus. For example many of the non-verbal behaviors were multi-functional [Duncan, 2008] and so gesture disambiguation was difficult when using common gesture annota- tion schemes such as [McNeill, 1992]. Further, as pointed out by Duncan, gesture segmentation was an issue with gestures commonly repeating and blending into each other. Without a formal annotation scheme for non-verbal behavior that created a clear mapping between clue-types and specific gesture types we felt custom gestures might end up being dis- concerting to users relative to gestures output by an off-the-shelf non-verbal generator. An off- the-shelf non-verbal behavior generator enables the agent to perform simpler gestures that more 68 easily map to less domain specific types of dialogue moves (but still occurred in the game). For example, gestures intending to provide positive or negative feedback in the game such as head nodding/shaking frequently occurred and the non-verbal behavior generators we ended up using for the test-bed agent (see Chapter 4) provided the agent with the facility for this behavior. We discuss a study in Chapter 6 that demonstrates that these capabilities/behaviors are also important for the test-bed agent. The study shows that when the agent, performing the clue-giving role, is endowed with virtual human embodiment, off-the-shelf non-verbal behavior generation, and incremental language processing capabilities significantly more positive behaviors and per- ceptions are elicited from human players. Moreover, our final evaluation studies described in Chapter 9 provide evidence that our decisions around these design aspects were good enough that the test-bed agent is able to elicit positive user behaviors and perceptions . Incremental Language Processing As mentioned, players frequently interrupted each other at important points in game-play. This led us to test the impact of endowing the agent with incre- mental language processing capabilities In Chapter 4 we discuss how we endowed the agent with incremental language processing capabilities so that it could emulate this sophisticated human turn-taking behavior. In Section 6.2 we discuss an online evaluation of the test-bed agent that demonstrates the positive impacts of endowing the agent with this capability. Agent Speech As noted, easily comprehensible speech was an important feature of successful game-play. Given the amount of content (especially in clue-giving role which requires a large amount of clues for arbitrary target-words) it was not feasible to expect to use human recordings for the test-bed agent’s speech. We felt it was important to empirically validate that the test-bed agent could make use of a synthetic voice without negatively impacting the interaction to such an extent that it would thwart our ability to investigate our main research questions around the impact of endowing the agent with multi-role capabilities. We discuss an off-line study in Chapter 6 that 69 provides evidence that the interaction isn’t significantly negatively impacted by the use of a high quality synthetic voice. 3.8 Summary In this chapter we presented a formal activity analysis of the word-guessing game test-bed do- main. We investigated a human-human corpus of audio and video recordings of human players engaged in a word-guessing game and discussed observations of that raised important automated agent player design considerations around embodiment, content generation/sourcing method, and synthetic voice selection as well as incremental language processing.We defined, evaluated, and applied an annotation scheme that defines a formal taxonomy of verbal moves for both the clue- giver and guessing roles of the game. We also defined and applied metrics for human game play performance in both roles establishing human-human baseline performances for the test-bed domain that we can use to help contextualize agent performance in this domain. Finally, we discussed the implications of the activity analysis for designing a test-bed agent. 70 Chapter 4 Architecture for Enabling Interactivity “True interactivity is not about clicking on icons or downloading files, it’s about encouraging communication. ” Edwin Schlossberg In this chapter we present an architecture for building a fully automated test-bed agent that can perform both roles of a word-guessing game. The architecture enables agent interactivity by leveraging a full stack of dialogue technologies including automatic speech recognition (ASR) , text-to-speech (TTS) , natural language understanding (NLU), content/natural language genera- tion (NLG), continuous dialogue management, and non-verbal behavior generation. The archi- tecture is complex enough to enable an agent to emulate important interactive behaviors observed in the activity analysis (e.g - incremental language processing) but flexible enough to support switching solutions for many of the leveraged technologies (e.g. - automatic speech recognizers or text-to-speech services). This architecture provides a relatively ecologically valid test-bed for different types of exper- iments compared to test-bed agent architectures more common in the literature that don’t employ a full stack of dialogue technologies. More commonly, experiments involve agents that either 71 aren’t fully automated (i.e. - certain agent capabilities are “wizarded” ) or are not fully inter- active (e.g. - user’s evaluate recorded interactions post-hoc or interaction is conducted via text versus speech). An example of the type of experiments enabled by this full stack architecture that would likely have been considered less novel if conducted in a less interactive manner is the game-framing/regulatory fit experiment described in Chapter 6. The architecture was specialized for 3 different platforms affording the agent with 3 different embodiments: virtual human, robot, and non-embodied web. The rest of this chapter is orga- nized as follows. The next section introduces the architecture at its most abstract level. Section’s 4.2, 4.3, 4.4 discusses the virtual human, robot, and non-embodied web implementations of the architecture respectively. 4.1 Architecture Figure 4.1: Agent Architecture The architecture was designed to enable an agent to perform both clue-giving and guessing roles of a word-guessing game activity similar to the RDG-Phrase game analyzed in Chapter 3. Agent appears autonomous to a human user but the agent capabilities are actually puppeted by a human operator 72 The architecture also enables an agent platform to perform the role of game judge. This role can be explicitly represented if the platform supports multiple agents (e.g. the Virtual Human Toolkit described in Section 4.2 ) or the roles responsibilities can be implicitly subsumed by the main game-playing agent (as is the case for the robot implementation described in Section 4.3 and the non-embodied web implementation discussed in 4.4). The game is organized into multiple timed rounds with a set of target-words associated with each round. The number of rounds, total round time, and total number of target words per round are all tunable parameters that can be chosen based on the agent use case. Figure 4.1 depicts the modular architecture. The architecture is composed of several custom modules (seen in green) specifically built to facilitate the game interaction and leverages various agent platforms (seen in yellow in the top right of the Figure) that afford the agent the ability to execute multi-modal behavioral output. The architecture extends the model-view-controller (MVC) software design pattern [Krasner et al., 1988] into a pattern we label the model-view- controller-performer-generator pattern. In the traditional MVC pattern the model components are responsible for defining and storing the central structures for an application domain, the view components are responsible for display- ing aspects of the model to the user, and the controller components are responsible for sending messages to user interface devices and serving as the interface between the view and model com- ponents by sending messages between them. This architecture also has performer and generator components. The performer component is responsible for executing actions sent by the controller components based on the current model states using various multi-modal output channels. The generator components are responsible for outputting content upon requests from the controller component which can then use the content in messages to the model,view, or performer compo- nents. 73 The dialogue manager module (seen in the center of the figure) acts as the controller com- ponent. The dialogue manager is responsible for managing communication between the other modules of the architecture as well as updating the other module’s states. The dialogue manager sends actions commands to the agent platform which acts as the performer component and exe- cutes the action command. We describe specific implementation details of the dialogue manager in more detail in section 4.1.1. The agent platform sends callback messages back to the dialogue manager that indicate the current status of a sent action command. The agent platform generally also contains an ASR mod- ule which captures users speech and sends ASR hypotheses messages to the dialogue manager. We describe the various components generally offered by agent platforms in Section 4.1.3. The dialogue manager contains the logic for mapping the callback messages and ASR hy- potheses to agent and game state updates. The dialogue manager sends state update messages to the game and agent model (seen in the middle left of the figure) which act as the model compo- nents in the MVC pattern. The game and agent models keep track of the current game state (e.g. - current target word, in-round or out-round) and agent state (e.g. - agent speaking or agent not speaking, past agent actions ) respectively. When the dialogue manager needs to know the current game or agent state in order to process a message from the agent platform it can make a state request from the game or agent model as appropriate which send back this information in a state info message. The dialogue manager is also responsible for sending messages corresponding to important game events (e.g. - a correct guess, time left in round) to a set of auxiliary graphical user inter- faces (GUI’s) (seen in top-left of the figure) which can be considered the view components of the extended MVC pattern. The set of auxiliary GUI’s display information such as the current score 74 and time left in round on a monitor seen by the user. We describe these GUI’s in more detail in Section 4.1.2. During a game round, as depicted in the two green modules shown on the right side of the figure, the dialogue manager leverages two other modules to perform the two roles of the game, a clue and guess generator which are the generator components. When the agent is performing the clue-giving role the dialogue manager requests clues for each round from the clue generator. For the experiments described in this thesis the clue generator took the form of pre-compiled text files that organize target-words and clues for reach round. We investigated several methods for automatically generating clues (see Chapters 5 and 8) to construct these text-files. It should be noted that the architecture supports using the clue generator in real time (as was the case in the demo described in [Pincus et al., 2014]). Using the clue generator in real time could potentially be interesting if one was interested in agent use cases that attempted to tailor the types of clues given to players in real time based on user attributes or behaviors. These attributes or behaviors would be stored in a user model module that would communicate with the dialogue manager in a similar manner as the game and agent model modules. When the agent is performing the guessing role of the game the dialogue manager queries the guess generator with a clue in real time to produce a ranked list of guesses for the given clue. See Chapter 7 for details on guess generation methods we investigated. Finally, the dialogue manager also updates a structured-query-language (SQL) database (seen in the bottom middle of the figure) with all game and interaction events in order to log each user interaction. 75 Table 4.1: Game Dialogue Moves Recognized State Dialogue Move Agent performing clue-giving role Correct Guess, Incorrect Guess Agent performing guessing role Clue In-Round Skip Out-Round Start, Non-Start, Role-Specifier, Non Role-Specifier, Quit 4.1.1 Dialogue Manager Here we provide some more details on the dialogue manager component of the architecture. The dialogue manager in its most basic form leverages a finite state model of dialogue to guide the user through a 2-phase interaction. Note this 2-phase interaction was extended to different 3- phase interactions for both the robot and non-embodied web implementations. The first phase is a greeting/instruction phase made up of 1 or 2 states where the user is introduced to the agent and given game instructions (in the 1-state version) or given the option to hear game instructions (in the 2-state version). The second phase is the game phase and is composed of 2 states, in-round and out-round. The in-round state occurs when a round starts. In this state the agent and user are giving clues and guesses associated with the current target-word. The out-round state is entered when a round has just ended (or before the first round has started). In this state, depending on the implementation, a user can request to start a new round, specify what role they want to play, and/or choose to quit. The dialogue manager subsumes the natural language understanding task and is responsible for mapping received ASR hypotheses to dialogue moves which correspond to user game actions. The list of dialogue moves recognized by the dialogue manager can be seen in Table 4.1 and https://www.oracle.com/database/technologies/appdev/sql.html 76 are organized based on when in the activity they can be recognized. The moves recognized in- round have been defined in the annotation scheme presented in Section 3.3. As seen in the table, the first two rows contain the dialogue moves recognized by the agent when it is performing the clue-giving and guessing roles respectively. The third row containing the skip dialogue move is recognized when the agent is performing either of the two roles and is in the in-round state. The fourth row contain the dialogue moves recognized by the agent when in the out-round state. In the out-round state, since the agent either requests users to specify what role they want to play or if this pre-determined to say “start”, the dialogue manager recognizes role-specifiers and non role-specifiers and starts and non-starts. role-specifiers are utterances that request to play the clue-giving or guessing role and non role-specifiers are utterances that do not make such a request. start are utterances that request to start a new round and non-starts are utterances that do not make such a request. Finally, if an evaluation featured optional rounds the dialogue manager has the functionality to enable the game judge to inform the user they could exit the game. Therefore, the dialogue manager also recognized quit dialogue moves, utterances that requested to quit the game. Keyword and phrase matching was used to map utterances to the various dialogue moves. Common phrasings culled from log files of pilot tests of the test-bed agent for skips. starts, role specifiers, and quits were stored and matched to user utterances at run time in order to create a more robust interaction where user’s did not have to state an exact phrase in order to move the interaction forward. Natural language generation is hand authored (except for clue and guess generation which are described in the next the two chapters). The agent’s dialogue manager has feedback parameters that dictate the frequency and type of feedback given after various game events such as a correct guess or end of round. In general, for the experiments described in this work, the agent leveraged 77 arrays of different surface realizations of positive and negative feedback observed in the activity analysis (see Chapter 3) that are randomly chosen at run-time in order to decrease the chance the agent will be perceived as repetitive. The dialogue manager has an out-game silence threshold parameter, s ng , so that if the user is silent for s seconds between rounds the agent requests the user to start a round or make a role specification. We now provide some specific dialogue manager implementation details that allow the dia- logue manager to support the communication between the architecture’s various modules, which processed different types of messages at different rates, as well as to afford the test-bed agent incremental language processing capabilities. Incremental language processing capabilities were observed in the activity analysis to be important interactive capabilities that allowed human play- ers to interrupt themselves at important points in the game (see Chapter 3). The dialogue manager concurrently manages between 3-4 threads. Each thread can enqueue new events (associated with the given thread) into a concurrent queue data structure. The first thread, which can be considered the main thread, is responsible for processing new events as they are enqueued to the priority queue. The main thread can be in 2 states; active or inactive. In the active state the thread is continuously looping and checking the queue for new events and upon finding a new event contains the logic to decide how to process the event. Processing the event involves updating the game and agent state by sending state update messages to the game and agent model modules and depending on the event sending action commands to the agent platform and/or updating the auxiliary GUIs with current game state information by sending the GUI’s game event messages. This thread is made inactive when an action command is sent to the agent platform and waits to be woken up either by a callback message that is sent back to the dialogue manager by the agent platform indicating action completion or one of the other threads enqueues a pre-designated “interrupt” event which activates the main thread to interrupt the agent 78 platform mid-action so that it is available to process an appropriate response to the newly detected event. The second thread is responsible for sending the game event messages (e.g - messages con- taining the current score or time left in round) to the set of auxiliary GUI’s displayed to the user. The third/fourth thread is an action thread. For the virtual human and non-embodied web implementations this is one thread that is responsible for receiving partial and final ASR hypothe- ses of user speech and/or sending the agent action commands (e.g. - speech to output for the text-to-speech service), and receiving agent platform callback messages which indicate comple- tion statuses of a particular action command. For the robot implementation the action thread is really 2 separate threads; one that manages ASR processing and one that manages all the agent action commands. The messaging protocols used by each implementation of the architecture (see Section’s 4.2, 4.3, and 4.4) differed in certain ways and are described in their associated sections. As noted the multi-threaded concurrent nature of the dialogue manager endows an agent with the ability to perform incremental language processing capabilities. However, it is important to note the sophistication of these capabilities are dependent on the granularity of the callback messages available on whatever agent platform is selected. For example, the Virtual Human Toolkit [Hartholt et al., 2013] (see Section 4.2) sends callback messages for each phoneme output by the agent when it is speaking and therefore the current agent state can be tracked more precisely and interrupted when necessary. On the other hand, the DialPort web platform [Lee et al., 2017] (see Section 4.4) does not currently have a callback mechanism and therefore the agent state cannot be interrupted and the dialogue manager can only send action commands to the DialPort agent platform in response to received user ASR hypotheses. As mentioned in our activity analysis (see Section 3.1 there were 3 dialogue moves: correct guesses, clue, and skips that caused human players to interrupt themselves. In order to allow the 79 test-bed agent to emulate this human ability that was important for game play these 3 dialogue moves were selected as user initiative barge-in dialogue moves. In addition, when out of round a start, role-specifier, and quit dialogue move are user initiative barge-in action causing the game round to start immediately. These dialogue moves all trigger the test-bed agent to halt all verbal and non-verbal behaviors (assuming he is talking) and take the appropriate next action. Note also that the end of round signal (a signal triggered when round time is over) is a barge-in for both user and the system speech. This signal can be considered a domain-initiative barge-in. The dialogue manager’s ability to perform this incremental language processing can be turned off by flipping a binary parameter, the barge-in parameter, coded into the dialogue manager. When this parameter is turned off, all user speech is flushed when the agent is speaking. This functionality allowed us to test the impact of endowing the agent with incremental language processing capabilities (see Section 6.2). 4.1.2 Auxiliary Graphical User-Interfaces (GUI’s) Here we discuss the set of 4 auxiliary GUI’s the architecture can make use of in order to display game event information to the user. The Game-Info GUI (shown in the top right of Figure 4.3), displays game information to a user including round number, time left, total score and round score and are updated in real-time as the game progresses. The Target GUI (shown in bottom right of Figure 4.3) displays the current target-word when the agent is performing the guessing role of the game. This GUI displayed the text “You are the Guesser” when the agent is performing the clue-giving role of the game. The ASR Hypothesis GUI (shown in top-left of Figure 4.3), displays the speech recognition results in real time to the player allowing the player to plan his next dialogue move accordingly. This GUI is meant to inform a user what the system has understood in order to attempt to mitigate 80 negative effects of ASR errors on the interaction. The Instructions GUI (shown in bottom-left of Figure 4.3) displays reminders about game instructions including that the player can skip the current target-word and to repeat themselves in case of speech mis-recognition. 4.1.3 Agent Platforms The architecture is flexible enough that it can make use of different agent platforms relatively eas- ily. Agent platforms generally offer services for ASR, TTS, embodiment, and non-verbal behavior generation. The architecture also supports the swapping out of these individual components. For example, the virtual human implementation of the architecture used one ASR for an early eval- uation described in Section 6.2 which was swapped out for a better performing ASR for later experiments described in Section 6.3 and Chapter 9. 4.2 Virtual Human Figure 4.2: Virtual Human Agent Architecture The virtual human implementation affords the agent the ability to perform both the clue-giving and guessing roles of the game in the 2-phase game interaction described in the last section. For 81 Figure 4.3: Test-Bed Agent Screenshot [Game Player (right) and Game Judge (left)] this implementation we leveraged the Virtual Human Toolkit [Hartholt et al., 2013], commonly used in human-computer interaction research, as the agent platform. This implementation makes full use of the architecture’s incremental language processing capabilities and was used to evaluate design questions raised in the activity analysis (e.g. - the impact of embodiment/incrementality affordances as described in Section 6.2). Figure 4.2 shows this implementation of the architecture with the virtual human agent platform instantiating the agent platform module of the general architecture. Figure 4.3 contains a screenshot that displays what a user, interacting with the virtual human test-bed agent implementation, would see on a monitor. In the center of screen are the virtual humans. Either virtual human can be assigned the role of game player and game judge by setting a parameter in the dialogue manager. The Virtual Human Toolkit components include text to speech services, automatic speech recognition, virtual human rendering, and a nonverbal behavior generator [Lee and Marsella, 2006], which all communicate using the toolkit’s Active MQ messaging server. The server is 82 responsible for managing communication between all system components which communicate using Virtual Human (VH) Messaging Protocol, a variant of Active MQ messaging . The toolkit’s modular architecture allows for the use of different automatic speech recognizers and text-to-speech services. The ASR module is responsible for sending partial and final auto- matic speech recognition hypotheses to the toolkit’s Active MQ server which relays them to the dialogue manager. The choice of which partial or final automatic speech recognition hypothe- sis the dialogue manager ultimately reacts to is a dialogue manager parameter that can be tuned. Choosing to respond to earlier partial messages decreases system response latency but also in- creases the risk that the system reacts to a lower confidence ASR hypothesis resulting in a system misunderstanding. For experiments in this thesis we made use of both the Apple OS X El Capitan’s dictation y and Google Chrome z automatic speech recognizers. We made use of pre-existing wrapper code that converts their native messaging into a format that conformed to VH Messaging. The wrapper code for the Google Chrome ASR web application was adapted for this architecture so that it would continuously listen for user speech as opposed to running on a push-to-talk model. The toolkit’s Tts Relay module allows for the use of different text-to-speech services. The non-verbal behavior of the virtual human implementation of the test-bed agent uses the off-the-shelf non-verbal behavior provided by the virtual human toolkit nonverbal behavior gen- erator [Lee and Marsella, 2006]. This off-the-shelf behavior produces behaviors such as head nodding for affirmative utterances, e.g.- ”yes”, head-shaking for “no”, pointing to self when self- referencing, e.g. - ”I”, and pointing to user when referencing user e.g. ”you”. When the test-bed http://activemq.apache.org/ y http://www.apple.com/osx/whatsnew/ z https://www.google.com/intl/en/chrome/demos/speech.html 83 agent’s non-verbal behavior is interrupted all animation stops and he immediately returns to his neutral pose. 4.2.1 Clue-Giving Role Dialogue Management Table 4.2: Test-Bed Agent Clue-Giving Role Dialogue Policy Agent State Game-State User Move Agent Response Agent speaking In-Round 1. incorrect guess 2. correct guess 3. skip 1. If at least threshold % (i) of words said by game player: [negative feedback] then give next clue else: user utterance flushed 2. Interrupts, confirmation, gives next clue 3. Interrupts, game player says “new target”, gives next clue Out-Round 1. startj role-specifier 2. not-startj non role-specifier 1. Interrupts, gives clue for 1st target 2. User utterance flushed Agent not speaking In-Round 1. incorrect guess 2. correct guess 3. skip 4.<SILENCE> 1. [negative feedback] then give next clue 2. Confirmation then give next clue 3. Game player says “new target”, gives clue for new target 4. Once reached threshold seconds (s g ) give next clue Out-Round 1. Interaction initiation signal 2. startj role-specifier 3. non-startj non role-specifier 4.<SILENCE> 1. Judge gives instructions 2. Game player gives clue 3. Judge asks user to say “start” or to pick a role 4. Once reached threshold seconds (s ng ) Judge asks user to say “start” or to pick a role In this section we discuss the dialogue management policy for the virtual human implemen- tation fo the test-bed agent when it is performing the clue-giving role. An enumeration of the dialogue policy for the dialogue manager in clue-giving role can be found in Table 4.2. The first column indicates which speaking state the test-bed agent is in (i.e.- whether or not he is speaking). The second column indicates the game-state the the test-bed agent is in (i.e.- either in-round or 84 out-round). The third column lists possible utterances the user might say in the corresponding states. The last column shows how the test-bed agent would respond given the corresponding sys- tem state and user action. As mentioned, when in-round user utterances are classified as a correct guess, a skip, or an incorrect guess. An interrupt response by the agent (which occur when the agent is in a speaking state and a user makes a user initiative barge-in move) causes the agent to stop all speech and non-verbal behavior generation and make an appropriate next move. The dialogue manager has 2 parameters that effect the rate the test-bed agent provides new clues to a user that are shown in Table 4.2. The s g parameter is an in-game silence threshold so that if a user says nothing for s seconds after the test-bed agent finishes a clue the test-bed agent will give the next clue. The i is a give next clue immediately threshold. If a user makes at least one incorrect guess after i percent of the current clue has been said and no correct guess is made before the end of the current clue the test-bed agent gives the next clue for the current target-word immediately. 85 4.2.2 Guessing Role Dialogue Management Table 4.3: Test-Bed Agent Guessing Role Dialogue Policy Agent State Game-State User Move Agent Response Agent speaking In-Round 1. clue 2. skip 1. If clue does not contain target-word: Interrupts, game player makes g guesses if a guess is correct: judge interrupts confirming correct guess else: Judge interrupts, informs user target- word disqualified, & requests user to give clues for next target-word 2. interrupts, game player says “new target” Out-Round 1. start j role-specifier 2. not-start j non role-specifier 1. Interrupts, game player asks for clue 2. User utterance flushed 86 Agent not speaking In-Round 1. clue 2. skip 3.<SILENCE> 1. If clue does not contain target-word: Game player makes g guesses if a guess is correct: judge interrupts confirming correct guess else: Judge informs user target-word disqualified & requests user to give clues for next target-word 2. Game player says “ok skip, new target” 3. Once reached threshold seconds (r g ) Game player asks user for clue. Out-Round 1. Interaction initiation signal 2. start j role-specifier 3. not-start j non role-specifier 4.<SILENCE> 1. Judge gives instructions 2. Game player asks for clue 3. Judge asks user to say “start” or to pick a role. 4. Once reached threshold seconds (s ng ) Judge asks user to say “start” or to pick a role. In this section we discuss the dialogue management policy for the virtual human implementation fo the test-bed agent when it is performing the guessing role. Similar to Table 4.2 associated with the clue-giving role, an enumeration of the dialogue policy for the dialogue manager when the 87 agent is performing the guessing role can be found in Table 4.3. Again, the first column indicates which speaking state the test-bed agent is in (i.e.- whether or not he is speaking). The second column indicates the game-state the the test-bed agent is in (i.e.- either in-round or out-round). The third column lists possible utterances the user might say in the corresponding states. The last column shows how the test-bed agent would respond given the corresponding system state and user action. As mentioned, when in-round user utterances are classified as a clue or a skip. The dialogue manager has 2 parameters relevant to its management of the guessing role. The r g parameter is an in-game silence threshold so that if a user says nothing for r seconds after the test-bed agent finishes guessing the test-bed agent will request a clue from the user. The g parameter indicates how many guesses the agent should make for particular clue. If a correct guess is made or the user skips while the agent is guessing then the current guessing thread is interrupted. For a given target-word the agent keeps track of guesses it has made and filters the ranked guess list returned by the guess generator so that it did not make the same guess for a given target-word more than once. 4.3 Robot We implemented a version of this architecture on a small humanoid robot, Nao [Robotics, 2018] to collect pilot data for 2 reasons. We wanted to investigate whether users would successfully interact with a multi-role agent as well as to determine whether useful data could be elicited by such a system. The Nao robot is a state-of-the-art robotics platform that is widely used in human- robot interaction experiments. Figure 4.4 shows this implementation of the architecture with the Nao robot agent platform instantiating the agent platform module of the general architecture. For this implementation we extended the capabilities of the dialogue manager to manage a 3-phase 88 Figure 4.4: Robot Agent Architecture interaction. In addition to the greeting/instruction and game-phase the robot dialogue manager also has policies for an initial phase centered around an ice-breaking social chat activity. In this implementation the social-chat activity and guessing role are “wizarded”. Similar to the virtual human implementation of the architecture discussed in Section 4.2, this implementation also makes use of the virtual human Active MQ messaging server to handle communication between two Google Chrome ASR’s web-apps (one for the user and one for the wizard) and the robot’s dialogue manager. Note that in this implementation the VH messages are only sent in one direction, from the Google Chrome web-apps to the dialogue manager, as opposed to also using this protocol to communicate with the NAO agent platform. The NAO agent platform has its own messaging protocol. The Nao agent platform provides off-the-shelf text-to-speech and non-verbal behavior generation services. The robot’s auxiliary GUI’s (shown on a monitor behind the robot) include the Game-Info GUI and the Instructional GUI described in Section 4.1.2. 89 4.3.1 Robot Dialogue Manager The robot’s dialogue manager policies for managing the agent’s performance of the clue-giving role are similar to the the virtual human implementation’s dialogue manager in clue-giving role (see Section 4.2.1) including its natural language understanding and natural language generation methods. There are three differences. First, the role of game-judge is implicitly subsumed by the robot game-player. Second, the barge-in behavior is different as the Nao event messaging protocol did not send callback messages at the same level of granularity as the Virtual Human Toolkit platform. There- fore, the dialogue manager’s incremental language processing capabilities could not be fully lever- aged in this implementation. The robot’s dialogue manager queues user initiative barge-in moves as opposed to interrupting the agent speech. If the robot is speaking when a user initiative barge- in move is made then robot responds to the user initiative barge-in move once the current robot utterance is complete. If a move other than a user initiative barge-in move is made (e.g. - incorrect guess) while the robot is speaking then the user speech is flushed. The third difference is that the dialogue manager can recognize an additional dialogue move during game play when in the in-round state of the game-phase, repeat, which causes the robot to repeat the clue it just spoke or is currently speaking. This design decision was made because pilot testing with the robot indicated that the Nao robot’s TTS service might be less comprehen- sible on average than the NeoSpeech or Google TTS services used for the virtual human and non-embodied (see Section 4.4). This observation was based on the observation made during robot pilot testing that users seemed to request that clues be repeated frequently. Similar to other dialogue moves, the NLU recognizes the repeat dialogue move by phrase and keyword matching. Again, common phrasings for this move were culled from log files of pilot tests with the robot. 90 The robot’s dialogue manager during the initial social chat phase and when performing the guessing role in the game-phase of the interaction acts as a conduit for a human wizard’s speech to the listening robot that outputs the wizard speech once a final ASR message is received. In the social chat phase, human wizards might ask ice-breaking or “get to know you” questions or respond to similar questions from a user. When performing the guessing role in the game-phase, a human wizard is expected to make guesses based on participants clues and provide feedback as deemed appropriate. 4.4 Non-Embodied Web 4.4.1 DialPort Front-End for Real Users Figure 4.5: Non-Embodied Web Agent Architecture In order to use the test-bed agent in experiments with “real” users “in the wild” we imple- mented a version of this architecture (for the clue-giving role) for the DialPort web-platform [Lee et al., 2017]. Figure 4.5 shows this implementation of the architecture with the DialPort agent platform instantiating the agent platform module of the general architecture. Note that the guess generator is hollowed out as we have not yet hooked up a guess generator to this implementation; 91 Figure 4.6: Dialport Game-Agent Personal Computer Web Interface although this could be done with a relatively small engineering effort. Also note the auxiliary gui module is also hollowed out in this implementation as the information associated with these views are sent from the dialogue manager to the DialPort agent platform for display. Dialogue agents can connect via an API with DialPort which serves as the agents’ front- end interface providing ASR and text-to-speech services as well as recruiting users via social media advertising. The platform currently relies on the Google ASR and TTS services. Users interact with DialPort via speech or text. If a user fails an initial microphone test administered by DialPort’s front-end agent, responsible for navigating users to a desired dialogue agent, the user is directed to interact with DialPort via a text input box. Currently, if the front-end agent receives a user utterance containing the word “game” the user is sent to the game agent. A screen-shot example of the front-end interface displayed to a user via a web browser on their personal computer can be seen in Figure 4.6. The current score, round #, and time-left in round can be seen in the top-left corner and are updated with each new message from the game agent. The text of the message currently being spoken by the game agent appears in the white rectangle at the bottom of the figure. The green oval towards the bottom left of the screen is a 92 Figure 4.7: Facebook Messenger Game-Agent Interface status message indicating if the agent is currently listening or speaking. The white oval above the green oval is a button a user can click at the end of the interaction to provide a rating of their overall interaction out of 5 stars. The longer white oval to the right of the green oval advertises agents available to the user for interaction when talking to the front-end DialPort agent. DialPort also interfaces with the Facebook Messenger PlatForm API so mobile users can also interact with DialPort agents via FaceBook Messenger. Figure 4.7 is an interface screen- shot shown to a user interacting with the game-agent via FaceBook Messenger on their mobile phone. User messages appear in blue and agent messages in grey. The score, round, and timer information is pre-pended to every new message sent by the game-agent. In this example the user requests to skip the current target-word and is then given a clue for a new target. The user then makes a correct guess “hot”, and the game-agent responds with an updated score and a clue for a new target-word “judge”. https://developers.facebook.com/docs/messenger-platform 93 The web-based dialogue manager communicates with DialPort via an HTTP server that con- verts HTTP messages (expected in JSON format) to VH messages. Since DialPort has multiple users in parallel, the dialogue manager launches a new agent instance for each new HTTP session (user) that is directed to the game from the main DialPort system. 4.4.2 Web Clue-Giving Role Dialogue Manager Here we discuss the dialogue management policy for the web implementation of the test-bed agent. As mentioned, this implementation currently only performs the clue-giving role of the game. The web dialogue manager also has similar policies to the virtual human dialogue manager in clue-giving role except for 4 differences. First, like the robot implementation, the game-role of game-judge is implicitly subsumed by the game player. Second, this version of the agent leverages a finite-state model of dialogue to guide a user through a 3-phase interaction (as opposed to 2). The 3-phases for this implementation are the basic greeting/instruction and game phase already discussed but there is also a post-survey phase which collects subjective feedback ratings from users. The post-survey phase is composed of 3 states. Each state corresponds to an evaluation statement. The agent asks for ratings on a 1-5 scale with 1 meaning strongly disagree and 5 meaning strongly agree. After the user played the game, the agent in the 1st state asked for ratings for the statement: “My clues were effective during the game”, the agent in the 2nd state asks for ratings for the statement: “My clues were natural during the game”, and the agent in the 3rd state asks for ratings for the statement: “You enjoyed playing the game”. The dialogue manager repeats a request for a given statement rating if the subsequent user utterance did not contain the #’s 1 to 5. After providing a rating for the 3rd statement the agent transitions to its final state where it gives a farewell message and returns the user to the front-end DialPort agent. 94 The third difference in this implementation is that the dialogue manager has no incremental language process capabilities so all user speech is flushed when the agent is speaking. As men- tioned, DialPort does not currently have callback messages to update the game player agent with action status so there was no way to make use of this capability. The fourth difference in this implementation is that the dialogue manager provides the clue- giving agent a skipping policy so that if the agent cycles through all of the clues for a particular target-word twice the agent skips (informing the user) and begins to give clues for a new target- word. This was done for this implementation to avoid a repetitive interaction. The clue lists that were used in the experiments involving this implementation had less clues on average than the clue lists used for the experiments that used the virtual human and robot implementations of the architecture. 4.5 Parameter Instantiation Decisions Here we detail and motivate parameter instantiation decisions not specified in prior sections . Automatic Speech Recognition Initially, for the early evaluation described in Section 6.2 the Apple OS X El Capitan’s dictation ASR was used as this recognizer was already integrated to work with the Virtual Human Toolkit. Based on early testing and the evaluation we found that the Apple ASR, which is intended for dictation, confused certain words with dictation commands resulting in speech mis-recognitions. Due to this issue we decided to swap the Apple ASR for the Google Chrome ASR y which was used for the game framing & feedback evaluation described in Section 6.3 and the multi-role evaluations described in Chapter 9. http://www.apple.com/osx/whatsnew/ y https://www.google.com/intl/en/chrome/ demos/speech.html 95 Text-to-Speech Based on the evaluation described in Section 6.1 we used NeoSpeech James text-to-speech for all the agent evaluations except for the multi-role robot evaluation described in Section 9.1. As mentioned the Nao agent platform has its own off-the-shelf TTS service and there would have been additional integration work required to use the NeoSpeech TTS for this implementation. Agent Role For the virtual human platform we arbitrarily chose to make the male virtual human perform the game player role and the female virtual human perform the game judge role. This was held constant for all evaluations that used the virtual human implementation. The robot had only one agent embodiment/text-to-speech service and DialPort had only one text-to-speech service and so the game judge role was implicitly subsumed by the game player agent. Embodiment/Non-verbal Behavior Generation As mentioned in Chapter 3 we observed players frequently employed non-verbal behaviors indicating embodiment and non-verbal behav- ior generation was likely an important aspect of the activity. However, we felt it might be discon- certing to author custom non-verbal behaviors without a formal non-verbal annotation scheme. The virtual human and robot agent platforms were partially chosen because they had similar off- the-shelf non-verbal behavior generators that enable the agent to perform simpler gestures that map to less domain specific types of dialogue moves but still occur in the game. We felt this would be sufficient to partially capture the positive benefits in player perception that stem from this aspect of game-play. The results in Chapter 9 support this hypothesis. User Initiative Barge-in Dialogue Moves/Incrementality As noted, for the virtual human platform, the 3 dialogue moves: Correct Guess, Skip, and Clue were selected as user initiative barge-in moves based on the activity analysis that showed human players frequently interrupted https://neospeech.com 96 themselves if they had the floor when one of these moves occurred. It seemed intuitive to also make the dialogue moves start and role specifiers user initiative barge-in moves as players making them likely did not want to hear the full set of instructions between each round which were repetitive. The virtual human platform with the barge-in parameter flipped on was used for the final multi-role evaluation described in Section 9.2 based on the results of the incrementality comparative evaluation described in Section 6.2. Similarly, for the robot platform, Correct Guess and Skip were queued and responded to after the current move was completed based on the granularity of the callback mechanisms (which are sent upon action completion) offered by that platform. In the “wizarded guessing role for this platform the “queuing decisions” were made by the human “wizard”. In practice the wizard generally waited for the robot to finish speaking before making a new move. Since the DialPort platform does not support callback messages this version of the agent is non-incremental and all user speech is flushed when the agent is speaking. For the the virtual Human clue-giving role the algorithm that dictates which partial/final ASR hypothesis message the dialogue manager will respond to when managing the clue-giving role is as follows: If the ASR hypothesis is a word that represents a user initiative barge-in action the dialogue manager processes that partial message immediately otherwise it waits for a final ASR hypothesis message before taking its next action. The assumption in this policy decision was that if a user initiative barge-in action is recognized by the ASR (even as a partial message) game performance & perception is improved more by assuming that the partial message reflects the user’s intended dialogue move as opposed to classifying the dialogue move based on a higher confidence final ASR hypothesis. For the the virtual Human guessing role the algorithm that dictates which partial/final ASR hypothesis message the dialogue manager will respond to when managing the guessing role is as 97 follows: If the ASR hypothesis is a word that represents a user initiative barge-in action other than a clue the dialogue manager processes that partial message immediately otherwise it waits for a final ASR hypothesis message before taking its next action. A clue was not processed immediately because we deemed it more important to send the guess generator a clue composed of higher confidence words than slight decreasing response time by processing the first partial ASR hypothesis message received. Other Dialogue Manager Parameters Feedback frequency parameters for the the clue- giving role for all implementations and the virtual human guessing role were set based off of subjective observations from the human-human analysis in Chapter 3. Positive feedback (e.g. -“that’s right”, “yes”) is prepended to an agent’s next clue after a correct guess . Negative feed- back (e.g. - “wrong”, “no”) is randomly (half the time) prepended to the agent’s next clue after an incorrect guess. Since the robot’s guessing role was “wizarded” feedback decisions were left to the human “wizard” when the robot performed this role. The s ng silence threshold parameter triggered when a user doesn’t say anything when the agent is in the out-round state was set to 15 seconds based on designer testing. The s g silence threshold parameter triggered when a user doesn’t saying anything causing the agent, when it is performing the clue-giving role. to provide a new clue was set to 6 seconds based on observations of the timing of these behaviors when performed by human clue-givers from our human-human analysis describe in Chapter 3. The i parameter for the virtual human clue-giving role which triggers the clue-giver to give a new clue immediately if i percent of the current clue has been said and no correct guess is made before the end of the current clue was set to 60% based on observations of the timing of these 98 behaviors when performed by human clue-givers from our human-human analysis describe in Chapter 3. For the virtual human guessing role the r g silence threshold parameter, which is triggered if a user hasn’t given a clue when the agent is performing the guessing role, was initially set to 6 seconds. User feedback from pilot tests of the test-bed agent led us to change r g to 12 seconds for the final multi-role evaluation described in Section 9.2. Also for the virtual human guessing role, the g parameter, which indicates how many guesses to give for each clue, was originally set to 5. After pilot testing for the final multi-role evaluation in Section 9.2 this was changed to 3 based on user feedback indicating the agent saying five guesses for a single clue was irritating. In addition the pilot testing indicated the frequency that a correct guess was in the top 5 guesses but not the top 3 guesses was rare. 4.6 Summary In this chapter we present an architecture for building a fully automated test-bed agent that can perform the roles commonly associated with a word-guessing game activity. The architecture can serve as a map for building a fully automated interactive word-guessing game agent that is a relatively ecologically valid test-bed for different types of experiments. We detail 3 different implementations of the architecture for 3 different agent platforms affording the agent with 3 different embodiments: virtual human, robot, and non-embodied web. The virtual human implementation of the architecture was used in the comparative embodi- ment/incrementality experiment presented in Section 6.2, the comparative game framing experi- ment discussed in Section 6.3, and the final comparative multi/single-role evaluation described in 99 Section 9.2. The robot implementation of the architecture was used in the pilot multi-role eval- uation described in Section 9.1. The non-embodied web implementation of the architecture was used in the comparative content sourcing experiment described in Chapter 8 that demonstrates advantages to multi-role enabled content sourcing with real users in the wild. 100 Chapter 5 Clue-Giving Role Content Generation “And the internet has made it so easy for people to ask for clues. ” Shigeru Miyamoto In this chapter we discuss automated methods that produces scalable content generation for the clue generator of an interactive agent performing the clue-giving role of the game such as the one that appears in the agent architecture described in Chapter 4. Section 5.1 presents an initial scalable clue generation method that leverages pre-existing web-based and lexical database resources and a clue corpus created using this method. Section 5.2 discusses the construction and evaluation of supervised machine learning classifier that can be used to filter for clues in a clue corpus that are more likely to elicit a correct guess. In Section 5.3, in order to provide more evidence of the efficacy of the automatic filtering method in a fully interactive context, we demonstrate that filtered clues are able to elicit more correct guesses from a human user in a pilot evaluation of the test-bed agent’s clue-giving role. 101 5.1 Scalable Content Sourcing As noted in Chapter 3 word-guessing games require a clue-giver to generate many clues for pre- viously unseen target-words. So one challenge in developing an automated agent with this ca- pability is developing a scalable clue generation method for arbitrary target-words. To this end, we investigated a method that generates clues by scraping web-based resources and querying pre- compiled lexical databases. To test the efficacy of this method on a list of arbitrary target-words we took list of common nouns found on the internet and scraped the first sentence of the Wikipedia page associated with a given target-word, the definition and example sentences found on the Dic- tionary.com web-page associated with a target-word, and queried the WordNet database [Miller, 1995] with the target-word to create a machine corpus of clues, the Machine Clue Corpus first presented in Table 1.2 in Chapter 1. The Machine Clue Corpus contains 213,228 clues for 1,312 unique target-words. The break- down by source of the Machine Clue Corpus can be seen in Table 5.1. Many different types of clues could be found from the dictionary.com webpage and WordNet query results for a given target-word. These types are generally easily mappable to the taxonomy described in Section 3.3. For example, syn clues are scraped from the synonyms section of the target word’s Dictionary.com page. and map to the Synonym clue type defined in Table 3.1. Another example is the wiki clue type, composed of the text found in the first sentence of a Wikipedia page associated with a target word, which generally map to the Description Definition clue type defined in Table 3.1. The frequency of the different types of clues in the Machine Clue Corpus can be found in Table 5.2 . If a type of clue starts with wn this indicates the clue was extracted from WordNet using a Java wrapper script y . WordNet clue type names are composed of a POS and some form of Clue types only appear in this table if at least 1,000 clues of that type were present in the corpus y WordNet was queried via a java wrapper found at http://lyle.smu.edu/ tspell/jaws/ 102 Table 5.1: Clue Frequency by Source Source # of Clues Wikipedia 1,106 (0.5%) Dictionary.com 79,683 (37.4%) WordNet 132,439 (62.1%) the other words that were used in the original query to WordNet. For example, wnNounPartMero indicates that the original query to WordNet used to generate this clue requested meronyms of the Noun WordNet senses of the target word. If the name of a type of clue does not start with wn and is not wiki, which refers to clues composed of the first sentence of the target word’s Wikipedia page, then the clue was extracted from scraping the target word’s Dictionary.com page. In this case the type name is generally an abbreviation for the section of the Dictionary.com that the clue was scraped from. All of the raw clues obtained in the Machine Clue Corpus are preprocessed in three ways. First, simple punctuation based rules are used to split clues obtained from WordNet and Dictio- nary.com into multiple clues. Second, the utility Lexical Variants Generation (Lvg) is utilized to replace the target word and any of its inflected forms in the clue text with the word “blank”. Finally third, if the clue type could be broadly categorized as hypernym, hyponym, or antonym text was either prepended or appended to the raw clue in order to make the clue more explicitly lead a receiver to the target word. In the case of hypernym “a type of” was prepended to the clue. For example, a hypernym clue for the target word ”dog” was “domestic animal” which became “a type of domestic animal”. The text “is one type of it” was appended to hyponym clues. For instance, for the target word “dog” the hyponym clue “dalmation” became “dalmation is one type of it. ” Finally for antonym http://nlm.nih.gov/research/umls/new users/online learning/LEX 004.htm 103 Table 5.2: Clue Type Type # of Clues (% of clue corpus) def 50,622 (23.5%) wnNounHypo 44,973 (20.9%) wnNounHyper 13,597 (6.3%) wnVerbTropos 13,528 (6.3%) syn 11,861 (5.5%) exampSent 11,527 (5.4%) wnVerbHyper 9,173 (4.3%) wnNounSyn 8,744 (4.1%) wnNounDef 6,541 (3.0%) wnVerbSyn 5,414 (2.5%) wnNounUsageExample 5,223 (2.4%) wnVerbUsageExample 4,682 (2.2%) wnNounPartMero 3,880 (1.8%) wnVerbDef 3,521 (1.6%) wnNounTopicMembers 2,924 (1.4%) idiomPhrase 2,403 (1.1%) ant 2,327 (1.1%) wnNounPartHolo 2,191 (1.0%) human 2,121 (1.0%) wnVerbVerbGroups 1,535 (<1%) wnNounTopics 1,110 (<1%) wiki 1,106 (<1%) synStudy 1,026 (<1%) 104 clues the text “the opposite of” was prepended to the clue text. An antonym clue “bottom” for the target word “top” thus became “the opposite of top”. Based on the total number of clues from various sources as shown in Table 5.1 as well as the diversity of clue types as shown in Table 5.2 we are able to show that this method is able to generate large quantities of different types of clue that in many cases are analogous to types of clues given by human clue-givers. As mentioned in Section 3.5 human clues-givers give on average 4.1 clues per target-word before either a correct guess or one of the players skipping. The Machine Clue Corpus has an average 162 clues per target-word which is well above the average number of clues output by human clue-givers and so would be sufficient in quantity even if players return to play the game more than once. However, we can not yet make any claims as whether the quality of these clues are good enough for game-play. We begin to address this quality issue in the next section. 5.2 Automatic Clue Filtering During an early evaluation of the test-bed agent in the clue-giving role, described in Chapter 6.2, it became clear that many of the clues in corpus mach were of low quality and unlikely to elicit a correct guess from a human guesser effecting game playability and user’s subjective perceptions of the agent. Many users left comments that clues were too difficult to understand and negatively impacted their experience. For example, in response to the question “Anything you would suggest to improve the game?” two representative responses were “The clues were very unhelpful in many cases” and “The clues were no fun”. Other examples were found in response to the question “Anything you particularly disliked about the game? why?” where two representative responses were “The clues were very unhelpful in many cases” and “Technical jargon that made me sleepy. ”. 105 Moreover, the average score given to the question “How effective did you find the clues in the last round” asked in this evaluation only ranged from 2.9-3.3 on a 7 point scale indicating there was relatively large room for improvement in overall clue quality. In order to mitigate this issue we developed a machine learning classifier that uses linguistic features from a given clue to automatically filter the clues in corpus mach for higher-quality clues more likely to elicit a correct guess [Pincus and Traum, 2016]. This method can be used to filter a clue corpus to create a clue corpus with a higher ratio of clues likely to elicit a correct guess. If the corpus is automatically generated the results of the evaluation of the machine learning filter we describe below provide evidence that the filtered corpus should have clues in line with a corpus of human generated clues in terms of an average guessability. The rest of this section is organized as follows. In the next subsection we discuss how we collected data for training and testing the machine learning clue filter. In Section 5.2.2 we discuss the classification experiments. In Section 5.2.3 we present the results. 5.2.1 Data Collection We created the Special Purpose Machine Clue Corpus first presented in Table 1.2 in Chapter 1, composed of a random subset of 317 clues for 87 target-words from the Machine Clue Corpus for these experiments.The frequencies of the different clue types represented in this subset of corpus mach can be seen in Table 5.3. The target-words for these clues included words such as bomb, ornament, fowl, and breakfast. In order to obtain information on the ability of the clues in the Special Purpose Machine Clue Corpus to elicit correct guesses we designed a crowd-sourced data collection experiment. We designed this experiment to provide effectiveness information for a single clue in isolation, rather than the effectiveness of a clue that takes into account possible clue sequences the given clue 106 Table 5.3: Experiment Corpus Clue Type Freq. Info. Type # of clues (% of experiment clue corpus) wnNounSyn 31 (9.8%) wnNounDef 30 (9.5%) def 30 (9.5%) wnNounHyper 27 (8.5%) exampSent 27 (8.5%) wnNounHypo 26 (8.2%) wnVerbDef 26 (8.2%) wnVerbSyn 23 (7.3%) wnVerbHyper 23 (7.2%) wnVerbUsageExample 20 (6.3%) wnNounUsageExample 19 (6.0%) syn 15 (4.7%) idiomPhrase 10 (3.2%) wnNounAnt 7 (2.2%) wiki 3 (0.9%) 107 could appear in, in order to avoid the exponential increase in the amount of data that would be required to solve the latter problem. The crowd-sourced experiment was run on Amazon’s Mechanical Turk platform where we recruited participants (Turkers) to interact with a web application we developed that elicited spo- ken guesses. Turker’s were required to be native english speakers, have 92% HIT approval ratings or higher, and have completed at least 100 prior HITs. The web application required a Turker to click on a play button which streamed a recording of a clue spoken by the text-to-speech system NeoSpeech’s James . The Turker was instructed to make as many guesses for the clue as possible once the clue recording started. The recording of guesses for each clue ended 6 seconds after the audio containing the spoken clue stopped playing and a pop-up window appeared informing the Turker of the clue’s target-word. Each HIT contained a sequence of 30 clues (all for different target-words). Each hit also con- tained one final play button which played a final recording asking for a test-task to be completed (“say the word strawberry”) to ensure the Turker was making a best effort. If this final record- ing was empty or contained audio other than the word strawberry; we did not use that Turker’s recordings in our analysis. For unknown reasons many participants began the experiment but did not finish. Incomplete sets of recordings were used in our data analysis only if a subset of the incomplete set passed a manual spot check testing if the recorded guesses seemed to be a best effort. Multiple clues for the same target were played to different Turkers in order to ensure data analysis would be able to differentiate clue-effectiveness from target difficulty. In total 317 dif- ferent clues were played to different Turkers over the web (some of which were heard by multiple Turkers). In total 457 recordings of Turkers making guesses were recorded. We annotated the http://www.neospeech.com 108 Table 5.4: Guess Annotation Examples Target Clue Guess Guess Annotation Code Bomb “An explosive device fused to explode under specific conditions.” “bomb, pressure plate, ...” 1 Bomb “H-blank is one type of it” <Silence> 0 Tendency “a blank to talk too much” “Tendency” 1 Tendency “the trend of the stock market” “up down” 0 guess recordings, labeling each recording with a 1 if a correct guess was made and 0 if not. A recording was considered to contain a correct guess even if it was only partially correct (e.g. - a guess of “paper” for the target newspaper). Table 5.4 has sample data from the experiment, including one effective clue and one ineffective clue for each of two different targets. 5.2.2 Method We conducted machine learning experiments in order to determine the predictive value of simple textual features for determining a clue’s effectiveness (capability to elicit a correct guess). We use the Weka Machine Learning Library’s Naive Bayes classifier in our experiments [Hall et al., 2009]. We performed 10-fold cross validation with our folds stratified across classes. Feature Selection We carried out feature selection using Weka’s attribute selection method ChiSquaredAttributeEval which ranks the attributes based on computing an attribute’s chi-square statistic with respect to the class. We then used a greedy approach where we started with all attributes and remove the lowest remaining ranked attribute from the ChiSquaredAttributeEval one by one as long as effective clue classification precision is increasing. We are most concerned with maximizing the precision (as opposed to recall) of classification for effective clues because successful game play for an arbitrary target-word usually only requires a few effective clues. The 109 Table 5.5: Features Used for Clue Quality Selection Features Clue Source Type of clue + (e.g - wnNounHyp, exampleSentence) Clue originally contained target-word + (replaced by “blank” during pre-processing) # of words in clue + Average PMI information + Max PMI measure + precision results reflect the likelihood of the automatic method selecting an effective clue from the clue corpus. Features We extracted some simple textual features from the clues utilized in the mechanical turk experiment. These features are listed in Table 5.5. A + indicates that this feature is part of the optimal feature set found by our feature selection method. The features include: the clue source (WordNet, Wikipedia, or Dictionary.com), the clue type as discussed in Section 3.3, a binary feature of value 1 if the original clue contained the target word otherwise of value 0, as well as point-wise mutual information (for the words in the clue and the clue’s target word) features. The model utilized to calculate the PMI features was built on a corpora containing millions of web blog entries, it is a subset of the spinn3r dataset discussed in [Burton et al., 2009]. 110 PMI(clueWord;target)+ PMI(target;clueWord) 2 (5.1) The point-wise mutual information features for a clue were calculated in two ways. An aver- age PMI for each clue was calculated by taking the average of the average PMI of all constituent clue-words with the target word for the clue, as shown in equation (5.1), and a max PMI for each clue was calculated by taking the maximum value of equation (5.1) for all the constituent clue- words. The optimal feature set includes every feature but clue source. Although the feature set we used in these experiments do not satisfy the assumption of conditional independence made by the Naive Bayes classifier, previous work has shown that the NB classifier has yielded promis- ing results in other text classification tasks even when the features utilized were not completely independent of one another [Dumais et al., 1998]. The results from these experiments are also consistent with this observation. Baseline We used random clue selection as a first baseline for the effective clue prediction task. Random selection here represents a completely naive clue-giving agent that only has the ability to randomly select a clue from the population of automatically generated clues for a given target word. In order to compute the likelihood for random selection selecting an effective clue we simply computed Equation (5.2). # clues that elicit corr: guess f rom Turkers N (5.2) 5.2.3 Results The results of the machine learning experiment can be found in Table 5.6. Since the mechanical turk experiment collected data for 317 unique clues and 45 of those clues were able to elicit a 111 Table 5.6: Baseline & Automatic Method Results Methodology % of Effective Clues Baseline 14.2% Automatic Method 34.6%* correct guess the likelihood that random selection, the baseline method, generates an effective clue is 45=317 (14.2%). The results seen in Table 5.6 demonstrate that the likelihood of selecting an effective clue is significantly higher (chi-square test *: p = 0:029) for the automatic method than if the baseline random clue selection method is used. The results for human clue-giving ability (see Section 3.6) show that the likelihood of select- ing an effective clue using the automatic method falls within the expected guess-ability and upper bounds of human likelihood of generating an effective clue. Thus a culled corpus pruned using the automatic method with the exhaustive feature selection algorithm should be a much more promising set of clues for examining human-machine gameplay. We provide further evidence of this in the next section. 5.3 Online Evaluation of Machine Learning Clue Filter This section provides further evidence that an automated interactive agent that leverages the ma- chine learning filter while performing the clue-giving role of a word-guessing game is able to to output clues with effectiveness levels that come in line with human guessing ability. In Section 5.3.1 we introduce metrics that allow us to compare the effectiveness of filtered and unfiltered clues. Section 5.3.2 applies these metrics to data from the online evaluation of a fully interactive version of the test-bed agent performing the clue-giving role described in Section 6.2. 112 5.3.1 Metrics for Clue Sequences with 2 Types of Clues Although we already presented metrics for assigning credit to clues in a clue sequence leading to a correct guess (see Section 3.6) these metrics cannot distinguish between the effectiveness of different clue types in the clue sequence (e.g. -filtered and unfiltered). In order to be able to investigate the effectiveness of filtered clues versus unfiltered clues we developed additional metrics for comparing the effectiveness of 2 different types of clues in a clue sequence leading to a correct guess. We developed three different metrics, each simplifying the credit-assignment problem for a correct guess to a single clue, but making different assumptions about whether to look only at relatively rare high-precision data, and whether to focus on the first or most recent clue. These three sets can be found in Table 5.7. For each subset of individual clues defined by each metric, a score of 1 is given to those clues that resulted in a correct guess and a score of 0 to those that did not result in a ccorrect guess. In cases where the same clue might have been used in more than one interaction the clue’s scores from each instance it was used are averaged. Figue 5.1 contains a sample dialogue between a user and the test-bed agent from the evaluation described in Section 6.2, the first column of the figure contains the utterances and the second column contains the utterance’s corresponding game action. The game actions are described in more detail in Table 4.2 in Section 4.2. We will refer to this sample dialogue in order to provide examples that help in understanding the three metrics. The first measure, Singleton Clues, considers only clues in sequences of length 1 which implies the clue was either said before a correct guess, a skip, or the end of the round signal. The shortcoming in this metric is that it ignores most of the data while the strength is that there is no doubt that credit for a correct guess is attributed accurately as there is only one clue in the 113 Table 5.7: Filt./Un-Filt. Clue Metrics Comparison Individual Clue Types Clue Considered Effective if Clue sequences of length 1 (Singleton Clues) Correct guess said. First clue in every clue seq. (Truncated Sequence Clues) Correct guess said without a 2nd clue being given. Immediately previous clue only (Prior Clues) Correct guess said before next clue given. Figure 5.1: Sample Dialogue 114 sequence. For example, using this measure and referring to the sample dialogue in Figure 5.1 the clue given in line 1 would be given a score of 0 since a skip occurred right after and all other clues would be ignored because they all appear in clue sequences of length at least 2. However, the clue in line 4 of Figure 3.3 in Section 3.2 would be a given a score of 1 since it is a clue in clue sequence of length 1 which precedes a correct guess (line 5). The second measure, Truncated Sequence Clues, pulls in more data by considering the first clue in every clue sequence, but biases against clues for target-words that deserve partial credit for a correct guess but do not receive any credit because the target-word took longer to figure out (more than one clue had to be heard). For example, again referring to the sample dialogue in Figure 5.1 the clues on lines 1,3, and 9 would all be given a score of 0. The clue on line 1 because a skip occurred right after, the clue on line 3 because a correct guess was said after the second clue in the sequence started, and the clue on line 9 because we know there is at least one more clue in its clue sequence. The clues on lines 5 and 11 would be ignored by this measure since they are not the first clue in their clue sequences. On the other hand, the clue in line 4 of Figure 3.3 in Section 3.2 would again be given a score of 1 as it is the first clue in its clue sequence (even though the clue sequence is only of length 1) and precedes a correct guess. Finally, the third measure, Prior Clues, also considers every clue-sequence but attributes all the credit for a correct guess to the clue immediately prior to the correct guess which in many cases is not a fair attribution of credit as the correct guess was achieved through the synergies of the clues up to and including the last clue. This issue is very apparent in the sample dialogue in Figure 5.1 when we apply the third measure to the clues that appear in the sample dialogue. The clue on line 5 would be given a score of 1 and attributed all the credit for the correct guess on line 6 but it is likely a fairer attribution would give the bulk of the credit for the correct guess on line 6 to the clue on line 3 because the clue on line 5 is cut off very early on and the clue on line 3 115 Table 5.8: Filtered vs. Un-Filtered Clue Effectiveness Measure Total # of clues of type given (# of filtered clues given) Average Un-Filtered Clue Effectiveness Average Filtered Clue Effectiveness Singleton Clues 347 (80) 51.1% 62.1% Truncated Sequence Clues 800 (165) 19.0% 25.4% Prior Clues 2,178 (309) 13.7% 21.2%*** provided the majority of the information to the human receiver about the target-word. Using the third measure, the clue on line 1 would be given a score of 0 as its the only clue in its sequence and the user performs a skip right after while the clues on lines 3 and 9 would be ignored as they are not the last clues in their clue sequences. The clue on line 11 depends on the subsequent dialogue and would be ignored if another clue began, given a score of 0 if a skip occurred or the end of round took place, or given a score of 1 if a correct guess was said before the next clue began being spoken. Also, once again, using this measure the clue in line 4 of Figure 3.3 in Section 3.2 would be given a score of 1 as it is the last clue in its clue sequence (even though the clue sequence is only of length 1) and precedes a correct guess. In this case the Prior Clues measure is fair attribution of credit for the correct guess as there are no other candidate clues that could have contributed to the guesser making a correct guess. We acknowledge a shortcoming in the previous three metrics is the fact that they all reduce the credit assignment of a correct guess to a single clue, which as mentioned in Section 3.6, is likely not a fair attribution of credit since clues clearly build off of one another (as well as the receiver guesses) in human-human game interactions. 116 5.3.2 Human-Agent Baseline Clue-Giving Measures Here we apply the metrics in Table 5.7 to data from the on-line pilot evaluation of the test-bed agent described in Section 6.2. Table 5.8 shows the number of clues of each type given in the 52 interactions as well as the average effectiveness for both types of clues. In all cases the fil- tered clues have higher average effectiveness than the un-filtered clues. The *** for the Prior Clues category indicates the difference in average effectiveness between filtered Prior Clues and un-filtered Prior Clues is highly statistically significant (p=.0003). Moreover, the difference be- tween the average effectiveness for filtered clues vs. un-filtered clues in the first two categories, Singleton Clues and Truncated Sequence Clues, are both approaching significance with a smaller sample size (p=.083 (singleton); p=.058 (truncated)). These results provide more evidence that the automatic method for clue filtering is capable of creating a corpus containing a higher ratio of effective clues. It also shows the filter identifies clues that are more effective not only in an offline context but also when output by a fully interactive agent. This indicates that the off-line evaluation of our automatic clue filtering method was a “good” proxy for the evaluation of the method when it is used in an agent in a real-time experiment that is closer to an “in the wild” interaction. 5.4 Conclusion We introduced automated clue generation methods that leverage pre-existing web-based and lex- ical database resources to produce clues. We demonstrate that the method can produce large quantities of clues with many analogous to the types of clues output by human clue-givers. We present a machine learning filter that uses linguistic features from the clues to filter for clues more 117 likely to elicit a correct guess from a human user. We show in an offline evaluation that this fil- ter can be used to prune a pre-existing clue corpus to create a clue corpus that is comparable to human-generated clues in terms of an average guess-ability. We demonstrate that the filter identi- fies clues that are more effective when output in actual game-play by a fully automated interactive agent. This clue generation method is scalable as it does not require manual intervention and produces the large quantity of clues relatively cheaply and easily. It produces the large quantity of clues for any target-word which allows it to support game-play even if users come back more to play the game more than once. In Chapter 8 we demonstrate an alternative scalable content sourcing method available to multi-role agents intended for asymmetric interaction that produces human generated clues. Not- ing that although the results from the machine learning filter evaluation demonstrated that the av- erage clue quality of filtered machine generated clues was comparable to human-generated clues there still remains an empirical question as to whether the filtered machine generated clues would be as effective as human generated clues when output by an automated clue-giving agent in live game-play with a human guesser. Moreover, subjective evaluation of the machine generated clues compared to the human generated clues analyzed in the activity analysis in Chapter 3 indicated that machine filtered clues and human generated clues might still differ in the dimension of per- ceived naturalness. We investigates these questions through a comparative evaluation discussed in Chapter 8. 118 Chapter 6 Design Decision Comparative Experiments “Damn these extraneous variables!” Unkown In this chapter we discuss three comparative evaluations we conducted based on observations pointed out in our activity analysis in Chapter 3. These evaluations were conducted in order to decrease the chances that these design aspects would obscure our ability to investigate our main research questions around multi-role dialogue agents. If one of these design choices was particularly disconcerting to a user, our final multi-role evaluations described in Chapter 9, might have been less able to serve as an adequate investigation of our main research questions around multi-role dialogue agents due to a user focusing on an aspect of the system associated with one of these design choices. In Section 6.1 we discuss an experiment designed in order to perform a comparative text-to- speech/human voice evaluation involving users listening to the various voices speaking human and machine generated clues. In Section’s 6.2 we discuss an evaluation that compares embodied/non- embodied and incremental/non-incremental versions of the virtual human implementation of the 119 test-bed agent (see Section 4.2) performing the clue-giving role. Finally, 6.3 presents an experi- ment that explores different game framings and associated feedback types also using the virtual human implementation of the test-bed agent performing the clue-giving role. 6.1 Off-Line Text-To-Speech Evaluation As pointed out in the activity analysis (see Chapter 3), we observed comprehensible speech was important for successful game-play. Generally, dialogue agent designers choose between using pre-recordings of a human voice actor or one of many text-to-speech solutions in order to endow the agent with an output voice. Text-to-speech services can be high quality commercial solutions that might require a paid license, lower quality off-the-shelf solutions that are free, or toolkits that with some degree of effort allow one to customize their own synthetic voice. Choosing the right solution for voice output also generally faces a trade-off between scalabil- ity and quality like the one faced when making content sourcing decisions (see Section 2.2). If the set of prompts that an agent is expected to say is fixed and small, one can use a human voice actor. However, as is the case with the test-bed agent here, if a wider variety and/or dynamic utterances are needed, then text-to-speech synthesis (TTS) is a better solution due to time/cost constraints. While many of the TTS services are getting better, none are completely natural, especially when it comes to emotional and conversational speech. Given that the amount of content an agent is responsible for saying in this domain (especially in the clue-giving role) led us to determine we would need to rely on a text-to-speech system for speech output. However, we felt it was important to empirically evaluate the impact of using a high quality text-to-speech system on agent interaction to ensure that the use of such a system did not cause users to focus on the agent speech as opposed to the agent’s multi-role capabilities. 120 We conducted a comparative evaluation of several natural and synthetic voices using several different criteria, including subjective ratings and objective task measures. In particular, we com- pare the relationship of a voice’s evocative function potential, a measure of the voice’s ability to evoke an intended reaction from the listener, to the voice’s intelligibility and to the listener’s perception of the voice’s naturalness and likability. Our first hypothesis is that voice quality is a multi-dimensional construct, and that the best voice for some purposes may not be the best for all purposes. There may be different aspects that govern subjective perceptions of a voice and objective task performance, and different aspects may facilitate different tasks. For example, a neutral highly intelligible voice may be perfect for a system that provides information but very unpleasant for a story-telling system that is trying to express strong emotion. Our second hypothesis is that naturalness and likability perceptions of a voice may depend on whether or not the user’s exposure to a voice is extended and continuous vs. short-term and sporadic (interleaved with other voices). The current practice in speech synthesis evaluation is to ask human raters to rate isolated audio clips, usually in terms of naturalness and intelligibility [Fraser and King, 2007, Karaiskos et al., 2008], without extended exposure to a voice. This approach can certainly inform us about the general quality of a synthetic voice; but it cannot necessarily provide any insight about the appropriateness of this voice for a task that requires that the listener be exposed to that voice for a considerable amount of time. Furthermore, as the environments where these dialogue systems are deployed become increasingly immersive involving multiple agents, e.g., virtual and augmented reality environments, it becomes critical to determine how subjective perceptions of a voice change if voice exposure is sporadic and interleaved with other voices. From now on, we will assume that sporadic voice exposure implies that the user is exposed to multiple voices interleaved. 121 Noting that it is not always feasible to evaluate a voice in the context of a full dialogue task we seek to determine whether results from standard voice evaluation experiments can act as a valid proxy for results from experiments that feature voice evaluation in a manner that more closely approximates the full dialogue task. Taking this idea one step further, we explore whether or not standard TTS evaluation tests such as transcription tasks (designed to assess the intelligibility of a voice) can be fully automated by using automatic speech recognition (ASR) output rather than manual transcriptions. If automating these standard evaluations provides highly correlated results to those obtained via manual transcriptions carried out by humans there will be support for automating these types of tasks in order to decrease the costs and time of evaluations. If results based on automatic transcriptions are highly correlated to results based on manual transcriptions carried out by humans, then there will be support for automating these types of tasks in order to decrease the cost and time of evaluations. We performed 5 experiments using 4 synthetic voices (covering a range of speech synthesis techniques) and 1 human voice. Each experiment is defined by a unique set of stimuli, subjects, and measures. The stimuli for all experiments are machine generated and human generated clues from actual game-play. In the first two experiments, we perform standard speech synthesis eval- uation, i.e., human raters rate isolated audio clips with regard to naturalness in one experiment and likability in the other experiment (each rater has short-term sporadic exposure to the voices). Experiments 3 and 4 are intelligibility experiments; in one, participants transcribe the utterances that they hear; in the other, we send audio files through an ASR engine. In order to compare the voices in a context more similar to the interactive game the fifth experiment is conducted over the web with clues being streamed to participants who are tasked with guessing the target-word asso- ciated with the given clue. In the fifth experiment, participants listen to many consecutive clues uttered with the same voice (extended continuous exposure). We used the Amazon Mechanical 122 Turk (AMT) service to recruit participants (AMT calls them Turkers) in the same fashion as in [Wolters et al., 2010, Georgila et al., 2012]. The rest of this section is organized as follows. In Section 6.1.1 we present the voices that we use as well as meta-data about the clues that the voices spoke. In Section 6.1.2 we describe the experiment methodology, and in Section 6.1.3 we report the results of our experiments and some inferences we can draw from them. Section 6.1.4 briefly discusses implications of this experiment for the test-bed agent. Finally, Section 6.1.5 summarizes our findings. 6.1.1 Data & Materials Our experiments use 4 different synthetic voices and 1 human voice, all male, with standard American accents. Human voice (HUM): The audio clips were recorded by the first author using a high- quality microphone with noise cancellation features. The resulting audio clips were very clear, almost studio-quality. Commercial voice 1 (US1): This is a high-quality commercial stylized voice based on Unit-Selection [Hunt and Black, 1996, Black and Taylor, 1997]. Commercial voice 2 (US2): This is a high-quality commercial customized Unit-Selection voice developed specifically for our institute. Hidden Markov model -based voice (HMM): This voice is based on HMM synthesis [Zen et al., 2009], in particular, speaker-adaptive HMM-based speech synthesis [Yamagishi et al., 2009]. First an average voice was built using the CMU ARCTIC speech databases y . Then https://www.mturk.com 123 Table 6.1: Example Clues Clue Type Source Target Word “an explosive device fused to explode under specific conditions” Definition WordNet Bomb “a blank to talk too much” Example Usage Dictionary.com Tendency “taxi” Word Relation Human Cab “a mixture containing two or more blank elements or blank and nonblank elements usually fused together or dissolving into each other when molten” Definition WordNet Metal “elephants may look alike to you and me, but the shapes of their blank flaps and their tusks set them apart” Example Usage Dictionary.com Ear “um not video but” Word Relation Human Audio 124 this average voice was adapted to the voice characteristics of a speaker using approx. 15 minutes of speech from that speaker (studio-quality recordings). We built this voice using the HTS toolkit with its standard vocoder [Zen et al., 2007]. Lower quality voice (SAM): We used Microsoft Sam. We measure a voice’s EVP for the guessing task by providing clues for listeners to guess a specific target word. We used 54 clues taken from both the machine corpus described in Sec- tion 5.1 and human clues from human-human gameplay from the Rapid Dialogue Game Corpus described in Section 3.1. The clues from the machine corpus came from two sources: WordNet [Miller, 1995] and the Dictionary.com pages associated with the target word. We only used clues that were able to elicit at least one correct guess in the study designed to measure clue effective- ness described in Section 5.2. Some example clues used in this experiment, their source, their type, and the target word they intend to evoke can be found in Table 6.1. Each of the 54 clues was synthesized in each of the voices. We categorized the 54 clues into 3 main clue types based on the annotation scheme in Section 3.3: a definition type which provided a definition of the target word, an example usage type which is generally a commonly used sentence that contains the word, and a word relation type which refers to clue types such as synonyms, hyponyms, hypernyms, antonyms, etc. of the target word. For our analysis we looked at cumulative statistics for the full set of clues as well as statistics for two different partitions of the clue corpus; by type and by length (> 5 words and 5 words). The relative frequency for each type of clue can be found in Table 6.2; 24% or 13/54 of the clues are composed of 5 or fewer words while 76% (41/54) of the clues are composed of more than 5 y http://www.festvox.org/cmu arctic/ 125 words. The average clue length is 10.75 words and the standard deviation of clue lengths is 7.86 words. Table 6.2: Clue Type Frequency Clue Type Relative Frequency (absolute # / 54) Definition 63% (34) Example Usage 24% (13) Word Relation 13% (7) 6.1.2 Method A summary of the 5 experiments conducted in this study, introduced in the beginning of this sec- tion, and the measures obtained from each experiment can be found in Table 6.3. The standard naturalness, likability and intelligibility experiments featured short-term sporadic exposure to the 5 voices and were designed using the online survey software Qualtrics . In these experiments all participating Turkers listened to 20 audio recordings (human or synthetic speech) of clues randomly selected from the 54 clues described previously. Each set of 20 audio recordings was balanced so that the participant would listen to 4 clips per voice. The order of clues and voices was randomized, i.e., there was constant switching from one voice to another (short-term spo- radic exposure to a voice). Generally, each participant never heard a clue more than once. Turkers were instructed to listen to an audio file only once in these experiments in order to more accu- rately model a normal spoken language situation such as transcribing a lecture or simultaneous interpretation. http://www.qualtrics.com/ 126 Table 6.3: Experiments & Obtained Measures Experiment Obtained Measures 1. Standard Naturalness 1. Short-Term/Sporadic (S/S) Naturalness 2. Standard Likability 1. Short-Term/Sporadic (S/S) Likability 3. Standard Intelligibility 1. Human Wrd. Err. Rate 2. Human Miss. Word % 4. ASR Intelligibility 1. ASR Wrd. Err. Rate 2. ASR Miss. Word % 5. Guessability 1. Extended/Continuous (E/C) Naturalness 2. Extended/Continuous (E/C) Likability 3. Guessability 127 54 different Turkers participated in the standard naturalness experiment. After listening to an audio file a Turker answered the following question: “For the utterance you just heard, how did the voice sound?” (1=very unnatural, 2=somewhat unnatural, 3=neither natural nor unnatural, 4=somewhat natural, 5=very natural). We will call this a Turker’s short-term/sporadic (S/S) naturalness measure. 54 different Turkers participated in the likability experiment. After listening to an audio file a Turker answered the following question: “Would you like to have a conversation with this speaker?” (1=definitely not, 2=maybe not, 3=cannot decide, 4=maybe yes, 5=definitely yes). We will call this a Turker’s short-term/sporadic (S/S) likability measure. The standard intelligibility experiment was designed as a transcription task. 55 Turkers lis- tened to audio recordings of the clues described previously and then wrote into a text box what they heard. 6 of the 55 Turkers’ transcription results were discarded; 2 Turkers did not appear to make a best effort and 4 misread the instructions and provided guesses for the clues they heard rather than transcribing the audio. We compared the transcriptions with the actual text of the clue that was synthesized or recorded (reference). In order to compare the results of this intelligibility experiment with the results from an automatic test of intelligibility (ASR intelligibility experi- ment) we send the 54 audio recordings of each clue for each voice through the Google Chrome ASR . For both standard and ASR intelligibility, we calculated word error rate (WER) (Equa- tion 6.1), and the percentage of words contained in the reference but not in the target transcription (missing word %). WER= Subs: + Delets: + Inserts: # O f Words In Re f erence (6.1) https://www.google.com/intl/en/chrome/ demos/speech.html 128 A web application was developed for the guess-ability experiment, and Turkers were redi- rected to this application from the AMT site to participate in the experiment. Each Turker in the guessing experiment had extended continuous exposure to 3 of the 5 voices, listening to 18 clues in each voice, for a total of 54 clues. We collected a full set of 54 recordings from 59 different Turkers and almost a full set (53/54) recordings from a 60th Turker (who failed to make a guess for the last clue). Note that many more Turkers attempted the experiment but failed to finish for unknown reasons. We do not consider this partially collected data except for the 60th Turker’s data just mentioned. Turkers heard only one instance of each clue. The order of voices was balanced (there are 60 permutations of the voices possible with our experimental set up; so each Turker heard 3 voices in a unique order), but clues were presented in a fixed order. Each Turker, when listening to a clue, was instructed to make as many guesses as he could before a pop-up alert appeared (six seconds later), indicating that recording had ended and revealing the target word. After each clue the Turker was asked to rate the naturalness of the voice he had just heard on a Likert scale as in the previously described experiments except the word “clue” replaced the word “utterance” in the question. The average of these 18 naturalness scores for each Turker will be called a Turker’s extended/continuous (E/C) naturalness score. After each set of 18 clues with the same voice, the Turker was asked whether or not he would like to have a conversation with the speaker the Turker had just been exposed to for the last 18 clues (same question as in the previously described likability experiment). We will call this a Turker’s extended/continuous (E/C) likability score. We annotated the 60 sets of audio recordings (3,239 audio files) of Turkers’ guesses for whether or not the recording contained a correct guess. An audio recording was annotated as correct if it contained a guess composed of the target word or an inflected form of the target word 129 for the previously spoken clue. We define a guess-ability score for a voice as the percentage of correctly guessed clues out of the total number of clues played to participants with that voice. All the likability and naturalness measures we categorize as subjective measures while the intelligibility and guess-ability measures we categorize as objective measures. 6.1.3 Results This section contains the results of our experiments including the S/S and E/C naturalness ratings in Table 6.4, and the S/S and E/C likability ratings in Table 6.5, and all the objective measures in Table 6.6. The general ranking of the voices across the various subjective and objective di- mensions measured were (starting with the highest ranking voice and proceeding in decreasing order): human (HUM), commercial (US1), commercial (US2), hidden Markov model (HMM), lower quality voice (SAM). We will refer to this as the standard order. The existence of a standard order indicates that we did not find good evidence to support hypothesis 1. At first glance any measure is a good proxy for another measure; however there are some exceptions. If there is a statistically significant exception we will explicitly mention it. A marking of “***” by a measure in one of the three tables indicates that the difference between that measure with the measure for the next ranked voice is highly significant (p<:001) . A marking of “**” by a measure in one of the three tables indicates that the difference between that measure with the measure for the next ranked voice is significant (p<:01). Finally, a marking of “#” by a measure in one of the three tables indicates that the difference between that measure and the voice ranked 2 below is significant (p<:01). Statistical tests conducted were paired or unpaired t-tests (based on the relationship of the data sets tested) with the use (if needed) of the Holm - Bonferroni method to counteract the issue of multiple comparisons. 130 Subjective & Objective Measures Table 6.4: S/S & E/C Naturalness Means Voice S/S Naturalness Avg. E/C Naturalness Avg. HUM 4.15*** 4.59*** US1 3.93*** 3.48*** US2 2.92*** 2.04*** HMM 2.04*** 1.83*** SAM 1.81 1.57 Table 6.5: S/S & E/C Likability Means Voice S/S Likability Avg. E/C Likability Avg. HUM 3.78 # 4.17** US1 3.63*** 3.36*** US2 2.66*** 1.69 HMM 1.81 1.53 SAM 1.72 1.35 The voices follow the standard order for both S/S and E/C mean naturalness, and all pair- wise comparisons for both S/S and E/C show differences in means that were highly statistically significant. This indicates that synthetic voices, at least the ones tested, have still not reached human-level naturalness. There were no significant violations to this pattern in various subsets of clues tested. The S/S and E/C likability scores can be found in Table 6.5 for all clues. Again, 131 Table 6.6: Objective Measure Means Voice Guessability Human Word Error Rate Human Missing Word % ASR Word Error Rate ASR Missing Word % HUM 57.10%** 18.35% # 15.64% # 5.41% # 5.24%** US1 59.72% # 23.31%*** 20.53%*** 6.11%** 4.54% # US2 50.39% # 29.65%*** 25.18% # 21.82%** 18.5%** HMM 46.45% 29.32% # 25.44%*** 13.26% # 10.3% # SAM 42.44% 35.43% 32.36% 28.27% 24.78% both measures follow the standard order. It is interesting that the US1 and HUM voices do not have a significant difference in their S/S likability but do for their E/C likability (p = 0:008). In terms of naturalness and likability we believe the HMM scored low due to the fact that it was not trained on a large amount of data (only 15 minutes of speech was used for adaptation) and also the fact that it did not use a more advanced vocoder such as STRAIGHT (Speech Transformation and Representation using Adaptive Interpolation of weiGHTed spectrum) [Kawahara, 1997]. Overall, this data suggests that synthetic voices are catching up faster in the likability dimension to HUM voices than in the naturalness dimension, although an experiment with more human voices is needed for more evidence of this trend. For standard intelligibility results the standard order is followed for both WER and missing word %. The HUM voice performs best although its performance over US1 is not significant, demonstrating that synthetic voices are able to match human voices in intelligibility measures. We see from Table 6.6 that the overall intelligibility of US2 and HMM is comparable. However, the HMM voice outperformed US2 significantly (WER : p= 0:002; missing word % : p= 0:017) on 132 example usage clues. Noting that the HMM voice extended the pronunciation of the word “blank” (which appeared in almost all of the example usage clues) this could provide some support for a hypothesis that unnatural sounding words remained in the listeners’ short-term memory more readily. However, further experiments are needed to verify whether or not this is just an aberration. For the ASR intelligibility results although the standard order was violated, HMM outper- formed US2 for both WER and missing word % and US1 outperformed HUM for missing word %, these deviations were not significant. Overall, the intelligibility results indicate that Google Chrome ASR is much better than real-time Turkers at the transcription task (where Turkers have only a single opportunity to hear the audio). In the guessability dimension the standard order is violated because US1 outperformed HUM there but we draw no conclusions from this as it is not a statistically significant difference. The performance of US1 for guess-ability is significantly (p = 0:001) better than US2 but has com- parable performance to the HUM voice indicating that synthetic voices have reached an EVP approaching human level for the clue guessing task. One hypothesis on why US2 has significantly worse guess-ability than US1 and HUM is that although US2 is a high-quality voice, more effort has been put in making this voice expressive rather than making sure that all phonetic units are fully covered in all possible contexts. In terms of the guessability for the various sub-groups of clues it appears all voices are performing much better for long clues except for HUM which has similar performance for both long and short clues. SAM is particularly bad for short clues, with guess-ability 33.3% (compared to 45.3% for long clues). These results indicate that if one is concerned with the subjective perception of the system carrying out the task or its intelligibility rather than only the task performance measure then HUM is the undeniable best voice. However, if one is only concerned with maximizing the EVP 133 of a dialogue system then US1 might be the preferred choice; as it eliminates the need for human recordings. Time/Continuity-Exposure In order to determine if time/continuity of voice exposure is an important variable in determining people’s subjective evaluations of a voice (note that hypothesis 2 was that this is an important variable) we consider the difference between 3 different pairs of statistics for each voice for all clues. The first pair of statistics we compare are the average S/S likability scores and the average E/C likability scores. These statistics are found in Table 6.5. We see that the likability scores decreased for all the synthetic voices (decrease in US2’s likability scores highly statistically significant: p= 3:6e 05 ) but increased for the human voice (p= 0:04) . The second pair of statistics we compare are the S/S naturalness scores and the E/C naturalness scores. These statistics are given in table 6.4. We see the same pattern with S/S and E/C natural- ness scores that we saw with S/S and E/C likability scores for the 5 voices; increasing naturalness scores for the HUM voice and decreasing naturalness scores for the synthetic voices. Moreover, every difference is highly significant here (HUM : p= 3:08e 16 ; US1 : p= 1:01e 12 ; US2 : p= 6:72e 33 ; HMM : p= 0:06e 2 ; SAM : p= 6:53e 05 ). An attempt to examine whether or not time exposure alone has an effect on subjective eval- uation of a voice leads us to examine a third pair of statistics: comparing the average of the first three naturalness scores from a Turker in the guessability experiment to the average of the last three naturalness scores (of 18 total) of the same voice (first voice heard only). This comparison provides evidence that the pattern we are discussing is not simply due to the difference in the types of tasks participants were asked to perform. These scores can be found in Table 6.7. A “*” in 134 Table 6.7: First vs. Last Naturalness Scores Voice First Three Naturalness Avg. Last Three Naturalness Avg. HUM 4.25 4.81* US1 3.42 3.52 US2 2.58 1.833* HMM 1.69 1.78 SAM 1.67 1.31 the second column indicates that the corresponding increase or decrease is statistically significant (HUM : p= 0:017;US2 : p= 0:013). Although US1’s and HMM’s naturalness averages increase, these increases are not significant. One issue to point out here is that the order of clues was fixed so the synthetic voices might have had worse performance on the last clues vs. the first clues. We now note that this study has results from two experiments where synthetic voices have a statistically significant decrease and where a human voice has a statistically significant in- crease in subjective evaluation ratings when comparing the ratings from people who had S/S vs. E/C exposure to the voices. These findings provide support for hypothesis 2 indicating that extended/continuous exposure to a synthetic voice negatively affects subjective perception of that voice. Furthermore, this study has shown results from one experiment which suggests that peo- ple’s subjective perceptions of synthetic voices degrade over time while their subjective percep- tions of human voices improve over time. Additional experiments with more human voices and a balanced order of clues could be conducted to provide further support for this phenomenon. 135 Correlation Analysis Table 6.8 presents the results of a correlation analysis between guessability and the other dimen- sions previously discussed. The correlation results for guessability and the two naturalness scores do not lead us to any clear conclusions. The only statistically significant correlation is between E/C naturalness, which had ratings collected after a participant had received feedback on the correctness of their guess (which could affect the rating), and guessability. Table 6.8: Guessability Correlations Categories r s P-Value Guessability & S/S Natural. 0.122 0.051 Guessability & E/C Natural. 0.31 0.002e -4 Guessablity & S/S Likability 0.108 0.085 Guessability & Stand. Word Error Rate -0.108 0.081 Guessability & % Stand. Missing Word % -0.129 0.035 We find weak negative correlations between guessability and both of the measures from the standard intelligibility experiments. Note that only the correlation between missing word % and guess-ability is statistically significant. This indicates that while intelligibility measures of a voice Spearman’s Rank-Order Correlation Coefficient Pearson Product-Moment Correlation Coefficient 136 Table 6.9: Intelligibility Correlations Voice Word Error Rate Standard ASR Corr. (r) (p-val) Missing Word % Standard ASR Corr. (r) 8 (p-val) HUM 0.06 (0:37) 0.07 (0:29) US1 0.27 (1:66e 36 ) 0.26 (3:97e 05 ) US2 0.55 (1:37e 05 ) 0.58 (5:21e 23 ) HMM 0.78 (7:17e 52 ) 0.74 (2:52e 43 ) SAM 0.07 (0:29) 0.17 (0:007) could be useful information when evaluating a voice’s EVP the correlation is not strong enough to suggest that they are valid proxy measures for a voice’s EVP. Furthermore, performing voice evaluation in an experiment that features the full context of the system being evaluated might still be required for precise voice evaluation results of a dialogue system. Table 6.9 shows the correlations for each voice between the ASR intelligibility experiment results and the standard intelligibility experiment results. For almost all of the synthetic voices there is a strong or somewhat strong positive correlation between the ASR intelligibility experi- ment results and the standard intelligibility results that has high statistical significance. The one exception to this is SAM’s ASR WER which shows no significant relationship with the human transcriptions WER. It is also interesting that for the HUM voice the ASR intelligibility results show basically no correlation to the standard intelligibility results. Overall though, it appears that for synthetic voices intelligibility results can be obtained automatically by sending recordings of 137 the voice to a well-trained ASR engine such as Google Chrome ASR; and these should be able to predict the results from a standard (human participant) intelligibility test. 6.1.4 Test-Bed Agent Implications The results from this study led us to choose the highest performing synthetic voice (US1) from these evaluations to use for the comparative role evaluation described in Section 9.2, incremen- tal/embodiment discussed in the next section, and the game framing/feedback evaluation pre- sented in Section 6.3. This evidence based synthetic voice selection for the test-bed agent helps decrease the chance that the results from these two evaluations are biased due to selecting a low quality synthetic voice rather than the specific aspects of the agent intended for investigation for each evaluation. 6.1.5 Summary We presented the results of an evaluation for 4 synthetic voices and 1 human voice that featured collection of data for subjective perception measures as well as for objective task measures of the voices. We demonstrated that synthetic voices do not always have significantly lower EVP than a human voice (US1 is similar); although they do significantly differ in subjective ratings assigned to them by listeners. For this reason, we would choose a human voice for a dialogue system designed to evoke an intended reaction from a listener only if subjective perceptions were important enough to the system designer to warrant the extra cost and time of making human audio recordings. We showed via comparison of measures of the voice’s EVP with measures of subjective per- ceptions and intelligibility that while you cannot always use standard measures of synthetic voice evaluation as a proxy for a new task, in determining the voice’s effectiveness at that new task, 138 the results from standard tests can provide useful information. Some of our data led us to sug- gest that synthetic voices’ likability and naturalness perceptions degrade based on time/continuity of exposure while human voices’ likability and naturalness perceptions improve with increasing time/continuity. Finally, we provided evidence that the automatic method of sending synthetic voice audio recordings through an ASR engine can serve as an adequate substitute for standard (human participant) intelligibility experimental results, and that the automatic method even out- performs Turkers’ transcription ability (when Turkers hear the audio only once). 6.2 On-Line Embodiment and Incrementality Evaluation As pointed out in the activity analysis (see Chapter 3) players frequently employed non-verbal behavior and interrupted each other. This indicated that endowing the agent with embodiment and incremental language processing capabilities could potentially have a large impact on the interaction. In order to investigate the impact on user’s perceptions and behaviors when the test- bed agent was afforded these capabilities we designed a 2x2 factorial experiment [Pincus and Traum, 2017] that used the virtual human implementation of the test-bed agent performing the clue-giving role (see Section 4.2). One factor was virtual human embodiment and one factor was barge-in. Overall, the results were mixed for this pilot evaluation but positive interaction effects were shown for the embodied and barge-in versions of the agent. Further, user comments from this pilot evaluation led us to iterate our automatic clue generation methods (see Section 5.2) that significantly improved the agent’s ability to elicit correct guesses from human users. The rest of this section is organized as follows. In Section 6.2.1 we discuss details of the inde- pendent variables investigated in this evaluation. Section 6.2.2 presents the experimental design and method. Section 6.2.3 lists the experimental hypotheses. Section 6.2.4 reports the results 139 from this evaluation. Finally, Section 6.2.5 briefly discusses implications of this experiment for the test-bed agent and summarizes our findings. 6.2.1 Independent Variables We examined two main independent variables: Embodied: whether or not participants could see the test-bed agent and its non-verbal behavior, including lip synch, gaze and some other gestures. Barge-in: whether or not the test-bed agent allowed user initiative barge-ins. In the embodied condition users saw a screenshot similar to the one seen in Figure 4.3 in Section 4.2 with the virtual humans displayed in the center of the screen. In this experiment only the the Game-Info GUI was displayed in the center above the screen displaying the virtual humans. In the non-embodied condition, participants only saw the Game-Info GUI but no screen displaying virtual humans. In the barge-in condition the barge-in parameter in the virtual human dialogue manager was flipped on (see Section 4.2) so user’s barge-in dialogue moves would be processed. In the non-barge-in condition this parameter was turned off and all user speech flushed while the test-bed agent was speaking. 6.2.2 Experimental Design & Method We used a between subjects method for the two variables, since we felt it would be disconcerting to change the user interface without changing other aspects of the agent. The 4 conditions of the 2x2 experimental design for the experiment can be seen in Table 6.10. 140 Table 6.10: Experiment Conditions Incremental Language Processing Embodiment Barge-in Non barge-in Virtual Human User sees virtual human agent with off-the-shelf non-verbal generation. User can interrupt agent speech with correct guesses and skips. User sees virtual human agent with off-the-shelf non-verbal generation. User speech flushed when agent speaking. Non-Embodied User sees only Game-Info GUI. User can interrupt agent speech with correct guesses and skips. User sees only Game-Info GUI. User speech flushed when agent speaking. 141 1. How likely would you be to recommend the game to a friend? 2. How natural was the clue giver? 3. How natural was the voice? 4. If you had opportunity to switch to a new unknown partner would you do that or keep the same one? 5. If you had the opportunity to play a game right now would you rather play this one again or another? 6. How much did you enjoy the game? 7. Anything you particularly enjoyed about the game? 8. Anything you particularly disliked about the game? why? 9. Would you play the game again for fun? 10. Anything you would suggest to improve the game? Figure 6.1: Post-Survey Participants were recruited via Craig’s List and paid $25 for their time. The players were required to play 8 150-second rounds of the word-guessing game with the test-bed agent. Fol- lowing the game, participants reported their subjective experience on a post survey seen in Figure 6.1. The dialogue manager was updated so that the judge would request ratings on 1-7 scale from the user for the following 2 statements: “How effective did you find the clues in the last round?” & “Other than the clues he gave, how do you think (the test bed agent) compares to a human clue-giver?” after each round. Participants spoke into a wireless Sennheiser microphone. Audio files were recorded and all relevant game actions stored in the agent’s database. 6.2.3 Hypotheses We had 4 main hypotheses for this experiment. 142 Hypothesis 1: The number of user initiative barge-in dialogue moves (i.e. - correct guesses and skips) recognized by the agent would be higher in the barge-in condition vs the non-barge- in condition. Note if evidence is found for hypothesis 1, in particular if there are more correct guesses utterances recognized by the system for players in the barge-in condition then players will have higher scores in that condition since each correct guess earns players 1-point. Motivation for hypothesis 1 was found in the fact that in barge-in mode the test-bed agent is able to interrupt himself when he is speaking for user initiative barge-in dialogue moves and thus has more time recognize those dialogue moves than when in non-barge-in mode. Hypothesis 2: Subjective evaluations of the game in terms of user enjoyment (as measured by responses to question 6 in Figure 6.1) would be higher in the barge-in condition vs the non-barge- in condition. We believed players would be frustrated when their speech was ignored during agent speech decreasing their enjoyment of the game. Hypothesis 3: Subjective evaluations in terms of naturalness for the voice (as measured by responses to question 3 in Figure 6.1) would be higher in the embodied condition vs the non- embodied condition. We thought a disembodied voice would be disconcerting and non-human like for players. Hypothesis 4: Subjective evaluations in terms of naturalness for the clue-giver (as measured by responses to question 2 in Figure 6.1) would be higher in the barge-in/embodied condition than the non-embodied/non-barge-in condition. We believed the barge-in/embodied condition comes closest to simulating playing the game with a human clue-giver and therefore would be perceived as more natural by players. 143 Table 6.11: Participant Interaction Statistics Condition # of Participants Avg. # of Rounds Played in Condition Embodied Barge-In 17 5.9 Embodied Non-Barge-In 15 5.9 Non-Embodied Barge-In 14 6.3 Non-Embodied Non-Barge-In 6 8 6.2.4 Results A summary of the results from this evaluation follow. All statistical tests discussed in this section are two-tailed un-paired independent t-tests. Participant Interaction Statistics We collected data from 52 participants. Due to technical issues not all participants completed the 8 required rounds. Table 6.11 shows number of people who interacted with the system in each of the 4 evaluation conditions and the average number of rounds played by participants in that condition. Behavorial Results We find evidence to support hypothesis 1, that the number of user- initiative barge-in moves recognized by the agent would be higher in the barge-in condition vs the non-barge-in condition, for both correct and skip dialogue moves. We found a (trending) sig- nificant difference for the number of correct guess utterances recognized in the barge-in (n=31) 144 ( M=1.4, SD=0.92) vs the non-barge-in (n=21) (M=1.0, SD=0.65) conditions (approaching sig- nificance (t(49.88)=-1.89, p= .064) We found a significant difference between average # of skip utterances recognized by the test-bed agent per round for the barge-in (M=1.28,SD=1.22) vs the non-barge-in (M=0.58,SD=0.73) conditions (t(49.39)=2.32, p= .024). We note there are two possible reasons more correct guesses were recognized in the barge- in condition. First, as mentioned, correct guess is a user initiative barge-in and therefore the test-bed agent can recognize it while speaking in barge-in mode. Second, players in the barge-in condition skipped or “moved on” significantly more than players in the non-barge-in condition. It is likely this “moving on” came at times players felt they could not make a correct guess in a reasonable amount of time which afforded the test-bed agent more time in which to give clues for new target-words that players might have had a better chance of guessing correctly quicker. Subjective Perception Results Subjective results were mainly calculated based on answers to a post-survey filled out by participants. We acknowledge a possible confound to these results is the different number of rounds played by the participants. However, since our main finding related to hypothesis 3 (see below) compares the embodied and non-embodied conditions where the number of average rounds played by participants in each condition differed by less than one this should not be too significant an issue. We did not find evidence for hypothesis 2, that user enjoyment would be higher in the barge-in condition vs the non-barge-in condition. The user enjoyment for the barge-in condition (M=2.52, SD=1.29) was almost the same as the user enjoyment for the non-barge-in condition (M=2.52, SD=1.43). The overall average enjoyment for all participants for this experiment was also ba- sically the same as for these two conditions (M=2.52, SD=1.32) indicating the test-bed agent performing the clue-giving role was not yet able to elicit enjoyment in this initial evaluation from users. Note this was similar to the overall average enjoyment for the subset of participants who 145 completed the full 8 rounds (M=2.63, SD=1.40) providing evidence that the technical failures did not play a large issue in perceived enjoyment. We did find evidence to support hypothesis 3, that the perceived naturalness of the agent voice would be higher in the embodied condition vs the non-embodied condition. A significant difference was found between the embodied and non-embodied conditions for the question “How natural did you find the voice?” on the post survey for the embodied condition (n=32) (M= 2.8, SD=1.36) vs the non-embodied condition (n= 20)(M=2.0 SD=1.3 )(t(41.87)=2.09, p= .04). This provides evidence that synthetic voices are found to be more natural when spoken by a realistic human avatar (with some basic non-verbal behavior) compared to a disembodied voice in the game context. We did not find strong evidence to support hypothesis 4, that the perceived naturalness of the clue-giver would be higher in the barge-in/embodied condition than the non barge-in/non embodied condition. However, the difference for participants in the embodied condition (M= 2.3, SD=1.21) compared to ones in the non-embodied condition(M= 1.75 , SD=1.30) in response to the question “How natural did you find the clue-giver?” approached significance (t(42.69)=1.80, p= 0.07) indicating embodiment might be more important than barge-in for designing a game that felt more similar to the experience from a human-human game, though this would require further investigation. 6.2.5 Implications for Test-bed Agent & Summary This evaluation provides evidence of synergies when the agent is endowed with both embodiment and incremental language processing capabilities with each capability seeming to add some value 146 and help towards creating a better overall experience. This supports our decision to use the incre- mental virtual human embodiment implementation of the architecture described in Chapter 4 for the final multi-role evaluation described in Section 9.2. 6.3 On-line Game Framing & Feedback Evaluation The activity analysis (see Chapter 3) indicated that feedback was frequently employed by players which likely positively influenced player’s subjective perceptions of their partner and the interac- tion. In this section we discuss a comparative evaluation of gain and loss framings (and associated feedback types) for the word-guessing game. The original intent of the study was to see if the game framing and associated feedback could be tailored to a particular user attribute, their reg- ulatory orientation. Regulatory focus theory (RFT) asserts that people differ in how they pursue goals (promotion focus individuals attend to gains, whereas prevention focus individuals are more impacted by losses). The theory also maintains that subjective perceptions, motivation and per- formance are enhanced when there is a match (or “fit”) between this goal-pursuit orientation and the nature of the interaction. Ultimately, the study demonstrates that holding the game framing constant (gain frame) for all users has positive interaction effects. This study also serves as an ex- ample of an experiment whose novelty is partially achieved by leveraging the full-stack interactive architecture described in Chapter 4. Recently, there has been some work involving RFT and non-embodied=non-speech capable artificial Game-Agents [Faur et al., 2015a, Faur et al., 2015b]. However, to the best of our knowl- edge, no one has explored if knowledge of a player’s regulatory focus (RF) can be leveraged in an automatic game that employs state-of-the art embodiment and speech technologies to create regulatory fit. 147 In the study we described here, participants were pre-screened into promotion-focused and prevention-focused categories based on their answers to the RF Questionnaire [Higgins et al., 2001]. Participants then play the game in one of 2 distinct frames; a gain frame where participants earn a point for each correct guess and a loss frame where participants do not lose a point for each correct guess. Our results, consistent with past work [Freitas and Higgins, 2002], showed higher perceived task-success when participants play the game in a ”fit” condition. Inconsistent with previous work on RFT, we did not find greater task-enjoyment nor improved (actual) task-success in the participants in the “fit” conditions. Players who played a gain version of the game had marginally significant higher enjoyment, regardless of their regulatory focus. We operationalized motivation by number of optional rounds played but failed to find a “fit” effect. We did find higher motivation in players who achieved early success (scoring more points in initial rounds) which could have overridden expected motivation “fit” effects. Early success was significantly correlated to number of optional rounds played. This finding is consistent with common game design principles [Lovato, 2017] and findings from the study in [Marsella et al., 2009]. Our results suggest the inherent framing and nature of a task should be considered carefully in design as it could override otherwise expected benefits to designing a game based on RFT. 6.3.1 Method This study also used the incremental virtual human embodiment implementation of the archi- tecture described in Section 4.2 performing the clue-giving role. The dialogue manager of this implementation was modified so that the agent could support both gain and loss framings for the game’s clue-giving role. Participants saw a monitor displaying a similar screenshot to the one shown in Figure 4.3 in Section 4.2. However, only the Game-Info GUI and ASR Hypothesis GUI 148 were shown for this experiment. Participants were required to play 4 rounds and then given the option to play up to 4 more rounds. Each round lasted 75 seconds and included clues for up to 5 target-words. Each game consisted of clues given in the same order for the same target-words. The clues were selected from corpus mach described in Section 5.1. A player was able to skip a target-word if they were unable to guess correctly within the time limit but could not return to a target-word if they skipped. Feedback Participants played 1 of 2 versions of the game: the gain frame where players started with no points and increased their score by 1 point every time they guessed correctly, and the loss frame where players started with 20 points (for the 4 required 5 target-word rounds) and their score decreased by one point every time they skipped or failed to guess a target-word correctly before time was up. In the loss frame players were given an additional 5 points in the beginning of any optional round they elected to play. We modified the agent’s dialogue manager so that it would trigger the agent and game-judge to provide feedback to players in order to re- enforce the gain-loss frame. There were 2 times feedback was given in the game, online feedback was given during game- play and offline summary feedback at the end of each round. Online feedback was given when the participant provided a correct guess or if the participant decided to skip. Feedback was com- municated in 2 ways to the participant; via speech through Mr. Clue and the GJ, as well as visually, via the window containing game information. The specific description of how each feedback type was realized at different times and via different modalities can be found in Table 6.12. Example clues and feedback can be seen in the sample gain-frame interaction and sample loss-frame interaction shown in Figure 6.2. In the gain-frame interaction (bottom-half of figure) the participant gave two wrong guesses for the target-word “iron” after hearing one clue. The participant then heard a 2nd clue for ”iron” and then decided to skip at which point the test-bed 149 Table 6.12: Feedback Realization Feedback Trigger Speech Realization Visual Realization Gain Frame Online: Correct Guess GA: “You have guessed correctly and gained 1 point.” Round Score incremented on GUI Online: Skip GA: “You skipped and will not gain 1 point for this target.” None Offline: Summary (at end of round) For each target guessed correctly: GJ: “Target number<X> was <TargetWord>. You guessed correctly and gained a point.” For each target not guessed correctly: GJ: “Target number<X> was <TargetWord>. You did not guess correctly and did not gain a point.” For each correct guess Total Score incremented on GUI Loss Frame Online: Correct Guess GA: “You guessed correctly and will not lose a point.” None Online: Skip GA: “You skipped and lose 1 point for this target” Round Score decremented on GUI Offline: Summary (at end of round) For each target guessed correctly: GJ: “Target number<X> was <TargetWord>. You guessed correctly and did not lose a point.” For each target not guessed correctly: GJ: “Target number<X> was <TargetWord>. You did not guess correctly and lose 1 point.” For each target not guessed correctly Total Score decremented on GUI 150 Figure 6.2: Sample Round Dialogues agent gave gain feedback and a clue for a new target-word, “icicle”. In the loss-frame interaction (top-half of figure) the participant was able to guess the first target-word, “wire”, after hearing two clues from the test-bed agent. Note the <Silence> participant line indicates a time when the participant was silent after the agent was finished speaking for a threshold time set via the s g parameter, the in-game silence threshold that triggers the test-bed agent to give a new clue for the same target-word after s seconds of silence (discussed in Section 4.2.1). The agent then gave loss feedback for the correct guess and a clue for a new target-word (“bead”). The participant was silent for the threshold time once again triggering the agent to give another clue for the target-word “bead”. The participant then made an incorrect guess (“crystal”) and the game agent provided a third clue for “bead” before the round ended. Experimental Design We conducted a between subject 2x2 factorial experimental design as seen in Table 6.13. Participants were first labeled as either a prevention person or a promotion 151 Table 6.13: Experimental Design Regulatory Fit Feedback Type Promotion Prevention Gain Promotion player receiving gain feedback (Fit) Prevention player receiving gain feedback (No-Fit) Loss Promotion player receiving loss feedback (No-Fit) Prevention player receiving loss feedback (Fit) person based on a pre-survey [Higgins et al., 2001]. Participants were eligible if they showed a large score difference between promotion and prevention sub-scales. A total of 1513 responses were received; participants scored average 2.9 on prevention sub-scale and 3.9 on promotion sub- scale, indicating a skew towards promotion focus. Accordingly, a difference score of 1 served as the cut-off for inclusion among promotion-focused individuals, whereas a difference score of 0.5 served as the cut-off among prevention-focused individuals. A total of 76 participants qualified and completed the experiment. After exclusion due to technical failures, 59 participants (30 males and 29 females) remained. Thirty participants were assigned to the fit condition (16 promotion-focused in gain-frame, 14 prevention-focused in loss- frame), and 29 in the non-fit condition (17 promotion-focused in loss-frame, 12 prevention- focused in gain-frame). Each group of people played one version of the game receiving either gain feedback or loss feedback. Based on [Higgins et al., 1997] promotion people in gain frame and prevention people in loss frame were in regulatory-fit. Besides the questions from [Higgins et al., 2001], that were used to assess a participants reg- ulatory focus, the pre-survey also contained one additional question asking how confident the participant was on a 1-5 scale in their word-guessing game ability. After participants finished 152 1. I am embarrassed about my game-performance 2. I am a good guesser when playing word-guessing games. 3. I feel more confident in my word-guessing game ability after playing the game. 4. The<game agent> is good at giving clues. 5. If I did not guess a word it’s more so because of the quality of the clues than the quality of my guesses. 6. I wouldn’t care if people saw a video of me playing this game. 7. I feel smarter than I did before playing the game. 8. I enjoyed the game. 9. I would recommend the game to a friend. 10. If I played the game again I would rather play with this <game agent> than an unknown partner. 11. If I had to play another game I would choose this game rather than an unknown game. 12. I would play the game again for fun. Figure 6.3: Post-Survey filling out the pre-survey, the experimenter read instructions. The instructions emphasized the point structure of the appropriate version of the game. The instructions also highlighted the func- tionality of the automatic speech recognizer (ASR) hypothesis GUI and that participants should repeat themselves in cases of mis-recognition. Participants sat in a standard rolling chair in a room with no background noise. Seated participants were placed approximately 2 feet from a widescreen monitor displaying the virtual humans and auxiliary guis. Participants spoke into a wireless Sennheiser microphone. Audio files were recorded and all relevant game actions stored in a database. Similar to the embodiment and incremental processing evaluation (see Section 6.2), following the game participants reported their subjective experience on scales from 1 (strongly disagree) to 5 (strongly agree). The post survey given can be seen in Figure 6.3. 153 Besides the subjective measures from the pre and post-surveys, we also examined objective game-play measures from recorded game behaviors. We discuss two measures in particular in our results; How well they did (performance) operationalized by average number of correct guesses per round and how long they played for (motivation) operationalized by number of rounds played. Hypotheses Based on previous work on regulatory-fit, we had 4 main hypotheses for this experiment. H1: Players in “Fit” conditions would have enhanced performance (i.e-higher task- success when operationalized by average round scores) as was shown in the tasks described in [Shah et al., 1998]. H2: Players in “Fit” conditions would play more optional rounds as their motivation and willingness to continue would be higher than players in “No-Fit” conditions as was the case for participants in [Freitas and Higgins, 2002] and for employees surveyed about their turnover intention in [Hamstra et al., 2011]. H3: Players in “Fit” condition would have higher perceived task-success as reported on the post-survey than players in “No-Fit” condition as was the case in the essay task in [Freitas and Higgins, 2002]. We use question 3 on the post- survey shown in Figure 6.1 as a proxy for perceived task-success to investigate this hypothesis. H4: Players in “Fit” conditions would report higher task-enjoyment on the post-survey, question 8 as seen in Figure 6.3, as was the case in [Freitas and Higgins, 2002]. 6.3.2 Results With the exception of perceived task-success, most of the results did not corroborate previous work on regulatory-fit. We conducted a post-hoc analysis that provided evidence that the gain- version of the task and achieving early success might have overridden the impact of regulatory-fit. The average number of points per round for the 33 players in the ‘Fit” condition was 0.96. The same measure for the 26 players in the “No-Fit” condition was 0.88. Although this is consistent with H1, this difference was not significant and thus we did not find strong support for H1. The 154 average number of rounds played in the “Fit” condition was actually lower than the average num- ber of rounds played in the No-Fit Condition; 6.9 and 7.5 respectively (not significantly different). Thus, we did not find support for H2. We did find support for H3. There was a significant difference in the scores for the question “I feel more confident in my word-guessing game ability after playing the game” for “Fit” (M=2.55, SD=1.01) and “No-Fit” (M=1.92, SD=0.73) conditions, two-tailed t-test (t(56.6)=2.58, p= .012). This finding is consistent with results from previous studies showing that perceived success is higher when there is regulatory-fit. Finally, we were not able to find support for H4. Scores given on the post-survey for “I enjoyed the game” by ”Fit” players (M= 3.18) and ”No-Fit” players (M=3.5) actually showed the opposite trend from what we expected (not significant). Since some of our findings did not support previous regulatory-fit work, we conducted an exploratory analysis discussed next that presents hypotheses as to why this study might have failed to replicate some previous findings on regulatory fit. Exploratory Analysis The results from our exploratory analysis suggest that the inherent framing and the impact of achieving early success might have over-ridden the expected regulatory- fit effects. When participants were given points in the loss frame to start with no story was provided that explained why they started with the given points. It seemed reasonable that the naturalness experienced by players in the gain frame, the inherent framing of the game, might override any expected effects we would normally expect for the “Fit” conditions. Indeed, when we looked at enjoyment scores reported by players on the post-survey; the 25 players in the gain frame (M=3.64, SD=1.05) had marginally significant (t(55.86)=1.81, p= .076) greater enjoyment scores than the 34 players in the prevention frame M=3.08, SD=2.20) providing some support for this reasoning as to why we did not find support for H4. On the other hand, we did not find a significant difference when using a two-tailed independent t-test (p=.119) for the 155 average number of points scored per round for participants in the gain frame vs the loss frame which does not explain why we did not find stronger support for H1. We next discuss early success and its relation to H2. Consistent with common game design principles [Lovato, 2017] and results from the study in [Marsella et al., 2009] it seems likely that people who have early success at a task will be more willing to continue the task longer. Our study design afforded us an opportunity to investigate this idea since the design ensures that every participant received the same stimuli (same clues in the same order for the same target-words) and every participant had the option of playing up to 4 optional rounds. We first look at whether or not the average points scored in the first 4 required rounds are predictive of how many optional rounds a player elected to play (our motivation operationalization). We find a significant moderate positive correlation (r=.32, p=.01) (r s =.30,p=.02) between the 2 variables indeed suggesting that the scores in the first 4 rounds were somewhat predictive of how many optional rounds would be played. In order to see whether it was really “early” success that was the predictor, we tested whether or not there was significant correlation between the average scores of participants in the first 2 rounds (as well as rounds 3 and 4) and the number of rounds they elected to play. We found that indeed the average score of the first 2 rounds was significantly moderately correlated (r=.36, p=.005) (r s =.38,p=.003) to the number of optional rounds played (and the average of the 3rd and 4th round scores was not). Another possible confound is different ASR error rates, if people who had early success were more likely to be recognized correctly by the ASR. However, in this case we would also expect higher success in the later rounds, which was not found. Moreover, we transcribed a random sub- sample (16 participants) whose data reflected a significant correlation between average number 156 of points scored in first 4 rounds and number of total rounds played and calculated those partic- ipant’s Word Error Rates (WER). We then ran a partial correlation (controlling for WER) and found the average number of points scored for the first 4 rounds as well as the first 2 rounds for this sub-sample remained marginally moderately positive; (p=.058) and (p=.033) respectively. 6.3.3 Test-Bed Agent Implications This experiment provides support that our choice of game framing with gain feedback was a supe- rior choice to loss framing with loss feedback for the game in the multi-role evaluations described in Chapter 9. This experiment also serves an example of a type of experiment enabled by the full- stack interactive architecture described in Chapter 4. Much of the novelty of this experiment is owed to its use of a relatively ecologically valid interaction test-bed. Finally, we note that the av- erage enjoyment for all participants in this experiment (M=3.32) was significantly higher (t=3.32, p= .0012) than the average enjoyment for participants in the incrementality embodiment evalua- tion described in the last section. This indicates that while the test-bed agent’s clue-giving role did not elicit strong enjoyment yet from users there was improvement from the first evaluation. 6.3.4 Summary We explored whether knowledge of a player’s chronic regulatory focus could be leveraged by a fully automated virtual human in a game to create regulatory-fit effects. Our results, consistent with past work, showed higher perceived task-success when participants play the game in a ”fit” condition. We provided evidence in a post-hoc analysis that general-purpose design guidelines (gain frame and early success) enhanced enjoyment and motivation for players. Personalization design guidelines based on regulatory focus theory did not seem to result in enhancements of enjoyment, performance, or motivation. Our results suggest the inherent framing and nature of a 157 task should be considered carefully in design as it might override expected benefits to designing a game based on regulatory focus theory. 6.4 Summary This chapter presented 3 comparative evaluations that support design decisions made for the test- bed agent and reduces the chance that poor design decisions around these dimensions obscure our ability to investigate our main research questions around multi-role dialogue agents. The first experiment provides motivation and support for our choice of synthetic voice selection. The second and third experiment featured live evaluations of the test-bed agent leveraging the architecture for enabling interactivity described in Chapter 4. The second experiment showed there are synergistic interaction benefits to endowing the agent with incremental language capa- bilities and virtual human embodiment. The third experiment provides evidence that the gain framing with gain feedback version of the game fosters a more positive interaction than the loss framing with loss feedback version of the game for the average user. The third experiment is an example of an experiment whose novelty is at least partially owed to using the fully interactive architecture. The third experiment also demonstrates that the over- all interaction enjoyment experienced by users interacting with the test-bed agent performing the clue-giving role has increased compared to their overall interaction enjoyment for the earlier em- bodiment/incrementality evaluation and our choices around clue generation and text-to-speech are not negatively impacting the interaction to an extent that it causes participants to give negative enjoyment evaluations of the interaction. 158 Chapter 7 Guessing Role Content Generation “Guessing before proving! Need I remind you that it is so that all important discoveries have been made?” Henri Poincar` e In this chapter we explore automated methods that can be used to power a guessing generator for an interactive agent performing the guessing role of a word-guessing game such as the one that appears in the agent architecture described in Chapter 4. The first set of methods we explore uses a variety of online machine resources. There are many existing online machine resources that approximately model the relationship between many of the types of clues such as the ones defined in Table 3.1 in Section 3.3 and the clue’s associated target- words. For example, lexical databases like WordNet [Miller, 1995] organize words into networks which a user can query with a word and in order to obtain a list of words that have a certain linguistic relation with the given word. These linguistic relations generally map to the common clue types used by clue-givers in word guessing games (e.g. - antonyms to antonym contrast and hyponyms to hyponym). Another example, are online encyclopedias like Wikipedia which are comprised of webpages thats first sentence usually defines the webpage’s associated word. These 159 sentences can generally be mapped to the Description Definition clue type. For this reason, it seemed worthwhile to explore leveraging these resources to power a guessing generator for an automated agent. Although the machine resource methods showed some promise for successfully producing high quality guesses, a combination of engineering challenges and concern over performance in the fully interactive game (see Section 7.1.6) led us to explore an alternative guessing method. This second method more directly models the relationship between a clue and a given target- word. The method relies on a classification technology that uses cross-information retrieval tech- niques [Lavrenko et al., 2002] and is used in many question-answering dialogue character systems [Leuski and Traum, 2010] to map the best available response to a new line of dialogue. This sec- ond method was ultimately used by the automated test-bed agent in the multi-role evaluation described in Chapter 9. The rest of this chapter is organized as follows. Section 7.1 presents more details on the online machine resource methods including an off-line evaluation of their performances. Section 7.2 discusses the classification technology method in more detail including an offline evaluation of its performance. 7.1 Machine Resources for Guessing We present the first set of guessing methods in Section 7.1.1 which query online resources includ- ing popular online search engines, encyclopedias, dictionaries, and pre-compiled lexical databases to produce guesses. Section 7.1.2 describes information retrieval ranking metrics we employed in order to re-rank the guesses returned by these resources when they are used individually as well as to create ensemble guessing techniques that make use of the resources in combination. 160 Section 7.1.3 discusses an offline evaluation of these guessing techniques. Section 7.1.4 summa- rizes a small data collection effort we conducted in order to have measures of human guessing performance that we could directly compare to the results of this off-line evaluation. Section 7.1.5 compares the performance of the machine resource guessing methods with human guessing performance. Finally Section 7.1.6 presents conclusions and implications for the test-bed agent based on this analysis. 7.1.1 Resources & Methods for Guessing Here we provide an overview of the online machine resources (and the methods used to query them). In order to extract more meaningful semantic content from clues (that help to distinguish clues from one another) the clues undergo a normalization process. In order to extract words from the text returned from a resource more likely to be a correct guess the resource returned text also undergoes a similar normalization process. For example, stop-words are removed from both clues and resource returned text. They are removed from clues because stop-words typically don’t help distinguish clues from one another and likely wouldn’t help the resource associate a clue with correct guess. They are removed from resource returned text because these types of words are not considered possible correct guesses based on the type of target-words used in the game. Normalization for both clues and text returned by a resource involve removing punctuation and stop-words . Also, all words are converted to lower case and stemmed y . For example, after normalization the raw clue “it’s a storage container for coca-cola” becomes “storage container coca cola. ”. Normalization for the raw text returned by a resource also involves removing any words that appeared in the query from the result (consistent with the word-guessing game rule Stop-words from python package NLTK y Porter Stemmer from python package NLTK 161 that a clue should not contain any form of the target-word). The process that enables resources to produce guesses for a given clue can be summarized in 3-stages. First, a given clue in textual form is normalized. Second, the normalized clue is used to query the resource. Depending on the resource this is either one query consisting of the whole clue or several queries with each query consisting of a constituent clue word. Third, the text returned by the resource is normalized which results in a ranked guess list (ranked in the same order the word appeared in the results). Individual Resource Guessing Methods Table 7.1: Guessing Resources Resource Description Method Link Google Popular online search engine. Scraped top 5 result titles returned from querying with whole clue. https://www.google.com/ Google Autocomplete Google’s deprecated search query autocomplete API. Scraped all returned results when queried with first 100 characters of the clue. http://suggestqueries.google.com/ complete/search?client=firefox&q=query WordNet Lexical database that groups words according to conceptual- semantic and lexical relations. Used NLTK Python library to query for synonyms of each clue word, and returned concatenated list. https://wordnet.princeton.edu/ Never Ending Language Learner (NELL) Online knowledge base that continuously mines the world wide web to create structured information from unstructured webpages. Queried API with each clue word and returned concatenated list. http://rtw.ml.cmu.edu/rtw/json0doc Word Association Online database that groups words according to “psychological perception, sense, and meaning.” Scraped all returned results when queried with each clue word, and returned concatenated list. https://wordassociations.net/en DuckDuckGo Online search engine that protects user’s privacy and as a result returns non-personalized results. Scraped top 5 result titles returned from querying with whole clue. https://duckduckgo.com/ 162 DuckDuckGo Previews DuckDuckGo’s Instant Answers API. Queried with whole clue and returned first 5 article previews. https://api.duckduckgo.com/ ?q=query&format=json Bing Online search engine created by Microsoft. Scraped top 5 result titles returned from querying with whole clue. https://www.bing.com/ Wikipedia Online encyclopedia created and edited by volunteers. Scraped top 5 result titles returned from querying with whole clue. https://www.wikipedia.org/ Contextual Web Online search engine that reportedly uses “the most efficient indexing method which mimics the way the human brain indexes memories.” Queried with whole clue and returned search engine results. https://contextualwebsearch.com/ Contextual Web Related Words Contextual Web’s “Related Words” feature. Queried with whole clue and returned all related words. https://contextualwebsearch.com/ Contextual Web Autocomplete Contextual Web’s autocomplete feature. Queried with whole clue and returned autocomplete results. https://contextualwebsearch.com/ The individual resources leveraged include popular online search engines, online encyclopedias, lexical databases, and knowledge-bases. Sometimes we made use of special functions the resource had available as an alternative guessing method. A full list of the resources used can be found in Table 7.1. The first column of this table provides the name of the resource, the second column provides a brief description of the resource, the third column details how the raw content (either word lists returned via an API or word lists obtained by scraping the content from a web-page) returned from the resource is processed to produce a final candidate list of guesses, and the fourth column provides a URL where the resource can be found. We define some vocabulary to help us provide an overview of these individual methods. We refer to the example result set in Figure 7.1 for purposes of explanation. This result set is the out- put of querying the resources Wikipedia and Word Association with the normalized clue “storage container coca cola” (target: bottle). These results have undergone the normalization process just described. Sometimes a resource was better suited for being queried with a whole clue, as is 163 Normalized Query Clue: “storage container coca cola” Target: bottle Wikipedia Results (title #: normalized title) 1: amatil 2: plastic container 3: breakmate 4: container deposit legislation australia 5: list bottles types brands companies Word Association Results (query word: query result) storage: retrieval gb disk locker reservoir warehouse container: shipping terminal pallet cargo packaging freight cocoa: pepsi coke cocaine colon beverage bottle grower cola: pepsi coke beverage soda bottle nike syrup sprite Figure 7.1: Example Result Set the case for the resource Wikipedia whose result set is shown in the top half Figure 7.1 below the query clue. Other times a resource was better suited for being queried multiple times with each constituent clue word as is the case for the resource Word Association thats result set is shown in the bottom half of the Figure. Some resources returned multiple results in a ranked order (e.g. -as be seen in the Figure Wikipedia returned a ranked list of titles) for an individual query. We define sub-lists as a list of words returned that are associated with an individual query or scraping action. For Wikipedia each returned title is a sub-list. For word association each constituent clue word used to query Word Association returns a sub-list. A resource list is a list of all sub-lists returned by a given resource for a given clue (which might involve multiple 164 query/scraping actions made). For Wikipedia the resource-list is the list of titles (wikipedia’s sub- lists). For word association the resource list is the list of returned query lists (word association’s sub-lists). The resource list has an implicit ranking of words corresponding to the rank of the sub-list it appears in as well as the position of the word in the sub-list. We distinguish this raw ranking from the ranked-lists output by the ranking algorithms we describe in the next subsection. In the following text we provide a more detailed description of these resources/methods. We organized these resources and functions by type when there was more than one of that particular type/function otherwise we describe the resource or function individually. Online Search Engines are commonly used by people to locate web-pages with information on a certain topic. When queried, search engine results include a ranked list of web-page titles and an associated text summary, as well as text associated with relevant images or advertisements. Depending on the form of the query and search engine there can be highlighted results that can serve purposes such as defining key query terms. We constructed resource-lists from the returned text from querying the engine with the whole normalized clue; only using words in the returned text that remained after normalizing the top 5 returned result titles (sub-lists). Google, Bing, Duck Duck Go and Contextual Web resources fall under this category. Search Engine Auto Completion Feature are a convenient feature of many search engines that predict in real time how to finish a partial query written by a user. We queried this feature with whole clues for some of the search engines just mentioned. These services returned a ranked list of suggested text that they expect the user is about to type to finish the query. We normalized this list of suggested completions to construct a resource list. Google Autocomplete and Contextual Web AutoComplete fall in this category. 165 Lexical Databases are commonly used by researchers in the human-computer dialogue, lin- guistic, and artificial intelligence communities as research tools/resources. They generally con- tain a large set of words and meta-information about the words (e.g. - definitions, lists of words linguistically related to them such as synonyms or antonyms, and example sentences where the words are used). These resources are queried with constituent clue words and the returned sub- lists are concatenated to make resource lists . Included in this category are WordNet and Word Association. Duck Duck Go Previews is a feature offered by the Duck Duck Go search Engine that allows users to see an excerpt of the returned webpages that is most relevant to the user’s query. The raw text for each of the top 5 previews returned when the feature was queried with the whole clue was normalized to produce a resource-list. Contextual Web Related Words is feature offered by the Contextual Web Search Engine API that allows users to enter a query and see a list of words related to query words. We queried this feature with the whole normalized clue and and the returned sub-lists are normalized to produce a resource-list Wikipedia is an online encyclopedia created and edited by an army of global volunteers. In general one queries Wikipedia in order to get more information on a topic. We queried Wikipedia with the whole clue and normalized the top 5 titles of Wikipedia pages (sub-lists) returned to produce a resource-list. Never Ending Language Learner (NELL) is a knowledge base that contains structured information extracted from unstructured web-pages. The resource has been continuously running since 2010 using machine learning to identify relationships between words that it can learn by reading the World Wide Web. The original intent of this project was to enable machines to provide answers to common questions poised in natural language by humans. NELL’s API is queried with 166 Table 7.2: Relationship of Guessing Resource to Clue Type Resource Type How Resource Approximately Models Relationship between Clue Types and Target-Word Likely Useful For Clue Type Online Search Engines 1. Titles usually contain words that commonly co-occcur with given word 1. Partial Phrase Search Engine Auto Completion 1. Results contain words that typically follow given word in common sentences. 1. Partial Phrase Lexical Databases 1. Organizes words based on their linguistic relationship to one another. 2. Contains definitions of given word. 3. Example sentences containing given word. 1. Synonym, Antonym Contrast, Hyponym 2. Description Definition 3. Partial Phrase Duck Duck Go Previews 1. First results usually text with definition. 2. Other results usually text with commonly co-occurring words. 1. Description Definition, 2. Partial Phrase Contextual Web Related Words 1. Returns words “related” to given word. 1. Synonym, Antonym Contrast, Hyponym, Partial Phrase Wikipedia 1. First sentence of page for given word usually a definition of given word. 1. Description Definition Never Ending Language Learner (NELL) 1. Contains facts containing given word 1. Action Description, Description Definition, Synonym, Antonym Contrast, Hyponym constituent clue words and the returned sub-lists are normalized and concatenated to produce a resource-list . Table 7.2 provides a mapping between these resource categories and clue types defined in Section 3.3. The first column of this table lists a resource category, the second column lists a fea- ture or aspect of the organization of the resource that that approximately models the relationship between certain clue types (listed in the third column) and their associated target-words. 167 7.1.2 Guess Ranking Metrics Table 7.3: Guess Re-Ranking Metrics Frequency-Based Metric Specification Metric Applied to Example Result Set Absolute Frequency # of times word appears in any sub-list bottle: ensemble score=3 bottle: wikipedia score=1 bottle: word association score=2 deposit: ensemble score=1 deposit: wikipedia score=1 deposit: word association score=0 etc. Resource Frequency # of resources that produce a resource list with the guess word bottle: ensemble score=2 amatil: ensemble score=1 etc. Rank-Based Metric Specification Metric Applied to Example Result Set Simple Rank summed rank (1 for last word in the list, 2 for 2nd-to-last, etc.) for each sub-list bottle: ensemble score=10 bottle: wikipedia score=4 bottle: word association score=6 pepsi: ensemble score=15 pepsi: wikipedia score=0 pepsi: word association score: 15 Rank Size rank on each sub-list divided by size of the sub-list summed over all sublists bottle: ensemble score=1.59 bottle: wikipedia score=0.8 bottle: word association score=0.79 pepsi: ensemble score=2.00 pepsi: wikipedia score=0 pepsi word association score=2.00 Mean Reciprocal Rank (MRR) summed MRR (1/1 for first word in the sub-list, 1/2 for second, etc.) for each sub-list bottle: ensemble score=0.87 bottle: wikipedia score= 0.5 bottle: word association score=0.37 pepsi: ensemble score=2.00 pepsi: wikipedia score=0 pepsi: word association score=2.00 In an attempt to improve the rank of a correct guess on a final ranked-list we used common information retrieval ranking methods to re-rank the resource-lists of individual resources as well 168 as a means to create ensemble techniques that produced a final ranked-list that leverages all of the resource’s individual resource-lists. These ranking algorithms take as input one or more resource- lists and output a numeric score which can be used to produce a final ranked-list with higher scoring words being given higher rank on the final ranked-list. These ranking algorithms and their definition can be found in Table 7.3. Frequency-based metrics appear in the top-half of the table and rank-based metrics appear in the bottom-half of the table. The first column of this table provides the name of the ranking algorithm, the second column explains the algorithm, and the third column provides the results of applying the ranking algorithm to the results from the example result set shown in Figure 7.1 for the normalized clue “storage container coca cola” (target: bottle) for a couple example returned words. The ranking calculations presented in Table 7.3 are done assuming that the only two resources queried are Wikipedia and Word Association. Note, the scores in the 3rd column of the table are both ensemble scores and individual resource scores. The ensemble scores apply the metrics to all sub-lists for all the resource-lists which allows production of an aggregated final ranked-list that makes use of all of the resource-lists which is useful for quantifying the performance of a guessing generator that leverages all of available resources.. The individual scores apply the metrics to the sub-lists for each individual resource-list which allows for production of final ranked-list for each individual resource which is useful for comparing the ability of an individual resource to produce good guesses. Note individual resource scores are not applicable in the case of the Resource Frequency metric. 7.1.3 Offline Evaluation of Guessing Resources Here we discuss an offline evaluation of guessing methods that employ these resources and re- ranking algorithms. 169 Clues for Evaluation In order to evaluate the ability of these machine resources to produce correct guesses we needed clues to query them, ideally ones that come from an actual game situation where a human provided clues to a guessing agent. To this end we used the “Wizarded” Agent Elicited Human Clue Dataset (see Table 1.2 in Chapter 1 ).Note that some of these clues appeared in a clue sequence for a particular target-word allowing us to evaluate whether querying resources with concatenated clues that appeared in a sequence led to a better performing guessing method. Even though the test-bed agent was “wizarded” during the evaluation where these clues were generated, the user’s ASR hypotheses were recorded using an automatic speech recognizer, and thus are indicative of the clues that would be heard by a guessing agent in automated gameplay. However, as noted in Table 1.2 we did perform a minimal amount of manual intervention on these clues. We filtered out non-clue dialogue move (such as acknowledgements) according to the taxonomy described in Section 3.3. We also concatenated segmented ASR hypotheses that appeared to be partial clues. These pre-processing steps should be able to be automated without too much additional work by creating a machine learning classifier that identifies clues/non-clues and implementing a segmentation model that is trained on a corpus of similar clues annotated with labels identifying partial clues. In the end, we had a list of 1,032 clues for 419 different target-words to use for this evaluation. Example clues and their associated target (in parenthesis) can be found in Table 7.4. Evaluation Metric We evaluate both individual and ensemble methods based on their average ability to produce the target-word in their top 5 ranked guesses when queried with the previously described set of clues 170 Table 7.4: Example Clues Clue (target) You use these to get strong (weights) Not the mind but the blank (body) sparkly (jewelry) and call this the top5 score. We looked at only the top 5 guesses for 2 reasons. First, observations of the Rapid Dialogue Game Corpus (see Section 3.5.1) indicated that players made 4 or less incorrect guesses for a given clue before a new clue was said by the clue giver, either player skipped, or the guesser made a correct guess. Second, we felt 5 was a a reasonable upper limit of incorrect guesses that an agent could output for a single clue without irritating the user. Statistical Analysis All statistical tests in this section were done as follows unless otherwise noted. We first performed Shapiro-Wilk (W) normality tests [Shapiro and Wilk, 1965] on all variables. The results of the Shapiro-Wilk test yielded low p-values (0.01) for all of the variables which led us to reject the null hypothesis that the variables are normally distributed. The results of the Shapiro-Wilk tests led us to use non-parametric statistical analysis. We use the omnibus Kruskal-Wallis (H) test [Kruskal and Wallis, 1952] to see if there are overall differences between the 3 experimental conditions for all variables. When the null hypothesis was rejected, we then proceeded to conduct post-hoc pairwise group comparisons via the commonly used Dunn test [Dunn, 1964] to determine the sources of difference. [Zar, 2013] states that the Dunn test works for unequal sample sizes. For all tables the following notation is used. A number 171 followed by a * indicates p<.05, followed by ** indicates p<.01, and followed by *** indicates p<.001. Individual Resource Evaluation & Results Our baseline guessing method just queries the individual resources with all 1,032 clues just described. We calculated the top5 score with no re-ranking (using the raw ranking from the resource-list). These results can be seen in the second column of Table 7.5 for the top5 score for each individual resource (listed in the first column). Overall, this baseline method performs poorly at being able to produce a correct guess for a given clue within 5 guesses, the best performing resource (Bing search engine) only being able to do so for slightly less than 7% of clues. In an attempt to improve upon this baseline we re-ranked the individual resource-lists for each clue to produce ranked-lists for each metric presented in Table 7.3. The third column of Table 7.5 lists the top5 score for the ranked-lists corresponding to the best re-ranking metric. The best re-ranking metric is listed in the fourth column. In most cases, the use of re-ranking increased the top5 score for an individual resource (only decreasing performance for the WordNet resource). In the case of Duck Duck Go Search and Word Association the increase in the top5 score was highly statistically significant. While different metrics for different resources seemed to result in the best ranked-lists, the absolute frequency was the most commonly highest performing metric and worked best for all traditional search engine resources. Overall, even with re-ranking however the best performing resource (Duck Duck Go search engine) is still only able to produce a a correct guess in 5 guesses for about 10% of clues. 172 Table 7.5: Individual Resource Guessing Performance Resource Top5 Score on Resource Lists Top5 Score on best Ranked List Best Ranked List Metric Bing 6.8% 7.9% Absolute Frequency WordNet 6.3% 6.1% MRR Word Association 6.3% 9.9%*** Simple Rank Google Search 6.1% 8.6% Absolute Frequency Duck Duck Go Search 5.6% 10.0%*** Absolute Frequency Wikipedia 3.9% 4.1% Absolute Frequency Context Web Search 3.4% 3.2% MRR Context Web Related Words 3.3% 3.4% MRR Google Auto Complete 2.1% 2.2% Absolute Frequency Duck Duck Go Preview 1.0% 1.0% Absolute Frequency Context Web Auto Complete 1.0% 1.0% Simple Rank Nell 0.5% 0.8% Rank Size 173 Table 7.6: Ensemble Methods Guessing Performances Metric Top5 Score Individual Clues Top5 Score Clue Sequences Top5 Score Filtered Individual Clues Top5 Score Filtered + Clue Sequences Absolute Frequency 14.6% 18.6% 25.3% 29.0% Resource Frequency 13.5% 18.2% 23.8% 28.1% Simple Rank 10.5% 12.9% 21.1% 24.3% Rank Size 14.3% 17.8% 24.9% 28.9% MRR 13.1% 16.9% 24.1% 27.9% Ensemble Evaluation & Results We investigated whether ensemble guessing methods that leveraged all available resources by combining individual resource-lists into final rank-lists using the metrics from Table 7.3 would outperform individual resources. We evaluated this ensemble method using 4 different method- ologies. For the first methodology, in order to directly compare the performance of an ensemble method with the performance of the individual resources we just reported, we simply calculated the top5 score when querying all resources with the full set of 1,032 clues that we used to evaluate the individual resources. The results of the evaluation for methodology 1 can be found in the second column of Table 7.6 for all 5 ranking metrics. Note the highest performing ranking metric (absolute frequency) for the ensemble method was 14.6% which was significantly higher than the highest performing MannWhitney U Test U=385410.5, p=0.001 174 individual resource (Duck Duck Go search engine) which had a 10.0% top5 score when its highest performing ranking metric (also absolute frequency) was employed. This provides evidence that simply combining the resources via an ensemble method can significantly increase performance of a guessing generator. For the second methodology, we evaluated how the ensemble methods performed when queried with a clue sequence as opposed to a single clue where a clue sequence is a set of clues that were given for a particular target word (possibly by different users) from the set of 1,032 clues. The third column contains the top5 score for this methodology. All of these scores are higher for all ranking metrics than the scores associated with using clues in isolation. Only the resource frequency top5 score is significantly higher for this second methodology. This finding provides evidence that querying these resources with a sequence of clues should elicit more correct guesses on average. Since some word-guessing games involve picking a target word from a set of possible words, we wanted to examine whether scoping the set of possible guesses would significantly improve performance. We note this is a somewhat easier task as the space of guesses is significantly reduced from the set of all English nouns, however it does give a measure of for how these types of resources would perform if used by a guessing agent in this alternative type of game. To this end, for the third methodology, we filtered the resource-lists according to a set of target-words. The set of target-words used for filtering guesses was created in a two step process. First, a common list of nouns was found on the internet in order to ensure space of guesses had a representative sample of commonly used words that did not have a hidden biases that might appear in a list specially curated. Second, since it was clearly important that the target-words (correct guesses) appeared on the list we augmented the list from step 1 to include target-words associated with the clues used for this evaluation described in Section 7.1.3 that did not appear on the original 175 list..The final list of target-words used for filtering had a total of 1,220 target-words. The fourth column contains the top5 score for this methodology. As we expected, constraining the set of possible guesses led to significantly higher top5 scores for each ranking metric. For the fourth methodology, we investigated the performance of the ensemble methods when the resource were queried with clue sequences and the resource-lists were filtered through the pre- determined set of target-words. The fourth column contains the top5 score for this methodology. Note these scores are all higher (not significantly) than the scores associated using singles clues but filtering the possible guesses. These scores are significantly higher than the scores associated with using clue sequences but not filtering the guesses as well as for using single clues but not filtering the guesses for each ranking metric. We also note that the best performing ranking metric for all of these ensemble methods was absolute frequency suggesting that this would be the best re-ranking algorithm were one to employ these types of resources for an interactive guessing agent. It is important to put these results into context to have measures of human guessing ability that are comparable to these measures. We make these comparisons in the next section. These performances should make for a strong baseline to compare automated guessers that employ more sophisticated natural language processing and machine learning techniques. 7.1.4 Human Guessing Performance Data Collection In order to understand how the ability of the machine resource methods just discussed compare to human guessing performance we need a comparable measure of human guessing ability. Cal- culating the human-human measures of guessing performance from Section 3.6.1 on the machine resource data or alternatively calculating the top5 score on the human-human data would not 176 result in directly comparable point estimates of guessing performance. Guess evaluation met- rics calculated on the human-human data conflates guess generation with dialogue management gameplay about how many possible guesses to actually realize. On the other hand, guess evalu- ation metrics on the calculate on the machine resource data don’t conflate guess generation with dialogue management gameplay. For example, simply calculating Average guess quality for the machine resources would not be useful as an interactive guessing agent that used all available guesses and kept guessing for a given clue would keep outputting guesses for a single clue for the entire round. Clearly, this would not make for an engaging game experience for a human player. Another reason none of the point estimates for these measures are directly comparable is that a different set of clues were involved in the human-human interaction and the machine resource evaluation and guessing ability is dependent on the quality of the clues provided. This issue can be overcome with a large enough set of clues that are representative of average clue quality. In order to have directly comparable point estimates of measures of performance for both the machine resource guessing methods and human guessing ability we performed a data collection effort that queried humans in a similar manner to the machine resource guessing method evalu- ation just discussed. The dataset generated from this effort is called the Turk Guess Set in Table 1.2 in Chapter 1. This allows us to calculate a top5 score (see Section 7.1.3 for motivation for the choice of top5 score metric) for human guessing ability that was directly comparable to the top5 score’s for the machine resource guessing methods reported in the last section. This data collection effort was conducted on Amazon’s Mechanical Turk (AMT). The AMT task, called Human Intelligence Tasks (HITs) by AMT, requested that Turkers provide up to five guesses for a target word after reading a clue or clue sequence for a target word. We required Turkers to have completed at least 100 previous AMT HITs and have an approval rating greater This value would be close to 0 177 than or equal to 90%. We requested Turkers to only complete the HIT if they were native English speakers from North America, but we had no way of verifying whether this request was honored. We collected data in textual format. We used the same set of clues and clue sequences from the machine resource evaluation discussed in Section 7.1.3 and asked Turkers to provide up to five guesses for a single clue or a clue sequence. The clues were randomly selected. There were two types of HITs associated with providing Turker’s either an individual clue or a clue sequence. Each HIT was composed of 30 clues or clue sequences (for 30 different target words). We included 1 test question in each task asking Turkers to enter a word spelled backwards to allow for automated detection of completed HITs that should be reviewed before being approved. If a Turker failed to answer the test question correctly we manually checked the result to make a determination on whether the Turker had made a “best effort” at the HIT. We instructed Turkers not to go back to previous clues once they had moved on to a new clue and not to consult any outside resources while completing the HIT. In total, 31 different instances of these HITs for individual clues and 7 different instances of these HITS for clue sequences were constructed, producing guess data for a random subset of 930 of the 1,032 clues used in the guessing generator evaluation. Each HIT instance was completed by 3 Turkers resulting in a total request of 114 (313=93 + 73=21) assignments. In total 32 HIT instances were rejected before the 114 HIT instances were completed. 7.1.5 Human-Human and Human-Agent Measurement Comparisons We calculated the top5 scores for the “human” resource. Table 7.7 contains these results, with the second column showing the top5 score for humans when they were exposed to a single clue and the third column showing the analogous measure for when humans were shown a clue sequence for a given target word. The “***” in the second column indicate that, as would be expected, 178 Table 7.7: Human Top5 Scores Individual Clues Clue Sequences Top5 Score 12.6% 36.1%*** human’s had significantly higher top5 scores when they were shown a clue sequence as opposed to a single clue. We now compare human guessing performance to the performance of the better performing ensemble machine resource methods seen in Table 7.6. The machine resource methods are able to significantly outperform the human guessing performance when only a single clue is provided even when the set of possible guesses are not filtered (first column of Table 7.6) except when the simple rank re-ranking method is used where the opposite pattern is true. The ensemble methods are also able to outperform (not significantly) human performance when a single clue is provided to humans but a clue sequence is provided to the machine resources (column three of Table 7.6). For all re-ranking methods except for Simple Rank, the ensemble methods are able to significantly outperform humans when a single clue is provided to the machine resources but the set of guesses are filtered (column four of Table 7.6). The ensemble methods are able to significantly outperform human performance when a clue sequence is provided to the machine resources and the set of guesses are filtered (column five of Table 7.6). On the other hand, humans outperformed all of the ensemble methods when provided with a full clue sequence for a given target-word. This is a somewhat intuitive finding as one would expect more sophisticated processing is needed when combining information from multiple clues in a clue sequence compared to trying to associate a word with a single clue. For the ensemble methods when the machine resources are provided a single clue (first column of Table 7.6) , a MannWhitney U Test U=13943, p=4.22*10 -8 179 clue sequence (second column of Table 7.6) , or a single clue but the machine resource’s guesses are filtered (third column of Table 7.6) this outperformance is always significant. However, when the machine resource’s guesses are filtered (fourth column of Table 7.6) this outperformance is no longer significant (except for simple rank) suggesting that these ensemble methods are coming closer to being in line with human guessing performance even when humans are provided a clue sequence for a given target-word. These findings suggest that a fully interactive agent performing the guessing role of the game that leveraged an ensemble method to generate guesses that used a re-ranking method other than simple rank and scoped the possible set of guesses through a pre-determined set of target-words might be able to come close to matching human guessing performance. For variants of the word- guessing game, such as a version of the game where a person was only allowed to provide one clue, these results provide evidence that several of the ensemble methods could outperform a human guesser. 7.1.6 Conclusions In this section we presented individual and ensemble methods that query machine resources to produce guesses. We demonstrated that some of the ensemble methods were able to significantly outperform the individual resource methods. We showed that for a variant of the word-guessing game where the set of possible guesses are filtered the ensemble methods come close to matching human performance. We also provided evidence that for a variant of the word-guessing game that only allowed one clue per target-word the ensemble methods are be able to significantly outperform a human guesser. Therefore, for designers of word-guessing game agents interested in building agents that play these variants of the game these methods have some promise. 180 However, there are two issues with these guessing methods that led us to investigate an al- ternative guessing method (see the next section) that we ultimately used to power the guessing generator of the test-bed agent. First, there are certain engineering/cost challenges that exist if one were to try use these guessing methods to power a guess generator for a fully automated interactive agent. The main engineering challenge is handling the processing of non-static web- resources whose html periodically update. The main cost challenge involves paying for resources that institute pay-walls for frequent use. Second, since this thesis is focused on building as unconstrained and fully interactive an agent as possible we did not want to limit the game played by the agent to only allow players to provide one clue for a given target-word. Since ensemble guessing methods showed weaker performance in versions of the word-guessing game where players were allowed to provide clue sequences for a target-word we had less confidence in using the ensemble methods to power the test-bed agent’s guessing generator. 7.2 Offline Evaluation of Question Answering Guessing Approach Here we discuss an offline evaluation of the second guessing method that relies on a classification tool, NPCEditor, that leverages cross language information retrieval techniques and is commonly used for question answering dialogue character systems. For the purposes of using this tech- nology as a guessing method one can consider the clues as questions and the target-words as answers. In contrast to traditional question-answering methods which only use information from prior questions to map a new question to an answer, the NPCEditor uses information from both the questions (clues) and answers (target-words) to map an unseen clue to a target-word. The 181 method leverages a probability model trained on data comprised of clues linked to their associ- ated target-words which allows use of the model for calculation of a conditional probability that each target-word (the model is trained on) is a correct guess given a new unseen clue. The rest of this section is organized as follows. We discuss the training and test data we used for the evaluation of this method in Section 7.2.1. Section 7.2.2 presents the results of this evaluation. 7.2.1 Data We also wanted to test this guessing method on clues from actual game play so we needed a new set of clues for the same target-words to use as a training set. We also collected new clues for training since the original clue and target-word set from the machine resource evaluation had only 2.1 clues on average for each target-word and we thought it would be difficult for the model to generalize well to unseen clues if it learned from such a small sample of clues for each target- word. In order to collect a larger set of clues for the same set of target-words used for the machine resource evaluation we conducted a small data collection on Amazon’s Mechanical Turk. The dataset generated from this effort is called the Turk Clue Set in Table 1.2 in Chapter 1. All Turkers who completed the task were required to have a 90% approval rating or higher and have at least 100 approved HITs. The AMT task (HITs) involved providing up to five clues each for a list of 20 target-words. We collected data in textual format for 419 target-words that were used in the pilot evaluation as well as the evaluation of the machine resources described in the last section. We included 1 test question in each task asking Turkers to enter a word spelled backwards to allow for automated detection of completed HITs that should be reviewed before being approved. If a Turker failed to answer the test question correctly we manually checked the result to make a determination on whether the Turker had made a “best effort” at the HIT. Note 182 Table 7.8: NPCEditor Guessing Ability Top5 Top3 Top 45.5% 38.9% 27.3% that no Turker participated in more than one of these tasks. We posted 21 HITs and asked for 5 different Turkers to complete an assignment of a single HIT. We ended up rejecting 76 assignments. Interestingly, it seemed many of these rejections were due to attempts by automated bots to complete the assignment as evidenced by the unnaturalness or nonsensical nature of the clues provided in the assignment. We approved 94 HITs which provided us with 6,403 clues; an average of 15 clues for each of the 419 target-words. These 6,403 clue,target-word pairs were used as the training data for the NPCEditor. We used 203 clues for 97 of the target-words from the “Wizarded” Agent Elicited Human Clue Dataset (see Table 1.2 in Chapter 1) as our test set. These 97 target-words were selected based on the fact that we already had collected filtered machine clues for them (see Chapter 5) and were also the target- words used in the comparative content sourcing evaluation (see Chapter 8). We thought having this type of data for the target-words we were testing this method on might end up being useful later on for further analysis. Like the larger original set of clues and target-words this random subset also had an average of 2.1 clues per target-word. 7.2.2 Results We calculated the top5 score, a top3 score (average ability to produce the target-word in top 3 ranked guesses) , and a top score (average ability to produce the target-word as a first guess) of the NPCeditor method. These results can be found in Table 7.8. 183 We note that the results presented here are not directly comparable to the results just reported for the guessing ability of humans and machine resources as this method has the advantage of guessing from a smaller space of possible guesses (419) than the results reported for any guessing methodology that made use of the machine resources. None the less, examination of the perfor- mance of this method which could output a correct guess in its top 5 guesses for almost half of the unseen clues it was evaluated on provided us with confidence that this method would perform sufficiently well for game-play. We also made this decision based on considering the engineering challenges that would exist if we tried to integrate the machine resources into a fully interactive guessing agent which included dealing with non static web-resources whose html was periodically updated as well as resources that instituted pay-walls for frequent use. We also note that offline evaluations of guessing ability do not fully capture the circumstances of a live evaluation involving a human clue-giver, where the guessing agent’s performance would ultimately be effected by the agent’s skipping policy and the effect of the agent’s responses on a human player’s clue-giving strategy and skipping strategies. This issue is somewhat mitigated by the fact that these offline evaluations tested these guessing methods on clues that were actual final ASR hypotheses from interactive game-play (with a small amount of manual filtering or concatenation) and so closely reflect the actual clues that would be received by an automated guessing agent in a live evaluation. 7.3 Conclusion We investigated different automated methods that can be leveraged by an agent performing the guessing role of a word-guessing game to generate guesses. One set of methods involved querying online machine resources. We identified better-performing resources as well as applied traditional 184 information retrieval ranking algorithms to the raw results returned and show that when ranking is applied certain resources perform significantly better at the guessing task. We demonstrate that ensemble methods that combine the results from all of the resources via the ranking algorithms significantly outperforms a guessing method that simply relies on ranking individual resource results. We also show that some of the ensemble techniques can outperform or come close to matching human guessing performance for certain variants of word-guessing games. The second method leveraged a classification technology commonly used by question-answering systems and was shown to have strong performance. Due to the second method’s stronger perfor- mance for the variant of the game played by the test-bed agent and engineering/cost challenges around using the first method in real time with an interactive agent we chose to use to the second method to power the guessing generator for the final multi-role evaluation of the test-bed agent described in Chapter 9. 185 Chapter 8 Multi-Role Enabled Content Sourcing “People seldom improve when they have no other model but themselves to copy after. ” Oliver Goldsmith In this chapter we begin to investigate our third research question around multi-role dialogue agents (RQ3); “Are there capabilities possessed by multi-role agents not shared by their single role counterparts that create positive interaction effects for their users?”. We do so by identifying and demonstrating a capability, multi-role enabled content sourcing, which is available to multi- role agents that is not shared by their single role counterparts intended for asymmetric interaction. We discussed the theoretical benefits of multi-role enabled content sourcing compared to other content sourcing methods in Section 2.2. As noted, compared to other common content sourcing methods, multi-role enabled content sourcing has several desiderata. These desiderata include that multi-role enabled content sourcing takes relatively less time, is lower cost, has higher scalability, is fully-situated, and is human quality. Here we empirically demonstrate that these desiderata translate to positive effects on user’s interaction behaviors and perceptions for the test-bed agent’s domain, a word-guessing game. 186 Through an “in the wild” comparative evaluation conducted over the web we show the bene- fits of this content sourcing policy over the scalable machine extraction content sourcing/generation policy that was already shown to have strong performance (see Chapter 5). The evaluation made use of the non-embodied web implementation of the test-bed agent architecture (see Section 4.4.2). The results of the comparative evaluation demonstrate that multi-role enabled content sourcing significantly improved user’s game performance, perceptions of content naturalness, and overall interaction enjoyment. To the best of our knowledge; this is the first empirical evaluation of an agent that re-uses content from prior interactions where the agent performed a different role from the evaluated role. The rest of this chapter is organized as follows. In the next section we discuss the specifics of sourcing clues for the test-bed agent’s clue-giving role using the machine extraction and multi- role enabled content sourcing/generation methods. In Section 8.2 we discuss how we used the DialPort to recruit users to interact with the non-embodied web implementation of the test-bed agent. Section 8.3 presents the experimental design and method of the evaluation. Section 8.4 describes the results of the evaluation. Section 8.5 discusses how we eliminated two confounds that had potential to bias the main results. Finally, section 8.6 concludes. 8.1 Content Sourcing Methodology As noted the clues given by the agent in the experiment were sourced using 2 different methods. One set of clues were output by human clue-givers playing with the game agent performing the guessing role and the other set of clues were generated by machine from web-based resources. Examples of each are shown in Table 8.1, and each content sourcing method is described in more detail below. We did not include separate conditions that used other common content sourcing 187 Table 8.1: Example Clues Target-Word Human Clue Machine Clue Lamp “this is something you turned on in a room to bring light” “fluorescent blank is one type of it” Spoon “you have a fork you have a knife” “a blank is a utensil consisting of a small shallow bowl, oval or round, at the end of a handle, ” Government “people that run the country all the” “administration” methods (see Section 2.2) because we were focused on showing the benefits of multi-role enabled content sourcing over the only other content sourcing method, machine generation/extraction, which is also relatively scalable. 8.1.1 Human Clues from Prior Interactions We sourced observed clues generated by a human interacting with the test-bed agent from the “Wizarded” Agent Elicited Human Clue Dataset first presented in Table 1.2 in Chapter 1 from the “Wizarded” Multi-Role Agent Evaluation (see Section 9.1). Example clues from this dataset that were used in this evaluation appear in the 2nd column of Table 8.1 (the clue’s corresponding target-word is in column 1). Note we used the raw ASR hypotheses of clues from this dataset as opposed to expert tran- scriptions of the clues as we wanted to evaluate whether a version of the test-bed agent that used clues from prior interactions (with a minimal amount of pre-processing) could have positive ef- fects on future agent interactions. Our evaluation results (presented in Section 8.4.2) therefore reflect more closely the positive effects of automatically reusing clues from prior interactions de- spite the noise introduced by depending on ASR. The minimal pre-processing of clues we did 188 undertake should be able to be automated without too much additional work by creating a ma- chine learning classifier that identifies clues/non-clues and implementing a segmentation model that is trained on a corpus of similar clues annotated with labels identifying partial clues. 8.1.2 Machine Generated Clues For the evaluation we also used clues from the Machine Clue Corpus first presented in Table 1.2 in Chapter 1 (also see Section 5.1). In order to have a strong baseline to demonstrate the benefits of re-using clues from prior interaction we only used clues from this corpus that passed the supervised machine learning filter trained to identify clues more likely to elicit a correct guess described in Section 5.2. Examples of filtered clues that were used in the evaluation can be seen in the 3rd column of Table 8.1 and their corresponding target-word in column 1. It is important to note, in order to support the claim of a strong baseline, that the machine filter used was already shown to be able to prune the Machine Clue Corpus for clues that on average elicited correct guesses from human guessers at rates in line with clues output by human clue-givers. Despite that finding, however, our results (see Section 8.4) show that clues from the “Wizarded” Agent Elicited Human Clue Dataset elicit more correct guesses on average than filtered clues from the Machine Clue Corpus. 8.2 Test-Bed Agent & User Recruitment As noted we leveraged the non-embodied web implementation of the test-bed agent (see Section 4.4.2). Since this implementation was connected to the DialPort platform we were able to recruit real users who interacted with the agent from their personal laptop or mobile phone. For the 189 Figure 8.1: FaceBook Experiment Advertisement evaluation described here, a $4,200 FaceBook advertising promotion was purchased that ran on both the FaceBook app on mobile phones and the FaceBook website on personal computers for 1 month (03/18/19 - 04/18/19). Users interacted with the agent over the web either via speech or text and over their mobile phone via text on FaceBook Messenger. This recruitment method allowed for natural interaction where participants were free to leave the interaction at any time. The positive implication of this type of recruitment is that results from this experiment should be particularly representative of agents deployed “in the wild”. The negative implication of this is that we weren’t guaranteed to collect the same amount/types of data (e.g. - responses to the post-survey) from every user who played the game. https://www.facebook.com/business/ads 190 8.3 Experimental Design & Method In this section we describe the experimental design of the evaluation and justification for our statistical analysis. 8.3.1 Materials and Independent Variables For the comparative content evaluation we compiled game rounds that featured target-words that appeared in both the “Wizarded” Agent Elicited Human Clue Dataset and the Machine Clue Corpus. We identified 97 common target-words shared between the corpora. We created clue lists with these common target-words that served to provide the game agent with a list of target- words and its associated clues for a given round. Since we had 97 target-words we created 9 clue lists that were composed of 10 target-words and 1 round of 7 target-words. The 1st condition (cond hum ) was associated with clue lists that contained only clues sourced from the “Wizarded” Agent Elicited Human Clue Dataset (described in Section 8.1.1). The 2nd condition (cond mach ) was associated with clue lists that contained only filtered clues from Machine Clue Corpus (see Section 8.1.2). The 3rd condition (cond mix ) was associated with clue lists that contained all the clues available from “Wizarded” Agent Elicited Human Clue Dataset and all filtered clues available from the Machine Clue Corpus . In order to ensure players in cond hum and cond mach did not receive more clues for a particular target-word in one of these conditions, each target-word on all clue lists had total # of clues = min(# of clues for target-word in corpus hum , # of filtered clues for target-word in corpus mach ). For the clues in the corpus that had more clues for a particular target-word, a random subset was used. We used all available clues for cond mix as we wanted to investigate if a version of the agent that used all available content could outperform a version of the agent that relied on only clues from 191 one source (even if the average clue quality was higher for clues from that particular source). This also allows us to investigate the benefits of sourcing content mostly from machine resources but augmenting with some data from previous user interaction. The average # of clues per target-word for cond hum and cond mach was 2.1 and for cond mix was 9.7. All clue lists had the same target-words in the same order. In order to ensure there were not clue ordering effects each time a round was played the order of the clues output by the game agent for each target-word for the given clue list was randomized. In terms of dialogue management, this means that the content selection policy of the game agent was constant for each condition. Choosing a random content selection policy for each condition ensures our results are not biased due to choosing a content selection policy that performs better on average when using one of the evaluated content types. 8.3.2 Dependent Variables and Hypotheses We list all dependent variables investigated in our main analysis in Table 8.2. We organize the dependent variables into 2 categories; behavioral and perception. Behavioral variables, investi- gated in our behavioral analysis, are calculated from user game behaviors including how long they chose to play as well as the frequency users output the 3 dialogue moves (correct guesses, incorrect guesses, skips) recognized by the agent when in-round. Perception variables, investi- gated in our perception analysis, are calculated from user provided responses to the post game survey. We have 2 main hypotheses for this evaluation as follows. User’s behaviors would be most positive when interacting with the game agent in cond hum (HYP ob j a ) and least positive when interacting with the game agent in cond mach (HYP ob j b ). 192 Table 8.2: Dependent Variables Variable Definition Behavioral TR Total rounds started TCR Total completed rounds AS Avg score (correct guesses) per completed round AI Avg incorrect guesses per completed round ASK Avg skips per completed round Perception PE Perceived Clue Effectiveness PN Perceived Clue Naturalness OE Overall Game Enjoyment Users would have the most positive perceptions of interactions with the test bed agent in cond hum (HYP sub j a ) and lowest in cond mach (HYP sub j b ). The main reasoning for both hypotheses is based on the expectation that clues generated by users in previous agent interaction reflect the context of future agent interactions more closely than clues extracted from pre-existing resources that were authored for a different purpose (e.g.- in the case of dictionary.com to serve as a definition as opposed to a clue). Both main hypotheses can be broken down into sub hypotheses for each dependent variable. We hypothesized that TR, TCR, and AS would be highest in cond hum and lowest in cond mach . We also hypothesized that ASK would be lowest in cond hum and highest in cond mach as machine clues would frustrate users into moving to new target-words more frequently on average as we expected them to be less likely to elicit correct guesses on average. 193 Further, we hypothesized that AI would be highest for users in cond hum and lowest for users in cond mach for 2 reasons. First, we expected the machine clues might stump users more on average leading users to have less candidate guesses on average to say. Second, since we expected users to skip the least in cond hum there would be more time for the user to make guesses in this condition. In terms of HYP sub j a and HYP sub j b , we predicted that that PE, PN, and OE would all be highest in cond hum and lowest in cond mach . We carried out the behavioral analysis on all users who completed at least 1 round. We chose this cutoff in order to normalize the behavioral dependent variables for interaction time. For example, we did not want to compare the # of correct guesses made by a user who played 1/2 a round with the # of correct guesses made by a user who played 1 full round. As noted, since this evaluation involved natural interaction we were not guaranteed to collect post-survey data from all uses who played the game. We carried out the perception analysis on the subset of users who provided at least 1 numerical response to the 3 statements the agent output during the post-survey stage of the interaction. We carried out the perception analysis on all users who answered at least one question as opposed to users who completed at least 1 round since we wanted to avoid not counting responses from users who might have left early due to negative perceptions of the agent in a particular condition. 8.3.3 Statistical Analysis A Shapiro-Wilk (W) normality test [Shapiro and Wilk, 1965] on the dependent variables listed in Table 8.2 (as well as additional dependent variables we investigated to eliminate confounds in Section 8.5) yielded low p-values (0.01) (see Table 8.3)which led us to reject the null hypothesis that the variables are normally distributed. Therefore, in order to conduct inferential statistics we employed non-parametric statistical tests. For the main results discussed in Section 8.4.2 we 194 Table 8.3: Shapiro-Wilk Normality Test Results Variable W p TR 0.61 9.1*10 -29 TCR 0.49 3.9*10 -32 AS 0.94 2.0*10 -4 AI 0.96 4.0*10 -3 ASK 0.73 1.7*10 -12 PE 0.89 1.0*10 -2 PN 0.87 3.0*10 -3 OE 0.78 1.0*10 -4 ACL 0.79 3.9*10 -42 APCS 0.67 2.9*10 -44 195 performed the omnibus Kruskal-Wallis (H) test [Kruskal and Wallis, 1952] to identify overall differences between the 3 experimental conditions for each of the dependent variables and the Dunn test [Dunn, 1964] was used for post-hoc pairwise group comparisons to determine the sources of difference. The confound analysis sometimes only required comparing 2 conditions in which case we performed Wilcoxon-Mann-Whitney tests [Mann and Whitney, 1947] for evidence that the 2 groups differ on the given sample statistic. We report effect sizes for the Kruskal-Wallis and Wilcoxon-Mann-Whitney tests in the form of h 2 [Tomczak and Tomczak, 2014] in order to provide an idea of the magnitude of the differences in the reported means. h 2 (when multiplied by 100) is the % of variance in the dependent variable explained by the independent variable. h 2 via [Lenhard, 2016] makes use of the assumption that the significance of non-parametric tests are usually calculated by approximating the given test statistics using a normal z distribution and therefore the corresponding z-value can be used to calculate an effect size [Fritz et al., 2012]. Note h 2 is independent of sample size. According to [Cohen, 2013], an h 2 generally indicates a large effect if its value is 0.14, an intermediate effect if its value is 0.06 and < 0.14, a small effect if its is value is 0.01 and < 0.06, and no effect if its value is< 0.003. 8.4 Results & Discussion Here we first discuss aggregate user-system interaction statistics and then the main evaluation results. 196 8.4.1 Aggregate User-Agent Interaction Statistics During the period between March 18th, 2019 and April 22, 2019 DialPort handled 1,949 web sessions. 948 of these sessions were via FaceBook Messenger on mobile while 1,001 sessions were via the DialPort website. 407 of these 1,949 sessions were routed to the test-bed agent. A little over half (206) of the these users continued interacting with agent past the first stage of the interaction and made at least one guess during a round. We could not find relevant statistics in the literature for dialogue system intended to satisfy user’s intrinsic motivations and also deployed “in the wild” but conjecture that these results would be in line with other types of similarly designed/deployed systems. The agent’s drop-off rate does outperform one potential benchmark statistic from the the ad- vertising industry. The average conversion rate is the number of conversions (people who actually purchase something) after clicking on an advertisement divided by total number of people who clicked on the advertisement. According to one eCommerce agency y , the average conversion rate for FaceBook ads across all industries is 9.21%. That means that 81% of people who click on a FaceBook advertisement don’t end up purchasing the advertised item. Although, its likely harder to convert people to paying customers than to players in a free game the advertising drop-off rate derived from the average conversion rate can serve as a floor that we should expect the agent’s drop-off rate to outperform. 15 users returned to the agent after having interacted with the agent in an initial session and 4 of those users returned more than 1 time. We were also not able to find relevant statistics in the literature that show average return rates for dialogue systems intended to satisfy user’s intrinsic motivations and are also deployed “in the wild”. Users continued to use DialPort for a few days after the advertising promotion finished y https://transaction.agency/ecommerce-statistics/the-average-conversion-rate-of-a-facebook-ad-is-9-21/ 197 106 users completed at least 1 game round and are included in the behavioral analysis. The breakdown of these 106 users per condition is cond hum : 34, cond mach : 34, cond mix : 38. 26 users who made at least 1 guess provided a response to at least the 1st statement in the post-survey and are included in the perception analysis. The # of users who responded to each statement are: 26 to the 1st statement on clue effectiveness, 25/26 to the 2nd statement on clue naturalness, and 24/26 users to the 3rd statement on game enjoyment. 14 (of the 26) users who responded to the 1 statement did not complete a full round. The breakdown of these 14 users for each condition is cond hum : 6 (of 9), cond mach : 4 (of 8), cond mix : 4 (of 9). Note the distribution of the # of users for the perception analysis for each condition is close to equal (frequencies did not differ by more than 2) and thus does not appear to be biased to any one condition. There is some support that this response rate, 12.6% (26/206), is in line with average external post-survey response rates (10%-15%) . We acknowledge the potential for the results of subjective surveys in particular to be biased in favor of people who are doing well to complete more; but since the 3 conditions have a balanced # of responders then this should not impact the comparison. A comparison of behavioral data for the 106 users included in the behavioral analysis to the behavioral data for the 26 users examined in the perception analysis is provided in Table 8.4. The value for the variables TR, TCR, and AI were significantly higher for the 106 users associated with the behavioral analysis. 8.4.2 Evaluation Results Table 8.5 shows the main results of the evaluation, the sample mean ( ¯ x) and sample standard de- viation (s), for each of the dependent variables listed in Table 8.2 (best value for each variable in https://www.surveygizmo.com/resources/blog/survey-response-rates/ 198 Table 8.4: Mean Subset User Statistics Behavioral Analysis User Mean N=106 Perception Analysis User Mean N=26 Total rounds (TR) 2.30 1.85 Total completed rounds (TCR) 1.91 0.85 Average score per completed round (AS) 2.01 1.15 Average incorrect guesses per completed round (AI) 6.02 4.13 Average skips per completed round (ASK) 0.70 0.42 199 Table 8.5: Main Results cond hum cond mach cond mix Variable ¯ x s ¯ x s ¯ x s TR 3.11 2.08 2.65 1.07 3.50 2.52 TCR 1.79 1.74 1.71 1.19 2.18 2.23 AS 2.76 1.36 1.39 0.98 1.98 1.55 AI 6.64 3.04 5.18 2.08 6.42 2.45 ASK 0.76 0.92 0.96 1.24 0.43 0.77 PE 3.11 2.13 2.61 1.36 2.13 1.36 PN 3.89 1.17 2.00 0.82 2.56 1.74 OE 4.75 0.46 2.86 1.78 2.78 1.56 Figure 8.2: Box Plots of Dependent Variables with Significant Differences *(p<0.05), **(p<0.01), ***(p<0.001) 200 bold). Overall we found support for our 2 main hypotheses. The values of the dependent vari- ables AS, AI, PN, and OE had significant differences consistent with our hypothesized directions discussed in Section 8.3. Figure 8.2 shows box-plots of the sample data for each of these depen- dent variables. We chose to present these results via box-plots since our data was not normally distributed and simple bar graphs of the means of the sample data could be visually misleading with respect to sample data variations. As seen in the top left of Figure 8.2 we found strong support for HYP ob j a for the variable AS. The AS for users in cond hum was significantly higher ( H = 17:17;d f = 2; p= 0:00) than the AS for both users in cond mach ( p= 0:00) and users in cond mix (p= 0:01) with a large effect size (h 2 = 0:147). In terms of HYP ob j b for AS, the AS for users in cond mix was higher than the AS for users in cond mach (although this difference was not significant). Also, as seen in Table 8.5, although cond mix ’s TR, TCR, and ASK values were best these differences were not significantly different across any of the conditions so we do not draw any conclusions from those results. Additionally, as seen in the top right of Figure 8.2, we found support for HYP ob j a and HYP ob j b for the variable AI. The AI for users in cond hum was significantly higher (H = 6:63;d f = 2; p = 0:04) than the AI for users in cond mach (p = 0:03) with a small ES (h 2 = 0:045). Moreover, the AI for users in cond mix was significantly higher (p= 0:02) than the AI for users in cond mach . In terms of the perception analysis, the results yielded strong support for HYP sub j a for the variables PN and OE. As seen in the bottom left hand of Figure 8.2, the PN for users in cond hum was significantly higher (H = 6:90;d f = 2; p = 0:03) than the PN for both users in cond mach (p = 0:01) and users in cond mix (p = 0:04) with a large effect size (h 2 = 0:223). Similarly, as seen in the bottom right of Figure 8.2 the OE for cond hum was significantly higher (H = 5:27;d f = 2; p = 0:01) than the OE for both users in cond mach (p = 0:01) and users in cond mix (p = 0:00) with a large effect size (h 2 = 0:357). We did not find support for HYP sub j b for the variables 201 PN and OE. Although we did not find significant differences for HYP sub j a and HYP sub j b for the variable PE for the 3 conditions the results are consistent with our hypothesized pattern. In order to provide some intuition for these results, we calculated the Jaccard Similarity Coef- ficient (J) for constituent clue words for clues output by the game agent in cond hum and cond mach for each of the 97 target-words used in the study. J gives a measure of the similarity between two finite sample sets and is calculated by dividing the size of the intersection of the two sets by the size of the union of the two sets. Thus a J closer to 1 indicates high similarity and a J closer to 0 indicate low similarity. If this data analysis yielded high Js this would indicate that that the differences in elicited perceptions and behaviors across conditions was not due to differences in word usage for each content type. This might prompt an investigation into whether the differences could be attributed to low level linguistic feature differences between the content types (e.g. -syntactic or semantic features). Before calculating J we stripped stop words from each clue and stemmed each word using a Porter Stemmer y . The calculation of the 97 J’s indicated there was almost no overlap between the words that composed each set of clues for each target word indicating word usage is likely an important factor in the results. 73 out 97 target word’s J=0. The maximum J=0.20. The average J=0.021. 8.5 Eliminating Obvious Confounds We examined whether differences in outcomes across conditions were due to confounding factors such as differences in average clue length of human vs machine clues or different distributions of modality (web speech, web text, or FaceBook text). Further statistical tests and stratified https://gist.github.com/sebleier/554280 y https://www.nltk.org/ modules/nltk/stem/porter.html 202 Table 8.6: Average Points Per Clue Sequence (APCS) Results cond simmach cond mach cond hum ¯ x s ¯ x s ¯ x s APCS 0.20 0.35 0.22 0.35 0.31*** 0.41 analysis showed that these factors do not account for the significant differences between human and machine conditions. We provide more details of this analysis in this section. 8.5.1 Clue Length The average clue length of machine clues were over twice as long as the average clue length of human clues. In order to rule out the hypothesis that cond hum outperformed cond mach in the variable AS due to cond hum having shorter average clue length’s (providing more time for the user to make guesses) we created an artificial condition, cond simmach . cond simmach was more similar to cond hum in terms of average clue length. cond simmach contained clue sequences from the original cond mach that contained clues with an average clue length less than or equal to the maximum average clue length found in any clue sequence in cond hum . Thus cond simmach had an average clue length ( ¯ x= 7:5;s= 8:2) closer to that of the average clue length in cond hum . Since we were examining a subset of the clue sequences that appeared in cond mach in cond simmach , a comparison of cond simmach ’s AS with cond hum ’s AS would not have been equivalent. Instead, we compared the average points per clue sequence (APCS) for cond simmach and cond hum . If length was a dominant confounding factor one would expect for cond simmach ’s APCS to be higher than cond mach ’s APCS and not be significantly less than cond hum ’s APCS. However, as seen in Ta- ble 8.6, cond simmach ’s APCS was lower than the original cond mach ’s APCS and significantly less 203 Table 8.7: User Interaction Modality Breakdown Mobile Text Web Text Web Speech Behavioral Perception Behavioral Perception Behavioral Perception cond hum 21 6 4 2 9 1 cond mach 24 8 3 0 7 0 cond mix 24 6 4 3 10 0 (H = 21:6;d f = 2; p kruskal = 0:00; p dunn = 0:00) than cond hum ’s APCS with a large effect size (h 2 = 0:190). This still leaves the possibility that the shorter clues in cond hum allowed time for more clues to be heard on average by users in that condition compared to users in cond mach and that contributed to the higher AS in cond hum . In order to eliminate this as a possible confound we calculated the average # of clues per successful clue sequence (ANCS) (clue sequences that resulted in a correct guess from a user) for both cond hum and cond mach . If the # of clues heard by users in each condition was a confound you would expect ANCS to be higher in cond hum than in cond mach . However we found the opposite pattern to be true. The ANCS for cond hum ( ¯ x = 1:48;s = 1:03) was significantly (U = 7136:5; p= 0:04) lower than the ANCS for cond mach ( ¯ x= 1:59;s= 1:0). 8.5.2 Interaction Modality Another obvious possible confounding factor we investigated was interaction modality (mobile text (MT), web text (WT), or web speech (WS)). We wanted to rule out the possibility that our results could be attributed to differences that arose in a particular condition because of a dispro- portionate # of users in that condition using a particular interaction modality. For example, if 204 users in cond hum interacted with the game agent via text significantly more than in the machine condition, our results could be attributed to be an artifact of ASR error. The user number breakdown by interaction modality for each condition can be found in Ta- ble 8.7. TThe #’s in the Behavioral columns are the # of users associated with the behavioral analysis that used the corresponding interaction modality (a subset of the 106 users). The #’s in the Perception columns are the # of users associated with the perception analysis that used the corresponding interaction modality (a subset of the 26 users). We note that the distribution of interaction modality in this evaluation reflects the distribution of modality chosen (or available) to users “in the wild”. Thus, these results are representative of the results one would expect when deploying a dialogue agent under similar conditions (namely with the same interaction modality options and advertised using similar methods to the ones described in Section 8.3). Ideally, to rule out this confound, we would conduct statistical tests that provide evidence that the distributions of user interaction modality is equal across conditions. However, the con- ventional hypothesis tests that investigate population proportions (e.g. - chi-square test of inde- pendence of variables [Wickens, 2014] or, in the case of a small sample size, Freeman-Halton extension of Fisher’s Exact Test [Freeman and Halton, 1951]) have null hypotheses that the pop- ulation proportions are equal and so one can only find strong evidence that population proportion distributions are unequal if these tests yield a significant result; not strong evidence that popula- tion proportions are equal (our aim). Instead, in order to investigate interaction modality as a confound, we conducted stratified analysis with each modality defining a strata. In general in this strata-level analysis, when the strata had reasonable sample size, we found patterns for each dependent variable which had sig- nificant differences in our main analysis (See Section 8.4) that supported HYP ob j and HYP sub j although the differences between groups did not typically remain significantly different. 205 Table 8.8: Mobile Text Strata Statistics cond humMT cond machMT cond mixMT Variable ¯ x s ¯ x s ¯ x s AS 2.32 1.18 1.26 0.96 1.72 1.37 AI 5.61 2.40 5.22 1.73 5.57 1.84 PN 3.50 0.84 2.13 1.36 2.17 1.60 OE 4.20 1.30 2.00 1.15 3.17 1.83 For ease of reference we refer to users who interacted in a given condition via a given modality by hyphenating the modality to the condition subscript (e.g.- cond hum users who interacted via mobile text are referenced as cond humMT ). We focus most closely on the MT strata as this strata makes up the largest fraction of users whose behaviors/responses we analyzed. This modality was used by 69 of the 106 users (65%) who completed at least 1 round and 20 of the 26 users (77%) who responded to at least 1 statement in the post-survey. As seen in Table 8.8, the variables AS, AI, PN OE for the mobile text strata all followed our hypothesized pattern with users in cond humMT having higher values than users in cond mixMT who had higher values than users in cond machMT . Note the AS for users in cond humMT was significantly higher ( H = 8:55;d f = 2; p kruskal = 0:01; p dunn = 0:00) than the AS for users in cond machMT with an intermediate effect size (h 2 = 0:099). The other differences were not strictly significantly different. The AS and AI for each condition for users in the WT and WS strata also followed our hypoth- esized pattern with the sole exception of cond mixWS ’s AI being slightly (and non-significantly) higher than cond humWS ’s AI. For purposes of full reporting the sample mean ( ¯ x) and sample stan- dard deviation (s) for the WT and WS strata’s AS and AI for all 3 conditions can be found in Table 206 Table 8.9: AS Strata Statistics AS Web-Text AS Web-Speech Condition ¯ x s ¯ x s cond hum 3.75 0.50 3.38 1.61 cond mix 2.21 1.31 2.50 1.98 cond mach 1.67 1.15 1.75 1.05 Table 8.10: AI Strata Statistics AI Web-Text AI Web-Speech Condition ¯ x s ¯ x s cond hum 9.25 2.99 7.90 3.50 cond mix 6.88 2.25 8.27 2.92 cond mach 4.33 3.51 5.39 2.80 8.9 and Table 8.10 respectively (best values bolded). In terms of PN and OE for the WT and WS strata we do not have data for a sufficient # of users to report meaningful statistics. 8.6 Summary We investigated the benefits of re-using previous user interaction content in an automated dialogue agent over versions of the agent that sourced content from pre-compiled knowledge resources. The evaluation was conducted via a platform that recruited uncompensated users through social media advertising who interacted with the agent via their personal computers or mobile phones and thus should be particularly indicative for dialogue agents deployed “in the wild”. The re- sults demonstrate that the agent’s response is more contextually appropriate (operationalized by 207 objective and subjective interaction measures) on average when the agent selects from a corpus that contains higher ratios of content evoked during previous user interaction. Thus, if a dialogue agent designer is designing an agent with the goal of eliciting certain desired user responses or to engender positive user perceptions of the agent (and would also like the agent deploy the agent at scale) then there are benefits to sourcing content in this manner. This finding addresses RQ3 as it identifies and demonstrates a capability of multi-role dia- logue agents not shared by their single role counterparts that are intended for asymmetric inter- action. If dialogue agent designers want to source content from previous user interaction, and the dialogue agent is intended for an asymmetric interaction, this finding points to the importance of endowing the agent with the ability to perform both roles of the interaction assuming users can be found to interact with the agent in both roles. This study also helps establish benchmark statistics (e.g. user drop-off rate & post-survey response rate) for dialogue agents deployed “in the wild” intended to satisfy user’s intrinsic mo- tivations. 208 Chapter 9 Multi-Role Agent Evaluations “Every role that you play comes with its own set of challenges. ” Mireille Enos In this chapter we help to fill in a gap in the human-computer dialogue literature by empir- ically evaluating an agent that can perform more than one role in the same activity. In Chapter 8 we showed one benefit to building multi-role dialogue agents is their ability to source superior content for the interaction which provides motivation for why one would want to build multi- role agents. However, there still remain questions as to whether or not users will even cooperate with multi-role agents, whether automated multi-role agents that leverage a full stack of inter- active technologies can elicit positive perceptions and behaviors from users (RQ1), and whether an agent’s performance of more than one role of an interaction negatively impacts these elicited perceptions (RQ2). First, multi-role agents seems to make sense for some activities (e.g. - a word guessing game), but seem less natural for other activities (e.g. - task based activities like booking a flight). Second, noting that the various methods leveraged by artificial agents to support automated interaction with human users generally do not perform at human level in many subjective and 209 objective dimensions of interaction (e.g. - perceived naturalness of synthetic voice quality [Pincus et al., 2015]) it is not completely clear the effect that extending the use of these methods to support the agent’s performance of more than one role would have on an interaction. For example, it is plausible that the sub-human quality of each sub-interaction associated with each role would negatively impact perceptions of the overall interaction. In order to investigate these questions we run two different experiments using versions of the test-bed agent that perform both roles of the game. In the first experiment, the “Wizarded” Multi-Role Agent Evaluation, we make use of the robot implementation of the test-bed agent (see Section 4.3) to investigate whether users will cooperate with a version of the agent that performs more than one role of the interaction (with the guessing role “wizarded”), whether we can elicit useful data with a multi-role agent. and whether users have positive perceptions of this interaction. This first experiment showed that users will cooperate with a multi-role agent. The experiment also yielded useful data. First, it yielded user’s perception and behavioral data which were used to investigate RQ1. Second, it provided an opportunity to elicit human generated clues in the game context. The clues elicited from this experiment became the Agent Elicited Human Clue Dataset used for the comparative content sourcing evaluation in Chapter 8. The findings from this experiment address RQ1 by demonstrating that a multi-role version of the agent is able to elicit positive perceptions and behaviors from users (at least when one of the roles is performed by a human operator). In the second experiment, the Multi/Single-Role Agent Comparative Evaluation, we make use of the fully automated virtual human implementation of the test-bed agent (see Section 4.2) to compare versions of the test-bed agent that perform both roles of the game versus versions of the test-bed agent that perform only one of the roles of the game. The findings from this experiment 210 address RQ1 more completely by demonstrating that a fully automated multi-role agent can elicit positive perceptions and behaviors from users. The findings from the second experiment also address RQ2 by showing that the test-bed agent’s performance of both roles does not negatively impact the elicited perceptions. The findings from this final evaluation, similar to the game framing study (see Section 6.3), should be relatively ecologically valid compared to many agent evaluations in the literature (e.g.- [Bickmore and Picard, 2005, Rovatsos et al., 2018] which use text-based interaction and/or no embodiment or [Artstein et al., 2017b, Lucas et al., 2019] which wizard certain agent capabilities and/or don’t have incremental dialogue capabilities). This ecological validity is accomplished by making use of the fully automated and interactive virtual human implementation of the test-bed agent. The rest of this chapter is organized as follows. In Section 9.1 we describe the first multi- role evaluation that made use of the robot implementation of the test-bed agent. In Section 9.2 we discuss the single-role/multi-role comparative evaluation of the test-bed agent that made use of the virtual human implementation of the test-bed agent. Finally, Section 9.3 presents our conclusions on these multi-role evaluations. 9.1 Pilot Robot Multi-Role Data Collection In this section we provide details of the “wizarded” multi-role robot pilot experiment. In the next subsection we discuss the experiment method. In Section 9.1.2 we present aggregate results for the experiment and in Section 9.1.3 we summarize. 211 9.1.1 Method We ran a pilot experiment with 3 conditions each associated with a specific game role ordering. In order to get a sense of whether or not participants were intrinsically motivated to play the game we required participants to play 8 rounds and gave the option to participants to play up to 8 more optional rounds (16 total). The first condition had participants serve as the guesser in the game for the first four rounds and then as the clue-giver for the next four rounds. The second condition expected the participants to act as clue-giver for the first four rounds and then the guesser for the next four rounds. The third condition interleaved the roles for each of the first 8 rounds with half of the participants for that condition starting off as guesser and the other half starting off as clue-giver. Prior to playing the game, participants engaged in an ice-breaking social chat with the agent. For the initial social chat the human wizard was given a list of questions to ask the user but was free to use spontaneous unconstrained speech when performing them as well as ask additional questions not on the list. Questions on the list included ones such as “what are your interests?” and “where are you from?”. The wizard was free to choose when to end the ice-breaking chat activity. Audio and video data were recorded using a wireless Sennheiser microphone which did not pick up speech being output by the computer speakers and CMU’s MultiSense software . All relevant game actions were stored in a SQL database. A short survey (See Figure 9.1) after each set of 4 rounds during the first required 8 rounds was given to the user to collect user’s perceptions of the agent. A longer survey (See Figure 9.2) after all rounds were played was given to participants to collect participant’s perceptions of the agent, participant’s self-assessments on http://multicomp.cs.cmu.edu/ 212 Figure 9.1: Short Survey 1. I could understand my partner 2. I think my partner could understand me 3. We worked together well as a team 4. I was a good clue-giver in this last set of rounds 5. I was a good guesser in this last set of rounds 6. The agent was a good guesser in this last set of rounds 7. The agent was a good clue-giver in this last set of rounds 8. I feel more confident in my word guessing game ability in the last set of rounds. their game ability, and some demographic information. Giving surveys at multiple points in the activity allowed us to see if certain measures changed as more time was spent with the agent. In addition to demographic information the survey collected ratings from participants on a 1-5 scale indicating their level of agreement (1 meaning strongly disagree and 5 meaning strongly agree) with a given statement. 9.1.2 Results We ran 16 participants through this setup. Since the number of participants was relatively small for this pilot evaluation and we did not find significant differences between conditions, we report some aggregate interaction statistics. Enjoyment The average score provided on the post-survey the statement “I enjoyed the game” was 4.2 providing evidence that people enjoyed play the multi-role version of the game with the robot. 213 Figure 9.2: Long Post Survey 1. What is your gender? 2. What is your race? 3. In which industry are you employed? (U.S. Census) 4. I could understand my partner 5. I think my partner could understand me 6. We worked together well as a team 7. I performed my role well 8. The agent performed his role well 9. I am embarrassed about my game performance 10. I am a good guesser 11. I am a good clue-giver 12. I feel more confident in my word-guessing game ability after playing the game 13. The agent is a good guesser 14. The agent is a good clue giver 15. If I did not guess a word it is more so because of the quality of the clues than the quality of my guesses 16. If the agent did not guess a word it is more so because of the quality of the clues than the quality of the agent’s guesses 17. I wouldn’t care if people saw a video of me playing this game 18. I feel smarter than I did before playing the game 19. I enjoyed the game 20. I would recommend the game to a friend 21. If I played the game again I would rather play with the agent than an unknown partner 22. If I had to play another game I would choose this game rather than an unknown game 23. I would play the game again for fun 214 Optional Rounds Evidence of user enjoyment can also be found in that participants elected to play optional rounds. The average number of optional rounds played by participants was 2.1. Thirteen out of sixteen of the participants played optional rounds. Scores Unsurprisingly, as the agents clue-giving role was automated and the agent’s guessing role was “wizarded” the average round score when the player was performing the guessing role of the game (1.02) was significantly lower according to a Wilcoxon-Mann-Whitney Test (U = 1179:5; p= 0:00) than when the player was performing the clue-giving role of the game (3.2). Perceived Agent Performance Also unsurprisingly; people perceived the “wizarded” agent as better at performing the guessing role than the automated agent performing the clue-giving role of the game. The average score given for the statement “The agent is a good guesser” was 4.1 which was significanlty higher according to Wilcoxon-Mann-Whitney Test (U = 30:5; p = 0:00) than the average score given for the statement “The agent is a good clue-giver” (2.5). Perceived Clue Effectiveness People seemed somewhat ambivalent as to the perceived ef- fectiveness of the automated clues giving an average score of 3.4 for the statement “If I did not guess a word it is more so because of the quality of the clues than the quality of my guesses. ” on the post-survey. Interestingly, the same average score was given for the statement “If the agent did not guess a word it’s more so because of the quality of the clues than the quality of the agent’s guesses. ” indicating this ambivalence extended to the perceived quality of participant’s own clues. Team Cohesion The average score provided on the post-survey to the statement “We worked well together as a team” was 3.25 indicating people felt somewhat ambivalent on how much team cohesion they felt between them and the agent. This value was relatively stable for both of the shorter surveys given after the first 4 rounds (3.25) and the first 8 rounds (2.9) indicating time spent with the agent did not seem to have a large effect on this dimension. 215 Early Success Consistent with the early success finding reported in Section 6.3 when we calculated the non-parametric spearman’s r to investigate wether early success was correlated to participants decision to play additional rounds we found a similar pattern. We found a medium- strong positive correlation (r = 0:56; p = 0:03) between a user’s average score in the first two rounds number of option rounds played which did not appear when we calculated r for a users’ average score in their first four and first 8 rounds. 9.1.3 Summary In this section we discussed a pilot multi-role agent evaluation, the “Wizarded” Multi-Role Agent Evaluation, that used the robot implementation of the test-bed agent performing both roles of the word-guessing game and an initial social chat activity. The evaluation showed that users will cooperate with a multi-role agent. This experiment also demonstrated that multi-role agents can yield useful data. Besides user behavioral and perception data this experiment was the source of the Agent Elicited Human Clue Dataset used for the comparative content sourcing evaluation in Chapter 8. The automated performance of the agent in the clue-giving role underperformed the “wiz- arded” performance of the agent in the guessing role in terms of average score per round. As expected, actual performance was perceived as as better performance as indicated by the scores participants provided to relevant questions on the post-survey. In this experiment, we did not find strong evidence of team cohesion between participants and the agent (and this measure held relatively constant even as participants time with the agent increased). The evaluation also be- gan to address RQ1 by providing evidence that people enjoyed the interaction based on their post-surveys scores as well as their decision to play optional rounds. 216 9.2 Multi/Single-Role Agent Comparative Evaluation In this section we discuss the Multi/Single-Role Agent Comparative Evaluation that directly in- vestigated the effects of endowing the test-bed agent the ability to perform multiple roles with the fully automated virtual human implementation of the test-bed agent. Section 9.2.1 describes the data used by the test-bed agent’s clue and guess generator for this evaluation. In Section 9.2.2 we provide details on the experimental design and method of the evaluation. Section 9.2.3 dis- cusses the dependent variables and experimental hypotheses. Section 9.2.4 explains and justifies our choice of statistical methods used for analyzing the evaluation results which are presented in Section 9.2.5. Finally, Section 9.2.6 summarizes the findings from this evaluation. 9.2.1 Data Here we outline the data used by the clue and guess generators of the test-bed agent architec- ture endowing the agent with the ability to output clues and guesses. As was done for previous agent evaluations, the agent relied on a text-file that organized target-words and their associated clues into rounds. We began by randomly selecting 160 target-words from the 419 target-words contained in the NPC Editor Training Dataset (see Section 7.2). We then randomly created sub- sets of size 10 of the target-words and associated these sets of 10 target-words each with a game round. This provided target-words for a total of 16 game rounds. When the agent was performing the guessing role in this evaluation the agent’s guess generator leveraged the NPC Editor trained model (see Section 7.2) to make a guess from a guess space comprised of the full set of 419 target-words. 217 Clue-Giving Role Data Here we describe the clues used to populate the text-files leveraged by the test-bed agent’s clue generator for this evaluation. We began by selecting all the clues from the “wizarded” Agent Elicited Human Clue Dataset for the 160 target-words just described. This dataset contained clues shown to have strong performance in game-play when output by the test-bed agent (see Chapter 8). However, the average number of clues per target-word for this dataset was 2.1 with many target-words selected only having one clue per target-word. As noted in the activity analysis (see Section 3.5) human clue-givers gave on average 4.1 clues per target-word before a correct guess or skip indicating we would need more clues to enable the agent to more closely emulate human clue-givers’ content generation abilities. In order to get sufficient data for the test-bed agent’s clue-giving role for this evaluation we conducted another small data collection effort using the same crowdsourcing method as described in Section 7.2.1 which resulted in the Turk Clue Set 2 dataset (see Table 1.2 in Chapter 1). We constructed a new HIT on Amazon’s Mechanical Turk that asked Turkers to provide clues in textual format for subsets of the 419 target-words from the NPC Editor Training Dataset. In terms of content sourcing methods, these clues can be considered human authored as opposed to human generated from prior interaction with the agent. In total, 21 different instances of this HIT were constructed; producing additional clues for the 419 target-words. We ended up approving 94 completed HIT instances (after rejecting 76 due to failing the test question) which gave us a total of 6,403 new human authored clues for the 419 target-words; an average of 15.2 clues per target-word. The final set of clues used for this evaluation consisted of all the clues generated from the AMT data collection just described for the set of 160 target-words already selected (2,633 clues) 218 as well as all the clues for these words from the Agent Elicited Human Clue Dataset (723 clues). Each of the 160 target-words therefore had an average of 20.4 clues. The order of the clues was randomized for each target-word in the final text-file used by the clue generator. 9.2.2 Experimental Design & Method The main purpose of this evaluation was to investigate the impact of endowing an agent with the ability to perform multiple roles in the same activity. To this end, we designed a 3-condition experiment with each condition defined by the role or roles performed by the test-bed agent. We used the virtual human implementation of the test-bed agent (see Section 4.2) for this evaluation. The first 2 conditions (single-role conditions) featured single role versions of the test-bed agent performing one of the roles of the game activity and the 3rd condition (a multi-role condition) featured the agent performing both roles of the game. In the agent clue giving condition users interacted with a version of the test-bed agent that performed only the clue giving role. In the agent guessing condition users interacted with a version of the agent that performed only the guessing role. In the multi-role condition users interacted with a version of the agent that performed both the clue-giving and guessing roles. Users were randomly assigned to one of the conditions. Users were required to play eight 120 second rounds with the agent in all 3 conditions and then given the option to play up to eight more optional rounds with the agent. For the eight required rounds in the multi-role condition, users alternated with the agent which role they played each round. In order to ensure there were no role ordering effects, half the users in the multi-role condition were randomly assigned the clue-giving role (and the other half the guessing role) in the first round. All users heard the same target-words (and if the user was performing the guessing role the same clues) in the same order. 219 Figure 9.3: Multi/Single-Role Agent Comparative Evaluation Pre-Survey 1. What is your gender? 2. What is the highest level of education you have completed? 3. What is your race? 4. In which industry are you employed? (U.S. Census) 5. I feel confident in my word-guessing game ability. We gave users the option to continue playing more rounds (but kept compensation the same) as opposed to requiring them to play more rounds since we wanted to determine if users would be intrinsically motivated to play with the agent even without increased compensation. In the single-role conditions users only had the option to continue playing the role they played in the first required eight rounds. In the multi-role condition users were able to choose which role they would play for each optional round. Before interacting with the agent, participants were given a consent form to fill out by the experimenter. After signing the consent form a pre-survey was given to participants (seen in Figure 9.3) to collect demographic information as well as get a baseline measure of how confident people were in their word-guessing game ability prior to interacting with the agent. After finishing the pre-survey, an experimenter re-entered the room and gave instructions on game-play and what to expect in the interaction according to which condition the participant had been assigned. The instructions emphasized that the agent and participant were on the same team and should try to score as many points as possible. Participants sat in a standard rolling chair in a room with no background noise. Seated par- ticipants were placed approximately 4 feet from a widescreen monitor that displayed the virtual 220 Figure 9.4: Multi/Single-Role Agent Comparative Evaluation Post-Survey 1. My partner created a sense of closeness or camaraderie between us. 2. My partner create a sense of distance between us. 3. I think my partner and I understood each other. 4. My partner communicated coldness rather than warmth. 5. My partner was warm and caring. 6. I wanted to maintain a sense of distance between us. 7. I felt I had a connection with my partner. 8. My partner was respectful to me. 9. I felt I had no connection with my partner. 10. I tried to create a sense of closeness or camaraderie between us. 11. I tried to communicate coldness rather than warmth. 12. I enjoyed playing the game. 13. I feel confident in my word-guessing game ability. 14. My experience playing the game with my partner is similar to the experience I would have had playing the game with a human partner. 15. My partner was a good clue-giver. 16. My partner was a good guesser. 17. My partner’s clues were natural during the game. 18. My partner was intelligent. 19. I felt like I was on the same team as my partner. 20. The final score was mostly due to my abilities rather my partner’s performance. 21. The final score was mostly due to my partners performance rather than my own perfor- mance. 22. The final score was mostly a team effort with equal contributions from myself and my partner. 23. Please add any other comments on your experience that you would like to share. 221 humans and auxiliary graphical user interfaces shown in the screenshot seen in Figure 4.3 in Sec- tion 4.2. Participants spoke into a wireless Sennheiser microphone. Audio files were recorded and all relevant game actions stored in a database. Following the game, participants reported their perceptions on on a post-survey. Figure 9.4 is the post-survey. All participants were asked to provide a rating on a 1-5 scale indicating their level of agreement (1 meaning strongly disagree and 5 meaning strongly agree) with each statement on this survey (except for the final question asking for comments in free-form text). The first eleven questions on the post-survey were de- signed to measure rapport based on a component rapport structure discussed in [Tickle-Degnen and Rosenthal, 1990]. 9.2.3 Dependent Variables Here we discuss the dependent variables investigated in this evaluation. As was done for the con- tent sourcing evaluation described in Chapter 8 we investigate perception variables in a perception analysis and behavioral variables in a behavioral analysis. We list all the dependent variables investigated in our perception analysis in Table 9.1. The first column names the variables with the dimension of perception it represents and the second column indicates which questions on the surveys were used to operationalize the variable. Note except for the variable Change in confidence, which uses one question from the pre-survey in Figure 9.3, all variables were operationalized using questions from the post-survey which appear in Figure 9.4. The behavioral variables were the number of optional rounds played and the average number of correct guesses (round score) and skips per round made by human players when performing either role, the average number of incorrect guesses per round made by human players when performing the guessing role, and the average number of clues per round made by human players 222 Table 9.1: Multi/Single-Role Agent Comparative Evaluation Perception Dependent Variables Variable Question used to Operationalize Variable Enjoyment Q12 Rapport Rapport Scale (Q1-Q11) Change in Confidence post-survey Q13 - pre-survey score Q5 Agent Ability Q15 (when human guesser)j Q16 (when human clue-giver) Agent Naturalness Q14 Perceived Intelligence Q18 Team Cohesion Q19 Self-Attributed Performance Q20 Agent-Attributed Performance Q21 Team-Attributed Performance Q22 223 when performing the clue-giving role. We also examined role-choice for the multi-role condition to determine if either the guessing role or clue-giving role was chosen more frequently during the optional rounds. 9.2.4 Statistical Analysis In order to compare most of the results across conditions we employed similar statistical tech- niques as described in Sections 8.3.3 and 7.1.3. We first performed Shapiro-Wilk (W) normality tests [Shapiro and Wilk, 1965] on all variables. In almost all cases the Shapiro-Wilk test yielded low p-values (0.01) for all of the variables which led us to reject the null hypothesis that the variables are normally distributed. For all of these variables we again used the omnibus Kruskal- Wallis (H) test to see if there are overall differences between the 3 experimental conditions and if the null hypothesis was rejected, we then proceeded to conduct post-hoc pairwise group compar- isons via the Dunn test. The two exceptions to this were for the variables average round score and role-choice. For average round score the Shapiro-Wilk test results indicated that the variable was normally dis- tributed so we used a one-way ANOV A omnibus test followed by Tukey’s honestly signficantly different (HSD) post-hoc test [Abdi and Williams, 2010]. For the variable role-choice we used the non-parametric chi-square goodness of fit test to determine if the distribution of participant’s clue-giving or guessing role choices in optional rounds was significantly different than 50/50. In order to investigate RQ2 we are interested in analyzing the study data in such a way that we provide evidence that the perceptions elicited by the multi-role version of the agent are not worse than the perceptions elicited by a single role version of the agent. The typical type of statistical test used to demonstrate this type of claim is called a non-inferiority test [Wellek, 2010] which is used to show one condition did not underperform another condition more than “a minimum amount that 224 Figure 9.5: Possible Outcomes of Non-Inferiority Tests/Trials has practical significance” termed a non-inferiority margin (D). More formally, a non-inferiority test has a null hypothesis that one condition (in our case the multi-role condition) performed worse than another condition (in our case the single-role conditions) by a margin larger thanD. A non- inferiority test has an alternative hypothesis that the differences between conditions is not more thanD. This is in contrast to the more common type of statistical test, a superiority tests (used through- out this thesis), which aim to demonstrate that one condition out performs another condition. Su- periority tests have a null hypothesis that there is no true difference between two conditions and an alternative hypothesis that there is a difference between the two conditions. It is often the case that a non-significant result of a superiority test is incorrectly used to provide evidence of equivalence between conditions. A non-inferiority test/trial has five possible outcomes which are depicted in Figure 9.5 and summarized in [Schumi and Wittes, 2011]. The horizontal lines in this figure represent confidence intervals around estimated effect sizes which appear as dots in the center of the intervals. The solid and dashed vertical bars represent 0 and the non-inferiority margin (D), respectively. The first interval at the top of the figure has a range that is totally above 0 indicating that the test condition 225 is superior to the control condition (which implies it is also non-inferior). The second interval down has a range that is partially above 0 and partially below 0, but is still totally aboveD indicating the test condition is non-inferior but not superior. The third interval has a range that stretches to both sides ofD and 0 indicating that the test condition is not shown be non-inferior to the control condition. The fourth interval has a range that isD but also totally below 0. This indicates that the test condition is shown to be non-inferior to the control condition but also inferior to the control condition (just to a smaller threshold thanD. In this case, one would use the control unless there was a compelling side benefit to using the test. Finally, the fifth interval has a range that lies totally belowD indicating the test is inferior to the control and is not shown to be non-inferior to the control. One good way of determining -D is to use a value for this parameter specified in comparable prior studies. However, we were not able to find studies that specified -D and were directly comparable to our study where perceptions of one thing vs another was specified on a 5-pt Likert scale. One way of thinking about a minimum practical difference on this type of data would be to consider the minimum difference in the number of stars given for the average restaurant review on a restaurant rating site such as Yelp that would lead a person to pick one restaurant over another (especially considering there are other factors that might influence ones final choice (e.g.- price or distance). As seen in Figure 9.6 Yelp user comments suggest that a 0.5 - 1-star difference may not make a large difference in a user’s final choice of restaurant. Motivated by these considerations we ran non-inferiority tests for all of the measures listed in Table 9.1 withD =1 as well as withD =0.5. D =0.5 was run in case our use of 1 as a minimum https://spoonuniversity.com/how-to/the-ultimate-yelp-guide-for-just-about-everyones y https://www.quora.com/Should-you-trust-a-restaurant-that-only-has-3-stars-on-Yelp z https://blog.yelp.com/2018/09/restaurant-ratings-on-yelp-are-remarkably-consistent-no-matter-whos-writing- them-when-and-where 226 Figure 9.6: Yelp User Reviews 1. “I try to pick from those with 4+ stars if I want to have a meal that will be almost guaran- teed to be satisfactory. However, don’t discriminate against those with 3-4 stars because there could be a hidden gem lingering in those search results” 2. “I have no problems going to a three star joint. Example; I just returned to my hotel after a long and tiring business day. I feel like pizza and a beer and I want to consume it in a dining room. I don’t feel like take out. I whip out my trusty Yelp app and hit the filters. I have numerous 3, 4 and even 5 star choices. A three star choice is directly across the street. I wouldn’t even need to drive. Would I trust it? YES!” y 3. “If every business you’re considering is between a 4.5 and a 5, the differences shrink and it becomes hard to distinguish two businesses.” z practical difference was too generous. To run these tests we used the non-inferiority t-tests offered by the equiv.test function in the R Statistical package. Since we were running a non-inferiority test the alternative parameter was set to “greater” and since our data was unpaired the paired parameter was set to “False”. We note Likert data is typically not normally distributed which is an assumption for t-tests. However, simulation studies that compared t-tests to Mann-Whitney tests (typically used on data that is not normally distributed) showed there is little difference between the false positive rates and the statistical power of both these tests when applied to sample data taken from 14 different representative population distributions of Likert data [De Winter and Dodou, 2010]. If we used different statistical tests than those just describe for a particular result we report the specific statistical test used along with the result in the text. For all graphs/tables the following notation is used: * indicates p<.05, ** indicates p<.01, and *** indicates p<.001. 227 Table 9.2: Multi/Single-Role Agent Comparative Evaluation Participant Demographics Total Participants N=68 Gender Male (52.9%), Female (45.6%), Non-Specified (1.5%) Education High School/GED (7.4%), Some College (27.9%), 2 Year Degree (13.2%), 4 Year Degree (35.3%), Master’s Degree (14.7%), Doctoral Degree (1.5%) Race African American (30.9%), Asian (5.9%), White/Caucasian (52.9%), Hispanic (2.9%), Other (7.4%) Occupation Forestry, fishing, hunting or agricultural support (69.1%), Mining (13.2%), Construction (1.5%), Manufacturing (4.4%), Wholesale Trade (2.9%), Retail Trade (5.9%), Transportation or Warehousing (3.0%) 9.2.5 Results Participants In this section we present participant demographic information. In total 68 participants ran through the experimental protocol. Participants were randomly assigned to one of the three con- ditions. 22 participants were assigned to the guessing condition, 23 participants were assigned to the clue-giving condition, and 23 participants were assigned to the multi-role condition. The demographic information for these 68 participants can be found in Table 9.2. Perception Analysis In this section we discuss the results of the perception analysis. The last three columns in Table 9.3 show the mean results for each of the perception dependent variables listed in the first column 228 Table 9.3: Perception Variable Results Variable Agent Clue-Giving x Score Agent Guessing x Score Multi-Role x Score Enjoyment 4.14 4.04 4.08 Rapport 3.19 3.02 3.08 Change in Confidence -0.55 -0.26 -0.43 Agent Ability 4.00 4.04 4.30j* 4.35 Agent Naturalness 2.73 2.87 2.52 Perceived Intelligence 3.72 3.52 3.57 Team Cohesion 3.18 3.18 3.56 Self-Attributed Performance 3.50 3.60 3.00 Agent-Attributed Performance 2.68 2.70 2.83 Team-Attributed Performance 3.32 3.52 3.74 * For Agent Ability the first value reflects the mean response value to Q15 on the post-survey, My partner was a good guesser. The second value reflects the mean response value to Q16 on the post-survey, My partner was a good clue-giver. 229 of the table that we investigated in the perception analysis for the three experiment conditions. None of the variables had significant differences between conditions. However, examining the variable enjoyment allows us investigate RQ1 again and provides evidence that people enjoyed playing the game with the multi-role agent. The mean score for this variable in the multi-role condition was 4.1 on a 5-pt scale providing direct evidence that people enjoyed interacting with the fully automated multi-role version of the agent that relied on a full stack of interactive dialogue technologies. The mean score for this variable for the single-role conditions were virtually the same; 4.14 for the agent clue-giving condition and 4.04 for the agent guessing condition. It is also worth noting that the only other variable besides enjoyment that had a mean score of 4 or higher for any condition was Agent Ability. Agent Ability had a mean score greater than or equal to 4 for all three conditions. Further, an analysis of the comments left by users in the multi-role condition in response to the final post-survey question showed that 9/12 of the user who provided optional comments had something positive to say about their experience with the agent. In comparison this statistic for participants in the agent clue-giving condition was 4/13 and for participants in the agent guessing condition was 7/18. Some representative comments for users in the multi-role condition provided post-interaction with the agent include “Very fun and interesting experience. ”, “surprisngly ad- vanced a.i. ”, “I want to start playing (the agent) in general. ”, and “It was a fun experience. I think in the future if Mr. clue knew my name it would help in the team building/closeness aspect”. There is one difference in scores found between conditions which warrants further investi- gation. A Kruskal-Wallis omnibus test showed that the differences between the mean score for at least two of the conditions for the variable self-attributed performance (eigth row of Table 9.3) was approaching significance (p= 0:06). Noting participants in the multi-role condition had lower scores for this variable compared to participant’s scores from the single-role conditions 230 (with a relatively larger difference in these scores than between the two scores for the single role conditions) leads us to consider the following. It is possible that playing both roles allows people to overcome a cognitive bias which inflates ones own contributions to an activity and minimizes their partners contributions to the activity. Perhaps performing both roles of the game gives a more realistic impression to someone of the contributions required to perform both roles. There is some support for this second explanation if we look at the pattern of results for the variables agent-attributed performance (ninth row of Table 9.3) and team-attributed performance (tenth row of Table 9.3). The scores for participants in the multi-role condition for both of these variables were higher than the analogous scores from participants in the single-role conditions. Non-Inferiority The results presented in Table 9.4 address RQ2. They provide evidence that the multi-role version of the agent did not negatively impact the perceptions of human players compared to the single role versions of the agent in the enjoyment, rapport, perceived intelligence, and team cohesion perception dimensions. The first column of this table lists the perception variable. The second and third columns of this table give the mean scores for the associated variable for the single role and multi-role conditions respectively. The fourth column gives the mean difference between the mean scores for both conditions. The fifth and sixth column of this table show the p-values for the non-inferiority tests (with -D= 1 andD= 0:5 respectively) that compared the values for the perception variables between the single-role and multi-role conditions. Both the more liberal non-inferiority tests with -D = 1 and the less liberal non-inferiority tests with -D = 0:5 yielded significant results for all four perception variables. These results provide evidence that the multi-role version of the agent did not negatively impact user’s elicited perceptions of enjoyment, rapport, perceived intelligence, and team cohesion when compared to 231 Table 9.4: Non-Inferiority Test Results Variable Single-Role ¯ x Score Multi-Role ¯ x Score ¯ x Difference Non-Inferiority Test w/ -D=1 p-value Non-Inferiority Test w/ -D=0.5 p-value Enjoyment 4.089 4.087 0.001 <0.01 0.03 Rapport 3.105 3.075 0.030 <0.01 0.02 Perceived Intelligence 3.489 3.565 0.076 <0.01 0.01 Team Cohesion 3.378 3.217 0.160 <0.01 <0.01 the elicited perceptions of users who interacted with the single-role version of the agent. Noting that superior statistical tests on these four variables did not show significant differences across conditions implies the outcomes of these non-inferior tests correspond to the second interval in Figure 9.5, showing non-inferiority but not superiority. Behavioral Analysis Here we discuss the results of the behavioral analysis. The behavioral analysis yielded more evidence that people enjoyed playing the game with the multi-role agent which addresses RQ1. For the variable optional rounds, participants chose to play on average 1.9 additional optional rounds with the test-bed agent in the multi-role condition indicating that they were intrinsically motivated to keep playing with the multi-role version of the agent as no extra compensation was provided for this extra time. Participants in the agent clue-giving condition chose to play on 232 Figure 9.7: Average Round Score average an additional 3.2 rounds and participants in the agent guessing condition chose to play on average 2.3 additional rounds. The differences for this statistic were not significant. The variables average round score had significant differences across conditions. Figure 9.7 shows participants in the agent guessing condition scored significantly (p anova = 0:00; p tukey = 0:00) more points than participants in the agent clue-giving condition. Also, participants in the multi-role condition scored significantly (p tukey = 0:02) more points than participants in the agent clue-giving condition. This result demonstrates the agent was better at performing the guessing role than the clue-giving role. In terms of role-choice, in the multi-role condition a chi-square goodness of fit test did not show that the distribution of participant’s clue-giving or guessing role choices in optional rounds was significantly different than 50/50. However, players more frequently chose to perform the clue-giving role (chosen 26 times) than perform the guessing role (chosen 19 times). The variables skips and incorrect guesses per round had significant differences between condi- tions. Figure 9.8 y shows that human participants made significantly (p kruskal = 0:00; p dunn = 0:00) Note the Figure shows standard error bars for each condition. y We chose to display data with a box-plot as this variable was not normally distributed 233 Figure 9.8: Average Skips per Round more skips on average per round in the agent guessing condition than in the agent clue-giving con- dition. Also, participants in the multi-role condition made significantly (p dunn = 0:03) more skips on average per round than participants in the agent clue-giving condition. The agent also made significantly more incorrect guesses on average per round than a human guesser which led the guessing condition ( ¯ x= 25.0) and the multi-role conditions ( ¯ x=17.4) to have higher average number of incorrect guesses per round than the clue-giving condition ( ¯ x = 11.9). We don’t draw conclusions from this as its most likely a result of the dialogue management parameter instantiation decision we made for the agent’s g parameter which indicates how many guesses the agent should make for a particular clue (see Section 4.5). It is also a result of how quickly the guessing generator is able to output a correct guess on average. The average number of clues output per round by the agent and human when performing the clue-giving role was virtually the same (20.2 by agent in agent clue-giving condition, 19.5 by human participant in agent guessing condition, and 19.3 by both agent and human participant in multi-role condition). This provides evidence that our parameter instantiation decisions for the s g parameter (the silence threshold that dictates when the agent should give a new clue) and the i parameter (triggers the clue-giver to give a new clue immediately if i percent of the current clue 234 Table 9.5: Team Guessing Ability Comparison Metric Team Value Guess Quality Human-guesser/Agent-giver 27.7% Agent-guesser/Human-giver 17.2% Average Correct Guesses per Target Human-guesser/Agent-giver 52.7% Agent-guesser/Human-giver 60.0% Average Correct Guesses per Clue Heard Human-guesser/Agent-giver 14.2% Agent-guesser/Human-giver 21.1% has been said and no correct guess is made before the end of the current clue) were good enough to enable the agent to emulate the frequency that a human clue-giver outputs clues. Agent-Guess/Hum-Giver & Hum-Guess/Agent Giver Team Performance Comparison In order to compare human-guesser/agent-giver team performance with agent-guesser/human- giver team performance we applied the average guess quality and average guessing performance metrics from Section 3.6 to the data from this experiment. Table 9.5 shows these results . The agent-guesser/human-giver teams underperformed the human-guesser/agent-giver teams on the average guess quality metric (Guess Quality row in Table 9.5). On the other hand the agent-guesser/human-giver teams outperformed the human- guesser/agent-giver teams on the two metrics measuring average guessing performance in the timed game (rows below the Guess Quality row in Table 9.5). As the statistics for the variables round score and incorrect guesses showed this indicates that the agent output more correct guesses A chi-square test of independence of variables shows the proportion of correct guesses/incorrect guesses output by humans was significantly (c 2 =174.3, p=0.00) higher than the same proportion for the agent. 235 on average than a human player but it also indicates that the agent output more incorrect guesses on average than a human player. These results bring up questions around design decisions for this type of agent. Is it better for the agent to maximize an average guess quality metric or the average guessing performance metrics. The answer to these questions would help guide parameter instantiation decisions for parameters such as the ones that determine how many incorrect guesses an agent should make for a given clue or when an agent should give a new clue (see Section 4.5). The fact that the agent was able to elicit enjoyment from users (see Section 9.2.5) provides evidence that our instantiation decisions around these parameters were good enough to make for an enjoyable experience. Human-Agent/Human-Human Performance Comparison Here we compare human-agent performance in the version of the game played in the Multi/Single- Role Agent Comparative Evaluation with human-human performance in the RDG-Phrase game . We note that one issue with this comparison is that the average target-word difficulty might be different in the two games. We compare the numbers in Table 9.5 to the guess quality/performance measures reported in Section 3.6.1 to compare human-guesser/agent-giver and agent-guesser/human-giver team per- formance to human-human team performance in the two games. This comparison shows that human-human average guess quality (26.4%) was roughly equal to human-guesser/agent-giver average guess quality (27.7%). On the other hand, agent-guesser/human-giver average guess quality (17.2%) was lower than that of human-human average guess quality. The performance guessing measures show that human-human teams performed better than human-guesser/agent-giver and agent-guesser/human-giver teams The average correct guesses per target and average correct guesses per clue heard for human-human teams was 72.2% and 32.2% respectively. This is compared to average correct guesses per target and average correct 236 guesses per clue heard of 52.7% and 14.2% for human-guesser/agent-giver teams and 60.0% and 21.1% for agent-guesser/human-giver teams. 9.2.6 Summary In this section we presented the results from the Multi/Single-Role Agent Comparative Evaluation that compared the perceptions and behaviors of users who interacted with a version of the agent that performed both roles of the game with users who interacted with versions of the agent that only performed one of the roles. The results showed the agent was better at performing the guessing role of the game than the clue-giving role of the game. The results also showed our parameter instantiation decisions that determine when to give a new clue were good enough that when the agent performed the clue-giving role it output clues at a frequency similar to that of a human clue-giver. We presented a comparison of agent-guesser/human-giver and human-guesser/agent-giver team performance. These results demonstrated that the former underperformed the latter accord- ing to the average guess quality metric but the latter outperformed the former according to the guess performance metrics. We also presented statistics that compared human-human team per- formance in the RDG-Phrase game with agent-guesser/human-giver and human-guesser/agent- giver team performance in the game played by the test-bed agent. More generally, the results from this evaluation demonstrate that a fully automated agent that leverages a full stack of interactive technologies is capable of eliciting enjoyment from its users which addresses RQ1. The results also demonstrate that the multi-role capabilities of the agent did not negatively impact user’s perceptions (in the dimensions of enjoyment, rapport, perceived intelligence, and rapport) when compared to the perceptions of users who interacted with single role versions of the agent which addresses RQ2 . 237 9.3 Conclusion In this chapter we presented two evaluations of the test-bed agent performing both roles of the game using two different implementations of the architecture presented in Chapter 4. The first evaluation, the “Wizarded” Multi-Role Agent Evaluation, made use of the robot implementation of the test-bed agent (see Section 4.3) with the guessing role of the game “wizarded”. This first evaluation demonstrates three main capabilities of multi-role agents. First, it shows that multi-role agents are capable of eliciting cooperation from their users. Second, it shows that a multi-role agent can elicit useful data from users. Besides the behavioral and perception data reported in Section 9.1.2 this evaluation also yielded the agent elicited clue dataset, a set of clues generated by humans interacting with the test-bed agent in the game context, which was used for the comparative content sourcing evaluation described in Chapter 8. Third, the evaluation begins to address RQ1 by demonstrating that a multi-role agent can elicit enjoyment from users (at least when one of the roles is “wizarded” by a human operator). The second evaluation, the Multi/Single-Role Agent Comparative Evaluation, made use of the virtual human implementation of the test-bed agent (see Section 4.2). This evaluation addresses RQ1 more completely by demonstrating that a fully automated multi-role agent that leverages a full stack of interactive technologies can elicit enjoyment from users. This evaluation also addressed RQ2 by demonstrating a multi-role agent does not necessarily negatively impact user’s perceptions (at least in the dimensions of enjoyment, rapport, perceived intelligence, and team cohesion) when compared to the perceptions of users who interact with single-role versions of the same agent. 238 Chapter 10 Conclusion “Computers are useless. They can only give you answers. ” Pablo Picasso This thesis is a first step in the scientific study of multi-role agents that should encourage future research in this area. It helps fill the gap in multi-role dialogue agent research in five main ways. First, it presents and evaluates a test-bed multi-role agent that performs more than one role in the same activity. We describe an architecture in Chapter 4 that endows a test-bed agent with core dialogue management capabilities for both roles of a word-guessing game but can be adapted for different embodiments including virtual human, robot, and a non-embodied web-platform that enables use of the test-bed agent in “in the wild” experiments. We also evaluated important test-bed agent design decisions identified by an activity analysis (see Chapter 3 around synthetic voice quality, dialogue incrementality/embodiment, and game framing/feedback in comparative experiments (see Chapters 6). We also evaluated content sourcing/generation methods for the test-bed agent’s clue-giving and guessing roles (see Chapters 5 and 7 respectively). The results from these experiments and evaluations informed design decisions for the test-bed agent that 239 decrease the chance that our later experiments, that more directly evaluate the agent’s multi-role capabilities, failed to find effects due to confounds stemming from poor design decisions. Second, this thesis establishes that multi-role agents are able to elicit enjoyment from users by investigating RQ1, “Can an automated multi-role agent that leverages a full stack of interactive technologies elicit positive perceptions and behaviors from users?”. In two different experiments using different versions of the agent we demonstrate that a multi-role agent can elicit enjoyment from users (see Chapter 9). The first experiment, the “Wizarded” Multi-Role Agent Evaluation demonstrates that users will cooperate with a multi-role agent, that a multi-role agent can elicit useful data, and users can enjoy interacting with a multi-role agent. The second experiment, the Multi/Single-Role Agent Comparative Evaluation, shows that a fully automated interactive multi- role dialogue agent that leverage a full stack of dialogue technologies is able to elicit enjoyment from users. Third, this thesis shows that multi-role agents do not necessarily negatively impact user’s perceptions when compared to the perceptions of users who interact with single-role versions of the same agent by investigating RQ2, “Does an agent’s performance of more than one role of an interaction negatively impact the elicited perceptions?” Non-inferiority statistical tests run on data from the Multi/Single-Role Agent Comparative Evaluation showed that the enjoyment, rapport, perceived intelligence, and team cohesion) perceptions of users who interacted with the multi-role version of the test-bed agent were non-inferior to the analogous perceptions of users who interacted with single role versions of the test-bed agent (see Section 9.2). Fourth, this thesis establishes that there is a benefit to building multi-role agents by investi- gating RQ3, “Are there capabilities possessed by multi-role agents not shared by their single role counterparts that create positive interaction effects for their users?”. We demonstrate through the “in the wild” comparative content sourcing evaluation (see Chapter 8) that multi-role agent’s 240 are able to leverage a superior content sourcing strategy, multi-role enabled content sourcing, not available to single role agents intended for asymmetric interaction. A version of the test-bed agent that leveraged multi-role enabled content sourcing was able to elicit superior behaviors (more and better guesses) and superior perceptions (higher perceived content naturalness and overall game enjoyment) from users than a version of the agent that relied on a more common scalable content sourcing strategy that leveraged machine resources. Finally fifth, the experiments, evaluations, and data collection efforts carried out during this thesis produced datasets (see Table 1.2 in Chapter 1) that should be useful to researchers for conducting future multi-role dialogue agent research. These datasets should also be useful to dialogue agent researchers interested in investigating user’s behaviors and perceptions when in- teracting with a dialogue agent intended to satisfy a user’s intrinsic motivations. 10.1 Scoping of Methods & Findings Here we briefly scope some of the methods we used to investigate multi-role dialogue agents as well as some of our findings on multi-role dialogue agents. First, we specify more precisely what we mean by the term “in the wild” as it is used to describe the comparative content sourcing evaluation (see Chapter 8). Second, we discuss the class of dialogue agents that our main findings related to multi-role dialogue agents likely cover. “In the wild” is a term currently used loosely in the artificial intelligence and human-computer interaction communities. Here we say a study involving an agent is “in the wild” if three condi- tions are met. The first condition stipulates that the agent being investigated must be deployed in the en- vironment where the agent is expected to be used in the real world and that environment is not 241 altered in any special way for the purpose of the study. For example, if a study is investigating the perceptions and behaviors elicited by a robot waiter, an “in the wild” study for this type of agent would involve the agent being deployed in a restaurant that had normal everyday operations taking place. The second condition stipulates that the “participant” users must be motivated to interact with the agent due to some agent ability (e.g. ability to elicit enjoyment or ability to perform some task for the user) as opposed to some reward independent of the interaction (e.g.- compensation). For the robot waiter example, the users would have to be normal restaurant customers that had come to the restaurant in order to enjoy a meal. The third condition stipulates that the “participant” users must not be primed in any special way on how they should interact with the agent beyond how users would be primed in an everyday scenario involving the agent. For the robot waiter example, a study would not qualify as “in the wild” if the customers were told to only order off a portion of the menu if the robot waiter was expected to take orders for the whole menu when deployed in the future. Based on this definition the comparative content sourcing evaluation qualifies as an “in the wild” study. This study satisfies the first condition as people frequently use the internet via their personal laptops and mobile phones to play games for entertainment purposes. This study satis- fies the second condition since users were motivated to interact with the agent for entertainment purposes and were not given compensation for playing the game. This study satisfies the third condition since users were not primed in any way on how to interact with the test-bed agent, which is consistent with how the test-bed agent would be deployed in real life. We now briefly scope our main findings related to multi-role dialogue agents. This thesis demonstrates three truths for multi-role dialogue agents. First, it shows a multi-role dialogue agent is capable of eliciting positive perceptions and behaviors from users (see Chapter 9). Second, it 242 demonstrates that one can build a multi-role agent that does not elicit worse perceptions than a single role version of the same agent (see Section 9.2.5). Third, it proves that multi-role enabled content sourcing can produce higher quality content than other scalable content sourcing policies (see Chapter 8). Given that these findings were demonstrated using a test-bed agent that performs an activity in one specific domain, it is important to consider what types of multi-role dialogue agents these findings likely cover. For the first and second truths, as pointed out in Chapter 9, building a multi-role dialogue agent makes sense in some domains (like the test-bed word-guessing game domain investigated here) but less so in others (e.g. - the travel domain; it seems unlikely people would enjoy booking a flight for an automated agent). Therefore, the first and second truths likely do not apply to all multi-role dialogue agents. At most, these truths likely apply to multi-role dialogue agents that engage in an activity where enabling the agent to perform multiple roles of the activity benefits the user in some manner (e.g. - provides more services, diversity of experience, or enjoyment to the user). For the third truth on multi-role enabled content sourcing, this likely only applies to multi- role dialogue agents intended to satisfy user’s intrinsic motivations. Since one goal of these types of agents is to engage users for extended and repeated periods a large amount of diverse content is needed (i.e. - high scalability is needed). Without a large & diverse content base this type of agent will likely become repetitive for users who interact with the agent for extended periods or repeatedly. Since user’s are intrinsically motivated to interact with these types of agents, leveraging multi-role enabled content sourcing provides a method of organically growing these agents’ content bases that takes less time, is low cost, is fully-situated, and is of human- quality. 243 On the other hand, a large & diverse content base is less necessary for agents designed to satisfy a user’s extrinsic motivations, a user’s desire to achieve a separable outcome [Ryan and Deci, 2000] (e.g. - task-based agents). In these cases, users are motivated to interact with the agents mainly to achieve separable outcomes which serve to motivate users to return and continue interacting with the agent even if the agent produces content that is repetitive. Repetitive content might actually be viewed as a positive attribute of these type of agents as it could possibly help reduce the cognitive load of repeat users. Since less content is needed for agents intended to satisfy user’s extrinsic motivations, au- thoring content sourcing methods are likely a superior content sourcing strategy for this class of agents. As noted in Section 2.2, authoring content is a less scalable content sourcing method but can produce relatively situated content that is also of human quality. The time and cost of paying authors to generate a limited amount of content is likely less than the time and cost of building out the other role of the agent and finding users (who would also likely need compensation as mo- tivation to interact with the agent performing the other role). Therefore, at most, the superiority of multi-role enabled content sourcing likely only extends to multi-role dialogue agents intended to satisfy user’s intrinsic motivations. 10.2 Future Directions Here we discuss directions for future work opened up by this thesis. We organize these directions intro two categories, work related to multi-role agent research in general and work that the test-bed agent is a well suited platform to investigate. We discuss two areas of multi-role agent research that warrant further investigation. First, it is important to verify whether the superior performance of multi-role enabled content sourcing 244 extends to other domains. Therefore multi-role agents for other domains and similar comparative content sourcing evaluations as the one described in Chapter 8 should be run to investigate this question. Second, it seems promising to investigate whether multi-role agents have other capabilities not shared by their single-role counterparts that can be exploited to benefit user interaction. For example, one way to set multi-role agent interaction model parameters (such as the ones we describe in Section 4.5 for our test-bed agent) for a given role would be to calculate them from analysis of human-agent interaction data when the agent is performing another role and observing the given role. If relevant human-human data is available it would be interesting to compare user interactions with a multi-role agent when these agent parameters are set based on statistics from the human-human data versus when these parameters are set from the human-agent data. In terms of future work not directly related to multi-role dialogue agent research, there are two areas of future work that came up during the development of this thesis. First, in terms of clue sequencing, although the test-bed agent policy (random selection) was good enough to elicit enjoyment and reasonable average scores from users it fails to emulate certain human clue-giver clue sequencing abilities. For example, as pointed out in Table ?? in Section 3.6, human clue- givers often take into account prior clues and guesses when constructing a clue sequence. It would be worthwhile investigating whether a clue-giving agent that was able to take into account prior clues and guesses had positive effects on the interaction. Second, the literature review (see Chapter 2) also reveals a lack of empirical evaluations of agents that perform multiple activities/roles and use information learned about the user in one activity/role in another activity/role. The test-bed agent architecture used in this work is a well suited platform to run experiments that investigate the benefits of at least one type of information 245 transfer. Since the test-bed agent architecture (see Chapter 4) has already been extended to fa- cilitate an ice-breaking social chat activity (see Section 4.3) the agent could potentially leverage information learned about a user during an initial ice-breaking social-chat activity in the game activity. For example, a personality inference could be inferred based on how a user engages in the social chat and feedback during the game could be personalized based on the personality inference made. Alternatively, inferences about the user’s knowledge could be made based on responses they give to questions during the social chat. When the agent performs the clue-giving role it would then be able to select clues expected to be likely to elicit a correct guess from a user with particular knowledge. 246 BIBLIOGRAPHY [Abdi and Williams, 2010] Abdi, H. and Williams, L. J. (2010). Tukey’s honestly significant difference (hsd) test. Encyclopedia of research design, 3:583–585. [Adrian et al., 2016] Adrian, K., Bilgin, A., and Van Eecke, P. (2016). A semantic distance based architecture for a guesser agent in essence’s location taboo challenge. DIVERSITY @ ECAI 2016, page 33. [Al Moubayed et al., 2014] Al Moubayed, S., Beskow, J., and Skantze, G. (2014). Sponta- neous spoken dialogues with the furhat human-like robot head. In Proceedings of the 2014 ACM/IEEE international conference on Human-robot interaction, pages 326–326. ACM. [Allwood, 2000] Allwood, J. (2000). An activity based approach to pragmatics. Abduction, belief and context in dialogue: Studies in computational pragmatics, pages 47–80. [Amazon, 2018a] Amazon (2018a). Amazon alexa. https://developer.amazon.com/ alexa/. [Online; accessed 18-May-2018]. [Amazon, 2018b] Amazon (2018b). The Alexa Prize. https://developer.amazon.com/ alexaprize. [Online; accessed 18-May-2018]. [Apple, 2018] Apple (2018). Siri. https://www.apple.com/ios/siri/. [Online; accessed 18-May-2018]. [Artstein et al., 2016] Artstein, R., Traum, D., Boberg, J., Gainer, A., Gratch, J., Johnson, E., Leuski, A., and Nakano, M. (2016). Niki and Julie: A Robot and Virtual Human for Studying Multimodal Social Interaction. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, pages 402–403, Tokyo, Japan. ACM Press. [Artstein et al., 2017a] Artstein, R., Traum, D., Boberg, J., Gainer, A., Gratch, J., Johnson, E., Leuski, A., and Nakano, M. (2017a). Listen to My Body: Does Making Friends Help Influence People? In Proceedings of the 30th International Florida Artificial Intelligence Research Society Conference (FLAIRS-30), Marco Island, Florida. AAAI. [Artstein et al., 2017b] Artstein, R., Traum, D., Boberg, J., Gainer, A., Gratch, J., Johnson, E., Leuski, A., and Nakano, M. (2017b). Listen to my body: Does making friends help influence people? In The Thirtieth International Flairs Conference. 247 [Bickmore and Cassell, 2000] Bickmore, T. and Cassell, J. (2000). how about this weather?? so- cial dialogue with embodied conversational agents. In Proc. AAAI Fall Symposium on Socially Intelligent Agents. [Bickmore and Cassell, 2001] Bickmore, T. and Cassell, J. (2001). Relational agents: a model and implementation of building user trust. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 396–403. ACM. [Bickmore and Casselle, 2005] Bickmore, T. and Casselle, J. (2005). Social dialogue with em- bodied conversational agents. Advances in natural multimodal dialogue systems, 30:23–54. [Bickmore and Picard, 2005] Bickmore, T. W. and Picard, R. W. (2005). Establishing and main- taining long-term human-computer relationships. ACM Transactions on Computer-Human Interaction (TOCHI), 12(2):293–327. [Biddle, 1986] Biddle, B. J. (1986). Recent developments in role theory. Annual review of soci- ology, 12(1):67–92. [Black and Taylor, 1997] Black, A. W. and Taylor, P. (1997). Automatically clustering similar units for unit selection in speech synthesis. In Proc. of the European Conference on Speech Communication and Technology (Eurospeech), Rhodes, Greece. [Bobrow et al., 1977] Bobrow, D. G., Kaplan, R. M., Kay, M., Norman, D. A., Thompson, H., and Winograd, T. (1977). Gus, a frame-driven dialog system. Artif. Intell., 8(2):155–173. [Bohus and Horvitz, 2009a] Bohus, D. and Horvitz, E. (2009a). Dialog in the open world: plat- form and applications. In Proceedings of the 2009 international conference on Multimodal interfaces, pages 31–38. ACM. [Bohus and Horvitz, 2009b] Bohus, D. and Horvitz, E. (2009b). Models for multiparty engage- ment in open-world dialog. In Proceedings of the SIGDIAL 2009 Conference: The 10th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 225–234. Association for Computational Linguistics. [Bohus and Horvitz, 2009c] Bohus, D. and Horvitz, E. (2009c). Models for multiparty engage- ment in open-world dialog. In Proceedings of the SIGDIAL 2009 Conference: The 10th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 225–234. Association for Computational Linguistics. [Bohus and Rudnicky, 2003] Bohus, D. and Rudnicky, A. I. (2003). Ravenclaw: Dialog manage- ment using hierarchical task decomposition and an expectation agenda. In Eighth European Conference on Speech Communication and Technology. [Bowden et al., 2018] Bowden, K. K., Wu, J., Oraby, S., Misra, A., and Walker, M. (2018). Slugnerds: A named entity recognition tool for open domain dialogue systems. In Proceedings of the Language Resources and Evaluation Conference (LREC), Miyazaki, Japan. European Language Resources Association. 248 [Boyce and Gorin, 1996] Boyce, S. and Gorin, A. (1996). User interface issues for natural spoken dialog systems. Proc. ISSD, 96(1996):65–68. [Bright, 2016] Bright, P. (2016). Tay, the neo-nazi millennial chatbot, gets autopsied. [Bunt et al., 2010] Bunt, H., Alexandersson, J., Carletta, J., Choe, J.-W., Fang, A. C., Hasida, K., Lee, K., Petukhova, V ., Popescu-Belis, A., Romary, L., et al. (2010). Towards an iso standard for dialogue act annotation. In Seventh conference on International Language Resources and Evaluation (LREC’10). [Burgener, 2005] Burgener, R. (2005). Artificial neural network guessing method and game. US Patent App. 11/102,105. [Burgener, 2006a] Burgener, R. (2006a). 20Q: the Neural Network Mind Reader. https://web.archive.org/web/20130216091945/http://ecolloq.gsfc.nasa. gov/archive/2006-Spring/announce.burgener.html. [Online; accessed 18-May- 2018]. [Burgener, 2006b] Burgener, R. (2006b). Artificial neural network guessing method and game. US Patent App. 11/102,105. [Burton et al., 2009] Burton, K., Java, A., and Soboroff, I. (2009). The icwsm 2009 spinn3r dataset. In Proceedings of the Third Annual Conference on Weblogs and Social Media (ICWSM 2009). [Carpenter, 2018] Carpenter, R. (2018). cleverbot. http://www.cleverbot.com/. [Online; accessed 18-May-2018]. [Cassell et al., 1999] Cassell, J., Bickmore, T., Billinghurst, M., Campbell, L., Chang, K., Vilhj´ almsson, H., and Yan, H. (1999). Embodiment in conversational interfaces: Rea. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems, pages 520– 527. ACM. [Cohen, 1960] Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46. [Cohen, 2013] Cohen, J. (2013). Statistical power analysis for the behavioral sciences. Rout- ledge. [Colby, 1981a] Colby, K. M. (1981a). Modeling a paranoid mind. Behavioral and Brain Sci- ences, 4(4):515–534. [Colby, 1981b] Colby, K. M. (1981b). Modeling a paranoid mind. Behavioral and Brain Sci- ences, 4(4):515–534. [Colby et al., 1971] Colby, K. M., Weber, S., and Hilf, F. D. (1971). Artificial paranoia. Artificial Intelligence, 2(1):1 – 25. 249 [Dahl et al., 1994] Dahl, D. A., Bates, M., Brown, M., Fisher, W., Hunicke-Smith, K., Pallett, D., Pao, C., Rudnicky, A., and Shriberg, E. (1994). Expanding the scope of the atis task: The atis-3 corpus. In Proceedings of the workshop on Human Language Technology, pages 43–48. Association for Computational Linguistics. [De Winter and Dodou, 2010] De Winter, J. and Dodou, D. (2010). Five-point likert items: t test versus mann-whitney-wilcoxon (addendum added october 2012). Practical Assessment, Research, and Evaluation, 15(1):11. [Dumais et al., 1998] Dumais, S., Platt, J., Heckerman, D., and Sahami, M. (1998). Inductive learning algorithms and representation for text categorization. In Proceedings of the seventh international conference on Information and knowledge management. ACM. [Duncan, 2008] Duncan, S. (2008). ANNOTATIVE PRACTICE . http://mcneilllab. uchicago.edu/pdfs/susan_duncan/Annotative_practice_REV-08.pdf. [Online; ac- cessed 26-September-2018]. [Dunn, 1964] Dunn, O. J. (1964). Multiple comparisons using rank sums. Technometrics, 6(3):241–252. [Fang et al., 2017] Fang, H., Cheng, H., Clark, E., Holtzman, A., Sap, M., Ostendorf, M., Choi, Y ., and Smith, N. A. (2017). Sounding board–university of washington’s alexa prize submis- sion. Alexa Prize Proceedings. [Fasola and Mataric, 2012] Fasola, J. and Mataric, M. J. (2012). Using socially assistive human– robot interaction to motivate physical exercise for older adults. Proceedings of the IEEE, 100(8):2512–2526. [Faur et al., 2015a] Faur, C., Caillou, P., Martin, J.-C., and Clavel, C. (2015a). A socio-cognitive approach to personality: Machine-learned game strategies as cues of regulatory focus. In Af- fective Computing and Intelligent Interaction (ACII), 2015 International Conference on, pages 581–587. IEEE. [Faur et al., 2015b] Faur, C., Martin, J.-C., and Clavel, C. (2015b). Matching artificial agents’ and users’ personalities: designing agents with regulatory-focus and testing the regulatory fit effect. In CogSci. [Ferrucci, 2010] Ferrucci, D. (2010). Build watson: an overview of deepqa for the jeopardy! challenge. In 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 1–1. IEEE. [Fraser and King, 2007] Fraser, M. and King, S. (2007). The Blizzard Challenge 2007. In Proc. of the ISCA Workshop on Speech Synthesis, Bonn, Germany. [Freeman and Halton, 1951] Freeman, G. and Halton, J. H. (1951). Note on an exact treatment of contingency, goodness of fit and other problems of significance. Biometrika, 38(1/2):141–149. 250 [Freitas and Higgins, 2002] Freitas, A. L. and Higgins, E. T. (2002). Enjoying goal-directed action: The role of regulatory fit. Psychological science, 13(1):1–6. [Fritz et al., 2012] Fritz, C. O., Morris, P. E., and Richler, J. J. (2012). Effect size estimates: current use, calculations, and interpretation. Journal of experimental psychology: General, 141(1):2. [Gehl, 2014] Gehl, R. W. (2014). Teaching to the turing test with cleverbot. Transformations: The Journal of Inclusive Scholarship and Pedagogy, 24(1-2):56–66. [Geller et al., 2005] Geller, J. D., Norcross, J. C., and Orlinsky, D. E. (2005). The psychothera- pist’s own psychotherapy: Patient and clinician perspectives. Oxford University Press. [Georgila et al., 2012] Georgila, K., Black, A. W., Sagae, K., and Traum, D. (2012). Practical evaluation of human and synthesized speech for virtual human dialogue systems. In Proc. of the International Conference on Language Resources and Evaluation (LREC), Istanbul, Turkey. [Georgila et al., 2018] Georgila, K., Gordon, C., Choi, H., Boberg, J., Jeon, H., and Traum, D. (2018). Toward low-cost automated evaluation metrics for internet of things dialogues. In Proceedings of the International Workshop on Spoken Dialogue Systems Technology (IWSDS). [Graesser et al., 2004a] Graesser, A. C., Lu, S., Jackson, G. T., Mitchell, H. H., Ventura, M., Olney, A., and Louwerse, M. M. (2004a). Autotutor: A tutor with dialogue in natural language. Behavior Research Methods, Instruments, & Computers, 36(2):180–192. [Graesser et al., 2004b] Graesser, A. C., Lu, S., Jackson, G. T., Mitchell, H. H., Ventura, M., Olney, A., and Louwerse, M. M. (2004b). Autotutor: A tutor with dialogue in natural language. Behavior Research Methods, Instruments, & Computers, 36(2):180–192. [Hall et al., 2009] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H. (2009). The weka data mining software: An update. In SIGKDD Explorations, volume 11. [Hamstra et al., 2011] Hamstra, M. R., Van Yperen, N. W., Wisse, B., and Sassenberg, K. (2011). Transformational-transactional leadership styles and followers? regulatory focus. Journal of Personnel Psychology. [Hartholt et al., 2013] Hartholt, A., Traum, D., Marsella, S. C., Shapiro, A., Stratou, G., Leuski, A., Morency, L.-P., and Gratch, J. (2013). All together now. In International Workshop on Intelligent Virtual Agents, pages 368–381. Springer. [Hemphill et al., 1990] Hemphill, C. T., Godfrey, J. J., and Doddington, G. R. (1990). The atis spoken language systems pilot corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990. [Higashinaka et al., 2007] Higashinaka, R., Dohsaka, K., and Isozaki, H. (2007). Learning to rank definitions to generate quizzes for interactive information presentation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pages 117–120. Association for Computational Linguistics. 251 [Higgins et al., 2001] Higgins, E. T., Friedman, R. S., Harlow, R. E., Idson, L. C., Ayduk, O. N., and Taylor, A. (2001). Achievement orientations from subjective histories of success: Promo- tion pride versus prevention pride. European Journal of Social Psychology, 31(1):3–23. [Higgins et al., 1997] Higgins, E. T., Shah, J., and Friedman, R. (1997). Emotional responses to goal attainment: strength of regulatory focus as moderator. Journal of personality and social psychology, 72(3):515. [Hunt and Black, 1996] Hunt, A. J. and Black, A. W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. In Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Atlanta, GA, USA. [Jung and Graf, 2008] Jung, J. and Graf, S. (2008). An approach for personalized web-based vocabulary learning through word association games. In Applications and the Internet, 2008. SAINT 2008. International Symposium on, pages 325–328. IEEE. [Karaiskos et al., 2008] Karaiskos, V ., King, S., Clark, R. A. J., and Mayo, C. (2008). The Bliz- zard Challenge 2008. In Proc. of the Blizzard Challenge, Brisbane, Australia. [Kawahara, 1997] Kawahara, H. (1997). Speech representation and transformation using adap- tive interpolation of weighted spectrum: vocoder revisited. In IEEE International Conference On Acoustics, Speech, And Signal Processing, Munich, Germany. Acoustics, Speech, and Sig- nal Processing. [Kennedy et al., 2017] Kennedy, J., Leite, I., Pereira, A., Sun, M., Li, B., Jain, R., Cheng, R., Pincus, E., Carter, E. J., and Lehman, J. F. (2017). Learning and reusing dialog for repeated interactions with a situated social agent. In Intelligent Virtual Agents, pages 192–204. Springer International Publishing. [Kipp, 2012] Kipp, M. (2012). Anvil: A universal video research tool. handbook of corpus phonology. [Krasner et al., 1988] Krasner, G. E., Pope, S. T., et al. (1988). A description of the model- view-controller user interface paradigm in the smalltalk-80 system. Journal of object oriented programming, 1(3):26–49. [Kruskal and Wallis, 1952] Kruskal, W. H. and Wallis, W. A. (1952). Use of ranks in one- criterion variance analysis. Journal of the American statistical Association, 47(260):583–621. [Lavrenko et al., 2002] Lavrenko, V ., Choquette, M., and Croft, W. B. (2002). Cross-lingual relevance models. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 175–182. [Lee and Marsella, 2006] Lee, J. and Marsella, S. (2006). Nonverbal behavior generator for em- bodied conversational agents. In International Workshop on Intelligent Virtual Agents, pages 243–255. Springer. 252 [Lee et al., 2017] Lee, K., Zhao, T., Du, Y ., Cai, E., Lu, A., Pincus, E., Traum, D., Ultes, S., Barahona, L. M. R., Gasic, M., Young, S., and Eskenazi, M. (2017). Dialport, gone live: an update after a year of development. In Proceedings of the 18th annual SIGdial meeting on discourse and dialogue, pages 170–173. [Leite et al., 2016a] Leite, I., Pereira, A., Funkhouser, A., Li, B., and Lehman, J. F. (2016a). Semi-situated learning of verbal and nonverbal content for repeated human-robot interaction. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, pages 13–20. ACM. [Leite et al., 2016b] Leite, I., Pereira, A., Funkhouser, A., Li, B., and Lehman, J. F. (2016b). Semi-situated learning of verbal and nonverbal content for repeated human-robot interaction. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, pages 13–20. ACM. [Leite et al., 2016c] Leite, I., Pereira, A., Funkhouser, A., Li, B., and Lehman, J. F. (2016c). Semi-situated learning of verbal and nonverbal content for repeated human-robot interaction. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, pages 13–20. ACM. [Lenhard, 2016] Lenhard, W. & Lenhard, A. (2016). Calculation of effect sizes. [Leuski and Traum, 2010] Leuski, A. and Traum, D. R. (2010). Practical language processing for virtual humans. In IAAI. [Lovato, 2017] Lovato, N. (2017). 16 Things Game Developers Should do to Improve Player Retention. http://www.gamedonia.com/blog/ 16-things-game-developers-should-do-to-improve-player-retention. [Online; accessed 22-October-2017]. [Lucas et al., 2017] Lucas, G. M., Boberg, J., Traum, D., Artstein, R., Gratch, J., Gainer, A., Johnson, E., Leuski, A., and Nakano, M. (2017). The role of social dialogue and errors in robots. In Proceedings of the 5th International Conference on Human Agent Interaction, HAI ’17, pages 431–433, New York, NY , USA. ACM. [Lucas et al., 2018] Lucas, G. M., Boberg, J., Traum, D., Artstein, R., Gratch, J., Gainer, A., Johnson, E., Leuski, A., and Nakano, M. (2018). Getting to know each other: The role of social dialogue in recovery from errors in social robots. In Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction, HRI ’18, pages 344–351, New York, NY , USA. ACM. [Lucas et al., 2019] Lucas, G. M., Lehr, J., Kr¨ amer, N., and Gratch, J. (2019). The effectiveness of social influence tactics when used by a virtual agent. In Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents, pages 22–29. 253 [Mann and Whitney, 1947] Mann, H. B. and Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics, pages 50–60. [Manuvinakurike and DeVault, 2015] Manuvinakurike, R. and DeVault, D. (2015). Pair me up: A web framework for crowd-sourced spoken dialogue collection. In Natural Language Dialog Systems and Intelligent Assistants, pages 189–201. Springer. [Marsella et al., 2009] Marsella, S., Gratch, J., Wang, N., and Stankovic, B. (2009). Assessing the validity of a computational model of emotional coping. In Affective Computing and Intelligent Interaction and Workshops, 2009. ACII 2009. 3rd International Conference on, pages 1–8. IEEE. [Marti and Emnet, 2018] Marti, S. and Emnet, K. (2018). Daboo - An interactive system to make a user guess a word as fast as possible (without using taboo words). http://alumni.media. mit.edu. [Online; accessed 09-May-2018]. [McNeill, 1992] McNeill, D. (1992). Hand and mind: What gestures reveal about thought. Uni- versity of Chicago press. [MicroSoft, 2018] MicroSoft (2018). XiaoIce. https://www.msxiaoice.com/. [Online; ac- cessed 18-May-2018]. [Miller, 1995] Miller, G. A. (1995). Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41. [Minsky, 1961] Minsky, M. (1961). Steps toward artificial intelligence. Proceedings of the IRE, 49(1):8–30. [Nakano et al., 2011] Nakano, M., Sato, S., Komatani, K., Matsuyama, K., Funakoshi, K., and Okuno, H. G. (2011). A two-stage domain selection framework for extensible multi-domain spoken dialogue systems. In Proceedings of the SIGDIAL 2011 Conference, pages 18–29. Association for Computational Linguistics. [Oh and Rudnicky, 2000] Oh, A. H. and Rudnicky, A. I. (2000). Stochastic language generation for spoken dialogue systems. In ANLP-NAACL 2000 Workshop: Conversational Systems. [Paetzel et al., 2015] Paetzel, M., Manuvinakurike, R., and DeVault, D. (2015). “so, which one is it?” the effect of alternative incremental architectures in a high-performance game-playing agent. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 77–86. [Paetzel et al., 2014] Paetzel, M., Racca, D. N., and DeVault, D. (2014). A multimodal corpus of rapid dialogue games. In LREC, pages 4189–4195. [Pan, 2018] Pan, Y .-h. (2018). 2018 special issue on artificial intelligence 2.0: theories and ap- plications. 254 [Pincus et al., 2014] Pincus, E., DeVault, D., and Traum, D. (2014). Mr. Clue - A virtual agent that can play word-guessing games. In Proc. of the 3rd Workshop on Games and NLP (GAMNLP), Raleigh, North Carolina, USA. [Pincus et al., 2015] Pincus, E., Georgila, K., and Traum, D. R. (2015). Which synthetic voice should i choose for an evocative task? In SIGDIAL Conference, pages 105–113. [Pincus et al., 2018] Pincus, E., Lei, S., Lucas, G., Johnson, E., Tsang, M., Gratch, J., and Traum, D. (2018). The importance of regulatory fit & early success in a human-machine game. In Proceedings of the Technology, Mind, and Society, page 31. ACM. [Pincus et al., 2013] Pincus, E., Stoyanchev, S., and Hirschberg, J. (2013). Exploring features for localized detection of speech recognition errors. In Proceedings of the SIGDIAL 2013 Conference, pages 132–136. [Pincus and Traum, 2014] Pincus, E. and Traum, D. (2014). Towards a multimodal taxonomy of dialogue moves for word-guessing games. In Proc. of the 10th Workshop on Multimodal Corpora (MMC), Reykjavik, Iceland. [Pincus and Traum, 2016] Pincus, E. and Traum, D. (2016). Towards automatic identification of effective clues for team word-guessing games. In Proceedings of the Language Resources and Evaluation Conference (LREC), pages 2741–2747, Portoro, Slovenia. European Language Resources Association. [Pincus and Traum, 2017] Pincus, E. and Traum, D. R. (2017). An incremental response policy in an automatic word-game. In WCIHAI@ IVA, pages 1–8. [Pope and Tabachnick, 1994] Pope, K. S. and Tabachnick, B. G. (1994). Therapists as patients: A national survey of psychologists’ experiences, problems, and beliefs. Professional Psychology: Research and Practice, 25(3):247. [Price, 1990] Price, P. J. (1990). Evaluation of spoken language systems: The atis domain. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylva- nia, June 24-27, 1990. [Raux et al., 2005] Raux, A., Langner, B., Bohus, D., Black, A. W., and Eskenazi, M. (2005). Let’s go public! taking a spoken dialog system to the real world. In Ninth European conference on speech communication and technology. [Robotics, 2018] Robotics, S. (2018). Nao. https://www.ald.softbankrobotics.com/en/ robots/nao. [Online; accessed 28-June-2018]. [Rovatsos et al., 2018] Rovatsos, M., Gromann, D., and Bella, G. (2018). The taboo challenge competition. AI Magazine, 39(1). 255 [Ruan et al., 2019] Ruan, S., Jiang, L., Xu, J., Tham, B. J.-K., Qiu, Z., Zhu, Y ., Murnane, E. L., Brunskill, E., and Landay, J. A. (2019). Quizbot: A dialogue-based adaptive learning system for factual knowledge. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, page 357. ACM. [Rudnicky et al., 1999] Rudnicky, A. I., Thayer, E., Constantinides, P., Tchou, C., Shern, R., Lenzo, K., Xu, W., and Oh, A. (1999). Creating natural dialogs in the carnegie mellon com- municator system. In Sixth European Conference on Speech Communication and Technology. [Ryan and Deci, 2000] Ryan, R. M. and Deci, E. L. (2000). Intrinsic and extrinsic motivations: Classic definitions and new directions. Contemporary educational psychology, 25(1):54–67. [Sawaki et al., 2008a] Sawaki, M., Minami, Y ., Higashinaka, R., Dohsaka, K., and Maeda, E. (2008a). “who is this” quiz dialogue system and users’ evaluation. In 2008 IEEE Spoken Language Technology Workshop, pages 149–152. IEEE. [Sawaki et al., 2008b] Sawaki, M., Minami, Y ., Higashinaka, R., Dohsaka, K., and Maeda, E. (2008b). ?who is this? quiz dialogue system and users’ evaluation. In Spoken Language Technology Workshop, 2008. SLT 2008. IEEE, pages 149–152. IEEE. [Schumi and Wittes, 2011] Schumi, J. and Wittes, J. T. (2011). Through the looking glass: un- derstanding non-inferiority. Trials, 12(1):1–12. [Sears, 1988] Sears, J. A. (1988). The darpa spoken language systems program: Past, present, and future. The Journal of the Acoustical Society of America, 84(S1):S188–S188. [Shah et al., 1998] Shah, J., Higgins, T., and Friedman, R. S. (1998). Performance incentives and means: how regulatory focus influences goal attainment. Journal of personality and social psychology, 74(2):285. [Shapiro and Wilk, 1965] Shapiro, S. S. and Wilk, M. B. (1965). An analysis of variance test for normality (complete samples). Biometrika, 52(3/4):591–611. [Shieber, 1994] Shieber, S. M. (1994). Lessons from a restricted turing test. arXiv preprint cmp-lg/9404002. [Shum et al., 2018] Shum, H., He, X., and Li, D. (2018). From eliza to xiaoice: Challenges and opportunities with social chatbots. CoRR, abs/1801.0195 7. [Sutton et al., 1996] Sutton, S., Novick, D. G., Cole, R., Vermeulen, P., de Villiers, J., Schalkwyk, J., and Fanty, M. (1996). Building 10,000 spoken dialogue systems. In Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on, volume 2, pages 709–712 vol.2. [Tapus and Mataric, 2008] Tapus, A. and Mataric, M. J. (2008). Socially assistive robots: The link between personality, empathy, physiological signals, and task performance. In AAAI spring symposium: emotion, personality, and social behavior, pages 133–140. 256 [Tapus et al., 2008] Tapus, A., T ¸ ˘ apus ¸, C., and Matari´ c, M. J. (2008). User-robot personality matching and assistive robot behavior adaptation for post-stroke rehabilitation therapy. Intelli- gent Service Robotics, 1(2):169. [Tickle-Degnen and Rosenthal, 1990] Tickle-Degnen, L. and Rosenthal, R. (1990). The nature of rapport and its nonverbal correlates. Psychological inquiry, 1(4):285–293. [Tomczak and Tomczak, 2014] Tomczak, M. and Tomczak, E. (2014). The need to report effect size estimates revisited. an overview of some recommended measures of effect size. Trends in Sport Sciences, 21(1). [Torrey et al., 2016] Torrey, L., Johnson, K., Sondergard, S., Ponce, P., and Desmond, L. (2016). The turing test in the classroom. In Thirtieth AAAI Conference on Artificial Intelligence. [Traum et al., 2015] Traum, D., Georgila, K., Artstein, R., and Leuski, A. (2015). Evaluating spoken dialogue processing for time-offset interaction. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 199–208. [Tur and Deng, 2011] Tur, G. and Deng, L. (2011). Intent determination and spoken utterance classification. Spoken Language Understanding: Systems for Extracting Semantic Information from Speech, pages 93–118. [Turing, 1950] Turing, A. M. (1950). Computing machinery and intelligence. In Mind 49, vol- ume 49, pages 433–460. Mind. [V on Ahn and Dabbish, 2008] V on Ahn, L. and Dabbish, L. (2008). Designing games with a purpose. Communications of the ACM, 51(8):58–67. [V on Ahn et al., 2006] V on Ahn, L., Kedia, M., and Blum, M. (2006). Verbosity: a game for collecting common-sense facts. In Proceedings of the SIGCHI conference on Human Factors in computing systems, pages 75–78. ACM. [Walker et al., 1998] Walker, M. A., Fromer, J. C., and Narayanan, S. (1998). Learning optimal dialogue strategies: A case study of a spoken dialogue agent for email. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th Interna- tional Conference on Computational Linguistics-Volume 2, pages 1345–1351. Association for Computational Linguistics. [Wallace, 2009] Wallace, R. S. (2009). The Anatomy of A.L.I.C.E., pages 181–210. Springer Netherlands, Dordrecht. [Wang et al., 2011] Wang, Y .-Y ., Deng, L., and Acero, A. (2011). Semantic frame-based spoken language understanding. Spoken Language Understanding: Systems for Extracting Semantic Information from Speech, pages 41–91. [Weisz et al., 2019] Weisz, J. D., Jain, M., Joshi, N. N., Johnson, J., and Lange, I. (2019). Big- bluebot: teaching strategies for successful human-agent interactions. In IUI, pages 448–459. 257 [Weizenbaum, 1966] Weizenbaum, J. (1966). Eliza—a computer program for the study of natural language communication between man and machine. Communications of the ACM, 9(1):36– 45. [Wellek, 2010] Wellek, S. (2010). Testing statistical hypotheses of equivalence and noninferior- ity. CRC Press. [Wickens, 2014] Wickens, T. D. (2014). Multiway contingency tables analysis for the social sciences. Psychology Press. [Williams and Young, 2007] Williams, J. D. and Young, S. (2007). Partially observable markov decision processes for spoken dialog systems. Computer Speech & Language, 21(2):393–422. [Wolchover, 2011] Wolchover, N. (2011). How the cleverbot computer chats like a human. Live- Science. [Wolters et al., 2010] Wolters, M. K., Issac, K. B., and Renals, S. (2010). Evaluating speech synthesis intelligibility using Amazon Mechanical Turk. In Proc. of the ISCA Workshop on Speech Synthesis, Kyoto, Japan. [Yamagishi et al., 2009] Yamagishi, J., Nose, T., Zen, H., Ling, Z.-H., Toda, T., Tokuda, K., King, S., and Renals, S. (2009). Robust speaker-adaptive HMM-based text-to-speech synthe- sis. IEEE Transactions on Audio, Speech, and Language Processing, 17(6):1208–1230. [Yu et al., 2017] Yu, Z., Black, A. W., and Rudnicky, A. I. (2017). Learning conversational systems that interleave task and non-task content. In IJCAI. [Yu et al., 2016a] Yu, Z., Nicolich-Henkin, L., Black, A. W., and Rudnicky, A. (2016a). A wizard-of-oz study on a non-task-oriented dialog systems that reacts to user engagement. In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dia- logue, pages 55–63. [Yu et al., 2015] Yu, Z., Papangelis, A., and Rudnicky, A. (2015). Ticktock: A non-goal-oriented multimodal dialog system with engagement awareness. In Proceedings of the AAAI Spring Symposium, volume 100. [Yu et al., 2016b] Yu, Z., Xu, Z., Black, A., and Rudnicky, A. (2016b). Chatbot evaluation and database expansion via crowdsourcing. In Proceedings of the chatbot workshop of LREC, volume 63, page 102. [Yu et al., 2016c] Yu, Z., Xu, Z., Black, A. W., and Rudnicky, A. (2016c). Chatbot evaluation and database expansion via crowdsourcing. In Proceedings of the chatbot workshop of LREC, volume 63, page 102. [Zar, 2013] Zar, J. H. (2013). Biostatistical analysis: Pearson new international edition. Pearson Higher Ed. 258 [Zen et al., 2007] Zen, H., Nose, T., Yamagishi, J., Sako, S., Masuko, T., Black, A. W., and Tokuda, K. (2007). The HMM-based speech synthesis system (HTS) version 2.0. In Proc. of the ISCA Workshop on Speech Synthesis, Bonn, Germany. [Zen et al., 2009] Zen, H., Tokuda, K., and Black, A. W. (2009). Statistical parametric speech synthesis. Speech Communication, 51(11):1039–1064. [Zhao et al., 2018] Zhao, R., Romero, O. J., and Rudnicky, A. (2018). Sogo: A social intelligent negotiation dialogue system. [Zue et al., 1992] Zue, V ., Glass, J., Goodine, D., Leung, H., Phillips, M., Polifroni, J., and Sen- eff, S. (1992). The voyager speech understanding system: a progress report. In Speech Recog- nition and Understanding, pages 415–424. Springer. 259
Abstract (if available)
Abstract
In the course of their lives human perform multiple roles such as work and social roles. However, current research in human-computer dialogue has focused on dialogue agents that perform only one role of an interaction. For example, Apple’s Siri acts mainly as an assistant. In this thesis we help fill the gap in multi-role dialogue agent research. ❧ We describe an architecture that endows a test-bed agent with core dialogue management capabilities for both roles of a word-guessing game but can be adapted for different embodiments including virtual human, robot, and a non-embodied web-platform that enables use of the test-bed agent in “in the wild” experiments. We incrementally evaluate design decisions for the test-bed agent that decrease the chance that our later experiments, that more directly evaluate the agent’s multi-role capabilities, failed to find effects due to confounds stemming from poor design decisions. We establish that multi-role agents, when compared to single-role versions of the same agent, are able to elicit enjoyment from users without negatively impacting user’s perceptions. We also use an “in the wild” experiment to prove that a multi-role content sourcing strategy can be superior to other scalable content sourcing strategies.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Machine learning in interacting multi-agent systems
PDF
Target assignment and path planning for navigation tasks with teams of agents
PDF
Understanding and generating multimodal feedback in human-machine story-telling
PDF
Modeling social causality and social judgment in multi-agent interactions
PDF
Managing multi-party social dynamics for socially assistive robotics
PDF
Decoding information about human-agent negotiations from brain patterns
PDF
Rapid prototyping and evaluation of dialogue systems for virtual humans
PDF
Computational foundations for mixed-motive human-machine dialogue
PDF
Incrementality for visual reference resolution in spoken dialogue systems
PDF
Efficient and effective techniques for large-scale multi-agent path finding
PDF
Sequential Decision Making and Learning in Multi-Agent Networked Systems
PDF
Code-switching dialogue systems: an investigation into how systems can support code-switching and when they should, with analysis of two Choctaw-English applications
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
Identifying and leveraging structure in complex cooperative tasks for multi-agent reinforcement learning
PDF
Behavioral form finding using multi-agent systems: a computational methodology for combining generative design with environmental and structural analysis in architectural design
PDF
Advancements in understanding the empirical hardness of the multi-agent pathfinding problem
PDF
Predicting and planning against real-world adversaries: an end-to-end pipeline to combat illegal wildlife poachers on a global scale
PDF
Interaction and topology in distributed multi-agent coordination
PDF
Efficient bounded-suboptimal multi-agent path finding and motion planning via improvements to focal search
PDF
Automated negotiation with humans
Asset Metadata
Creator
Pincus, Eli
(author)
Core Title
An investigation of fully interactive multi-role dialogue agents
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
10/26/2020
Defense Date
10/05/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
clues,content sourcing,dialogue agents,dialogue systems,guesses,multi-role,OAI-PMH Harvest,role,word-guessing games
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Traum, David (
committee chair
), Kim, Peter (
committee member
), Mataric, Maja (
committee member
)
Creator Email
elipincus@gmail.com,epincus@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-386805
Unique identifier
UC11666314
Identifier
etd-PincusEli-9071.pdf (filename),usctheses-c89-386805 (legacy record id)
Legacy Identifier
etd-PincusEli-9071.pdf
Dmrecord
386805
Document Type
Dissertation
Rights
Pincus, Eli
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
clues
content sourcing
dialogue agents
dialogue systems
guesses
multi-role
role
word-guessing games