Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Understanding and generating multimodal feedback in human-machine story-telling
(USC Thesis Other)
Understanding and generating multimodal feedback in human-machine story-telling
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Understanding and Generating Multimodal Feedback in Human-Machine Story-Telling
by
Setareh Nasihati Gilani
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2023
Copyright 2024 Setareh Nasihati Gilani
Dedication
In loving memory of my dear father, whose unwavering belief in my abilities ignited the spark of ambition
within me. Though he is no longer with us, his influence and love continue to guide and motivate me every
day.
To my incredible mother, your enduring love, boundless support, and unwavering resilience have been
the pillars of strength throughout this arduous academic journey. You both have shaped my life and this
achievement is as much yours as it is mine. Thank you for everything.
ii
Acknowledgements
It feels good to have an end to this chapter of my life. While the completion of this Ph.D. thesis marks a
significant milestone, it is the journey that matters in the end. Of course, as life happens, there have been
tons of ups and downs, there have been moments on the mountaintops and moments in deep valleys of
despair, and it was this journey itself that truly shaped me both personally and professionally.
First and foremost, I express my deepest gratitude to my advisor, Dr. David Traum, whose unwavering
guidance and academic support were invaluable throughout this doctoral endeavor. Beyond scholarly
mentorship, I am profoundly grateful for their exceptional kindness and unwavering support during some
of the most challenging moments I faced during my Ph.D. years. His understanding was a beacon of light
that guided me through the darkest times. I extend my appreciation to my esteemed committee members:
Dr. Maja Mataric, Dr. Peter Kim, Dr. Kallirroi Georgila and Dr. Mohammad Soleymani. Their insightful
feedback, expert guidance, and belief in the potential of this research were instrumental in shaping this
thesis. Their collective wisdom propelled me forward and enriched the depth of my work.
To my collaborative partners, whose expertise and collaboration were indispensable to the success of
my research, I extend my heartfelt gratitude. Working alongside them was a truly enjoyable and enriching
experience, and I am grateful for their collaborative spirit and dedication to our shared goals. Special
thanks to Ron Artstein, Kallirroi Georgila, and Anton Leuski for their invaluable insights, guidance, and
contributions, which significantly enhanced the quality and depth of this research. I extend heartfelt
iii
appreciation to Alesia Gainer, our dedicated project manager, whose exceptional organizational skills and
tireless efforts were instrumental in seamlessly overseeing every aspect of our research endeavors.
With sincere appreciation, I acknowledge the support from the U.S. Army (ARMY) and the National
Science Foundation (NSF) for their invaluable financial support. Their funding was instrumental in the
successful execution of the experiments and projects presented in this thesis, enhancing our understanding
of human-agent interactions.
On a personal note, I would like to express my deepest gratitude to my husband, Ehsan. I am incredibly
fortunate to have met him during this academic journey, and he has been my unwavering pillar of support
ever since. Ehsan, you are my best friend, confidant, and constant source of encouragement. Your love and
belief in me have been the fuel that has propelled me through each day.
In the loving memory of my father, whose presence and unwavering support were a beacon of strength
throughout my academic life, I express my heartfelt gratitude. Your love and guidance will forever resonate
within me, propelling me to reach for the stars. To my beloved mother, your unconditional love, understanding, and unwavering support have been my steadfast foundation. I am eternally grateful for your
sacrifices and unwavering belief in my abilities.
This journey would not have been possible without the support, love, and encouragement of all the
individuals mentioned above, as well as countless others who have touched my life along the way. Thank
you all for being a part of this remarkable journey.
iv
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Story-telling in Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 Utilizing Feedback in Story-telling Context . . . . . . . . . . . . . . . . . . . . . . 5
1.1.2.1 Audience Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.2.2 StoryTellers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.2.3 Active third-party Story-Listeners . . . . . . . . . . . . . . . . . . . . . . 7
1.1.3 Exploring Thesis Objectives and Goals . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Thesis Statement and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
v
1.4.1 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.2 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Chapter 2: Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1 Story-Telling Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Story-Listener Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Storyteller and Story-listener Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Chapter 3: Story-Telling Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Story-swapping Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.2 Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Chapter 4: Identifying Opportunities for Empathetic Responses . . . . . . . . . . . . . . . . 39
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4.1 Multimodal Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4.1.1 Textual Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4.1.2 Audio Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
vi
4.4.1.3 Visual Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4.2 Ground-Truth Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4.2.1 Wizard judgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4.2.2 Mechanical Turk Ratings . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4.3 Behavior Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4.4 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4.4.1 Unimodal models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4.4.2 Static fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4.4.3 RNN Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Chapter 5: RAVE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3 Overall System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3.1 Version 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3.2 Version 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3.3 Version 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.4 Robot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4.1 Design Guideline 1: Attractive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4.2 Design Guideline 2: Simplified . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.4.3 Design Guideline 3: Identifiable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4.4 Design Guideline 4: Agentic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
vii
5.4.5 Design Guideline 5: Safe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.5 Avatar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.5.1 3D model construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.5.2 Motion capture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.5.3 Realtime animation and control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.5.4 Avatar behaviors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.5.5 Design of Nursery Rhyme Behaviors . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.6 Perception Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.6.1 Thermal Camera and Thermal Infrared Imaging . . . . . . . . . . . . . . . . . . . . 80
5.6.2 Eye-Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.6.3 Baby Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.7 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.8 Dialogue Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.8.1 Input signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.8.2 Output commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.8.2.1 Primitive Behaviors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.8.2.2 Action Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.8.3 Interaction Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.9 Experiment Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.9.1 Greeting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.9.2 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.9.3 Familiarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.9.4 Interaction Session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.10 Research Questions and Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
viii
5.11 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.11.1 Qualitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.11.1.1 Case Study 1: Free-play, Wizard-of-Oz . . . . . . . . . . . . . . . . . . . 98
5.11.1.2 Case Study 2: Freeplay, introducing limited autonomy and perception . . 100
5.11.1.3 Case Study 3: Enhanced behavior and perception with tetradic interaction 102
5.11.1.4 Comprehensive study on 8 infants . . . . . . . . . . . . . . . . . . . . . . 103
5.11.2 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.11.2.1 Baby and Avatar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.11.2.2 Parent’s Intervention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.12 Design Lessons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.13 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Chapter 6: Human-Swarm Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.2.1 Emotional Expression and Performance . . . . . . . . . . . . . . . . . . . . . . . . 120
6.2.2 Multimodal Affective Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.3.1 Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.3.2 Residents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.3.3 Resident Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.3.4 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.4 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.4.1 Multimodal Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.4.1.1 Visual Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
ix
6.4.1.2 Linguistic features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.4.1.3 Auditory Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.4.2 Proposed Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.5.1 Behavioral Responses to Different Personalities . . . . . . . . . . . . . . . . . . . . 133
6.5.1.1 Participant Emotions and Task Performance . . . . . . . . . . . . . . . . 133
6.5.1.2 Emotional Response to Different Characters . . . . . . . . . . . . . . . . 134
6.5.1.3 Participant Personality and Facial Expressions . . . . . . . . . . . . . . . 135
6.5.2 Performance Prediction Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.5.3 Intervening the Sessions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.6 Conclusion & Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Chapter 7: Conclusions and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.2.1 Perceiving Human Emotions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.2.2 Tailoring Stories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.2.3 Prospective Paths of Further Exploration . . . . . . . . . . . . . . . . . . . . . . . . 149
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
x
List of Tables
1.1 Agent Role vs Feedback Direction. Each cell has a sample scenario (A for Agent and H for
Human) for a better illustration of the condition in gray. . . . . . . . . . . . . . . . . . . . 8
4.1 Human-Agent dialogue excerpts with different empathy responses. . . . . . . . . . . . . . 45
4.2 Distribution of classes for two sets of labels . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 F1-scores for three-class classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4 Confusion matrices (RNN fusion). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.5 Instances of RNN Fusion model’s correct/incorrect predictions on MTurk labels (Positive,
Negative, None). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.1 Robot and Avatar Primitive Behaviors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2 A sequence of snapshots drawn from a sample experiment showing different stages of the
interaction. (Participant is a hearing male with no sign exposure and is 12 months+1 day old) 90
6.1 Residents’ Description and Difficulty Level . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.2 An interaction example drawn from the data: Interaction between a stubborn person
(Difficulty 3) with the operator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.3 Statistics on operator’s interaction with different categories of residents . . . . . . . . . . . 126
6.4 Mapping of Action Units (AU)s to Emotions . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.5 Multiple linear regression statistics (DV: Operator’s emotions during interactions with
residents, IV: Operator’s performance (number os saved residents) . . . . . . . . . . . . . . 134
xi
6.6 Mean cross-validation accuracy and F1-Macor metrics for different modalities on 3-class
classification task. Models with the dagger (†) sign are affective models and others are general. 137
xii
List of Figures
2.1 The uncanny valley diagram [190] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1 Virtual Human Arron and Arnold; Human Arron and Arnold . . . . . . . . . . . . . . . . . 28
3.2 System Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Example of exchange between Aaron (3rd person VH-VH) and participant . . . . . . . . . 32
3.4 Interaction of perspective and agent on the Rapport Scale. . . . . . . . . . . . . . . . . . . 33
3.5 Interaction of perspective and agent on personality characteristics. . . . . . . . . . . . . . 34
3.6 Interaction of perspective and order on personality characteristics. . . . . . . . . . . . . . 35
3.7 Ordering effects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1 A participant and the virtual agent, Ellie. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Multimodal static fusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3 Box plots of verbal and nonverbal behavior with significant differences among different
classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 Multimodal RNN fusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.5 F1-scores of VADER sentiment analysis with different thresholds. . . . . . . . . . . . . . . 54
5.1 Physical deployment of system components from the front view . . . . . . . . . . . . . . . 62
5.2 Multiparty interaction between Avatar, Robot, and infant from multiple viewpoints . . . . 71
xiii
5.3 The robot used in this study (shown in three figures to the right) is based on the open-source
Maki platform from Hello Robo (shown at left). Modifications have been made to accentuate
infant-like features, to reduce distractions, to increase the saliency of the directionality of
head positions, and to support judgments of animacy and agency. . . . . . . . . . . . . . . 72
5.4 The generation of the 3D avatar model. Left: The participant inside a photogrammetry cage.
Right top: The 3D avatar model generated from photogrammetry capture. Right bottom:
Light Stage facial scans used for reference for avatar construction. . . . . . . . . . . . . . . 76
5.5 Frames of Avatar doing the BOAT Nursery Rhyme. The four frames were selected (in order
from left to right) from a fluid video clip of avatar signing where each frame represents a silent
sign-phonetic-syllabic contrastive unit as produced with the hands in the ASL visual nursery
rhyme “BOAT-ON-WAVE”. In formal linguistic analyses, these contrastive phonetic-syllabic units
are notated as follows: 5.5a /B/+low; 5.5b /B/+modulation+high; 5.5c /5/+modulation+high; 5.5d
/5/+modulation+low. These phonetic-syllabic linguistic units are not produced in isolation like a list.
Instead, they are bound into fluid movements that form rule-governed, grammatical clausal, phrasal,
and syntactic constructions in all natural languages, here ASL. . . . . . . . . . . . . . . . . . . . 78
5.6 AOI regions from infant’s perspective. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.7 Observer interface for baby’s behaviors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.8 Logical overview of system components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.9 Summarized decision tree based on system variables . . . . . . . . . . . . . . . . . . . . . . 88
5.10 Experimental Setup (Side View) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.11 Active display area from the infant’s view and the calibration points . . . . . . . . . . . . . 93
5.12 Assistant using the magic wand to gain the baby’s attention for calibration process . . . . 94
5.13 Case study 1 infant “Albert” pointing. Here the infant, seated on his mother’s lap, as seen
from the system’s point of view. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
xiv
5.14 Composite image of case study 2 infant “Bella”. Here the infant can be seen from multiple
angles (bottom left and right) interacting with the autonomous system (top) . . . . . . . . 102
5.15 Composite image of case study 3 infant “Celia”. Here the infant (right and bottom left) can
be seen producing a waving-arm gesture towards the system (top left). . . . . . . . . . . . 103
5.16 Annotation example using ELAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.17 Frequency of Avatar’s Behaviors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.18 Frequency of Baby’s Behaviors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.19 Babies’ response rate to different avatar behaviors . . . . . . . . . . . . . . . . . . . . . . . 110
5.20 Babies’ categorical responses to different avatar behaviors . . . . . . . . . . . . . . . . . . 110
5.21 Frequency of baby response types in absence (condition 1) or presence (condition 2) of
parental involvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.1 Overview of the simulation environment, the spokesperson, and the operator (aka the user) 119
6.2 Overall Architecture of Our Multi-Modal Information Infusion. . . . . . . . . . . . . . . . 127
6.3 Happiness Expressions by Character Type . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.4 Confusion matrix of the partial-inference model . . . . . . . . . . . . . . . . . . . . . . . . 139
7.1 Sample Prompt to chatGPT regarding perceiving the emotional state . . . . . . . . . . . . 147
7.2 ChatGPT’s response of characterizing the emotional state of the people in the given
conversation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.3 Follow up conversation with ChatGPT about tailoring the following utterance based on the
previous context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
xv
Abstract
People use feedback (verbal or nonverbal) from their interlocutors to guide their own behavior and alter
the flow of conversation. In this thesis, we focus on human-machine interactions that involve storytelling
and investigate the role of understanding and providing feedback from the machine’s perspective. We
explored the characteristics of stories that machines should use to increase rapport. We developed machine
storytellers and listeners that can provide feedback and adapt their stories based on perceived multimodal
feedback from their users. Finally, we investigated how machines can use real-time predictions based on user
feedback to further adapt the dialogue management policies of the system for better overall performance.
xvi
Chapter 1
Introduction
1.1 Overview
People engaged in conversation use feedback (verbal or non-verbal) from their interlocutors to guide
their own behavior and alter the flow of the conversation, where advisable [27]. For example, one may
choose to cut the conversation short upon determining the apathy of the other person; others might try to
somehow regain the lost engagement by adding humor to the conversation or making a sudden noise like
clearing throat; and another person might choose to change the topic of the conversation temporarily or
entirely based on the received feedback[60]. Computers are not nearly as good as people in understanding
and generating multimodal feedback in their interactions with humans. Often, they are programmed to
complete a specific task and they don’t incorporate feedback signals in their agenda. In this thesis, we aim
to address this problem. Narrative is often thought of as a monologic activity, with a single speaker who
tells stories to an audience. However, stories told in a dialogue setting often involve feedback from the
audience, the storyteller, and even third parties. It is thus a good starting point for examining some of the
roles of feedback in dialogue. To this end, in this thesis, we will focus on human-machine interactions
that involve some sort of story-telling. We will focus on the role of computers both as story-tellers and
story-listeners and will discuss how they can use multimodal feedback in each role.
1
It’s crucial to emphasize that feedback within a conversation encompasses more than just verbal
expressions. Feedback manifests in diverse forms, including both verbal and non-verbal cues. These cues
can come in various modalities, including visual (e.g., facial expressions, body language), acoustic (e.g., tone
of voice), and even physiological (e.g., changes in heart rate).
In the context of employing feedback in human-machine interactions, one should think about two
different dimensions from the machine’s perspective; one is for it to understand and interpret the human’s
feedback and the other is for it to generate appropriate and meaningful feedback in response to the human
in the course of the interaction. Not all human-machine interactions encompass both dimensions. For
example, when you press a button on a remote control, the machine (the remote control) understands your
action and acts accordingly, and you can consider the volume change or channel change as the machine’s
feedback to your input, but it does not understand further feedback from the user about whether it did the
right job or not. Also, these two dimensions are not mutually exclusive, but rather they are dependent on
each other if one would want to reach a deep sense of conversational flow and achieve common ground.
In fact, in the context of having a fluent conversation, one can not achieve total efficiency in one aspect
without completely being competent in the other one. Perhaps this would be more clear with an example:
imagine a simple scenario of two people sitting across the dinner table. The interaction goes as follows:
Person A: “please pass the butter”
Person B: raises eyebrows
Person A: points to the butter
Person B: passes the butter
In this scenario, Person A understands B’s feedback (raising eyebrows) and interprets this gesture as a sign
that B has not understand what he/she wants. Therefore, A makes his request clear by pointing to the
butter. Then, B should comprehend what A meant by picking up on the provided feedback again. It’s clear
that both understanding and providing feedback made an important role in the course of the interaction
2
and completing the task. As such, a machine should be able to understand the human’s feedback as well as
being able to produce appropriate feedback and make good decisions in the flow of the conversation.
In this thesis, we will focus on interactions that include both dimensions. Both of these areas are active
research areas in the affective computing field. We will go over each of these dimensions briefly:
1. Comprehending human feedback: A lot of information can be gained by parsing the verbal
information given as feedback in the interaction. Examples of verbal feedback can be phrases like
"aha", and "go on" as positive affirmation and examples like "what" and "say that again?" for lack of
positive feedback (when it is expected) as negative feedback. Most Dialogue Management systems
rely on verbal information as one of the main input sources (most often the only input source) for
determining the next state of the dialogue. But research has shown that making use of multimodal
input such as acoustic or facial expressions can potentially improve the system’s ability to improve
rapport with the user [120, 176, 263]. This comes as no surprise since humans also frequently rely on
multimodal cues in their decision to manage the interaction with each other. Multimodal behavioral
processing technology is a key technique for developing an empathetic dialogue system that can
adapt to the behavior of a human user.
2. Generating feedback: In order to have a smooth and natural conversation, the machine should also
be able to produce appropriate and timely feedback to the user. Abubshait and Wiese [1] argue that
seeing mind in others is not exclusive to human agents, but mind can also be ascribed to non-human
agents like robots, as long as their appearance and/or behavior allows them to be perceived as
intentional beings. Research has shown that in order to trigger mind perception, non-human entities
need to display signs of intentionality via appearance [136, 164] and/or behavior [280].
3
1.1.1 Story-telling in Interactions
Story-telling is a particular domain in which comprehending and generating feedback has a crucial role
in [90]. The art of storytelling, an ancient and fundamental form of oral communication, traces its roots
back to prehistoric times. Remarkably, storytelling remains a long-lasting human practice, serving as a
multifaceted tool for both acculturation and the transmission of human history [60, 233]. This long-lasting
tradition finds its place in the intricate realm of human interaction, exerting a profound influence on our
social dynamics, identity formation, and the preservation of cultural heritage.
In the realm of human discourse, stories permeate conversations, enriching our interactions with depth
and nuance [200]. They are not merely anecdotes but potent carriers of meaning and shared experiences.
Stories often function as a mechanism for individuals to establish and express their identities [21, 94]. In
narrating personal anecdotes, individuals craft a narrative that embodies their unique life experiences and
perspectives. Furthermore, the power of storytelling extends beyond the individual; it plays a pivotal role
in the intergenerational transfer of cultural heritage [216]. Through storytelling, societies preserve and
transmit their traditions, beliefs, and values to successive generations. The tales passed down through
generations serve as a living archive of a community’s collective memory, ensuring the continuity of its
cultural identity. Moreover, storytelling serves as a potent tool for building rapport and strengthening social
bonds [258]. The act of sharing stories fosters empathy and connection among individuals. By sharing
personal narratives, we open a window into our inner worlds, inviting others to relate to our experiences
and emotions. This shared vulnerability fosters mutual understanding and forges deeper connections
among people.
Beyond its social and cultural significance, storytelling takes on a cognitive dimension. Human beings
often employ narratives, whether their own or those of others, as a means to elucidate their perspectives
and argue their viewpoints on various issues. Stories become a form of evidence to support and articulate
our mental frameworks [117] and explain our decisions [26]. This mode of argumentation transcends mere
4
logical deduction, as it appeals to our emotions and intuitive understanding, making it a compelling and
persuasive tool in discourse.
Traditionally, human storytelling has been one of the main means of conveying knowledge from
generation to generation, but nowadays new technologies have also been used in this knowledge-sharing
process [60, 51]. Conversational storytelling provides a communication channel that an agent can use to
maintain engagement with the human user [24].
In summation, storytelling, as a timeless mode of communication, continues to play a pivotal role
in human interaction. From shaping personal identities to preserving cultural heritage, from building
connections to facilitating persuasive arguments, stories are indispensable in our journey to understand
ourselves and our place in the world. In the words of "Robert McKee", American screenwriter and story
consultant: “Storytelling is the most powerful way to put ideas into the world today” [177].
1.1.2 Utilizing Feedback in Story-telling Context
1.1.2.1 Audience Design
In a natural human-human conversation, the storyteller often tailors the utterances/stories for their
addressees, and this has been labeled audience design [27]. He/She might choose to omit specific details
or elaborate more/less on something. For example, a speaker might tell a different version of a story to a
child than they would to an adult. They might also tell a different story to a group of close friends than
they would to a group of strangers. Audience design is not just about the addressee. Speakers also consider
the feedback they receive from their audience, as well as the lack of feedback from disengaged audience
members [149]. This helps speakers to adjust their communication style and ensure that their message is
being understood.
In addition to the speaker and addressees which are standard roles of any interaction [72], there can be
other participants involved in the interaction. The speaker and the addressee often swap roles during the
5
course of the interaction, but there are also other ratified participants who can become more active and
can also play a role in audience design [94]. For example, in a debate, a candidate may be speaking to the
moderator or interviewer, but they are really trying to reach the audience who will vote for them or give
them money.
1.1.2.2 StoryTellers
A contemporary illustration of the dynamic interplay between storytelling and audience design can be
found in the work of Lima, Feijo, and Furtado [159]. Their research introduces an innovative adaptive
storytelling system that tailors its narrative delivery based on individual personality traits and preferences.
This system represents a departure from traditional, one-size-fits-all storytelling, offering users distinct
story versions carefully crafted to resonate with their unique characteristics. Noteworthy to mention,
that their tailoring process is not executed on-the-fly during the storytelling interaction but relies on
personality traits and preferences gleaned from users at the outset of the study. The system leverages this
initial profiling to provide a tailored storytelling experience that aligns with each user’s distinct tastes and
sensibilities.
In sum, audience design is an important aspect of human communication, wherein storytellers craft
their narratives based on their audience’s needs, engagement, and responsiveness. This dynamic process
underscores the nuanced interplay between storytellers and their audiences. Moreover, the advent of
adaptive storytelling systems, exemplified by Lima, Feijo, and Furtado [159] work, highlights the burgeoning
intersection of communication and technology. This intersection invites further exploration and research
into the audience-tailored storytelling phenomena.
6
1.1.2.3 Active third-party Story-Listeners
In the context of dialogue, an active third-party story-listener is an external observer who assumes the role
of an attentive spectator. Unlike a passive observer, the third-party actively engages with the interaction,
discerning cues and feedback from both sides. Based on these observations, the third party may elect to
intervene or take action, thus exerting a significant influence on the unfolding discourse. For example,
consider a scenario in which an uncle is telling a story to his niece, and the child’s mother is present. The
mother, as an active third-party story-listener, may decide to intervene if she notices that the child is having
difficulty understanding some words, or if she perceives that the child might be scared by the direction the
story is taking. She may intervene by providing clues to the uncle or taking the floor herself.
This dynamic scenario exemplifies the intricate nature of human communication, where the active
third-party story-listener serves as a pivotal character in the narrative and their potential intervention,
contributes to the evolving storyline. By discerning cues and feedback from both sides, they can gain
insights into the unfolding discourse that may not be apparent to the primary participants. This information
can be used to intervene or take action in a way that promotes effective communication and understanding.
1.1.3 Exploring Thesis Objectives and Goals
In this thesis, our primary objective is to explore the role of feedback in human-agent storytelling. Specifically, our research focuses on the domain of dialogue management systems, which are designed to analyze,
comprehend, and generate multimodal feedback within the context of human-agent storytelling interactions.
These systems can be categorized based on their unique roles in storytelling scenarios, falling into three
key categories:
1. StoryTelling agents: These systems are proficient in narrating stories to human interlocutors.
2. Story Elicitation Agents: These systems excel at prompting and coaxing narratives from human
participants.
7
Feedback Direction
Neither Recognizing Providing Recognizing & Providing
Agent role
Neither (not in story-context)
Definition
1: Agents who are oblivious to the interlocutor’s feedback
2: Agents who are able
to monitor their interlocutors and recognize their feedback.
3: Agents who can provide
some feedback in the course
of the interaction to their interlocutors
4: Agents who use their interlocutor’s feedback to provide appropriate feedback
Example
-Automated ticket buying
systemsExample Systems: [163]
A: Don’t forget your umbrella,
it’s going to be rainy today.
H: What? I’m not in Seattle anymore.
A: Don’t forget your sunscreen.
It’s sunny in LA.
Example Systems: [246]
-Agent smiling at random times
in the conversationExample Systems:
A: You have a meeting in 10
minutes
H: I’m on a sick leave today
A: Oh okay, sorry to hear that.
Feel better soon.
Example Systems: [97]
Storyteller
Definition
5: Story-teller agents that
can neither recognize nor
provide feedback.
6: Storyteller agents who can
recognize listener’s feedback.
These systems are “aware” of
their interlocutor and monitor them.
7: Systems that provide feedback without being aware of
the content of the interaction.
8: Storyteller agents who can
provide appropriate feedback based on the perceived
feedback.
Example
-Text to Speech systems that
read a story from a written textExample Systems:
A: There once was a little girl
called Snow-white living in a
forest
H: Oh no not this story again.
I hate it!
A: There once was a little girl
with a red riding hood living
in a forest
Example Systems: [275]
-System making eye-gaze to different people once in a while
during the storytellingExample Systems: [270, 159]
A: Let me tell you a scary story
tonight.
H: (frowns) I’m not in the
mood for that
A: I understand. Let me tell
you about a funny thing that
happened today instead.
Example Systems: [46, 236]
Story-listener
Definition
9: Story-listener agents who
are oblivious to the interlocutor’s speaker cues
10: Story-Listener agents
who can recognize the storyteller’s feedback.
11: Story-Listener agents
who provide feedback unrelated to the content.
12: Systems that either elicit
or listen to stories from their
interlocutors and are able
to provide appropriate feedback based on their interlocutor’s behaviors.
Example
-Agents who just stay mute
when they are told a storyExample Systems:
Systems that do sentiment analysis on stories told by others.
Example Systems: [56]
-Agent nodding while listening
to a story from the interlocutorExample Systems: [102]
H: I lost my partner in a tragic
accident last year. It was a brutal time for me. (sighs)
A: I’m sorry to hear that. It
should have been very hard. Example Systems: [66, 202]
Third-party Story-Listeners
Definition
13: Third-party StoryListeners who are oblivious
to the interlocutor’s feedback.
14: Third-party StoryListeners who are able to
monitor the interlocutors
for recognizing feedback.
15: Third-party StoryListeners who can provide
some feedback in the course
of the interaction to the
interlocutors
16: Third-party StoryListeners who use their
interlocutor’s feedback
to provide appropriate
feedback or intervene
Example
Example Systems: [91] Example Systems: [195]
Table 1.1: Agent Role vs Feedback Direction. Each cell has a sample scenario (A for Agent and H for Human)
for a better illustration of the condition in gray.
8
3. Active Third-Party Story Listeners: These systems adeptly observe and engage with ongoing
interactions between the interlocuters and provide feedback or interventions as warranted.
Our comprehensive framework, depicted in Table 1.1, outlines the diverse agents capable of either
recognizing, providing, or both recognizing and providing feedback during interactive storytelling encounters. The table organizes all the different kinds of use of feedback in storytelling systems and shows the
state of the arts and gaps. In this table, we offer a snapshot of these agents, accompanied by examples of
State-of-the-Art (SOTA) systems that align with each category (where applicable). A detailed exploration of
the SOTA and related systems will be presented in Chapter 2.
limitations of current systems:
1. Most systems that fit in the second column of the table 1.1 (i.e. cells 2,6,10,14) do the process
of recognizing/analyzing the user feedback in offline analysis. For example, although the system
described in [275] is using multimodal signals from the user, but is unable to make online adjustments
to the course of the interaction or use these signals in real-time.
2. There are not a lot of storyteller agents that fit in cell 7 of the table. The systems that provide
feedback usually have a pre-determined pattern of feedback, called envelope feedback [46], or some
information obtained from the user before the interaction like their personality type. For example,
the system described in [159] is successful in tailoring the story and selecting a specific version of a
story to tell based on the preferences of users, which they model in terms of the Big Five factors.
3. To our knowledge, there are not systems that are able to tell stories as well as listening to them as
well as recognize and provide feedback at the same time using multimodal channels.
Our overarching research objective is to equip agents, regardless of their specific roles, with the capacity
to both recognize and provide feedback effectively. In essence, our aspiration is to cultivate agents capable
9
of seamlessly fitting into the fourth column of Table 1.1, a column that is currently sparsely populated.
Existing systems that attempt to occupy this space often fall short, either lacking the requisite competency
to recognize diverse modalities of input feedback or failing to provide nuanced, contextually appropriate
feedback in response to perceived cues. For instance, consider the system introduced in [236], which
primarily relies on audio thresholds to determine when to offer feedback in the form of non-verbal cues like
’uh-huh’ nods. However, this system’s feedback repertoire remains limited, primarily focusing on auditory
cues and failing to offer rich multimodal feedback to users.
To achieve our research goals, we adopt an incremental approach. Initially, we construct agents with
capabilities that align with other cells in Table 1.1, gradually progressing towards agents that can adeptly
fit into the fourth column. Our approach given in Section 1.3, outlines the evolution of these agents,
fostering their ability to recognize and provide diverse and contextually relevant feedback in the landscape
of human-agent storytelling interactions.
1.2 Research Questions
The focus of this thesis is on dialogue management systems that 1- detect and understand and 2- generate
multimodal feedback in a human-agent storytelling interaction. Also, we investigate the type of stories that
virtual humans should tell their human interlocutors. We focus on the following more specific questions:
1. What kinds of stories should virtual story-tellers provide?
2. What kinds of multimodal feedback can agents provide to people with the goal of improving the
interaction?
3. What kinds of multimodal feedback can agents recognize and understand and how can they be used
to adapt the dialogue management policies of the agent? For example, how can dialogue managers
10
include different modalities of the sensory input in their internal state? These inputs can be visual,
acoustic, or physiological.
4. How can agents use real-time predictions based on user feedback to further adapt the dialogue
management policies of the agent?
Question 1 is related to stories that agents who fill in cells 5,6,7,8 should use. There are different aspects
to stories that a virtual human uses, such as the identity of the protagonist, the perspective of the narrator,
etc.
Question 2 is related to agents that fit the third column of the same table.
Question 3 is directly related to agents that fit the fourth column from the table 1.1. It concerns the
understanding of feedback, and then use it to adapt the interaction and make changes accordingly.
Question 4 can be addressed by having agents that would fit the last column of the table. These agents
have the ability to tailor their stories in addition to giving feedback to the user. They have the ability to
anticipate the outcome of an ongoing interaction, drawing from real-time situational awareness, and reading
on the multimodal feedback from the interlocuters and recent events. Moreover, these perceptive agents
are equipped to align their actions with predefined system goals and predictions, strategically intervening
when deemed necessary.
1.3 Approach
In this section we describe the work that has been done as part of this thesis and how each piece contributes
to address one or more of the questions and sub questions described in section 1.2.
In order to demonstrate the necessity of having feedback in interaction, and to directly answer our
question 1, our first experiment was designed based on a virtual human in a story-telling scenario with a
user. In this experiment, described in chapter 3, the agent tells stories to the user and in return elicits stories
11
from the user, acting as both story-teller and story-listener. Chapter 3 is based on our two published papers
on this project: [91] and [193]. This experiment explores different dimensions such as the identity and the
embodiment of the agent in telling stories and analyzes the effect of each one on the user rapport. The agent
in this experiment has a simple dialogue manager and although it was not using any user-generated feedback
nor was it producing any feedback in the interaction, this experiment sets the ground for answering the
second and third research questions. The system in this experiment would fall into cell 5 of table 1.1.
To investigate question 2, we did a data analysis on a story-listening agent. The experiment, described
in chapter 4 analyzes behavioral cues of an interactive virtual agent to indicate an opportunity to provide an
empathetic response. This chapter has been written based on our paper [262]. In this project, a multimodal
deep neural network was developed for identifying opportunities when the agent should express positive
or negative empathetic responses. The model uses verbal, acoustic and visual modalities. The system is
able to provide feedback, so it would fit cell 11 of our table 1.1.
To investigate question 3, we designed a multimodal system called RAVE, described in chapter 5. This
chapter has been written based on our published papers on this project [240, 195, 197, 196, 210].
RAVE is a system that engages in a multimodal and multiparty interaction with deaf and hard-of-hearing
infants. The goal of the system is to provide a tool to facilitate teaching sign language to these babies. To
this end, the system uses a virtual human and an embodied robot to build a three-way interaction with the
baby. The system uses multiple modalities for feedback of different sorts and can provide sign language
patterns of “nursery rhymes” to the baby which we consider a form of story in this interaction. Thus, the
agent in this scenario is a story-teller agent who can provide appropriate feedback and therefore fits to cell
8 of the table 1.1.
Addressing our fourth research question, we designed and developed a predictive model capable of
predicting user performance based on multimodal behavioral cues in a high-stress human-agent interaction
scenario. More specifically, we focus on cell 16 of the table 1.1 Utilizing visual, audio, and text data from
12
a multimodal corpus, we demonstrated the model’s ability to forecast user performance, enabling timely
agent interventions for performance enhancement. This approach highlights the potential of predictive
models and strategic agent interventions in elevating the quality of human-agent interactions. This work
has been discussed in chapter 6. This chapter has been written based on our published papers on this
project [192, 49, 264, 194, 48]
These four chapters have successfully addressed our initial research questions. However, it’s important
to recognize that the research landscape offers numerous unexplored questions and potential avenues for
further investigation. We will delve into possible future directions of this thesis in chapter 7.
1.4 Thesis Statement and Contributions
1.4.1 Thesis Statement
I define my thesis statement as follows:
By understanding and generating multimodal feedback, virtual agents can improve the quality of
human-machine storytelling interactions by providing engaging, personalized, and interactive experiences for users.
1.4.2 Thesis Contributions
Based on the approach to my research question explained in section 1.3, the contributions of this thesis are
as follows:
1. Investigating the type of stories virtual agents should use
We explored the role of virtual agents in storytelling scenarios and the impact of different story
types on user interaction. We conducted experiments, such as the one described in Chapter 3, where
virtual agents acted as both storytellers and listeners. By investigating the type of stories that virtual
13
agents should use, we aimed to better understand how story content influences user engagement and
rapport.
2. Investigating the role of feedback in human-machine interactions
We demonstrate the importance of feedback in human-machine interactions. The first experiment in
Chapter 3 serves as a foundational study, even though it doesn’t directly incorporate user-generated
feedback. This experiment laid the groundwork for addressing our second and third research questions,
highlighting the essential role of feedback in enhancing the quality of interactions between virtual
agents and users.
3. Design and development of the dialogue manager for the RAVE system that takes multimodal feedback into account
The RAVE system, described in Chapter 5, is a significant contribution to my thesis. This system
engages in multimodal and multiparty interactions with deaf and hard-of-hearing infants, aiming
to facilitate the teaching of sign language. My contribution involves designing and developing the
dialogue manager for RAVE, which considers various forms of feedback and incorporates them into
the interaction. This demonstrates a practical application of the research findings in the real world
and in an educational context.
4. Investigating the types of signals agents can use to adapt to in the course of the dialogue
The work in this thesis explores how virtual agents can adapt to user interactions by analyzing different
types of signals. For instance, in the experiment described in Chapter 4, we focus on identifying
opportunities for providing empathetic responses based on behavioral cues, encompassing verbal,
acoustic, and visual modalities. Investigating these adaptation signals is vital for improving the
responsiveness and effectiveness of virtual agents in human-agent interactions.
14
5. Creating a model for task performance prediction using multimodal feedback in the interaction
This work extends to creating a predictive model that can forecast user performance based on multimodal behavioral cues in high-stress human-agent interactions. This model, discussed in Chapter 6,
has the potential to enable timely agent interventions for performance enhancement. By developing
this predictive model, we highlight the practical applications of this research in improving the quality
of human-agent interactions, particularly in high-stress scenarios.
6. Creating a framework for organizing feedback-usage-type with story-telling-role-type for
virtual agents.
We have contributed to the field by developing a framework that organizes feedback usage types
and story-telling role types for virtual agents (summarized in Table 1.1). This framework provides
a structured approach to understanding the interplay between feedback and storytelling roles in
human-agent interactions, as demonstrated in the various experiments and projects described in this
thesis. We claim this table as one of the contributions of this thesis, it organizes all the different kinds
of use of feedback in storytelling systems and shows the state of the arts and gaps. In this table, we
offer a snapshot of the agents, accompanied by examples of State-of-the-Art (SOTA) systems that
align with each category (where applicable).
15
Chapter 2
Background and Related Work
In recent years, dialogue systems have been practically used in a wide range of applications such as virtual
assistants and conversational bots[73]. Perhaps the most famous of these dialogue systems are voice
assistants like Siri [250], Alexa [165], Google Assistant [161], and to some extent ChatGPT [39] which
are being widely used by millions of people around the world[73]. This industry is rapidly expanding,
with market revenues estimated at 15.6 billion U.S. dollars by 2020, and market revenue projections of 35.5
billion by 2025 [274]. For example, Google Assistant is available in more than 90 countries and in over 30
languages and is used by more than 500 million users monthly [38]. According to Amazon Alexa statistics
from 2020, 53.6 million Amazon Echo speakers were sold in the same year. The number grew to 65 million
in 2021 [255]. These systems are very competent in responding to simple verbal commands like setting an
alarm or a reminder, or fetching information like the weather condition. They help people shop online,
send instant messages, and make phone calls, all through voice commands [53]. In general, these virtual
assistant applications run as a background process that can be activated using a reserved voice command
(e.g., “Hey Siri” or “OK Google”).
In the heart of any dialogue system, there is a dialogue management component. In the case of the
virtual assistants, this component would determine the system’s action based on the given command. Some
examples of these actions can be performing a specific task (e.g. setting an alarm, making an entry in the
16
calendar, etc), providing requested information (e.g. weather condition) or asking follow up questions for
clarification. In the first example, after performing the requested action, the system would confirm the
fulfillment of the task by providing a confirmation to the user. This confirmation, which is considered a
“feedbac” signal in the interaction, ensures the user of the completion of their requested task. Providing
and seeking feedback plays an important role in any interaction. In fact, one of the central claims in many
theories of conversation is that dialogue partners seek and provide evidence about the success of their
interaction [37]. That is, interlocutors∗
tend to follow a proof procedure to check whether their utterances
were understood correctly or not and constantly exchange specific forms of feedback that can be affirmative
or negative [99].
Understanding the interlocutor’s feedback in a human-human interaction comes naturally for most
people with normal perceptual abilities and a lack of understanding of these feedback signals is considered
a disability [70]. For example, for deaf people, it would not be possible to discuss perceiving verbal feedback.
This may impact their ability to start or hold a conversation and understand verbal communication cues
that give context to what is being said (e.g. tone) beyond the words. In this thesis, other than where it is
explicitly mentioned (for example in chapter 5) we will focus on people with normal perceptual abilities.
There are two aspects we should consider when we talk about how feedback should be provided. The
first aspect is concerned with human-like qualities, the other one is simply about efficiency. The phenomena
of whether or not this feedback should be human-like or not is not a trivial problem. People have contrasting
preferences on how a robot should behave and how affective should its behavior be. While some designers
believe that computers should be super tools and active appliances and are designed just to fulfill our
demands, others believe that these latest AI systems can be collaborators [279] and having human-like
experiences with a robot increases the rapport built between them and lead to a more satisfied interaction.
This debate has a long history in the field. In 1997, Pattie Maes and Ben Shneiderman had a famous debate
∗
a person who takes part in a dialogue or conversation.
17
about whether the smart agency (i.e., AI) or direct manipulation would dominate user interaction design
[247].
We summarize some of the arguments for each side and mention some recent research concerned with
this debate.
• Human-like feedback:
There is an argument that at least some feedback and communicative behavior should be very humanlike because humans know how to understand it very quickly and easily. Mostly because our neural
systems are hard-wired through evolution to be sensitive to certain kinds of signals as opposed to
others. Also, through social practices, we are used to receiving and interpreting specific kinds of
signals. For example, we would know instantly what a head nod or a smile means, whereas we can’t
say the same if we encounter a set of blinking lights. Those need to be defined properly for us to
understand them. Another argument is that when a computer system becomes “smart”, human users
tend to mindlessly apply social rules, and expectations, or even assign personalities to the system.
This is known as “Computer As Social Actors” (CASA) or Anthropomorphism [198]. Thus, humans
tend to treat AI as a collaborative partner instead of a tool when they are working together towards a
common goal. [279, 45]
• Non-human-like feedback Ben Schneiderman is one of the biggest names in advocating against
human-like machines. He believed that machines should be “tools of visualization” for humans to
complete their tasks, and the future of computing is in the direction of “information visualization”
rather than “depending on “agents” that learn their likes and dislikes and act on a user’s behalf” [247].
Another argument in this direction is the concept of “uncanny valley”, which was first introduced
by Mori [190]. The uncanny valley is “used in reference to the phenomenon whereby a computergenerated figure or humanoid robot bearing a near-identical resemblance to a human being arouses
18
Figure 2.1: The uncanny valley diagram [190]
a sense of unease or revulsion in the person viewing it” †
. Based on the diagram depicted in figure
2.1, familiarity ratings fall into a “valley” when the machines become too human-like. This would
lead to an increase in feelings of eeriness, unease, or repulsion in observers [178]. Noteworthy to say
that this is a controversial hypothesis [291]; With some studies providing empirical evidence for its
existence [170, 119, 244, 187] and some others failing to find the effect at all [109, 169, 23]. In 2012,
a Bayesian explanation of the uncanny valley was proposed by Moore [189]. Moore explains that
“differential perceptual distortion arising from stimuli containing conflicting cues can give rise to a
perceptual tension at category boundaries that could account for these phenomena” [189].
This thesis is not about settling this debate or proving one side over the other. Our focus is on the
quality of the interaction and the effects that feedback has on the rapport with the user. There is a large
body of work concerning agents (storytellers, story-listeners, neither or both) with different embodiment
that are capable of providing some sort of feedback in the interaction using none or some input from their
†Definition per Oxford Languages
19
interlocutors. These agents can be virtual agents or robots with physical embodiment. Our focus is mainly
on the story-teller and/or story-listener systems. Here we mention some of the relevant work that has been
done in this area.
2.1 Story-Telling Agents
Val-Calvo et al. [275] describes a system where there is a story-telling agent interacting with humans in a
controlled setting. The system narrates some stories (in the form of displaying some scenes to the user)
in order to evoke emotional responses. It collects different signals: EEG signals, GSR and BVP, and facial
expressions from the user. They develop a model based on collected data to predict the invoked user’s
emotions. It later compares the prediction with the self-reported emotions from the user in an offline
analysis. This system does not provide any feedback itself, and although it recognizes user feedback in the
offline mode, it’s not ’aware’ of it “during” the interaction, thus it would fall into cell 5 of table 1.1.
New Dimensions of Testimony (NDT) system [270] is designed with the goal of digitally preserving
holocaust survivor’s stories in the context of interactive story-telling settings. Users talk to a persistent
representation of a Holocaust survivor on a screen, while a dialogue agent selects appropriate responses to
user utterances from a set of pre-recorded video statements, simulating a live conversation [14]. This a
good example of the story-teller systems that provide but do not recognize the user feedback. The system
is capable of providing some non-verbal behaviors during the course of the interaction (like head nods) but
they are not related to the feedback from the audience.
Lima, Feijo, and Furtado [159] describe an adaptive storytelling system that provides different versions
of the story to the user based on their personality and preference modeling. This selection is not done
online, rather it is using the personality traits obtained at the beginning of the study using the Big Five
metric.
20
Silva, Raimundo, and Paiva [248] present “Papous”, a virtual storyteller that is capable of displaying
affective speech, facial and body expressions. “Papous” is a storyteller designed to work for children and
it uses a specific story format. The story is created in levels, with each level having multiple “StoryBits”.
Levels refer to the progressive flow of the story while the StoryBit correspond to different versions that
can happen in each level. So a storyteller should choose which StoryBit to tell in each level while it is
progressing through the story timeline. They elicit feedback (or input) from the users by having them
choose cards from a set of available illustrated physical cards in their possession and insert them into the
system. The cards correspond to different properties of the story the user wants such as “being told in a
scarier manner”. The system then chooses a StoryBit that corresponds the best with the aggregation of all
the user’s input.
2.2 Story-Listener Agents
Gratch et al. [102] describes a system where the user retells some previously observed series of events to a
graphical character. The agent (which they call the Rapport agent) generates feedback by real-time analysis
of acoustic properties of speech and speaker gestures. They investigate the impact of feedback contingency
(as opposed to feedback frequency) on user rapport. They compare the rapport measures of interactions
with a “responsive” agent, a “mediated” agent, and a “non-contingent” agent along with a controlled version
of “face-to-face” interaction with a human. The responsive agent synthesizes head gestures and posture
shifts in response to features of the speaker’s speech and movements. The “mediated” agent reproduces the
actual head movements and posture shifts of a real human listener while the “non-contingent” agent uses a
pre-recorded behavior sequence taken from another session of “responsive” condition. So the the behaviors
are identical to the responsive condition in terms of their frequency and dynamics, but not contingent
on the behaviors of the current speaker. Their results indicated that the responsive agent was as good as
human listeners in creating rapport.
21
Al Moubayed et al. [7] describes two story-listener agents(one robot and one avatar) who are capable of
providing feedback based on perceived acoustic and visual features.
Park et al. [202] investigates the effects of an attentive listening robot on a child’s storytelling. They
develop a backchannel prediction model based on observed nonverbal behaviors and speaker cues of 4–6
year-old children. They test their model by having two robots interacting with the child at the same time:
One using contingent and the other using non-contingent backchanneling strategies. They show that
children prefer to converse with the more attentive and contingent robot.
2.3 Storyteller and Story-listener Agents
Perhaps the first system that allowed a kind of story-swapping with a virtual agent was [236], in which a
child character Sam would alternate telling and listening to stories with children. The Sam system has two
parts: the character Sam, an embodied conversational agent who is designed to look like a child around age
6, and a toy castle with a figurine. Sam is projected on a screen behind the castle, and can both listen to a
child’s stories and tell her own while playing with the figurine and the castle. When the child is playing
with the toys and narrating, the system uses audio threshold detection to determine when to give feedback
(backchannels such as ‘uh-huh’ nods, and explicit prompts such as ‘and then what happens?’). Ryokai,
Vaucelle, and Cassell [236] showed how children who interact with the Sam character both increase their
stories’ complexity and occasionally coach Sam.
22
Chapter 3
Story-Telling Agent
Telling stories is an important aspect for virtual agents to interact with people socially. However, it is not
clear what kinds of stories should be told, particularly with respect to the relation of the story to the identity
of the virtual agent. To answer our research question 1 ("Types of stories from the virtual story-tellers") and
to set the ground to answer question 2 ("Types of multimodal feedback provided by agents"), we describe
an experiment designed to investigate the impact of the identity, presentation form, and perspective of a
virtual storyteller on a human user who engages in a story-swapping activity with two virtual characters.
For each interaction, the user was given 10 “ice-breaker” questions to ask a virtual character and respond to
the character’s reciprocal request. Participants also filled out a post-interaction survey, measuring rapport
with the character and impressions of the character’s personality. Results generally show that participants
prefer characters who tell first-person stories, however, there were some interactions with presentation
order. No significant preferences were established for the form or identity variables.
3.1 Introduction
We begin this chapter by defining what we consider to be a “story” in this thesis. A story or narrative
is a connected series of events told through words (written or spoken), imagery (still and moving), body
language, performance, music, or any other form of communication. Stories are pervasive in conversation
23
between people [200]. They are often used to establish identity [93, 20], pass on cultural heritage [216], and
build rapport [258]. Often stories are “swapped” when one conversational participant will reply to a story
with another story. Indeed, [58] found that almost 1/4 of stories in casual conversation were presented
in response to stories told by the other participant. Also, according to Dunbar [75], 65% of our daily
conversations consist of gossips, which we consider a form of story based on our definition.
Stories have also been incorporated in virtual human systems (see Section 3.2). In creating or mining
stories for a virtual human to tell, there are several considerations about what kinds of stories should
be told, particularly considering the goals of building long-term rapport and a desire for people to keep
interacting with the systems. We focus on issues such as how the story connects to the identity of the
virtual human and presentation style. It is unclear how best to address these issues, as there are multiple,
and occasionally conflicting desiderata proposed in the literature. We distill some of those desiderata into
the the following five principles:
1. Be human: Virtual humans should be as much like humans as possible, and thus should project a
fully human identity and tell human-centric stories.
2. Talk about yourself: Tell first person stories, because they are more intimate, help the listener get
to know the teller, and can act as self-disclosure. [293] describe how self-disclosure can play multiple
roles in rapport management, including negative-self disclosure to boost interlocutors’ face, inviting
reciprocal self-disclosures, and revealing openness to being seen by the other.
3. Be Real: Stories should be authentic, or at least believable, or else they might trigger a backfire effect,
where the teller is seen as inauthentic, and untrustworthy. False stories about the self, might make
the listener think the teller is claiming credit that is not deserved. An obviously artificial agent might
fall into this problem if it tells stories about human experiences.
24
4. Be Interesting: Novel and unusual stories are more exciting than everyday occurrences. So stories
from a non-human perspective might be more interesting than standard human experiences.
5. Don’t Gossip: Third person stories might seem like gossiping about someone else, if the stories are
too personal, or possibly name dropping.
These principles may lead to conflicting ideas of optimal stories for a virtual human to tell. Principles
1 and 2 combine to say that a virtual human should tell first person stories with a human self-identity.
On the other hand, Principle 3 gives reason to think it may be dangerous to rapport to tell such stories.
Dropping principle 2 but keeping 1 could lead to a preference for third person stories about a real human
that the virtual human knows. This might be contradicted by principle 5. On the other hand, keeping
principle 2 and dropping principle 1 could lead to an agent telling first person stories about an identity as
an artificial character, which might also be reinforced by principle 4. We review some prior work exploring
these principles in the next section.
In order to explore these principles, we designed a set of virtual human agents who can engage in a
simple form of story-swapping. Each of the agents can engage in simple interactions such as greetings and
closings and can respond to a set of “ice-breaker” questions, that might be used on a first date or similar “get
to know you” encounter. For these questions the agent’s answer includes a story. We created four character
response sets, to have all combinations of identity (human or artificial) and protagonist (first person or third
person). We also considered embodiment type as either Human (video recording of a real person telling the
stories) or Virtual-Human (animated character telling the story), however we only recorded the human
identity stories in video, yielding six different story-swapping system types. More details about the agents
can be found in Section 3.3.
We also designed an experiment to try to explore the collective impact of the above principles on people
who interact with the characters. Participants interact with two of the above systems in a “get to know
you” scenario. We investigate the degree of reciprocal story-telling, and test the rapport participants feel
25
toward the characters as well as their impressions of the character’s personality. The experimental design
is described in Section 3.4. Results are presented in Section 3.5. We conclude in Section 3.6, with some
thoughts and next steps.
3.2 Related work
Several virtual agent systems have told, elicited or swapped stories. The SimSensei system [67] elicits
extended narratives from users, in an attempt to recognize whether the user is suffering from psychological
distress. Many systems feature agents that tell stories as part of an interaction establishing information
about the character. Some of these have agents playing the roles of historical characters (e.g., [28] had
an August Strindberg character, while [29] had a character portraying Hans Christian Andersen). Others
include fictional characters from literature (e.g., [108]), or new characters (e.g., [140, 232]). Stories have
also been told as part of establishing a long-term relationship and influencing users to adopt behavioral
change [174].
Perhaps the first system that allowed a kind of story-swapping with a virtual agent was [236], in which
a child character Sam would alternate telling and listening to stories with children. Ryokai, Vaucelle,
and Cassell [236] showed how children who interact with the Sam character both increase their stories’
complexity and occasionally coach Sam. The analysis made a sharp contrast between conversation and
storytelling as distinct activities, rather than telling stories within a conversation. The stories also tended
to be “made up” rather than personal narratives.
Robinson et al. [232] analyzed a corpus of interactions between museum visitors and a questionanswering virtual human, Sergeant Blackwell, whose answers included some narrative responses. The
analysis showed that a large percentage of questions to Blackwell included bibliographic and personal
preference questions: almost 97% of the questions were on a human-centered view compared to only 3% of
questions about the technology. Likewise, [140] noted that more than 1/3 of user questions in a museum
26
pre-suppose treating the Max agent as human. These findings lend support for principle (1), and to some
extent (2).
[4] also reported on interactions between museum visitors and virtual humans, and noted a lot of
human-oriented questions, such as preferences and biography. These characters had artificial backstories,
however, making jokes about their non-human characteristics, such as (computer) chips being their favorite
food. The popularity of this system [268] might lend support for principle (4), as well as possibly (2) and (3).
Bickmore, Schulman, and Yin [31] performed an experiment contrasting first vs third person stories in a
health-care application, where an agent told inspiring stories about weight loss, either about the agent (first
person) or about someone else (third person). This study thus directly tested the contrasting principles (2)
and (3). Participants in each condition answered questions about how much they “enjoy the stories that the
counselor tells”, “look forward to talking to the counselor”, and “feel that the counselor is dishonest”. [31]
found that first person participants were more likely to talk to the agent and reported greater enjoyment;
however there were no significant differences between the groups in the extent to which they looked
forward to talking with the agent or felt the agent was dishonest. Moreover, newly recruited subjects
use the system significantly more than participants who were already using the agent for weight-loss
counseling before the study. These findings support principle (2) but fail to support principle (3). This
work did not use speech input to trigger virtual agent responses, did not elicit user stories, and did not
explore the artificial identity option. [259] compare fictional (traveller from another planet) vs realistic
(artificial robot) identities for a Nao robot exercise coach. The identities were expressed as backstory that
the coach would reveal about itself at various points in the interaction. [259] found no differences in ratings
or activity levels between these two conditions, suggesting that principle (3) may not be so critical (or at
least might be balanced by principle 4).
27
Figure 3.1: Virtual Human Arron and Arnold; Human Arron and Arnold
Concerning embodiment type,[133] compared video to animated characters in social interaction, and
found that video avatars led to more co-presence than animated character avatars, though had no impact
on satisfaction.
3.3 Story-swapping Agents
As mentioned above, we created six versions of simple story-swapping agents. All were designed to engage
with users in a simple “get to know you” dialogue, including reciprocal question answering. Four different
sets of character dialogue were created, each being able to answer 20 “ice-breaker” questions, such as
“Do you play sports”. We created two different characters, named Arnold and Arron, that differ in their
perspective of the stories they tell. Arnold tells first person stories, while Arron tells third person stories
about an acquaintance. Arnold and Arron were given similar ages and appearances.
For each character, there are two versions of the stories, one in which the character is portrayed as
human (VH-Human), and having fully human experiences, and another (VH-VH) in which the character
talks about an artificial identity, emphasizing being a computer generated character and unable to do things
like eat or drink, but having experiences in a virtual world. For the human stories only (not the artificial
identity), we also have video-recordings of people from the same demographic group playing Arron and
Arnold (Human-Human). Figure 3.1 shows the four different embodiments. We thus have six different
agents, considering character/perspective (Arnold/1st or Arron/3rd), identity (Human or Virtual Human)
and embodiment presentation (Human Video or Virtual Human - but only for Human identity).
28
The following shows two different versions of Arnold’s response to the question of whether or not they
have met a celebrity before. An example of one of Aaron’s stories is shown in Figure 3.3.
VH-VH: I have talked to some celebrities, but unfortunately I can’t really get an autograph or a picture
to show my friends. I do have all of the conversation logs though, even from my first time ever talking to a
celebrity, when I talked to Hines Ward, a former football player for the Pittsburgh Steelers. After hearing and
reading so much about how shallow and elitist some celebrities are, I was pleasantly surprised to find Ward
very down to earth and easy to talk to. He even missed his flight so he could talk longer. I guess he found it just
as cool to be talking to a virtual human for the first time as I did talking to a celebrity for the first time.
VH-Human or Human-Human: I’m not huge into celebrities, but one time in college I saw a flyer that
said that one of my childhood sports heroes, Hines Ward, was coming to do a signing on campus. I went and
bought a football from the local sports store and headed over to the signing. There was a huge line, and as time
went on I started to get worried that I wasn’t going to make it to him in time. Just as I was about to be up in
line, I saw his agent come over and tell him they need to leave for the airport now, or they would miss his flight.
I heard him say “There will be another flight, these people have been waiting for hours”. I went up and he not
only signed my ball, but we were able to talk for a couple minutes since I was the last person in line. I was so
impressed by what a genuine person he was, and for that reason I will never forget that day.
Our agents were built using the Virtual Human Toolkit [110]. The architecture for the four virtual
human embodiment agents is shown in Figure 3.2a, while the architecture for the two video versions is
shown in Figure 3.2b. The natural language understanding and spoken answer parts are identical, the only
differences being the way the embodied aspects were presented (Human video vs Virtual Human).
3.4 Experimental Design
In order to shed light on the best design choices for virtual human stories in story-swapping dialogue, we
recruited experimental participants to engage in dialogue with the six agents described in the previous
29
(a) Virtual Human Architecture (b) Video Architecture
Figure 3.2: System Architectures
section. 60 participants (38 males, 22 females) were recruited via Craigslist. All participants who met
requirements (i.e. age greater than 18, and adequate eyesight and hearing power) were accepted. One
participant failed to complete the study so she was excluded and other 59 (38 males, 21 females) were
considered.
We examined independent variables of perspective (1st vs 3rd person), identity (human or virtual
human), and presentation (human video or virtual human). We use a partial within-subjects design, where
each participant talks to two virtual humans. In order to reduce redundancy, we make sure that participants
have a different set of questions for each virtual agent they talk with, so the set of 20 ice-breaker questions
were divided into two sets of 10. We decided to look at perspective (1st vs 3rd) within subjects, and to
keep the identity and presentation variables the same for that subject. Thus each subject will have one
conversation with Arron involving 10 questions, and one with Arnold with a different 10, in one of the
three identity-presentation combinations (Human-Human, VH-Human, or VH-VH). To control for order
effects, half of the participants first talked to Arnold, the 1st person character, while the other half first
talked to Arron, the 3rd person character.
30
3.4.1 Metrics
The experiment consisted of a 3 (agent: VH-VH vs. VH-Human vs. Human-Human) × 2 (order: 1st then
3rd vs. 3rd then 1st) design. Our 60 participants were randomly assigned to one of these 6 cells (10 per cell).
We measured the following dependent variables for each interaction between a subject and virtual
agent:
• a 9-item rapport scale
• a 6-item ancillary rapport scale
• two items on subjective sharing of personal information
• a set of 30 personality characteristics
• length of participant responses
• number of participant responses containing stories
We used the rapport scale from [101]. Specifically, participants answered 9 questions about the rapport
they felt with the agent on scales ranging from 1 (not at all) to 5 (very much) such as “I felt I had a connection
with my partner” and “I felt that my partner was interested in what I was saying.” As an ancillary measure of
rapport, participants also answered other questions on the same scale such as : “How much rapport did you
feel with your interaction partner?” Or “How close did you feel to your interaction partner?” Participants
were also asked two items about the extent to which they shared personal information using the same
5-point scale scale: “How personal were your answers?” and “How much did you reveal about yourself in
your answers?” Finally, for exploratory purposes, participants were asked to rate their interaction partner
on 30 personality characteristics using bipolar scale from 1 to 5, with each endpoint reflecting the extreme
of that personality characteristic (e.g., Compassionate to Aloof, Friendly to Unfriendly, etc.).
31
Participant: What do you and your friends like to do?
Agent: Well me and the other virtual humans here are pretty busy most of the time talking to people like
you, so we don’t have time to do much else or hang out with each other. The humans we work with
seem to do some pretty cool things with their friends, though. One of our researchers just got back from
a back packing trip in the Tetons, and was telling me about how he got to watch the sun rise through the
peaks each morning. He claimed it was one of the most beautiful things he have ever seen.
Agent: What about you?
Participant: What do me and my friends like to do? oh my goodness, well lets see ... well since these
days you know funs are not exactly as high as they used to be... most of the time me and my friends just
come over to my house and play some old Nintendo.. That’s right you heard me right! The old classic
Nintendo! I got over fifteen games it’s been a lot of fun. You know, booze ... stuff like that ... umm ...
but on a regular day when there is a little bit of money in the pocket we tend to go to movies or we go
camping ... lets see the other day we had a little barbecue at the beach, Venice beach. But sometimes you
just gotta do what you can with what you have, right? Either way, whenever you have good company
it’s gonna be a good time. That’s what I always say.
Figure 3.3: Example of exchange between Aaron (3rd person VH-VH) and participant
3.4.2 Protocol
After receiving a general explanation of the system and the procedure of the study, participants gave
consent. As explained in the previous section, each participant was assigned to one of the six different
conditions. Prior to starting the interaction with the agent, participants were given the set of questions that
were supposed to be the topics of their conversation and 5 minutes to plan their answers. After this period,
they engaged in the their first conversation with their assigned agent.
After the greetings, each episode began with an ice breaking question like ’What was your favorite
vacation?" from the participant followed by the agent’s answer. The agent responded to the question and
told stories according to its assigned perspective and identity. After his answer, the agent prompted the
user to answer that same question about himself with phrases like “What about you?” or “Yourself?”. After
answering the question, the participant moved on to the next question and began the next episode. Each
conversation consisted of a sequence of such sub-dialogue episodes, like that shown in Figure 3.3, which
shows one participant interacting with the VH-VH version of Aaron.
32
After going through all the questions, participants were asked to fill out a questionnaire with the
subjective items about their experience with the agent. Then, participants went through the same procedure
with the second character.
3.5 Results
We analyzed these dependent variables: the Rapport Scale, our ancillary rapport scale, disclosure of personal
information, and finally each of the 30 personality characteristics items. On each of these measures, we
conducted a 2 (perspective: 1st person vs. 3rd person) × 2 (order: 1st then 3rd vs. 3rd then 1st) × 3 (agent:
human-human, VH-VH, vs. VH-human) mixed ANOVA with order and agent as between-subject factors,
and perspective as a within-subjects factor. Considering the Rapport Scale, there was only a marginally
significant main effect of perspective, F(1, 53) = 3.21, p = .08, such that users experienced greater rapport
with the 1st person agent (M = 3.61, SE = 0.09) than with the 3rd person agent (M = 3.42, SE = 0.09). However,
this effect was qualified in an interaction with “agent” condition (F(2, 53) = 3.76, p = .03). As can be seen
in Figure 3.4, the effect of users experiencing greater rapport with the 1st person agent than with the 3rd
person agent only appears when the agent has a human backstory (VH-human and human-human). No
other effects or interactions approached significance (F s < 1.12, ps > .33).
Figure 3.4: Interaction of perspective and agent on the Rapport Scale.
33
For our ancillary rapport measure, we also found a main effect of perspective, F(1, 53) = 4.44, p = .04,
again such that users experienced greater rapport with the 1st person agent (M = 3.41, SE = 0.12) than
with the 3rd person agent (M = 3.10, SE = 0.12). However, the interaction with agent condition did
not approach significance F(1, 53) = 1.95, p = .15, nor did any of the other effects or interactions reach
statistical significance (F s < 1.52, ps > .23). Considering the disclosure of personal information, there
was a significant main effect of perspective, F(1, 53) = 6.88, p = .01, such that participants share more
personal information with the 1st person agent (M = 3.92, SE = 0.12) than with the 3rd person agent
(M = 3.65, SE = 0.14). No other effects or interactions approached significance (F s < 0.79, ps > .45).
Turning to exploratory analysis of personality characteristic items, several were found to have significant
or marginal effects. First, for rating of the personality characteristic of “rude”, there was only a marginally
significant main effect of perspective, F(1, 53) = 3.33, p = .07, such that users rate the agent as less rude
with the 1st person agent (M = 1.66, SE = 0.10) than with the 3rd person agent (M = 1.89, SE = 0.11).
However, this effect was qualified in a marginal interaction with “agent” condition (F(2, 53) = 2.94, p =
.06). As can be seen in Figure 3.5a, again the effect only appears when the agent has a human backstory (VHhuman and human-human). No other effects or interactions approached significance (F s < 1.08, ps > .31).
(a) Personality characteristic “rude” (b) personality characteristic “aloof”
Figure 3.5: Interaction of perspective and agent on personality characteristics.
Second, for rating of the personality characteristic of “aloof”, there was a significant main effect of
perspective, F(1, 53) = 5.58, p = .02, such that users rate the agent as less aloof with the 1st person
agent (M = 2.56, SE = 0.11) than with the 3rd person agent (M = 2.84, SE = 0.12). However, this
34
effect was qualified in a marginal interaction with “agent” condition (F(2, 53) = 2.57, p = .086). As
can be seen in Figure 3.5b, it appears that this time, the effect only appears with a VH (VH-human and a
trend for VH-VH, but not with human-human). No other effects or interactions approached significance
(F s < 1.57, ps > .22).
Third, for rating of the personality characteristic of “non-threatening”, there was only a marginally
significant interaction between perspective and order (F(1, 53) = 3.65, p = .06). As can be seen in Figure
3.6a, it appears that the 3rd person agent is only perceived as more threatening when users interact with the
1st person agent beforehand. No other effects or interactions approached significance (F s < 1.38, ps >
.25).
(a) “non-threatening” (b) “cheerful”
Figure 3.6: Interaction of perspective and order on personality characteristics.
Likewise, for rating of the personality characteristic of “cheerful”, while there was a significant main
effect of perspective, F(1, 53) = 4.28, p = .04, such that users experience the 1st person agent as more
cheerful (M = 3.94, SE = 0.12) than the 3rd person agent (M = 3.64, SE = 0.12), there was also a
significant interaction with order (F(1, 53) = 5.30, p = .03). As can be seen in Figure 3.6b, it appears that
the 1st person agent is only perceived as more cheerful when they interact with it before the 3rd person
agent. No other effects or interactions approached significance (F s < 0.57, ps > .46).
Next, for rating of the personality characteristic of “trustworthy”, there was only a significant main
effect of perspective, F(1, 52) = 5.67, p = .02, such that users experience the 1st person agent as more
35
trustworthy (M = 3.95, SE = 0.13) than the 3rd person agent (M = 3.55, SE = 0.16). No other effects
or interactions approached significance (F s < 1.61, ps > .21).
For rating of the personality characteristic of “passive”, there was only a marginally significant main
effect of agent, F(1, 53) = 3.33, p = .087, such that users rate the agent as less passive when it has a human
backstory (human-human M = 2.13, SE = 0.17 and VH-human M = 2.05, SE = 0.17) compared to
when it has an artificial backstory (M = 2.55, SE = 0.17). No other effects or interactions approached
significance (F s < 0.58, ps > .56).
For rating of the personality characteristic of “unsympathetic”, there was a significant main interaction
of agent by order, F(2, 53) = 3.22, p = .048. As depicted in Figure 3.7a, in the VH-human condition,
participants overall rated both agents as more unsympathetic when they interact with the 3rd person
agent before the 1st person agent. As the three-way interaction with perspective (3rd vs. 1st person)
did not reach significance (F(2, 53) = 2.63, p = .11), this figure displays means collapsed across 1st
person agent and 3rd person agent. Indeed, no other effects or interactions were statistically significant
(F s < 2.63, ps > .11).
(a) Interaction of agent and order on characteristic “unsympathetic”.’
(b) Interaction of perspective and order on
the length participant talked’
Figure 3.7: Ordering effects.
There were no significant effects or interactions approached significance for the other personality
characteristic items (F s < 2.72, ps > .11).
Additionally, we conducted a 2 (perspective: 1st person vs. 3rd person) × 2 (order: 1st then 3rd
vs. 3rd then 1st) × 3 (agent: human-human, VH-VH, vs. VH-human) mixed ANOVA on the length of
36
time participants talked to the agent. There was only a significant interaction of perspective and order,
F(1, 45) = 4.02, p = .05. As can be seen in Figure 3.7b, participants talked longer with whatever agent
they spoke to second. No other effects or interactions approached significance (F s < 0.66, ps > .52).
Finally, we conducted chi-square tests to determine if agent condition impacted whether or not participants were more likely to tell a story to the agent in response to his question. Across responses to
all questions, agent condition never had a significant effect on the likelihood of telling a story (χ
2
s <
7.86, ps < .10).
3.6 Conclusions
In summary, like [31], we see a general preference for first person over third person stories, even though
there were differences in the type of interaction (spoken rather than typed) and activity (story-swapping
rather than stories motivating exercise), and considering also variations in presentation. Agents who told
1st-person stories led to users reporting that they felt greater rapport, that they shared more information,
and saw the agent as less rude, less aloof, less threatening, more cheerful, and more trustworthy than the
agent who told third person stories. Some of these results appeared only in the VH-Human condition (which
is most similar to the agent in [31]). Given that the agents, subjects, dialogue genre, and measures were all
different from the previous study, we see this as reinforcement of principles (1) and (2)- that human-like
first person stories should be told by a virtual human, in order to engage human users. Likewise, we fail to
find any support for Principle (3), that would discourage human-like first person stories as deceitful.
On the other hand, we do not see differences in objective measures of user reactions to the stories
in dialogue, and many of the findings occur in only some of the conditions, so it may also be fine to tell
third person stories or have a non-human backstory identity, as long as the stories are interesting and
approachable.
37
It is also interesting that users talked more with the second agent, regardless of whether it was a first
or third person perspective. This seems to indicate that users are “warming up” to this style of interaction,
and not yet bored with it after the first batch of 10 questions.
Recognizing and providing feedback in the course of this interaction was not the main focus of this study,
but a post qualitative analysis of the user’s responses revealed that participants tend to use considerably
shorter stories toward the end of the interaction in each session. A possible explanation can be that they
figured out by then that the agent is not understanding them and its non-verbal behaviors are not contingent
on their story. So they were reluctant to tell long stories. In fact, a lot of the participants asked about this
from the experimenter after the study. Furthermore, there were a number of instances where having an
agent who was capable of giving some content-related feedback would have improved the interaction. For
instance, in one of the interactions with a user, the user was talking about a funny tradition that they used
to have in college when they had to run around the campus in their underwear. The user is laughing hard
as he is retelling the story. It would have been very natural for his interlocutor (our agent) in the interaction
to have some feedback in reaction to his story/laughing. Thus, in addition to answering the our Research
question 4, this chapter sets the ground for answering our first and second questions: in chapter 5, we use
the user’s behaviors in deciding the timing to give stories (research question 1). Also, in chapter 4 the agent
is able to provide appropriate feedback in right places during the stories (research question 2).
38
Chapter 4
Identifying Opportunities for Empathetic Responses
Part of being able to provide appropriate feedback in the course of the interaction is knowing the best "type
of feedback" to provide. In exploring the answer to our research questions concerning the use of users’
feedback and non-verbal behaviors in a story-listening context (question 2 on types of multimodal feedback
provided by agents and question 3 on types of multimodal feedback perceived by agents) an offline analysis
of a multimodal dataset of interactions with a virtual human was done.
In this chapter, we study and analyze behavioral cues that indicate an opportunity to provide an
empathetic response. This is one type of feedback that the systems need to perform in a story-listening
context. Doing this task appropriately plays a crucial role in stimulating full engagement. Not providing
such responses results in losing rapport and not doing it appropriately (whether it’s the wrong timing
or wrong emotion) would cause uneasiness or unlikability. Emotional tone in language in addition to
facial expressions are strong indicators of dramatic sentiment in conversation that warrant an empathetic
response. To automatically recognize such instances, we develop a multimodal deep neural network for
identifying opportunities when the agent should express positive or negative empathetic responses. We
train and evaluate our model using audio, video, and language from human-agent interactions in a wizardof-Oz setting, using the wizard’s empathetic responses and annotations collected on Amazon Mechanical
Turk as ground-truth labels. Our model outperforms a text-based baseline achieving F1-score of 0.71 on a
39
three-class classification. We further investigate the results and evaluate the capability of such a model to
be deployed for real-world human-agent interactions.
4.1 Introduction
Emotionally intelligent and embodied interactive agents are showing great promise for effectively augmenting human resources in different domains including health-care and education. To create a realistic and
engaging experience, it is necessary for the agents to be receptive and responsive to the users’ emotional
needs. There has been a large body of work in multimodal recognition of sentiment and human emotions
from online videos or interactive experiences [218, 43, 42]. Existing work have made notable progress
towards sentiment recognition from vast online datasets. Nonetheless, despite the increasing attention
towards emotionally intelligent and empathetic interactive companions, recognition of empathy has not
been extensively explored due to limited amount of data and the complexity of defining ground-truth labels.
Empathy is defined as the ability to recognize, understand and react to emotions, attitudes and beliefs
of others [8]. Automatic recognition of empathy, although similar to sentiment, requires a different and
more complex modeling. Recognition of opportunities for empathetic responses should include subjectivity
while also accounting for the intensity of the sentiment to elicit empathetic responses. The threshold for
expressing empathetic responses can vary from person to person and is also affected by inter-personal
relationships and the context of the conversation. “I am concerned about global warming.” and “I lost my
mother to cancer.” are expected to elicit different responses in terms of empathy.
Multimodal sentiment analysis relies on the perception of proxies of sentiment or affect from different
views including verbal content or spoken words, emotional tone of speech and facial expressions. In this
work, building upon the work on multimodal sentiment analysis, we propose a multimodal machine learning
framework for identifying opportunities for empathetic responses during human-agent conversations. To
this end, we analyzed interactions between an agent and a user during a semi-structured interview probing
40
symptoms of mental health disorders such as depression. During the interview, the agent asks a set of
questions, where each question is possibly followed by shorter follow-up questions with respect to the
user’s previous responses. Our developed model determines when the agent needs to express empathy and
with what polarity, i.e., a positive or negative empathetic response. We focus on the prediction of empathy
in an uncontrolled environment with real-world users, throughout the human-agent dialogue interaction.
The problem is therefore formulated as a three-class classification of positive, negative or no response
using verbal, acoustic and visual modalities. Each modality is mapped to a representation which is used
to recognize the classes. We evaluated the unimodal and multimodal recognition results with two sets of
labels, one consisting of the real-time judgments of the experimenters and one according to the judgments
of the independent observers. Identifying such moments will enable the agent to provide an empathetic
response such as “I’m sorry” or “that’s great” when necessary.
The major contributions of this work include:
• An analysis of verbal and nonverbal behaviors prompting empathetic responses.
• Providing a machine learning framework for identifying empathetic opportunities in an uncontrolled
dyadic interaction with real-world users.
• An analysis of different strategies for creating ground-truth labels for empathetic responses.
4.2 Related Work
The development of emotionally intelligent and empathetic agents have been a long-standing goal of AI.
Bickmore [30] showed how embodied agents can employ empathy to form better social relationships. Brave
et al. [36] shows that empathetic emotions lead to greater likeability and trustworthiness of the agent.
Existing work have mostly examined empathetic interactions through game-playing contexts [155, 36, 25].
41
Others have looked at prediction of counselors’ empathy measures in domains like motivational
interviewing [287, 286]. They have used ratings of empathy as means of evaluating psychotherapy sessions
and counselor performance. Claveland Callejas [57] surveyed sentiment analysis and its applications to
human-agent interaction. They found that the existing sentiment analysis methods deployed in humanagent interactions are not designed for socio-affective interactions. Hence, they recommend building
systems that can support socio-affective interactions in addition to enhancing engagement and agent
likability.
Sentiment analysis usually focuses on recognizing the polarity of sentiment expressed towards an entity
[252]. Learning empathetic opportunities in interactive systems requires more than mere recognition of
polarity, since empathetic responses are in response to personal misfortunes or successes and not just any
emotionally charged utterance.
Recent multimodal sentiment analysis approaches use deep neural networks trained and evaluated
on social media videos to detect sentiment. Zadeh et al. [290] used a Tensor Fusion Network to model
intra-modality and inter-modality dynamics in multimodal sentiment analysis. Their tensor fusion network
consists of modality embedding sub-networks, a tensor fusion layer modeling the unimodal, bimodal and
trimodal interactions using a three-fold Cartesian product from modality embeddings along with a final
sentiment inference sub-network conditioned on the tensor fusion layer.
Hazarika et al. [112] propose a conversational memory network for emotion recognition in dyadic
interactions, considering emotion dynamics. They use Gated Recurrent Units (GRUs) to model past
utterances of each speaker into memories to leverage contextual information from the conversation history.
Majumder et al. [171] models emotions in conversations by distinguishing individual parties throughout
the conversation flow. They consider three major aspects in dialogue by modeling individual party states,
context from the preceding utterances as well as the emotion of the preceding utterance by employing
three GRUs. Their network feeds incoming utterances into two GRUs called Global GRU and party GRU to
42
Figure 4.1: A participant and the virtual agent, Ellie.
update the context and party states respectively. The global GRU encodes corresponding party information
while encoding an utterance. By attending over the global GRU, the model represents information from all
previous utterances and the speaker state. Depending on the context, information is updated and fed into
the emotion GRU for emotion representation.
Existing work mostly leverage online datasets that benefit from large amounts of data [52, 205, 219, 217],
or use highly curated offline datasets that adopt professional actors for predefined and highly expressive
scenarios [219, 284, 186]. In this paper, we focus on real-world data obtained from people talking with a
virtual agent in a semi-structured interview imitating a therapy session. This is an inherently challenging
domain due to limited amount of real-world data with relatively lower expressiveness and unstructured
spoken dialogue.
4.3 Data
We use a portion of the Distress Analysis Interview Corpus - Wizard-of-Oz (DAIC-WOZ) for training and
evaluating our method. DAIC-WOZ is a subset of DAIC that contains semi-structured interviews designed
43
to support the assessment of psychological distress conditions such as depression and post-traumatic stress
disorder (PTSD) [100]. The interviews were collected as part of an effort to create a virtual agent that
conducts semi-structured interviews to identify verbal and nonverbal indicators of mental illness. The
subset of the corpus examined in this work include the Wizard-Of-Oz interviews conducted by a virtual
agent controlled by two trained human wizards in a separate room. In this two-wizard arrangement, one
wizard controlled the the agent’s verbal behavior while the other handled her nonverbal behavior. The
interview was structured to start with a set of general rapport-building questions and continue to query
potential symptoms of mental health such as quality of sleep. In this setup, a fixed set of top-level questions
were provided to the wizard to be asked during the interview. In addition to asking the top-level questions,
the wizard was provided with a finite repertoire of response options to act as a good listener by providing
back-channels, empathy and continuation prompts [66] (see Figure 4.1).
Verbal and nonverbal behavior of participants were captured by a front-facing camera and head-worn
microphone. In this work, we extract segments eliciting empathetic responses from the experiments by
looking at the agent’s expressions of empathy such as “I’m sorry to hear that.” or “That sounds like a
great situation.”. In the segmented data, each instance consists of the participants’ verbal and non-verbal
(audiovisual) responses to each main question and the follow-up questions. Follow-up questions such as
“Can you tell me more about that?” were asked to elicit further disclosure and encourage more elaborate
responses. Example dialogue excerpts are shown in Table 4.1.
Due to the nature of the predefined semi-structured interview, the dialogue turns take minimal influence
from the dialogue history and are therefore considered independently. The data is segmented into small
time-windows consisting of the users’ transcribed text, video and audio that have resulted in either positive,
negative or no empathetic responses from the virtual agent.
Overall, we had 2185 data points extracted from conversations of 186 participants. The average length
of the dialogue excerpts was 30.6 seconds, while the average number of turns per data point was 3.2 turns.
44
Table 4.1: Human-Agent dialogue excerpts with different empathy responses.
Dialogue Excerpt
Negative
A: How have you been feeling lately?
H: Um kind of uh I guess sorta sorta depressed generally
A: Tell me more about that
H: Uh just uh feeling tired and sluggish and
um less less motivated and less interested in things
A: I’m sorry to hear that.
Positive
A: What are you most proud of in your life?
H: Uh I’m proud that I’ve come a long way from when I first moved out here
I’m uh a lot more disciplined um I read a lot uh I do crosswords and I think I’ve I
think I know what’s important in life now and
I’m more focused and going after what I want
A: That’s so good to hear.
None
A: What are somethings you wish you could change about yourself?
H: Um I wish I could be taller I wish I could be more inclined to play basketball
so I then become go to the NBA and be a millionaire I know that’s all unrealistic
but just answering honestly.
Video
Text
Audio
ResNet
Features (T x N)
GRU
MFCC GRU
BERT FC
FC (Fusion)
Features (T x M)
Features (1 x L)
Figure 4.2: Multimodal static fusion.
45
4.4 Method
4.4.1 Multimodal Feature Extraction
4.4.1.1 Textual Features
For text input, we use a pre-trained language representation model called Bidirectional Encoder Representations from Transformers (BERT) [69]. BERT has substantially advanced the state-of-the-art in a
number of natural language processing (NLP) tasks including sentiment analysis and question answering,
which also makes it suitable for this task. We therefore used BERT as our text embedding model using
only the participants’ utterances from the dialogue excerpts. We avoid using the agent’s utterances in the
classification because of the unfair advantage it may provide to the recognition model. We took Uncased
BERT-Base to obtain a single 768-dimension vector representation of the transcribed text per data entry
[288]. BERT encodes the whole text sequence into a fixed-size vector, and unlike audiovisual modalities,
the temporal dimension is latent in the text representation.
4.4.1.2 Audio Features
Two types of feature-sets were extracted for the representation of speech prosody: (i) the extended Geneva
Minimalistic Acoustic Parameter Set (eGeMAPS) and (ii) Mel-frequency cepstral coefficients (MFCC),
extracted using OpenSMILE [83].
eGeMAPS provides a set of acoustic features hand-selected by experts for their potential to detect
affect in speech, and has been widely used in literature due to their performance, as well as theoretical
significance [81]. This feature-set consists of 23 features such as fundamental frequency and loudness.
MFCCs represent 13 band mel-frequency cepstral coefficients (MFCC) computed from audio signals from
25ms audio frames. MFCCs and their first and second order derivatives were extracted [83, 82] to obtain a
temporal matrix of T × 39 representation per data entry.
46
4.4.1.3 Visual Features
For the visual representation, we experimented with two different feature-sets: (i) 17 action units and 6
head pose features were extracted per frame using OpenFace [19] and (ii) face embedding obtained from a
ResNet pre-trained model [115]. OpenFace is used to extract the intensity of facial action units, representing
17 action units based on the Facial Action Coding System (FACS) [76] along with head pose variations
per frame, therefore providing a T × 23 representation. For the face embedding, we extracted masked
and aligned faces per frame using OpenFace [18] and fed it to ResNet-50, a convolutional neural network
pre-trained on ImageNet [65], and extracted the representation from the penultimate layer, to obtain a
T × 2048 representation.
4.4.2 Ground-Truth Labels
4.4.2.1 Wizard judgments
We extracted the ground-truth labels from the empathetic and non-empathetic responses of the humancontrolled virtual agent. The agent’s responses are divided into three classes: negative empathy, positive
empathy or no empathy. Negative empathetic responses include utterances such as "That sounds really
hard" and "I’m sorry to hear that", positive empathy includes utterances like "That’s so good to hear", "That
sounds like a great situation", and no empathetic responses shows that the agent moved on to the next
question or expressed fillers or back-channels without sentiment. By using these key phrases, we extracted
the ground-truth labels for the three classes.
4.4.2.2 Mechanical Turk Ratings
To validate the wizard’s empathetic responses, we collected labels via Amazon Mechanical Turk (MTurk).
We recruited five raters per instance (257 unique participants), all from the United States to avoid language
barriers. For each data point, the users were given the textual data, i.e., the dialogue sequence and they were
47
positive negative none
Sentiment
0
20
40
60
80
100
tone
positive negative none
Sentiment
0
5
10
15
20
affect
positive negative none
Sentiment
0
5
10
15
20
posemo
positive negative none
Sentiment
0
2
4
6
8
10
12
negemo
positive negative none
Sentiment
0
5
10
15
20
25
30
social
positive negative none
Sentiment
0
5
10
15
20
25
30
35
cogproc
positive negative none
Sentiment
0
5
10
15
20
informal
Positive None Negative
Sentiment
0.0
0.5
1.0
1.5
2.0
2.5
3.0
AU06
Positive None Negative
Sentiment
0.0
0.5
1.0
1.5
2.0
2.5
AU12
Positive None Negative
Sentiment
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
AU15
Figure 4.3: Box plots of verbal and nonverbal behavior with significant differences among different classes.
asked to select the proper categorical response toward the user at the end of each conversation. For further
clarification, we provided example responses belonging to each category. Each assignment consisted of
20 tasks (data points) plus two control questions (with obvious responses) to eliminate raters that did not
pay attention to the task and provided random answers. One control question contained an obviously
devastating story about the participant’s mother passing away while the other control question involved
a very happy and inspiring story about the participant. We repeated the experiment on data points with
wrong answers to either of the control questions to obtain valid ratings. We additionally eliminated the
instances where there was no majority vote among raters (7% of the data).
The Fleiss’ kappa was calculated to measure inter annotator agreement for the entire data across five
raters which showed fair agreement with κ = 0.33. A comparison between the majority vote of the MTurk
raters and the wizard’s responses, shows 58% agreement. More analysis indicates that the difference is
mainly caused by MTurk raters annotating certain entries as either positive or negative where there was in
fact no empathetic response by the wizard. This is likely the result of the raters looking at data entries
independently and not as part of an entire dialogue. Therefore the wizard may not have expressed empathy
48
Table 4.2: Distribution of classes for two sets of labels
Negative Positive None
Wizard 20.6% 40.6% 38.8%
MTurk 24.9% 46.0% 29.1%
Video
Text
Audio
ResNet
Features (T x N)
GRU
MFCC GRU
BERT FC
FC (Fusion)
Features (T x M)
Features (1 x L)
GRU (Fusion)
Figure 4.4: Multimodal RNN fusion.
where it was fit, to avoid redundancy of such expressions throughout the interaction. The low inter-rater
agreement from MTurk annotations demonstrates the intrinsic complexity of the task, which speaks well to
the nature of empathy as a social construct and the empathy level of the person expressing it. Furthermore,
the task becomes more difficult due to the individual differences across the annotators with respect to their
own personal experiences and self-identification with the user.
Table 4.2 shows the distribution of data across different classes. Throughout the experimentation, we
evaluate and report the results for both sets of labels to address this difference between the sets of labels.
4.4.3 Behavior Analysis
To study the verbal and nonverbal indicators associated with instances of behavior that elicit empathetic
responses, we used interpretable features from each modality for investigating such associations. For vision,
we used facial action units, for speech, we opted for eGeMAPS features and for language we used LIWC to
gain a better understanding of the social predictive signals of empathy. Linguistic Inquiry and Word Count
(LIWC) is a dictionary-based tool that generates scores along different dimensions including linguistic
variables such as number of conjunctions and pronouns and affective and cognitive constructs [203].
49
After selecting a set of features, we ran one-way analysis of variance (ANOVA) and visually inspected
the box plots of significant results (p < 1E − 5). The behavioral features that stood out are shown in
Figure 4.3. We could not observe any visible differences among audio features. The sentiment of language,
tone, positive (posemo) and negative emotions (negemo), according to LIWC are strong indicators of
sentiment for recognizing empathetic response opportunities. The language used in describing less pleasant
situations is more formal which might show that participants were less comfortable sharing them. Social
processes including mentioning family members was higher during the description of negative experiences,
pointing toward interpersonal issues. Cognitive processes (cogproc) which involve describing causation,
certainty and insight were lower for positive instances which demonstrate that the expressions of positive
experiences were in simpler language.
Action units associated with positive expressions, AU06 (cheek raiser) and AU12 (lip puller), are strong
indicators of positive sentiment. AU15 or lip corner depressor that is associated with sadness also showed
stronger activation during negative instances. This demonstrates that visual features in addition to verbal
behavior might be able to assist the recognition of sentiment for providing empathetic responses.
4.4.4 Model Architecture
4.4.4.1 Unimodal models
For every modality an encoder maps its input representations to a fixed-size vector or embedding. In unimodal classification, each of these encoders is then followed by a softmax layer for three class classification.
Language information is encoded with instance-based encoders. These encoders consist of a single fully
connected (FC) layer of a fixed size. Sequences of audio and visual features were fed to a single layer gated
recurrent unit (GRU) that maps the vision and speech representations to a fixed-size embedding, keeping
only the last state. The obtained representations from unimodal encoders are followed by a softmax layer
50
for classification. Additionally, we developed a multimodal model that fused the aforementioned encoders,
described below.
4.4.4.2 Static fusion
In this architecture, features from different modalities are initially passed through unimodal encoders, and
their resulting embeddings were concatenated and fed into a fully-connected fusion layer followed by a
softmax classifier. The structure of this static fusion network is illustrated in Figure 4.2.
4.4.4.3 RNN Fusion
Similar to the static fusion model, the RNN fusion architecture initially produces unimodal embeddings for
each modality. However, in case of vision and audio, with RNN encoders, the temporal embeddings learned
through single-layer GRUs, are concatenated and fed into an RNN fusion layer consisting of a single-layer
GRU. The text embedding is then concatenated with the output from the last state of the RNN fusion layer
and fed to a single fully-connected layer for final fusion (static). The output is finally passed through a
softmax classifier for three class classification. The RNN fusion network structure is shown in Figure 4.4.
4.4.5 Experimental Setup
In this work, we evaluate our methods on a dataset of 2185 instances of conversation excerpts from 186
participants. Given the size of the dataset at-hand, we opted for a simpler neural network architecture that
can capture the patterns associated with empathetic responses while generalizing well. The model takes
temporal audio and video input features per data entry and a single representation vector for text. We
discard all data shorter than 1.5 seconds and apply random cropping of a 90-second window for long video
and audio inputs (average length of the data is 90 seconds), during training. During evaluation a middle
segment with max duration of 90 seconds is extracted.
51
For each modality, we designed an encoder network mapping the input feature space to a 128-d
embedding space. In both architectures, video and audio inputs are fed separately into two 1-layer GRUs
to obtain individual embeddings for both modalities. Only for ResNet due to the higher dimensionality
of the original space, we added a 128-d fully connected layer after GRU. For textual data, the BERT
vector representation is fed into a fully-connected layer to obtain a compact representation, reducing the
feature dimensions from 768 to 128. The embeddings from all modalities are consistent across the two
fusion networks. The two models employ different fusion architectures: (i) static fusion model uses the
concatenation of the three embeddings and feeds the multimodal representation vector to a fully-connected
layer, with a dropout value of 0.2, to obtain a final vector of size three, containing the probabilities among
three classes. A softmax classifier is then adopted to perform the classification (ii) RNN fusion model
initially fuses the temporal video and audio sequences using a GRU of size 128 and then concatenates the
bimodal representation with the text embedding. Similar to the static fusion network, the multimodal
representation is fed to the fully-connected layer, with a dropout value of 0.2, obtaining the final probability
vector on which a softmax classifier performs the classification. A cross-entropy loss is used in this setup
with a weight vector, learned from the train set, to account for the data imbalance and the evaluation results
are computed using micro F1-score. A 10-fold cross-validation has been used for training and evaluation of
the dataset. We optimize the network using Adam, with a batch size of 32 and a learning rate of 10−4
. 20%
of training data is held out in each iteration for validation, and the best performing model on the validation
set is selected. In the case of multimodal models, the encoders and fusions layers are all trained jointly for
100 epochs.
Since there is no prior work whose results are directly comparable with our work, we compare our
results against a text-based sentiment analysis method, given the similarities between our problem and
classical sentiment analysis. For our text baseline, we use Valence Aware Dictionary and Sentiment Reasoner
(VADER) which is a lexicon and rule-based sentiment analysis tool [124].
52
4.5 Results and Discussion
To inform our design decisions for the multimodal networks, we initially trained and evaluated unimodal
classifiers using different feature-sets. The results from unimodal classification, evaluated by micro F1-scores
are shown in Table 4.3.
Table 4.3: F1-scores for three-class classification.
Features/Models MTurk Wizard
Audio
MFCC 0.38 0.36
eGeMAPS 0.37 0.35
Video
AU+Pose 0.38 0.35
ResNet 0.46 0.43
Text BERT 0.64 0.61
Multimodal
Static Fusion 0.69 0.61
RNN Fusion 0.71 0.61
Baseline VADER - text 0.58 0.44
Unimodal classification results demonstrate the superiority of text in content representation and
predictive power, exceeding performance from visual and audio modalities. This result is consistent with
prior work on multimodal sentiment analysis [290, 217], and extenuated by the real-world setting and low
expressiveness of this interactive scenario.
The multimodal networks are trained on the best performing feature-sets from each modality, meaning
ResNet for video representation, MFCCs for audio and BERT for language. The audio representations had
low predictive power for both MFCC and eGeMAPs on unimodal classifications, which may be the result
of audio quality and recording. When training the models with MTurk annotations, the results from the
multimodal networks show an increase in performance using the RNN fusion model, which speaks to the
existing temporal inter-dynamics of audio and video captured by this network. The multimodal networks
53
gain an overall advantage over the textual unimodal network which is the highest performing unimodal
classifier in this task (see Table 4.3).
Our unimodal text classifier outperforms the text sentiment baseline. Using the recommended threshold
on compound sentiment score, i.e., 0.05 for VADER, a text-based sentiment analysis achieves F1 = 0.58
for MTurk labels and F1 = 0.44 for wizard labels. We also tested the sensitivity of the threshold value
and found that the best possible results are only slightly different (see Figure 4.5). Hence, our text-based
method using BERT comfortably outperforms VADER results which further validates our approach.
0.2 0.4 0.6 0.8
Threshold
0.40
0.45
0.50
0.55
0.60
F1-score
label
MTurk
Wizard
Figure 4.5: F1-scores of VADER sentiment analysis with different thresholds.
The results demonstrate that model predictions are higher when trained on MTurk labels for both
multimodal and unimodal classifications. The aggregate of labels from five annotators provide higher
reliability and potentially lower between-person variability. Additionally, the wizard has an understanding
54
of conversation context and may experience different inter-personal connections to the story or person
that would affect the empathetic responses beyond the ability of our model.
The column-wise-normalized confusion matrices for RNN fusion model across wizard and MTurk ratings
are shown in Table 4.4. The results show similar patterns for both labels and indicate that false predictions
are mainly mis-classifications of either positive and negative responses with no empathy, i.e., , prediction of
positive/negative responses where none was necessary or predicting no empathy where positive empathy
would have been a better response. To deploy such a system in real interactions, high precision in detection
is necessary, as confusion of positive and negative responses will disrupt the interaction. Examples of
the model’s predictions on MTurk labels are shown in the Table 4.5. Instances like the second entry are
dependent on the personalities and inter-personal relationships of the interlocutors. However, instances
like the third entry can be disruptive to the interaction and require further attention.
Table 4.4: Confusion matrices (RNN fusion).
Predictions
Negative Positive None
Iden
MTurk
Labels
Negative 72.12% 4.15% 11.43%
Positive 9.85% 78.44% 30.60%
None 18.03% 17.41% 57.97%
Wizard
Labels
Negative 49.65% 1.59% 12.02%
Positive 14.24% 74.87% 30.80%
None 36.11% 23.54% 57.18%
4.6 Conclusions
In this chapter, we reported on our efforts in automatic recognition of opportunities for providing empathetic
responses. To this end, we labeled and analyzed a dataset of human-agent interactions in the context of
a semi-structured interview. Our analysis demonstrated that facial expressions of emotions and verbal
content are the important channels for recognizing such opportunities.
55
Dialogue Excerpt Prediction/Label
A: What got you to seek help?
H: My mood was just not right I was always feeling down and
depressed and lack of energy always wanting to sleep um lack of
interest
Neg/Neg
A: What’s your dream job?
H: Designing for the movie industry
A: How hard is that?
H: Extremely so I never really pursued it
Non/Neg
A: What do you do when you’re annoyed?
H: When I’m annoyed you know I really don’t get annoyed that
much I just let it go it’s not worth the pain and problems they
could cause if I can’t straighten out a problem let it go Neg/Pos
Table 4.5: Instances of RNN Fusion model’s correct/incorrect predictions on MTurk labels (Positive, Negative,
None).
We developed and evaluated a deep neural network capable of multimodal learning of such opportunities.
The best unimodal result was achieved by encoding language with a Transformer network (BERT) pretrained on a large amount of data and performing classification. Fusing the verbal channel with facial
expressions, our recurrent neural network fusion provided the best result of F1 = 0.71 which is comparable
to the recent work on multimodal sentiment analysis [290].
Analysis on two sets of ground-truth labels from the experiments and independent observers, showed
that empathy, similar to other social constructs, may suffer from indistinct boundaries that can be affected
by inter-personal relationships and individuals’ personalities.
Embodied virtual agents and social robots that can emotionally engage their users have a huge potential
in multiple domains including healthcare and education [64, 240]. With this work, we provide a blueprint
for developing empathetic machines.
This project investigates questions 1.2, 2.2 and 3.2 of this thesis and provides a solution for them. First
regarding 1.2, the agent looks for the user’s verbal and non-verbal cues in three modalities of audio, visual,
and text. Subsequently, it processes them to find the best type of feedback to provide in an offline setup.
Regarding 2.2, although the study was done in the WoZ setting, it was able to provide useful empathetic
56
feedback in response to the user’s stories. Finally, the project makes use of a prediction model that is a
mean for deriving the agent’s feedback from the perceived input signals.
57
Chapter 5
RAVE
To investigate question 3 (Types of multimodal feedback perceived by agents), we designed a multimodal
system called RAVE. RAVE is a system that engages in a multimodal and multiparty interaction with deaf
and hard-of-hearing infants. The goal of the system is to provide a tool to facilitate teaching sign language
to these babies. To this end, the system uses a virtual human and an embodied robot to build a three-way
interaction with the baby. This scenario is a good example of why we need different kinds of modalities
since we have specific motivations for each of them. The system uses visual input (eye tracking as well as
body movements) and not audio since the babies cannot hear. We want to have an embodied robot because
we are not sure about the maturity and potential of the visual system; i.e. we are not sure whether or not
babies will pay attention to the avatar on the screen. We also use physiological data (thermal-infrared-based
technology) because of the lack of well-developed communicative abilities in the babies. These signals
are called “honest signals” [204]: unconscious social signals that are subtle patterns of human interaction
that reveal our attitudes. The system uses these inputs in order to provide appropriate feedback and to
maintain a coherent interaction. The virtual human is capable of providing sign language patterns of
“nursery rhymes” to the baby which we consider as a form of a story in this interaction. Thus, the agent in
this scenario is a story-teller agent who can provide appropriate feedback and therefore fits into cell 3 of
58
the table 1.1. Although the experiment is designed in a scenario-specific context, it undertakes to provide
answers to research question 3.
5.1 Introduction
Many AI systems have been designed for facilitating language learning by adults, and to a lesser extent,
children [256, 96, 128]. However, there is a significant paucity of work on AI systems for young infants
despite the widely understood critical importance that this developmental period has for healthy language
and cognitive growth, and related reading and academic success [214]. Children have proven to be a
challenging population to design language learning technology, and some technologies designed for
children have been shown not to facilitate language learning. For example, Krcmar [144], Richert, Robb, and
Smith [228], and Krcmar, Grela, and Lin [145] have described studies showing that children who receive
linguistic stimuli from television do not learn as much as those who receive the same linguistic stimuli
from live adults
Children have a tremendous ability to acquire language from observation of and engagement in social
interactions. This ability is particularly salient during the first year of life, when infants begin to display the
capacity to both understand and generate language through vocal or manual babbling [211, 206, 207, 208].
Our particular interest is young babies who lack the necessary language exposure in early life. Based
on discoveries of brain-based “critical periods” of human development for language (e.g., ages 6-12 months;
[213]), a growing body of neuroscience research has identified the potentially devastating impact that
minimal language exposure during this particular period of child development can have on children’s
linguistic, cognitive, and social skills, be they hearing or deaf infants [212]. These effects might even last
for years. [237]
While children may have minimal language input for a variety of reasons, children born deaf or with
severe hearing loss are particularly at risk and inaccessible to traditional interventions [212]. More than
59
90% of deaf infants are born to hearing parents with little sign language experience; these parents face
the challenge of learning to sign at a rate that matches or exceeds the rate at which their child requires
language input [116]. While there are some technology-based interventions that can help restore or support
hearing in some children (such as cochlear implants), these technologies often cannot be deployed until
after 18-24 months of age and can result in continued language deficits following more than a year of
minimal language input [199]. As such, technology that can help fill a language-exposure gap can have a
tremendous impact for social good in these populations, especially in young deaf babies who can experience
dramatically reduced or no access to usable linguistic input during this period.
In this chapter, we describe the design of a novel technology-based intervention for language learning
for deaf infants. The system, called RAVE (Robot, AVatar, thermal Enhanced language learning tool), focuses
on multiparty social interaction and learning elements of a visual sign language, American Sign Language
(ASL) [139]. RAVE consists of two agents: a virtual human (provides language and socially contingent
interactions) and an embodied robot (provides socially engaging physical cues to babies and directs babies’
attention to the virtual human). The use of a virtual human on a screen provides the benefit of having
an expressive agent (both in facial expressions and posture) that can produce a natural signed language
as linguistic input. RAVE brings together science from multiple disciplines to explore the potential for
technologies such as functional Near Infrared Spectroscopy (fNIRS) brain imaging (measures the baby’s
higher cognition), thermal IR imaging (measures the baby’s emotional engagement), robotics, and virtual
humans in an attempt to positively influence the learning process.
Rave was designed specifically for babies within the age range of 6-12 months. We base our selection of
both the age range and the specific language stimuli on foundational research conducted as a part of the
present study using fNIRS of infants across different early-life developmental periods while processing
different types of language stimuli [212, 214]. These studies yielded evidence that select brain sites and
systems underlying human language learning are sensitive to specific rhythmic temporal patterns central
60
to phonological structure in all world languages (spoken and signed). Further, these neural systems are
most negatively impacted by minimal language input specifically during the ages 6 to 12 months [214,
257]. Crucially, an infant’s exposure to language stimuli during this period with select phonetic-syllabic
rhythmic temporal patterning (frequencies) most powerfully engages these neural sites and systems to
support later healthy language, phonological, reading and cognitive growth [214]. We then built American
Sign Language (ASL) nursery rhymes with these specific phonetic-syllabic rhythmic temporal patterns and
used them as our linguistic stimuli to match infants’ biological sensitivity within this precise developmental
period.
Providing linguistic stimuli to 6-12 month old deaf infants presents a unique set of design challenges.
Our system must be capable of dealing with the developing perceptual, cognitive, and social responsiveness
of 6-12 month old infants. Most of the standard application design methodologies, which involve posing
detailed questions to the user population, have extremely limited usefulness with this population. The
system must also not rely on any type of auditory cueing that is typical for language-based applications.
Because social interaction and context are essential to language learning in children [148, 251, 260],
social agents are a natural starting point for our design. The choice of social agent, however, exposes the
central design challenge for this work. We might choose a virtual agent, a character on a screen, to provide
linguistic input to the infant. Virtual characters have both the manual dexterity and the expressiveness
of posture and facial expression to produce sign language at a limited, but reasonable fidelity [242, 138].
However, even exceptionally well-designed educational material presented to infants on a screen have
resulted in only minimal learning gains by those infants [61]. Phoneme distinction [148], word referent
mapping [145, 144], and lexical category retention [249] all showed only minimal gains when using a
screen-based intervention.
We could instead choose to use a physically-present robotic agent. Infants as young as six months of
age have been shown to respond socially to physically-present robots, selectively following the robot’s
61
gaze and engaging the robot with social overtures [180, 13]. However, these robots lack both the physical
dexterity and expressiveness to mimic even a limited set of signs. Typically, robots often have only 2-3
degrees of freedom in their hands because of space and cost requirements, far less than is required for most
signed languages [142]. Expressiveness displayed through posture and facial expression is similarly limited,
with only a few systems approaching anything close to what visual languages require [6]. Even robots
that have been designed specifically to produce signed language [266, 162] or to act as tutors for signed
languages [141, 273] often fail to have the full range of manual dexterity and expressiveness required. Most
significantly, all of the robots developed to date for signing lack the temporal patterning and fluidity of
motion that is a critical part of infant sensitivities and none were designed to be appropriate for infants due
to their size and/or appearance.
Figure 5.1: Physical deployment of system components from the front view
62
The solution that we present here seeks to use both a screen-based virtual human and a physicallypresent robot to engage infants in triadic interactions (see Figure 5.1) Our hypothesis is that the pair of
agents will provide both the dexterous, expressive language production (via the virtual human) and the
socially-engaging physical cues that trigger learning (via the robot).
While RAVE appears to be using the best available research to inform its design, there is still a question
of whether it really could be used to facilitate babies’ language learning. Is there evidence that the babies’
behaviors are influenced and/or facilitated by the avatar’s behaviors? Is there a principled and predictable
relationship? We will perform both a qualitative and a quantitative evaluation of the RAVE system with
babies. In particular, we focus on the baby’s interaction with the virtual human avatar, that is providing
multiple kinds of social and linguistic behavior.
In addition to the thesis research questions that this project is investigating (investigating the means
for agents to understand and provide feedback in a story-telling context), we develop some chapter-specific
research questions as follows:
1. Do babies attend to the avatar and respond to its communicative behaviors?
2. Can babies with little or no exposure to ASL distinguish among the avatar’s different conversational
roles (or modes: Linguistic Nursery Rhymes; Social Gestures; Idle/nonlinguistic postures; 3rd person
observer); particularly as they appear on a flat TV monitor?
3. If they can distinguish between different Avatar’s roles, how do they react to different roles? Can an
avatar stimulate babies’ production of socially contingent responses, and crucially, nascent language
responses?
4. What, if anything, is the impact of the presence or absence of parents’ participation in the conversational interaction?
63
The following section reviews the related work on infant-technology interactions, on socially-assistive
robots for child tutoring, and the use of virtual avatars for language instruction. We then describe the
system design and implementation of different components in the next sections. Next, we will focus on
the dialogue manager of the system. And finally, we will describe our experimental protocol, and we will
present both a preliminary evaluation (via three case studies) and an extensive evaluation of the system.
5.2 Background and Related Work
Language is the principal system of expression and communication for humans and arguably the most
prominent cognitive and cultural tool that distinguishes human beings from other species. Acquiring
language commences from birth aided by multiple factors, including brain-based sensitivities to aspects of the
specific rhythmic patterning of human languages, observation, and engagement in social interactions with
the outside world [86]. Language exposure plays an important role in infants’ early development of linguistic
abilities. Ages 6-12 months is widely recognized as a critical developmental period for language [213, 214,
254]. It is during this period that babies acquire essential phonetic-syllabic segments unique to their native
language, which make possible their ability to acquire their native language’s vocabulary, discern their
language’s distributional and syntactic regularities, and crucially, to engage in letter-to-phonetic segment
mapping in early successful reading [146, 214]. In early life, select brain sites participate in early human
language learning (such as the Planum Temporale in the Superior Temporal Gyrus), which are sensitive to
specific rhythmic temporal patterns at the nucleus of phonological structure found in all world languages
(spoken and signed) [213, 209]. Exposure to these patterns is crucial for the development of this brain sites
and systems to support later healthy language, phonological, reading, and cognitive growth [214]. Children
deprived of this early exposure specifically during the ages of 6 to 12 months may face dire consequences
such as delays in cognitive, linguistic, reading, and social skills which may last for years [237, 257] with
accompanying devastating lifelong impact of reading and academic success [214].
64
Intriguingly, the same developmental brain sensitivities to the rhythmic temporal patterns of human
language phonology also exist in deaf babies learning a natural signed language, and it develops on the
identical maturational time table as hearing babies [208, 206, 207]. This universal brain sensitivity enables
young sign-exposed babies the early life language input that permits them to build a sign phonological
system vital to letter-to-sign-phonetic segment mapping in successful reading acquisition [214]. All babies
who miss exposure to the patterns of their natural language in early life (be it a signed or a spoken language)
are rendered at risk for language and reading delays spanning life [213, 214].
91.7% of the deaf individuals in the National Census of the Deaf Population (NCDP) came from families
where both parents were identified as hearing [118]. In these families, quickly learning a new signed
language can become a challenge for the parents. There are some speech-based interventions such as
cochlear implants designed to make available spoken language to the young deaf baby [121, 281]. However,
most of these tools cannot be deployed until the ages of 18-24 months. While efforts have begun to
implant children at younger ages (from ∼8 months), precise adjustments, tuning of the device, as well as
intensive speech training, still typically begins after ages 18-24 months and proceeds for months into years
thereafter [199]. Thus, this is well past the early critical period for learning phonological units, phonological
segmentation, categorization and mapping, and sequencing distributions - all vital to optimal, healthy
language learning and reading. As such, there is a pressing opportunity for AI technology that can provide
signed language input in the critical period of 6-12 months [240, 254].
Prior work explored the pedagogical and social functions of screen-based and physically-present
technologies for infants. Anderson and Pempek [11] posited that infants and young children learn less well
from a demonstration on a television or video program than from similar real-life experiences. This “video
deficit effect” has been widely replicated and found to exist in several aspects of language development
including phoneme distinction [147], word referent mapping [145, 144], and lexical category retention [249].
However, DeLoache et al. [62] found that young children (age 30 months) were better able to learn new
65
words from watching a video when its content was of two people’s social interaction than when a single
person appeared to directly address the child, emphasizing the important role social interaction plays in
infant language development.
In terms of physically-present technologies, Meltzoff et al. [180] found that toddlers (at approximately
18 months of age) were more likely to follow the gaze of a robot that had been seen interacting in socially
communicative ways with an adult. Similarly, Arita et al. [13], using a looking-time paradigm to gauge
expectation, found that infants expected robots to be spoken to like a human, but only if the robot had
been seen previously to interact with a human interlocutor. Together these findings suggest that infants
can attribute social agency to robots.
Robots that provide social and cognitive support, rather than engage in direct physical manipulation, are
called socially assistive robots [85]. This type of robot is often a desktop-mounted system that has a small
number of degrees of freedom for expressive movement but is incapable of locomotion or manipulation.
These robots have been used in a variety of tutoring and educational domains, including providing emotional
support to children undergoing blood draws [105], teaching children how to deal with bullies [154], tutoring
basic mathematical problems [224], and supporting social skills tutoring for children with autism spectrum
disorder [239]. Language instruction has been particularly successful with older children (typically 4-6 year
olds) with these robotic systems; Spanish-speaking children struggling to learn English in public schools
showed 2 standard deviations of improvement following a 5-session intervention with a robot tutor [156].
The present work unites a multidisciplinary team to extend beyond previous work by considering both a
much younger population and a specific population that has not been previously addressed (deaf infants).
Virtual Humans [230] combine an animated humanoid avatar body with a behavior control mechanism,
to create a human-like animation interface to computing. Virtual humans have been successfully used
for a number of purposes including education [229, 98, 130], and practicing human interaction activities
[231, 89, 68]. There have been several recent efforts to create signing virtual humans (e.g., [215, 295, 126]),
66
however so far they have focused on only manual sign-shape, rather than the full complement of necessary
synchronized behaviors for grammatical production, including facial expressions [138, 132]. Virtual humans
have been successfully used with young children for educational purposes [261, 135, 269], however, to our
knowledge, not previously with deaf infants.
5.3 Overall System Design
As we knew of no other interactive technologies developed for this population, we followed an integrated,
iterative design methodology. We tested prototypes and partially-operational systems with infants as we
iteratively improved our hardware designs, software systems, and interaction goals.
The overarching aim of the joint avatar-robot system is to provide language exposure during a critical
period for language development, thereby mitigating developmental delays due to minimal language
experience in deaf infants. To this end, the design of the system was guided by two goals: 1) to provide
socially contextual language through the interactions of a robot-avatar-child triad, and 2) to maintain child
attention for reasonable age-appropriate periods of time. The system behavior should be contingent on
the behavioral and arousal state of the infant. Appropriate social interaction scaffolds the development of
social [107] and higher cognitive functions [175], and social contingency in particular is hypothesized to
be critical to language acquisition [212]. Maintaining attention is important because if the infants are not
sufficiently engaged then language learning is impossible to accomplish.
We used several designs and selection of components for the system. Here we will go through the main
three versions of the system.
5.3.1 Version 1
In the first integration of robot and avatar output, the perceptual components were not yet trained on
appropriate data and available for real-time use. We thus used a “Wizard of Oz” interface to allow human
67
controllers to select specific behaviors for the robot and avatar, following a rough protocol based on the
above guidelines. This process allowed full human perceptual abilities and decision making to guide
selection of behaviors, but allowed evaluation of whether the pair of agents and set of behaviors could
be used to engage the infant as desired. The main goal of this design was to determine the ability of the
system to engage the infants in an interactive and engaging social routine which will ultimately lead to
visual language episodes
5.3.2 Version 2
Observing the success of system version 1, we moved toward a fully automated version of the system
including a dialogue management script written in Python (refer to section 5.8 for more detail) that was
informed by the thermal imaging and eye-tracking perception, as well as physical states of the robot and
avatar. Each combination of information states led to behaviors by robot and avatar (sometimes just waiting,
sometimes complex, synchronized behaviors, such as the robot directing attention to the virtual human
who then starts the rhyme). As observed in the case studies, infant behavior, not just attentional focus
and arousal, are important in establishing contingency - in particular communicative and social behaviors
toward the agents must be reacted to in order to establish contingency. This represents a challenge, because
the state of the art of recognition of infant’s posture and communicative behaviors is not as well developed
as that for adults, and most ready-to-use perception systems are not trained on infant data. The experiments
done with system version 2, revealed the fact that the system was reluctant to respond to baby’s social and
communicative gestures. Since contingent interaction was crucial to the ultimate goal of the system, this
was considered a big defect in our design.
68
5.3.3 Version 3
In order to address the problem presented in Version 2 of the system, we have added additional visual
perception components, initially to record data, with a goal of later customizing recognition components
based on this data. For the third and final version, we introduced a semi-automated behavioral policy that
relies on a human observer to signal specific complex perceptions (beyond arousal and attention), but uses
the automated controller to make avatar and robot behavior decisions based on the updated state.
Our hypothesis was that to capture the attention of infants, we should provide a linguistic stimulus
that matched the rhythmic temporal patterning found in all natural languages, including natural signed
languages, such as ASL. Petitto et al. [206] found that babies are sensitive to specific rhythmic temporal
frequencies of language in early life. Specific rhythmic temporal patterning binds phonetic-syllabic segments
into words, phrases, and clauses in spoken languages, with identical processes occurring in signed languages,
whereupon specific rhythmic temporal patterning binds sign-phonetic units into signs, signs into sign
phrases and clauses, and signed sentences [214]. Grammatical information is also communicated in ASL via
systematic (rule-governed) patterned changes in handshape, eye gaze, grammatical modulations of space
and movement, and grammatical body shifting and crucially, facial expressions [227, 138, 242, 215, 295,
126]. Thus, our use of a virtual character on a screen was a key design feature that provides the benefit of
having an expressive agent that produces the optimal rhythmic temporal patterning vital to acquire the
phonological building blocks of language, manual dexterity, and obligatory facial expressiveness to produce
signed language samples as linguistic input.
We use a robot since it is a physically-embodied agent, which provided a mechanism to engage the
baby, a locus for facilitating attention to the virtual human, and a means to introduce a more natural social
conversational setting whereupon agents and baby can occupy varying conversational roles. It has been
found that a physical robot can evoke interest and social responses from young babies (Meltzoff et al. [180]
and Arita et al. [13]), but even robots that have been designed specifically to act as signed language tutors
69
[141, 273, 12, 143], often cannot support the full range of manual dexterity, fluidity, rhythmic temporal
language patterning, and facial expressiveness required [142, 266, 157]. Hence our reasoning to use a
physical robot to act as an initial target for infant attention. We also predicted that contingent interactions
between the robot and the virtual human helps establish each agent as socially-interacting conversational
partners rather than objects. Our design to use a combined robot+avatar was further supported by previous
studies that found the exclusive use of video-recording and playing back visual language is unlikely to
work [145, 228, 144]. Kuhl, Tsao, and Liu [148] found that exposing American infants to human speakers of
Mandarin Chinese reversed a decline in perception of Mandarin phonetic segments, but exposure only to
audio or to audio-visual recorded stimuli did not.
Figure 5.1 shows the physical deployment of the hardware components from the front view. Multiple
webcams were used to record the experiment from different angles. A snapshot of the experiment is shown
in Figure 5.2. As seen in Figure 5.2, the infant was sitting on a parent’s lap, facing the avatar’s monitor and
the robot. In the next sections, we will go through different components and perception modules of the
system.
5.4 Robot
The robot’s primary purpose is to capture and direct the infant’s attention to the avatar. The robot used
here is a modified version of Maki v1.4 by Hello Robo [172], which is a 3D printable, open source robot
(Figure 5.3). Maki’s compatibility with the design guidelines we identified as relevant for infant-robot
interaction made it a strong candidate as our base robotic platform. Described below, these guidelines were
drawn from research fields including animation, developmental psychology, and human-robot interaction.
70
Figure 5.2: Multiparty interaction between Avatar, Robot, and infant from multiple viewpoints
5.4.1 Design Guideline 1: Attractive
The robot’s appearance needs to be salient enough to capture a deaf infant’s attention without impeding
its ability to transfer the infant’s attention to the avatar when appropriate. To support this, we chose an
anthropomorphic appearance with accentuated infantile characteristics. Standing 13.5-inches high on a
small torso, we enlarged the eyes and minimized the other facial features. Infant-like appearance has been
posited to be particularly salient to all humans [166], and has been empirically demonstrated to be highly
salient to children as young as 3 years old [34].
The robot relies on the infant’s ability to follow gaze. Infants as young at 6 months are capable of
following gaze direction, although this skill is more accurate in infants aged 12 months and older [35].
Through means of joint attention episodes, the robot influenced the infant’s attention. The robot has an
articulated head (pan left/right, tilt up/down), articulated eyes (pan left/right, tilt up/down) and eyelids
(open/close) to enable naturalistic directional shifts of attention. Once the robot had captured the infant’s
71
Figure 5.3: The robot used in this study (shown in three figures to the right) is based on the open-source
Maki platform from Hello Robo (shown at left). Modifications have been made to accentuate infant-like
features, to reduce distractions, to increase the saliency of the directionality of head positions, and to
support judgments of animacy and agency.
attention, it turned its head and eyes to look at the avatar. We believe that the robot having physical eyes
and eyelids (as opposed to screen animated facial features) are important for infants to consider it as a
social and communicative agent.
5.4.2 Design Guideline 2: Simplified
As the primary purpose of the robot is to direct attention to the avatar, we seek to minimize morphological
features and behaviors that might distract the infant. We chose to hide the large, dark circular “ears” of the
robot as these high contrast variations were particularly distracting to younger infants. By eliminating
the color contrast difference, the robot’s eyes remained the most salient feature, especially when the head
turned toward the avatar. While we also created a version of the robot that added a pair of 2 degree-offreedom eyebrows, we chose not to use the added expressiveness that these features provided because
of the additional salience that they added. It is imperative for the robot to act as a supporting actor to
the avatar, like a sidekick to a hero. (This relationship in human-robot interaction was suggested initially
by Vázquez [277].) We limited the behavioral repertoire of the robot to a few core behaviors: directing
attention through joint head and eye movement, responding to overtures from infant and avatar, showing
72
surprise, and engaging the infant with a game of “peek-a-boo”. To maintain novelty, we opted to provide
small variations to these core behaviors rather than attempting to provide a richer set of behavioral options.
5.4.3 Design Guideline 3: Identifiable
As eye gaze was the robot’s primary method of interaction and communication with the infant, it was
imperative that the robot’s gaze direction and head orientation must be easily identifiable to the infant
at all times. While human-robot gaze has been studied extensively (see Admoni and Scassellati [3] for a
review), none of this work specifically has targeted infants. In human-to-human interactions, infant gaze
has been studied extensively. Prior to age 14 months, infants largely look at an adult’s head orientation as
an attention following mechanism [59]. Wilson et al. [282] suggested that the change in profile and nose
angle from center might be cues that humans use to determine head orientation [150].
We made three primary modifications to the base Maki robot to accentuate head orientation and gaze
direction. First, we increased the color contrast of the eyes in order to render them more visible. Second,
we added a strip of faux-fur material to the top of the robot’s head to give the appearance of a stripe of
hair. As the robot’s head was nearly spherical and the other directional features (nose and mouth) were
only minimally represented, this feature provided a key mechanism for discriminating head orientation.
Third, inspired by principles of studio animation, we exaggerated the robot’s movements and behaviors to
exploit the full range of physical motion that the robot was capable of performing. While we did not test
the effects of these modifications individually with infants, each provided our own (adult) staff a noticeable
increase in performance in estimating head orientation and gaze direction.
5.4.4 Design Guideline 4: Agentic
In order to sustain the idea that the robot is a social and communicative agent, the robot must further
demonstrate intention, respond appropriately and in a timely manner to interactions towards it for an
73
appropriate duration, and demonstrate passive life-like movement when resting in between interaction
periods. We constructed a simple behavioral repertoire for the robot that supported the illusion of agency.
This included both goal-directed actions (e.g., looking toward a target location), expressive actions (e.g.,
showing “surprise” by opening the eyelids wide and drawing the head slightly back), and idle behaviors
that were active at random intervals whenever other activities were not being performed (e.g., blinking and
small shifts in gaze position). The length of all of these behaviors were limited to be no more than a few
seconds both to accommodate the relatively attenuated attention span of infants and to allow more rapid
responsiveness to actions by the infant or avatar.
Finally, we found during the course of our design and evaluation that transitivity was maintained
with the robot. As Brooks and Meltzoff noted, when known social agents (human experimenters) treated
the robot as a social agent, then toddlers would tend to attribute agency to the robot [180]. We used a
familiarization protocol during which the researcher was instructed to act as if the robot and the virtual
agent were social agents by greeting them individually, making eye contact, responding to social gaze from
the system, and engaging the system in a simple social exchange.
5.4.5 Design Guideline 5: Safe
While the robot was generally kept out of reach of the infant, we considered physical safety during the
design of the system. We eliminated potential pinch points in the robot hardware, used a physical design
that was large enough to be difficult for an infant to grasp and lift while still maintaining a lightweight
frame, and limited the speed of movement to limit possible accidental contact. Electronics are maintained
out of reach and the robot can be wiped down with disinfectant wipes between children.
74
5.5 Avatar
The virtual human’s primary purpose is to provide visual language stimulus to the infant, and, along with
the robot, to engage the infant in a contingent social interaction. Thus, some of the design guidelines
are similar to those for the robot, while others are complementary. Like the robot, the avatar should be
attractive and agentic. Rather than striving for a simplified appearance, the goal of providing high fidelity
sign language stimulus pushed in the opposite direction - a very human-like avatar that is capable of
performing fluent visual sign language samples. Safety is not a concern, given the agent is in the virtual
world (the concern is rather with the conventional display hardware, rather than the avatar itself).
To meet the goals of attracting the infant and being able to perform realistic human-like behavior, we
decided to base the avatar on scans of a young deaf female native signer of American Sign Language (Figure
5.4). This choice also allows use of motion capture of the same individual, which increases the ability to
target the motion to the avatar. We briefly describe the process for creating the avatar model, behaviors,
animation, and control of the avatar.
As described above, specific rhythmic temporal patterns underlying phonetic-syllabic organization
in all languages served as the structural template on which the specific ASL nursery rhymes used in the
present studies were built [212, 206, 207, 214].
5.5.1 3D model construction
The avatar was constructed by capturing a native signer inside a photogrammetry cage using 25 megapixel
DSLR (Digital Single Lens Reflex) cameras (Figure 5.4). The 3D body model was then reconstructed using
photogrammetry software [5] to create a 3D body model. A virtual skeleton was then added to the 3D
model to allow for articulation and deformation using linear blend skinning. A set of facial scans was also
captured using a light stage [61] and used as reference imagery. A set of joints were added to the avatar’s
3D face to allow for deformation and movement, as well as 3D models for the eyes, teeth and tongue.
75
Figure 5.4: The generation of the 3D avatar model. Left: The participant inside a photogrammetry cage.
Right top: The 3D avatar model generated from photogrammetry capture. Right bottom: Light Stage facial
scans used for reference for avatar construction.
5.5.2 Motion capture
Motion data were built with the collaborative resources of Gallaudet University’s Motion Capture Laboratory (M. Malzkuhn and J. Lamberton) and the University of Southern California’s Institute for Creative
Technologies (A. Shapiro) produced through full body capture via a camera-based motion capture system
(VICON). The raw motion data was then re-targeted [92] onto a skeleton that matched the topology of the
avatar model. Facial animation was manually keyframed by an animator. A variety of nursery rhymes,
76
conversational fillers (e.g., “yes”), short utterances (“What’s that?”), and idling poses were captured and
processed.
5.5.3 Realtime animation and control
The 3D avatar was animated and controlled through the use of a real-time character animation system
[245]. The animation system includes control mechanisms for playing prerecorded animations, gazing, and
head movements (such as nodding and shaking).
5.5.4 Avatar behaviors
For the purposes of evaluating the ability of the system to engage in socially contingent interaction with
the baby, we focus the analysis on the avatar’s different conversational modes, including categories for
noncommunicative behavior, social dyadic and triadic (including the robot) behaviors, and those that
contain developmentally appropriate linguistic features. The categories used are as follows:
1. Idle behaviors (“Idle”) are nonlinguistic/nonsigning, and non socially communicating neutral bodily
postures, e.g., arms at side with typical slight body shifting). This behavior typically occurred when
the robot has the floor and is engaging with the baby, and avatar is looking at the robot or the baby
as a 3rd-party conversationalist.
2. Nursery Rhymes (“NR”) are linguistic stimuli such as the “BOAT-ON-WAVE” nursery rhyme in
ASL, with specific rhythmic temporal patterns at the core of all languages’ linguistic phonological
structure. ∗ While the ASL NR is unique to the ASL language and Deaf culture, a rough semantic
neighbor in the English language would be “Row-Row-Row-Your Boat” a simple repetitive rhythmic
rhyme with approximate versions in many languages around the world.)
∗The formal linguistic notation of natural signed languages, such as ASL, uses glosses showing approximate English translations
in capital letters.
77
3. Social Gestures (“Social”) include universal social routines (e.g., BYE-BYE, HI), conversational
fillers (e.g., Affirmative Head Nod), and/or short lexical phrases such as YES! or THAT (i.e., English’s
“right”)!
4. 3-Way behaviors (“3-Way”) are avatar’s communicative interactions that were directed to both
the baby and the robot, such as “LOOK-AT-ME” (grammatically inflected in the grammar of ASL to
include both the baby in second person role and the robot in third person role.
5.5.5 Design of Nursery Rhyme Behaviors
(a) (b) (c) (d)
Figure 5.5: Frames of Avatar doing the BOAT Nursery Rhyme.
The four frames were selected (in order from left to right) from a fluid video clip of avatar signing where each frame
represents a silent sign-phonetic-syllabic contrastive unit as produced with the hands in the ASL visual nursery rhyme
“BOAT-ON-WAVE”. In formal linguistic analyses, these contrastive phonetic-syllabic units are notated as follows: 5.5a
/B/+low; 5.5b /B/+modulation+high; 5.5c /5/+modulation+high; 5.5d /5/+modulation+low. These phonetic-syllabic
linguistic units are not produced in isolation like a list. Instead, they are bound into fluid movements that form
rule-governed, grammatical clausal, phrasal, and syntactic constructions in all natural languages, here ASL.
Linguistic Patterns provide the vital linguistic stimuli for the baby. Nursery Rhymes were constructed
with the identical rhythmic temporal patterning that matched the infant brain’s specific neural sensitivity
to that rhythmic temporal patterning [206, 214]. All Nursery Rhymes were built with the maximallycontrasting rhythmic temporal patterning in 1.5 Hz alternations [207, 206]. Specific phonetic-syllabic
contrasts that infants first begin to perceive and produce in language development (ages 6-12 months) were
used. These include 3 maximally-contrasting phonetic hand primes in ASL: /5/, /B/, /G/ with contrastive
transitions /B/⇒/5/, /5/⇒/F/, /G/⇒/F/, plus allophonic variants. Below we provide some examples of the
78
Nursery Rhymes as per formal analyses in the formal discipline of Linguistics analyses for ASL, which had
baby-appropriate lexical meaning with their respected action patterned sequences:
• BOAT†
(Phonetic-Syllabic units /B/, /5/ )
1. BOAT (/B/, double bounce=noun; palms in/+ low center)
2. BOAT-on-WATER (/B/+modulation, palms in/+ high center)
3. WAVE (ROLLING) (/5/+SAME modulation, palms out/+ high center)
4. WAVE (ROLLING) (/5/+SAME modulation, palms down/+ low center)
• PIG (Phonetic-Syllabic unit: /5/)
1. PIG (/5/, Chin)
2. PET (/5/, called “center space” in Linguistic sign notation)
3. HAPPY (/5/ + double-handed, Chest)
• FISH (Phonetic-Syllabic unit: /B/ (allophonic))
1. FISH (/B/, “center space”)
2. FINS (/B/+double-handed, Head)
3. SWIMS (away) (/B/, Cross-Body)
• CAT(Phonetic-Syllabic units: /5/; /G/allophonic; /BENT5/; /F/)
1. Grandma has red cat [/5/⇒/G/] and [/G/⇒/F/]
2. Grandma has white cat [/5/⇒/BENT5/] and [/BENT5/⇒/F/]
†
Sequence of frames of this Nursery Rhyme is depicted in Figure 5.5.
79
5.6 Perception Modules
In order to enable an ecological interaction between the infant and the artificial agents, a perceptual
system was constructed based on thermal infrared (IR) imaging, eye tracking, and vision. We include
here a sufficient description of each of these systems such that the evaluation of the overall system can
be understood. The perceptual system is able to detect an infant’s key behaviors and physiological states,
classify them on the basis of an integrated theoretical model, and trigger robot and avatar behavior in
response to infant behavior.
5.6.1 Thermal Camera and Thermal Infrared Imaging
Thermal IR imaging allows the system to detect subtle changes in the infant’s internal state, which are
significant for discriminating when the infant is engaged with the interaction. Facial thermal patterns
depend on subcutaneous vessels transporting blood heat, and these vessels regulate blood flow via local
vascular resistance (vasodilation and vasoconstriction) and arterial pressure [10, 184]. Therefore, by
recording the dynamics of the facial cutaneous temperature, it is possible to assess autonomic nervous
system activity and infer the subject’s emotional state [185, 125, 173, 184].
We choose the nasal tip as the salient region of interest (ROI) for assessing the psychophysiological
activity of the infants, because of its strict neurovascular relationship with adrenergic activity associated
with expression of emotional states [125, 173] The nose tip’s average temperature was extracted from
each frame thus obtaining a temperature signal in real time. The dynamics of the temperature was used
to classify the arousal state of the infants by assuming that a decrease of temperature is linked with a
sympathetic-like response (associated with distress and disengagement) whereas its increase is due to a
parasympathetic prevalence on the subject autonomic state (related with interest and social engagement)
[173].
80
Thermal IR imaging is performed by means of a digital IR thermal camera FLIR A655sc (640 x 480
microbolometer FPA, NETD: < 30 mK @ 30 °C, sampling rate: 50 Hz). In order to preserve the ecology of
the recording in challenging situations like experiments with infants and toddlers, the facial ROIs need
to be automatically recognized and tracked in all the frames of the thermal video. The original solution
developed includes three different processes: i) automatic recognition of facial landmarks in the visible
domain using the Open Face library [9]; ii) frame-to-frame tracking of the ROIs in the visible domain by
referencing them with respect to the facial landmarks; iii) co-registration of the visible ROIs with their
corresponding ROIs in the thermal videos.
5.6.2 Eye-Tracker
Infant gaze was also used as an assessment measure for triggering behavioral responses from the robot and
avatar. A Tobii Pro X3-120 eye tracker measured the eye gaze and duration as an indicator of an infant’s
focus of interest. Eye movement data were collected every 8.3 ms (for an effective sampling rate of 120 Hz),
and identified by a time stamp and (x,y) coordinates. The eye tracker was controlled by a customized Python
script perceiving the position of the infant gaze on the experimental setting. 4 different area of interests
(AOI) were defined: Robot, Avatar, In-Between and Outside. AOI coordinates were defined in relation to
the infant’s point of view, as shown in Figure 5.6. We took into account the AOIs as well as the fixation on
the target as an indicator of baby’s focus of interest. We performed a majority vote paradigm every half
second (60 data points) to determine the area of interest. A calibration process was done in the beginning
of the experiment to accommodate the program to the physical setup and the relative coordination of the
baby’s eyes, targets and the tracker.
81
Figure 5.6: AOI regions from infant’s perspective.
5.6.3 Baby Behavior
It is crucial to have a visual perception component to capture the communicative and social behaviors of
the infant such as hand clapping, pointing, reaching, etc. Current tracking systems such as Kinect [292],
use models that are trained on adult anatomy and do not work properly on infants due to their fundamental
differences in posture and relative proportions of body parts. Furthermore, the baby is sitting on a parent’s
lap, so this will bring additional complications for tracking systems which are mostly trained on full body
postures. To address these issues, we collect Kinect data for future analysis, with the hope of eventually
collecting enough data for future customizing models to conform to our specific needs with respect to the
experiment setup. But as a first step, we adopted an interface, shown Figure 5.7, that is used by an expert
observer to indicate relevant infant behaviors to the rest of the system in real time .
82
Babbling
Protowords
Words
Crying Fussing
Vegetative Attention Focus
Protosigns
Signs
Attention
Wave/Tap
Waving Flapping arms
Flapping hands +
Rhythmic hand activity
Vegetative Manual
Actions, Vegetative
Body Actions
Pointing
Gestures Universal “Hug” Universal “No” Universal “Yes” Universal “Give
Me”
Peekaboo Hello Goodbye Kiss Clapping Hands
Social
Referencing Copying Avatar
Reaching Object in Hand
Vocalization
Social
Communicative
Gestures
Social Interactive
Imitation/Mirroring Copying Robot
Manual Manual
Babbling
Social Routines
Social Manual
Actions
Figure 5.7: Observer interface for baby’s behaviors
5.7 System Architecture
Controllers for the hardware components were running on three separate machines, using multiple programming languages. We used a publisher-subscriber model [223] to facilitate communication between
the components where ActiveMQ [2] was used as a message passing server. As shown in Figure 5.8, the
perceptual components (eye-tracker, thermal imaging and behavior recognizer) send their messages to the
server (publishers). These messages are subscribed to by the dialogue controller to update the information
state and send messages to the Robot and Avatar (subscribers), directing them to perform communicative
behaviors.
5.8 Dialogue Management
The dialogue manger has three main goals:
1. To engage the baby by participating in interactive dialogues.
83
Behavior
Recognizer
Eye-Tracker
Thermal
Imaging
Dialogue
Manager
Robot
Avatar
Messaging Server
Message Message Message
Message
Message
Figure 5.8: Logical overview of system components
Agent Category Behaviors
Avatar
Conversational Fillers
✄
✂
Nod ✁
✄
✂
✁
Gaze forward/right/left
✄
✂
Head Shake ✁
✄
✂
✁
Contemplate
✄
✂
Think ✁
✄
✂
Toss ✁
Social Behaviors
✄
✂
Wave ✁
✄
✂
Hello ✁
✄
✂
Peekaboo ✁
✄
✂
✁
Go Away / Come Back
Question Solicitation
✄
✂
What? ✁
✄
✂
What’s Wrong?
✁
✄
✂
What’s That? ✁
✄
✂
✁
Ready? (To Robot/Baby)
Linguistic Patterns
✄
✂
✁
Good Morning
✄
✂
✁
Look at Me! (To Robot/Baby)
✄
✂
Boat ✁
✄
✂
✁
Pig
✄
✂
Fish ✁
✄
✂
Cat ✁
Robot Fillers and Social Behaviors
✄
✂
Nod ✁
✄
✂
Hide/Unhide ✁
✄
✂
✁
Gaze Forward/Right/Left
✄
✂
Startle ✁
✄
✂
Blink ✁
✄
✂
✁
Sleep
✄
✂
Wake Up
✁
Table 5.1: Robot and Avatar Primitive Behaviors
2. To maintain the engagement (sustained engagement).
3. To promote engagement-rendered responses from the baby.
It uses input signals from the perception modules to update its information state [271] and choose new
actions. Here, we give more details on multimodal perception signals, different output behaviors for agents,
and finally the protocol f
5.8.1 Input signals
These are input signals received from perceptual components and agents as well as the internal dialogue
manager signals. Note that in our design, we have no direct perceptual monitoring of the parent, and no
Avatar/Robot actions are contingent directly on the parent.
• Area of Interest (AOI) is the signal received from the eye-tracker component with discrete values
for 4 different areas of the baby’s eye gaze: Robot, Avatar, Between and Outside.
• Readiness-To-Learn is the signal received from the Thermal Imaging system with 5 discrete values:
very negative (sustained decrease in attention, sympathetic), negative (non-sustained decrease in
attention, sympathetic), very positive (sustained increase in attention, parasympathetic), positive
(non-sustained increase in attention, parasympathetic), and a None signal which shows the signal’s
absence because of not detecting a reliable signal from the baby [44].
• Baby-Behavior (BB) is the signal received from the human observer interface about the baby’s
social and communicative behaviors. As seen in Figure 5.7 the input signals are classified into several
categories such as vocalization, social communicative gestures, social routines and social manual
actions. There are a total of 23 distinct states for this variable.
• Component State Signals come from the Avatar and Robot indicating their state of action: when a
requested behavior has started, ended or if there were any errors or exceptions during the execution
of a specific behavior.
• Timing Signals are initiated from the dialogue manager itself. The DM tracks when events of
different types have happened and sets up automatic signals that can change behaviors, e.g. if nothing
interesting has happened recently.
85
5.8.2 Output commands
There are two different control levels of actions for the agents, described below: Primitive Behaviors and
Action Sequences.
5.8.2.1 Primitive Behaviors
These are defined as atomic actions of the agents which are single behaviors that cannot be interrupted;
such as nodding or a single nursery rhyme. Table 5.1 shows a list of primitive behaviors for the Avatar and
Robot.
The virtual human’s different language samples comprised different conversational/communicative
social functions that are commonly used in Infant-Adult conversations. These functions include nursery
rhymes, social routines, questions, conversational fillers, soothing responses, social affirmations and
negation, solicitations and conversationally neutral idling. These were grouped as follows:
1. Conversational Fillers and Social behaviors Conversational Fillers are short lexical items or
phrases that assure the addressee that the conversational partner is attending and still ”in" the
conversation. They are like social punctuations e.g., YES! or THAT!, which are full lexical items in
ASL. Social behaviors (or, social routines) are standard gestures that are widely used with infants,
such as PEEKABOO.
2. Question Solicitation such as ASL signs WHAT? or WHAT’S THAT? are used when the infant is
in a sympathetic state.
3. Linguistic Patterns provide the vital linguistic stimuli for the baby. All Nursery Rhymes were
constructed with the identical rhythmic temporal patterning that matched the infant brain’s specific
neural sensitivity to that rhythmic temporal patterning [206, 214]. The identical overall rhythmic
temporal patterning that all Nursery Rhymes were built with was this: maximally-contrasting
86
rhythmic temporal patterning in 1.5 Hz alternations [207, 206]. Inside this temporal envelope were
specific phonetic-syllabic contrasts, including 3 maximally-contrasting phonetic hand primes in ASL
that human infants first begin to perceive and produce in language development: /5/, /B/, /G/ with
contrastive transitions /B/=>/5/, /5/=>/F/, /G/=>/F/, plus allophonic variants. The Nursery Rhyme
patterns were produced such that each had baby-appropriate lexical meanings. Below we provide an
example of one of the four Nursery Rhymes, though each were designed with the same canonical
structure. The other nursery rhymes are explained in section 5.5.5.
BOAT (Phonetic-Syllabic units /B/, /5/ )
(a) BOAT (/B/, palm in)
(b) BOAT-on-WATER (/B/+ movement modulation, palm up)
(c) WAVE (/5/+same movement modulation, palm down)
5.8.2.2 Action Sequences
We define an action sequence as a plan for a timed sequence of primitive actions by the agents that will be
executed in order as planned. An example of an action sequence is the triad social greetings between the
agents: (1) Avatar turns toward the Robot. (2) Robot turns toward the Avatar. (3) Avatar and Robot both
nod to each other. (4) Avatar signs LOOK-AT-ME to both baby and Robot. (5) Avatar signs READY? to both
baby and Robot. (6) Avatar turns back and looks at baby.
Another example is the familiarization sequence which is executed at the beginning of the experiment
and will be described in detail in section 5.8.3.
5.8.3 Interaction Protocol
The interaction design follows the Information State approach to dialogue management [271], in which a
set of typed information variables are monitored and updated and a policy is used to decide on behaviors
87
Figure 5.9: Summarized decision tree based on system variables
88
given the current values. Unlike most interactive dialogue systems, the states do not refer to aspects of
natural language dialogue context, but rather the perceptually informed status of the infant.
At each point in the 3 way robot-avatar-baby interaction, the system has a sequence of actions as the
current plan for the agents to execute. These are designed with the assumption that the baby will behave
accordingly, but if the baby acts differently, the planned actions may get removed or get updated by a
completely different plan, in order to maintain a socially contingent interaction. The only signal that causes
an interruption in the execution of currently planned actions is the input signal coming from the baby
behavior interface. In this case, the policy overrides the current plan with a new plan according to the new
state of the baby.
Each set of input combinations leads to a sequence of actions from the Avatar and Robot. Theoretically
speaking, considering only the 3 input variables coming from the perceptual components, we are looking
at an information state space of 4 ∗ 5 ∗ 23 = 460 possible combinations. However, not every combination
is possible or likely to happen; but to build a completely reliable system all combinations should be
considered. We used a rule based policy which will trigger specific sequences of behaviors based on
predefined combination of variables. Figure 5.9 shows a highly abstract decision tree used as part of the
policy in which many branches are aggregated with each other. Each branch consists of more fine-grained
branches based on different input values for the baby-behavior, fixation of gaze, former executed plans and
other state variables.
In order to make the baby familiar and comfortable with its surrounding, we begin the experiment
with a familiarization episode. The goal of this period is to introduce the agents as conversational partners
and make the baby feel involved in this multiparty interaction. This is a trace of what happens in the
familiarization episode: At the start of the experiment, both Avatar and the Robot are in their idle and
neutral form. The Robot’s head is down with its eyes closed and Avatar is standing still looking forward.
Robot wakes up from his sleeping position, sees the baby and nods as an indication that it has acknowledged
89
No. AOI Thermal Baby System Description
1 Between +
Baby is focusing, paying attention to the system. He is
looking somewhere in between
avatar/ robot. The goal of the
system is to shift his gaze toward the Avatar
2 Avatar ++
Avatar tries to gain attention by
signing LOOK-AT-ME. Thermal
signal indicates that baby is in
an engaged prosocial (parasympathetic) state (or “ready to
learn”), thus the system transits
to a nursery rhyme episode.
3 Robot ++
The agents nod after turning
to each other. The goal is for
the Avatar to acknowledge the
robot as a 3rd party conversationalist in the interaction before she takes the floor and begin signing.
4 Avatar ++
Avatar begins a nursery Rhyme.
Robot turns to her in the middle
and nods towards her. Baby is
copying the Avatar and is producing signs/proto-signs in response to the Avatar’s linguistic
input.
5 Neither None
Baby turns to look at his mom
to exhibit the classic social referencing behavior. Avatar signs
ATTEND-TO-ME at the baby
and tries to get back his attention.
Table 5.2: A sequence of snapshots drawn from a sample experiment showing different stages of the
interaction. (Participant is a hearing male with no sign exposure and is 12 months+1 day old)
90
the baby’s presence and then turns to Avatar. Avatar looks at the robot and acknowledge it by nodding and
then turns to the baby. Then the Avatar tries to gain the baby’s attention by waving to it. The Avatar will
sign GOOD MORNING toward both the Baby and the Robot to begin the interaction.
Table 5.2 shows a sequence of snapshots drawn from one interaction with a baby, along with different
state variable values and an explanation of the state of the system at each point. This triad interaction
between Avatar-Robot-Baby consists of the agents greeting each other (as participants in the conversation)
and then the Avatar taking the floor and signing to the baby. Robot will nod to the Avatar occasionally to
establish his role as a 3rd party conversationalist. We call this sequence a “Nursery Rhyme episode”.
5.9 Experiment Protocol
To address the questions about the impact of the avatar behaviors on babies, we designed an experiment
whereupon babies interacted with the system in a controlled setting. While previous investigations were
conducted with over 40 babies focusing on RAVE system’s functionality [240, 195], the present study
provides a first-time evaluation focusing specifically on the babies and the relationship between their
behaviors and the avatar’s behaviors.
4 babies (ages approximately 6-13 months) participated in an intensive case study with a focus on the
nature of the babies’ behaviors relative to the avatar’s behaviors. One of the babies had been exposed to
ASL (sign-exposed) and 3 had no previous exposure to a signed language (non-sign-exposed). Given our
hypothesis regarding the universal nature of the rhythmic temporal patterning underlying human language
phonological organization, which was specifically built into the avatar’s linguistic stimuli, a key design
feature of the present study was that non-sign-exposed babies would constitute a powerful test of this
hypothesis. While it would have been ideal to have a larger sample size, it has been well established among
scientists in this field (those studying signing/non-signing deaf/hearing children) that the vulnerability and
rarity of this population renders traditional sample sizes unrealistic. Because of their theoretical power,
91
Figure 5.10: Experimental Setup (Side View)
smaller sample sizes have routinely appeared in prominent publications involving the rare sign-exposed
infants (e.g., [207]; T=3 sign-exposed infants).
Babies were seated on their parent’s lap facing the system (Figure 5.10). Multiple cameras were used
to record the baby (and the parent) from different angles. Each baby’s experimental session lasted until
the baby became distracted or entered a fussy state in which case we immediately ceased the session. The
experiment consisted of several steps: upon arrival, the baby and the parent were greeted and introduced
to (and greeted by) the robot and then the avatar. This introduction period has been proved to be useful by
Meltzoff et al. [180]. Next, a calibration process (a technical requirement of the thermal IR Imaging and
Tobii eye tracking systems), followed by the interaction session.
5.9.1 Greeting
Babies came with their parent to the experiment room. To get the baby comfortable with the environment
and getting him/her familiar with the agents, a trained assistant fluent in signed language treated the baby
and the parent upon the arriving. After that, the assistant introduced the robot to the baby in a greeting
92
manner allowing the baby to touch the robot briefly. This introduction period has been proved to be useful
by Meltzoff et al. [180]. After this greeting session, the parent sat on the chair with the baby sitting on
parent’s lap.
5.9.2 Calibration
Calibration process was necessary to accommodate the eyetracker software with the relative coordinations
of baby’s eyes with the targets and the tracker. The conventional calibration method (in which the tracker
is placed on/below a monitor) involves having the subject looking at multiple points in the monitor in a
specific order. Although, our experiment setup was totally different due to the present of both the avatar
and the robot as possible targets and also having the eyetracker between them besides the fact that our
subject was a very young infant and having him/her look at specific points in a determined order and in
a timely manner was a rigorous task. To accommodate for our specific experimental circumstances, we
considered a fictitious monitor as an active display area in front of the baby which contained both the
avatar and the robot (Figure 5.11). The red stars in the figure represent our two calibration points.
Figure 5.11: Active display area from the infant’s view and the calibration points
93
We also made some adjustments to the default calibration process of the eyetracker. The process was
done with the help of the assistant who had a magic wand with colorful lights at the end (Figure 5.12)
which was used to attract the baby’s attention to specific points (here, the targets). The assistant pointed
the wand to the calibration points to make the infant track the light and look at that target; at which point
the assistant pressed a button on a remote controller to let the program capture the eye gaze at that specific
moment. During the Calibration process, the robot and the avatar are in their idle form. The robot is
sleeping with its head down and the avatar is standing still looking forward.
Figure 5.12: Assistant using the magic wand to gain the baby’s attention for calibration process
5.9.3 Familiarization
To make the baby feel comfortable and involved in this multiparty socially contingent interactions and
also to introduce the agents as conversational partners, we begin the experiment with a familiarization
94
episode between the baby, robot and avatar. This process is done with the help of an experimental assistant
who interacts with the agents. At the beginning, the assistant talks as well as signing to the robot to wake
him up. The robot wakes up, lifts his head, blinks, sees the baby and nods as an acknowledgment of baby’s
presence. Then it turns toward the avatar. Avatar sees the robot, turns to him, nods, then turns back to
baby and waves to the baby. Avatar takes the floor and signs HELLO and GOOD MORNING to the baby to
begin the interaction. At this point the assistant signs GOODBYE to the baby and then to the agents, and
departs from the experiment room, leaving the baby to interact with the system.
5.9.4 Interaction Session
The avatar’s socially contingent interaction session with the baby began after the assistant left the room
(Condition 1). At approximately 2.5 minutes into the experimental session, parents were permitted to
interact as per their natural inclination (Condition 2). Throughout the interaction, parents wore sunglasses
which was an intentional design feature meant to block the technology from recording eye-tracking artifacts
from the parent’s eyes. To find evidence to investigate about one of our research questions about the role
of the parent in the interaction, the session was divided into two parts. In the beginning of the session
(Condition 1), the baby was sitting on parent’s lap, with the parent staying still and not participating or
interacting at all. In condition 2, parents were asked to act as they normally do, and to participate in
the interaction, acknowledging and responding to the baby’s social referencing. Note that none of the
perceptual components were monitoring the parent, so none of the agent’s behaviors were contingent
directly on the parent [195].
We define 5 types of interaction partnerships in the study:
1. Avatar-Robot interaction mostly consists of robot looking at the avatar and nodding occasionally
to acknowledge her when the Avatar has the floor and is signing.
95
2. Avatar-Baby interaction is when the Avatar is behaving accordingly to baby’s social responses and
mostly trying to engage the baby and gain his/her attention
3. Robot-Baby interaction is mainly the robot trying to engage the baby by nodding/startling and
eventually shifting the baby’s attention toward the Avatar by turning to her
4. Avatar-Robot-Baby is one of the vital triad interactions. In these triadic interactions, the three
interlocutors are present and can actively participate in the interaction, with the avatar and baby
presenting the possibility for real language productions and the robot rudimentary communication.
5. Avatar-Baby-Parent is an important triadic interaction. In this triadic interaction, the three interlocutors are present and can actively participate in the interaction and can communicate with one
another responsively.
5.10 Research Questions and Evaluation Metrics
We can operationalize the main research questions raised in section 5.1. Regarding the first question (do
babies attend to the avatar and respond to its communicative behaviors?); one possibility is that babies do
not see the avatar, or the agents collectively, as interesting social interlocutors or respond to them at all.
Another possible outcome is that the infants may enter an agitated mode upon confronting an unknown
(or “strange”) situation such as the RAVE system [106]. We use the percentage of baby’s responses to the
avatar as a metric to evaluate the overall system’s impact and performance in terms of engaging the babies.
The second question asked whether babies can differentiate among the different avatar conversational
modes even though it’s unlikely that these young babies understand the semantic content of the ASL
language productions (i.e., vocabulary meanings, syntax, etc.). If so, this would corroborate the now-classic
studies in infant language processing that demonstrates their ability to discriminate categorically among
classes of linguistic units (such as phonetic-syllabic units) in different languages based on their contrastive
96
patterning (peaked between ages 6-12 months; [213, 207, 17]). Here, we examined the baby’s response rate
to the avatar’s different conversational modes.
The third question we asked is of particular scientific interest concerning the mechanisms that drive
early language learning: does the avatar’s specifically linguistic productions garner the baby’s attention,
and in particular, does the avatar’s linguistic productions garner linguistic responses from the babies? We
hypothesize that it is the linguistic patterning that is important in the avatar’s productions, not its modality
of language production and reception (here, signed; [213, 206]). Specifically, we claim that since we are
correctly hitting on just the right temporal patterning in the avatar’s productions, then all babies would be
engaged by the avatar’s language productions over other social and communicative conversational modes -
indeed, even in babies who were never exposed to a signed language. We hypothesized that they would
react with more linguistic content when the avatar was in this category, as compared to when the avatar
was in its other conversational roles.
Finally, the fourth question concerns whether having the parent intervene in the conversational
interaction is beneficial in terms of facilitating the system’s overall language learning goals, or would
it have an adverse effect? Perhaps babies would feel more comfortable when they find themselves in a
familiar and natural situation in which their parent is part of the interaction and acknowledging their social
referencing other than standing still and not reacting to any of their behaviors (which is definitely not
a routine for parents). On the other hand, the intervention from the parent might be distracting for the
baby and steal the attention from the avatar; as a result, babies may turn to parents for interaction instead
of engaging with the system. The first metric to assess this is the overall response rate across conditions.
Furthermore, studying the distribution of baby responses across conditions would give us detailed insight
on the parent’s impact on this social interaction.
97
5.11 Evaluation
The development of this system took place over three years. We tested the system, or parts of the system,
with infants in three pilot sessions designed to evaluate the overall efficacy of the system as well as additional
experiments focused on specific scientific investigations which are not reported in this paper. Overall, we
have seen more than 68 infants with some variant of this system for this and related studies, but report
here on three case studies drawn from a total of 36 infants recorded during our system integration and
evaluation sessions. The design was revised incrementally after each integration session before the next
testing phase. Our design choices were focused on ensuring that the system was usable by deaf infants; no
audio cues were used to generate or direct attention or to create engagement. Part of our evaluation tested
the system with deaf infants to ensure that incidental audio (motor noise) was not a factor. We also at times
evaluated the system with hearing infants (with both signing and non-signing parents), as we believe that
this system can potentially benefit a more general population of infants. Qualitative analysis in the form of
case studies were done on the first two versions of the system. A more in-depth quantitative evaluation
was done on the final version of the study.
5.11.1 Qualitative Evaluation
Preliminary evaluation was done on different versions of the system. Here, we will report on 3 case studies,
drawn from different points in time, that demonstrate how the system evolved. These 3 case studies were
done on the first two versions of the study. Then we would do a deeper qualitative analysis on the final
version.
5.11.1.1 Case Study 1: Free-play, Wizard-of-Oz
The first case study was drawn from a sample of infants recorded during August of 2016. Following a
controlled experimental protocol that focused on the ability of the robot to guide infant attention to the
98
avatar’s screen, a free play session was introduced. The experimental protocol used pre-scripted behavioral
sequences from the robot and virtual agent. Our goal with the free play session was to understand if infants
found the system to be engaging when both the robot and the avatar performed in socially contingent
manners towards the infant and also towards each other, even though the system could not yet achieve this
result autonomously.
Infant “Albert” (Figure 5.13), a deaf sign-exposed male, aged 13 months, was the first deaf infant to
interact with this system. During the free play session, we controlled the robot and avatar using a “Wizard
of Oz”-based open-loop controller. The behaviors for the robot and the avatar were operated by two human
“wizards,” and a third human “conductor” coordinated the timing and execution of the robot and the avatar’s
behaviors together. It was the conductor that created the illusion of social contingency based on the infant’s
direction of attention and emotional state. The conductor was responsible for providing additional brief
periods of rest if the infant became overstimulated or agitated during the free play session, and ultimately
controlled the duration of the free play session and its conclusion.
Albert was very engaged with both the robot and the avatar. The longest period of uninterrupted
engagement (based on gaze toward robot or avatar) lasted roughly one minute. In between periods of
engagement with the system, Albert turned to his mother (deaf) in several social reference episodes; he
tapped his mother hand and signed “mother” at one point to capture her attention on the avatar and/or
the robot. He never showed signs of distress during the procedure. During the engagement period, Albert
pointed at the robot on two separate occasions.
Our experience with Albert demonstrated that the language generated by the avatar was effective for
children at the upper limit of our age range who already had exposure to sign language. We also saw
successful engagement by both robot and avatar, successful direction of attention by the robot to the avatar,
and attempts by Albert to share the robot with his mother (via declarative pointing). This interaction
demonstrated to our team that this system could be successful even with older infants (typically known for
99
Figure 5.13: Case study 1 infant “Albert” pointing. Here the infant, seated on his mother’s lap, as seen from
the system’s point of view.
showing briefer tolerance in controlled experimental paradigms) and that we may continue to expect only
short interactions. This also demonstrated that the inadvertent noises made by the robot (motor noise)
were not critical to the ability of the system to engage an infant, and supported the use of design guidelines
drawn from hearing infants. Based on this case, and others obtained during this same testing sample, we
opted to begin to incorporate autonomous response and eliminate our use of the wizards
5.11.1.2 Case Study 2: Freeplay, introducing limited autonomy and perception
Moving toward version 2 of the system, we had integrated limited parts of the perception system to allow
for an interaction guided by perception, but still triggered by a hidden human operator. This iteration of
the system used arousal and gaze information collected from a thermal camera and eye tracking camera,
respectively, to trigger specific behaviors in the avatar and robot. The behavioral repertoire of the robot and
avatar were relatively simple. The robot had one behavior designed to engage the infant (a “peek-a-boo”
style engagement in which the robot showed a startle response when the infant looked at it) and one
100
behavior designed to direct attention toward the avatar. The avatar would engage in sign-language nursery
rhymes following the robot gaze-direction behavior and would be quietly idle at other times.
Infant “Bella” (Figure 5.14), a hearing, sign-exposed female aged 8 months, was among the first infants
to interact with this autonomous system. The entire session lasted approximately 4 minutes and 20
seconds before Bella completely disengaged. The longest period of uninterrupted engagement lasted
roughly three minutes, three times longer than the most engaged infant from the freeplay sessions. While
Bella visually tracked both the robot and the avatar, she displayed a strong preference for the robot and
produced numerous robot-directed pointing gestures. Remarkably, at one point she appeared to copy the
robot’s ’startle’ behavior – pitching her head down with closed eyes, followed by rapid upward pitch while
concurrently opening eyes in a startle response – directly after it had been produced by the robot.
This interaction was highly successful in many ways; the semi-autonomous system enabled even longer
engagement and interaction than we had been able to achieve with a strictly human-controlled system,
the coordination between the perception system, robot, and avatar allowed for smooth transitions of the
infant’s attention from one target to the other, and we even saw an example of robot-to-infant copying.
However, this interaction also pointed out two significant deficits of our system. First, there were numerous
attempts by Bella to engage the system in ways that we had not anticipated, and the system therefore failed
to respond to them. Bella at times clapped her hands together in excitement to try to generate some kind of
response or engagement with the robot, but it failed to respond. Her copying of the robot’s startle behavior
was also a missed opportunity for our system. Second, when Bella was not engaged with the system, both
the avatar and the robot were still and passive. This resulted in a break in the illusion of agency for both
agents, as both appeared to be less social as they sat still and unmoving.
101
Figure 5.14: Composite image of case study 2 infant “Bella”. Here the infant can be seen from multiple
angles (bottom left and right) interacting with the autonomous system (top)
5.11.1.3 Case Study 3: Enhanced behavior and perception with tetradic interaction
To address the deficits that we saw in the interactions with Bella, we made substantial improvements to
both the interaction design (to account for these “still” periods) and to the perception system. During
periods where the infant is distracted or otherwise not attending to the system, we trigger interactions
between the robot and the virtual human. This maintains the animacy of both agents while providing social
interaction exemplars to the infant that might serve as points of engagement.
Infant “Celia” (Figure 5.15), a hearing, sign-exposed female, aged 11 months, was one of a cohort of 23
pilot-study infants that interacted with this updated system during June, 2017. For the first time, we also
allowed the parent to participate in the infantrobot-avatar interaction half way through the session in order
to observe how this system might be used in vivo. The session lasted approximately five and a half minutes
with the longest period of uninterrupted engagement lasting roughly a minute and a half. Celia visually
tracked both the robot and the avatar, and made a number of robot-directed manual productions including
pointing gestures and attempts to produce parts of what the virtual agent had signed (as interpreted by signfluent observers). During the parent-engaged interactive phase of the session, Celia’s engagement appeared
to increase, producing many instances of socially communicative gestures, including hand-clapping, waving
102
Figure 5.15: Composite image of case study 3 infant “Celia”. Here the infant (right and bottom left) can be
seen producing a waving-arm gesture towards the system (top left).
and pointing. Celia was not unique in reproducing the virtual human’s signing, and this was observed
even in some non-sign-exposed infants. While Celia did not attempt to copy the robot’s behavior, other
members of the cohort did.
Perhaps the most interesting thing that we observed with Celia was the production of the virtual
avatar’s signs. This attempt was notable not only because it was in an interaction that lasted only a few
minutes, but also because it was directed not back at the virtual agent but at the (non-signing) robot.
5.11.1.4 Comprehensive study on 8 infants
The final dialogue manager routines described in the previous section were tested in interaction with 8
babies: 2 females and 6 males with average age of 9 months and 20 days (range 7-13 months) from whom 5
were hearing participants not sign-exposed, two hearing sign-exposed and one deaf sign-exposed. The
experimental protocol included a calibration and introduction routine with an experimenter described in
section 5.9
We first asked the important scientific question as to whether RAVE can engage the infants’ attention.
Perhaps these young babies would not see the agents as interesting social interlocutors as predicted, but
103
possibly boring objects, or worse, a source of agitation. When confronted with unknown humans and/or
situations of novelty, babies at this age are prone to stranger anxiety [106]. Given that our 8 babies (age
range 7-13 months) were within the onset period of “stranger anxiety” (onset range 6-12 months), crying
and fussing could have occurred. Thus, upon the babies’ first contact with the Robot-Avatar system, the
babies could have been interested in the agents, but they also could have fussed, become distracted, etc.,
at which point the interactional session would have been immediately ceased. None of these distracted
behaviors were observed and instead, all 8 of the tested babies exhibited positive engagement behaviors,
including: (1) Immediate visual engagement (locked attention) with the agents, (2) Sustained attention
(persisting over time) and (3) Visually tracked (gaze following) the Avatar and Robot as they interacted
with each other and the baby.
All 8 babies exhibited sustained engagement with RAVE lasting 4-5 minutes (average 3m40s; range
1m33s-4m56s). This is an atypical attention engagement window for very young infants. There was only
one baby with the low engagement time (1m33s; > 2 standard deviations from the mean), but this baby
entered the room very fussy. Although fussy on entering, she changed to riveted attention upon sight
of the agents, and then slipped into a fussy state again at which point, we terminated the session. If we
were to remove this outlier, the average sustained engagement time for the remaining babies is nearly
4 minutes (3m58s). Interestingly, this baby was an outlier for another reason. Her age (13mths;16days)
fell outside of our predicted window of peak infant engagement for RAVE (age range 6-12 months). Her
performance thus provided preliminary support for our prediction that RAVE was most optimal for babies
within the developmental period when they had peaked sensitivity to the rhythmic patterning of language,
ages 6-12 months. The fact that the presence of the agents impacted all babies’ preexisting emotional and/or
attentional states for such durations is in itself remarkable, and invites us to understand why was it so. Even
the 2 babies who came into the experiment fussy were able to engage in sustained attention and participate
for over 2 minutes (2m13s, 3m25s). They changed to riveted attention upon sight of the agents, and then
104
slipped into a fussy state again at which point, we terminated the session. The fact that the presence of the
agents halted the babies’ preexisting emotional agitation for several minutes is in itself remarkable, and
invites us to understand why was it so.
We observed such sustained engagement in all babies, even hearing babies with no prior exposure
to signed language, meaning that something about the avatar’s productions was engaging to the babies
even though they could not understand the meanings of the signs, with interesting group differences. For
example, we found that our one baby with early bilingual language exposure (i.e., early ASL and early
English exposure) had the greatest combined positive impact on its engagement span (longest experiment
run time of 4m56s). This finding corroborates our earlier studies showing significant processing advantages
afforded to babies and children with early bilingual language exposure [213].
The second condition, where parents were permitted to join in as they would naturally, allows for
baby’s social referencing to be acknowledged and responded to. In fact, we also observed instances where
(nonsigning) parents copied the Avatar’s signs and encouraged the baby to react and interact with the
Avatar (only that will be picked up by the Avatar to continue its cycles).
Our second question was whether the artificial agents can elicit socially interactive and socially contingent conversation with an infant, above and beyond babies’ production of prosocial emotional engagement/sustained visual attention. In an intensive analysis of 4 of the 8 babies (as analyses involve hundreds of
hours of frame-by-frame video transcription with trained experts, behavioral coding, and reliability checks),
all four babies produced social interactions and/or solicitations to the agents (e.g., waving HI, pointing,
reaching, etc.) and attempted to copy the avatar, either through attempts to copy the avatar’s signs (and
components of signs) or matching the avatar’s rhythmic movements at the nucleus of its sign productions.
This novel finding is noteworthy because most babies (3 of 4) were never exposed to signed language yet
attempted to copy the Avatar’s linguistic signed language productions, and as noted above, they did so
without understanding the meaning of the avatar’s signs. Crucially, the babies’ powerful engagement with
105
the avatar occurred even though the avatar is an artificial agent on a flat TV monitor. We also performed a
detailed case-study with one of the babies (a 7 month-old hearing baby boy who was exposed to signed
language/ASL). In particular, we examined:
1. Whether the baby performed age-appropriate proto-linguistic behavior?
2. Whether this was produced in a socially contingent sequence as solicited by the Avatar’s linguistic
behavior?
In pursuit of these questions, we first coded the videos of conversational interactions with respect
to Avatar and baby behaviors, followed by reliability checks. The rigorous coding was done by trained
coders in the field of child language. Every video was coded for the categories of social conversational turns
and content along with the time marking in coding and total time length of coded segments. Regarding
question (1), we see linguistic behavior from the baby in both conditions (with and without the parent
joining in). The baby waved and produced proto-signs related two distinct Nursery Rhymes. Regarding
question (2), the sign-productions in all cases appeared as socially contingent reactions to the Avatar. Baby
proto-signs were produced within a few seconds of the Avatar producing the relevant Nursery Rhymes.
Baby social behaviors, such as waving, were produced as a response to social routines such as the signs for
HELLO or GOODBYE. Thus, we see that the agents performing dialogue routines, in reaction to continuous
multimodal sensory updates, were successful in soliciting socially contingent conversation from the infant.
This would suggest the potential viability for using this kind of system for language learning in young
infants.
5.11.2 Quantitative Evaluation
We categorize the baby’s spontaneous behavioral responses to the Avatar conversational modes as follows:
106
1. Linguistic Responses (“ling”) include manual babbling, the production of manual proto signphonetic units, proto-signs, and imitations of signs (i.e., the baby imitates or copies what it sees the
Avatar is producing);
2. Social/Gestural Responses (“S/G”) include pointing, waving, clapping hands or attempts to copy
the agents’ behaviors, or social referencing to the parent;
3. Sustained Visual Attention (“SVA”) indicates the baby being visually transfixed on the agents for
atypically extended periods for infants, defined as greater than one second for this study.
Note that these categories are not mutually exclusive. A baby can exhibit SVA, that is be visually transfixed
on the avatar and simultaneously be producing social/gestural responses or linguistic responses. Producing
visually transfixed attention, social gestures and especially linguistic behaviors are an indication that system
is successful at soliciting babies’ interaction. Frequency analyses of the baby’s behaviors throughout the
experiments provided us with a good insight of the babies’ behavioral pattern. Analysis was done based on
the occurrences of specific behaviors as its the convention in child developmental sciences.
The video-recorded socially contingent interaction sessions were coded for conversational turns. The
coding was done by trained experts in the field of developmental cognitive neuroscience, child development,
linguistics and sign-linguistics with reliability checks. ELAN [40] was used for annotating the baby’s
behavior and marking the times of avatar and robot’s behaviors. ELAN is a professional tool to manually
and semi-automatically annotate and transcribe audio or video recordings with a tier-based data model
that supports multi-level, multi-participant annotation. A screenshot of the tool along with different tiers is
shown in Figure 5.16.
Here, we present the results of our analyses in two parts. First, we show the interactions between the
baby and the avatar, the babies’ specific categories of spontaneous behavioral responses, and their relative
107
Figure 5.16: Annotation example using ELAN
frequencies. Second, we show the corresponding analyses regarding the parents, and the impact of parent’s
intervention on baby’s behaviors toward the system.
5.11.2.1 Baby and Avatar
In answer to questions 1 & 2 above (do babies attend to the avatar?; do they differentiate among its
conversational modes?), A frequency analysis of responses to avatar behaviors was conducted using the
categories of behavioral responses of babies stated in section 5.10 (linguistic; sustained visual attention;
social/gestural responses). Overall, babies responded to more than 60% of avatar’s behaviors (M =
61.8, SD = 6.9). Figure 5.18 shows a Venn diagram of the four discrete categories of baby behavioral
responses to the avatar as well as their relative frequencies. The overlapping portions show cases where the
baby responded in more than one way to the same avatar behavior. We can see that the babies’ transfixed
sustained visual attention (SVA) constitutes the biggest portion of babies’ behavioral responses to the avatar
(48% overall). Next, we studied the relationship between the Avatar’s behaviors and the baby’s response rate.
Results, depicted in Figure 5.19 show that babies’ responses were not equally distributed across different
108
types of Avatar’s behaviors. Note that the distribution of avatar behaviors was also not uniform: 13% of the
avatar’s behaviors were NR, 13% 3-way, 36% were social, and the remaining 38% were idle.
Figure 5.17: Frequency of Avatar’s Behaviors
Figure 5.18: Frequency of Baby’s Behaviors
Following from question 3 above (does the Avatar’s linguistic behavior impact linguistic productions in
the baby?), as a first step in our analyses, we observed that the babies produced their greatest percentage of
spontaneous responses to the Avatar when the avatar was producing linguistic Nursery Rhymes. Babies
produced spontaneous behavioral responses to 85% of the Avatar’s Linguistic Nursery Rhymes, 84% of the
3-Way conversational turns, 75% of the Avatar productions when in Social Gesturing conversational turn,
but only 37% of times when the Avatar was idle. The babies’ responses to the Avatar’s actions (NR, Social,
3-Way) were significantly more compared to when the avatar was in its idle mode (t = 3.35, p = 0.01).
Thus, the babies do appear to attend to and to respond to the Avatar’s different conversational modes, with
the babies’ greatest percentage of responses being when the Avatar was producing Linguistic Nursery
Rhymes.
109
Figure 5.19: Babies’ response rate to different avatar behaviors
Figure 5.20: Babies’ categorical responses to different avatar behaviors
Further to question 3 above, we conducted a frequency analysis of the different baby behaviors in
response to the avatar’s behaviors. Figure 5.20 shows the rate of each baby behavior in response to each
category of avatar’s production. Note that the bars in each category do not necessarily need to add up to 1,
because sometimes the baby responds with multiple response types, as shown in Figure 5.18.
As shown in Figure 5.20, the babies responded differently when the avatar was in the Linguistic Nursery
Rhyme conversational mode versus other modes (Social, Idle, 3-Way). The babies produced the largest
percentage of linguistic responses to the avatar’s Linguistic Nursery Rhymes (31% to Nursery Rhymes
vs 10% to Idle, 19% to Social Gestures, and 20% to 3-Way). Further, the babies’ responses to the avatar’s
Linguistic Nursery Rhymes (over the avatar’s other conversational turn types) involved them to be largely
riveted into a state of fixed and Sustained Visual Attention (77%). Of theoretical significance, there appears
to have been a principled relationship between the avatar’s socially contingent communicative turn types
and the babies’ specific responses. This relationship implies that the avatar was indeed having a linguistic
impact on the baby.
110
Figure 5.21: Frequency of baby response types in absence (condition 1) or presence (condition 2) of parental
involvement
5.11.2.2 Parent’s Intervention
To address question 4 (impact of parental intervention),
we analyzed the different baby behaviors across the two conditions. Babies responded to 80% of avatar’s
behaviors in Condition 1 versus 60% in Condition 2 (t = 2.22, p < 0.05). This decrease is mainly due to a
significant decrease in the babies sustained visual attention, SVA (t = 4.3, p < 0.005). This finding makes
sense since in condition 2, parents were acknowledging and interacting with the baby, so babies would
naturally look more at the parent thereby exhibiting fewer instances of sustained attention toward the
avatar. Figure 5.21 shows the distribution of baby responses across the two conditions. As shown, there is
a significant increase in the percentage of babies’ linguistic behaviors across Condition 1 vs Condition 2
(t = 2.4, p < 0.05). This is very interesting, as it indicates that parents’ interactions may have the potential
to augment the language learning impact of RAVE. Apart from parental impact, the present pattern of
change from Condition 1 to Condition 2 may imply that the infant is evidencing aspects of learning.
5.12 Design Lessons
While our efforts in this project were focused on building a specific tool for supplementing language for deaf
infants, some of our design process has touched on issues that are more generally applicable. We highlight
111
three design lessons that emerged from this work and discuss their broader application to child-agent
design and the design of agents as pedagogical tools.
First, our work supports and affirms the importance of physical embodiment. While our protocols were
not designed to explicitly compare the impact of an embodied system to a screen-only system, we based
our design on the considerable evidence showing that infants do not learn from screen-based technologies
[145, 55]. Infants in our case studies were able to successfully engage with our paired robot and avatar
system and some of the infants displayed instances of copying (a limited form of social learning). While
anecdotal, we view our results as being aligned with prior studies that demonstrate the benefits of physically
embodied systems for teaching language to older children [156], as well as the overall benefits of physical
embodiment for improving learning gains in adults [157], enhancing social performance and perception
[278], and ensuring compliance with challenging pedagogical and therapeutic tasks [16]. Our observations
are encouraging, but future efforts must focus on repeated exposures over longer periods of time which
would possibly allow for the demonstration of language learning.
Second, our work supports the potential benefits of incorporating parents into agent-infant pedagogical
interactions. The introduction of a parent-engaged interactive session with our third case study supports
multiple results in robot-child interaction that demonstrate the usefulness of involving parents in pedagogical engagements. In particular, this work draws from observations in producing therapeutic agents
that teach cognitive and social skills to children with autism spectrum disorder [239]. In clinical domains,
designing interactions that provide enhanced agent-child interactions have limited value, as teaching a
child to interact with a robot may or may not generalize to human-human interactions. Instead, modern
agent-based interventions focus on using the agent to explicitly support child-adult interactions [137, 220].
In the future, allowing more direct support between infant and parent may provide a more rapid and
generalizeable method for supporting language learning.
112
Finally, our work expands and reinforces the use of paired agents to provide controlled, semi-scripted
interactions for the benefit of a child observer. Having two agents allowed our system to produce scripted
call-and-response interactions between agents, to demonstrate positive social responses toward the social
overtures of the other agent, to use social mechanisms to signal to the child that both avatar and robot
were social agents, and to capitalize on capabilities that only one of the agents possessed. This technique
was used previously in interactions involving groups of 2 robots and 1-3 older children (4-6 years old)
by Scassellati’s group [154, 153], and in this work closely resembles the framing used by Vázquez et al.
[277]. The computational methods for supporting these multi-agent pedagogical designs and the types of
interactions that can be achieved merit further exploration.
5.13 Conclusion
The driving theoretical question of this chapter was to understand whether an artificial agent (the ASL
signing avatar) had the potential to facilitate language learning in young babies. To address this, we
studied the impact that a signing avatar agent (with its richly varied communicative and socially contingent
conversational modes) had on young babies’ spontaneous behavioral responses, in particular, we asked
whether the avatar’s linguistic productions in signed language would spontaneously trigger linguistic
responses from the babies (be they babies exposed to a signed language or babies exposed to a spoken
language). We were especially interested if a very young baby would even detect the avatar’s different
conversational modes, as the avatar was projected onto a flat screen.
We indeed found that babies spontaneously distinguished among avatar conversational modes. Babies
produced different categories of behavioral responses to the avatar, and, further, their different behavioral
responses were socially contingent (related to) the avatar’s different conversational modes. To be sure, the
results indicate that the babies were indeed able to detect the avatar’s different conversational modes even
though all appeared on a flat screen.
113
One interesting finding was that babies produced the greatest percentage of linguistic responses to
the avatar’s linguistic Nursery Rhymes versus other avatar conversational modes even though most of
them did not understand ASL and thus could not understand the meanings of the language that they
were observing. The babies’ linguistic responses included linguistic ASL-related productions (spanning
manual babbling, production of proto-sign phonetic units and proto-signs, linguistic sign-phonetic and
sign imitations). Research has demonstrated that mimicking is present from birth [181, 182]; and is one
component of learning [146] and social relations [179]. In spoken/signed languages, language learning
includes both mimicking and novel language productions. Our results revealed both, demonstrating novel
baby-initiated responses.
Herein lies one of the important findings of the present study concerning the nature of the brain-based
mechanisms that govern human language acquisition. Rather than being attracted to the meanings of the
language before them, we hypothesized that all babies (deaf and hearing) were differentiating among the
avatar’s conversational modes based on differences in their +/- relation to the rhythmicity of language
phonetic-syllabic (phonological) structure [206, 213]. ASL is a real language and possesses the phonological
structure universal to all world languages, and the avatar was producing the rhythmic temporal patterns
(that underlie phonological structure in ASL), thus babies demonstrated riveted attention to this category
of avatar productions. This finding is much like those showing that babies demonstrate riveted attention to
the phonological patterns in their native language as well as in a foreign (non-native) language over other
patterns of acoustic stimuli [214, 131, 146]. These findings provide support for the fact that infants are born
with a sensitivity to specific rhythmic temporal patterns in language and that the avatar had hit squarely
on those patterns.
The present findings also suggest that the dialogue management had achieved a level of verisimilitude
to social contingency found in natural parent-baby discourse. The babies’ interactive engagement with
this artificial agent provides evidence that social contingency is a vital component of healthy language
114
learning. Beyond the importance of social interactions, the role of social contingency in early language
acquisition will also be pursued in our future work along with work analyzing the robot’s role in the
system. Nonetheless, all of the babies appeared to be captivated by the avatar and exhibited spontaneous
engagement with the avatar, which was powerfully observed and occurred even though the avatar is an
inanimate artificial agent on a flat screen.
This project directly addresses questions 2.1 (Types of multimodal feedback provided by agents) and
3.1 (Types of multimodal feedback perceived by agents) of this thesis. First regarding 2.1, the storyteller
agent (the virtual avatar) provides non-verbal feedback, such as nodding, and linguistic feedback, such as
the aha sign. Regarding 3.1, the agent chooses appropriate nursery rhymes based on the baby’s feedback
and its perceived emotional state. In addition, this project also provides a partial answer to question 4
(real-time predictions based on user feedback), by altering the planned story scenario either by shortening
or completely abandoning it based on the baby’s behavioral feedback.
115
Chapter 6
Human-Swarm Interaction
Addressing our fourth research question (How can agents use real-time predictions based on user feedback
to further adapt the dialogue management policies of the agent?), we designed and developed a agent
capable of predicting user performance based on user’s feedback and multimodal behavioral cues. In this
experiment, our agent acts as an active third-party story listener, observing the user interact with the system
telling stories. By observing the interaction and the user’s feedback on the stories, the agent would use a
model to predict the user’s performance and decide to intervene when it seems necessary, thus changing
the flow of the dialogue and resulting in a higher performance of the overall system. For this experiment,
we used a high-stress human-agent interaction scenario in which decisions must be made, communicated,
and agreed upon in a short amount of time to avoid dire consequences. We chose this scenario since such
interactions can be a source of different multimodal signals indicating participant cognitive and emotional
states, which can vary with factors such as the difficulty of the interaction and participant personalities.
We studied the correlation between individuals’ behavioral cues and task performance while interacting
with several synthetic agents with different personalities and difficulties in this scenario.
Utilizing visual, audio, and text data from a multimodal corpus, we demonstrated the model’s ability to
predict user performance with a reasonable accuracy of 72%. This predictive model empowers the system
to strategically intervene in interactions, specifically when it anticipates user failure, thereby orchestrating
116
a significant enhancement in overall performance. This approach highlights the potential of predictive
models and strategic agent interventions in elevating the quality of human-agent interactions.
6.1 Introduction
The personality of interlocutors can play a profound role in shaping the structure and dynamics of conversation. Numerous studies have delved into the intricate relationship between personality and character
traits in dialogue systems, and how tailoring the behavior of one interlocutor based on the personality
profile of the conversant can result in superior outcomes [289, 243, 188, 235]. However, the interplay
between the personality profiles of both interlocutors and its effect on the flow of the interaction remains
an understudied area with significant potential for further exploration.
The majority of interactions examined in prior research have been conducted in low-stress contexts,
such as during a casual conversation or while engaging in specific tasks without the imposition of time
constraints. To build upon this existing knowledge and explore the impact of these emotional connections in
more demanding scenarios, this chapter investigates the dynamics of human-virtual character interactions
within a high-stress environment. Specifically, the study focuses on the ability of participants to perform a
critical task, namely rescuing residents from a fire-rescue scenario, while simultaneously developing and
maintaining rapport with a virtual character. In this scenario, the system is capable of observing the human
user, predicting their behavior and decides to intervene when it sees fit. In doing so, it uses the feedback of
the user when they are listening to a story as part of an interaction.
The objective of this chapter is to provide answers to the following tasks, which collectively would provide
a comprehensive answer to our fourth research question:
1. Delving into the behavioral responses and feedback of individuals when exposed to various synthetic
personalities in a time-sensitive and high-risk simulated scenario.
117
2. Developing a multimodal deep neural network prediction model for predicting the outcome (user’s
performance) of these interactions using audio, video, and language modalities.
3. Demonstrating the potential benefits of integrating this predictive model into the system, showcasing
how it enables timely interventions with the aim of improving overall performance.
Our first objective is to investigate the association between individuals’ behavioral responses to different
synthetic personalities and their task performance in high-risk interactions. Our analysis centers on the
use of emotional expressions as an indicator of users’ internal state, exploring the link between exhibited
emotions and task performance. Additionally, we examine the users’ emotions during interactions with
various synthetic personalities to discern whether meaningful differences exist in how individuals respond
to different personality characteristics. This approach will provide insights into the complex interplay
between emotions, behavioral responses, and task performance in high-risk interactions.
Our second objective is to develop a predictive model for the user’s performance in time-sensitive
scenarios, utilizing multimodal signals acquired during interaction windows. This model will serve as a
valuable tool for evaluating a user’s ability to act efficiently and influence others under stressful conditions
and time constraints. The results will provide insights into the complex interplay between multimodal
signals and performance in time-sensitive scenarios.
The third objective of this chapter is to empirically illustrate the prospective performance enhancement
achievable by the system through the utilization of the predictive model. Specifically, we investigate the
system’s capacity to strategically intervene during interactions, guided by predictions signaling potential
interaction failure or long interaction duration to reach a successful conclusion.
We analyzed data from Chaffey et al. [49], in which human participants in a disaster relief scenario act
as Operator of a robot swarm and seek out and engage with diverse synthetic individuals (residents), to
convince them to evacuate, in an attempt to save them from an approaching fire (Figure 6.1).
118
Figure 6.1: Overview of the simulation environment, the spokesperson, and the operator (aka the user)
Our investigation into the first research question revealed that several emotional expressions are indeed
significantly correlated with Operator performance (Section 6.5.1.1). Additionally, our study sheds light on
the complex dynamics of human-synthetic resident interactions. Specifically, we found that users exhibit
differing emotional responses to different resident personalities (Section 6.5.1.2). Finally, our investigation
revealed a statistically significant correlation between neurotic personality traits and exhibited emotions
during these interactions (Section 6.5.1.3) Building upon these insights, we addressed the second research
question by proposing and developing a novel multimodal prediction model that effectively leverages
behavioral cues to classify Operator performance (Sections 6.4.2 and 6.5.2). Finally, we show that equipping
the system with this model has the potential for performance enhancement in section 6.5.3
6.2 Related Work
Studies have shed light on the intriguing phenomenon of humans forming emotional connections and
building rapport in interaction with virtual humans [47, 122]. In fact, research conducted by Gratch and
colleagues demonstrated that humans are capable of responding emotionally to virtual characters, with
subjects reporting feeling empathy and even attachment to them [103].
119
Building on this insight, further studies have focused on managing the complex dynamics that arise
from these emotional connections. Zhao and colleagues have conducted research aimed at enhancing
rapport-building between humans and virtual characters by investigating various factors that can influence
the development of these connections [294]. Other studies have examined the role of specific nonverbal
cues, such as facial expressions, in establishing conversational rapport with virtual agents. Wong and
colleagues, for instance, have investigated the effects of different facial expressions on the development
of rapport between humans and virtual characters [285]. In addition, the effects of perceived agency and
message tone on rapport-building have been studied in a study by Lucas et al. [167]. They have explored
the impact of these factors on the development of rapport with a virtual personal trainer.
The face is an essential channel for expressing and perceiving emotional states. Emotions are complex
and vary in physical and mental features across different situations, people, and cultures [22]. However,
according to Paul Ekman [78] there is a core facial configuration that can be used to diagnose a person’s
emotional state in the same way a fingerprint is used to recognize a person. Emotions are a process of
automatic appraisal influenced by evolutionary and personal experiences that trigger physiological changes
and emotional behaviors to deal with situations. As an example, Dudzik and Hung [74] focus on the
interesting task of "Detection of Spontaneous Recollections during Video-viewing" using facial behavior
analysis.
6.2.1 Emotional Expression and Performance
A person’s emotional state can influence their performance in a variety of task domains, including contexts
as diverse as team resilience [183], athletic performance [226], software development [104], academic
achievement [50], and driving safety [127]. Furthermore, emotion can be an indicator of task performance,
as people often experience more positive affect when they know that they are performing well [87]. Karacora
et al. [134] have examined how rapport-building between humans and virtual characters can affect task
120
performance. Because emotional expressions can in many cases predict or indicate performance, it can be a
useful modality through which an automated system can gain information about the user.
6.2.2 Multimodal Affective Analysis
Multimodal behavioral analysis has been used in multiple domains. In the cognitive science domain,
research has shown that sequence learning and multimodal behavioral analysis can be used as a tool to
assess human behavior towards learning ability, temporary memory and attention [15, 222, 123]. For
an individual to perform well in a cognitive task, paying attention and being engaged in the task are
crucial. Previous research has investigated the relationship between attention, task engagement, and
human emotions expressed through bodily and facial expressions [32, 88, 276]. In [15], authors propose a
multimodal approach for cognitive task performance prediction from body Postures, facial Expressions and
EEG signals.
Many applications include tasks of sentiment analysis and sentiment prediction of a user. These
include studies that have combined visual and audio features [272, 63], speech content [191, 262], and
even physiological signals to recognize emotions [225]. In [151], authors have focused on “Comfortability",
an “internal state that focuses on the person’s desire to maintain or withdraw from an interaction". They
have proposed several multimodal classifiers to recognize the comfortability of humans in a Human-Robot
interaction. They use various facial and upper-body movements (i.e., Action Units, Head Pose, Upper-body
Pose and Gaze) as input to their Machine Learning models. In [262], a multimodal deep neural classifier was
proposed to predict the best times for an agent’s empathetic response in a human-agent interaction. They
argue that emotional tone in language in addition to facial expressions are strong indicators of dramatic
sentiment in conversation that warrant an empathetic response, and therefore they are using visual, audio
and language modalities in their prediction model
121
Table 6.1: Residents’ Description and Difficulty Level
Resident Category Difficulty Description
Cooperative Couple Cooperative 1 The couple is willing and able to evacuate but needs to be informed of the actual danger of the fire. Once they receive a
warning from either the operator or virtual spokesperson, they
will evacuate.
Babysitter Cooperative 1 A person with two children is willing and able to evacuate, but
is hesitant to leave without the consent of the parents. However,
she can be convinced to evacuate with the assistance of either
the operator or the virtual spokesperson.
Van Driver Compliant 2 A man who is responsible for transporting a group of elderly
patients is stranded with them after an accident on the road. The
group depends on the vehicle to evacuate safely.
Stubborn Man Stubborn 3 A resident who is unwilling to evacuate due to the reluctance to
leave their work behind. The resident can only be persuaded to
leave with the help of the operator.
Stubborn Couple Stubborn 3 A couple who believes their house preparations are adequate to
withstand the wildfire. They will only evacuate if convinced by
the operator.
Techniques for combining data from different modalities can be classified as either early or late fusion.
Early fusion involves projecting features from different modalities into the same joint feature space before
feeding them into the classifier [234], while late fusion involves making classifications on each modality
and merging their decisions or predictions later on [41]. Some studies have even implemented a hybrid
fusion strategy that combines the benefits of both early and late fusion [217].
6.3 Data
The data used is from an experiment first designed and introduced by Chaffey et al. [48]. The study included
31 participants recruited through Craigslist (age range 22-49 with an average age of 29, 19 male and 12
female, from a range of ethnicities. We briefly describe the overall scenario, the residents who must be
found and convinced and helped to evacuate, our recorded data and performance metrics.
122
Table 6.2: An interaction example drawn from the data: Interaction between a stubborn person (Difficulty
3) with the operator.
Speaker Dialogue Utterance
Resident Hello?
Operator Hey, what’s going on? Are you okay? I need you to evacuate right now. Immediately. You’re in
danger.
Resident I’m not leaving my home. I have too much work here to just leave it all behind.
Operator No, no, no. You’re going to leave right now. There’s a fire. Do you understand? You need to leave
right now.
Resident I’ve spent years on my collection, and there isn’t time to take it all.
Operator I understand that. Listen! You have to go, sir. You have to. Please listen. Your life is in danger.
Resident You really think it’s that bad?
Operator It’s very bad. It’s spreading really fast.
Resident Okay, I’m not stupid, just let me grab my bag and I’ll head out.
Operator Okay, thank you, sir, The emergency vehicle is, I think, on the far end. So you need to hurry up and
get there. Or we’ll try to get an emergency vehicle for you. Can we get an emergency vehicle for
them, Julie?
Spokesperson Yes.
Operator Send the emergency vehicle to his house right now.
Spokesperson I’m on it.
6.3.1 Scenario
The simulation presents a dynamic scenario where human participants (operator) are tasked with rescuing
residents from a small town threatened by an imminent wildfire. During the simulation, the operator is
under severe time constraints, as each resident must be evacuated before the fire engulfs their location.
To evacuate each resident, the operator must first convince them to either follow a drone to safety or be
evacuated by an evacuation vehicle. The task requires the operator to quickly locate the residents, convince
them to evacuate, and assist those in need physically. However, this operation poses various challenges
such as uncooperative residents, lack of resources like evacuation vehicles, and a rapidly decreasing time
window.
As part of the resources, the operator controls a swarm of drones. The drones have several capabilities, including monitoring the spread of the fire, searching for residents, playing pre-recorded warnings,
establishing audio communication with residents, following the movement of residents, and guiding them
toward safety. To facilitate controlling the swarm and reduce the cognitive load associated with managing
123
them, the operator also has access to a virtual assistant (spokesperson). The spokesperson can translate the
high-level instructions from the operator (in natural language) into step-by-step commands to the drones
and the evacuation vehicles. For example, a command from the operator such as “Drones 3 to 8 search area
between A16 and Z28” is translated by the spokesperson to optimized detailed instructions such as “drone 3
to A16”, “drone 4 to B17”, etc. The spokesperson can also be instructed to communicate directly with the
located residents and it can potentially convince some of them to evacuate without direct intervention by
the operator. Alternatively, the operator can take control and directly try to convince a resident (with the
expectation that people are more likely to be persuaded to leave their homes by a human rather than virtual
agent. Finding the right tasks to delegate to the spokesperson is a crucial part of a successful evacuation,
as an operator is unlikely to have the time and cognitive resources needed to do everything themselves.
For example, when multiple residents are found in quick succession, the operator must work with the
spokesperson to decide who will handle each case.
To stay informed about the ongoing situation, the operator can also view a real-time map of the town
that shows the location of the drones and evacuation vehicle, the areas that have already been searched, and
any discovered information like the fire’s location and the whereabouts of any rescued residents. Figure 6.1
illustrates an overview of the simulation environment, the spokesperson, and the operator.
6.3.2 Residents
There are five residents who need to be evacuated. Their location is randomly set at the start of the
simulation and their evacuation requires a unique set of actions. After locating the residents, convincing
them in a timely manner is the most crucial challenge in their safe evacuation. Each resident has a unique
personality profile which requires the operator to have a broad set of social and communication skills to be
able to convince them to evacuate. The operator is unaware of the residents they will encounter or their
personality profiles beforehand. Therefore, they need to determine the necessary resources and actions
124
through communication with the residents and in collaboration with the spokesperson. Among the five
residents, the spokesperson is able to convince only 2 on its own and the other 3 can only be convinced to
leave by the operator.
We categorized the different behavior/personality types of the resident characters in the simulation.
Residents who needed only a little information to be convinced to evacuate and who required no additional
help to flee were classified as cooperative. Residents who needed no convincing but who nonetheless
required focused or special intervention to evacuate were called compliant. Residents who were hostile or
who could not be convinced to evacuate without verbal confrontation with the operator were classified
as stubborn. See table 6.1 for detailed information about all residents, along with their description and
assigned category and difficulty level. These categories are also representative of the anticipated effort/time
that the operator has to dedicate in order to convince them to leave.
6.3.3 Resident Interactions
31 participants took part in the study. The recorded information from their interactions contains a complete
log of simulation events, including the operator’s timestamped actions and instructions to the spokesperson,
residents’ dialogues, the Operator’s performance in terms of the number of rescued residents, their frontal
video recording while interacting with the system, and the screen recording of the operator’s simulation
environment. Microsoft Platform for Situated Intelligence (PSI) [33] was used as the infrastructure for data
collection and storage, and all streams of data were gathered through a PSI script and stored in PSI Store
format. Participants were asked to answer mini-IPIP[71] questionnaire before the interaction, which gives
us access to Operator’s personality traits.
We extracted the windows from the data in which the operator is conversing with the residents directly.
Table 6.2 shows an example from a resident in the “Stubborn” Category interacting with the operator.
There were 104 interaction windows from all the subjects interacting with the simulation environment.
125
Table 6.3: Statistics on operator’s interaction with different categories of residents
Resident Category No. of
interactions
Avr. Length
(seconds)
No. of
Segments
Success Rate
Cooperative 21 25.32 436 100%
Compliant 29 38.15 614 93%
Stubborn 54 36.46 1091 74%
All 104 34.68 2141 85%
We divide the interaction period into several segments based on the speaker tag. Segments are either
utterances or silent periods. The two main speakers in the interaction windows are the operator (34% of
segments) and the resident (27% of segments), but occasionally we also have the spokesperson jump into
the conversation (13% of segments). A total number of 2141 segments were extracted from the interactions
(including the silent segments). Table 6.3 shows the data statistics, including the number of instances, the
average interaction length, the number of segments, and the success rate for each of the resident categories.
6.3.4 Performance Metrics
In this study, we introduce two quantitative performance metrics.
1. Number of Saved Residents: (0-5). It is important to note that we only consider residents who were
successfully rescued and not those who were convinced but unable to leave due to time constraints.
2. Interaction Length: We use interaction length as a performance metric since our experiment is
designed to assess the operator’s ability to efficiently and quickly convince the residents. The length of
the successful interactions (resident convinced) represented a double Gaussian distribution, separating
the data into two major parts, therefore, we chose to have 2 classes of successful interactions and
one class for unsuccessful interactions (resident not convinced) which represented close to a normal
distribution; therefore We divide all successful interactions into those that take more/less than the
median interaction length and classify unsuccessful instances (where residents were not convinced)
126
16
8
8
8
8
1
Hey, I need you to evacuate
right now. Immediately.
You’re in danger.
Segment Frames
VideoMAE
OpenFace
FC
T * 1536
Segment Audio
RoBERTa-base
FC
RoBERTa-empathy
FC
Wave2Vec
OpenSmile
LSTM
FC
Speaker ID
T * 78 T * 32 32 768 768
T * 16
LSTM
Decision Layer
Interaction Segments
Segment Text
LSTM
Figure 6.2: Overall Architecture of Our Multi-Modal Information Infusion.
as a separate class, making it 3 classes. “Successful-short” and “successful-long” classes each makeup
for 42.5% of data, and “unsuccessful” class for 15% of the data.
The first metric is defined for each user based on the overall simulation, while the second metric is defined
for each interaction window, measuring the operator’s performance in that specific interaction.
6.4 Method
6.4.1 Multimodal Feature Extraction
In this section, we present our methodology for multimodal feature extraction, which involves processing
three distinct modalities of recorded data: audio, video, and text. To accomplish this, we conducted
experiments utilizing two state-of-the-art models and methods for each modality.
6.4.1.1 Visual Features
For the visual representation, we experimented with two different models:
127
OpenFace [19] We use the OpenFace toolkit to extract raw features per frame from the operator’s video.
The extracted 32-dimensional feature vectors including the estimated eye gaze direction vector in 3D, head
pose, and 17 Facial Action Units (AUs) intensity [80] indicating the facial muscle movements. These visual
descriptors have been shown to be strong indicators of human emotions and sentiments [253]. Therefore,
we also extracted values for six universal emotions [79] {anger, happiness, sadness, fear, disgust, surprise}
based on [201].
VideoMAE model has been recently proposed by Tong et al. [265] as a data-efficient learner for selfsupervised video pre-training (SSVP). It is inspired by the recent success of ImageMAE[114] and extended the
masked autoencoders (MAE) to video, achieving state-of-the-art performance on several video classification
benchmarks. the VideoMAE model provides a promising approach for self-supervised video pre-training
with excellent performance and data efficiency. We use pre-trained VideoMAE model from huggingface[283]
to extract dense representation from the videos. By applying this model to the operators’ videos, we obtained
1536-dimensional vectors for each frame.
6.4.1.2 Linguistic features
Audio recordings of the interactions were first transcribed using the OpenAI open-sourced Whisper model
[221]. Afterward, the transcriptions were manually checked and corrected. The text from the segments is
then fed into two linguistic models: RoBERTa [160] and its fine-tune version RoBERTa-empathy. For the
silent segments, an empty string was fed into the models.
RoBERTa RobBERTa [160] is a cutting-edge English language model based on transformers architecture,
which was first introduced in BERT[69]. It has been pre-trained on a large corpus of English data using the
Masked Language Modeling (MLM) objective, which allows it to learn a bidirectional representation of the
sentence. This makes its representations of language useful for many downstream tasks such as sequence
128
classification, token classification, or question answering. We used the pre-trained roberta-base model
weights from the huggingface[283], which outputs a 768-dimensional contextualized vector embedding
for each word or token in the input sentences. We then use mean-pooling to obtain the sentence-level
representation from the contextualized embeddings.
RoBERTa-empathy Hartmann [111] model is a fine-tuned checkpoint of the DistilRoBERTA-base
model[238]. This particular model has been fine-tuned on a balanced subset from 6 datasets, which
focuses on empathy and distress and can predict Ekman’s 6 basic emotions[77], plus a neutral class. The
embeddings generated by the RoBERTa-empathy model are specifically tailored towards emotional content,
rather than semantic information, making it a suitable candidate for our application, which also centers on
emotion analysis. The pre-trained model weights are available on huggingface[283].
6.4.1.3 Auditory Features
Similar to the visual and linguistic modalities, in the auditory features we focused on two models. First,
wave2vec as a state-of-the-art audio embedding model that has shown great performance boost in a wide
range of auditory applications, and second, OpenSmile[84] as a model geared toward extracting emotional
queues from the audio.
wave2vec Schneider et al. [241] model explores unsupervised pre-training for speech recognition, seeking
to learn representations of raw audio. It is trained on large amounts of unlabeled audio data, and the
resulting representations are used to enhance acoustic model training. The model is based on a multilayer convolutional neural network that is optimized via a noise contrastive binary classification task.
Experiments on the WSJ dataset have shown that this approach can reduce word error rates (WER) by up
to 36% compared to previous state-of-the-art models, even when only a few hours of transcribed data was
available. We use wave2vec to extract auditory features from the segments that go beyond the language
129
itself. We use the pre-trained model weights from huggingface[283] to obtain the frame-by-frame auditory
embeddings.
OpenSMILE Eyben, Wöllmer, and Schuller [84] is an open-source software toolkit that enables the
extraction of features from audio signals (speech and music). It is commonly used in the classification of
speech and music signals and in the area of automatic emotion recognition for affective computing. OpenSMILE can recognize the characteristics of a given speech or music segment, such as a speaker’s emotion,
age, gender, personality, depression, intoxication, or vocal pathological disorders. We use OpenSMILE to
generate an auditory representation of the conversation between operator and residents with a focus on
affective (emotional) aspects of the conversation.
6.4.2 Proposed Architecture
In this study, we set out to investigate whether the behavioral cues displayed by Operators can serve as a
dependable indicator of their performance in high-pressure, time-critical situations (our Research Question
2). To address this research question, we developed a novel multimodal deep neural network architecture.
The model consists of a combination of pre-trained models, which are then fine-tuned on our dataset to
optimize its performance for the specific task at hand. In the following section, we elaborate on the details
of our proposed architecture.
We present a novel multimodal model that integrates linguistic, auditory, and visual features, as described
in Section 6.4.1. The architectural design of our proposed model is depicted in Figure 6.2, serving as a
visual representation of its structure and components. Drawing inspiration from established conventional
architectures such as the work by Deng et al. [63], our model follows a similar approach by processing
each modality separately. Specifically, we treat linguistic, auditory, and visual information as distinct
channels, applying separate processing to each channel. Following this individual processing, we integrate
the information from different channels into a unified space before forwarding it to a classification layer. A
130
notable and distinguishing characteristic of our proposed architecture lies in the incorporation of temporal
information at two levels: segment level and interaction level. To achieve this, we employ two layers of
Long Short-Term Memory (LSTM) networks. This temporal modeling is specifically applied to the visual
and auditory modalities, enabling the capturing of temporal dependencies within individual segments as
well as the interaction dynamics across multiple segments. It should be noted that this particular temporal
modeling scheme is not applied to the linguistic modality, as its temporal nature is not as prominent in this
context.
Video For each individual segment, the accompanying visual frames undergo processing by both the
VideoMAE and OpenFace models, resulting in feature vectors of dimensions 1536 and 32 respectively. In
order to compress and diminish the dimensionality of the information, a fully connected layer is employed
to transform the output of VideoMAE into a 16-dimensional representation. Subsequently, the visual frame
embeddings obtained from both sources are concatenated and provided as input to an LSTM layer with
a hidden layer size of 16, enabling the derivation of visual segment-level embeddings. To streamline the
procedure, sequences are prepared by either padding or truncating them to consist of a maximum of 30
segments per video, with each segment containing at most 64 frames (equal to 32 seconds).
Audio In the auditory modality, we adopt a two-step approach to encode the information into dense
representations. Initially, pre-trained models are employed to obtain the embeddings. To obtain feature
vectors for individual segments, we utilize both wave2vec and OpenSmile models. For silent segments, an
empty voice was fed into the models. In the case of wave2vec, the audio files are resampled to 16K FPS from
their original 44100 FPS, as suggested by the model’s documentation. Consequently, each auditory frame is
transformed into a 32-dimensional vector. To streamline the process, sequence preparation involves either
padding or truncation to ensure a maximum of 30 segments per video, with each segment containing a
maximum of 32 seconds. Following a similar approach as the video component, an LSTM layer with a
131
hidden size of 8 is employed to amalgamate frame-level information and generate a dense representation of
the segment. Conversely, OpenSmile generates a single 6373-dimensional vector per input file eliminating
the need for pooling to acquire segment-level representations. In this scenario, a fully connected layer is
utilized to compress and reduce the dimensionality of the information, thereby converting the output of
OpenSmile into an 8-dimensional representation.
Text To process the linguistic features, we also use two pre-trained models to encode the information into
dense representations. First each transcribed sentence from each segment is encoded using RoBERTa-base
and RoBERTa-empathy models. For silent segments, an empty string was given to the models as input. Then
we use mean pooling (instead of an LSTM) to amalgamate sentence-level information into segment-level
representations. In the linguistic part of the model, we used mean pooling for two reasons: first, it reduces
the overall count of trainable parameters and second, it is a widely adopted technique in NLP[267, 152]. For
each set of encodings, we use a fully connected layer to compress and reduce the dimensionality of the
information into an 8-dimensional representation.
Multi-Modal Fusion After obtaining the multi-modal segment level representations, in the last part of
the model, we first concatenate information from all modalities into one vector and add the speaker ID
to respective segments. Then we feed the resulting vector into an LSTM layer to pool the segment-level
information into a 16-dimensional representation of the whole interaction. Finally, we feed this interaction
representation into a fully connected decision layer to predict the length-based class of the interaction.
132
6.5 Experiments
6.5.1 Behavioral Responses to Different Personalities
6.5.1.1 Participant Emotions and Task Performance
Experimental Setup We investigate the relationship between the operator’s emotional response to the
residents and their task performance, by extracting the emotional response of the operators during the
interaction windows. We used the number of saved residents as the performance metric for this analysis.
Facial Action Units were obtained from the OpenFace software. Consequently, we extracted the emotional
state of the participant by aggregating the action units based on the table below:
Table 6.4: Mapping of Action Units (AU)s to Emotions
Emotion Action Units Description
Disgust 9 + 15 + 16 Nose Wrinkler, Lip Corner Depressor, Lower Lip
Depressor
Fear 1 + 2 + 4 + 5 + 7 + 20 + 26 Inner Brow Raiser, Outer Brow Raiser, Brow
Lowerer, Upper Lid Raiser, Lid Tightener, Lip
Stretcher, Jaw Drop
Happiness 6 + 12 Cheek Raiser, Lip Corner Puller
Sadness 1 + 4 + 15 Inner Brow Raiser, Brow Lowerer, Lip Corner
Depressor
Surprise 1 + 2 + 5 + 26 Inner Brow Raiser, Outer Brow Raiser, Upper Lid
Raiser, Jaw Drop
Anger 4 + 5 + 7 + 23 Brow Lowerer, Upper Lid Raiser, Lid Tightener,
Lip Tightener
Results Multiple linear regression method was used to test if the operator’s expressed emotions significantly predicted performance. The overall regression was statistically significant (R2 = 0.735, F(6, 15) =
6.945, p = 0.001). Table 6.5 shows coefficients(β) and t-statistics for each emotion and the significance of
their contribution to the model. As seen in this table, all of the emotions (except fear) significantly predict
the performance (p_value < 0.035) .
133
Table 6.5: Multiple linear regression statistics (DV: Operator’s emotions during interactions with residents,
IV: Operator’s performance (number os saved residents)
Emotion Beta t p_value
Disgust 0.458 3.210 0.006
Fear -0.295 -1.450 0.168
Happiness -1.160 -5.316 <0.001
Sadness -0.494 -2.838 0.012
Surprise -0.329 -2.328 0.034
Anger -0.827 -3.989 0.001
6.5.1.2 Emotional Response to Different Characters
Experimental Setup To test whether different character types elicited different emotional responses
from the operators, multiple regression was used. Participant ID (operator ID) was included as a random
factor to control for individual participant effects, and main-effects models were constructed for each
emotion category.
Results Regression controlling for participant ID revealed a significant effect of character type on operator
happiness expressions (F(2, 22) = 3.74, p = .028). Post-hoc pairwise comparisons of estimated marginal
means suggested that this effect was due to decreased happiness expressed when interacting with the
compliant character type vs. the other character types (compliant vs. stubborn p = .011; compliant vs.
cooperative p = .042; Figure 6.3).
After controlling for participant differences, the only significant effect of character behavior type on
emotion expression was with regard to happiness. Furthermore, we found that this was due to significantly
lower levels of happiness expressions being displayed by operators when interacting with the compliant
character type. Our interpretation of this finding is as follows: The compliant character type was represented
by one character in our study, the van driver, and his story could have been viewed as a particularly unhappy
one. The van driver was agreeable, he was responsible for the most lives, he was helping town residents
who were elderly, and he ran into unforeseen trouble that was not the result of any character’s stubbornness
134
Figure 6.3: Happiness Expressions by Character Type
or disagreeable attitude. It is thus not surprising that such a poignant scenario might lead to reduced
expression of operator happiness.
6.5.1.3 Participant Personality and Facial Expressions
Experimental Setup A series of correlations were performed to determine whether any participant
personality traits (as measured by the mini-IPIP) might predict the pattern of facial expressions observed
during the study.
Results Participants’ scores on the Neuroticism scale of the mini-IPIP correlated with the prevalence
of sadness in their facial expressions during interactions with residents (r(29) = .43, p =.043) and during
the experiment as a whole (r(29) = .45, p =.011). The personality trait of neuroticism is associated with
a tendency toward feelings of anxiety, sadness, pessimism, or insecurity [95, 129], and this appears to
have been reflected in the amount of sad facial expressions displayed by participants during our study.
Interestingly, neuroticism, which is also associated with anxious or nervous tendencies, was not found to
relate to fear facial expressions in our study. However, fear was rarely observed on participants’ faces during
135
this study, so there may not have been enough variation available to detect an effect. Other dimensions of
personality were not significantly related to any category of emotional expression.
6.5.2 Performance Prediction Model
Experimental Setup We implemented the model explained in section 6.4.2 using the Keras library[54].
To train our model, we utilized the interaction length performance metric as our data label. As a result, the
problem was transformed into a 3-class classification task.
The sparse categorical cross-entropy was chosen as the loss function, while the Adam optimizer was
employed to update the weights during training iteratively. The model underwent 100 epochs of training,
and to prevent overfitting, an early stopping mechanism was incorporated with patience of 10 interactions.
Early stopping was triggered if the validation loss failed to improve beyond the threshold of 1e-4. To assess
the model’s performance, 4-fold cross-validation with random shuffling with a “per-participant” basis
was performed, so no participants’ interaction data were included in both train and test sets for the same
experiment. Within each fold, the validation accuracy and validation f1-macro score were calculated, and
the average values were reported. We also conduct an ablation study to evaluate the effectiveness of each of
our modalities as well as combination of them in our proposed prediction model where we separately train
the classification model using one or some of the embedding sources. In addition, we tested two variations
of our model: 1) affective: which only uses the affective embedding sources (OpenFase, RoBERTa-empathy,
and OpenSMILE), and 2) generals: which only uses general embedding sources (VideoMAE, RoBERTa,
wave2vec).
Results Table 6.6 summarizes our proposed model’s accuracy and f1 score results and their different
variations. Our best model achieved an accuracy of 73.08% and f1-macro score of 63.96% on the classification
task. Our findings demonstrate that the linguistic modality contributes the most to the model’s performance,
surpassing the auditory and visual modalities. This observation is consistent with the results reported in
136
Table 6.6: Mean cross-validation accuracy and F1-Macor metrics for different modalities on 3-class classification task. Models with the dagger (†) sign are affective models and others are general.
Modality Model Accuracy (%) F1-Macro (%)
Visual
VideoMAE 45.19 41.17
OpenFace†
39.48 27.32
—Both— 47.71 42.01
Linguistic
RoBERTa 56.73 53.86
RoBERTa-empathy†
50.00 41.37
—Both— 56.91 53.98
Auditory
wave2vec 40.38 29.92
OpenSMILE†
33.65 16.77
—Both— 43.19 33.58
Multimodal
Visual & Linguistic 65.51 61.66
Visual & Auditory 50.41 44.29
Linguistic & Auditory 59.09 57.11
generals 57.69 50.58
affectives†
55.56 46.46
Full 73.08 63.96
Random 33.65 16.78
other studies[113, 158, 168, 262]. We can see that out of all the dual-modality combinations, the visuallinguistic combination has higher accuracy compared to others. This suggests that these two modalities
contained the most distinct information embedded in them compared to other pairs. Furthermore, our
results indicate that the combination of general models outperforms the combination of more specialized
models trained on affective datasets. This finding suggests that, despite not being specifically fine-tuned for
this particular task, the large-scale training data of general models provides them with enough implicit
knowledge to perform adequately on a behavior analysis-based task. These results offer possibilities for
future research into the optimal combination of modalities for behavior analysis and suggest that general
models may provide an effective starting point for this investigation.
137
6.5.3 Intervening the Sessions
Experimental Setup As detailed in Section 6.4.2, our multimodal predictive model, is able to predict user
performance with a reasonable accuracy of 72%. As a next step, we investigate the potential of integrating
this prediction model into the system’s repertoire and measure the potential performance boost of the
system. More specifically, we investigate the potential of strategic agent interventions in improving the
number of saved residents. The goal is to estimate the effect of the “spokesperson” intervening during
sessions that we predict the “operator” will not convince the “resident” or would take a long time to do
that; i.e. when the interaction label is either “unsuccessful” or “successful-long”
We’re using the model that achieved the highest accuracy described in Section 6.5.2, which incorporates
all three modalities along with affective and general features specific to each modality. To simulate the
intervention process, we’re employing a fully fine-tuned model. During the inference (test) phase, we’re
feeding only the first quarter of the interaction, which is approximately 7.5 seconds, to the model. We are
using a 4-fold cross-validation model using 75% of data for training and 25% for testing. This approach is
referred to as “partial-inference”. We’re recording instances where the model confidently predicts a session
will result in a failure, specifically when the model indicates a lack of conviction of “resident” in evacuating.
Results Figure 6.4 shows the confusion matrix of the prediction of our best model on the 3-class classification task. As a reminder, our three labels were "unsuccessful", “successful-short” and “successful-long”
(section 6.3.4).
The findings of our study demonstrate the “partial-inference” model achieves a 49.04% prediction
accuracy in forecasting interaction outcomes, in contrast to the baseline random guessing rate of 33%.
Delving deeper into our analysis, the examination of the confusion matrix reveals that the model interestingly
labels all (100%) of the successful-long interactions (third column) as either unsuccessful or successful-long,
with the exact match of 87.18%. In this circumstances, the spokesperson might intervene in the interaction,
138
Figure 6.4: Confusion matrix of the partial-inference model
thus taking the floor away from the operator, saving the operator some valuable time where they can spend
rescuing other residents, thus improving the overall performance. This evidence underscores the profound
advantages associated with the incorporation of predictive models within the agent framework, facilitating
timely interventions during sessions.
6.6 Conclusion & Future Directions
Virtual characters are used in a variety of task simulations and training contexts. The personalities of
these characters can influence users’ experiences, including their behavioral responses as well as their
performance on the simulated task. We explore these effects in a high-stress simulated context wherein users
(operators) must rescue virtual characters (residents) from a simulated wildfire that is rapidly approaching
a simulated town. Operators must engage in dialogue with the residents to convince them to flee and
to arrange their escape methods. Participants interacted with characters of different personality types
and were video recorded during the interaction. From the video recording, we extracted multimodal
data including their frontal video, speech transcripts, and voice recordings, and used these to examine
139
relationships between character personalities, operator personalities, operator-expressed emotions, and
task performance.
We found that operators did vary their facial expressions depending upon what character personalities
they were interacting with. Specifically, participants expressed less happiness when interacting with the
compliant character as opposed to the cooperative or stubborn characters. The compliant character’s story
was particularly stressful and heartbreaking, and it appears to have successfully reduced the expression
of happiness of the participants when interacting with this character. Future work can add additional
personality types for characters, including ones that are optimistic, overly helpful, or actively hostile, to
allow exploration of how these would influence user emotional expression.
It is important to keep in mind that individual user differences can influence behavior and emotion in
various contexts. For this reason, we examined whether the personality traits of the participants influenced
their facial expressions during the study. We found that the Big Five personality factor of Neuroticism
predicted the amount of sad facial expressions displayed. Future iterations of the simulation can incorporate
more advanced agents that respond to the user’s facial expressions, and knowing the effects of participant
personality on facial expression can allow the agents to better calibrate the appropriate response.
Our study found that the operator’s performance was negatively related to the facial expression of
several different emotions. This might suggest that both positive valence emotions (like happiness) and
negative valence emotions (like fear and sadness) generally impeded the performance of the task. With a
high-stress, time-sensitive task, it may be the case that even a small amount of cognitive or attentional
burden imposed by emotion was enough to negatively impact performance. Future modifications to the
task, like increasing or decreasing the time pressure, may yield a different relationship between expressed
emotion and performance.
In addition, we developed a deep neural network model to predict the user’s performance using 3
modalities of video, audio, and text. User performance in this case is categorized as one of the three classes
140
of “successful-short”, “successful-long” and “unsuccessful”. To this end, We split the interaction window
into several segments based on the speaker tag. The model processes the interaction in two levels each with
an LSTM layer. The first level encodes the temporal dimensions of each segment within the interaction and
the second level incorporates the interaction dynamics across multiple segments. Our best model achieved
an accuracy of 73.08% on our dataset. Future directions involve exploring other models and also different
architectures such as incorporating late-fusion methods to compare them with our early-fusion method.
141
Chapter 7
Conclusions and Future Directions
In this chapter, we summarize the main contributions of this dissertation, and discuss potential directions
for future work.
7.1 Summary
In this thesis, we delved into the intricate domain of human-agent interaction, with a particular focus on
the utilization of feedback within the context of story-telling. As a central guiding principle, we formulated
the thesis statement as follows:
By understanding and generating multimodal feedback, virtual agents can improve the quality of
human-machine storytelling interactions by providing engaging, personalized, and interactive experiences for users.
Throughout our research journey, we posed four pivotal research questions, each serving as a compass
guiding our exploration. Through a series of designed experiments, we provided answers to these inquiries.
Within the framework of our investigations, we have uncovered nuanced responses to each of these
questions. While we acknowledge that the specific answers we’ve provided may not be universally applicable,
we confidently assert that the methodological approach we’ve developed holds the key to extracting answers
142
in a broader spectrum of domains and contexts. The methodology is a tool that can be used by researchers
seeking solutions to similar questions, regardless of the specific subject matter or context of the inquiry.
Below, we review the list of the four research questions (section 1.2) that have guided our journey:
1. What kinds of stories should virtual story-tellers provide?
2. What kinds of multimodal feedback can agents provide to people with the goal of improving the
interaction?
3. What kinds of multimodal feedback can agents recognize and understand and how can they be used
to adapt the dialogue management policies of the agent? For example, how can dialogue managers
include different modalities of the sensory input in their internal state? These inputs can be visual,
acoustic, or physiological.
4. How can agents use real-time predictions based on user feedback to further adapt the dialogue
management policies of the agent?
Chapter 3 was dedicated to addressing Research Question 1. In this chapter, we outlined our experiment
setup involving a virtual character engaging in storytelling with a user. This virtual character assumed the
dual roles of both storyteller and listener. We aimed to understand how certain factors, specifically the
character’s identity, how the story is presented, and the character’s point of view impact the interactions
between humans and these virtual characters. Broadly speaking, participants exhibited a distinct preference
for characters who conveyed their stories from a first-person perspective. Nevertheless, our analysis also
unveiled subtle interactions within the presentation order variable, while embodiment and identity variables
failed to establish significant preferences among participants.
Our Research Question 2 was about different kinds of multimodal feedback that the agents should
provide. Part of being able to provide appropriate feedback in the course of the interaction is knowing the
best "type of feedback" to provide at a given time. In exploring the answer to our research questions, an
143
offline analysis of a multimodal dataset of interactions with a virtual human was done. The experiment,
described in chapter 4 analyzed behavioral cues of an interactive virtual agent to indicate an opportunity
to provide an empathetic response. To automatically recognize such instances, a multimodal deep neural
network was developed for identifying opportunities when the agent should express positive or negative
empathetic responses. The model uses verbal, acoustic, and visual modalities. We trained and evaluated our
model using audio, video, and language from human-agent interactions in a wizard-of-Oz setting, using the
wizard’s empathetic responses and annotations collected on Amazon Mechanical Turk as ground-truth
labels. Our model outperforms a text-based baseline model, achieving F1-score of 0.71 on a three-class
classification. We further investigated the results and evaluated the capability of such a model to be deployed
for real-world human-agent interactions.
Research Question 3, in its generality, poses a challenge in that its answer is context-dependent,
necessitating a focused lens to decipher its intricacies. The scope of potential contexts wherein these
questions might arise is expansive, spanning diverse scenarios. While most dialogue systems primarily cater
to verbal communication with hearing adults, we thought it would be illuminating to explore an outlier
scenario, one at the boundary of these contexts—a scenario involving non-verbal infants with hearing
impairments, with whom we aspire to engage through storytelling. To further refine this context and
elucidate our investigative journey, we embarked on an extensive, multi-year research project, extensively
documented in Chapter 5. In this effort, we designed and implemented a multimodal system called RAVE,
engineered to facilitate interactions with hearing-impaired infants who lack both verbal and acoustic
modalities for communication. RAVE system specializes in orchestrating multimodal, and multiparty
engagements tailored specifically for deaf and hard-of-hearing infants. The overarching objective of this
platform is to serve as a formidable tool for facilitating the teaching of sign language to this unique
demographic. At the heart of RAVE’s architecture lies a dynamic interplay between a virtual human and an
embodied robot, creating a three-way interaction with the infant. This choreography integrates a multitude
144
of sensory modalities to convey diverse forms of feedback, including the presentation of sign language
patterns in the form of "nursery rhymes." It is worth noting that while our experimental design is rooted in
a scenario-specific context, it aspires to yield valuable insights and responses to the enigmatic challenges
posed by Research Question 3.
We addressed our fourth research question by designing and developing a multimodal deep neural
network, which is capable of predicting the user’s behavior based on multimodal feedback cues. For this
experiment, we used a high-stress human-agent interaction scenario in which decisions must be made,
communicated, and agreed upon in a short amount of time to avoid dire consequences. We studied the
correlation between individuals’ behavioral cues and task performance while interacting and listening to
the stories of several synthetic agents with different personalities and difficulties in this scenario. The agent
in this scenario acts as an active third-party story-listener, observing a story-telling interaction between the
user and the system (which is in fact another agent). Utilizing visual, audio, and text data from a multimodal
corpus, we demonstrated the model’s ability to predict user performance with a reasonable accuracy of 72%
and provided a formulation for the agent to intervene when it seems necessary, thus changing the flow
of the dialogue and resulting in a higher performance of the overall system. We showed that equipping
the system with this predictive model and making the agent intervene in cases that it predicts user failure,
would boost the performance of the overall system.
7.2 Future Directions
Recent advances in the area of Large languae models (such as OpenAI ChatGPT∗ or Google Bard†
) have lifted
the boundaries of the Natural Language Generation field beyond what was imaginable with conventional
transformer-based models. These models stand out for their ability to operate across diverse domains
with minimal supervision and in a zero-shot setting. These models are specifically tuned on human
∗
chat.openai.com
†
https://bard.google.com/chat
145
instructions and are incentivised to align their responses with human judgments. They are trained on
massive datasets of text and code, and can be used for a variety of tasks, including story generation,
translation, and summarization. Importantly, their training approach ensures that their responses align
with human preferences, covering both informativeness and safety considerations. They have the potential
to revolutionize the way stories are generated and told by making it possible to generate more creative,
personalized, and accessible stories. For example, LLMs can be used to generate stories based on a user’s
preferences, or to create stories in different languages. LLMs can also be used to make stories more
interactive and engaging.
Multimodal LLMs (such as GPT4) are a type of LLM that can generate and understand text, as well as
other forms of media, such as images, audio, and video. These models have the potential to create even
more immersive and engaging storytelling experiences. For example, a multimodal LLM could be used
to generate a virtual reality story experience, or to create a personalized bedtime story for a child that
includes images and audio.
Within the scope of my thesis, large language models can play a dual role: first, in perceiving human
feedback, and second, in tailoring stories. We will go through each of these domains in more detail
7.2.1 Perceiving Human Emotions
Large language models have shown exceptional abilities to perceive wide ranges of human emotions.
Additionally, the introduction of multimodal large language models like GPT-4 or Violet extends this
functionality into a multimodal setting, allowing simultaneous processing of video and text prompts. This
expansion broadens the range of potential applications and possibilities.
Figure 7.1 shows a screenshot of a sample prompt that was given to chatGPT. The prompt includes a
simple conversation between two people (person A and person B) talking about person’s B’s dissertation
146
Figure 7.1: Sample Prompt to chatGPT regarding perceiving the emotional state
defense. Person A is basically asking person B of how the defense went. We ask chatGPT to list the most
significant mental or emotional state of each person involving this conversation.
In Figure 7.2 , the response of chatGPT to the given prompt is depicted. Notably, chatGPT accurately
identifies curiosity, empathy, and sympathy in relation to person A, while also recognizing nervousness,
relief, and resilience in the context of person B. Despite the simplicity of this example, it serves to illustrate
the models’ proficiency in comprehending and classifying emotions with minimal supervision.
7.2.2 Tailoring Stories
One of the most promising areas of future work is the use of multimodal LLMs to incorporate human
feedback into the storytelling process. This could be done in a variety of ways. For example, a multimodal
LLM could be used to generate a story based on a set of prompts provided by a human user. The user could
then provide feedback on the generated story, and the LLM could use this feedback to generate a new
version of the story that is more in line with the user’s preferences.
147
Figure 7.2: ChatGPT’s response of characterizing the emotional state of the people in the given conversation
148
Figure 7.3: Follow up conversation with ChatGPT about tailoring the following utterance based on the
previous context
Another promising area of future work is the use of LLMs to create personalized storytelling experiences.
For example, an LLM could be used to create a bedtime story for a child based on the child’s name, interests,
and preferences. The child could then provide feedback on the story, and the LLM could use this feedback
to generate a new story that is more tailored to the child’s individual needs.
In continuation of the sample conversation discussed in Section 7.2.1, Figure 7.3 presents a subsequent
prompt directed to chatGPT and chatGPT’s response. This prompt instructs chatGPT to generate a suitable
utterance for person A, who intends to convey unwelcome news to person B about their paper not being
accepted for publication. As observed, ChatGPT effectively leverages information from the preceding
dialogue, referencing the defense today, and adeptly delivers the unfavorable news in a manner that aligns
with the mental state of person B.
7.2.3 Prospective Paths of Further Exploration
Overall, the use of multimodal LLMs and human feedback has the potential to make storytelling more
engaging, immersive, and accessible to people of all ages and abilities.
149
In addition to the specific examples mentioned above, here are some other potential future directions
for research in this area:
• Developing new methods for multimodal LLMs to generate and understand different types of stories,
such as interactive stories, non-fiction stories, and stories in different languages.
• Developing new methods for multimodal LLMs to incorporate human feedback into the storytelling
process in real time.
• Developing new methods for multimodal LLMs to learn from and adapt to the preferences of individual
users.
• Conducting user studies to evaluate the effectiveness of different methods for using multimodal LLMs
and human feedback in storytelling.
I believe that research in this area has the potential to lead to significant advances in the field of
storytelling. Multimodal LLMs and human feedback have the potential to create new and innovative forms
of storytelling experiences that are more engaging, immersive, and accessible to everyone.
150
Bibliography
[1] Abdulaziz Abubshait and Eva Wiese. “You look human, but act like a machine: agent appearance
and behavior modulate different aspects of human–robot interaction”. In: Frontiers in psychology 8
(2017), p. 1393.
[2] Apache ACTIVEMQ. http://activemq.apache.org. Accessed: 2018-04-25. 2018. (Visited on
04/25/2018).
[3] Henny Admoni and Brian Scassellati. “Social eye gaze in human-robot interaction: a review”. In:
Journal of Human-Robot Interaction 6.1 (2017), pp. 25–63.
[4] Priti Aggarwal, Ron Artstein, Jillian Gerten, Athanasios Katsamanis, Shrikanth Narayanan,
Angela Nazarian, and David Traum. “The Twins Corpus of Museum Visitor Questions”. In:
LREC-2012. Istanbul, Turkey, May 2012.
[5] LLC Agisoft. “Agisoft Photoscan Pro”. In: St. Petersburg, Russia. Available online at: http://www.
agisoft. com (2014).
[6] Neziha Akalin, Pinar Uluer, and Hatice Kose. “Non-verbal communication with a social robot peer:
Towards robot assisted interactive sign language tutoring”. In: 2014 IEEE-RAS International
Conference on Humanoid Robots. IEEE. 2014, pp. 1122–1127.
[7] Sames Al Moubayed, Malek Baklouti, Mohamed Chetouani, Thierry Dutoit, Ammar Mahdhaoui,
J-C Martin, Stanislav Ondas, Catherine Pelachaud, Jérôme Urbain, and Mehmet Yilmaz.
“Generating robot/agent backchannels during a storytelling experiment”. In: 2009 IEEE International
Conference on Robotics and Automation. IEEE. 2009, pp. 3749–3754.
[8] Firoj Alam, Morena Danieli, and Giuseppe Riccardi. “Annotating and modeling empathy in spoken
conversations”. In: Computer Speech & Language 50 (2018), pp. 40–61.
[9] Brandon Amos, Bartosz Ludwiczuk, Mahadev Satyanarayanan, et al. “Openface: A general-purpose
face recognition library with mobile applications”. In: CMU School of Computer Science 6.2 (2016),
p. 20.
[10] Michael Anbar. “Physiological, clinical and psychological applications of dynamic infrared
imaging”. In: Engineering in Medicine and Biology Society, 2003. Proceedings of the 25th Annual
International Conference of the IEEE. Vol. 2. IEEE. 2003, pp. 1121–1124.
151
[11] Daniel R Anderson and Tiffany A Pempek. “Television and very young children”. In: American
behavioral scientist 48.5 (2005), pp. 505–522.
[12] Oya Aran, Ismail Ari, Lale Akarun, Bülent Sankur, Alexandre Benoit, Alice Caplier, Pavel Campr,
Ana Huerta Carrillo, and François-Xavier Fanard. “SignTutor: An Interactive System for Sign
Language Tutoring.” In: IEEE MultiMedia 16.1 (2009), pp. 81–93.
[13] Akiko Arita, Kazuo Hiraki, Takayuki Kanda, and Hiroshi Ishiguro. “Can we talk to robots?
Ten-month-old infants expected interactive humanoid robots to be talked to by persons”. In:
Cognition 95.3 (2005), B49–B57.
[14] Ron Artstein, Alesia Gainer, Kallirroi Georgila, Anton Leuski, Ari Shapiro, and David Traum. “New
dimensions in testimony demonstration”. In: Proceedings of the 2016 Conference of the North
American Chapter of the Association for Computational Linguistics: Demonstrations. 2016, pp. 32–36.
[15] Ashwin Ramesh Babu, Akilesh Rajavenkatanarayanan, James Robert Brady, and Fillia Makedon.
“Multimodal approach for cognitive task performance prediction from body postures, facial
expressions and EEG signal”. In: Proceedings of the Workshop on Modeling Cognitive Processes from
Multimodal Data. 2018, pp. 1–7.
[16] Wilma A Bainbridge, Justin W Hart, Elizabeth S Kim, and Brian Scassellati. “The benefits of
interactions with physically present robots over video-displayed agents”. In: International Journal
of Social Robotics 3.1 (2011), pp. 41–52.
[17] Stephanie A Baker, Roberta Michnick Golinkoff, and Laura-Ann Petitto. “New insights into old
puzzles from infants’ categorical discrimination of soundless phonetic units”. In: Language Learning
and Development 2.3 (2006), pp. 147–162.
[18] Tadas Baltrusaitis, Peter Robinson, and Louis Philippe Morency. “OpenFace: An open source facial
behavior analysis toolkit”. In: 2016 IEEE Winter Conference on Applications of Computer Vision,
WACV 2016. IEEE, Mar. 2016, pp. 1–10. isbn: 9781509006410.
[19] Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. “Openface 2.0: Facial
behavior analysis toolkit”. In: 2018 13th IEEE International Conference on Automatic Face & Gesture
Recognition (FG 2018). IEEE. 2018, pp. 59–66.
[20] Michael Bamberg, Anna De Fina, and Deborah Schiffrin. “Discourse and Identity Construction”.
English. In: Handbook of Identity Theory and Research. Ed. by Seth J. Schwartz, Koen Luyckx, and
Vivian L. Vignoles. Springer New York, 2011, pp. 177–199. isbn: 978-1-4419-7987-2. doi:
10.1007/978-1-4419-7988-9_8.
[21] Michael Bamberg, Anna De Fina, and Deborah Schiffrin. “Discourse and identity construction”. In:
Handbook of identity theory and research. Springer, 2011, pp. 177–199.
[22] Lisa Feldman Barrett, Ralph Adolphs, Stacy Marsella, Aleix M Martinez, and Seth D Pollak.
“Emotional expressions reconsidered: Challenges to inferring emotion from human facial
movements”. In: Psychological science in the public interest 20.1 (2019), pp. 1–68.
152
[23] Christoph Bartneck, Takayuki Kanda, Hiroshi Ishiguro, and Norihiro Hagita. “My robotic
doppelgänger-A critical look at the uncanny valley”. In: RO-MAN 2009-The 18th IEEE international
symposium on robot and human interactive communication. IEEE. 2009, pp. 269–276.
[24] Cristina Battaglino and Timothy Bickmore. “Increasing the engagement of conversational agents
through co-constructed storytelling”. In: Eleventh Artificial Intelligence and Interactive Digital
Entertainment Conference. 2015.
[25] Christian Becker, Helmut Prendinger, Mitsuru Ishizuka, and Ipke Wachsmuth. “Evaluating affective
feedback of the 3D agent max in a competitive cards game”. In: International Conference on Affective
Computing and Intelligent Interaction. Springer. 2005, pp. 466–473.
[26] Hilary L Bekker, Anna E Winterbottom, Phyllis Butow, Amanda J Dillard, Deb Feldman-Stewart,
Floyd J Fowler, Maria L Jibaja-Weiss, Victoria A Shaffer, and Robert J Volk. “Do personal stories
make patient decision aids more effective? A critical review of theory and evidence”. In: BMC
medical informatics and decision making 13 (2013), pp. 1–9.
[27] Allan Bell. “Language style as audience design”. In: Language in society 13.2 (1984), pp. 145–204.
[28] Linda Bell and Joakim Gustafson. “Interaction With An Animated Agent In A Spoken Dialogue
System”. In: Proceedings of Eurospeech ’99. 1999, pp. 1143–1146.
[29] Niels Ole Bernsen and Laila Dybkjaer. “Meet Hans Christian Andersen”. In: In Proceedings of the 6th
SIGdial Workshop on Discourse and Dialogue. 2005.
[30] Timothy W Bickmore. “Relational agents: Effecting change through human-computer
relationships”. PhD thesis. Massachusetts Institute of Technology, 2003.
[31] Timothy W. Bickmore, Daniel Schulman, and Langxuan Yin. “Engagement vs. Deceit: Virtual
Humans with Human Autobiographies.” In: IVA. Ed. by Zsófia Ruttkay, Michael Kipp,
Anton Nijholt, and Hannes Högni Vilhjálmsson. Vol. 5773. Lecture Notes in Computer Science.
Springer, Sept. 21, 2009, pp. 6–19. isbn: 978-3-642-04379-6. url:
http://dblp.uni-trier.de/db/conf/iva/iva2009.html#BickmoreSY09.
[32] Karina S Blair, Bruce W Smith, Derek GV Mitchell, John Morton, Meena Vythilingam, Luiz Pessoa,
Daniel Fridberg, Alan Zametkin, Eric E Nelson, Wayne C Drevets, et al. “Modulation of emotion by
cognition and cognition by emotion”. In: Neuroimage 35.1 (2007), pp. 430–440.
[33] Dan Bohus, Sean Andrist, Ashley Feniello, Nick Saw, Mihai Jalobeanu, Patrick Sweeney,
Anne Loomis Thompson, and Eric Horvitz. “Platform for situated intelligence”. In: arXiv preprint
arXiv:2103.15975 (2021).
[34] Marta Borgi, Irene Cogliati-Dezza, Victoria Brelsford, Kerstin Meints, and Francesca Cirulli. “Baby
schema in human and animal faces induces cuteness perception and gaze allocation in children”. In:
Frontiers in psychology 5 (2014), p. 411.
[35] A Borji, D Parks, and L Itti. “Complementary effects of gaze duration and early saliency in guiding
fixations”. In: (2014).
153
[36] Scott Brave, Clifford Nass, and Kevin Hutchinson. “Computers that care: investigating the effects of
orientation of emotion exhibited by an embodied computer agent”. In: International journal of
human-computer studies 62.2 (2005), pp. 161–178.
[37] Susan Elise Brennan. “Seeking and providing evidence for mutual understanding”. PhD thesis.
Stanford University, 1990.
[38] Manuel Bronstein. A more helpful Google assistant for your every day. Jan. 2020. url:
https://blog.google/products/assistant/ces-2020-google-assistant/.
[39] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. “Language models are
few-shot learners”. In: Advances in neural information processing systems 33 (2020), pp. 1877–1901.
[40] Hennie Brugman, Albert Russel, and Xd Nijmegen. “Annotating Multi-media/Multi-modal
Resources with ELAN.” In: LREC. 2004.
[41] Guoyong Cai and Binbin Xia. “Convolutional neural networks for multimedia sentiment analysis”.
In: Natural Language Processing and Chinese Computing: 4th CCF Conference, NLPCC 2015,
Nanchang, China, October 9-13, 2015, Proceedings 4. Springer. 2015, pp. 159–167.
[42] Erik Cambria, Isabelle Hupont, Amir Hussain, Eva Cerezo, and Sandra Baldassarri. “Sentic avatar:
Multimodal affective conversational agent with common sense”. In: Toward Autonomous, Adaptive,
and Context-Aware Multimodal Interfaces. Theoretical and Practical Issues. Springer, 2011, pp. 81–95.
[43] Erik Cambria, Björn Schuller, Yunqing Xia, and Catherine Havasi. “New avenues in opinion mining
and sentiment analysis”. In: IEEE Intelligent systems 28.2 (2013), pp. 15–21.
[44] Daniela Cardone and Arcangelo Merla. “New frontiers for applications of thermal infrared imaging
devices: Computational psychopshysiology in the neurosciences”. In: Sensors 17.5 (2017), p. 1042.
[45] John M Carroll. “The evolution of human-computer interaction”. In: Annual Review of Psychology
48 (2001), pp. 501–522.
[46] Justine Cassell and Kristinn R Thorisson. “The power of a nod and a glance: Envelope vs. emotional
feedback in animated conversational agents”. In: Applied Artificial Intelligence 13.4-5 (1999),
pp. 519–538.
[47] Aleksandra Cerekovic, Oya Aran, and Daniel Gatica-Perez. “Rapport with virtual agents: What do
human social cues and personality explain?” In: IEEE Transactions on Affective Computing 8.3 (2016),
pp. 382–395.
[48] Patricia Chaffey, Ron Artstein, Kallirroi Georgila, Kimberly A Pollard, Setareh Nasihati Gilani,
David M Krum, David Nelson, Kevin Huynh, Alesia Gainer, Seyed Hossein Alavi, et al. “Developing
a virtual reality wildfire simulation to analyze human communication and interaction with a
robotic swarm during emergencies”. In: Workshop on Human Language Technologies in Crisis and
Emergency Management. 2019.
154
[49] Patricia Chaffey, Ron Artstein, Kallirroi Georgila, Kimberly A Pollard, Setareh Nasihati Gilani,
David M Krum, David Nelson, Kevin Huynh, Alesia Gainer, Seyed Hossein Alavi, et al. “Human
swarm interaction using plays, audibles, and a virtual spokesperson”. In: Artificial Intelligence and
Machine Learning for Multi-Domain Operations Applications II. Vol. 11413. SPIE. 2020, pp. 272–285.
[50] Hyesang Chang and Sian L Beilock. “The math anxiety-math performance link and its relation to
individual and environmental factors: A review of current behavioral and psychophysiological
research”. In: Current Opinion in Behavioral Sciences 10 (2016), pp. 33–38.
[51] Gwo-Dong Chen, Chin-Yeh Wang, et al. “A survey on storytelling with robots”. In: International
Conference on Technologies for E-Learning and Digital Entertainment. Springer. 2011, pp. 450–456.
[52] Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Baltrušaitis, Amir Zadeh, and
Louis-Philippe Morency. “Multimodal sentiment analysis with word-level fusion and reinforcement
learning”. In: Proceedings of the 19th ACM International Conference on Multimodal Interaction. ACM.
2017, pp. 163–171.
[53] Geumhwan Cho, Jusop Choi, Hyoungshick Kim, Sangwon Hyun, and Jungwoo Ryoo. “Threat
modeling and analysis of voice assistant applications”. In: International Workshop on Information
Security Applications. Springer. 2018, pp. 197–209.
[54] François Chollet et al. Keras. https://keras.io. 2015.
[55] Dimitri A Christakis. “The effects of infant media usage: what do we know and what should we
learn?” In: Acta Paediatrica 98.1 (2009), pp. 8–16.
[56] Eric Chu and Deb Roy. “Audio-visual sentiment analysis for learning emotional arcs in movies”. In:
2017 IEEE International Conference on Data Mining (ICDM). IEEE. 2017, pp. 829–834.
[57] Chloe Clavel and Zoraida Callejas. “Sentiment analysis: from opinion mining to human-agent
interaction”. In: IEEE Transactions on affective computing 7.1 (2016), pp. 74–93.
[58] Kathryn J. Collins and David Traum. “Towards a multi-dimensional taxonomy of stories in
dialogue”. In: Proceedings of the Tenth International Conference on Language Resources and
Evaluation (LREC). 2016.
[59] Valerie Corkum and Chris Moore. “Development of joint visual attention in infants.(1995)”. In:
(1995).
[60] Sandra Costa, Alberto Brunete, Byung-Chull Bae, and Nikolaos Mavridis. “Emotional storytelling
using virtual and robotic agents”. In: International Journal of Humanoid Robotics 15.03 (2018),
p. 1850006.
[61] Paul Debevec. “The light stages and their applications to photoreal digital actors”. In: SIGGRAPH
Asia 2.4 (2012).
[62] Judy S DeLoache, Cynthia Chiong, Kathleen Sherman, Nadia Islam, Mieke Vanderborght,
Georgene L Troseth, Gabrielle A Strouse, and Katherine O’Doherty. “Do babies learn from baby
media?” In: Psychological Science 21.11 (2010), pp. 1570–1574.
155
[63] Didan Deng, Yuqian Zhou, Jimin Pi, and Bertram E Shi. “Multimodal utterance-level affect analysis
using visual, audio and text features”. In: arXiv preprint arXiv:1805.00625 (2018).
[64] Eric Deng, Bilge Mutlu, and Maja J Mataric. “Embodiment in Socially Interactive Robots”. In:
Foundations and Trends in Robotics 7.4 (2019), pp. 251–356. issn: 1935-8253.
[65] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. “Imagenet: A large-scale
hierarchical image database”. In: 2009 IEEE conference on computer vision and pattern recognition.
IEEE. 2009, pp. 248–255.
[66] David DeVault, Ron Artstein, Grace Benn, Teresa Dey, Ed Fast, Alesia Gainer, Kallirroi Georgila,
Jon Gratch, Arno Hartholt, Margaux Lhommet, et al. “SimSensei Kiosk: A virtual human
interviewer for healthcare decision support”. In: Proceedings of the 2014 international conference on
Autonomous agents and multi-agent systems. International Foundation for Autonomous Agents and
Multiagent Systems. 2014, pp. 1061–1068.
[67] David DeVault, Ron Artstein, Grace Benn, Teresa Dey, Ed Fast, Alesia Gainer, Kallirroi Georgila,
Jon Gratch, Arno Hartholt, Margaux Lhommet, Gale Lucas, Stacy Marsella, Fabrizio Morbini,
Angela Nazarian, Stefan Scherer, Giota Stratou, Apar Suri, David Traum, Rachel Wood, Yuyu Xu,
Albert Rizzo, and Louis-Philippe Morency. “SimSensei Kiosk: A Virtual Human Interviewer for
Healthcare Decision Support”. In: The 13th International Conference on Autonomous Agents and
Multiagent Systems (AAMAS 2014). 2014.
[68] David DeVault and David Traum. “Toward Rapid Development of Multi-Party Virtual Human
Negotiation Scenarios”. In: Ron Artstein, Mark Core, David DeVault, Kallirroi Georgila, Elsi Kaiser,
and Amanda Stent (2011), p. 63.
[69] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “Bert: Pre-training of deep
bidirectional transformers for language understanding”. In: arXiv preprint arXiv:1810.04805 (2018).
[70] Office of Disability Employment Policy. Autism. url:
https://www.dol.gov/agencies/odep/program-areas/autism.
[71] M Brent Donnellan, Frederick L Oswald, Brendan M Baird, and Richard E Lucas. “The mini-IPIP
scales: tiny-yet-effective measures of the Big Five factors of personality.” In: Psychological
assessment 18.2 (2006), p. 192.
[72] Paul Ed Drew and Anthony Ed Wootton. “Erving Goffman: Exploring the interaction order.” In:
Revised papers originally presented at a conference titled" Erving Goffman, an Interdisciplinary
Appreciation," held at the University of York in Jul 1986. Northeastern University Press. 1988.
[73] Mateusz Dubiel, Martin Halvey, and Leif Azzopardi. “A survey investigating usage of virtual
personal assistants”. In: arXiv preprint arXiv:1807.04606 (2018).
[74] Bernd Dudzik and Hayley Hung. “Exploring the Detection of Spontaneous Recollections during
Video-viewing In-the-Wild using Facial Behavior Analysis”. In: Proceedings of the 2022 International
Conference on Multimodal Interaction. 2022, pp. 236–246.
156
[75] Robin IM Dunbar. “Gossip in evolutionary perspective”. In: Review of general psychology 8.2 (2004),
pp. 100–110.
[76] P. Ekman and W.V. Friesen. The Facial Action Coding System (FACS). Consulting Psychologists Press,
Stanford University, Palo Alto, 1978.
[77] Paul Ekman. “An argument for basic emotions”. In: Cognition & emotion 6.3-4 (1992), pp. 169–200.
[78] Paul Ekman. “Emotions revealed”. In: Bmj 328.Suppl S5 (2004).
[79] Paul Ekman and Wallace V Friesen. “Constants across cultures in the face and emotion.” In: Journal
of personality and social psychology 17.2 (1971), p. 124.
[80] Paul Ekman and Wallace V Friesen. “Facial action coding system”. In: Environmental Psychology &
Nonverbal Behavior (1978).
[81] Florian Eyben, Klaus R. Scherer, Bjorn W. Schuller, Johan Sundberg, Elisabeth Andre, Carlos Busso,
Laurence Y. Devillers, Julien Epps, Petri Laukka, Shrikanth S. Narayanan, and Khiet P. Truong. “The
Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective
Computing”. In: IEEE Transactions on Affective Computing 7.2 (Apr. 2016), pp. 190–202. issn:
19493045.
[82] Florian Eyben, Felix Weninger, Florian Gross, and Björn Schuller. “Recent developments in
openSMILE, the munich open-source multimedia feature extractor”. In: Proceedings of the 21st ACM
international conference on Multimedia - MM ’13. New York, New York, USA: ACM Press, Oct. 2013,
pp. 835–838. isbn: 9781450324045.
[83] Florian Eyben, Martin Wöllmer, and Björn Schuller. “OpenSMILE: The Munich Versatile and Fast
Open-source Audio Feature Extractor”. In: Proceedings of the 18th ACM International Conference on
Multimedia. MM ’10. Firenze, Italy: ACM, 2010, pp. 1459–1462. isbn: 978-1-60558-933-6.
[84] Florian Eyben, Martin Wöllmer, and Björn Schuller. “Opensmile: the munich versatile and fast
open-source audio feature extractor”. In: Proceedings of the 18th ACM international conference on
Multimedia. 2010, pp. 1459–1462.
[85] David Feil-Seifer and Maja J Matarić. “Socially assistive robotics”. In: IEEE Robotics & Automation
Magazine 18.1 (2011), pp. 24–31.
[86] Amy Sue Finn. The sensitive period for language acquisition: The role of age related differences in
cognitive and neural function. University of California, Berkeley, 2010.
[87] Cynthia D Fisher and Christopher S Noble. “A within-person examination of correlates of
performance and emotions while working”. In: Human Performance 17.2 (2004), pp. 145–168.
[88] Kristin Fraser, Irene Ma, Elise Teteris, Heather Baxter, Bruce Wright, and Kevin McLaughlin.
“Emotion, cognitive load and learning outcomes during simulation training”. In: Medical education
46.11 (2012), pp. 1055–1062.
157
[89] Sudeep Gandhe, David DeVault, Antonio Roque, Bilyana Martinovski, Ron Artstein, Anton Leuski,
Jillian Gerten, and David Traum. “From domain specification to virtual humans: An integrated
approach to authoring tactical questioning characters”. In: Ninth Annual Conference of the
International Speech Communication Association. 2008.
[90] Robert A Georges. “Feedback and response in storytelling”. In: Western Folklore 38.2 (1979),
pp. 104–110.
[91] Setareh Nasihati Gilani, Kraig Sheetz, Gale Lucas, and David Traum. “What kind of stories should a
virtual human swap?” In: International Conference on Intelligent Virtual Agents. Springer. 2016,
pp. 128–140.
[92] Michael Gleicher. “Retargetting motion to new characters”. In: Proceedings of the 25th annual
conference on Computer graphics and interactive techniques. 1998, pp. 33–42.
[93] Erving Goffman. The presentation of self in everyday life. Garden City, NY Double Day, 1959.
[94] Erving Goffman et al. The presentation of self in everyday life. Vol. 21. Harmondsworth London, 1978.
[95] Lewis R Goldberg. “The Development of Markers for the Big-Five Factor Structure”. In:
Psychological Assessment 4.1 (1992), pp. 26–42.
[96] Ewa M Golonka, Anita R Bowles, Victor M Frank, Dorna L Richardson, and Suzanne Freynik.
“Technologies for foreign language learning: a review of technology types and their effectiveness”.
In: Computer assisted language learning 27.1 (2014), pp. 70–105.
[97] Carla Gordon, Anton Leuski, Grace Benn, Eric Klassen, Edward Fast, Matt Liewer, Arno Hartholt,
and David R Traum. “PRIMER: An Emotionally Aware Virtual Agent.” In: IUI Workshops. 2019.
[98] Arthur C Graesser, Sidney D’Mello, Xiangen Hu, Zhiqiang Cai, Andrew Olney, and Brent Morgan.
“AutoTutor”. In: Applied natural language processing: Identification, investigation and resolution. IGI
Global, 2012, pp. 169–187.
[99] Björn Granström, David House, and Marc Swerts. “Multimodal feedback cues in human-machine
interactions”. In: Speech Prosody 2002, International Conference. 2002.
[100] Jonathan Gratch, Ron Artstein, Gale M Lucas, Giota Stratou, Stefan Scherer, Angela Nazarian,
Rachel Wood, Jill Boberg, David DeVault, Stacy Marsella, et al. “The distress analysis interview
corpus of human and computer interviews.” In: LREC. Citeseer. 2014, pp. 3123–3128.
[101] Jonathan Gratch, David DeVault, Gale M Lucas, and Stacy Marsella. “Negotiation as a challenge
problem for virtual humans”. In: International Conference on Intelligent Virtual Agents. Springer.
2015, pp. 201–215.
[102] Jonathan Gratch, Ning Wang, Jillian Gerten, Edward Fast, and Robin Duffy. “Creating Rapport with
Virtual Agents”. In: Lecture Notes in Artificial Intelligence; Proceedings of the 7th International
Conference on Intelligent Virtual Agents (IVA). Vol. 4722. Paris, France, 2007, pp. 125–128. url:
http://ict.usc.edu/pubs/Creating%20Rapport%20with%20Virtual%20Agents.pdf.
158
[103] Jonathan Gratch, Ning Wang, Jillian Gerten, Edward Fast, and Robin Duffy. “Creating rapport with
virtual agents”. In: Intelligent Virtual Agents: 7th International Conference, IVA 2007 Paris, France,
September 17-19, 2007 Proceedings 7. Springer. 2007, pp. 125–138.
[104] Daniel Graziotin, Xiaofeng Wang, and Pekka Abrahamsson. “Are happy developers more
productive? The correlation of affective states of software developers and their self-assessed
productivity”. In: Product-Focused Software Process Improvement: 14th International Conference,
PROFES 2013, Paphos, Cyprus, June 12-14, 2013. Proceedings 14. Springer. 2013, pp. 50–64.
[105] Jillian Greczek and Maja Mataric. “Toward personalized pain anxiety reduction for children”. In:
2015 AAAI Fall Symposium Series. 2015.
[106] David J Greenberg, Donald Hillman, and Dean Grice. “Infant and stranger variables related to
stranger anxiety in the first year of life.” In: Developmental Psychology 9.2 (1973), p. 207.
[107] Tobias Grossmann, Sarah Lloyd-Fox, and Mark H Johnson. “Brain responses reveal young infants’
sensitivity to when a social partner follows their gaze”. In: Developmental Cognitive Neuroscience 6
(2013), pp. 155–161.
[108] Joakim Gustafson, Linda Bell, Johan Boye, Anders Lindström, and Mats Wirén. “The NICE fairy-tale
game system”. In: The 5th SIGdial Workshop on Discourse and Dialogue at HLT-NAACL 2004. 2004.
[109] David Hanson, Andrew Olney, Steve Prilliman, Eric Mathews, Marge Zielke, Derek Hammons,
Raul Fernandez, and Harry Stephanou. “Upending the uncanny valley”. In: AAAI. Vol. 5. 2005,
pp. 1728–1729.
[110] Arno Hartholt, David Traum, Stacy C. Marsella, Ari Shapiro, Giota Stratou, Anton Leuski,
Louis-Philippe Morency, and Jonathan Gratch. “All Together Now: Introducing the Virtual Human
Toolkit”. In: International Conference on Intelligent Virtual Humans. Edinburgh, UK, Aug. 2013. url:
http://ict.usc.edu/pubs/All%20Together%20Now.pdf.
[111] Jochen Hartmann. Emotion English DistilRoBERTa-base.
https://huggingface.co/j-hartmann/emotion-english-distilroberta-base/. 2022.
[112] Devamanyu Hazarika, Soujanya Poria, Amir Zadeh, Erik Cambria, Louis-Philippe Morency, and
Roger Zimmermann. “Conversational memory network for emotion recognition in dyadic dialogue
videos”. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018,
pp. 2122–2132.
[113] Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. “Misa: Modality-invariant
and-specific representations for multimodal sentiment analysis”. In: Proceedings of the 28th ACM
International Conference on Multimedia. 2020, pp. 1122–1131.
[114] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. “Masked
autoencoders are scalable vision learners”. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. 2022, pp. 16000–16009.
159
[115] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image
recognition”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016,
pp. 770–778.
[116] Quick Statistics About Hearing.
https://www.nidcd.nih.gov/health/statistics/quick-statistics-hearing. Accessed: 2022-08-16.
2022. (Visited on 03/25/2021).
[117] Rosanna Hertz. “Stories as evidence”. In: Qualitative Sociology 29.4 (2006), pp. 539–543.
[118] P Higgins. Outsiders in a hearing world. SAGE Publishing, 1980.
[119] Chin-Chang Ho, Karl F MacDorman, and ZA Dwi Pramono. “Human emotion and the uncanny
valley: a GLM, MDS, and Isomap analysis of robot video ratings”. In: 2008 3rd ACM/IEEE
International Conference on Human-Robot Interaction (HRI). IEEE. 2008, pp. 169–176.
[120] Ngoc-Huynh Ho, Hyung-Jeong Yang, Soo-Hyung Kim, and Gueesang Lee. “Multimodal approach of
speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural
network”. In: IEEE Access 8 (2020), pp. 61672–61686.
[121] William F House. “Cochlear implants”. In: Annals of Otology, Rhinology & Laryngology 85.3 (1976),
pp. 3–3.
[122] Lixing Huang, Louis-Philippe Morency, and Jonathan Gratch. “Virtual Rapport 2.0”. In: Intelligent
Virtual Agents: 10th International Conference, IVA 2011, Reykjavik, Iceland, September 15-17, 2011.
Proceedings 11. Springer. 2011, pp. 68–79.
[123] Larry E Humes and Shari S Floyd. “Measures of working memory, sequence learning, and speech
recognition in the elderly”. In: (2005).
[124] Clayton J Hutto and Eric Gilbert. “Vader: A parsimonious rule-based model for sentiment analysis
of social media text”. In: Eighth international AAAI conference on weblogs and social media. May
2014.
[125] Stephanos Ioannou, Sjoerd Ebisch, Tiziana Aureli, Daniela Bafunno, Helene Alexi Ioannides,
Daniela Cardone, Barbara Manini, Gian Luca Romani, Vittorio Gallese, and Arcangelo Merla. “The
autonomic signature of guilt in children: a thermal infrared imaging study”. In: PloS one 8.11 (2013),
e79440.
[126] Kabil Jaballah and Mohamed Jemni. “A Review on 3D signing avatars: Benefits, uses and
challenges”. In: International Journal of Multimedia Data Engineering and Management (IJMDEM)
4.1 (2013), pp. 21–45.
[127] Myounghoon Jeon, Bruce N Walker, and Jung-Bin Yim. “Effects of specific emotions on subjective
judgment, driving performance, and perceived workload”. In: Transportation research part F: traffic
psychology and behaviour 24 (2014), pp. 197–209.
[128] Jiyou Jia. “An AI framework to teach English as a foreign language: CSIEC”. In: AI Magazine 30.2
(2009), p. 59.
160
[129] O. P. John, L. P. Naumann, and C. J. Soto. “Paradigm shift to the integrative Big Five trait taxonomy:
History, measurement, and conceptual issues”. In: Handbook of Personality: Theory and Research.
Ed. by O. P. John, W. Robins, and L. A. Pervin. New York: Guilford Press, 2008, pp. 114–158.
[130] W Lewis Johnson, Carole Beal, Anna Fowles-Winkler, Ursula Lauper, Stacy Marsella,
Shrikanth Narayanan, Dimitra Papachristou, and Hannes Vilhjálmsson. “Tactical language training
system: An interim report”. In: International Conference on Intelligent Tutoring Systems. Springer.
2004, pp. 336–345.
[131] Peter W Jusczyk, Derek M Houston, and Mary Newsome. “The beginnings of word segmentation in
English-learning infants”. In: Cognitive psychology 39.3-4 (1999), pp. 159–207.
[132] Hernisa Kacorri, Pengfei Lu, and Matt Huenerfauth. “Evaluating facial expressions in American
Sign Language animations for accessible online information”. In: International Conference on
Universal Access in Human-Computer Interaction. Springer. 2013, pp. 510–519.
[133] Sin-Hwa Kang, James H. Watt, and Sasi Kanth Ala. “Social Copresence in Anonymous Social
Interactions Using a Mobile Video Telephone”. In: Proceedings of the SIGCHI Conference on Human
Factors in Computing Systems. CHI ’08. Florence, Italy: ACM, 2008, pp. 1535–1544. isbn:
978-1-60558-011-1. doi: 10.1145/1357054.1357295.
[134] Bilge Karacora, Morteza Dehghani, Nicole Kramer-Mertens, and Jonathan Gratch. “The influence of
virtual agents’ gender and rapport on enhancing math performance”. In: Proceedings of the Annual
Meeting of the Cognitive Science Society. Vol. 34. 34. 2012.
[135] Colleen Kehoe, Justine Cassell, Susan Goldman, James Dai, Ian Gouldstone, Shaunna MacLeod,
Traci O’Day, Anna Pandolfo, Kimiko Ryokai, and Austin Wang. “Sam goes to school: story listening
systems in the classroom”. In: (2004).
[136] Sara Kiesler, Aaron Powers, Susan R Fussell, and Cristen Torrey. “Anthropomorphic interactions
with a robot and robot–like agent”. In: Social Cognition 26.2 (2008), pp. 169–181.
[137] Elizabeth Kim, Rhea Paul, Frederick Shic, and Brian Scassellati. “Bridging the research gap: Making
HRI useful to individuals with autism”. In: (2012).
[138] Michael Kipp, Alexis Heloir, and Quan Nguyen. “Sign language avatars: Animation and
comprehensibility”. In: International Workshop on Intelligent Virtual Agents. Springer. 2011,
pp. 113–126.
[139] Edward S. Klima and Ursula Bellugi. The signs of language. 1979.
[140] Stefan Kopp, Lars Gesellensetter, Nicole C Krämer, and Ipke Wachsmuth. “A Conversational Agent
as Museum Guide–Design and Evaluation of a Real-World Application”. In: Intelligent Virtual
Agents: 5th International Working Conference, IVA 2005, Kos, Greece, September 12-14, 2005,
Proceedings. Vol. 3661. Springer Science & Business Media. 2005, p. 329.
[141] Hatice Kose, Neziha Akalin, and Pinar Uluer. “Socially interactive robotic platforms as sign
language tutors”. In: International Journal of Humanoid Robotics 11.01 (2014), p. 1450003.
161
[142] Hatice Kose, Rabia Yorganci, Esra H Algan, and Dag S Syrdal. “Evaluation of the robot assisted sign
language tutoring using video-based studies”. In: International Journal of Social Robotics 4.3 (2012),
pp. 273–283.
[143] Hatice Kose, Rabia Yorganci, and Itauma I Itauma. “Humanoid robot assisted interactive sign
language tutoring game”. In: Robotics and Biomimetics (ROBIO), 2011 IEEE International Conference
on. IEEE. 2011, pp. 2247–2248.
[144] Marina Krcmar. “Word learning in very young children from infant-directed DVDs”. In: Journal of
Communication 61.4 (2011), pp. 780–794.
[145] Marina Krcmar, Bernard Grela, and Kirsten Lin. “Can toddlers learn vocabulary from television?
An experimental approach”. In: Media Psychology 10.1 (2007), pp. 41–63.
[146] Patricia K Kuhl. “Early language acquisition: cracking the speech code”. In: Nature reviews
neuroscience 5.11 (2004), p. 831.
[147] Patricia K Kuhl, Feng-Ming Tsao, and Huei-Mei Liu. “Foreign-language experience in infancy:
Effects of short-term exposure and social interaction on phonetic learning”. In: Proceedings of the
National Academy of Sciences 100.15 (2003), pp. 9096–9101.
[148] Patricia K. Kuhl, Feng-Ming Tsao, and Huei-Mei Liu. “Foreign-language experience in infancy:
Effects of short-term exposure and social interaction on phonetic learning”. In: Proceedings of the
National Academy of Sciences 100.15 (2003), pp. 9096–9101. issn: 0027-8424. eprint:
http://www.pnas.org/content/100/15/9096.full.pdf.
[149] Anna K Kuhlen and Susan E Brennan. “Anticipating distracted addressees: How speakers’
expectations and addressees’ feedback influence storytelling”. In: Discourse Processes 47.7 (2010),
pp. 567–587.
[150] Stephen RH Langton, Helen Honeyman, and Emma Tessler. “The influence of head contour and
nose angle on the perception of eye-gaze direction”. In: Perception & psychophysics 66.5 (2004),
pp. 752–771.
[151] Maria Elena Lechuga Redondo, Radoslaw Niewiadomski, Rea Francesco, and Alessandra Sciutti.
“Comfortability Recognition from Visual Non-verbal Cues”. In: Proceedings of the 2022 International
Conference on Multimodal Interaction. 2022, pp. 207–216.
[152] Kenton Lee, Luheng He, Mike Lewis, and Luke Zettlemoyer. “End-to-end neural coreference
resolution”. In: arXiv preprint arXiv:1707.07045 (2017).
[153] Iolanda Leite, Marissa McCoy, Monika Lohani, Daniel Ullman, Nicole Salomons, Charlene Stokes,
Susan Rivers, and Brian Scassellati. “Emotional storytelling in the classroom: Individual versus
group interaction between children and robots”. In: Proceedings of the Tenth Annual ACM/IEEE
International Conference on Human-Robot Interaction. 2015, pp. 75–82.
162
[154] Iolanda Leite, Marissa McCoy, Monika Lohani, Daniel Ullman, Nicole Salomons, Charlene Stokes,
Susan Rivers, and Brian Scassellati. “Narratives with robots: The impact of interaction context and
individual differences on story recall and emotional understanding”. In: Frontiers in Robotics and AI
4 (2017), p. 29.
[155] Iolanda Leite, André Pereira, Samuel Mascarenhas, Carlos Martinho, Rui Prada, and Ana Paiva.
“The influence of empathy in human–robot relations”. In: International journal of human-computer
studies 71.3 (2013), pp. 250–260.
[156] Daniel Leyzberg, Samuel Spaulding, and Brian Scassellati. “Personalizing robot tutors to
individuals’ learning differences”. In: 2014 9th ACM/IEEE International Conference on Human-Robot
Interaction (HRI). IEEE. 2014, pp. 423–430.
[157] Daniel Leyzberg, Samuel Spaulding, Mariya Toneva, and Brian Scassellati. “The physical presence
of a robot tutor increases cognitive learning gains”. In: Proceedings of the Annual Meeting of the
Cognitive Science Society. Vol. 34. 34. 2012.
[158] Xia Li and Minping Chen. “Multimodal sentiment analysis with multi-perspective fusion network
focusing on sense attentive language”. In: Chinese Computational Linguistics: 19th China National
Conference, CCL 2020, Hainan, China, October 30–November 1, 2020, Proceedings 19. Springer. 2020,
pp. 359–373.
[159] Edirlei Soares de Lima, Bruno Feijo, and Antonio L Furtado. “Adaptive storytelling based on
personality and preference modeling”. In: Entertainment Computing 34 (2020), p. 100342.
[160] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. “Roberta: A robustly optimized bert
pretraining approach”. In: arXiv preprint arXiv:1907.11692 (2019).
[161] Google LLC. Google Assistant, Your Own Personal Google. 2016. url: https://assistant.google.com/.
[162] Sheng-Yen Lo and Han-Pang Huang. “Realization of sign language motion using a dual-arm/hand
humanoid robot”. In: Intelligent Service Robotics 9.4 (2016), pp. 333–345.
[163] Abbas Saliimi Lokman and Mohamed Ariff Ameedeen. “Modern chatbot systems: A technical
review”. In: Proceedings of the future technologies conference. Springer. 2018, pp. 1012–1023.
[164] Christine E Looser and Thalia Wheatley. “The tipping point of animacy: How, when, and where we
perceive life in a face”. In: Psychological science 21.12 (2010), pp. 1854–1862.
[165] Irene Lopatovska, Katrina Rink, Ian Knight, Kieran Raines, Kevin Cosenza, Harriet Williams,
Perachya Sorsche, David Hirsch, Qi Li, and Adrianna Martinez. “Talk to me: Exploring user
interactions with the Amazon Alexa”. In: Journal of Librarianship and Information Science 51.4
(2019), pp. 984–997.
[166] Konrad Lorenz. “Die angeborenen formen möglicher erfahrung”. In: Zeitschrift für Tierpsychologie
5.2 (1943), pp. 235–409.
163
[167] Gale M Lucas, Nicole Krämer, Clara Peters, Lisa-Sophie Taesch, Johnathan Mell, and
Jonathan Gratch. “Effects of perceived agency and message tone in responding to a virtual personal
trainer”. In: Proceedings of the 18th International Conference on Intelligent Virtual Agents. 2018,
pp. 247–254.
[168] Huaishao Luo, Lei Ji, Yanyong Huang, Bin Wang, Shenggong Ji, and Tianrui Li. “Scalevlad:
Improving multimodal sentiment analysis via multi-scale fusion of locally descriptors”. In: arXiv
preprint arXiv:2112.01368 (2021).
[169] Karl F MacDorman. “Subjective ratings of robot video clips for human likeness, familiarity, and
eeriness: An exploration of the uncanny valley”. In: ICCS/CogSci-2006 long symposium: Toward
social mechanisms of android science. Citeseer. 2006, pp. 26–29.
[170] Karl F MacDorman and Hiroshi Ishiguro. “The uncanny advantage of using androids in cognitive
and social science research”. In: Interaction Studies 7.3 (2006), pp. 297–337.
[171] Navonil Majumder, Soujanya Poria, Devamanyu Hazarika, Rada Mihalcea, Alexander Gelbukh, and
Erik Cambria. “Dialoguernn: An attentive rnn for emotion detection in conversations”. In: arXiv
preprint arXiv:1811.00405 (2018).
[172] Tim Payne. MAKI - A 3D Printable Humanoid Robot.
https://www.kickstarter.com/projects/391398742/maki-a-3d-printable-humanoid-robot. Accessed:
2018-04-25. 2018. (Visited on 04/25/2018).
[173] Barbara Manini, Daniela Cardone, Sjoerd Ebisch, Daniela Bafunno, Tiziana Aureli, and
Arcangelo Merla. “Mom feels what her child feels: thermal signatures of vicarious autonomic
response while watching children in a stressful situation”. In: Frontiers in human neuroscience 7
(2013), p. 299.
[174] Ramesh Manuvinakurike, Timothy W. Bickmore, and Wayne Velicer. “Automated Indexing of
Internet Stories for Health Behavior Change: Weight Loss Attitude Pilot Study”. In: Journal of
Medical Internet Research (2014).
[175] Vesna Marinović, Stefanie Hoehl, and Sabina Pauen. “Neural correlates of human–animal
distinction: An ERP-study on early categorical differentiation with 4-and 7-month-old infants and
adults”. In: Neuropsychologia 60 (2014), pp. 60–76.
[176] Daniel McDuff, Kael Rowan, Piali Choudhury, Jessica Wolk, ThuVan Pham, and Mary Czerwinski.
“A multimodal emotion sensing platform for building emotion-aware applications”. In: arXiv
preprint arXiv:1903.12133 (2019).
[177] Robert McKee. “Story: Substance, structure, style and the principles of screenwriting. 1997”. In:
Kent, Great Britain: Methuen (1997).
[178] Lianne FS Meah and Roger K Moore. “The uncanny valley: A focus on misaligned cues”. In:
International Conference on Social Robotics. Springer. 2014, pp. 256–265.
164
[179] Andrew N Meltzoff. “Imitation as a mechanism of social cognition: Origins of empathy, theory of
mind, and the representation of action”. In: Blackwell handbook of childhood cognitive development
(2002), pp. 6–25.
[180] Andrew N Meltzoff, Rechele Brooks, Aaron P Shon, and Rajesh PN Rao. “"Social" robots are
psychological agents for infants: A test of gaze following”. In: Neural networks 23.8-9 (2010),
pp. 966–972.
[181] Andrew N Meltzoff and M Keith Moore. “Imitation of facial and manual gestures by human
neonates”. In: Science 198.4312 (1977), pp. 75–78.
[182] Andrew N Meltzoff and Wolfgang Prinz. The imitative mind: Development, evolution and brain bases.
Vol. 6. Cambridge University Press, 2002.
[183] Isabella Meneghel, Marisa Salanova, and Isabel M Martınez. “Feeling good makes us stronger: How
team resilience mediates the effect of positive emotions on team performance”. In: Journal of
Happiness Studies 17 (2016), pp. 239–255.
[184] Arcangelo Merla. “Thermal expression of intersubjectivity offers new possibilities to
human–machine and technologically mediated interactions”. In: Frontiers in psychology 5 (2014),
p. 802.
[185] Arcangelo Merla and Gian Luca Romani. “Thermal signatures of emotional arousal: a functional
infrared imaging study”. In: 2007 29th Annual International Conference of the IEEE Engineering in
Medicine and Biology Society. IEEE. 2007, pp. 247–249.
[186] Angeliki Metallinou, Athanasios Katsamanis, and Shrikanth Narayanan. “Tracking continuous
emotional trends of participants during affective dyadic interactions using body language and
speech information”. In: Image and Vision Computing 31.2 (2013), pp. 137–152.
[187] Wade J Mitchell, Kevin A Szerszen Sr, Amy Shirong Lu, Paul W Schermerhorn, Matthias Scheutz,
and Karl F MacDorman. “A mismatch in the human realism of face and voice produces an uncanny
valley”. In: i-Perception 2.1 (2011), pp. 10–12.
[188] Eric J Moody, Daniel N McIntosh, Laura J Mann, and Kimberly R Weisser. “More than mere
mimicry? The influence of emotion on rapid facial reactions to faces.” In: Emotion 7.2 (2007), p. 447.
[189] Roger K Moore. “A Bayesian explanation of the ‘Uncanny Valley’effect and related psychological
phenomena”. In: Scientific reports 2.1 (2012), pp. 1–5.
[190] Masahiro Mori. “Bukimi no tani [the uncanny valley]”. In: Energy 7 (1970), pp. 33–35.
[191] Yukiko I Nakano, Eri Hirose, Tatsuya Sakato, Shogo Okada, and Jean-Claude Martin. “Detecting
Change Talk in Motivational Interviewing using Verbal and Facial Information”. In: Proceedings of
the 2022 International Conference on Multimodal Interaction. 2022, pp. 5–14.
[192] Setareh Nasihati Gilani, Kimberly Pollard, and David Traum. “Multimodal Prediction of User’s
Performance in High-Stress Dialogue Interactions”. In: Companion Publication of the 25th
International Conference on Multimodal Interaction. 2023, pp. 71–75.
165
[193] Setareh Nasihati Gilani, Kraig Sheetz, Gale Lucas, and David Traum. “What Kind of Stories Should
a Virtual Human Swap? (Extended Abstract)”. In: Proceedings of the 2016 International Conference on
Autonomous Agents & Multiagent Systems. AAMAS ’16. Singapore, Singapore: International
Foundation for Autonomous Agents and Multiagent Systems, 2016, pp. 1437–1438. isbn:
9781450342391.
[194] Setareh Nasihati Gilani and David Traum. “Analyzing User’s Mental State and Facial Expressions in
Interaction with Different Personalities in a Critical Situation”. In: the 1st International Multimodal
Communication Symposium (MMSYM 2023). 2023.
[195] Setareh Nasihati Gilani, David Traum, Arcangelo Merla, Eugenia Hee, Zoey Walker,
Barbara Manini, Grady Gallagher, and Laura-Ann Petitto. “Multimodal Dialogue Management for
Multiparty Interaction with Infants”. In: Proceedings of the 2018 on International Conference on
Multimodal Interaction. ACM. 2018, pp. 5–13.
[196] Setareh Nasihati Gilani, David Traum, Rachel Sortino, Grady Gallagher, Kailyn Aaron-Lozano,
Cryss Padilla, Ari Shapiro, Jason Lamberton, and Laura-Ann Petitto. “Can a Signing Virtual Human
Engage a Baby’s Attention?” In: Proceedings of the 19th ACM International Conference on Intelligent
Virtual Agents. 2019, pp. 162–169.
[197] Setareh Nasihati Gilani, David Traum, Rachel Sortino, Grady Gallagher, Kailyn Aaron-Lozano,
Cryss Padilla, Ari Shapiro, Jason Lamberton, and Laura-ann Petitto. “Can a Virtual Human
Facilitate Language Learning in a Young Baby?” In: Proceedings of the 18th International Conference
on Autonomous Agents and MultiAgent Systems. 2019, pp. 2135–2137.
[198] Clifford Nass, Jonathan Steuer, and Ellen R Tauber. “Computers are social actors”. In: Proceedings of
the SIGCHI conference on Human factors in computing systems. 1994, pp. 72–78.
[199] Johanna Grant Nicholas and Ann E Geers. “Will they catch up? The role of age at cochlear
implantation in the spoken language development of children with severe to profound hearing
loss”. In: Journal of Speech, Language, and Hearing Research 50.4 (2007), pp. 1048–1062.
[200] Neal R Norrick. Conversational narrative: Storytelling in everyday talk. Vol. 203. John Benjamins
Publishing, 2000.
[201] Maja Pantic and Leon J. M. Rothkrantz. “Automatic analysis of facial expressions: The state of the
art”. In: IEEE Transactions on pattern analysis and machine intelligence 22.12 (2000), pp. 1424–1445.
[202] Hae Won Park, Mirko Gelsomini, Jin Joo Lee, and Cynthia Breazeal. “Telling stories to robots: The
effect of backchanneling on a child’s storytelling”. In: 2017 12th ACM/IEEE International Conference
on Human-Robot Interaction (HRI. IEEE. 2017, pp. 100–108.
[203] James W Pennebaker, Martha E Francis, and Roger J Booth. “Linguistic inquiry and word count:
LIWC 2001”. In: Mahway: Lawrence Erlbaum Associates 71.2001 (2001), p. 2001.
[204] Alex Pentland. Honest signals: how they shape our world. MIT press, 2010.
166
[205] Verónica Pérez-Rosas, Rada Mihalcea, and Louis-Philippe Morency. “Utterance-level multimodal
sentiment analysis”. In: Proceedings of the 51st Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers). Vol. 1. 2013, pp. 973–982.
[206] Laura Ann Petitto, Siobhan Holowka, Lauren E Sergio, Bronna Levy, and David J Ostry. “Baby
hands that move to the rhythm of language: hearing babies acquiring sign languages babble silently
on the hands”. In: Cognition 93.1 (2004), pp. 43–73.
[207] Laura Ann Petitto, Siobhan Holowka, Lauren E Sergio, and David Ostry. “Language rhythms in
baby hand movements”. In: Nature 413.6851 (2001), p. 35.
[208] Laura Ann Petitto and Paula F Marentette. “Babbling in the manual mode: Evidence for the
ontogeny of language”. In: Science 251.5000 (1991), pp. 1493–1496.
[209] Laura Ann Petitto, Robert J Zatorre, Kristine Gauna, Erwin James Nikelski, Deanna Dostie, and
Alan C Evans. “Speech-like cerebral activity in profoundly deaf people processing signed languages:
implications for the neural basis of human language”. In: Proceedings of the National Academy of
Sciences 97.25 (2000), pp. 13961–13966.
[210] Laura-Ann Petitto. Hearing Babies Respond to Language’s Patterning and Socially-Contingent
Interactions with a Signing Avatar: Insights into Human Language Acquisition.
[211] Laura-Ann Petitto. “How the brain begets language”. In: The cambridge companion to chomsky
(2005), pp. 84–101.
[212] Laura-Ann Petitto. “The Impact of Minimal Language Experience on Children During Sensitive
Periods of Brain and Early Language Development: Myths Debunked and New Policy Implications”.
In: retrieved from http://petitto.net/wp-content/uploads/2014/04/Petitto_Minimal-LanguageExperience_Final_Oct-6-2017.pdf.
[213] Laura-Ann Petitto, Melody S Berens, Ioulia Kovelman, Matt H Dubins, K Jasinska, and M Shalinsky.
“The “Perceptual Wedge Hypothesis” as the basis for bilingual babies’ phonetic processing
advantage: New insights from fNIRS brain imaging”. In: Brain and language 121.2 (2012),
pp. 130–143.
[214] Laura-Ann Petitto, Clifton Langdon, Adam Stone, Diana Andriola, Geo Kartheiser, and
Casey Cochran. “Visual sign phonology: Insights into human reading and language from a natural
soundless phonology”. In: Wiley Interdisciplinary Reviews: Cognitive Science 7.6 (2016), pp. 366–381.
[215] Farzad Pezeshkpour, Ian Marshall, Ralph Elliott, and J Andrew Bangham. “Development of a legible
deaf-signing virtual human”. In: Multimedia Computing and Systems, 1999. IEEE International
Conference on. Vol. 1. IEEE. 1999, pp. 333–338.
[216] L. Polanyi. Telling the American story: a structural and cultural analysis of conversational storytelling.
Language and being. Ablex Publishers, 1985. isbn: 9780893910419.
167
[217] Soujanya Poria, Erik Cambria, and Alexander Gelbukh. “Deep convolutional neural network textual
features and multiple kernel learning for utterance-level multimodal sentiment analysis”. In:
Proceedings of the 2015 conference on empirical methods in natural language processing. 2015,
pp. 2539–2544.
[218] Soujanya Poria, Erik Cambria, Newton Howard, Guang-Bin Huang, and Amir Hussain. “Fusing
audio, visual and textual clues for sentiment analysis from multimodal content”. In:
Neurocomputing 174 (2016), pp. 50–59.
[219] Soujanya Poria, Iti Chaturvedi, Erik Cambria, and Amir Hussain. “Convolutional MKL based
multimodal emotion recognition and sentiment analysis”. In: 2016 IEEE 16th international conference
on data mining (ICDM). IEEE. 2016, pp. 439–448.
[220] Sarah M Rabbitt, Alan E Kazdin, and Brian Scassellati. “Integrating socially assistive robotics into
mental healthcare interventions: Applications and recommendations for expanded use”. In: Clinical
psychology review 35 (2015), pp. 35–46.
[221] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever.
“Robust speech recognition via large-scale weak supervision”. In: arXiv preprint arXiv:2212.04356
(2022).
[222] Akilesh Rajavenkatanarayanan, Ashwin Ramesh Babu, Konstantinos Tsiakas, and Fillia Makedon.
“Monitoring task engagement using facial expressions and body postures”. In: Proceedings of the 3rd
International Workshop on Interactive and Spatial Computing. 2018, pp. 103–108.
[223] Ragunathan Rajkumar, Michael Gagliardi, and Lui Sha. “The real-time publisher/subscriber
inter-process communication model for distributed real-time systems: design and implementation”.
In: Real-Time Technology and Applications Symposium, 1995. Proceedings. IEEE. 1995, pp. 66–75.
[224] Aditi Ramachandran, Alexandru Litoiu, and Brian Scassellati. “Shaping productive help-seeking
behavior during robot-child tutoring interactions”. In: 2016 11th ACM/IEEE International Conference
on Human-Robot Interaction (HRI). IEEE. 2016, pp. 247–254.
[225] Hiranmayi Ranganathan, Shayok Chakraborty, and Sethuraman Panchanathan. “Multimodal
emotion recognition using deep learning architectures”. In: 2016 IEEE winter conference on
applications of computer vision (WACV). IEEE. 2016, pp. 1–9.
[226] Marco Rathschlag and Daniel Memmert. “Self-generated emotions and their influence on sprint
performance: An investigation of happiness and anxiety”. In: Journal of Applied Sport Psychology
27.2 (2015), pp. 186–199.
[227] Judy Snitzer Reilly, Marina McIntire, and Ursula Bellugi. “The acquisition of conditionals in
Language: Grammaticized facial expressions”. In: Applied Psycholinguistics 11.4 (1990), pp. 369–392.
[228] Rebekah A Richert, Michael B Robb, and Erin I Smith. “Media as social partners: The social nature
of young children’s learning from screen media”. In: Child Development 82.1 (2011), pp. 82–95.
168
[229] Jeff Rickel and W Lewis Johnson. “Animated agents for procedural training in virtual reality:
Perception, cognition, and motor control”. In: Applied artificial intelligence 13.4-5 (1999),
pp. 343–382.
[230] Jeff Rickel and W Lewis Johnson. “Virtual humans for team training in virtual reality”. In:
Proceedings of the ninth international conference on artificial intelligence in education. Vol. 578.
Citeseer. 1999, p. 585.
[231] Jeff Rickel, Stacy Marsella, Jonathan Gratch, Randall Hill, David Traum, and William Swartout.
“Toward a new generation of virtual humans for interactive experiences”. In: IEEE Intelligent
Systems 17.4 (2002), pp. 32–38.
[232] Susan Robinson, David Traum, Midhun Ittycheriah, and Joe Henderer. “What Would You Ask a
Conversational Agent? Observations of Human-Agent Dialogues in a Museum Setting”. In:
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC).
Marrakech, Morocco, 2008.
[233] R Craig Roney. “Back to the basics with storytelling”. In: The Reading Teacher 42.7 (1989),
pp. 520–523.
[234] Verónica Pérez Rosas, Rada Mihalcea, and Louis-Philippe Morency. “Multimodal sentiment analysis
of spanish online videos”. In: IEEE intelligent systems 28.3 (2013), pp. 38–45.
[235] Krystyna Rymarczyk, Cezary Biele, Anna Grabowska, and Henryk Majczynski. “EMG activity in
response to static and dynamic facial expressions”. In: International Journal of Psychophysiology
79.2 (2011), pp. 330–333.
[236] K. Ryokai, C. Vaucelle, and J. Cassell. “Virtual peers as partners in storytelling and literacy
learning”. In: Journal of Computer Assisted Learning 19.2 (2003), pp. 195–208. issn: 1365-2729. doi:
10.1046/j.0266-4909.2003.00020.x.
[237] Jenny R Saffran, Ann Senghas, and John C Trueswell. “The acquisition of language by children”. In:
Proceedings of the National Academy of Sciences 98.23 (2001), pp. 12874–12875.
[238] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. “DistilBERT, a distilled version
of BERT: smaller, faster, cheaper and lighter”. In: ArXiv abs/1910.01108 (2019).
[239] Brian Scassellati, Henny Admoni, and Maja Matarić. “Robots for use in autism research”. In: Annual
review of biomedical engineering 14 (2012), pp. 275–294.
[240] Brian Scassellati, Jake Brawer, Katherine Tsui, Setareh Nasihati Gilani, Melissa Malzkuhn,
Barbara Manini, Adam Stone, Geo Kartheiser, Arcangelo Merla, Ari Shapiro, et al. “Teaching
language to deaf infants with a robot and a virtual human”. In: Proceedings of the 2018 CHI
Conference on Human Factors in Computing Systems. ACM. 2018, p. 553.
[241] Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. “wav2vec: Unsupervised
pre-training for speech recognition”. In: arXiv preprint arXiv:1904.05862 (2019).
169
[242] Jerry Schnepp, Rosalee Wolfe, John McDonald, and Jorge Toro. “Generating co-occurring facial
nonmanual signals in synthesized American sign language”. In: (2013).
[243] Beate Seibt, Andreas Mühlberger, Katja U Likowski, and Peter Weyers. “Facial mimicry in its social
setting”. In: Frontiers in psychology 6 (2015), p. 1122.
[244] Jun’ichiro Seyama and Ruth S Nagayama. “The uncanny valley: Effect of realism on the impression
of artificial human faces”. In: Presence 16.4 (2007), pp. 337–351.
[245] Ari Shapiro. “Building a character animation system”. In: INTERNATIONAL Conference on Motion in
Games. Springer. 2011, pp. 98–109.
[246] Yoshiaki Shirai and Hirochika Inoue. “Guiding a robot by visual feedback in assembling tasks”. In:
Pattern recognition 5.2 (1973), pp. 99–108.
[247] Ben Shneiderman and Pattie Maes. “Direct manipulation vs. interface agents”. In: interactions 4.6
(1997), pp. 42–61.
[248] André Silva, Guilherme Raimundo, and Ana Paiva. “Tell me that bit again... bringing interactivity to
a virtual storyteller”. In: International conference on virtual storytelling. Springer. 2003, pp. 146–154.
[249] Clare E Sims and Eliana Colunga. “Language Development in the Age of Baby Media: What we
Know and What Needs to be Done”. In: Cascadilla Press. Retrieved on March 20 (2013), p. 2016.
[250] Apple Inc. Siri Team. Hey Siri: An On-device DNN-powered Voice Trigger for Apple’s Personal
Assistant. 2017. url: https://machinelearning.apple.com/research/hey-siri.
[251] Jeffrey L Sokolov. “A local contingency analysis of the fine-tuning hypothesis.” In: Developmental
psychology 29.6 (1993), p. 1008.
[252] Mohammad Soleymani, David Garcia, Brendan Jou, Björn Schuller, Shih-Fu Chang, and Maja Pantic.
“A survey of multimodal sentiment analysis”. In: Image and Vision Computing 65 (2017), pp. 3–14.
[253] Mohammad Soleymani, Maja Pantic, and Thierry Pun. “Multimodal emotion recognition in
response to videos”. In: IEEE transactions on affective computing 3.2 (2011), pp. 211–223.
[254] Arielle Spellun and Poorna Kushalnagar. “Sign language for deaf infants: A key intervention for a
developmental emergency”. In: Clinical pediatrics 57.14 (2018), pp. 1613–1615.
[255] statista. Smart speaker unit shipments.
https://www.statista.com/statistics/792598/worldwide-smart-speaker-unit-shipment/. Accessed:
2022-01-25. Jan. 2022.
[256] Glenn Stockwell. “A review of technology choice for teaching language skills and areas in the
CALL literature”. In: ReCALL 19.2 (2007), pp. 105–120.
[257] Adam Stone, Laura-Ann Petitto, and Rain Bosworth. “Visual sonority modulates infants’ attraction
to sign language”. In: Language Learning and Development 14.2 (2018), pp. 130–148.
170
[258] George Stricker and Martin Fisher. Self-disclosure in the therapeutic relationship. Springer, 1990.
[259] Katelyn Swift-Spong, Cheng K. Fred Wen, Donna Spruijt-Metz, and Maja J. Matarić. “Comparing
Backstories of a Socially Assistive Robot Exercise Buddy for Adolescent Youth”. In: 25th IEEE
International Symposium on Robot and Human Interactive Communication. New York, NY, Aug. 2016.
url: http://robotics.usc.edu/publications/930/.
[260] Catherine S Tamis-LeMonda, Marc H Bornstein, and Lisa Baumwell. “Maternal responsiveness and
children’s achievement of language milestones”. In: Child development 72.3 (2001), pp. 748–767.
[261] Andrea Tartaro and Justine Cassell. “Authorable virtual peers for autism spectrum disorders”. In:
Proceedings of the Combined workshop on Language-Enabled Educational Technology and
Development and Evaluation for Robust Spoken Dialogue Systems at the 17th European Conference on
Artificial Intellegence. Vol. 28. 2006.
[262] Leili Tavabi, Kalin Stefanov, Setareh Nasihati Gilani, David Traum, and Mohammad Soleymani.
“Multimodal learning for identifying opportunities for empathetic responses”. In: 2019 International
Conference on Multimodal Interaction. 2019, pp. 95–104.
[263] Jean-Philippe Thiran, Ferran Marques, and Hervé Bourlard. Multimodal Signal Processing: Theory
and applications for human-computer interaction. Academic Press, 2009.
[264] Claire To, Setareh Nasihati Gilani, and David Traum. “Common Strategy Patterns of Persuasion in a
Mission Critical and Time Sensitive Task”. In: Proceedings of the 27th Workshop on the Semantics and
Pragmatics of Dialogue. 2023.
[265] Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. “Videomae: Masked autoencoders are
data-efficient learners for self-supervised video pre-training”. In: arXiv preprint arXiv:2203.12602
(2022).
[266] Michelle Starr. Toshiba’s new robot can speak in sign language.
https://www.cnet.com/news/toshibas-new-robot-can-speak-in-sign-language/. 2014. (Visited on
10/08/2014).
[267] Shubham Toshniwal, Haoyue Shi, Bowen Shi, Lingyu Gao, Karen Livescu, and Kevin Gimpel. “A
Cross-Task Analysis of Text Span Representations”. In: Proceedings of the 5th Workshop on
Representation Learning for NLP. Online: Association for Computational Linguistics, July 2020,
pp. 166–176. doi: 10.18653/v1/2020.repl4nlp-1.20.
[268] David Traum, Priti Aggarwal, Ron Artstein, Susan Foutz, Jillian Gerten, Athanasios Katsamanis,
Anton Leuski, Dan Noren, and William Swartout. “Ada and Grace: Direct Interaction with Museum
Visitors”. In: The 12th International Conference on Intelligent Virtual Agents (IVA). Santa Cruz, CA,
Sept. 2012. url: http://ict.usc.edu/pubs/Ada%20and%20Grace-
%20Direct%20Interaction%20with%20Museum%20Visitors.pdf.
[269] David Traum, Priti Aggarwal, Ron Artstein, Susan Foutz, Jillian Gerten, Athanasios Katsamanis,
Anton Leuski, Dan Noren, and William Swartout. “Ada and Grace: Direct interaction with museum
visitors”. In: International conference on intelligent virtual agents. Springer. 2012, pp. 245–251.
171
[270] David Traum, Andrew Jones, Kia Hays, Heather Maio, Oleg Alexander, Ron Artstein, Paul Debevec,
Alesia Gainer, Kallirroi Georgila, Kathleen Haase, et al. “New Dimensions in Testimony: Digitally
preserving a Holocaust survivor’s interactive storytelling”. In: International Conference on
Interactive Digital Storytelling. Springer. 2015, pp. 269–281.
[271] David Traum and Staffan Larsson. “The Information State Approach to Dialogue Management”. In:
Current and New Directions in Discourse and Dialogue. Ed. by Jan van Kuppevelt and Ronnie Smith.
Kluwer, 2003, pp. 325–353.
[272] Panagiotis Tzirakis, George Trigeorgis, Mihalis A Nicolaou, Björn W Schuller, and
Stefanos Zafeiriou. “End-to-end multimodal emotion recognition using deep neural networks”. In:
IEEE Journal of selected topics in signal processing 11.8 (2017), pp. 1301–1309.
[273] Pınar Uluer, Neziha Akalın, and Hatice Köse. “A new robotic platform for sign language tutoring”.
In: International Journal of Social Robotics 7.5 (2015), pp. 571–585.
[274] Lionel Sujay Vailshery. Smart speakers - Statistics & Facts.
https://www.statista.com/topics/4748/smart-speakers/. Accessed: 2022-01-24. Aug. 2021.
[275] Mikel Val-Calvo, José Ramón Álvarez-Sánchez, José Manuel Ferrández-Vicente, and
Eduardo Fernández. “Affective robot story-telling human-robot interaction: Exploratory real-time
emotion estimation analysis using facial expressions and physiological signals”. In: IEEE Access 8
(2020), pp. 134051–134066.
[276] Lotte F Van Dillen, Dirk J Heslenfeld, and Sander L Koole. “Tuning down the emotional brain: an
fMRI study of the effects of cognitive load on the processing of affective images”. In: Neuroimage
45.4 (2009), pp. 1212–1219.
[277] Marynel Vázquez, Aaron Steinfeld, Scott E Hudson, and Jodi Forlizzi. “Spatial and other social
engagement cues in a child-robot interaction: Effects of a sidekick”. In: Proceedings of the 2014
ACM/IEEE international conference on Human-robot interaction. 2014, pp. 391–398.
[278] Joshua Wainer, David J Feil-Seifer, Dylan A Shell, and Maja J Mataric. “The role of physical
embodiment in human-robot interaction”. In: ROMAN 2006-The 15th IEEE International Symposium
on Robot and Human Interactive Communication. IEEE. 2006, pp. 117–122.
[279] Dakuo Wang, Pattie Maes, Xiangshi Ren, Ben Shneiderman, Yuanchun Shi, and Qianying Wang.
“Designing AI to Work with or for People?” In: Extended Abstracts of the 2021 CHI Conference on
Human Factors in Computing Systems. 2021, pp. 1–5.
[280] Eva Wiese, Agnieszka Wykowska, and Hermann J Müller. “What we observe is biased by what
other people tell us: Beliefs about the reliability of gaze behavior modulate attentional orienting to
gaze cues”. In: PloS one 9.4 (2014), e94529.
[281] Blake S Wilson, Charles C Finley, Dewey T Lawson, Robert D Wolford, Donald K Eddington, and
William M Rabinowitz. “Better speech recognition with cochlear implants”. In: Nature 352.6332
(1991), pp. 236–238.
172
[282] Hugh R Wilson, Frances Wilkinson, Li-Ming Lin, and Maja Castillo. “Perception of head
orientation”. In: Vision research 40.5 (2000), pp. 459–472.
[283] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi,
Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer,
Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger,
Mariama Drame, Quentin Lhoest, and Alexander Rush. “Transformers: State-of-the-Art Natural
Language Processing”. In: EMNLP 2020: System Demonstrations. Online: Association for
Computational Linguistics, 2020, pp. 38–45. doi: 10.18653/v1/2020.emnlp-demos.6.
[284] Martin Wöllmer, Angeliki Metallinou, Florian Eyben, Björn Schuller, and Shrikanth Narayanan.
“Context-sensitive multimodal emotion recognition from speech and facial expression using
bidirectional lstm modeling”. In: Proc. INTERSPEECH 2010, Makuhari, Japan. 2010, pp. 2362–2365.
[285] Joshua Wei-Ern Wong and Kevin McGee. “Frown more, talk more: Effects of facial expressions in
establishing conversational rapport with virtual agents”. In: Intelligent Virtual Agents: 12th
International Conference, IVA 2012, Santa Cruz, CA, USA, September, 12-14, 2012. Proceedings 12.
Springer. 2012, pp. 419–425.
[286] Bo Xiao, Dogan Can, Panayiotis G Georgiou, David Atkins, and Shrikanth S Narayanan. “Analyzing
the language of therapist empathy in motivational interview based psychotherapy”. In: Proceedings
of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and
Conference. IEEE. 2012, pp. 1–4.
[287] Bo Xiao, Zac E Imel, Panayiotis G Georgiou, David C Atkins, and Shrikanth S Narayanan. “" Rate
My Therapist": Automated Detection of Empathy in Drug and Alcohol Counseling via Speech and
Language Processing”. In: PloS one 10.12 (2015), e0143055.
[288] Han Xiao. bert-as-service. https://github.com/hanxiao/bert-as-service. 2018.
[289] Runzhe Yang, Jingxiao Chen, and Karthik Narasimhan. “Improving dialog systems for negotiation
with personality modeling”. In: arXiv preprint arXiv:2010.09954 (2020).
[290] Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. “Tensor
fusion network for multimodal sentiment analysis”. In: arXiv preprint arXiv:1707.07250 (2017).
[291] Jie Zhang, Shuo Li, Jing-Yu Zhang, Feng Du, Yue Qi, and Xun Liu. “A literature review of the
research on the uncanny valley”. In: International Conference on Human-Computer Interaction.
Springer. 2020, pp. 255–268.
[292] Zhengyou Zhang. “Microsoft kinect sensor and its effect”. In: IEEE multimedia 19.2 (2012), pp. 4–10.
[293] Ran Zhao, Alexandros Papangelis, and Justine Cassell. “Towards a Dyadic Computational Model of
Rapport Management for Human-Virtual Agent Interaction”. English. In: Intelligent Virtual Agents.
Ed. by Timothy Bickmore, Stacy Marsella, and Candace Sidner. Vol. 8637. Lecture Notes in
Computer Science. Springer, 2014, pp. 514–527. isbn: 978-3-319-09766-4. doi:
10.1007/978-3-319-09767-1_62.
173
[294] Ran Zhao, Alexandros Papangelis, and Justine Cassell. “Towards a dyadic computational model of
rapport management for human-virtual agent interaction”. In: Intelligent Virtual Agents: 14th
International Conference, IVA 2014, Boston, MA, USA, August 27-29, 2014. Proceedings 14. Springer.
2014, pp. 514–527.
[295] Lynette van Zijl and Jaco Fourie. “The development of a generic signing avatar”. In: Proceedings of
the IASTED International Conference on Graphics and Visualization in Engineering, GVE. Vol. 7. 2007,
pp. 95–100.
174
Abstract (if available)
Abstract
People use feedback (verbal or nonverbal) from their interlocutors to guide their own behavior and alter the flow of conversation. In this thesis, we focus on human-machine interactions that involve storytelling and investigate the role of understanding and providing feedback from the machine's perspective. We explored the characteristics of stories that machines should use to increase rapport. We developed machine storytellers and listeners that can provide feedback and adapt their stories based on perceived multimodal feedback from their users. Finally, we investigated how machines can use real-time predictions based on user feedback to further adapt the dialogue management policies of the system for better overall performance.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Modeling dyadic synchrony with heterogeneous data: validation in infant-mother and infant-robot interactions
PDF
An investigation of fully interactive multi-role dialogue agents
PDF
Towards generalizable expression and emotion recognition
PDF
Computational modeling of mental health therapy sessions
PDF
Neural network integration of multiscale and multimodal cell imaging using semantic parts
PDF
Understanding human-building interactions through perceptual decision-making processes
PDF
Situated proxemics and multimodal communication: space, speech, and gesture in human-robot interaction
PDF
Multiparty human-robot interaction: methods for facilitating social support
PDF
Quickly solving new tasks, with meta-learning and without
PDF
Managing multi-party social dynamics for socially assistive robotics
PDF
Towards learning generalization
PDF
Neural networks for narrative continuation
PDF
Computational foundations for mixed-motive human-machine dialogue
PDF
Graph machine learning for hardware security and security of graph machine learning: attacks and defenses
PDF
Generative foundation model assisted privacy-enhancing computing in human-centered machine intelligence
PDF
Rapid prototyping and evaluation of dialogue systems for virtual humans
PDF
Efficiently learning human preferences for proactive robot assistance in assembly tasks
PDF
Socially-informed content analysis of online human behavior
PDF
An intelligent tutoring system’s approach for negotiation training
PDF
Leveraging structure for learning robot control and reactive planning
Asset Metadata
Creator
Nasihati Gilani, Setareh
(author)
Core Title
Understanding and generating multimodal feedback in human-machine story-telling
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2023-12
Publication Date
01/17/2024
Defense Date
09/26/2023
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
dialogue management,human agent interaction,human computer interaction,multimodal feedback generation,multimodal interaction,multimodal neural networks,multiparty interactions,OAI-PMH Harvest,virtual storytelling
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Traum, David (
committee chair
), Georgila, Kallirroi (
committee member
), Kim, Peter (
committee member
), Mataric, Maja (
committee member
), Soleymani, Mohammad (
committee member
)
Creator Email
nasihati.s@gmail.com,nasihati@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113804861
Unique identifier
UC113804861
Identifier
etd-NasihatiGi-12607.pdf (filename)
Legacy Identifier
etd-NasihatiGi-12607
Document Type
Dissertation
Format
theses (aat)
Rights
Nasihati Gilani, Setareh
Internet Media Type
application/pdf
Type
texts
Source
20240118-usctheses-batch-1120
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
dialogue management
human agent interaction
human computer interaction
multimodal feedback generation
multimodal interaction
multimodal neural networks
multiparty interactions
virtual storytelling