Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Multiparty human-robot interaction: methods for facilitating social support
(USC Thesis Other)
Multiparty human-robot interaction: methods for facilitating social support
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Multiparty Human-Robot Interaction:
Methods For Facilitating Social Support
by
Christopher M. Birmingham
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2023
Copyright 2023 Christopher M. Birmingham
This dissertation is dedicated to my partner in everything, my dearest, my love, Libby.
ii
Acknowledgements
My Ph.D has been made possible through the help of more people than I could possibly acknowl-
edge, however I would like to take a moment to thank my advisor, my thesis committee, my
collaborators and co-authors, the NSF, and my own support network.
I would like to acknowledge the incredible support of Maja, my advisor and mentor. Maja has
guided and encouraged me from the beginning of my Ph.D, and has stood up for me whenever
challenges arose. Maja has an incredible gift for inspiring others, and it has been an honor to learn
by her side.
I would also like to offer my thanks to my mentors and collaborators. Professors Mohammad
Soleymani and Lynn Miller have played a huge role in shaping and improving my research and
I feel incredibly fortunate to have had the opportunity to write grants and do research with them.
I would also like to thank my collaborator, Professor Kalin Stefanov at Monash University, who
has been an amazing mentor and collaborator on my turn-taking research. My research would also
not have been possible without the support of several amazing undergraduate students, including
Zijian Hu, Kartik Mahajan, Eli Reber, and Ashley Perez.
I would be remiss if I did not thank the National Science Foundation and the University of
Southern California for funding my research. Thank you for making this work possible.
Finally I would like to thank my family, friends, and lab mates, who have helped me through
the trials and triumphs of the Ph.D. To my parents Mike and Margaret, for their unwavering support
and belief that I would make it through. To my friends, especially David Millard, Isabel Rayas,
Tyler and Travis Laferriere-Holloway for hosting me and helping me through some of the darkest
iii
times. To my labmates, especially Thomas Groechel - whos dark humor has gotten me through
several grant proposals and Lauren Klein - whos sparkling personality has made every day in the
lab so much brighter; but also to Nathan Dennler, Zhonghao Shi, Mina Kian, and Amy O’Connell
for making the Interaction Lab the wondrous place that it is.
To everyone who has contributed to my Ph.D. I offer my sincere gratitude and appreciation.
Thank you.
iv
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1: Chapter One: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Situated Group Interaction Dynamics . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Evaluation Domain: Support Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Support Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Robot as Facilitator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Motivation and Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Dissertation Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2: Chapter Two: Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 Multiparty Human-Robot Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Turn-Taking Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Active Speaker Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Socially Assistive Robotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Trust in Socially Assistive Robotics and Support Groups . . . . . . . . . . . . . 11
2.2.2 Measuring Empathy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 Empathy in Socially Assistive Robotics . . . . . . . . . . . . . . . . . . . . . . 13
2.2.4 Mediation/Facilitation in Socially Assistive Robotics . . . . . . . . . . . . . . . 14
2.3 Evaluation Domain: Support Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Support Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.2 Support Group Facilitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3: Chapter Three: Modeling Turn-Taking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1 Approach to Turn-Taking Using Group Attention . . . . . . . . . . . . . . . . . . . . . . 18
v
3.2 Evaluation of the Turn-Taking Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.1 Active Speaker Detection Methodology . . . . . . . . . . . . . . . . . . . . . . 20
3.2.2 Turn-taking Prediction Methodology . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4: Chapter Four: Modeling Changes in Trust. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1 Approach to Measuring Trust Change in Robot-Facilitated Support Groups . . . . . . . . 36
4.2 Evaluation of the Trust Change Measure . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5: Chapter Five: Modeling Social Support Through Perceptions of Empathy. . . . . . . . . . . . . . . . . . . 48
5.1 Approach to Studying Perceptions of Robot Empathy . . . . . . . . . . . . . . . . . . . . 48
5.2 Evaluating Perceptions of Robot Empathy . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6: Chapter Six: Support Group Facilitation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.1 Approach to Directing and Role Modeling for Facilitation . . . . . . . . . . . . . . . . . 65
6.2 Evaluation of Directing and Role Modeling for Facilitation . . . . . . . . . . . . . . . . . 68
6.2.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7: Chapter Seven: Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.2 Future Direction of Facilitating Social Support for SAR . . . . . . . . . . . . . . . . . . 75
7.3 Final Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
vi
List of Tables
3.1 ASD Evaluation: performance of the synchronizer and detectors. Mean mAP and
standard deviation are shown in parenthesis. . . . . . . . . . . . . . . . . . . . . . 25
3.2 TTP Validation Results: UAR performance of models augmented by Binary
and Continuous VisualAttention features on the validation set provided by the
MultiMediate competition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 TTP Test Results: UAR performance of competitor models and our synchronizer
model augmented by Binary and Continuous VisualAttention features on the
held-out test set provided by the MultiMediate competition. . . . . . . . . . . . . . 31
4.1 Example questions and disclosures spoken by the robot, indicating sensitivity. A
total of 16 questions and 6 disclosures were available; an average of 12 questions
and 3 disclosures were made by the robot in each session. . . . . . . . . . . . . . . 39
4.2 Academic Support Group: Correlation table, p<0.05 are bold . . . . . . . . . . . . 41
4.3 Overall Trust Test Statistics: overall change in trust given to each corresponding
entity by the session participants . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.1 Empathy Study Examples: actor disclosures were structured in the format of: ”I
feel X because Y” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2 Empathy Study Belief Measurement: the participant belief and the corresponding
statement used to gauge participant belief in the interaction. The participants rated
their agreement with each statement on a five-point Likert scale. . . . . . . . . . . 54
5.3 Empathy Study Agreement: mean rating of agreement with each belief statement
for each condition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.1 Facilitator Action Space: examples of the SAR facilitator speech action space
designed for facilitating through role modeling and eliciting empathy and disclosure. 67
6.2 Pre-Post RoSaS: mean ratings associating social robots with the Warmth,
Competence, and Discomfort subscales of the RoSaS measure before and after the
interaction, standard deviation in parenthesis. The scale is from 1 to 9 with higher
being more associated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
vii
6.3 Pre-Post Group Cohesion: mean ratings of the Group Cohesion measure before
and after the interaction, standard deviation in parenthesis. The scale is from 0 to
6 with higher numbers indicating greater perceived group cohesion. . . . . . . . . 70
6.4 Facilitator Ratings: mean ratings of the desire to use QT as a facilitator in the
future, standard deviation in parenthesis. The scale is from− 3 to 3 with higher
numbers indicating greater desire for QT as their support group facilitator. . . . . . 71
viii
List of Figures
3.1 FOV A and RFSG dataset’s spatial configuration for the sensors and participants . . 20
3.2 TTP Architecture: the augmented VisualAttention synchronizer consists of a
speech activity confidence score generated by the PerfectMatch model for the
candidate speaker and the group attention score for each group member except the
candidate speaker from the latest available frame. The figure shows the method
for generating continuous rather than binary VisualAttention features. Group
attention and speech activity are combined to produce a label estimate for future
speech for the candidate speaker. . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 ASD Results: Performance of all cross-dataset speaker-independent models. The
Y -axis is the models’ mAP for different head poses in the range[− 60,60] degrees.
(a) and (c) show the performance of the original detectors and synchronizer on
each dataset. (b) and (d) show the improved performance of the synchronizer
augmented with VisualAttention features. . . . . . . . . . . . . . . . . . . . . . . 30
4.1 V olunteer demonstration of the academic support group study setup; participants
could see one another and the robot but not each other’s computer screens. . . . . . 38
4.2 Overall Trust Box Plot: participants’ trust pre- and post-interaction, relative to the
other group members and the robot. Medians are shown in orange. . . . . . . . . 41
4.3 Overall Trust Box Plot: the change in trust participants felt in group members and
in the robot, illustrating the effect the group interaction had on trust. Medians are
shown in orange. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4 Factor Analysis Scree Plot: principal components in the composite trust survey
for group members and the robot, before and after the interaction. All plots show
a strong elbow at the second factor, indicating one factor explains most of the
variance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.1 Empathy Study Setup: camera views used to record the interaction: the view
capturing the front of the actor (left) and the view capturing the front of the robot. . 52
5.2 Empathy Study Flow Chart: steps participants completed during the study.
Participants were considered to have completed the full study if they correctly
answered all questions in the Attention Check Questionnaire. . . . . . . . . . . . . 55
ix
5.3 Empathy Study Conditions: Differences between the cognitive and affective
conditions in the rated empathy and belief. Significant differences are indicated
with asterisks: *=p<.05, ****=p<.0001. . . . . . . . . . . . . . . . . . . . . . . 57
5.4 NARS vs Empathy and Belief: the dependent measures plotted against the
NARS scores, including the linear regression line and the shaded 95% confidence
interval. Empathy is plotted in the top row, belief in the bottom. Ratings from the
cognitive condition are on the left, the affective condition in the center, and the
differences between the two are plotted on the right. . . . . . . . . . . . . . . . . . 58
5.5 Belief vs Empathy: correlation between rated belief in the interaction and
perceived empathy of the robot, including the linear regression line and the shaded
95% confidence interval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.6 Order Effects: difference in participant ratings of dependent measures between
the cognitive first and affective first ordering. Significant differences are indicated
with asterisks: ns=p>.05, *=p<.05, **=p<.01. . . . . . . . . . . . . . . . . . . 60
6.1 Social Support Methods: Visualization of the two proposed social support
methods for SAR facilitators of support groups . . . . . . . . . . . . . . . . . . . 66
6.2 Online Support Group Setup: volunteer demonstration of a robot-facilitated
support group over the video conferencing platform Zoom with a QTrobot. . . . . 67
x
Abstract
Within the field of Human-Robot Interaction (HRI), Socially Assistive Robotics (SAR) has the
potential to create robots that can facilitate and enhance social interaction among groups of people.
These robots can help to connect individuals into more cohesive and supportive groups. This,
however, is a difficult task as it involves sensing individual attitudes, recognizing group dynamics
and behaving in a socially appropriate way to achieve a goal. This dissertation makes progress on
each of these tasks in order to unlock the potential for SAR to facilitate social support.
This dissertation presents computational models for understanding individual attitudes, such
as trust and group dynamics, such as turn-taking. It also presents a facilitation framework
to provide social support based on empathy and disclosure. These models and frameworks are
validated in the context of a support group.
This dissertation provides a review of related work in multiparty HRI and SAR, and relevant
background on the application domain of support groups. It examines the complexities of modeling
turn-taking, which include detecting the active speaker and predicting the next speaker based on
group attention. Additionally, it investigates the challenges of modeling changes in trust through
an academic support group. Finally, it models social support through perceptions of empathy and
disclosure made by a robot. All of these models form the basis for a support group facilitation
framework, which is evaluated in the context of a cancer support group. Together, these contribu-
tions form the foundation for enabling SAR to interact with groups in real-time and improve the
social dynamics between the group members.
xi
Chapter 1
Chapter One: Introduction
This chapter provides an introduction to robot facilitation of support groups. The
defining problem and goal of this dissertation is motivated by the challenge of under-
standing group dynamics and facilitating social support in these groups. The chapter
concludes with an outline of the rest of the dissertation and a list of primary and sec-
ondary contributions of the dissertation.
1.1 Overview
This dissertation delves into the use of technology to provide a novel type of social support, one
that is designed to bring people together in meaningful, in-person ways, as opposed to the online
connections fostered through social media. Digital connections are not equivalent to real-world
social interactions, and so the goal of this dissertation is to create embodied tools that can support
multiparty social connections. To that end, this dissertation seeks to take an early step towards
creating the necessary tools that can be used to facilitate social support and promote meaningful
connections between individuals.
1
1.1.1 Situated Group Interaction Dynamics
One of the biggest and most difficult challenges for SAR when interacting with groups of people
is the complexity of the group dynamics in situated interactions. Not only do the robots need to be
able to understand multiple modes of communication, such as verbal and nonverbal cues, but they
also need to be able to understand the complexities of real-time multi-person communication. The
dynamics of groups are complex and can be difficult to comprehend and participate in, even for the
most socially adept humans. SAR must be able to understand and react quickly to the changing
dynamics of the situation in order to effectively participate with a group.
1.2 Evaluation Domain: Support Groups
1.2.1 Support Groups
A support group is a type of social gathering where people come together to share their experiences
and feelings, provide encouragement and advice, offer support to each other, and build a sense of
community (Hu 2017). Support groups can be beneficial for people who are facing a range of dif-
ficult situations, such as mental health issues, physical disabilities, life transitions, or bereavement
(Davison et al. 2000; Gillis and Parish 2019; Hong et al. 2012; Rice et al. 2014). They can provide
a safe and supportive space for people to talk about their struggles and be heard without judgment.
They are often free or low cost and can help individuals feel less alone and more connected to
a community. Additionally, support groups can provide a sense of hope and optimism as people
share their stories and receive encouragement from others. This can help build a sense of belong-
ing and connection, creating a space where people can feel comfortable and supported (Posluszny
et al. 2002; Viswanathan et al. 2020).
Support groups are often facilitated by a professional, such as a trained therapist, psychologist,
social worker, or a volunteer, and can be organized around specific topics. The facilitator typically
encourages members to share their experiences, listen to one another, and offer support in a safe and
2
non-judgmental environment. The facilitator can also provide valuable resources and information
to help the members gain insight and understanding into their situations, and to help them find
positive solutions to whatever challenges they may be facing (Davison et al. 2000).
1.2.2 Robot as Facilitator
Given the challenges associated with SARs participating in situated groups and the importance of
a support group facilitator, it may be difficult to imagine why it would be advantageous for SARs
to fill the role of support group facilitator. The scarcity of professional and volunteer facilitators
compared to the need for support groups (Banbury et al. 2018) is one of the primary reasons why
SARs may be beneficial in this role. Moreover, SARs have some unique advantages compared
to a human facilitator. Most notably, a robot has much greater computational bandwidth than a
human, allowing it to track the verbal and nonverbal participation of group members throughout
the entire interaction, without introducing bias. Furthermore, because the SAR is not human,
the support group members may feel more comfortable expressing themselves. Finally, the role
of a facilitator is constrained; the primary task of the facilitator is to enable the development of
supportive relationships between the group members. As such, the SAR can provide an effective
and reliable way of achieving this goal.
1.3 Motivation and Problem Statement
In a world where modern information technology has increasingly pulled users into a digital
ecosystem and isolated them in their local environments, socially interactive robots present an
opportunity to engage people in situated prosocial ways. By their embodied nature, robots inher-
ently invite people to come together and interact in physical spaces, which creates opportunities
to assist and improve human-human interaction. Shaping and improving one-on-one and group
interaction dynamics are both specific goals of Socially Assistive Robotics (SAR) (Feil-Seifer and
Matari´ c 2005; Matari´ c and Scassellati 2016).
3
A key research challenge of Human-Robot Interaction (HRI) in general and SAR in particular is
understanding the complex interpersonal dynamics in group settings. In such settings, individuals
interact with one another through verbal and nonverbal signals that are contextual and dynamic. To
successfully interface with and mediate group interactions, robots must be able to recognize human
signals in real time, understand what they mean in a given context, and then choose appropriate
actions to achieve predetermined goals. These goals may involve improving cohesion, communi-
cation, engagement, or trust within the group. Sensing and improving these properties of group
dynamics is challenging because they involve the interaction of many, often subtle, multimodal
signals (Mana et al. 2007; Short et al. 2016).
This dissertation work introduces HRI and SAR into the novel context of support group facilita-
tion. In support group meetings, individuals with a common problem or challenge provide support
to one another, typically with the help of a mediator (Jacobs et al. 2011). Trust is crucial for proper
functioning of support groups, because it is only when participants feel they are in a trustworthy
setting that they are willing to share and receive support (Johnson and Noonan 1972). In most sup-
port groups, the level of trust changes over time as participants make disclosures and experience
supportive responses from others. Group participants regularly evaluate and update their trust in
one another as the session progresses (Corey et al. 2013). Since trust among group members in
a support group setting can change relatively quickly and significantly (Ball et al. 2009), the con-
text is both challenging and well suited for capturing data for training robots to learn the signals
and dynamics associated with changes in trust. Although trust between participants in a support
group typically grows over time (Ball et al. 2009), the skill of the mediator plays an important
role in group success. Corey et al. (2013) emphasize that facilitators must have leadership skills
such as genuineness, caring, openness, self-awareness, active listening, confronting, supporting,
and modeling in order to lead a group effectively.
4
1.4 Dissertation Contributions
The main contribution of this dissertation is to define and demonstrate how SAR can be lever-
aged in multiparty settings to facilitate social support. This dissertation contributes to the field
of HRI, bringing together SAR and multiparty interaction for the purpose of facilitating social
support.
The following are the primary contributions of this dissertation:
1. Methods for measuring social support, including computational models of empathy and trust
2. An integrated facilitation framework of elicitation and role modeling for autonomous support
group facilitation.
3. Methods for modeling aspects of turn-taking, including active speaker detection and turn-
taking prediction.
The following are secondary contributions of this dissertation:
1. User studies
• A study of changes in trust in robot-facilitated support groups.
• A study of human acceptance of different types of empathetic responses made by robots
to human disclosures.
• A study of user perceptions of different strategies for facilitation of online support
groups.
2. Implemented and validated open-source software systems. All code for studies throughout
this dissertation is open-sourced and publicly available, including:
• HARMONI – the platform for Human And Robot Modular and OpeN Interactions; a tool
for creating and controlling human-robot interactions meant to speed up development,
collaboration, and experimentation in the HRI community.
5
• HCI-FACE – the Human-Computer Interaction Facial Animation and Conversation
Engine; HCI-FACE is an open source, fully customizable, full stack interaction en-
gine. It combines a React front end with a Python back end to power an animated face
capable of real-time voice-based interaction.
1.5 Outline
The remainder of this document is organized as follows:
• Chapter 2 defines key terms and provides relevant background and related work in the areas
of multiparty human-robot interaction and socially assistive robotics, as well as the domain
of support groups.
• Chapter 3 describes the development of a model of turn-taking based on group dynamics.
• Chapter 4 presents the development and validation of a model of trust in robot facilitated
support groups.
• Chapter 5 describes the study of user perceptions of a robot’s empathetic responses to dis-
closures made by a third party.
• Chapter 6 outlines two different methods for facilitating social support: Role Modeling -
in which the robot role models the desired behaviors and Directing - in which the robot
encourages the desired behavior of support group members.
• Chapter 7 provides a summary and concluding statements as well as potential open problems
and extensions to the dissertation.
6
Nota bene: This dissertation includes contributions from multiple researchers, including un-
dergraduate students supervised and mentored by the dissertation author. The contributors
are named in chapters and sections of the dissertation that cover their work. Specifically, a
“Contributors” box is included at the beginning of each chapter or section that includes con-
tributions by other researchers.
7
Chapter 2
Chapter Two: Background and Related Work
This chapter presents relevant background and related work. It first discusses mul-
tiparty HRI, a rapidly growing area of HRI as robots leave the lab and enter group
spaces. The chapter then describes SAR, focusing on key multiparty aspects such as
trust, empathy, disclosure, and mediation/facilitation. This chapter aims to provide
context for the subsequent chapters, which will focus on how to model those aspects
for allowing a robot to facilitate social support.
2.1 Multiparty Human-Robot Interaction
Multiparty HRI involves a robot interacting with a group of individuals. This is a challenging
domain in that it takes all of the problems of HRI on an individual level and scales them across
multiple people, while adding the complexity of group dynamics that don’t exist in one-on-one
interactions. Within HRI, there is a growing body of work developing socially interactive robots
for multiparty interactions.
Socially interactive robots have been used in a wide variety of multiparty contexts, but the most
popular are teaching, entertaining, serving, and mediation. In the teaching context, robots have
been used for tutoring (Mussakhojayeva et al. 2016), for exhibit guiding (Salam and Chetouani
2015), and for dispensing information (Bohus et al. 2014). In the entertainment context, robots
have been used to play games (Fraune et al. 2017), give presentations (Sugiyama et al. 2015), and
8
participate in conversations (
ˇ
Sabanovi´ c et al. 2013). In the service context, robots have been stud-
ied as waiters serving drinks in open settings (V´ azquez et al. 2015) and behind the bar (Kirchner
et al. 2011). In the mediation context, robots have primarily been used for moderating game play
(Short and Mataric 2017; Short et al. 2017; Jung et al. 2015) and mediating conversations. The
work presented in this dissertation most closely aligns with the work prior work mediating con-
versations, but with the goal of helping promote social support rather than game play or mediating
conflict in conversation.
2.1.1 Turn-Taking Prediction
Turn-taking in spoken dialogue systems is the coordination of system speech with the person or
persons with whom the system is speaking (Skantze 2021). Turn-taking modeling can be formu-
lated as the process of estimating whether or not a given group member will be speaking at a future
point in time, also known as next speaker prediction. Turn-taking is commonly defined as having
four cases: hold, when a person is talking and continues to do so; yield, when a person is talking
and is about to stop; take, when a person is not talking but will start to talk; and listen, when a per-
son is not talking and will continue to not talk (Skantze 2021). Prior turn-taking decision modeling
typically addressed the yielding and holding cases. These models estimated whether a person is
yielding or holding when a short pause occurs during their speech. Continuous turn-taking makes
a prediction about future speech in order to address all four cases. Skantze (2021) provides a
comprehensive review of the history and state-of-the-art in turn-taking methods.
The first model of continuous turn-taking was introduced in Skantze et al. (2015), built to
support an autonomous multiparty robot in an interactive museum. The approach predicted a
vector of speech activity 60 frames or 3 seconds into the future using a multimodal Long Short-
Term Memory network (LSTM). The model incorporated acoustic and part of speech features, but
it was found that it performed almost as well with only the acoustic features. That seminal approach
has been extended in Roddy et al. (2018a) and Roddy et al. (2018b) through the use of a multiscale
Recurrent Neural Network (RNN) architecture, in which different modalities were modeled in
9
individual sub-network Long Short-Term Memory (LSTM) networks that operated at their own
independent timescales, with a separate LSTM that fused the modalities to form predictions at
a regular rate. Ward et al. (2019) also extended the work in Skantze (2017) through the use of
an improved multilayer LSTM utilizing parametric rectified linear units and by testing the model
across multiple languages and conversational genres. Masumura et al. (2019) developed a novel
model for end-of-turn detection utilizing a cross-modal representation trained with a punctuated
text dataset.
In face-to-face interactions, eye gaze is one of the best studied and strongest visual cues for
coordinating turn-taking and managing attention by dialogue partners (Oertel et al. 2013). Within
the fields of HRI and Human-Computer Interaction (HCI), eye gaze detection has already been
utilized for addressee and backchannel detection (e.g., Admoni and Scassellati (2017)). Eye gaze
control has been successfully deployed for turn signaling and control by the agent (Joosse 2017).
Despite the theoretical importance of eye gaze in group interactions and the use of eye gaze in
HRI and HCI, at the time of this dissertation, eye gaze remains an underutilized feature for active
speaker detection and turn-taking prediction.
2.1.2 Active Speaker Detection
Active speaker detection (ASD) is the task of determining if a certain speaker is active at any point
in time. In clean acoustic conditions, and with a single speaker, the acoustic information is funda-
mental for the ASD task, and methods for audio-only ASD have been extensively studied. Anguera
et al. (2012) and Tranter and Reynolds (2006) offer comprehensive reviews of the research in this
field. Audio-only ASD systems usually suffer from noisy environments, far-field microphones,
and speakers that overlap in time. Additionally, audio-only approaches are limited in multiparty
interactions, where it is important to assign the detection to speakers that might be physically close.
Video-only methods attempt to directly model the face, e.g., (Ahmad et al. 2013; Stefanov et al.
2016), or some aspects of the face (e.g., lip movements (Siatras et al. 2009)). The drawbacks of
10
these methods are related to motions, e.g., facial expressions, that can be misinterpreted as speak-
ing.
Audio-visual methods combine information from both the audio and visual modalities; com-
plementing the audio approach with its video counterpart generally produces better performance
due to increased robustness (Minotto et al. 2014; Cutler and Davis 2000; Chakravarty et al. 2015;
Chakravarty and Tuytelaars 2016; Ren et al. 2016). Recently, researchers have employed Artificial
Neural Networks for ASD from audio-visual input. A multimodal LSTM model that learns shared
weights between modalities was proposed in (Ren et al. 2016). A combination of a pre-trained
Convolutional Neural Network (CNN) model used as image encoder and an LSTM model used as
classifier was presented in (Stefanov et al. 2017). Stefanov et al. (2020) proposed a self-supervised
method in the context of language acquisition. Hu et al. (2015) proposed a CNN model that learns
the fusion of face and audio information.
Other approaches to ASD include a general pattern recognition framework used by Besson and
Kunt (2008). Visual activity (the amount of movement) and the focus of visual attention were used
as inputs by Hung and Ba (2009). Stefanov et al. (2016) used facial action units as inputs to Hidden
Markov Models and Vajaria et al. (2008) demonstrated that information from body movements can
improve detection performance.
2.2 Socially Assistive Robotics
2.2.1 Trust in Socially Assistive Robotics and Support Groups
Trust is challenging to define and measure, yet it is crucial to the proper functioning of socially
assistive robots. Trust can have different meanings depending on the context, however its definition
in the literature has consolidated around three main factors: ability, benevolence, and integrity
(Mayer et al. 1995). This work utilizes the definitions from Williams et al. (Williams 2001), where
ability is defined as “a set of skills that allow an individual to perform in some area”, benevolence
11
as “the other-oriented desire to care for the protection of another”, and integrity as “the belief that
another adheres to a set of principles that one finds acceptable”.
In support groups (and other therapeutic groups), Johnson and Noonan (1972) identified benev-
olence and integrity as the more pertinent aspects of trust. Unfortunately, no validated measures
have been developed for change in trust of group members over a single-session support group
interaction. Two relevant measures for individual trust are the Dyadic Trust Scale (Larzelere and
Huston 1980) and the Specific Interpersonal Trust Scale (SITS) (Johnson-George and Swap 1982).
The Dyadic Trust Scale is a uni-dimensional scale that focuses on measuring trust in close personal
relationships. The SITS also measures trust in close personal relationships, but has been broken
out to include factors of reliableness, emotional trust, and general trust. In support groups, the
relevant factors are measured by the emotional trust and general trust subscales. Because it is un-
clear what measures would capture the short-term change in trust, this dissertation work used both
measures and custom questions based on the established antecedents of trust: ability, benevolence,
and integrity.
2.2.2 Measuring Empathy
Empathy involves an individual experiencing a change in their own thoughts or feelings in response
to observations of another individual. Measuring this internal phenomenon in psychology has
taken many forms, including the first person perspective (the trait of empathy), the second person
perspective (the recipient’s rating), and the third person perspective (expert rating of empathetic
actions) (Sanchez et al. 2019).
Measuring the trait of empathy has evolved. In the past, it was done by separately measuring
the two dimensions of empathy, cognitive empathy through the use of the Hogan Empathy Scale
(Hogan 1969) and affective empathy through the Interpersonal Reactivity Index (Davis 1983)).
Subsequently, increasingly comprehensive measures have been applied, such as the Toronto Em-
pathy Scale (Spreng* et al. 2009) and the Basic Empathy Scale (Jolliffe and Farrington 2006).
More recent work, such as the Empathy Assessment Index (Lietz et al. 2011), has introduced
12
subscales for measuring tendencies towards empathic action in order to incorporate a pro-social
dimension into the measurement of empathy.
Measuring perceptions of empathy is used in clinical practice, typically through patient rat-
ings of the empathy of their doctor or therapist, such as the Consultation and Relational Empathy
(CARE) (Mercer et al. 2004) and the Jefferson Scale of Patient Perceptions of Physician Empathy
(JSPPPE) (Kane et al. 2007). Most common in healthcare settings are third person scales com-
pleted by an observer of the interaction between two individuals, typically a healthcare provider
and a patient. Examples include the Empathic Communication Coding System (ECCS) (Bylund
and Makoul 2005) and the Global Rating Scale (Tavares et al. 2013).
2.2.3 Empathy in Socially Assistive Robotics
Prior work on measuring empathy or perceived empathy of robots is limited. One approach to
measuring the success of an empathetic robot is the use of the McGill Friendship Questionnaire
along with the CARE measure (Johanson et al. n.d.; Leite et al. 2010). More recently, Charrier etal
(Charrier et al. 2019; Charrier et al. 2018) introduced the Robot’s Perceived Empathy (RoPE) scale
to provide a means of directly measuring the perceived empathy of a robot. This scale provides a
measure of the second person perspective of the robot. The RoPE scale has been used to measure
the perceived empathy of an empathetic chatbot response for medical assistance (Daher et al. 2020)
and of a virtual robot used for genetic counseling (Reghunath 2021).
Two distinct research areas have addressed empathy in social agents (Paiva 2011). In the first,
the human interaction partner is the observer of the robot and the robot is the target of the human’s
empathy. In the second, the robot is the observer of the human and is displaying empathy to the
human. There has been significant progress on both areas in recent years (Paiva et al. 2017); this
dissertation focuses on the second area. The first model of empathy in socially assistive robotics
was proposed by Tapus and Mataric (2007) and included the following features: recognizing,
understanding and interpreting other’s emotional state, processing and expressing emotions across
modalities, communicating with others, and perspective taking.
13
Empathy in HRI is particularly challenging because accurately perceiving a user’s affective
states in an open domain or environment has not been technologically possible until recently. Early
SAR studies in empathy used mimicry to approximate empathy, either by mirroring the emotion
inferred through speech intonation (Hegel et al. 2006) or through copying the user’s mouth and
head movements (Riek et al. 2010). Other work has explored how empathy during game play
affects children, such as in two studies where children played chess against an iCat robot (Leite
et al. 2014; Cramer et al. 2010). In one study the robot utilized an SVM-based Affect Recognition
system to return probabilities of the children’s valence and displayed social supportive behaviors
if the child’s affective state was negative (Leite et al. 2014). In the other, the robot analyzed the
game from the child’s position to determine how they might be feeling, and reacted according to
the assumed feelings of the child. Cramer et al. (2010) examined how a robot’s displayed empathy
could affect people’s attitudes towards the robot, including exploring how the accuracy of the
robot’s displayed empathy (in this case when an iCat robot was playing a cooperative game with
an actor) affected attitudes towards the robot. Results showed that inaccurate empathy led to a
significant decrease in user trust in the robot.
2.2.4 Mediation/Facilitation in Socially Assistive Robotics
This dissertation follows prior work in which the robot acted as both a mediator and facilitator
in order to improve the quality and balance of a conversation. Tahir et al. (2018) focused on
different methods for improving conversational quality using a robot mediator to deliver feedback
about a conversation between two individuals. In that study, two individuals acted out a scripted
conversation and then the robot delivered feedback created by a ‘sociofeedback system’ which
analyzed the interest, dominance, and agreement displayed in the conversation. Although the paper
reported on the sociofeedback system training, the focus of the study was on the user’s perceptions
of the way the feedback was delivered via the robot, in which they found that the participants
liked receiving feedback from the robot. Recent work has explored using a robot as a counselor
in couples therapy to improve the quality of communication (Zuckerman and Hoffman 2015),
14
and promote positive communication (Utami et al. 2017) and collaborative responses (Utami and
Bickmore 2019).
Other work has focused on improving the balance and flow of conversations. One of the earlier
efforts to explore conversational balance utilized a facilitation robot to obtain the conversational
initiative and regulate imbalance (Matsuyama et al. 2015). The work of Short et al. (2016) eval-
uated user perception of a robot mediator in a controlled study in which participants completed a
group storytelling task. In Ohshima et al. (2017), the authors tested robot behaviors for helping
a group recover from an awkward silence. In their study, a robot led a conversation with three
participants, taking actions and asking questions to encourage conversation.
Because of the difficulty of understanding and interacting with natural language, most prior
SAR work on conversational mediation had constrained the role of the robot and the action space
in which the robot can participate. For instance, Hoffman et al. (2015) documented the design and
evaluation of a nonverbal conversational companion that attempts to encourage empathy between
the individuals having the conversation. Other work had constrained the interaction problem by ig-
noring the control challenges through a Wizard of Oz (WoZ) framework in which a hidden human
controls the behavior of the robot. This paradigm can introduce confounds and bias into the inter-
action; however, it can be appropriate if reported correctly and used as part of an iterative approach
to developing technology (Riek 2012). For example, in the HRI work of Nigam and Riek (2015),
the interest was solely in collecting data for building classification models of the environment and
therefore using WoZ was appropriate. In V´ azquez et al. (2017), the authors were studying the
human perception of the robot gaze and orientation behaviors, and so choose WoZ to control the
timing and choice of actions the robot takes.
15
2.3 Evaluation Domain: Support Groups
2.3.1 Support Groups
Social support groups, consisting of people with shared experiences and concerns, provide im-
portant emotional and moral support (Hu 2017), and can play a significant role in improving the
ability to cope with life stressors. Social support can buffer physical and/or psychological chal-
lenges (Davison et al. 2000; Gillis and Parish 2019; Hong et al. 2012; Rice et al. 2014), caregiver
challenges (e.g., Chien et al. (2011)), and challenges in training for, transitioning into, and within
work environments including during COVID-19 (Viswanathan et al. 2020). Widely available in-
person in local communities, support groups have been shown to reduce stress and isolation and
enhance coping and well-being (Gottlieb and Wachala 2007), to provide safe peer environments to
acknowledge/vent difficult or negative emotions and challenges, to achieve consensual validation
(e.g., Viswanathan et al. (2020)), and to learn better coping strategies from peers (Posluszny et al.
2002; Viswanathan et al. 2020). Group cohesion, the level of connectedness within a group, has
been used to measure the level of social support in the support group network (Martin et al. 2017)
and is predictive of staying in the group (Dyaram and Kamalanabhan 2005). Individuals’ levels of
social engagement, involvement, and time spent in support groups (e.g.,Dyaram and Kamalanab-
han (2005)) predict improvements in health outcomes. Research increasingly suggests that online
support groups are promising for improving well-being while having cost-effective broad reach
(e.g., Banbury et al. (2018), Griffiths et al. (2009), Hong et al. (2012), Rains and Young (2009),
and Viswanathan et al. (2020).
2.3.2 Support Group Facilitation
For both in-person and online support groups, group facilitators are key. Even with varying lev-
els of professional training, they help groups to establish goals and manage group dynamics and
relationships (Borek and Abraham 2018; Garcia et al. 2011). Effective facilitators can create a
16
psychologically safe group climate (e.g., indicated by member’s social engagement, trust, and em-
pathy/ warmth) that fosters cohesive and supportive groups, often by using subtle communicative
behaviors (Davison et al. 2000). For example, in a COVID-19 online frontline personnel support
group (Viswanathan et al. 2020), facilitators guided responses to emotions of medical staff (e.g.,
fear, and anxiety over situational uncertainty) and used communicative behaviors to acknowledge
these reactions as normal and/or shared, and afforded insight, hope, interpersonal learning from
other strategies, and peer support for available work-related strategies. Many facilitator techniques
for encouraging active listening, speaking, responding, and expressions of empathy/sympathy to
group member statements are common across support groups (e.g., Viswanathan et al. (2020)).
An accepting and positive group climate nudges the members to increase their engagement and
involvement, through self-disclosure and receiving feedback (Borek and Abraham 2018). An ef-
fective facilitator brings about a safe environment and facilitates the establishment of a positive
group climate.
2.4 Summary
This chapter presented relevant background information and related work. It first discussed mul-
tiparty HRI, a rapidly growing area of HRI as robots leave the lab and enter group spaces. The
chapter then described SAR, focusing on key multiparty aspects such as trust, empathy, disclosure,
and mediation/facilitation. This chapter provides context for the subsequent chapters, which will
focus on how to model these aspects for allowing a robot to facilitate social support.
17
Chapter 3
Chapter Three: Modeling Turn-Taking
This chapter presents work that uses group attention to increase the accuracy and ro-
bustness of models of active speaker detection and turn-taking prediction. A key com-
ponent of the work of this dissertation chapter is an analysis of the conditions under
which active speaker detection methods fail and how these failures can be overcome.
Additionally this chapter shows how the turn-taking prediction method developed here
achieves state of the art results on the MultiMediate Grand Challenge.
3.1 Approach to Turn-Taking Using Group Attention
Contributors: Chapter 3 is based on Birmingham et al. (2021a) and Birmingham et al. (2021b)
written with Kalin Stefanov and Maja Matari´ c.
Understanding and predicting the changes in the roles on the conversational floor ( i.e., speaker,
addressee, bystander) known as footing (Goffman 1974; Goffman 1981), is a prerequisite for natu-
ral and effective human-machine interaction. To successfully and fluently participate in a situated,
multiparty conversation, a system must understand who is speaking, known as active speaker de-
tection and when the speaker will relinquish their turn, known as turn-taking prediction. Active
speaker detection is the task of identifying the current speaker (if any) from a set of candidate
speakers. It is necessary for recognizing who is talking and for attributing any thoughts, ideas, and
opinions to the speaker. Turn-taking prediction has been formulated in many ways, and requires
18
estimating which member of the group, if any, will be speaking at a future point in time (Skantze
2021). Turn-taking prediction is needed in order for a machine to interact with an individual or a
group without excessive pauses or interrupting another group member’s turn.
Although most humans can perform active speaker detection and turn-taking prediction with
relative ease, computers struggle to accurately do either. This is because both tasks are inher-
ently multimodal, requiring an accurate synthesis of and reasoning about visual, auditory, and
linguistic information. In physically situated interactions this challenge is amplified by sensor
limitations, e.g., monocular cameras and far-field microphones. Additionally, the ground truth of
these problems can be noisy, with overlaps, cut-ins, and backchannels that can blur the distinction
between the active speaker and other group members.
Due to these challenges, it is helpful to incorporate information beyond the target individual’s
own visual and auditory data. Such information can include objects of interest in the environment
or, in the case of multiparty interactions, it can include information from other group members,
such as their focus of visual attention.
The work in this chapter utilizes group members’ focus of visual attention to improve the per-
formance of state-of-the-art active speaker detection and turn-taking prediction models. The main
contributions of the work of this dissertation chapter are:
• An evaluation of state-of-the-art models employed for the task of active speaker detection
and turn-taking prediction in situated multiparty interactions.
• An analysis of some of the conditions under which these models fail when used in situated
multiparty interactions.
• An introduction of novel models that utilize group members’ focus of visual attention in
order to address the shortcomings identified in this analysis.
19
(a) Robot-Facilitated Support
Group.
(b) Focus of Visual Attention.
Figure 3.1: FOV A and RFSG dataset’s spatial configuration for the sensors and participants
3.2 Evaluation of the Turn-Taking Approach
3.2.1 Active Speaker Detection Methodology
Given a number of candidate speakers, Active Speaker Detection (ASD) consists of the task of
determining, at any point in time, what speakers are active using information from that point in
time. This is a binary classification problem (per candidate speaker) where the input is the part
of the image capturing the candidate speaker and the associated audio. The backbone model used
in this work is the state-of-the-art audio-video synchronizer (Chung and Zisserman 2016; Son
and Zisserman 2017; Chung et al. 2019). Following the state-of-the-art in ASD (Chung 2019),
the synchronizer is turned into detectors by training a Temporal Convolution Network (TCN) and
a Bidirectional Long Short-Term Memory (BLSTM). We introduce novel methods to augment
the synchronizer with information for the group members’ focus of visual attention for improved
cross-dataset ASD.
ASD Datasets
The introduction of the A V A-ActiveSpeaker dataset (Roth et al. 2019) consisting of nearly
40 hours of video data from movies has allowed for benchmarking different methods for ASD.
However, the use of data from movies does not support the development of ASD methods for
20
physically situated interactions. To address this issue, we used two private multiparty interaction
datasets described next.
Robot-Facilitated Support Group Dataset (RFSG) The robot-facilitated support group dataset,
first published in Birmingham et al. (2020) and described in detail in Chapter 4 consists of 27
multiparty interactions between three students and a robot. The robot leads the support group by
asking questions and making disclosures to encourage the human members of the group to share
and receive support, using the setup shown in Figure 3.1a. The robot’s utterances were selected
from a predefined set of questions and statements by a human “wizard”. The total number of
participants in the dataset is 81. The average duration of the interactions is 20 minutes, resulting in
a total of 10 hours of data per recording device. The active speaker labels were obtained by manual
annotations. In this work we consider the color video stream generated by the camera pointed at
each participant and the audio stream generated by a single microphone in the middle of the table.
Focus of Visual Attention Dataset (FOVA) We also analyzed the multimodal multiparty dataset
described in Stefanov and Beskow (2016). Each interaction consisted of three participants: one
moderator and two interactants, using the spatial configuration shown in Figure 3.1b. A total of
15 sessions were recorded, each lasting approximately 30 minutes, resulting in 7.5 hours of data
per recording device. The moderator was the same in all interactions, while the other participants
varied, totalling 24 unique participants. The active speaker labels were obtained by manual anno-
tations. In this work we consider the color video stream generated by the Kinect RGB-D camera
pointed at each participant and the audio stream generated by the participants’ close-talking mi-
crophone.
ASD Experimental Setup
We evaluated the performance of the ASD methods with three experiments: within dataset
speaker-dependent (10-fold cross-validation), within dataset speaker-independent (leave-6-out
cross-validation), and cross-dataset speaker-independent (train on one dataset and test on the
other). The within-dataset speaker-dependent experiment trained models with data for all par-
ticipants and evaluated them on independent data from all participants in the same dataset. The
21
within-dataset speaker-independent experiment trained models with data for a subset of partici-
pants and evaluated them on the left-out participants in the same dataset. This experiment tested
the transferability (generalization capabilities) of the models to unseen participants from the same
physical context. The cross-dataset speaker-independent experiment used all data from one dataset
to train the models, and all data from the other dataset to evaluate them. This experiment tested the
transferability (generalization capabilities) of the models to both unseen participants and physical
contexts. The experiments directly demonstrated the contribution of the work in terms of error
analysis and proposed strategies to address the identified shortcomings of the ASD methods.
ASD Features
We present experiments with two sets of features termed PerfectMatch and VisualAttention.
PerfectMatch – in Chung and Zisserman (2016), the authors trained a Convolutional Neural
Network termed SyncNet for the purpose of synchronizing audio and video tracks of individuals
talking. The network consisted of 6 convolutional layers followed by 2 fully connected layers for
both audio and video, separately. The model takes as input 5 video frames and the corresponding
audio samples. This model was shown to be effective for ASD by comparing the magnitude of the
difference between the final audio and video features. In Chung et al. (2019), the authors employed
a different strategy for training the original SyncNet model. The new model termed PerfectMatch
was shown to outperform the original SyncNet model. In our experiments, we used the features of
the final convolutional layer from the PerfectMatch model that has been trained on the V oxCeleb
dataset (Nagrani et al. 2017).
VisualAttention – we used visual attention features inspired by Stefanov et al. (2019). We im-
plemented both binary and continuous representations of whether any of the other group members
are looking at the candidate speaker. To measure the direction of the visual attention of each group
member, we used the 3D position and orientation of the head. For the FOV A dataset, an RGB-D
Kinect camera with calibrated position and orientation was used to acquire those measures. For the
RFSG dataset, the measures were approximated through OpenFace (Baltrusaitis et al. 2016) from
cameras with calibrated positions and orientations. For both datasets, the position and orientation
22
of the participants’ head was recreated in the same 3D space for each video frame where the head
position and orientation was used to create a vector of visual attention. The binary representation
creates a cylinder around the candidate speaker’s head and judges a group member to be looking at
that person if the head pose vector of the group member intersects with the cylinder. We used 5× the average male head size for the dimensions of the cylinder. The continuous representation mea-
sures the angle between the group member’s head pose and the vector from the group member’s
head to the candidate speaker’s head. The measured angle is mapped to a range between 0 and 1,
in which an angle of 0 degrees (the group member is looking directly at the candidate speaker) is 1
and the angle of 75 degrees (the group member is looking away from the candidate speaker) is 0.
Augmentation – we combined the VisualAttention features for a candidate speaker by averaging
the features produced by each of the other group members. This produced a single value for both
the binary and continuous features. In both cases, the closer this value was to 1, the more likely
the candidate speaker was the focus of attention; conversely, the closer the value was to 0, the
less likely it was that they are the focus of attention. To augment the detection of a given model,
which is also between 0 and 1, we implemented three preliminary methods: 1) we combined the
VisualAttention feature with the model detection in an unweighted average, 2) we combined them
in a weighted average skewed towards the VisualAttention feature, and 3) we multiplied the model
detection value with a VisualAttention feature shifted from the range[0,1] to the range[0.5,1.5] in
order to allow the feature to modify the model detection.
ASD Models
For each experiment, we independently evaluated the synchronizer, detectors, and augmented
synchronizer in order to compare the models and provide evidence for some of the situations in
which they fail.
The synchronizer was implemented as described in Chung et al. (2019). This included an input
of 5 video frames and the corresponding audio, resulting in 1024D audio and video features. These
features were cross-correlated to find the minimum difference (the matching synchronization),
then the difference between the features was median-filtered and normalized for each video. This
23
produced a single value in the range[0,1] that was used as the detection of whether the candidate
speaker was speaking. For the synchronizer we used the state-of-the-art pre-trained model.
The detectors were implemented as described in Chung (2019). A window of 5 of the 512D
audio and video features was used as input to a 2-layer time series model before being combined in
a fully connected layer. The fully connected layer was then connected to a softmax layer to produce
the final output probabilities. As in Chung (2019), we tested both a Temporal Convolution Network
and a Bidirectional Long Short-Term Memory as the time series model. The detectors were trained
as described in Chung (2019). The input features from the synchronizer were held constant and
the models were trained on the FOV A and RFSG datasets individually for each experiment.
The augmented models were implemented by combining the detection of the synchronizer
with the VisualAttention features using the three different augmentation methods.
ASD Evaluation
Each of the models was evaluated on an unseen test set. In the case of the synchronizer, the
model was trained on the V oxCeleb dataset and was not fine-tuned. We report the result of using
the model as-is on the RFSG and FOV A datasets. In the case of the detectors, we report the result of
within dataset speaker-dependent, within dataset speaker-independent, and cross-dataset speaker-
independent models separately. We follow the reporting requirements for the A V A-ActiveSpeaker
dataset in reporting the mean average precision (mAP) scores for each model. Average precision
(AP) is the average of precision scores calculated for each recall threshold, AP=∑
n
(R
n
− R
n− 1
)P
n
,
where R is the recall and P is the precision for each threshold n. The results are reported in terms
of frame-by-frame mAP, mAP=(AP0+ AP1)/2, where AP0 and AP1 are the AP scores of the
negative and positive class, respectively.
3.2.2 Turn-taking Prediction Methodology
The goal of the methods described in this section is to detect the speaking state (i.e., speaking or
not speaking) of all persons in a physically situated multiparty interaction at some point in the near
future (i.e., turn-taking prediction).
24
Model Speaker-dependent Speaker-independent Cross-dataset
RFSG FOV A RFSG FOV A RFSG FOV A
BLSTM 0.973 (0.006) 0.987 (0.002) 0.894 (0.028) 0.975 (0.004) 0.663 (0.022) 0.685 (0.012)
TCN 0.966 (0.004) 0.986 (0.002) 0.891 (0.038) 0.976 (0.002) 0.638 (0.024) 0.702 (0.011)
PerfectMatch - - - - 0.807 0.677
Table 3.1: ASD Evaluation: performance of the synchronizer and detectors. Mean mAP and
standard deviation are shown in parenthesis.
TTP Task Definitions
Given a fixed context window of sensor data pertaining to a number of potential speakers,
turn-taking prediction consists of the task of determining which speakers will be active at some
fixed point in the future, by using information from the current point in time. The MultiMediate
Challenge formulates the problem of predicting the state of each potential speaker (i.e., speaking
or not speaking) as a binary classification task, with independent labels for each potential speaker.
TTP Models
The backbone model used in this work is the SOTA audio-video synchronizer termed Per-
fectMatch (Chung et al. 2019). Following the state-of-the-art in active speaker detection (Chung
2019), the synchronizers are further turned into predictors by adding a Temporal Convolution Net-
work (TCN) and a Bidirectional Long Short-Term Memory (BLSTM). Here we propose models
that augment the synchronizers and predictors with information about the group member’s focus
of visual attention.
TTP Features
Our work combined two sets of features: PerfectMatch and VisualAttention features. Next, we
describe how those features were computed.
PerfectMatch – Chung and Zisserman (2016) trained a Convolutional Neural Network
(i.e., SyncNet) to produce audio and video embeddings for the purpose of synchronizing audio
and video tracks of individuals talking. The network consisted of 6 convolutional layers followed
by 2 fully connected layers for separate audio and video. The model took as input 5 video frames
and the corresponding audio samples. This model has been shown to be effective for active speaker
25
detection by comparing the magnitude of the difference between the final audio and video features
and smoothing with a median filter. Chung et al. (2019) re-trained the original SyncNet model
for the purpose of synchronizing audio and video tracks of individuals talking. The new model
(i.e., PerfectMatch) was shown to outperform the original SyncNet model. The output of the final
convolutional layer from the PerfectMatch model that had been pre-trained on V oxCeleb (Nagrani
et al. 2017) was directly used to produce 512D audio and 512D video features for each frame.
VisualAttention – The visual attention features are inspired by Stefanov et al. (2019) and
include continuous and binary representations of whether or not a group member is looking at
the candidate speaker. The binary representation considers the group member to be looking at the
candidate speaker if the group member’s head pose is pointed closer to the candidate speaker than
the head pose of any of the other group members. Looking at the candidate speaker is represented
by a 1, otherwise a 0. For the continuous representation, the angle between the head pose of the
group member and the vector between the head of the group member to the head of the candidate
speaker is then used as a measure for how far away they are looking from the candidate speaker.
This angle is normalized using:
f= 1− θ/q (3.1)
where f is the feature from that group member to the candidate speaker,θ is the angle between the
group member’s head pose and the candidate speaker-group member vector, and q is a normaliza-
tion factor based on the field of view of the group member.
In both the continuous and binary cases we created the mean VisualAttention feature for the
candidate speaker by averaging the features produced by the other group members with respect to
that candidate speaker. This produced a single value for the average of the continuous and binary
features. The closer this value was to 1, the more likely the individual was to be the focus of
attention of the group; conversely, the closer the value was to 0, the more likely it was that the
individual was not the focus.
26
TTP Model Architecture
The synchronizer model architecture is described in Chung and Zisserman (2016); it includes
a window of 5 frames of video and the corresponding audio which produces 512D audio and video
embeddings. These embeddings are cross-correlated to find the minimum difference (the matching
synchronization), then the difference between the embeddings is median filtered and normalized
for each video. This produces a single value between 0 and 1 which is used as the confidence
that the candidate speaker is speaking. The synchronizer utilizes the entire video as context for
determining the median, and thus is not suited for real-time use.
The predictor model architectures are described in Chung (2019). A window of 5 of the 512D
audio and video embeddings from each frame are used as input to a 2 layer time series model
before being combined into a fully connected layer. The fully connected layer is then connected
to a softmax layer to produce the final output probabilities. As in Chung (2019), we used both a
Temporal Convolution Network (TCN) and a Bidirectional Long Short-Term Memory (BLSTM)
as the time series models. These models do not require the full video context and can be used in
real-time applications.
The VisualAttention augmented models combined the output of the synchronizer and predic-
tors described above with the mean VisualAttention feature described in Section 3.2.2. To augment
the predictions of the synchronizer and predictor models, which were also between 0 and 1, we
multiplied the confidence value with the VisualAttention feature, which was shifted from [0,1] to
[0.5,1.5] to allow the feature to adjust the model confidence. The full architecture for the Visu-
alAttention augmented synchronizer can be seen in Figure 3.2.
TTP Datasets
The 2021 MultiMediate Challenge used the published MPIIGroupInteraction dataset (M¨ uller et
al. 2018). The dataset consists of 22 German language conversations between three to four people,
each with an approximate length of 20 minutes. Participants in each conversation were instructed
to discuss a controversial topic and were recorded by 8 frame-synchronized video cameras and
4 microphones. The challenge provides the recording from all cameras (one from behind each
27
PerfectMatch
Audio MFCCs
Video
Audio
Face Detections
OpenFace
Video
1
2
3
𝝷 1
𝝷 2
𝝷 3
Visual Attention
Group Attention
Confidence Score
Speech Activity
Figure 3.2: TTP Architecture: the augmented VisualAttention synchronizer consists of a speech
activity confidence score generated by the PerfectMatch model for the candidate speaker and the
group attention score for each group member except the candidate speaker from the latest available
frame. The figure shows the method for generating continuous rather than binary VisualAttention
features. Group attention and speech activity are combined to produce a label estimate for future
speech for the candidate speaker.
participant) and one of the microphones for each session. Every frame of each recording is labeled
with a binary representation of who is speaking.
TTP Experimental Setup
The synchronizer was implemented with a SOTA pre-trained model. The model was pre-
trained on V oxCeleb (Nagrani et al. 2017), which may contain significantly different data distribu-
tions than the challenge datasets.
The predictors were implemented and trained as described in Chung (2019). The input fea-
tures from the synchronizer were held constant and the TCN and BLSTM models were trained
on the MPIIGroupInteraction dataset for each experiment. The training was implemented in Py-
Torch (Paszke et al. 2017). The models were trained for 25 epochs, with a batch size of 64. The
Adam (Kingma and Ba 2014) optimizer was used with the default parameters (α= 0.001,β
1
= 0.9,
β
2
= 0.999, and ε = 10
− 8
) and a fixed learning rate of 0 .001. The loss was calculated with the
cross entropy loss function.
28
The VisualAttention augmented models required additional processing to generate the fea-
tures for this dataset. In the MPIIGroupInteraction dataset, the position of the cameras was not
publicly available, so the positions of the group members relative to each other was estimated. The
cameras were not centered exactly for each group member but were consistent across all sessions,
so the estimates were created using the distributions in the training set of head poses for each
group member when the other group members were speaking. For the binary representation, this
estimation was accomplished by dividing the head yaw rotation of all participants in a given seat
into three equal quantiles, where one quantile was assigned to the person to their left, one across,
and one to their right. When the direction a participant was facing was within a given quantile, the
participant was considered to be looking at the seat in that quantile and and not at the other two.
Creating a continuous representation requires a more exact method for estimating the location of
a given participant relative to each of the others. This was accomplished by taking the mean of the
distribution of head yaw rotations when each of the other group members was speaking and using
it as a proxy for the location of that group member.
3.2.3 Results
ASD Results
The results from all experiments are reported in Table 3.1, with the mean and standard deviation
of the models’ mAP. The results from the speaker-dependent experiment are based on 10-fold
cross-validation. The best performing detector was the BLSTM. This detector achieved a score of
0.973 on the RFSG dataset and 0.987 on the FOV A dataset. Both detectors performed similarly
well on both datasets. The results from the within-dataset speaker-independent are based on
leave-6-out cross-validation. The BLSTM detector performed better on the RFSG dataset, reaching
a score of 0.894 while the TCN detector performed better on the FOV A dataset, with a score of
0.976. Both detectors performed similarly within dataset, however there was a large difference
between datasets, where the performance on the FOV A dataset was 0.08 better. The results from
the cross-dataset speaker-independent experiment are based on 10-fold cross-validation for the
29
(a) Models on RFSG (b) Augmented model
on RFSG
(c) Models on FOV A (d) Augmented model
on FOV A
Figure 3.3: ASD Results: Performance of all cross-dataset speaker-independent models. The Y -
axis is the models’ mAP for different head poses in the range[− 60,60] degrees. (a) and (c) show
the performance of the original detectors and synchronizer on each dataset. (b) and (d) show the
improved performance of the synchronizer augmented with VisualAttention features.
detectors and include the performance of the synchronizer on each dataset, reported as a single
value. The PerfectMatch synchronizer performed best on the RFSG dataset, with a score of 0.807.
The detectors that had been fine-tuned on FOV A performed significantly worse, with the mean
score of 0.651. On the FOV A dataset, the TCN detector performed best, with a score of 0.702.
TTP Results
Each of the models was evaluated on the validation set provided by the 2021 MultiMediate
Challenge organizers. The VisualAttention Augmented Synchronizer was evaluated on the hidden
test set, as the intention of this work is to evaluate the model that was found to generalize best to
new datasets in our prior work.
We follow the reporting requirements for the 2021 MultiMediate Challenge in reporting the
unweighted average recall scores (UAR). Here, metrics are calculated for each candidate speaker
and their unweighted mean is found, where the unweighted recall score for each candidate speaker
is defined as:
R= tp/(tp+ fn) (3.2)
where R is the recall, tp is the number of true positives, and fn is the number of false negatives.
30
3.3 Results
This section reports the results of the experiments described in Section 3.2.2.
Challenge Validation Set
Model Binary Aug. Continuous Aug.
BLSTM Predictor 0.719 0.747
TCN Predictor 0.572 0.721
Synchronizer 0.692 0.715
Table 3.2: TTP Validation Results: UAR performance of models augmented by Binary and Con-
tinuous VisualAttention features on the validation set provided by the MultiMediate competition.
The results of the experiment on the 2021 MultiMediate validation set are reported in Table
3.2. The best performing model was the BLSTM. The synchronizer augmented with continuous
VisualAttention features performed only slightly worse than the SOTA BLSTM and TCN predic-
tors augmented with VisualAttention, by− 0.032 and− 0.006, respectively, both of which have
been explicitly fine-tuned on the 2021 MultiMediate training set.
Challenge Test Set
Authors Binary
Ours (Cont. Aug. Synchronizer) 0.632
Ours (Binary Aug. Synchronizer) 0.628
HNU VPAI 0.57
Jiangeng 0.53
MM Baseline 0.51
Table 3.3: TTP Test Results: UAR performance of competitor models and our synchronizer model
augmented by Binary and Continuous VisualAttention features on the held-out test set provided by
the MultiMediate competition.
For the competition test set we submitted the synchronizer augmented with continuous and
binary VisualAttention features. The binary features performed worse than the continuous, with
a score of 0.628 and 0.632, respectively. The provided baseline was 0.51. Both submissions
achieved SOTA performance, outperforming all other competitors.
31
3.3.1 Discussion
ASD Error Analysis – In order to investigate the decrease in performance on the challenging task
of cross-dataset speaker-independent detection, we evaluated the cross-dataset models for different
head poses of the candidate speaker. In Figures 3.3a and 3.3c we report the performance of the
models across different head poses for both datasets. The candidate speaker’s head pose was
gathered on a per-frame basis into 6 buckets spanning 20 degrees each. The performance of each
model was then calculated for all frames in each bucket. Across all models the performance formed
an inverted relationship with the distance from the center, which is to say that the performance
tended to get worse as the candidate speaker looked further away from the camera (i.e., profile
faces). Although not always the case, this relationship can be seen from the general concave
shapes in the figures. This observation further motivates the introduction of information that is
independent from the candidate speaker for the task of ASD. In the next section we present the
relative improvement when information for the group-level focus of visual attention is incorporated
into the detection result.
ASD Augmentation Improvement – On the RFSG dataset, the best performing augmented
model resulted from combining the synchronizer detection with the continuous focus of visual
attention feature through multiplication. This yielded a mAP score of 0.850. For each of the detec-
tors on this dataset, the multiplication with the continuous focus of visual attention feature yielded
an improvement. For the FOV A dataset, the best performing augmented model involved combining
the synchronizer detection with the continuous focus of visual attention feature through weighted
averaging. This yielded a mAP score of 0.861. There was no improvement when augmentation
was applied to the detectors.
Figures 3.3b and 3.3d illustrate how each of the models performed across the different head
poses. We included the original synchronizer, TCN and BLSTM detectors, and the augmented
synchronizer. The results show that the augmented model performed better across the full range
of head poses, and partially managed to correct for the typical deterioration of performance that
happened when the head pose is at an extreme angle with respect to the camera plane.
32
As can be seen in Table 3.1, the state-of-the-art detectors performed well in speaker-dependent
10-fold and speaker-independent leave-6-out cross-validation experiments on both datasets. Both
datasets provide well-posed problems, with cameras pointed at participants’ faces and clear audio
during a normal conversation. However, when these fine-tuned detectors are applied to a new set-
ting (cross-dataset), we observe a significant decrease in performance. Even though both datasets
consist of seated conversations around a table, the detectors appear to learn the distribution specific
to the context present in the dataset used for fine-tuning. Although this is a well understood limi-
tation of machine learning, it presents a significant challenge for creating accurate ASD methods
that can generalize across environments and physical contexts.
Given prior work on spatial bias (Stefanov et al. 2021) in vision-based voice activity detection,
we investigated how spatial bias hampered the performance of these detectors when transferred to
new physical contexts. As expected, we found that the models often performed worse when the
faces were pointed away from the camera. This occurs because detection is not as good and fewer
training examples exist for faces seen in profile.
To improve the performance of the models, we utilized a feature that is independent from the
candidate speaker’s head pose. We found that the focus of visual attention of the other group
members could be used to augment the output of the synchronizer and detectors, improving the
cross-dataset performance in almost all cases. Furthermore, we found a significant improvement
of the models’ performance when combined with the focus of visual attention of the other group
members across the entire spectrum of head poses. However, when the head is turned away from
the camera, the increase is larger than when the candidate speaker is facing the camera.
Despite their similarities, the RFSG and FOV A datasets have significant differences due to their
setups and physical arrangement. The distribution of head poses for the FOV A data set is bi-modal,
caused by the seating of 3 people evenly around a circular table. The RFSG distribution is a steep
unimodal curve, likely caused by the shape of the table and the location of the robot on that table.
The fine-tuned detectors are well suited to the distributions of the respective datasets, leading to
33
a poor fit when faced with the challenge of transferring to a new dataset. However, this type of
distribution shift is expected every time the ASD models are employed in new interactions.
TTP Performance – In this work we show how simple focus of visual attention features can
improve the performance of general purpose synchronizers to be competitive with SOTA methods
on the 2021 MultiMediate Next Speaker Prediction Challenge. While the calculation of VisualAt-
tention features requires an understanding of the physical relationship between individuals in a
scene, it does not require the retraining or fine-tuning of another model. These models do not
outperform SOTA models that have been fine-tuned for a specific setting, as can be seen in the val-
idation results, but our prior work (under review) has shown they can outperform when transferred
to new scenes without an existing dataset and where fine-tuning is not possible.
It is important to note that the PerfectMatch synthesizer models used for this competition were
not trained for next speaker prediction but for active speaker detection. The augmentation with
visual focus of attention features may help for the turn-taking and turn yielding cases, but this
requires further investigation. Additionally, because the 2021 MultiMediate Challenge utilized
multiple cameras without providing their relative location to one another, in this work we utilized
crude, approximate measures of visual attention.
3.4 Summary
This chapter addressed the problem of active speaker detection in physically situated multiparty
interactions. This challenge required a robust solution that can perform effectively across a wide
range of speakers and physical contexts. Current state-of-the-art active speaker detection ap-
proaches rely on machine learning methods that do not generalize well to new physical settings.
We found that these methods do not transfer well even between similar datasets. We introduced
the use of group-level focus of visual attention in combination with a general audio-video synchro-
nizer method for improved active speaker detection across speakers and physical contexts. Our
34
dataset-independent experiments demonstrated that the proposed approach outperforms state-of-
the-art methods trained specifically for the task of active speaker detection. Additionally, in this
chapter we addressed the Next Speaker Prediction sub challenge of the ACM 2021 MultiMediate
Grand Challenge. This challenge posed the problem of turn-taking prediction in physically situ-
ated multiparty interaction. Solving this problem remains essential for enabling fluent real-time
multiparty human-machine interaction. This problem was made more difficult by the need for a
robust solution that can perform effectively across a wide variety of settings and contexts. Prior
work has shown that current state-of-the-art methods rely on machine learning approaches that
do not generalize well to new settings and feature distributions. To address this problem, we in-
troduced the use of group-level focus of visual attention as additional information. We showed
that a simple combination of group-level focus of visual attention features and publicly available
audio-video synchronizer models is competitive with state-of-the-art methods fine-tuned for the
challenge dataset.
In Chapter 4, we move from modeling the concrete dynamics of turn-taking to modeling the
internal dynamics of interpersonal trust.
35
Chapter 4
Chapter Four: Modeling Changes in Trust
This chapter presents work to explore the potential for SAR to improve group dynamics
when interacting with people in social settings. Specifically, this chapter presents the
user study of a robot facilitated academic support group. This study contributes a first
example of a robot facilitator and a novel measurement of dyadic trust. The results of
this study show a significant increase in interpersonal trust.
4.1 Approach to Measuring Trust Change in Robot-Facilitated
Support Groups
Contributors: Chapter 3 is based on Birmingham et al. (2020) written with Zijian Hu, Kartik
Mahajan, Eli Reber, and Maja Matari´ c.
A key research challenge of human-robot interaction (HRI) in general and SAR in particular is
understanding interpersonal dynamics in group settings. In such settings, individuals interact with
one another through complex verbal and nonverbal signals that are contextual and change over
time. To successfully interact with and mediate group interactions, robots must be able to recog-
nize human signals in real time, understand what they mean in a given context, and then choose
appropriate actions to achieve group goals that may involve improving cohesion, communication,
36
engagement, or trust. Sensing and improving these properties of group dynamics is challenging
because they involve the interaction of many, often subtle, multimodal signals (Mana et al. 2007).
The work of this dissertation chapter introduces HRI and SAR into the novel context of support
group mediation. Support groups are meetings in which individuals with a common problem or
challenge provide support to one another, typically with the help of a mediator (Jacobs et al. 2011).
Trust is crucial for proper functioning of support groups, because it is only when participants feel
that they are in a trustworthy place that they are willing to share and receive support (Johnson
and Noonan 1972). In most support groups, the level of trust changes over time as participants
make disclosures and experience supportive responses from others. Group participants regularly
evaluate and update their trust in one another as the session progresses (Corey et al. 2013). Since
trust in a support group setting can change relatively quickly and significantly, the context is both
challenging and well suited for capturing data for training robots to learn the signals and dynamics
associated with changes in trust. Although trust between participants in a support group typically
grows over time, the skill of the mediator plays an important role in group success. Corey et al.
(2013) emphasize that mediators must have leadership skills such as genuineness, caring, openness,
self-awareness, active listening, confronting, supporting, and modeling in order to lead a group
effectively. They also point out how trust can be gained or lost by how the mediator copes with
conflict or the initial expression of negative reactions.
As an early step toward effective robot support group mediators, this exploratory work devel-
oped a novel framework for selecting robot mediator questions and disclosures, as well as a dyadic
trust scale for measuring interpersonal trust. The framework was validated on a dataset collected
in semi-structured interactions of 27 three-person robot-mediated support groups for academic
stress. The study was conducted just prior to academic year-end final exams and involved gen-
uinely supportive interactions among the participating students. The dyadic trust scale was found
to be uni-factorial, validating it as an appropriate general measure of trust. The results show that
37
robot-mediated interactions significantly increased dyadic trust between participants as well as be-
tween participants and the robot, and participants genuinely shared with one another under the
guidance of the robot mediator.
4.2 Evaluation of the Trust Change Measure
4.2.1 Methodology
Participants – A total of 81 university students who self-identified as stressed participated in the
study in groups of three; each group met once and the study produced a total of 27 recorded group
sessions. For the purposes of the study, stress was defined as all forms of academic stress, includ-
ing all concerns pertaining to class work and performance. All participants provided consent to
participate in the study, which was approved under USC IRB UP-19-00084. The participant demo-
graphics were as follows: gender 48% female, 49% male, 3% preferred not to specify; ethnicity;
74% Asian, 11% Hispanic/Latinx, 13% Caucasian, 3% African American; degree being pursued:
45% undergraduate, 32% master’s, and 24% PhD.
Figure 4.1: V olunteer demonstration of the academic support group study setup; participants could
see one another and the robot but not each other’s computer screens.
Physical Setup – In each session, the three participants were seated around the end of a table
with a seated Nao robot on it, as seen in Figure 4.1. The Nao is a humanoid robot, 22.6” tall, with a
38
Sensitivity Question Disclosure
Low What do you like about school? When I feel stressed, I think my circuits might overload.
Does anyone else feel the same way?
Medium What are some of the hardest parts of school for you? Sometimes I worry I am inadequate for this school.
Does anyone else sometimes feel that too?
High What will happen if you don’t succeed in school? Sometimes I worry about whether I belong here.
Does anyone else feel the same way?
Table 4.1: Example questions and disclosures spoken by the robot, indicating sensitivity. A total
of 16 questions and 6 disclosures were available; an average of 12 questions and 3 disclosures
were made by the robot in each session.
total of 25 degrees of freedom. The robot was positioned as a member of the group, and served as
the group moderator. Between the robot and the participants, a 360-degree microphone recorded
audio data. At the base of the microphone, three HD webcams were arranged, one facing each
of the participants; the webcams recorded participant body pose and facial expressions. Behind
the robot, an RGB-D camera was mounted on a tripod and recorded the interaction of all four
members of the group. The robot controller (Wizard) was seated behind a one-way mirror, out of
the participants’ sight.
Interaction Framework – The robot’s role as a mediator consisted of initiating and encour-
aging the process of sharing and supporting within the group, by asking questions that encouraged
participants to share with one another. The Wizard controlled the robot’s head direction (and there-
fore its gaze direction) to look at the speaking member of the group, and controlled the timing of
the robot’s speech (a question or disclosure) based on when the group finished the discussion of the
previous topics/question. The robot’s questions and disclosures were open-ended and specifically
designed to encourage the participants to share with one another, in order to support and learn from
one another. The content ranged from low sensitivity, such as, “What do you like about school?”
to high sensitivity, such as, “Sometimes I worry about if I belong here, does anyone else feel the
same way?” In the study, sensitivity was defined by how personal or invasive a question was or
how uneasy a question might make a participant feel (Kaplan and Yu 2015).
During the interaction, the robot said questions and disclosures according to a simple algorithm.
The questions and disclosures were grouped into low, medium, and high sensitivity as illustrated in
39
Table 4.1, and the robot started with low sensitivity questions before moving onto medium and then
high sensitivity. Before transitioning to the next highest level, the robot would make a disclosure.
This pattern of questions and disclosures starting at low sensitivity and moving to high sensitivity
allowed participants to become comfortable sharing and start trusting each other at each level of
sensitivity. The robot alternated between questions and disclosures and balanced who in the group
received each question, to balance the number of questions first posed to each participant. The
robot maintained a neutral affect, with no facial expression and neutral tonal affect.
Procedure – After participants consented to take part in the study, they were seated as shown
in Figure 4.1, answered a few demographic questions, and completed the pre-study trust survey
consisting of 30 Likert scale questions (described in Section 4.2.1). Participants’ familiarity with
each other was not measured. The robot then began the group interaction by explaining that the
purpose of the session was for them to talk about their academic stress and help one another. The
robot then asked the participants to introduce themselves. After the introductions, the Wizard
took control of the robot for the remaining 20 minutes of the session and chose questions and
disclosures according to the framework described above. At the end of the session, the robot asked
the group to conclude the group session by sharing what they felt they had learned.
After the group interaction was complete, the participants completed the trust survey again.
Participants were then invited to take part in an open-ended group interview, in which they had
the opportunity to provide feedback about their experience. Finally, the participants completed
a custom survey assessing their baseline trust (”Would you say that most people can be trusted
or that you can never be too careful with people?”), the Negative attitudes towards Robots Survey
(Nomura et al. 2004), the Big Five (short) Inventory (Rammstedt and John 2007), and the Empathy
Inventory (Davis et al. 1980).
Trust Surveys – A battery of trust surveys was administered to evaluate the pre- and post-study
levels of participant trust. Three validated surveys were used: 1) the Dyadic Trust Scale (Larzelere
and Huston 1980), and the Specific Interpersonal Trust Scale 2) Overall Subscale and 3) Emotional
Subscale (Johnson-George and Swap 1982). Additionally, a customized study-specific scale was
40
Agreeableness Conscientiousness Extroversion Neuroticism Openness Emp Overall NARS Overall Trust Baseline
Robot Before 0.25 -0.13 0.11 -0.18 0.09 0.01 -0.37 0.27
Robot Change -0.11 0.34 -0.02 0.17 0.01 -0.10 -0.12 0.03
Group Before 0.08 -0.07 -0.04 0.20 0.10 0.07 -0.20 -0.06
Group Change 0.21 -0.07 -0.01 -0.13 -0.05 0.18 0.06 -0.04
Table 4.2: Academic Support Group: Correlation table, p<0.05 are bold
administered, consisting of six questions based on the antecedents of trust: benevolence, integrity,
and ability (see (Colquitt et al. 2007) for a meta-review of their significance). The complete com-
bined battery of surveys consisted of 30 Likert scale questions; each participant completed three
copies of the identical battery of surveys before and after the group session: one survey was about
the robot and the other two were about the other two study participants.
4.2.2 Results
This section presents the quantitative and qualitative results of the study. These results are based
on the scores of 71 participants; the scores of 10 participants who did not complete the surveys or
filled them in without coding the reverse questions correctly were removed, resulting in n=71. The
survey scores from each battery of surveys were combined to produce three numeric scores on a
scale -3 to 3 indicating the level of trust the participant felt towards each of the other participants,
and the robot.
Figure 4.2: Overall Trust Box Plot: participants’ trust pre- and post-interaction, relative to the
other group members and the robot. Medians are shown in orange.
41
Overall Trust – To determine significance, a t-test for comparing paired samples on the values
of trust was conducted before and after the support group session. The effect size was calculated
by
r=
Z
√
N
where Z was the standardized test statistic from a Wilcoxon Signed Rank Test and N was the size
of the corresponding population. For this analysis it is assumed that each participant’s rating of
trust in the other two participants was independent, doubling the population (n=142) as compared
to the robot (n=71). As shown in Figure 4.2, there was a significant increase in trust participants
felt in one another and in the robot; the effect size of the increase in trust was large for both groups.
T-value P-value Effect Size 95% CI for∆ Mean
Group t-(141) -9.46 <0.001 0.66 (0.24, 0.73)
Robot t-(70) -6.95 <0.001 0.68 (0.43, 0.77)
Table 4.3: Overall Trust Test Statistics: overall change in trust given to each corresponding entity
by the session participants
Figure 4.3: Overall Trust Box Plot: the change in trust participants felt in group members and in
the robot, illustrating the effect the group interaction had on trust. Medians are shown in orange.
Figure 4.2 shows that the distributions of the group and robot trust contained similar changes in
the mean (increases of approximately 0.65 and 0.50, respectively), but different changes in standard
deviation (increases of 0.01 and 0.13, respectively). This highlights the growing variability of the
participants’ trust for the robot, as compared to their responses to the other participants. It can also
42
be seen from Figure 4.3 that the differences in the robot trust (SD = 0.58) were less variable than
those of group trust (SD = 0.75). This was most likely due to the larger number of outliers in the
group differences distribution than the robot differences distribution (13 and 5, respectively). Aside
from these, there are no other statistically significant differences between the two distributions (p
= 0.14).
The trust levels for the robot also provide support for its use in the support group context.
The large effect size and the significant increase in trust point towards the robot’s potential as an
effective mediator in support group conversations.
Demographics – After using a Bonferroni correction for multiple comparisons, none of the
tested demographics (age, gender, ethnicity, and degree sought) had a statistically significant im-
pact on the trust in the robot or other participants in the session. As can be seen in Table 4.2,
there were significant correlations between trust in the robot before and Agreeableness and Trust
Baseline (.25) as well as a strong negative correlation with NARS Overall (-.36). Interestingly, the
only correlation with the Robot Change was Conscientiousness, possibly due to a sense of duty to
rate the robot higher after the group interaction. There were no significant correlations in any of
the surveys with Group Before or Group Change.
Factor Analysis – A factor analysis was performed to understand the hidden latent variables
affecting the results for n =71 participants. The Root Mean Squared Error of Approximation
and Root Mean Square of Residuals were used as an adequacy test to determine if a number of
factors was sufficient (Preacher et al. 2013). One dominant factor sufficient to meet the adequacy
test standards (RMSEA=(0.04, 0.06), RMSR=(0.04, 0.06)) was found. Removing the participants
who did not take the survey seriously resulted in higher inter-correlations among the questions,
thus allowing more questions to be loaded into the factor. After cleaning the results, the survey
data showed a total Cronbach’s alpha of over 0.9. The robot and group trust subscales also showed
Crobach’s alphas of over 0.9, thus validating the internal consistency of the participants’ responses.
The questions that were loaded into the factor held an overarching theme of general trust. The
single trust factor became more dominant after the session for the robot trust questions (eigenvalue
43
Figure 4.4: Factor Analysis Scree Plot: principal components in the composite trust survey for
group members and the robot, before and after the interaction. All plots show a strong elbow at the
second factor, indicating one factor explains most of the variance.
went from 17.03 to 22.87), but became less dominant for the group trust questions (eigenvalue went
from 24.60 to 17.48). This suggests that the participant’s conceptions of trust in the robot became
more monolithic (meaning uniform and indivisible) after the interaction, while conceptions of trust
in the other participants became less monolithic. All but two of the survey questions were loaded
into the one factor. In looking at the unloaded questions, it can be seen that they also measure
general trust, but include elements of the ability aspect of trust that may have deterred participants
from having similar answers to other questions. For example, the question “I would trust the robot
to take me to the airport” had very low correlations with the other questions, even though it is
asking about the participants’ trust in the robot because it seems infeasible that the small robot
could drive. After performing the factor analysis, it was decided that two questions that were
poorly correlated with the rest of the survey (less than 0.3) would be thrown out. The single factor
solution suggests that participants had a vague, monolithic notion of trust in one another that they
used to answer the survey. This may be because even after the group interaction, they had only
known one another for a short time, and had yet to solidify more distinct facets of trust in one
another and the robot.
44
Qualitative Results – At the conclusion of the support group setting the participants were
asked by the robot, “What is something each of you learned today?” Many participants expressed
sentiments that they were “not alone in feeling stressed” and “everyone is in the same boat.” Sev-
eral expressed that they felt they had “learned new tips and strategies for dealing with stress.” Not
all participants described feeling that way. Several chose to focus more on the robot, for example
saying “I am much more comfortable talking to the robot than I thought I would be” and “I am
shocked by machine and human interaction, the robot can talk and kind of understand feelings.”
In their optional feedback, many participants expressed that they enjoyed the group interaction.
Contrary to expectations, almost all participants expressed that stress was not a sensitive topic, and
that they “talk about academic stress all the time” with family and friends. When asked how the
session differed from everyday conversations about school stress, some participants focused on
how the discussion with the robot was mechanical and not a free flowing discussion, while others
focused on how they felt talking with a robot and strangers allowed them to “say things I could
not share in other situations.” Although most participants expressed that it was easy to talk about
academic stress, even with strangers and a robot, almost all participants said they felt they grew
closer with one another through the group interaction, and that trust grew as they shared with one
another. Supported by the survey data, this validates the hope that robot-mediated support group
interaction increases participant trust and helps to alleviate academic stress.
In the group interview, participants also offered feedback on the limitations of the interaction
and suggestions for improvement. A commonly-discussed limitation was the “lack of humanity”
of the Nao robot as a mediator. This sentiment was explained by several participants as being
related to the Nao robot’s simple, inexpressive face. Although the robot employed gestures such
as shrugging and head scratching, participants felt that it could not display empathy and that the
pauses and gestures were ‘awkward’. The robot turned its head to look at the participant who was
speaking; one participant felt “it was always watching me” while others described it as ‘lifeless’.
Another limitation participants identified in the design was the robot sound. Participants felt that
45
the noise of the robot’s motors interrupted the conversation flow and reminded them that the medi-
ator was a robot. Non-native English speakers also had trouble understanding the robotic voice and
often asked for the question and disclosures to be repeated. When discussing suggestions for im-
provement, two common themes emerged. The first was the participants’ wish for the robot to do
more than ask questions and make disclosures: they suggested that the robot should acknowledge
or follow up on what was said before moving on with the conversation. The second theme was
the desire for the robot to either share more of its own backstory and be able to answer questions
about itself, or to state at the beginning that its purpose was solely to mediate and that it would not
answer questions.
4.2.3 Discussion
The work of this dissertation chapter reported on a user study evaluating trust in a socially assis-
tive robot-mediated academic support group for stressed university students. A group mediation
framework was developed and validated. To measure trust, a dyadic trust scale was implemented
and found to be uni-factorial, validating it as an appropriate general measure of trust. The support
group interaction had a strong effect on participants’ trust in one another and on their trust in the
robot. Participants were willing to learn from and share with one another under the guidance of
the robot mediator. The work of this dissertation chapter validated that the robot did function as
an effective mediator of the support group interaction, opening the door to future work with robot
support group mediators.
4.3 Summary
Socially assistive robots have the potential to improve group dynamics when interacting with
groups of people in social settings. This chapter presented work that contributes to the under-
standing of those dynamics through a user study of trust dynamics in the novel context of a robot
mediated support group. For the study presented here, a novel framework for robot mediation of
46
a support group was developed and validated. To evaluate interpersonal trust in the multiparty
setting, a dyadic trust scale was implemented and found to be uni-factorial, validating it as an ap-
propriate measure of general trust. The results of this study demonstrated a significant increase
in average interpersonal trust after the group interaction session, and qualitative post-session in-
terview data report that participants found the interaction helpful and successfully supported and
learned from one other. The results of the study validated that a robot-mediated support group
can improve trust among strangers and allow them to share and receive support for their academic
stress.
In Chapter 5 the focus changes from modeling trust dynamics with measurements of ratings of
interpersonal trust to modeling social support through measurement of perceptions of empathy.
47
Chapter 5
Chapter Five: Modeling Social Support Through Perceptions of
Empathy
This chapter presents a novel Mechanical Turk (MTurk) study of the differences in
user perception of affective and cognitive empathetic statements made by a robot. This
chapter gives an overview of empathy and perceptions of empathy in social assistive
robots as well as the design and implementation of a study to test user perceptions of
different types of empathetic statements made by a robot. Next this chapter presents
the significant results from the study and a discussion of the study correlations shown
in those results.
5.1 Approach to Studying Perceptions of Robot Empathy
Contributors: Chapter 5 is based on Birmingham et al. (2022) written with Ashley Perez and
Maja Matari´ c.
Empathy is an individual’s capacity to respond to the emotions of others (Eisenberg et al.
2014), an integral part of human social communication. People use empathy to build connections
to one another by understanding and responding to one another’s thoughts and feelings. As robots
and conversational agents begin to have complex interactions with humans, they will also need
the capacity to understand and respond with empathy to the thoughts and feelings of humans.
48
Computational empathy (Yalc ¸ın 2019) can support the development of social relationships between
humans and robots, essential for socially assistive robots and social robots, and helpful for most
human-robot interaction. As with human empathy, computational empathy must be communicated
clearly and perceived as empathetic in order to be effective.
While the form empathy takes can vary greatly depending on the context, empathy can gener-
ally be understood through two broad categories or dimensions: cognitive empathy and affective
empathy. Cognitive empathy is the ability to understand another person’s thoughts and feelings.
Affective empathy is the ability to have an affective response that is more appropriate to someone
else’s situation than one’s own (Hoffman 2001). Humans naturally employ cognitive, affective
empathy or both forms of empathy in various social interactions, but it is as yet unclear how a
social or socially assistive robot should choose between these types of empathy when conveying
an empathetic response.
A robot’s ability to understand another’s feelings (cognitive empathy) and to feel in response
(affective empathy) is dramatically different from human abilities. This may lead the human target
of a robot’s empathy to perceive the robot’s attempt at empathy as insincere, causing unintended
consequences. Although there is no prior work exploring perceptions of cognitive and affective
empathy, we hypothesized that people are more likely to believe that a robot is able to identify
and understand an emotion rather than genuinely feel an emotion and would therefore rate a robot
using cognitive empathy as more believable and empathetic than a robot using affective empathy.
In the work of this dissertation chapter we explored the perceptions of a robot’s cognitive and
affective empathetic responses to a human’s disclosure through a within-subject study with coun-
terbalancing (n=111). We asked participants to rate their perceptions of a robot’s empathy based on
short videos of the robot giving cognitive and affective empathetic responses to disclosures about
stresses related to the COVID-19 pandemic made by an actor. To rate the interaction, participants
filled out a short survey after watching two videos demonstrating each condition. We analyzed the
relationship between the participants’ attitudes towards robots and their rating of how genuine they
felt the interaction was and their rating of the robot’s empathy in each condition.
49
Our surprising results showed that the robot that responded with affective empathy was per-
ceived as more empathetic. Further, we found that participants’ attitudes towards robots were
correlated with their ratings of the robot’s empathy and their belief that the interaction between
the robot and the actors were believable, natural, and genuine, and that negative attitudes towards
robots were associated with higher ratings of cognitive empathy over affective empathy. These
results serve to inform work into human-centered HRI and into developing robots that will be per-
ceived to be appropriately empathetic and could personalize their empathetic responses to each
user.
5.2 Evaluating Perceptions of Robot Empathy
5.2.1 Methodology
Hypotheses
H
1
: Participants will rate the empathy of a robot that makes cognitive-based empathy state-
ments higher than a robot that makes affective-based empathy statements.
H
2
: More negative attitudes towards robots will be correlated with higher ratings in the cogni-
tive condition compared to the affective condition for both:
H
2A
: empathy, and
H
2B
: belief in the interaction.
Study Design
We utilized a within-subject design with counterbalancing. The same participants evaluated
both the cognitive and affective conditions in a random order. This introduced less variation in
participant scores and required fewer participants than a between-subjects design.
Prior to running the study we chose a desired significance level of 0 .05, power of 0.8, and a
medium effect size: Cohen’s d=.4. To determine an appropriate sample size we needed an estimate
of the standard deviation, which we obtained by running a pilot study on Amazon Mechanical Turk
(AMT) with 15 participants under conditions identical to the main study. We found the standard
50
deviation in scored empathy in the pilot study to be approximately 0.7. To calculate the necessary
sample size, we utilized Kadam (Kadam and Bhalerao 2010):
n= 2(Z
a
+ Z
1− β
)
2
∗ σ
2
/δ
2
where Z
a
is a constant based on a 5% α error in a two-sided effect, Z
1− β
is a constant set
according to 80% power,σ is the standard deviation andδ is the estimated effect size. From this,
the minimum study population required to test H
1
was determined to be 48.
For evaluating H
2
, we utilized the commonly used heuristic for testing individual predictors
proposed in Green (Green 1991): N > 104+ m where m is the number of independent variables.
From this we concluded that the minimum study population should be increased to 106.
Participants – Participants were recruited through AMT. Inclusion criteria were: being from
the US, having an approval rate greater than 98%, having at least 1,000 prior approved tasks, and
not having taken this task before. Exclusion criteria were: age under 18, not fluent in English, and
being visually impaired (because of the visual nature of the study stimuli). In total, 181 participants
accepted the task and 111 completed the full survey. Only those who completed the full survey
were included in our analysis.
Participants who completed the full survey identified as: 75 Male, 33 Female, 3 Prefer not to
specify; 76 White, 15 Asian, 14 Hispanic, 8 African American, 2 American Indian or Alaskan
Native, and 1 Native Hawaiian or Other Pacific Islander; the ages ranged from 20 to 64, with a
mean of 36; the highest level of education completed was: 12 High School Equivalent or less,
26 Associate’s or some college, 56 Bachelor, 14 Master’s, 3 Professional degree. All participants
consented to be a part of the study and were paid for their time.
Although the cognitive and affective conditions were balanced due to the nature of the within-
subject design, the condition order assignment was random and therefore was not balanced by
participant gender, age, or education.
Recording Setup – The recording space was chosen and arranged to give the impression
of a therapy session between the robot and the actor. The robot was a tabletop humanoid
QTRobot (LuxAI - award winning Social robots for autism and special needs education 2021),
51
(a) Camera views of the interaction with the female actor.
(b) Camera views of the interaction with the male actor.
Figure 5.1: Empathy Study Setup: camera views used to record the interaction: the view capturing
the front of the actor (left) and the view capturing the front of the robot.
set up across the table from the actor. The recording equipment was structured to capture an inter-
view between the robot and the actor. One camera recorded the front of the robot over the shoulder
of the actor, and another recorded the front of the actor over the shoulder of the robot, as shown
in Figure 5.1. A microphone was placed between the robot and the actor to capture the speech
of both. The robot was controlled with the open-source HARMONI software (Interaction-Lab
n.d.[a]) that utilized CoRDial (Interaction-Lab n.d.[b]) for generating the robot’s facial expres-
sions and Google text-to-speech (Text-to-speech: Lifelike speech synthesis — google cloud n.d.)
for generating the robot’s speech. All software necessary for reproducing this study can be found
at: https://github.com/interaction-lab/HARMONI/releases/tag/Empathy-Study.
The dialog in each video consisted of three parts. As the actor sat down, the robot greeted them
and asked, “How are you doing today?” The actor then made a disclosure related to the COVID-19
pandemic. The disclosures were structured as follows: “To be honest, I am not doing so well, I
feel X because Y” where X is a negative emotion and Y is a difficulty related to the pandemic.
52
Negative Emotion (X) Difficulty Related to the Pandemic (Y)
Depressed I don’t have the energy to do
anything anymore.
Overwhelmed I haven’t been able to go outside and do
anything in so long.
Table 5.1: Empathy Study Examples: actor disclosures were structured in the format of: ”I feel X
because Y”
The robot then made an empathetic response that consisted of three parts: 1) an acknowledgement
of the actor sharing information: “Thank you for sharing that”; 2) an affective or cognitive empa-
thetic statement (depending on the condition, described below); and 3) an invitation to share more
information: “Would you like to expand on that? We can talk more about why you are feeling this
way.”
The affective and cognitive empathetic statements were designed to convey the difference be-
tween the two types of empathy while maintaining a balanced length and structure of the utterance.
In the cognitive empathy statement, the robot communicated that it understood the actor by saying:
“It sounds like you are having a hard time feeling X right now”, where X is the emotion that the
actor stated they were feeling. In the affective empathy statement, the robot communicated that
it related to the emotion identified by the actor by saying: “I can totally relate to what you are
feeling, I also feel X sometimes”, where X is the emotion that the actor stated they were feeling.
In addition to speaking, the robot used a set of three gestures. The first gesture was a two-
arm wave; the robot used it while it was greeting the actor, to express welcoming and excitement.
The second gesture involved the robot bringing its right hand to its chest in a gesture of sincerity
and thanks; the robot used it when it said “Thank you for sharing” and also when it invited the
participant to talk more. In the third gesture, the robot brought its hands to its hips to emphasize
the point it was making whenever it made the affective or cognitive empathetic statements.
In both study conditions (affective and cognitive empathy statements) the robot maintained a
neutral expression and tone of voice, and used the same gestures and the same turn-taking timing
53
Belief In Statement
Robot Response When the robot responded to Brandon and Isabel
the response was believable.
Natural Interaction The interaction between Brandon and Isabel
and the robot appears natural.
Actor Distress Brandon and Isabel appear genuinely
distressed.
Table 5.2: Empathy Study Belief Measurement: the participant belief and the corresponding state-
ment used to gauge participant belief in the interaction. The participants rated their agreement with
each statement on a five-point Likert scale.
(when it started to speak after the actor stopped speaking). The video shown to the study partici-
pants did not include a response by the actors to avoid biasing the participants. A video of the full
interaction is available here: https://youtu.be/nUf3SDUtbZ8.
Measures – We collected quantitative and qualitative data to measure the participants’ percep-
tion of the robot’s empathy and explore what factors that affected that perception.
Quantitative Data Participants’ attitudes toward robots were measured with the Negative At-
titudes Towards Robots Scale (NARS) (Nomura et al. 2006) that was shortened to include only
questions relevant to socially assistive robots. The participants’ perceptions of the robot’s empathy
to the actor were measured with the Robot’s Perceived Empathy (RoPE) scale, modified from a
first person questionnaire to a third person questionnaire to account for the fact that participants
were watching an actor instead of interacting themselves with the robot. Specifically, we replaced
the first person pronouns “I” and “me” with the third person reference to the actors’ names, and
modified the questions to ask if the robot “appears to” meet the criteria of each question. Three
additional questions were included to assess the participants’ belief in the interaction scenario, as
shown in Table 5.2. All three used a five-point Likert scale ranging from ’Strongly Disagree’ to
’Strongly Agree’.
Qualitative Data Participants were asked if they would like to interact with the robot and to
explain why or why not. Participants were also asked to describe the differences between the robot
responses in the two conditions in their own words and explain which felt more empathetic. The
participants provided answers to these write-in questions in a blank text box.
54
Figure 5.2: Empathy Study Flow Chart: steps participants completed during the study. Participants
were considered to have completed the full study if they correctly answered all questions in the
Attention Check Questionnaire.
Procedure
This study was approved by the University’s Institutional Review Board (IRB). Participants
were recruited through AMT. As seen in the flowchart in Figure 5.2, participants first completed
a demographic questionnaire and the NARS. They then watched a video of the short interaction
between the robot and an actor and completed an attention check questionnaire that asked three
simple multiple choice questions about the video they had just watched. If they answered any of
the questions incorrectly, they were directed to the end of the survey. If they answered the questions
correctly, they were randomly assigned to a condition and either watched the affective or cognitive
condition videos first. Each condition involved watching two videos of actors interacting with the
robot, one male actor and one female actor. After they completed both videos, they were directed
to complete a short questionnaire about the robot they had just watched, including an attention
check question, the modified RoPE scale questions, the three belief questions, and the short-form
question about whether they would like to interact with the robot. Participants then completed the
55
other condition of the study. After completing both conditions, participants were asked to describe
the differences between the robots and explain which robot was more empathetic.
Participants were paid proportionally to the time spent on the task. Those who failed to com-
plete the first attention check spent about 4 minutes on the study and were paid US$1 for their time.
Those who completed the full survey, which took approximately 24 minutes, were paid US$6.
Analyses – This study measured participants’ attitudes toward the two types of robot-expressed
empathy. The dependent variables were empathy (the RoPE scale score) and belief (how the
participants rated their belief in the interaction scenario, shown in Table 5.2). The independent
variables were the condition viewed (affective or cognitive), participants’ attitudes towards robots
(the NARS score), participants’ demographics, and the ordering of conditions.
The five-point Likert scale questions were coded from − 2 to 2, and the net score was calculated
as the average of the individual question scores, with the appropriate questions reverse-coded. The
dependent variables were tested with the Shapiro-Wilk test; they were found to not be normally
distributed, so non-parametric tests were used. To test the difference between the affective and
cognitive conditions, the Wilcoxon signed-rank test was used. To establish the relationship be-
tween the dependent and independent variables, a two-sided linear least-squares regression was
used, with the correlation and significance determined with the Spearman’s rank correlation.
5.2.2 Results
Cognitive vs. Affective Conditions
Empathy – The perceived empathy was higher in the affective condition (M= 0.62, SD= 0.67)
than in the cognitive condition (M= 0.38, SD= 0.71), as can be seen in Figure 5.3. The affective
condition was rated significantly higher than the cognitive condition ( p<.001), showing a small-
to-medium effect (Cohen’s d= 0.35, effect-size r= 0.17). Thus, H
1
is not supported.
Belief – The perceived net belief was higher in the affective condition (M= 0.69, SD= 0.82)
than in the cognitive condition (M = 0.56, SD= 0.82), as shown in Figure 5.3. The affective
condition was rated significantly higher than the cognitive condition ( p < .05), showing a small
56
Figure 5.3: Empathy Study Conditions: Differences between the cognitive and affective conditions
in the rated empathy and belief. Significant differences are indicated with asterisks: *=p < .05,
****=p<.0001.
Belief In Affective Condition Cognitive Condition
Robot Response 0.76 0.60
Natural Interaction 0.54 0.36
Actor Distress 0.78 0.72
Table 5.3: Empathy Study Agreement: mean rating of agreement with each belief statement for
each condition.
effect (Cohen’s d=0.16, effect-size r=0.08). Table 5.3 provides the data demonstrating that the
affective condition was rated as more believable on all questions. The largest difference was found
in the question asking if the interaction appears natural, where the mean participant rating for the
affective condition was 0.18 greater than the mean for the cognitive condition.
Interest In Future Interaction
After both conditions, participants were asked if they would be interested in interacting with
the robot in the future. In the affective condition, 47% of participants said they would like to
interact with a robot like the one shown in the video, while 32% said they would not, and 22%
were unsure. In the cognitive condition, 38% said they would be likely to, 36% said they would
not, and 26% were unsure. We found no statistical test for significance of these findings given their
57
nominal scale (yes-no-maybe). While this is not a direct test of H
1
, it provides further evidence
that H
1
is not supported.
Figure 5.4: NARS vs Empathy and Belief: the dependent measures plotted against the NARS
scores, including the linear regression line and the shaded 95% confidence interval. Empathy is
plotted in the top row, belief in the bottom. Ratings from the cognitive condition are on the left,
the affective condition in the center, and the differences between the two are plotted on the right.
Attitude Towards Robots vs. Empathy and Belief
As can be seen in Figure 5.4, the NARS scores were negatively correlated with empathy
(RoPEs) and belief measures, although they were positively correlated with the difference be-
tween the cognitive and affective conditions for both measures. In the affective condition, the
correlation between NARS and empathy (r(111)=− .23, p=.016) and between NARS and belief
measures (r(111)=− .30, p= 0.002) was significant. In the cognitive condition, it was significantly
correlated with the belief measure (r(111)=− .20, p= 0.037) but not with the empathy measure
(r(111)=− .11, p= 0.244). For the difference between the two conditions (cognitive and affective)
the correlation with the empathy measure was significant (r(111)= .23, p= 0.015) but the belief
measure was not (r(111)=.11, p= 0.251). Therefore, H
2A
is supported and H
2B
is not supported.
Belief vs. Empathy As can be seen in Figure 5.5, there was a positive correlation between
the participants’ rating of belief in the interaction and their ratings of the robot’s empathy. This
58
Figure 5.5: Belief vs Empathy: correlation between rated belief in the interaction and perceived
empathy of the robot, including the linear regression line and the shaded 95% confidence interval.
correlation was significant for both the cognitive (r(111)= .68, p< 0.001) and affective (r(111)=.67,
p< 0.001) conditions.
Order Effects – There were no significant differences in participant ratings of the robot’s em-
pathy in either the affective first or the cognitive first condition, as shown in Figure 5.6. Participant
ratings of their belief in the interaction were significantly higher in the cognitive first order for both
the cognitive (p= 0.033) and affective conditions (p= 0.002).
5.2.3 Discussion
The work of this dissertation chapter examined participant ratings of robot empathy after watching
a socially assistive robot deliver two types of empathetic statements (cognitive and affective) to an
actor. We found that the participants rated the affective statements higher than cognitive ones. The
participants’ attitude toward the robot was correlated with their rating of the robot’s empathy and
with the difference between the two conditions.
Participants rated affective empathy higher – We expected participants to find the idea that
a robot could feel an emotion to be less believable than the idea that a robot could understand the
59
Figure 5.6: Order Effects: difference in participant ratings of dependent measures between the
cognitive first and affective first ordering. Significant differences are indicated with asterisks:
ns=p>.05, *=p<.05, **=p<.01.
feelings of a human, which we hypothesized (H
1
) would lead participants to rate the cognitive con-
dition higher than the affective condition. Contrary to this expectation, we found that participants
rated the robot that made affective empathetic statements as significantly more empathetic than the
robot that made cognitive empathetic statements across all measures. Participants also expressed
more interest in interacting with the robot seen in the affective condition. These results contradict
H
1
.
One possible explanation for these surprising results can be found in the breakdown of our
expectations. As shown in Figure 5.5, the expectation that belief was strongly correlated with the
perceived empathy is correct; however, the expectation that the belief in the cognitive condition
would be higher was incorrect, as shown in Figure 5.3. Although this may explain why H
1
was not
supported, it does not explain why the affective empathy condition was rated higher.
60
The higher rating of the affective condition may be an indication that affective empathy (ex-
pressing feeling similarly) is generally perceived as more powerful than cognitive empathy (ex-
pressing understanding); however, we found no evidence for this in the literature on empathy and
therefore this requires further investigation. Alternatively, the fact that the affective condition was
rated higher may be explained by the correlation between the negative attitudes toward robots and
the difference between the ratings of empathy in the cognitive and affective condition. Participants
in our study generally scored low on the NARS, which correlated with higher ratings of empathy
in the affective condition compared to the cognitive condition.
Participant interest in a future interaction with the robot matched their ratings of empathy
and belief in the two conditions. In the affective condition, almost half of the participants said
they would be interested in interacting with the robot shown in the video, while another third of
participants were not sure. This is higher than in the cognitive condition, where only 38% of
participants were interested in interacting with the robot and about a quarter described themselves
as unsure.
More negative attitudes toward robots were correlated with empathy and belief ratings
– As presented in Section 5.2.2, participants’ prior attitude toward robots was correlated with the
ratings of belief and empathy in the affective condition and with belief in the cognitive condition.
This correlation shows how a more negative attitude toward robots predisposes participants to be
less likely to believe in the interaction scenario and to rate the robot as less empathetic, regardless
of condition. This correlation is stronger for ratings in the affective condition than in the cognitive
condition, and there is a positive correlation between the NARS measure and difference between
the empathy ratings in the two conditions, as shown in Figure 5.5. This means that participants
with more negative views of robots rated the empathy of the robot in the cognitive condition higher,
as expected. Therefore, H
2A
is supported. While there is a small positive slope to the regression
between NARS and the difference in ratings of belief between conditions, it is not significant, so
H
2B
is not supported by the data. That H
2A
is supported while H
2B
is not supported suggests that
participants’ prior attitude towards robots was more important than the specific type of empathy
61
the robot was trying to convey in determining their belief in the interaction. This is intuitive, in
that the type of empathy conveyed was only one factor among many that participants may have
incorporated into their belief rating, while it was likely the dominant factor in their rating of the
robot’s empathy.
Ethics and trade-offs of deception in HRI – Having robots express empathy inherently in-
volves some deception through anthropomorphization, which has has been identified as a potential
ethical hazard in the the world’s first explicitly ethical standard in robotics: BS8611-2016 Guide to
the ethical design and application of robots and robotic systems (BSI-2016 2016). In Winkle et al.
(2021), the authors establish that this anthropomorphization may be a necessary component for ef-
fective HRI, but also explore the use of strategies to mitigate the potential hazard. One strategy is
to “minimize the use of affective social interaction that suggests robots have ‘emotional states’ or
social agency”. This strategy mirrors the difference between the cognitive and affective conditions
in our study, so it is unsurprising that Winkle et al. also found that reducing anthropomorphism
(and thus the ethical risk) would make HRI more effective for some users and less effective for oth-
ers. We built on this result by describing how prior negative attitudes towards robots can explain
some of the differences in this effectiveness.
Limitations – There are several limitations to this study. The work of this dissertation chapter
was conducted during the COVID-19 pandemic, making in-person data collection at scale difficult.
A natural limitation of studying interaction preferences for a robot via AMT is that participants
were not able to interact with the robot directly. The resulting reported preferences may differ from
those that would be given as a result of in-person interactions. Additionally, the demographics
of the participants who selected the task on AMT are not balanced, with almost three-quarters
identifying as white and three-quarters as male, preventing generalization to more diverse user
populations.
Another limitation of the study was the short and constrained nature of the dialog shown to
participants. In order to balance the gender of the actor (one male and one female), two videos
were shown to each participant for each condition. This meant each participant watched a total of
62
four videos, which had to be kept short in order to ensure participants would stay engaged with
each video and complete the study in a reasonable amount of time. To avoid creating confounds
in the study dialog, the structure of what the robot said was kept constant, with only the relevant
emotion and the type of empathy expressed changing in each video. This constrained interaction
structure was necessarily repeated in each video and may also have affected the participants’ views
of the robot.
5.3 Summary
This chapter presents work that contributes a novel study of the differences in user perception of
affective and cognitive empathetic statements made by a robot to an actor’s disclosure. We report
several findings, including that participants rated affective empathetic statements as both more
empathetic and more believable than cognitive empathetic statements, that the participant’s belief
in the interaction was highly correlated with the perceived empathy of the robot in both the affective
and cognitive conditions, and that more negative attitudes towards robots were correlated with a
higher rating of empathy in the cognitive condition than the affective condition. We hope that the
work of this dissertation chapter will inspire continued development of ethical human-centered
HRI
Communicating empathy is important for building relationships in numerous contexts. Conse-
quently, the failure of robots to be perceived as empathetic by human users could be detrimental
to developing effective human-robot interaction. Work on computational models of empathy has
been growing rapidly, reflecting the importance of this ability for machines. Despite growing recent
work, there remain unanswered questions about how users perceive different forms of empathetic
expression by robots and how attitudes towards robots may mediate perceptions of robot empathy.
Do people really believe that robots can feel or understand emotions? The work of this disserta-
tion chapter studied the difference in viewers’ perceptions of cognitive and affective empathetic
63
statements made by a robot in response to human disclosure. In a within-subjects study, partici-
pants (n=111) watched videos in which a human disclosed negative emotions around COVID-19,
and a robot responded with either affective or cognitive empathetic responses. Using an adapted
version of the Robot’s Perceived Empathy (RoPE) scale, participants rated their perceptions of the
robot’s empathy in both cases. We found that participants perceived the robot that made affec-
tive empathetic statements as being more empathetic that the robot that made cognitive empathetic
statements; we also found that participants with more negative attitudes toward robots were more
likely to rate the cognitive condition as more empathetic than the affective condition. These results
inform HRI in general and future work into developing robots that will be perceived as empathetic
and could personalize empathetic responses to each user.
In Chapter 6 the model presented in this chapter and those introduced in prior chapters are
incorporated into a framework for facilitating social support.
64
Chapter 6
Chapter Six: Support Group Facilitation Framework
This chapter describes the development and validation of a framework for support
group facilitation. The framework consists of two approaches: role modeling and
directing as a facilitator. The framework was developed into a semi-autonomous fa-
cilitator that was evaluated on two support groups made up of three cancer survivors
and the robot facilitator.
6.1 Approach to Directing and Role Modeling for Facilitation
The Human Robot Group IPO Framework, first detailed by McGrath (1964) and then further re-
fined by Hackman and Morris (1975), structures a multiparty human-robot interaction into inputs,
process, and outputs (IPO). Inputs include the robot’s behavior, role, and appearance, group type
and composition, and environment setting, task, stress level, and reward structure. Process covers
the interactions among group members and the interaction between group members and the robot.
Output includes the performance outcomes and other outcomes affecting the group members Sebo
et al. 2020. The work of this dissertation chapter proposes methods relating to the process, in
which the robot acts as a facilitator to shape the social support among group members.
As previously described in Chapter 2, the primary mechanism by which a support group func-
tions is through empathy in response to disclosure. This chapter presents two methods for the
65
(a) Role modeling social support (b) Eliciting social support
Figure 6.1: Social Support Methods: Visualization of the two proposed social support methods for
SAR facilitators of support groups
facilitation of social support in a support group: role modeling and elicitation of social support. A
visual representation of these two methods is shown in Figure 6.1.
To act as a role model, a SAR facilitator must make relevant disclosures and empathetic re-
sponses within the context of a particular conversation in a particular support group. Generating
empathetic responses to group member’s disclosures requires that the robot recognize the valence
of the disclosure and the corresponding emotion, and use the information to formulate an empa-
thetic response. Role modeling disclosure is complex but it is possible to utilize the support group
topic to generate custom disclosures a priori and select the most appropriate disclosure to use dur-
ing the conversation. Disclosures made by the robot should appear sincere and reflect vulnerability
in order to effectively encourage other group members of the group to make their own disclosures.
To achieve elicitation, a SAR facilitator must direct group members to perform their own dis-
closures and respond with empathy to each other. This task is more straightforward than role
modeling, as it does not require correctly understanding group member disclosures or generat-
ing customized disclosures for the robot. This implies less likelihood of mistakes and therefore
improved trust in the robot facilitator.
66
Action Role Modeling Support Eliciting Support
Disclosure Would someone be willing During the pandemic I have also
to share how they are doing? been struggling with isolation
Empathy Does anyone relate to Thank you for sharing, that
what was just shared? sounds challenging to deal with!
Table 6.1: Facilitator Action Space: examples of the SAR facilitator speech action space designed
for facilitating through role modeling and eliciting empathy and disclosure.
The full action space of the SAR facilitator includes the following four actions: role modeling
disclosure, role modeling empathy, eliciting disclosure, and eliciting empathy; examples of each
of these actions are detailed in Table 6.1. Of the four actions the SAR facilitator can take, the
most challenging and sensitive is role modeling empathy in response to participant disclosure. To
accomplish this, we build on the work of Tavabi et al. , for “identifying opportunities for empathetic
response” and use a multimodal model for disclosure detection (Tavabi et al. 2019). This model has
been trained to recognize positive, negative, and neutral disclosure, allowing the robot to formulate
an empathic response, structured as described in Birmingham et al. (2022).
Figure 6.2: Online Support Group Setup: volunteer demonstration of a robot-facilitated support
group over the video conferencing platform Zoom with a QTrobot.
67
6.2 Evaluation of Directing and Role Modeling for Facilitation
The goal of this study was to collect participant feedback about perceptions of the robot facilitator
and their perceptions of its performance as a director or a role model facilitator. This pilot study
explores online facilitation of a support group by an autonomous agent. The work of this disserta-
tion chapter collected detailed, higher quality feedback from a small set of participants, but was not
powered to quantitatively test specific hypotheses. Instead, this study contributes detailed feedback
from participants with prior support group experience which will inform the further development
of an online support group facilitator.
6.2.1 Methodology
Study Design To evaluate the role modeling and directing conditions we utilized a within-subject
design with counterbalancing. The same participants evaluated both the role model and directing
conditions in a random order.
Measures Quantitative measurements were collected through the Group Cohesion measure
(Krogel et al. 2013) and the RoSaS measure (Carpinella et al. 2017). The Group Cohesion measure
consists of 19 statements about the other group members rated on a seven point scale from “not
true at all” to “very true”. The RoSaS measure consists of 18 adjectives participants were asked to
rate “how closely are the words below associated with the category social robots” on a nine point
scale from “definitely not associated” to “definitely associated”. Additionally participants were
asked to rate the facilitator (the QTrobot who was named “QT” with the following questions de-
signed specifically for this study: “Please rate your agreement or disagreement with the following
statement: I would like to have QT as my support group facilitator”; “What do you think of QT so
far?” and “What feedback would you give QT to help them be a better support group facilitator?”
Finally, participants were asked about their experience through open ended qualitative questions in
the survey and through a group discussion at the end of the study session.
68
Procedure This study was approved by the University’s Institutional Review Board (IRB). In
the study, the SAR facilitator leads a support group over Zoom, as illustrated in Figure 6.2. The
study was a within-subjects study in which the SAR facilitator engaged in role modeling and
eliciting social support methods for each group. Each of the two conditions was run for twenty-
five minutes, yielding a total of fifty minutes of interaction, which is typical for many support
groups.
Prior to the start of the study, participants were recruited with a flier on social media. Partici-
pants contacted the study coordinator, who answered any questions,consented them for the study,
and scheduled them with the other two participants. At the scheduled time, the study began with
the study coordinator reviewing the consent form with them and asking them to complete the first
set of survey questions, which included the demographic measures and the first round of the Group
Cohesion and RoSaS questionnaires. Then the robot facilitator, QT, gave an introduction to itself,
and asked the participants to fill out the first of three measurements rating the facilitator. After
that was completed, QT began the support group with an introduction icebreaker, and the first of
the two randomly assigned conditions. When the allotted time was up, QT directed participants to
complete the second rating of itself as the facilitator. Once participants completed that section of
the survey QT led the second round of the support group, which was the other randomly assigned
condition. After that was completed QT asked participants to complete the third and final round of
rating it as the facilitator. At this stage, the interaction was finished and the participants completed
the last section of open ended questions and then the study coordinator held a group debriefing
discussion.
Participants Participants for this study were recruited from the Adolescent and Young Adult
(AYA) cancer survivor population in the United States. This population includes cancer sur-
vivors ages 18 to 39. This population was chosen because their age range and experience in sup-
port groups enabled them to communicate effectively in online videoconferencing and to provide
constructive feedback about QT’s facilitation. Participants were recruited through social media.
Through an unplanned coincidence, members of one of the two groups identified as male and
69
members of the other group identified as female. The female group all identified as White, ages
32, 33, 34; the male group was ages 25, 29, and 33, and identified as Hispanic, White/Mexican, and
Asian, respectively. As measured by the abbreviated Big Five questionnaire (Rammstedt and John
2007), the participants identified as less extroverted ( M=− 1, SD= 1.7) and more conscientious
(M= 1.8, SD= 1.7). None of the group members had prior experience with robots, and all of the
group members had prior experience as part of a support group, ranging from two of the women
who had been to less than five and all the others who had been to more than five group meetings.
Participants were given a 25 dollar Amazon gift card honorarium as a thanks for their participation.
6.2.2 Results
Pre-Post RoSaS and Group Cohesion
Pre Post
Warmth 6.22 (0.84) 5.67 (0.62)
Competence 7.00 (0.88) 7.03 (1.19)
Discomfort 4.58 (1.61) 3.08 (2.37)
Table 6.2: Pre-Post RoSaS: mean ratings associating social robots with the Warmth, Competence,
and Discomfort subscales of the RoSaS measure before and after the interaction, standard deviation
in parenthesis. The scale is from 1 to 9 with higher being more associated.
In this pilot study, we observed changes in the RoSaS scores that are worthy of further investi-
gation. As seen in Table 6.2 we found that the association with the word “Warmth” went slightly
down, from (M= 6.22) before the interaction to (M= 5.67) after the interaction. We saw that the
rating of “Competence” stayed high at (M= 7) and “Discomfort” decreased from (M= 4.58) to
(M= 3.08).
Pre Post
Cohesion 2.79 (0.81) 3.06 (1.02)
Table 6.3: Pre-Post Group Cohesion: mean ratings of the Group Cohesion measure before and after
the interaction, standard deviation in parenthesis. The scale is from 0 to 6 with higher numbers
indicating greater perceived group cohesion.
70
Group cohesion, as measured by the Group Cohesion Questionnaire, increased slightly from
the beginning (M = 2.79) to the end (M = 3.06). As can be seen in Table 6.3, this change is
significantly less than the standard deviation, and as such will require a higher powered study to
be properly considered.
Facilitator Feedback Feedback on the facilitator was measured after QT’s introduction and
Initial Director Role Model
Desire for QT Facilitator -0.17 (0.75) 0.33 (1.51) .017 (0.98)
Table 6.4: Facilitator Ratings: mean ratings of the desire to use QT as a facilitator in the future,
standard deviation in parenthesis. The scale is from− 3 to 3 with higher numbers indicating greater
desire for QT as their support group facilitator.
again after the two rounds of facilitated support group interaction. The mean ratings and standard
deviations of participants desire to have QT as a facilitator can be seen in Table 6.4
Group Discussion Analysis of the participants responses to the open ended feedback questions
yielded several interesting results. First, participants expressed several common elements that they
liked about the interaction. Two participants volunteered that they liked that the facilitator was
a robot, as it made it feel like it was “more of a safe environment”, and they felt they “were not
judged” by the robot. Other participants volunteered that they liked the questions, stating that they
were, “deep” and they “had good order and structure”. One participant expressed that they “liked
that other people were there”, because they felt more comfortable interacting with the robot with
others. However, multiple participants felt that QT was not interactive enough or that they would
have liked it if QT was more interactive.
When asked if they felt QT understood them, participants expressed that they either somewhat
agreed or agreed that QT had understood them. Other questions were more universally agreed
upon. When asked what they had learned during the sessions, participants all agreed that they had
learned some form of, “I am not alone in my feelings”. When asked if they felt less stressed after
the interaction compared to when they had started, all participants said they were less stressed.
71
6.2.3 Discussion
The results of this study confirm that there is potential for SAR to effectively facilitate support
groups while also pointing out areas that should be improved for a larger study. In particular, the
enthusiastic engagement and positive feedback from participants was very encouraging. Partici-
pants’ feedback revealed that they saw QT as understanding them and asking good questions, two
qualities that are critical for a successful facilitator. Participant feedback also revealed that the in-
teraction was successful in its role as a support group in that participants learned about each other
and were supported by how they learned they were not alone in the challenges they faced. Partic-
ipants also gave QT high scores for perceived competence and there was a decrease in ratings for
discomfort during the interaction. Finally, although the differences were small, participants’ desire
for QT as a facilitator increased from the baseline introduction after each condition, and seems to
indicate that participants preferred the Director condition compared to the Role Model condition.
The small differences expressed, combined with the high variance, mean it is impossible to draw
conclusions from the limited study data, but do suggest hypotheses for future studies.
Several areas for improvement emerged. Participants expected more interactivity from QT,
wanting it to be more responsive to their questions. As a role model, QT made disclosures in
order to encourage others. These disclosures were occasionally followed by questions from the
group directed at QT, which it was unable to properly address. This might be solved by the use
of a large language model, though it would be imperative to make sure appropriate guardrails
were implemented to avoid having the robot say something inappropriate. One participant who
expressed a desire for more interactivity from QT said that, “it felt as though QT faded into the
background after a while”. This is challenging feedback, because when the conversation is going
well, the facilitator should take a back seat role and let the participants drive the conversation, only
intervening when the conversation veers off course or requires input to get it going again. This may
be one area where participants’ expectations for human and robot facilitators differ, as they might
see interacting with the robot as game-like in desired frequency, whereas they might not have the
same expectations for a human facilitator.
72
Turn taking remained a challenge for support group facilitation, even in a WoZ system. The
line between helpful silence and awkward pauses is blurry, and happens even in support groups led
by experienced and trained facilitators. Alternative solutions for standardizing turn taking, such as
requiring a physical action by both participants and facilitators may be necessary, even at the cost
of slowing down the interaction.
The proposed methods are limited in scope to good-faith interactions and do not include meth-
ods for handling error correction or strategies for addressing rejection by group members or poten-
tial bad actors in the group, both of which are required in real-world applications.
6.3 Summary
This chapter presented the development and validation of a framework for support group facili-
tation. SAR facilitators have the potential to improve access to social support in various support
group contexts, including groups for health and behavioral conditions, parenting, students, and
many more. SAR facilitation methods may also apply to everyday SAR interactions in group con-
texts in the home, in schools, and in the workplace. Through this pilot study we have begun to
show the feasibility of a SAR facilitator and presented preliminary findings on the potential impact
of a SAR facilitator in a support group context. The work presented in the chapter has allowed us
to gain insights into the facilitation methods before engaging in larger experiments that will test
the empirical effectiveness of the SAR facilitator.
73
Chapter 7
Chapter Seven: Summary and Conclusions
This chapter first summarizes the contributions of this dissertation on leveraging mul-
tiparty HRI for SAR facilitation. Next, future directions for SAR within the Facilitation
framework are discussed.
7.1 Contributions
The importance of support groups cannot be overstated, as they provide a space where individuals
can come together and share their experiences, offer support, and build a sense of community.
However, the dynamics of group interactions can be complex and difficult to navigate, even for
socially adept humans. This is where SAR can play a crucial role in facilitating support groups,
by providing a neutral and non-judgmental presence that can help guide the conversation and keep
the focus on building supportive relationships between the group members.
One of the key advantages of using SAR as a support group facilitator is that it has computa-
tional capabilities that allow it to track the verbal and nonverbal participation of group members
throughout the entire interaction. Additionally, because the SAR is not human, the support group
members may feel more comfortable sharing with the robot as it is impossible for the robot to
judge what they are sharing. The ability of SAR to understand the complexities of group dynamics
in real-time, and provide valuable insights and feedback to support group members, can lead to
more effective and productive interactions within the group.
74
This dissertation contains an extensive review of related work in multiparty HRI and SAR, as
well as relevant background information on the application domain of support groups. To enable
SAR to engage in more complex model-driven turn-taking, it contributes models for detecting
the active speaker and predicting the next speaker based on group attention. It also explores the
challenges of modeling changes in trust, a crucial factor in support groups. Furthermore, the
dissertation presents a model of social support through perceptions of empathy and disclosure
made by a robot. Together, these models provide the foundation for a comprehensive support
group facilitation framework, which is evaluated in the context of a cancer support group.
This dissertation examines the implications of these contributions and how they can be used
to enable SAR to interact with groups in real-time, fostering better communication and improving
the social dynamics between the group members. By understanding the intricacies of modeling
turn-taking, changes in trust, and social support, SAR can effectively facilitate support groups and
provide the necessary tools for group members to interact with one another more effectively.
7.2 Future Direction of Facilitating Social Support for SAR
The work of this dissertation provides an early example of the potential for SAR to facilitate
multiparty social support. Progress on each of the problem domains presented in the dissertation
aim to further enable SAR to reach more people more effectively. In the turn-taking domain,
progress requires more accurate and robust sensing of individuals’ gaze and pose, combined with
increased computational power for utilizing larger computational models at sufficient speeds to
allow them to be used in real-time applications. In the domain of trust modeling, there is a need
for larger and longer duration studies that can increase the study power and effect size. This
will allow future work to explore the nuances and intricacies of trust with greater fidelity. In the
domain of modeling social support through empathy and disclosure, one very interesting avenue
of future research is the exploration of the differences between social support provided through
text mediums and the same support made in person. The different modalities may change the
75
perception of the support being offered in interesting ways. Finally, there are many directions
of open research that can be explored to build upon the facilitation framework introduced in this
dissertation. One direction is the involvement of different modalities for detecting the emotional
state of individuals within a group and addressing the negative or positive states explicitly, even
when not volunteered by a group member. Another interesting avenue is to improve the multimodal
expression of the SAR facilitator in order to allow it to behave as a better role model actor while it
is facilitating.
7.3 Final Words
In conclusion, this dissertation presents a comprehensive framework for using SAR to facilitate
support groups. By understanding the complexities of group dynamics and the importance of social
support, SAR can provide a valuable resource for individuals who are facing difficult situations.
The potential for technology to bridge the gap between online and offline interactions, and to create
meaningful connections between individuals and within communities, is vast. Through continued
research and development, SAR can play a crucial role in improving the social dynamics among
group members and fostering stronger connections between individuals.
76
Bibliography
Hu, Amanda (2017). “Reflections: the value of patient support groups”. In: Otolaryngology–Head
and Neck Surgery 156.4, pp. 587–588.
Davison, Kathryn P, James W Pennebaker, and Sally S Dickerson (2000). “Who talks? The social
psychology of illness support groups.” In: American Psychologist 55.2, p. 205.
Gillis, Brenna D and Abby L Parish (2019). “Group-based interventions for postpartum depres-
sion: An integrative review and conceptual model”. In: Archives of Psychiatric Nursing 33.3,
pp. 290–298.
Hong, Yan, Ninfa C Pena-Purcell, and Marcia G Ory (2012). “Outcomes of online support and
resources for cancer survivors: a systematic literature review”. In: Patient education and coun-
seling 86.3, pp. 288–296.
Rice, Simon M, Joanne Goodall, Sarah E Hetrick, Alexandra G Parker, Tamsyn Gilbertson, G
Paul Amminger, Christopher G Davey, Patrick D McGorry, John Gleeson, and Mario Alvarez-
Jimenez (2014). “Online and social networking interventions for the treatment of depression in
young people: a systematic review”. In: Journal of medical Internet research 16.9, e206.
Posluszny, Donna M, Kelly B Hyman, and Andrew Baum (2002). “Group interventions in cancer:
The benefits of social support and education on patient adjustment”. In: Theory and research
on small groups, pp. 87–105.
Viswanathan, Ramaswamy, Michael F Myers, and Ayman H Fanous (2020). “Support groups and
individual mental health care via video conferencing for frontline clinicians during the COVID-
19 pandemic”. In: Psychosomatics 61.5, pp. 538–543.
Banbury, Annie, Susan Nancarrow, Jared Dart, Leonard Gray, and Lynne Parkinson (2018). “Tele-
health interventions delivering home-based support group videoconferencing: systematic re-
view”. In: Journal of medical Internet research 20.2, e25.
Feil-Seifer, David and Maja J Matari´ c (2005). “Defining socially assistive robotics”. In: 9th Inter-
national Conference on Rehabilitation Robotics, 2005. ICORR 2005. IEEE, pp. 465–468.
Matari´ c, Maja J and Brian Scassellati (2016). “Socially assistive robotics”. In: Springer handbook
of robotics, pp. 1973–1994.
77
Mana, Nadia, Bruno Lepri, Paul Chippendale, Alessandro Cappelletti, Fabio Pianesi, Piergiorgio
Svaizer, and Massimo Zancanaro (2007). “Multimodal corpus of multi-party meetings for au-
tomatic social behavior analysis and personality traits detection”. In: Proceedings of the 2007
workshop on Tagging, mining and retrieval of human related activity information. ACM, pp. 9–
14.
Short, Elaine, Katherine Sittig-Boyd, and Maja J Mataric (2016). “Modeling moderation for multi-
party socially assistive robotics”. In: IEEE Int. Symp. Robot Hum. Interact. Commun.(RO-MAN
2016). New York, NY: IEEE.
Jacobs, Ed E, Robert LL Masson, Riley L Harvill, and Christine J Schimmel (2011). Group coun-
seling: Strategies and skills. Cengage learning.
Johnson, David W and M Patricia Noonan (1972). “Effects of acceptance and reciprocation of self-
disclosures on the development of trust.” In: Journal of Counseling Psychology 19.5, p. 411.
Corey, Marianne Schneider, Gerald Corey, and Cindy Corey (2013). Groups: Process and practice.
Cengage Learning.
Ball, Barbara, Patricia K Kerig, and Barri Rosenbluth (2009). “Like a family but better because
you can actually trust each other”. In: Health Promotion Practice 10.1 suppl, 45S–58S.
Mussakhojayeva, Saida, Madi Zhanbyrtayev, Yerlik Agzhanov, and Anara Sandygulova (2016).
“Who should robots adapt to within a multi-party interaction in a public space?” In: The
Eleventh ACM/IEEE International Conference on Human Robot Interaction. IEEE Press,
pp. 483–484.
Salam, Hanan and Mohamed Chetouani (2015). “Engagement detection based on mutli-party cues
for human robot interaction”. In: 2015 International Conference on Affective Computing and
Intelligent Interaction (ACII). IEEE, pp. 341–347.
Bohus, Dan, Chit W Saw, and Eric Horvitz (2014). “Directions robot: in-the-wild experiences
and lessons learned”. In: Proceedings of the 2014 international conference on Autonomous
agents and multi-agent systems. International Foundation for Autonomous Agents and Multia-
gent Systems, pp. 637–644.
Fraune, Marlena R, Selma
ˇ
Sabanovi´ c, and Eliot R Smith (2017). “Teammates first: Favoring in-
group robots over outgroup humans”. In: 2017 26th IEEE International Symposium on Robot
and Human Interactive Communication (RO-MAN). IEEE, pp. 1432–1437.
Sugiyama, Takaaki, Kotaro Funakoshi, Mikio Nakano, and Kazunori Komatani (2015). “Estimat-
ing response obligation in multi-party human-robot dialogues”. In: 2015 IEEE-RAS 15th Inter-
national Conference on Humanoid Robots (Humanoids). IEEE, pp. 166–172.
ˇ
Sabanovi´ c, Selma, Casey C Bennett, Wan-Ling Chang, and Lesa Huber (2013). “PARO robot af-
fects diverse interaction modalities in group sensory therapy for older adults with dementia”.
78
In: 2013 IEEE 13th international conference on rehabilitation robotics (ICORR). IEEE, pp. 1–
6.
V´ azquez, Marynel, Aaron Steinfeld, and Scott E Hudson (2015). “Parallel detection of conver-
sational groups of free-standing people and tracking of their lower-body orientation”. In:
2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE,
pp. 3010–3017.
Kirchner, Nathan, Alen Alempijevic, and Gamini Dissanayake (2011). “Nonverbal robot-group
interaction using an imitated gaze cue”. In: Proceedings of the 6th international conference on
Human-robot interaction. ACM, pp. 497–504.
Short, Elaine and Maja J Mataric (2017). “Robot moderation of a collaborative game: Towards
socially assistive robotics in group interactions”. In: 2017 26th IEEE International Symposium
on Robot and Human Interactive Communication (RO-MAN). IEEE, pp. 385–390.
Short, Elaine Schaertl, Katelyn Swift-Spong, Hyunju Shim, Kristi M Wisniewski, Deanah Kim
Zak, Shinyi Wu, Elizabeth Zelinski, and Maja J Matari´ c (2017). “Understanding social inter-
actions with socially assistive robotics in intergenerational family groups”. In: 2017 26th IEEE
International Symposium on Robot and Human Interactive Communication (RO-MAN). IEEE,
pp. 236–241.
Jung, Malte F, Nikolas Martelaro, and Pamela J Hinds (2015). “Using robots to moderate team
conflict: the case of repairing violations”. In: Proceedings of the tenth annual ACM/IEEE in-
ternational conference on human-robot interaction. ACM, pp. 229–236.
Skantze, G. (2021). “Turn-taking in Conversational Systems and Human-Robot Interaction: A Re-
view”. In: Computer Speech & Language 67, pp. 101–178.
Skantze, G., M. Johansson, and J. Beskow (2015). “Exploring Turn-Taking Cues in Multi-Party
Human-Robot Discussions About Objects”. In: Proceedings of the ACM International Confer-
ence on Multimodal Interaction, pp. 67–74.
Roddy, M., G. Skantze, and N. Harte (2018a). “Investigating Speech Features for Continuous Turn-
Taking Prediction Using LSTMs”. In: Proceedings of the Annual Conference of the Interna-
tional Speech Communication Association, pp. 586–590.
Roddy, M., G. Skantze, and N. Harte (2018b). “Multimodal Continuous Turn-Taking Prediction
Using Multiscale RNNs”. In: Proceedings of the ACM International Conference on Multimodal
Interaction, pp. 186–190.
Ward, N. G., D. Aguirre, G. Cervantes, and O. Fuentes (2019). “Turn-Taking Predictions Across
Languages and Genres Using an LSTM Recurrent Neural Network”. In: Proceedings of the
IEEE Spoken Language Technology Workshop, pp. 831–837.
79
Skantze, G. (2017). “Towards a General, Continuous Model of Turn-Taking in Spoken Dialogue
Using LSTM Recurrent Neural Networks”. In: Proceedings of the Annual Meeting of the Spe-
cial Interest Group on Discourse and Dialogue, pp. 220–230.
Masumura, R., M. Ihori, T. Tanaka, A. Ando, R. Ishii, T. Oba, and R. Higashinaka (2019). “Im-
proving Speech-Based End-of-Turn Detection Via Cross-Modal Representation Learning with
Punctuated Text Data”. In: Proceedings of the IEEE Automatic Speech Recognition and Un-
derstanding Workshop, pp. 1062–1069.
Oertel, C., M. Wlodarczak, J. Edlund, P. Wagner, and J. Gustafson (2013). “Gaze Patterns in Turn-
Taking”. In: Proceedings of the Annual Conference of the International Speech Communication
Association.
Admoni, H. and B. Scassellati (2017). “Social Eye Gaze in Human-Robot Interaction: A Review”.
In: Journal of Human-Robot Interaction 6.1, pp. 25–63.
Joosse, Michiel Pieter (2017). “Investigating positioning and gaze behaviors of social robots: peo-
ple’s preferences, perceptions, and behaviors”. In.
Anguera, X., N. Bozonnet S.and Evans, C. Fredouille, G. Friedland, and O. Vinyals (2012).
“Speaker Diarization: A Review of Recent Research”. In: IEEE Transactions on Audio, Speech,
and Language Processing 20.2, pp. 356–370.
Tranter, S. E. and D. A. Reynolds (2006). “An Overview of Automatic Speaker Diarization Sys-
tems”. In: IEEE Transactions on Audio, Speech, and Language Processing 14.5, pp. 1557–
1565.
Ahmad, R., S. P. Raza, and H. Malik (2013). “Visual Speech Detection Using an Unsupervised
Learning Framework”. In: Proceedings of the International Conference on Machine Learning
and Applications. V ol. 2, pp. 525–528.
Stefanov, K., A. Sugimoto, and J. Beskow (2016). “Look Who’s Talking: Visual Identification of
the Active Speaker in Multi-Party Human-Robot Interaction”. In: Proceedings of the Advance-
ments in Social Signal Processing for Multimodal Interaction, pp. 22–27.
Siatras, S., N. Nikolaidis, M. Krinidis, and I. Pitas (2009). “Visual Lip Activity Detection and
Speaker Detection Using Mouth Region Intensities”. In: IEEE Transactions on Circuits and
Systems for Video Technology 19.1, pp. 133–137.
Minotto, V . P., C. R. Jung, and B. Lee (2014). “Simultaneous-Speaker V oice Activity Detection and
Localization Using Mid-Fusion of SVM and HMMs”. In: IEEE Transactions on Multimedia
16.4, pp. 1032–1044.
Cutler, R. and L. Davis (2000). “Look Who’s Talking: Speaker Detection Using Video and Audio
Correlation”. In: Proceedings of the IEEE International Conference on Multimedia and Expo.
V ol. 3, pp. 1589–1592.
80
Chakravarty, P., S. Mirzaei, T. Tuytelaars, and H. Van Hamme (2015). “Who’s Speaking? Audio-
Supervised Classification of Active Speakers in Video”. In: Proceedings of the ACM on Inter-
national Conference on Multimodal Interaction, pp. 87–90.
Chakravarty, P. and T. Tuytelaars (2016). “Cross-Modal Supervision for Learning Active Speaker
Detection in Video”. In: Proceedings of the European Conference on Computer Vision,
pp. 285–301.
Ren, J., Y . Hu, Y .-W. Tai, C. Wang, L. Xu, W. Sun, and Q. Yan (2016). “Look, Listen and Learn -
A Multimodal LSTM for Speaker Identification”. In: Proceedings of the AAAI Conference on
Artificial Intelligence , pp. 3581–3587.
Stefanov, K., J. Beskow, and G. Salvi (2017). “Vision-Based Active Speaker Detection in Multi-
party Interaction”. In: Proceedings of the Grounding Language Understanding, pp. 47–51.
Stefanov, K., J. Beskow, and G. Salvi (2020). “Self-Supervised Vision-Based Detection of the
Active Speaker as Support for Socially-Aware Language Acquisition”. In: IEEE Transactions
on Cognitive and Developmental Systems 12.2, pp. 250–259.
Hu, Y ., J. Ren, J. Dai, C. Yuan, L. Xu, and W. Wang (2015). “Deep Multimodal Speaker Naming”.
In: Proceedings of the ACM International Conference on Multimedia, pp. 1107–1110.
Besson, P. and M. Kunt (2008). “Hypothesis Testing for Evaluating a Multimodal Pattern Recog-
nition Framework Applied to Speaker Detection”. In: Journal of NeuroEngineering and Reha-
bilitation 5.1, p. 11.
Hung, H. and S. O. Ba (2009). Speech/Non-Speech Detection in Meetings From Automatically
Extracted Low Resolution Visual Features. Tech. rep. Idiap.
Vajaria, H., S. Sarkar, and R. Kasturi (2008). “Exploring Co-Occurrence Between Speech and
Body Movement for Audio-Guided Video Localization”. In: IEEE Transactions on Circuits
and Systems for Video Technology 18.11, pp. 1608–1617.
Mayer, Roger C, James H Davis, and F David Schoorman (1995). “An integrative model of orga-
nizational trust”. In: Academy of management review 20.3, pp. 709–734.
Williams, Michele (2001). “In whom we trust: Group membership as an affective context for trust
development”. In: Academy of management review 26.3, pp. 377–396.
Larzelere, Robert E and Ted L Huston (1980). “The dyadic trust scale: Toward understanding
interpersonal trust in close relationships”. In: Journal of Marriage and the Family, pp. 595–
604.
Johnson-George, Cynthia and Walter C Swap (1982). “Measurement of specific interpersonal trust:
Construction and validation of a scale to assess trust in a specific other.” In: Journal of person-
ality and Social Psychology 43.6, p. 1306.
81
Sanchez, Gabriel, Melissa Ward Peterson, Erica D Musser, Igor Galynker, Simran Sandhu, and
Adriana E Foster (2019). “Measuring Empathy in Health Care”. In: Teaching Empathy in
Healthcare. Springer, pp. 63–82.
Hogan, Robert (1969). “Development of an empathy scale.” In: Journal of Consulting and Clinical
Psychology 33.3, pp. 307–316.
Davis, Mark H (1983). “Measuring individual differences in empathy: Evidence for a multidimen-
sional approach.” In: Journal of personality and social psychology 44.1, p. 113.
Spreng*, R Nathan, Margaret C McKinnon*, Raymond A Mar, and Brian Levine (2009). “The
Toronto Empathy Questionnaire: Scale development and initial validation of a factor-analytic
solution to multiple empathy measures”. In: Journal of personality assessment 91.1, pp. 62–71.
Jolliffe, Darrick and David P Farrington (2006). “Development and validation of the Basic Empa-
thy Scale”. In: Journal of adolescence 29.4, pp. 589–611.
Lietz, Cynthia A, Karen E Gerdes, Fei Sun, Jennifer Mullins Geiger, M Alex Wagaman, and Eliz-
abeth A Segal (2011). “The Empathy Assessment Index (EAI): A confirmatory factor analysis
of a multidimensional model of empathy”. In: Journal of the Society for Social Work and Re-
search 2.2, pp. 104–124.
Mercer, Stewart W, Margaret Maxwell, David Heaney, and Graham Watt (2004). “The consultation
and relational empathy (CARE) measure: development and preliminary validation and reliabil-
ity of an empathy-based consultation process measure”. In: Family practice 21.6, pp. 699–705.
Kane, Gregory C, Joanne L Gotto, Susan West, Mohammadreza Hojat, and Salvatore Mangione
(2007). “Jefferson Scale of Patient’s Perceptions of Physician Empathy: preliminary psycho-
metric data”. In: Croatian medical journal 48.1, pp. 81–86.
Bylund, Carma L and Gregory Makoul (2005). “Examining empathy in medical encounters: an
observational study using the empathic communication coding system”. In: Health communi-
cation 18.2, pp. 123–140.
Tavares, Walter, Sylvain Boet, Rob Theriault, Tony Mallette, and Kevin W Eva (2013). “Global
rating scale for the assessment of paramedic clinical competence”. In: Prehospital emergency
care 17.1, pp. 57–67.
Johanson, Deborah L, Ho Seok Ahn, Bruce A MacDonald, Byeong Kyu Ahn, Jong Yoon Lim,
Eddie Hwang, Craig J Sutherland, and Elizabeth Broadbent (n.d.). “Pay Attention! The Effect
of Attentional Behaviours by a Robotic Receptionist on User Perceptions and Behaviors”. In:
().
Leite, Iolanda, Samuel Mascarenhas, Andr´ e Pereira, Carlos Martinho, Rui Prada, and Ana Paiva
(2010). “” Why can’t we be friends?” an empathic game companion for long-term interaction”.
In: International Conference on Intelligent Virtual Agents. Springer, pp. 315–321.
82
Charrier, Laurianne, Alisa Rieger, Alexandre Galdeano, Am´ elie Cordier, Mathieu Lefort, and Sal-
ima Hassas (2019). “The rope scale: a measure of how empathic a robot is perceived”. In: 2019
14th ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, pp. 656–
657.
Charrier, Laurianne, Alexandre Galdeano, Am´ elie Cordier, and Mathieu Lefort (2018). “Empathy
display influence on human-robot interactions: a pilot study”. In: Workshop on Towards Intel-
ligent Social Robots: From Naive Robots to Robot Sapiens at the 2018 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS 2018), p. 7.
Daher, Karl, Jacky Casas, Omar Abou Khaled, and Elena Mugellini (2020). “Empathic chatbot
response for medical assistance”. In: Proceedings of the 20th ACM International Conference
on Intelligent Virtual Agents, pp. 1–3.
Reghunath, Anagha (2021). Expression of Empathy in Social Virtual Bots used for Genetic Coun-
seling.
Paiva, Ana (2011). “Empathy in social agents”. In: International Journal of Virtual Reality 10.1,
pp. 1–4.
Paiva, Ana, Iolanda Leite, Hana Boukricha, and Ipke Wachsmuth (2017). “Empathy in virtual
agents and robots: A survey”. In: ACM Transactions on Interactive Intelligent Systems (TiiS)
7.3, pp. 1–40.
Tapus, Adriana and Maja J Mataric (2007). “Emulating Empathy in Socially Assistive Robotics.”
In: AAAI spring symposium: multidisciplinary collaboration for socially assistive robotics,
pp. 93–96.
Hegel, Frank, Torsten Spexard, Britta Wrede, Gernot Horstmann, and Thurid V ogt (2006). “Playing
a different imitation game: Interaction with an Empathic Android Robot”. In: 2006 6th IEEE-
RAS International Conference on Humanoid Robots. IEEE, pp. 56–61.
Riek, Laurel D, Philip C Paul, and Peter Robinson (2010). “When my robot smiles at me: Enabling
human-robot rapport via real-time head gesture mimicry”. In: Journal on Multimodal User
Interfaces 3.1-2, pp. 99–108.
Leite, Iolanda, Ginevra Castellano, Andr´ e Pereira, Carlos Martinho, and Ana Paiva (2014). “Em-
pathic robots for long-term interaction”. In: International Journal of Social Robotics 6.3,
pp. 329–341.
Cramer, Henriette, Jorrit Goddijn, Bob Wielinga, and Vanessa Evers (2010). “Effects of (in) ac-
curate empathy and situational valence on attitudes towards robots”. In: 2010 5th ACM/IEEE
International Conference on Human-Robot Interaction (HRI). IEEE, pp. 141–142.
Tahir, Yasir, Justin Dauwels, Daniel Thalmann, and Nadia Magnenat Thalmann (2018). “A user
study of a humanoid robot as a social mediator for two-person conversations”. In: International
Journal of Social Robotics, pp. 1–14.
83
Zuckerman, Oren and Guy Hoffman (2015). “Empathy objects: Robotic devices as conversation
companions”. In: Proceedings of the ninth international conference on tangible, embedded,
and embodied interaction. ACM, pp. 593–598.
Utami, Dina, Timothy W Bickmore, and Louis J Kruger (2017). “A robotic couples counselor for
promoting positive communication”. In: 2017 26th IEEE International Symposium on Robot
and Human Interactive Communication (RO-MAN). IEEE, pp. 248–255.
Utami, Dina and Timothy Bickmore (2019). “Collaborative user responses in multiparty interac-
tion with a couples counselor robot”. In: 2019 14th ACM/IEEE International Conference on
Human-Robot Interaction (HRI). IEEE, pp. 294–303.
Matsuyama, Yoichi, Iwao Akiba, Shinya Fujie, and Tetsunori Kobayashi (2015). “Four-participant
group conversation: A facilitation robot controlling engagement density as the fourth partici-
pant”. In: Computer Speech & Language 33.1, pp. 1–24.
Ohshima, Naoki, Ryo Fujimori, Hiroko Tokunaga, Hiroshi Kaneko, and Naoki Mukawa (2017).
“Neut: Design and evaluation of speaker designation behaviors for communication support
robot to encourage conversations”. In: 2017 26th IEEE International Symposium on Robot and
Human Interactive Communication (RO-MAN). IEEE, pp. 1387–1393.
Hoffman, Guy, Oren Zuckerman, Gilad Hirschberger, Michal Luria, and Tal Shani Sherman
(2015). “Design and evaluation of a peripheral robotic conversation companion”. In: Proceed-
ings of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction.
ACM, pp. 3–10.
Riek, Laurel D (2012). “Wizard of oz studies in hri: a systematic review and new reporting guide-
lines”. In: Journal of Human-Robot Interaction 1.1, pp. 119–136.
Nigam, Aastha and Laurel D Riek (2015). “Social context perception for mobile robots”. In: 2015
IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, pp. 3621–
3627.
V´ azquez, Marynel, Elizabeth J Carter, Braden McDorman, Jodi Forlizzi, Aaron Steinfeld, and
Scott E Hudson (2017). “Towards robot autonomy in group conversations: Understanding the
effects of body orientation and gaze”. In: Proceedings of the 2017 ACM/IEEE International
Conference on Human-Robot Interaction. ACM, pp. 42–52.
Chien, Ling-Yu, Hsin Chu, Jong-Long Guo, Yuan-Mei Liao, Lu-I Chang, Chiung-Hua Chen, and
Kuei-Ru Chou (2011). “Caregiver support groups in patients with dementia: a meta-analysis”.
In: International journal of geriatric psychiatry 26.10, pp. 1089–1098.
Gottlieb, Benjamin H and Elizabeth D Wachala (2007). “Cancer support groups: a critical review
of empirical studies”. In: Psycho-oncology 16.5, pp. 379–400.
Martin, Joel, Mireia Bolibar, and Carlos Lozares (2017). “Network cohesion and social support”.
In: Social networks 48, pp. 192–201.
84
Dyaram, Lata and TJ Kamalanabhan (2005). “Unearthed: the other side of group cohesiveness”.
In: Journal of Social Sciences 10.3, pp. 185–190.
Griffiths, Kathleen Margaret, Alison L Calear, and Michelle Banfield (2009). “Systematic review
on Internet Support Groups (ISGs) and depression (1): Do ISGs reduce depressive symptoms?”
In: Journal of medical Internet research 11.3, e1270.
Rains, Stephen A and Valerie Young (2009). “A meta-analysis of research on formal computer-
mediated support groups: Examining group characteristics and health outcomes”. In: Human
communication research 35.3, pp. 309–336.
Borek, Aleksandra J and Charles Abraham (2018). “How do small groups promote behaviour
change? An integrative conceptual review of explanatory mechanisms”. In: Applied Psychol-
ogy: Health and Well-Being 10.1, pp. 30–61.
Garcia, Carolyn, Sandi Lindgren, and Jessie Kemmick Pintor (2011). “Knowledge, skills, and qual-
ities for effectively facilitating an adolescent girls’ group”. In: The journal of school nursing
27.6, pp. 424–433.
Birmingham, Chris, Kalin Stefanov, and Maja J Mataric (2021a). “Group-level focus of visual
attention for improved next speaker prediction”. In: Proceedings of the 29th ACM International
Conference on Multimedia, pp. 4838–4842.
Birmingham, Christopher, Maja Mataric, and Kalin Stefanov (2021b). “Group-Level Focus of Vi-
sual Attention for Improved Active Speaker Detection”. In: Companion Publication of the 2021
International Conference on Multimodal Interaction, pp. 37–42.
Goffman, E. (1974). Frame Analysis: An Essay on the Organization of Experience. Harvard Uni-
versity Press.
Goffman, E. (1981). Forms of Talk. University of Pennsylvania Press.
Chung, J. S. and A. Zisserman (2016). “Out of Time: Automated Lip Sync in the Wild”. In: Pro-
ceedings of the Workshop on Multi-view Lip-reading.
Son, J. S. and A. Zisserman (2017). “Lip Reading in Profile”. In: Proceedings of the British Ma-
chine Vision Conference, pp. 1–11.
Chung, S.-W., J. S. Chung, and H.-G. Kang (2019). “PerfectMatch: Improved Cross-modal Em-
beddings for Audio-visual Synchronisation”. In: Proceedings of the IEEE International Con-
ference on Acoustics, Speech and Signal Processing, pp. 3965–3969.
Chung, J. S. (2019). “Naver at ActivityNet Challenge 2019 – Task B Active Speaker Detection
(A V A)”. In: arXiv preprint arXiv:1906.10555.
85
Roth, J., S. Chaudhuri, O. Klejch, R. Marvin, A. Gallagher, L. Kaver, S. Ramaswamy, A. Stopczyn-
ski, C. Schmid, Z. Xi, and C. Pantofaru (2019). “A V A-ActiveSpeaker: An Audio-Visual Dataset
for Active Speaker Detection”. In: arXiv preprint arXiv:1901.01342.
Birmingham, C., Z. Hu, K. Mahajan, E. Reber, and M. J. Matari´ c (2020). “Can I Trust You? A
User Study of Robot Mediation of a Support Group”. In: Proceedings of the IEEE International
Conference on Robotics and Automation, pp. 8019–8026.
Stefanov, K. and J. Beskow (2016). “A Multi-Party Multi-Modal Dataset for Focus of Visual At-
tention in Human-Human and Human-Robot Interaction”. In: Proceedings of the International
Conference on Language Resources and Evaluation.
Nagrani, A., J. S. Chung, and A. Zisserman (2017). “V oxCeleb: A Large-Scale Speaker Identifica-
tion Dataset”. In: Proceedings of the Annual Conference of the International Speech Commu-
nication Association.
Stefanov, K., G. Salvi, D. Kontogiorgos, H. Kjellstr¨ om, and J. Beskow (2019). “Modeling of Hu-
man Visual Attention in Multiparty Open-World Dialogues”. In: ACM Transactions on Human-
Robot Interaction 8.2, p. 21.
Baltrusaitis, T., P. Robinson, and L. P. Morency (2016). “OpenFace: An Open Source Facial Be-
havior Analysis Toolkit”. In: Proceedings of the IEEE Winter Conference on Applications of
Computer Vision, pp. 1–10.
M¨ uller, Philipp, Michael Xuelin Huang, and Andreas Bulling (2018). “Detecting Low Rapport
During Natural Interactions in Small Groups from Non-Verbal Behavior”. In: Proc. ACM In-
ternational Conference on Intelligent User Interfaces (IUI), pp. 153–164. DOI: 10.1145/
3172944.3172969.
Paszke, A., S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga,
and A. Lerer (2017). “Automatic Differentiation in PyTorch”. In: Proceedings of the NeurIPS
Autodiff Workshop.
Kingma, D. P. and J. Ba (2014). “Adam: A Method for Stochastic Optimization”. In: Computing
Research Repository abs/1412.6980.
Stefanov, K., M. Adiban, and G. Salvi (2021). “Spatial Bias in Vision-Based V oice Activity De-
tection”. In: Proceedings of the International Conference on Pattern Recognition, pp. 10433–
10440.
Kaplan, Robin and Erica Yu (2015). “Measuring Question Sensitivity”. In: American Association
for Public Opinion Research, pp. 4107–4121.
Nomura, Tatsuya, Takayuki Kanda, Tomohiro Suzuki, and Kennsuke Kato (2004). “Psychology
in human-robot communication: An attempt through investigation of negative attitudes and
anxiety toward robots”. In: RO-MAN 2004. 13th IEEE International Workshop on Robot and
Human Interactive Communication (IEEE Catalog No. 04TH8759). IEEE, pp. 35–40.
86
Rammstedt, Beatrice and Oliver P John (2007). “Measuring personality in one minute or less: A
10-item short version of the Big Five Inventory in English and German”. In: Journal of research
in Personality 41.1, pp. 203–212.
Davis, Mark H et al. (1980). “A multidimensional approach to individual differences in empathy”.
In.
Colquitt, Jason A, Brent A Scott, and Jeffery A LePine (2007). “Trust, trustworthiness, and trust
propensity: A meta-analytic test of their unique relationships with risk taking and job perfor-
mance.” In: Journal of applied psychology 92.4, p. 909.
Preacher, Kristopher J., Guangjian Zhang, Cheongtag Kim, and Gerhard Mels (2013). “Choos-
ing the Optimal Number of Factors in Exploratory Factor Analysis: A Model Selection Per-
spective”. In: Multivariate Behavioral Research 48.1. PMID: 26789208, pp. 28–56. DOI:10.
1080/00273171.2012.710386.
Birmingham, Chris, Ashley Perez, and Maja J Mataric (2022). “Perceptions of Cognitive and Af-
fective Empathetic Statements by Socially Assistive Robots”. In: 2022 17th ACM/IEEE Inter-
national Conference on Human-Robot Interaction (HRI). IEEE.
Eisenberg, Nancy, Cindy L Shea, Gustavo Carlo, and George P Knight (2014). “Empathy-related
responding and cognition: A “chicken and the egg” dilemma”. In: Handbook of moral behavior
and development. Psychology Press, pp. 85–110.
Yalc ¸ın,
¨
Ozge Nilay (2019). “Evaluating empathy in artificial agents”. In: arXiv preprint
arXiv:1908.05341.
Hoffman, Martin L (2001). Empathy and moral development: Implications for caring and justice.
Cambridge University Press.
Kadam, Prashant and Supriya Bhalerao (2010). “Sample size calculation”. In: International jour-
nal of Ayurveda research 1.1, p. 55.
Green, Samuel B (1991). “How many subjects does it take to do a regression analysis”. In: Multi-
variate behavioral research 26.3, pp. 499–510.
LuxAI - award winning Social robots for autism and special needs education (2021).
Interaction-Lab (n.d.[a]). interaction-lab/HARMONI: Controller code for human and Robot mod-
ular open interactions.
Interaction-Lab (n.d.[b]). Interaction-Lab/Cordial-Public: Robot control for speech and synchro-
nized behaviors, phone-based javascript face, and tablet interaface.
Text-to-speech: Lifelike speech synthesis — google cloud (n.d.).
87
Nomura, Tatsuya, Takayuki Kanda, and Tomohiro Suzuki (2006). “Experimental investigation into
influence of negative attitudes toward robots on human–robot interaction”. In: Ai & Society
20.2, pp. 138–150.
BSI-2016 (2016). BS 8611: 2016 Robots and Robotic Devices: Guide to the Ethical Design and
Application of Robots and Robotic Systems.
Winkle, Katie, Praminda Caleb-Solly, Ute Leonards, Ailie Turton, and Paul Bremner (2021). “As-
sessing and Addressing Ethical Risk from Anthropomorphism and Deception in Socially As-
sistive Robots”. In: Proceedings of the 2021 ACM/IEEE International Conference on Human-
Robot Interaction, pp. 101–109.
McGrath, Joseph Edward (1964). Social psychology: A brief introduction. Holt, Rinehart and Win-
ston.
Hackman, J Richard and Charles G Morris (1975). “Group tasks, group interaction process, and
group performance effectiveness: A review and proposed integration”. In: Advances in experi-
mental social psychology 8, pp. 45–99.
Sebo, Sarah, Brett Stoll, Brian Scassellati, and Malte F Jung (2020). “Robots in groups and teams:
a literature review”. In: Proceedings of the ACM on Human-Computer Interaction 4.CSCW2,
pp. 1–36.
Tavabi, Leili, Kalin Stefanov, Setareh Nasihati Gilani, David Traum, and Mohammad Soleymani
(2019). “Multimodal learning for identifying opportunities for empathetic responses”. In: 2019
International Conference on Multimodal Interaction, pp. 95–104.
Krogel, Julieann, Gary Burlingame, Chris Chapman, Tyler Renshaw, Robert Gleave, Mark
Beecher, and Rebecca MacNair-Semands (2013). “The Group Questionnaire: A clinical and
empirically derived measure of group relationship”. In: Psychotherapy Research 23.3, pp. 344–
354.
Carpinella, Colleen M, Alisa B Wyman, Michael A Perez, and Steven J Stroessner (2017). “The
robotic social attributes scale (rosas) development and validation”. In: Proceedings of the 2017
ACM/IEEE International Conference on human-robot interaction, pp. 254–262.
88
Abstract (if available)
Abstract
Within the field of Human-Robot Interaction (HRI), Socially Assistive Robotics (SAR) has the potential to create robots that can facilitate and enhance social interaction among groups of people. These robots can help to connect individuals into more cohesive and supportive groups. This, however, is a difficult task as it involves sensing individual attitudes, recognizing group dynamics and behaving in a socially appropriate way to achieve a goal. This dissertation makes progress on each of these tasks in order to unlock the potential for SAR to facilitate social support.
This dissertation presents computational models for understanding individual attitudes, such as trust and group dynamics, such as turn-taking. It also presents a facilitation framework to provide social support based on empathy and disclosure. These models and frameworks are validated in the context of a support group.
This dissertation provides a review of related work in multiparty HRI and SAR, and relevant background on the application domain of support groups. It examines the complexities of modeling turn-taking, which include detecting the active speaker and predicting the next speaker based on group attention. Additionally, it investigates the challenges of modeling changes in trust through an academic support group. Finally, it models social support through perceptions of empathy and disclosure made by a robot. All of these models form the basis for a support group facilitation framework, which is evaluated in the context of a cancer support group. Together, these contributions form the foundation for enabling SAR to interact with groups in real-time and improve the social dynamics between the group members.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
On virtual, augmented, and mixed reality for socially assistive robotics
PDF
Managing multi-party social dynamics for socially assistive robotics
PDF
Towards socially assistive robot support methods for physical activity behavior change
PDF
Modeling dyadic synchrony with heterogeneous data: validation in infant-mother and infant-robot interactions
PDF
Situated proxemics and multimodal communication: space, speech, and gesture in human-robot interaction
PDF
Nonverbal communication for non-humanoid robots
PDF
Socially assistive and service robotics for older adults: methodologies for motivating exercise and following spatial language instructions in discourse
PDF
Coordinating social communication in human-robot task collaborations
PDF
Towards generalizable expression and emotion recognition
PDF
Efficiently learning human preferences for proactive robot assistance in assembly tasks
PDF
Computational modeling of mental health therapy sessions
PDF
Understanding and generating multimodal feedback in human-machine story-telling
PDF
Modeling and regulating human interaction with control affine dynamical systems
PDF
Automated alert generation to improve decision-making in human robot teams
PDF
Active sensing in robotic deployments
PDF
Robot, my companion: children with autism take part in robotic experiments
PDF
The task matrix: a robot-independent framework for programming humanoids
PDF
Robot life-long task learning from human demonstrations: a Bayesian approach
PDF
Behavioral signal processing: computational approaches for modeling and quantifying interaction dynamics in dyadic human interactions
PDF
Socially-informed content analysis of online human behavior
Asset Metadata
Creator
Birmingham, Christopher Michael
(author)
Core Title
Multiparty human-robot interaction: methods for facilitating social support
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2023-05
Publication Date
04/21/2023
Defense Date
03/22/2023
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
human-robot interaction,OAI-PMH Harvest,socially assistive robotics
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Mataric, Maja (
committee chair
), Miller, Lynn (
committee member
), Soleymani, Mohammad (
committee member
)
Creator Email
cbirming@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113077791
Unique identifier
UC113077791
Identifier
etd-Birmingham-11690.pdf (filename)
Legacy Identifier
etd-Birmingham-11690
Document Type
Dissertation
Format
theses (aat)
Rights
Birmingham, Christopher Michael
Internet Media Type
application/pdf
Type
texts
Source
20230424-usctheses-batch-1029
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
human-robot interaction
socially assistive robotics