Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Modeling dyadic synchrony with heterogeneous data: validation in infant-mother and infant-robot interactions
(USC Thesis Other)
Modeling dyadic synchrony with heterogeneous data: validation in infant-mother and infant-robot interactions
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Modeling Dyadic Synchrony with Heterogeneous Data:
Validation in Infant-Mother and Infant-Robot Interactions
by
Lauren Rebecca Klein
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2023
Copyright 2023 Lauren Rebecca Klein
Acknowledgements
I am profoundly thankful for the support system that has made this dissertation possible.
Dr. Maja Matari´ c and her impactful research program are the reasons I decided to pursue a PhD.
Those who have been fortunate enough to work with Maja know she is one of the fiercest advocates
for students, both in the Interaction Lab and across USC. Maja’s support, both academic and
emotional at times, was instrumental in my progress through the PhD program. I am undoubtedly
a better researcher, writer, speaker, and mentor because of her.
I am thankful for the collaborations and mentorship provided by my dissertation committee
members.
Dr. Pat Levitt helped me to find my footing as an interdisciplinary researcher. His guidance
over the past years has inspired me to expand into new research directions, and has challenged me
to better communicate results to stakeholders across research domains. I am always thankful that
he welcomed me as an honorary part of his lab- the talented and inclusive environment that he
fosters has been a second academic home for me during my PhD program.
Dr. Mohammad Soleymani’s collaboration and feedback on numerous projects pushed me to
think more critically about my research. His class, Multimodal Probabilistic Learning of Human
Communication, provided me with learning opportunities that would become essential to my dis-
sertation research.
For years, I have been impressed and thankful for the support and collaboration provided by
Dr. Shri Narayanan’s students. The supportive community created by his lab members ensured
that I always had someone to go to with questions about behavioral signal processing.
ii
Since Dr. Thomason joined USC, I have been excited about the collaborations and enthusiasm
that he brings to the 4th floor RTH community. His advice on career steps and mentorship goals
have helped me to progress through the final year of my PhD.
Countless other researchers across USC have been critical to my success. Dr. Beth A. Smith,
along with her students Marcelo R. Rosales and Wayne Deng, provided invaluable advice and
support. This team oversaw my first foray into interdisciplinary collaborative research, and I am
grateful for their help and expertise. I am thankful to Sahana Nagabhushan Kalburgi and Alma
Gharib for their friendship and mentorship in the Levitt Lab. Levitt Lab research assistants includ-
ing Aim´ e Ozuna, Liam North, and Dianna Guerrero Jimenez provided essential support during
usability testing work. Finally, I owe a debt of gratitude to Victor Ardulov. I first met Victor when
I asked for his advice after reading a paper of his in 2019. He has been a close collaborator and
friend ever since, and his advice has been essential to my research progress and learning.
I would also like to thank my peers and friends in the Interaction Lab. I may not have made it
through without Tom Groechel’s consistent willingness to listen to my research or life concerns for
long periods of time and then responding “I think it will be fine.” Chris Birmingham’s friendship
and commitment to checking in on the well-being of his labmates have been a constant source of
support. I am thankful to have been able to rely on, and sometime commiserate with, Zhonghao
Shi to discuss roadblocks that came up along my research journey. Katrin Fischer showed me
the ropes on usability testing, and I am thankful for her patience and willingness to share her
expertise. Roxanna Pakkar was the first person I met at USC during my PhD journey, and a true
friend throughout. Nathan Dennler’s creativity has been a constant source of inspiration over the
past few years. Finally, I want to thank Amy O’Connell, Mina Kian, and Anna-Maria Velentza for
forming a fun and supportive culture in the Interaction Lab.
I am grateful to have been able to mentor many talented students over the past 5 years. Allen
Chang has consistently impressed me with his drive and persistence in seeing research projects
through to completion. The software development prowess of Sahithi Ramaraju, Abiola Johnson,
Sneha Bandi, and Sairam Bandi allowed me to explore new research directions that would have
iii
been infeasible without such a team. Other students I am thankful to have worked with include
Zijian Hu, Audrey Roberts, Youngseok Joung, Gabriella Margarino, Kiely Green, Michelle Kim,
Parimal Mehta, Padmanabhan Krishnamurthy, Kate Hu, Vicky Yu, Sarah Etter, Justin Lenderman,
and Jack Zhang.
I am thankful to the National Science Foundation who funded my work in infant-robot interac-
tion and infant behavioral analysis via a grant for “Infant-Robot Interaction as an Early Intervention
Strategy.” I am also appreciative to the Levitt Lab and the California Initiative to Advance Precision
Medicine for funding my research related to remote data collection and infant affect recognition. I
would also like to thank the Annenberg Fellowship program, the WiSE Qualcomm Fellowship Pro-
gram, Schmidt Futures, and the JPB Foundation for funding my work and the prior data collections
that enabled me to do meaningful research.
Finally, I would like to thank the Klein and Dubin/Zimmerman families for their support
throughout my PhD program. To my parents Amy and Steven, thank you for teaching me to
value my education and to advocate for myself, and for the pep talks that have pushed me through
to this point. To Greg and Lori, you are the best brother and sister-in-law I could ask for. To Ce-
line, an honorary Klein, thank you for your willingness to provide TLC at a moment’s notice. To
my grandparents Papa Al, Grandma DeeDee, Papa Barry, and Grandma Omi, thank you for your
unconditional interest and enthusiasm in all of my endeavors, and for sending me highlighted and
annotated newspaper articles in the mail related to my research.
The hospitality provided by the Dubin and Zimmerman families made my transition to L.A.
warm and welcoming. I am thankful to Cindy and Mark for making sure I always have a home
for Shabbat dinners, and for their sportsmanship. Fred and Marlene provided a much-needed
outlet for discussing arts and science. To Josh, I am beyond fortunate for the unwavering support,
thoughtfulness, patience, and fun you bring to each day, and this dissertation would not have been
possible without you.
For this entire team of rockstars, I am deeply grateful.
iv
Table of Contents
Acknowledgements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
1: Chapter Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 The Role of Social Synchrony in Embodied Interactions . . . . . . . . . . . . . . . . . . 1
1.2 Defining Social Synchrony: Key Terms and Definitions . . . . . . . . . . . . . . . . . . . 2
1.2.1 Joint Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Shared Affective States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 Temporal Behavior Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Problem Statement and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2: Chapter Background and Related Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 Social Synchrony in Parent-Child Interactions . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Influence on Child Development . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Observation Contexts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Manual Evaluation Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.4 Computational Modeling Approaches . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.5 Automated Affect and Behavior Recognition . . . . . . . . . . . . . . . . . . . . 13
2.2 Social Synchrony in Human-Robot Interaction . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Human-Robot Collaboration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Socially Assistive Robotics (SAR) . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.3 Joint Attention in SAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.4 Affective Behavior Recognition in SAR . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3: Chapter Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1 Infant-Mother Interactions: Face-to-Face Still-Face Paradigm . . . . . . . . . . . . . . . . 18
v
3.1.1 Social Interaction Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.2 Experimental Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.3 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.5 Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.6 Inclusion and Exclusion Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Infant-Robot Interactions: A Contingent Learning Paradigm . . . . . . . . . . . . . . . . 21
3.2.1 Interaction Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.1.1 Contingent Learning User Study 1 . . . . . . . . . . . . . . . . . . . . 22
3.2.1.2 Contingent Learning User Study 2 . . . . . . . . . . . . . . . . . . . . 23
3.2.1.3 Contingent Learning User Study 3 . . . . . . . . . . . . . . . . . . . . 24
3.2.2 Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4: Chapter Evaluating Temporal Behavior Adaptation Across Heterogeneous Behaviors with Win-
dowed Cross-Correlation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1 Technical Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.1.1 Video Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.1.2 Audio Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.2 Infant Responses to Still-Face Paradigm . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.3 Infant-Mother Temporal Behavior Adaptation . . . . . . . . . . . . . . . . . . . . 31
4.1.3.1 Windowed Cross-Correlation with Peak-Picking . . . . . . . . . . . . . 31
4.1.3.2 Parameter Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.3.3 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.1 Infant Responses to Still-Face Paradigm . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.1.1 Infant Head Angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.1.2 Infant Arm Angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.1.3 Infant V ocalizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.2 Infant-Mother Temporal Behavior Adaptation . . . . . . . . . . . . . . . . . . . . 36
4.2.2.1 Lag Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Discussion and Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5: Chapter Integrating Multiple Heterogeneous Behaviors into a Model of Social Synchrony with
Dynamic Mode Decomposition with Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.1 Dyadic Interaction as a Dynamical System . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Technical Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2.1.1 Person Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2.1.2 Pose Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2.2 Dynamical System Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.2.1 Window Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.2.2 Control Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.2.3 Dynamic Mode Decomposition with Control . . . . . . . . . . . . . . . 46
vi
5.2.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2.3.1 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2.3.2 Exploring Trends in Interaction Dynamics . . . . . . . . . . . . . . . . 49
5.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3.1 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3.1.1 Observing the Still-Face Instructions . . . . . . . . . . . . . . . . . . . 50
5.3.1.2 Leading-Following Relationship . . . . . . . . . . . . . . . . . . . . . 51
5.3.2 Trends in Interaction Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3.2.1 Infant Responses to the Still-Face Stage . . . . . . . . . . . . . . . . . 51
5.3.2.2 Trends Across Infant Age . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3.3 Incorporating Audio Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.4 Discussion and Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6: Chapter ICHIRP: An Infant-Caregiver Home Interaction Recording Platform . . . . . . . . . . . . . . . . 58
6.1 Preliminary Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2 Application Prototype Design and Implementation . . . . . . . . . . . . . . . . . . . . . 60
6.2.1 Recording Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2.2 Registration, Uploading, and Scheduling . . . . . . . . . . . . . . . . . . . . . . 61
6.2.2.1 Registration and Home Screen . . . . . . . . . . . . . . . . . . . . . . 61
6.2.2.2 Uploading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2.2.3 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.3 Pilot Usability Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.3.1 Study Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.3.2 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.3.3 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.5 Discussion and Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7: Chapter Predicting Visual Attention During Infant-Robot Interaction . . . . . . . . . . . . . . . . . . . . . . 68
7.1 Bayesian Surprise Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.2 Visual Surprise in the Interaction Environment . . . . . . . . . . . . . . . . . . . . . . . . 72
7.2.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.3 Evaluating Robot Behaviors that Initiate Joint Attention . . . . . . . . . . . . . . . . . . . 75
7.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.4 Discussion and Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8: Chapter Real-Time, Continuous Affect Prediction in Socially Assistive Robotics. . . . . . . . . . . . . . 80
8.1 Technical Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.1.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.1.2 Feature Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.1.3 Train-Test Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
vii
8.1.4 Temporal Feature Aggregation and Windowing . . . . . . . . . . . . . . . . . . . 86
8.1.5 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.1.6 Affect Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8.1.6.1 Unimodal Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8.1.6.2 Multimodal Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8.1.6.3 Trends in Model Performance Over Time . . . . . . . . . . . . . . . . . 88
8.1.6.4 Classifier Accuracy Versus Time Since Infant Affect Transition . . . . . 89
8.1.6.5 Classifier Accuracy Versus Time Since Classifier Prediction Transition . 89
8.2 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.2.1 Model Accuracy Versus Input Window Length . . . . . . . . . . . . . . . . . . . 89
8.2.2 Trends in Model Performance Over Time . . . . . . . . . . . . . . . . . . . . . . 91
8.2.3 Accuracy Versus Time Since Affect Transition . . . . . . . . . . . . . . . . . . . 91
8.2.4 Accuracy Versus Time Since Prediction Transition . . . . . . . . . . . . . . . . . 93
8.3 Discussion and Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
9: Chapter Dissertation Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
viii
List of Tables
5.1 t-statistic between R
play
I
and R
still− f ace
I
. . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Correlations between R
I
and R
M
. . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3 Correlation between (R
play
I
- R
s f
I
) andλ
s f
A,I
. . . . . . . . . . . . . . . . . . . . . . 52
5.4 Trends inλ
A,M
across age . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.5 Trends in R
I
Across Infant Age . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.6 Correlations between R
I
and R
M
. . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.7 Trends in R
I
across Infant Age . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.1 Usability Issues by App Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.1 Percent of Infant Gaze Locations in Regions with Higher than Average Surprise
Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
ix
List of Figures
1.1 Left: components of social synchrony; right: relationships between the compo-
nents of social synchrony during an embodied dyadic social interaction . . . . . . 3
3.1 A mother and infant participating in the Face-to-Face Still-Face procedure. The
infant is sitting in the lap of a researcher. . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 An infant participating in the Socially Assistive Robot Contingent Learning
Paradigm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1 Arm and head angles measured using pose features extracted with OpenPose. . . . 29
4.2 Number of interactions in each dataset. D
video
is the set of interactions with reliable
video data. D
audio
is the set of interactions with reliable audio data. D
video
consists
of data from 54 dyads, and D
audio
consists of data from 39 dyads. . . . . . . . . . . 30
4.3 Illustration of windowed cross-correlation with peak-picking, evaluated for a sin-
gle window on example signals. Top: a given time window is shifted across multi-
ple lag values, changing the range of infant angle values considered. Bottom: The
correlation values between windowed signals are plotted, and a peak lag value is
identified. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4 Trends in infant behavior across FFSF stage and age. Bottom left: percent of time
spent vocalizing; top right: variance in arm pitch; bottom right: variance in head
pitch. Bars and asterisks represent significant results for Student’s t tests between
individual FFSF stages, with * p< 0.025, and ** p< 0.001. . . . . . . . . . . . . 36
4.5 Left: Student’s t-test statistic between lag variance distributions during Play and
Reunion. Negative values indicate a higher mean lag variance during Reunion
compared to Play. The key at the left indicates the behavioral signals which were
input into the windowed cross-correlation model. Significance is reported with *
p< 0.025, and ** p< 0.001. Right: Median lag variance across age, with 95%
confidence intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
x
5.1 Computational modeling pipeline. Left: pose landmarks are extracted from a video
frame using OpenPose and distances between features are calculated; vocal fun-
damental frequency is extracted with Praat and speaker identification is performed
via manual annotation; middle: Dynamical system model with arrays of infant and
mother features from two consecutive frames; right: matrices of infant and mother
features. In this example, DMDc is applied to multimodal data sampled at 30 Hz
from a 3-second window of an interaction, and the infant’s features are used as the
control input. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2 (λ
A,M
) across infant age, calculated using head pose data (left), arm pose data
(center), and both (right). Results includeλ
A,M
values from each of the three stages
(play, still-face, reunion) of the FFSF protocol. . . . . . . . . . . . . . . . . . . . 54
6.1 Registration and Home Screens of the ICHIRP application. left and middle: regis-
tration Screens; right: application home screen and application tour . . . . . . . . . 61
6.2 Registration and Home Screens of the ICHIRP application. left and middle: regis-
tration Screens; right: application home screen and application tour . . . . . . . . . 62
6.3 Study Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.1 Experimental setup of the SAR leg movement study; the infant is seated across
from the Nao robot and is wearing leg and arm motion trackers and an eye tracker.
This dissertation uses data from the study to explore Bayesian surprise as a poten-
tial predictor of infant visual attention. . . . . . . . . . . . . . . . . . . . . . . . . 69
7.2 Left: The surprise values of each 16x16 patch of pixels. Lighter pixels indicate
higher surprise values. Right: Study environment from the infant’s point of view
with overlaid target to show infant gaze location during a robot kicking behavior.
The circles indicate 2, 4, and 8 degrees from the estimated gaze location. . . . . . . 72
7.3 Histograms and corresponding KL divergence values of infant and random gaze
distributions. From each frame in an infant’s gaze tracking video, we extracted the
surprise value at the infant’s gaze location and at a random location in the frame.
This process was repeated to find the KL divergence for each infant 100 times ( p
< 0.0001 for each infant on a one-tailed t-test to test KL>0 ) . . . . . . . . . . . . 74
7.4 The robot behavior signal and log surprise signal, ζ =0.98, for a 1.5-minute time
interval. A robot behavior signal value of 1 indicates that the robot is kicking its
leg, while a value of 0 indicates the robot is still. . . . . . . . . . . . . . . . . . . . 76
7.5 Infant 1, 2, 3, and 5 looking behavior with robot behavior and surprise signal. The
dotted line represents the robot behavior. Values 1, 2, and 3 indicate robot kicking,
robot kicking and lights, or robot kicking and laughing, respectively. For Percent
of Behaviors Looked At, a value of 1 on the y-axis corresponds to 100%. . . . . . . 78
xi
7.6 Left: the regression line and data from infant 1. The regression line is defined
by equation RBL= 83.49+ 4.58 LAS, showing a trend that infant 1 looked at a
higher percentage of the robot behaviors during minutes with higher log average
surprise value; right: the regression line and data from infant 2. The regression
line is defined by equation RBL= 41.89+ 3.42 LAS, showing a trend that infant
2 looked at a higher percentage of the robot behaviors during minutes with higher
log average surprise value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.1 Overview of the modeling framework. FC: Fully Connected (Dense) Layer. . . . . 85
8.2 Scatter plot and marginal density distributions of the first 2 principal components
of face and body NN embeddings. The distributions visualized were produced
from embeddings that represent the first test group of infants. . . . . . . . . . . . . 88
8.3 Mean AUC scores across the 5 test trials for each set of long window lengths for
facial and body features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.4 Model accuracy along seconds since infants transitioned into a given affect. The
mean accuracy across predictions is illustrated by the solid line, and the 95% confi-
dence interval for model accuracy across infant-robot interaction sessions is shaded
around the mean accuracy. Only data samples with a sample size greater than 3 are
visualized. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.5 Model accuracy along seconds for which the classifier produced consecutive pre-
dictions of the same affect. The mean accuracy across infants is illustrated by the
solid line, while the 95% confidence interval for model accuracy across infant-
robot interaction sessions is shaded around the mean accuracy. Only data samples
with a sample size greater than 3 are visualized. . . . . . . . . . . . . . . . . . . . 94
xii
Abstract
Our health and well-being are intricately tied to the dynamics of our social interactions. During
infancy and early childhood, these interactions shape our cognitive development, behavioral devel-
opment, and even brain architecture. In turn, our health and behavioral patterns influence the way
we engage with others. The continuous feedback loop between our social environment and overall
well-being has motivated researchers across behavioral sciences and computing to develop models
that describe the dynamics of our social interactions, or social synchrony.
The key components of social synchrony during embodied interactions are temporal behav-
ior adaptation, joint attention, and shared affective states. In order to communicate successfully,
partners must be attentive and responsive to each other’s behaviors. During social interactions,
appropriate responses should take into account the affective state being displayed by one’s partner.
To create comprehensive representations of nuanced social interactions, computational models of
social synchrony must account for each of these components.
The goal of this dissertation is to develop and evaluate approaches for modeling social syn-
chrony during embodied dyadic interactions. We present computational models of social syn-
chrony during dyadic, embodied interactions in two contexts. First, we explore human-human
social interactions, where attention and affective states must be inferred through behavioral ob-
servations. During embodied interactions, social partners communicate using a diverse range of
behaviors; therefore, this work develops approaches for modeling temporal behavior adaptation
using heterogeneous data, or data representing multiple behavior types. Next, we explore social
synchrony in the context of human-robot interaction. Robots must be equipped with perception
xiii
modules in order to establish joint attention and shared affective states based on information about
their partners’ behaviors. To address this need, we develop and evaluate models for attention and
affective state recognition.
Given the central role of communication in cognitive and social development, this dissertation
focuses on interactions that occur during infancy and early childhood. Specifically, we investigate
infant-mother, infant-robot, and child-robot interactions. As infants and their mothers have very
different communication skills, these interactions serve as an ideal test-bed for models of temporal
behavior adaptation across heterogeneous behaviors. We explore social synchrony in this context
by modeling the infant and mother dyad as a dynamical system. First, we assess the temporal re-
lationships between heterogeneous pairs of infant and mother behavioral signals using windowed
cross-correlation with peak-picking. To account for the relationships between multiple heteroge-
neous behavioral signals, we use Dynamic Mode Decomposition with control to analyze the modes
of the dynamical system. Using this approach, we propose a new metric of responsiveness that cap-
tures the degree to which heterogeneous behavioral signals of one partner predict the behaviors of
the other, allowing for a more holistic understanding of temporal behavior adaptation.
Next, we explore social synchrony in the context of Socially Assistive Robotics (SAR) inter-
actions. Specifically, we analyze infant-robot and child-robot interactions where a SAR system
administered therapeutic learning activities. Using eye-tracking data, a model of Bayesian sur-
prise is used to predict infant visual attention based on the temporal and spatial visual saliency of
the interaction environment. Finally, we evaluate temporal patterns in affect recognition perfor-
mance, toward understanding how perception modules can inform appropriate temporal behavior
adaptation during SAR interactions.
The work presented in this dissertation for evaluating and supporting social synchrony opens
the door for new opportunities in computing and behavioral sciences. Future work will bring
together the modeling approaches for temporal behavior adaptation, joint attention, and affective
state recognition described in this dissertation. The consolidation of these approaches will inform
the classification of joint interaction states rather than individual partners’ affective states. In turn,
xiv
this can promote interpretable analysis of social interactions and inform the relationships between
individual behaviors, joint interaction states, and developmental and health outcomes.
xv
Chapter 1
Introduction
This chapter provides an introduction to the concept of social synchrony and its role
in shaping and evaluating social interactions. The introduction reviews the compu-
tational approaches presented in this dissertation for evaluating and supporting so-
cial synchrony during embodied dyadic interactions. The chapter concludes with an
overview of the contributions of this work and an outline of this dissertation.
1.1 The Role of Social Synchrony in Embodied Interactions
The quality of embodied social interactions is shaped by our ability to engage responsively with
our communication partners. In order to generate appropriate responses, we must both attend to
our partners’ behavioral cues and account for the underlying affective context of the interaction.
This process of responsive communication is known as social synchrony, and shapes the way we
engage with both people and technology.
A central driver of research into social synchrony is its relationship to child development.
Infants and young children rely on supportive communication with their caregivers for healthy
cognitive and emotional development (Developing Child 2012). These early interactions influence
developing brain architecture and outcomes across social, emotional, and cognitive development
(Developing Child 2004), and more synchronous interactions are associated with more positive
1
outcomes (Lecl` ere et al. 2014a). Motivated by this relationship, researchers and therapists ob-
serve child-caregiver interactions to assess child development and identify appropriate interven-
tions (Rogers et al. 2014). Social synchrony continues to reflect and support relationship building
and mental health outcomes throughout adulthood (Delaherche et al. 2012).
The importance of social synchrony to communication has inspired its integration into the
design of social agents. Past work to assess synchrony in human-agent interactions has found
that better adaptation of behaviors between the human and agent increases the human participants’
feelings of rapport (Gratch et al. 2007) and the agents’ ability to learn from the human (Prepin and
Gaussier 2010). Applications of Socially Assistive Robotics (SAR) (Feil-Seifer and Matari´ c 2005)
to deliver therapeutic interventions for children with or at risk for developmental disabilities have
leveraged the benefits of social synchrony to optimize interaction outcomes; for example, during
therapeutic game-based activities with children with Autism Spectrum Disorder, SAR systems
that monitored child behavior and adapted their feedback accordingly led to increases in child
imitation behaviors (Greczek et al. 2014). However, recent work (Shi et al. 2021) points out that
SAR systems for children often focus on attention and behavior adaptation without addressing
child affect; this gap must be addressed to enable SAR systems to establish shared affective states
toward improved social synchrony.
1.2 Defining Social Synchrony: Key Terms and Definitions
Surveys of literature on interpersonal or social synchrony define the concept as “individuals’ tem-
poral coordination during social interactions... requiring the perception and integration of multi-
modal communicative signals” (Delaherche et al. 2012) or “a dynamic and reciprocal adaptation
of the temporal structure of behaviors and shared affect between interactive partners” (Lecl` ere et
al. 2014a). These definitions inspire the three key concepts considered in this dissertation: joint
attention, shared affect, and temporal behavior adaptation. Figure 1.1 illustrates the components of
social synchrony and the relationships between them during embodied dyadic social interactions.
2
Figure 1.1: Left: components of social synchrony; right: relationships between the components of
social synchrony during an embodied dyadic social interaction
1.2.1 Joint Attention
Joint attention occurs when communication partners share a common focus of attention, and it
comprises two main concepts. Initiating joint attention involves the use of gestures to guide a part-
ner’s attention- this includes making eye contact (Mundy and Newell 2007). Responding to joint
attention involves one partner following the other’s gaze or gestures in order to establish the same
object of focus. Joint attention is necessary in order to appropriately respond to our communica-
tion partners- in order to respond to a behavior, we must first observe and process that behavior.
When presenting responsive parenting behaviors, Fisher et al. (2016) describe sharing the child’s
focus as a precursor to providing supportive or encouraging responses to a child’s point of focus,
actions, or displays of emotion. While a person’s attention cannot be directly measured, it can be
inferred by observing gaze behavior (Franchak et al. 2011; Yamamoto et al. 2019; Nied´ zwiecka
et al. 2018).
3
1.2.2 Shared Affective States
The presence of shared affective states distinguishes social synchrony from the physical synchrony
that is required during physical collaborative tasks. While attending to our partners, evaluating the
affective content of their behaviors and estimating how they are feeling (their internal affective
state) enables us to moderate our own internal state and generate responses consistent with the
tone of the interaction. This coregulation of affective states helps to support more positive affect
during interactions (Feldman 2003).
1.2.3 Temporal Behavior Adaptation
Temporal behavior adaptation involves the initiation or change of a behavior in response to one’s
partner. Delaherche et al. (2012) describe how the concept of temporal behavior adaptation distin-
guishes synchrony from mirroring or mimicry, in that “the important element is the timing, rather
than the nature of the behaviors.” Partners do not need to replicate each other’s behaviors in order
to achieve synchrony; for example, a parent may name “that is a ball!” in response to their child
holding up a ball and smiling, supporting a synchronous interaction without copying the child’s
actions. In this dissertation, we refer to observations of different types of social behaviors as het-
erogeneous behavioral signals.
1.3 Problem Statement and Methods
The goal of this dissertation is to develop and evaluate approaches for modeling social synchrony
during embodied dyadic interactions. Models of social synchrony support different applications
across interaction types. Observing communication between caregivers and their infants or chil-
dren through the lens of social synchrony serves to advance our understanding of human behavior
and child development. In some cases, the components of synchrony inform interventions to sup-
port healthy child development. In human-robot interaction, computational models of synchrony
4
are needed to inform perception and action selection modules that support positive interaction
outcomes.
A central research challenge is integrating the multiple social behaviors and components of
synchrony towards more comprehensive modeling of embodied social communication. As infants
are still developing their motor, language, and social skills, they often differ from their mothers in
the behaviors they use to communicate. Previous work has shown that infant-mother coordination,
within a single behavior, varies significantly with age and interaction quality. However, existing
approaches to modeling social synchrony do not yet capture the temporal adaptation that occurs
across multiple heterogeneous behaviors at once. Within socially assistive human-robot interac-
tion, robots must monitor their partners’ affective states continuously and adapt their behaviors
in real-time. Affect classification models cannot be evaluated in isolation; instead, we must also
consider how these models can inform temporal behavior adaptation via an action selection pol-
icy. The modeling approaches presented in dissertation addresses these challenges by leveraging
heterogeneous data and directly addressing the relationships between the components of social
synchrony.
In the context of human-human interactions, we expand on previous models of temporal be-
havior adaptation that focus on individual behaviors by establishing approaches that account for
the multiple heterogeneous signals people use to communicate. During human-robot interaction,
we establish and test models of visual attention and affective state recognition and analyze the
influence of these concepts on moment-to-moment interaction outcomes. Due to the importance
of social synchrony to assessing and supporting healthy child development outcomes, we evalu-
ate our approaches in the context of interactions with infants. Specifically, we leverage datasets
of recordings from 1) the Face-to-Face Still-Face procedure, a validated infant-mother interaction
paradigm, and 2) a therapeutic leg movement activity for infants delivered by a humanoid robot.
The results presented in this dissertation support a holistic approach to evaluating and supporting
social synchrony during embodied dyadic interactions.
5
1.4 Contributions
This dissertation presents novel approaches for modeling social synchrony during both human-
human and human-robot interactions. Primary contributions include:
1. Analyses of temporal behavior adaptation during an infant-mother interaction paradigm. We
demonstrate that experimental changes in mothers’ interaction patterns are accompanied by
observed changes in social synchrony both within and across types of behavioral signals.
2. A modeling approach for evaluating temporal behavior adaptation across multiple hetero-
geneous behaviors. We use dynamical systems models that integrate multiple behavioral
signals to develop novel metrics for evaluating social synchrony.
3. Analyses of approaches for predicting joint attention during a Socially Assistive Robotics
interaction. We evaluate factors including visual saliency and the robot’s behavior policy
that influence infants’ visual and affective responses.
4. A novel evaluation approach for affective state recognition models for real-time, continuous
application.
Secondary contributions of this dissertation include:
1. The Infant-Caregiver in-Home Interaction Recording Platform (ICHIRP), a data collection
platform designed to scale research in infant-caregiver interaction by supporting caregiver-
led data collection, and a pilot usability study for ICHIRP.
2. A dataset of infant and mother behavioral signals including head pose, arm pose, and fun-
damental frequency collected from recordings of infant-mother interactions.
1.5 Dissertation Outline
The remainder of this dissertation document includes the following:
6
• Chapter 2 reviews background and prior work related to social synchrony in human-human
and human-robot interaction.
• Chapter 3 reviews the previously existing datasets, contributed by other research teams, that
were used in this dissertation.
• Chapter 4 describes our analysis of temporal adaptation of heterogeneous behaviors between
infants and their caregivers using windowed cross-correlation with peak-picking.
• Chapter 5 demonstrates a dynamical systems approach for integrating multiple heteroge-
neous signals into a single model of temporal behavior adaptation, and presents a new metric
of responsiveness.
• Chapter 6 describes the design of the ICHIRP application and a pilot usability study con-
ducted to evaluate challenges to caregiver-recorded infant-caregiver interactions.
• Chapter 7 evaluates a model of Bayesian surprise as a predictor of infant visual attention
during a robot-led therapeutic leg movement activity.
• Chapter 8 presents an analysis of temporal patterns in affect classification performance and
describes how these patterns can inform models of temporal behavior adaptation.
• Chapter 9 summarizes and concludes the dissertation.
Nota bene: This dissertation includes contributions from collaborative, interdisciplinary
projects involving multiple researchers. A “Contributors” box at the beginning of each chap-
ter or section describes the contributions of fellow USC students and lists the co-authors of
published papers resulting from this dissertation work.
7
Chapter 2
Background and Related Work
This chapter reviews existing literature on manual and computational approaches
to evaluating social synchrony during embodied dyadic interactions. Additionally,
this chapter discusses the application domains studied in this dissertation, including
infant-caregiver interaction and socially assistive robotics for children, where social
synchrony plays a central role.
While joint attention and shared affective states are vital components of social synchrony, they
cannot be measured directly. Rather, they must be interpreted through the behaviors displayed
by interacting partners. Therefore, approaches to evaluating social synchrony typically involve
the recognition of the individual positions, behaviors, or affective expressions of each partner,
followed by an analysis of the relationships between these factors. Additionally, definitions of
synchrony cannot always be disentangled from the social contexts that inform them. As the study
of social synchrony originated in developmental science, this chapter first presents the impact of
social synchrony and approaches for its evaluation in child development. Next, we discuss past
work in social synchrony in human-robot interaction and SAR.
8
2.1 Social Synchrony in Parent-Child Interactions
2.1.1 Influence on Child Development
Supportive interactions with primary caregivers are essential for infant social and cognitive devel-
opment. Dyads often engage in a “serve-and-return” pattern, where one partner reaches out with
a gesture or vocalization and the other responds with their own action (Developing Child 2012).
Sensory input from caregiver feedback during these “serve-and-return” moments helps to shape
the architecture of the infant’s brain (Dawson and Fischer 1994). The importance of consistent,
reciprocal social interactions is highlighted by the adverse consequences of their absence. De-
velopmental delays (Wan et al. 2013) and stressors such as economic hardship, social isolation,
and caregiver mental health impairments can lead to a breakdown in interaction quality, resulting
in severe lifelong consequences for infants including cognitive delays and impairments in health
and learning outcomes (Developing Child 2012). Interventions that promote quality interactions
with primary caregivers during infancy have been shown to improve infant cognitive and execu-
tive function and attention (Lieberman et al. 1991) and reduce autism symptom severity in at-risk
infants (Rogers et al. 2014).
Recent work has explored how infant risk for adverse health outcomes can be predicted from
interaction scales evaluated on observations of infant-caregiver interactions in play or perturbed
settings. Wan et al. (2013) found infant interactive behavior with caregivers and dyadic mutual-
ity at ages 12-15 months to predict autism diagnosis at 3 years. Work by Lecl` ere et al. (2016a)
found that infant-mother turn-taking dynamics predict risk for child neglect, which can have se-
vere consequences on child development. The relationship between early communication and
child development outcomes has established the importance of evaluating social synchrony both
for fundamental research into child development and for early screening tools.
9
2.1.2 Observation Contexts
Infant-caregiver interactions are typically observed in either natural play settings or perturbed play
interactions, such as the Face-to-Face Still-Face (FFSF) interaction paradigm (Tronick et al. 1978;
Adamson and Frick 2003a). The FFSF consists of a two-minute play stage of normal interaction
between caregiver and infant, followed by a two-minute still-face stage where the caregiver is asked
not to emote or respond to the infant’s behaviors, followed by a two-minute “reunion” stage, where
the caregiver resumes normal play with the infant. FFSF and its variations are used to study early
social and emotional development and infant emotion regulation, including how dyadic behaviors
change in the presence of atypical development or adverse childhood experiences (ACEs). For
example, during a variation of the FFSF where mothers touched their infants during the still-face
stage, infants with depressed mothers showed different gaze and affective behaviors from those of
non-depressed mothers (Pel´ aez-Nogueras et al. 1996).
2.1.3 Manual Evaluation Approaches
Existing instruments for evaluating infant-caregiver interaction can be classified into three cate-
gories: global interaction scales, synchrony scales, and micro-coded time-series analysis measures.
Global interaction scales such as the Coding Interactive Behavior (CIB) scale (Feldman 1998), or
the Manchester Assessment of Caregiver-Infant Interaction (MACI) (Wan et al. 2013) address
dyadic processes such as adaptation or affective mutuality, as well as individual behaviors of each
interaction partner including praising, exhibiting an affective state appropriate to that of their part-
ner’s, or acknowledging one’s partner. Synchrony scales such as Bernieri’s Scale (Bernieri et al.
1988) and the Synchrony Global Coding System (Skuban et al. 2006) are similar to global scales
but exclude analysis of individual partners’ behaviors. Micro-coded time-series analysis methods
measure correlations between infant and caregiver behavior over sliding time windows (Lecl` ere
et al. 2014a).
10
2.1.4 Computational Modeling Approaches
Given the importance of supportive dyadic interactions to healthy infant development, multiple
computational approaches have emerged to address the time and cost associated with data an-
notation and to enable detailed temporal analysis of dyadic processes. Common modeling tech-
niques largely fall within micro-coded time series analysis, and are based on statistical approaches
(Lecl` ere et al. 2014a) or dynamical systems models (Gates and Liu 2016). These approaches typ-
ically involve generating a time series of behaviors or states for each member of the dyad through
either automated feature extraction or manual data annotation, and evaluating the relationships be-
tween the two time series. We highlight six common modeling approaches for evaluating dyadic
coordination, and describe how they have been used to study infant-caregiver interactions, inform-
ing our work.
Correlational Analysis: In correlational analysis, a sliding window is applied to each signal,
and time-lagged cross-correlation is applied to each window. The correlations between each par-
ticipant’s time series is then interpreted as a metric of coordination or behavioral synchrony. A
peak-picking algorithm is often used to determine the time lag at which partners are most coordi-
nated (Boker et al. 2002). Prior work by Hammal et al. (2015a) has used this method to identify
how infant-mother interaction dynamics evolve across stages of the FFSF procedure.
Recurrence Analysis: This approach compares the feature vectors of each partner at every time
point, and the pairs of vectors that meet a similarity criterion are marked as recurrent points. The
percent of recurrent points, or entropy of the recurrent matrix, is used to evaluate attunement, or
assess how often the partners visit similar states. For example, L´ opez P´ erez et al. (2017) used the
percent of recurrent points to demonstrate asymmetric interaction dynamics during infant-mother
play interactions.
Analysis of Delayed and Overlapped Behaviors: This approach tracks the onset and offset of
behaviors for each partner, and monitors how they overlap or delay their behaviors. It has been
11
used to monitor synchrony of infant and mother vocalizations (Gratier 2003) and to relate infant-
mother movement coordination to interaction quality defined by expert-rated CIB scores (Lecl ` ere
et al. 2016a)
Granger Causality: Granger causality (Granger 1969) is a statistical approach for testing
whether one time series is useful for predicting another. The approach has been used in infant-
mother interactions to analyze what is likely leading the interaction (Hoch et al. 2021) and in
child-therapist interactions (Seth 2005) to study how metrics of dyadic interaction relate to autism
symptom severity.
Dynamical Systems Modeling: Dynamical Systems Modeling approaches leverage the depen-
dence of current behaviors on the previous state of the interaction. Common dynamical system
approaches applied to infant-caregiver interaction are Markov Models; they have been used to
identify transition dynamics between infant and mother smiles (Messinger et al. 2010a) and affec-
tive states (Cohn and Tronick 1987). Identifying some higher-level behaviors requires manual data
annotation; Ardulov et al. (2018) demonstrated how dynamic mode decomposition with control
could be used to identify interaction dynamics in a continuous state space, using the eigenvalues
of the transition and control matrices to evaluate how child speech was influenced by adult input
during forensic interviews.
Each of the modeling approaches described above outputs a set of metrics that quantify the re-
lationships between these time series, such as entropy in recurrence analysis. Differences in metric
distributions between conditions (e.g., dyads with healthy versus depressed parents, play versus re-
union stages of the FFSF) enable researchers to evaluate the relationship between infant-caregiver
interaction dynamics and relevant developmental phenomena (Lecl` ere et al. 2014a); therefore, re-
searchers must select an appropriate modeling approach in order to generate metrics that capture
the phenomenon of interest. Survey papers (Gates and Liu 2016; Delaherche et al. 2012) provide
guidelines for selecting models of interaction dynamics based on relevant research questions, and
Lecl` ere et al. (2014a) provides specific examples in the context of infant-caregiver interaction. Ex-
isting approaches have assessed temporal behavior adaptation synchrony by identifying patterns in
12
the coordination of infant and mother behavioral signals over time, but often evaluate the adapta-
tion of a single type of behavior. This dissertation develops approaches that address the complex
interplay of various types of behavior that co-occur during infant-caregiver interaction.
2.1.5 Automated Affect and Behavior Recognition
While micro-coded time series used by computational modeling approaches most directly align
with temporal behavior adaptation, the types of behaviors analyzed support the inference of joint
attention or shared affective states. Therefore, past work in computing has focused on automated
approaches to recognize the presence of joint attention or specific states affective states during
interactions with infants.
Towards studying joint attention during infant-caregiver interactions, eye-tracking has been
used to automatically follow the gaze behaviors of infants and their parents. Head-mounted eye
tracking for infants was first introduced by Franchak et al. (2011), when it was used to study
infant gaze behavior during naturalistic interactions with their mothers. Leveraging this approach,
Yamamoto et al. (2019) observed longitudinal measurements of eye gaze in the home environment
from 10 to 15 months of infant age, finding interpersonal distance to be a moderator of joint
attention. Using a screen-mounted eye tracking, Nied´ zwiecka et al. (2018) studied mutual gaze
toward visual stimuli displayed on the screen, finding that mutual gaze at 5 months predicted
infant disengagement behaviors at 11 months.
To support affect recognition for infants, Cohen and Lavner (2012) used voice activity detec-
tion and k-nearest neighbors to detect infant cries. Model performance was evaluated based on
the fraction of 1-second and 10-second segments that were accurately classified. A review by
Saraswathy et al. (2012) describes methods for automatic analysis of pre-detected infant crying
sounds used to scan for signs of adverse health outcomes. Lysenko et al. (2020) and Messinger
et al. (2009) used facial landmarks to predict expressions of infant affect from individual video
frames. Models evaluated on pre-defined windows of time have identified the potential of affect
recognition in analyzing infant well-being. When infant and caregiver behaviors are measured or
13
interpreted selectively, temporal behavior adaptation can offer valuable insights into joint attention
and shared affective states.
2.2 Social Synchrony in Human-Robot Interaction
The concept of social synchrony between humans has served as inspiration to enhance efficiency
and satisfaction during human-robot interaction (HRI). In HRI, perception models and action selec-
tion policies must be established to realize the joint attention, shared affect, and temporal behavior
adaptation criteria that are vital to establishing social synchrony. In turn, these principles can pro-
mote more satisfactory and productive interaction outcomes. This section reviews how the core
components of synchrony have informed the field of HRI.
2.2.1 Human-Robot Collaboration
During collaborative human-robot interaction, a subfield of HRI, past work has found that ac-
counting for the timing and expectations of human participants can support improved interaction
outcomes. Nikolaidis et al. (2017) used a game-theoretic approach to model human participants’
behavioral adaptation to a robot during a table-clearing task and choose the robot’s actions accord-
ingly, leading to an increase in team performance. The concept of human-robot fluency during
collaborative tasks was formalized by Hoffman (2019), who used the percentage of concurrent
activity, the human’s idle time, the robot’s functional delay, and the robot’s idle time as objective
fluency metrics. User study participants reported that the interactions progressed more fluently
when human idle time and robot functional delay decreased. By adapting the robot’s behavior
based on anticipated human behavior, Hoffman and Breazeal (2007) were able to decrease human
idle time and, in turn, increase participants’ perception of the robot’s contribution to team fluency
and success. These results bear striking similarity to the metrics used by Lecl` ere et al. (2016a)
to characterize child-mother interaction, including synchrony ratio (concurrent activity offset by a
delay), overlap ratio (concurrent motion activity), and pause ratio (lack of motion by either party),
14
which were found to correlate significantly with CIB item ratings such as maternal sensitivity and
dyadic reciprocity. These studies demonstrate the importance of temporal behavior adaptation
across contexts in both child-caregiver and human-robot interaction.
2.2.2 Socially Assistive Robotics (SAR)
In contrast to robots that provide physical assistance, Socially Assistive Robotics (SAR) supports
users via social, cognitive, and emotional cues (Matari´ c and Scassellati 2016). Past work has ex-
plored applications of SAR to support learning gains in children with Autism Spectrum Disorder
(ASD) (Clabaugh et al. 2019), stroke rehabilitation (Matari´ c et al. 2007), and support group media-
tion (Birmingham et al. 2020), among others. During SAR interactions, the robot’s action selection
policy must account for both the goals of the joint activity and the social context of the interac-
tion. Matari´ c and Scassellati (2016) note that responding to a human user too quickly, slowly, or
repetitively can break the social dynamic of the interaction. This task is made more complex by
the need to coordinate robot behaviors across movement, facial expressions, and speech, each of
which must be appropriate for the user’s emotional state and intent. The following subsections
discuss how the concepts of joint attention and shared affective states have been operationalized
toward achieving social synchrony in SAR.
2.2.3 Joint Attention in SAR
Research in SAR often addresses joint attention through the concept of user engagement. Assum-
ing a robot continuously senses the state of a joint task, the user’s engagement in that task can safely
satisfy the conditions for establishing joint attention. Understanding and leveraging engagement
serves both to sustain a user’s attention during SAR interactions and to support the interaction’s
behavioral and educational goals (Matari´ c and Scassellati 2016). Integrating estimated participant
engagement into a reinforcement learning framework, Tsiakas et al. (2018) developed personalized
training strategies during a cognitive training exercise with computer science undergraduate and
graduate students.
15
As with human partners, eye gaze plays a key role in human-robot joint attention. The esti-
mated gaze direction of human users can support SAR platforms in recognizing user engagement
during in-the-wild social interactions (Jain et al. 2020). Additionally, displaying gaze cues can
help humanoid robots to initiate joint attention. For example, a robot may gaze-facilitate en-
gagement from human users by redirecting its gaze without breaking from the social interaction
(Holroyd et al. 2011). Huang and Mutlu (2013) used eye gaze to address the need to meet both
social expectations and task goals during human-robot communication. The multifaceted role of
user engagement and joint attention in promoting successful SAR interactions highlights both the
importance and the technical challenges of enabling robots to promote social synchrony.
2.2.4 Affective Behavior Recognition in SAR
As with joint attention, the role of affect recognition in SAR is twofold: it is necessary to develop
socially appropriate responses and to drive positive interaction outcomes. To intervene when hu-
man partners become emotionally unavailable to interact, robots must monitor affect continuously
and recognize changes quickly. To meet the requirements of an embodied social interaction, affect
recognition approaches must operate in real-time. This challenge is made more difficult by the
limited training data available for novel and challenging populations.
Past work has contributed multiple approaches to affect recognition using child behavioral
signals recorded during SAR interactions or human-computer interaction. As our affective states
are often manifested through our facial expressions and body gestures, past research in affect
recognition for SAR (Filntisis et al. 2019; Jain et al. 2020; Shi et al. 2021; Mathur et al. 2021) has
leveraged models that extract facial action unit or body pose recognition using video footage from
a participant-facing camera. Other approaches have leveraged sentiment analysis of child speech
(Abbasi et al. 2023) or audio signals from nonverbal utterances (Narain et al. 2020) to estimate
child well-being or affective states. Past literature on these approaches emphasizes the importance
of affective state recognition for crafting appropriate behavioral responses to the child.
16
2.3 Summary
This chapter reviewed the past literature in social synchrony that informs the contributions made
by this dissertation. Our work aims to push the boundaries of synchrony modeling by developing
and evaluating approaches for integrating heterogeneous behavioral signals and for extending the
evaluation of social synchrony to SAR interactions.
17
Chapter 3
Datasets
This chapter describes the previously collected datasets used in this dissertation. The
datasets were collected by other research teams across the University of Southern Cal-
ifornia (USC) and Children’s Hospital Los Angeles (CHLA), and include recordings
of diverse, developmentally relevant dyadic and embodied social interactions with in-
fants. The breadth of interactions included in these datasets enables the research
presented in this dissertation to establish approaches for modeling social synchrony
that can be applied across social settings.
This dissertation relies on previously existing datasets of social interactions collected by col-
laborators across Dr. Maja Matari´ c’s Interaction Lab at USC, Dr. Pat Levitt’s Levitt Lab at CHLA,
and Dr. Beth Smith’s Infant Neuromotor Control Lab (INCLab) at CHLA. This chapter reviews
the components of the datasets that are integral to the work in this dissertation.
3.1 Infant-Mother Interactions: Face-to-Face Still-Face
Paradigm
As part of a larger project to study maternal stress, recorded interactions between infants and their
mothers were collected by Dr. Pat Levitt’s Levitt Lab at CHLA. This project was funded by the JPB
Foundation through a grant to The JPB Research Network on Toxic Stress: A Project of the Center
18
on the Developing Child at Harvard University. The following subsections describe the infant-
mother interaction dataset in greater detail. While other data including surveys and biological
samples were collected as a part of the larger project on maternal stress, they are not analyzed in
this dissertation and therefore are not described in this chapter.
3.1.1 Social Interaction Paradigm
The Face-to-Face Still-Face (FFSF) procedure (Tronick et al. 1978) is one of the most widely used
experimental paradigms for observing infant-caregiver interaction in a research setting (Provenzi
et al. 2018; Lecl` ere et al. 2014b). Past work has used the FFSF paradigm to evaluate the effects of
infant age, developmental disorders, and maternal depression on early communication (Adamson
and Frick 2003b). Given the vast documentation of interaction patterns that typically occur during
the FFSF procedure, it offers an ideal environment for testing and refining models of social syn-
chrony. In the scope of this dissertation, this dataset is used to develop and test models of temporal
behavior adaptation.
3.1.2 Experimental Procedure
The FFSF procedure involves three stages: play, still-face, and reunion. Each stage lasts approx-
imately 2-3 minutes. During the play stage, infant and caregiver play together in a face-to-face
interaction. During the still-face stage, the caregiver maintains eye contact with the infant, but
expresses a flat facial expression and does not respond to the infant’s bids for attention. During the
reunion stage, the caregiver resumes play with their infant.
3.1.3 Participants
57 infant-mother dyads were recruited from the Los Angeles community. The study involved
mother caregivers only because the data collection was part of a larger research effort to study
maternal stress. All data were collected at Children’s Hospital Los Angeles, under IRB protocol
19
CHLA-15-00267. Each dyad was invited to participate at 2 months, 6 months, 9 months, 12
months, and 18 months after the birth of the infant to complete the FFSF procedure. A total of 229
interactions were completed.
3.1.4 Implementation
Figure 3.1: A mother and infant participating in the Face-to-Face Still-Face procedure. The infant
is sitting in the lap of a researcher.
Infants and mothers sat 2-3 feet apart, as shown in Figure 3.1, with the infant sitting on a
researcher’s lap or worn on the front of a researcher in a baby carrier. Each of the three FFSF
stages lasted 2 minutes. A timer beeped at the start and end of each stage to mark the transitions. If
the infant was fussy and crying for a continuous period of thirty seconds, the Still-Face stage was
terminated early. All FFSF procedures were recorded at 30 frames per second from a profile view
with a Sony HDR-CX240 video camera. The microphone on the video camera was used to record
the audio data.
20
3.1.5 Labels
The start and end times of each FFSF stage were marked by a timer within the recordings, and
sections of the audiovideo data were labeled accordingly with the corresponding stage of the FFSF
procedure. Moment-to-moment labels are not included in the Infant-Mother Interaction dataset.
3.1.6 Inclusion and Exclusion Criteria
Interactions were excluded from analysis in this dissertation if the mother broke the interaction
protocol for any reason, if the infant was too fussy to reach the final stage of the FFSF protocol,
or if the infant was asleep at any time during the procedure. Additionally, videos were excluded
if the camera was paused or replaced during the interaction or if the infant or mother were too
occluded to collect reliable pose data. After applying exclusion criteria, 200 videos were included
for analysis in this dissertation.
3.2 Infant-Robot Interactions: A Contingent Learning
Paradigm
The Infant-Robot Interaction dataset was collected as part of an interdisciplinary research project
that explored the use of SAR to encourage exploratory leg movement in infants. The project was
supported by the National Science Foundation under grant NSF CBET-1706964. The interaction
paradigm and SAR system were developed in collaboration by researchers in the INCLab and
Interaction Lab. To inform a closed-loop interaction design, Funke et al. (2018) analyzed infant
reactions to robot movement and speech. Next, a contingent learning paradigm was implemented
where the robot delivered visual and auditory stimuli as a reward for infant leg movement (Fitter
et al. 2019). Later projects (Pulido et al. 2019; Deng et al. 2021) made changes to the robot reward
policy in order to encourage different types of infant leg movements. The following subsections
describe three of the user studies conducted by the Interaction Lab and INCLab and the resulting
21
interaction data. We will refer to these studies as Contingent Learning User Studies 1, 2, and 3.
While a fourth user study involving the contingent learning paradigm was collected more recently,
it is not analyzed in this dissertation.
3.2.1 Interaction Paradigm
Each study included the Nao robot seated across from the infant, with the setup of the first user
study illustrated in Figure 3.2. The Nao robot was chosen for its humanoid form and similarity
in size to an infant. These factors enabled the robot to provide a demonstration of the desired
infant leg movement. Infants wore accelerometers on their arms and legs that were used to identify
whether infants had produced a sufficient leg movement to trigger the robot feedback. Prior work
found that wearing those sensors has a negligible effect on infant leg movement frequency and that
the sensors provide accurate infant movement data (Jiang et al. 2018). The contingent learning
activity lasted for 8 minutes, although infants who became excessively fussy were removed from
the setup. At the beginning and end of the activity, the robot sat motionless for 2 minutes to assess
the baseline movement level of the infant. During each study, a parent was seated next to the infant
at all times, and the infant was removed from the interaction if they became too fussy or upset.
This study procedure was approved by the University of Southern California Institutional Review
Board under protocol #HS-14-00911.
3.2.1.1 Contingent Learning User Study 1
During this user study, shown in Figure 3.2, an infant leg movement of 3 m/s
2
was required to
activate the robot. In front of the infant and the robot, a pink toy ball with a bell was suspended at
a height that was reachable by kicking. Each infant wore a head-mounted eye tracker and inertial
sensors within bands on their wrists and ankles. The ball was found to be distracting to the infant
and was removed during subsequent user studies.
13 infants participated in this user study. Infants between 4 and 9 months of age from the Los
Angeles area were recruited to participate in the study. Infants from multiple births, infants with
22
Figure 3.2: An infant participating in the Socially Assistive Robot Contingent Learning Paradigm.
a history of gestation less than 37 weeks, infants experiencing complications during birth, and
infants with any known visual, orthopedic or neurologic impairment at the time of testing were
excluded. Infants also needed to score above the 5th percentile for their age on the Alberta Infant
Motor Scale (Piper and Darrah 1994) for inclusion. This study is described in further detail by
Fitter et al. (2019).
3.2.1.2 Contingent Learning User Study 2
During the second user study, a subset of 7 participants from user study 1 visited the study site
at the INCLab participate a second time. The study setup was similar to Contingent Learning
User Study 1, but did not include the head-mounted eye-tracker or the pink ball. For each infant, a
reinforcement learning policy was learned to iteratively update the leg acceleration threshold of the
robot. The goal of these policies was to personalize the interaction to each infant and encourage
infants to increase the peak acceleration of their kicks over time. This work is described in greater
detail by Pulido et al. (2019).
23
3.2.1.3 Contingent Learning User Study 3
12 infants between 6 and 8 months participated in Contingent Learning User Study 3. A banded
threshold was applied: infants needed to produce a leg acceleration above 9 m/s
2
and below 20
m/s
2
with an angular velocity greater than 2 rad/s. This study represented a more challenging
learning task. The threshold was chosen to explore whether infants could learn to increase not
only the frequency of desired leg movements, but also produce more leg movements of higher
acceleration. Deng et al. (2021) describe the work in more detail.
3.2.2 Labels
Infant affect was manually coded using a five-point arousal scale consistent with previous infant
research (Sargent et al. 2014; Lester et al. 2004). Annotators labeled each video frame as either
alert, fussy, crying, drowsy, or sleeping. Annotators achieved over 85% label agreement. 84.3%
of the data were labeled as alert, 13.4% as fussy, 2.3% as crying, and 0% as drowsy or sleeping.
Given the limited data labeled as crying and similarities between fussy and crying, these labels
were combined as fussy in our analysis.
3.3 Acknowledgements
The research presented in this dissertation was possible because of the datasets described in this
chapter, which were collected by other research teams across USC and CHLA. We are thankful to
the researchers and funding organizations who supported and collected these datasets and who de-
signed the corresponding SAR systems for making these datasets available for the work described
in this dissertation.
24
Chapter 4
Evaluating Temporal Behavior Adaptation Across
Heterogeneous Behaviors with Windowed Cross-Correlation
Contributors: Chapter 4 is based on Klein et al. (2020). Additional authors of the pub-
lished work include Victor Ardulov, Yuhua Hu, Mohammad Soleymani, Alma Gharib, Barbara
Thompson, Pat Levitt, and Maja J. Matari´ c.
Interactions between infants and their caregivers can provide meaningful insight into
the dyad’s health and well-being. As researchers cannot directly observe partners’
affective states or attention, models of infant-parent synchrony typically involve ob-
servations of behavior adaptation, which are used in turn to infer internal states.
Previous work has shown that infant-caregiver behavior adaptation, within a single
modality, varies significantly with age and interaction quality. However, as infants
are still developing their motor, language, and social skills, they may differ from their
caregivers in the behaviors they use to communicate. This chapter examines how the
temporal adaptation of heterogeneous behaviors between infants and their caregivers
can expand researchers’ abilities to observe meaningful trends in infant-caregiver in-
teractions.
As described in Chapter 3.1, the dynamics of infant-caregiver interactions offer an important
vantage point into child development. Typically developing infants attempt to engage their parents
25
through motor or vocal babbling, and the parents reciprocate with their own attention, gestures, or
vocalizations. This pattern of temporal behavior adaptation is known as “serve-and-return”, and is
characteristic of healthy infant-caregiver relationships (Developing Child 2012).
Recent advances in automated feature extraction and signal processing have enabled computa-
tional analyses of temporal behavior adaptation between infants and their caregivers using recorded
behavioral signals (Hammal et al. 2015b; Lecl` ere et al. 2014b; Messinger et al. 2010b; Lecl` ere et al.
2016b; Mahdhaoui et al. 2011; Tang et al. 2018). These automated approaches have been proposed
to evaluate features that influence infant-caregiver communication across multiple time scales, and
to support quantitative measurements of dyadic processes for clinical observation. Pose, body
movement, facial expressions, and vocal prosodic features have each been used to inform models
of dyadic processes and characterize infant-caregiver interactions.
Since infant and caregiver turns may incorporate vocalizations, gestures, or both, the ability to
include multiple types of behavioral signals is essential to analyzing interaction dynamics (Lecl` ere
et al. 2014b; Developing Child 2012). Additionally, the first year of life is characterized by rapid
growth in infant social skills as well as differences between infant and caregiver communication
behaviors. For example, a younger infant may respond to their mother’s vocalizations by moving
their body, but as their own vocalization skills develop, they may mirror their mother’s behavior
(Lecl` ere et al. 2014b). These differences are not addressed by models that incorporate a single
behavioral signal for each partner. Yet, existing research in this area has focused largely on com-
putational models evaluated on the same interaction behaviors across partners. Building on this
observation, this chapter discusses a method for evaluating in temporal behavior adaptation across
age or experimental conditions.
Leveraging video and audio data from the infant-mother interaction dataset described in Chap-
ter 3, the work presented in this chapter discusses advantages of measuring temporal adaptation
of heterogeneous behaviors when tracking changes in interaction dynamics. Using the windowed
cross-correlation with peak-picking (WCCPP) method introduced by Boker et al. (2002) and uti-
lized by Hammal et al. (2015b), we evaluated temporal behavior adaptation across three signals,
26
namely, head pose, arm angle, and vocal prosody, across infant age and stages of the FFSF proce-
dure.
Results demonstrated that infant behavior changes during the FFSF procedure measured with
head pose, arm pose, and vocalization signals each had unique trends across age. Significant lev-
els of coordination between infant and mother behavioral signals were found not only between
the same behavioral signals, but also across heterogeneous signals. Moreover, the metrics eval-
uated across heterogeneous signals identified trends in infant-mother coordination beyond those
identified by metrics evaluated using the same behavioral signals for each partner. These results
are consistent with prior observations that infants rely on communication across types of behav-
iors (Lecl` ere et al. 2014b), supporting the value of heterogeneous infant-mother temporal behavior
adaptation metrics upon which a higher-level understanding of social synchrony can be built. As a
secondary contribution of this dissertation, a dataset of de-identified behavioral signals were made
publicly available athttps://github.com/LaurenKlein/mother-infant-inter
actions.
4.1 Technical Approach
This section describes our approach to evaluating temporal behavior adaptation across heteroge-
neous behaviors. We measured both the infants’ individual behavior changes, as well as changes
in infant-mother behavior adaptation, across age and stages of the FFSF procedure. As analysis of
the infants’ behavior changes is typical when conducting the FFSF procedure (Adamson and Frick
2003b), this approach allowed us to confirm whether our results were consistent with previous
research findings and developmentally-relevant phenomena.
4.1.1 Feature Extraction
Serve-and-return interaction can involve visual attention, gestures, and vocalizations, driving our
choice of feature extraction methods.
27
4.1.1.1 Video Features
Consistent with prior work in this area (Hammal et al. 2015b; Adamson and Frick 2003b), head
and arm positions of the dyad were measured throughout the interactions. Head orientation was
one of the original social signals measured during the FFSF procedure (Tronick et al. 1978), and
Stiefelhagen and Zhu (2002) demonstrated head orientation to be a good indicator of visual atten-
tion.
Hand and arm movements are involved in manipulation of toys shared by the mom and infant;
Tronick et al. (1978) identified arm positions and movements as behaviors of interest during the
FFSF procedure. Our work focused on the position of the upper-arm closest to the camera, as these
were consistently identifiable within the video frame. As the forearms and hands of the infant and
mother were less frequently visible in the video frame, their positions were not considered in this
analysis.
Pose features were extracted from the videos using the open-source software OpenPose (Cao et
al. 2018). In order to maintain invariance to the position of the dyad within the frame, the size of the
infant, occasional readjusting of the camera, or slight repositoning of the infant by the researcher,
we measured the angles between joints, rather than individual joint positions. As the videos were
filmed in profile, the pitch of the infant’s and mother’s head and upper-arm were measured as
proxies for head and arm positions. Head angle, or pitch, was approximated as the angle between
the horizontal and the line connecting the detected nose position and ear position. Upper-arm pitch
was measured as the angle between the horizontal and the line between the detected elbow and
shoulder joints. These angles are shown in Figure 4.1.
In 10 videos, either the mother or infant could not be reliably tracked due to significant occlu-
sions, or due to family members sitting too close to the dyad and preventing confident discrimina-
tion between the detected features of each person. The remaining dataset of 196 videos (D
video
)
includes interaction videos from 54 of the 57 infant-mother dyads who participated in the study.
28
Figure 4.1: Arm and head angles measured using pose features extracted with OpenPose.
4.1.1.2 Audio Features
Due to its connection to arousal and infant developmental status (Juslin and Scherer 2005; Kappas
et al. 1991), vocal fundamental frequency (F0) was extracted for both infant and mother, using
Praat (Boersma and Weenink 2002) open-source software for speech-processing. Praat default
settings are based on adult speech; the range for detecting infant F0 was adjusted to between
250 and 800 Hz based on settings suggested by Gabrieli et al. (2019) for analyzing infant vocal
fundamental frequency. The F0 is sampled at 100 Hz, and subsampled to 30 Hz when relating F0
to pose.
As the infants, mothers, and researchers all spoke or vocalized during Play and Reunion stages,
speaker diarization was performed by two trained annotators. 10% of the videos were processed
by both annotators, and a Cohen’s Kappa value of 0.83 was achieved, indicating high agreement.
V ocalizations were labeled as ‘mother and infant’, ‘infant’, or ‘other’, with remaining vocalizations
attributed to the mother. Timer sounds indicating the beginning or end of a stage, and occasional
instances of a researcher or passerby speaking, were annotated as ‘other’ and were not considered
in later analysis. At some points, infants and their mothers vocalized at the same time. Mothers
tended to vocalize more often and more loudly (except when the infants were crying); therefore,
the vocal fundamental frequencies measured during these times were attributed to the mothers.
29
The amount of time with both an F0 value and an annotation of ‘infant’ was recorded as the
total duration of vocalization for each infant. Only vocalizations with F0 were considered for this
measurement, to prevent the detection of unvoiced sounds such as the infant’s breathing.
Some of the toys in the experiment played loud music throughout the interaction, interfer-
ing with the F0 measurement; therefore, interactions where music was played were excluded
from audio-based analysis. After excluding videos with music, 68 interactions remained for au-
dio feature analysis. This subset of the data (D
audio
) includes data from 39 of the original 57
infant-mother dyads. Dyads who did not play music typically included 2- or 6-month-old infants.
As the musical toys were commonly used, most dyads are represented in D
audio
once or twice;
only five infants are represented at more than two ages. A breakdown of each dataset by age
is shown in Figure 4.2. De-identified features from both datasets can be found on GitHub at
https://github.com/LaurenKlein/mother-infant-interactions.
Figure 4.2: Number of interactions in each dataset. D
video
is the set of interactions with reliable
video data. D
audio
is the set of interactions with reliable audio data. D
video
consists of data from
54 dyads, and D
audio
consists of data from 39 dyads.
30
4.1.2 Infant Responses to Still-Face Paradigm
To investigate infant behavior change, infant behavioral signals were aggregated by age and stage
of the FFSF procedure. Using these values, we compared how infant responses to the Still-Face
paradigm differed across age, and compared our results with expected infant behavior. The vari-
ances of the head and arm angles were calculated to approximate the total amount of infant head
and arm movement during each stage of the FFSF procedure. This process was conducted sep-
arately for each infant at each age. To compare vocal behavior across stages, we calculated the
percent of time during which the infant was vocalizing in each stage.
4.1.3 Infant-Mother Temporal Behavior Adaptation
4.1.3.1 Windowed Cross-Correlation with Peak-Picking
In this work, we modeled temporal behavior adaptation using the WCCPP approach described by
Boker et al. (2002). This approach has been used by Hammal et al. (2015b) in the context of the
FFSF to evaluate changes in head-pose coordination between 4-month-old infants and their moth-
ers from Play to Reunion. WCCPP is characterized by its ability to track changes in the temporal
dynamics of an interaction (Boker et al. 2002), making it suitable for modeling the fluctuating
patterns of serve-and-return interaction.
Windowed cross-correlation estimates the peak strength and time lag of correlations between
two signals at successive windows. A sliding window of length w
max
was applied to the two signals
at steady increments w
i
. To account for varying temporal relationships between signals, Pearson
correlation coefficients (R) are calculated across multiple lags, with a maximum lag value t
max
. To
evaluate the lagged correlation values between a mother’s and infant’s behavioral signals, the win-
dow of the infant’s behavioral signal was shifted between− t
max
and+t
max
. For an infant’s signal I
and mother’s signal M, the pair of windows W
I
and W
M
considered after k window increments and
at lag t were selected as shown in Equations 1 and 2, below:
31
W
I
(k,t)=[I
kw
i
+t
,I
kw
i
+1+t
,...I
kw
i
+w
max
− 1+t
] (4.1)
W
M
(k)=[M
kw
i
,M
kw
i
+1
,...M
kw
i
+w
max
− 1
] (4.2)
Pearson correlation coefficients were calculated at each lag, producing a series R of correlation
values shown in Equation 3:
R=[r(W
I
(k,− t
max
),W
M
(k)),r(W
I
(k,− t
max
+ 1),W
M
(k)),
...r(W
I
(k,t
max
),W
M
(k))]
(4.3)
where r(W
I
,W
M
) calculates the Pearson correlation coefficient between the two windowed signals,
W
I
and W
M
.
The plot of correlation value as a function of lag was smoothed to reduce noise using a quadratic
Savitzky-Golay filter with a moving window of 5 samples. The lag at which a peak correlation
occurs was identified and considered for later analysis. This process is further illustrated in Figure
4.3.
The appropriate window size, window step size, maximum lag, and lag step size depend on the
dataset. Additionally, establishing criteria for identifying peaks can help to reduce the number of
erroneous peaks caused by noise. The next subsection discusses the selection of these parameters
and criteria.
4.1.3.2 Parameter Selection
An inherent challenge of windowed cross-correlation across two different behavioral signals stems
from the fact that the appropriate window size may be different for the two behaviors. Window
sizes of 3-4 seconds are typical when analyzing motor movements. Hammal et al. (2015b) found 3
seconds to be an appropriate window size for analyzing infant and mother head pose coordination,
while Boker et al. (2002) used window sizes of 4 seconds to analyze head and arm gestures during
32
Figure 4.3: Illustration of windowed cross-correlation with peak-picking, evaluated for a single
window on example signals. Top: a given time window is shifted across multiple lag values,
changing the range of infant angle values considered. Bottom: The correlation values between
windowed signals are plotted, and a peak lag value is identified.
dyadic conversations to account for the typical amount of time needed to produce and perceive
these gestures. Based on these prior studies and our initial analysis, we used a window size of 3
seconds (90 samples) when analyzing arm and head pose coordination.
In contrast to head and arm movements, vocalizations occurred in much smaller windows
of time. After interpolating between vocalizations that occurred less than 0.25 seconds apart,
the average length of vocalizations by the mother was found to be between 0.5 and 1 seconds
(µ = 0.61s,σ = 0.39s). As each instance of windowed cross-correlation requires a single w
max
,
33
the difference in time scales was resolved by selecting the smaller of the two windows when mod-
eling temporal behavior adaptation between vocal and pose signals. Since continuous 3-second-
windows of vocalization cannot be measured reliably, we selected a 1-second-window for calcula-
tions involving vocalization. The correlations between signals was calculated provided the infant
or mother vocalized for over 50% of the window period; otherwise, no correlation value or peak
was reported. As this resulted in smaller window sizes for a portion of the data and therefore
increased noise, we selectively considered only peaks corresponding to R values that were statisti-
cally significant at p< 0.05. Based on our window size and initial data analysis, a maximum lag
value of 10 was selected.
4.1.3.3 Model Evaluation
Results of the cross-correlation analysis were evaluated in two ways. First, we compared the
number of peaks found between the true behavioral signals with the number of peaks found when
each signal was randomly reordered. Pairs of signals that had significantly more correlation peaks
than their randomized counterparts were considered for further analysis.
To investigate how temporal behavior adaptation changed with FFSF stage and across ages,
we calculated the variance of peak lag values for each infant, stage, and age. Boker et al. (2002)
noted that higher lag variance corresponded with interactions that are further from synchrony,
and found that lag variance increased significantly following an interruption or communication
barrier (in their case, the presence of ongoing loud noise). As a higher lag variance indicates the
presence of lag values that are further from 0 and less stability in the interaction dynamics, a higher
lag variance may also indicate less coordinated behavior in the dyadic interaction. We therefore
used lag variance as an approximation of the level of temporal adaptation between two behavioral
signals.
34
4.2 Results and Analysis
This work evaluates differences in behavior across both age and stage of the FFSF; we therefore
considered statistical significance at α < 0.025, using the Bonferroni correction for multiple com-
parisons.
4.2.1 Infant Responses to Still-Face Paradigm
4.2.1.1 Infant Head Angle
Repeated measures ANOV As revealed significant differences in the amount of head pitch variance
across stages of the FFSF at ages 6 and 12 months, p< 0.005. Student’s t tests were then used to
identify differences between individual stages, with results reported in Figure 4.4. There was an
increase in variance during Still-Face at 9 months as well, but this was not found to be statistically
significant.
4.2.1.2 Infant Arm Angle
Figure 4.4 shows the distribution of arm angle variance across age and FFSF stage. Differences in
variances between stages were not found to be significant; however, a linear mixed effects model
with infant ID as a fixed effect found that the average amount of arm angle variance across all
three stages increased with age, p< 0.025. This indicates an increase in the amount of infant arm
movement with age.
4.2.1.3 Infant Vocalizations
Repeated measures ANOV As found significant differences in the amount of infant vocalization
across the FFSF stages at 2 and 6 months of age, with p< 0.005. We also note increases in the
amount of vocalization at 9 and 12 months, though these were not significant with the Bonferroni
correction (p= 0.048, 0.035, respectively). Student’s t tests were used to evaluate differences
between individual stages at these ages. Infants at 6, 9, and 12 months of age vocalized more often
35
Figure 4.4: Trends in infant behavior across FFSF stage and age. Bottom left: percent of time
spent vocalizing; top right: variance in arm pitch; bottom right: variance in head pitch. Bars and
asterisks represent significant results for Student’s t tests between individual FFSF stages, with *
p< 0.025, and ** p< 0.001.
during the Still-Face stage, while 2-month-old infants vocalized most during the Recovery stage.
These trends are illustrated in Figure 4.4. No significant relationship was found between age and
amount of vocalization. The amount of infant vocalization was the only metric with significant
differences across stages of the FFSF procedure for 2-month-olds.
4.2.2 Infant-Mother Temporal Behavior Adaptation
Each measure of cross-correlation incorporating two pose signals produced significantly more
peaks than randomly reordered signals (p< 0.001), indicating significance with Bonferroni correc-
tion. Measures of cross-correlation incorporating the mother’s vocal signals produced more peaks
than random (p < 0.005) at every age except for 18 months, which had a sample size of only 5
dyads. The number of peaks was not significantly greater than random for measures incorporating
36
the infant’s vocalization signals. Pairs of signals showing significantly more peaks than random
suggest mother and infant adapted the corresponding behaviors to each other over time.
4.2.2.1 Lag Variance
For pairs of signals that demonstrated significant coordination, we compared the peak lag variance
across FFSF stage and across age to identify trends in infant behavior across time. Student’s t
tests identified whether differences in lag variance between the Play and Reunion stages were
significant, indicating a shift in the level of coordination across the two stages. These results are
illustrated in Figure 4.5. Each pair of signals had a unique range of ages for which the lag variance
could significantly distinguish between Play and Recovery. The (mother head angle, infant arm
angle) pair was the only signal pair to show significant differences in lag variance across stages
for all age groups; however, each age group had at least two signal pairs for which lag variance
differed significantly across stages. We note that intermodal signal pairs, specifically the (mother
head angle, infant arm angle) and (mother F0, infant head angle), were the only pairs of signals for
which significant differences in lag variance were found between Play and Reunion for 2-month-
old infants. Notably, all significant differences across stages showed an increase in lag variance
from Play to Reunion, representing less consistent dynamics in the Reunion stage compared to
Play.
A linear mixed models analysis with infant id as a random effect indicated a significant, inverse
relationship between age and lag variance for the (mother head angle, infant head angle), (mother
arm angle, infant arm angle) and (mother head angle, infant arm angle) pairs (p< 0.025). These
trends are illustrated in Figure 4.5. Significant differences across age were not seen for the remain-
ing pairs. While the max lag t
max
was chosen empirically, the direction of change in lag variance
remained negative across age and positive across experimental stage for t
max
values between 8 and
14 inclusive, for the ages and signal pairs that yielded significant results.
37
Figure 4.5: Left: Student’s t-test statistic between lag variance distributions during Play and Re-
union. Negative values indicate a higher mean lag variance during Reunion compared to Play.
The key at the left indicates the behavioral signals which were input into the windowed cross-
correlation model. Significance is reported with * p< 0.025, and ** p< 0.001. Right: Median
lag variance across age, with 95% confidence intervals.
4.3 Discussion and Summary
The results presented in this chapter support the value of heterogeneous behavioral signals toward
a more holistic understanding of infant-mother behavior adaptation. Prior to this work, research in
computational modeling of infant-caregiver interaction has been mainly focused on temporal adap-
tation of homogeneous behaviors. In this work, the temporal dynamics evaluated using windowed
cross correlation with peak-picking with heterogeneous signal pairs identified meaningful trends
across conditions, showing significantly more variance after the stressful Still-Face stage, and at
younger infant ages. These trends mirrored similar analyses evaluated on homogeneous signals,
but extended our ability to distinguish between stages of the FFSF at certain ages. For example,
only heterogeneous signal pairs showed significant changes in lag variance from Play to Reunion
at 2 months of age. Moreover, the (mother head angle, infant arm angle) pair demonstrated a
greater decrease over time than any other pair of signals.
The perspective offered by heterogeneous behavioral signals was also evident from the infants’
individual behaviors, with each type of behavioral signal demonstrating a different trend. Increases
38
in head movement followed an almost parabolic trend, with the largest changes at 6, 9, and 12
months. Our finding that infants reacted more strongly to the Still-Face at certain ages based on
this metric is consistent with prior work using the FFSF procedure: as infants grow older, they
gain motor skills and therefore their reaction to the Still-Face stage along certain metrics becomes
more observable (Adamson and Frick 2003b). As the 2-month-old infants often needed to have
their heads supported by the researcher, their ability to adjust their head movements may have been
limited. Adamson and Frick (2003b) also noted that after 9 months of age, infants are better able
to distract themselves and therefore are less affected by their mother’s ignoring behavior during
the FFSF procedure. This was reflected in the smaller increase in head movement at 18 months
as compared to 6, 9, and 12 months. The significance of the change in head pose variance at 12
months may be due to the lack of toys during the Still-Face stage. In many of the procedures,
the mothers removed the toys from the infant’s grasp during the Still-Face stage, often causing the
infants to become fussy, resulting in more head movement.
Conversely, infants demonstrated increased arm movement with age, but not across FFSF
stages. As infants used their arms during the Play and Reunion stages to touch and manipulate toys,
the variance remained high during these stages. The amount of vocalization varied across stages at
all ages except 18 months, and was the only feature to distinguish infant behavior between stages
at 2 months. Given the evolution of communication behaviors with age, tracking behavior using
multiple modalities is necessary to fully represent infant responses to stressful situations such as
the FFSF procedure.
The measured WCCPP outputs also highlight the differences in communication abilities be-
tween infants and their mothers, and how these differences influence social synchrony. While
some intermodal signals were coordinated and showed meaningful changes in coordination across
age, this did not necessarily imply that the reversed pair of signals demonstrated the same trend.
For example, while the pairs including the mothers’ vocalization signals were significantly coordi-
nated, pairs including the infants’ vocalization signals were not. This was likely because mothers
vocalized and spoke more often, while infant vocalizations were relatively infrequent compared to
39
the length of the interaction. While the (mother head angle, infant arm angle) pair showed sig-
nificant trends across age, the (mother arm angle, infant head angle) pair did not. Incorporating
heterogeneous behavioral signals can therefore support opportunities to evaluate social synchrony
in domains where communication differences can pose a challenge, such as in human-robot in-
teraction or during interactions with individuals with disabilities impacting their communication
behaviors.
Incorporating heterogeneous infant and mother behavioral signals supported a more detailed
view of temporal behavior adaptation at each age. However, incorporating these metrics increases
the number of features available when evaluating coordination between individuals. Since correla-
tion supports the association of only two signals at a time, new approaches are needed to integrate
multiple signals when modeling temporal behavior adaptation. The upcoming chapter details our
approach to this challenge using Dynamic Mode Decomposition with control.
40
Chapter 5
Integrating Multiple Heterogeneous Behaviors into a Model of
Social Synchrony with Dynamic Mode Decomposition with
Control
Contributors: Chapter 5 is based on Klein et al. (2021). Additional authors of the published
work include Victor Ardulov, Alma Gharib, Barbara Thompson, Pat Levitt, and Maja J. Matari´ c.
This chapter explores Dynamic Mode Decomposition with control (DMDc) as an ap-
proach to integrating multiple signals from each communicating partner into a model
of temporal behavior adaptation. Evaluated on the infant-mother interaction dataset
described in Chapter 3.1, results support a the use of DMDc in modeling social syn-
chrony. Based on these results, a new metric is proposed for quantifying temporal
behavior adaptation across heterogeneous signals.
5.1 Dyadic Interaction as a Dynamical System
While prior work has modeled temporal behavior adaptation using diverse behavioral signals, it is
currently unclear how automated methods may integrate these signals to capture a more holistic
perspective. In the case of infant-mother interaction, partners may communicate using gestures,
gaze, vocalizations, and shared affective states. Accounting for infant and mother communication
41
across multiple behaviors without manually annotating higher-level behaviors (or training classi-
fiers on large amounts of data to recognize higher-level behaviors) is an unsolved problem.
This chapter explores Dynamic Mode Decomposition with control (DMDc) as a model for mul-
timodal interpersonal coordination during dyadic interactions. This research builds on past work
by Ardulov et al. (2018), who presented the first use of DMDc to model social interactions, specifi-
cally using prosody and lexical features from child forensic interviews. We used the infant-mother
interaction dataset described in Chapter 3.1 along with the corresponding extracted pose and vo-
cal fundamental frequency signals described in Chapter 4. Our approach models the interaction
between infant and mother as a dynamical system, while enabling multiple continuous behavioral
signals as inputs. We analyzed the eigenvalues of the dynamical systems models and introduce
a new metric of temporal behavior adaptation based on the models’ coefficients. We evaluated
our approach by assessing the ability of the resulting metrics to identify trends in interaction dy-
namics during the experimental procedure and across infant age, comparing results evaluated on
homogeneous and heterogeneous behavioral data.
Results demonstrate the ability of our approach to integrate multiple behavioral signals into a
single model of temporal behavior adaptation, even when leveraging different sets of behavioral
signals for each interacting partner. Model output representing dyadic processes followed known
trends in mothers’ behavior and infant-mother interaction across stages of the FFSF procedure,
demonstrating the ability of DMDc to capture relevant social information. Effect sizes were larger
when models were fit on both head pose and arm pose data, supporting the importance of integrat-
ing heterogeneous behavioral signals. Additional exploration showed that for both homogeneous
and heterogeneous data, model output identified relationships between changes in interaction dy-
namics from play to still-face and infants’ movement behavior, and trends in both individual and
dyadic behaviors across infant age. These results support the use of our modeling approach in
generating descriptive metrics of temporal behavior adaptation.
42
5.2 Technical Approach
This section describes our approach for automated analysis of infant-mother coordination. Section
5.2.1 describes the feature extraction process, and Section 5.2.2 discusses the application of DMDc
to multimodal behavioral signals. Figure 5.1 details the full automated pipeline.
Figure 5.1: Computational modeling pipeline. Left: pose landmarks are extracted from a video
frame using OpenPose and distances between features are calculated; vocal fundamental frequency
is extracted with Praat and speaker identification is performed via manual annotation; middle:
Dynamical system model with arrays of infant and mother features from two consecutive frames;
right: matrices of infant and mother features. In this example, DMDc is applied to multimodal
data sampled at 30 Hz from a 3-second window of an interaction, and the infant’s features are used
as the control input.
5.2.1 Dataset
We used a dataset of audio-video recordings of infants and their mothers participating in the Face-
to-Face Still-Face (FFSF) procedure described in Chapter 3.1. To track behaviors of both the
mothers and infants, we leveraged the extracted video features including arm pose and head pose,
and audio features including vocal fundamental frequency (F0) described in Chapter 4 subsection
4.1.1. Additional measures to identify mother and infant from their corresponding OpenPose land-
marks is described in Subsection 5.2.1.1. As infants in this dataset vocalized infrequently (Klein
et al. 2020), only the mothers’ F0 values were used for further analysis. The interaction recordings
comprising both datasets are detailed in Chapter 4 Figure 4.2. The dataset of video features is
labeled as D
video
and the dataset of audio features as D
audio
.
43
5.2.1.1 Person Tracking
The pose of each participant was identified in each frame using the open-source OpenPose software
(Cao et al. 2018). Leveraging the relatively constant, seated positions of participants within the
frame throughout each recording, a clustering method was used to assign each detected set of pose
landmarks to the correct person and avoid tracking erroneously detected pose landmarks. First, a
cluster centroid was initiated using the head position of the first three most confidently detected
people, representing the infant, child, and researcher. Only head landmarks such as the nose or
neck were used for tracking, as these had the most consistent positions within each video frame.
Participants moved their heads while interacting; however, the overall location of each person’s
nose and neck landmarks changed less frequently than landmarks on the limbs, resulting in more
reliable discrimination between people. If an additional visitor accompanying the dyad was present
throughout the entire video, a fourth cluster centroid was initiated. In each consecutive frame, the
head position of each detected person was calculated and assigned to a cluster so as to minimize
the total distance between each cluster centroid and its new assigned point. The centroid of each
cluster was updated after each addition. After the clustering algorithm terminated, each cluster
was assigned as the mother, infant, researcher, or visitor based on the known seating positions of
each individual. Only the mothers’ and infants’ data were used for further analysis.
5.2.1.2 Pose Evaluation
After the pose landmarks for both mother and infant were identified, head and arm pose were mon-
itored within each frame. Head and arm pose were measured using the detected pose landmarks
from each frame, as illustrated in Figure 5.1. Head pose was monitored using the horizontal and
vertical distances between the nose and neck landmarks. Arm pose was monitored using the hor-
izontal and vertical distances between the neck and elbow landmarks. As the videos were filmed
in profile view, only the arm which was more consistently detected throughout the recording was
considered. Since DMDc does not restrict model input to a single signal from each participant,
both the horizontal and vertical pose measurements were used as inputs. This removed the need
44
to model pose using angles, which can introduce erroneous variance during fluctuations between
0
◦ and 359
◦ . To normalize by the size of participants or closeness to the camera, distances were
divided by the size of each participant’s head, measured as the average distance between the nose
and ear landmarks.
5.2.2 Dynamical System Modeling
For each recorded interaction, we fit windowed DMDc models to the extracted data. Separate mod-
els were fit for each modality (arm pose, head pose, F0) and for each combination of modalities.
This allowed direct analysis of the ability of DMDc to identify trends by comparing results across
models.
5.2.2.1 Window Selection
As described in Section 2.1, infant-mother interactions are characterized by time-varying dynam-
ics; throughout typical play, the individual leading the interaction can change over time. Addition-
ally, even for infants with typical development, dyads will likely shift between states of coordi-
nated communication and mismatched or asynchronous communication. To capture these chang-
ing dynamics, data from each interaction were split into non-overlapping intervals of 3 seconds.
This interval was selected based on past work (Klein et al. 2020; Hammal et al. 2015a) that used
windowed-cross correlation to explore temporal adaptation of infant and mother pose during the
FFSF protocol. As the infants’ and mothers’ arms were occasionally out of frame, windows with
more than 10% of missing infant or mother arm pose data were excluded from further analysis.
For audio analysis, windows without vocalizations, and therefore without F0 data, were excluded.
5.2.2.2 Control Parameters
To appropriately capture interaction dynamics such as the “serve-and-return” pattern (described in
Chapter 2), it is necessary for the model to incorporate changes in the leader of the interaction.
Various moments may be characterized by the infant reaching out and the mother responding or
45
vice versa, or by an uncoordinated, mismatched state. Windowed cross-correlation addresses this
by varying the lag between correlated signals. In this work, we address changes in leader-follower
dynamics by fitting two types of models: one DMDc model describes the continuous evolution of
the mother’s behavioral signals and integrates the infant’s behavior as the control signals, while the
other models the infant’s behavior over time and uses the mother’s behavior as the control.
5.2.2.3 Dynamic Mode Decomposition with Control
To model temporal behavior adaptation between partners during the still-face interaction, our work
leverages Dynamic Mode Decomposition with control (DMDc) which assumes the dynamical sys-
tem model
c
t+1
= Ac
t
+ Bu
t
(5.1)
where c
t
and c
t+1
represent the observations of the mother’s behavior at times t and t+ 1
respectively, while u
t
represents the child’s signal. The relationship described in Eq 5.1 represents
the underlying assumption that the observed signal for a mother is the combination of the signals
from mother and infant from the previous time steps. DMDc is an algorithm for estimating the
transition matrix, A, and controller matrix, B. This is accomplished by recognizing that given an
observational window C
t
=[c
t
,c
t− 1
,...,c
t− w
] and U
t
=[u
t
,u
t− 1
,...,u
t− w
] then it holds that:
C
t+1
= AC
t
+ BU
t
=[AB]
C
t
U
t
(5.2)
From this form it is possible to solve for A and B:
[AB]= C
t+1
C
t
U
t
†
(5.3)
where † represents the Moore-Penrose pseudo-inverse.
46
Since the signal c representing the mother’s behavior has a consistent dimension across time
steps, A is a square matrix. Therefore, we can study the dynamic response (or mode) of the
mother’s behavior by examining the dominant eigenvalues of the transition matrix. Given that
this model expresses a discrete time dynamical system, the eigenvalues correspond with the fre-
quency responses; the eigenvalue with the largest complex magnitude represents the “dominant”
dynamics. Accordingly, the magnitude of the dominant eigenvalue expresses an exponential decay
while the angle to the real axis reflects an oscillatory response. To distinguish between models,
we denote the eigenvalues as λ
A,I
or λ
A,M
. The subscript I or M indicates the control input to the
model as the behavioral signals of the infant or mother, respectively.
Coefficient matrices A and B give us an estimate of the influence that prior infant and mother
behaviors have on the mother’s behavior at the next time step. By comparing A and B, we can
estimate the extent to which the mother adapts her behavior in response to her infant versus as a
function of her own past behavior. Therefore, we introduce the measure relative influence (R):
R=
||B||
F
||A||
F
(5.4)
While the concept of temporal behavior adaptation is multifaceted, we use relative influence
R as a way to measure a component of an infant’s or mother’s responsiveness or adaptation to
their interactive partner during a given window. Larger values of R represent instances when the
mother is estimated to be more responsive to her infant’s behavior. To distinguish between infant-
controlled models and mother-controlled models, we denote the relative influence of the infant on
the mother as R
I
, and the relative influence of the mother on the infant as R
M
.
5.2.3 Analysis
A successful model of interaction dynamics must be able to recognize changes in behavior and
dyadic processes, such as those caused by the stressful still-face stage or communication patterns
that change as the infant develops. Therefore, we evaluated how the metrics detailed in Section
47
5.2.2.3 evolved across stages of the FFSF protocol and across infant age. The mean values of
these metrics were aggregated across 3-second windows occurring in the same experimental stage.
Identified trends were compared across models fit on each modality or combination of modalities
to assess the ability of DMDc to evaluate multimodal behavioral coordination.
Given the larger size of D
video
compared to D
audio
, we conducted our analyses in two stages.
First, we evaluated our approach on the dataset of head pose and arm pose features. Analyzing
our approach on this larger dataset allowed for a robust analysis of DMDc’s ability to incorporate
multiple behavioral signals. Next, we repeated our analysis using the interactions and FFSF stages
with available F0 data. A separate analysis of coordination metrics calculated using the mothers’
F0 data enabled us to directly evaluate the model’s ability to incorporate different sets of behavioral
signals for each partner in a dyad, and to address the challenges associated with less frequent
behaviors (i.e., vocalizations).
5.2.3.1 Model Validation
We first evaluated our model’s ability to identify known trends inherent to the FFSF protocol. As
mothers were instructed not to respond to their infants during the still-face stage, it follows that the
relative influence of an infant on the mother should be lower during the still-face stage. Therefore,
we anticipated a decrease in R
I
from the play to still-face stages. This effect was tested with two-
tailed Student’s t-tests for each set of DMDc models. Since mothers were instructed not to interact
(or vocalize) during the still-face stage, audio data were excluded from this portion of the analysis.
Additionally, representative models must capture the lead-follow or “serve-and-return” pat-
tern of infant-mother interactions. Given the finite length of each interaction, there is an inherent
trade-off between the time spent leading vs. following; if a larger part of an interaction stage is
characterized by the infant responding to the mother’s cues, there is less time remaining when
the mother is responding to the infant. Therefore, we anticipated that stages with higher mother-
to-infant influence should show lower infant-to-mother influence. We tested this hypothesis by
evaluating the Pearson correlation coefficient between R
I
and R
M
for both the play and reunion
48
stages for each set of DMDc models. As the mothers were instructed not to respond at all to their
infants during the still-face stage, we did not conduct this analysis for the still-face stage.
5.2.3.2 Exploring Trends in Interaction Dynamics
Beyond monitoring known behavior patterns, a goal of modeling infant-mother coordination is to
explore trends that emerge with changes in interaction quality or as part of infant development.
For example, studies have conducted the FFSF procedure with varying levels of maternal unre-
sponsiveness, sometimes allowing different forms of touch between mother and infant during the
still-face stage (Adamson and Frick 2003a). We explored the effect of mothers’ changing behav-
ior across FFSF stages by evaluating the relationship between the decreased influence of infants
on their mothers from play to still-face stages, or decreased responsiveness from mothers, and in-
fant behavior during the still-face stage. We measured the difference in infants’ influence on their
mothers across stages as R
play
I
-R
still− f ace
I
, capturing the difference in the relative influence metrics
from play to still-face stages for models that used the infants’ behavioral signals as control inputs.
To explore how this value related to infant’s behaviors during the still-face stage, we monitored
λ
still− f ace
A,M
, the dominant eigenvalue of the infant’s transition matrix. This allowed for a direct
comparison between changes in a mother’s feedback across stages to the dynamics of infant be-
havior during the stressful still-face stage. As mothers did not vocalize during the still-face stage,
only head pose and arm pose were evaluated for this analysis.
Next, we analyzed how coordination metrics extracted by each model evolved with infant age.
As infants grow older, their motor and social skills develop, impacting their interactive capabilities
and in turn influencing their mothers’ responses. To evaluate trends in behavior across infant
age, we conducted a linear mixed models (LMM) analysis of both individual parameters (λ
A
)
and dyadic parameters (R), with the id of each dyad as a random effect. LMM coefficients were
evaluated to assess the direction and strength of relationships between each metric and age. Given
the difference in scale between variables (for example, infant age ranges between 2 and 18 months
49
while eigenvalues remain close to the unit circle), values were normalized between 0 and 1 prior
to statistical testing.
5.3 Results and Discussion
We first report results with models evaluated on D
video
to assess the ability of our approach to
integrate multimodal data, including multiple behavioral signals. These results are reported and
discussed in Sections 4.1 and 4.2. Next, results from analysis including audio data, evaluated
on the subset of interactions included in both D
video
and D
audio
, are reported in Section 4.3. We
explore relationships between infant and mother behavior, trends across experimental stage, and
trends across age; therefore, we consider statistical significance at α <.017 using the Bonferroni
correction for multiple comparisons.
5.3.1 Model Validation
5.3.1.1 Observing the Still-Face Instructions
Consistent with the instructions of the FFSF procedure, results showed a decrease in mothers’
measured responsiveness to infants’ behavior during the still-face stage. As shown in Table 5.1,
t-tests demonstrated a significant decrease in R
I
from the play to still-face stages, representing
a decreased influence of infants’ behavior on mothers’ behavior, or decreased responsiveness of
mothers to their infants. This result was strongest when measuring relative influence using both
head pose and arm pose, indicating that changes in responsiveness may be best observed by inte-
grating multiple behaviors. As mothers were instructed not to interact with their infants during the
still-face stage, they did not vocalize and therefore did not produce F0 data; consequently, audio
data were excluded from this analysis.
50
Table 5.1: t-statistic between R
play
I
and R
still− f ace
I
Modalities t
Head Pose 3.838**
Arm Pose 3.069*
Head Pose & Arm Pose 4.374**
*p<.017 **p<.001
5.3.1.2 Leading-Following Relationship
Results indicated a significant negative correlation between R
I
and R
M
, shown in Table 5.2. This
finding is consistent with anticipated interaction dynamics; if a mother is responding to the infant’s
behavior for a large portion of the interaction, there remains less time for the infant to respond to the
mother’s behavior. For a given stage, this inverse correlation is strongest for models trained on both
head pose and arm pose. This result demonstrates the potential of DMDc to leverage multimodal
signals in tracking the leading-following relationship inherent to infant-mother interaction.
Table 5.2: Correlations between R
I
and R
M
Modalities Stage R
Head Pose Play -.402 **
Arm Pose Play -.348 **
Head Pose & Arm Pose Play -.430 **
Head Pose Reunion -.285 **
Arm Pose Reunion -.347 **
Head Pose & Arm Pose Reunion -.375 **
*p<.017 **p<.001
5.3.2 Trends in Interaction Dynamics
5.3.2.1 Infant Responses to the Still-Face Stage
Comparing metrics between play and still-face stages, results showed an inverse correlation be-
tween R
play
I
-R
still− f ace
I
and the magnitude ofλ
A,M
. These results are shown in Table 5.3. Describing
the autonomous evolution of infant motor behavior across frames, smaller eigenvalues represent a
more dampened dynamic system, or an infant’s faster return to baseline behavior after a change in
51
pose. This indicates that a more drastic change across stages in a mother’s responsiveness, or in an
infant’s ability to influence the mother’s behaviors, was followed by more sporadic infant behavior
during the still-face stage. This may have reflected how decreased responsiveness by mothers in-
fluenced emotion regulation behaviors demonstrated by infants. These results are shown in Table
5.3. The inverse correlation was strongest for models trained on both head pose and arm pose data;
this result supports the potential of multimodal metrics of behavioral coordination for evaluating
changes in interaction dynamics.
Table 5.3: Correlation between (R
play
I
- R
s f
I
) andλ
s f
A,I
Modalities t
Head Pose -.183*
Arm Pose -.225*
Head Pose & Arm Pose -.266**
*p<.017 **p<.001
5.3.2.2 Trends Across Infant Age
The linear mixed models analysis (LMM) demonstrated changes in infants’ transition dynamics,
and in infants’ influence on their mothers’ behavior, across age. The magnitudes of eigenvalues
λ
A,M
increased with infant age for all DMDc models. The LMM coefficients and standard errors
(SE) are reported in Table 5.4, and results are visualized in Figure 5.2. Increasing values of λ
A,M
indicate that older infants, with eigenvalues closer to 1, produced more stable and less sporadic
movements, perhaps due to improved motor abilities. Similar results for models evaluated on both
modalities compared to single modalities indicate that not only did individual movements become
more stable with age, but behaviors that involved multiple modalities (e.g., hand-eye coordination)
became more stable as well. However, parameters from multimodal DMDc models also had larger
standard errors and wider confidence intervals, indicating more variance between infants. Signif-
icant trends were not found in the angle of of λ
A,M
to the real axis, or in the oscillatory response
of infant behavior. As observed in Figure 5.2, we also note that variance in λ
A,M
between infants
decreased with infant age, indicating more similar behaviors among older infants. This pattern was
52
consistent during the play, still-face, and reunion stages. Conversely, significant trends were not
found in the mothers’ transition dynamics. This is unsurprising, as mothers are likely no longer
developing motor skills.
Table 5.4: Trends inλ
A,M
across age
Stage Modalities Coefficient SE
Play Head Pose .043* .016
Play Arm Pose .156** .027
Play Head Pose & Arm Pose .133** .025
Still-Face Head Pose .069 .031
Still-Face Arm Pose .187** .038
Still-Face Head Pose & Arm Pose .142** .035
Reunion Head Pose .093** .026
Reunion Arm Pose .285** .044
Reunion Head Pose & Arm Pose .155** .025
*p<.017 **p<.001
Meanwhile, infants’ influence on their mothers’ behaviors, measured by R
I
, decreased with
infant age. Given that only two modalities are considered in this analysis, it is possible that changes
in mothers’ responses to their infants reflected different uses of individual modalities; mothers
may have become less likely to respond to their infants using consistent motor movements, but
remained attentive overall. It is also possible that the decrease in R
I
reflects changes in the timing
of responses. While this work evaluates frame-to-frame observations within 3-second windows,
future work will explore additional time scales. Results of the LMM analysis are shown in Table
5.5. We note that, across modalities for a given metric, effect sizes are smaller during the reunion
stage. This may reflect between-dyad differences in infant-mother interaction repair strategies or
infant emotion regulation behaviors demonstrated while recovering from the stressful effects of the
still-face stage; between-subject variance in these reunion-stage behaviors may inhibit the ability
to observe trends that emerge with age. Significant effects were not found for LMMs evaluated on
R
M
, or the infants’ responsiveness to their mothers. While mothers were instructed to interact with
their infants and typically remained engaged over time, infants often became distracted by toys or
53
Figure 5.2: (λ
A,M
) across infant age, calculated using head pose data (left), arm pose data (center),
and both (right). Results includeλ
A,M
values from each of the three stages (play, still-face, reunion)
of the FFSF protocol.
the environment; therefore, variance in infant behavior may have masked changes in moment-to-
moment responsiveness. Evaluating multiple time scales may address this challenge.
Table 5.5: Trends in R
I
Across Infant Age
Stage Modalities Coefficient SE
Play Head Pose -.132** .032
Play Arm Pose -.164** .038
Play Head Pose & Arm Pose -.130** .033
Reunion Head Pose -.064* .019
Reunion Arm Pose -.057 .029
Reunion Head Pose & Arm Pose -.061* .022
*p<.017 **p<.001
54
5.3.3 Incorporating Audio Data
Consistent with the results of Section 5.3.1.2, negative correlations were found between R
I
and R
M
for all combinations of modalities. These results are shown in Table 5.6. However, we note that,
for a given set of pose signals, effect sizes became smaller when F0 was included as an input. This
is likely due to the sparsity of F0 data compared to head pose and arm pose data; while pose can
be monitored continuously, F0 values can only be collected when vocalizations are made. During
the play and reunion stages, mothers were not speaking to their infants continuously; rather, there
were often periods of silence. As a result, an average of 18 3-second windows included F0 data,
and supported the fit of a DMDc model. Given the richness of mother-infant interaction, exchanges
that occur in just a few 3-second windows may not be sufficient to fully reflect a dyad’s interaction
dynamics; rather, our results suggest that longer or additional interaction recordings may be needed
to evaluate coordination using data generated from sparse behaviors.
Table 5.6: Correlations between R
I
and R
M
Modalities Stage R
Head Pose Play -.436 **
Arm Pose Play -.378 *
Head & Arm Pose Play -.434 **
Head Pose & F0 Play -.427 **
Arm Pose & F0 Play -.345 *
Head & Arm Pose & F0 Play -.432 **
Head Pose Reunion -.315*
Arm Pose Reunion -.304*
Head & Arm Pose Reunion -.342 *
Head Pose & F0 Reunion -.119
Arm Pose & F0 Reunion -.233
Head & Arm Pose & F0 Reunion -.241
*p<.017 **p<.001
Similar to models evaluated on D
video
, results showed a negative trend in R
I
values across
infant age for all models, as shown in Table 5.7. Consistent with results reported in Section 4.2,
effect sizes were smaller during the reunion stage and when including F0 data. However, similar
trends for models evaluated with mothers’ F0 data demonstrate the ability of the DMDc approach
55
to leverage different sets of behaviors between interacting partners. This feature is necessary for
integrating behaviors that are used by one partner but not the other, such as lexical features during
interactions between verbal and non-verbal partners.
Table 5.7: Trends in R
I
across Infant Age
Stage Modalities Coefficient SE
Play Head Pose -.272* .085
Play Arm Pose -.316* .099
Play Head & Arm Pose -.254* .083
Play Head Pose & F0 -.170* .062
Play Arm Pose & F0 -.227* .077
Play Head & Arm Pose & F0 -.182* .069
Reunion Head Pose -.140 .059
Reunion Arm Pose -.183* .073
Reunion Head & Arm Pose -.149* .06
Reunion Head Pose & F0 -.105 .052
Reunion Arm Pose & F0 -.063 .033
Reunion Head & Arm Pose & F0 -.105 .053
*p<.017 **p<.001
5.4 Discussion and Summary
This chapter presented DMDc as a method for evaluating temporal behavior adaptation, specifi-
cally during infant-mother interaction. Significant changes in model output from the play to still-
face stages of the FFSF protocol validated the ability of our approach to capture known trends
in a developmentally relevant interaction paradigm. Stronger effect sizes across FFSF stages for
models evaluated on heterogeneous rather than homogeneous data highlights the importance of
accounting for multiple types of behaviors when characterizing how partners adapt their behaviors
to each other. The ability of this approach to integrate separate sets of behavioral signals for each
participant makes DMDc appropriate for evaluating early communication, as infants are still de-
veloping many of the communication modalities available to adults. Finally, transition dynamics in
infant behavior, and the observed influence of infants’ behaviors on their mothers’ behaviors, both
showed significant changes across age, showing the ability of our approach to capture longitudinal
56
changes in interaction dynamics across developmental stages. This research highlights value of
heterogeneous behavioral signals in quantifying temporal behavior adaptation and demonstrates
the ability of DMDc to provide such metrics.
57
Chapter 6
ICHIRP: An Infant-Caregiver Home Interaction Recording
Platform
Contributors: Chapter 6 describes a usability study designed by Lauren Klein, Katrin Fischer,
and collaborators from the Levitt Lab at Children’s Hospital Los Angeles. Master’s students
Sahithi Ramaraju, Abiola Johnson, Sneha Bandi, and Sairam Bandi developed the software for
the mobile application prototype described in this chapter.
This chapter describes a pilot usability study to evaluate the feasibility of remote,
caregiver-led data collection of infant-caregiver interaction recordings. Specifically,
the pilot study described in this chapter included 5 infant-caregiver dyads and as-
sessed challenges to recording and participating in social interactions simultaneously.
To support the self-recording process, we developed the Infant-Caregiver in-Home In-
teraction Recording Platform (ICHIRP). This chapter reviews design considerations
for the functional application prototype and insights from the pilot study.
The cost and time constraints associated with standardized screening challenge the ability to
collect infant-caregiver interaction observations at the scale required to identify ranges of typical
behavior. Moreover, nationally, only 33% of children between 9 and 36 months are screened by a
pediatrician for any type of behavioral developmental issues (Hirai et al. 2018). Studies that evalu-
ate interactions typically occur in lab settings, requiring participating families to repeatedly travel
to a lab or clinic for observation. Numerous factors negatively impact data collection, including
58
when a family is unavailable to attend an appointment or the infant falls asleep or is too fussy,
resulting in data from a particular stage of development missing or having to be excluded.
Enabling families to record infant-caregiver interactions in the home greatly expands accessi-
bility and data availability. Past work (Shin et al. 2021) has explored remote data collection during
video chat with a researcher, or self-recorded behavioral data involving a single participant. Sapiro
et al. (2019) discuss applications of computer vision to measure eye gaze patterns, motor behavior,
and affective expressions in the home or school environment even without in-situ data collection
support from a researcher. They specify the need for future tools that expand these technologies to
interactive rather than individual behavior, motivating the ICHIRP application and usability study
described in this chapter. This platform and design insights serve as a secondary contribution of
this dissertation. This work was funded by the California Initiative to Advance Precision Medicine
via a grant for Scalable Measurement and Clinical Deployment of Mitochondrial Biomarkers of
Toxic Stress.
6.1 Preliminary Design Considerations
Remote collection of identifiable data raises privacy concerns and may impact adoption and con-
sistency of data quality, highlighting the need to engage stakeholders in the process of designing
solutions. These challenges motivate the need for an accessible and secure stakeholder-informed
platform that supports in-home data collection of consistent, high quality interaction data. To-
ward that goal, all preliminary design considerations for ICHIRP were informed by conversations
with child development and neuroscience stakeholders at CHLA, CHLA security stakeholders,
and caregivers providing volunteer feedback. To support analysis by domain experts, the appli-
cation needed to enable caregivers to record a 5-minute play interaction with both the mother’s
and infant’s faces, and preferably arm gestures, visible in the interactions. All application design,
59
implementation, and usability interaction procedures were discussed with the information secu-
rity team at CHLA throughout the design process prior to IRB review to ensure compliance with
CHLA’s standards for privacy and confidentiality.
To gain an initial understanding of usability concerns for parents, we asked two friends and
collaborators to try recording a 10-minute video of a free-play interaction with their infants and
upload the data to a secure Box location. Caregiver feedback identified uploading the video and
keeping both participants in-frame as main challenges to recording and sharing the interaction
videos from home.
6.2 Application Prototype Design and Implementation
Based on the design considerations described in Section 6.1, we developed a functional proto-
type of the ICHIRP application. The prototype was implemented for Android phones using React
Native. This section describes the components and functionality of the platform.
6.2.1 Recording Support
The main purpose of the usability study discussed in this chapter is to assess the feasibility of
caregiver-led interaction recordings. Recording and participating in a social interaction simulta-
neously can be a challenging task, especially while caring for an infant; this involves setting the
position of the recording phone, positioning oneself and the infant so both are visible in the camera,
and pressing a record button prior to returning to a seated position. To support these tasks, the ap-
plication first provides instructions and illustrations regarding the recording process. An example
of these instructions are illustrated in Figure 6.1. Upon starting to record, the application displays a
10-second countdown to allow the caregiver to return to their seat. If two faces can be seen within
the camera frame, the recording begins; otherwise, the caregiver is asked to adjust and restart the
recording. To account for errors in face detection, the caregiver is given the option to override the
suggestion and begin the recording anyway. Throughout the recording, the application displays
60
an beeping sound and a reminder to move back into the frame if fewer than 2 faces are visible
for 15 consecutive seconds. This time was selected in order to support quality video recordings
without frequent distractions to the interaction. A timer tracks the length of the interaction. After 5
minutes, recording was stops and the application beeps and displays a message to let the caregiver
know the interaction has been completed.
Figure 6.1: Registration and Home Screens of the ICHIRP application. left and middle: registra-
tion Screens; right: application home screen and application tour
6.2.2 Registration, Uploading, and Scheduling
In order to support in-home data collection, the application must include additional functionality
to support data sharing with the research team and adherence during longitudinal data collections.
This subsection describes the components of the application prototype designed to support these
use cases for at-home application use.
6.2.2.1 Registration and Home Screen
The application features a standard registration screen asking for the caregiver’s name, email, and
password. This screen is illustrated in Figure 6.2. Contact information is collected so that par-
ticipants may receive reminders to record after scheduling the reminders in the application. Once
61
participants register or log into the app, they have an option to take a brief tour of the application.
This tour or ‘walkthrough’ was implemented to minimize confusion about the navigation of the
application.
Figure 6.2: Registration and Home Screens of the ICHIRP application. left and middle: registra-
tion Screens; right: application home screen and application tour
6.2.2.2 Uploading
Caregivers can upload the interaction videos in 2 ways. After completing an interaction, caregivers
are directed to a screen that enables uploading of the interaction recording. Alternatively, a separate
uploading tab allows users to select a video from the mobile phone’s gallery.
62
6.2.2.3 Scheduling
To support adherence during longitudinal data collections, the application allows participants to
set a reminder to record an interaction at a later date.
6.3 Pilot Usability Study
This section describes the pilot study designed to evaluate the feasibility of caregiver-led interac-
tion recording and the usability of the ICHIRP application prototype. The usability study protocol
was approved by the CHLA IRB under CHLA-15-00267.
6.3.1 Study Setup
The usability study took place at the Levitt Lab using the study setup shown in Figure 6.3. Par-
ticipants sat to the left of the researcher, with the infant seated in between the caregiver and the
researcher. A height chair was provided, although caregivers could elect to keep their infant seated
in their stroller if they had brought one to the lab. Common household items including a plant,
mug, and books were placed on the table so that caregivers could set the phone upright for the
recording. A small tripod was also left on the table that parents could elect to use.
An Android phone with the application pre-installed was provided to caregivers during the
interaction. The phone screen was mirrored to the researcher’s laptop and was screen-recorded
so that the researcher could view the caregiver’s actions in the app during and after the session
without looking over the caregiver’s shoulder. The interaction was also recorded by two Noldus
cameras capturing the participants at two angles, as shown in Figure 6.3.
6.3.2 Procedure
After completing the consent form and demographics survey, participants were asked to pro-
vide feedback on the application’s registration screen, recording functionality, and uploading and
63
Figure 6.3: Study Setup
scheduling interfaces. Participants were asked to navigate through the application as if they were
at home and the researcher was not present; minimal instruction was given on this process to
understand the usability of the application itself. For the purposes of the in-person testing, the
registration, uploading, and reminders scheduling interfaces were interactive but not functional.
Caregivers were prompted to step through the registration, recording, uploading, and scheduling
functionalities. After each step, caregivers were asked to rate their confidence in completing the
step, perceived level of difficulty of the task, and level of frustration with the task on a scale from
1 (not at all confident, not at all difficult, not at all frustrating) to 7 (very confident, very difficult,
very frustrating). They were also invited to provide any additional feedback on these steps. At
the end of the prototype testing, participants were asked open-ended interview questions regarding
their overall impression of the activity.
6.3.3 Participants
Participants were 5 infant-mother dyads from the Los Angeles area, with infant ages ranging be-
tween 9 and 12 months. Inclusion criteria included the mother’s fluency in English, as the usability
research team could not support translation to other languages. Inclusion criteria involved preg-
nancy lasting anywhere from 32-42 weeks, as this study was part of a larger research effort to study
adverse childhood experiences. All 5 participants were iPhone users, although 2 were familiar with
64
Android. Participants were compensated with a $25 gift card and an age-appropriate toy for their
child.
6.4 Results and Analysis
Feedback from caregivers was transcribed and grouped into categories. Table 6.1 lists each of
the usability issues described by one or more study participants. As expected, the majority of
usability challenges were associated with the recording task. In some cases, participants expressed
uncertainty as to how they should be positioned during the interaction, and some were surprised
to receive feedback when they were out of frame during the interaction. While this information
was presented in the pre-recorded instructions, reading through instructions is likely made more
challenging while caring for an infant. These results suggest the need for clearer messaging either
within the application itself or during onboarding communication with the research team.
Additional recording usability issues addressed the challenges of recording a video and re-
sponding to feedback while participating in the infant-caregiver interaction. Caregivers reported
confusion over how to respond to or fix the error messages prompting them to switch positions
during the recording.
Across the other areas of the ICHIRP application prototype, users expressed that the function-
ality was straight-forward to use but noted that scheduled reminders should appear and be editable
within the scheduling section of the application. All other usability issues listed in Table 6.1 were
identified by at most one participant. While these application areas were not the primary subject
of this usability study, the usability issues identified by caregivers provide useful insight into the
design of mobile applications for remote data collection.
65
6.5 Discussion and Summary
The usability study presented in this chapter highlights the tradeoffs between remote and in-person
data collection, and between data quality and user experience. While remote data collection en-
ables caregivers to flexibly adjust data collection sessions around their schedule and the needs of
their infants, the difficulties associated with recording and participating in interactions pose a chal-
lenge, especially when infants are fussy while the parents are trying to read instructions. Future
work in this area should explore the roles of instructions during study onboarding sessions versus
within the application, and how these roles change in the context of longitudinal data collections.
66
Table 6.1: Usability Issues by App Area
App Area Individual Usability Issues
Registration Not clear what the app is about upon first impression (opening)
Registration Not clear what ICHIRP is or stands for
Not clear why scheduling is necessary
Registration Not clear what email will be used for
Home Look and feel are inconsistent with the context- could be more comfort-
ing
Home Mismatch - buttons on home screen and icons in tab bar - should be
consistent
Walkthrough Next button is expected on the right-hand side
Recording Unclear where both partners should be at the beginning of the video
Recording Unclear how to fix positioning error before the recording starts
Recording Unclear how to fix positioning error as recording is on (message stays
on screen even as mom and baby are in view) - unclear what should be
in frame, e.g. face/body
Recording Even after choosing to ”record anyway”, error messages keep appearing
(which is unexpected and annoying)
Recording Error message sometimes disappears before it can be read (if the faces
are back in the frame)
Recording Unclear whether it’s still recording when an error message appears
Recording Cannot see timer during recording
Recording Record button = round, end record has rounded edges/not 100% square
= not visually distinct
Recording Dim screen makes it hard to see whether both are in frame/not clear if
dim screen means everything is ok
Screen did not respond to being tapped by lightening up (participant
expected screen to brighten when touched)
Recording Tapping the dimmed screen does interact with interface even though it
is not visible: Expectation = first tap would undim
Recording Gives error messages when both participants are sideways and facing
each other (expectation: both people are considered ”in the frame” even
when in profile view)
Recording Not clear what user and baby should be doing during recording
Upload No slider on the preview- does not enable skipping
Upload The square around camera icon is not responsive to touch (causing user
to think there is no other action possible here)
Upload Upload page could be mistaken for an empty gallery (preview expected)
Upload Sharing video from local gallery to app is not possible
Upload Hesitates to go to upload in the app, considers going to phone gallery
Schedule Can’t see the scheduled session when done (either in app or phone’s
calendar)
Schedule You cannot change the time after you set it the first time (nothing hap-
pens when you click ”Set Date and Time”)
67
Chapter 7
Predicting Visual Attention During Infant-Robot Interaction
Contributors: Chapter 7 is based on Klein et al. (2019). Additional authors of the published
work include Laurent Itti, Beth A. Smith, Marcelo Rosales, Stefanos Nikolaidis, and Maja J.
Matari´ c.
This chapter describes a model of Bayesian surprise as a predictor of visual attention
during a therapeutic infant-robot leg movement activity. Bayesian surprise measures
the saliency of a stimulus over both space and time. This chapter describes the benefits
and limitations of using visual salience as a predictor of infant visual attention.
Early intervention to address developmental disability in infants has the potential to promote
improved outcomes in neurodevelopmental structure and function (Holt and Mikati 2011). Re-
searchers are starting to explore Socially Assistive Robotics (SAR) as a tool for delivering early
interventions that are synergistic with and enhance human-administered therapy. For SAR to be
effective in this context, the robot must be able to present salient stimuli that consistently engage
the infant in the desired activity. While the robot can continuously monitor sensor input to attend
to or estimate the state of the interaction, attracting the infant’s attention is more nuanced. Ini-
tiating joint attention requires an understanding of the appropriate type and timing of stimuli to
present or actions to perform. According to an accepted mental model presented by Cohen (1973),
infants fixate longer on stimuli they do not understand, or that take longer to fit to their mental
68
model. Therefore, it is possible that surprising stimuli may be more difficult for infants to model
and could support increased visual attention.
Figure 7.1: Experimental setup of the SAR leg movement study; the infant is seated across from
the Nao robot and is wearing leg and arm motion trackers and an eye tracker. This dissertation uses
data from the study to explore Bayesian surprise as a potential predictor of infant visual attention.
This chapter explores whether a model of Bayesian surprise can be used to predict infant visual
attention during a SAR leg movement activity for infants. We present an analysis of eye gaze
tracking data from five 6-8 month old infants from the infant-robot interaction dataset described
in Chapter 3.2. Specifically, we use a the model introduced by Itti and Baldi (2005) to model
the “surprise” introduced by the surrounding environment as seen from the infants’ head-mounted
camera and by the pattern of robot kicking behaviors. The model (described in Section 7.1) was
originally tested and validated with adults watching videos. In this analysis, our goals were: 1)
to determine the extent to which that model can be used to predict infant gaze behaviors; and 2)
to identify areas of improvement for generalizing that model to infants during SAR interactions,
in order to inform future work employing surprising or novel robotic stimuli to encourage infant
attention to the robot.
69
When evaluating the surprise model on the head-mounted video footage of the interaction
environment, over 67% of infant gaze locations were in areas that the model evaluated as having
higher than average surprise values. Meanwhile, the output of the surprise model evaluated on
the timing of robot’s kicking behaviors was predictive of the gaze behaviors of 2 out of 5 infants.
These results indicate the potential for using surprise to inform robot behaviors that attract infant
attention, but also illustrate that additional context is needed to design action selection policies
capable of initiating joint attention with infants.
7.1 Bayesian Surprise Model
The Bayesian surprise model, developed by Itti and Baldi (2005), provides a method for computing
the amount of low-level surprise generated by incoming data over both space and time. We present
here a summary of the previously developed model. The model computes probability P(M) rep-
resenting the extent to which an observer believes in a given hypothesis or model, M, in a model
spaceM . As new data observation D is introduced, the belief in model M changes to P(M|D).
Surprise is defined as the distance between posterior distribution and prior distribution of be-
liefs over models. This distance, and therefore the amount of surprise, is calculated using the
Kullback-Leibler (KL) divergence:
S(D,M)= KL(P(M|D),P(M))=
Z
M
P(M|D)log
P(M|D)
P(M)
dM (7.1)
Incoming data are modeled using Poisson distributions M(λ), as these model the firing patterns of
neurons in the brain with firing rate λ. In order to keep P(M) and P(M|D) in the same functional
form for a Poisson-distributed D, P(M) is calculated using the Gamma probability density:
P(M(λ))=γ(λ;α,β)=
β
α
λ
α− 1
e
− βλ
Γ(α)
(7.2)
70
with shape α > 0, inverse scale β > 0, and Euler Gamma functionΓ. To calculate the posterior
Gamma densityγ(λ,α
′
,β
′
), the shape and inverse scale are updated as:
α
′
=ζα+
¯
λ (7.3)
β
′
=ζβ+ 1 (7.4)
where ζ , the ”forgetting factor”, limits the extent of the belief in the prior by preserving its mean
α/β but increasing its variance α/β
2
. This defines the time scale of the model. Itti and Baldi
(2005) a ζ value of 0.7 to evaluate the surprise of video data during user studies. Source code to
evaluate the Bayesian surprise model on video data and on 1-dimensional signals can be found at
http://ilab.usc.edu/toolkit/.
The Bayesian surprise model was tested in a user study with adults aged 23-32, with normal
vision (Itti and Baldi 2006). Each participant watched 25 minutes of video footage. Eye movement
traces, or saccades, were recorded with a gaze tracker and analyzed. The values of pixel patches in
each video frame were passed into the surprise model to calculate a matrix of surprise values for
each frame.
Measuring the distance between histograms of the participants’ actual saccade endpoints and
histograms of randomly generated saccade endpoints produced a KL divergence of approximately
0.241 (Itti and Baldi 2006). The distribution of human saccade endpoints was shifted toward more
surprising values than the random distribution, indicating that adults gazed toward locations that
were more surprising than randomly selected locations. Further analysis showed that over 72%
of saccades were targeted toward areas of the video that were more surprising than average, sug-
gesting that adults are attracted to surprising locations of video footage (Itti and Baldi 2006). The
success of this model in predicting adult gaze location for video footage motivated our exploration
of surprise as a predictor of infant visual attention during a SAR interaction.
71
7.2 Visual Surprise in the Interaction Environment
7.2.1 Methodology
We analyzed video from the head-mounted camera in the SAR leg movement study to explore the
predictive power of Bayesian surprise on infant visual attention. While previous work analyzed
adult gaze toward prerecorded video, this work examined infant gaze toward real physical stimuli
in the infant’s environment. Video data from the infant’s head-mounted camera were used to deter-
mine the surprise values of different areas of the infant’s point of view over time. The same visual
features from work by Itti and Baldi (2006)–color, intensity, movement, temporal onset/offset, and
orientation–were used to determine these surprise values. Eye tracking software provided the co-
ordinates of the infant’s gaze within the video. Fig. 7.2 shows a video frame of the environment
from the infant’s head-mounted camera viewpoint and the infant’s gaze location, as well as the
corresponding surprise values of that frame.
Figure 7.2: Left: The surprise values of each 16x16 patch of pixels. Lighter pixels indicate higher
surprise values. Right: Study environment from the infant’s point of view with overlaid target to
show infant gaze location during a robot kicking behavior. The circles indicate 2, 4, and 8 degrees
from the estimated gaze location.
Each infant’s gaze locations were compared against randomly generated gaze locations to de-
termine the extent to which infants looked toward more surprising locations. Gaze tracking data
72
were available for 5 infants from the SAR leg movement study and formed the basis for our anal-
ysis. Infants were excluded if they shifted their gaze trackers during the study, if their eye was
not visible by the camera, or if a technical issue occurred that prevented calibrated gaze tracking.
A total of 162,113 frames and gaze locations were analyzed. For each video frame, we extracted
the surprise value of the infant’s gaze location and the surprise value of a randomly selected gaze
location. Histograms of the surprise values at infant gaze locations and at random locations were
compared using KL divergence.
Data were analyzed for the duration of the time that the infants wore the gaze tracker; as this
part of the analysis was not dependent solely on the robot behavior but rather on the infants’ general
environment, data generated during the minutes before and after the interaction were analyzed as
well as the interaction itself. Some infants wore the gaze tracker longer than others, and therefore
contributed more gaze locations to our analysis. We account for the difference in total number of
video frames for each infant when determining the average percent of gaze locations in areas with
higher than average surprise value.
7.2.2 Results
The results suggest that infants were more likely to look at surprising stimuli. Fig. 7.3 displays
histograms comparing the infants’ gaze distributions to randomly generated gaze distributions. For
each infant, the distribution of infant gaze locations is shifted toward more surprising areas than
a randomly generated distribution. While the distance between the distributions is small, the KL
divergence is at the same order of magnitude as that found in work by Itti and Baldi (2006). We
calculated 100 random distributions to compute KL divergences and used a one-tailed t-test to test
the hypothesis KL divergence> 0 for each infant, p< 0.0001. Over 67% of infant gaze locations
were in areas of the video which were more surprising than average. This number is comparable
to the 72% found by Itti and Baldi (2006) in their study with adult participants observing video
scenes. Values for individual infants are reported in Table 7.1.
73
Figure 7.3: Histograms and corresponding KL divergence values of infant and random gaze distri-
butions. From each frame in an infant’s gaze tracking video, we extracted the surprise value at the
infant’s gaze location and at a random location in the frame. This process was repeated to find the
KL divergence for each infant 100 times (p < 0.0001 for each infant on a one-tailed t-test to test
KL>0 )
74
The realistic nature of the data and the age of the population involved in this work introduce
challenges to using the Bayesian surprise model as a predictor of visual attention. The video data
from the SAR leg movement study are not a prerecorded set of videos, but rather footage filmed
from a head-mounted camera. As such, the camera moves significantly more than typical video
footage. This may cause surprise values to be higher than those of pre-recorded videos in certain
areas. In addition, the infants were sometimes fussy or distracted by people in the room. While
people may generate their own low-level surprise in the video data, infants may also be looking at
humans for social purposes.
7.3 Evaluating Robot Behaviors that Initiate Joint Attention
7.3.1 Methodology
After validating the surprise model with infants, we aimed to explore how the amount of Bayesian
surprise generated by the robot’s behaviors could predict infant visual attention. Since the robot
is constantly monitoring the infant’s behavior, attracting the infant’s attention to the robot can be
considered a successful initiation of joint attention. Specifically, we were interested in determining
whether the surprise model could be used to predict what percent of robot behaviors infants would
look at during a specific time interval. The robot’s behavior was represented as a 1-dimensional
signal with a frequency of 30Hz. Values 1, 2, and 3 indicate robot kicking, robot kicking and
lights, or robot kicking and sound, respectively. This numbering scheme was chosen to distinguish
behaviors so that a change in behavior type may induce surprise; we also evaluated the model
Table 7.1: Percent of Infant Gaze Locations in Regions with Higher than Average Surprise Value
Infant 1 2 3 4 5
Weighted
Average
Percent of Gaze
Locations
71.24 70.60 71.97 65.08 60.42 67.97
Number of Frames 54574 25169 24753 44708 34079
75
0.8 1 1.2 1.4 1.6 1.8 2 2.2
Time (m)
-8
-6
-4
-2
0
2
4
Log Surprise, Robot Behavior
Robot Behavior and Log Surprise vs. Time
Log Surprise Signal
Robot Behavior Signal
Figure 7.4: The robot behavior signal and log surprise signal,ζ =0.98, for a 1.5-minute time inter-
val. A robot behavior signal value of 1 indicates that the robot is kicking its leg, while a value of 0
indicates the robot is still.
reordering which behavior was 1, 2, or 3 to ensure the assignment did not have a significant effect.
This signal was input into the surprise model. As the signal was one dimensional with a high
frequency compared to number of signal value changes, we selected higherζ values (0.98-0.99) for
the forgetting factor. We also divided the behavior signal by 1000 and took the log of the surprise
value to prevent the model from producing unreasonably high peaks during robot behavior onset.
Fig. 7.4 displays the surprise signal and the robot behavior signal over a 1.5-minute window.
Video data from the head-mounted camera were annotated by two trained student annotators
and one researcher to determine when the infant was looking at the robot. The method of annotation
was conservative so as to minimize false positives classifying that an infant looked at the robot: an
infant was only classified as looking at the robot if part of the robot was within a circle representing
2 degrees from the infant’s predicted gaze location for three or more consecutive frames. 20% of
the data were annotated by all three annotators. Interrater reliability was measured for this 20%
using Fleiss’ kappa, and a value ofκ = 0.96 was achieved.
76
To evaluate whether surprise was predictive of the robot’s success in acquiring infant attention,
we compared the log of the average surprise value generated by the robot behavior signal each
minute with the percent of robot behaviors an infant looked at each minute. We labeled an infant
as looking at the robot if the infant looked at any part of the robot during the kick or within one
second after the kick. We chose the 1 minute interval as it was small enough to make predictions
over several time intervals, yet large enough that a looking behavior value would not be drastically
influenced by the infant checking in with a parent. We used linear regression to generate a linear
model of percent of robot behaviors looked at per minute versus log average surprise per minute.
7.3.2 Results
We used an ANOV A to test how well the regression equations fit the data, and whether the model
was predictive for each infant. The log of the average surprise value per minute was significantly
predictive of percent of robot behaviors looked at each minute for 2 of the 5 infants, p< 0.05. The
linear regression determined that the log of the average surprise per minute (LAS) was significantly
predictive of infant 1’s percent of robot behaviors looked at per minute (RBL), F(1,6) = 20.655, p
= 0.0039. LAS accounted for 77.5% of the variance in RBL with the regression equation: RBL=
83.49+ 4.58 LAS. Fig. 7.5 and Fig. 7.6 show infant 1’s looking behavior compared with surprise.
For infant 2, LAS was predictive of RBL with F(1,6) = 8.46 and p = 0.027. LAS accounted
for 58.5% of variance in the dependent variable. The regression equation was RBL= 41.89+
3.42 LAS. Infant 2’s looking behavior compared with surprise is displayed in Figures 7.5 and 7.6.
While the regression equations of infants 3, 4, and 5 suggested a positive correlation between
LAS and RBL, the results for these infants were not statistically significant. Infants 4 and 5 looked
less at surprising areas in general (Table 7.1). Infant 5 looked at other areas of the robot for long
time intervals, instead of focusing specifically on the robot’s leg during robot kicking actions.
Infant 5’s gaze behavior with respect to the Bayesian surprise value of the robot’s behavior is
shown in Figure 7.5. Infant 3 appeared fussy significantly more than the other infants in the
study, affecting their gaze behavior. Additionally, during infant 3’s interaction, a long time interval
77
Figure 7.5: Infant 1, 2, 3, and 5 looking behavior with robot behavior and surprise signal. The
dotted line represents the robot behavior. Values 1, 2, and 3 indicate robot kicking, robot kicking
and lights, or robot kicking and laughing, respectively. For Percent of Behaviors Looked At, a
value of 1 on the y-axis corresponds to 100%.
between successive robot kicks caused a large increase in the surprise value of the Bayesian model.
Since the percent of behaviors looked at by the infant is limited while the Bayesian surprise model
is unbounded, the infant’s looking behavior could not produce a similar spike (Fig. 7.5). Imposing
an upper bound on the Bayesian surprise value in future work may help to mitigate against this
effect. It is also possible that contextual information or external distractions may play more of a
role in visual attention for some infants than for others.
78
Figure 7.6: Left: the regression line and data from infant 1. The regression line is defined by
equation RBL= 83.49+ 4.58 LAS, showing a trend that infant 1 looked at a higher percentage of
the robot behaviors during minutes with higher log average surprise value; right: the regression
line and data from infant 2. The regression line is defined by equation RBL= 41.89+ 3.42 LAS,
showing a trend that infant 2 looked at a higher percentage of the robot behaviors during minutes
with higher log average surprise value.
7.4 Discussion and Summary
Analysis of the video data and infant gaze locations in Section 7.2 demonstrates that all 5 infants
tended to look at surprising areas of their environment. The Bayesian surprise model performed
similarly with infant data as with adult data analyzed by Itti and Baldi (2006), despite evaluating the
model on moving video data from the infants’ head-mounted camera instead of on pre-recorded
videos used with adults. This suggests that surprise may be useful in designing and evaluating
salient stimuli which attract infant attention.
Several factors may account for the lack of predictive power of LAS on 3 of the infants’ looking
behavior. First, the optimal forgetting factorζ may be different for each infant. The rate at which
surprise fades may vary more in infants than in adults, and the 8-minute contingency phase from
the SAR leg movement study may not be enough time to learn an optimal time constant for each
infant. However, more notably, modeling surprise generated by the robot’s behavior alone does not
account for the infant’s role in establishing joint attention. Accounting for the infant’s affective and
behavioral states is necessary to understanding when and how best to make bids for the infant’s
attention. The next chapter presents methods for estimating infants’ affective states during this
SAR contingent learni.
79
Chapter 8
Real-Time, Continuous Affect Prediction in Socially Assistive
Robotics
Contributors: Chapter 8 is based on Chang and Klein et al. (2022) written with co-first author
Allen Chang. Allen Chang led the code development and writing related to the architecture and
training of affect recognition models described in this chapter. Lauren Klein led the literature
review and analysis and discussion of results. Additional authors of the published work include
Marcelo R. Rosales, Weiyang Deng, Beth A. Smith, and Maja J. Matari´ c.
This chapter investigates temporal factors that influence affect recognition perfor-
mance during a SAR interaction paradigm. We evaluate the how models perform
during affective state transitions and visual occlusions. Our results demonstrate the
need to account for changes in affect prediction performance over time when using
perception modules to inform temporal behavior adaptation policies, or action selec-
tion policies, for socially assistive robots.
Agents must monitor their partners’ affective state changes continuously in order to adapt their
behavior appropriately during social interactions. Existing methods for evaluating affect recogni-
tion, however, do not account for changes in classification performance that may occur when hu-
man behavior results in occlusion, or during transitions between affective states. Instances where
partners enter an affective or behavioral state for only a short period of time may be difficult to clas-
sify. This chapter addresses temporal changes in affect classification performance in the context
80
of an infant-robot interaction, where infants’ affective states determined their ability to participate
in a therapeutic leg movement activity. To support robustness to facial occlusions in video record-
ings, we trained infant affect recognition classifiers using both facial and body expressions. Next,
we conducted an in-depth analysis of our best-performing models to evaluate how performance
changed over time as the models encountered missing data and transitioning infant affect. Our
results demonstrate that while a unimodal model trained on facial expressions achieved the highest
performance on time windows without missing data, it was outperformed by a multimodal model
when evaluated across entire interactions. Additionally, model performance was weakest when
predicting an affective state transition, and increased after multiple predictions of the same affec-
tive state. These findings emphasize the benefits of incorporating body expressions in continuous
automated infant affect recognition. Our work highlights the importance of evaluating variability
in model performance both over time and in the presence of missing data when applying affect
recognition to social interactions.
Across applications of automated affect recognition for infants, models may be subject to dif-
ferent requirements with respect to time. As described in Chapter 2, researchers in infant develop-
ment study transitions in observed infant and their mother affective states to evaluate relationships
between the timing of shared affective states and developmental outcomes (Lecl` ere et al. 2014a).
In this case, a model must identify the time of each change in affect in order to support meaningful
analysis of interaction dynamics. Research exploring the Socially Assistive Robotics (SAR) for
infants (Scassellati et al. 2018) has identified affect as a key indicator of an infant’s ability to en-
gage with the robot. In order to tailor its actions appropriately to an infant’s current affect, a SAR
system must maintain a continuous prediction of an infant’s affect, even when data are missing.
Evaluating automated affect recognition approaches in the context of missing data and affective
state transitions is necessary to assessing their ability to support downstream applications.
Past work (Saraswathy et al. 2012; Cohen and Lavner 2012) has used cry detection and analysis
to evaluate infant affect from segments of audio data. Additionally, video-based methods (Lysenko
et al. 2020; Messinger et al. 2009) have extracted facial landmarks to classify infant affect from
81
individual frames. As facial features were sometimes occluded, classification was performed on
frames without missing data. While these approaches support the potential of affect recognition
for infants, performance metrics were not reported during times when data were missing or during
affect transitions. As occlusions and changes in affect are often unavoidable, further work is needed
to evaluate how automated methods will perform when making predictions continuously in the
context of infant social interactions.
Using a labeled video dataset collected during the infant-robot interaction studies described in
Chapter 3.2, this chapter addresses challenges to continuous infant affect recognition in 2 ways.
First, we explored body expression as an input modality for multimodal affect recognition to eval-
uate how using multiple sources of data support continuous affect prediction in the presence of
missing data from either modality. Next, we explored trends in model performance over time. To
analyze how the infants’ past behaviors informed their current affective state, we repeated model
training for a range of input window lengths. Using the highest performing unimodal and multi-
modal models, we conducted an in-depth analysis of performance with respect to time relative to
actual and predicted transitions in infant affect.
We evaluated our experiments with the 2D area underneath the receiver operating characteristic
curve (AUC). Our results supported the use of body expressions in infant affect recognition. Uni-
modal models trained on body expression data achieved performance above random chance (0.70
AUC evaluated on data with high feature extraction confidence, 0.67 AUC evaluated on all data).
While the best-performing face models (0.86, 0.67) and multimodal models (0.86, 0.73) had equal
AUC when evaluated on data with high feature extraction confidence, a multimodal model with
joint feature fusion outperformed the face model when features with low extraction confidence
were not excluded. For both unimodal and multimodal models, we observed changes in both the
mean and variance of model accuracy with respect to the amount of time since an actual or pre-
dicted state transition. Classification performance was weaker when first predicting that an infant
had changed affective states and increased after making consecutive predictions of the same state
for several seconds; in the context of a SAR interaction, this result highlights a trade-off between
82
waiting to achieve a high prediction certainty and missing an opportunity to intervene when infants
become fussy. These results underscore the benefits of body expressions in automated infant affect
recognition, as well as the importance of considering temporal trends in model performance.
8.1 Technical Approach
In order to understand how our modeling approach may support continuous affect recognition,
it is necessary to understand both overall performance and changes in performance across time.
In the context of the SAR interactions described in Section 3.2, these changes may influence a
robot’s level of certainty, and therefore its action selection policy. This section describes our feature
extraction approach, affect classification models, and evaluation methods.
8.1.1 Feature Extraction
We used the open source toolkit OpenFace (Baltrusaitis et al. 2018) to extract facial landmarks
and action units (AUs) from the face-view videos and OpenPose (Cao et al. 2017) to extract body
skeleton landmarks from the body-view videos. Since OpenFace and OpenPose were trained with
adult data, visual inspection was conducted to assess that the accuracy of model output was suf-
ficient. In the case where no landmarks were detected by OpenFace or OpenPose, landmark and
AU values outputs were 0. As infant leg movements directly controlled the SAR’s activations,
landmarks from the legs were discarded to mitigate differences across interaction difficulty. The
method described by Klein et al. (2020) was used to identify which body skeleton landmarks be-
longed to the infant when researchers or parents were present in the videos. Face-view videos
had variable frame rates while body-view videos had frame rates of 29.97 frames per second. To
mitigate these differences in frame rates, mean landmark and AU values were aggregated across
0.25 second intervals. While OpenFace detects AUs using a sequence of frames, a similar dynamic
83
representation was not available for body gestures, so we calculated the speed of each body skele-
ton landmark to account for body movement. The speed of each landmark was calculated as the
Euclidian distance it traveled since the previous 0.25 second frame.
As occlusions were common, we used a threshold of 20% feature extraction confidence to
identify data samples unlikely to represent actual infant behavior. The subset of samples with
confidence values above the threshold for both modalities were used for model training. 58.9%
of OpenFace data and 78.6% of OpenPose exceeded the confidence threshold. Visual inspection
indicated that facial feature extraction with low confidence often coincided with moments when the
infants leaned away from the camera or turned their heads sideways to look at their parents. This
discrepancy in feature availability supports the integration of multiple data sources in recognizing
infant affect.
As the SAR feedback often included infant cry sounds, and parents or researchers occasion-
ally spoke during the interaction, it was unclear whether extracted audio features would describe
infant behavior. Therefore, audio data from these interactions were not included in the scope of
this dissertation. Future work will explore methods for incorporating audio data in infant affect
recognition while accounting for multiple sources of noise.
8.1.2 Feature Preprocessing
Landmark positions were centered and scaled to address differences in infants’ sizes or positions
relative to the cameras. First, we identified a landmark that was roughly centered in the video and
infrequently occluded, and we subtracted its x and y position from each landmark. The tip of the
nose was selected as the centered landmark for the face and the neck was selected for the body
skeleton. Coordinates were then scaled according to each infant’s size in the video. For facial
features, coordinates were divided by the vertical length of each infant’s face, approximated as the
distance between landmarks at the top and bottom of their head. For body skeleton features, dis-
tances between landmarks were divided by the distance between the neck and the pelvis landmarks,
an approximation for the length of the infant’s torso. As in past work by Messinger et al. (2009)
84
Figure 8.1: Overview of the modeling framework. FC: Fully Connected (Dense) Layer.
and Lysenko et al. (2020), we included the Euclidean distances between each pair of landmarks as
features.
8.1.3 Train-Test Split
Given the size of the dataset used in this work, it was not feasible to use a single hold-out group
for testing. Instead, we partitioned the data into 5 groups, and repeated training and testing 5
times. For each trial, 4 data groups were used for training and a fifth was used for testing. To
account for similarities within participants, we grouped the data so that no infant appeared in more
than 1 group; therefore, for each of the 5 trials, models were tested on data for infants who were
not represented in training in order to simulate performance on an unseen infant. Groups were
balanced so that each had a similar distribution of labels. For each model evaluated, we report the
mean AUC across the 5 trials.
85
8.1.4 Temporal Feature Aggregation and Windowing
Based on past work in skeleton-based affect and action recognition (Filntisis et al. 2019; Yang
et al. 2019), we applied time windows to the face and body expression features. Temporal feature
aggregation over a window of previous face and body expressions allows for sequential analysis
of features without the use of a recurrent neural networks, which are prone to overfitting on small
datasets. As in Yang et al. (2019), 2 different time windows were applied to capture both short
term and long term trends in behavior. The 2 windows were aligned based on end time, and the
affective state annotation present at the end of both time windows was selected as the label to
facilitate the real-time application of our approach. To reduce the number of parameters required
for model training, we aggregated features over time as in Filntisis et al. (2019) rather than applying
convolutional layers. Specifically, we calculated the maximum, mean, and standard deviation of
features over each window.
To capture the most recent infant behavior, we selected 0.5 seconds as the length of the short
term time window. As speed was calculated based on differences in landmark positions between
frames, this time window was chosen to enable speed calculation between the 2 most recent 0.25
second frames. The optimal amount of past infant behavior needed to inform the upcoming affect
prediction was unclear, so we tested values starting at 1 second and increasing by a factor of 2
until a peak in performance was found. We report results across this range to analyze the impact
of windowing on model performance. Window lengths were varied independently for facial and
body features to highlight differences across modalities. To compare models on the same dataset,
only samples that exceeded the 20% feature extraction confidence threshold for 90% of the longest
tested window size were used for model training. These samples are labeled as D
con f ident
, while
the entire dataset of samples is labeled as D
total
.
8.1.5 Feature Selection
We categorized the temporally aggregated features into 4 distinct sets: facial landmark distances,
facial AUs, body landmark distances, and body landmark speeds. For each of the 5 training groups
86
and each selection of window sizes, we applied Welch’s unequal variances t-test to determine the
aggregated features with the largest difference in means across labels. For each training group, we
fit our models using the 12 features of each set with the lowest p-values from the t-test to prevent
overfitting while balancing feature representation across modalities.
8.1.6 Affect Classification
8.1.6.1 Unimodal Classifiers
Unimodal models were trained only with facial expressions or body expressions, and are referred to
in this chapter as the face and body models. For each model, we trained neural networks (NNs) for
each window length. The NN architecture was based on work by Yang et al. (2019) and included
multiple input branches for each set of features. Feature sets were transformed through individual
fully connected layers with the ReLU activation function and then concatenated for final affect
classification. An overview of our modeling framework is illustrated in Fig. 8.1.
Each model was initialized randomly without seeding and trained with binary cross-entropy
loss and the Adam optimizer. Class weights of(1,9) for alert and fussy classes respectively were
assigned based on the approximate balance of labels in each training set. Models were trained
with 5 epochs to prevent overfitting. To visualize the information embedded by the model, we
conducted principal component analysis with 2 dimensions on the output of the final embedding
layer. The face and body principal components from the 1 of the 5 test groups are illustrated Fig.
8.2.
8.1.6.2 Multimodal Classifiers
In addition to training unimodal models, we evaluated approaches for integrating both modalities
into a single model. We leveraged a joint fusion model to account for interactions between body
and facial expressions. The input channels were the same as those from our unimodal models and
were fused at the concatenation layer before producing a final affect classification. To evaluate
87
fusion at the decision level, we implemented a late fusion model using a soft voting approach that
weighted the face and body models equally. While past work in multimodal affect recognition
(Lingenfelser et al. 2014) has leveraged event detection to recognize modality-specific behaviors
prior to fusion, this approach requires labeled behaviors in each modality, which were not present
in the infant-robot interaction dataset.
8.1.6.3 Trends in Model Performance Over Time
After evaluating models of each window length, we conducted an in-depth analysis of the highest
performing models for both modalities and fusion types to assess trends in model performance
over time. To evaluate the models in a continuous setting, models were tested on all groups of test
data (D
total
), including windows that did not meet the feature extraction confidence threshold. This
analysis was conducted separately for the body, face, joint fusion, and late fusion models.
Figure 8.2: Scatter plot and marginal density distributions of the first 2 principal components of
face and body NN embeddings. The distributions visualized were produced from embeddings that
represent the first test group of infants.
88
8.1.6.4 Classifier Accuracy Versus Time Since Infant Affect Transition
As infants settle into an affective state, their behavior may evolve from their initial expressions
during a transitional period. If infants behave differently after first becoming fussy versus after
remaining fussy for several seconds, models evaluated on these evolving behaviors may experi-
ence changes in performance over time. To gain insight into trends in behavior as infants remained
in a given state, we investigated model performance with respect to the amount of time passed
since infants last transitioned between affective states. We abbreviate the time since an infant last
transitioned between affect states as the time since affect transition (TSAT). To identify trends in
performance as infants transitioned to and remained in a given affective state over time, we eval-
uated classification accuracy on groups of samples from the same infant-robot interaction session,
affective label, and with TSAT values within 0.25 seconds of each other.
8.1.6.5 Classifier Accuracy Versus Time Since Classifier Prediction Transition
A SAR system that responds to infant affect must evaluate the certainty of its affect classifications
to inform its action selection policies. If model performance changes over time, this may influence
the decision to intervene. Therefore, we evaluated classifier performance with respect to the time
since the classifier changed its prediction (TSPT). As with methods described in Section 8.1.6.4, we
determined the classification accuracy on groups of samples from the same infant-robot interaction
session, affective label, and with TSPT values within 0.25 seconds of each other.
8.2 Results and Analysis
8.2.1 Model Accuracy Versus Input Window Length
Trends in performance across long window lengths indicated a longer optimal window for facial
expressions compared to body expressions. The difference suggests that infants in the infant-
robot interaction dataset displayed their affect along different time scales for each modality. As
89
illustrated in Fig. 8.3, mean AUC for the face and multimodal models increased until the long input
window length for facial expressions reached 32 seconds. Meanwhile, the body and joint fusion
models had higher mean AUC scores for shorter rather than longer body expression windows,
though this trend was less consistent in the late fusion models.
Given the use of both short and long input windows, it is possible that the longer optimal win-
dow for facial expressions provides important context to compare against short term observations.
In contrast, a longer window did not provide consistent benefits for the body expression modality,
which may indicate that infants’ body positions or movements from several seconds ago were less
relevant to their current affective state. This trend may serve as an advantage for models trained
on body expressions. Given their reliance on less data, these models may be less affected by
occlusions that occurred several seconds ago.
We note that AUC scores were higher for the face models compared to body models, and were
similar between face models and the highest performing multimodal models when evaluated on
Figure 8.3: Mean AUC scores across the 5 test trials for each set of long window lengths for facial
and body features.
90
non-occluded data. These scores indicate that when both sources of data were readily available
for the past 64 seconds, facial features were more informative than body features or a combined
feature set, which is consistent with the illustration in Fig. 8.2, showing less overlap in the first
two principal components of the facial embeddings compared to the body embeddings. The dif-
ference in ability to discriminate between alert and fussy affect may be related to the differences
in available features. While facial AUs provide context to an infant’s emotional state, we did not
incorporate indicators of specific poses or gestures that may more common to one affect compared
with another. However body models outperformed random chance with a mean AUC score of
70% across the tested window lengths. This result motivates the inclusion of body expressions in
contexts prone to occlusions where continuous affect monitoring is needed.
8.2.2 Trends in Model Performance Over Time
As described in Section 8.1.6.3, we analyzed how the accuracy of our highest scoring models
evolved over time in order to understand how our automated affect recognition approach may
perform in a continuous setting. Specifically, we further evaluated the models leveraging long
window lengths of 32 seconds for facial features and 2 seconds for body features, which produced
the highest mean AUC across groups of data.
8.2.3 Accuracy Versus Time Since Affect Transition
During times when infants were labeled as alert, we observed an increase in accuracy for face
and multimodal models with respect to TSAT, as illustrated in Fig. 8.4. This suggests that as
the infants in the infant-robot interaction dataset transitioned from fussy to alert, their changes in
behavior occurred gradually over several seconds rather than immediately. Conversely, an increase
in accuracy was not observed for the body models, which may reflect differences between models
that leverage different window lengths. Additionally, we note that during times when infants were
labeled as fussy, the 95% confidence interval expanded over time for models across all modalities,
while the 95% confidence interval remained the same across TSAT when infants were labeled
91
Figure 8.4: Model accuracy along seconds since infants transitioned into a given affect. The mean
accuracy across predictions is illustrated by the solid line, and the 95% confidence interval for
model accuracy across infant-robot interaction sessions is shaded around the mean accuracy. Only
data samples with a sample size greater than 3 are visualized.
as alert. The trends in the confidence intervals support that infants behaved similarly after first
becoming fussy, whereas once they remained fussy for long periods of time, the way in which they
expressed their frustration became more varied.
For both the face and late fusion models, accuracy was higher during times when infants were
labeled as alert. This may have reflected class imbalance, as the majority of data in the dataset
were labeled as alert. In contrast, the body models were more accurate when infants were labeled
as fussy as compared to alert, and the joint fusion model performed similarly across affective
92
labels. These differences may have reflected trade-off between over and under predicting fussy
affect.
In contexts such as SAR, evaluating the trade-off between overall model accuracy and recall of
negative affective states is essential when forming an action selection policy. For example, a robot
that overpredicts a label of fussy and errs toward unnecessary soothing behaviors may support
more successful interactions compared to a robot that errs toward not intervening when infants are
fussy.
8.2.4 Accuracy Versus Time Since Prediction Transition
Across all models, accuracy first increased with the number of consecutive predictions of the same
affect, followed by a decrease after several seconds. These trends are illustrated in Fig. 8.5. This
pattern may reflect gradual rather than sudden transitions as affect shifted between states. For sam-
ples labeled as alert, the decrease in mean accuracy did not occur until after 30 seconds for any
model. In data labeled as fussy, this decrease occurred at different times across models. Unimodal
models experienced a decrease in performance after predicting a label of fussy for approximately
10 consecutive seconds. Meanwhile, when multimodal models predicted a label of fussy, peak ac-
curacy was reached after 15 seconds before experiencing a decrease in performance, and achieved
greater peak accuracy compared to unimodal models. As these results were generated across for
all samples, including those with low confidence features, these trends demonstrate the benefit of
multimodal approaches for continuous affect recognition in the presence of missing data.
Future work should address the relationship between the number of consecutive predictions
and affect recognition performance. Whether considering infants or older populations, changes
in model performance over time impacts a system’s ability to respond quickly to affective state
transitions and to assess the duration of a given state. For example, a SAR that intervenes when
a subject is upset should balance the rising certainty over time (as illustrated in Fig. 8.5) with the
ramifications of waiting too long to provide support.
93
Figure 8.5: Model accuracy along seconds for which the classifier produced consecutive predic-
tions of the same affect. The mean accuracy across infants is illustrated by the solid line, while
the 95% confidence interval for model accuracy across infant-robot interaction sessions is shaded
around the mean accuracy. Only data samples with a sample size greater than 3 are visualized.
Additionally, we note that the classifiers demonstrated a lower mean and higher variance in
performance when predicting a label of fussy rather than alert. This may be caused by class
imbalance, as alert labels comprised 84.3% of the infant-robot interaction dataset. In addition to
temporal trends in accuracy, the impact of class imbalance should be considered when applying
94
affect recognition to downstream tasks such as the automatic evaluation of social synchrony in
infant-mother interactions or the design of action selection policies during SAR interactions.
8.3 Discussion and Summary
This chapter presents an analysis of temporal trends in unimodal and multimodal infant affect
recognition performance using facial and body features. In addition to demonstrating body fea-
tures as a viable modality for predicting infant affect, we explored how changes in the length of
time included in input windows and the duration of an actual or predicted affective state were as-
sociated with classification performance. Longer optimal window lengths of facial features com-
pared to body features suggest that infants in the infant-robot interaction dataset displayed their
affect along different time scales across modalities. Multimodal models outperformed unimodal
models when evaluated on the entire dataset D
total
but not when evaluated on D
confident
. The mean
and variance in classification accuracy changed with the duration of actual and predicted affective
states. These results highlight the importance of testing models on continuous interaction data and
assessing patterns in performance over time when applying affect recognition to real-world social
interactions. Future work should evaluate how these results generalize across infant datasets and
additional modalities. Taken together, the results presented in this chapter inform guidelines for
the application of automated infant affect recognition in the context of social interaction.
95
Chapter 9
Dissertation Summary and Conclusions
This chapter summarizes the contributions made by this dissertation and describes
future directions enabled by this research.
This dissertation presents approaches for modeling social synchrony during embodied dyadic
interactions. Social synchrony was formalized as the unification of temporal behavior adaptation,
joint attention, and shared affective states. As embodied social interactions encompass multiple
types of communication patterns at once, this work focused on modeling synchrony with hetero-
geneous data, or data describing multiple types of behaviors. Given the impact of supportive,
reciprocal social interactions on child development, the approaches in this dissertation were de-
veloped and evaluated in the context of infant-mother, infant-robot, and child-robot interactions.
The analyses presented in this dissertation assert the interplay between the components of social
synchrony as the foundation of supportive, reciprocal social interactions in these contexts.
Computational models of temporal behavior adaptation were investigated in the context of
infant-mother interactions. As individuals’ focus of attention and affective states cannot be ob-
served directly, computational research in social synchrony has typically assessed relationships
between partners’ behavioral signals to analyze communication dynamics and emotion regulation
patterns during these interactions (Lecl` ere et al. 2014a; Delaherche et al. 2012). We expanded op-
portunities to evaluate synchrony in this context by demonstrating how changes in infant-mother in-
teraction dynamics are manifested through the changing relationships between both homogeneous
96
and heterogeneous infant and mother behavior signals. Using the approach by Boker et al. (2002)
to windowed cross-correlation with peak-picking, we quantified how specific pairs of behavioral
signals were associated in different ways across different stages of the Face-to-Face Still-Face pro-
cedure (FFSF). The dataset of behavioral signals extracted as part of this work were made publicly
available to support ongoing research in this area. Next, we modeled the mother-infant dyad as a
dynamical system and showcased how Dynamic Mode Decomposition with control could be used
to assess temporal behavior adaptation across multiple heterogeneous signals. Through this work,
we introduced a novel metric of responsiveness to quantify temporal behavior adaptation across
these multiple signals.
Joint attention was analyzed in the context of a therapeutic leg movement activity for infants
administered by a Socially Assistive Robot. We used a model of Bayesian surprise (Itti and Baldi
2005) to demonstrate the predictive power of spatial and temporal saliency in the interaction envi-
ronment on infants’ gaze location. However, the ‘surprise’ generated by the robot’s actions were
not sufficient to predict its success in initiating or sustaining joint attention with the infant. As
this estimator of attention accounted only for the robot behavior and not the infant’s behavioral
or affective state, this analysis highlighted the inability to model components of social synchrony
without accounting for their counterparts. Therefore, we next addressed methods for classifying
infant affective state and child engagement, which both served as indicators of participants’ ability
to take part in the interactions.
Finally, our analysis of temporal patterns in affect classification performance addressed how
methods of evaluating classifier performance can inform models of temporal behavior adaptation.
We found that the choice of optimal classifiers shifted when accounting for occlusions. Within in-
teractions, model performance was lower during times surrounding affective state transitions and
increased as affective states remained consistent. In SAR, these findings can inform the certainty
of the robot’s perception modules, and in turn, its ability to adapt its behavior appropriately to
97
achieve shared affective states. In retrospective analysis of social interactions, evaluating tempo-
ral patterns in classification performance can help determine how best to incorporate automated
behavior recognition into models of temporal behavior adaptation.
The contributions of this dissertation support new opportunities for computational models that
synthesize the components of social synchrony toward more holistic representations of embodied
dyadic communication. Describing challenges facing the integration of machine learning into qual-
itative coding, Chen et al. (2016) describe the need for cohesion between automated approaches
and human expert coding workflows. In the context of assessing developmentally relevant social
interactions, approaches that explicitly model each component of social synchrony, the relation-
ships between components, and the multiple behaviors that determine interaction dynamics can
bring existing approaches closer to this goal. A more comprehensive, interpretable mapping be-
tween computational and expert-labeled coding approaches would support the need for more scal-
able screening procedures, behavioral sciences studies, and assessments of intervention efficacy. In
the context of SAR, integrating computational models of social synchrony may support improved
outcomes through the development of naturalistic, reciprocal closed-loop interactions.
Overall, the approaches and insights contributed by this dissertation serve as a basis for more
holistic computational representations of the interactions that are foundational to our overall health
and well-being.
98
Bibliography
Developing Child, National Scientific Council on the (2012). The science of neglect: The persistent
absence of responsive care disrupts the developing brain.
Developing Child, National Scientific Council on the (2004). Young children develop in an envi-
ronment of relationships. Harvard University, Center on the Developing Child.
Lecl` ere, Chlo¨ e, Sylvie Viaux, Marie Avril, Catherine Achard, Mohamed Chetouani, Sylvain Mis-
sonnier, and David Cohen (2014a). “Why synchrony matters during mother-child interactions:
a systematic review”. In: PloS one 9.12, e113571.
Rogers, Sally J, L Vismara, AL Wagner, C McCormick, G Young, and S Ozonoff (2014). “Autism
treatment in the first year of life: a pilot study of infant start, a parent-implemented intervention
for symptomatic infants”. In: Journal of autism and developmental disorders 44.12, pp. 2981–
2995.
Delaherche, Emilie, Mohamed Chetouani, Ammar Mahdhaoui, Catherine Saint-Georges, Sylvie
Viaux, and David Cohen (2012). “Interpersonal synchrony: A survey of evaluation methods
across disciplines”. In: IEEE Transactions on Affective Computing 3.3, pp. 349–365.
Gratch, Jonathan, Ning Wang, Jillian Gerten, Edward Fast, and Robin Duffy (2007). “Creating
rapport with virtual agents”. In: Intelligent Virtual Agents: 7th International Conference, IVA
2007 Paris, France, September 17-19, 2007 Proceedings 7. Springer, pp. 125–138.
Prepin, Ken and Philippe Gaussier (2010). “How an agent can detect and use synchrony param-
eter of its own interaction with a human?” In: Development of Multimodal Interfaces: Active
Listening and Synchrony: Second COST 2102 International Training School, Dublin, Ireland,
March 23-27, 2009, Revised Selected Papers, pp. 50–65.
Feil-Seifer, David and Maja J Matari´ c (2005). “Defining socially assistive robotics”. In: 9th Inter-
national Conference on Rehabilitation Robotics, 2005. ICORR 2005. IEEE, pp. 465–468.
Greczek, Jillian, Edward Kaszubski, Amin Atrash, and Maja Matari´ c (2014). “Graded cueing feed-
back in robot-mediated imitation practice for children with autism spectrum disorders”. In: The
23rd IEEE international symposium on robot and human interactive communication. IEEE,
pp. 561–566.
99
Shi, Zhonghao, Manwei Cao, Sophia Pei, Xiaoyang Qiao, Thomas R Groechel, and Maja J Matari´ c
(2021). “Personalized Affect-Aware Socially Assistive Robot Tutors Aimed at Fostering Social
Grit in Children with Autism”. In: arXiv preprint arXiv:2103.15256.
Mundy, Peter and Lisa Newell (2007). “Attention, joint attention, and social cognition”. In: Current
directions in psychological science 16.5, pp. 269–274.
Fisher, Philip A, Tahl I Frenkel, Laura K Noll, Melanie Berry, and Melissa Yockelson (2016).
“Promoting healthy child development via a two-generation translational neuroscience frame-
work: the Filming Interactions to Nurture Development video coaching program”. In: Child
Development Perspectives 10.4, pp. 251–256.
Franchak, John M, Kari S Kretch, Kasey C Soska, and Karen E Adolph (2011). “Head-mounted
eye tracking: A new method to describe infant looking”. In: Child development 82.6, pp. 1738–
1750.
Yamamoto, Hiroki, Atsushi Sato, and Shoji Itakura (2019). “Eye tracking in an everyday envi-
ronment reveals the interpersonal distance that affords infant-parent gaze communication”. In:
Scientific reports 9.1, pp. 1–9.
Nied´ zwiecka, Alicja, Sonia Ramotowska, and Przemysław Tomalski (2018). “Mutual gaze during
early mother–infant interactions promotes attention control development”. In: Child Develop-
ment 89.6, pp. 2230–2244.
Feldman, Ruth (2003). “Infant–mother and infant–father synchrony: The coregulation of positive
arousal”. In: Infant Mental Health Journal: Official Publication of The World Association for
Infant Mental Health 24.1, pp. 1–23.
Dawson, Geraldine Ed and Kurt W Fischer (1994). Human behavior and the developing brain. The
Guilford Press.
Wan, Ming Wai, Jonathan Green, Mayada Elsabbagh, Mark Johnson, Tony Charman, Faye Plum-
mer, and Basis Team (2013). “Quality of interaction between at-risk infants and caregiver at
12–15 months is associated with 3-year autism outcome”. In: Journal of Child Psychology and
Psychiatry 54.7, pp. 763–771.
Lieberman, Alicia F, Donna R Weston, and Jeree H Pawl (1991). “Preventive intervention and
outcome with anxiously attached dyads”. In: Child development 62.1, pp. 199–209.
Lecl` ere, Chlo¨ e, Marie Avril, S Viaux-Savelon, N Bodeau, Catherine Achard, Sylvain Missonnier,
Miri Keren, R Feldman, M Chetouani, and David Cohen (2016a). “Interaction and behaviour
imaging: a novel method to measure mother–infant interaction using video 3D reconstruction”.
In: Translational Psychiatry 6.5, e816–e816.
Tronick, Edward, Heidelise Als, Lauren Adamson, Susan Wise, and T Berry Brazelton (1978).
“The infant’s response to entrapment between contradictory messages in face-to-face interac-
tion”. In: Journal of the American Academy of Child psychiatry 17.1, pp. 1–13.
100
Adamson, Lauren B and Janet E Frick (2003a). “The still face: A history of a shared experimental
paradigm”. In: Infancy 4.4, pp. 451–473.
Pel´ aez-Nogueras, Martha, Tiffany M Field, Ziarat Hossain, and Jeffrey Pickens (1996). “Depressed
mothers’ touching increases infants’ positive affect and attention in still-face interactions”. In:
Child development 67.4, pp. 1780–1792.
Feldman, R (1998). “Coding interactive behavior (CIB)”. In: Unpublished manual, Department of
Psychology, Bar-Ilan University, Ramat-Gan, Israel.
Bernieri, Frank J, J Steven Reznick, and Robert Rosenthal (1988). “Synchrony, pseudosynchrony,
and dissynchrony: measuring the entrainment process in mother-infant interactions.” In: Jour-
nal of personality and social psychology 54.2, p. 243.
Skuban, Emily Moye, Daniel S Shaw, Frances Gardner, Lauren H Supplee, and Sara R Nichols
(2006). “The correlates of dyadic synchrony in high-risk, low-income toddler boys”. In: Infant
Behavior and Development 29.3, pp. 423–434.
Gates, Kathleen M and Siwei Liu (2016). “Methods for quantifying patterns of dynamic interac-
tions in dyads”. In: Assessment 23.4, pp. 459–471.
Boker, Steven M, Jennifer L Rotondo, Minquan Xu, and Kadijah King (2002). “Windowed cross-
correlation and peak picking for the analysis of variability in the association between behavioral
time series.” In: Psychological methods 7.3, p. 338.
Hammal, Zakia, Jeffrey F Cohn, and Daniel S Messinger (2015a). “Head movement dynamics dur-
ing play and perturbed mother-infant interaction”. In: IEEE transactions on affective computing
6.4, pp. 361–370.
L´ opez P´ erez, David, Giuseppe Leonardi, Alicja Nied´ zwiecka, Alicja Radkowska, Joanna
Raczaszek-Leonardi, and Przemysław Tomalski (2017). “Combining recurrence analysis and
automatic movement extraction from video recordings to study behavioral coupling in face-to-
face parent-child interactions”. In: Frontiers in psychology 8, p. 2228.
Gratier, Maya (2003). “Expressive timing and interactional synchrony between mothers and in-
fants: Cultural similarities, cultural differences, and the immigration experience”. In: Cognitive
development 18.4, pp. 533–554.
Granger, Clive WJ (1969). “Investigating causal relations by econometric models and cross-
spectral methods”. In: Econometrica: journal of the Econometric Society, pp. 424–438.
Hoch, Justine E, Ori Ossmy, Whitney G Cole, Shohan Hasan, and Karen E Adolph (2021). ““Danc-
ing” Together: Infant–Mother Locomotor Synchrony”. In: Child development 92.4, pp. 1337–
1353.
Seth, Anil K (2005). “Causal connectivity of evolved neural networks during behavior”. In: Net-
work: Computation in Neural Systems 16.1, pp. 35–54.
101
Messinger, Daniel M, Paul Ruvolo, Naomi V Ekas, and Alan Fogel (2010a). “Applying machine
learning to infant interaction: The development is in the details”. In: Neural Networks 23.8-9,
pp. 1004–1016.
Cohn, Jeffrey F and Edward Z Tronick (1987). “Mother–infant face-to-face interaction: The se-
quence of dyadic states at 3, 6, and 9 months.” In: Developmental psychology 23.1, p. 68.
Ardulov, Victor, Madelyn Mendlen, Manoj Kumar, Neha Anand, Shanna Williams, Thomas Lyon,
and Shrikanth Narayanan (2018). “Multimodal interaction modeling of child forensic inter-
viewing”. In: Proceedings of the 20th ACM International Conference on Multimodal Interac-
tion, pp. 179–185.
Cohen, Rami and Yizhar Lavner (2012). “Infant cry analysis and detection”. In: 2012 IEEE 27th
Convention of Electrical and Electronics Engineers in Israel. IEEE, pp. 1–5.
Saraswathy, J, M Hariharan, Sazali Yaacob, and Wan Khairunizam (2012). “Automatic classifica-
tion of infant cry: A review”. In: 2012 International Conference on Biomedical Engineering
(ICoBE). IEEE, pp. 543–548.
Lysenko, Sofiya, Nidhi Seethapathi, Laura Prosser, Konrad Kording, and Michelle J Johnson
(2020). “Towards Automated Emotion Classification of Atypically and Typically Developing
Infants”. In: 2020 8th IEEE RAS/EMBS International Conference for Biomedical Robotics and
Biomechatronics (BioRob). IEEE, pp. 503–508.
Messinger, Daniel S, Mohammad H Mahoor, Sy-Miin Chow, and Jeffrey F Cohn (2009). “Au-
tomated measurement of facial expression in infant–mother interaction: A pilot study”. In:
Infancy 14.3, pp. 285–305.
Nikolaidis, Stefanos, Swaprava Nath, Ariel D Procaccia, and Siddhartha Srinivasa (2017). “Game-
theoretic modeling of human adaptation in human-robot collaboration”. In: Proceedings of the
2017 ACM/IEEE international conference on human-robot interaction, pp. 323–331.
Hoffman, Guy (2019). “Evaluating fluency in human–robot collaboration”. In: IEEE Transactions
on Human-Machine Systems 49.3, pp. 209–218.
Hoffman, Guy and Cynthia Breazeal (2007). “Effects of anticipatory action on human-robot team-
work efficiency, fluency, and perception of team”. In: Proceedings of the ACM/IEEE interna-
tional conference on Human-robot interaction, pp. 1–8.
Matari´ c, Maja J and Brian Scassellati (2016). “Socially assistive robotics”. In: Springer handbook
of robotics, pp. 1973–1994.
Clabaugh, Caitlyn, Kartik Mahajan, Shomik Jain, Roxanna Pakkar, David Becerra, Zhonghao Shi,
Eric Deng, Rhianna Lee, Gisele Ragusa, and Maja Matari´ c (2019). “Long-term personaliza-
tion of an in-home socially assistive robot for children with autism spectrum disorders”. In:
Frontiers in Robotics and AI 6, p. 110.
102
Matari´ c, Maja J, Jon Eriksson, David J Feil-Seifer, and Carolee J Winstein (2007). “Socially assis-
tive robotics for post-stroke rehabilitation”. In: Journal of neuroengineering and rehabilitation
4, pp. 1–9.
Birmingham, Chris, Zijian Hu, Kartik Mahajan, Eli Reber, and Maja J Matari´ c (2020). “Can I
trust you? A user study of robot mediation of a support group”. In: 2020 IEEE International
Conference on Robotics and Automation (ICRA). IEEE, pp. 8019–8026.
Tsiakas, Konstantinos, Maher Abujelala, and Fillia Makedon (2018). “Task engagement as per-
sonalization feedback for socially-assistive robots and cognitive training”. In: Technologies
6.2, p. 49.
Jain, Shomik, Balasubramanian Thiagarajan, Zhonghao Shi, Caitlyn Clabaugh, and Maja J Matari´ c
(2020). “Modeling engagement in long-term, in-home socially assistive robot interventions for
children with autism spectrum disorders”. In: Science Robotics 5.39, eaaz3791.
Holroyd, Aaron, Charles Rich, Candace L Sidner, and Brett Ponsler (2011). “Generating connec-
tion events for human-robot collaboration”. In: 2011 RO-MAN. IEEE, pp. 241–246.
Huang, Chien-Ming and Bilge Mutlu (2013). “The repertoire of robot behavior: Enabling robots
to achieve interaction goals through social behavior”. In: Journal of Human-Robot Interaction
2.2, pp. 80–102.
Filntisis, Panagiotis Paraskevas, Niki Efthymiou, Petros Koutras, Gerasimos Potamianos, and Pet-
ros Maragos (2019). “Fusing body posture with facial expressions for joint recognition of affect
in child–robot interaction”. In: IEEE Robotics and Automation Letters 4.4, pp. 4011–4018.
Mathur, Leena, Micol Spitale, Hao Xi, Jieyun Li, and Maja J Matari´ c (2021). “Modeling user
empathy elicited by a robot storyteller”. In: 2021 9th International Conference on Affective
Computing and Intelligent Interaction (ACII). IEEE, pp. 1–8.
Abbasi, Nida Itrat, Micol Spitale, Joanna Anderson, Tamsin Ford, Peter B Jones, and Hatice Gunes
(2023). “Computational Audio Modelling for Robot-Assisted Assessment of Children’s Mental
Wellbeing”. In: Social Robotics: 14th International Conference, ICSR 2022, Florence, Italy,
December 13–16, 2022, Proceedings, Part II. Springer, pp. 23–35.
Narain, Jaya, Kristina T Johnson, Craig Ferguson, Amanda O’Brien, Tanya Talkar, Yue Zhang
Weninger, Peter Wofford, Thomas Quatieri, Rosalind Picard, and Pattie Maes (2020). “Person-
alized modeling of real-world vocalizations from nonverbal individuals”. In: Proceedings of
the 2020 International Conference on Multimodal Interaction, pp. 665–669.
Provenzi, Livio, Giunia Scotto di Minico, Lorenzo Giusti, Elena Guida, and Mitho M¨ uller (2018).
“Disentangling the dyadic dance: theoretical, methodological and outcomes systematic review
of mother-infant dyadic processes”. In: Frontiers in psychology 9, p. 348.
103
Lecl` ere, Chlo¨ e, Sylvie Viaux, Marie Avril, Catherine Achard, Mohamed Chetouani, Sylvain Mis-
sonnier, and David Cohen (2014b). “Why synchrony matters during mother-child interactions:
a systematic review”. In: PloS one 9.12.
Adamson, Lauren B and Janet E Frick (2003b). “The still face: A history of a shared experimental
paradigm”. In: Infancy 4.4, pp. 451–473.
Funke, Rebecca, Naomi T Fitter, Joyce T de Armendi, Nina S Bradley, Barbara Sargent, Maja J
Matari´ c, and Beth A Smith (2018). “A data collection of infants’ visual, physical, and behav-
ioral reactions to a small humanoid robot”. In: 2018 IEEE Workshop on Advanced Robotics
and Its Social Impacts (ARSO). IEEE, pp. 99–104.
Fitter, Naomi T, Rebecca Funke, Jos´ e Carlos Pulido, Lauren E Eisenman, Weiyang Deng, Marcelo
R Rosales, Nina S Bradley, Barbara Sargent, Beth A Smith, and Maja J Matari´ c (2019). “So-
cially assistive infant-robot interaction: Using robots to encourage infant leg-motion training”.
In: IEEE Robotics & Automation Magazine 26.2, pp. 12–23.
Pulido, Jos´ e Carlos, Rebecca Funke, Javier Garc´ ıa, Beth A Smith, and Maja Matari´ c (2019).
“Adaptation of the difficulty level in an infant-robot movement contingency study”. In: Ad-
vances in Physical Agents: Proceedings of the 19th International Workshop of Physical Agents
(WAF 2018), November 22-23, 2018, Madrid, Spain. Springer, pp. 70–83.
Deng, Weiyang, Barbara Sargent, Nina S Bradley, Lauren Klein, Marcelo Rosales, Jos´ e Carlos
Pulido, Maja J Matari´ c, and Beth A Smith (2021). “Using Socially Assistive Robot Feedback to
Reinforce Infant Leg Movement Acceleration”. In: 2021 30th IEEE International Conference
on Robot & Human Interactive Communication (RO-MAN). IEEE, pp. 749–756.
Jiang, Crystal, Christianne J Lane, Emily Perkins, Derek Schiesel, and Beth A Smith (2018). “De-
termining if wearable sensors affect infant leg movement frequency”. In: Developmental neu-
rorehabilitation 21.2, pp. 133–136.
Piper, Martha C and Johanna Darrah (1994). Motor assessment of the developing infant. WB Saun-
ders Company.
Sargent, Barbara, Nicolas Schweighofer, Masayoshi Kubo, and Linda Fetters (2014). “Infant ex-
ploratory learning: influence on leg joint coordination”. In: PloS one 9.3, e91500.
Lester, Barry M, Edward Z Tronick, and MD T. Berry Brazelton (2004). “The neonatal inten-
sive care unit network neurobehavioral scale procedures”. In: Pediatrics 113.Supplement 2,
pp. 641–667.
Klein, Lauren, Victor Ardulov, Yuhua Hu, Mohammad Soleymani, Alma Gharib, Barbara Thomp-
son, Pat Levitt, and Maja J Matari´ c (2020). “Incorporating Measures of Intermodal Coordina-
tion in Automated Analysis of Infant-Mother Interaction”. In: Proceedings of the 2020 Inter-
national Conference on Multimodal Interaction, pp. 287–295.
104
Hammal, Zakia, Jeffrey F Cohn, and Daniel S Messinger (2015b). “Head movement dynamics dur-
ing play and perturbed mother-infant interaction”. In: IEEE transactions on affective computing
6.4, pp. 361–370.
Messinger, Daniel M, Paul Ruvolo, Naomi V Ekas, and Alan Fogel (2010b). “Applying machine
learning to infant interaction: The development is in the details”. In: Neural Networks 23.8-9,
pp. 1004–1016.
Lecl` ere, C, M Avril, S Viaux-Savelon, N Bodeau, Catherine Achard, S Missonnier, M Keren, R
Feldman, M Chetouani, and David Cohen (2016b). “Interaction and behaviour imaging: a novel
method to measure mother–infant interaction using video 3D reconstruction”. In: Translational
Psychiatry 6.5, e816–e816.
Mahdhaoui, Ammar, Mohamed Chetouani, Raquel S Cassel, Catherine Saint-Georges, Erika Par-
lato, Marie Christine Laznik, Fabio Apicella, Filippo Muratori, Sandra Maestro, and David
Cohen (2011). “Computerized home video detection for motherese may help to study impaired
interaction between infants who become autistic and their parents”. In: International Journal
of Methods in Psychiatric Research 20.1, e6–e18.
Tang, Chuangao, Wenming Zheng, Yuan Zong, Zhen Cui, Nana Qiu, Simeng Yan, and Xiaoyan
Ke (2018). “Automatic Smile Detection of Infants in Mother-Infant Interaction via CNN-based
Feature Learning”. In: Proceedings of the Joint Workshop of the 4th Workshop on Affective
Social Multimedia Computing and first Multi-Modal Affective Computing of Large-Scale Mul-
timedia Data, pp. 35–40.
Stiefelhagen, Rainer and Jie Zhu (2002). “Head orientation and gaze direction in meetings”. In:
CHI’02 Extended Abstracts on Human Factors in Computing Systems, pp. 858–859.
Cao, Zhe, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh (2018). “OpenPose:
realtime multi-person 2D pose estimation using Part Affinity Fields”. In: arXiv preprint
arXiv:1812.08008.
Juslin, Patrik N and Klaus R Scherer (2005). “V ocal expression of affect”. In: The new handbook
of methods in nonverbal behavior research, pp. 65–135.
Kappas, A, U Hess, and KR Scherer (1991). “V oice and emotion: Fundamentals of Nonverbal
Behavior”. In: Rim, B., Feldman, RS (eds.), pp. 200–238.
Boersma, P and D Weenink (2002). “Praat 4.0: a system for doing phonetics with the computer
[Computer software]”. In: Amsterdam: Universiteit van Amsterdam.
Gabrieli, Giulio, Wan Qing Leck, Andrea Bizzego, and Gianluca Esposito (2019). “Are Praat’s
default settings optimal for Infant cry analysis”. In: Proceedings of the 2019 CCRMA Linux
Audio Conference, LAC, Stanford, LA, USA, pp. 23–26.
105
Klein, Lauren, Victor Ardulov, Alma Gharib, Barbara Thompson, Pat Levitt, and Maja Matari´ c
(2021). “Dynamic mode decomposition with control as a model of multimodal behavioral co-
ordination”. In: Proceedings of the 2021 International Conference on Multimodal Interaction,
pp. 25–33.
Hirai, Ashley H, Michael D Kogan, Veni Kandasamy, Colleen Reuland, and Christina Bethell
(2018). “Prevalence and variation of developmental screening and surveillance in early child-
hood”. In: JAMA pediatrics 172.9, pp. 857–866.
Shin, Eunkyung, Cynthia L Smith, and Brittany R Howell (2021). “Advances in behavioral re-
mote data collection in the home setting: Assessing the mother-infant relationship and infant’s
adaptive behavior via virtual visits”. In: Frontiers in Psychology 12, p. 703822.
Sapiro, Guillermo, Jordan Hashemi, and Geraldine Dawson (2019). “Computer vision and behav-
ioral phenotyping: an autism case study”. In: Current Opinion in Biomedical Engineering 9,
pp. 14–20.
Klein, Lauren, Laurent Itti, Beth A Smith, Marcelo Rosales, Stefanos Nikolaidis, and Maja J
Matari´ c (2019). “Surprise! predicting infant visual attention in a socially assistive robot contin-
gent learning paradigm”. In: 2019 28th IEEE International Conference on Robot and Human
Interactive Communication (RO-MAN). IEEE, pp. 1–7.
Holt, Rebecca L and Mohamad A Mikati (2011). “Care for child development: basic science ratio-
nale and effects of interventions”. In: Pediatric neurology 44.4, pp. 239–253.
Cohen, Leslie B (1973). “A two process model of infant visual attention”. In: Merrill-Palmer
Quarterly of Behavior and Development 19.3, pp. 157–180.
Itti, Laurent and Pierre Baldi (2005). “A principled approach to detecting surprising events in
video”. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recog-
nition (CVPR’05). V ol. 1. IEEE, pp. 631–637.
Itti, Laurent and Pierre F Baldi (2006). “Bayesian surprise attracts human attention”. In: Advances
in neural information processing systems, pp. 547–554.
Scassellati, Brian, Jake Brawer, Katherine Tsui, Setareh Nasihati Gilani, Melissa Malzkuhn, Bar-
bara Manini, Adam Stone, Geo Kartheiser, Arcangelo Merla, Ari Shapiro, et al. (2018). “Teach-
ing language to deaf infants with a robot and a virtual human”. In: Proceedings of the 2018 CHI
Conference on human Factors in computing systems, pp. 1–13.
Baltrusaitis, Tadas, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency (2018). “Open-
face 2.0: Facial behavior analysis toolkit”. In: 2018 13th IEEE international conference on
automatic face & gesture recognition (FG 2018). IEEE, pp. 59–66.
Cao, Zhe, Tomas Simon, Shih-En Wei, and Yaser Sheikh (2017). “Realtime multi-person 2d pose
estimation using part affinity fields”. In: Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 7291–7299.
106
Yang, Fan, Yang Wu, Sakriani Sakti, and Satoshi Nakamura (2019). “Make skeleton-based action
recognition model smaller, faster and better”. In: Proceedings of the ACM multimedia asia,
pp. 1–6.
Lingenfelser, Florian, Johannes Wagner, Elisabeth Andr´ e, Gary McKeown, and Will Curran
(2014). “An event driven fusion approach for enjoyment recognition in real-time”. In: Pro-
ceedings of the 22nd ACM international conference on Multimedia, pp. 377–386.
Chen, Nan-chen, Rafal Kocielnik, Margaret Drouhard, Vanessa Pe˜ na-Araya, Jina Suh, Keting Cen,
Xiangyi Zheng, and Cecilia R Aragon (2016). “Challenges of applying machine learning to
qualitative coding”. In: ACM SIGCHI Workshop on Human-Centered Machine Learning.
107
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Behavioral signal processing: computational approaches for modeling and quantifying interaction dynamics in dyadic human interactions
PDF
Situated proxemics and multimodal communication: space, speech, and gesture in human-robot interaction
PDF
Multiparty human-robot interaction: methods for facilitating social support
PDF
Modeling and regulating human interaction with control affine dynamical systems
PDF
On virtual, augmented, and mixed reality for socially assistive robotics
PDF
Computational modeling of mental health therapy sessions
PDF
Managing multi-party social dynamics for socially assistive robotics
PDF
Coordinating social communication in human-robot task collaborations
PDF
Nonverbal communication for non-humanoid robots
PDF
Towards socially assistive robot support methods for physical activity behavior change
PDF
Multimodality, context and continuous dynamics for recognition and analysis of emotional states, and applications in healthcare
PDF
Socially assistive and service robotics for older adults: methodologies for motivating exercise and following spatial language instructions in discourse
PDF
Building and validating computational models of emotional expressivity in a natural social task
PDF
Understanding and generating multimodal feedback in human-machine story-telling
PDF
Computational models for multidimensional annotations of affect
PDF
Automatic quantification and prediction of human subjective judgments in behavioral signal processing
PDF
Computational modeling of human behavior in negotiation and persuasion: the challenges of micro-level behavior annotations and multimodal modeling
PDF
Computational foundations for mixed-motive human-machine dialogue
PDF
Statistical inference for dynamical, interacting multi-object systems with emphasis on human small group interactions
PDF
Active sensing in robotic deployments
Asset Metadata
Creator
Klein, Lauren Rebecca
(author)
Core Title
Modeling dyadic synchrony with heterogeneous data: validation in infant-mother and infant-robot interactions
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2023-05
Publication Date
05/03/2023
Defense Date
04/27/2023
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
affective computing,behavioral signal processing,human-computer interaction,human-robot interaction,OAI-PMH Harvest,social signal processing
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Matarić, Maja J. (
committee chair
), Levitt, Pat (
committee member
), Narayanan, Shrikanth (
committee member
), Soleymani, Mohammad (
committee member
), Thomason, Jesse (
committee member
)
Creator Email
kleinl@usc.edu,kleinlaurenr@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113099312
Unique identifier
UC113099312
Identifier
etd-KleinLaure-11762.pdf (filename)
Legacy Identifier
etd-KleinLaure-11762
Document Type
Dissertation
Format
theses (aat)
Rights
Klein, Lauren Rebecca
Internet Media Type
application/pdf
Type
texts
Source
20230504-usctheses-batch-1036
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
affective computing
behavioral signal processing
human-computer interaction
human-robot interaction
social signal processing