Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Understanding interactional dynamics from diverse child-inclusive interactions
(USC Thesis Other)
Understanding interactional dynamics from diverse child-inclusive interactions
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Understanding Interactional Dynamics from Diverse
Child-inclusive Interactions
by
Rimita Lahiri
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
August 2024
Copyright 2024 Rimita Lahiri
I dedicate this thesis to my parents and my family
for their support throughout.
ii
Acknowledgements
First, I would like to express my sincere gratitude to my advisor, Prof. Shrikanth (Shri) Narayanan.
The doctoral jouney started with both Shri and Panayiotis (Panos) Georgiou, Panos’ guidance and
feedback during the initial days and Shri’s vision and perspective towards the broader objective,
have played an integral role in my growth as a researcher. Shri’s passion and critical thinking have
inspired me at every step. His motivation, support and patience have shaped my PhD journey, I
feel honored and extremely proud to have had the opportunity to work under his guidance.
I would also like to thank my dissertation committee members, Prof. Paul Bogdan and Prof.
Maja Mataric, and my qualifying exam committee members, Prof. Keith Jenkins, and Prof. C.C.
Jay Kuo. Their generous feedback and comments have helped me to refine and improve this
dissertation, and I am grateful for their contributions.
I am humbled and deeply grateful for the support and encouragement I have received from USC
Hearing and Communication Neuroscience Department during the 2 year doctoral traineeship.
Specially I want to extend my gratitude to the members of the advisory committee Prof. Maja
Mataric and Prof. Assal Habibi for their continued support and help. I would like to thank Prof.
Christopher Shera for all his help and his sincere efforts to successfully complete the traineeship. I
also want to thank our amazing collaborators Prof. So Hyun Kim, Prof. Catherine Lord and Prof.
Somer Bishop, for their guidance and help throughout my PhD journey. Specifically, I want to
thank Sophy(Prof. So Hyun Kim) for always helping with my questions, research ideas and for
mentoring me throughout.
I also express my gratitude towards past and present SAIL members who extended their help
and support at every step, through intellectual and academic discussions as well as guidance and
friendship. I want to thank Nasir, Taejin and Manoj for mentoring me during early days of my
iii
PhD journey. Particularly, I have spent a lot of quality time with Nasir and received constant help
and support from him whenever I got stuck. Special thanks to Naveen, Krishna, Sandeep, Rahul,
Victor, Raghu, Tiantian, Rajat, Saby, Digbalay, Kleanthis, Yiorgos, Anfeng for all the fun times
and cherishable memories from the lab. I would like to mention the constant support of Amrutha
and Xuan, which kept me strong and motivated and ofcourse SAIL girl chitchats.
I want to thank my friends in Los Angeles who made my journey so cherishable. If it were
not for them, I could not have had finished my journey of PhD. Specifically, Sulagna, I don’t
think I could finish my thesis if she was not there, Samprita for being extremely kind and patient
throughout all the odds. Special thanks to Avik, Sohini, Chandani and Souvik for all the get
togethers, random chitchats and great food. I also want to express my heartfelt thanks to Shivalee,
Jhelum, Anwesha, Anupriya, Baishali, Arnab, Dibyendu, Madhubony, Erum, Anamika, Shilpa,
Agnimitra, Pratyusha for all the good times.
Finally, I would like to thank my parents and my extended family for their constant support
at every step. For my parents, letting their only child move away from home to fulfill her dreams
was not easy, even during difficult times they encouraged me to follow my heart, sometimes at
the cost of their own happiness and comfort. I would like to mention, my aunt, Irin Lahiri, my
paternal uncle Tapas Kumar Ghoshal and my uncle Late Kallol Kolay for their tremendous support,
they have taught me to believe in myself and to keep dreaming big even during the unfavourable
times. I am forever indebted to my teachers from various phases of life, particularly Mrs. Shubhra
Mukherjee, Ms. Anjana Basu and Mr. Suman Chakravarty for fostering a passion to know the
unknown in me during my childhood, I believe that eventually motivated me to pursue doctoral
studies. My journey would be incomplete without my best friend Barsha, right from the school
days, her protection and guidance have helped me to dodge difficulties with ease, she has stood
strong with me right from the beginning through thick and thin. My heartfelt thanks and gratitude
to all my cousins for their love and support. I would like to thank my husband, Dr. Anish Dasgupta
for his unwavering love and support, I cannot thank him enough for continuously pushing me to
iv
fulfill my dreams, special mention to Anish’s family for showering their love and affection and
also for welcoming me to the family with such warmth and love.
v
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Representation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Modeling Interpersonal Synchrony . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Chapter 2: Child-adult classification using adversarial learning . . . . . . . . . . . . . . . . 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Speaker Diarization in Autism Diagnosis Sessions . . . . . . . . . . . . . . 11
2.2.2 Domain Adversarial Learning . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Domain Adversarial Learning for Speaker classification . . . . . . . . . . . . . . . 12
2.4 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.2 Features and neural network architecture . . . . . . . . . . . . . . . . . . . 15
2.4.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.4 Cross-Domain Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Chapter 3: Child-adult classification using self-supervised learning . . . . . . . . . . . . . 20
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.1 Self-supervision in speech . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.2 Child-adult classification in the ASD domain . . . . . . . . . . . . . . . . 23
vi
3.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4.1 Pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4.2 Downstream Classifier Architectures . . . . . . . . . . . . . . . . . . . . . 26
3.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5.1 Child Adult Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5.2 ADOSMod3 Experiments on Demographics . . . . . . . . . . . . . . . . . 28
3.5.3 Experimental details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.6 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.6.1 Classification Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.6.2 Result Evaluation based on Demographics . . . . . . . . . . . . . . . . . . 31
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Chapter 4: Analyzing short term dynamic speech features for understanding behavioral
traits of children . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Conversational Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.1 Acoustic-Prosodic and Turn-Taking Feature Analysis . . . . . . . . . . . . 38
4.3.2 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.3 Classification and Feature Selection . . . . . . . . . . . . . . . . . . . . . 39
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4.1 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4.2 Classification Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4.3 Summary of observations . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Chapter 5: Modeling interpersonal synchrony across vocal and lexical modalities in childinclusive interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3 Quantification of Interpersonal Synchrony . . . . . . . . . . . . . . . . . . . . . . 47
5.3.1 Acoustic spectral and vocal prosodic features . . . . . . . . . . . . . . . . 48
5.3.1.1 DTW distance measure (DTWD) . . . . . . . . . . . . . . . . . 48
5.3.1.2 Squared Cosine Distance of Complexity Measure (SCDC) . . . . 49
5.3.2 Lexical Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3.2.1 Word Mover’s Distance (WMD) . . . . . . . . . . . . . . . . . . 51
5.4 Empirical validation of the proposed synchrony measures . . . . . . . . . . . . . . 52
5.5 Experimental Results on ASD Interaction Datasets . . . . . . . . . . . . . . . . . 53
5.5.1 Classification experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.5.2 ANOVA analysis based on age-group and gender . . . . . . . . . . . . . . 55
5.5.3 Comparison of the distribution of the proposed measures across different
subtasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
vii
Chapter 6: A context-aware computaional approach to quantify vocal entrainment in dyadic
interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2 Computing context-aware entrainment measure . . . . . . . . . . . . . . . . . . . 61
6.2.1 Unsupervised model training and CED computation . . . . . . . . . . . . . 61
6.2.2 Model architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.3.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.3.1.1 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.3.1.2 Parameters and implementation details . . . . . . . . . . . . . . 64
6.3.2 Experimental validation of CED . . . . . . . . . . . . . . . . . . . . . . . 65
6.3.3 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Chapter 7: Understanding entrainment patterns under contrastive supervision . . . . . . . . 70
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.3 Contrastive Learning for Understanding Entrainment . . . . . . . . . . . . . . . . 75
7.3.1 Encoding entrainment under contrastive supervision . . . . . . . . . . . . . 76
7.3.2 Uni-modal and Cross-modal Design . . . . . . . . . . . . . . . . . . . . . 78
7.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.4.1 Pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.5.1 Preprocessing, feature extraction and baselines . . . . . . . . . . . . . . . 84
7.5.2 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.5.3 Verification experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.5.4 Correlation analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.6.1 Quantitative analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.6.2 Qualitative analysis on demographics . . . . . . . . . . . . . . . . . . . . 89
7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Chapter 8: A Summary of Deep Unsupervised Modeling Strategies for Understanding Vocal Entrainment Patterns in Child-Inclusive Interactions . . . . . . . . . . . . . . . . . . 92
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.2.1 Fisher Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.2.2 ADOSMod3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.2.3 Remote-NLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
8.3 Modeling Entrainment using Deep Unsupervised Learning . . . . . . . . . . . . . 97
8.3.1 Conversational turn modeling for quantifying entrainment . . . . . . . . . 97
8.3.2 Contrastive Learning Based Training . . . . . . . . . . . . . . . . . . . . . 99
8.3.3 Modeling information flow across conversational turns . . . . . . . . . . . 101
viii
8.3.4 Modeling entrainment with reversed gradient reconstruction . . . . . . . . 102
8.3.5 Joint Interlocutor modeling with shared encoder and decoder . . . . . . . . 103
8.4 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
8.4.1 Preprocessing, feature extraction and baselines . . . . . . . . . . . . . . . 103
8.4.2 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
8.4.3 Verification experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
8.4.4 Correlation analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.5 Experimental results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Chapter 9: Conclusion and future directions . . . . . . . . . . . . . . . . . . . . . . . . . . 109
9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
9.2 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
ix
List of Tables
2.1 Demographic details of ADOS dataset . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Mean F1-score (%) treating child age as domain shift . . . . . . . . . . . . . . . . 17
2.3 Mean F1-score (%) treating collection center as domain shift . . . . . . . . . . . . 18
3.1 Session-level statistics of child-adult corpora. . . . . . . . . . . . . . . . . . . . . 24
3.2 Number of trainable parameters for the pre-training experiments based on unfrozen
transformer layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Child-adult classification F1 score using W2V2. (PT corresponds to pre-training
and the following number represents the number of layers used for pre-training.) . . 27
3.4 Child-adult classification F1 score using WavLM pre-training. (PT corresponds
to pre-training and the following number represents the number of layers used for
pre-training.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Demographic details of ADOS dataset . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Top 5 features based on absolute correlation values for static functionals of different feature categories with CSS (the indices for MFCCs and MFBs are shown in
parentheses) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Top 5 features based on absolute correlation values for dynamic functionals of
different feature categories with CSS (the indices for MFCCs and MFBs are shown
in parentheses) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.1 Demographic details of ADOS dataset . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2 F1-score for ASD diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.1 Demographic details of ADOSMod3 dataset . . . . . . . . . . . . . . . . . . . . . 64
6.2 Classification experiment for real vs fake sessions . . . . . . . . . . . . . . . . . . 64
6.3 Correlation experiment between CED and clinical scores relevant to ASD (bold
figures imply statistical significance, p < 0.05 ) (CP: child to psychologist, PC: psychologist to child) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
x
7.1 ADOSMod3 dataset demographic details . . . . . . . . . . . . . . . . . . . . . . 79
7.2 Classification accuracy(%) for real vs fake sessions for unimodal speech validation
experiment (averaged over 25 runs, standard deviation shown in paranthesis). . . . 82
7.3 Classification accuracy(%) for real vs fake sessions for unimodal text validation
experiment (averaged over 25 runs, standard deviation shown in parenthesis). . . . 83
7.4 Classification accuracy(%) for real vs fake sessions for cross-modal validation experiment (averaged over 25 runs, standard deviation shown in parenthesis). . . . . 83
7.5 Correlation experiment between CLED and clinical scores relevant to ASD with
p values in parenthesis(bold figures imply statistical significance after correction,
p < 0.006 ) (CP: child to psychologist, PC: psychologist to child) . . . . . . . . . 84
8.1 Classification accuracy(%) for real vs fake sessions (averaged over 25 runs, standard deviation shown in paranthesis). . . . . . . . . . . . . . . . . . . . . . . . . . 106
8.2 Correlation experiment between variants of CLED and clinical scores relevant to
ASD with p values in parenthesis(bold figures imply statistical significance after
correction, p < 0.006 ) in ADOSMod3 dataset (CP: child to psychologist, PC: psychologist to child) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
8.3 Correlation experiment between variants of CLED and clinical scores relevant to
ASD with p values in parenthesis(bold figures imply statistical significance after
correction, p < 0.01 ) in Remote-NLS dataset (CP: child to parent, PC: parent to
child) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
xi
List of Figures
1.1 An overview of the interaction analysis pipeline . . . . . . . . . . . . . . . . . . . 4
2.1 Speech processing pipeline for feature extraction . . . . . . . . . . . . . . . . . . 10
2.2 Training(top) and Testing(bottom) Network Architecture . . . . . . . . . . . . . . 12
2.3 t-SNE plots of the most discriminative 2 components of the generator output corresponding to the classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1 Schematic overview of the proposed two-step recipe for child-adult classification . 21
3.2 t-SNE plots of the most discriminative 2 components of the embedding space corresponding to the classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Gender based Child-adult classification F1 scores. . . . . . . . . . . . . . . . . . . 29
3.4 Age based Child-adult classification F1 scores. . . . . . . . . . . . . . . . . . . . 31
4.1 Static and dynamic functionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Classification experiment results: macro-averaged F1 scores vs. different orders of
dynamic functionals used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.1 F1-score for ASD diagnosis with fused features . . . . . . . . . . . . . . . . . . . 54
5.2 Two way analysis of variance of F1 scores across age-groups and gender . . . . . . 55
5.3 Comparison of different coordination measures across subtasks . . . . . . . . . . . 56
6.1 Architecture for CED extraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.2 Absolute values of CED across age and gender from ADOSMod3 . . . . . . . . . 67
6.3 Attention activations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.1 Three main ideas are employed to address the challenges in modelling entrainment
in child-inclusive dyadic interactions: a) Understanding entrainment in conversational exchanges, b) Quantifying entrainment using unsupervised, contrastive design, c) Validation by studying the relevance of the proposed entrainment measure
with respect to clinically meaningful behavioral scores. . . . . . . . . . . . . . . . 71
xii
7.2 Schematic Diagram for computing CLED: Pretraining phase (Left), Evaluation
phase (Right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.3 Variation of CLED across gender . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.4 Variation of CLED across children with and without Autism diagnosis . . . . . . . 90
8.1 Schematic diagram for computing entrainment measure from conversational turns . 98
8.2 Contrastive pretraining diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.3 Modeling information flow across converational turns . . . . . . . . . . . . . . . . 101
8.4 Modeling information flow across converational turns with gradient reversed reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.5 Joint interlocutor modeling with shared encoder and decoder . . . . . . . . . . . . 103
xiii
Abstract
Objective understanding of dynamics of child-inclusive interactions requires the machines to discern who,when, how and where someone is talking and also analyse the content of the interactions
automatically. Children’s speech differs significantly from that of adults, making automatic understanding of child speech a notably more challenging task compared to adult speech. An additional
layer of complexity arises if the interactions are related to clinical or mental health applications
for children as the conditions may give rise to speech or language abnormalities. Robust behavioral representation learning from the rich multimodal child-inclusive interaction content can help
in extracting meaningful insights towards addressing these problems. However, this is particularly challenging due to the vast heterogeneity, contextual variability present in the child-inclusive
interactions and also due to the scarcity of reliably labelled datasets. In this thesis, I develop
methods for automated understanding of child speech, which can be broadly divided into several subtasks: detecting child speech, classifying speakers, converting child speech to text, and
automatically inferring behavioral constructs using low-level vocal and language-based feature
descriptors. Specifically, I focus on the child-adult speaker classification task and on modeling
interactional dynamics through quantifying interpersonal synchrony from child-inclusive interactions. The major challenges in child-adult speaker classification are due to the main reasons, firstly
large within-class variability due to the age,gender and symptom severity and secondly, due to lack
of reliable training data to address the above mentioned variabilities. I first adopted an adversarial
learning based approach to resolve the issues related to variabilities from age and signal collection
site. To avoid the need for manual annotations for speaker identities in vocal frames, I relied on
self-supervised learning to leverage unlabelled child speech to improve child-adult classification
performance. For modeling interpersonal synchrony for child-inclusive interactions, I first utilized
xiv
some knowledge driven heuristics based time-series analysis to extract meaningful information for
understanding behavioral traits of children with and without autism diagnosis. I extended the experiments across multiple modalities to evaluate if synchrony metrics can capture complementary
information that may be present across modalities. Furthermore, I reported and analyzed the possible limitations of the proposed metrics and finally introduce enhanced data-driven neural network
based approaches to address the issues with the previous studies. For each of the use-cases, I evaluate the proposed framework for relevant child-inclusive interaction domain, report the results by
comparing the scores under different circumstances and investigate and analyze the results for a
holistic understanding.
xv
Chapter 1
Introduction
1.1 Motivation
Computational investigation of human behavior has emerged as one of the popular research topics
in the area of sociology and psychology research [1], [2]. The expression and interpretation of
human behavior play an integral in shaping social behavior and interactions. Due to its significance and broad scope, human behavior modeling using computational approaches is gradually
transforming into an interdisciplinary research area involving expertise in machine learning, signal processing and data analytics. This emerging domain encompasses computational techniques
designed to measure, analyze, and model human behavior and interactions. Human interactions
are intricate processes characterized by the exchange of information among participants, modeling these interactions aims to dissect the complex expressions of behavior across various modes
of human communication, such as speech [3], [4], language [2], [5], and visual cues [6], often
represented by real-world signals, and to correlate them with abstract and intricate behavioral constructs relevant to areas of interest. These technologies and algorithms facilitate the extraction of
behavioral insights applicable across diverse domains, from healthcare to commerce.
Humans are highly receptive of the information cues conveyed during an interaction and often
tend to adjust their responses based on these cues in an attempt to facilitate a socially meaningful conversation. As a result, the behaviors of each participant are influenced by continuous
feedback mechanisms, whether arising from interactions between participants or in response to
1
external environmental factors. Understanding these expressed and perceived behaviors in human
interactions, can be helpful in identification and quantification of various typical and atypical behaviors, supporting informed decision-making. Prior research have shown promising potential for
application of computational behavior modeling in clinical domains including couples therapy [3],
[4], [7], autism spectrum disorder [8]–[10] and addiction counseling [11], [12], by creation of
automated support systems for psychotherapists in creating objective measures for diagnostics, intervention assessment and long-term monitoring. This further underscores the need for developing
robust computational frameworks to process and analyze nuances of interpersonal interactions for
efficient behavioral analysis.
Although decades of significant efforts invested in the landscape of interaction analysis has
yielded significant technological innovations, but the research fraternity has achieved limited success in developing technologies for child-inclusive interactions [13]. Processing signals from children is significantly harder than processing other human interactions due to the wide variability
associated with the former. In recent past, researchers have shown growing interest in understanding child speech considering the commercial viewpoint of developmental health and societal
entertainment applications like digital assistant, interactive services etc. The rapid progress of
machine learning based computational methodologies have shown promising potential to address
these issues, in fact a number of such challenges have been put forward to be addressed considering multiple perspectives including speech production, machine learning, signal availability and
annotation reliability. Previous works on analyzing child-inclusive interactions have majorly focused on data collected from child interactions with animated agents or search engines for reading
assignments and training applications. Due to the constrained nature of the above mentioned interactions, there is limited scope of understanding and deriving meaningful insights from child speech
in those cases.
In this thesis, I develop speech and language processing frameworks for naturalistic conversations between a child and an adult, usually a clinician or a parent [14], [15]. These interactions
are usually goal driven (for a clinical diagnosis) but spans multiple topics and activities which
2
are likely to evoke spontaneous reactions from the child, thus very crucial in understanding latent state of the child involved. Specifically, I consider child-inclusive clinical interactions of
under three circumstancess- firstly, interactions from autism diagnostic sessions, secondly, interactions designed to track change in child’s behavior and finally interactions designed to assess the
child’s language development skills. These dyadic conversations are rich with lexical information
(“what is being said”) and vocal patterns (“how it is being said”) and other nonverbal cues. I have
particularly focused on vocal speech patterns and spoken language to extract meaningful feature
descriptors for interaction modeling.
An overview of the steps associated with computational behavior modeling in child-inclusive
clinical interactions are provided in Fig. 1.1. Previous works on computational interaction processing, majorly focused on two areas - firstly, core speech and language processing frameworks capable of extracting speaker boundaries, transcripts and paralinguistic markers from raw signals [16]–
[18] and secondly, leveraging the transcripts and speaker information derived in the former part
to study the behavioral characteristics [19], [20] of the individuals engaged in the conversations.
While the latter research area is more directly aligned with the final objective of interaction analysis of extracting meaningful information for informed decision making, it is important to note that
conducting behavioral analysis requires computation of feature descriptors that involves multiple
signal processing modules and poor performance of any of these modules may degrade the overall
interaction processing performance thus leading to inaccurate conclusions.
Computational methods to analyze interpersonal interactions involve end-to-end speech and
language processing pipelines that go from raw audio features to clinically meaningful behavioral
descriptors. An important component of this pipeline is the speaker diarization [21], [22], which
solves the question of ”who spoke when”? In the context of child-inclusive clinical interactions,
I formulate the speaker diarization task in a supervised setup as child-adult speaker classification.
Firstly, I train a child-adult classification system to address two primary sources of variability
arising from age-related developmental aspects of children and varying background conditions,
often influenced by where and how data was collected. Further, I aimed to address the issue
3
Who Spoke when?
What is being said? How it is being said?
Diagnosis Interaction Synchrony
Verbal Productivity Interaction Outcome
Vocal patterns,
lexical cues, visual
content
Paralinguistic markers,
transcripts, speaker
boundaries
Skill deficits, symptom
severity
Feedback, demographics,
variability
Computational behavior modeling
Latent state
Child-inclusive interaction
Speech/Language based signal
processing
Figure 1.1: An overview of the interaction analysis pipeline
of reliable labelled data availability by leveraging unlabelled child speech using additional pretraining and the results reveal even domain specific pre-training with unlabelled child speech can
be helpful in improving speaker performance.
Quantifying behavioral synchrony [23] can inform clinical diagnosis, long-term monitoring,
and individualised interventions in neuro-developmental disorders characterized by deficit in communication and social interaction, such as autism spectrum disorder. Prior work related to behavioral synchrony in the ASD domain is somewhat limited, focusing largely on individual modalities,
such as vocal prosody or facial movements. To understand multimodal synchrony patterns, it is also
important to consider the coordination and interplay between the communication modalities within
an individual, in addition to across individuals. For modeling interpersonal synchrony firstly, I investigate three distinct measures of behavioral synchrony in speech and language patterns in interactions. Specifically, I focussed on quantifying synchrony across different information modalities
related to voice, articulation, and language through the joint consideration of prosody, acoustic
4
spectral features, and language patterns to capture interaction synchrony. Finally, prompted by
the limitations of the above mentioned approaches, I developed data-driven approaches to quantify
synchrony. I hypothesized both local and global context plays key role in modeling synchrony
information and used a transformer based network for validating our hypothesis for modeling interpersonal synchrony. I also explore different strategies of employing contrastive learning in an
unsupervised manner to model dyadic interactions for quantifying interpersonal entrainment across
speech and language modalities.
1.2 Contribution
1.2.1 Representation Learning
With regards to behavioral feature extraction and analysis of child-inclusive dyadic interactions,
prior works have primarily relied on human-annotated data segmentation, which is expensive and
time-consuming to obtain, especially for large corpora. Analysis of child speech is more challenging than adult speech because of the wide variability and idiosyncrasies associated with child.
Computational methods to analyze social interaction sessions require an end-to-end speech and
language processing pipeline that goes from raw audio to clinically-meaningful behavioral features. An important component of this pipeline is the ability to automatically detect who is speaking when i.e., perform child-adult speaker classification. First, I have applied domain adversarial
training to enhance the child-adult speaker classification performance in autism diagnosis sessions. I have used 2 different methods (GAN and GR) for learning domain invariant features,
and show that domain adversarial training improves the speaker classification performance by a
significant margin. Further, I attempted to leverage unlabelled child speech in pre-training for developing speaker discriminative embeddings, especially due to the vast inherent heterogeneity in
the data arising from developmental differences. I experimentally substantiate the effectiveness of
our method for downstream child-adult speaker classification tasks using W2V2 and WavLM and
5
report over 13% and 9% relative improvement over the base models in terms of F1 scores in two
datasets, respectively.
1.2.2 Modeling Interpersonal Synchrony
Interpersonal synchrony [24], [25] in interactions can be broadly viewed as an individual’s reciprocity, co-ordination or adaptation to other participants(s) in, and during, the interaction. Since
interpersonal synchrony in dyadic conversations provides insights toward understanding behavioral dynamics, it can potentially aid in scientific and clinical studies of interactions in the domain
of ASD, which is characterized by differences in social communication and interaction. First, I
develop three distinct knowledge driven measures for quantifying synchrony across different information modalities related to voice, articulation and language through the joint consideration of
prosody, acoustic spectral features and language patterns to capture interaction synchrony. Further, I employ more direct data-driven strategies to extract entrainment related information from
raw speech features, such are formulated in a way that they inherently consider both short-term and
long-term context. I develop a context-aware model for computing entrainment, by aiming to train
the model to learn the influence of the speakers on each other. I also develop contrastive learning
based approach across vocal and language modalities to quantify entrainment in child-inclusive
dyadic interactions.
1.3 Thesis Outline
Here is an outline of the remainder of the thesis.
Chapter 2 : In this chapter, I train a child-adult classification system using domain adversarial training to address the sources of variability arising from age of the child and data collection
location. I use two methods, generative adversarial training with inverted label loss and gradient
6
reversal layer to learn speaker embeddings invariant to the above sources of variability, and analyze different conditions under which the proposed techniques improve over conventional learning
methods.
Chapter 3 : In this chapter, I address the problem of detecting who spoke when in childinclusive spoken interactions i.e., automatic child-adult speaker classification. I investigate the
impact of additional pre-training with more unlabelled child speech on the child-adult classification
performance. I pre-train our model with child-inclusive interactions, following two recent selfsupervision algorithms, Wav2vec 2.0 and WavLM, with a contrastive loss objective.
Chapter 4 : In this chapter, I explore short term dynamic functionals of speech features both
within and across speakers to understand if local changes in speech provide information toward
phenotyping of ASD. I compare the contributions of static and dynamic functionals representing
conversational speech toward the clinical diagnosis state. Our results show that predictions obtained from a combination of dynamic and static functionals have comparable or superior performance to the predictions obtained from just static speech functionals. I also analyze the relationship
between speech production and ASD diagnosis through correlation analyses between speech functionals and manually-derived behavioral codes related to autism severity. The experimental results
support the notion that dynamic speech functionals capture complementary information which can
facilitate enriched analysis of clinically-meaningful behavioral inference tasks.
Chapter 5 : In this chapter, three different objective measures of interpersonal synchrony are
evaluated across vocal and linguistic communication modalities; for vocal prosodic and spectral
features Dynamic Time Warping Distance (DTWD) and Squared Cosine Distance of (feature-wise)
Complexity (SCDC) are used, and for lexical features Word Mover’s Distance (WMD) is applied
to capture behavioral synchrony. It is shown that these interpersonal vocal and linguistic synchrony
measures capture complementary information that helps in characterizing overall behavioral patterns.
Chapter 6 : In this chapter, I propose a context-aware approach for measuring vocal entrainment in dyadic conversations. I use conformers (a combination of convolutional network and
7
transformer) for capturing both short-term and long-term conversational context to model entrainment patterns in interactions across different domains. Specifically I use cross-subject attention
layers to learn intra- as well as interpersonal signals from dyadic conversations.
Chapter 7 : In this chapter, I employ a contrastive learning approach to learn a feature representation capable of encoding entrainment. Since entrainment can be exhibited across multiple
modalities, I explore modeling entrainment across speech and language modalities through both
uni- and cross-modal formulation. Further, I propose measures to quantify entrainment based on
the learnt feature embeddings. The proposed measures are validated through classifying real (consistent) and fake (inconsistent/shuffled) conversational turns. I then demonstrate the application
of these measures in a clinical interaction domain to understand the behavioral characteristics of
children with autism.
Chapter 8: In this chapter, I model vocal entrainment in dyadic child-inclusive conversations
to understand and analyze behavioral traits of children with and without autism diagnosis. Specifically, I explore and summarize contrastive learning based unsupervised strategies to learn representations related to entrainment from speech features. I validate the proposed measures by using
them to differentiate real conversations from simulated shuffled ones. Furthermore, I illustrate their
utility in modeling various behaviors relevant to autism symptoms by correlation experiments.
Chapter 9 : Finally, I share concluding summary based on our research so far and briefly
present plans for future work.
8
Chapter 2
Child-adult classification using adversarial learning
2.1 Introduction
Autism spectrum disorder (ASD) refers to a group of neuro- developmental disorders characterized
by abnormalities in speech and language [26]–[28] and often diagnosed in children using semistructured dyadic interactions with a trained clinician. The reported ASD prevalence has been
steadily increasing among children in the US from 1 in 150 [29] to 1 in 59 [30]. Computational
processing of the participants’ speech and language during such child-adult interactions has shown
potential in recent years in supporting and augmenting human perceptual and decision making
capabilities [31]–[33].
However, previous works utilized manual speaker labels and transcripts for behavioral feature
computation, which can be expensive and time-consuming to create. Hence, feature extraction
at-scale is dependent on a robust speech and language pipeline (Figure 2.1). An important component of the pipeline is speaker diarization, which answers the question “who spoke when?”. In
the context of ASD diagnostic assessment sessions, diarization can be approached as (supervised)
child-adult speaker classification. Training a child-adult classification system is often not straightforward due to multiple sources of variability in the data. Among others, two primary sources
of variability arise from developmental aspects of child speech [34] and from varying background
9
Speech
Acquisition
Voice
Activity
Detection
Speaker
Classification
Automatic
Speech
Recognition
Feature
Extraction
speech
non-speech
Child
Adult Hi how are you?
Computational
Behavior
Modeling
Speaker Diarization
+ Role Assignment
Figure 2.1: Speech processing pipeline for feature extraction
conditions, often influenced by where and how the data are collected. In this work, we train a childadult classification system using domain adversarial training [35], [36] to address these sources of
variability.
A generative adversarial network (GAN) is composed of two mutually pitting neural networks,
termed as the generator and the discriminator. These networks play a minimax game, where the
generator aims to create fake samples from a noise vector of some arbitrary distribution in order
to confuse the discriminator. On the other hand, the discriminator tries to distinguish between
the real and fake samples. Domain adversarial learning can be formulated as a variant of GANs,
where the noise vectors are replaced with target data, and the (domain) discriminator network tries
to discriminate whether a sample belongs to source or target domain. Hence, the generator network
learns to extract domain-invariant representations. The speaker classifier is trained on the generator
outputs in a multi-task manner. In this work, we have used two different methods of domain
adversarial training namely Gradient Reversal (GR) [36] and Generative Adversarial Networks
(GAN) [37]. GR tries to learn the domain-invariant feature by reversing the gradients coming
from domain discriminator while GAN aims to achieve the same by training with inverted domain
labels. The full network configuration comprising of generator (feature extractor), discriminator
and speaker classifier is shown in Figure 2.2.
10
2.2 Background
2.2.1 Speaker Diarization in Autism Diagnosis Sessions
Although there exists a significant amount of work in speaker diarization of broadcast news and
meetings, interest in spontaneous and real-life conversations has emerged only recently. Diarization solutions for child speech (both child-directed and adult-directed) initially looked at traditional
feature representations (MFCCs, PLPs) [38] and speaker segmentation/clustering methods (generalized likelihood ratio, Bayesian information criterion) [21], [39]. In [21], the authors introduced
several methods for working with audio collected from children with autism using a wearable device. More recently, approaches based on fixed-dimensional embeddings such as ivectors [40] and
DNN speaker embeddings such as x-vectors [41] were explored. While some of the above approaches have adapted clustering methods to child speech [41], to the best of our knowledge none
of them have taken into account shifts in domain distribution that is likely to adversely impact
diarization performance.
2.2.2 Domain Adversarial Learning
Domain adaptation within adversarial learning was first introduced by [36] for computer vision
related applications. Since then, there has been an emerging trend to use domain adversarial learning to alleviate the mismatch between the training and testing data in various speech applications
including ASR and acoustic emotion recognition [42]. In [43], [44] the authors have employed
domain adversarial training to improve the robustness of the speech recognition system to handle
different noise types and levels. In [45], the authors applied domain adversarial training to address the mismatch between close-talk and single-channel far-field recordings. Our motivation for
applying domain adversarial learning is inspired from recent applications ([37], [46]) in speaker
verification across multiple languages. It was shown that adversarial training can be used to learn
robust speaker embeddings across different conditions. We extend this concept to the task of
11
Figure 2.2: Training(top) and Testing(bottom) Network Architecture
child-adult classification from speech, where variabilities in children’s linguistic capabilities and
recording locations can be viewed as domain shift that can be modeled using adversarial learning.
2.3 Domain Adversarial Learning for Speaker classification
The main aim of the work is to efficiently distinguish between the speakers (namely, child and an
adult interlocutor) from audio recordings of diagnostic sessions from different clinical locations.
Besides learning domain invariant features by confusing the discriminator, the network must be
able to efficiently distinguish between the speakers as well. In this work, we have shown that the
proposed objective can be accomplished using a GAN based method, or a GR based method.
Consider samples from the source domain (Xs
,Ys) ∈ Ωs and target domain (Xt
,Yt) ∈ Ωt with a
common label space Y. During training, labels from the target domain are assumed unavailable,
12
and data distributions of Xs and Xt might differ. The goal of domain adversarial learning is to maximize the target accuracy by jointly learning to maximize task performance and reducing domain
shift between the source and target domains in generator output embedding space.
In our work, we begin by training the network with source data and corresponding speaker
labels to minimize task loss.We refer to this as pre-training. Following, the adversarial game
continues where the discriminator is trained with true domain labels and the generator is trained
either with inverted domain labels (GAN) or reverse gradients (GR) alternatively until convergence
is reached.
In both methods, for every batch of data, the training is carried out in three distinct steps. In
the first step, the generator and speaker classification models are trained with true speaker labels
from the source data using the following objective:
min
G,C
LossSpk(Xs
,Ys) = Exs,ys∼(Xs,Ys)
2
∑
k=1
1k=ys
log(C(G
s
(xs))) (2.1)
where G(.) and C(.) are the generator and classifier functions, respectively. In the second step, the
embeddings are extracted from the output layer of the generator for both source and target data
using the model trained in the previous step. The domain discriminator is now trained with the
true domain labels. This step ensures that the discriminator is well trained to distinguish between
source and target domain.
min
D
LossDom(Xs
,Xt
,G) = E
xs∼Xs
log(D(G(xs))) + E
xt∼Xt
log(1−(D(G(xt)))) (2.2)
The first and second steps are the same for both GAN and GR: they differ in the third step. For
GAN, the generator is trained with source and target data but with inverted domain labels:
min
G
LossAdv(Xs
,Xt
,G) = E
xs∼Xs
log(D(G(xt))) + E
xt∼Xt
log(1−(D(G(xs)))) (2.3)
13
(a) Session A:
Before adversarial
training
(b) Session A:
After adversarial training
(c) Session B:
Before adversarial
training
(d) Session B:
After adversarial training
Figure 2.3: t-SNE plots of the most discriminative 2 components of the generator output corresponding to the classes
In case of GR, the gradients from the domain discriminator are reversed for training. In both the
cases, the final step ensures the generator is trained well to generative domain-invariant representations. It is important to note that the generator network weights are updated twice during the
adversarial training in the first and last steps respectively.
2.4 Experimental setup
2.4.1 Dataset
The ADOS-2 dataset is composed of semi-structured activities involving a child and an interlocutor, who is trained to examine behaviours related to ASD. A typical ADOS-2 session lasts between
40-60 minutes and consists of varying subtasks designed to elicit responses from the child under
different social and interactive circumstances. In this work, we look at administrations of Module3 which are intended for verbally-fluent children. Further, we restrict to the Emotions and Social
Difficulties & Annoyance subtasks since they elicit spontaneous speech from the child under significant cognitive load. In the Emotions subtask the child is asked to recognize different objects that
trigger various emotions within them and share their perceptions on the same. The Social Difficulties & Annoyance subtask explores the child’s thoughts regarding various social problems faced
at home or school. The dataset consists of recordings from 165 children (86 ASD, 79 Non-ASD)
collected from two different clinical centers: University of Michigan Autism and Communciation
14
Table 2.1: Demographic details of ADOS dataset
Category Statistics
Age(years) Range: 3.58-13.17 (mean,std):(8.61,2.49)
Gender 123 male, 42 female
Non-verbal IQ Range: 47-141 (mean,std):(96.01,18.79)
Clinical
Diagnosis
86 ASD,42 ADHD (Attention Deficit Hyperactivity Disorder)
14 mood/anxiety disorder
12 language disorder
10 intellectual disability, 1 no diagnosis
Age
distribution
Cincinnati: ≤5yrs 7, 5-10 yrs 52, ≥10yrs 25
Michigan: ≤5yrs 11, 5-10 yrs 42, ≥10yrs 28
Disorders Center (UMACC) and Cincinnati Children’s Medical Center (CCHMC). Further details
are presented in Table 7.1.
2.4.2 Features and neural network architecture
In all experiments we used 23-dimensional MFCC features with mean and variance normalized at
session level. The features were extracted using the Kaldi1
toolkit with a frame-length of 40ms
and frame-shift of 20ms. Features were spliced with a context of 15 frames yielding a sample of
dimension 31×23. Consecutive samples were chosen with an interval of 15 frames in order to
minimize overlap during DNN training.
The generator G(.) consists of a bidirectional long short term memory (BLSTM) layer followed
by four fully connected (FC) layers consisting of 128, 64, 16 and 16 neurons respectively as shown
in Figure 2.2. Certain settings have smaller training data compared to others, hence the number
of parameters were reduced to prevent over-fitting. The speaker classifier C(.) consists of two
dense layers with 16 neurons each, while the domain discriminator D(.) consists of one dense
layer of 16 neurons. Rectified linear units (ReLU) layers were used as activation functions for all
the layers, and both dropout (p = 0.2) and batch normalization were applied to every hidden layer
for regularization.
1https://github.com/kaldi-asr/kaldi
15
2.4.3 Baselines
We have compared the performance of our systems with two systems. The first system (PreTrain) is composed of only the feature generator and the speaker classifier blocks. This system is
trained with source data and directly tested on target data, the goal being to check whether domain
adversarial training provides any improvement over pre-training. The second model uses the same
architecture, except the training data is augmented with target domain data. Since target labels
are not available during domain adversarial training, this system (Upper-Bound)serves as an upper
bound for the performance.
2.4.4 Cross-Domain Design
To address the variability resulting from child age and location differences, we designed two sets
of experiments: First, we partitioned the data according to age groups and chose the oldest and
youngest groups from both locations as the source and the target data. In (Exp 1), we selected
sessions of kids (≥10yrs) as the source and sessions of kids (≤5yrs) as target data. Later in (Exp
2), we reversed the source and target data and repeated the same experiment to address domain
shift in the other direction.
Second, we divided the sessions based on their locations. To control for variability sources, we
further divided the sessions from each location into 3 age groups of (≤5yrs, 5-10yrs, ≥10yrs) and
conducted separate experiments within each group. In (Exp 3), for each age group we considered
recordings from Cincinnati as source data and recordings from Michigan as target data. Later, in
Exp 4 we reversed the source and target data and conducted the same experiment).
We check for complementary information in embeddings extracted from GAN and GR using
score fusion and embedding fusion. For the score fusion system, we estimate class distribution
for a test sample by computing posterior means from GAN and GR models. For the embedding
fusion system, we extract embeddings from the output of the generator block for both source and
target data for GAN and GR. We then concatenate GAN and GR embeddings and train a separate
neural network model with similar architecture to the GAN and GR models, using the source data.
16
Table 2.2: Mean F1-score (%) treating child age as domain shift
Systems Exp 1(%) Exp 2(%)
Pre-Train 73.40 63.69
GAN 78.27 71.21
GR 78.53 72.26
Score Fusion 78.86 71.61
Embed. Fusion 78.38 71.95
Upperbound 85.65 86.29
Finally, the fused embeddings of the target data are fed to the trained network to check classification
performance.
For all experiments, we update the model weights using Adam optimizer (lr = 0.001, β1 =
0.9, β2 = 0.999, ε = 1e−8) to minimize categorical cross-entropy loss. Accuracy on a set of
held-out sessions from the source corpora is used for early stopping during both pre-training and
domain adversarial training. During evaluation, we discard the domain discriminator part. The
23-dimensional features from the audio session are fed to the network consisting of the generator
G(.) and the speaker classifier C(.) to estimate speaker labels at sample-level. Since many sessions
in our corpus contain imbalanced class distributions (more samples from adult than child), we
estimate classification performance using the mean (unweighted) F1-score.
2.5 Results and Analysis
From Tables 2.2 and 2.3, we observe that both GAN and GR outperform the baselines in agebased and location-based experiments. In general, GR performs slightly better than GAN in a
majority of settings. Among the age-based experiments, we observe that Exp 2 which consists
of kids aged ≥10yrs as target data, degrades in accuracy for all models. A possible reason is that
older kids with well-developed vocal tract and speaking skills are harder (i.e., more confusable) for
the model to discriminate from adult speakers. Interestingly, domain adaptation returns a greater
relative improvement over pre-training in Exp 2 (13.45%) than Exp 1 (7.43%).
17
Table 2.3: Mean F1-score (%) treating collection center as domain shift
Systems Exp 3(%) Exp 4(%)
≤5yrs 5-10yrs ≥10yrs ≤5yrs 5-10yrs ≥ 10yrs
Pre-Train 79.55 79.23 67.69 82.12 78.16 72.68
GAN 82.14 80.32 73.32 85.03 82.32 76.72
GR 81.74 80.60 73.57 84.53 82.96 76.61
Score Fusion 82.13 80.64 73.46 85.21 83.20 76.85
Embed. Fusion 82.39 80.31 73.19 82.72 82.87 75.33
Upper-bound 87.72 87.56 86.74 90.67 89.47 87.80
Among the location-based experiments, the age group ≥10 yrs possibly represents the largest
domain shift (on the basis of Pre-Train vs Upper-Bound performances). Similar to the age-based
experiment, domain adversarial learning returns the largest relative improvement for kids ≥10
yrs. Interestingly, improvements in adversarial learning for kids in 5-10yrs age group are different
in Exp 3 and Exp 4. This hints that domain shifts (in this age group) are currently modeled to
different extent by GAN and GR, indicating that different modeling techniques should be explored
to address this issue. Score fusion performs the best among all the proposed methods, suggesting
the presence of complementary information between GAN and GR methods.
As a qualitative analysis, we present t-SNE visualizations of the generator outputs for target
data from two sessions of Exp 4 in Figure 2.3. We plot the embeddings before and after GAN
training. In both cases, it is evident from the plots that pre-trained embeddings exhibit confusion between child and adult classes, while GAN training increases the discriminative information
between them.
2.6 Conclusion
Previous studies have established the potential of adversarial learning for addressing domain mismatch. In this work, we have applied domain adversarial training to enhance the speaker classification performance in autism diagnosis sessions. We have used 2 different methods (GAN and
GR) for learning domain invariant features, and show that domain adversarial training improves the
18
speaker classification performance by a significant margin. Further, we improved the performance
further by fusing at the embedding-level and score-level. While our proposed approaches provide
improvements over the baseline, the possible upper bound performance implies still significant
room for improvement. In the future, we would like to extend adversarial learning to different
GAN variants and tasks in the speech pipeline, for example, child ASR.
19
Chapter 3
Child-adult classification using self-supervised learning
3.1 Introduction
Autism Spectrum Disorder (ASD) is a neuro-developmental disorder, characterized by deficits in
social and communicative abilities along with restrictive repetitive behavior [28], [47]. Individuals
with ASD tend to show symptoms of anomalies in language, non-verbal comprehension, expressions and vocal prosody patterns [48], [49]. In the United States, the prevalence of ASD in children
has steadily increased from 1 in 150 [29] in 2002 to 1 in 44 in 2022. It is critical to develop early
ASD diagnosis to create timely interventions. One of the most common observation tools supporting ASD diagnostic and intervention efforts includes clinically-administered semi-structured
dyadic interactions between the child and a trained clinician [14], [50]. Computational analysis of
such interactions provides evidence-driven opportunities for the support of behavioral stratification
as well as diagnosis and personalized treatment.
However, with regards to behavioral feature extraction and analysis for these dyadic interactions, prior works have primarily relied on human-annotated data segmentation by speaker labels,
which is expensive and time-consuming to obtain, especially for large corpora. Computational
modeling of naturalistic conversations has gained a lot of attention in the past few decades because
of its potential in rich human behavioral phenotyping. Hence, it is desirable to conduct automatic
analysis of these interactions using signal processing and machine learning. Specifically, one fundamental module for supporting automated processing of child-adult interactions is the task of
20
Figure 3.1: Schematic overview of the proposed two-step recipe for child-adult classification
child-adult speech classification i.e., distinguishing the speech regions of the child from those of
an interacting adult. Analysis of child speech is more challenging than adult speech because of the
wide variability and idiosyncrasies associated with child [34], [51], [52]. An additional layer of
complexity arises while analyzing speech for the clinical domain, as different clinical conditions
may lead to unique patterns in language and speech, making it challenging for current computational approaches to capture.
Training a robust child-adult classifier is challenging for two main factors: the scarcity of reliably labeled datasets containing child speech and the larger within-class variability due to the
changes in child speech based on demographic factors like age, gender, and developmental status including any clinical symptom severity [53]. Most recent works addressing the problem of
speaker diarization have primarily targeted fine-tuning the pre-trained models by optimizing a supervised objective. So far, Self-Supervised Learning (SSL) algorithms are largely under-explored
21
for leveraging unlabelled child speech for developing speaker discriminative embeddings, especially in real-world settings such as clinical diagnostic and monitoring sessions. Specifically, there
is a limited understanding of how the performance of these models varies across children with
different demographics, including age and gender.
Contributions of this chapter: We address the above questions by evaluating the impact of including more child speech, during pre-training on the child-adult speaker classification. We choose
Wav2vec 2.0 (W2V2)-base and WavLM-base+ as the backbone models. The detailed contributions
of this work are summarized as:
• Our work represents one of the first attempts to leverage unlabelled child speech in pretraining for developing speaker discriminative embeddings, especially due to the vast inherent heterogeneity in the data arising from developmental differences.
• We experimentally substantiate the effectiveness of our method for downstream child-adult
speaker classification tasks using W2V2 and WavLM and report over 13% and 9% relative
improvement over the base models in terms of F1 scores in two datasets, respectively.
• We also illustrate and analyze the performance of the proposed method among different
subgroups of children based on demographic factors.
3.2 Background
3.2.1 Self-supervision in speech
The need for building speech processing frameworks in low/limited resource scenarios has spurred
significant efforts on unsupervised, semi-supervised and weakly supervised learning strategies to
reduce reliance on labeled datasets. The success of SSL [54] in natural language processing, notably due to its generalizability and transferability, has also inspired its adoption within the speech
domain. Early studies explored SSL in speech with generative loss [55], [56], while more recent
ones have focused on discriminative loss [57], [58] and multi-task learning objectives [59], [60].
22
The current approach in this realm follows a two-step process: first pre-train a model in a selfsupervised manner on large amounts of unlabeled data to encode general-purpose knowledge, and
next specialize the model on various downstream tasks through fine-tuning. Past studies have reported the efficacy of SSL algorithms by leveraging the pre-trained embeddings on downstream
tasks including ASR [57], speaker verification [61], speaker identification [62], phoneme classification [63], emotion recognition [64], spoken language understanding [64], and TTS [65].
3.2.2 Child-adult classification in the ASD domain
Child-adult classification is among the more difficult tasks within speaker diarization, due to the
challenges related to ”in the wild” child speech in naturalistic conversational settings including
short speaker turns, varied noise sources and a larger fraction of overlapping speech. Early diarization solutions involving child speech used traditional feature representations (MFCCs, PLPs) [38],
[40]. In [21], the authors introduced several methods for processing audio collected from children with autism using a wearable device. Later, deep speech representations, i-vectors [21] and
x-vectors [22] were studied for this task. A variety of challenges, both from signal processing and
limited data availability, have been identified and addressed. In [53], the authors have proposed an
adversarial training strategy to address the large within- and across-age and gender variability due
to developmental changes in children. Alternatively, in [66], pre-trained x-vectors were fine-tuned
for child/adult speaker diarization using a meta-learning paradigm, namely prototypical networks.
Moreover, the role of the amount of child speech in building deep neural speaker representations
was studied in [67] and their experimental results confirm that including more child data indeed
enhances the task performance in a supervised setup.
3.3 Datasets
Our child-inclusive data come from interactions in a clinical setting, specifically obtained during
the administration of two clinical protocols related to developmental disorders. The first protocol
23
Table 3.1: Session-level statistics of child-adult corpora.
Dataset Duration Child-speaking fraction
(mean±std) (mean±std)
Pre-training 14.05±2.08 n.a
ADOSMod3 3.23±1.61 0.46±0.18
Simons 19.05±12.86 0.40±0.08
is the gold standard Autism Diagnostic Observation Schedule (ADOS) [14], used for diagnostic
purposes. The second protocol is a recently proposed outcome-measure focused instrument Brief
Observation of Social Communication Change (BOSCC) [50] for tracking changes in social and
communicative skills during the course of treatment. A typical ADOS session lasts 40−60 minutes
and contains multiple (usually 10−15) semi-structured activities for addressing specific symptoms
related to ASD. Usually these interactions aim to elicit spontaneous responses from children under
different circumstances to obtain a diagnostic score for classifying children with and without ASD.
A BOSCC session is usually 12 minutes long, consisting of two 2min conversational talk sessions
and two 4min play sessions where the child plays with a toy.
In our pre-training experiments, we use a dataset consisting of 369 recordings of unlabelled
BOSCC sessions comprising approximately 100K utterances. For the fine-tuning experiments, we
use two different corpora, ADOSMod3 and Simons. The ADOSMod3 corpus was collected across
2 clinical sites. These data are from administrations of the ADOS Module-3 designed for verbally
fluent children, with a focus on the Social Difficulties and Annoyance and Emotional sub-tasks for
this work. The data consist of total 346 sessions collected from 165 children (86 ASD, 79 NonASD). The Simons corpus used in our study consists of a combination of clinically administered
ADOS (n = 6) and BOSCC (n = 33) sessions collected across 4 sites and these sessions were
labeled by trained annotators to extract speaker timestamps. The details of datasets are reported in
Table 3.1.
24
3.4 System Description
3.4.1 Pre-training
Our research aims to adapt the existing self-supervised approaches to the child-adult interaction
domain through contrastive learning. Similar to [68], our contrastive learning framework is based
on the assumption that neighboring segments from audio samples are highly likely to contain
identical information. For instance, it is probable that adjacent audio frames are produced by the
same speaker and are expected to contain similar semantic meaning, linguistic content, as well as
acoustic characteristics. To elaborate, we define the dataset of audio samples as N, where each
audio sample is denoted as xi
. The corresponding neighboring audio segment is represented as x
′
i
,
and is defined as any audio sample that has a time shift of half a second or less from the original
sample xi
.
As outlined in the previous section, transformer-based models first transform the input speech
sample x to intermediate features z using the feature encoder f(·) on the basis of CNNs. Subsequently, the transformer encoder g(·) maps the features z to contextualized representations c.
Consequently, we can create similar pairs of contextualized representations ci and c
′
i
from the
neighboring audio segments xi and x
′
i
, with the remaining pairs being considered as negative pairs:
Positive Pairs : ci ≈ c
′
i
(3.1)
Negative Pairs : ci ̸= ck
, ci ̸= c
′
k
,where i ̸= k (3.2)
Motivated by SimCLR [69], we apply the NTXent contrastive loss [70] as the pretraining
objective with the adult-child conversational corpora. Given the temperature value τ, the loss
function LNT Xent for the positive audio pairs xi and x
′
i within a batch of B input audio is:
−log exp(sim(ci
, c
′
i
)/τ)
∑
B
k=0
k̸=i
exp(sim(zi
,zk)/τ) +∑
B
k=0
exp(sim(zi
,z
′
k
)/τ)
(3.3)
25
3.4.2 Downstream Classifier Architectures
We use two different neural network models for child-adult speaker classification based on [71].
Both the classifiers include a self-attention based projector module, whereas one of them uses
CNNs to capture speaker characteristics and the other uses Recurrent Neural Netowrks (RNN) to
model the temporal dependencies present in the signal.
The RNN-based classifier consists of a stacked sequence of a Feed Forward Layer (FFL), a
bidirectional Long Short Term Memory (LSTM) layer, a self-attention based projector layer and an
output layer comprised of 2 FFLs, separated by a non-linear activation. The CNN classifier architecture is comprised of a weighted feature extraction module, followed by a convolutional module
having 3 1D convolutional layers, each with a dropout and a non-linear activation in between, a
self-attention based projector layer and an output layer comprised of 2 FFLs, separated by a nonlinear activation. For all the experiments we use Rectified Linear Unit (ReLU) as the non-linear
activation and a dropout ratio of 0.3.
3.5 Experimental Setup
3.5.1 Child Adult Classification
In this study, we hypothesize leveraging unlabelled child-speech for pre-training can guide models
to learn the heterogeneous child speech and interaction patterns, leading to enhanced performance
of downstream child-adult speaker classification. Instead of training from scratch, we pre-train
the existing W2V2 and WavLM models with additional unlabelled child speech by unfreezing and
updating specific transformer layers using a contrastive loss described in Sec 3.4.1. We report the
child-adult classification macro F1-score on two labeled child-adult interaction corpus described
in section 3.3. We report the results in Table 3.3 and Table 3.4, where the first row denotes downstream child-adult classification performance using the model solely relying on pre-trained embeddings. The subsequent rows denote downstream task performance using the models pre-trained
26
Table 3.2: Number of trainable parameters for the pre-training experiments based on unfrozen
transformer layers
Number of unfrozen transformer layers
1 2 3 4 5
6.2M 13.5M 20.8M 27.1M 33.8M
Table 3.3: Child-adult classification F1 score using W2V2. (PT corresponds to pre-training and
the following number represents the number of layers used for pre-training.)
Model ADOSMod3 Simons
RNN CNN RNN CNN
W2V2 - Base 67.92 70.59 63.41 64.13
W2V2 - PT1 69.31 72.41 65.19 66.28
W2V2 - PT2 71.55 72.95 65.87 65.12
W2V2 - PT3 72.23 74.38 68.81 65.44
W2V2 - PT4 74.01 74.89 67.63 66.79
W2V2 - PT5 72.19 74.05 65.01 65.39
Table 3.4: Child-adult classification F1 score using WavLM pre-training. (PT corresponds to
pre-training and the following number represents the number of layers used for pre-training.)
Model ADOSMod3 Simons
RNN CNN RNN CNN
WavLM-Base 72.73 73.09 71.78 70.25
WavLM - PT1 74.29 74.93 72.64 71.11
WavLM - PT2 76.66 75.81 72.88 72.74
WavLM - PT3 75.95 76.37 72.31 71.09
WavLM - PT4 75.18 75.92 72.01 71.59
WavLM - PT5 75.48 73.17 71.47 70.17
with additional child speech, where the number indicates the number of trainable transformer layers involved in the pre-training task.
27
(a) Session
A: W2V2-Base
(b) Session
A: W2V2-PT4
(c) Session
B: WavLM-Base
(d) Session
B: WavLM-PT2
Figure 3.2: t-SNE plots of the most discriminative 2 components of the embedding space corresponding to the classes
3.5.2 ADOSMod3 Experiments on Demographics
Our study also investigates the model performance across age-groups in the ADOSMod3 corpus.
Prior works [34], [53] have reported age as an important variability factor impacting speech characteristics. Based on this hypothesis, we conduct an experiment by partitioning the ADOSMod3
corpus (3 − 13yrs) into three different age-groups (Age-group 1: 43-90 months, Age-group 2:
91-118 months, Age-group 3: 119-158 months), such that each group contains equal number of
sessions. For each of these groups, we report the child-adult classification F1 score using the pretrained base models of W2V2 and WavLM and also the best-performing pre-trained models of
those two categories.
Not only age, analyses of developmental changes in speech have revealed sex (”gender”) differences in speech characteristics, especially post puberty [34]. In this work, we also report genderbased child-adult speaker classification performance on ADOSMod3 dataset, with recordings from
244 male and 84 female individuals. Similar to age-focused experiments, the dataset is partitioned
into male and female subsets, and comparisons are drawn between the base model and the bestperforming pre-trained models for both W2V2 and WavLM.
3.5.3 Experimental details
For both the pre-training and fine-tuning experiments, Adam optimizer is used with a batch size of
32 samples and temperature is set to 0.1. The number of tunable parameters for the pre-training
28
Figure 3.3: Gender based Child-adult classification F1 scores.
experiments is reported in Table 3.2. The initial learning rate is set to 1e-5 and the models are
trained for 30 epochs with an early stopping callback on validation loss, patience being 5 epochs.
For the downstream child-adult classification task, the model is trained to minimize the binary
cross-entropy loss for a maximum of 50 epochs, while the initial learning rate for this experiment
is 2e-4 with a weight decay of 1e-4. For both the datasets, we use 70% for training, 15% for
validation and 15% for testing. We use the model checkpoints from HuggingFace [72]. We pretrain the models using a single NVIDIA GeForce GPU 1080 Ti and each experiment took less than
two days.
3.6 Results and Discussion
3.6.1 Classification Evaluation
In this subsection, we analyze the experimental results reported in Table 3.3 and Table 3.4 to
address the following questions:
29
Does pre-training with more child speech improve the classification? The results reveal that
pre-training with additional child speech improves the child-adult classification F1 score over the
base model. This underscores the models’ ability to account for the heterogeneity that is inherent in children’s speech. It can be observed that WavLM-based pre-trained models show better
performance compared to W2V2 across all the experiments. Both the classifiers show comparable performance, with the RNN-based classifier yielding the best score in the majority of the
experiments. Among the datasets, the experimental results reveal better F1 scores in ADOSMod3
compared to the Simons corpus. One possible reason might be related to the session recording
length difference between the datasets. The average duration of sessions in Simons corpus is much
higher than ADOSMod3, resulting in greater potential variability and heterogeneity, which may
have degraded the F1 scores.
Does pre-training with more transformer layers improve the classification? It is interesting to
note that, while in W2V2-based pre-training, the classification F1 keeps improving by tuning more
transformer layers, in the case of WavLM, the performance improvements reach the maximum
with tuning fewer transformer layers. One possible explanation is that the WavLM model is trained
with an objective function to capture speaker related information, helping the model to achieve the
optimum performance with lesser training. However, in both the scenarios, the model performance
starts to degrade by adding more than four transformer layers. As these models are designed to
provide generalized speech representations, tuning larger portions of these pre-trained models on
a relatively smaller dataset might lead to the loss of generalizability, causing the performance to
decrease for the classification task. However, our results provide compelling evidence that it is
beneficial to adapt the last few transformer layers for the adult-child classification.
Qualitative analysis We present t-SNE visualizations of pre-trained embeddings for 2 output
classes from two sessions in Figure 3.2. We plot the embeddings with and without additional
pre-training. In both cases, it is evident from the plots that our method increases the discriminative
information between them.
30
Figure 3.4: Age based Child-adult classification F1 scores.
3.6.2 Result Evaluation based on Demographics
For the gender-focused experiments, the relative improvement in F1 scores are 6.39% and 3.14%
for the male and female subsets. Possibly due to both inherent speech pattern differences and
inherent data distribution biases (see Section 3.5.2), the models yield higher F1 scores in the male
population than the female population. For the age-focused experiment, the relative improvements
of 9.42%, 8.23%, and 4.06% are seen for the three age-groups (youngest to oldest). The results
imply that it is intrinsically challenging to model children of AG1 and AG2 due to the developing
vocal tract behaviors among these ages. As a consequence, adding more children speech in training
data provides greater benefits to the model to capture more relevant information, resulting in more
improvement in the younger age-group than the older ones.
31
3.7 Conclusion
Past work has demonstrated the promise of deploying self-supervised algorithms in a variety of
downstream tasks like ASR, speaker diarization, and speaker verification [61], [73]. In this work,
we investigate the utility of additional pre-training with more child speech, even in the presence
of the inherent heterogeneity and variability, to improve child-adult speaker classification in clinical recordings involving interactions with children with autism. The experimental results with
the proposed models support our hypothesis of benefiting from incorporating child speech based
additional pre-training, across both age and gender dimensions of variability.
In this work, we used the manually-annotated ground truth labels for identifying and evaluating the speech and non-speech regions. In the future, we plan to build a child-adult diarization
framework with an integrated Voice Activity Detection (VAD) system to further reduce the need of
human effort. In addition, we plan to extend this study with an additional emphasis on early vocalization and speech (from toddlers and infants) in the interaction. Unlike verbally fluent children,
toddler speech contains significant amounts of pre-verbal sounds and non-verbal vocalizations,
which pose additional challenges for automated processing.
32
Chapter 4
Analyzing short term dynamic speech features for
understanding behavioral traits of children
4.1 Introduction
ASD refers to a range of neuro-developmental disorders characterized by an early onset of significant social-communicative challenges along with restrictive, repetitive behaviors and interests.
Recent studies report continual increase in prevalence of ASD in children from 1 in 59 children in
2014 to 1 in 54 children in 20201
. ASD diagnosis is a complex, challenging and time-consuming
process as it relies on behavior symptoms in the absence of any reliable biological markers or
medical tests.
While there are ongoing efforts to better understand the association between genetic and neurobiological factors and ASD, a significant amount of research has been invested in building computational tools for domain experts and creating objective measures for early diagnostics, intervention
planning and assessment. In particular, different physiological and behavioral signal-based features
are being extensively studied to identify features that capture behavioral traits relevant to existing
diagnostic instruments (ADOS [74], ADI-R [75]) in order to support behavioral phenotyping and
stratification in the context of diagnosis and subsequent intervention.
1https://www.autismspeaks.org/press-release/cdc-estimate-autism-prevalence-increases-nearly-10-percent-1-54-
children-us
33
Previous studies [9], [76] have explored computational approaches for validating behavioral
markers and inferring ASD diagnosis predictions using observable behavioral information obtained
from conversations involving children and interlocutors. For instance, Bone et al. [10] analyzed the
association between objective signal-derived prosodic cues and subjective perceptions of prosodic
awkwardness in settings of story retelling from adolescents with an ASD diagnosis; [33] studied
lexical features to characterize the verbal behavior of children with ASD and non-ASD developmental disorders. However, since most of the literature relies on individual features computed for
each speaker individually based on short-term vocal and lexical cues, they may not capture the full
extent of two interlocutors’ coordination and reciprocity, which is important when characterizing
ASD.
Behavioral patterns in interactions are inherently dynamic in nature, and features derived from
local changes reflect this behavior better when compared to those derived from global changes. In
recent years, multiple works have proposed the use of various forms of conversational speech dynamics as features for downstream inference tasks. For example, emotion recognition has benefited
from the use of temporal dynamics in form of autoregressive methods and spectral moments [77],
hidden Markov models [78], nonlinear dynamics [79], and more recently recurrent neural networks [80], [81]. Curhan et al. [82] showed that measures of vocal activity level, conversational
engagement, prosodic emphasis, and mirroring can help predicting negotiation trends. Deception
detection [83] from vocal cues have been recently shown to improve by capturing the conversational dynamics [84], [85]. In the clinical domain, the dynamics captured by spectral energy
variability have been shown to an indicator of depression [86], [87]. [88] has reported superior
performance in couples therapy outcome prediction using dynamic functionals. Warlaumont et
al. [89] has found measures of conversational dynamics, both at short and long timescales, can
vary in between population with and without ASD diagnosis. Vocal arousal dynamics in childpsychologist interaction was shown to distinguish between high and low ASD severity [23]. In
this work, this motivates us to capture dynamics based on the aggregated turns of each interlocutor to encode important conversational and behavioral patterns of speech. More specifically, we
34
aim to understand the contribution of dynamic functionals in characterizing behavioral patterns of
children with an ASD diagnosis.
We analyze the vocal speech patterns of children – both those with and without an ASD diagnosis – engaged in interaction with clinicians. We present a correlation analysis to interpret the
relationship between the extracted features and manually coded clinical ratings related to ASD
diagnosis. We formulate the prediction task as a binary classification problem of differentiating
between children with an ASD diagnosis and those who do not. We compare the predictions using
the features derived from the static and dynamic functionals to better understand the benefit of
explicitly using dynamic functionals for predicting the diagnosis state.
4.2 Conversational Data
Static functionals from child turns
e.g. mean (pitch),
median (intensity)
Child
turn
C->A C->A C->A C->A
C->C C->C
Second order dynamic functionals
Child-Child (C2->C2)
Child-Psych (C2->A2)
C2->C2
C2->A2
C2->A2 C2->A2
First order dynamic functionals
Child-Child (C->C)
Child-Psych (C->A)
Child
Psychologist
Child
turn
Child
turn
Psych
turn
Psych
turn
Figure 4.1: Static and dynamic functionals
The Autism Diagnostic Observation Schedule (ADOS)-2 [90] instrument refers to a sequence
of semi-structured activities between a child and a clinician to assess behavioral patterns associated
with ASD. A typical ADOS-2 interaction session lasts 40-60 minutes, where a child is engaged in
multiple subtasks to evoke maximum response.
35
1st 1st + 2nd 1st + 2nd + 3rd 1st + 2nd +
3rd + 4th
order of the dynamic functional(s) used
0.600
0.625
0.650
0.675
0.700
0.725
0.750
0.775
F1 macro score
Dynamic
Static and dynamic
Static
Baseline (majority classifier)
(a) Logistic Regression
1st 1st + 2nd 1st + 2nd + 3rd 1st + 2nd +
3rd + 4th
order of the dynamic functional(s) used
0.65
0.70
0.75
0.80
0.85
F1 macro score
Dynamic
Static and dynamic
Static
Baseline (majority classifier)
(b) Support Vector Machine
1st 1st + 2nd 1st + 2nd + 3rd 1st + 2nd +
3rd + 4th
order of the dynamic functional(s) used
0.650
0.675
0.700
0.725
0.750
0.775
0.800
F1 macro score
Dynamic
Static and dynamic
Static
Baseline (majority classifier)
(c) K Nearest Neighbor
1st 1st + 2nd 1st + 2nd + 3rd 1st + 2nd +
3rd + 4th
order of the dynamic functional(s) used
0.58
0.60
0.62
0.64
0.66
0.68
0.70
0.72
0.74
F1 macro score
Dynamic
Static and dynamic
Static
Baseline (majority classifier)
(d) Random Forest
1st 1st + 2nd 1st + 2nd + 3rd 1st + 2nd +
3rd + 4th
order of the dynamic functional(s) used
0.64
0.66
0.68
0.70
0.72
0.74
0.76
0.78
F1 macro score
Dynamic
Static and dynamic
Static
Baseline (majority classifier)
(e) Naive Bayes
Figure 4.2: Classification experiment results: macro-averaged F1 scores vs. different orders of
dynamic functionals used
36
Table 4.1: Demographic details of ADOS dataset
Category Statistics
Age(years) Range: 3.58-13.17 (mean, std): (8.61, 2.49)
Gender 123 male, 42 female
Non-verbal IQ Range: 47-141 (mean, std): (96.01, 18.79)
Clinical
Diagnosis
86 ASD
42 ADHD (Attention Deficit Hyperactivity Disorder)
14 mood/anxiety disorder
12 language disorder
10 intellectual disability, 1 no diagnosis
In this work, we choose to focus on the Emotions and Social difficulties & annoyance subtasks
from the Module 3 administration, designed for verbally fluent children. In the Emotions subtask,
the child is asked questions related to the identification of situations and activities that elicit different emotions. During the Social difficulties & annoyance subtask, the child is asked to describe
his/her opinion on different social issues in different circumstances (at home or school) and about
their coping strategies.
For this work, we carry out data standardization for each speaker in each session after aligning
the speaker’s turn using manually derived transcripts (following SALT transcription guidelines
[91]). We exclude sessions having fewer than 25 turns so as to enable reliable computation of 3rd
and 4th order dynamic functionals. After preprocessing, our final dataset contains a total of 281
sessions from 165 children (144 ASD, 137 non-ASD). Almost every child contributed 2 interaction
sessions, corresponding to the 2 subtasks mentioned above.
Table 4.2: Top 5 features based on absolute correlation values for static functionals of different
feature categories with CSS (the indices for MFCCs and MFBs are shown in parentheses)
Prosodic Features Voice Quality Features Acoustic Features
feature func corr feature func corr feature func corr
Pitch Envelope Max. -0.1875 Voicing Min. 0.3053 MFCC(2) Min. 0.1852
Pitch Envelope Mean -0.1561 Voicing Diff Min. 0.1350 MFCC(0) Max. -0.1832
Pitch Diff Mean 0.1467 Dynamic Jitter Diff Median -.1287 MFB(7) Min. -0.1694
Pitch Envelope Min. 0.1308 Jitter Median 0.1263 MFB(4) Max. -0.1620
Loudness Mean 0.1260 Voicing Diff Max. 0.1175 MFCC(6) Min. -0.1603
37
Table 4.3: Top 5 features based on absolute correlation values for dynamic functionals of different
feature categories with CSS (the indices for MFCCs and MFBs are shown in parentheses)
Child - Child Psych - Psych Child - Psych
order feature func corr order feature func corr order feature func corr
1 MFCC(6) Max. 0.1999 1 Loudness Std. Dev. -0.3190 2 Loudness Max. -0.3026
3 MFCC(6) Min. -0.1893 1 MFB(0) Std. Dev. -0.3014 3 Loudness Max. -0.2977
1 Pitch Median 0.1748 4 MFB(0) Min. 0.2951 4 Loudness Min. 0.2856
1 MFCC(6) Std. Dev. 0.1715 1 MFB(0) Min. 0.2919 1 MFB(0) Std. Dev. -0.2745
3 MFCC(7) Std. Dev. 0.1700 2 Loudness Std. Dev. -0.2799 1 MFB(0) Min. 0.2712
4.3 Experimental Methodology
We conduct two sets of experiments, (i) correlation based analyses between the extracted static
and dynamic functionals and ranked measures of the child’s ASD severity termed as Calibrated
Severity Score (CSS) [90], and (ii) binary classification between children with and without ASD
diagnosis based on different sets of static and dynamic functionals. The former is undertaken to
understand the relationship between the extracted functionals and the clinically-meaningful CSS,
while the latter is used to understand the additional predictive power that the dynamic functionals
present over the static functionals.
4.3.1 Acoustic-Prosodic and Turn-Taking Feature Analysis
All the speech features are extracted using openSMILE [92]. We consider features relating to
acoustics, prosody, and voice quality. All 15 dimensions of Mel Frequency Cepstral Coefficients
(MFCC) and the 8 dimensions of Mel Frequency Band (MFB) features are included in the spectral
set. Loudness, pitch envelope and their first order differences are considered as the prosodic set,
while voicing probability, local jitter, the differential frame-to-frame jitter, local shimmer, and the
first order difference of each of these features made up the voice-quality set. All these features are
computed for every 10ms interval of the audio file.
After extracting the raw features, we calculate five static functionals (mean, standard deviation,
median, mininum, and maximum) across each session in the dataset. For the static functionals, we
only consider the child’s turns. To calculate dynamic functionals, we first average the frames of
each relevant turn, and take the first, second, third, and fourth order differences within turn-pairs as
38
our dynamic functionals as shown in Figure 4.1. We define turn-pairs as consisting of consecutive
turns either between the same speaker or across different speakers.
4.3.2 Correlation Analysis
The correlational analysis is set up to estimate the association between the functionals (static and
dynamic) and manually coded behavioral ratings (CSS) related to ASD diagnosis. CSS is a metric
quantifying ASD severity with relative independence from individual characteristics such as age
and verbal IQ on a 10 point scale. Because of the ranked nature of CSS, we chose Spearman’s
rho [93] for this analysis over Pearson’s correlation coefficient. The correlation metrics serve as a
knowledge-driven way to select features then used to infer ASD diagnosis in the next experiment.
4.3.3 Classification and Feature Selection
To understand the role of dynamic functionals in differentiating between children with and without
ASD diagnosis, we set up a binary classification experiment to predict the output labels as ASD
or non-ASD. We consider five different classifiers for this experiment, logistic regression, Support
Vector Machine (SVM), random forest, k-nearest neighbours and naive bayes classifier. For each
of the classifiers, we carry out 5-fold cross-validation 5 times to avoid overfitting the data.
We consider different combinations (feature-level fusion) of static and dynamic functionals in
our classification analysis to investigate the extent of combined predictive power of both static
and dynamic functionals. Moreover, we report the classification F1 score of different order functionals individually and along with static functionals to study the contributions of using dynamic
functionals over static ones.
To reduce the number of features used, we use a feature selection strategy based on correlationbased feature ranking. We calculate the Spearman’s correlation coefficients for each feature from
each set of functionals (dynamic or static) with respect to the variable of interest (CSS in this case),
and rank them in descending order of correlation. In this process, we exclude the features that are
not statistically significant.
39
4.4 Results
In this section we report the findings based on the experiments we conduct.
4.4.1 Correlation Analysis
Here, we perform correlation analysis to investigate details regarding the static and dynamic functionals capturing information that can be used to make inferences related to behavioral patterns in
ASD. For this experiment, we compute Spearman’s correlation coefficients between the CSS and
mutually exclusive sets of static and dynamic functionals.
In Table 4.2, we report the five most correlated static functionals for each of the feature categories and in Table 4.3 we report the five most correlated dynamic functionals for each of the
categories involving either same speaker or different speakers. In each case, we consider the significantly correlated functionals only (p < 0.05).
4.4.2 Classification Experiment
The goal of the classification experiment is to investigate the possibility of predicting ASD diagnosis based on static and dynamic functionals (both individually and in combination). As mentioned
in previous sections, the classification experiment is formulated as a 2-class problem of predicting
either ASD or non-ASD.
For all the experiments, the first n ranked statistically significant static and dynamic functionals
are considered as input to classifiers. The value of n is chosen separately for each feature set from 2
to 35 based on the classification performance. We consider only child-child and child-psychologist
dynamic functionals as input to the classifiers, as our primary focus is on analyzing the behavioral
dynamics of the children. It is important to understand that for each order of dynamic functionals,
we also consider the preceding order differences cumulatively. For example, the dynamic functional of 3rd order also includes the 1st and 2nd order dynamic functionals. Since we build each
40
dynamic functional set in a cumulative way, the resulting selected features do not always include
equal proportions of each contributing dynamic or static functional subset.
For each of the mentioned classifiers, we report the performance considering only static functionals, cumulative dynamic functionals, and also static and cumulative dynamic functionals together in terms of classification F1 score as shown in Figure 4.2. We consider the majority classifier (every sample is assigned to whichever is the majority class in the training set) as the baseline
and report its performance in the same plot to better understand the improvement in classification
performance after incorporating static and dynamic functionals.
4.4.3 Summary of observations
While Table 4.2 shows pitch envelope provides maximum correlation for static functionals, Table
4.3 reveals MFCC and loudness showing greater absolute correlation for dynamic functionals.
Quite interestingly, within psychologist (i.e., psych - psych) correlations are found to be higher
followed by child - psych dynamics; this is consistent with the observation that the psychologist
adjusts their dynamics according to the clinical state of the child [94]. Results from Figure 4.2
suggest a combination of static and dynamic functionals offers the best performance in the majority
of the cases. Amongst the dynamic functionals, the 1st order differences work the best, indicating
that higher order functionals viz., 2nd, 3rd and 4th order differences are not contributing as much
to the classification problem; it may also be the case that the feature selection is perhaps overfitting
or ineffective.
4.5 Conclusion
Speech features are being extensively studied to understand and characterize ASD in the context of
behavioral analysis and phenotyping. In this work, we report the relevance of different combination
of static and dynamic speech functionals with respect to clinically-determined disorder severity in
terms of the correlation metric. We examine the role of dynamic functionals individually and
41
in combination with static functionals to predict ASD diagnosis. Furthermore, we show the top
ranked static and dynamic functionals (based on correlation metric) carry meaningful insights to
classify the behavioral patterns of children with and without ASD diagnosis.
In the future we plan to extend this work with lexical features in order to gain a comprehensive
understanding about behavioral traits of children during such interactions. We also plan to explore
other feature selection techniques in order to improve classification performance based on static
and dynamic functionals.
42
Chapter 5
Modeling interpersonal synchrony across vocal and lexical
modalities in child-inclusive interactions
5.1 Introduction
Autism Spectrum Disorder (ASD) [95] 1
is a developmental condition primarily characterised by
differences in social communication skills and abilities along with restricted repetitive behavior,
interests and movements. Prevalence of ASD in children in the United States has steadily risen
from 1 in 150 in 2002 to 1 in 44 in 2021 2
. ASD is a spectrum disorder with wide individual heterogeneity. Human-centered technological advances offer promise for supporting new evidencedriven possibilities in support of behavioral stratification as well as diagnosis and personalized
treatment[96], [97].
In the recent years, computational approaches using signal processing and machine learning
have been proposed for both research and clinical translation in mental health [97]. For example,
machine learning algorithms have been proposed in ASD related research to gain insights into (and
even attempt to minimise the manual efforts required for) ASD diagnosis based on expert human
coded behaviors extracted from instruments like Autism Diagnostic Interview-Revised (ADI-R) and
Autism Diagnostic Observation Schedule (ADOS). Computational methods have shown promise
in supporting the diagnostic efforts by identifying the essential nosological modules, eliminating
1https://www.cdc.gov/mmwr/volumes/69/ss/ss6904a1.htm?s_cid=ss6904a1_w
2https://www.cdc.gov/ncbddd/autism/data.html
43
redundancy without compromising on accuracy [9], [31], [33]. Computational techniques have
also provided tools to further scientific understanding of interaction mechanisms. For instance,
[10] connects objective signal-derived descriptors of vocal prosody to subjective perceptions of
prosodic awkwardness and reports differences in acoustic-prosodic features between ASD and
neurotypical individuals, including demonstrating interaction pattern differences in vocal prosody
coordination that varied in accordance with a child’s ASD symptom severity [94]. Such approaches
build upon previous studies [98], [99] that have reported significant correlation between interlocutor’s prosody and language patterns and subject’s ASD severity, thus underscoring the importance
of including interpersonal coordination in studies of social communication and interaction involving children with ASD. The work in this chapter aims to contribute toward this direction in investigating new behavioral measures focused on interaction synchrony.
Synchrony [24], [25] in interactions can be broadly viewed as an individual’s reciprocity, coordination or adaptation to other participants(s) in, and during, the interaction. Typically, synchrony can be exhibited across multiple communication modalities such as through vocal patterns [100], hand and head motions [6], [101], facial expressions [102], etc. In a dyadic interaction setting, it helps in illuminating a speaker’s attitude toward the other speaker, and the outcomes
of an interaction. For instance, a greater interaction synchrony is often associated with positive
social outcomes such as in better rapport building [103], efficient tutoring experience [104], more
successful negotiation [105] and so on. Understanding and quantifying behavioral synchrony is a
challenging problem due to its multifaceted dynamic and complex nature [106] and also how it is
affected by differences in the individual’s health state and condition.
Since interpersonal synchrony in dyadic conversations provides insights toward understanding behavioral dynamics, it can potentially aid in scientific and clinical studies of interactions in
the domain of ASD, which is characterized by differences in social communication and interaction [32], [107]. Prior work related to behavioral synchrony in ASD domain is somewhat limited,
focusing largely on individual modalities such as vocal prosody or facial movements. [23] investigated synchrony in vocal arousal patterns in ASD child-clinician (adult) interactions, and showed
44
its variation based on the child’s ASD severity levels. To understand multimodal synchrony patterns, it is also important to consider the coordination and interplay between the communication
modalities within an individual, in addition to across individuals. For example, Guha et al. in [102]
have reported the role of localized dynamics between different facial regions and their movements,
and differences therein between typically developing children and children with high functioning
autism.
In this chapter, we investigate three distinct measures of behavioral synchrony in speech and
language patterns in an interaction based on DTW distance [102], cosine distance [106], [108] and
Word Mover’s distance [25]. The primary contribution of this work is in quantifying synchrony
across different information modalities related to voice, articulation and language through the joint
consideration of prosody, acoustic spectral features and language patterns to capture interaction
synchrony. Experiments performed on data from real world clinical interactions show that the proposed measures capture co-ordination in dyadic interactions. Since individuals with ASD exhibit
wide differences in social communication, we believe that these co-ordination features can offer
additional objective measures for behavior characterization and further stratification. Importantly,
we experimentally investigate whether coordination features across the speech and language communication channels can capture complementary information that can be used as an additional
source of information in characterizing ASD individuals, and distinguishing them from those that
have not received an ASD diagnosis.
We analyze differences in the synchrony measures across the children with and without an ASD
determination through post hoc classification experiments. Classification experiments carried out
with the three proposed coordination features reveal their importance in differentiating ASD and
non-ASD groups through improved (F1 score) performance with respect to baseline classifiers.
Furthermore, we analyse the variation of the mean value of the proposed coordination measures
throughout the interactions across two different subtasks for both children diagnosed with ASD
and without ASD. We also examine age-dependency in these results through two way analysis of
45
variance of the classification F1 scores computed across three different age-groups of young (2.5-
7.5yrs), middle-band (7.5-10yrs) and older children (above 10yrs), as well as male and female
children.
5.2 Dataset Description
The vocal and language behavioral synchrony measures are evaluated in the context of interactions
between a child and a clinician. The data are drawn from two specific domains (Emotions and
Social difficulties and annoyance subtasks) involving behavioral observation.
Table 5.1: Demographic details of ADOS dataset
Category Statistics
Age(years) Range: 3.58-13.17 (mean,std):(8.61,2.49)
Gender 123 male, 42 female
Non-verbal IQ Range: 47-141 (mean,std):(96.01,18.79)
Clinical
Diagnosis
86 ASD,42 ADHD (Attention Deficit Hyperactivity Disorder)
14 mood/anxiety disorder
12 language disorder
10 intellectual disability, 1 no diagnosis
Age distrubution clinicwise Cincinnati 84 (2.5-7.5yrs : 28, 7.5-10yrs : 31, ≥10yrs : 25)
Michigan 81 (2.5-7.5yrs : 24, 7.5-10yrs : 30, ≥10yrs : 27)
Age distrubution ASD/Non-ASD ASD 86 (2.5-7.5yrs : 25, 7.5-10yrs : 30, ≥10yrs : 31)
Non-ASD 79 (2.5-7.5yrs : 27, 7.5-10yrs : 31, ≥10yrs : 21)
The Autism Diagnostic Observation Schedule(ADOS-2)[14] instrument refers to semi-structured
interactions between a child and a clinician trained to score the different behaviors associated with
ASD. These interactive sessions are typically 40-60 minutes long and broken down into a variety of
subtasks (e.g., construction, joint-interactive play, creating a story, demonstration, etc.) which are
likely to evoke prominent response from a child under different social circumstances. Based on the
child’s response, the clinician provides assessment of ASD symptoms following module-specific
coding and finally all these codes are aggregated to compute an autism severity score[90].
For this study, we focus on a subset of data from the administrations of module-3 meant for
verbally fluent children. Specifically, we choose to work with Emotions and Social difficulties
46
and annoyance subtasks because of their ability to elicit spontaneous speech from the children
under significant cognitive demand. The dataset consists of recordings from 165 children (86
ASD, 79 Non-ASD), collected from two different clinics: the University of Michigan Autism
and Communication Disorders Center (UMACC) and the Cincinnati Children’s Medical Center
(CCHMC). For our experiments, we have 1 recording for each of the mentioned subtasks from
each participant resulting in total of 330 recordings. The demographic details are presented in
Table 7.1. The average duration of each of these sessions is about 3 minutes (192 secs). The lexical
features are extracted based on manual transcriptions following SALT [109] guidelines. Since we
aggregate turn-level coordination measures, sessions with fewer than 10 turns are discarded as
sufficient number of turns are required to aggregate and average out local irregularities.
5.3 Quantification of Interpersonal Synchrony
In this section, we describe the different signal feature descriptors used and the proposed coordination measures in detail. First, we outline the feature descriptors and based on those features, we
define the coordination measures. For this study, we consider three different set of feature descriptors: vocal prosodic features, acoustic spectral features and lexical features. We use DTWD and
SCDC for vocal prosodic and acoustic spectral features to quantify interpersonal synchrony. For
lexical features, we apply WMD to capture behavioral synchrony.
To calculate the coordination measures, the interactions are divided into multiple turns for each
speaker. A speaker turn is defined as the time duration for which the active speaker is talking
without interruption from the other speaker. After marking the turn boundaries, the coordination
measures are computed for every consecutive turn-pair and averaged over all consecutive turn-pairs
to obtain a session level measure.
47
5.3.1 Acoustic spectral and vocal prosodic features
All the features are extracted using OpenSMILE toolkit[92]. The feature extraction is carried
out using a sliding Hamming window of duration 25ms with an interval of 10ms. We use 15
dimensional Mel Frequency Cepstral Coefficients (MFCC)[106] as acoustic spectral features and
pitch, intensity, jitter and shimmer as vocal prosodic features. The prosodic features are smoothed
and extrapolated over the unvoiced regions.
5.3.1.1 DTW distance measure (DTWD)
We use the classic DTW [110] method to measure the similarity between the acoustic-prosodic
features extracted from two consecutive speaker turns. This method computes the (dis)similarity
between two time sequences, of possibly of varying lengths, after aligning them to the maximum
extent in terms of a warping distance. For example, [102] has used this method to compare facial
expression time series in children with and without an autism diagnosis. Herein, we employ the
DTW method to compute the dissimilarity between vocal feature time series obtained from the
child and clinician’s speech turn pairs. We introduce the average warping distance as a measure
for interpersonal synchrony.
For two m dimensional time-series X and Y with length Tx and Ty respectively, such that
X ∈ Rm×Tx and Y ∈ Rm×Ty
, DTW finds the (dis)similarity between these sequences by optimally
aligning them. A distance matrix D ∈ RTx×Ty
is calculated where every element d(i, j) denotes the
Euclidean distance between the i
th vector of X and j
th vector of Y. Based on the distance matrix
values, an optimal warping path W = w1,w2,...,wH yielding the overall minimum cost (distance)
is found. A warping path is a mapping from X to Y tracing the elements of W where every element is such that wh = (i, j) (i ∈ [1,Tx]; j ∈ [1,Ty]), i.e., X can be warped to same length as Y by
corresponding i
th element of X to j
th element of Y. An optimal path is the one associated with the
minimum cost where the cost is computed as the sum of absolute distances for every matched pair
of indices present in the path,
48
d(W) =
H
∑
h=1
D(wh(1),wh(2)) (5.1)
Since it is a dissimilarity measure, a larger warping distance is deemed to signify lesser coordination or synchrony.
5.3.1.2 Squared Cosine Distance of Complexity Measure (SCDC)
Prior work [108] has attempted to measure coordination between speakers in a dyadic conversation from a nonlinear dynamical system approach based on the underlying model’s complexity.
Following a similar approach, we analyze the complexity pattern underlying the signals observed
in dyadic conversations by framing them as arising from a coupled nonlinear dynamical system.
However, while [108] relies on computing coordination in different features separately, we measure the coordination between speakers as a whole based on all the audio features considered.
We capture the difference between vocal characteristics of the speaker (turns) by comparing
the complexity underlying their prosodic and spectral feature values. We use sample entropy as
the complexity measure. It is an information-theoretic measure of complexity which signifies the
amount of new information introduced across a temporal sequence. Based on our definition of
complexity, we hypothesize that local changes of complexity in a well coordinated conversation
will be smaller than a less well coordinated conversation. More specifically, the distance between
complexity patterns corresponding to consecutive turns should be expected to be lower in a well
coordinated conversation as compared to a randomly generated conversation. As will be shown in
Section 4, the experimental findings are found to support this hypothesis, and confirm that a larger
value of the proposed measure corresponds to lower synchrony.
To calculate the proposed measure, for a time sequence of length M such that X = x1, x2,...xM,
a length m subsequence is formed as Xm(i) = xi
, xi+1,...., xi+m−1. Let d(Xm(i),Xm(j)) denote the
Chebyshev distance between any two such vector pairs where i ̸= j. Now, if Em(r+1) denotes the
number of vector pairs such that d(Xm+1(i),Xm+1(j)) < r where r is a predefined threshold then
sample entropy Se is defined as,
49
Se = −ln Em+1(r)
Em(r)
(5.2)
From the definition, Em+1(r) is always less than or equal to Em(r), so Se is non-negative.
Smaller values of Se signify greater self-similarity across the values. Here, we consider, m = 2 and
r = 0.25× the standard deviation of the time-series.
For any two consecutive pair of turns, first we compute the sample entropy values for every
feature, yielding 2 vectors X1 and X2 consisting of feature wise complexity values corresponding
to the two turns. Next, we calculate the synchrony measure σ as,
σ = cos2
θ12 = ( X
T
1 X2
|X1|.|X2|
)
2
(5.3)
We hypothesize, the difference between sample entropy values corresponding to a turn-pair in a
well-coordinated conversation will be smaller as compared to a less well-coordinated conversation.
σ captures the above information and is therefore introduced as a coordination measure.
5.3.2 Lexical Features
A word embedding is a contextually-derived numerical representation in a vector space for every
word. Word embeddings make performing mathematical operations and analysing the results of
those operations easier. In this study, we use 768 dimensional Bidirectional Encoder Representations from Transformer (BERT) [54] embeddings as lexical features. We choose BERT embeddings since these embeddings incorporate context which handles polysemy and nuances better.
There exist other word embeddings like word2vec[111], GloVe[112] which are widely used for
different downstream tasks like semantic search, information retrieval etc. BERT clearly offers
an advantage over others because it is designed to create dynamically informed embeddings while
taking both forward and backward context in consideration, unlike the other models which create
embeddings context free by only considering unidirectional context. The potential of BERT has
50
also been demonstrated in terms of superior performance in many downstream language-based
inference tasks.
Here, we extract BERT embeddings (BERT BASE model with 12 transformer blocks, 12 attention heads and hidden layer dimension 768) corresponding to each word to form the feature matrix
corresponding to every turn, where each row of the matrix is an embedding corresponding to one
word of the turn. Once the feature matrices corresponding to the speaker turns are obtained, WMD
is computed between these feature matrices corresponding to consecutive turn-pairs.
5.3.2.1 Word Mover’s Distance (WMD)
Word Mover’s Distance (WMD) was introduced by Kusner et al.[113] as a similarity measure
between two text documents by comparing between word2vec[111] neural word embeddings. It is
calculated as the optimum distance the embedded words in one document needs to travel to reach
the embedded words in the other document. It can be considered as a special case of the popular
transportation problem of Earth Mover’s Distance [114].
If bi and bj are the BERT embeddings corresponding to words wi and wj respectively, the
distance between these words can be defined as,
d(wi
,wj) = ∥bi −bj∥2
If X1 and X2 are the matrices corresponding to two turns of Speaker A and Speaker B, respectively,
the WMD between the turns can be expressed as,
WMD(X1,X2) = min
T≥0
m
∑
i=1
n
∑
j=1
Ti jd(wi
,wj) (5.4)
constrained on ∑
m
i=1
Ti j =
1
m
, ∑
n
j=1
Ti j =
1
n
, where m and n are the number of words in turns
X1 and X2, respectively and Ti
j is the associated weight. Since we are working with contextual
embeddings, the same word in two different positions will have distinct embeddings, so the weights
are chosen to be uniform for WMD calculation.
51
Similar to the previously introduced measures, it is hypothesized that a conversation with
greater synchrony is likely to have a smaller average WMD compared to a less synchronized conversation.
5.4 Empirical validation of the proposed synchrony measures
In this work, the DTWD and SCDC of prosodic and spectral features are introduced as interaction coordination measures. While these methods have been previously employed to capture
(dis)similarity between different time-series, they have not been used together in the context of
quantifying interpersonal synchrony from speech audio. It should be noted that prior work [25]
has established the potential of WMD in capturing linguistic coordination based on word embeddings in dyadic conversation setting. Since the use of WMD as a viable measure for interaction
coordination has already been validated in previous work, in this section we only consider validating the usefulness of measures based on DTWD and SCDC for characterising synchrony in dyadic
interactions.
We use the USC CreativeIT database [115] for these analysis experiments. It is a publicly available multimodal database consisting of dyadic conversations portraying theatrical improvisations.
We generate approximately 2500 regular (“real”) and random turn pairs from these interactions.
Any consecutive pair of turns from the same interaction is considered as a regular turn pair while
the random pairs are generated by arbitrarily choosing two turns from two different interaction
sessions not involving same speakers. Our hypothesis is that for any feature set, the synchrony
should be higher in the actual turn pairs when compared to randomly shuffled pairs. To validate
our hypothesis, we design a paired t-test to compare the coordination measures across the actual
and random pairs with a null hypothesis being the sample mean of those 2 time-series are equal.
The paired sample t-test result shows that the prosodic feature synchrony is significantly higher
in real turn pairs compared to the random turn pairs, based on DTWD (F-statistic = 4.277, p
value = 0.00001) and SCDC (F-statistic = 3.705, p value = 0.00021). A similar trend is seen for
52
synchrony of acoustic segmental features in terms of both DTWD (F-statistic = 2.515, p value =
0.0119) and SCDC (F-statistic = 3.705, p value = 0.0002).
5.5 Experimental Results on ASD Interaction Datasets
In this section, we report and analyze the results of the different experiments conducted to understand the differences in the proposed synchrony measures in interactions involving children with
and without an ASD diagnosis.
5.5.1 Classification experiment
Table 5.2: F1-score for ASD diagnosis
Classifiers
Audio Features Lexical features
Spectral features Prosodic features Word Embeddings
DTWD SCDC DTWD SCDC WMD
Majority Classifier 0.3446
SVM (RBF) 0.5554 0.5345 0.5150 0.3428 0.5060
SVM (Linear) 0.5315 0.3446 0.5417 0.3448 0.4394
Logistic Regression 0.5424 0.3446 0.4865 0.3524 0.4861
The experiments in this subsection explore if the proposed measures of interaction synchrony
reveal differences between interactions involving children with an ASD diagnosis and those that
do not have. This is set up as a series of classification experiments aimed at how well the children
diagnosed with and without ASD can be distinguished using the proposed synchrony measures.
For the experiments, each interaction session is partitioned into child and adult speaker turns.
Once the speaker turns are defined, the turns are collected into N segments and the coordination
measures are calculated for every such segment. Hence, for each interactive session the classifier
will be input with N features to predict the ASD or non-ASD output label. We split the interaction
sessions into three groups: training set, validation set and test set, with 70% of the dataset for
training and 15% each for development and testing. We repeat the same classification experiment
53
PS PL SL PLS
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Mean Classification F1 Score
0.5256 0.5299 0.5197
0.5354
0.5704
0.4951
0.5293 0.5252
0.5649
0.5104
0.5794
0.5628
Majority Classifier
PS = Prosodic & Spectral
PL = Prosodic & Lexical
SL = Spectral & Lexical
PLS = Prosodic, Lexical & Spectral
Logistic Regression
SVM Linear
SVM RBF
Figure 5.1: F1-score for ASD diagnosis with fused features
with different values of N and we select N = 5 as the optimal value based on classification performance on the validation dataset. We use speaker-out cross-validation so that the speakers used in
the training set are not used in the test set.
The experiments consider three classifiers, all well established in the literature: i) Support Vector Machine (SVM) with linear kernel, ii) SVM with Radial Basis Function (RBF) kernel and iii)
logistic regression. We also consider a classifier predicting the majority class as the baseline classifier. SVM classifiers compute a maximum margin hyperplane separating the classes. The logistic
regression classifier works by estimating a dependent variable in terms of one or more independent
variables using logistic functions. The classification experimental results with individual modalities are tabulated in Table 5.2, while those with fused features are reported in Fig. 5.1. We use early
fusion for concatenating the features from individual modalities before feeding to the classifier.
54
AG1 AG2 AG3
AgeGroup
0.38
0.40
0.42
0.44
0.46
0.48
0.50
0.52
F1Score
AG1 - 2.5 to 7.5yrs
AG2- 7.5 to 10yrs
AG3 - more than 10yrs
2-way Analysis of Variance of F1score based on Agegroups and Gender
Gender
Male
Female
Figure 5.2: Two way analysis of variance of F1 scores across age-groups and gender
5.5.2 ANOVA analysis based on age-group and gender
In addition to reporting the F1 scores for ASD/non-ASD classification experiments with the different co-ordination features, we also carry out a two way analysis of variance of classification F1
scores across 3 age groups and gender. We partition the data into 3 different age groups (2.5−7.5
yrs, 7.5−10 yrs, ≥ 10 yrs) to gain insights into the synchrony features across different age-groups
and gender within children diagnosed with ASD and children who are not. Fig. 5.2 presents a box
plot showing the median, maximum, minimum, 75% percentile and 25% percentile values of F1
scores for each age-group and gender.
5.5.3 Comparison of the distribution of the proposed measures across different
subtasks
We also report the mean value of these coordination measures for children diagnosed with and
without ASD, across the 2 subtasks. The comparison of the distribution of these values are shown
55
0 1 2 3 4
Segments
15000
16000
17000
18000
19000
20000
DTWD
prosodic DTWD
0 1 2 3 4
Segments
0.005
0.010
0.015
0.020
0.025
0.030
0.035
SCDC
prosodic SCDC
0 1 2 3 4
Segments
20000
21000
22000
23000
24000
25000
26000
27000
DTWD
spectral DTWD
0 1 2 3 4
Segments
0.003
0.004
0.005
0.006
0.007
SCDC
spectral SCDC
0 1 2 3 4
Segments
7.0
7.2
7.4
7.6
7.8
8.0
WMD
WMD
ASD_Social
Non-ASD_Social
ASD_Emotions
Non-ASD_Emotions
Figure 5.3: Comparison of different coordination measures across subtasks
in Fig. 5.3. We calculate the mean value of the coordination measures for the turns collected in 5
segments and plot those values across the corresponding segments.
5.5.4 Discussion
From Table 5.2 we observe that in case of both of spectral and prosodic features, all classifiers
yield better performance with DTWD based synchrony features compared to SCDC based features. Comparing the results from Table 5.2 and Figure 5.1, it can be seen that fusing the synchrony
features across modalities improves the classification performance over using individual modality
features. Among the fused feature based experiments reported in Fig. 5.1, prosodic and spectral features together show the best performance, which indicates that there exists complementary
information across these modalities helpful for this classification task. While all the classifiers considered provide similar performance levels, the variants of SVM with radial basis function kernel
appears to be the most consistent across the experiments.
56
For the age-group based analysis reported in Fig. 5.2, the pvalue = 0.000196 suggests that the
null hypothesis can be rejected implying that age difference and gender both significantly affect
the classifier F1 scores for differentiating between children with and without ASD diagnosis based
on the proposed coordination measures. Moreover, we can also find an improvement in ASD/NonASD classification performance amongst female children in the oldest age-group as compared to
the other age-groups. This finding stands consistent with the investigation presented in the prior
work [116] and motivates to seek more insight to why female are more likely to go undiagnosed
than males until an older age.
Fig. 5.3 presents variation of the mean of the proposed measures across 2 different subtasks and
ASD status. In majority of the cases, higher mean value of these proposed measures are reported
for children with an ASD diagnosis in both social and emotion subtasks, which indicates that
children with an ASD diagnosis exhibit less synchrony in terms of these measures as compared
to the children without ASD diagnosis. It is interesting to note that for all the children, emotion
subtask is shown to have less synchrony for most of the duration of the session as revealed by all
the measures.
5.6 Conclusion
Previous behavioral science research has established the importance of interpersonal synchrony
in understanding behavior patterns in human interaction. In this work, we propose three different
measures of synchrony across different aspects of speech communication (vocal acoustics, prosody
and language use).
To investigate whether these synchrony features offer insights into potential differences in interaction patterns involving children diagnosed with ASD and those that do not, we set up a classification experiment utilizing the synchrony features. Results show that the proposed synchrony
features are able to distinguish interactions involving ASD and non-ASD children indicating the
role of coordination as an element of difference in social communication patterns.
57
Moreover, the analysis shows that the synchrony features across different information modalities of spoken interactions captured by spectral features, prosodic features and language patterns
provide complementary information distinguishing the two groups: children with ASD diagnosis
and without ASD diagnosis.
There are several challenging research directions for extending this work which we will like
to explore in future. We plan to investigate more data driven approaches to quantify synchrony
instead of knowledge driven approaches. Knowing that neural networks can efficiently learn nonlinear mappings between feature and coordination measures, future works can explore usage of
deep neural network based models to learn representations related to synchrony.
58
Chapter 6
A context-aware computaional approach to quantify vocal
entrainment in dyadic interactions
6.1 Introduction
Interpersonal human interactions, notably dyadic interactions (interactions involving two people),
are widely studied by social science and human-centered computing researchers alike [1], [117].
Such interactions are characterized by rich information exchange across multiple modalities including speech, language, and visual cues. Over the years, a significant amount of effort has been
invested in developing tools for both conversational data collection and in understanding and modeling the signals extracted from these interactions.
A phenomenon called entrainment [118], [119] has been described as one of the major driving
forces of an interaction [120]. While entrainment can be exhibited within and across different
modalities, vocal entrainment [121] or acoustic-prosodic entrainment [119], [122], [123] is defined
as an interlocutor’s tendency to accommodate or adapt to the vocal patterns of the other interlocutor
over the course of the interaction.
Understanding entrainment [124] can provide meaningful insights to analyze behavioral characteristics of the individual interlocutors and the interaction participants. For example, a higher degree of entrainment is associated with positive behavioral markers like social desirability, smoother
interactions, higher rapport content etc.[125], [126]. Entrainment can also serve as a valuable
59
instrument to characterize behaviors in the study and practice of psychiatry and developmental
studies involving distressed couples, children with autism spectrum disorder, addiction, etc [121],
[124].
Due to the complex nature of entrainment and a scarcity of appropriately labeled speech corpora, quantifying entrainment is a challenging task. Most of the early works have relied on empirical and knowledge-driven tools like correlation, recurrence analysis, time-series analysis, spectral
methods to measure how much a speaker is entraining to the other speaker [127]. This body of
work often relied on the assumption of a linear relationship between the extracted entrainment representations and vocal features, which may not always hold. On the other hand, although context
during a conversation plays an important part in interpersonal interactions, it has not been incorporated in existing approaches for measuring entrainment. While the recent line of works [121]
employ a more direct data-driven strategy to extract entrainment related information from raw
speech features, such are formulated in a way that they inherently only consider short-term context
while overlooking more long-term context.
Recently context-aware deep learning architectures such as transformers [128] have been proposed to capture richer contexts by explicitly modeling the temporal dimension and found many
applications in natural language processing, speech and vision. In light of their success in modeling rich temporal context, we investigate if transformers can help capture meaningful information
for quantifying entrainment.
In this work, we develop a context-aware model for computing entrainment, addressing the
need for both short and long-range temporal context modeling. For the scope of this work, the
proposed framework incorporates ‘context’ by aiming to train the model to learn the influence of
the speakers on each other. We follow the established strategy of using a distance-based measure
between consecutive turn-pairs in the projected embedding space and introduce the Contextual
Entrainment Distance (CED) metric. The main contributions of this work are two fold: first,
we use a combination of self-attention and convolution to extract both short-term and long-term
60
Embedding Extraction
Self Encoder
Cross Encoder Cross Encoder
Embedding fusion
Dense Layers
Predicted output
(real/fake)
Q
Embedding Extraction
Self Encoder
K V
Q
K
V
Q
K
V
Q
K V
TERA
Extraction
Raw
speech
Self
attention
Cross-subject
attention
Speaker 1 Speaker 2
Testing phase
Z1
, Z2
Extraction
Figure 6.1: Architecture for CED extraction.
contextual information related to entrainment; and second, we propose a transformer-based crosssubject framework for joint modeling of the interacting speakers to learn the pattern of entrainment.
We experimentally evaluate the validity and efficacy of CED in dyadic conversations involving
children and study its association with respect to different clinically-significant behavioral ratings
where the role of entrainment has been previously implicated [124].
6.2 Computing context-aware entrainment measure
6.2.1 Unsupervised model training and CED computation
Prior literature in this domain have relied on computing a distance measure directly between the
turn-level speech features X1 and X2 from speaker 1 and speaker 2 respectively [119]. However,
these features also capture additional information such as speaker characteristics and ambient
61
acoustic information which do not contribute towards learning the target entrainment patterns.
The objective is to learn the inverse mapping between the embedding space (Z1,Z2) and the feature
space (X1,X2) such that the model should learn to recognize turn pairs with high and low levels of
entrainment.
Here, we formulate the problem by training the network to classify between interactions having consecutive turn segments (true samples) and interactions having random/shuffled turn segments (fake samples). We temporally partition the conversational audio sequence into speaker
specific chunks and feed these chunks to the model to predict whether the fed audio chunks are
part of real conversation or a fake one.
After the training phase, we use the trained network weights to extract the cross-encoder layer
outputs for both speakers. Next, we calculate CED as the smooth L1 distance [121] between the
embeddings obtained in the previous step.
6.2.2 Model architecture
As shown in Fig. 6.1, we use two main modules to build the model to compute entrainment, first,
the self-attention encoder that is used to enhance the extracted features by attending to themselves
and then, a cross-attention encoder which allows the features to attend to a different source.
We use conformer [129] layers for the self-attention module to model both short-term and
long-term dependencies within an audio sequence in a parameter-efficient way by incorporating a
convolutional module in the transformer layer. The self-attention layer obtains meaningful representation from the long-term interaction and the convolution layer is used to learn the local relation
amongst the interaction based features.
To extract meaningful information related to entrainment, previous works have mostly relied
on individual modeling of interlocutors involved in a conversation. However, entrainment being
an interpersonal phenomenon, the need for jointly modeling interlocutors becomes heightened in
such scenarios. We address this issue by using a transformer layer for cross-subject attention,
62
allowing the features extracted per subject to access each other to capture crossed influence over
the interaction.
6.3 Experiments
We use the following two datasets for our experiments.
The Fisher Corpus English Part 1 (LDC2004S13) [130] consists of spontaneous telephonic
conversations between two native English speaking subjects. There are 5850 conversations of
approximately 10 minutes duration. The dataset is accompanied with transcripts along with timestamps marking speaker duration boundaries. The ADOSMod3 corpus consists of recorded conversations from autism diagnostic sessions between a child and a clinician who is trained to observe
the behavioral traits of the child related to (ASD). A typical interactive session following the Autism
Diagnostic Observation Schedule (ADOS)-2 instrument lasts about 40-60 minutes, and these sessions are composed of a variety of subtasks to evoke spontaneous response from the children under
different social and communicative circumstances. In this work, we consider the administration
of Module 3 meant for verbally fluent children and adolescents. Moreover, we focus on Emotions
and Social difficulties and annoyance subtasks as these are expected to extract significant spontaneous speech and reaction from the child while answering questions about different emotions and
social difficulties. The corpus consists of recordings from 165 children collected across 2 different
clinical sites. We use this corpus for evaluation purpose, the demographic details of the dataset are
reported in Table 7.1.
6.3.1 Experimental setup
6.3.1.1 Feature extraction
In this work, to compute CED the speech segments of interest are conversational turns from both
speakers. We compute the speaker turn boundaries from the time information available in the
transcripts, excluding the intra-turn pauses to avoid including noisy and redundant signals. For
63
Table 6.1: Demographic details of ADOSMod3 dataset
Category Statistics
Age(years) Range: 3.58-13.17 (mean,std):(8.61,2.49)
Gender 123 male, 42 female
Non-verbal IQ Range: 47-141 (mean,std):(96.01,18.79)
Clinical
Diagnosis
86 ASD, 42 Attention Deficit Hyperactivity Disorder (ADHD)
14 mood/anxiety disorder
12 language disorder
10 intellectual disability, 1 no diagnosis
Age distribution Cincinnati: ≤5yrs 7, 5-10 yrs 52, ≥10yrs 25
Michigan: ≤5yrs 11, 5-10 yrs 42, ≥10yrs 28
Table 6.2: Classification experiment for real vs fake sessions
Measure Classification accuracy(%)
Fisher Corpus ADOSMod3 Corpus
Baseline 1 80.52 82.22
Baseline 2 76.33 70.64
Baseline 3 82.91 85.73
CED 92.13 95.66
every speaker turn, we extract self-supervised TERA embeddings [55] to obtain a 768 dimensional
feature vector. We choose TERA embeddings as it employs a combination of auxiliary tasks to
learn the embedding instead of relying on a single task, so it is expected to learn enhanced features
from raw speech signals.
6.3.1.2 Parameters and implementation details
We use 352 and 64 attention units for the conformer and transformer layers, respectively, while 4
attention heads are employed for both. The full architecture obtained by using a conformer layer
followed by a transformer layer results into 2.1M parameters. The model is trained with a binary
cross entropy with logits loss function and Adam optimizer with the initial learning rate of 1e
−5
.
There is a provision of early stopping after 10 epochs if no improvement is seen in validation loss,
a dropout rate of 0.2 is used for every dropout layer used in the model.
64
6.3.2 Experimental validation of CED
We carry out an ad-hoc real/fake classification experiment to validate CED as a metric for measuring entrainment. For every real sample session we synthesize a fake sample session by shuffling
the speaker turns while maintaining the dyadic conversation sequence, an approach that is widely
adopted in the study of temporal aspects of interaction such as synchrony. The hypothesis is more
entrainment is expected to be observed in real sessions as compared to fake sessions resulting in
the real sessions having lesser CED. The classification accuracies are reported in Table 2. The
classification experiment steps are as follows:
• We calculate CED measure for both directions for every proximal turn-pair and calculate the
mean (divide by 2 for direction factor) for the real and fake sample session.
• We compare the average CED distance from all the turn pairs for the real and fake session,
the sample sessions are correctly classified if CED of real session is less than fake session.
• The experiment is repeated 30 times to eliminate any bias introduced while randomly shuffling the speaker turns.
As baselines, we use three distance measures computed between the extracted turn-level pretrained embeddings: smooth L1 distance [121] (Baseline 1) and two measures introduced in [124],
namely, DTWD (Baseline 2), and SCDC (Baseline 3).
6.3.3 Experimental evaluation
In this experiment, we calculate the correlation between the proposed CED measure and the clinical scores relevant to ASD in Table 7.5. Since CED is directional in nature, we compute the
correlation metric in both the directions child to psychologist and psychologist to child. We report
the Pearson’s correlation coefficient (ρ) and also the corresponding p-value, to test the null hypothesis that there exists no linear association between the proposed measure and the clinical scores.
Amongst the clinical scores, VINELAND scores are designed to measure adaptive behaviour of
65
Table 6.3: Correlation experiment between CED and clinical scores relevant to ASD (bold figures
imply statistical significance, p < 0.05 )
(CP: child to psychologist, PC: psychologist to child)
Clinical scores
Pearson’s correlation
CED-PC CED-CP
ρ p-value ρ p-value
VINELAND ABC -0.061 0.237 0.012 0.827
VINELAND Social -0.021 0.345 0.071 0.073
VINELAND Communication -0.158 0.003 0.043 0.428
CSS 0.222 0.004 0.023 0.672
CSS-SA 0.231 0.012 0.03 0.472
CSS-RRB 0.158 0.055 0.091 0.262
individuals, while VINELAND ABC stands for Adaptive Behaviour Composite score, VINELAND
social and VINELAND communication are adaptive behavior scores for specific skills of socialization and communication. CSS stands for Calibrated Severity Score which reflects the severity
of ASD symptoms in individuals. CSS-SA and CSS-RRB reflects ASD symptoms severity along
2 domains of Social Affect and Restrictive and Repetitive Behaviours. The details of the clinical
scores related to ASD are described in [90], [131], [132].
We also report the absolute values of the proposed CED measure (both directions) for different
gender and different age-groups. We partition the dataset across 3 age groups of Group 1: ≤ 5yrs,
Group 2: > 5yrs & ≤ 10yrs, Group3: > 10yrs and for each of the age groups we report the
directional CED measure for male and female subgroups in Fig. 6.2.
6.4 Results and Discussion
The results reported in Table 6.2 reveal that CED achieves better performance in identifying real
and fake sessions with respect to the baseline methods in both Fisher and ADOSMod3 corpus in
terms of classification accuracy, which validates the use of CED as a proxy metric for measuring
entrainment.
66
AG1 AG2 AG3
CED
AG1 = Age Group 1 (<= 5yrs)
AG2 = Age Group 2 (> 5yrs & < 10yrs)
AG3 = Age Group 3 (>= 10yrs)
CED-PC-female
CED-CP-female
CED-PC-male
CED-CP-male
Figure 6.2: Absolute values of CED across age and gender from ADOSMod3
(a) Cross-encoder 1 (b) Cross-encoder 2
Figure 6.3: Attention activations
67
Results in Table 7.5 show that VINELAND communication score is negatively correlated
with psychologist→child CED with significant statistic, which stands consistent with the definition of CED, since higher CED signifies lower entrainment. CSS and CSS-SA scores are reported to be positively correlated with CED. It is interesting to note that while psychologist→child
CED is capturing signals with meaningful interpretations, no such evidence is reported from
child→psychologist CED measures. A possible explanation can be since the model is trained
with dyadic conversations from adults in Fisher corpus, the model is unable to capture the nuances
of interactions involving children which is reflected in these results. It is also worth mentioning
while there exists a significant correlation between CSS, CSS-SA and psychologist→child CED,
CSS-RRB also shows weak evidence of positive correlation with psychologist→child CED.
In Table 6.2, the distributions for absolute values of CED are reported across gender and agegroups. Both directional CED are always seen to have lesser mean values in females as compared
to males, which reiterates the claim reported in [133] that women are better at disguising autism
symptoms than men. Across age-groups,the experimental results do not show any discernable
observation from CED in both directions in male children, however female psychologist→child
CED is shown to decrease with an increase in age, which also supports the claim presented in
[133].
We also investigate the weights of the activations from the cross-encoder attention layer to
understand which parts of the speaker turns are emphasized by the attention heads to extract meaningful signals. Attention activation heatmaps from cross-encoder 1 and 2 reported in Fig. 6.3 show
attention layers attend to initial few timeframes from the second speaker turn which supports the
claim mentioned in [121] and domain theory that initial and final interpausal units from second
and first speaker respectively are a rich source of signals related to entrainment.
68
6.5 Conclusion
In this work we introduce a novel context-aware approach (CED) to measure vocal entrainment in
dyadic conversations. We use a combination of convolutional neural networks and transformers to
capture both short-term and long-term context, and also employ a cross-subject attention module
to learn interpersonal entrainment related information from the other subject in a dyadic conversation. We validate the use of CED as a proxy metric for measuring entrainment by conducting a
classification experiment to distinguish between real (consistent) and fake (inconsistent) interaction sessions. We also study the association between CED and clinically relevant scores related to
ASD symptoms by computing the correlation metric. We also report the mean absolute value of
directional CED across gender and different age-groups to understand if the entrainment pattern
of the children varies across gender or age-group or not. In this work, we use a self-supervised
embedding for feature extraction, it will be interesting to see if other context-based pre-trained embeddings yeild similar performance in capturing entrainment. We also face difficulties in deploying
entrainment embeddings learnt on Fisher for ADOSMod3 dataset and thus we plan to investigate
domain-specific entrainment embeddings for understanding behavioral traits.
69
Chapter 7
Understanding entrainment patterns under contrastive
supervision
7.1 Introduction
Understanding and analyzing interpersonal spoken interactions has emerged as a widely studied
theme [1], [117] in the fields of sociology, psychology and human-centered computing. Over the
years, researchers have made significant efforts to collect conversational data, process and model
signals extracted from those data to seek insights about conversational patterns and mechanisms.
Dyadic interactions involve two interlocutors, and these exchanges are rich with linguistic content
(“what is communicated”), vocal nuances (“how it is communicated”), and various verbal, nonverbal vocal and visual signals conveying affect. Additionally, temporal contextual information
from both the interlocutors contribute in estimating the flow of the conversation, as well as the
socio-cognitive characteristics of the involved interlocutors. A social adaptive mechanism termed
as entrainment has been reported [120] to be one of the catalysts in shaping those interactions.
Entrainment can be broadly defined as the adaptive tendency of a speaker to accommodate
or sound similar to the other interlocutor involved, over the course of a conversation. A multitude of prior works have coined different terminologies to describe the phenomenon, including
convergence [120], [134], alignment [135], and entrainment [118]. Studying entrainment patterns
can offer valuable insights towards understanding behavioral traits of the individuals involved in
70
★ Complex definition
Nuanced interpersonal construct
★ Exhibition across multiple
modalities
Vocal cues, prosody, text
★ Nuisance factors
Speaker characteristics, channel
information
★ Unavailability of datasets
Lack of reliable labels
Child-inclusive
dyadic interactions
Challenges in
modelling entrainment
Unsupervised contrastive formulation
Relationship with behavioral scores
Turn-level modeling
Figure 7.1: Three main ideas are employed to address the challenges in modelling entrainment
in child-inclusive dyadic interactions: a) Understanding entrainment in conversational exchanges,
b) Quantifying entrainment using unsupervised, contrastive design, c) Validation by studying the
relevance of the proposed entrainment measure with respect to clinically meaningful behavioral
scores.
71
the interaction. Prior works have established strong positive association between entrainment and
different measures of interaction effectiveness, such as social desirability [125], naturalness of
conversation [136], rapport building, and smoother interactions. In the context of dyadic conversations, a higher degree of vocal entrainment is found to be positively associated with different
interpersonal behavioral traits, including higher empathy of clinical provider [137], greater emotional bond [108], and increased interpersonal agreement and positive therapy outcome in couples
therapy [138].
Autism Spectrum Disorder (ASD) is a developmental condition that is characterised by atypical social communication and skills, accompanied by constrained repetitive behavior, interests and
movements. Prevalence of ASD among children in United States has increased from 1 in 150 in
2002 to 1 in 36 as per the CDC latest reports in 2023. Prior research has shown that early diagnosis
and intervention for ASD yields long-term positive effects. During the formative years, since the
brain is still in its developing state, due to this plasticity, treatments in children have better chance
of being effective. This underscores the need for enabling detailed behavioral analyses of childinclusive clinical interactions, specifically in the context of autism research and clinical translation.
As interpersonal entrainment in dyadic conversations offers valuable insights into behavioral dynamics, it holds promising potential to assist in scientific and clinical investigations focused on
interactions within the realm of ASD, which is marked by variations in social communication and
interactive skills.
Entrainment can be manifested across multiple modalities including vocal signals, linguistic
patterns, facial gestures etc. Most of the prior works have focused on modeling entrainment from
individual modalities, thus not leveraging the information present across multiple modalities. Recently, with the advancement of multimodal modelling in modalities such as speech, text, images,
cross-modal supervision using contrastive learning has shown enormous potential in a diverse set
of discriminative and generative tasks including classification, retrieval, question-answering, generative modeling and so on. These models are usually pretrained using weakly aligned data, thus
72
removing the need for manual annotations, which perfectly satisfies our requirements for modeling
entrainment across multiple modalities.
In this work, we propose to model the tendency of the interlocutors to sound similar by leveraging the similarity present across the conversational turns in consecutive turn-pairs in a contrastive
manner. As shown in Fig. 7.1, turn-level conversational information is integral for modeling intra
as well as interpersonal behavior. Concepts of entrainment are strongly tied to the joint characteristics of conversational turns of both the interlocutors involved in the dyadic exchange. We
aim to model the joint space of conversational turns to obtain feature representations encoding
entrainment in both uni- and cross-modal settings. We also apply the introduced measures of entrainment to analyze the behavioral attributes of children with and without ASD engaged in dyadic
interaction.
To this end, we outline the major contributions of this work:
1. We propose a novel Contrastively Learnt Entrainment Distance (CLED) as a proxy measure for quantifying interpersonal entrainment in dyadic scenarios. We compute CLED by
learning a feature representation encoding entrainment information using the joint space of
conversational turns from consecutive turn pairs.
2. We also investigate the cross-modal subspace, with speech and language modality, to learn
the feature representations capturing entrainment.
3. We test the efficacy of the proposed entrainment measures in clinical interactions involving
children with ASD by analyzing the relation with different behavioral attributes related to
ASD.
The rest of the chapter is organized as follows: in Sec. 7.2 we discuss the related works on
modeling entrainment, its relevance in autism research using computational methodologies and
cross-modal supervision for representation learning. We describe the methodology adopted for
computing (CLED) in Sec. 7.3, outlining both uni-modal and cross-modal experiments. In Sec.
7.4, we provide details of the datasets that are used during pretraining and evaluation phases. In
73
Sec. 7.5 we describe the setup for computing CLED and different downstream experiments. We
analyse the experimental results in Sec. 7.6 and finally summarize our findings in Sec. 7.7.
7.2 Background
Modeling entrainment: Over the years, the dynamical process of entrainment has been subject
to conceptual scrutinization, resulting in numerous theories explaining the mechanism. Due to its
nuanced nature and multifaceted definition, modeling and quantifying entrainment is a challenging
computational task. In early efforts, researchers had primarily relied on trained human annotators
to assess the extent of entrainment by observing the interaction [139], [140]. However, these
human observational methods have limitations. Manual annotation efforts are tedious and timeexpensive. Furthermore, due to their subjective nature, they often lead to poor inter-annotator
agreement, resulting in incorrect and inconsistent labels.
Because of these limitations, later efforts in modeling entrainment resorted to knowledgedriven feature analysis, such as recurrence analysis, correlation analysis, and time-series analysis [127]. This direction of research is restricted by the assumption of linear association between
the vocal feature descriptors, which does not always hold. Later, nonlinear dynamical systembased measures [108], [138] have been used to model coordination or synchrony, but all of these
works have computed synchrony measures considering the feature descriptors individually which
has limited utility in assessing the entrainment between the vocal patterns of all interlocutors. Recent line of works [121], [141] employ a more direct data-driven strategy to extract entrainmentrelated information from raw speech features using neural network approaches, but all of these
have investigated entrainment only across vocal patterns. Lahiri et al. [124] attempted to study
synchrony patterns involving multiple modalities, however, they have also modeled synchrony
within each modality separately, rather than utilizing their combined cross-modal subspace.
Computational methodologies supporting autism research: In recent years, computational
methodologies of signal processing and machine learning have emerged for both research and
74
clinical applications in mental health [97]. For example, machine learning algorithms have been
suggested in autism studies to gain insights into and possibly streamline the ASD diagnosis procedure by leveraging expert human-coded behaviors derived from tools like the Autism Diagnostic
Interview-Revised (ADI-R) and Autism Diagnostic Observation Schedule (ADOS). Computational
techniques have demonstrated efficacy in facilitating diagnostic procedures by identifying crucial
nosological modules, streamlining processes without compromising accuracy [9], [31], [33]. Additionally, these methods have furnished tools to deepen scientific comprehension of communication and interaction mechanisms. For example,[10] correlates objective signal-derived descriptors
of vocal prosody with subjective perceptions of prosodic awkwardness, revealing distinctions in
acoustic-prosodic features between ASD and neurotypical individuals. Notably, it showed differences in vocal prosody coordination during interactions, varying with the severity of ASD symptoms in children. Such approaches extend prior findings demonstrating a significant correlation
between interlocutors’ prosody and language patterns and the severity of ASD symptoms in subjects [98], [102], emphasizing the necessity of integrating interpersonal coordination into studies
of social communication and interaction among children with ASD.
7.3 Contrastive Learning for Understanding Entrainment
We aim to compute a distance measure derived from the feature descriptors to quantify entrainment
in dyadic interactions. Earlier works [101], [119] have derived a proxy measure for entrainment
by directly computing a correlation or a distance measure from the raw feature descriptors or the
dimension reduced variant. To precisely capture only entrainment related information and avoid
nuisance factors such as speaker characteristics or channel information, we follow a data-driven
approach introduced in [121] to project the raw feature vectors to a representation space using
a transformation and to compute the entrainment distance from the projected feature descriptors.
For computing the entrainment measure, we train a model to encode the associated entrainment
75
information based on conversational turns. Assuming X
i
and Y
i denotes the turn feature
descriptors from speaker X and Y corresponding to i
th turnpair, the distance can be expressed as:
Dent(X
i
,Y
i
) = d(f(X
i
), f(Y
i
)) (7.1)
There are two main components in this approach: first, learning the transformation function
to effectively project the raw feature vectors to the embedding space to encode entrainable information avoiding the nuisance factors and second, selecting an optimal distance metric to compute
the final distance measure in the projected embedding space. Following previous work [121], we
adopt a contrastive learning based pretraining to simultaneously preserve entrainable information
while remaining invariant to the nuisance factors. Also. we explore different distance metrics to
compute the entrainment measure.
7.3.1 Encoding entrainment under contrastive supervision
In this section, we describe how we learn the transformation function f(.) using a contrastive
pretraining strategy. Due to the lack of reliably labeled corpora encoding entrainment and its
extent, we employ an unsupervised strategy to model entrainment. We hypothesize that the consecutive speaker turns in a natural conversation will possess some amount of entrainable information when compared to synthetically-created (scrambled) conversations constructed with random
speaker turns. Based on this assumption, we train the model with a contrastive loss function
to discriminate between consecutive turn-pairs and non-consecutive turn-pairs. We subsequently
compute the entrainment measure Contrastively Learnt Entrainment Distance (CLED) as a distance between encoded embeddings. Specifically, we attempt to learn a joint space based on the
similarity of the consecutive turn pairs using contrastive learning.
Self-supervised learning (SSL) algorithms have shown success in capturing meaningful information associated with many downstream tasks including automatic speech recognition, speaker
76
recognition, and text classification. Since SoTA SSL models are built using transformer architecture that have demonstrated superior ability in capturing longer context information, we employ
these pretrained speech embeddings as our feature descriptor for learning entrainment information.
Let X
i
and Y
i denote the speech or lexical signal extracted from the consecutive turns of speaker
X and Y, corresponding to i
th turnpair. For both speech and text modalities, we consider different
pretrained models to encode the turn level signals to yield ˆX
i
1
from turn1 and ˆY
i
2
from turn2 of i
th
turnpair using encoders g1(.) and g2(.) respectively, where ˆX
1
i
,
ˆY
2
i
are V dimensional vectors.
ˆX
i
1 = g1(X
i
1
),
ˆY
i
2 = g2(Y
i
2
) (7.2)
Next, we attempt to learn the transformation function f1(.) and f2(.) to map the feature descriptors
for turn1 and turn2 corresponding to a turnpair, to a joint space based on their similarity. For a
batch of N turnpairs, we compute
E1 = f1(Xˆ
1), E2 = f2(Yˆ
2) (7.3)
where E1 ∈ R
N×d
, E
2 ∈ R
N×d
, Xˆ
1,Yˆ
2 ∈ R
N×V
are V dimensional vectors and f1(.) and f2(.) are
the learnt linear transformation functions. After the transformed embeddings for both the speaker
turns are computed, their similarity can be derived as
C = τ ·(E1.E2
T
) (7.4)
where τ is a temperature factor to regulate and scale the range of logits. The computed similarity
matrix C ∈ R
N×N should equal to 1 across the diagonal, denoting the correct consecutive turn pairs
in a batch of N samples, whereas the rest N
2 − N elements can be considered as incorrect pairs.
The contrastive loss function will hence be
Lcon = 0.5 ·(lturn1
(C) +lturn2
(C)) (7.5)
77
lturn = 1/N
N
∑
i=0
log(diag(so ftmax(C))) (7.6)
where lturn is computed by Equation 8.5 across the turns from speaker X and Y, respectively from
all the turn-pairs. We use the symmetric cross-entropy loss over the similarity matrix C for training
the linear transformations f1(.) and f2(.).
For inference, we use the learnt transformation functions f1(.) and f2(.) to map the encoded
signals from turn1 and turn2 corresponding to k
th turnpair to obtain E
k
1
and E
k
2
respectively.
Finally, we compute the distance metric CLED between E
k
1
and E
k
2
as a proxy measure of entrainment. We consider 3 different distance metrics: smooth L1 distance, L2 distance and cosine
distance respectively, which are given by,
CLEDsmoothL1(E
k
1
,E
k
2
) =
0.5 ∗
E
k
1 −E
k
2
2
E
k
1 −E
k
2
≤ 1
0.5−(
E
k
1 −E
k
2
) otherwise
(7.7)
CLEDL2(E
k
1
,E
k
2
) =
E
k
1 −E
k
2
2
(7.8)
CLEDcosine(E
k
1
,E
k
2
) = E
k
1
.E
k
2
E
k
1
E
k
2
(7.9)
It is important to note that, while contrastive pretraining and its related loss function are discussed
in the context of turns from Speaker X and Y, the choice and explanation of variable names are
tailored for clarity and ease of understanding. Furthermore, the proposed approach isn’t designed
to capture generic speaker-specific information; instead, its focus lies on encoding entrainable
information within consecutive turns of a turn pair.
7.3.2 Uni-modal and Cross-modal Design
Since entrainment can be exhibited across multiple modalities, we are interested in learning to what
extent contrastive pretraining can help in capturing entrainment across lexical and vocal modalities. Prior works [124] have reported complementary information to be present across modalities
78
Table 7.1: ADOSMod3 dataset demographic details
Category Statistics
Age(years) Range: 3.58-13.17 (mean,std):(8.61,2.49)
Gender 123 male, 42 female
Non-verbal IQ Range: 47-141 (mean,std):(96.01,18.79)
Clinical
Diagnosis
86 ASD, 42 Attention Deficit Hyperactivity Disorder (ADHD)
14 mood/anxiety disorder
12 language disorder
10 intellectual disability, 1 no diagnosis
Age distribution ≤5yrs 18, 5-10 yrs 94, ≥10yrs 53
while modeling interpersonal synchrony from dyadic interactions. Motivated by these insights, we
apply the proposed pretraining approach to model vocal entrainment and lexical entrainment. For
vocal entrainment, we consider pretrained speech embeddings for both turns, and aim to learn a
joint subspace to encode the similarity across the speech embeddings. Similarly, to model lexical
entrainment, we repeat the same experiment but with pretrained text embeddings.
Recently, Radford et al. [142] explored the cross-modal space to learn relationships between
images and text and reported superior performance on downstream visual classification tasks.
Later, [143], [144] have followed similar directions to bridge the gap between audio and text by
learning the relationship between audio and language semantics using contrastive pretraining. In
the realm of behavioral signal processing, the approach of learning interpersonal constructs under
cross-modal supervision of speech and text is under explored. One of the major challenges faced
in modeling interpersonal constructs is due to the unavailability of reliably labeled datasets. In
these scenarios cross-modal approaches often offer more flexibility and generalization by leveraging information from multiple modalities. In this work, for investigating the cross-modal space for
entrainment modeling, we formulate the problem as learning a joint subspace based on the similarity of pretrained speech and text embeddings of turn1 and turn2, respectively, and vice versa using
contrastive pretraining.
79
Turn Encoder 1 Turn Encoder 2
Projection 1 Projection 2
X1Y1
X3Y3
X2Y2
X4Y4
X3Y1
X1Y3
Speaker X
Speaker Y
Turn Encoder 1 Turn Encoder 2
Projection 1 Projection 2
CLED Measure
Pretraining Evaluation
Pretrained
model
Figure 7.2: Schematic Diagram for computing CLED: Pretraining phase (Left), Evaluation
phase (Right)
7.4 Datasets
We use the Fisher dataset [130] for the contrastive pretraining phase and a combination of Fisher
and an in-house child-inclusive (clinical) interaction dataset for evaluation.
7.4.1 Pretraining
For pretraining, we require a dataset consisting of naturalistic conversations that supports our hypothesis of proximal turn-pairs having more entrainable information than non-proximal random
combination of turns. Besides a naturalistic conversational setting, we also require the dataset to
contain sufficient number of turn-pairs to train our model with adequate number of samples. We
chose the Fisher corpus English Part 1 (LDC2004S13) for our pretraining experiments as it satisfies both our requirements. It is a corpus containing telephonic dyadic conversations, consisting of
80
5850 interactions with an average duration of 10min. Since the Fisher dataset is a telephonic conversation, it was recorded at 8kHz sampling rate. The dataset comes with manual transcriptions,
and we use this information to identify speaker turn boundaries for training our model. We choose
80% of the dataset for training and use 10% as validation data for the pretraining experiments.
7.4.2 Evaluation
For evaluation, we use the remaining 10% of Fisher corpus for verification experiments. We also
use an in-house dataset ADOSMod3 containing child-inclusive interactions for both verification
and correlation experiments. The ADOSMod3 corpus is comprised of recorded conversations from
autism diagnostic sessions between a child and a clinician who is trained to observe clinicallyrelevant behavioral aspects of the child related to Autism. A typical interactive session following
the Autism Diagnostic Observation Schedule (ADOS)-2 protocol lasts about 40-60 minutes, and
these sessions consist of a variety of structured subtasks to evoke spontaneous response from the
children under different social and communicative circumstances. In this work, we consider the
administration of Module 3 meant for verbally fluent children and adolescents. Specifically, we focus on the Emotions and Social difficulties and annoyance subtasks as these are expected to extract
significant spontaneous speech and reaction from the child while answering questions about different emotions and social difficulties. The Social difficulties and annoyance task elicits the child’s
perception on different social issues faced at home or school and the Emotions subtask is designed
to extract the child’s emotional states related to different trigger objects by introducing them in
a conversational setup. The corpus consists of recordings from 165 children collected across 2
different clinical sites: the University of Michigan Autism and Communication Disorders Center
(UMACC) and the Cincinnati Children’s Medical Center (CCHMC). The details of the dataset are
reported in Table 7.1. For our experiments, we have 1 recording for each of the mentioned subtasks
from each participant resulting in total of 330 recordings. The average duration of each of these
subtask based sessions is about 3 minutes (192 secs). The lexical features are extracted based on
manual transcriptions following SALT [109] guidelines. Since we aggregate turn-level encoded
81
features, sessions with fewer than 10 turns are discarded to avoid local irregularities arising due to
insufficient number of conversational turns.
Amongst the clinical scores, VINELAND scores are designed to measure adaptive behaviour of
individuals, while VINELAND ABC stands for Adaptive Behaviour Composite score, VINELAND
social and VINELAND communication are adaptive behavior scores for specific skills of socialization and communication. CSS stands for Calibrated Severity Score which reflects the severity
of ASD symptoms in individuals. CSS-SA and CSS-RRB reflects ASD symptoms severity along 2
domains of Social Affect and Restrictive and Repetitive Behaviours. ADOS scores are related to
the diagnostic instrument, CoreASD,AdvSocial and RRB scores respectively denote scores across
core autism symptoms, advanced social skills and restrictive and repetitive behaviours.
Table 7.2: Classification accuracy(%) for real vs fake sessions for unimodal speech validation
experiment (averaged over 25 runs, standard deviation shown in paranthesis).
Measure
Classification accuracy(%)
Fisher ADOSMod3
Baseline1 72.20(0.54) 52.98(0.43)
Baseline2 72.03(0.85) 50.61(0.55)
Baseline3 74.66(0.77) 52.73(0.62)
CLEDsmoothL1Wav2vec2.0 86.08(1.15) 60.15(0.68)
CLEDL2Wav2vec2.0 84.79(1.18) 59.36(0.55)
CLEDcosineWav2vec2.0 89.95(0.84) 62.77(0.48)
CLEDsmoothL1HuBERT 91.35(0.63) 60.93(1.07)
CLEDL2HuBERT 90.23(0.41) 61.84(0.62)
CLEDcosineHuBERT 93.04(0.25) 63.07(0.43)
CLEDsmoothL1WavLM 87.15(0.53) 60.04(1.06)
CLEDL2WavLM 86.54(1.02) 58.83(0.79)
CLEDcosineWavLM 88.38(0.42) 62.08(0.43)
82
Table 7.3: Classification accuracy(%) for real vs fake sessions for unimodal text validation experiment (averaged over 25 runs, standard deviation shown in parenthesis).
Measure
Classification accuracy(%)
Fisher ADOSMod3
Baseline1 61.01(0.55) 48.74(0.66)
Baseline2 59.86(0.41) 48.04(0.48)
Baseline3 63.53(0.77) 50.03(0.52)
CLEDsmoothL1BERT 66.35(1.15) 50.25(0.73)
CLEDL2BERT 64.08(0.95) 50.22(0.58)
CLEDcosineBERT 67.13(0.89) 51.14(0.62)
CLEDsmoothL1RoBERTa 63.59(0.74) 51.36(0.98)
CLEDL2RoBERTa 62.29(0.89) 51.73(1.05)
CLEDcosineRoBERTa 63.93(0.72) 52.05(1.11)
CLEDsmoothL1USE 63.78(0.74) 51.03(0.56)
CLEDL2USE 63.06(0.85) 51.15(1.07)
CLEDcosineUSE 64.57(0.92) 51.77(0.61)
Table 7.4: Classification accuracy(%) for real vs fake sessions for cross-modal validation experiment (averaged over 25 runs, standard deviation shown in parenthesis).
Measure
Classification accuracy(%)
Fisher ADOSMod3
Baseline1 63.77(0.52) 45.22(0.34)
Baseline2 62.62(0.38) 48.62(0.43)
Baseline3 62.51(0.51) 48.13(0.52)
CLEDsmoothL1HuBERT −BERT 67.08(0.43) 49.15(0.25)
CLEDL2HuBERT −BERT 68.15(0.32) 50.13(0.22)
CLEDcosineHuBERT −BERT 65.23(0.41) 50.06(0.31)
83
Table 7.5: Correlation experiment between CLED and clinical scores relevant to ASD with p
values in parenthesis(bold figures imply statistical significance after correction, p < 0.006 )
(CP: child to psychologist, PC: psychologist to child)
Entrainment measures
Pearson’s correlation(ρ)
VINELAND CSS ADOS
ABC Social Comm Overall SA RRB CoreASD AdvSocial RRB
CLEDspeechCP
-0.115 -0.062 -0.105 0.179 0.122 0.204 0.100 0.053 0.218
(0.0377) (0.261) (0.003) (0.004) (0.0245) (0.0001) (0.02) (0.213) (0.0006)
CLEDspeechPC
-0.097 -0.046 -0.161 0.254 0.243 0.248 0.251 0.151 0.260
(0.0789) (0.401) (0.003) (0.0009) (0.0009) (0.00004) (0.005) (0.074) (0.0003)
CLEDtextCP
-0.074 -0.013 -0.021 0.113 0.114 0.045 0.006 0.072 0.042
(0.723) (0.814) (0.691) (0.059) (0.040) (0.381) (0.054) (0.732) (0.442)
CLEDtextPC
-0.189 -0.041 -0.091 0.191 0.176 0.162 0.108 0.131 0.215
(0.182) (0.312) (0.231) (0.003) (0.001) (0.214) (0.041) (0.014) (0.361)
CLEDcrossmodalCP
-0.049 -0.125 -0.021 0.034 0.062 0.032 0.018 0.028 0.083
(0.076) (0.023) (0.341) (0.042) (0.213) (0.314) (0.738) (0.610) (0.076)
CLEDcrossmodalPC
-0.121 -0.247 -0.126 0.028 0.031 0.053 0.103 0.107 0.096
(0.002) (0.007) (0.002) (0.598) (0.257) (0.213) (0.048) (0.041) (0.043)
7.5 Experiments
7.5.1 Preprocessing, feature extraction and baselines
Since the Fisher corpus is a telephonic conversation dataset, it was recorded at 8kHz sampling rate.
In this work, all the audio files from the Fisher Corpus were upsampled to 16kHz to enable speech
foundation model usage. To identify speaker turn boundaries without incurring preprocessing
error, for both training and evaluation phase, we consider manual annotations as gold standard.
Following identification of turn-boundaries, based on the timings, we extract embeddings from
pretrained speech and language models corresponding to the turn-boundaries. For all the experiments, we compute mean of the embeddings across all the frames and use it as the feature representation for modeling entrainment. For the unimodal speech experiments, we consider the pretrained
speech models, Wav2vec 2.0 [57], WavLM [62] and HuBERT [145] models based on their superior performance on a variety of downstream tasks. For unimodal lexical experiments, we use
BERT [54], RoBERTa [146] and USE [147] models for extracting lexical embeddings. Based on
the real vs. fake classification performances reported respectively in Table 8.1 and Table 7.3, we
84
consider HuBERT and BERT for the crossmodal experiments. For the unimodal experiments, we
consider embeddings extracted from the same modality (speech or text) for both the turns present
in a consecutive turn-pair, while for crossmodal experiments we compute pretrained speech and
text embeddings for the turns involved to learn a joint crossmodal space across the consecutive
turns.
For all the unimodal and crossmodal verification experiments reported in Table 8.1, Table 7.3
and Table 7.4 we consider 3 baselines. Baseline 1 is computed as the smoothL1 distance directly
computed from the pretrained embeddings. Baseline 2 and Baseline 3 are computed as the L2
distance and cosine distance, respectively, computed directly from the pretrained embeddings.
7.5.2 Implementation details
For all the experiments, we map encoded turn embeddings to an intermediate dimension of 256
in the joint space for modeling entrainment related information from consecutive turn pairs using
projection heads. The projection head consists of a couple of fully connected layers to project to
the required dimension (768 → 256 → 256) with gelu activation function and dropout layer with
probability 0.2. It also contains a residual connection followed by layer norm. For pretraining
stage, we train for 25 epochs using ADAMW optimizer with an initial learning rate of 1e
−3
, we
also use an early stopping with a patience of 5 epochs if there is no improvement in validation loss.
7.5.3 Verification experiment
We conduct an ad-hoc real/fake classification experiment to validate CLED as a proxy metric for
entrainment. For every real sample session, we created a synthetic fake counterpart by rearranging
speaker turns while preserving the sequential flow of the conversation, a method commonly used
in exploring temporal dynamics such as synchrony. We hypothesize that real sessions will exhibit
higher levels of entrainment compared to synthetic fake sessions, resulting in lower CLED scores
for real sessions. The classification accuracies are reported in Table 8.1, Table 7.3 and Table 7.4.
The classification experiment steps are as follows:
85
• We calculate CLED measure for both directions for every proximal turn-pair and calculate
the mean (divide by 2 for direction factor) for the real and fake sample session.
• We compare the average CLED distance from all the turn pairs for the real and fake session,
the sample sessions are correctly classified if CLED of real session is less than fake session.
• The experiment is repeated 30 times to eliminate any bias introduced while randomly shuffling the speaker turns.
From the results presented in Tables 8.1, 7.3, and 7.4, we can observe that the proposed contrastive learning-based measures perform better than the baselines in all distance measures. This
validates our hypothesis that the proposed training strategy helps in capturing entrainment-related
information, thus yielding better performance in distinguishing real from fake sessions.
7.5.4 Correlation analyses
The verification experiment mentioned in the previous subsection is designed as a proof-of-concept
validation for the proposed metrics aimed at quantifying entrainment. In this subsection, we
demonstrate the potential applicability of these metrics concerning specific behavioral constructs
typically linked with entrainment. Specifically, we delve into a case study focusing on modeling
distinct behaviors using real-world conversational datasets involving children. We investigate the
efficacy of these measures to study behavioral attributes of children with and without ASD. These
experiments can be seen as an indirect validation of the ability of the proposed metrics to capture
entrainment, as they showcase their correlation with associated behaviors related to autism.
86
Figure 7.3: Variation of CLED across gender
7.6 Results
7.6.1 Quantitative analysis
Prior literature in autism research has reported children with ASD diagnosis showing lower social
synchronization as compared to their typically developing peers in both spontaneous and intentional interpersonal interactions [148], [149]. Moreover, experimental results reveal significant
association between interpersonal synchrony and clinically relevant behavioral scores from ADOS
instrument in [124], [141]. In this experiment, we compute the Pearson’s correlation between the
proposed CLED measures and behavioral scores related to autism symptoms. It is important to note
here the proposed measures are directional in nature, so we consider both child → pyschologist
and psychologist → child direction to compute the entrainment measures. We report Pearson’s
correlation coefficients (ρ) for this experiment in Table 7.5 along with their p-values. We test
87
against the null hypothesis H0 that there is no linear association between behavioral scores and the
proposed entrainment measures.
Based on the results of the unimodal and crossmodal verification experiments, we study the
correlation of the best performing CLED distance measures for each of the experiments and the
clinical scores related to autism symptoms. It can be observed from Table 7.5 that the proposed
unimodal and cross-modal CLED measures hold negative correlations with V INELAND scores
and they are positively associated with CSS and ADOS scores. The observation stands consistent with the definition of the proposed distance-based entrainment measures, where higher values
of CLED denotes lower level of entrainment. CLEDspeechCP shows significant negative correlation with V INELAND ABC and Communication scores and it also holds statistically significant
positive association with all categories of CSS score (Overall, Social Affect and RRB) and some
categories of ADOS scores including CoreASD and RRB. In the opposite direction,CLEDspeechPC
hold significant negative correlation with V INELAND communication and significant positive correlation with all categories of CSS scores and ADOS CoreASD and RRB scores. It is interesting
to note that CLED measure could capture meaningful correlation in most of the experiments only
in the unimodal case involving speech embeddings, while for unimodal text and cross-modal experiments although similar trends are reported, the number of statistically significant correlations
are relatively fewer and the majority of them are observed in psychologist → child direction. One
possible reason can be that the conversational turn-pair modeling can capture entrainment more
from the vocal patterns than lexical context, thus the model can encode more entrainable information from proximal turn-pair speech embeddings as compared to the lexical embeddings. This
also explains the fewer significant correlations in the crossmodal experiments where the model
performs poorly to encode the entrainment between the speech and lexical embeddings from proximal turns. Furthermore, averaging across the frames to generate turnlevel embeddings may cause
information loss, particularly affecting the lexical modality. This could potentially result in turnlevel embeddings placing greater emphasis on specific words rather than capturing the entirety of
the turn, thus leading to inefficient modeling. Moreover, since the model is trained with dyadic
88
conversations from adults in Fisher corpus, the model is unable to capture the nuances of interactions involving children which is reflected in the lesser number of significant interpretations in
child → psychologist direction. This also explains the lower classification accuracy in verification
experiment for ADOSMod3 corpus.
7.6.2 Qualitative analysis on demographics
In this subsection, we investigate the distribution of CLED measures across the various demographic characteristics of the children. As reported in Table 7.5, we did not find significant
meaningful correlations for the lexical and crossmodal modeling in the child→ psychologist direction, so we only consider psychologist→ child direction for those measures in this experiment. We compare the distribution of CLED measures across male and female children, it is
evident from Fig. 7.3 that in each case female children has shown lesser values which signifies
more entrainment. Additionally, statistical tests yield significant interpretation for each of the
measures (CLEDSpeechPC (statistic = 2.887, p-value = 0.004),CLEDSpeechCP (statistic = 2.672,
p-value = 0.008), CLEDTextPC (statistic = 1.637, p-value = 0.006), CLEDCrossmodalPC (statistic = 0.927, p-value = 0.004)). In Fig. 7.4, we plot the distribution of the proposed measures
across the children with and without ASD diagnosis. The measures from speech modality in
both the directions yield significantly higher values (CLEDSpeechPC (statistic = 1.688, p-value
= 0.006),CLEDSpeechCP (statistic = 0.987, p-value = 0.008)) for individuals with ASD thus suggesting lesser entrainment for those sesssions, which also stands consistent with the theories established in autism research stating communication deficit as one of the traits of individuals diagnosed
with ASD. Quite interestingly, the results in Fig. 7.4 reveals higher values of CLED measures for
text and cross-modal experiments, but the statistical tests donot yield significance outcomes supporting the observation.
89
Figure 7.4: Variation of CLED across children with and without Autism diagnosis
90
7.7 Conclusion
In this chapter, we introduce a data-driven framework for computing entrainment distances between successive speaker turns within a dyadic interaction. The proposed measures are evaluated
within a latent embedded space acquired through contrastive learning from natural conversational
data. To validate our approach, we demonstrate its efficacy in discriminating between real and
fake (shuffled) conversations by accurately capturing entrainment in real interactions. Moreover,
our proposed measures outperform baseline methods in this verification task, underscoring their
effectiveness. Since interpersonal entrainment can be exhibited across multiple modalities, this
chapter aims to quantify entrainment across different information modalities related to voice and
language patterns through unimodal and crossmodal experiments involving speech and lexical embeddings to model entrainment. We apply the introduced measures in the clinical domain of dyadic
interactions involving children to study the behavioral patterns of children with and without autism.
We experimentally investigate the relation of the proposed measures across speech and language
modalities with respect to behavioral scores relevant to autism symptoms. We observe the introduced measures specifically in the speech modality to hold meaningful correlation with relevant
behavioral scores with statistically significant interpretations. While the introduced measures could
capture vocal entrainment effectively, the experimental results didnot reveal statistically significant
conclusions for modeling lexical and crossmodal entrainment. The results reported in this work
motivates further to investigate modeling entrainment in language and cross-modal design. Additionally, we aim to study the potential of leveraging intra and inter-turn temporal context while
modeling dyadic entrainment. Also, it will be interesting to explore the potential of disentangling
entrainment information from the other nuisance information to model dyadic entrainment.
91
Chapter 8
A Summary of Deep Unsupervised Modeling Strategies for
Understanding Vocal Entrainment Patterns in Child-Inclusive
Interactions
8.1 Introduction
Understanding and analyzing human interactions in the social context has emerged as a prominent area of research in the fields of sociology, psychology and human-centered computing [1],
[117]. Over the years, a considerable body of research has aimed to comprehend and draw insights
from the dynamics of interpersonal interactions by gathering conversational data, processing and
analyzing it by developing relevant models. Dyadic interactions, involving two individuals, are
specifically noteworthy, as they encompass rich linguistic content (pertaining to ”what is communicated”), vocal subtleties (relating to ”how it is communicated”), and a diverse array of verbal,
nonverbal vocal, and visual signals conveying emotions. Furthermore, contextual cues from both
speakers are utilized to facilitate the flow of conversation. A multifaceted social adaptive process known as entrainment has been conceptually identified as a significant factor driving these
interactions [120], thus it has been extensively studied in numerous psychological and behavioral
works.
92
Entrainment can be broadly defined as the adaptive inclination of a speaker to adjust or resemble the other participant in a conversation over time. Various prior studies have introduced different
terms to describe this phenomenon, such as alignment [135], convergence [134], proximity [150],
synchrony [124] etc. Understanding patterns of entrainment can provide valuable insights into
the behavioral characteristics of individuals engaged in the interaction and interpersonal dynamics exhibited during the interaction. Previous research has highlighted a significant positive correlation between entrainment and various measures of interaction effectiveness, including social
desirability [125], conversational spontaneity [136], rapport formation, and smoother interactions.
In the context of dyadic conversations, a heightened level of vocal entrainment has been positively linked to various interpersonal behavioral traits, such as increased empathy among clinical
providers [137], stronger emotional connection [108], and enhanced interpersonal agreement and
positive outcomes in couples therapy [138].
Autism Spectrum Disorder (ASD) is a developmental condition primarily characterized by
deficit in social communication skills and abilities, alongside restricted and repetitive behaviors,
interests, and movements. ASD represents a spectrum disorder with significant individual diversity. Advances in human-centered technology hold promise for supporting novel evidence-based
approaches in behavioral stratification, as well as in diagnosis and personalized treatment. Prevalence of ASD in children in the United States has steadily increased from 1 in 150 in 2002 to 1
in 36 in 2023. Given the developing state of the brain during the formative years and its plasticity, interventions in children have a higher likelihood of success, which has been experimentally
validated by prior research stating early diagnosis and intervention for ASD result in long-term
positive outcomes. This highlights the importance of facilitating comprehensive behavioral analyses of clinical interactions involving children, particularly within the realm of autism research and
clinical application.
Studying the entrainment patterns of children in spontaneous conversations can help in interpreting the behavioral traits of children, thus it can potentially aid clinicians in efforts of autism
diagnosis and behavioral phenotyping. Prior works [98], [99] have reported a notable correlation
93
between the prosody and language patterns of interlocutors and the severity of ASD symptoms in
subjects. This underscores the significance of incorporating interpersonal entrainment into studies
of social communication and interaction among children with ASD. The objective of this chapter
is to contribute to this endeavor by exploring novel behavioral metrics to quantify interpersonal
entrainment.
In this chapter, we explore several deep learning based unsupervised methodologies for learning representations capturing entrainment and employing those for quantifying entrainment from
dyadic interactions involving children with autism. Since turn-level conversational information
is essention in building models capable of capturing entrainment information, we first develop a
framework to learn entrainment related information by leveraging the joint space of conversational
turns using a contrastive formulation. Along with turn-level information content, the information flow across consecutive turns also play an integral role in modeling entrainment. To include
both information content and flow of information across consecutive conversational turns, building
upon on the previously introduced contrastive loss based formulation, we incorporate an additional
reconstruction loss obtained by extracting relevant information from previous turn to predict the
next speaker turn in consecutive conversational turns. While encoding entrainment related information, it can also capture information not related to entrainment. To address this issue, aiming to
learn more robust entrainment related information, we integrate an addition loss function to incentivize learning only entrainment related information invariant of the other factors not contributing
to entrainment, by attempting to reconstruct the individual turn representations in a turn-pair with
a gradient reversal layer. The main objective is to learn a representation capturing entrainment
by exploiting the similarity between consecutive turns and the modeling the flow of information
across the consecutive turns while avoiding intra-turn information not related to entrainment.
94
8.2 Datasets
We use Fisher corpus English Part 1[130] and two internal datasets involving children with autism,
ADOSMod3 and Remote-NLS for different phases of our experiments.
8.2.1 Fisher Dataset
For pretraining, we need a dataset that includes natural conversations confirming our hypothesis
that proximal turn-pairs contain more entrainable information compared to randomly combined
turns. Additionally, the dataset must provide an ample number of turn-pairs to ensure sufficient
sample size for training our model effectively. We choose to use the Fisher corpus English Part 1
(LDC2004S13) for our training experiments because it meets both criteria. This corpus consists of
telephonic dyadic conversations, totaling 5850 interactions with an average duration of 10 minutes
each. Given its telephonic nature, recordings in the Fisher dataset are sampled at 8kHz. The dataset
includes manual transcriptions, which we utilize to determine speaker turn boundaries for training
our model. We allocate 80% of the dataset for training and reserve 10% for validation purposes
during the training experiments. For evaluation, we utilize the remaining 10% of the Fisher corpus
in verification experiments.
8.2.2 ADOSMod3
We use in-house dataset called ADOSMod3, which includes interactions involving children, for
both verification and correlation experiments. The ADOSMod3 corpus consists of recorded conversations from autism diagnostic sessions between children and clinicians trained to observe clinically relevant behavioral aspects related to Autism. These sessions, following the Autism Diagnostic Observation Schedule (ADOS)-2 protocol, typically last between 40 to 60 minutes and involve
structured subtasks designed to elicit spontaneous responses from children across various social
and communicative scenarios. Specifically, we focus on Module 3, tailored for verbally fluent
children and adolescents. Our emphasis is on subtasks such as Emotions and Social difficulties and
95
annoyance, which are intended to capture significant spontaneous speech and reactions from children. The Social difficulties and annoyance task explores the child’s perceptions of social issues
encountered at home or school, while the Emotions subtask delves into the child’s emotional responses triggered by specific prompts in a conversational setting. The corpus comprises recordings
from 165 children gathered from two clinical sites: the University of Michigan Autism and Communication Disorders Center (UMACC) and the Cincinnati Children’s Medical Center (CCHMC).
Each participant contributed one recording per subtask, resulting in a total of 330 recordings. The
average duration of each subtask session is approximately 3 minutes (192 seconds). Lexical features are extracted from manual transcriptions following SALT [109] guidelines. To maintain
consistency, sessions with fewer than 10 turns are excluded to prevent local irregularities arising
from insufficient conversational data.
Among the clinical scores, VINELAND scores assess adaptive behavior, with VINELAND ABC
measuring Adaptive Behavior Composite, and VINELAND social and VINELAND communication
focusing on specific socialization and communication skills. CSS (Calibrated Severity Score) indicates the severity of ASD symptoms, with CSS-SA and CSS-RRB reflecting severity in the domains
of Social Affect and Restrictive and Repetitive Behaviors, respectively. ADOS scores pertain to the
diagnostic instrument, where CoreASD, AdvSocial, and RRB scores denote evaluations across core
autism symptoms, advanced social skills, and restrictive and repetitive behaviors, respectively.
8.2.3 Remote-NLS
Additionally, we present evaluations based on another in-house dataset Remote-NLS sourced from
[15], [151]. Natural language sampling (NLS) is a widely used method in both research and clinical settings for assessing a child’s spontaneous spoken language within naturalistic contexts. For
children diagnosed with ASD, NLS has proven to be an especially valuable approach for assessing
their language abilities [15], [152]. This dataset explores remote collection of interactions between
parents and children with ASD at home, particularly when parents are given open-ended elicitation
instructions and use items and materials they have in the home. It comprises 90 videos depicting
96
15-minute interactive sessions involving children diagnosed with autism and their parents. These
sessions were conducted remotely in a home environment, where parents were instructed to select from a predefined set of activities designed to engage their child’s attention. These activities
encompassed 13 categories such as games, conversations, cooking, and art. Initially, the video
sessions were transcribed using the Systematic Analysis of Language Transcripts (SALT) [109].
Subsequently, domain experts employed SALT to classify the spoken language proficiency of each
child into categories such as pre-verbal communication, first words, and word combinations as
outlined in [15].
For this study, parents also participated in the Vineland Adaptive Behavior Scales-Third Edition (VABS), a semi-structured interview that individually assesses adaptive functioning essential
for diagnosing intellectual and developmental disabilities. The VABS interview was conducted
remotely via Zoom by technicians certified for research purposes. Along with VABS standard
score (VABS-SS), we use core domain standard scores assessing an individual’s overall adaptive
functioning across four major areas: communication, daily living skills, motor skills, and socialization.
8.3 Modeling Entrainment using Deep Unsupervised Learning
In this section we present the deep learning strategies employed to model entrainment from dyadic
interactions in an unsupervised manner.
8.3.1 Conversational turn modeling for quantifying entrainment
The main objective is to compute a distance based measure for entrainment from the feature representations of proximal turnpairs. Instead of relying on correlation measures or direct computation
of entrainment measures by computing differences between raw feature vectors as reported in
[101], [119], we build upon a data-driven strategy introduced in [121]. As shown in Fig. 8.1,
we learn representations related to entrainment by transforming the turn level feature vectors into
97
Child
Psychologist
Child turn Psychologist turn
Feature
extraction
Feature
extraction
Entrainment distance
computation
Pretrained dyadic
interaction model
Child speech
embeddings
(Xi)
Psychologist
speech
embeddings
(Yi)
Child
entrainment
embeddings
Psychologist
entrainment
embeddings
Spk Y Spk X Spk Y
Feature
extraction
Feature
extraction
Feature
extraction
Entrainment model
training
Spk X
Feature
extraction
Pretrained
model
Speaker X Speaker Y
Training Evaluation
Turnpair Turnpair
Figure 8.1: Schematic diagram for computing entrainment measure from conversational turns
a joint space capable of encoding only entrainment information while ignoring the factors not
related to entrainment and finally compute the required distance measure from the projected feature descriptors of consecutive turns during evaluation. If X
i
and Y
i denote the turn level feature
descriptors from speaker X and Y corresponding to i
th turnpair, the entrainment distance can be
expressed as:
Dent(X
i
,Y
i
) = d(f(X
i
), f(Y
i
)) (8.1)
The main contribution of this work lies in exploring different deep learning based unsupervised strategies to estimate the transformation function f(.). Due to the unavailability of reliably
labeled datasets for entrainment related experiments, we use an unsupervised strategy to compute
an entrainment measure. We hypothesize there exists some level of entrainment across consecutive turn-pairs in spontaneous conversations. We aim to leverage the entrainable information across
consecutive turns by training models to encode entrainment information from a turn-pair consisting of consecutive turns as compared to one consisting synthetically generated randomly chosen
turns.
98
Feature
extraction
Feature
extraction
X1Y1
X3Y3
X2Y2
X4Y4
X2Y2
X1Y3
X3Y1
f1()
f2
()
Encoder
Encoder
X
i
Y
i
Contrastive pretraining
Audio signal
Speech
embeddings
Audio signal Speech
embeddings Figure 8.2: Contrastive pretraining diagram
8.3.2 Contrastive Learning Based Training
In this section, we describe how we learn the transformation function f(.) using a contrastive pretraining strategy. Based on our assumption of consecutive speaker turns in a natural conversation
possessing some amount of entrainable information, we train the model with a contrastive loss
function to discriminate between consecutive turn-pairs and non-consecutive turn-pairs. We subsequently compute the entrainment measure Contrastively Learnt Entrainment Distance (CLED)
as a distance between encoded embeddings. Specifically, we attempt to learn a joint space based
on the similarity of the consecutive turn pairs using contrastive learning.
Since SoTA Self-supervised learning (SSL) models are built using transformer architecture
that have demonstrated superior ability in capturing longer context information, we employ these
pretrained speech embeddings as our feature descriptor for learning entrainment information. Let
X
i
and Y
i denote the pretrained speech embeddings extracted from the consecutive turns of speaker
X and Y, corresponding to i
th turnpair. For these experiments, we consider different pretrained
models to encode the turn level signals to yield X
i
1
from turn1 and Y
i
2
from turn2 of i
th turnpair
using pretrained encoders, where X
i
1
,Y
i
2
are V dimensional vectors. Next, we attempt to learn
the transformation function f1(.) and f2(.) to map the feature descriptors for turn1 and turn2
99
corresponding to a turnpair, to a joint space based on their similarity. For a batch of N turnpairs,
we compute
E1 = f1(X1), E2 = f2(Y2) (8.2)
where E1 ∈ R
N×d
, E2 ∈ R
N×d
, X1,Y2 ∈ R
N×V
are V dimensional vectors and f1(.) and f2(.) are
the learnt linear transformation functions. After the transformed embeddings for both the speaker
turns are computed, their similarity can be derived as
C = τ ·(E1.E2
T
) (8.3)
where τ is a temperature factor to regulate and scale the range of logits. The computed similarity
matrix C ∈ R
N×N should equal to 1 across the diagonal, denoting the correct consecutive turn pairs
in a batch of N samples, whereas the rest N
2 − N elements can be considered as incorrect pairs.
The contrastive loss function will hence be
Lcon = 0.5 ·(lturn1
(C) +lturn2
(C)) (8.4)
lturn = 1/N
N
∑
i=0
log(diag(so ftmax(C))) (8.5)
where lturn is computed by Equation 8.5 across the turns from speaker X and Y, respectively from
all the turn-pairs. We use the symmetric cross-entropy loss over the similarity matrix C for training
the linear transformations f1(.) and f2(.).
For inference, we use the learnt transformation functions f1(.) and f2(.) to map the encoded
signals from turn1 and turn2 corresponding to k
th turnpair to obtain E
k
1
and E
k
2
respectively.
Finally, we compute the distance metric CLED between E
k
1
and E
k
2
as a proxy measure of entrainment. It is important to note that, while contrastive pretraining and its related loss function
are discussed in the context of turns from Speaker X and Y, the choice and explanation of variable names are tailored for clarity and ease of understanding. Furthermore, the proposed approach
100
isn’t designed to capture generic speaker-specific information; instead, its focus lies on encoding
entrainable information within consecutive turns of a turn pair.
8.3.3 Modeling information flow across conversational turns
Building upon on the contrastive loss based pretraining, we incorporate an additional loss related
to the information flow across conversational turns. For this, we follow the encoding approach
introduced in [100] to extract relevant entrainable information from previous turn to predict the
next turn. For contrastive learning based pretraining we attempt to leverage the similarity between
the conversational turn to build a joint space capable of encoding this similarity. However, conceptually the turn level feature descriptors corresponding to the i
th turnpair, X
i
and Y
i
contain speaker
specific, paralinguistic and phonetic similarity as well which is irrelevant for modeling entrainment. In order to develop a robust estimate for the encoder functions f1(.) and f2(.), we train the
model with an additional reconstruction loss function associated with predicting the second turn
within a turnpair from the first turn. The steps associated with this model are shown in Fig. 8.3.
Feature
extraction
Feature
extraction
f1()
f2
()
Encoder
Encoder
X
i
Y
i
Contrastive
pretraining and
mapping
Audio signal
Speech
embeddings
Audio signal Speech
embeddings
g
1()
g2
()
Decoder
Decoder
x
i
y
i
X
i
Y
i
Contrastive loss
Reconstruction
loss
Figure 8.3: Modeling information flow across converational turns
For this experiment, we use X
i
and Y
i
as input to the encoders to train in a contrastive manner,
resulting in the intermediate embeddings x
i
and y
i
. We further feed the intermediate embedding
corresponding to the first turn x
i
through a decoder network to predict Y
i
, resulting in ˆXi
. Next, Y
i
and ˆXi are compared to calculate the associated reconstruction loss. During evaluation, similar to
the previous step, we compute the intermedicate entrainment embeddings x
i
and y
i
for entrainment
101
distance computation. We hypothesize training the model in this manner will teach it to capture
only entrainable information across conversational turns from consecutive turn-pair samples, thus
during the evaluation, those intermediate embeddings can be more effective in modeling entrainment.
8.3.4 Modeling entrainment with reversed gradient reconstruction
Feature
extraction
Feature
extraction
f1()
f2
()
Encoder
Encoder
X
i
Y
i
Contrastive
pretraining with
gradient reversal
Audio signal
Speech
embeddings
Audio signal Speech
embeddings
g
1()
g2
()
Decoder
Decoder
x
i
y
i
X
i
Y
i
Contrastive loss
Reconstruction
loss
Gradient
reversal
Gradient
reversal
Reconstruction
loss
Reconstruction
loss
Figure 8.4: Modeling information flow across converational turns with gradient reversed reconstruction
While predicting the next interlocutor from the previous turn can teach the model to capture
entrainable information flowing aross the conversational turns, the reconstructed signal may also
contain similarity information with respect to its source signal which is not contributing to entrainment information. In order to avoid encoding the similarity with the original turn and ensure
modeling only entrainment information invariant of any other factors, we incorporate another reconstruction loss function using the gradient reversal technique. A gradient reversal layer is basically an identity function but reversed gradient during backpropagation. So instead of minimizing
the objective function the resulting outputs acts in opposite way.
102
Feature
extraction
Feature
extraction
f1()
f2
()
Encoder
Encoder
X
i
Y
i
Training with Joint encoder
and decoder
Audio signal
Speech
embeddings
Audio signal Speech
embeddings
g
1()
g2
()
Decoder
Decoder
x
i
y
i
X
i
Y
i
Contrastive
loss
Reconstruction
loss
Shared weights
Enc Dec
Figure 8.5: Joint interlocutor modeling with shared encoder and decoder
8.3.5 Joint Interlocutor modeling with shared encoder and decoder
While modeling interpersonal constructs, it is important to model all the interlocutors together to
avoid losing any information related to the influence of each other. Joint interlocutor modeling
allows the model to learn intra- as well as interpersonal signals in a dyadic context. We aim to
model both the interlocutors by concatenating transformed signal from both the turns and jointly
processing them, thus capturing the rich nuances of interpersonal influence using a combination
of joint encoder and decoder. Additionally, the resulting signal from the shared encoder-decoder
architecture is splitted in two sections to be used for reconstructing the individual turn-level signals.
8.4 Experimental setup
8.4.1 Preprocessing, feature extraction and baselines
As the Fisher corpus consists of telephonic conversations recorded at an 8kHz sampling rate, all
audio files in this study were upsampled to 16kHz to facilitate the use of speech foundation models.
For both the training and evaluation phases, manual annotations are relied upon as the gold standard
to accurately identify speaker turn boundaries and minimize preprocessing errors. After identifying
turn boundaries based on their timings, we extract embeddings from pretrained speech models that
103
correspond to these boundaries. In all experiments, we calculate the mean of these embeddings
across all frames and utilize this averaged representation as the feature for modeling entrainment.
Specifically, for all the experiments, we employ HuBERT [145] model, which is chosen for its
demonstrated effectiveness across a range of downstream tasks. For the experiments reported in
Table 8.1, we consider 3 baselines. Baseline 1 is computed as the smoothL1 distance directly
computed from the pretrained embeddings. Baseline 2 and Baseline 3 are computed as the L2
distance and cosine distance, respectively, computed directly from the pretrained embeddings.
8.4.2 Implementation details
For all the experiments, we map encoded turn embeddings to an intermediate dimension of 256
in the joint space for modeling entrainment related information from consecutive turn pairs using
projection heads. The projection head consists of a couple of fully connected layers to project to
the required dimension (768 → 512 → 256 → 256) with ReLU activation function and dropout
layer with probability 0.2. It also contains a residual connection followed by layer norm. For
pretraining stage, we train for 25 epochs using ADAMW optimizer with an initial learning rate
of 1e
−3
, we also use an early stopping with a patience of 5 epochs if there is no improvement in
validation loss.
8.4.3 Verification experiment
We conduct an ad-hoc real/fake classification experiment to validate CLED as a proxy metric for
entrainment. Each real session is paired with a synthetic fake counterpart where speaker turns are
rearranged while maintaining the dyadic flow of the conversation, a method commonly employed
to explore temporal dynamics like synchrony. Our hypothesis posits that real sessions will demonstrate higher levels of entrainment compared to synthetic fake sessions, resulting in lower CLED
scores for real sessions. Classification accuracies are detailed in Table 8.1. The experimental
procedure is outlined as follows:
104
• We compute the CLED measure for both directions of each proximal turn-pair and average
them (dividing by 2 for direction factor) for both real and fake sample sessions.
• We compare the average CLED distances across all turn pairs between the real and fake
sessions. A real session is correctly classified if its CLED is lower than that of the corresponding fake session.
• The experiment is repeated 25 times to mitigate any bias introduced by the random shuffling
of speaker turns.
8.4.4 Correlation analyses
The verification experiment discussed in the preceding subsection serves as a proof-of-concept
validation for the proposed metrics designed to quantify entrainment. In this section, we illustrate
the potential application of these metrics in relation to specific behavioral constructs commonly
associated with entrainment. We present a case study that examines modeling distinct behaviors
using real-world conversational datasets involving children. Our focus is on assessing the effectiveness of these metrics in studying behavioral characteristics among children, both with and
without ASD. These experiments indirectly validate the metrics’ capability to capture entrainment
by demonstrating their correlation with behaviors associated with autism.
8.5 Experimental results and discussion
In this work, all the results related to the experiments described in Section 8.3.2 is reported with
CLED. The modeling strategy described in Section 8.3.3 are denoted by CLEDmapping. The strategies described in Section 8.3.4 and 8.3.5 are referred as CLEDgradrev and CLEDjointencdec respectively.
It is important to note the directional nature of the proposed measures, thus we compute both
the direction of child → psychologist (CP) and psychologist → child (PC) entrainment for
105
Table 8.1: Classification accuracy(%) for real vs fake sessions (averaged over 25 runs, standard
deviation shown in paranthesis).
Measure Classification accuracy(%)
Fisher ADOSMod3 Remote-NLS
Baseline1 72.20(0.54) 52.98(0.43) 51.01(0.35)
Baseline2 72.03(0.85) 50.61(0.55) 50.70(0.21)
Baseline3 74.66(0.77) 52.73(0.62) 50.57(0.31)
CLED 91.04(0.23) 63.06(0.55) 59.31(0.63)
CLEDmapping 93.45(0.37) 63.73(0.43) 62.43(0.25)
CLEDgradrev 86.54(0.52) 62.55(0.25) 59.45(0.43)
CLEDjointencdec 95.66(0.27) 65.77(0.51) 60.04(0.52)
Table 8.2: Correlation experiment between variants of CLED and clinical scores relevant to ASD
with p values in parenthesis(bold figures imply statistical significance after correction, p < 0.006
) in ADOSMod3 dataset
(CP: child to psychologist, PC: psychologist to child)
Entrainment measures
Pearson’s correlation(ρ)
VINELAND CSS ADOS
ABC Comm Social Overall SA RRB CoreASD AdvSocial RRB
CLED−CP -0.089 -0.129 -0.030 0.168 0.126 0.223 0.093 0.115 0.190
(0.106) (0.001) (0.571) (0.002) (0.021) (0.000003) (0.087) (0.035) (0.0004)
CLED−PC -0.100 -0.139 -0.051 0.255 0.203 0.329 0.152 0.165 0.298
(0.007) (0.001) (0.317) (0.0002) (0.0001) (0.00007) (0.005) (0.002) (0.00006)
CLEDmappingCP -0.013 -0.101 -0.095 0.089 0.059 0.197 0.055 0.020 0.151
(0.402) (0.006) (0.067) (0.010) (0.027) (0.0009) (0.313) (0.704) (0.005)
CLEDmappingPC -0.049 -0.126 -0.101 0.234 0.197 0.270 0.192 0.144 0.203
(0.377) (0.002) (0.079) (0.0001) (0.003) (0.00005) (0.0004) (0.0008) (0.0001)
CLEDgradrevCP -0.022 -0.073 -0.013 0.098 0.101 0.021 0.102 0.077 0.038
(0.691) (0.877) (0.772) (0.007) (0.063) (0.696) (0.021) (0.159) (0.488)
CLEDgradrevPC -0.083 -0.103 -0.048 0.127 0.125 0.137 0.126 0.103 0.081
(0.132) (0.006) (0.312) (0.002) (0.021) (0.012) (0.002) (0.060) (0.138)
CLEDjointencdecCP -0.100 -0.134 -0.036 0.190 0.105 0.030 0.136 0.068 0.009
(0.072) (0.001) (0.518) (0.0009) (0.045) (0.581) (0.012) (0.211) (0.867)
CLEDjointencdecPC -0.127 -0.196 -0.051 0.203 0.188 0.202 0.179 0.142 0.213
(0.002) (0.0003) (0.357) (0.0001) (0.0005) (0.0002) (0.001) (0.0009) (0.004)
Table 8.3: Correlation experiment between variants of CLED and clinical scores relevant to ASD
with p values in parenthesis(bold figures imply statistical significance after correction, p < 0.01 )
in Remote-NLS dataset
(CP: child to parent, PC: parent to child)
Entrainment measures Pearson’s correlation(ρ)
VABS-SS VABSComm VABSLiving VABSSocial VABSMotor
CLED−CP 0.040(0.707) 0.045(0.677) 0.019(0.858) 0.046(0.668) 0.052(0.628)
CLED−PC -0.239(0.002) -0.245(0.002) -0.236(0.002) -0.220(0.003) -0.117(0.277)
CLEDmappingCP 0.064(0.552) 0.081(0.455) 0.072(0.502) 0.051(0.633) 0.047(0.665)
CLEDmappingPC -0.292(0.005) -0.326(0.002) -0.259(0.009) -0.243(0.002) -0.160(0.132)
CLEDgradrevCP 0.031(0.327) 0.046(0.213) 0.022(0.412) 0.013(0.775) 0.023(0.557)
CLEDgradrevPC -0.158(0.024) -0.195(0.009) -0.136(0.026) -0.154(0.028) -0.105(0.245)
CLEDjointencdecCP 0.061(0.154) 0.029(0.796) 0.032(0.767) 0.014(0.895) 0.032(0.768)
CLEDjointencdecPC -0.3493(0.0009) -0.370(0.0004) -0.270(0.001) -0.323(0.002) -0.229(0.003)
106
ADOSMod3 dataset. Since Remote-NLS dataset involves parents, we report the Pearson’s correlation (ρ) and associated p-values, in child → parent (CP) and parent → child (PC) directtion.
The results reported in Table 8.1, show that all of our proposed measures achieve higher accuracy
than all baselines when evaluated on the Fisher corpus, CLEDjointencdec being the best performing
measure with almost 95% accuracy. This shows that the proposed measures are able to successfully distinguish between real and fake conversations by identifying the degree of entrainment. For
ADOSMod3 and Remote-NLS, the performance of all the measures degraded, possibly because
of the mismatch in the data distribution of Fisher corpus and the remaining two. Since the model
was trained with Fisher data which comprises of adult speech, it may not be able to capture all
the entrainable information present in the child-inclusive interactions. Out of all the measures
CLEDjointencdec performed the best consistently across all the experiments, whereas CLEDgradrev
did not perform well across all the tasks. One possible reason, can be the gradient reversal layer
failing to filter only the entrainable information flowing across conversational turns, thus confusing
the model and ultimately leading to degraded modeling.
From the experimental results reported in Table 8.2, we observe that proposed CLED measure and its variants hold negative correlations with V INELAND scores and they are positively
associated with CSS and ADOS scores. The observation stands consistent with the definition of
the proposed distance-based entrainment measures, where higher values of CLED and its variants
denotes lower level of entrainment. For ADOSMod3 dataset, we observed more statistically significant interpretations in the direction of psychologist → child (PC), but the opposite directions
also show statistically significant results in some experiments. This suggests, although the proposed measures show effectiveness in encoding entrainment in the psychologist → child (PC)
direction, it still can capture some entrainable information in the opposite direction. However,
the child → parent (CP) direction for Remote-NLS dataset does not capture any meaningful insight in any of the experiments. We suspect two factors contributing to the inconclusive results
in the child → parent (CP) direction, firstly the mismatch in data distribution used in training
as discussed before doesnot allow the model to learn the nuances associated with child speech in
107
spontaneous interactions. Additionally, the child participants in Remote-NLS dataset often spoke
very little due to their age and developmental conditions, leading to smaller amount of signal from
the child’s side for the model to understand their entrainment patterns. Many of the participants
have their language skill level marked as preverbal or f irstwords by the experts, which also suggests the same.
8.6 Conclusion
In this chapter, we present a data-driven framework for computing entrainment distances between
consecutive speaker turns within dyadic interactions. Our approach involves evaluating these measures within a latent embedded space obtained through contrastive learning from natural conversational data. To validate our methodology, we demonstrate its effectiveness in distinguishing
between real and fake (shuffled) conversations by accurately capturing entrainment in genuine interactions. Furthermore, our proposed measures surpass baseline methods in this verification task,
highlighting their efficacy. We apply these measures in the clinical context of dyadic interactions
involving children to investigate behavioral patterns in both autistic and typically developing children. Specifically, we explore the relationship of our proposed measures of vocal entrainment with
behavioral scores relevant to autism symptoms. Our findings reveal that the introduced measures,
particularly within the speech modality, exhibit meaningful correlations with relevant behavioral
scores, providing statistically significant insights.
108
Chapter 9
Conclusion and future directions
In this chapter, I provide a concise overview of the main concepts, findings, and improvements observed from our research outlined in preceding chapters. Subsequently, I discuss potential avenues
for future research directions.
9.1 Summary
In this dissertation, I explore methods to process, interpret and analyze child speech in spontaneous
interactions with adults. The demand for reliable and automated speech processing is increasing
with the rise of voice-controlled assistants. Processing child-inclusive interactions presents distinctive acoustic and linguistic challenges stemming from various developmental factors. The contribution towards understanding the interaction dynamics can be categorized in two major areas:
firstly by improving child-adult speaker classification task to compute reliable turn boundaries and
secondly, exploring different strategies for modeling the dyadic interactions capable of inferring robust behavioral constructs including entrainment/synchrony from acoustic-language features capturing meaningful insights towards the socio-cognitive characteristics of the children engaged in
the interaction.
During the first half of the dissertation, I report my contributions towards improving the childadult classification module performance, this module is an important part of speech processing
109
pipeline responsible for computing speaker level boundaries and transcripts, which plays an integral role to identify conversational turns. Firstly, I develop an adversarial learning based framework
to address variabilities arising from age and recording sites impacting the child-adult classification
performance. The experimental results reveal indeed adversarial training is very helpful in such
scenarios to bridge the gap between source and target data distributions without impacting the task
performance. Further, as discussed in Chapter 3, I leverage unlabelled child data with additional
pretraining to improve child-adult classification performance. I report the impact of including more
child speech using a self-supervised pretraining objective and experimentally substantiate the effectiveness of the proposed framework by developing speaker discriminative embeddings yielding
better task performance.
In the later part of the dissertation, I studied the temporal and interpersonal dynamics of the
interactions, in the context of quantifying entrainment/synchrony. In the first approach, I analyze
the temporal dynamics of vocal speech patterns of engaged children, both with and without ASD
diagnosis. Specifically, I study the relationship between low level speech feature descriptors and
manually coded clinical ratings relavant to autism symptoms. This leads to a promising finding
that dynamic functionals are more effective in predicting autism diagnosis as compared to static
functionals.
My next contribution involves modeling interpersonal synchrony across vocal and lexical modalities using knowledge driven approaches. I propose three distinct measures of interpersonal synchrony based on DTW distance, cosine distance and Word Mover’s distance. I experimentally
validate the experimental measures and then report the results on applications of behavioral synchrony measures to understand the characteristics of children with and without autism diagnosis.
Further, to address the constraints and limitations of the knowledge driven approaches- I explore
various data driven approaches towards dyadic interaction modeling to quantify entrainment across
speech and language patterns. Firstly I develop a context-aware model for computing entrainment,
addressing the need for both short and long-range temporal context modeling. In the final part, I
110
employ a contrastive learning approach to learn a feature representation capable of encoding entrainment. Since entrainment can be exhibited across multiple modalities, I investigate modeling
entrainment across speech and language modalities through both uni- and cross-modal formulation.
9.2 Future directions
In Chapter 2 and 3, I address different subproblems related to the child-adult speaker classification (2 speaker diarization) task. In this direction, many recent studies have focused on finetuning
pretrained speech foundation models for a target evaluation domain. These pre-trained speech
models learn general-purpose speech representations using self-supervised or weakly-supervised
learning objectives from large-scale datasets. Despite the significant advances made in speaker
classification tasks through the use of pre-trained architecture, fine-tuning these large pre-trained
models for different datasets requires saving local copies of entire weight parameters, rendering
them impractical to deploy in real-world settings. One interesting research direction will be to
investigate parameter efficient finetuning strategies for child-adult speaker classification task and
its comparison with traditional fine-tuning.
Our previous approaches involving speech and language modalities for modeling dyadic entrainment motivates further to investigate along other modalities of facial expressions, head motions and body gestures. Although there are ongoing efforts to study interaction dynamics involving these modalities, most of these are unimodal, which prompts us to turn towards multimodal and
cross-modal formulations. Also, understanding the interplay between entrainment patterns across
different communication channels including, visual cues, vocal patterns, prosody or affective constructs promise high potential.
An important finding towards modeling interpersonal entrainment lies in the need for addressing context in dyadic interactions. While I explore intra- as well inter turn-level conversational
context in our experiments, using longer context windows involving multiple turns can capture
111
important information. Another relevant direction is to locate ’salient’ events/behaviors associated
to autism symptoms by analyzing the local entrainment patterns within an interaction. Identifying
relevant activities and computing the proposed entrainment measures as per activity windows can
help extract meaningful insights regarding the kind of tasks eliciting maximal information out of
the children and thus help in understanding their behavioral characteristics better.
One of the major challenges in modeling interactions, specifically involving children is the
unavailability of reliably labeled datasets. With the advancement of generative tools, synthesizing
spontaneous child-inclusive interactions can be very helpful towards this goal. Investigating robust
and reliable ways of generating good quality synthetic interactions, with careful annotation efforts
can be very beneficial for this line of interaction analysis.
112
References
[1] S. Narayanan and P. G. Georgiou, “Behavioral signal processing: Deriving human behavioral informatics from speech and language”, Proceedings of the IEEE, vol. 101, no. 5,
pp. 1203–1233, 2013.
[2] P. G. Georgiou, M. P. Black, A. C. Lammert, B. R. Baucom, and S. S. Narayanan, ““that’s
aggravating, very aggravating”: Is it possible to classify behaviors in couple interactions
using automatically derived lexical features?”, in Affective Computing and Intelligent Interaction: 4th International Conference, ACII 2011, Memphis, TN, USA, October 9–12,
2011, Proceedings, Part I 4, Springer, 2011, pp. 87–96.
[3] M. Black, A. Katsamanis, C.-C. Lee, A. C. Lammert, B. R. Baucom, A. Christensen, et al.,
“Automatic classification of married couples’ behavior using audio features”, in Eleventh
annual conference of the international speech communication association, 2010.
[4] M. P. Black, A. Katsamanis, B. R. Baucom, C.-C. Lee, A. C. Lammert, A. Christensen,
et al., “Toward automating a human behavioral coding system for married couples’ interactions using speech acoustic features”, Speech communication, vol. 55, no. 1, pp. 1–21,
2013.
[5] S. N. Chakravarthula, R. Gupta, B. Baucom, and P. Georgiou, “A language-based generative model framework for behavioral analysis of couples’ therapy”, in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2015,
pp. 2090–2094.
[6] B. Xiao, P. G. Georgiou, C.-C. Lee, B. Baucom, and S. S. Narayanan, “Head motion synchrony and its correlation to affectivity in dyadic interactions”, in 2013 IEEE International
Conference on Multimedia and Expo (ICME), IEEE, 2013, pp. 1–6.
[7] M. P. Black, P. G. Georgiou, A. Katsamanis, B. R. Baucom, and S. Narayanan, ““you made
me do it”: Classification of blame in married couples’ interactions by fusing automatically
derived speech and language information”, in Twelfth Annual Conference of the International Speech Communication Association, 2011.
[8] T. Chaspari, D. Bone, J. Gibson, C.-C. Lee, and S. Narayanan, “Using physiology and language cues for modeling verbal response latencies of children with asd”, in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2013, pp. 3702–
3706.
113
[9] D. Bone, M. S. Goodwin, M. P. Black, C.-C. Lee, K. Audhkhasi, and S. Narayanan, “Applying machine learning to facilitate autism diagnostics: Pitfalls and promises”, Journal of
autism and developmental disorders, vol. 45, no. 5, pp. 1121–1136, 2015.
[10] D. Bone, M. P. Black, A. Ramakrishna, R. Grossman, and S. S. Narayanan, “Acousticprosodic correlates ofawkward’prosody in story retellings from adolescents with autism”,
in Sixteenth Annual Conference of the International Speech Communication Association,
2015.
[11] S. P. Lord, D. Can, M. Yi, R. Marin, C. W. Dunn, Z. E. Imel, et al., “Advancing methods
for reliably assessing motivational interviewing fidelity using the motivational interviewing
skills code”, Journal of substance abuse treatment, vol. 49, pp. 50–57, 2015.
[12] R. Gupta, P. G. Georgiou, D. C. Atkins, and S. S. Narayanan, “Predicting client’s inclination towards target behavior change in motivational interviewing and investigating the role
of laughter.”, in Interspeech, 2014, pp. 208–212.
[13] M. Kumar, D. Bone, K. McWilliams, S. Williams, T. D. Lyon, and S. S. Narayanan, “Multiscale context adaptation for improving child automatic speech recognition in child-adult
spoken interactions.”, in INTERSPEECH, 2017, pp. 2730–2734.
[14] C. Lord, S. Risi, L. Lambrecht, E. H. Cook, B. L. Leventhal, P. C. DiLavore, et al., “The
autism diagnostic observation schedule—generic: A standard measure of social and communication deficits associated with the spectrum of autism”, Journal of autism and developmental disorders, vol. 30, no. 3, pp. 205–223, 2000.
[15] H. Tager-Flusberg, S. Rogers, J. Cooper, R. Landa, C. Lord, R. Paul, et al., “Defining
spoken language benchmarks and selecting measures of expressive language development
for young children with autism spectrum disorders”, 2009.
[16] N. Flemotomos, P. Papadopoulos, J. Gibson, and S. Narayanan, “Combined speaker clustering and role recognition in conversational speech.”, in Interspeech, 2018, pp. 1378–
1382.
[17] Z. Chen, J. Gibson, M.-C. Chiu, Q. Hu, T. K. Knight, D. Meeker, et al., “Automated empathy detection for oncology encounters”, in 2020 IEEE International Conference on Healthcare Informatics (ICHI), IEEE, 2020, pp. 1–8.
[18] B. Xiao, Z. E. Imel, P. G. Georgiou, D. C. Atkins, and S. S. Narayanan, “” rate my therapist”: Automated detection of empathy in drug and alcohol counseling via speech and
language processing”, PloS one, vol. 10, no. 12, e0143055, 2015.
[19] J. Gibson, D. Can, P. G. Georgiou, D. C. Atkins, and S. S. Narayanan, “Attention networks
for modeling behaviors in addiction counseling.”, in Interspeech, 2017, pp. 3251–3255.
[20] A. Jakkam and C. Busso, “A multimodal analysis of synchrony during dyadic interaction
using a metric based on sequential pattern mining”, in 2016 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2016, pp. 6085–6089.
114
[21] T. Zhou, W. Cai, X. Chen, X. Zou, S. Zhang, and M. Li, “Speaker diarization system for
autism children’s real-life audio data”, in 2016 10th International Symposium on Chinese
Spoken Language Processing (ISCSLP), IEEE, 2016, pp. 1–5.
[22] J. Xie, L. P. Garcia-Perera, D. Povey, and S. Khudanpur, “Multi-plda diarization on children’s speech.”, in Interspeech, 2019.
[23] D. Bone, C.-C. Lee, A. Potamianos, and S. S. Narayanan, “An investigation of vocal
arousal dynamics in child-psychologist interactions using synchrony measures and a conversationbased model”, in Fifteenth Annual Conference of the International Speech Communication
Association, 2014.
[24] S. L. Johnson and T. Jacob, “Sequential interactions in the marital communication of depressed men and women.”, Journal of consulting and clinical psychology, vol. 68, no. 1,
p. 4, 2000.
[25] M. Nasir, S. N. Chakravarthula, B. Baucom, D. C. Atkins, P. Georgiou, and S. Narayanan,
“Modeling interpersonal linguistic coordination in conversations using word mover’s distance”, arXiv preprint arXiv:1904.06002, 2019.
[26] J. Volden and C. Lord, “Neologisms and idiosyncratic language in autistic speakers”, Journal of Autism and Developmental Disorders, vol. 21, no. 2, pp. 109–130,
[27] S. H. Kim, R. Paul, H. Tager-Flusberg, and C. Lord, “Language and communication in
autism”, in Handbook of Autism and Pervasive Developmental Disorders, Fourth Edition.
American Cancer Society, 2014, ch. 10.
[28] S. V. Huemer and V. Mann, “A comprehensive profile of decoding and comprehension
in autism spectrum disorders”, Journal of Autism and Developmental Disorders, vol. 40,
pp. 485–493, 2010.
[29] Centers for Disease Control and Prevention (CDC), “Mental health in the united states:
Parental report of diagnosed autism in children aged 4-17 years–united states, 2003-2004”,
MMWR. Morbidity and mortality weekly report, vol. 55, no. 17, pp. 481–486, 2006.
[30] J. Baio, L. Wiggins, D. L. Christensen, M. J. Maenner, J. Daniels, Z. Warren, et al., “Prevalence of autism spectrum disorder among children aged 8 years—autism and developmental disabilities monitoring network, 11 sites, united states, 2014”, MMWR Surveillance
Summaries, vol. 67, no. 6, p. 1, 2018.
[31] D. Bone, S. L. Bishop, M. P. Black, M. S. Goodwin, C. Lord, and S. S. Narayanan, “Use
of machine learning to improve autism screening and diagnostic instruments: Effectiveness, efficiency, and multi-instrument fusion”, Journal of Child Psychology and Psychiatry, vol. 57, no. 8, pp. 927–937, 2016.
[32] D. Bone, S. Bishop, R. Gupta, S. Lee, and S. S. Narayanan, “Acoustic-prosodic and turntaking features in interactions with children with neurodevelopmental disorders.”, in Interspeech, 2016, pp. 1185–1189.
115
[33] M. Kumar, R. Gupta, D. Bone, N. Malandrakis, S. Bishop, and S. S. Narayanan, “Objective
language feature analysis in children with neurodevelopmental disorders during autism
assessment.”, in INTERSPEECH, 2016, pp. 2721–2725.
[34] S. Lee, A. Potamianos, and S. Narayanan, “Acoustics of children’s speech: Developmental
changes of temporal and spectral parameters”, The Journal of the Acoustical Society of
America, vol. 105, no. 3, pp. 1455–1468, 1999.
[35] G. Wilson and D. J. Cook, “A survey of unsupervised deep domain adaptation”, arXiv
preprint arXiv:1812.02849, 2019.
[36] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, et al., “Domainadversarial training of neural networks”, The Journal of Machine Learning Research, vol. 17,
no. 1, pp. 2096–2030, 2016.
[37] G. Bhattacharya, J. Alam, and P. Kenny, “Adapting end-to-end neural speaker verification to new languages and recording conditions with adversarial training”, in ICASSP
2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), IEEE, 2019, pp. 6041–6045.
[38] M. Najafian and J. H. Hansen, “Speaker independent diarization for child language environment analysis using deep neural networks”, in 2016 IEEE Spoken Language Technology
Workshop (SLT), IEEE, 2016, pp. 114–120.
[39] L. Sun, J. Du, T. Gao, Y. Lu, Y. Tsao, C. Lee, et al., “A novel lstm-based speech preprocessor for speaker diarization in realistic mismatch conditions”, in 2018 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5234–5238.
DOI: 10.1109/ICASSP.2018.8462311.
[40] A. Cristia, S. Ganesh, M. Casillas, and S. Ganapathy, “Talker diarization in the wild: The
case of child-centered daylong audio-recordings”, in Interspeech 2018, 2018, pp. 2583–
2587.
[41] J. Xie1, L. P. Garcia-Perera, D. P. Povey, and S. Khudanpur, “Multi-plda diarization on
children’s speech”, in Interspeech 2019, 2019, pp. 376–380.
[42] M. Abdelwahab and C. Busso, “Domain adversarial for acoustic emotion recognition”,
IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 12,
pp. 2423–2435, 2018.
[43] A. Tripathi, A. Mohan, S. Anand, and M. Singh, “Adversarial learning of raw speech features for domain invariant speech recognition”, in 2018 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2018, pp. 5959–5963.
[44] S. Sun, C.-F. Yeh, M.-Y. Hwang, M. Ostendorf, and L. Xie, “Domain adversarial training
for accented speech recognition”, in 2018 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), IEEE, 2018, pp. 4854–4858.
116
[45] P. Denisov, N. T. Vu, and M. F. Font, “Unsupervised domain adaptation by adversarial
learning for robust speech recognition”, in Speech Communication; 13th ITG-Symposium,
VDE, 2018, pp. 1–5.
[46] G. Bhattacharya, J. Monteiro, J. Alam, and P. Kenny, “Generative adversarial speaker embedding networks for domain robust end-to-end speaker verification”, in ICASSP 2019-
2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
IEEE, 2019, pp. 6226–6230.
[47] J. Volden and C. Lord, “Neologisms and idiosyncratic language in autistic speakers”, Journal of autism and developmental disorders, vol. 21, no. 2, pp. 109–130, 1991.
[48] S. H. Kim, R. Paul, H. Tager-Flusberg, and C. Lord, “Language and communication in
autism”, Handbook of Autism and Pervasive Developmental Disorders, Fourth Edition,
2014.
[49] T. Sorensen, E. Zane, T. Feng, S. Narayanan, and R. Grossman, “Cross-modal coordination of face-directed gaze and emotional speech production in school-aged children and
adolescents with asd”, Scientific reports, vol. 9, no. 1, p. 18 301, 2019.
[50] R. Grzadzinski, T. Carr, C. Colombi, K. McGuire, S. Dufek, A. Pickles, et al., “Measuring
changes in social communication behaviors: Preliminary development of the brief observation of social communication change (boscc)”, Journal of autism and developmental
disorders, vol. 46, pp. 2464–2479, 2016.
[51] V. Bhardwaj, M. T. Ben Othman, V. Kukreja, Y. Belkhier, M. Bajaj, B. S. Goud, et al.,
“Automatic speech recognition (asr) systems for children: A systematic literature review”,
Applied Sciences, vol. 12, no. 9, p. 4419, 2022.
[52] P. Gurunath Shivakumar and S. Narayanan, “End-to-end neural systems for automatic children speech recognition: An empirical study”, Computer Speech & Language, vol. 72,
p. 101 289, 2022.
[53] R. Lahiri, M. Kumar, S. Bishop, and S. Narayanan, “Learning domain invariant representations for child-adult classification from speech”, in ICASSP 2020-2020 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, pp. 6749–
6753.
[54] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding”, arXiv preprint arXiv:1810.04805, 2018.
[55] A. T. Liu, S.-W. Li, and H.-y. Lee, “Tera: Self-supervised learning of transformer encoder
representation for speech”, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2351–2366, 2021.
[56] S. Ling and Y. Liu, “Decoar 2.0: Deep contextualized acoustic representations with vector
quantization”, arXiv preprint arXiv:2012.06659, 2020.
117
[57] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for selfsupervised learning of speech representations”, Advances in neural information processing
systems, vol. 33, pp. 12 449–12 460, 2020.
[58] S. Schneider, A. Baevski, R. Collobert, and M. Auli, “Wav2vec: Unsupervised pre-training
for speech recognition”, arXiv preprint arXiv:1904.05862, 2019.
[59] S. Pascual, M. Ravanelli, J. Serra, A. Bonafonte, and Y. Bengio, “Learning problemagnostic speech representations from multiple self-supervised tasks”, arXiv preprint arXiv:1904.03416,
2019.
[60] M. Ravanelli, J. Zhong, S. Pascual, P. Swietojanski, J. Monteiro, J. Trmal, et al., “Multitask self-supervised learning for robust speech recognition”, in ICASSP 2020-2020 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE,
2020, pp. 6989–6993.
[61] Z. Fan, M. Li, S. Zhou, and B. Xu, “Exploring wav2vec 2.0 on speaker verification and
language identification”, arXiv preprint arXiv:2012.06185, 2020.
[62] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, et al., “Wavlm: Large-scale selfsupervised pre-training for full stack speech processing”, IEEE Journal of Selected Topics
in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
[63] Y.-A. Chung, W.-N. Hsu, H. Tang, and J. Glass, “An unsupervised autoregressive model
for speech representation learning”, arXiv preprint arXiv:1904.03240, 2019.
[64] Y. Wang, A. Boumadane, and A. Heba, “A fine-tuned wav2vec 2.0/hubert benchmark
for speech emotion recognition, speaker verification and spoken language understanding”,
arXiv preprint arXiv:2111.02735, 2021.
[65] D. Alvarez, S. Pascual, and A. Bonafonte, “Problem-agnostic speech embeddings for multi- ´
speaker text-to-speech with samplernn”, arXiv preprint arXiv:1906.00733, 2019.
[66] N. R. Koluguri, M. Kumar, S. H. Kim, C. Lord, and S. Narayanan, “Meta-learning for
robust child-adult classification from speech”, in ICASSP 2020-2020 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, pp. 8094–
8098.
[67] S. Krishnamachari, M. Kumar, S. H. Kim, C. Lord, and S. Narayanan, “Developing neural
representations for robust child-adult diarization”, in 2021 IEEE Spoken Language Technology Workshop (SLT), IEEE, 2021, pp. 590–597.
[68] V. Sachidananda, S.-Y. Tseng, E. Marchi, S. Kajarekar, and P. Georgiou, “Calm: Contrastive aligned audio-language multirate and multimodal representations”, arXiv preprint
arXiv:2202.03587, 2022.
118
[69] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive
learning of visual representations”, in International conference on machine learning, PMLR,
2020, pp. 1597–1607.
[70] K. Sohn, “Improved deep metric learning with multi-class n-pair loss objective”, Advances
in neural information processing systems, vol. 29, 2016.
[71] T. Feng, R. Hebbar, and S. Narayanan, “Trustser: On the trustworthiness of fine-tuning pretrained speech embeddings for speech emotion recognition”, arXiv preprint arXiv:2305.11229,
2023.
[72] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, et al., “Transformers:
State-of-the-art natural language processing”, in Proceedings of the 2020 conference on
empirical methods in natural language processing: system demonstrations, 2020, pp. 38–
45.
[73] S.-w. Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y. Y. Lin, et al., “Superb:
Speech processing universal performance benchmark”, arXiv preprint arXiv:2105.01051,
2021.
[74] C. Lord, M. Rutter, P. C. DiLavore, et al., “Autism diagnostic observation schedule–
generic”, Dissertation Abstracts International Section A: Humanities and Social Sciences,
1999.
[75] C. Lord, M. Rutter, and A. Le Couteur, “Autism diagnostic interview-revised: A revised
version of a diagnostic interview for caregivers of individuals with possible pervasive developmental disorders”, Journal of autism and developmental disorders, vol. 24, no. 5,
pp. 659–685, 1994.
[76] D. Bone, C.-C. Lee, T. Chaspari, J. Gibson, and S. Narayanan, “Signal processing and
machine learning for mental health research and clinical applications”, IEEE Signal Processing Magazine, vol. 34, no. 5, pp. 189–196, 2017.
[77] S. Ntalampiras and N. Fakotakis, “Modeling the temporal evolution of acoustic parameters
for speech emotion recognition”, IEEE Transactions on affective computing, vol. 3, no. 1,
pp. 116–125, 2011.
[78] B. Schuller, G. Rigoll, and M. Lang, “Hidden markov model-based speech emotion recognition”, in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03)., IEEE, vol. 2, 2003, pp. II–1.
[79] E. Tzinis, G. Paraskevopoulos, C. Baziotis, and A. Potamianos, “Integrating recurrence
dynamics for speech emotion recognition”, arXiv preprint arXiv:1811.04133, 2018.
[80] S.-L. Yeh, Y.-S. Lin, and C.-C. Lee, “An interaction-aware attention network for speech
emotion recognition in spoken dialogs”, in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2019, pp. 6685–
6689.
119
[81] S. Mirsamadi, E. Barsoum, and C. Zhang, “Automatic speech emotion recognition using
recurrent neural networks with local attention”, in 2017 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2017, pp. 2227–2231.
[82] J. R. Curhan and A. Pentland, “Thin slices of negotiation: Predicting outcomes from conversational dynamics within the first 5 minutes.”, Journal of Applied Psychology, vol. 92,
no. 3, p. 802, 2007.
[83] B. M. DePaulo, J. J. Lindsay, B. E. Malone, L. Muhlenbruck, K. Charlton, and H. Cooper,
“Cues to deception.”, Psychological bulletin, vol. 129, no. 1, p. 74, 2003.
[84] Y. Zhou, H. Zhao, X. Pan, and L. Shang, “Deception detecting from speech signal using
relevance vector machine and non-linear dynamics features”, Neurocomputing, vol. 151,
pp. 1042–1052, 2015.
[85] H.-C. Chou, Y.-W. Liu, and C.-C. Lee, “Joint learning of conversational temporal dynamics
and acoustic features for speech deception detection in dialog games”, in 2019 Asia-Pacific
Signal and Information Processing Association Annual Summit and Conference (APSIPA
ASC), IEEE, 2019, pp. 1044–1050.
[86] D. J. France, R. G. Shiavi, S. Silverman, M. Silverman, and M. Wilkes, “Acoustical properties of speech as indicators of depression and suicidal risk”, IEEE transactions on Biomedical Engineering, vol. 47, no. 7, pp. 829–837, 2000.
[87] N. Cummins, J. Epps, V. Sethu, M. Breakspear, and R. Goecke, “Modeling spectral variability for the classification of depressed speech.”, in Interspeech, 2013, pp. 857–861.
[88] M. Nasir, B. R. Baucom, P. Georgiou, and S. Narayanan, “Predicting couple therapy outcomes based on speech acoustic features”, PloS one, vol. 12, no. 9, e0185123, 2017.
[89] A. S. Warlaumont, D. K. Oller, R. Dale, J. A. Richards, J. Gilkerson, and D. Xu, “Vocal
interaction dynamics of children with and without autism”, in Proceedings of the Annual
Meeting of the Cognitive Science Society, vol. 32, 2010.
[90] K. Gotham, A. Pickles, and C. Lord, “Standardizing ados scores for a measure of severity
in autism spectrum disorders”, Journal of autism and developmental disorders, vol. 39,
no. 5, pp. 693–705, 2009.
[91] J. J. Heilmann, J. F. Miller, and A. Nockerts, “Using language sample databases”, 2010.
[92] F. Eyben, M. Wollmer, and B. Schuller, “Opensmile: The munich versatile and fast open- ¨
source audio feature extractor”, in Proceedings of the 18th ACM international conference
on Multimedia, 2010, pp. 1459–1462.
[93] L. Myers and M. J. Sirois, “S pearman correlation coefficients, differences between”, Encyclopedia of statistical sciences, 2004.
120
[94] D. Bone, M. P. Black, C.-C. Lee, M. Williams, P. Levitt, S. Lee, et al., “The psychologist
as an interlocutor in autism spectrum disorder assessment: Insights from a study of spontaneous prosody”, Journal of Speech, Language, and Hearing Research, vol. 57, pp. 1162–
1177, 2014. DOI: 10.1044/2014_JSLHR-S-13-0062.
[95] J. Baio, “Prevalence of autism spectrum disorder among children aged 8 years-autism and
developmental disabilities monitoring network, 11 sites, united states, 2010”, 2014.
[96] S. S. Narayanan and P. Georgiou, “Behavioral signal processing: Deriving human behavioral informatics from speech and language”, Proceedings of IEEE, vol. 101, no. 5,
pp. 1203–1233, 2013. DOI: 10.1109/JPROC.2012.2236291.
[97] D. Bone, C.-C. Lee, T. Chaspari, J. Gibson, and S. Narayanan, “Signal processing and
machine learning for mental health research and clinical applications [perspectives]”, IEEE
Signal Processing Magazine, vol. 34, no. 5, pp. 196–195, 2017.
[98] R. Paul, L. D. Shriberg, J. McSweeny, D. Cicchetti, A. Klin, and F. Volkmar, “Brief report:
Relations between prosodic performance and communication and socialization ratings in
high functioning speakers with autism spectrum disorders”, Journal of Autism and Developmental Disorders, vol. 35, no. 6, p. 861, 2005.
[99] S. Peppe, J. McCann, F. Gibbon, A. O’Hare, and M. Rutherford, “Receptive and expressive ´
prosodic ability in children with high-functioning autism”, Journal of Speech, Language,
and Hearing Research, 2007.
[100] M. Nasir, B. Baucom, S. Narayanan, and P. Georgiou, “Towards an unsupervised entrainment distance in conversational speech using deep neural networks”, arXiv preprint
arXiv:1804.08782, 2018.
[101] C.-C. Lee, A. Katsamanis, M. P. Black, B. R. Baucom, A. Christensen, P. G. Georgiou,
et al., “Computing vocal entrainment: A signal-derived pca-based quantification scheme
with application to affect analysis in married couple interactions”, Computer Speech &
Language, vol. 28, no. 2, pp. 518–539, 2014.
[102] T. Guha, Z. Yang, R. B. Grossman, and S. S. Narayanan, “A computational study of expressive facial dynamics in children with autism”, IEEE transactions on affective computing,
vol. 9, no. 1, pp. 14–20, 2016.
[103] J. Cassell, A. Gill, and P. Tepper, “Coordination in conversation and rapport”, in Proceedings of the workshop on Embodied Language Processing, 2007, pp. 41–50.
[104] A. Ward and D. Litman, “Dialog convergence and learning”, Frontiers in Artificial Intelligence and Applications, vol. 158, p. 262, 2007.
[105] P. J. Taylor and S. Thomas, “Linguistic style matching and negotiation outcome”, Negotiation and Conflict Management Research, vol. 1, no. 3, pp. 263–281, 2008.
121
[106] Z. Yang and S. S. Narayanan, “Analyzing temporal dynamics of dyadic synchrony in affective interactions.”, in INTERSPEECH, 2016, pp. 42–46.
[107] D. Bone, M. P. Black, C.-C. Lee, M. E. Williams, P. Levitt, S. Lee, et al., “Spontaneousspeech acoustic-prosodic features of children with autism and the interacting psychologist”, in Thirteenth Annual Conference of the International Speech Communication Association, 2012.
[108] M. Nasir, B. R. Baucom, C. J. Bryan, S. S. Narayanan, and P. G. Georgiou, “Complexity
in speech and its relation to emotional bond in therapist-patient interactions during suicide
risk assessment interviews.”, in INTERSPEECH, 2017, pp. 3296–3300.
[109] J. Miller and A. Iglesias, “Systematic analysis of language transcripts (salt), research version 2012 [computer software]”, Middleton, WI: SALT Software, LLC, 2012.
[110] C. Myers and L. Rabiner, “A level building dynamic time warping algorithm for connected word recognition”, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 29, no. 2, pp. 284–297, 1981.
[111] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality”, in Advances in neural information
processing systems, 2013, pp. 3111–3119.
[112] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation”, in Proceedings of the 2014 conference on empirical methods in natural language
processing (EMNLP), 2014, pp. 1532–1543.
[113] M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger, “From word embeddings to document
distances”, in International conference on machine learning, 2015, pp. 957–966.
[114] Y. Rubner, C. Tomasi, and L. J. Guibas, “The earth mover’s distance as a metric for image
retrieval”, International journal of computer vision, vol. 40, no. 2, pp. 99–121, 2000.
[115] A. Metallinou, Z. Yang, C.-c. Lee, C. Busso, S. Carnicke, and S. Narayanan, “The usc
creativeit database of multimodal dyadic interactions: From speech and full body motion
capture to continuous emotional annotations”, Language resources and evaluation, vol. 50,
no. 3, pp. 497–521, 2016.
[116] A. Sturrock, N. Yau, J. Freed, and C. Adams, “Speaking the same language? a preliminary
investigation, comparing the language and communication skills of females and males with
high-functioning autism”, Journal of autism and developmental disorders, pp. 1–18, 2019.
[117] A. Vinciarelli, M. Pantic, H. Bourlard, and A. Pentland, “Social signal processing: Stateof-the-art and future perspectives of an emerging domain”, in Proceedings of the 16th ACM
international conference on Multimedia, 2008, pp. 1061–1070.
[118] S. E. Brennan, “Lexical entrainment in spontaneous dialog”, Proceedings of ISSD, vol. 96,
pp. 41–44, 1996.
122
[119] R. Levitan and J. B. Hirschberg, “Measuring acoustic-prosodic entrainment with respect to
multiple levels and dimensions”, 2011.
[120] H. Giles and P. Powesland, “Accommodation theory”, in Sociolinguistics, Springer, 1997,
pp. 232–239.
[121] M. Nasir, B. Baucom, C. Bryan, S. Narayanan, and P. Georgiou, “Modeling vocal entrainment in conversational speech using deep unsupervised learning”, IEEE Transactions on
Affective Computing, 2020.
[122] C.-C. Lee, M. P. Black, A. Katsamanis, A. Lammert, B. Baucom, A. Christensen, et al.,
“Quantification of prosodic entrainment in affective spontaneous spoken interactions of
married couples”, in Proceedings of InterSpeech, Makuhari, Japan, 2010, pp. 793–796.
[123] B. Xiao, Z. E. Imel, D. Atkins, P. Georgiou, and S. S. Narayanan, “Analyzing speech
rate entrainment and its relation to therapist empathy in drug addiction counseling”, in
Proceedings of Interspeech, Dresden, Germany, 2015.
[124] R. Lahiri, M. Nasir, M. Kumar, S. H. Kim, S. Bishop, C. Lord, et al., “Interpersonal synchrony across vocal and lexical modalities in interactions involving children with autism
spectrum disorder”, JASA Express Letters, vol. 2, no. 9, p. 095 202, 2022.
[125] M. Natale, “Convergence of mean vocal intensity in dyadic communication as a function
of social desirability.”, Journal of Personality and Social Psychology, vol. 32, no. 5, p. 790,
1975.
[126] N. Lubold and H. Pon-Barry, “Acoustic-prosodic entrainment and rapport in collaborative
learning dialogues”, in Proceedings of the 2014 ACM workshop on Multimodal Learning
Analytics Workshop and Grand Challenge, 2014, pp. 5–12.
[127] E. Delaherche, M. Chetouani, A. Mahdhaoui, C. Saint-Georges, S. Viaux, and D. Cohen,
“Interpersonal synchrony: A survey of evaluation methods across disciplines”, IEEE Transactions on Affective Computing, vol. 3, no. 3, pp. 349–365, 2012.
[128] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, et al., “Attention
is all you need”, Advances in neural information processing systems, vol. 30, 2017.
[129] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, et al., “Conformer: Convolutionaugmented transformer for speech recognition”, arXiv preprint arXiv:2005.08100, 2020.
[130] C. Cieri, D. Miller, and K. Walker, “The fisher corpus: A resource for the next generations
of speech-to-text.”, in LREC, vol. 4, 2004, pp. 69–71.
[131] S. L. Bishop, M. Huerta, K. Gotham, K. Alexandra Havdahl, A. Pickles, A. Duncan, et
al., “The autism symptom interview, school-age: A brief telephone interview to identify
autism spectrum disorders in 5-to-12-year-old children”, Autism Research, vol. 10, no. 1,
pp. 78–88, 2017.
123
[132] V. Hus, K. Gotham, and C. Lord, “Standardizing ados domain scores: Separating severity
of social affect and restricted and repetitive behaviors”, Journal of autism and developmental disorders, vol. 44, no. 10, pp. 2400–2412, 2014.
[133] E. Fombonne, “Camouflage and autism”, Journal of Child Psychology and Psychiatry,
vol. 61, no. 7, pp. 735–738, 2020.
[134] J. S. Pardo, “On phonetic convergence during conversational interaction”, The Journal of
the Acoustical Society of America, vol. 119, no. 4, pp. 2382–2393, 2006.
[135] M. J. Pickering and S. Garrod, “Alignment as the basis for successful communication”,
Research on Language and Computation, vol. 4, no. 2, pp. 203–228, 2006.
[136] J. B. Hirschberg, A. Nenkova, and A. Gravano, “High frequency word entrainment in spoken dialogue”, 2008.
[137] B. Xiao, P. G. Georgiou, Z. E. Imel, D. C. Atkins, and S. S. Narayanan, “Modeling therapist empathy and vocal entrainment in drug addiction counseling.”, in Interspeech, 2013,
pp. 2861–2865.
[138] M. Nasir, B. R. Baucom, S. S. Narayanan, and P. G. Georgiou, “Complexity in prosody: A
nonlinear dynamical systems approach for dyadic conversations; behavior and outcomes
in couples therapy.”, in Interspeech, 2016, pp. 893–897.
[139] J. N. Cappella, “Coding mutual adaptation in dyadic nonverbal interaction”, in The sourcebook of nonverbal measures, Psychology Press, 2014, pp. 383–392.
[140] F. J. Bernieri, J. S. Reznick, and R. Rosenthal, “Synchrony, pseudosynchrony, and dissynchrony: Measuring the entrainment process in mother-infant interactions.”, Journal of
personality and social psychology, vol. 54, no. 2, p. 243, 1988.
[141] R. Lahiri, M. Nasir, C. Lord, S. H. Kim, and S. Narayanan, “A context-aware computational approach for measuring vocal entrainment in dyadic conversations”, in ICASSP
2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), IEEE, 2023, pp. 1–5.
[142] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, et al., “Learning
transferable visual models from natural language supervision”, in International conference
on machine learning, PMLR, 2021, pp. 8748–8763.
[143] B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, “Clap learning audio concepts
from natural language supervision”, in ICASSP 2023-2023 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2023, pp. 1–5.
[144] H.-H. Wu, P. Seetharaman, K. Kumar, and J. P. Bello, “Wav2clip: Learning robust audio representations from clip”, in ICASSP 2022-2022 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2022, pp. 4563–4567.
124
[145] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units”,
IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–
3460, 2021.
[146] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, et al., “Roberta: A robustly optimized
bert pretraining approach”, arXiv preprint arXiv:1907.11692, 2019.
[147] D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. S. John, et al., “Universal sentence
encoder”, arXiv preprint arXiv:1803.11175, 2018.
[148] M. Kaur, S. M. Srinivasan, and A. N. Bhat, “Comparing motor performance, praxis, coordination, and interpersonal synchrony between children with and without autism spectrum
disorder (asd)”, Research in developmental disabilities, vol. 72, pp. 79–95, 2018.
[149] K. A. McNaughton and E. Redcay, “Interpersonal synchrony in autism”, Current psychiatry reports, vol. 22, pp. 1–11, 2020.
[150] R. Levitan, A. Gravano, L. Willson, S. Be ˇ nuˇ s, J. Hirschberg, and A. Nenkova, “Acoustic- ˇ
prosodic entrainment and social behavior”, in Proceedings of the 2012 Conference of the
North American Chapter of the Association for Computational Linguistics: Human language technologies, 2012, pp. 11–19.
[151] L. K. Butler, C. La Valle, S. Schwartz, J. B. Palana, C. Liu, N. Peterman, et al., “Remote
natural language sampling of parents and children with autism spectrum disorder: Role of
activity and language level”, Frontiers in Communication, vol. 7, p. 820 564, 2022.
[152] M. Barokova et al., “Commentary: Measuring language change through natural language
samples”, Journal of autism and developmental disorders, vol. 50, no. 7, pp. 2287–2306,
2020.
125
Abstract (if available)
Abstract
Objective understanding of dynamics of child-inclusive interactions requires the machines to discern who,when, how and where someone is talking and also analyse the content of the interactions automatically. Children's speech differs significantly from that of adults, making automatic understanding of child speech a notably more challenging task compared to adult speech. An additional layer of complexity arises if the interactions are related to clinical or mental health applications for children as the conditions may give rise to speech or language abnormalities. Robust behavioral representation learning from the rich multimodal child-inclusive interaction content can help in extracting meaningful insights towards addressing these problems. However, this is particularly challenging due to the vast heterogeneity, contextual variability present in the child-inclusive interactions and also due to the scarcity of reliably labelled datasets. In this thesis, I develop methods for automated understanding of child speech, which can be broadly divided into several subtasks: detecting child speech, classifying speakers, converting child speech to text, and automatically inferring behavioral constructs using low-level vocal and language-based feature descriptors. Specifically, I focus on the child-adult speaker classification task and on modeling interactional dynamics through quantifying interpersonal synchrony from child-inclusive interactions. The major challenges in child-adult speaker classification are due to the main reasons, firstly large within-class variability due to the age,gender and symptom severity and secondly, due to lack of reliable training data to address the above mentioned variabilities. I first adopted an adversarial learning based approach to resolve the issues related to variabilities from age and signal collection site. To avoid the need for manual annotations for speaker identities in vocal frames, I relied on self-supervised learning to leverage unlabelled child speech to improve child-adult classification performance. For modeling interpersonal synchrony for child-inclusive interactions, I first utilized some knowledge driven heuristics based time-series analysis to extract meaningful information for understanding behavioral traits of children with and without autism diagnosis. I extended the experiments across multiple modalities to evaluate if synchrony metrics can capture complementary information that may be present across modalities. Furthermore, I reported and analyzed the possible limitations of the proposed metrics and finally introduce enhanced data-driven neural network based approaches to address the issues with the previous studies. For each of the use-cases, I evaluate the proposed framework for relevant child-inclusive interaction domain, report the results by comparing the scores under different circumstances and investigate and analyze the results for a holistic understanding.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Context-aware models for understanding and supporting spoken interactions with children
PDF
Extracting and using speaker role information in speech processing applications
PDF
Modeling dyadic synchrony with heterogeneous data: validation in infant-mother and infant-robot interactions
PDF
Interaction dynamics and coordination for behavioral analysis in dyadic conversations
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Establishing cross-modal correspondences for media understanding.
PDF
Modeling and regulating human interaction with control affine dynamical systems
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Behavioral signal processing: computational approaches for modeling and quantifying interaction dynamics in dyadic human interactions
PDF
Computational modeling of behavioral attributes in conversational dyadic interactions
PDF
Computational methods for modeling nonverbal communication in human interaction
PDF
Emotional speech production: from data to computational models and applications
PDF
Learning shared subspaces across multiple views and modalities
PDF
Modeling dynamic behaviors in the wild
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Neural representation learning for robust and fair speaker recognition
PDF
Understanding and generating multimodal feedback in human-machine story-telling
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Multimodal and self-guided clustering approaches toward context aware speaker diarization
PDF
Machine learning paradigms for behavioral coding
Asset Metadata
Creator
Lahiri, Rimita (author)
Core Title
Understanding interactional dynamics from diverse child-inclusive interactions
School
Andrew and Erna Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2024-08
Publication Date
02/06/2025
Defense Date
07/01/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
child-adult speaker classification,child-inclusive interactions,interpersonal synchrony,Language,OAI-PMH Harvest,Speech,unsupervised learning
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth (
committee chair
), Bogdan, Paul (
committee member
), Mataric, Maja (
committee member
)
Creator Email
rimita.lahiri@yahoo.com,rlahiri@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113998TAP
Unique identifier
UC113998TAP
Identifier
etd-LahiriRimi-13350.pdf (filename)
Legacy Identifier
etd-LahiriRimi-13350
Document Type
Dissertation
Format
theses (aat)
Rights
Lahiri, Rimita
Internet Media Type
application/pdf
Type
texts
Source
20240807-usctheses-batch-1194
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
child-adult speaker classification
child-inclusive interactions
interpersonal synchrony
unsupervised learning