Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Computational modeling of human interaction behavior towards clinical translation in autism spectrum disorder
(USC Thesis Other)
Computational modeling of human interaction behavior towards clinical translation in autism spectrum disorder
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
COMPUTATIONAL MODELING OF HUMAN INTERACTION BEHAVIOR
TOWARDS CLINICAL TRANSLATION IN AUTISM SPECTRUM DISORDER
by
Daniel Bone
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
August 2016
Copyright 2016 Daniel Bone
Dedicated to wife, Taryn, for emotionally supporting me through
numerous Interspeech deadlines; also to my mother, father, brother,
and sister, for their tireless encouragement.
ii
Contents
Dedication ii
Contents iii
List of Tables vi
List of Figures viii
Acknowledgments x
Abstract xii
1 Introduction 1
1.1 Autism Spectrum Disorder . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Behavioral Signal Processing . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Autism, BSP, & Speech Prosody . . . . . . . . . . . . . . . . . . . . 4
1.4 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Specic Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6 Technical Contributions and Challenges . . . . . . . . . . . . . . . . 11
1.7 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Computational Characterization of Atypical Prosody 14
2.1 Spontaneous Prosody in ADOS Interviews . . . . . . . . . . . . . . 14
2.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.2 The ADOS and ADOS-ASD Severity . . . . . . . . . . . . . 21
2.1.3 Study Participants . . . . . . . . . . . . . . . . . . . . . . . 23
2.1.4 Acoustic-Prosodic Features . . . . . . . . . . . . . . . . . . . 26
2.1.5 Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . 36
2.1.6 Results, Study I: Correlation of Acoustic-Prosodic Descrip-
tors with ASD Severity . . . . . . . . . . . . . . . . . . . . . 37
2.1.7 Results, Study II: Correlational Feature Analysis . . . . . . 39
2.1.8 Discussion, Study I . . . . . . . . . . . . . . . . . . . . . . . 42
2.1.9 Conclusions, Study I . . . . . . . . . . . . . . . . . . . . . . 43
iii
2.1.10 Conclusions, Study II . . . . . . . . . . . . . . . . . . . . . . 44
2.2 Acoustic-Prosodic Correlates of \Awkward" Prosody in Story Retellings
from Adolescents with Autism . . . . . . . . . . . . . . . . . . . . . 45
2.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.2.3 Analysis of Perceptual Ratings . . . . . . . . . . . . . . . . . 50
2.2.4 Acoustic-Prosodic Cues of Awkwardness . . . . . . . . . . . 52
2.2.5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . 56
2.3 Outlook for the Quantication of Atypical Prosody . . . . . . . . . 57
3 The Psychologist as an Interlocutor in ASD Assessment 59
3.1 Prosody, Turn-taking, and Language of the Psychologist during
ADOS Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.1.2 Social Demand Rating System for Study III . . . . . . . . . 63
3.1.3 Turn-taking and Language Features . . . . . . . . . . . . . . 64
3.1.4 Statistical Analysis and Machine Learning . . . . . . . . . . 66
3.1.5 Results and Discussion, Study I: Correlation and Predictive
Analysis of ASD Severity from Acoustic-Prosody . . . . . . 68
3.1.6 Results and Discussion, Study II: Correlation, Prediction,
and Classication based on Dynamic Prosody . . . . . . . . 78
3.1.7 Results and Discussion, Study III: Correlation Analysis of
ASD Severity . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.1.8 Results and Discussion, Study III: Prediction over Varying
Social Demand . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.1.9 Conclusions and Future Work . . . . . . . . . . . . . . . . . 86
3.2 Joint Modeling of Child's and Psychologist's Aect . . . . . . . . . 88
3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.2.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . 91
3.2.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . 97
3.2.4 Conclusions and Future Work . . . . . . . . . . . . . . . . . 102
4 Machine Learning for Autism Diagnostics and Screening 103
4.1 Machine Learning for Improving Autism Screening and Diagnostic
Instruments: Eectiveness, Eciency, & Multi-Instrument Fusion . 103
4.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.1.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.1.3 Results, Designing Eective Algorithms . . . . . . . . . . . . 115
4.1.4 Results, Designing Ecient Algorithms . . . . . . . . . . . . 118
4.1.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.1.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.1.7 Implications for Future Research and Clinical Translation . . 125
iv
5 Conclusions and Future Work 127
5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.2 Open Problems and Future Work . . . . . . . . . . . . . . . . . . . 129
6 References 131
v
List of Tables
1.1 Completed Thesis Technical Contributions and Challenges. . . . . . 12
2.1 Demographic information of all subjects: mean (stdv.) . . . . . . . . 26
2.2 Spearman's rank order correlation coecients between acoustic-prosody
descriptors and ADOS severity. Positive correlations indicate that
increasing descriptor values occur with increasing severity.
,
, and
y
indicate statistical signicance at =0.01, =0.05, and =0.10
(marginal) levels, respectively. . . . . . . . . . . . . . . . . . . . . . 38
2.3 Correlations of features with ADOS severity and best-estimate diag-
nosis. * indicates p<0:05; n.s. is non-signicant. . . . . . . . . . . 40
2.4 Participant demographics. `' designates dierence at =0:05 level
by Wilcoxon rank-sum test. . . . . . . . . . . . . . . . . . . . . . . . 47
2.5 Top ve features correlated (
S
) with perceptual ratings and ASD
diagnosis (ASD1, TD0). Bold: p<0:01; else p<0:05. . . . . . . 49
2.6 Spearman's inter-rater reliability (sig. at =0:05). . . . . . . . . 50
2.7 Correlations between speaker-averaged ratings, demographics, and
ASD diagnosis. Bolded implies sig. at =0:05. . . . . . . . . . . . 52
2.8 Regression and classication of perceptual ratings and ASD diagno-
sis via acoustic features and demographic variables. Bolded statis-
tics are signicant at the=0:05 level by one-sided tests. N
ratings
=322,
N
Diag:
=69. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.1 Spearman's between durational descriptors and ADOS severity
and atypical prosody ratings.
and
y
indicate statistical signicance
at =0.05 and =0.10 (marginal) levels, respectively. . . . . . . . . 69
3.2 Stepwise regression with prosodic features and underlying variables. 72
3.3 Hiearchical stepwise regression in order: i) child's (psychologist's)
prosody; ii) psychologist's (child's) prosody and underlying variables. 73
3.4 Spearman's rank order correlation between predicted severity and
rated severity based on acoustic-prosodic descriptors and actual,
rated ADOS severity. Note:
p<0:001,
p<0.01. . . . . . . . . . 74
vi
3.5 Regression and classication of ASD severity and best-estimate diag-
nosis via acoustic-prosodic and turn-taking features. Bolded statis-
tics are signicant at the =0:05 level. . . . . . . . . . . . . . . . . 79
3.6 Correlations of session-level features with ADOS severity. Note:
[
y
,
,
] =[0:10,0:05,0:01] . . . . . . . . . . . . . . . . . . . . . . 84
3.7 Correlations of prosodic and language feature sets predictions with
ADOS severity over varying social demands. Note: [
y
,
,
,
]
=[0:10,0:05,0:01,0:001] . PPsychologist's features, CChild's fea-
tures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.8 Demographic statistics of the 29 recorded children in this study that
were administered Module 3 of the ADOS. . . . . . . . . . . . . . . 92
3.9 List of words dened as a backchannel in our corpus. . . . . . . . . 97
3.10 Model perplexity as a function of feature input. . . . . . . . . . . . . 101
3.11 Classication performance in UAR. Note: bold indicates signi-
cance above chance at p<0:01. . . . . . . . . . . . . . . . . . . . . . 101
4.1 Demographic information for all data subsets. Note that Age 10+
SRS and ADI-R+SRS are identical. `*' indicates dierences between
ASD and DD at =0:05. . . . . . . . . . . . . . . . . . . . . . . . . 109
4.2 List of important groupings from hierarchical clustering. Note that
these groupings were identical in Age 10+ and 10-. . . . . . . . . . 114
4.3 BEC classication with instrument codes, totals, and classications
as features for ADI-R (Ever and Current) and SRS, split at age
10. Results are in terms of UAR. `*' indicates pairwise-dierence
between Proposed: Codes and Classication at =0:05. . . . . . . . 116
4.4 Instrument fusion in joint ADI-R-Current and SRS sample in terms
of UAR. `*' indicates pairwise-dierence between Proposed: Codes
and Totals at =0:05. . . . . . . . . . . . . . . . . . . . . . . . . . 117
vii
List of Figures
1.1 Intended communicative signals (conscious or unconscious) are pro-
duced by a subject. Experts develop their own perceptions, but also
utilize computational tool outputs. . . . . . . . . . . . . . . . . . . 5
1.2 What is our ground truth? Illustration of targeted hidden construct
with observable information. . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Schematic of thesis statement: modeling interaction between child
& psychologist. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Schematic of thesis statement: modeling applies to interaction with
psychologist, parents, & siblings (or peers), as well as modeling with
various behavioral signals. . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Second-order polynomial representation of the normalized log-pitch
contour for the word \drives". Note: [curvature,slope,center]=[0:11,
0:12, 0:28]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2 Example intonation contour plotted with Momel target points and
Intsint symbolic representation. . . . . . . . . . . . . . . . . . . . . 35
2.3 Duration sample versus exemplar for one sentence, where \just"
was missing from the sample production. Computation:
S
=0:90;
fraction of samples with a feature value=
14
15
; score=0:90
14
15
=0:84. . 46
3.1 Proportions of conversation containing psychologist and/or child
speech. Sessions are ordered and labeled by ADOS severity. . . . . . 68
3.2 Coordination of vocal intensity slope median between child and psy-
chologist. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.3 Coordination of median HNR between child and psychologist. . . . 70
3.4 Coordination of median jitter between child and psychologist after
controlling for psychologist ID and signal-to-noise ratio. . . . . . . . 70
3.5 Example vocal arousal streams from child (C) and psychologist (P)
with highlighted regions of synchrony. . . . . . . . . . . . . . . . . . 94
3.6 Example vocal arousal streams for child (C) and psychologist (P)
with dominance (Dom) and backchannels (Bc). . . . . . . . . . . . . 94
viii
3.7 Conversational model of vocal arousal (psychologist's view shown).
Note: p - psychologist; c - child; Ar - vocal arousal; DA - dialogue
act; Dm - dominance; bold indicates a vector ending at turn k. . . . 95
3.8 Correlation between lag (positive when child leads) and ASD severity
vs. window length. . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.9 Lag vs. ASD severity for W
s
=30.
S
=0:49 (p<0:01) . . . . . . . . . 99
3.10 Correlation between Granger causality parameters and ASD sever-
ity vs. window length. F
C>P
is the magnitude of interaction for
the child G-causing the psychologist's behavior, and vice-versa for
F
P>C
. cf(C) is the child's causal
ow. . . . . . . . . . . . . . . . . 100
4.1 Flow diagram of ML-based algorithm development. . . . . . . . . . . 107
4.2 Illustration of model training, tuning, and testing through \nested"
cross-validation (CV) as used in the Eective Algorithms section. . 111
4.3 Receiver operating characteristic plots. The Equal Error Rate (EER)
Line indicates the UAR optimization point, where sensitivity and
specicity are weighted equally. Classiers should perform above the
Chance Line, where UAR equals 50%. Note that we plot sensitivity
vs. specicity in order to aid interpretation relative to UAR. . . . . 119
4.4 Optimization curves versus number of codes for Age 10- (top) and
Age 10+ (bottom) screeners. Optimization is biased towards sen-
sitivity (roughly 2:1). An elbow-point at 95% of maximum perfor-
mance is marked for both age groups. . . . . . . . . . . . . . . . . . 121
ix
Acknowledgments
This dissertation is made possible with the encouragement and help from my fam-
ily, friends, and colleagues.
Firstly, I would like to thank my advisor Dr. Shrikanth Narayanan. The Ph.D.
is a series of intellectual trials. He provides excellent advice on the big picture
impact of work as well as the details of engineering application, drawing from
his wide expertise. He is a constant source of inspiration for his students, who are
motivated by his enthusiasm. I would like to thank Dr. Sungbok Lee for his advice
throughout my seven years in the Ph.D. regarding prosody and emotion. I would
also like to thank my dissertation committee, Dr. Pat Levitt and Dr. Antonio
Ortega for their insightful feedback and time.
I would like to thank Dr. Pat Levit, Dr. Somer Bishop, Dr. Matthew Goodwin,
Dr. Catherine Lord, and Dr. Marian Williams for sharing their knowledge of
autism, providing data, and generally supporting our eorts to bring technological
innovation to the domain of autism. I'm fortunate to work with these excellent
researchers at the peak of their respective elds.
I'm also lucky to have outstanding colleagues at the Signal Analysis and Inter-
pretation Lab (SAIL). Among others, I am thankful to: Matthew Black and Jeremy
(Chi-Chun) Lee, who mentored me in the art of research; Bo Xiao, who helped me
pass EE 562A; Jangwon Kim, who studied with me for the formidable screening
x
exam; Naveen Kumar, who answered a million Unix questions; Jimmy Gibson,
who always was willing to discuss research over a beer; Nassos Katsamanis, whose
laugh made meeting enjoyable; and Theodora Chaspari and Rahul Gupta, who led
the CARE group eorts with me.
Finally, I would like to thank my family and friends for their support. My wife,
Taryn, dealt with my sporadic sleep schedule before the screening exam and many
conference deadlines. My parents, Jan and Ken, have a never-ending supply of
love and encouragement, and always supported me in choosing my own path. In
fact, it was a comment from my father that rst inspired my journey towards a
Ph.D. My older brother, Joey, taught me how to ght adversity through his own
experiences as an adult, and mine as a little brother. My younger sister, Sarah,
taught me patience and positivity. I would also like to thank those who I have
forgotten to mention.
xi
Abstract
This thesis concerns human-centered signal processing and machine learning, with
a focus on creating engineering techniques and systems for societal applications in
human health and well-being. Specically, I aim to develop novel computational
methods and tools that will support clinicians and researchers in the domain of
autism spectrum disorder (ASD){ASD has a population prevalence of 1 in 68 (Baio,
2014). Computational methods of behavioral characterization can augment the
clinician's analytical capabilities in diagnosis, personalized intervention, and long-
term monitoring. Computational dimensional descriptors of behavior may be inte-
gral to further developments in the biology and neurology of ASD; they oer a
scalable quantitative framework to address a scientic and translational need, aug-
menting current qualitative methods.
A primary target for computational modeling is speech prosody{the rhythm
and timing of speech that communicates meaning and aect{which is dicult to
precisely specify qualitatively, even for experts. Children with ASD are almost
universally delayed in language acquisition, and early language ability is a primary
indicator of long-term prognosis. Prosody is a key marker for early diagnosis, but
is still not reliably evaluated. My work aims to automatically characterize social
prosody from spontaneous speech samples using statistical signal processing.
xii
Moreover, social prosody does not occur in isolation, but during interaction
with a communicative partner. This thesis illustrates the need to model both sides
of the interaction simultaneously. My research has produced original and impor-
tant ndings about human interaction in ASD, ndings which have translational
potential. In particular, I have found that the prosodic, turn-taking, and lan-
guage cues of the psychologist alter during ASD assessment depending on the level
of social-communicative impairment the child displays. In other words, the psy-
chologist's behavior is predictive of the child's level of impairment. Such ndings
may eventually in
uence design of novel psychometric instruments which focus on
attunements in the psychologist's behavior in order to assess the child's behavior,
or be used in the evaluation of new intervention strategies.
Towards the goal of automatic systems, I have created a robust aective mea-
sure from speech in computational vocal arousal. This rule-based method is shown
to be robust across databases, and competes well with the state-of-the-art in super-
vised approaches. I have utilized this method to automatically label vocal arousal
in child-psychologist interactions. Next, I investigated the synchrony between child
and psychologist, nding interesting results; for instance, for sessions with children
having low ASD severity, the psychologist leads the aective exchange, but for
sessions with high ASD severity, the child becomes less responsive and leads on
average.
Additionally, having approached these problems through the lens of BSP, I
have discovered certain pitfalls that must be avoided. Through dissemination in
peer-reviewed articles, this thesis also contributes to the amassing set of standards
of practice in Behavioral Signal Processing (BSP).
Lastly, I have investigated application of machine learning to autism diagnostics
and screening. I have shown that sensitivity and specicity can be readily tuned to
xiii
optimal level through careful application of machine learning to this new domain.
I have also shown that instruments can be fused to improve performance. The
primary outcome is a screener algorithm which achieves 95% of the instruments
performance using only 5% of the available codes.
xiv
Chapter 1
Introduction
This thesis focuses on translational application of Behavioral Signal Processing
(BSP) methodologies to the domain of autism spectrum disorders (ASD). Signal
processing and machine learning techniques oer objective computation of behavior
and the ability to \let the data speak for itself", which can have a profound impact
in observational science arenas. In this section, I'll introduce autism, BSP, and
what they can oer to one another.
1.1 Autism Spectrum Disorder
Autism spectrum disorder (ASD) is a neurodevelopmental disorder dened clin-
ically by impaired social reciprocity and communication{jointly referred to as
social aect (Gotham et al., 2007){as well as restricted, repetitive behaviors and
interests (American-Psychiatric-Association, 2013). Diagnosis is made through a
battery of behavioral tests which incorporate parent/teacher knowledge (e.g., the
ADI-R (Rutter et al., 2003)) or clinical observation, such as the \gold-standard"
Autism Diagnostic Observation Schedule, or ADOS (Lord et al., 2000). How-
ever, the autism phenotype is extremely complex, comprising a spectrum with
continuous display of various behavioral factors (Wing, 1988). Accordingly, the
recently altered clinical standards in DSM-5 (American-Psychiatric-Association,
2013) remove sub-categories of diagnosis (e.g., Asperger's or PDD-NOS) due to
1
the lack of reliability in categorization across sites, instead preferring an ASD/non-
ASD classication along with yet-to-be-determined dimensional descriptors of
autistic behavior.
ASD is the fastest growing disorder in the United States: recent prevalence
estimates put the number of diagnosed cases around 1 in 68 Baio (2014). The
disorder comes extreme nancial burden: the average cost on a family is around
$60; 000 a year (http://www.cdc.gov/ncbddd/autism/data.html). Therefore,
the cost of a false positive diagnosis is actually quite high.
Although the disorder is considered to have neurobiological roots, it is cur-
rently diagnosed through observation. The biological denition is dependent on
the observational denition. This has led to a quandary in which better stratica-
tion is needed for both clinical etiological studies, one which may best be achieved
through biological study, but the observational proles are often too complex to
achieve optimal experimental outcomes in those biological studies.
Early intervention can have a profound impact. Many studies demonstrate
positive outcomes with very early intervention (before two years of age when pos-
sible). However, in order to perform early intervention, early diagnosis is critical.
Diagnosis will depend on many cues, but we believe vocal cues can be an important
factor.
One frequent symptom of autism is `atypical prosody'{ an over-loaded term.
It covers the whole range of possible atypicalities, from lexical stress (placement
of stress within a word), to pragmatic prosody (intonation within a sentence), to
aective prosody. Prosody is understudied in ASD, and the set of prosodic abnor-
malities that dene `atypical prosody' in ASD, and their respective prevalences,
are not yet dened. The largest drawback in making use of prosody appears to
2
be the qualitative nature of the denition of prosody and atypicality. Objective
measures stand to have a profound impact.
1.2 Behavioral Signal Processing
Behavioral Signal Processing (BSP) combines the domains of Behavioral Science
and Engineering Systems (Narayanan and Georgiou, 2013). Traditionally, Behav-
ioral Science has relied on human coders making qualitative judgments about data,
judgments which are dependent on past experiences. With many of the most com-
plex behavioral phenomena, humans either simply cannot perceive suciently or
they cannot reach consistency between raters. Therefore, something more objec-
tive would be very benecial.
Signals and Systems Engineers study how to reliably extract information from
signals and how to objectively learn from data. These skills are critical when con-
sidering the variability of expression that comprises human behavior. Behavioral
Signal Processing combines these two elds in an eort to produce computational
methods and algorithms that model observable and latent human behavior through
low-level behavioral cues.
The output is a set of computational methods, called Behavioral Informatics,
which are used to model human behavioral constructs. The study of human-human
interaction will enhance engineering applications like human-machine interfaces.
BSP will provide feedback to the behavioral expert, but the critical question is
\What feedback will BSP provide?". Behavioral science can automate the coding
process. This may provide eciency in tagging, the ability to make an expert's
decisions universal, and insights into the decision-making process. However, a
more interesting outcome is to augment human perception. Certain behaviors are
3
dicult for humans to perceive, and objective measures inspired by knowledge
may operationalize their denition.
1.3 Autism, BSP, & Speech Prosody
Autism research is at an exciting stage, poised for dramatic increases in under-
standing and treatment of this prevalent neuro-developmental disorder. While
the rst academic descriptions of autism from Kanner appeared in 1943, inter-
disciplinary interest has only recently exploded. Researchers are joining forces to
learn the complex etiology that leads to the heterogeneous display of symptoms
dened as autism spectrum disorder (ASD). Autism is an exemplar for transla-
tional research of a psychiatric disorder; theoretical and empirical contributions
from clinical, genetic, neuroscience, and animal studies will not only explain the
cause of ASD, but be translated into mechanisms that support early diagnosis
and individualized treatments (Levitt and Campbell, 2009). Moreover, the study
of autism will enhance fundamental knowledge of the core human processes that
dene the disorder.
Behavioral Signal Processing (BSP) techniques have signicant potential to
aid discovery in medicine by supporting human analysis with novel computational
tools and techniques. Of particular interest, BSP can assist in dening behavioral
components associated with autism spectrum disorders (ASD), supporting genetic
and neurological studies and informing personalized intervention. Our vision is to
augment the psychologist's analytical capabilities with human-in-the-loop tools as
illustrated in Figure 1.1. Our process is distinct from the most basic black-box
engineering approach which blindly treats the data as a classication problem.
Since our goal is to quantify relations that are beyond human perception, a simple
4
supervised approach is not viable. Instead, we seek to quantify potential causes
of a general perception through computing features that are inspired by domain
knowledge.
Figure 1.1: Intended communicative signals (conscious or unconscious) are pro-
duced by a subject. Experts develop their own perceptions, but also utilize com-
putational tool outputs.
Dimensional behavioral descriptors are crucial to major advancements in the
neurology and genetics of ASD since previous studies of genetics have been limited
by categorical diagnosis. Syndromic conditions (e.g., Fragile X Syndrome) can
account for no more than 15-20% of ASD cases (Abrahams and Geschwind, 2010).
The prevailing opinion is that individual variants will not precisely map onto clin-
ical categories of diagnosis. It is suspected that genetic variants are more strongly
associated with intermediate phenotypes (or endophenotypes) that amount to an
ASD diagnosis, than the categorical diagnosis itself. This attitude is paralleled in
neurological studies of ASD, where brain circuits are expected to show poor cor-
respondence to clinically dened disease boundaries (Abrahams and Geschwind,
2010). So autism nosology is now at a critical moment in which the eld requires
more-detailed characterization of core ASD components for support in nding
genetic and neurological etiology and in further stratifying this disorder (Lord and
Jones, 2012). Thus, dimensional behavioral phenotyping is essential. My research
aims to address this crucial need through an informatics-centric approach.
Qualitative descriptions of abnormal speech prosody appear throughout the
ASD literature, yet contradictory ndings are common, and the specic features
5
Atypical/
Awkward
Prosody
Clinician
Perception
“Autistic”
Prosody
Acoustic-
Prosodic
Features
Human Labels
(Ground Truths)
Signals
(Objective Measures)
Intricate
Prosodic Rating
Hidden Construct
(Target of Interest)
Figure 1.2: What is our ground truth? Illustration of targeted hidden construct
with observable information.
of prosody measured are not always well dened (McCann and Peppe, 2003), a
testament to both their relevance and the challenges in standardizing prosodic
assessment. Structured laboratory tasks have been used to evaluate prosodic func-
tion in children with ASD. Such studies have shown, for instance, that both sen-
tential stress (Paul et al., 2005b) and contrastive stress (Peppe et al., 2007) dif-
fered in children with ASD versus neuro-typical peers. However, as Peppe (2011)
has remarked, structured assessment \provides no information about aspects of
prosody that do not aect communicative function in a concrete way, but may
have an impact on social functioning or listenability ... such as speech-rhythm,
pitch-range, loudness and speech-rate" (p. 18). Thus, reliable measurement of
spontaneous speech prosody is currently lacking. Methods that rely on human
perception and annotation of each participant's data are time-intensive, limiting
the number of participants that can be studied. Human annotation is also prone to
6
reliability issues, with inadequate reliability found for item-level scoring of certain
prosody voice codes (Shriberg et al., 2001). Therefore, automatic computational
analysis of prosody may be an objective, scalable complement to expert human
annotation.
Complicating matters, as a result of there being no reliable ground truth, exper-
imental protocol for incorporating objective measures is non-direct. The scenario
is displayed in Figure 1.2, wherein neither human annotation nor objective mea-
sures perfectly capture the desired construct of Atypical Prosody. Typical protocol
has been to relate acoustics to autistic prosody, in the form of diagnostic category;
but not all individuals with autism have atypical prosody, and those atypicalities
may not be specic to autism. Given the limitations of human annotation, the
most desirable system may be one that is dened from the signal up. This thesis
makes initial steps toward the goal of a signal-derived prosodic prole by dening
various acoustic parameters and related them to human annotations of atypicality,
which will have correlation to the primary target. This issue is discussed further
in the Outlook portion of Chapter 5.
1.4 Thesis Statement
This thesis posits that \Computational Models of Human Interaction Behavior Can
Have Translational Impact in Autism". Specically, I am concentrating on three
primary areas: quantication of atypical prosody in ASD; computational behavior
modeling of dyadic interactions involving children with and without autism, as
depicted in Figure 1.3; and machine learning for autism diagnosis. Machine learn-
ing is optimal for certain objectives which were previously approximated through
other statistical techniques; we present its use for creating screening instruments
7
that achieve near-optimal performance using a small subset of potential codes.
But most of this thesis regards interaction. Atypical prosody has already been dis-
cussed at length, but that atypicality is largely viewed in light of the context of an
interaction; thus, we must consider both partners in order to make any judgments.
When people communicate, we can only see their expressed behaviors typi-
cally (if we have physiological measurements we can get some internal measures
as well). Those behaviors are perceived by the communicative partner, and in
u-
ence the partner's later decisions. Therefore, the expressed behaviors are mutually
dependent. If someone's internal state is altered, such as in ASD, it can alter their
perceptions, productions, and the observed behavior. Part of this thesis seeks to
infer how expressed behavior relates to ASD severity. The aim is to produce an
objective, quantitative, dimensional measure of social prosody. Further, since this
is an interaction, and the behaviors are mutually dependent, it is interesting to
consider how much information we can obtain about an ASD child's behavior from
their interacting partner's behaviors. Such an approach could be used even in the
absence of observing the target child's behavior.
Figure 1.3: Schematic of thesis statement: modeling interaction between child &
psychologist.
This thesis framework will generalize to other interactions. Various pairs of
communicative partners can be interchanged as shown in Figure 1.4. There are
advantages to modeling parent-child behavior or sibling/peer-child behavior, and
8
this can be executed with the computational methodologies that are proposed in
this thesis. Additionally, in this work we concentrate primarily on speech data,
but there are many other behavioral cues that come from other modalities. In the
future, we plan to look at the behavioral expressions of facial and body gestures,
and internal signals such as heart-rate variability. Most interesting of all, we want
to explore the coordination between these multiple streams as it is their temporal
coordination which may be perceived as awkward.
Figure 1.4: Schematic of thesis statement: modeling applies to interaction with
psychologist, parents, & siblings (or peers), as well as modeling with various behav-
ioral signals.
1.5 Specic Aims
This thesis will produce novel computational tools to model human behavior during
interaction in order to precisely locate social-communicative diculties related to
ASD. Specically, it will produce quantitative, dimensional descriptors of social
behavior from observational signals. Enhanced behavioral phenotyping will better
inform personalized intervention of this debilitating syndrome, as well support
original biological ndings and consequent advances.
9
Aim 1: Devise computational features, models, and tools through which we
can observe the nature of socio-emotional human behavior as it varies over
time; then, precise locate social-communicative diculties related to ASD.
Challenge: Obtaining higher-level behavioral constructs from raw signal data
may not be directly achievable, meaning intermediate behavioral constructs
will need to be developed and validated.
Approach: First, signal cues that inform or indicate who, what, when, how,
where, why will be extracted. Second, these behavioral cues will be mapped
to behavioral constructs (e.g., synchrony, quality of interaction).
Impact: Resulting quantitative tools will proliferate in the community, aid-
ing eorts in assessment, monitoring, and intervention. These quantitative,
objective measures of behavior in interaction will create a surge in the under-
standing of social abilities of children with autism via large sample studies.
Aim 2: Measure prosodic impairments from spontaneous speech in interac-
tions, creating a dimensional measure of social prosody.
Challenge: Speech prosody is dicult to detail qualitatively, and is typically
studied in non-social settings.
Approach: Working with expert clinical psychologists, we will create and
employ robust prosodic measures to capture various potential prosodic abnor-
malities (e.g., monotone or hyper-nasal voice), observing to what extent each
measure accounts for atypicality within individuals, and the prevalence of
these measures across individuals.
Impact: Aside from providing currently-unknown population prevalence
statistics, instruments will be developed for early diagnosis, personalized
10
intervention, and monitoring of social prosody. Further, computational char-
acterization of social prosody may be integral to stratication and subsequent
progress in discovering biological markers of ASD.
Aim 3: Promote eective practices that can lead to transformative outcomes
in this interdisciplinary computational-behavioral domain of ASD.
Challenge: Appropriate application of computational techniques requires
strong knowledge of both the techniques and the application domain. Oth-
erwise, misleading results can arise.
Approach: Our studies have concentrated on producing innovative, veriable
results, on which further eorts can grow. We published potential pitfalls.
Impact: We have previously stated along with concrete examples ways in
which misleading results can occur. This methodological work may have pro-
found impact in both ASD and engineering research communities. In fact,
these eorts have led to us creating more accurate and successful works.
1.6 Technical Contributions and Challenges
The completed technical contributions and challenges of this thesis are presented in
Table 1.1. (1.) The rst contribution is in identifying \what is atypical prosody?".
This rst contribution to the thesis is partially completed, but will be pursued
further in future work. The challenges are in variability of human expression,
the need to understand context, and that we are missing a reliable ground truth
for the standard machine learning approaches. (2.) The second contribution is
in using prosody to study interaction. In doing so, we have provided the rst
objective evidence that the psychologist adapts their behavior in a predictable
way depending on the child's ASD severity. Similar challenges exist. (3.) The
11
third contribution is in creating objective, generally-applicable aective measures
from speech. We have provided one such measure in vocal arousal. It is more
reliable and interpretable than the best supervised approaches. (4.) I have studied
the aective exchange between child and psychologist with this objective measure.
(5.) Additionally, through conducting empirical studies, I have come across certain
issues that are useful to the BSP community. We have cited certain BSP scientic
standards of practice that are useful to avoid pitfalls. (6.) I have created a screener
algorithm that can be used for individuals with autism that is quite quick to
administer. This was created through application of machine learning, and plans
are being made to test in a clinical population.
Table 1.1: Completed Thesis Technical Contributions and Challenges.
Contribution BSP Category Challenges
Objective
Quantication of
Atypical Prosody
Signal Proc. &
Modeling
Data variability, context
variability, unreliable ground
truth
Quantitative Evidence
of Psychologist
Adaptation
Modeling Data & sensing limitations,
context variability, unreliable
baseline
Robust Vocal Arousal Signal Proc. &
Modeling
Supervised approach is
data-dependent
Aective Synchrony Modeling Choosing the appropriate model
BSP Standards Sensing, Signal Proc.,
& Modeling
Avoiding pitfalls
Machine Learning in
ASD Diagnostics
Eectiveness,
Eciency, &
Instrument Fusion
Limited to human observation of
behaviors, avoiding pitfalls
1.7 Thesis Outline
This thesis is organized as follows. First, computational characterization of atyp-
ical prosody is discussed in Chapter 2, with spontaneous interviews and relations
12
to ASD diagnosis and severity in Section 2.1, and human perceptions of prosodic
awkwardness in Section 2.2. Second, interaction modeling in terms of acoustic-
prosodic, turn-taking, and lexical features is discussed in Section 3.1. Specically,
we detail three principle studies that look at the child's and the interacting psy-
chologist's use of spontaneous prosody. We nd that the psychologist's cues are
at least as informative of the child's ASD severity as the psychologist's cues. We
also look at conversational cues, observing that the conversational quality degrades
for children with higher severity ASD, as well as examine how the cues vary with
social demand. Third, aective interaction is explored in Section 3.2, wherein
we quantify the aective exchange between child and psychologist. We previously
demonstrate that we can reliably quantify vocal arousal in dierent datasets (Bone
et al., 2014b). Fourth, we focus on the promise of applying machine learning to
autism diagnosis and screening in Section 4. We conclude and propose future work
of this thesis in Chapter 5.
13
Chapter 2
Computational Characterization
of Atypical Prosody
In this chapter we present experiments in a primary thread of this thesis, eorts
to quantify what it is perceived as atypical in the speech prosody of children with
autism spectrum disorder (ASD). In the rst section, we present two experiments
that utilize diagnostic category and ASD severity as dependent variables, tak-
ing the approach of quantifying \autistic prosody". As mentioned in Section 1.3,
there is no perfect ground truth for atypical prosody in ASD, and that is one
of the primary motivations for our computational eorts. Accordingly, we also
compare some of our measures to human perception of atypicality in Section 2.2;
specically, acoustic measures are correlated with Amazon Mechanical Turker's
percpetions of \awkwardness". Outlook for quantifying atypical prosody is dis-
cussed in Section 2.3.
2.1 Spontaneous Prosody in ADOS Interviews
The purpose of the following experiments is to examine relationships between
prosodic speech cues and autism spectrum disorder (ASD) severity. We objec-
tively quantied acoustic-prosodic cues of children with ASD during spontaneous
interaction, establishing a methodology for future large-sample analysis. Speech
14
acoustic-prosodic features were semi-automatically derived from segments of semi-
structured interviews (Autism Diagnostic Observation Schedule; ADOS) with chil-
dren previously diagnosed with ASD (Study I), as well as individuals with non-
ASD neurodevelopmental disorders (Study II). Prosody is quantied in terms of
intonation, volume, rate, and voice quality. Research hypotheses are tested via
correlation and hierarchical and predictive regression between ADOS severity and
prosodic cues (Experiment I), as well as classication and support vector regression
(Experiment II). In Chapter 3, we demonstrate how features of the psychologist
alter depending on the ASD severity of the interacting child using the same data.
The results support the promise of automatically extracted prosodic features
of children with ASD that allow for scalable analysis of large corpora. We observe
a number of eects in intonation and voice quality in these two experiments.
1
2.1.1 Introduction
\It's not what you say, but how you say it." This common saying elucidates how
critical speech prosody is to eectively communicating. Speech prosody, which
refers to the manner in which a person utters a phrase to convey aect, mark a com-
municative act, or disambiguate meaning, plays a critical role in social reciprocity.
A central role of prosody is to enhance communication of intent, and thus enhance
conversational quality and
ow. For example, a rising intonation can indicate a
request for response, while a falling intonation can indicate nality (Cruttenden,
1997). Prosody can also be used to indicate aect (Juslin and Scherer, 2005) or
attitude (Uldall, 1960). Furthermore, speech prosody has been associated with
social-communicative behaviors such as eye contact in children (Furrow, 1984).
1
This work appears in Bone et al. (2014a) & Bone et al. (2016b).
15
Unfortunately, many verbal individuals with autism spectrum disorder (ASD)
have decits in both discerning a speaker's intent from prosody and producing
appropriate prosody (Paul et al., 2005a), which are detrimental to social function-
ing. Atypical prosody in ASD is considered an understudied, high impact research
area (McCann and Peppe, 2003), one that can have signicant translational impact;
recent promising ndings have shown that visual feedback intervention based on
even simple prosodic measures such as vocal intensity improves production (Sim-
mons et al., 2016). We believe that objective computational methods can support
advances in the understanding and treatment of atypical prosody.
Atypical prosody is also relevant to certain overarching theories on ASD, for
example impaired theory of mind (Baron-Cohen, 1988, Frith, 2001, Frith and
Happ e, 2005, McCann and Peppe, 2003). Specically, inability to gauge the men-
tal state of an interlocutor may be due to impairments in perception of prosody,
which in turn may create challenges for producing appropriate prosodic functions.
Many studies have investigated receptive and expressive language skills in autism
(e.g., (Boucher et al., 2011, Paul et al., 2005b). Tested theories include the speech
attunement framework (Shriberg et al., 2011) { which decomposes production-
perception processes into \tuning in" to learn from the environment, and \tuning
up" one's own behavior to a level of social appropriateness { as well as disrupted
speech planning and atypical motor system function such as in childhood apraxia
of speech (American Speech-Language-Hearing Association, 2007a,b). Given the
complexity of developing speech, it is not surprising that the mechanisms through
which atypical prosody occurs in children with ASD remain unclear.
16
ASD Atypical Prosody Descriptions Qualitative descriptions of prosodic
abnormalities appear throughout the ASD literature, but contradictory nd-
ings are common, and the specic features of prosody measured are not always
well dened (McCann and Peppe, 2003), a testament to both their relevance
and the challenges in standardizing prosodic assessment. For example, pitch
range has been reported as both exaggerated and monotone in individuals with
ASD (Baltaxe, 1977). Characterization of prosody is also incorporated within the
widely used diagnostic instruments, the Autism Diagnostic Observation Schedule
(ADOS; (Lord et al., 2000, 1999)) and the Autism Diagnostic Interview, Revised
(ADI-R; (Rutter et al., 2003)). The ADOS considers any of the following to
be speech abnormalities associated with ASD: \slow and halting; inappropriately
rapid; jerky and irregular in rhythm . . . odd intonation or inappropriate pitch
and stress, markedly
at and toneless, . . . consistently abnormal volume"((Lord
et al., 1999), Module 3, p. 6), and the ADI-R prosody item focuses on the parent's
report of unusual characteristics of the child's speech, with specic probes regard-
ing volume, rate, rhythm, intonation, or pitch. A variety of markers can contribute
to a perceived oddness in prosody such as dierences in pitch slope (Paccia and
Curcio, 1982), atypical voice quality (Sheinkopf et al., 2000), and nasality (Shriberg
et al., 2001). This inherent variability and subjectivity in characterizing prosodic
abnormalities poses measurement challenges.
Structured laboratory tasks have been used to assess prosodic function more
precisely in children with ASD. Such studies have shown, for instance, that both
sentential stress (Paul et al., 2005b) and contrastive stress (Peppe et al., 2007) dif-
fered in children with ASD compared to typical peers. (Peppe et al., 2007) devel-
oped a structured prosodic screening prole which requires individuals to respond
17
to computerized prompts; observers rate the expressive prosody responses for accu-
racy in terms of delivering meaning. However, as Peppe (2011) has remarked, the
instrument \provides no information about aspects of prosody that do not aect
communication function in a concrete way, but may have an impact on social
functioning or listenability . . . such as speech-rhythm, pitch-range, loudness and
speech-rate" (p. 18).
In order to assess these global aspects of prosody which are thought to dier
in individuals with atypical social functioning, qualitative tools have been used to
evaluate prosody along dimensions such as phrasing, rate, stress, loudness, pitch,
laryngeal quality, and resonance (Shriberg et al., 1997, 2011, 2001). While these
methods incorporate acoustic analysis with software in addition to human percep-
tion, intricate human annotation is still necessary. Methods that rely on human
perception and annotation of each participant's data are time-intensive, limiting
the number of participants that can be eciently studied. Human annotation is
also prone to reliability issues, with marginal to inadequate reliability found for
item-level scoring of certain prosody voice codes (Shriberg et al., 2001). Therefore,
automatic computational analysis of prosody has the potential to be an objective
alternative or complement to human annotation that is scalable to large datasets{
an appealing proposition given the wealth of spontaneous interaction data already
collected by autism researchers.
Study I Goals and Rationale
Because precise characterization of the global aspects of prosody for ASD has not
been established (Diehl et al., 2009, Peppe et al., 2007), the rst study presents
a strategy to obtain a more objective representation of speech prosody through
signal processing methods that quantify qualitative perceptions. This approach is
18
in contrast to experimental paradigms of constrained speaking tasks with manual
annotation and evaluation of prosody by human coders (Paul et al., 2005b, Peppe
et al., 2007). Furthermore, previous studies have been limited primarily to the
analysis of speech of children with high-functioning autism (HFA) out of the con-
text in which it was produced (Ploog et al., 2009). While clinical heterogeneity
may explain some con
icting reports regarding prosody in the literature, analy-
sis of more natural prosody through acoustic measures of spontaneous speech in
interactive communication settings has the potential to contribute to better char-
acterization of prosody in children with ASD.
The rst study analyzed speech segments from spontaneous interactions
between a child and a psychologist recorded during standardized observational
assessment of autism symptoms using the ADOS. The portions of the assessment
that were examined represent spontaneous interaction that is constrained by the
introspective topics and interview style. Spontaneous speech during the ADOS
assessment has been shown to be valid for prosodic analysis (Shriberg et al., 2001).
Prosody is characterized in terms of the global dynamics of intonation, volume,
rate, and voice quality. Regarding potential acoustically-derived correlates of per-
ceived abnormalities in these speech segments, few studies oer suggestions (Diehl
et al., 2009, van Santen et al., 2010), and even fewer have additionally assessed
spontaneous speech (Shriberg et al., 2011). As such, the current study proposes
a set of acoustic-prosodic features to represent prosody in child-psychologist dia-
logue.
Some researchers have called for a push toward both dimensional descriptions
of behavior and more valid and reliable ways to quantify such behavior dimen-
sions (e.g., Lord and Jones (2012)). This work {part of the emerging eld of
behavioral signal processing, or BSP (Narayanan and Georgiou, 2013){attempts
19
to address these goals. For instance, such computational approaches have lent
quantitative insight into processes such as prosodic entrainment between inter-
acting dyads and aective patterns (Lee et al., 2014). The co-variation between
continuous behavioral descriptors of speech prosody and dimensional ratings of
social-aective behavior are investigated in the present paper. Given the appar-
ent continuum of phenotypic behavior, correlational analysis utilizing ordinal-scale
behavior ratings may prove invaluable toward eective stratication that supports
further study (e.g., genetic research).
This rst experiment provides a more detailed analysis than was documented
in a previous report one spontaneous prosody during the ADOS (Bone et al.,
2012b). The overarching goal is to develop a framework for large-sample analysis
of prosody, in a dyadic setting, by using semi-automatic computational methods.
The validation of the strategy to perform large-scale analysis of natural speech
data between clinician and child has the potential to provide greater insight for
developing more eective ASD interventions. The specic aims addressed in the
present study include: (1) demonstration of the feasibility of semi-automatic com-
putational analysis of specic, perceptually-inspired acoustic-prosodic elements of
speech during naturalistic conversational interchange in children with ASD; (2)
exploration of the relationship between prosodic features in the speech of the child
with ASD and those of the psychologist interlocutor; (3) exploration of the rela-
tionship between autism symptom severity and prosodic features of the child; (4)
exploration of the relationship between autism symptom severity and the prosodic
features of the psychologist during interaction with the child. We hypothesized
that both the psychologists' prosody and the child's prosody would vary depend-
ing on the level of severity of ASD symptoms of the child.
20
Study II Goals and Rationale
In the second experiment, we continue towards an automatic prosodic evaluation by
analyzing prosodic display in a large sample of individuals with ASD as well as non-
ASD developmental disorders. Additionally, we introduce a novel feature group
based on the coordination of prosodic modalities, and we investigate goodness
of pronunciation (GOP). One limitation of the second study is that we cannot
examine voice quality given potential recording dierences between sites; future
investigations should focus on this critical feature group. Through this study,
we aim to enhance our understanding of signal-derived speech prosody measures,
which are vital to behavioral interaction analyses and the creation of automated
clinical tools.
2.1.2 The ADOS and ADOS-ASD Severity
The Autism Diagnostic Observation Schedule (ADOS: (Lord et al., 1999)) was
administered by one of three psychologists with research certication in the mea-
sure. The ADOS is a standardized assessment of autism symptoms conducted
through a series of activities designed to elicit a sample of communication, social
interaction, play, and other behaviors. The ADOS is designed with dierent Mod-
ules, chosen based primarily on the child's level of expressive language. The present
study includes only participants who were administered Module 3, designed for
children with
uent speech, dened according to the ADOS manual as speech
which includes \a range of
exible sentence types, providing language beyond the
immediate context, and describing logical connections within a sentence" ((Lord
et al., 1999), p. 5). In order to identify the child's level of verbal
uency, a three
step process was followed. First, the parent answered a series of questions about
21
the child's language level by telephone prior to the session. Next, the psychologist
interacted with the child in the clinic while the research assistant was obtaining
informed consent, to further conrm the child's level of verbal
uency. If the child
spoke in complete utterances during this interaction, the psychologist proceeded
with administering Module 3. The psychologist then continued to assess the child's
verbal
uency during the rst 10 minutes of the ADOS session. Following the stan-
dard ADOS protocol, the psychologist changed Modules after the rst part of the
assessment, if the child's expressive language did not t the denition of
uent
speech in the ADOS manual required for Module 3. For this study, only partici-
pants who were administered Module 3 were included. Formal language assessment
was not conducted as part of the larger study, so data about the relative language
skills of the participants could not be presented.
Measures: Targeted ADOS activities The video-recorded speech samples for
the studies were obtained from two ADOS interview activities, Emotions and Social
Diculties and Annoyance. These activities were selected because each oers a
continuous sampling of conversational speech, rich with emotionally-focused con-
tent pertinent to ASD diagnosis. A child with ASD may be less comfortable com-
municating about these particular topics than typically-developing peers, which
should be noted during interpretation of results. Since the conversational style
of these two subtasks is rather constrained, such apprehension may be implicitly
captured by the automatic measures.
ASD severity The ADOS Module 3 includes 28 codes scored by the examiner
immediately following the assessment. The Diagnostic Algorithm consists of a sub-
set of the codes, used to determine if the child's scores exceed the cutos typical
of children with autism in the standardization group for the measure. For this
22
analysis, the revised algorithms (Gotham et al., 2007) were used rather than the
original ADOS algorithm, since they are based on more extensive research regard-
ing the codes which best dierentiate children with ASD from typically-developing
children. Algorithm scores were then converted to an autism symptom severity
score, following the recommendation of Gotham et al. (2009). The dependent
variable in this study was the severity score, which is based on the Social Aect
and Restricted, Repetitive Behaviors factors in the revised ADOS diagnostic algo-
rithm and the severity scale that is used for normalization across modules and
age (Gotham et al., 2009).
ADOS severity was analyzed instead of the atypical prosody ADOS code,
\Speech Abnormalities Associated with Autism," for three reasons: (1) atypical
prosody is dicult to describe and relies on subjective interpretation of multiple
factors; (2) atypical prosody in the ADOS is coded on a low-resolution 3-point
scale; and (3) the atypical prosody ADOS code is highly correlated with overall
ADOS severity (in the USC CARE data set,
s
(26)=0:73, p<0:001).
2.1.3 Study Participants
Study I Participants
Participants were recruited as part of a larger study of children with autism spec-
trum disorders, with or without co-occurring medical conditions. The present
study included 28 children without a diagnosed or parent-reported medical condi-
tion, ranging in age from 5 years, 8 months to 14 years, 7 months (mean = 9.8;
SD = 2.5). Twenty-two (79%) were male; 20 (71%) were Hispanic, and 8 (29%)
were White, Non-Hispanic. Parents were asked to report the child's primary or
23
rst language. The rst languages of the 28 participants were: 15 English (54%),
nine Spanish (32%), and four both English and Spanish (14%).
This data is a subset of the USC CARE Corpus (Black et al., 2011a). The
behavioral data was collected as a part of a larger genetic study, for which the
ADOS was administered to conrm the ASD diagnosis. Age for inclusion was 5 to
17 years, and for this sample, prior diagnosis of an autism spectrum disorder by a
professional in the community was required. All verbally
uent children from the
larger study were included in this sample, determined based on the decision of the
psychologist to administer Module 3 of the ADOS (see below in Measures).
Conrmation of autism diagnosis was established by the psychologist based on
ADOS scores, any input provided by the parent during the assessment, and review
of available records of the previous diagnosis. In this sample, 17 (61%) of the
participants had a conrmed diagnosis of autism on the ADOS, ve (18%) had a
diagnosis of autism spectrum disorder (ASD) but not full autism, and six (21%)
scored below the cuto for ASD on the ADOS { meaning they were deemed to not
have autism spectrum disorder.
Children whose parent(s) spoke primarily Spanish were assessed by a bilin-
gual (Spanish/English) psychologist, and children had the option to respond in or
request Spanish interactions if they felt more comfortable conversing in Spanish.
This sample includes only children who chose to participate in the assessment in
English; one participant was excluded from this analysis due to a primarily Span-
ish discourse. Another participant was excluded due to nominal vocal activity
(verbal or non-verbal) during the assessment, which furthermore was mued and
unintelligible.
In addition to the children, this study includes speech data from the three
licensed psychologists who administered the ADOS for the genetic study. All three
24
psychologists were women, and all were research-certied in the ADOS and had
extensive clinical experience working with children with ASD. Two psychologists
were bilingual in English and Spanish; one was a native Spanish speaker who was
also
uent in English.
Study II Participants
Experimental data consist of Autism Diagnostic Observation Schedule
(ADOS (Lord et al., 2000)) Module 3 videos of a child interacting with a psy-
chologist. Module 3 is intended for children who are verbally
uent, and thus
speech prosody is a valid analytical target. Data consist of 95 children with autism
spectrum disorder (ASD) and 81 subjects with a non-ASD developmental disor-
der; non-ASD subject diagnoses include attention decit hyperactivity disorder
(ADHD), language disorders, mood/anxiety disorders, and intellectual disability.
Participant demographics are presented in Table 2.4, including: ADOS severity,
non-verbal IQ, age, and gender. We control for demographic dierences during
later analyses. ADOS severity is a measure of symptom severity from 1-10, with
10 being most severe. Subject diagnoses are \best-estimate clinical diagnoses",
and consider other factors beyond the ADOS, such as parent report.
Data were collected at two sites as part of an IRB-approved study. Video and
audio quality varies between sessions and sites. As diagnosis and ADOS severity
was biased by site, we did not feel condent in using voice quality or energy-based
measures which were previously shown to be characteristic of ASD speech (Bone
et al., 2014a), but could be aected by site-specic channel dierences (Bone et al.,
2013a). There were a total of 9 psychologists across sites. Occasionally a sec-
ond psychologist or a parent was in the room. The second psychologist's actions
25
were attributed to the primary psychologist, while a parent's actions only aected
latency calculations.
Table 2.1: Demographic information of all subjects: mean (stdv.)
N Severity Age (yr.) NVIQ Female
ASD 95 7.2 (2.1) 8.8 (2.6) 97.2 (20.3) 21.0%
non-ASD 81 2.6 (2.0) 8.3 (2.5) 95.7 (17.9) 30.9%
2.1.4 Acoustic-Prosodic Features
A primary goal of these studies is to capture disordered prosody by direct speech
signal processing techniques in such a way that it may scale more readily than
full-hand annotation. In the rst experiment, twenty-four features (number of
each type denoted parenthetically) were extracted which address four key areas of
prosody relevant to ASD: pitch (6), volume (6), rate (4), and voice quality (8).
These vocal features were designed through referencing linguistic and engineer-
ing perceptual studies in order to capture the qualitatively-described disordered
prosody reported in the ASD literature. The features are detailed in the following
subsections, and the features are listed by feature type in Table 2.2. The signal
analysis used here can be considered semi-automatic since it takes advantage of
manually derived text transcripts for accurate automatic alignment of the text to
the audio. The modied features used in the second experiment are described sub-
sequently; it should be noted that voice quality features were not examined due to
dierences in channel conditions between sites.
Text-to-speech alignment
A necessary objective of this study is to appropriately model the interaction with
meaningful vocal features for each participant. For many of the acoustic param-
eters that we extract, it is necessary to understand when each token (word or
26
phoneme) was uttered within the acoustic waveform. For example, detecting the
start and end times of words allows for the calculation of syllabic speaking rate,
and the detection of vowel regions allows for the computation of voice quality
measures. Manual transcription at this ne level is not practical or scalable for
such a large corpus, and thus we rely on computer speech-processing technologies.
Since a lexical-level transcription was available with the audio (text transcript),
we employ the well-established method of automatic forced alignment of text-to-
speech to provide the alignment (Katsamanis et al., 2011).
The sessions were rst manually transcribed using a protocol adapted from
the Systematic Analysis of Language Transcripts (SALT) transcription guide-
lines (Miller and Iglesias, 2008) and segmented by speaker turn (i.e., the start and
end times of each utterance in the acoustic waveform). The enriched transcrip-
tion includes: partial-words, stuttering, llers, false-starts, repetitions, nonverbal
vocalizations, mispronunciations, and neologisms. Speech that was inaudible due
to background noise was marked as such. In this study, speech segments that
were unintelligible or contained high-background noise were excluded from further
acoustic analysis.
With the lexical transcription completed, automatic phonetic forced-alignment
to the speech waveform was then performed using the HTK software (Young et al.,
1993). Speech processing applications require that speech is represented by a series
of acoustic features. Our alignment framework uses the standard Mel-frequency
cepstral coecient (MFCC) feature vector, a popular signal representation derived
from the speech spectrum, with standard HTK settings: 39-dimensional MFCC
feature vector (energy of the signal + 12 MFCCs, and rst- and second-order
temporal derivatives), computed over a 25ms window with a10ms shift. Acoustic
models (AMs) are statistical representations of the sounds (phonemes) that make
27
up words, based on the training data. Adult-speech AMs (for the psychologist's
speech) were trained on the Wall-Street Journal Corpus (Paul and Baker, 1992),
and child-speech AMs (for the subject's speech) were trained on the Colorado
University (CU) Children's Audio Speech Corpus (Shobaki et al., 2000). The end
result was an estimate of the start and end time of each phoneme (and thus each
word) in the acoustic waveform. For study II, word, syllable, and phonemic forced-
alignment were accomplished performed using Kaldi software (Povey et al., 2011).
Study I Features: Pitch and volume
Intonation and volume contours were represented by log-pitch and vocal inten-
sity (short-time acoustic energy) signals which were extracted per-word at turn-
end using the Praat software (Boersma, 2001a). Pitch and volume contours were
extracted only on turn-end words because intonation is most perceptually salient at
phrase boundaries; in this work, we dene the turn-end to be the end of a speaker
utterance (even if interrupted). In particular, turn-end intonation can indicate
pragmatics such as disambiguating interrogatives from imperatives (Cruttenden,
1997), and it can indicate aect since pitch variability is associated with vocal
arousal(Busso et al., 2009, Juslin and Scherer, 2005). Turn-taking in interaction
can lead to rather intricate prosodic display (Wells and Macfarlane, 1998). In this
study we examine multiple parameters of prosodic turn-end dynamics which may
shed some light on the functioning of communicative intent. Future work could
view complex aspects of prosodic functions through more precise analyses.
In this work, several decisions were made that may aect the resulting pitch
contour statistics. Turns are included even if they contain overlapped speech as
long as the speech is intelligible. Thus, overlapped speech presents a potential
source of measurement error. However, no signicant relation is found between
28
percentage overlap and ASD severity (Table 3.1), indicating this may not signif-
icantly aect results. Furthermore, an additional step was taken to create more
robust extraction of pitch. Separate audio les were made that contained only
speech from a single speaker (using transcribed turn boundaries); audio that was
not from a target speaker's turns was replaced with Gaussian white-noise. This
was done in an eort to more accurately estimate pitch from the speaker of inter-
est in accordance with Praat's pitch-extraction algorithm. Specically, Praat uses
a post-processing algorithm that nds the cheapest path between pitch samples,
which could aect pitch tracking when speaker transitions are short.
We investigate the dynamics of this turn-end intonation since the most inter-
esting social functions of prosody are achieved by relative dynamics. Further,
static functionals such as mean pitch and vocal intensity may be in
uenced by
various factors unrelated to any disorder. In particular, mean pitch is aected by
age, gender, and height, whereas mean vocal intensity is dependent on the record-
ing environment and a subject's physical positioning. Thus, to factor variability
across sessions and speakers, log-pitch and intensity were normalized by subtract-
ing means per-speaker, and per-session. Log-pitch is simply the logarithm of the
pitch value estimated by Praat; log-pitch was evaluated rather than linear pitch
because pitch is log-normally distributed, and log-pitch is more perceptually rele-
vant (Sonmez et al., 1997). Pitch was extracted with the autocorrelation method
in Praat within the range 75-600Hz using standard settings apart from minor
empirically-motivated adjustments (e.g., the octave jump cost was increased to
prevent large frequency jumps).
In order to quantify dynamic prosody, a second-order polynomial representation
of turn-end pitch and vocal intensity is calculated which produces a curvature (2nd
coecient), slope (1st coecient), and center (0th coecient). Curvature measures
29
rise-fall (negative) or fall-rise (positive) patterns; slope measures increasing (posi-
tive) or decreasing (negative trends); and center roughly measure the signal level or
mean. However, all three parameters are simultaneously optimized to reduce mean
squared error, and thus are not exactly representative of their associated meaning.
First, the time associated with an extracted feature contour is normalized to the
range [-1,1] to adjust for word duration. An example parameterization is given in
Figure 2.1 for the word \drives". The pitch has a rise-fall pattern (curvature =
-0.11), a general negative slope (slope = -0.12), and has a positive level (center =
0.28).
Figure 2.1: Second-order polynomial representation of the normalized log-pitch
contour for the word \drives". Note: [curvature, slope, center]=[0:11,0:12,
0:28].
Medians and inter-quartile ratios (IQR) of the word-level polynomial coe-
cients representing pitch and vocal intensity contours were computed, totaling 12
features (2 functionals x 3 coecients x 2 contours). Median is a robust ana-
logue of mean and IQR is a robust measure of variability; functionals that are
robust to outliers are advantageous, given the increased potential for outliers in
this automatic computational study.
Study I Features: Rate
Speaking rate was characterized as the median and IQR of the word-level syllabic
speaking rate in an utterance{done separately for the turn-end words-for a total of
30
four features. Separating turn-end rate from non-turn-end rate enables detection
of potential aective or pragmatic cues exhibited at the end of an utterance (for
example, the psychologist could prolong the last word in an utterance as part of a
strategy to engage the child). Alternatively, if the speaker is interrupted, the turn-
end speaking rate may appear to increase, implicitly capturing the interlocutor's
behavior.
Study I Features: Voice quality
Perceptual depictions of odd voice quality have been reported in studies of children
with autism, having a general eect on the listenability of the children's speech.
For example, children with ASD have been observed to have hoarse, harsh, and
hypernasal voice quality and resonance (Pronovost et al., 1966). However, inter-
rater and intra-rater reliability of voice quality assessment can vary greatly (Gelfer,
1988). Thus, acoustic correlates of atypical voice quality may provide an objective
measure that informs the child's ASD severity. Recently, Boucher et al. (2011)
found that higher absolute jitter contributed to perceived \Overall Severity" of
voice in spontaneous speech samples of children with ASD. In this study, voice
quality is captured by eight signal features: median and inter-quartile ratio (IQR)
of jitter, shimmer, cepstral peak prominence (CPP), and harmonics-to-noise ratio
(HNR).
Jitter and shimmer measure short-term variation in pitch period duration and
amplitude, respectively. Higher values for jitter and shimmer have been linked
to perceptions of breathiness, hoarseness, and roughness (McAllister et al., 1998).
Although speakers may hardly control jitter or shimmer voluntarily, it is possible
that spontaneous changes in a speaker's internal state are indirectly responsible
for such short-term perturbations of frequency and amplitude characteristics of the
31
voice source activity. As reference, jitter and shimmer have been shown to capture
vocal expression of emotion, having demonstrable relations with emotional inten-
sity and type of feedback (Bachorowski and Owren, 1995), as well as stress (Li
et al., 2007). Additionally, while jitter and shimmer are typically only computed
on sustained vowels when assessing dysphonia, jitter and shimmer are often infor-
mative of human behavior (e.g., emotion) in automatic, computational studies
of spontaneous speech; this is evidenced by the fact that jitter and shimmer are
included in the popular speech processing toolkit, openSMILE (Eyben et al., 2010).
In this study, modied variants of jitter and shimmer are computed which do not
rely on explicit identication of cycle boundaries. Equation 2.1 shows the standard
calculation for relative, local jitter, where T is the pitch period sequence, and N is
the number of pitch periods; the calculation of shimmer is similar and corresponds
to computing the average absolute dierence in vocal intensity of consecutive peri-
ods. In our study, smoothed, longer-term measures are computed by taking pitch
period and amplitude samples every 20 ms (with a 40 ms window); the pitch period
at each location is computed from the pitch estimated using the autocorrelation
method in Praat. Relative, local jitter and shimmer were calculated on vowels that
occur anywhere in an utterance.
jitter
loc;rel
=
jitter
loc
meanPeriod
;jitter
loc
=
N
X
j=2
jT
j
T
j1
j
N 1
;meanPeriod =
N
X
j=1
T
j
N
(2.1)
CPP and HNR are measures of signal periodicity (while jitter is a measure of
signal aperiodicity) that have also been linked to perceptions of breathiness (Hil-
lenbrand et al., 1994) and harshness (Halberstam, 2004). For sustained vowels,
percent jitter can be equally eective in measuring harshness as CPP in sustained
vowels (Halberstam, 2004); however, CPP was even more informative when utilized
32
on continuous speech. Heman-Ackah et al. (2003) found that CPP provided some-
what more robust measures of overall dysphonia than did jitter, when using a xed-
length windowing technique on read speech obtained at a six-inches mouth-to-mic
distance. Since we are working with far-eld (approximately two meters mouth-to-
mic distance) audio recordings of spontaneous speech, voice quality measures may
be less reliable. Thus, we incorporate all four descriptors of voice quality, totaling
eight features. We calculated HNR (for 0-1500Hz) and CPP using an implementa-
tion available in Voice Sauce (Shue et al., 2010); the original method is described
in Hillenbrand et al. (1994), Hillenbrand and Houde (1996). Average CPP was
taken per vowel. Then, median and IQR (variability) of the vowel-level measures
were computed per-speaker as features (as done with jitter and shimmer).
Study II Features: Speaking Rate and Syllabic Intonation
Speaking rate is calculated using forced-aligned syllable boundaries as median over
turns in #syllables=s. As in the rst experiment (Bone et al., 2014a), we consider
the segmental intonation contours of short lexical units. This technique may cap-
ture speaker idiosyncrasies in micro-prosodic production. We calculate syllable-
level second-order polynomial parametrization of pitch, then calculate session-level
medians and inter-quartile ratios of slope and curvature. The overall median pitch
is also calculated, totaling 5 features.
Study II Features: Suprasegmental Intonation
We characterize individuals' intonation patterns in order to quantify perceptions
of either monotonic or exaggerated intonation. In particular, the macro-prosodic
movement of pitch is modeled using an automatic signal-derived method, Momel
(MOdeling MELody), which provides a phonetic representation of pitch intonation
33
patterns (Hirst et al., 2000). The algorithm produces a smooth curve that models
the macro-melodic movements of pitch, where deviations are attributed to micro-
prosodic movements related to segmental constraints. Taking raw fundamental
frequency as input, a set of target points for quadratic interpolation is output. We
compute the median-absolute-dierence between Momel points to capture dynamic
variability.
These target points can be further transformed into a symbolic representation
of fundamental frequency patterns, namely Intsint (International Transcription
System for Intonation (Hirst et al., 2000)). Intsint comprises a limited set of
abstract tonal symbols, grouped as absolute or relative. Absolute tones refer to a
speaker's overall pitch range, and are categorized as top, mid, or bottom (T, M, B).
Relative tones are determined relative to the previous tone, and are categorized as:
same (S) if less than a threshold from the previous target; non-iterative high/low
step (H/L); or iterative up/down step (U/D), which tend to be smaller than H
or L steps. An example intonation contour and corresponding Momel and Intsint
outputs is displayed in Figure 2.2.
We calculate the frequency of relative tone changes (H, U, S, D, L) as features,
which may capture a speaker's tendency towards specic pitch dynamics. We
implemented versions of the proposed Momel/Intsint algorithms (Hirst, 2007) in
Matlab. Thresholds for small and large steps were set at 0.125 and 0.25 octaves
from center. Analysis is done in the OME (Octave MEdian) scale (De Looze and
Hirst, 2014), a log-pitch transformation as in Eq. 2.2 through which speaker's tend
to have the same pitch range of one octave.
OME = log(f0
Hz
) log(median(f0
Hz
) ) (2.2)
34
0 0.5 1 1.5 2 2.5
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Time (s)
Octaves
IF YOU
DON’T
BE
NICE
TO
THEM
THEY
WILL
STING
YOU
T
H
L
D
H
D
S
D
U
pitch contour
momel points
Figure 2.2: Example intonation contour plotted with Momel target points and
Intsint symbolic representation.
Since a speaker's range has been observed to reliably be one OME around center in
neutral speech, all speakers should have a comparable range regardless of median
pitch (unlike for Hz).
Study II Features: Goodness of Pronunciation
Phonemic spectral distortions due to atypical, or immature, articulatory controls
may be perceived as \atypical" speech production. As such, we utilize a measure
of phonetic pronunciation quality, goodness of pronunciation (GOP (Witt and
Young, 2000)). GOP has been shown useful for other paralinguistic tasks such
as nativeness detection (Black et al., 2015a). GOP uses acoustic models (AMs)
trained on domain data to quantify pronunciation quality:
GOP =
1
N
log
P (OjTranscript)
P (OjAM loop)
(2.3)
35
where N is the number of frames and O are the acoustic features. The numera-
tor is the likelihood of the acoustics, given the transcription and the native AMs,
which is equivalent to forced alignment. The denominator is typically estimated
via automatic speech recognition with the same set of AMs and an \AM loop"
grammar. Higher GOP scores indicate higher pronunciation quality. GOP compu-
tation is performed using Kaldi (Povey et al., 2011) with a slight modication. In
our implementation of computing the denominator, we do not allow for transitions
between phones within the forced aligned phone boundaries.
Study II Features: Prosodic Coordination
We suspected that individuals with ASD may coordinate prosodic modalities in
unique ways. To assess this hypothesis, we introduce a feature which measures
the simultaneous movements of pitch, duration, and intensity. In particular, for
each syllable we compute the duration, median pitch, and median intensity. These
features are concatenated per speaker, and then the Spearman's rank-correlation
coecient is calculate pairwise, producing three features.
2.1.5 Statistical Analysis
Study I Statistical Analysis
Spearman's non-parametric correlation between continuous speech features and
the discrete ADOS severity score was used to establish signicance of relationships.
Pearson's correlation was used when comparing two continuous variables. The sta-
tistical signicance level is set at p<0:05. However, we sometimes report p-values
for the reader's consideration that did not meet this criterion, but nonetheless may
represent trends that would be signicant with a larger sample size (i.e., p<0:10).
36
Additionally, underlying variables (psychologist identity, child age and gender, and
signal-to-noise ratio) are often controlled by using partial correlation in an eort
to arm signicant correlations. Signal-to-noise ratio (SNR) is a measure of the
speech signal quality aected by recording conditions (such as background noise,
vocal intensity, or recorder gain). SNR was calculated as the relative energy within
utterance boundaries (per speaker), compared to the energy in regions exclusive
of utterance boundaries for either speaker.
Study II Statistical Analysis
We conduct correlation analysis as in Study I using Spearman's rank-correlation
coecient as a metric.
2.1.6 Results, Study I: Correlation of Acoustic-Prosodic
Descriptors with ASD Severity
In this subsection, the pair-wise correlations between the 24 child and psychologist
prosodic features and the rated ADOS severity are presented (see Table 2.2); we
do not discuss psychologist features until Chapter 3. Positive correlations indicate
that increasing descriptor values correspond to increasing symptom severity. If
not stated otherwise, all reported correlations are still signicant at the p<0:05
signicance level after controlling for the underlying variables: psychologist ID,
age, gender, and signal-to-noise ratio (SNR).
The pitch features of intonation are examined rst. The child's turn-end median
pitch slope is negatively correlated with rated severity (r
s
(26)=0:68, p<0:001);
children with higher ADOS severity tend to have more negatively sloped pitch.
Negative turn-end pitch slope is characteristic of statements, but also is related to
37
Table 2.2: Spearman's rank order correlation coecients between acoustic-prosody
descriptors and ADOS severity. Positive correlations indicate that increasing
descriptor values occur with increasing severity.
,
, and
y
indicate statistical
signicance at =0.01, =0.05, and =0.10 (marginal) levels, respectively.
Category Descriptor Child Psychologist
Intonation-
Pitch
Curvature Median -0.53
-0.12
Slope Median -0.68
0.30
Center Median -0.12 0.26
Curvature IQR 0.22 0.09
Slope IQR -0.03 0.43
Center IQR 0.02 0.48
Intonation-
Vocal
Intensity
Curvature Median -0.09 -0.13
Slope Median -0.31 -0.25
Center Median -0.14 0.09
Curvature IQR -0.05 0.10
Slope IQR 0.36
y
0.33
y
Center IQR 0.18 0.41
Speaking
Rate
Non-Boundary Median -0.00 0.19
Boundary Median 0.00 -0.04
Non-Boundary IQR 0.22 -0.05
Boundary IQR 0.33
y
-0.03
Voice Quality
Jitter Median 0.38
0.43
Shimmer Median 0.08 0.04
CPP Median -0.03 0.39
HNR Median -0.38
-0.37
Jitter IQR 0.45
0.48
Shimmer IQR -0.12 -0.03
CPP IQR 0.12 0.67
HNR IQR 0.50
0.58
other communicative functions such as turn-taking. Whether or not this acous-
tic feature may be associated with perceptions of monotonous speech is an area
for further research. The child's turn-end median pitch curvature shows similar
correlations and can also be a marker of statements.
38
Next, the vocal intensity features that describe intonation and volume were
considered. The child's vocal intensity slope variability (IQR) did not reach
statistically-signicant positive correlation with ADOS severity (p=0:06).
When examining speaking rate features, we observed qualitatively that some
children with more severe symptoms spoke extremely fast, whereas others spoke
extremely slowly. The heterogeneity is consistent with the nding of no correlation
between either speaker's speaking rate features and the child's rated severity.
Regarding measures of voice quality, several congruent relations with ADOS
severity exist. The child's median jitter is positively correlated with the child's
rated severity of ASD at r
s
(26)=0:38 (p<0:05), while the child's median HNR
is negatively correlated at r
s
(26)=0:38 (p<0:05); however, the child's median
CPP is not signicantly correlated (r
s
(26)=0:08, p=0:67). As a reminder, jitter
is a measure of pitch aperiodicity, while HNR and CPP are measures of signal
periodicity, and thus jitter is expected to have the opposite relations as HNR and
CPP.Additionally, there are medium-to-large correlations for the child's jitter and
HNR variability (r
s
(26)=0:45, p<0:05; and r
s
(26)=0:50, p<0:01, respectively); all
indicate that increased periodicity variability is found when the child has higher
rated-severity. These voice quality feature-correlations exist after controlling for
the listed underlying variables, including signal-to-noise ratio.
2.1.7 Results, Study II: Correlational Feature Analysis
Acoustic and turn-taking feature correlations for both child and psychologist are
provided in Table 2.3, although we do not discuss psychologist features until Chap-
ter 3. We concentrate on correlations with ASD severity, which is better explained
by our features than best-estimate diagnosis. This nding may stem from the
fact that ASD severity is calculated from the ADOS interaction data, whereas
39
Table 2.3: Correlations of features with ADOS severity and best-estimate diagnosis. *
indicates p<0:05; n.s. is non-signicant.
Category Feature Child Psychologist
Trend with severity Sp. grp. di.Trend with severity Sp. grp. di.
Turn-taking
& speaking
rate
speaking time (%) less 0:15
n.s. 0:01
turn length (words) shorter 0:17
n.s. 0:13
intra-turn pause (s) longer 0:18
X n.s. 0:12
intra-turn pause (%) n.s. 0:00 more 0:17
X
latency (sec) n.s. 0:11 longer 0:24
silence (%) more 0:15
more 0:15
speaking rate (syl/s) slower 0:20
X n.s. 0:05
Segmental
pitch cues
f0 curve median lower 0:18
n.s. 0:09
f0 slope median n.s. 0:12 n.s. 0:12
f0 curve IQR more 0:36
X n.s. 0:08
f0 slopeIQR more 0:28
more 0:25
X
f0 median higher 0:25
X higher 0:21
Supra-
segmental
Intonation
m.a.d. Momel higher 0:17
X higher 0:25
X
Intsint High Tone (%) higher 0:19
higher 0:32
X
Intsint Same Tone (%) lower 0:19
lower 0:24
X
Intsint Low Tone (%) higher 0:19
higher 0:31
X
Prosodic
Coordination
corr. f0 & dur. lower 0:25
X N/A N/A N/A
corr. f0 & intensity lower 0:17
N/A N/A N/A
corr. dur. & intensity n.s. 0:09 N/A N/A N/A
Pronounce GOP lower 0:20
N/A N/A N/A
best-estimate diagnosis draws from external factors that we cannot observe. All
signicant correlations with severity are still signicant after controlling for demo-
graphic variables (i.e., age, gender, and NVIQ) except child turn length (p=0:07)
and child pitch curvature median (p=0:11).
Segmental pitch cues show short-term variability in use of fundamental fre-
quency. Children with higher social-communicative decits showed negative pitch
curvature, which is possibly perceived as \
at" or \monotonic". Additionally, the
child's short-term dynamic variability of fundamental frequency increases. Pitch
variability may increase with enhanced aect. Also, in our data, children with
higher ASD severity tend to speak slower on average.
Suprasegmental cues are essential for communicating intention and aect.
While we cannot fully model prosody without knowing the semantic context of
40
an uttered phrase, we can look at global tendencies. Speakers with higher severity
are shown to have more macro-prosodic variability in all four of the features that
we examine. Specically, the child has larger pitch movements (octaves) between
successive Momel points. In symbolic representation (Intsint), there are more high
and low tones, and less same level pitch movements. This nding expands on pre-
vious reports of higher pitch variability, which often did not use log-scales (pitch
is log-normally distributed within speaker) and were simply global functionals on
raw fundamental frequency, not providing insight into the dynamics. Note that the
intermediate up-step and down-step tones are not displayed to improve readability,
since neither reached signicance.
After listening to speaking samples, we suspected that individuals with \atyp-
ical" prosody were sometimes modulating a prosodic modality independent of
other modalities. We quantied prosodic coordination as the pairwise coordi-
nation between three modalities: syllabic fundamental frequency, vocal intensity,
and duration. Results support that children with higher ASD symptom severity
coordinate their use of pitch with duration and vocal intensity to a lesser extent.
Lastly, we investigate pronunciation quality, motivated by the possibility that
articulation distortions, which occur generally in those with language delays, may
be attributed to atypical prosody. Results show that children with higher ADOS-
ASD severity do tend to have a lower goodness of pronunciation. Whether, and
to what degree, articulation distortions aect perceptions of atypical prosody is a
topic of future research.
Prediction experiments are reserved for Chapter 3.
41
2.1.8 Discussion, Study I
Semi-automatic processing and quantication of acoustic-prosodic features of the
speech of children with ASD was conducted, demonstrating the feasibility of this
paradigm for speech analysis even in the challenging domain of spontaneous dyadic
interactions and using far-eld sensors.Moreover, some proposed features such as
intonation dynamics are novel to the ASD domain, while vocal quality measure-
ments (i.e., jitter) mirrored other preliminary ndings. In Chapter 3, we will
demonstrate that the speech characteristics of the child and the psychologist were
signicantly related to the severity of the child's autism symptoms.
As predicted, correlation analyses demonstrated signicant relationships
between acoustic-prosodic features of both partners and rated severity of autism
symptoms. Continuous behavioral descriptors that co-vary with this dimensional
rating of social-aective behavior may lead to better phenotypic characterizations
that address the heterogeneity of ASD symptomatology. Severity of autistic symp-
toms was correlated with children's negative turn-end pitch slope, which is a marker
of statements. The underlying reason for this relationship is currently uncertain
and needs further investigation. The child's jitter median tended to increase while
HNR median decreased; jitter, HNR, and CPP variability also tended to increase
in the children's speech with increasing ASD severity. Higher jitter, lower HNR,
and lower CPP have been reported to occur with increased breathiness, hoarse-
ness, and roughness (Halberstam, 2004, Hillenbrand et al., 1994, McAllister et al.,
1998), while similar perceptions of atypical voice quality have been reported in
children with ASD. For example, Pronovost et al. (1966) found speakers with high
functioning autism to have hoarse, harsh, and hypernasal qualities. Hence, the
less periodic values of jitter and HNR seen for children with higher autism severity
42
scores suggest the extracted measures are acoustic correlates of perceived atypical
voice quality. The ndings show promise for automatic methods of analysis, but it
is uncertain which aspect of voice quality jitter, HNR, and CPP may be capturing.
Since the CPP measure was non-signicant for the child while the jitter and HNR
measure were signicant, further, more controlled investigation of voice quality
during interaction is desired in future studies. The results corroborate ndings
from another acoustic study (Boucher et al., 2011) which found that higher abso-
lute jitter contributed to perceived \Overall Severity" in samples of spontaneous
speech of children with ASD.
2.1.9 Conclusions, Study I
A framework was presented for objective, semi-automatic computation and quan-
tication of prosody using speech signal features; such quantication may lead to
robust prevalence estimates for various prosodic abnormalities and thus more spe-
cic phenotypic analyses in autism. Results indicated that the extracted speech
features were informative. This preliminary study suggests that signal processing
techniques have the potential to support researchers and clinicians with quantita-
tive description of qualitative behavioral phenomena and to facilitate more precise
stratication within this spectrum disorder.
Future research may investigate more specically the relationship between
prosody and overall ASD behavior impairments. Future research will also examine
the prevalence of various prosodic abnormalities in children with a wider range of
ASD severity and level of language functioning, using computational techniques
explored in this study, but scaled to larger datasets. Dependencies of various
prosodic abnormalities may also be examined, such as the eects of varying social
and cognitive load throughout an interaction. Our recent preliminary work{which
43
incorporates ratings of social load on the child{further investigates conversational
quality by incorporating turn-taking and language features while expanding the
analysis to the entire ADOS session (Bone et al., 2013b). Greater understanding
of the intricacies of atypical speech prosody can inform diagnosis and lead to more
personalized intervention. In addition, examination of children's specic responses
to varied speech characteristics in the interacting partner may lead to ne-tuned
recommendations for intervention targets and evaluation of mechanisms of change
in intervention.
2.1.10 Conclusions, Study II
We examined acoustic-prosodic and turn-taking features in interactions with indi-
viduals with neurodevelopmental disorders towards a better, evidence-based under-
standing. Five groups of features were considered: turn-taking and speaking rate,
suprasegmental intonation, segmental pitch, prosodic coordination, and pronunci-
ation quality. Unfortunately, voice quality features were excluded due to potential
channel dierences between sites. The most robust nding is that segmental and
suprasegmental prosodic variability increases for children with higher ASD severity
(or ASD versus non-ASD disorders). Additionally, based on our proposed features,
children with higher ASD severity showed lower coordination of pitch with other
modalities.
44
2.2 Acoustic-Prosodic Correlates of \Awkward"
Prosody in Story Retellings from Adoles-
cents with Autism
In this chapter, we connect objective signal-derived descriptors of prosody to sub-
jective perceptions of prosodic awkwardness, rather than solely autism diagnostic
variables. Subjectively, more awkward speech is less expressive (more monotone)
and more often has perceived awkward rate/rhythm, volume, and intonation. We
also nd expressivity can be quantied through objective intonation variability fea-
tures, and that speaking rate and rhythm cues are highly predictive of perceived
awkwardness. Acoustic-prosodic features are also able to signicantly dierenti-
ate subjects with ASD from typically developing (TD) subjects in a classication
task, emphasizing the potential of automated methods for diagnostic eciency and
clarity.
2.2.1 Introduction
In this study, we ask naive human raters to assess various types of prosodic awk-
wardness, then link these perceptions to objective acoustic-prosodic measures.
Raters score overall awkwardness, as well as awkwardness of individual compo-
nents of prosody: rate/rhythm, volume, and intonation/stress. We expect that
agreement will be highest at the cumulative level, that something sounds \odd."
Vocal expressivity is also rated since ASD prosody is described as monotone or
overly exaggerated, which may contribute to perceived awkwardness. Through
this work, we aim to enhance our understanding of signal-derived speech prosody
45
measures, which are vital to behavioral interaction analyses (Bone et al., 2014a)
and the creation of automated clinical tools.
2.2.2 Methodology
In the following sections we discuss the data collection and participant demograph-
ics, perceptual rating scheme, acoustic-prosodic features, and machine learning
data analysis.
Data Collection and Participants
Data were recorded as part of an aective story retelling task. Participants initially
viewed a stimulus video in which an actor, \Safari Bob", stated that he needed
someone to ll in for him as host of a children's television show. Participants lis-
tened to a story told by Safari Bob, then retold it with the story text displayed
on screen. We focus our preliminary analyses on one of the four stories, \Ele-
phants", which contains ve sentences. Of the possible 345 utterances for analysis
(69x5), 322 are selected (=4:7s,=1:7s) post-exclusion of poor audio quality and
utterances that went far o-script. ASD and TD (typically developing) partici-
pant demographics are presented in Table 2.4, including: verbal IQ, performance
IQ, receptive vocabulary (as measured by the Peabody Picture Vocabulary Test-
Revised (Dunn and Dunn, 1981)), and reading level (as measured by the Woodcock
Johnson Test (Woodcock et al., 2001)). We control for the statistically-signicant
Figure 2.3: Duration sample versus exemplar for one sentence, where \just" was
missing from the sample production. Computation:
S
=0:90; fraction of samples
with a feature value=
14
15
; score=0:90
14
15
=0:84.
46
Table 2.4: Participant demographics. `' designates dierence at =0:05 level by
Wilcoxon rank-sum test.
N Age (yr.) Female V-IQ P-IQ Rec. Vocab. Reading
ASD 43 12.9 1 (2.3%) 105 102
111
105
TD 26 13.6 2 (7.7%) 112 113
122
108
group dierences in demographics during later analysis. More details can be found
in the primary paper on this database (Grossman et al., 2013).
Perceptual Ratings of Prosody
Our study is motivated by the general perceptions of awkwardness that occur when
interacting with individuals with autism; naive raters of prosody are able to detect
an overall quality of \awkwardness" in an ASD individual's speech (Grossman,
2014).
Each utterance is scored on N-point Likert scales by 15 naive raters on Ama-
zon Mechanical Turk (MTurk). Raters could listen to the les multiple times while
judging a speaker's overall awkwardness and other related constructs. Specically,
`Awkwardness' is obtained through inversion of a `Naturalness' (non-awkwardness)
rating, which is on a 4-point scale from `Very Awkward' to `Natural (Not Awk-
ward).' Also, raters mark the presence of awkwardness (binary) in three com-
ponents of prosody|`Rate/Rhythm', `Volume', and `Intonation/Stress'. Lastly,
`Expressivity' (animation) is rated on a 5-point scale from `Extremely Flat or
Monotone' to `Overly Animated.'
Final ratings are obtained through averaging scores per utterance. Given the
variable quality of raters from MTurk, we remove raters with very poor agreement
with the initial mean (
S
<0:2) and raters who evaluated less than 10 utterances.
Additionally, awkward prosody component scores (binary) are z-normalized per-
rater before fusion, which improves agreement.
47
Acoustic-Prosodic Features
Prosodic atypicalities associated with autism have been reported in the domains
of intonation, volume, rate, and voice quality; as such, we compute a total of 37
features which target these qualitative constructs. A novelty of this work is that we
compare features that jointly model prosody as it occurs with exact lexical content
versus those that do not. Our utterance-level features that do not precisely model
the lexical content are grouped as rate & rhythm, voice quality, and intonation.
Features which model prosody jointly with lexical content include exemplar-based
intonation/stress and transcript-matching cues.
Speech rate & rhythm comprise 12 cues. Speech articulation rate is mea-
sured as the median and inter-quartile ratio (IQR) of individual syllable rates
per-utterance; syllable boundaries are determined by forced-alignment using HTK.
Speech rhythm, the temporal patterning of speech units, is quantied using Pair-
wise Variability Indices (PVI, (Grabe and Low, 2002)) and Global Interval Pro-
portions (GIP, (H onig et al., 2011, Ramus, 2002)). Pairwise variability indices
measure durational variability of adjacent linguistic units; we compute normalized
and unnormalized PVI measures for consonants, vowels, and syllables. GIP fea-
tures include the percentage of vowel speech and the standard deviations of vowel
and consonant durations. We also compute the percentage of pausing within an
utterance, a key facet of rhythm.
Voice quality is captured by six features: median and IQR of syllabic jit-
ter, shimmer, and harmonics-to-noise-ratio (HNR). Jitter and shimmer, the local
variability in pitch and intensity, respectively, are calculated using the method
described in Bone et al. (2014a), utilizing Praat (Boersma, 2001b). HNR is
extracted using VoiceSauce (Shue et al., 2011).
48
We model intonation through syllable-level parametrization of pitch and inten-
sity signals. We compute the slope and curvature of these signals per-syllable,
then calculate utterance-level median and IQR. Raw signal means and standard
deviations are also extracted, totaling 12 features. This technique may capture
speaker idiosyncrasies in intonation.
Using exemplar-based template features, we implicitly model intonation & stress
as they occur jointly with the lexical content. These features have previously been
used in children's read prosody assessment computed against an adult narration
exemplar (Duong et al., 2011), and were previously proposed by Bone et al. for
studying prosody in ASD (Bone et al., 2013a). Exemplar features model the evolu-
tion of a prosodic contour with spoken words compared to a reference; an example
for duration is shown in Fig. 2.3. First, prosodic contours are extracted (pitch,
intensity, and duration), which are then time-aligned with word-boundaries. For
each word and contour, a single feature functional is computed (we use median),
producing a representation in which each word holds a single prosodic value. Next,
we obtain a single exemplar for comparison by averaging ve productions with the
best rating (e.g., least awkward). This is done per-rating in a leave-one-subject-out
fashion; in the case of predicting ASD diagnosis, we use all TD subjects for deriv-
ing the exemplar. Lastly, we compute the Spearman's correlation (
S
) between an
observed prosodic template and the exemplar, generating one feature per prosodic
Table 2.5: Top ve features correlated (
S
) with perceptual ratings and ASD diag-
nosis (ASD1, TD0). Bold: p<0:01; else p<0:05.
Perceptual Rating Diagnosis
Ranking Expressivity Overall Awkwardness Awkward Rate/Rhy. Awkward Volume Awkward Intonation Autism Spectrum
Feat. 1 0.43 f0 0.47 pause % -0.40 dur. model 0.28 pause % 0.42 PVI vowels -0.40 dur. model
Feat. 2 0.42 jitter Mdn -0.39 dur. model 0.39 pause % -0.25 int. slope IQR 0.37 vowel dur. 0.32 pause %
Feat. 3 0.40 jitter IQR 0.36 PVI vowels 0.35 vowel dur. 0.24 HNR Mdn 0.35 int. slope IQR -0.30 correct %
Feat. 4 -0.30 pause % -0.36 correct % 0.32 PVI vowels | | -0.34 rate Mdn (syl/s) -0.27 rate IQR (syl/s)
Feat. 5 0.29 int. 0.34 insertion % 0.31 PVI syllables | | 0.33 int. curv. IQR 0.26 substituted %
49
Table 2.6: Spearman's inter-rater reliability (sig. at =0:05).
Code Expr Awk R/R Vol Inton
Spearman's 0.70 0.57 0.42 0.37 0.25
signal. Missing words or feature values are penalized; we scale
S
by the percentage
of valid feature values.
Transcript-matching features relating observed and reference transcripts
include percentages of correct, inserted, deleted (Fig. 2.3), and substituted words
(computed via NIST SCTK).
Statistical Analysis and Machine Learning
Two types of analyses are conducted: correlation and prediction/classication.
Support Vector Regression and logistic regression (Fan et al., 2008) are performed
in a speaker-independent/sentence-independent manner to support generalization
of results; parameters are tuned using two-level nested cross-validation. Perceptual
ratings may vary for utterances from an individual speaker, but their ASD diag-
nosis is constant. Therefore, analysis between subjective ratings and acoustics is
conducted for all utterances (treated individually); in contrast, we pool (average)
samples when predicting autism diagnosis; pooling predictions across utterances
also models realistic application of an automatic system. This topic is discussed
further in Section 3.1.6.
2.2.3 Analysis of Perceptual Ratings
In this section we discuss the inter-rater reliability of dierent perceptual codes
by our naive raters (Table 2.6), as well as correlations between perceptual ratings,
demographic variables (receptive vocabulary, P-IQ, & age), and ASD diagnosis
(Table 2.7). Table Legend: Expr - expressivity; Awk - awkwardness; awkward R/R
- rate/rhythm, Vol - volume, Inton - intonation/stress.
50
The naive raters achieve moderate or substantial agreement for overall awkard-
ness and expressivity (calculated as the median Spearman's correlation between
each rater and the mean score of the other evaluators that rated those utterances).
Understandably, there is much lower agreement for the more specic components of
prosodic awkwardness, listed in descending order as follows: rate/rhythm, volume,
and intonation/stress. Cumulative perceptions tend to have much higher agree-
ment than more specic items; this is true for autism diagnostic instruments (Lord
et al., 2000, 1994a) and for the PVSP prosody examination (Shriberg et al., 1992).
Correlations between dierent perceptual ratings can also inform the human
perceptual process. Speakers that are perceived as more generally awkward are also
heard as less expressive (
S
=0:39), or more monotone. The global perception of
awkwardness can be further decomposed; awkward speakers tend to have more
awkward rate/rhythm (
S
=0:80), awkward intonation (
S
=0:50), and awkward
volume (
S
=0:35); appropriate timing, or rate/rhythm, is a critical factor in judging
overall awkwardness for an utterance.
Next, we consider dependencies between perceptual codes and demographic
variables. Subjects with higher receptive vocabulary are perceived as more
expressive (
S
=0:27) and generally less awkward (
S
=0:39)|this specically
includes the domains of awkward rate/rhythm (
S
=0:37) and awkward intona-
tion (
S
=0:44); very similar relations exist between performance IQ and percep-
tual ratings. Younger subjects tend to have higher incidence of awkward volume
(
S
=0:29).
A primary goal of this study is to examine prosodic awkwardness in autism.
ASD subjects' speech was perceived as dierent than control subjects' speech, even
though the MTurk raters were blind to study purpose, demographic makeup, and
task description. ASD speech was perceived as more awkward (
S
=0:50), and had a
51
Table 2.7: Correlations between speaker-averaged ratings, demographics, and ASD
diagnosis. Bolded implies sig. at =0:05.
Awk R/R Vol Inton Vocab P-IQ Age ASD
Expr -0.39 -0.25 0.09 0.16 0.27 0.31 0.05 -0.06
Awk 0.80 0.35 0.50 -0.39 -0.36 0.12 0.50
R/R 0.07 0.41 -0.37 -0.30 -0.14 0.48
Vol -0.05 0.07 -0.10 -0.29 0.32
Inton -0.44 -0.08 0.06 0.33
Vocab 0.40 -0.05 -0.22
P-IQ -0.08 -0.27
Age -0.10
higher rate of awkward rate/rhythm (
S
=0:48), volume (
S
=0:32), and intonation
(
S
=0:33). All relations with ASD diagnosis remain signicant (=0:05) after
controlling for demographics (receptive vocab., P-IQ, & age).
2.2.4 Acoustic-Prosodic Cues of Awkwardness
Interpretable, objective signal measures can provide a bottom-up explanation for
these human perceptions of prosody, allowing scalable application to larger data.
In section 2.2.4, the most informative prosodic cues for each perceptual rating are
discussed, and in section 3.1.6, prosodic cues are used to predict perceptual ratings
and autism diagnosis.
Correlational Feature Analysis
The top ve features related to each perceptual rating and ASD diagnosis are
provided in Table 2.5. Since the value of a feature is dependent on various sources of
noise in the feature extraction process, we cannot condently state that a construct
is uninformative of a target variable, only that the extracted feature is not.
The top cues for overall awkwardness relate primarily to timing; speech that
is less awkward has less pausing, less local variability in vowel duration, and a
higher correlation with the syllable-duration exemplars. Less awkward-sounding
52
Table 2.8: Regression and classication of perceptual ratings and ASD diagnosis
via acoustic features and demographic variables. Bolded statistics are signicant
at the =0:05 level by one-sided tests. N
ratings
=322, N
Diag:
=69.
Perceptual Rating Diagnosis
Features Expr Awk RR Vol Into ASD
Baseline: Demog. 0.16 0.32 0.25 0.22 0.20 63%
Rate/Rhythm 0.25 0.53 0.45 0.24 0.30 69%
Exemplar 0.15 0.41 0.40 0.13 0.02 56%
Voice Qual. 0.43 0.17 0.08 0.12 -0.06 46%
Trans. Match 0.00 0.23 0.24 -0.14 0.04 59%
Intonation 0.38 0.06 0.02 0.12 0.25 48%
Feature Fusion 0.55 0.56 0.47 0.21 0.36 65%
Agreement 0.70 0.57 0.42 0.37 0.25 N/A
metric Spearman's correlation (
S
) UAR
productions also adhere more to the transcript and insert fewer words. Since raters
were not given the text, aberrations from the transcript likely should not factor
into the ratings, but may have had other prosodic eects (e.g., increased pausing)
which are more directly relevant to perceived awkwardness.
Although agreement on awkwardness in prosodic sub-components is lower,
acoustic-prosodic cues can still provide insights into the human perceptual pro-
cess. Perceived awkward rate/rhythm is captured by divergence from the normal
relative word-duration (exemplar features), increased pausing, and more variable
syllabic-duration. Awkward use of volume correlates with increased pausing, higher
median HNR, and less variable syllabic-intensity slope|a possible marker of the
at, stilted expression seen in ASD. Awkward intonation/stress is best explained
by slower articulation rate (syl/s), as well as more variable duration and syllabic-
intensity dynamics.
Perceived expressivity is best captured by the variability of pitch and intensity
contours. More expressive speech has more variable pitch and intensity, higher and
more variable jitter, and less pausing|all indicators of higher vocal arousal (Bone
53
et al., 2014b). These ndings support the use of dynamic-intonation variability
measures to assess monotone intonation in ASD.
Acoustic-prosodic cues also serve as evidence of dierences between speech from
ASD and TD subjects. All presented features are at least marginally signicant
(=0:10 level) after controlling for receptive vocab., P-IQ, and age, unless other-
wise specied. ASD productions tend to have lower correlations with durational
exemplars trained (speaker-independently) on TD subjects' speech|objective evi-
dence of dierences in ASD subjects' use of duration relative to lexical content.
Interestingly, ASD subjects had less variable articulation rate (syl/s), another
potential correlate of `monotone' production. Speech from ASD subjects also con-
tained more pauses, and also matched the transcript less often|i.e., more correct
words and less substitutions, although substitution % becomes non-signicant after
accounting for demographics.
Many of the most informative signal cues pertain to timing, rate, and rhythm,
which is unsurprising since the related subjective code is a highly-explanatory fac-
tor for overall perceived awkwardness; for example, pause frequency is an invaluable
cue in assessing awkwardness (as in other automatic speech assessment scenarios
such as children's literacy (Black et al., 2011b)). Although the agreement on per-
ceived awkward volume and intonation is relatively low, several intuitive signal
relations emerged.
Predicting Perceptual Ratings
While the individual correlational analysis in the previous section can inform inter-
pretation in human-human behavioral interaction analyses, automatic systems that
support clinical researchers can rely on joint modeling of many features. In this
section, we analyze the performance of dierent feature categories in predicting
54
ratings of prosody and autism diagnosis (Table 2.8); results of such experiments
can inform not only the acoustical dependence of our perceptions, but also those
cues which are associated with speech abnormalities of autism.
Rate & Rhythm features are signicantly predictive of all perceptual ratings,
producing the highest performances in nearly all experiments, with the sole excep-
tion being expressivity. The analysis of Section 2.2.4 showed timing features were
excellent correlates of perceived awkwardness. In fact, Rate & Rhythm features
alone meet or exceed the baseline results, and achieve performance on par with
inter-rater agreement for all four awkward-prosody codes|e.g., for overall awk-
wardness
S
=0:53 versus an agreement of
S
=0:57.
Exemplar features, which measure dynamic-dierences in word-time prosodic
feature streams compared to a baseline model (exemplar), produce signicant pre-
diction for overall awkwardness, awkward rate/rhythm, and awkward volume, albeit
below that of Rate & Rhythm features. The other lexical-modeling features, Tran-
script Matching statistics, are able to account for a small amount of the variance
associated with overall awkwardness and awkward rate/rhythm through prediction.
The remaining two feature sets excel at quantifying expressivity. The global
and local variance in f0 and intensity captured by Intonation and Voice Qual-
ity features predict expressivity ratings moderately well (
S
=0:38 and
S
=0:43,
respectively). When Rate & Rhythm and the other acoustic cues are included
in the model, prediction improves to
S
=0:55, still below inter-rater agreement
(
S
=0:70). Expressivity is the only code for which there is a large gain from fusion
over Rate & Rhythm features alone, highlighting the importance of timing cues.
Lastly, we predict ASD diagnosis using the provided acoustic-prosodic cues.
Since autism diagnosis is an intricate procedure lasting hours and incorporating
55
various sources of information, we should not expect to achieve very high perfor-
mance from speech alone, much less from a few read utterances. Still, prediction
from acoustics allows us to observe the importance of a group of signal cues in
discovering dierential patterns between groups (ASD and TD). Such ndings
can eventually lead to improved automatic assessment and monitoring systems or
automatic prosodic tutor systems. Unweighted average recall (UAR) prediction
performances are presented for ASD diagnosis (50% UAR is chance); speaker- and
sentence-independent models are evaluated on each utterance, then predictions are
aggregated through majority voting per speaker.
Rate & Rhythm features achieve the best performance in classifying ASD
(69%), likely due to their utility in quantifying awkward prosody, which we showed
in Section 2.7 to be associated with ASD diagnosis. Moreover, this is the only
individual feature group which produces signicant classication UAR
2
. After fus-
ing all acoustic-prosodic features, classier performance drops (potentially due to
insucient data size) to 65%. Demographic features (receptive vocab., P-IQ, and
age), also achieve signicant prediction at 63%.
2.2.5 Conclusion and Future Work
Speech cues are critical to ner characterization of autism spectrum disorder,
yet there has been little headway toward a generalizable operational denition
of prosodic atypicalities in ASD; e.g., prevalence estimates for various prosodic
abnormalities are still unknown. Fortunately, speech processing can provide scal-
able, objective measures to support scientic advances.
2
Measured by a conservative one-sided binomial proportions test with N=2N
minorityclass
as described in Bone et al. (2014a).
56
In this work, we link acoustic-prosodic cues to general perceptions of speech
prosody. Naive raters reach moderate to substantial agreement on cumulative
aspects of prosody, but have lower agreement about components of prosody, high-
lighting the diculty of explicating a general assessment; objective cues oer
insights into that process. Rate & rhythm features are predictive of various awk-
wardness codes, producing correlations approaching inter-rater agreement; these
timing cues additionally dierentiate ASD and TD groups. Exemplar features,
which jointly model prosody and lexical content, are also signicantly informa-
tive of awkwardness. Lastly, dynamic intonation features can objectively quantify
perceived expressivity.
In the future, we will continue to investigate the relationship between per-
ception of prosody and acoustic cues. Acoustic-prosodic cues can provide novel
insights into dyadic interactions involving children with autism (Bone et al., 2014a).
Eventually, systems incorporating automatically extracted signal cues may be cre-
ated for enhanced diagnostics and behavioral monitoring as well as prosodic inter-
vention.
2.3 Outlook for the Quantication of Atypical
Prosody
Speech prosody remains a critical research area in autism spectrum disorder for
which objective assessment can have true impact in characterizing and tracking
prosodic decits. However, one of the primary reasons that speech prosody is
understudied in autism is because of the diculty in modeling it during conver-
sational speech due to its variability and dynamic nature. Initial studies were
limited to features like the mean and standard deviation of pitch and intensity.
57
The following are suggestions for future research towards the goal of creating a
computational characterization of prosody in neurodevelopmental disorders.
Optimal data collection: Collected should have high quality (for complex fea-
ture extraction), high consistency, and ecological validity. Spontaneous speech is
much preferred over read due to relevance to this social-communicative disorder.
Maintaining Interpretability: A great appeal of engineering methods is that
complex models can be created that humans need not understand, e.g., deep learn-
ing. However, in this particular problem domain, the primary drawback is that
interpretability is largely abandoned. With a loss of interpretability, it is di-
cult to track why a system is successful (which may be for dubious reasons (Bone
et al., 2013a)), and it is less clear if the system will generalize to independent,
uniquely collected data (than, for example, knowledge-drive approaches (Bone
et al., 2014b)).
Selecting a Ground Truth: Supervised learning necessitates a ground truth.
However, atypical prosody is a construct that, while critically important, has no
reliable ground truth. Two choices are apparent: either ASD/non-ASD diagnosis
or human judgment. ASD behavior is highly variable; since not all children have
the same decits, ASD diagnosis cannot be equated to \autistic" prosody. Alter-
natively, human judgment is unreliable, especially for untrained raters (Bone et al.,
2015b). Thus, it is our suggestion that future studies simultaneously analyze the
relevance of prosodic features against both ground truths. Moreover, it may be
necessary that the nal objective measures are entirely bottom-up, derived from
and dened by signals. Such a rule-based approach would come with its own dif-
culties in generalization, but would be one solution to creating a fully objective
denition of atypical prosody.
58
Chapter 3
The Psychologist as an
Interlocutor in ASD Assessment
In this chapter, we investigate dierences in the psychologist's behavior in relation
to interacting with children with varied levels of social-communicative ability. The
rst section analyzes eects in prosody, turn-taking, and language, while the second
section looks at vocal arousal synchrony.
3.1 Prosody, Turn-taking, and Language of the
Psychologist during ADOS Interactions
The purpose of the following experiments is to examine the speech and language
activities of the psychologist, hypothesizing a mutually interactive relationship
between the psychologists' and the children's speech characteristics. The results
presented in this section build on the experiments of Section 2.1. Specically,
there are three studies: Study I and Study II are presented in Section 2.1, and
Study III is an addition to Study I (also using the USC CARE Corpus) which
looks and language and turn-taking in the entire ADOS interactions as well as
the eects of Social Demand. Research hypotheses are tested via correlation and
hierarchical and predictive regression between ADOS severity and features (Study
I, III), as well as classication and support vector regression (Study II). The results
59
demonstrate that automatically extracted features characterize dyadic interactions
between child and psychologist.
1
3.1.1 Introduction
Human social interaction necessitates that each participant continually perceive,
plan, and express multi-modal pragmatic and aective cues. Thus, a person's
ability to interact eectively may be compromised when there is an interruption
in any facet of this perception-production loop. Autism spectrum disorder (ASD)
is a developmental disorder dened clinically by impaired social reciprocity and
communication{jointly referred to as social aect (Gotham et al., 2007){as well as
restricted, repetitive behaviors and interests (APA, 2000).
In addition to increased understanding of the prosody of children with autism,
our study paradigm allows careful examination of speech features of the psychol-
ogist as a communicative partner interacting with the child. Synchronous interac-
tions between parents and children with ASD have been found to predict better
long-term outcomes (Siller and Sigman, 2002), and many intervention approaches
include an element of altering the adult's interactions with the child with ASD
to encourage engaged, synchronous interactions. For example, in the SCERTS
(Social Communication, Emotional Regulation, and Transactional Support) model,
parents and other communication partners are taught strategies to \attune aec-
tively and calibrate their emotional tone to that of the less able partner"((Prizant
et al., 2003), p. 308). Changes in aective communication and synchrony of
the caregiver or interventionist with the child are also elements utilized in Piv-
otal Response Training (e.g., (Vernon et al., 2012)), DIR/Floortime (e.g., (Wieder
and Greenspan, 2003)), and the Early Start Denver Model (Dawson et al., 2010).
1
This work appears in Bone et al. (2014a), Bone et al. (2013b), & Bone et al. (2016b).
60
The behavior of one person in a dyadic interaction generally depends intricately
on the other person's behavior{ evidenced in the context provided by age, gen-
der, social status, and culture of the participants (Knapp et al., 2013), or the
behavioral synchrony that occurs naturally and spontaneously in human-human
interactions (Kimura and Daibo, 2006). Thus, we investigate the psychologist's
acoustic-prosodic cues in an eort to understand the degree to which their speech
behavior varies based on interaction with participants of varying social-aective
abilities.
A crucial aim of these studies is to incorporate analysis of the acoustic-prosodic
characteristics of a psychologist engaged in ADOS administration, rather than
focusing only on the child's speech. This transactional, dyadic focus provides an
opportunity to discern the adaptive behavior of the psychologist in the context
of eliciting desired responses from each child, and to examine possible prosodic
attunement between the two participants.
In addition to prosody discussed in study I, in the other studies we also incor-
porate turn-taking and language cues that may capture aspects of reciprocal social
communication in child-psychologist interaction. Turn-taking and language cues
provide an enhanced depiction of communication style and quality beyond prosodic
features. Researchers have initiated large-sample, computational study of language
and turn-taking in ASD. Studies have considered, for example: pauses, llers, and
discourse markers (Heeman et al., 2010); semantic and pragmatic errors (Prud-
hommeaux et al., 2011); and language referencing internal states (Brown et al.,
2012).
In study III (Bone et al., 2013b), we extend our study of speech and language
cues to the entire 30-60 minute ADOS session data, whereas, in study I, we focused
only on two similar interview-style subtasks that allow us to examine expressive
61
prosody in a semi-structured context. Given the varying social demands across the
subtasks of ADOS, our analysis also considers their eect on the nature of the com-
munication cues. A complex interplay between the social, aective, and cognitive
demands has been hypothesized to in
uence some of the observed variability in
the communication patterns in ASD (Lord et al., 2000, Mesibov et al., 1994). The
investigation of spontaneous speech acoustic-prosodic, turn-taking, and language
cues of both child and psychologist during interactions of varying social demand
provides insights into dyadic interactions involving children with ASD.
The results of Study I (Bone et al., 2014a) show that as rated ASD severity
increased, both participants demonstrated eects for turn-end pitch slope, and
both spoke with atypical voice quality. The psychologist's acoustic cues predicted
the child's symptom severity better than the child's acoustic cues. The psycholo-
gist, acting as evaluator and interlocutor, is shown to adjust behavior in predictable
ways based on the child's social-communicative impairments. The results support
future study of speech prosody of both interaction partners during spontaneous
conversation, while making use of automatic computational methods that allow
for scalable analysis on much larger corpora. In Study II (Bone et al., 2016b), we
also found similar to Study I in predicting ASD severity using either the child's or
the psychologist's features. Like the child, the psychologist's speech prosody was
more variable.
The results of Study III (Bone et al., 2013b) indicate the conversational qual-
ity degraded for children with higher ASD severity, as the child exhibited di-
culties conversing and the psychologist varied her speech and language strategies
to engage the child. When interacting with children with increasing ASD sever-
ity, the psychologist exhibited higher prosodic variability, increased pausing, more
speech, atypical voice quality, and less use of conventional conversational cue such
62
as assents and non-
uencies. Children with increasing ASD severity spoke less,
spoke slower, responded later, had more variable prosody, and used personal pro-
nouns, aect language, and llers less often. We also investigated the predictive
power of features from interaction subtasks with varying social demands placed
on the child. We found that acoustic prosodic and turn-taking features were more
predictive during higher social demand tasks, and that the most predictive features
vary with context of interaction. We also observed that psychologist language fea-
tures may be robust to the amount of speech in a subtask, showing signicance
even when the child is participating in minimal-speech, low social-demand tasks.
3.1.2 Social Demand Rating System for Study III
The ADOS consists of 14 activities, or subtasks, (e.g., joint-play, emotions inter-
view, telling a story), each with dierent social presses and consequently, level
of social demand. Seven psychologists with ADOS research training rated, on a
1-to-N scale, each of the subtasks for Social Diculty (N=5), Cognitive Diculty
(N=5), Naturalness (N=3), and the Amount of Speech Required (N=3). Ratings
were z-normalized per rater and averaged.
We concentrate our analysis on social demand (diculty). The description
for this rating is, \How much does the activity require the child to interact
socially (verbally or non-verbally) with the psychologist?" Inter-rater agreement
for social demand was moderate, intra-class correlation=0:54. The 14 subtasks
were separated into high (5), medium (6), and low (3) social demand based
on the distribution. Social demand was correlated with ratings for cognitive
demand (r
s
(14)=0:80), naturalness (r
s
(14)=0:63), and amount of speech required
(r
s
(14)=0:82).
63
3.1.3 Turn-taking and Language Features
Study I: Turn-taking Features
The style of interaction (e.g., who is the dominant speaker or the amount of over-
lap) may be indicative of the child's behavior. Thus, we extract four additional
proportion features that represent disjoint segments of the interaction: (1) the
fraction of the time the child speaks and the psychologist is silent, (2) the fraction
of the time the psychologist speaks and the child is silent, (3) the fraction of the
time that both participants speak (overlap), and (4) the fraction of the time that
neither participant speaks (silence). These features are only examined in an initial
statistical analysis.
Study II: Turn-taking and Speaking Rate Features
We compute seven temporal descriptors of the social interaction based on our
previous work (Bone et al., 2012b). Six turn-taking features describe the conver-
sational style of each participant: speaking time (%), turn length (words), latency
(s), overall silence time (%), intra-turn pause length (s), and fraction intra-turn
pausing (%). Speaking rate is calculated using forced-aligned syllable boundaries
as (#syllables=s). All features are calculated as the median over a session.
Study III: Acoustic-Prosodic, Turn-taking, and Language Features
We computed 16 prosodic descriptors of social interaction based on our previous
work (Bone et al., 2012b). The features relate to speech intonation, volume, rate,
and voice quality.
Intonation and volume contours (pitch and intensity) are extracted on turn-end
words using Praat (Boersma, 2001b). Log-pitch and intensity are mean subtracted
64
per session, then second-order polynomial parameterization is computed. Median
and inter-quartile ratio (IQR) of slope and curvature, as well as raw mean are
included as features (10 features).
For the remaining features, median values over all words in each subtask are
calculated for robustness. Voice quality is computed through jitter, shimmer, and
harmonics-to-noise using Praat with a 40 ms window and 10 ms shift (3 features).
Our previous study found that jitter{ a breathy/rough/hoarse voice quality corre-
late (McAllister et al., 1998){ increased for both interlocutors when ASD severity
increased (Bone et al., 2012b). Speaking rate (SR) is separated into 3 features:
SR (#words-per-utterance/time-for-utterance), articulatory SR (syl/s), and intra-
turn silence duration.
Four turn-taking features describe the conversation style of each participant:
speech %, silence %, overlap %, and median latency. Overlap % is the proportion
of time a participant interrupts the interlocutor. Median latency is time taken to
speak after the previous speaker ends their turn. Silence % is the same for both
speakers.
Language usage potentially related to ASD is quantied using the Linguistic
Inquiry and Word Count (LIWC) toolbox (Pennebaker et al., 2001). LIWC soft-
ware has previously been used to study language in ASD (Brown et al., 2012,
Chaspari et al., 2013). The features are: (1) words per sentence (WPS){ a rough
approximation of mean-length-of-utterance (MLU); (2) rst-person, singular pro-
nouns (I, me, mine); (3-5) positive emotion, negative emotion, and aect (positive
or negative) language; (6-8) assents (OK, yes), non-
uencies (hm, umm), and llers
(I mean, you know). Language features are percentages normalized by the total
number of words spoken (besides WPS).
65
3.1.4 Statistical Analysis and Machine Learning
Correlation and Regression for Study I
Spearman's non-parametric correlation between continuous speech features and
the discrete ADOS severity score was used to establish signicance of relationships.
Pearson's correlation was used when comparing two continuous variables. The sta-
tistical signicance level is set at p<0:05. However, we sometimes report p-values
for the reader's consideration that did not meet this criterion, but nonetheless may
represent trends that would be signicant with a larger sample size (i.e., p<0:10).
Additionally, underlying variables (psychologist identity, child age and gender, and
signal-to-noise ratio) are often controlled by using partial correlation in an eort
to arm signicant correlations. Signal-to-noise ratio (SNR) is a measure of the
speech signal quality aected by recording conditions (such as background noise,
vocal intensity, or recorder gain). SNR was calculated as the relative energy within
utterance boundaries (per speaker), compared to the energy in regions exclusive
of utterance boundaries for either speaker.
Stepwise regression was performed on the entire dataset in order to assess
explanatory power through adjustedR
2
as well as examine selected features. Hier-
archical and predictive regressions were performed to compare the explanatory
power of the child and psychologist acoustic-prosodic features. Given the limited
sample size, step-wise feature selection is performed for all regressions. Parame-
ters for stepwise regression were xed for the Stepwise Regression and Hierarchical
Regression sections (p
intro
=0:05 and p
remove
=0:10), and optimized for predictive
regression.
Predictive regression is completed with a cross-validation framework to assess
the model's explanatory power on an independent set of data; in particular, one
66
session is held-out for prediction while the stepwise regression model is trained
on all other sessions. The process is repeated in order to obtain a prediction
for each session's severity rating. Then, the predicted severity ratings are corre-
lated with the true severity ratings. All models include for selection the underly-
ing variables (psychologist ID, age, gender, and signal-to-noise ratio), in order to
ensure no advantage is given to either feature set. Parameters of stepwise regres-
sion are optimized per cross-fold; p
intro
is selected in the range [0:01, 0:19], with
p
remove
=2p
intro
.
Correlation, Regression, and Classication for Study II
We conduct correlation analysis, as well as classication (support vector machine)
and regression (support vector regression, SVR) via Liblinear software (Fan et al.,
2008). Parameters are tuned using two-level nested cross-validation, and average
statistics of ten runs of CV are reported. Spearman's rank-correlation coecient
and unweighted average recall (UAR, the mean of per-class recall) are selected for
evaluation metrics.
Correlation and Regression for Study III
Spearman's rank-correlation and predictive regression are used as in Study I.
67
3.1.5 Results and Discussion, Study I: Correlation and
Predictive Analysis of ASD Severity from Acoustic-
Prosody
Relationship between Normalized Speaking Times and Symptom Sever-
ity
Figure 3.1 illustrates the proportion of time spent talking by each participant, as
well as periods of silence and overlapping speech. Correlations between duration
of speech and ADOS severity are presented in Table 3.1. The percentage of child
speech (audible or inaudible due to background noise) during this sub-sample of
the ADOS is not signicantly correlated with ASD severity (r
s
(26)= 0:37,
p=0:06). The percentage of psychologist speech is signicantly correlated with
ASD severity (r
s
(26)=0:40, p=0:03). No relationship was found for percentage
overlap or percentage silence. The data suggest a pattern in speech in which the
more the psychologist speaks, the more severe the rated ASD symptoms.
1 1 1 2 3 3 5 5 5 6 6 6 6 6 7 7 8 8 9 9 9 9 9 9 10 10 10 10
0
0.2
0.4
0.6
0.8
1
Subject Severity Rating
Percentage
psych speech
child speech
overlap
silence
Figure 3.1: Proportions of conversation containing psychologist and/or child
speech. Sessions are ordered and labeled by ADOS severity.
68
Table 3.1: Spearman's between durational descriptors and ADOS severity and
atypical prosody ratings.
and
y
indicate statistical signicance at =0.05 and
=0.10 (marginal) levels, respectively.
% Child Speech % Psych Speech % Overlap % Silence
ADOS Severity -0.37
y
0.40
-0.17 0.15
Child-Psychologist Coordination of Prosody
Certain prosodic features may co-vary between participants, suggesting that one
speaker's vocal behavior is in
uenced by the other speaker's vocal behavior, or vice-
versa. The strongest correlation between participants is seen for median slope of
vocal intensity (r
p
(26)=0:64,p<0:01) as illustrated in Figure 3.2. This correlation
is still signicant at the p<0:01 level after controlling for psychologist identity
and signal-to-noise ratio (SNR), presumably the most likely confounding factors.
Coordination of median jitter was not signicant (p=0:24), while coordination with
median harmonics-to-noise ratio (HNR) was signicant (r
p
(26)=0:71,p<0:001) as
displayed in Figure 3.3. Median jitter and HNR capture aspects of voice quality and
can be altered unconsciously to some degree, although they are speaker-dependent.
After controlling for psychologist ID and SNR, signicance at the 0.05 level is
reached for median jitter (r
p
(26)=0:47, p=0:02), as shown in Figure 3.4, and still
exists for median HNR (r
p
(26)=0:70, p<0:001).
Figure 3.2: Coordination of vocal intensity slope median between child and psy-
chologist.
69
Figure 3.3: Coordination of median HNR between child and psychologist.
Figure 3.4: Coordination of median jitter between child and psychologist after
controlling for psychologist ID and signal-to-noise ratio.
Two other features show signicant coordination between speakers: the pitch
center inter-quartile ratios (IQR) and the cepstral peak prominence (CPP) medi-
ans. But these relations are non-signicant when controlling for psychologist ID
and SNR, and thus are disregarded.
Relationship between Acoustic-Prosodic Descriptors and ASD Severity
In this subsection, the pair-wise correlations between the 24 psychologist prosodic
features and the rated ADOS severity are presented (see Table 2.2). Positive corre-
lations indicate that increasing descriptor values correspond to increasing symptom
severity.
The pitch features of intonation are examined rst. The psychologist's pitch
center variability (inter-quartile ratio, IQR) is positively correlated with rated
severity (r
s
(26)=0:48, p<0:01) as is the psychologist's pitch slope variability
(r
s
(26)=0:43, p<0:05); the psychologist tends to have more varied pitch center
70
and pitch slope when interacting with a child with more atypical behavior. How-
ever, psychologist pitch center and slope variability correlations are non-signicant
(p=0:08 andp=0:07, respectively) after controlling for underlying variables; there-
fore, these results should be interpreted cautiously.
Next, the vocal intensity features that describe intonation and volume were con-
sidered. The psychologist's vocal intensity center variability (IQR) is positively cor-
related with rated severity (r
s
(26)=0:41, p=0:03). When interacting with a more
atypical child, the psychologist tends to vary her volume level more. The psychol-
ogist's vocal intensity slope variability (IQR) did not reach statistically-signicant
positive correlation with ADOS severity (p=0:09 and p=0:06, respectively).
Regarding measures of voice quality, several congruent relations with ADOS
severity exist. As a reminder, jitter is a measure of pitch aperiodicity, while HNR
and CPP are measures of signal periodicity, and thus jitter is expected to have the
opposite relations as HNR and CPP. Similar to the child's features, the psycholo-
gist's median jitter (r
s
(26)=0:43, p<0:05), median HNR (r
s
(26)=0:37, p<0:05),
and median CPP (r
s
(26)=0:39,p<0:05) all indicate lower periodicity for increas-
ing ASD severity of the child. Additionally, there are medium-to-large correlations
for the psychologist's jitter (r
s
(26)=0:48, p<0:01), CPP (r
s
(26)=0:67, p<0:001),
and HNR (r
s
(26)=0:58,p<0:01) variability; all indicate that increased periodicity
variability is found when the child has higher rated-severity. All of these voice qual-
ity feature-correlations exist after controlling for the listed underlying variables,
including signal-to-noise ratio.
Stepwise regression
Stepwise multiple-linear-regression was performed using all child and psychologist
acoustic-prosodic features as well as the underlying variables: psychologist ID,
71
age, gender, and signal-to-noise ratio to predict ADOS severity (Table 3.2). The
stepwise regression chose four features, three from the psychologist and one from
the child. Three of these features were among those most correlated with ASD
severity, indicating that the features contain orthogonal information. A child's
negative pitch slope, and a psychologist's cepstral peak prominence variability,
vocal intensity center variability, and pitch center median all are indicative of a
higher severity rating for the child according to the regression model. None of the
underlying variables were chosen over the acoustic-prosodic features.
Table 3.2: Stepwise regression with prosodic features and underlying variables.
Cumulative Model Statistics Final
0
s
Step Added Feature R R
2
Adjusted R
2
F F Sig. Stnd. Sig.
1 Psych CPP IQR 0.71 0.50 0.48 26.3 p<1e-3 0.50 p<1e-3
2 Child Pitch Slope Median 0.81 0.65 0.62 23.3 p<10e-3 -0.33 p<0.01
3 Psych Vocal Int. Center IQR 0.85 0.72 0.68 20.3 p<10e-3 0.34 p<0.01
4 Psych Pitch Center Median 0.88 0.78 0.74 20.4 p<10e-3 0.26 p=0.02
Hierarchical regression
In this subsection, the result of optimizing a model for the child's or psychologist's
features rst, then observing whether orthogonal information is present in the
other participant's features or the underlying variables was analyzed (Table 3.3);
included underlying variables are psychologist ID, age, gender, and signal-to-noise
ratio.
The same four features selected in the Stepwise Regression experiment are
included in the child-rst model; the only dierence being the child's pitch slope
median is selected before the psychologist's CPP variability in this case. The
child-rst model only selected one child feature, child pitch slope median, and
reached an adjusted R-square of 0:43. Yet, further improvements in modeling were
found (R-square=0:74) after selecting three additional psychologist features: 1)
72
CPP variability, 2) vocal intensity center variability, and 3) pitch center median.
A negative pitch slope for the child suggests
atter intonation, while the selected
psychologist features may capture increased variability in voice quality and into-
nation.
The other hierarchical model rst selects from psychologist features, then con-
siders adding child and underlying features. That model, however, found no sig-
nicant explanatory power was available in the child or underlying features, with
the psychologist's features contributing to an adjusted R-square of 0:78. In partic-
ular, the model consists of four psychologist features: 1) CPP variability, 2) HNR
variability, 3) jitter variability, and 4) vocal intensity center variability. These fea-
tures largely suggest that increased variability in the psychologist's voice quality
is indicative of higher ASD for the child.
Table 3.3: Hiearchical stepwise regression in order: i) child's (psychologist's)
prosody; ii) psychologist's (child's) prosody and underlying variables.
Child Prosody, then Psych Prosody and Underlying Variables
Cumulative Model Statistics Final
0
s
Step Added Feature R R
2
Adjusted R
2
F F Sig. Stnd. Sig.
1 Child Pitch Slope Median 0.67 0.45 0.43 21.6 p<0.001 -0.33 p<0.01
2 Psych CPP IQR 0.81 0.65 0.62 23.3 p<0.001 0.50 p<0.001
3 Psych Vocal Int. Center IQR 0.85 0.72 0.68 20.3 p<0.001 0.37 p<0.01
4 Psych Pitch Slope Median 0.88 0.78 0.74 20.4 p<0.001 0.27 p=0.02
Psych Prosody, then Child Prosody and Underlying Variables
Cumulative Model Statistics Final
0
s
Step Added Feature R R
2
Adjusted R
2
F F Sig. Stnd. Sig.
1 Psych CPP IQR 0.71 0.50 0.48 26.3 p<0.001 0.48 p<0.001
2 Psych HNR IQR 0.79 0.63 0.60 21.2 p<0.001 0.32 p<0.01
3 Psych Jitter IQR 0.84 0.71 0.69 19.2 p<0.001 0.35 p<0.01
4 Psych Vocal Int. Center IQR 0.90 0.81 0.78 24.3 p<0.001 0.33 p<0.01
Predictive regression
The results shown in Table 3.4 indicate the signicant prediction of ADOS sever-
ity from acoustic-prosodic features. The psychologist's prosodic features provided
73
higher correlation than the child's prosodic features, r
s;psych
(26)=0:79 (p<0:001)
compared to r
s;child
(26)=0:64 (p<0:001), although the dierence between correla-
tions is not signicant. Additionally, no improvement is observed when including
the child's features for regression: r
s;psych&child
(26)=0:67 (p<0:001).
Table 3.4: Spearman's rank order correlation between predicted severity and rated
severity based on acoustic-prosodic descriptors and actual, rated ADOS severity.
Note:
p<0:001,
p<0.01.
Descriptor's Included Child Prosody Psych Prosody Child and Psych Prosody Underlying Variables
s
0.64
0.79
0.67
-0.14
Discussion of Study I Results
The contributions of this work are three-fold. First, semi-automatic processing and
quantication of acoustic-prosodic features of the speech of children with ASD was
conducted, demonstrating the feasibility of this paradigm for speech analysis even
in the challenging domain of spontaneous dyadic interactions and using far-eld
sensors. Second, the unique approach of analyzing the psychologist's speech in
addition to the child's speech during their interaction provided novel information
about the predictive importance of the psychologist as an interlocutor in charac-
terizing a child's autistic symptoms. Third, as predicted, speech characteristics of
both the child and the psychologist were signicantly related to the severity of the
child's autism symptoms. Moreover, some proposed features such as intonation
dynamics are novel to the ASD domain, while vocal quality measurements (i.e.,
jitter) mirrored other preliminary ndings.
Examination of speaking duration indicated that the percentage of time the
psychologist spoke in conversation was informative; in interactions with children
with more severe autism symptoms, the psychologist spoke more and the child
spoke non-signicantly less (p=0:06). This nding may suggest that the child with
74
more severe ASD has diculty conversing about the emotional and social content
of the interview, and thus the psychologist is attempting dierent strategies, ques-
tions, or comments to try to draw the child out and elicit more verbal responses.
Similar ndings about relative speaking duration have been reported in previ-
ous observational studies of the interactions of adults and children or adolescents
with autism (Garc a-P erez et al., 2007, Jones and Schwartz, 2009). In addition,
some coordination between acoustic-prosodic features of the child and psycholo-
gist was shown for vocal intensity level variability, median HNR, and median jitter
(only after controlling for underlying variables); this gives evidence of the inter-
dependence of participants' behaviors. Vocal intensity is a signicant contributor
to perceived intonation, and HNR & jitter are related to aspects of atypical vocal
quality. These ndings suggest that the psychologist tended to match her volume
variability and voice quality to the child's during the interaction.
As predicted, correlation analyses demonstrated signicant relationships
between acoustic-prosodic features of both partners and rated severity of autism
symptoms. Continuous behavioral descriptors that co-vary with this dimensional
rating of social-aective behavior may lead to better phenotypic characterizations
that address the heterogeneity of ASD symptomatology. Severity of autistic symp-
toms was correlated with children's negative turn-end pitch slope, which is a marker
of statements. The underlying reason for this relationship is currently uncertain
and needs further investigation. The child's jitter median tended to increase while
HNR median decreased; jitter, HNR, and CPP variability also tended to increase
in the children's speech with increasing ASD severity. Higher jitter, lower HNR,
and lower CPP have been reported to occur with increased breathiness, hoarse-
ness, and roughness (Halberstam, 2004, Hillenbrand et al., 1994, McAllister et al.,
1998), while similar perceptions of atypical voice quality have been reported in
75
children with ASD. For example, Pronovost et al. (1966) found speakers with high
functioning autism to have hoarse, harsh, and hypernasal qualities. Hence, the
less periodic values of jitter and HNR seen for children with higher autism severity
scores suggest the extracted measures are acoustic correlates of perceived atypical
voice quality. The ndings show promise for automatic methods of analysis, but it
is uncertain which aspect of voice quality jitter, HNR, and CPP may be capturing.
Since the CPP measure was non-signicant for the child while the jitter and HNR
measure were signicant, further, more controlled investigation of voice quality
during interaction is desired in future studies. The results corroborate ndings
from another acoustic study (Boucher et al., 2011) which found that higher abso-
lute jitter contributed to perceived \Overall Severity" in samples of spontaneous
speech of children with ASD.
Examination of the psychologist's speech features revealed that when interact-
ing with a more atypical child, the psychologists tended to vary their volume level
and their pitch dynamics (slope and center) more. This variability may re
ect
the psychologist's attempts to engage the child by adding aect to their speech,
since increased pitch variability is associated with increased arousal (Juslin and
Scherer, 2005). However, the pitch dynamic variability was non-signicant (p=0:08
and p=0:07) after controlling for underlying variables, so this result should be
interpreted with caution. It is also important to note that the data show clearly
there are certain relations that are very signicant, and others that should be
further investigated with a more powerful clinical sample. Additionally, psychol-
ogist speech showed increased aperiodicity (captured by median jitter, CPP, and
HNR) when interacting with children with higher autism severity ratings. This
increased aperiodicity when interacting with more atypical children, together with
the coordination observed between the two participants' median HNR as well as
76
their median jitter after controlling for underlying variables, suggests that the
psychologist may be altering her voice quality to match that of the child. Fur-
thermore, the psychologist's periodicity variability (captured by jitter, CPP, and
HNR variability) increased, like the child's, for higher severity of autistic symp-
toms. Findings regarding voice quality are stronger for having considered several
alternative measures.
Lastly, this study represents one of the rst collections of empirical results that
demonstrate the signicance of the psychologist's behavior in relation to the sever-
ity of the child's autism symptoms. In particular, three regression studies were
conducted in this regard: stepwise, hierarchical-stepwise, and predictive-stepwise.
Stepwise regression with selection from all child and psychologist acoustic-prosodic
features and underlying variables demonstrated that both psychologist and child
features had explanatory power for autism severity. Hierarchical-stepwise regres-
sion showed that independently, both the child's and psychologist's acoustic-
prosodic features were informative. However, evidence suggests that the psy-
chologist's features were more explanatory than the child's; higher R-square was
observed when selecting from the psychologist's features compared to the child's
features, and no child feature was selected after choosing from psychologist features
rst. Finally, the predictive value of each feature set was evaluated. The psychol-
ogist's features were more predictive of autism severity than the child's features;
while this dierence was non-signicant, the ndings indicate that the psycholo-
gist's behavior carries valuable information about these dyadic interactions. Fur-
thermore, the addition of the child's features to the psychologist's features did not
improve prediction accuracy.
77
3.1.6 Results and Discussion, Study II: Correlation,
Prediction, and Classication based on Dynamic
Prosody
Correlational Feature Analysis
Acoustic and turn-taking feature correlations for both child and psychologist are
provided in Table 2.3.
Turn-taking cues provide an overarching depiction of interaction quality. In
our data, children with higher ASD severity tend to speak less, in shorter phrases,
and with longer pauses; additionally, they speak slower on average. These ndings
mirror our previous ndings in a smaller dataset(Bone et al., 2013b). Since the
psychologist is not only the evaluator, but also a participant in the interaction, we
can observe eects in their behavior according to their participant's social cues.
In particular, the psychologist tends to pause more within a turn and wait longer
to start a turn (more latency). Overall, there is also more silence. In sum, this
suggests that the psychologist may be unsure of when the child will start and end
turns, or that the child may be unresponsive.
Segmental pitch cues show short-term variability in use of fundamental fre-
quency. For both the child and the psychologist, short-term dynamic variability
of fundamental frequency increases. Pitch variability may increase with enhanced
aect, such as the psychologist trying to engage the child, or in response to an
interlocutor's behavioral patterns.
Suprasegmental cues are essential for communicating intention and aect.
While we cannot fully model prosody without knowing the semantic context of
an uttered phrase, we can look at global tendencies. Likewise for speakers with
78
higher severity, the psychologists are shown to have more macro-prosodic vari-
ability in all four of the features that we examine. Specically, both participants
have larger pitch movements (octaves) between successive Momel points. In sym-
bolic representation (Intsint), there are more high and low tones, and less same
level pitch movements. This nding expands on previous reports of higher pitch
variability, which often did not use log-scales (pitch is log-normally distributed
within speaker) and were simply global functionals on raw fundamental frequency,
not providing insight into the dynamics. Note that the intermediate up-step and
down-step tones are not displayed to improve readability, since neither reached
signicance.
Prediction Experiments
Correlational analysis informs interpretation in behavioral interactions, but com-
putational systems that support clinical researchers in behavior tracking and inter-
vention can rely on joint modeling of many features. In this section, we analyze
the performance of dierent feature categories in predicting ASD severity and
best-estimate diagnosis (Table 3.5).
Table 3.5: Regression and classication of ASD severity and best-estimate diagno-
sis via acoustic-prosodic and turn-taking features. Bolded statistics are signicant
at the =0:05 level.
ASD Severity Diagnosis
Features Child Psych. C.&P. C.&P.
Baseline: Demographics -0.02 N/A -0.02 52%
Turn-taking 0.17 0.17 0.18 58%
Segmental 0.28 0.19 0.30 57%
Suprasegmental 0.08 0.25 0.22 56%
Prosodic Coord. 0.17 N/A 0.17 56%
Pronunciation 0.17 N/A 0.17 52%
Feature Fusion 0.31 0.31 0.35 59%
metric Spearman's
S
UAR
79
We initially examine the predictive power of the baseline demographic features:
age, gender, and non-verbal IQ. If these features are predictive on their own, it's
possible that our speech cues are directly predictive of demographics (e.g., IQ or
age), rather than ASD-related social behavior. We nd that the demographic
features are not signicantly predictive of ASD severity or diagnosis, and thus
conclude that our features are capturing ASD-specic behavioral patterns.
All features groups are signicantly predictive of ASD severity, and all but
goodness of pronunciation signicantly classify ASD from non-ASD interactions
based on both the child's and the psychologist's features. Pronunciation quality
may more directly aect or represent social functioning, since there is no diagnostic
relevance.
The top individual feature groups for predicting severity are: the child's seg-
mental pitch features; the psychologist's suprasegmental intonation features; and
the child's and psychologist's combined segmental intonation features. Based on
the analysis of Sections 2.2.4 and 3.1.6, we conclude that increased prosodic vari-
ability of both participants is a reliable predictor of ASD severity. The individ-
ual feature groups achieve similar ASD/non-ASD classication performance (aside
from pronunciation quality); turn-taking and speaking rate statistics reach a peak
of 58% unweighted average recall. Feature fusion leads to the optimal performance
of 0.35 correlation and 59% UAR.
3.1.7 Results and Discussion, Study III: Correlation Anal-
ysis of ASD Severity
The speech and language features of child and psychologist vary throughout the
interaction; recall that each ADOS session comprises several subtasks. In this
section, we study the general communicative behavior of both participants across
80
the session. Features are rst calculated per subtask, but these may have high
variability if the sample of communication is too small. We quantify session-level
behavior as the median of all subtask-level values of each feature. Additionally, in
this analysis we exclude three subtasks for which \Amount of Speech Required"
was rated lowest, so as to reduce variance in subtask-level speech and language
features. The excluded subtasks are Construction, Make-Believe Play, and Break.
In Section 3.1.7, the signicant pair-wise correlations between the 28 child and
psychologist speech and language features and ASD severity are examined and
interpreted (Table 3.6). A very inclusive signicance level is chosen (=0:10).
Although many correlations reach higher levels of signicance, certain results
should be interpreted carefully.
Acoustic-Prosodic Variation
We rst consider pitch and intensity contour functionals of either participant. As
a child's ASD severity increases, the psychologist and the child tend to speak
with more variable prosody. In particular, the psychologist's pitch and intensity
curvature, and the child's pitch curvature have increased inter-quartile ratio (IQR).
Additionally, the psychologist has increased pitch slope. The psychologist may be
exaggerating intonation in order to convey more aect in an attempt to engage
the child and elicit a desired response. The child's pitch curvature variability is
less intuitive, but it may result from poor control of pitch dynamics or increased
arousal. Furthermore, children with higher ASD severity showed a strong tendency
to exhibit reduced pitch slope; this behavior may relate to the common perceptions
of monotonous intonation in children with ASD, as we suggested in our previous
study (Bone et al., 2012b).
81
The psychologist's harmonics-to-noise ratio (HNR) tends to decrease when
interacting with a child who has higher ASD severity. HNR can relate to per-
ceptions of breathy, rough, or hoarse voice (McAllister et al., 1998){ generally
an atypical voice quality. In our previous work, we found both participants to
have increased atypical voice quality for sessions with children of higher symptom
severity (Bone et al., 2012b).
Conversational Depiction
An image of conversational style and quality as a function of ASD severity emerges
from the correlations in Table 3.6. As ASD severity increases, the psychologist
talks more while the child talks less. This may suggest that the child with more
severe symptoms is more dicult to engage and less comfortable talking, and the
psychologist is trying various speech strategies to elicit engagement. The psy-
chologist articulates quicker, while the child articulates slower. Furthermore, the
psychologist has increased intra-turn silence, while the child has increased latency
to respond. The psychologist may wait for the child to respond (intra-turn silence)
given the child's bias toward delayed response; but, not seeing the desired reaction
from the child, the psychologist again prompts or proceeds.
We also consider the language features. Children with more severe ASD tended
to use personal pronouns less often. Dierences have previously been observed in
the production of `me' and `you' between autistic and typically developing chil-
dren (Jordan, 1989). In the ADOS, the children are often asked personal questions
which can make the children uncomfortable. We may suspect that children who
have an aversion to making personal assertions, especially those with higher ASD
severity, will use personal pronouns less often. This nding corroborates a study
that found children with autism responding to personal or emotional questions
82
tended to give de-personalized responses and avoid use of the word `I' (Baltaxe,
1977). Furthermore, children with higher symptom severity produce less aect
language (positive or negative emotional words). Children with higher ASD sever-
ity may be freely oering personal information less often; the ADOS code \Oers
Information" scores such socially atypical behavior. Additionally, since the psy-
chologist tends to use personal pronouns more frequently, the psychologist may
be making overt attempts to engage the child in reciprocal social communication
about the psychologist's experiences.
The semi-structured interaction appears less conversational in the children who
exhibit more severe social-communicative diculties. The psychologist produces
less assent (OK, yes) language to back-channel the child's comments. The psy-
chologist also uses less non-
uent language (hm, umm), which could indicate the
psychologist is making an eort to be very direct and clear in her communica-
tion attempts. Additionally, children with higher ASD severity tend to use fewer
llers/discourse-markers (I mean, you know). Fillers can serve turn-taking func-
tions such as pacing and stalling. Previous results from Heeman et al. (2010) also
found that children with ASD used fewer llers (Heeman et al., 2010), although
they were surveying use of `uh' and `um' which are classied as non-
uencies in
our study. Lastly, the child with higher ASD severity exhibits decreased words-
per-sentence. This indicates that children with higher ASD severity speak fewer
words at a time, in addition to speaking a smaller percentage of the time.
3.1.8 Results and Discussion, Study III: Prediction over
Varying Social Demand
The severity of social-communicative decits for children with ASD may not fully
emerge until a certain level of social demand is reached. In this section, we examine
83
Table 3.6: Correlations of session-level features with ADOS severity. Note: [
y
,
,
]
=[0:10,0:05,0:01] .
Trend with Severity Psych Feature Sp.
more positive pitch slope +0:32
y
more variable pitch curvature +0:33
y
more positive intensity curvature +0:31
y
more variable intensity curvature +0:51
increased articulatory SR +0:38
increased intra-turn silence +0:32
y
decreased harmonics-to-noise 0:47
increased speech % +0:54
increased personal pronouns +0:38
decreased assent lang. 0:48
decreased non-
uent lang. 0:48
Trend with Severity Child Feature Sp.
more negative pitch curvature 0:56
more variable pitch curvature +0:45
more variable intensity curvature +0:43
decreased articulatory SR 0:34
y
increased latency +0:34
y
decreased speech % 0:36
y
decreased words per sentence0:42
decreased personal pronouns 0:40
decreased aect lang. 0:40
decreased llers 0:41
speech and language features from the child and psychologist in subtasks with high,
medium, and low social demand through a predictive modeling task. The high
social demand tasks are: Emotions, Loneliness, Social Diculties and Annoyance,
Joint Interactive Play, and Cartoons; medium: Friends and Marriage, Description
of a Picture, Creating a Story, Demonstration Task, Telling a Story from a Book,
Conversation and Reporting; low: Construction, Make-Believe Play, and Break.
Social demand group-level features are the medians of those from corresponding
subtasks.
Multiple linear regression with forward-backward feature-selection is performed
using a leave-one-session-out framework with two layers, one for prediction and
another for parameter tuning. Spearman's rank correlation coecient is used for
analysis and tuning. Acoustic prosody and turn-taking cues were used separately
from language descriptors. These were divided in this analysis because we found
over-tting often occurred due to the large size of our initial feature set. Features
chosen in at least 50% of the cross-folds are presented in Table 3.7.
84
Table 3.7: Correlations of prosodic and language feature sets predictions
with ADOS severity over varying social demands. Note: [
y
,
,
,
]
=[0:10,0:05,0:01,0:001] . PPsychologist's features, CChild's features.
High Social Demand Medium Social Demand Low Social Demand
Sp: Chosen Features Sp: Chosen Features Sp: Chosen Features
Psych Acoustics +0:43
1. P engy curvature IQR +0:46
1. P intra-turn sil DUR +0:20 1. P engy curvature MED
2. P harmonics-to-noise 2. P speech % 2. P intra-turn sil DUR
3. P silence % 3. P speech %
Child Acoustics +0:76
1. C pitch curvature MED +0:17 1. C speaking rate +0:02 1. C pitch curvature IQR
2. C pitch curvature IQR 2. C overlap % 2. C intra-turn sil DUR
3. C articulation rate 3. C intensity MED
Both Acoustics +0:50
1. C pitch curvature IQR +0:29 1. P intra-turn sil DUR0:11 1. C pitch curvature IQR
2. P engy curvature IQR 2. P speech % 2. P intra-turn sil DUR
3. C intra-turn sil DUR 3. P engy curvature MED
Psych Lang. +0:49
1. P aect +0:10 1. P non-
uencies +0:39
1. P non-
uencies
2. P non-
uencies 2. P aect
Child Lang. 0:01 1. C ller +0:35
y
1. C ller 0:97 none
Both Lang. +0:27 1. P aect +0:46
1. C. ller +0:27 1. P non-
uencies
2. P non-
uencies 2. P non-
uencies
Acoustic-Prosodic and Turn-taking Features
Interestingly, both psychologist and child acoustics are signicantly predictive of
ASD severity in the subtasks with high social demand. Generally, prosodic curva-
ture variability, voice quality, articulatory speaking rate, and pausing are selected
for prediction. We nd that constraining the range of subtasks leads to diverse
modeling of appropriate social functioning{ the most-often chosen features for pre-
diction vary between conditions.
We also consider medium and low social demand tasks. The psychologist's
features are only predictive in the medium social demand subtasks. The psy-
chologist's intra-turn silence (pausing) and speaking percentage are selected for
prediction. Low social demand subtasks are sparse in expected speech content
from children, explaining why no signicant prediction occurs.
85
Language Features
The selected language features show some success in prediction of ASD severity,
but generally less so than the acoustic features. Only three features are chosen
for prediction in any group: psychologist aect and non-
uencies, and child llers.
We note that after separating high and medium social demand, psychologist use
of aect language becomes informative, whereas it was not a signicant correlate
with ASD severity before.
The psychologist's language features are signicantly predictive of ASD severity
in high social demand activities, but the child's are not. The child's features are
signicantly predictive in medium social demand, but the psychologist's are not;
however, the addition of the psychologist's features leads to statistically signicant
prediction. The trend of increased prediction with increased social demand does
not seem as pronounced for language cues. Interestingly, the psychologist's features
are signicantly predictive in low social demand, although no features of the child
are selected for prediction{ underscoring the salience of the psychologist's language
features even when the child receives few presses for interaction.
3.1.9 Conclusions and Future Work
Using our objective, semi-automatic framework, we analyzed the joint interaction
between child and psychologist, since the psychologist is both interlocutor and
evaluator. In Studies I and II, regression analyses provided empirical support for
the signicance of the psychologist's behavior in ASD assessment, an intuitive
result given the dependence between dyadic interlocutors in general. In study III,
we demonstrated that prosodic, turn-taking, and language features taken during
child-psychologist interactions are indicative of the degradation in conversational
86
quality for children with greater severity of ASD symptoms. In particular, we found
that as ASD severity increases, the psychologist varies her speech and language
strategies in attempts to engage the child in social interactions, while children with
more severe ASD speak less and use fewer aect words and personal pronouns.
Additionally, we found greater predictive power for ASD severity in subtasks with
high social demand, while the psychologist's language cues were predictive even
during minimal-speech, low social-demand tasks.
These results support future study of acoustic-prosody and language during
spontaneous conversation{ not only of the child's behavior, but also of the psy-
chologist's speech patterns{ using computational methods that allow for analysis
on much larger corpora. These preliminary studies suggest that signal processing
techniques have the potential to support researchers and clinicians with quantita-
tive description of qualitative behavioral phenomena and to facilitate more precise
stratication within this spectrum disorder.
Further experiments will be conducted on larger datasets which include data
from typically developing children to provide a normative context. Normative data
can provide a baseline for expected typical behavior, allowing for greater precision
and detail in modeling interaction. Additional topics such as at which point the
psychologist makes a decision, the global versus local nature of decits, and the
modeling potential of unsupervised behavioral signal Bone et al. (2014c), Lee et al.
(2014) will be examined.
87
3.2 Joint Modeling of Child's and Psychologist's
Aect
Researchers from various disciplines are concerned with the study of aective phe-
nomena, especially arousal. Expressed aective modulations, which re
ect both
an individual's internal state and external factors, are central to the communica-
tive process. Bone et al. (2014b) developed a robust, unsupervised (rule-based)
method which provides a scale-continuous, bounded arousal rating from the vocal
signal. In this study, we investigate the joint-dynamics of child and psychologist
vocal arousal in autism spectrum disorder (ASD) diagnostic interactions. Arousal
synchrony is assessed with multiple methods. Results indicate that children with
higher ASD severity tend to lead the arousal dynamics more, seemingly because
the children aren't as responsive to the psychologist's aective modulations. A
vocal arousal model is also proposed which incorporates social and conversational
constructs. The model captures conversational signal relations, and is able to
distinguish between high and low ASD severity at accuracies well-above chance.
2
3.2.1 Introduction
Arousal (also referred to as activation and excitation) is a primary component
in dimensional theories of emotion (Fontaine et al., 2007, Yik et al., 1999), and
continues to be the focus of interdisciplinary work in domains such as psychology,
engineering, linguistics, and biology. Arousal is an internal state that, like other
aective constructs, in
uences our thoughts and actions. For instance, a person's
arousal level can aect the decisions they make (Kroeber-Riel, 1979, Leith and
2
This work appears in an Interspeech 2014 submission, Bone et al. (2014c).
88
Baumeister, 1996) and their performance on certain tasks (Lambourne and Tom-
porowski, 2010).
The transmission of multimodal aective cues is a core facet of human commu-
nication. Perceptual tests and engineering systems have established that expressed
aective cues can be discerned from speech (Bachorowski, 1999, Juslin and Scherer,
2005, Lee et al., 2011, Lee and Narayanan, 2005, Schuller et al., 2003). Although
arousal is reliably decoded from vocal cues, engineering tools that are broadly
applicable to unlabeled databases are lacking. State-of-the-art supervised systems
usually incorporate thousands of features (e.g., openSMILE (Eyben et al., 2010));
while large feature sets increase capacity for modeling behavior, they reduce inter-
pretability and risk overtting in cross-corpora experiments. Bone et al. (2014b,
2012) developed and validated an alternative unsupervised (rule-based) framework
for vocal arousal rating that utilizes three features (pitch, vocal intensity, and the
ratio of high- and low-frequency energy). This framework enables general, inter-
pretable study of vocal arousal with few constraints.
In human-human interaction, behavioral synchrony (or entrainment) between
participants is central to perceptions of overall quality. Harrist and Waugh (2002)
dene synchrony as a \type of interaction between two people ... an observable
pattern of dyadic interaction that is mutually regulated, reciprocal, and harmo-
nious". Several studies have concentrated on infants and adolescents to understand
the development and importance of entrainment processes (Bernieri et al., 1988,
Feldman, 2007, Harrist and Waugh, 2002). Such work has reported a higher dyadic
synchrony when mothers interact with their own infant, rather than an unfamil-
iar infant (Bernieri et al., 1988); and that parent-infant synchrony is predictive of
symbolic play complexity (Feldman, 2007). Overall, the development of behavioral
89
synchrony is seen as key to establishing signicant dyadic relationships, enabling
the child to grow socially and emotionally (Harrist and Waugh, 2002).
Behavioral synchrony computation often relies on hand-coded behavioral sig-
nals like gaze, vocalizations, and aect (Feldman et al., 2011). There is an appar-
ent need for scalable automatic techniques. Engineering methods in computing
synchrony are gaining attention. Studies have used automatic measures of heart
rate (Feldman et al., 2011); facial features such as smile strength, eye constric-
tion, and mouth opening (Messinger et al., 2009); and prosodic and spectral
features (Lee et al., 2014). The present work investigates the joint-dynamics of
automatically-generated vocal arousal contours.
Autism spectrum disorder (ASD) is a highly heterogeneous and highly preva-
lent (1 in 68 children (Baio, 2014)) neuro-developmental disorder characterized by
social communication decits, social impairments, and the presence of restricted,
repetitive, and/or stereotyped behaviors (APA, 2000). Computational models of
social behavior have the potential to aid clinicians in diagnosis, intervention, and
long-term monitoring. Further, neurobiological studies in autism need quanti-
tative, dimensional measures of behavior for improved stratication (Lord and
Jones, 2012). Toward this end, Bone et al. (2012b, 2014a) found through speech-
prosodic analysis of ASD diagnostic sessions that the psychologist adjusted their
speech properties based on the child's social-communicative impairments (Bone
et al., 2012b, 2014a). Bone et al. (2013b) additionally investigated automatically-
extracted turn-taking and language cues, observing an overall degradation of con-
versational quality when the psychologist interacted with children with higher ASD
severity. The current work extends this line of research, investigating the joint-
evolution of aect as captured by vocal arousal.
90
In this study, we examine child-psychologist aective synchrony in relation to
ASD severity. We note that vocal arousal is a function of both internal state and
external social and conversational factors. Thus, we additionally propose a model
of vocal arousal dynamics which incorporates both internal (self-evolution) and
external social and contextual factors. This model is then utilized to discrimi-
nate between groups of interaction sessions, divided according to the child's ASD
severity.
3.2.2 Experimental Design
The USC CARE Corpus
Child-psychologist interactions are studied in the context of the Autism Diagnostic
Observation Schedule (ADOS, (Lord et al., 2000)). The present work focuses on
the ADOS Module 3, designed for subjects who are verbally
uent as judged by
the psychologist. During administration, the psychologist leads the child through
a variety of activities, or subtasks, designed to elicit social responses. The psy-
chologist then scores 28 codes representing the child's behaviors in the domains
of Social Interaction, Communication, and Restricted, Repetitive Behaviors. This
analysis uses the revised ADOS Module 3 algorithms (Gotham et al., 2007), and
the transformation of the ADOS total to an ASD severity score (Gotham et al.,
2009). The ASD severity score is in the range 1 to 10, with higher scores indicating
higher severity of ASD symptoms.
The USC CARE Corpus (Black et al., 2011a) is comprised of audio-video
recordings (2 HD cameras and 2 high-quality far-eld microphones) of ADOS
administrations. Sessions are lexically transcribed based on the SALT transcription
91
manual (Miller and Smith, 1983), and temporally marked for utterance boundaries.
Demographics of the 29 participants for this study are presented in Table 3.8.
As with previous studies conducted with this corpus (Bone et al., 2012b, 2014a,
2013b), we examine both child and psychologist behavior. Three licensed, research-
certied psychologists with extensive clinical experience with ASD children admin-
istered the ADOS. Two of the psychologists were bilingual in English and Spanish;
bilingual participants were evaluated by bilingual psychologists. Administrations
were conducted in English, so small portions of Spanish conversation are disre-
garded; one subject (of 30) was excluded due to a primarily Spanish discourse.
Table 3.8: Demographic statistics of the 29 recorded children in this study that
were administered Module 3 of the ADOS.
Category Count/Statistic
Age (years) mean: 10.0, std. dev.: 2.6, range: 5.8-15.0
Gender male: 23, female: 6
Native language Spanish: 9, English: 10, Sp&Eng: 4, unk: 6
Ethnicity Hispanic: 20, White/+Other: 8, AF-AM: 1
ADOS module #3: 29
ADOS diagnosis autism: 18, ASD: 5, below ASD cutos: 6
Vocal Arousal
Expressed vocal arousal is quantied using a method proposed in (Bone et al.,
2014b). This rule-based method can provide a scale-continuous arousal rating,
bounded in the range [-1,1], from the vocal signal without the need for any manual
labeling. The arousal rating is built upon three knowledge-inspired features, whose
individual scores are fused to improve reliability. The method performs consistently
across multiple corpora, achieving both high correlations with labels and impressive
binary classication performance.
92
The framework tracks a speaker's variation from a baseline (e.g., higher pitch
indicates higher arousal). This baseline data can be neutral-labeled or unla-
beled (global normalization). In the case of neutral-data normalization, positive
(negative) ratings can be interpreted as higher-than-neutral (lower-than-neutral)
arousal. If no labeled data is available, the method is still able to rank instances
according to perceived arousal via global normalization; here, the relative value of
instances still has meaning, but the absolute value is less interpretable.
In this corpus, vocal arousal is extracted for both speakers. The data is orga-
nized into turns. A turn consists of consecutive, uninterrupted utterances from
a single speaker. Utterances that are entirely overlapped by another speaker's
turn are excluded (rather than performing source separation), while overlaps that
represent speaker changes are maintained. Sessions vary from 76-326 turns for
each participant. Features are extracted within the voiced frames of a turn. Vocal
arousal is computed for each turn with global speaker-normalization.
Sample vocal arousal streams for child and psychologist are shown in Figure 3.5.
Coupling is apparent in this 50 turn sample; the Spearman's rank-correlation coe-
cient between the two arousal contours is
S
=0:66. A varying lead-lag relationship
is also evident: In the rst segment (\Psych leads"), the psychologist's arousal
precedes the child's arousal by one turn. (By convention, time slot t contains the
arousal from the psychologist at turn k and the following turn from the child, k+1.)
In the second segment (\Child leads"), the child's arousal precedes the psycholo-
gist's arousal by approximately two turns.
Synchrony Measures
We consider two measures of synchrony: cross-correlation with peak-picking and
Granger causality. Windowed cross-correlation with peak-picking (Boker et al.,
93
5 10 15 20 25 30 35 40 45 50 1
−1
−0.5
0
0.5
1
turn #
Vocal Arousal
P−Ar. C−Ar.
Psych
leads
Child
leads
Figure 3.5: Example vocal arousal
streams from child (C) and psycholo-
gist (P) with highlighted regions of syn-
chrony.
5 10 15 20 25 30 35 40 45 50 1
−1
−0.5
0
0.5
1
turn #
Vocal Arousal
P−Ar. C−Ar. P−Dom. P−Bc.
Oh no.
Hmm.
Cool
Okay.
Well.
Figure 3.6: Example vocal arousal
streams for child (C) and psycholo-
gist (P) with dominance (Dom) and
backchannels (Bc).
2002) is used to evaluate both the magnitude of interaction between vocal arousal
streams (peak absolute value of cross-correlation), and the lead/lag relationship
(corresponding peak index). Correlation can be either positive or negative, re
ect-
ing synchronous or asynchronous interaction; we consider the absolute magnitude
of the correlation.
Granger causality (Granger, 1969) is a statistical method for determining if
one signal is useful in predicting another; in this case, through linear autore-
gressive models for time series prediction (Hamilton, 1994). Given time-varying
signals X(t) and Y(t), if predicting Y with previous values of X reduces the resid-
ual signal energy compared to using Y alone, then X is said to Granger-cause
Y. Three Granger features are extracted (Seth, 2005): the strength of Granger-
causality from the child to the psychologist (and vice-versa), taken as the logarithm
of the F-statistic; and the child's causal
ow, the dierence of the out-degree of
Granger causality minus the in-degree. Features were extracted within overlap-
ping windows. The number of lags for Granger analysis was decided per window
through Bayesian Information Criterion. Given the limited length of these sessions,
Granger-causality magnitudes were included in computations whether or not they
94
reached statistical signicance. All computations are made using the GCCA tool-
box (Seth, 2010). A similar approach was employed to model dominance eects
with non-verbal behaviors in (Kalimeri et al., 2012).
A Conversational Model of Vocal Arousal
Vocal arousal is an expressed signal that is not only re
ective of internal arousal,
but also social and conversational factors. Vocal arousal evolves as a function of
the interaction; in this data, the coupling is quite signicant (Section 3.2.3). Since
a person's vocal arousal depends on their conversational partner's vocal arousal,
this is one component of the model we propose in Figure 3.7; corresponding signals
are shown in Figure 3.6.
The vocal arousal for a particular turn is certainly dependent upon the spoken
content. For this purpose, we include dialogue acts in our model of vocal arousal.
For instance, compared to acknowledgments, backchannels are expected to be of
lower volume and pitch movement; this is because backchannels are spoken with
the intention of not overtaking the
oor, whereas acknowledgments assert an opin-
ion (Shriberg et al., 1998). Since these features are positive correlates of perceived
arousal (Juslin and Scherer, 2005), we may also expect vocal arousal to be lower
for backchannels.
Ar
p,k
Ar
c,k-1
DA
p,k
Dm
p,k
Figure 3.7: Conversational model of vocal arousal (psychologist's view shown).
Note: p - psychologist; c - child; Ar - vocal arousal; DA - dialogue act; Dm -
dominance; bold indicates a vector ending at turn k.
95
The evolution of vocal arousal contours is additionally contingent on the style
of the conversation (e.g., interview or casual conversation). We restrict our focus
to a speaker's conversational dominance (power or control). Dominance is a global
factor of an interaction, but it varies locally as well. For example, W ollmer et al.
(2012) investigated the temporal modeling of dominance using acoustic features.
Acoustic features like vocal intensity in
uence perceptions of dominance (Jayagopi
et al., 2009). We include temporal dominance in our vocal arousal model such that
any dependency may be captured and quantied.
ASD Severity Classication with Arousal Model
We implement the proposed model as a vocal arousal sequence prediction model
using linear-chain conditional random elds (CRFs). The vocal arousal of speaker
A is predicted by the previously mentioned social and conversational factors:
speaker B's vocal arousal at the previous turn (plus and ), speaker A's
current dialog act, and speaker A's current dominance. Since dialogue act annota-
tions are not available for the USC CARE Corpus, we restrict the dialogue act set
to backchannels and others. Backchannels are dened as turns which are at least
one-third composed of words from the set listed in Table 3.9; this set overlaps with
the most common backchannels in the Switchboard corpus (Stolcke et al., 2000).
We accept that false positives will occur.
Dominance is perceived through several cues. Prosody is already used in vocal
arousal, so we rely on another feature. Total speaking length has been shown
eective in dominance prediction (Jayagopi et al., 2009). We devise a temporal
dominance feature based on turn length. For a dyadic conversation, we dene
speaker A's dominance at turn k as the ratio of the length of speaker A's previous
turns to the total length of speaker A's and speaker B's previous turns; we use 3
96
previous turns with a decaying weight function of [0.15, 0.30, 0.55] for turns [k-2,
k-1, k].
Table 3.9: List of words dened as a backchannel in our corpus.
List of backchannel words
`mm', `hmm', `mm-hmm', `uh', `huh', `uh-huh', `um', `ah', `oh', `okay', `yeah', `yes',
`cool', `nice', `alright', `I see', `my goodness', `oh no'
In Section 3.2.3, we perform binary classication of ASD severity with CRFs
through a leave-one-session-out approach. We dene high-severity ASD as an ASD
severity of 7 or higher. This division produces an approximately equal distribution
of 14 high-severity and 15 low-severity sessions. CRF models are trained with the
HCRF toolbox (Morency et al., 2007). CRFs are rst trained for each class to
predict speaker A's vocal arousal sequence given a set of features. Then maximum
likelihood classication is performed by selecting the model that produces the
highest likelihood for the observed arousal sequence of the test session. Rather than
computing the likelihood of the exact sequence, we dene the sequence likelihood
as the product of the likelihood of each observation in the sequence. Vocal arousal
is quantized into three equally-balanced categories per session (high, medium, and
low). Dominance (Dm) is quantized into three categories as well: Dm<=1=3,
1=3>Dm<=2=3, and Dm>2=3.
3.2.3 Results and Discussion
In Section 3.2.3 the various measures of arousal synchrony are investigated in
relation to ASD symptom severity. Then in Section 3.2.3, we model the arousal
dynamics conditioned on other relevant social and conversational features; the
content captured by this model is evaluated through a classication task.
97
Vocal Arousal Synchrony
The social-aective exchange between child and psychologist is expected to dier
if the child has social-communicative diculties. In this section, the synchrony
between vocal arousal signals is related to the severity of the child's ASD symp-
toms.
First, we consider the strength of coupling between the signals, calculated as
the maximum absolute cross-correlation value. In order to capture dierent styles
of interaction (e.g., synchrony/asynchrony or varying linear predictive coecients
with Granger analysis), the sessions are windowed. Window sizes (W
s
) of 25-
70 turns are used with a step-size of W
s
=2. Within each window, the lag that
maximizes absolute correlation is chosen; then the median coupling and lag are
computed across all windows in a session. Coupling magnitude is medium in the
sessions: mean=0:41 (stdv:=0:12) for W
s
=25. However, no relation is observed
between this coupling magnitude and ASD severity (p>0:05) for any window size.
Next we ask the question, `Who leads the aective exchange and how does it
relate to ASD severity?' This topic is rst examined using cross-correlation with
peak picking. Referring to Figure 3.8, a signicant positive correlation frequently
occurs between peak-lag and ASD severity (4 times p<0:05, twice p<0:10, and 4
times p>0:10). This indicates that as ASD severity increases, the child tends to
lead the arousal coordination more, as illustrated in Figure 3.9. For very low ASD
severity, the psychologist leads the interaction; but for very high ASD severity, the
child often leads. However, it is uncertain whether this occurs because the child is
less in
uenced by the psychologist, or because the psychologist is more attuned to
the child.
98
25 30 35 40 45 50 55 60 65 70
0
0.2
0.4
window size (turns)
correlation coefficient (ρ
S
)
correlation
α = 0.05
α = 0.10
Figure 3.8: Correlation between lag (positive when child leads) and ASD severity
vs. window length.
Figure 3.9: Lag vs. ASD severity for W
s
=30.
S
=0:49 (p<0:01)
In order to gain insight into the causal structure of the interaction relative to
ASD severity, we utilize Granger causality analysis. The results (displayed in Fig-
ure 3.10) vary somewhat according to window size. The child's in
uence on the
psychologist (F
C>P
) is not statistically signicantly related to the child's ASD
severity (p>0:05) for any window size. But the psychologist's in
uence on the
child (F
P>C
) is found to decrease with higher ASD severity for W
s
=35 (p<0:05).
Also, the child's causal
ow (cf(C)) is signicantly more outward given higher
ASD severity for W
s
=45 (p<0:05). These observations suggest that a child with
higher ASD severity is less in
uenced by the psychologist's vocal arousal. No
signicant relation to ASD severity is found for the psychologist's attunement to
99
the child's behavior. While these results cohere with the cross-correlation anal-
ysis, they should be interpreted cautiously since few parameter settings showed
signicance.
25 30 35 40 45 50 55 60 65 70
−0.5
0
0.5
window size (turns)
correlation coefficient (ρ
S
)
F
C−>P
F
P−>C
cf (C)
α=0.05
α=0.05
Figure 3.10: Correlation between Granger causality parameters and ASD severity
vs. window length. F
C>P
is the magnitude of interaction for the child G-causing
the psychologist's behavior, and vice-versa for F
P>C
. cf(C) is the child's causal
ow.
Classication with Conversational Model
The relationships between a speaker's vocal arousal and other social (i.e., part-
ner's vocal arousal) and conversational (i.e., backchannels and dominance) factors
may capture social-communicative patterns associated with ASD severity. In this
section, we perform classication of high and low ASD severity groups using CRF
vocal arousal predictive models.
First, we validate that the model captures statistical relations between vocal
arousal and the proposed features by examining perplexity. Perplexity is calculated
with whole sequence prediction via leave-one-session-out cross-validation. Since
the vocal arousal is evenly distributed among three levels, maximum perplexity
is log
2
(3)=1:59. We nd a small decrease in perplexity when using backchannels
or dominance as features. The largest gain comes from including the partner's
vocal arousal; recall that arousal coupling is rather strong in this database. Thus,
100
the model captures information about the relation between feature streams. The
remaining question is whether information is discriminative.
Table 3.10: Model perplexity as a function of feature input.
feature baseline partner backch. dom. all
perplexity psych 1.59 1.49 1.55 1.57 1.46
perplexity child 1.59 1.48 1.57 1.58 1.47
The results in Table 3.11 indicate that the model can discriminate between high
and low ASD severity. Predicting the psychologist's or child's vocal arousal with
the preceding vocal arousal of the other speaker achieves above 50% unweighted
average recall (UAR), but this is not signicantly above chance. The same is
observed with the other features used in predicting the child's vocal arousal. Fusion
decreases performance, likely due to insucient data size. Interestingly, the rela-
tionships of the psychologist's vocal arousal with backchannels (80% UAR) and
dominance (79% UAR) discriminate ASD severity.
Table 3.11: Classication performance in UAR. Note: bold indicates signicance
above chance at p<0:01.
feature chance partner backch. dom. all
predict psych 50% 58% 80% 79% 75%
predict child 50% 55% 59% 62% 48%
Conditional probability tables may support interpretation. For backchannels,
the largest dierence appears to be the probability of low vocal arousal given the
current state is a backchannel, or p(Ar
p;k
=lowjBC
p;k
=true). This probability is
higher for the high ASD severity group (0.57 vs. 0.49). This result could occur
if the psychologist is more cautious of taking over the
oor during backchannels
for the children with higher ASD severity. For dominance, there is some dierence
between groups in the conditional probabilities p(Ar
p;k
jDm
k
) for high and low
101
dominance, but the interpretation is less certain. We do note that the psycholo-
gist is more dominant (mean of Dm
p
) for the high ASD severity group (p<0:05).
Still, the proposed vocal arousal model captures relations between vocal arousal
and relevant feature streams that is informative of ASD severity. There is again
the intriguing nding that the psychologist's behavior alone is informative; this
result mirrors ndings that the psychologist adjusts prosody, turn-taking, and lan-
guage behavior relative to the child's social-communicative diculties (Bone et al.,
2013b).
3.2.4 Conclusions and Future Work
We investigated how the social-aective interaction between child and psychologist
varies according to ASD severity. Vocal arousal is demonstrated to be a useful
automatic measure for aective synchrony studies. The ndings reveal that the
child with higher ASD severity is less responsive to the psychologist, and thus
appears to lead the aective exchange. We also proposed a vocal arousal model
that incorporates social and conversational in
uences. The model discriminated
between sessions involving children with high and low ASD severity, using only the
psychologist's behavior.
Future work will seek to rene the conversational model and evaluate on a
larger conversational database. We believe that aective-dynamic analysis can
provide key insights and advancements toward one of the major goals of Behavioral
Signal Processing (Narayanan and Georgiou, 2013), providing tools that support
clinicians.
102
Chapter 4
Machine Learning for Autism
Diagnostics and Screening
In this chapter, machine learning is applied to autism diagnostics and screening in
order to optimize unweighted average recall and make more ecient and eective
instruments. This chapter concerns two primary works: Bone et al. (2016a, 2015a).
4.1 Machine Learning for Improving Autism
Screening and Diagnostic Instruments:
Eectiveness, Eciency, & Multi-
Instrument Fusion
4.1.1 Introduction
New technologies, methods of analysis, and access to larger datasets have set the
stage for real improvements in the iterative process by which knowledge can in
u-
ence the ways in which we screen, diagnose, and monitor behavior disorders. In
the case of autism spectrum disorder (ASD; (American-Psychiatric-Association,
2013)), enormous eorts have been undertaken to better identify and understand
its wide phenotypic heterogeneity. As our understanding of ASD changes, it
103
becomes apparent that new instruments may be necessary for certain clinical and
research purposes. For example, standardized instrument performance may be sub-
stantially reduced for challenging populations (i.e., non-ASD disorders that result
in secondary impairments in social skills (Molloy et al., 2011)). In some cases,
revising algorithms or selecting particular items via better data and/or new com-
putational approaches may be sucient; but in others, it may also be necessary to
develop additional behavioral measures. For instance, DSM-5 introduced certain
concepts (e.g., sensory abnormalities) that may not be adequately re
ected in diag-
nostic instruments developed for use under DSM-IV (Huerta et al., 2012). Another
issue is the growing number of children in need of ASD diagnostic assessment for
clinical purposes (Baio, 2014), as well as increasing interest in ascertaining very
large numbers of children with ASD for research (e.g., genetics studies (Abrahams
and Geschwind, 2010)). Thus, there is increasing pressure to reduce administration
time for standardized diagnostic instruments (Lord and Jones, 2012).
For ASD, and behavioral disorders in general, machine learning (ML) can be
useful for improving instrument performance and generalization to unseen data,
as well as for reducing the number of codes required by the algorithm. ML is
especially applicable to ASD, where instruments are validated in reference to a
\gold-standard" best-estimate clinical diagnosis (BEC). Unlike traditional tech-
niques that use correlation-based statistical analysis or handcrafted algorithms,
ML classiers are designed to optimize a desired objective/constraint function,
typically some function of sensitivity and specicity.
Handcrafted algorithms tend to be simple summations and thresholds, but
because of the prevalence of mobile technologies, reliance on hand-calculation
is no longer necessary. Further, given the availability of large ASD datasets, it
makes sense to approach instrument revision and new instrument development by
104
rst analyzing existing data. If we can identify items or constructs that appear
to be optimal at discriminating dierent groups of children with ASD, then we
can focus new eorts on developing measures that build upon those constructs.
Importantly, however, results of certain studies seeking to improve ASD diagnos-
tic instruments through ML have been largely invalid due to errors in problem
formulation and ML utilization (Bone et al., 2015a). These issues include:
awed
assertion that administration time for the Autism Diagnostic Observation Schedule
(ADOS; (Lord et al., 2000)) is reduced by minimizing the number of codes used;
classication from instrument diagnosis rather than BEC; insucient validation;
and lack of generalization of results in replication experiments.
In the present study, we attempt to design both more eective (higher per-
forming) and more ecient (reduced administrative time) instrument algorithms
through the use of ML. We focus on two caregiver-report instruments: the Autism
Diagnostic Interview-Revised (ADI-R; (Lord et al., 1994b)) and the Social Respon-
siveness Scale (SRS; (Constantino and Gruber, 2007)). Our work is dierent from
most previous literature in the following ways (for extended discussion see Bone
et al. (2015a)). First, our models predict BEC (the \gold-standard" diagnosis)
rather than instrument diagnosis, based on the instrument codes. This approach
may actually create more eective algorithms, improving the ecacy of current
instrument algorithms. Second, we combine items from multiple instruments (i.e.,
ADI-R and SRS). Although all instruments focus on relevant behavioral concepts,
certain items may \work" better depending on wording and context. For exam-
ple, observational measures may more eectively capture subtleties of nonverbal
communication compared to caregiver reports whereas parent or teacher reports
are crucial in obtaining information about peer interactions. Third, we focus on
caregiver instruments, for which administration time may be dramatically reduced
105
with ML. This is in contrast to works that claim to reduce ADOS administration
time (Kosmicki et al., 2015, Wall et al., 2012a), which is not plausible since ADOS
codes are not tied directly to any subtask, and thus the entire ADOS is still nec-
essary to administer. We note that ML has been employed with the ADI-R (Wall
et al., 2012b), but was used to predict ADI-R classication, without certain addi-
tional methodological considerations included in the present study. Specically, the
fourth contribution of this work is to optimize parameter selection in multi-level
cross-validation and to a priori disregard ADI-R questions that reduce the algo-
rithm's generalizability. Lastly, we work with a challenging dataset that includes
many individuals who received non-ASD developmental disorder BEC (non-ASD).
Performance of algorithms from the present research should be viewed in light of
the dicult nature of the problem, i.e., dierentiating children with ASD from
children with other disorders (as opposed to children with neurotypical develop-
ment) by using solely parental reports to approximate a BEC which was made
using various sources of information.
4.1.2 Method
Overview
Existing ADI-R and SRS algorithms consist of three components: initial codes,
domain knowledge-inspired subdomain totals, and nal classication based on an
overall total. Similarly, our purely data-driven approach performs an importance-
weighted summation of code scores with a built-in threshold to optimize a desired
metric. Our experimental approach is illustrated in Figure 4.1, wherein we create a
new mapping from ADI-R and SRS behavioral codes to BEC. First, an ML classier
is used to design an algorithm that can map Instrument Codes to BEC Diagnoses;
106
this is the training phase. It requires a set of data independent from the held out
portion of data used for testing (evaluation). In testing, a Predicted BEC Diagnosis
is derived from Instrument Codes, and then compared to the previously known
BEC Diagnosis. We use the standard protocol of cross-validation to train/test on
independent subsets of data.
Training:
Instrument Codes
Training:
BEC Diagnoses
Testing:
BEC Diagnoses
Machine
Learning
Classifier
ML-Based
Instrument
Algorithm
Testing:
Predicted BEC Diag.
Testing:
Instrument Codes
Figure 4.1: Flow diagram of ML-based algorithm development.
Participant Information and Code Preprocessing
The experimental data we use consists of ADI-R and SRS item scores from a
large corpus previously examined by Bone et al. (2015a), referred to as Balanced
Independent Dataset. The ADI-R is a 93-item interview administered by a trained
clinician to a caregiver in two-to-three hours. For children over four years of age,
107
caregivers are asked about their child's current behaviors (\ADI-R-Current"), as
well as behaviors exhibited in the past (either when the child was between the
ages of 4-5 years old or ever in the past; \ADI-R-Ever"). Since the published
ADI-R-Ever algorithm produces classications of Autism and Non-Autism, which
do not accord with the targeted BEC diagnoses of ASD and Non-ASD, we also
evaluate a Collaborative Programs of Excellence in Autism (CPEA) classication
based on various combinations of the ADI-R-Ever sub-domain totals (see Hus and
Lord (2013)). The SRS is a 65-code caregiver or teacher questionnaire that takes
approximately 15 minutes to complete; all items are based on current behaviors.
The corpus that we employ combines data from clinical and research assessments.
We constrained our analyses to verbal individuals (as determined by code 30 of
the ADI-R) for two reasons. First, the subset of non-verbal individuals was much
smaller than that of verbal individuals in our sample. Second, the problem of
quickly dierentiating verbal individuals with and without ASD (e.g., for triaging
purposes) is arguably more clinically relevant since a child who is over four years
of age and not yet using phrases likely has severe developmental diculties that
require immediate referral. Multiple assessments were available for many cases in
this corpus; however, only the most recent assessment was retained for each case.
All participant data were drawn from an IRB approved data repository. For
all analyses, individuals over 10 years of age (including 10.0 years of age) are
treated separately from those below 10 years of age because several ADI-R codes
are only asked for children under 10. Participant age was limited to a minimum
of four years, with no maximum age restriction; age ranges included 4.0 to 55.1
years. Table 4.1 contains demographic information for our experimental samples.
While there were no statistically signicant dierences for age or non-verbal IQ
(NVIQ) between groups according to a Mann-Whitney U test, the DD sample
108
contained a higher percentage of females (p<0:05). For the Above 10 age group,
24.6% of participants were adults (i.e., 19 years or older). There was a small,
but statistically signicant dierence in the percentage of adults between groups
(22.6% in ASD and 30.4% in DD; p<0:05). We suspected this to have minimal
eects on our results (given the small dierence and identical questioning between
adults and children over 10 years), but because it was identied, we included age
as a demographic variable in our baseline experiments. The SRS was not given to
all participants. This decision depended on clinical protocols rather than anything
systematic about the populations; for example, some ADI-R data were collected
prior to the SRS publication.
Table 4.1: Demographic information for all data subsets. Note that Age 10+ SRS
and ADI-R+SRS are identical. `*' indicates dierences between ASD and DD at
=0:05.
ADI-R SRS ADI-R+SRS
Age 10- Age 10+ Age 10- Age 10+ Age 10-
# subjects 993 654 646 319 567
# ASD 727 486 440 238 389
# non-ASD DD 266 168 206 81 178
(percent TD in \DD" group) 5.3 11.3 5.3 4.9 3.9
Mean (Stdv.) age (yr.): ASD 6.8 (1.8) 15.9 (5.3) 6.7 (2.0) 14.7 (5.2) 7.1 (1.7)
Mean (Stdv.) age (yr.): DD 6.8 (1.8) 17.2 (7.6) 6.6 (2.0) 16.1 (8.8) 7.1 (1.7)
Mean (Stdv.) NVIQ: ASD 89.6 (23.4) 88.7 (27.7) 92.1 (22.0) 95.5 (23.8) 92.6 (22.3)
Mean (Stdv.) NVIQ: DD 91.7 (21.0) 84.3 (30.5) 93.4 (20.5) 92.1 (23.7) 92.7 (20.5)
percent female: ASD 19:1
18.5
19:1
21:8
19:0
percent female: DD 30:5
31:5
32:0
33:3
33:1
We avoided using questions from the ADI-R that were more summative in
nature (e.g., #86{Interviewer's Judgment, which can consider all information
obtained during the preceding 85 questions), increasing the likelihood that our
reduced algorithm would translates into a useable system. We also excluded codes
that were not expected to generalize across clinics, suspecting they likely captured
idiosyncrasies of the specic clinical research sample (i.e., study recruitment strat-
egy versus general diagnostic relevance). For example, children with non-ASD
109
diagnoses such as Down syndrome and ADHD who were recruited as part of cer-
tain research studies would be likely to show symptoms at an earlier or later age
than children with ASD (i.e., ADI-R #2, Age parents rst concern), but this trend
would not necessarily hold for children with other non-ASD diagnoses who were
referred for ASD diagnostic evaluations. We also performed novel transformations
on ADI-R codes (which are composite ordinal/categorical variables that are not
initially optimal for ML), the details of which are presented in Appendix A of Bone
et al. (2016a).
Machine Learning Approach
Cross-Validation and Performance Metric A primary contribution of this
work is the use of multi-level cross-validation (CV), which allows for testing an
algorithm's ability to generalize within a data set and ensures that algorithm per-
formance is not overstated due to \data-tting." As illustrated in Figure 4.2,
CV consists of separating a dataset into equal-sized disjoint partitions that are
used iteratively for training and testing such that each partition is evaluated once.
Additionally, any parameter tuning or feature selection is performed in a second
(\nested") layer of CV on each training set. Our primary layer of CV consists of
ve equally-sized folds, while the secondary layer is a 3-fold CV on the rst-layer
training data. For increased reliability of results, we perform 10 runs of CV unless
otherwise stated.
In accordance with previous work (Bone et al., 2015a), we chose unweighted
average recall (UAR) as our performance metric. UAR is a superior metric to
accuracy when data are imbalanced, since even an algorithm that simply picks the
majority class may obtain high accuracy. Researchers typically also refer to other
metrics like sensitivity and specicity; UAR is the arithmetic mean of sensitivity
110
Entire Data to be
Modeled/Tested
4-Folds Training
1-Fold Testing
2 –Folds
Training
1 –Fold
Development
Model Training
Generalization
5-Fold
CV
3-Fold
CV
Parameter
Tuning
Figure 4.2: Illustration of model training, tuning, and testing through \nested"
cross-validation (CV) as used in the Eective Algorithms section.
(recall of ASD class) and specicity (recall of non-ASD class), placing equal weight
on both. We utilized UAR as a general metric to ascertain algorithm capabilities for
a specic set of codes. Statistical signicance was calculated using a conservative
t-test for dierence of independent proportions with sample size N equal to twice
the size of the minority class, as presented by Bone et al. (2015a).
UAR =
sensitivity +specificity
2
Technical Details of Classication Framework and Error Tuning Clas-
sication was performed using Support Vector Machine (SVM) with linear kernel
111
via the LIBSVM toolbox (Chang and Lin, 2011). SVM is a maximum-margin clas-
sier, meaning it aims to nd the boundary that maximally separates two classes
in a high-dimensional feature space. This foundation tends to produce robust
boundaries (i.e., algorithms) that generalize well to unseen data. As such, SVM is
presently one of the most popular classiers. We used linear-kernel SVM, which
has been shown to work very well even when the number of features is high relative
to the size of the data (e.g., Black et al. (2015b)). Initial analyses we conducted
suggested SVM performed better than other considered classiers, such as logis-
tic regression; however, due to space and readability constraints, we only present
results for linear-kernel SVM.
The base-form of SVM assumes linear separability in feature space, but
this is an unrealistic assumption. A tunable regularization parameter is intro-
duced which weights the importance of a misclassication; this is the C param-
eter in LIBSVM. Higher values of C bias the algorithm to make fewer mis-
classications. This parameter is tuned in a second-layer of cross-validation with a
grid-search. For the Eective Algorithms experiments the grid-search was dened
as Cf10
5
; 10
4
; 10
0
; 10
1
; 10
2
g, and for the Ecient Algorithms experiments the
range was reduced toCf10
5
; 10
2
; 10
0
g due to computational complexity. Addi-
tionally, LIBSVM allows for dierentially weighting the errors that occur for dier-
ence classes (this isw
1
andw
2
in LIBSVM; eq. 40 from Chang and Lin (2011)). We
rst balanced errors via the constraint function as described by Rosenberg (2012),
in which classes are given weights inversely proportional to their priors (i.e., mis-
classications of the minority class are given higher importance). Specically, w
1
and w
2
are dened as:
w
1
=
(N
1
+N
2
)v
N
1
;w
2
=
(N
1
+N
2
) (1v)
N
2
(4.1)
112
where N
1
and N
2
are the counts of samples from class 1 (ASD) and class 2
(non-ASD); v[0; 1] is a tunable parameter for which increasing values put more
emphasis on sensitivity versus specicity. Eective Algorithms experiments uti-
lizedv=0:5 to optimize UAR, while the Ecient Algorithms experiments examined
vf0; 0:05; 0:95; 1g. In short, v is the fraction of importance placed on sensitivity,
with the remainder for specicity.
Feature Selection and Cross-Validation Analyses We identied groups of
features that collectively achieved high performance via greedy forward-feature
selection, which is critical since top performing individual features can be highly
correlated and contain little complementary information (i.e., collinearity). In
greedy forward-feature selection, the best performing codes in combination with
already-selected codes are chosen iteratively. This process must be performed
through \nested" CV as in Figure 4.2 in order to get reliable performances. In this
case, we had three layers of CV: the rst (5-fold) for assessing performance general-
izability for dierent numbers of features; the second (4-fold) for performing feature
selection; and the third (3-fold) for tuning parameters. This computationally-
intensive approach led to ve sets of selected codes per execution. We examined
patterns in code selection across many iterations (100 runs, or 500-folds). In Table
5 of Bone et al. (2016a), we presented several code selection results, including:
training on the entire training dataset; optimal forward-feature selection path with
a rst-order Markov assumption (due to data sparsity) from CV; overall most fre-
quently selected codes from CV; and most frequently selected subsets of codes from
CV. We selected only ve codes based on empirical ndings detailed in the Results
section.
113
In our experiments we also sought to merge ADI-R-Ever, ADI-R-Current, and
SRS codes to produce the smallest set needed for accurate screening or diagnosis.
Given that many of these codes are highly correlated, it was dicult to inter-
pret commonalities among code sets selected in dierent folds of CV. As such, we
performed hierarchical clustering, wherein codes that were similar (have a small
distance from one another) were clustered. The distance metric is d=1j
S
j,
where
S
is the Spearman's rank-correlation coecient. Distance between a group
and a code is calculated as the average pairwise distance. Since a code can actually
be composed of several features as detailed in Appendix A of Bone et al. (2016a),
we began by grouping all features from the same code. For practical reasons, dis-
crete variables associated with ordinal codes were excluded from clustering. In the
Results section, we only report the most commonly selected code from a cluster.
Clustering primarily had the intuitive eect of grouping ADI-R-Ever codes with
corresponding ADI-R-Current codes. Other important groupings are presented in
Table 4.2. Group A consists of SRS codes that involve the perception of a child's
social awkwardness with other people, particularly peers. ADI-R-52 and ADI-R-54
are also grouped together; both involve the child initiating shared experiences.
Table 4.2: List of important groupings from hierarchical clustering. Note that
these groupings were identical in Age 10+ and 10-.
Group Code # ADI-R Code Title or SRS Question
A
SRS B18 \Has diculty making friends, even when trying his or her best."
SRS C29 \Is regarded by other children as odd or weird."
SRS D33 \Is socially awkward, even when he or she is trying to be polite."
SRS D37 \Has diculty relating to peers."
B
ADI 52 Showing and Directing Attention
ADI 54 Seeking to Share Enjoyment with Others
114
4.1.3 Results, Designing Eective Algorithms
In order to demonstrate the utility of ML in designing more accurate and consistent
diagnostic algorithms, we created new algorithms that mapped instrument code
scores to BEC and compared them to existing instrument classications. The
experiments presented in Table 4.3 display performance (UAR) for predicting BEC
diagnosis from various input features: a baseline set using demographic variables
(NVIQ, age, gender), as well as instrument codes, totals, and classications.
Using Instrument Codes as features allows the SVM classier to nd an optimal
mapping to BEC diagnosis, i.e., a new instrument algorithm. We also analyzed
the performance of existing Instrument Totals; ADI-R Totals consists of A, B, and
C sub-domain totals, while SRS Totals refers to raw sub-domain and total scores.
Instrument Classications represent the established algorithms. For the ADI-R-
Ever, we simply found the maximum-likelihood mapping from ADI-R classica-
tion (Autism or Non-Autism) to BEC (ASD or Non-ASD){this mapping aords a
simple solution. The ADI-R-Ever CPEA conventions were designed for ASD/Non-
ASD decisions. The SRS does not possess a singular diagnostic threshold, instead
it suggests researchers will \use and validate dierent cutpoints and screening rates
based on study-specic requirements" ([SRS-2 Manual]). We set a threshold on
the standardized overall SRS total (SRS-T) through CV{for example, the trained
thresholds had means of 74.1 and 77.7 for Age 10- and Age 10+, respectively. For
children over age four years, validated totals and classications only exist for ADI-
R-Ever codes (i.e., \most abnormal 4-5 or ever"). Therefore, we do not present
an ADI-R-Current Classication in Table 3; but we do calculate ADI-R-Current
Totals (which are then used as features in ML) using the same approach as for
ADI-R-Ever.
115
Table 4.3: BEC classication with instrument codes, totals, and classications
as features for ADI-R (Ever and Current) and SRS, split at age 10. Results are
in terms of UAR. `*' indicates pairwise-dierence between Proposed: Codes and
Classication at =0:05.
ADI-R-Ever ADI-R-Current SRS
Age 10- Age 10+ Age 10- Age 10+ Age 10- Age 10+
Features
Baseline: Demog. 56.3 57.6 56.3 57.4 56.3 57.2
CPEA Classication 72.5 73.4 N/A N/A N/A N/A
Instr. Classication 77.0 76.4 N/A N/A 65:7
63.7
ML: Instr. Totals 76.8 77.4 74.2 74.4 64.7 66.9
ML: Instr. Codes 79.7 78.3 78.0 74.5 72:3
68.9
Sample N
ASD 727 486 727 486 440 238
DD 266 168 266 168 206 81
Comparing the proposed algorithms (Codes) vs. existing algorithms (Classi-
cation) in Table 4.3, we saw a trend across experimental settings in which higher
performance was achieved with the instrument codes as input features. In other
words, we observed a trend in which we were able to design new algorithms via
ML that superseded the performance of existing algorithms, despite the discussed
competing factors. This dierence only met stringent requirements for statistical
signicance for SRS Age 10-. However, after pooling results across age groups
for increased statistical power, there was marginal improvement for ADI-R-Ever
(p=0:09, one-tailed) and statistically signicant improvement for the SRS (p<0:01,
one-tailed). Additionally, there was little dierence between ADI-R-Ever instru-
ment totals and classications, indicating that existing thresholds are roughly as
eective as new ones based on existing totals. Performance gains appeared to come
from a more optimal aggregation of various code scores, based on comparison of
the instrument codes performance with the instrument totals performance.
In these data, the CPEA classication, which is designed for ASD/Non-ASD
decisions, performed worse than the existing ADI-R-Ever Algorithm, which is
designed for Autism/Non-Autism decisions. Error analysis revealed that this was
116
because sensitivity and specicity were more balanced with the ADI-R-Ever Algo-
rithm (age-pooled results: 76.7% UAR, 80.2% sensitivity, 73.3% specicity) than
with the CPEA classication (age-pooled results: 72.8% UAR, 90.3% sensitivity,
55.3% specicity).
The demographic variables reached performance only slightly above chance
(50% UAR; p<0:05, one-tailed), likely due to class-dierences in gender (see
Table 4.1). All other feature types statistically signicantly outperform this base-
line (p<0:05, one-tailed). ADI-R-Ever codes outperformed ADI-R-Current codes
by a small margin for Age 10- (1.7% absolute, 2.2% relative) and a slightly bigger
margin for Age 10+ (3.8% absolute, 5.1% relative). While it is tempting to com-
pare performance of the ADI-R and SRS in Table 4.3, the data may be dissimilar.
Consequently, we performed separate experiments within the sample of individu-
als who received both ADI-R and SRS administrations. Since the SRS assesses
current behavior, we compared only the ADI-R-Current.
Table 4.4: Instrument fusion in joint ADI-R-Current and SRS sample in terms
of UAR. `*' indicates pairwise-dierence between Proposed: Codes and Totals at
=0:05.
Age 10- Age 10+ Age Combined
Features
Demographics 57.1 57.2 56.8
SRS Classication 68.9 63.7 67.8
ML Fusion: ADI-R-C/SRS Totals 74:2
72.9 73:8
ML: ADI-R-C Codes 78.0 73.9 76.7
ML: SRS Codes 73.0 68.9 71.7
ML Fusion: ADI-R-C/SRS Codes 80:0
76.5 78:8
Sample N
ASD 389 238 627
DD 178 81 259
The results in Table 4.4 suggest that the instrument-fused SVM classier was
able to design a more eective instrument algorithm than that available from
SRS Classication (p<0:01, one-tailed) or ADI-R/SRS Totals (p<0:05, one-tailed)
based on age-combined results. In both age groupings the order of performance
117
was: (1) ADI-R-C/SRS Codes; (2) ADI-R-C Codes; (3) ADI-R-C/SRS Totals; (4)
SRS Codes; (5) SRS Classication (no ADI-R-Current Classication exists); and
(6) Demographics. Pooled results suggested that ADI-R-Current codes were more
informative than SRS codes (p<0:05, two-tailed; 5.0% absolute, 7.0% relative), and
that no statistically signicant gain occured when fusing SRS codes with ADI-R
codes (p=0:21, one-tailed; 2.1% absolute, 2.7% relative).
4.1.4 Results, Designing Ecient Algorithms
Optimization of BEC sensitivity and specicity was performed by dierentially
weighting the relative importance of each (Figure 4.3), i.e., adjusting the param-
eter v from 0 (all weighting on specicity) to 1 (all weighting on sensitivity).
Furthermore, we utilized forward-feature selection with CV to determine a mini-
mal subset of codes needed from the joint set of ADI-R-Ever, ADI-R-Current, and
SRS codes (Figures 4.3 and 4.4, and Tables 5 and 6 of Bone et al. (2016a)). For
the Age 10- experiments, we limited our analyses to the ADI-R, as no SRS codes
were frequently selected in the minimal subset, and no degradation in screener
performance when using only the ADI-R in that subset; this allowed us to have a
higher N for this Age 10- experiment.
The receiver operating characteristic (ROC) curves of Figure 4.3 demonstrate
the selective tuning of sensitivity and specicity. Performance, which generally
improves with number of features included, increased exponentially up to approxi-
mately ve codes and slowed thereafter. Age 10- performance was higher than Age
10+ performance, possibly since the ADI-R Age 10- dataset is much larger than
the ADI-R/SRS Age 10+ dataset. We also note that occasionally the All Codes
performance dropped below that of the subsets; this can happen randomly or due
to certain poorly-performing codes.
118
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Sensitivity
Specificity
Age 10− : 1 Code
Age 10− : 5 Codes
Age 10− : All Codes
Age 10+ : 1 Code
Age 10+ : 5 Codes
Age 10+ : All Codes
EER Line
Chance Line
Figure 4.3: Receiver operating characteristic plots. The Equal Error Rate (EER)
Line indicates the UAR optimization point, where sensitivity and specicity are
weighted equally. Classiers should perform above the Chance Line, where UAR
equals 50%. Note that we plot sensitivity vs. specicity in order to aid interpreta-
tion relative to UAR.
In order to assess feature selection versus performance more closely, we xed
the valuev in Figure 4.4. This allowed us to see an \elbow-point" after which per-
formance gains were small for increased complexity (number of codes). We dene
119
the elbow-point as the point where 95% of maximal performance is reached. A
reasonable application of this approach is to design a screener, where it is more
important to prevent Type-II errors than Type-I errors. Analysis of the curves
in Figure 4.3 indicated that a weighting of v=0:65 was appropriate. Interestingly,
with only four codes for Age 10- and three codes for Age 10+, the screener algo-
rithms reached 95% maximal performance. This represents a tremendous potential
reduction in these coding systems.
Next, we examined the codes selected for a screener that used only ve codes
(which is larger than the necessary three or four codes based on results in Fig-
ure 4.4). The most commonly selected codes in CV as well as the one set that was
selected when training on the full data are shown in Table 5 of Bone et al. (2016a),
with the corresponding code names presented in Table 6. The most important
codes are highlighted in Table ??. By convention, highlighted codes are those
that are (i) among the 10 most frequently selected codes in CV and (ii) were either
selected in the full data training or through the best forward-feature selection path
based on statistical analysis of selected codes in CV. The Age 10- ADI-R screener
achieved 89.2% sensitivity and 59.0% specicity across 500-folds. When analyzing
the most commonly selected codes across experiments, three codes were selected
in 53.2% of folds: ADI-R-Ever 33, 35, and 50. Other frequently selected codes
were ADI-R-Ever 64 and 68 and ADI-R-Current 73.
The Age 10+ ADI-R/SRS screener had similar, but slightly lower performance
of 86.7% sensitivity and 53.4% specicity. However, code selection appeared less
consistent; for example, the most commonly selected group of three codes was only
selected in 8.2% of folds (vs. 53.2% for Age 10-). Overall, the two most selected
codes are ADI-R-Current 35 and SRS-D33. ADI-R-Ever 34, 47, 54, 55, and 59
were selected relatively frequently.
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0.4
0.5
0.6
0.7
0.8
0.9
1
Number of Selected Codes
Performance
X
4 Codes (95% Performance)
Age 10− Screener Optimization
Age 10− Screener Optimization Max.
Age 10− Screener Sensitivity
Age 10− Screener Specificity
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0.4
0.5
0.6
0.7
0.8
0.9
1
Number of Selected Codes
Performance
X
3 Codes (95% Performance)
Age 10+ Screener Optimization
Age 10+ Screener Optimization Max.
Age 10+ Screener Sensitivity
Age 10+ Screener Specificity
Figure 4.4: Optimization curves versus number of codes for Age 10- (top) and Age
10+ (bottom) screeners. Optimization is biased towards sensitivity (roughly 2:1).
An elbow-point at 95% of maximum performance is marked for both age groups.
4.1.5 Discussion
In the section Designing Eective Algorithms, we compared SVM-based instrument
algorithms to existing ones. Since ML can optimize a desired objective function
(e.g., UAR), we expected it to outperform existing algorithms. However, there are
121
two principal competing factors. First, performance of existing algorithm classi-
cations should be slightly in
ated since they are often available during the BEC
decision-making process. Second, while we consider the present data sucient to
draw conclusions, ML approaches generalize better given larger amounts of data;
this is more of a concern for the Age 10+ experiments.
Our results indicate that ML is a promising tool for creating instrument algo-
rithms. The ML algorithm achieved higher performances than existing algorithms
for both ADI-R-Ever (marginal) and SRS in age-pooled results. We also explored
novel ML-fusion of the ADI-R and the SRS, nding no statistically signicant gain
over ADI-R alone (p=0:21, one-tailed). In our sample, the ADI-R was likely more
utilized for BEC diagnosis than the SRS; the ADI-R was generally higher perform-
ing across all experiments, including for the age-combined experiments in Table 4.4
(p<0:05, two-tailed). Still, this fusion approach can be generalized to combining
any number of instruments, allowing for fusion of information from multiple sources
at varying degrees of reliability, all within a framework that is objective and can be
tuned toward the desired metric. Testing of these approaches in a larger, indepen-
dent sample wherein clinicians are blind to instrument classication could provide
great insights and lead to translational outcomes.
The results of section Designing Ecient Algorithms support our ability to cre-
ate a screening algorithm with reduced interview codes and presumably reduced
administration time. Future clinical studies could evaluate if coding dierences
occur when administering the reduced set (especially by individuals with less
training) and if screener validity translates to independent data (e.g., in a general-
referral setting). The considerable redundancy (in terms of what is most diag-
nostically relevant) in these instruments may be necessary for making a precise
122
diagnosis or for obtaining a complete clinical picture of an individual child; how-
ever, for screening purposes, eliminating this redundancy is critical. Specically,
we created an ADI-R Screener for below 10 years of age that achieved 89.2% sen-
sitivity and 59.0% specicity in 500 folds of CV. We also created an ADI-R/SRS
Screener for above 10 years of age that reached 86.7% sensitivity and 53.4% speci-
city in CV. Given the complexity of these data{which contain many individuals
with non-ASD developmental disabilities and/or psychiatric disorders who can be
confusable with ASD individuals on standardized ASD instruments{this perfor-
mance represents a reasonable achievement. Moreover, the results point to current
limitations of parental reports in distinguishing such dicult cases, the need for
more comprehensive work-ups that go beyond caregiver reports to yield valid ASD
diagnoses, and the potential utility of ML in designing customized algorithms for
various purposes.
A principal methodological decision revolved around how to design (i.e., select
codes for) our nal proposed screener algorithms. The CV experiments, which
sub-sampled the full data for training and testing, serve to estimate how well the
ML approach will generalize to similar data. Specically, we were able to observe
sensitivity and specicity range across folds as well as how consistently certain
codes were selected. We argue that robust performance is more important than
individual code selection for this task; since many of the codes are highly correlated,
we can expect that some may be interchanged with little loss in performance.
Although codes selected in numerous folds do stand out as essential to estimating
BEC in this data, the appropriate choice of screener items are those selected in
full-data training, since the procedure through which they have been selected was
empirically supported by the CV results.
123
Based on CV experiments for ADI-R below age 10 verbal children, the most
important three codes were ADI-R-Ever 33, 35, and 50; these codes assess stereo-
typed language, reciprocal conversation, and gaze, respectively. These three codes
were reliably selected together in 53.2% of folds, while the other two selected codes
were more variable. (Note that ADI-R-35 and ADI-R-50 were also selected in the
experiments of Wall et al. (2012b), wherein the authors predicted ADI-R classica-
tion from ADI-R codes; this may indicate these codes are also critical to the current
ADI-R algorithm.) The proposed below age 10 ADI-R screener (consisting only
of the ve codes from the full-data-training experiment) includes these critical
codes plus ADI-R-Current-73 (abnormal response to sensory stimuli) and ADI-
R-Ever-34 (Social Verbalization/Chat). ADI-R-Current-73 falls under Restricted
and Repetitive Behaviors, while social chat augments the other communication-
oriented codes.
For the above age 10 ADI-R/SRS screener, code selection was considerably
more variable, although performance was still consistent. The two most selected
codes, ADI-R-Current 35 (Reciprocal Conversation) and SRS-D33 (socially awk-
ward, even when attempting to be polite), were rarely selected together (only 17.6%
of folds). SRS-D33 is interesting since it probes for a parent's concern about their
child's social skills. Recall that no SRS codes were reliably selected in the Age 10-
group; thus, it may be the case that parents of younger children tend to be less
critical of their child's social skills and are less likely to use words such as \awk-
ward" or \odd" to describe their young child, but that these terms seem more
appropriate for describing older children and adolescents. The screener algorithm
trained on the full data included: ADI-R-Ever 34, 47, 58, and 49 and SRS-D33.
124
4.1.6 Limitations
Several features of the sample potentially limit generalizability. While this is a
relatively large sample, participants included individuals between 4 and 55 years,
which is a very wide age range. Future studies should investigate individual dier-
ences using narrower age bands, and especially consider dierences between ado-
lescents/young adults vs. individuals in middle or later adulthood. In addition,
due to the small number of nonverbal individuals, we were only able to include ver-
bal individuals in our experiments and were therefore not able to oer suggestions
about how best to reduce or modify parent-report instruments for individuals with
minimal verbal abilities. Another important feature of this sample is that these
data represent reports largely from self-referred parents. Algorithms derived from
these data might perform dierently if applied to general population settings where
parents might not be quite as concerned; particularly with respect to sensitivity,
i.e., parents who are not seeking autism-specialty clinic evaluations are less likely
to be picked up by the screener.
4.1.7 Implications for Future Research and Clinical Trans-
lation
ML has the potential to improve certain aspects of instrument design, particularly
by decreasing redundant behavioral information and fusing multiple instruments.
In general, the approach of using existing data from these instruments can inform
future instrument revision and development. Taking a combined approach across
multiple instruments may be especially informative, in that we can identify dier-
ent methods of probing similar behaviors that are more or less useful.
125
We showed that ML-based instrument algorithms could be selectively tuned
depending on the relative importance of Type-I and Type-II errors for a given
setting. Using this approach, we developed screener algorithms that may support
large-scale neurobiological studies; however, the algorithms should rst be tested
in independent populations with independent coding to ensure appropriate gen-
eralization across samples. Additionally, the approach we employed for feature
selection through many folds of CV (in sucient data) provides empirical infor-
mation about the most critical codes. We found strong evidence that ADI-R-Ever
33, 35, and 50 are valuable for below age 10 years for screening. Having identied
certain constructs that appear to be particularly diagnostically salient, instrument
revision eorts may focus on those areas of abnormality to maximize sensitivity
and specicity.
Future research should also consider designing targeted algorithms for groups of
children that share similar characteristics which are known to be important when
measuring ASD symptoms (e.g., age, gender, IQ, language level). However, as
mentioned above, it will be rst necessary to obtain large enough numbers of par-
ticipants who vary on these characteristics so as to ensure sucient power within
the dierent strata (e.g., nonverbal vs. verbal) . Then we can use dierent item
sets identied for the dierent cells in the development or renement of measures
that can better account for these other variables.
126
Chapter 5
Conclusions and Future Work
5.1 Conclusions
This thesis describes completed research centered around the use of signal process-
ing and machine learning applied to human behavior towards clinical translation
in autism. Specically, we examined this research problem in the context of three
applications: atypical prosody; child-psychologist interaction; and autism diagnos-
tic instrument algorithms.
Speech cues are critical to ner characterization of autism spectrum disorder,
yet there has been little headway toward a generalizable operational denition of
prosodic atypicalities in ASD; e.g., prevalence estimates for various prosodic abnor-
malities are still unknown. Fortunately, speech processing can provide scalable,
objective measures to support scientic advances. We created a set of acoustic-
prosodic features which are a step towards a signal-derived prosodic prole. We
linked acoustic-prosodic cues to ASD severity and to general perceptions of speech
prosody. Our ndings highlight the importance of intonation, rate, and voice qual-
ity to atypical prosody.
The child's behavior shouldn't be viewed in isolation, but also in respect to
the mutual eects seen in the psychologist's behavior and vice-versa, since the
psychologist is both interlocutor and evaluator. We demonstrated that prosodic,
turn-taking, and language features taken during child-psychologist interactions are
indicative of the degradation in conversational quality for children with greater
127
severity of ASD symptoms. In particular, we found that as ASD severity increases,
the psychologist varies her speech and language strategies in attempts to engage
the child in social interactions, while children with more severe ASD speak less and
use fewer aect words and personal pronouns. We found that the psychologist's
features were at least as predictive of the child's severity as the child's features.
Additionally, we found greater predictive power for ASD severity in subtasks with
high social demand, while the psychologist's language cues were predictive even
during minimal-speech, low social-demand tasks.
Machine learning (ML) provides novel opportunities for human behavior
research and clinical translation, yet its application can have noted pitfalls (Bone
et al., 2015a). In our work, we fastidiously utilize ML to derive autism spectrum
disorder (ASD) instrument algorithms in an attempt to improve upon widely-
used ASD screening and diagnostic tools. Algorithms were created via a robust
ML classier, support vector machine (SVM), while targeting best-estimate clin-
ical diagnosis of ASD vs. non-ASD. The created algorithms were more eective
(higher performing) than current algorithms, were tunable (sensitivity and speci-
city can be dierentially weighted), and were more ecient (achieving near-peak
performance with ve or fewer codes). We presented a screener algorithm for below
(above) age 10 that reached 89.2% (86.7%) sensitivity and 59.0% (53.4%) speci-
city with only 5 behavioral codes. ML is useful for creating robust, customizable
instrument algorithms. In a unique dataset comprised of controls with other dif-
culties, our ndings highlight limitations of current caregiver-report instruments
and indicate possible avenues for improving ASD screening and diagnostic tools.
128
5.2 Open Problems and Future Work
Speech prosody remains a critical research area in autism spectrum disorder for
which objective assessment can have true impact in characterizing and tracking
prosodic decits. However, one of the primary reasons that speech prosody is
understudied in autism is because of the diculty in modeling it during conver-
sational speech due to its variability and dynamic nature. Initial studies were
limited to features like the mean and standard deviation of pitch and intensity.
The following are suggestions for future research towards the goal of creating a
computational characterization of prosody in neurodevelopmental disorders.
Optimal data collection: Collected should have high quality (for complex fea-
ture extraction), high consistency, and ecological validity. Spontaneous speech is
much preferred over read due to relevance to this social-communicative disorder.
Maintaining Interpretability: A great appeal of engineering methods is that
complex models can be created that humans need not understand, e.g., deep learn-
ing. However, in this particular problem domain, the primary drawback is that
interpretability is largely abandoned. With a loss of interpretability, it is dicult
to track why a system is successful (which may be for dubious reasons Bone et al.
(2013a)), and it is less clear if the system will generalize to independent, uniquely
collected data (than, for example, knowledge-drive approaches Bone et al. (2014b)).
Selecting a Ground Truth: Supervised learning necessitates a ground truth.
However, atypical prosody is a construct that, while critically important, has no
reliable ground truth. Two choices are apparent: either ASD/non-ASD diagnosis
or human judgment. ASD behavior is highly variable; since not all children have
the same decits, ASD diagnosis cannot be equated to \autistic" prosody. Alter-
natively, human judgment is unreliable, especially for untrained raters Bone et al.
129
(2015b). Thus, it is our suggestion that future studies simultaneously analyze the
relevance of prosodic features against both ground truths. Moreover, it may be
necessary that the nal objective measures are entirely bottom-up, derived from
and dened by signals. Such a rule-based approach would come with its own dif-
culties in generalization, but would be one solution to creating a fully objective
denition of atypical prosody.
Given that autism is a social disorder, joint analysis of behavior should continue
to be explored. Future work may investigate longitudinal changes in the psycholo-
gist's behavior following intervention. Parents and peers may also be modeled. In
fact, when the child is non-verbal, we may still gain knowledge from the psychol-
ogist's behavior.
Machine learning has many potential uses in ASD research and intervention.
Future research should also consider designing targeted algorithms for groups of
children that share similar characteristics which are known to be important when
measuring ASD symptoms (e.g., age, gender, IQ, language level). However, it
will be rst necessary to obtain large enough numbers of participants who vary
on these characteristics so as to ensure sucient power within the dierent strata
(e.g., nonverbal vs. verbal). Then we can use dierent item sets identied for
the dierent cells in the development or renement of measures that can better
account for these other variables.
130
Reference List
Abrahams, B.S., Geschwind, D.H., 2010. Genetics of autism, in: Vogel and Motul-
sky's Human Genetics. Springer, pp. 699{714.
American-Psychiatric-Association, 2013. DSM 5. American Psychiatric Associa-
tion.
American Speech-Language-Hearing Association, 2007a. Childhood Apraxia of
Speech [Position Statement]. Available from www.asha.org/policy.
American Speech-Language-Hearing Association, 2007b. Childhood Apraxia of
Speech [Tech Report]. Available from www.asha.org/policy.
APA, 2000. Diagnostic and Statistical Manual of Mental Disorder, Ed. 4 text
revision. American Psychiatric Assoc.. Washington D.C.
Bachorowski, J.A., 1999. Vocal expression and perception of emotion. Current
directions in psychological science 8, 53{57.
Bachorowski, J.A., Owren, M.J., 1995. Vocal expression of emotion: Acoustic
properties of speech are associated with emotional intensity and context. Psy-
chological Science 6, 219{224.
Baio, J., 2014. Prevalence of autism spectrum disorder among children aged 8 years
{ autism and developmental disabilities monitoring network, 11 sites, united
states, 2010. morbidity and mortality weekly report. surveillance summaries.
volume 63, number 1. Centers for Disease Control and Prevention .
Baltaxe, C.A., 1977. Pragmatic decits in the language of autistic adolescents.
Journal of Pediatric Psychology 2, 176{180.
Baron-Cohen, S., 1988. Social and Pragmatic Decits in Autism: Cognitive or
Aective? Journal of Autism and Developmental Disorders 18, 379{401.
Bernieri, F.J., Reznick, J.S., Rosenthal, R., 1988. Synchrony, pseudosynchrony,
and dissynchrony: Measuring the entrainment process in mother-infant interac-
tions. Journal of personality and social psychology 54, 243.
131
Black, M.P., Bone, D., Skordilis, Z.I., Gupta, R., Xia, W., Papadopoulos, P.,
Chakravarthula, S.N., Xiao, B., Van Segbroeck, M., Kim, J., et al., 2015a.
Automated evaluation of non-native english pronunciation quality: Combin-
ing knowledge-and data-driven features at multiple time scales, in: Sixteenth
Annual Conference of the International Speech Communication Association.
Black, M.P., Bone, D., Skordilis, Z.I., Gupta, R., Xia, W., Papadopoulos, P.,
Chakravarthula, S.N., Xiao, B., Van Segbroeck, M., Kim, J., et al., 2015b.
Automated evaluation of non-native english pronunciation quality: Combining
knowledge-and data-driven features at multiple time scales, in: Sixteenth Annual
Conference of the International Speech Communication Association.
Black, M.P., Bone, D., Williams, M.E., Gorrindo, P., Levitt, P., Narayanan, S.S.,
2011a. The USC CARE Corpus: Child-Psychologist Interactions of Children
with Autism Spectrum Disorders, in: Proceedings of Interspeech.
Black, M.P., Tepperman, J., Narayanan, S.S., 2011b. Automatic prediction of
children's reading ability for high-level literacy assessment. Audio, Speech, and
Language Processing, IEEE Transactions on 19, 1015{1028.
Boersma, P., 2001a. Praat, a system for doing phonetics by computer. Glot
International 5, 341{345.
Boersma, P., 2001b. Praat, a system for doing phonetics by computer. Glot
International 5, 341{345.
Boker, S.M., Rotondo, J.L., Xu, M., King, K., 2002. Windowed cross-correlation
and peak picking for the analysis of variability in the association between behav-
ioral time series. Psychological Methods 7, 338.
Bone, D., Bishop, S., Black, M.P., Goodwin, M.S., Lord, C., Narayanan, S.S.,
2016a. Use of machine learning to improve autism screening and diagnostic
instruments: eectiveness, eciency, and multi-instrument fusion. Journal of
Child Psychology and Psychiatry .
Bone, D., Bishop, S., Gupta, R., Lee, S., Narayanan, S., 2016b. Acoustic-prosodic
and turn-taking features in interactions with children with neurodevelopmental
disorders .
Bone, D., Black, M.P., Lee, C.C., Williams, M.E., Levitt, P., Lee, S., Narayanan,
S., 2012b. Spontaneous-Speech Acoustic-Prosodic Features of Children with
Autism and the Interacting Psychologist., in: INTERSPEECH, pp. 1043{1046.
Bone, D., Black, M.P., Lee, C.C., Williams, M.E., Levitt, P., Lee, S., Narayanan,
S., 2014a. The Psychologist as an Interlocutor in Autism Spectrum Disorder
132
Assessment: Insights from a Study of Spontaneous Prosody. Journal of Speech,
Language, and Hearing Research (in Press).
Bone, D., Black, M.P., Ramakrishna, A., Grossman, R., Narayanan, S., 2015b.
Acoustic-prosodic correlates of awkward prosody in story retellings from adoles-
cents with autism .
Bone, D., Chaspari, T., Audhkhasi, K., Gibson, J., Tsiartas, A., Van Segbroeck,
M., Li, M., Lee, S., Narayanan, S., 2013a. Classifying language-related devel-
opmental disorders from speech cues: the promise and the potential confounds.,
in: INTERSPEECH, pp. 182{186.
Bone, D., Goodwin, M.S., Black, M.P., Lee, C.C., Audhkhasi, K., Narayanan, S.,
2015a. Applying machine learning to facilitate autism diagnostics: Pitfalls and
promises. Journal of autism and developmental disorders 45, 1121{1136.
Bone, D., Lee, C.C., Chaspari, T., Black, M., Williams, M., Lee, S., Levitt, P.,
Narayanan, S., 2013b. Acoustic-prosodic, turn-taking, and language cues in
child-psychologist interactions for varying social demand, in: INTERSPEECH.
Bone, D., Lee, C.C., Narayanan, S., 2014b. Robust Unsupervised Arousal Rating:
A rule-based framework with knowledge-inspired vocal features. IEEE Transac-
tions on Aective Computing (Accepted).
Bone, D., Lee, C.C., Narayanan, S.S., 2012. A Robust Unsupervised Arousal
Rating Framework using Prosody with Cross-Corpora Evaluation., in: INTER-
SPEECH, pp. 1175{1178.
Bone, D., Lee, C.C., Potamianos, A., Narayanan, S., 2014c. An investigation of
vocal arousal dynamics in child-psychologist interactions using synchrony mea-
sures and a conversation-based model, in: Interspeech.
Boucher, M.J., Andrianopoulos, M.V., Velleman, S.L., Keller, L.A., Pecora, L.,
2011. Assessing vocal characteristics of spontaneous speech in children with
autism. Paper presented at the American Speech-Language-Hearing Association.
Brown, B.T., Morris, G., Nida, R.E., Baker-Ward, L., 2012. Brief report: Making
experience personal: Internal states language in the memory narratives of chil-
dren with and without aspergers disorder. Journal of autism and developmental
disorders 42, 441{446.
Busso, C., Lee, S., Narayanan, S., 2009. Analysis of emotionally salient aspects
of fundamental frequency for emotion detection. Audio, Speech, and Language
Processing, IEEE Transactions on 17, 582{596.
133
Chang, C.C., Lin, C.J., 2011. Libsvm: a library for support vector machines. ACM
Transactions on Intelligent Systems and Technology (TIST) 2, 27.
Chaspari, T., Bone, D., Gibson, J., Lee, C., Narayanan, S.S., 2013. Using physi-
ology and language cues for modeling verbal response latencies of children with
ASD, in: Proceedings of ICASSP.
Constantino, J.N., Gruber, C.P., 2007. Social responsiveness scale (SRS). Western
Psychological Services Los Angeles, CA.
Cruttenden, A., 1997. Intonation. Cambridge University Press.
Dawson, G., Rogers, S., Munson, J., Smith, M., Winter, J., Greenson, J., Don-
aldson, A., Varley, J., 2010. Randomized, controlled trial of an intervention for
toddlers with autism: the early start denver model. Pediatrics 125, e17{e23.
De Looze, C., Hirst, D., 2014. The ome (octave-median) scale: A natural scale for
speech prosody, in: Proceedings of the 7th International Conference on Speech
Prosody (SP7).
Diehl, J.J., Watson, D., Bennetto, L., McDonough, J., Gunlogson, C., 2009. An
Acoustic Analysis of Prosody in High-Functioning Autism. Applied Psycholin-
guistics 30, 385{404.
Dunn, L.M., Dunn, L.M., 1981. Peabody Picture Vocabulary Test-Revised:
PPVT-R. American Guidance Service.
Duong, M., Mostow, J., Sitaram, S., 2011. Two methods for assessing oral reading
prosody. ACM Transactions on Speech and Language Processing (TSLP) 7, 14.
Eyben, F., W ollmer, M., Schuller, B., 2010. Opensmile: the munich versatile and
fast open-source audio feature extractor, in: Proceedings of the international
conference on Multimedia, ACM. pp. 1459{1462.
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J., 2008. Liblinear: A
library for large linear classication. The Journal of Machine Learning Research
9, 1871{1874.
Feldman, R., 2007. On the origins of background emotions: from aect synchrony
to symbolic expression. Emotion 7, 601.
Feldman, R., Magori-Cohen, R., Galili, G., Singer, M., Louzoun, Y., 2011. Mother
and infant coordinate heart rhythms through episodes of interaction synchrony.
Infant Behavior and Development 34, 569{577.
134
Fontaine, J.R., Scherer, K.R., Roesch, E.B., Ellsworth, P.C., 2007. The world of
emotions is not two-dimensional. Psychological science 18, 1050{1057.
Frith, U., 2001. Mind blindness and the brain in autism. Neuron 32, 969{979.
Frith, U., Happ e, F., 2005. Autism spectrum disorder. Current biology 15, R786{
R790.
Furrow, D., 1984. Young childrens use of prosody. Journal of child language 11,
203{213.
Garc a-P erez, R.M., Lee, A., Hobson, R.P., 2007. On intersubjective engagement
in autism: A controlled study of nonverbal aspects of conversation. Journal of
Autism and Developmental Disorders 37, 1310{1322.
Gelfer, M.P., 1988. Perceptual attributes of voice: Development and use of rating
scales. Journal of Voice 2, 320{326.
Gotham, K., Pickles, A., Lord, C., 2009. Standardizing ados scores for a measure
of severity in autism spectrum disorders. Journal of autism and developmental
disorders 39, 693{705.
Gotham, K., Risi, S., Pickles, A., Lord, C., 2007. The autism diagnostic obser-
vation schedule: revised algorithms for improved diagnostic validity. Journal of
Autism and Developmental Disorders 37, 613{627.
Grabe, E., Low, E.L., 2002. Durational variability in speech and the rhythm class
hypothesis. Papers in laboratory phonology 7.
Granger, C.W., 1969. Investigating causal relations by econometric models and
cross-spectral methods. Econometrica: Journal of the Econometric Society ,
424{438.
Grossman, R.B., 2014. Judgments of social awkwardness from brief exposure to
children with and without high-functioning autism. Autism .
Grossman, R.B., Edelson, L.R., Tager-Flusberg, H., 2013. Emotional facial and
vocal expressions during story retelling by children and adolescents with high-
functioning autism. Journal of Speech, Language, and Hearing Research 56,
1035{1044.
Halberstam, B., 2004. Acoustic and perceptual parameters relating to connected
speech are more reliable measures of hoarseness than parameters relating to
sustained vowels. ORL 66, 70{73.
135
Hamilton, J.D., 1994. Time series analysis. volume 2. Princeton university press
Princeton.
Harrist, A.W., Waugh, R.M., 2002. Dyadic synchrony: Its structure and function
in childrens development. Developmental Review 22, 555{592.
Heeman, P.A., Lunsford, R., Selfridge, E., Black, L., Van Santen, J., 2010. Autism
and interactional aspects of dialogue, in: Proceedings of the 11th Annual Meet-
ing of the Special Interest Group on Discourse and Dialogue, Association for
Computational Linguistics. pp. 249{252.
Heman-Ackah, Y.D., Heuer, R.J., Michael, D.D., Ostrowski, R., Horman, M.,
Baroody, M.M., Hillenbrand, J., Satalo, R.T., 2003. Cepstral peak promi-
nence: a more reliable measure of dysphonia. Annals of Otology Rhinology and
Laryngology 112, 324{333.
Hillenbrand, J., Cleveland, R.A., Erickson, R.L., 1994. Acoustic correlates of
breathy vocal quality. Journal of Speech and Hearing Research 37, 769.
Hillenbrand, J., Houde, R.A., 1996. Acoustic correlates of breathy vocal qual-
ity: Dysphonic voices and continuous speech. Journal of Speech and Hearing
Research 39, 311.
Hirst, D., 2007. A praat plugin for momel and intsint with improved algorithms
for modelling and coding intonation, in: Proceedings of the XVIth International
Conference of Phonetic Sciences.
Hirst, D., Di Cristo, A., Espesser, R., 2000. Levels of representation and levels
of analysis for the description of intonation systems, in: Prosody: Theory and
experiment. Springer, pp. 51{87.
H onig, F., Batliner, A., N oth, E., 2011. Does it groove or does it stumble-
automatic classication of alcoholic intoxication using prosodic features., in:
INTERSPEECH, pp. 3225{3228.
Huerta, M., Bishop, S.L., Duncan, A., Hus, V., Lord, C., 2012. Application of
dsm-5 criteria for autism spectrum disorder to three samples of children with
dsm-iv diagnoses of pervasive developmental disorders. American Journal of
Psychiatry .
Hus, V., Lord, C., 2013. Eects of child characteristics on the autism diagnostic
interview-revised: Implications for use of scores as a measure of asd severity.
Journal of autism and developmental disorders 43, 371{381.
136
Jayagopi, D.B., Hung, H., Yeo, C., Gatica-Perez, D., 2009. Modeling dominance in
group conversations using nonverbal activity cues. Audio, Speech, and Language
Processing, IEEE Transactions on 17, 501{513.
Jones, C.D., Schwartz, I.S., 2009. When asking questions is not enough: An obser-
vational study of social communication dierences in high functioning children
with autism. Journal of autism and developmental disorders 39, 432{443.
Jordan, R.R., 1989. An experimental comparison of the understanding and use of
speaker-addressee personal pronouns in autistic children. International Journal
of Language & Communication Disorders 24, 169{179.
Juslin, P., Scherer, K., 2005. The New Handbook of Methods in Nonverbal Behav-
ior Research. Oxford: Oxford University Press.. chapter 3. Vocal Expression of
Aect. pp. 65{135.
Kalimeri, K., Lepri, B., Aran, O., Jayagopi, D.B., Gatica-Perez, D., Pianesi, F.,
2012. Modeling dominance eects on nonverbal behaviors using granger causal-
ity, in: Proceedings of the 14th ACM international conference on Multimodal
interaction, ACM. pp. 23{26.
Katsamanis, A., Black, M., Georgiou, P.G., Goldstein, L., Narayanan, S., 2011.
Sailalign: Robust long speech-text alignment, in: Proc. of Workshop on New
Tools and Methods for Very-Large Scale Phonetics Research.
Kimura, M., Daibo, I., 2006. Interactional synchrony in conversations about emo-
tional episodes: A measurement by the between-participants pseudosynchrony
experimental paradigm? Journal of Nonverbal Behavior 30, 115{126.
Knapp, M., Hall, J., Horgan, T., 2013. Nonverbal communication in human inter-
action. Cengage Learning.
Kosmicki, J., Sochat, V., Duda, M., Wall, D., 2015. Searching for a minimal set of
behaviors for autism detection through feature selection-based machine learning.
Translational psychiatry 5, e514.
Kroeber-Riel, W., 1979. Activation research: Psychobiological approaches in con-
sumer research. Journal of Consumer Research , 240{250.
Lambourne, K., Tomporowski, P., 2010. The eect of exercise-induced arousal on
cognitive task performance: a meta-regression analysis. Brain research 1341,
12{24.
Lee, C., Katsamanis, A., Black, M., Baucom, B., Georgiou, P., Narayanan, S.,
2011. An Analysis of PCA-based Vocal Entrainment Measures in Married Cou-
ples' Aective Spoken Interactions, in: Proceedings of Interspeech.
137
Lee, C.C., Katsamanis, A., Black, M.P., Baucom, B.R., Christensen, A., Georgiou,
P.G., Narayanan, S.S., 2014. Computing vocal entrainment: A signal-derived
pca-based quantication scheme with application to aect analysis in married
couple interactions. Computer Speech & Language 28, 518{539.
Lee, C.M., Narayanan, S., 2005. Towards detecting emotions in spoken dialogs.
IEEE TASLP 13, 293{302.
Leith, K.P., Baumeister, R.F., 1996. Why do bad moods increase self-defeating
behavior? emotion, risk tasking, and self-regulation. Journal of personality and
social psychology 71, 1250.
Levitt, P., Campbell, D.B., 2009. The genetic and neurobiologic compass points
toward common signaling dysfunctions in autism spectrum disorders. The Jour-
nal of clinical investigation 119, 747.
Li, X., Tao, J., Johnson, M.T., Soltis, J., Savage, A., Leong, K.M., Newman, J.D.,
2007. Stress and emotion classication using jitter and shimmer features, in:
Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE Interna-
tional Conference on, IEEE. pp. IV{1081.
Lord, C., Jones, R.M., 2012. Annual research review: Re-thinking the classication
of autism spectrum disorders. Journal of Child Psychology and Psychiatry 53,
490{509.
Lord, C., Risi, S., Lambrecht, L., Cook, E., Leventhal, B., DiLavore, P., Pickles,
A., Rutter, M., 2000. The Autism Diagnostic Observation Schedule-Generic:
A standard measure of social and communication decits associated with the
spectrum of autism. J. of Autism and Dev. Dis. 30, 205{223.
Lord, C., Rutter, M., DiLavore, P., Risi, S., 1999. Autism diagnostic observation
schedule-wps edition. Los Angeles, CA: Western Psychological Services .
Lord, C., Rutter, M., Le Couteur, A., 1994a. Autism diagnostic interview-revised:
a revised version of a diagnostic interview for caregivers of individuals with
possible pervasive developmental disorders. Journal of autism and developmental
disorders 24, 659{685.
Lord, C., Rutter, M., Le Couteur, A., 1994b. Autism diagnostic interview-revised:
a revised version of a diagnostic interview for caregivers of individuals with
possible pervasive developmental disorders. Journal of autism and developmental
disorders 24, 659{685.
McAllister, A., Sundberg, J., Hibi, S.R., 1998. Acoustic Measurements and Per-
ceptual Evaluation of Hoarseness in Children's Voices. Logopedics Phoniatrics
Vocology 23.
138
McCann, J., Peppe, S., 2003. Prosody in Autism Spectrum Disorders: A Critical
Review. Int. J. Lang. Comm. Dis. 38, 325{350.
Mesibov, G.B., Schopler, E., Hearsey, K.A., 1994. Structured teaching. Behavioral
issues in autism , 195{207.
Messinger, D.S., Mahoor, M.H., Chow, S.M., Cohn, J.F., 2009. Automated mea-
surement of facial expression in infant{mother interaction: A pilot study. Infancy
14, 285{305.
Miller, J., Iglesias, A., 2008. Systematic analysis of language transcripts (salt),
english & spanish (version 9)[computer software]. madison: University of wis-
consinmadison, waisman center. Language Analysis Laboratory .
Miller, J., Smith, A., 1983. Salt transcription manual: Guidelines for transcribing
free speech samples. Unpublished paper, Language Analysis Lab, University of
Wisconsin-Madison .
Molloy, C.A., Murray, D.S., Akers, R., Mitchell, T., Manning-Courtney, P., 2011.
Use of the autism diagnostic observation schedule (ados) in a clinical setting.
Autism 15, 143{162.
Morency, L., Quattoni, A., Darrell, T., 2007. Latent-dynamic discriminative mod-
els for continuous gesture recognition, in: Computer Vision and Pattern Recog-
nition, 2007. CVPR'07. IEEE Conference on, IEEE. pp. 1{8.
Narayanan, S., Georgiou, P.G., 2013. Behavioral signal processing: Deriving
human behavioral informatics from speech and language. Proceedings of the
IEEE PP, 1{31.
Paccia, J., Curcio, F., 1982. Language Processing and Forms of Immediate
Echolalia in Autistic Children. Journal of Speech and Hearing Research 25,
42{47.
Paul, D., Baker, J., 1992. The design for the wall street journal-based csr corpus,
in: Proceedings of the workshop on Speech and Natural Language, Association
for Computational Linguistics. pp. 357{362.
Paul, R., Augustyn, A., Klin, A., Volkmar, F.R., 2005a. Perception and production
of prosody by speakers with autism spectrum disorders. Journal of autism and
developmental disorders 35, 205{220.
Paul, R., Shriberg, L.D., McSweeny, J., Cicchetti, D., Klin, A., Volkmar, F., 2005b.
Brief Report: Relations between Prosodic Performance and Communication and
Socialization Ratings in High Functioning Speakers with Autism Spectrum Dis-
orders. J. of Autism and Dev. Dis. 35, 861{869.
139
Pennebaker, J.W., Francis, M.E., Booth, R.J., 2001. Linguistic inquiry and word
count: Liwc 2001. Mahway: Lawrence Erlbaum Associates .
Peppe, S., 2011. Speech Prosody in Atypical Populations. J&R Press Ltd. chap-
ter 1. pp. 1{23.
Peppe, S., McCann, J., Gibbon, F., O'Hare, A., Rutherford, M., 2007. Receptive
and Expressive Prosodic Ability in Children with High-Functioning Autism. J.
of Speech & Hearing Research 50, 1015{1028.
Ploog, B.O., Banerjee, S., Brooks, P.J., 2009. Attention to prosody (intonation)
and content in children with autism and in typical children using spoken sen-
tences in a computer game. Research in Autism Spectrum Disorders 3, 743{758.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Han-
nemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G.,
Vesely, K., 2011. The kaldi speech recognition toolkit, in: IEEE 2011 Workshop
on Automatic Speech Recognition and Understanding, IEEE Signal Processing
Society. IEEE Catalog No.: CFP11SRW-USB.
Prizant, B.M., Wetherby, A.M., Rubin, E., Laurent, A.C., 2003. The scerts model:
A transactional, family-centered approach to enhancing communication and
socioemotional abilities of children with autism spectrum disorder. Infants &
Young Children 16, 296{316.
Pronovost, W., Wakstein, M.P., Wakstein, D.J., 1966. A longitudinal study of
the speech behavior and language comprehension of fourteen children diagnosed
atypical or autistic. Exceptional children .
Prudhommeaux, E.T., Roark, B., Black, L.M., van Santen, J., 2011. Classication
of atypical language in autism. ACL HLT 2011 , 88.
Ramus, F., 2002. Acoustic correlates of linguistic rhythm: Perspectives .
Rosenberg, A., 2012. Classifying skewed data: Importance weighting to optimize
average recall., in: INTERSPEECH, pp. 2242{2245.
Rutter, M., Le Couteur, A., Lord, C., 2003. Autism diagnostic interview-revised.
Los Angeles, CA: Western Psychological Services .
van Santen, J.P.H., Prud'hommeaux, E.T., Black, L.M., Mitchell, M., 2010. Com-
putational Prosodic Markers for Autism. Autism 14, 215{236.
Schuller, B., Rigoll, G., Lang, M., 2003. Hidden markov model-based speech
emotion recognition, in: Acoustics, Speech, and Signal Processing, 2003. Pro-
ceedings.(ICASSP'03). 2003 IEEE International Conference on, IEEE. pp. II{1.
140
Seth, A.K., 2005. Causal connectivity of evolved neural networks during behavior.
Network: Computation in Neural Systems 16, 35{54.
Seth, A.K., 2010. A matlab toolbox for granger causal connectivity analysis. Jour-
nal of neuroscience methods 186, 262{273.
Sheinkopf, S.J., Mundy, P., Oller, D.K., Steens, M., 2000. Vocal atypicalities
of preverbal autistic children. Journal of Autism and Developmental Disorders
30(4), 345{54.
Shobaki, K., Hosom, J., Cole, R., 2000. The ogi kids' speech corpus and recognizers,
in: Proceedings of ICSLP.
Shriberg, E., Stolcke, A., Jurafsky, D., Coccaro, N., Meteer, M., Bates, R., Taylor,
P., Ries, K., Martin, R., Van Ess-Dykema, C., 1998. Can prosody aid the
automatic classication of dialog acts in conversational speech? Language and
speech 41, 443{492.
Shriberg, L.D., Austin, D., Lewis, B.A., McSweeny, J.L., Wilson, D.L., 1997. The
speech disorders classication system (sdcs): extensions and lifespan reference
data. Journal of Speech, Language, and Hearing Research 40, 723.
Shriberg, L.D., Kwiatkowski, J., Rasmussen, C., Lof, G.L., Miller, J.F., 1992. The
prosody-voice screening prole (pvsp): Psychometric data and reference infor-
mation for children. The Prosody-Voice Screening Prole (PVSP): Psychometric
data and reference information for children .
Shriberg, L.D., Paul, R., Black, L.M., van Santen, J.P., 2011. The Hypothesis of
Apraxia of Speech in Children with Autism Spectrum Disorder. J. of Autism
and Dev. Disord. 41, 405{426.
Shriberg, L.D., Paul, R., McSweeny, J.L., Klin, A., Cohen, D.J., Volkmar, F.R.,
2001. Speech and Prosody Characteristics of Adolescents and Adults with High-
Functioning Autism and Asperger Syndrome. Journal of Speech, Language, and
Hearing Research 44, 1097{1115.
Shue, Y.L., Keating, P., Vicenik, C., Yu, K., 2010. Voicesauce: A program for
voice analysis. Energy 1, H1{A1.
Shue, Y.L., Keating, P., Vicenik, C., Yu, K., 2011. Voicesauce: A program for
voice analysis, in: Proceedings of the 17th International Congress of Phonetics
Sciences, pp. 1846{1849.
Siller, M., Sigman, M., 2002. The behaviors of parents of children with autism
predict the subsequent development of their children's communication. Journal
of Autism and Developmental Disorders 32, 77{89.
141
Simmons, E.S., Paul, R., Shic, F., 2016. Brief report: A mobile application to
treat prosodic decits in autism spectrum disorder and other communication
impairments: A pilot study. Journal of autism and developmental disorders 46,
320{327.
Sonmez, M.K., Heck, L., Weintraub, M., Shriberg, E., Kemal, M., Larry, S.,
Mitchel, H., Shriberg, W.E., 1997. A Lognormal Tied Mixture Model Of Pitch
For Prosody-Based Speaker Recognition.
Stolcke, A., Ries, K., Coccaro, N., Shriberg, E., Bates, R., Jurafsky, D., Taylor,
P., Martin, R., Van Ess-Dykema, C., Meteer, M., 2000. Dialogue act modeling
for automatic tagging and recognition of conversational speech. Computational
linguistics 26, 339{373.
Uldall, E., 1960. Attitudinal meanings conveyed by intonation contours. Language
and Speech 3, 223{234.
Vernon, T.W., Koegel, R.L., Dauterman, H., Stolen, K., 2012. An early social
engagement intervention for young children with autism and their parents. Jour-
nal of autism and developmental disorders 42, 2702{2717.
Wall, D., Kosmicki, J., Deluca, T., Harstad, E., Fusaro, V., 2012a. Use of machine
learning to shorten observation-based screening and diagnosis of autism. Trans-
lational psychiatry 2, e100.
Wall, D.P., Dally, R., Luyster, R., Jung, J.Y., DeLuca, T.F., 2012b. Use of articial
intelligence to shorten the behavioral diagnosis of autism. PloS one 7, e43855.
Wells, B., Macfarlane, S., 1998. Prosody as an interactional resource: Turn-
projection and overlap. Language and Speech 41, 265{294.
Wieder, S., Greenspan, S.I., 2003. Climbing the symbolic ladder in the dir model
through
oor time/interactive play. Autism 7, 425{435.
Wing, L., 1988. The continuum of autistic characteristics, in: Diagnosis and
assessment in autism. Springer, pp. 91{110.
Witt, S.M., Young, S.J., 2000. Phone-level pronunciation scoring and assessment
for interactive language learning. Speech communication 30, 95{108.
W ollmer, M., Eyben, F., Schuller, B.W., Rigoll, G., 2012. Temporal and situa-
tional context modeling for improved dominance recognition in meetings., in:
INTERSPEECH.
Woodcock, R.W., McGrew, K., Mather, N., 2001. Woodcock-Johnson tests of
achievement. Itasca, IL: Riverside Publishing.
142
Yik, M.S., Russell, J.A., Barrett, L.F., 1999. Structure of self-reported current
aect: Integration and beyond. Journal of personality and social psychology 77,
600.
Young, M., Landy, M., Maloney, L., 1993. A Perturbation Analysis of Depth
Perception from Combinations of Texture and Motion Cues. Vision Research
33, 2685{2696.
143
Abstract (if available)
Abstract
This thesis concerns human-centered signal processing and machine learning, with a focus on creating engineering techniques and systems for societal applications in human health and well-being. Specifically, I aim to develop novel computational methods and tools that will support clinicians and researchers in the domain of autism spectrum disorder (ASD)—ASD has a population prevalence of 1 in 68 (Baio, 2014). Computational methods of behavioral characterization can augment the clinician’s analytical capabilities in diagnosis, personalized intervention, and long-term monitoring. Computational dimensional descriptors of behavior may be integral to further developments in the biology and neurology of ASD
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Human behavior understanding from language through unsupervised modeling
PDF
A computational framework for diversity in ensembles of humans and machine systems
PDF
Knowledge-driven representations of physiological signals: developing measurable indices of non-observable behavior
PDF
Automatic quantification and prediction of human subjective judgments in behavioral signal processing
PDF
Computational methods for modeling nonverbal communication in human interaction
PDF
Behavioral signal processing: computational approaches for modeling and quantifying interaction dynamics in dyadic human interactions
PDF
Context-aware models for understanding and supporting spoken interactions with children
PDF
Modeling dyadic synchrony with heterogeneous data: validation in infant-mother and infant-robot interactions
PDF
Multimodality, context and continuous dynamics for recognition and analysis of emotional states, and applications in healthcare
PDF
Modeling expert assessment of empathy through multimodal signal cues
PDF
Modeling and regulating human interaction with control affine dynamical systems
PDF
Machine learning paradigms for behavioral coding
PDF
Spoken language processing in low resource scenarios with applications in automated behavioral coding
PDF
Computational modeling of human behavior in negotiation and persuasion: the challenges of micro-level behavior annotations and multimodal modeling
PDF
Towards social virtual listeners: computational models of human nonverbal behaviors
PDF
Computational modeling of behavioral attributes in conversational dyadic interactions
PDF
Novel variations of sparse representation techniques with applications
PDF
Exploiting latent reliability information for classification tasks
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Data based modeling and analysis in reservoir waterflooding
Asset Metadata
Creator
Bone, Daniel K.
(author)
Core Title
Computational modeling of human interaction behavior towards clinical translation in autism spectrum disorder
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
07/05/2016
Defense Date
05/09/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
autism,behavioral signal processing,machine learning,OAI-PMH Harvest,speech
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth S. (
committee chair
), Levitt, Pat R. (
committee member
), Ortega, Antonio (
committee member
)
Creator Email
dakebone@gmail.com,dbone@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-261223
Unique identifier
UC11281195
Identifier
etd-BoneDaniel-4508.pdf (filename),usctheses-c40-261223 (legacy record id)
Legacy Identifier
etd-BoneDaniel-4508.pdf
Dmrecord
261223
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Bone, Daniel K.
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
autism
behavioral signal processing
machine learning