Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Individual differences in phonetic variability and phonological representation
(USC Thesis Other)
Individual differences in phonetic variability and phonological representation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INDIVIDUAL DIFFERENCES IN PHONETIC VARIABILITY
AND PHONOLOGICAL REPRESENTATION
by
Sarah Kolin Harper
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(LINGUISTICS)
December 2021
Copyright 2021 Sarah Kolin Harper
ii
Dedication
This dissertation is dedicated to my grandmothers, Myrna Kolin and Anna Harper.
iii
Acknowledgements
Completing a Ph.D. is hard. Completing a Ph.D. in the middle of a global pandemic when
your research involves putting sensors on peoples’ tongues should have been the stuff of
nightmares. Fortunately, it was not, due in no small part to the guidance and encouragement of
the many people who have helped to make this dissertation possible. I cannot thank them
enough.
First, endless thanks go out to Louis Goldstein for the insight you have shared and the
time you have devoted to helping me grow as a scientist and a person. Some of the most
challenging and, simultaneously, the most rewarding intellectual experiences of my life have
occurred while sitting in your office or on Zoom, with both of us trying to figure out how best to
address some new issue or thorny theoretical question that has popped up with my dissertation
research. Thank you for giving me both the freedom to explore the questions I was interested in
and the guidance I needed to adequately explore them, and for having the foresight to decide that
I needed to come up with a plan for running the rest of my dissertation experiments online as
soon as everything started to shut down in March 2020.
I am extremely grateful for the guidance from my other committee members, and the
various ways in which they have impacted both my intellectual and professional development.
Dani Byrd’s extensive knowledge in phonetics and phonology has proved invaluable in helping
me hone my intellectual arguments, both in my dissertation research and in many other projects
during graduate school, as well as my skills in experiment design and data presentation. Thank
you for everything you have taught me about professionalization and how to succeed as a woman
in academia (or any field, really), and for deciding during my second year that you were going to
iv
make sure I knew how to use all of the equipment in the Phonetics Lab. I have also benefited
substantially from the perspective Jason Zevin has had to offer on my research, coming from a
more psychology-oriented background, and from his willingness to go along with some of my
more out-there research ideas. Thanks for letting me hang out at your lab meetings for the past
six years, even though I usually didn’t have a reason for being there beyond general curiosity.
I have been lucky to have the support of the faculty in the Linguistics department and
other departments at USC. I have particularly benefited from my interactions with Khalil
Iskarous, Shri Narayanan, and Rachel Walker, whose feedback has helped shape my work
throughout my graduate school career. Thank you as well to the wonderful USC Linguistics
staff, Guillermo Ruiz and Lisa Jo Keefer, for all of the logistical help and non-work-related
conversation you have provided over the years. I am also fortunate to have benefited from my
involvement with the USC Phon group and SPAN during my time at USC, and to have been
supported by the NIH through the Hearing and Communication Neuroscience program. The
exposure to research from various disciplines that I received as part of the HCN community has
had a critical impact on my own research interests, and the training I received has undoubtedly
helped make me a better scientist.
I doubt that I would have ended up pursuing a Ph.D. to begin with without the fabulous
undergraduate education in Linguistics that I received at the University of Michigan, and very
specifically without the influence of Pam Beddor and Nick Henriksen. I am a scientist because of
their influence and the love for scientific research, and for phonetics specifically, that they
fostered in me. I am immensely grateful for the continued support I have received from them, as
well as Jelena Krivokapić (whose prosody seminar sparked an obsession with articulatory
research that’s still going strong), throughout my graduate career.
v
My graduate school experience was made so much better by the presence and friendship
of my fellow grad students in USC Linguistics. Thank you to my cohort, Jesse Storbeck, Tanner
Sorensen, Yifan Yang and Betül Erbaşı, for being the first friends I made at USC and my
constant companions during this wild ride. Reed Blaylock has had the unenviable job of talking
me through more academic (and non-academic) crises than I can count, and I am immensely
grateful that he’s deigned to put up with me and my hijacking of his social circles for the past
decade and a bit. I’m so thankful for the many enjoyable (and often super random) conversations
I’ve shared with Mairym Llorens Monteserin, Miran Oh, Sarah Lee, Charlie O’Hara, Ana
Besserman, Maury Courtland, Cynthia Lee, Samantha Gordon Danner, Caitlin Smith, Yubin
Zhang, Hayeun Jang, Yijing Lu, Chloe Gfeller, Ian Rigby, Jessie Johnson, Daniel Plesniak, Binh
Ngo, Jessica Campbell, and many others while working in the “Big Office,” during Tea Time,
and just when running into each other in the hallway. Many thanks as well to my friends outside
of USC Linguistics, particularly those from the USC Ballroom Dance Team, for your
encouragement and for never wanting to talk about my work. Special thanks to Max Pflueger for
the much-needed work breaks provided by our “coffee chats.”
Eric, your love, support, and strategically timed distractions have made it so much easier
to survive graduate school with my mental and physical health intact. Thank you for driving
down to LA at least once a month for years, for listening to numerous freak outs about things
that ended up not actually being that big of a deal, for all of the walks, and for sitting with a
blanket thrown over your head for half an hour because I wanted you to record something in a
“make-shift sound booth.” Words cannot convey how grateful I am for your presence in my life.
An entire universe of gratitude must go to my family for a lifetime of unwavering love
and for being my constant cheerleaders. Thank you so much to my parents for inspiring the love
vi
of learning that guides everything I do, even when it meant unending questions about anything
and everything, and for the support and advice you’ve given me over the past 29 years. Thanks
to you and to Haley for always keeping it real and making me laugh.
And finally, thank you to mid-2000s pop punk and The Great British Baking Show for
providing the background noise necessary for me to focus and get this dissertation done.
vii
Table of Contents
Dedication ...................................................................................................................................... ii
Acknowledgements ...................................................................................................................... iii
Table of Contents ........................................................................................................................ vii
List of Tables ................................................................................................................................ xi
List of Figures .............................................................................................................................. xv
Abstract ........................................................................................................................................ xx
1. Introduction ........................................................................................................................... 1
1.1. Notes on terminology .................................................................................................... 6
1.2. Outline of the dissertation ............................................................................................ 7
2. Individual differences in articulatory and acoustic variability ...................................... 10
2.1. Introduction ................................................................................................................. 10
2.1.1. Variation in the production of American English coronal consonants ..................... 12
2.1.2. Articulatory-acoustic relationships ........................................................................... 15
2.2. Study goals ................................................................................................................... 19
2.3. Methods ........................................................................................................................ 21
2.3.1. Corpus data ............................................................................................................... 21
2.3.2. Articulatory analysis ................................................................................................. 23
2.3.2.1. Identification of temporal landmarks ................................................................ 23
2.3.2.2. Constriction measurement ................................................................................ 25
2.3.3. Acoustic analysis ...................................................................................................... 26
2.3.3.1. Formant measurements ..................................................................................... 26
2.3.3.2. Spectral measurements ..................................................................................... 28
2.3.3.3. MFCC calculations ........................................................................................... 30
2.3.4. Calculation of dispersion .......................................................................................... 30
2.3.5. Statistical analysis ..................................................................................................... 34
2.3.5.1. Analysis of individual differences in articulatory and acoustic variability ...... 34
2.3.5.2. Analysis of articulatory-acoustic relations and recoverability of variability ... 35
2.4. Results .......................................................................................................................... 37
2.4.1. Interspeaker differences in articulatory and acoustic variability .............................. 37
viii
2.4.1.1. Articulatory variability ..................................................................................... 37
2.4.1.2. Acoustic variability ........................................................................................... 45
2.4.2. Articulatory-acoustic relations .................................................................................. 52
2.4.2.1. Encoding of unidimensional articulatory variability in multi-dimensional
acoustic space ................................................................................................................... 53
2.4.2.2. Encoding of unidimensional acoustic variability in multi-dimensional
articulatory space ............................................................................................................. 58
2.5. Discussion .................................................................................................................... 64
2.6. Conclusion ................................................................................................................... 68
3. Evidence for the encoding of variability in the representation of phonological units .. 70
3.1. Introduction ................................................................................................................. 70
3.1.1. Factors conditioning inter- and intraspeaker variation ............................................. 70
3.1.1.1. Vocal tract morphology .................................................................................... 71
3.1.1.2. Cognitive, sensory, and neural factors ............................................................. 74
3.1.1.3. Prosodic factors ................................................................................................ 75
3.1.2. Global and local patterns in phonetic variability ...................................................... 78
3.2. Hypothesis and predictions ........................................................................................ 82
3.3. Variability within and across phonological units ..................................................... 85
3.3.1. Analysis 1 Results: Within- and across-context variability ...................................... 87
3.3.1.1. Articulatory variability ..................................................................................... 87
3.3.1.2. Acoustic variability ........................................................................................... 92
3.3.2. Analysis 2 Results: Comparison of variability across phonological segments ......... 95
3.3.2.1. Articulatory variability ..................................................................................... 95
3.3.2.2. Acoustic variability ......................................................................................... 101
3.3.3. Analysis 3: Comparisons across articulatory dimensions ....................................... 102
3.3.4. Discussion ............................................................................................................... 109
3.4. Individual differences in vocal tract morphology and variability ........................ 114
3.4.1. Methods................................................................................................................... 114
3.4.1.1. Morphological measurements ......................................................................... 114
3.4.2. Results ..................................................................................................................... 119
3.4.2.1. Relationship between vocal tract morphology and articulatory variability ... 119
3.4.2.2. Relationship between vocal tract morphology and acoustic variability ......... 120
3.4.3. Discussion ............................................................................................................... 123
3.5. Individual differences in characteristic prosody and variability ......................... 126
3.5.1. Methods................................................................................................................... 127
3.5.2. Results ..................................................................................................................... 129
3.5.2.1. Relationship between prosodic variables and articulatory variability ........... 129
3.5.2.2. Relationship between prosodic variables and acoustic variability ................ 132
3.5.3. Discussion ............................................................................................................... 135
3.6. General Discussion .................................................................................................... 137
3.7. Conclusion ................................................................................................................. 140
ix
4. Incorporating individual differences in phonological representation ......................... 142
4.1. Introduction ............................................................................................................... 142
4.1.1. Foundations of DFT ................................................................................................ 145
4.1.2. Integration of DFT into models of phonological planning ..................................... 147
4.2. Model .......................................................................................................................... 150
4.2.1. Model components .................................................................................................. 150
4.2.1.1. Planning fields and behavioral goals ............................................................. 151
4.2.1.2. Input ................................................................................................................ 153
A. Task input ............................................................................................................ 153
B. Specific input ...................................................................................................... 156
4.2.2. Interaction ............................................................................................................... 157
4.2.3. Dynamic Target Selection ....................................................................................... 160
4.3. Simulations ................................................................................................................ 160
4.3.1. Simulation 1: Individual differences in stochastic variability ................................ 161
4.3.1.1. Simulation-specific model settings .................................................................. 162
4.3.1.2. Results ............................................................................................................. 166
4.3.1.3. Discussion of Simulation 1 ............................................................................. 167
4.3.2. Simulation 2: Relationship between stochastic and contextual variability ............. 169
4.3.2.1. Implementation of coarticulatory effects on speech planning ........................ 169
4.3.2.2. Simulation-specific model settings .................................................................. 172
4.3.2.3. Results ............................................................................................................. 175
4.3.2.4. Discussion of Simulation 2 ............................................................................. 178
4.3.3. Simulation 3: Production variability and perceptual sensitivity ............................. 181
4.3.3.1. Simulation-specific model settings .................................................................. 184
4.3.3.2. Results ............................................................................................................. 188
4.3.3.3. Discussion of Simulation 3 ............................................................................. 192
4.4. General Discussion .................................................................................................... 194
4.4.1. Preshaped activation distributions as the locus of individual differences .............. 196
4.4.2. Comparison of the proposed model with other approaches to encoding variability in
phonological representation ................................................................................................ 198
4.5. Conclusion ................................................................................................................. 204
5. The relationship between variability in speech production and perceptual sensitivity to
subphonemic variation ............................................................................................................. 206
5.1. Introduction ............................................................................................................... 206
5.1.1. Previous research on individual differences in variability and speech perception . 208
5.2. Hypotheses and predictions ..................................................................................... 210
5.3. Methods ...................................................................................................................... 212
5.3.1. Participants and recruitment ................................................................................... 212
5.3.2. Stimuli ..................................................................................................................... 213
5.3.2.1. Production ....................................................................................................... 213
5.3.2.2. Perception ....................................................................................................... 214
A. Stimulus creation: /s/ continua ............................................................................ 216
B. Stimulus creation: /ɹ/ continuum ......................................................................... 218
5.3.3. Procedure ................................................................................................................ 221
x
5.3.3.1. Production task ............................................................................................... 222
5.3.3.2. Headphone screen ........................................................................................... 223
5.3.3.3. Perception task ................................................................................................ 223
5.3.4. Analysis................................................................................................................... 226
5.3.4.1. Production measurements ............................................................................... 226
5.3.4.2. Perception measurement ................................................................................. 228
5.3.4.3. Statistical analysis .......................................................................................... 228
5.4. Results ........................................................................................................................ 229
5.4.1. Summary statistics and comparison to chance performance .................................. 229
5.4.2. Effect of acoustic dimension and interstimulus distance on accuracy .................... 231
5.4.3. Relationship between production variability and discrimination accuracy ............ 233
5.5. Discussion .................................................................................................................. 237
5.6. Conclusion ................................................................................................................. 242
6. Summary and conclusions ................................................................................................ 243
References .................................................................................................................................. 249
Appendices ................................................................................................................................. 295
Appendix A: Analyses using CoV ........................................................................................ 295
Appendix B: Relationship between stochastic and contextual variability by position .. 301
xi
List of Tables
Table 2.1. Breakdown of examined tokens in the XRMB corpus across segments and speakers.
........................................................................................................................................... 23
Table 2.2. Results of Brown-Forsythe test for homogeneity of variance in all segments. All
comparisons are significant (p < 0.05). ............................................................................. 38
Table 2.3. Proportion of pairwise comparisons of IQRTOT that were significant for each
measurement (by segment). .............................................................................................. 41
Table 2.4. Percentage of pairwise comparisons of IQRCROSS that were significant for each
measurement (by segment). .............................................................................................. 44
Table 2.5. Percentage of pairwise comparisons of IQRCON that were significant for each
measurement (by segment). .............................................................................................. 44
Table 2.6. Results of Brown-Forsythe test for homogeneity of variance for all spectral measures
in /s/ and /ʃ/. All comparisons are significant at the ! = 0.05 significance level (all p-
values rounded to three decimal places). .......................................................................... 46
Table 2.7. Results of Brown-Forsythe test for homogeneity of variance for all spectral measures
in /l/ and /ɹ/. All comparisons are significant at the α = 0.05 significance level (all p-
values rounded to three decimal places). .......................................................................... 47
Table 2.8. Percentage of pairwise comparisons of overall variability that were significant at the
Bonferroni corrected α = 0.000067 significance level for each spectral measurement in
(a) /s/ and /ʃ/ and (b) /l/ and /ɹ/. ......................................................................................... 48
Table 2.9. Percentage of pairwise comparisons of cross-context variability that were significant
for each measurement (by segment). ................................................................................ 50
Table 2.10. Percentage of pairwise comparisons of within-context variability that were
significant for each measurement (by segment). .............................................................. 52
Table 2.11. Median marginal and conditional R
2
values by segment across all LMER models fit
with an articulatory dependent variable. ........................................................................... 54
Table 2.12. All marginal and conditional R
2
values for all LMER models fit with an articulatory
dependent variable. ........................................................................................................... 54
Table 2.13. Median regression coefficient for the random slope by SPEAKER in each LMER
model fit to an articulatory dependent variable. ............................................................... 55
Table 2.14. Spearman’s rank-order correlation for the comparison of IQRTOT and IQRPRED in
models fit to articulatory dependent variables. Significant comparisons are bolded. ..... 56
Table 2.15. Median marginal and conditional R
2
values by segment across all LMER models fit
with an acoustic dependent variable. ................................................................................ 59
Table 2.16. All marginal and conditional R
2
values for all LMER models fit with an acoustic
dependent variable. ........................................................................................................... 59
Table 2.17. Median regression coefficient for the random slope by SPEAKER in each LMER
model fit to an acoustic dependent variable. ..................................................................... 60
Table 2.18. Spearman’s rank-order correlation for the comparison of IQRTOT and IQRPRED in
models fit to acoustic dependent variables. Significant comparisons are bolded. ........... 62
Table 3.1. Acronyms used for the analyzed articulatory and acoustic dimensions ..................... 87
Table 3.2. Spearman’s rank-order correlation for the comparison of IQRCROSS and IQRCON for
each articulatory dimension in each segment. Green cells indicate comparisons are
significant at the uncorrected p < 0.05 level. Comparisons significant after using the
xii
Benjamini-Hochberg method to control for false discovery rate (padj < 0.05) are
additionally italicized. ....................................................................................................... 91
Table 3.3. Spearman’s rho (rs) for the comparison of IQRCROSS and IQRCON for each acoustic
dimension in each segment. Green cells indicate comparisons that are significant at the
uncorrected p < 0.05 level. Comparisons significant after using the Benjamini-Hochberg
method to control for false discovery rate (padj < 0.05) are additionally italicized. ......... 94
Table 3.4. Correlation matrix for the comparison of IQRCON across segments for each articulatory
dimension. All correlations calculated using Spearman’s rho. Green cells indicate
comparisons that are significant at the uncorrected p < 0.05 level. Comparisons
significant after using the Benjamini-Hochberg method to control for false discovery rate
are additionally italicized. ............................................................................................... 100
Table 3.5. Spearman’s rho for the comparison of IQRCON across segments for each acoustic
dimension. Green cells indicate comparisons that are significant at the uncorrected p <
0.05 level. No comparisons were significant using the Benjamini-Hochberg method to
control for false discovery rate. ...................................................................................... 102
Table 3.6. Correlation matrix for the comparison of IQRCON across articulatory dimensions for
each segment. All correlations calculated using Spearman’s rank-order correlation
coefficient. Green cells indicate comparisons that are significant at the uncorrected p <
0.05 level. Comparisons significant after using the Benjamini-Hochberg method to
control for false discovery rate (pADJ < 0.05) are additionally italicized. ........................ 108
Table 3.7. Summary of morphological measurements and associated acronyms ...................... 117
Table 3.8. Spearman’s rank-order correlation for the comparison of vocal tract morphology and
IQRCON for each articulatory dimension in each segment. Green cells indicate
comparisons that are significant at the uncorrected p < 0.05 level. No comparisons were
significant after using the Benjamini-Hochberg method to control for false discovery rate
(all padj > 0.05). ............................................................................................................... 118
Table 3.9. Spearman’s rho (rs) for the comparison of vocal tract morphology and IQRCON for
each acoustic dimension in /s/ and /ʃ/. Comparisons significant at the uncorrected p <
0.05 level are shown in green. No comparisons were significant after using the
Benjamini-Hochberg method to control for false discovery rate (all padj > 0.05). ......... 121
Table 3.10. Spearman’s rho (rs) for the comparison of vocal tract morphology and IQRCON for
each acoustic dimension in /l/ and /ɹ/. Comparisons significant at the uncorrected p < 0.05
level are bolded and in green. Comparisons significant after using the Benjamini-
Hochberg method to control for false discovery rate (padj < 0.05) are additionally
italicized. ......................................................................................................................... 122
Table 3.11. Spearman’s rho (rs) for the comparison of speech rate and prosodic phrasing with
IQRCON for each articulatory dimension in each segment. Green cells indicate
comparisons significant at the uncorrected p < 0.05 level. No comparisons were
significant after correcting for false discovery rate. ....................................................... 131
Table 3.12. Spearman’s rank-order correlation for the comparison of prosodic measurements and
IQRCON for each acoustic dimension in /s/ and /ʃ/. Comparisons significant at the
uncorrected p < 0.05 level are bolded and shown in green. Comparisons significant after
using the Benjamini-Hochberg method to control for false discovery rate (padj < 0.05) are
additionally italicized. ..................................................................................................... 132
Table 3.13. Spearman’s rank-order correlation for the comparison of prosodic measurements and
IQRCON for each acoustic dimension in /l/ and /ɹ/. Comparisons significant at the
xiii
uncorrected p < 0.05 level are bolded and in green. No comparisons were significant after
using the Benjamini-Hochberg method to control for false discovery rate (all padj > 0.05).
......................................................................................................................................... 134
Table 5.1. List of stimulus words recorded by participants in the production task. Words in the
top two rows (/s/ and /ɹ/ initial) are included in the analysis presented here. ................. 214
Table 5.2. Acoustic dimension(s) manipulated for each target segment and the motivation for
their selection. M1 = first spectral moment, M4 = fourth spectral moment, and F3 = third
formant ............................................................................................................................ 215
Table 5.3. Counterbalancing of perception block presentation across groups. .......................... 222
Table 5.4. Summary statistics for Response Accuracy across speakers. Mean and standard
deviation (given in parentheses) calculated separately for each level of step size for each
of the manipulated acoustic dimensions. ........................................................................ 230
Table 5.5. Spearman’s rho (rs) for the comparison of Response Accuracy and IQRCON across
speakers. Correlation coefficients calculated for each unique combination of Step Size
and Acoustic Continuum. Comparisons significant at the uncorrected p < 0.05 level are
bolded. Comparisons significant after using the Benjamini-Hochberg method to control
for false discovery rate (padj < 0.05) are additionally italicized. ..................................... 235
Table 5.6. Spearman’s rho (rs) for the comparison of Response Accuracy and CoVCON across
speakers. Correlation coefficients calculated for each unique combination of Step Size
and Acoustic Continuum (for M1 and F3 only). Comparisons significant at the
uncorrected p < 0.05 level are bolded. No comparisons were significant after using the
Benjamini-Hochberg method to control for false discovery rate. ................................... 237
Table A.1. Proportion of pairwise comparisons of CoVTOT, CoVCROSS and CoVCON that were
significant for LA and CD (by segment). ....................................................................... 296
Table A.2. Proportion of pairwise comparisons of CoVTOT, CoVCROSS and CoVCON that were
significant for M1 and M2 in the examined fricative consonants. ................................. 296
Table A.3. Proportion of pairwise comparisons of CoVTOT, CoVCROSS and CoVCON that were
significant for F1-F4 in the examined liquid consonants. ............................................... 297
Table A.4. Spearman’s rho (rs) for the comparison of CoVCROSS and CoVCON for each ratio scale
dimension in each segment. Green cells indicate comparisons that are significant at the
uncorrected p < 0.05 level. Comparisons significant after using the Benjamini-Hochberg
method to control for false discovery rate (padj < 0.05) are additionally bolded. ........... 297
Table A.5. Correlation matrix for the comparison of CoVCON across segments for CD and LA.
All correlations calculated using Spearman’s rho. Green cells indicate comparisons that
are significant at the uncorrected p < 0.05 level. Comparisons significant after using the
Benjamini-Hochberg method to control for false discovery rate are additionally bolded.
......................................................................................................................................... 298
Table A.6. Spearman’s rho for the comparison of CoVCON across segments for ratio scale
acoustic dimension. No comparisons were found to be significant. ............................... 298
Table A.7. Spearman’s rank-order correlation for the comparison of vocal tract morphology and
CoVCON for CD and LA in each segment. Green cells indicate comparisons that are
significant at the uncorrected p < 0.05 level. No comparisons were significant after using
the Benjamini-Hochberg method to control for false discovery rate (all padj > 0.05). ... 299
Table A.8. Spearman’s rho (rs) for the comparison of vocal tract morphology and CoVCON for M1
and M2 in /s/ and /ʃ/. No comparisons were found to be significant. ............................. 300
xiv
Table A.9. Spearman’s rho (rs) for the comparison of vocal tract morphology and IQRCON for F1-
F4 in /l/ and /ɹ/. Comparisons significant at the uncorrected p < 0.05 level are bolded and
in green. No comparisons significant after controlling for false discovery rate. ............ 300
Table B.10. Spearman’s rank-order correlation for the comparison of IQRCROSS and IQRCON for
each articulatory dimension in each segment in word-initial contexts only. Green cells
indicate comparisons are significant at the uncorrected p < 0.05 level. Comparisons
significant after using the Benjamini-Hochberg method to control for false discovery rate
(padj < 0.05) are italicized. ............................................................................................... 301
Table B.11. Spearman’s rank-order correlation for the comparison of IQRCROSS and IQRCON for
each articulatory dimension in each segment in word-final contexts only. Green cells
indicate comparisons are significant at the uncorrected p < 0.05 level. Comparisons
significant after using the Benjamini-Hochberg method to control for false discovery rate
(padj < 0.05) are italicized. ............................................................................................... 301
xv
List of Figures
Figure 2.1. Placement of pellets in the XRMB data. Green circles indicate pellets from which
articulatory measurements were taken. (Schematic adapted from Westbury, 1994). 22
Figure 2.2. IQRTOT (scaled) values by segment for all articulatory dimensions. Order of speakers
(x-axis) in all graphs is based on the rank-ordering of IQRTOT values for CL in /t/.
Error bars show the 95% confidence interval for each speaker’s IQRTOT value. ...... 40
Figure 2.3. IQRCROSS (scaled) values by segment for all articulatory dimensions. Order of
speakers (x-axis) in all graphs is based on the rank-ordering of IQRCROSS values for
CL in /t/. Error bars show the 95% confidence interval for each speaker’s IQRCROSS
value. ......................................................................................................................... 42
Figure 2.4. IQRCON (scaled) values by segment for all articulatory dimensions. Order of speakers
(x-axis) in all graphs is based on the rank-ordering of IQRCON values for CL in /t/.
Error bars show the 95% confidence interval for each speaker’s IQRCON value. ..... 43
Figure 2.5. IQRTOT (scaled) values for all acoustic dimensions in /s/ (top row) and /ʃ/ (bottom
row). Order of speakers (x-axis) in all graphs is based on the rank-ordering of
IQRTOT values for M1 in /s/. Error bars show the 95% confidence interval for each
speaker’s IQRTOT value. ............................................................................................ 46
Figure 2.6. IQRTOT (scaled) values for all acoustic dimensions in /l/ (top row) and /ɹ/ (bottom
row). Order of speakers (x-axis) in all graphs is based on the rank-ordering of IQRTOT
values for M1 in /s/. Error bars show the 95% confidence interval for each speaker’s
IQRTOT value. ............................................................................................................. 47
Figure 2.7. IQRCROSS (scaled) values for all acoustic dimensions in /s/ and /ʃ/. Order of speakers
(x-axis) in all graphs is based on the rank-ordering of IQRCROSS values for M1 in /s/.
Error bars show the 95% confidence interval for each speaker’s IQRCROSS value. .... 49
Figure 2.8. IQRCROSS (scaled) values for all acoustic dimensions in /l/ and /ɹ/. Order of speakers
(x-axis) in all graphs is based on the rank-ordering of IQRCROSS values for M1 in /s/.
Error bars show the 95% confidence interval for each speaker’s IQRCROSS value. .... 49
Figure 2.9. IQRCON (scaled) values for all acoustic dimensions in /s/ and /ʃ/. Order of speakers (x-
axis) in all graphs is based on the rank-ordering of IQRCON values for M1 in /s/. Error
bars show the 95% confidence interval for each speaker’s IQRCON value. ................ 51
Figure 2.10. IQRCON (scaled) values for all acoustic dimensions in /l/ and /ɹ/. Order of speakers
(x-axis) in all graphs is based on the rank-ordering of IQRCON values for M1 in /s/.
Error bars show the 95% confidence interval for each speaker’s IQRCON value. ...... 51
Figure 2.11. Relationship between IQRTOT (calculated from the actual XRMB data) and IQRPRED
(calculated from predicted values of LMER Model 1) across speakers for each
articulatory dimension in each segment. ................................................................... 57
Figure 2.12. Relationship between IQRTOT and IQRPRED across speakers for each acoustic
dimension in each segment. ....................................................................................... 63
Figure 3.1. Schematic illustration of predicted relationship between within- and cross-context
variability (P1). .......................................................................................................... 84
Figure 3.2. Schematic illustrating the predicted lack of consistent relationship between a trait
(here, vocal tract morphology) and phonetic variability across speakers (P3). ........ 85
Figure 3.3. Relationship between IQRCROSS and IQRCON across speakers for each articulatory
dimension in each segment (top row to bottom row: CL, CD, CO, LA, and LP).
xvi
Relationships significant after the application of a Benjamini-Hochberg correction
are indicated by an asterisk; relationships significant at the unadjusted p < 0.05 level
only are indicated by †. ............................................................................................. 90
Figure 3.4. Relationship between IQRCROSS and IQRCON across speakers for each acoustic
dimension in each segment. ....................................................................................... 93
Figure 3.5. Comparison of IQRCON across phonological segments for CL. ................................. 97
Figure 3.6. Comparison of IQRCON across phonological segments for CD. ................................ 97
Figure 3.7. Comparison of IQRCON across phonological segments for CO. ................................ 98
Figure 3.8. Comparison of IQRCON across phonological segments for LA. ................................ 98
Figure 3.9. Comparison of IQRCON across phonological segments for LP. ................................. 99
Figure 3.10. Comparison of IQRCON between phonological segments for all measured acoustic
dimensions. Top row: liquid contrasts (/l/ vs. /ɹ/); bottom row: fricative contrasts (/s/
vs. /ʃ/). ...................................................................................................................... 102
Figure 3.11. Comparison of IQRCON across articulatory dimensions for /t/. ............................. 105
Figure 3.12. Comparison of IQRCON across articulatory dimensions for /s/. ............................. 106
Figure 3.13. Comparison of IQRCON across articulatory dimensions for /ʃ/. ............................. 106
Figure 3.14. Comparison of IQRCON across articulatory dimensions for /l/. ............................. 107
Figure 3.15. Comparison of IQRCON across articulatory dimensions for ɹ/. .............................. 107
Figure 3.16. Illustration of anatomical landmarks used to calculate vocal tract morphology
measurements. ......................................................................................................... 117
Figure 3.17. Illustration of linguistic elements used to calculate speech rate and prosodic
phrasing. .................................................................................................................. 128
Figure 4.1. Examples of possible preshape distributions along the TTCL and TTCD planning
fields for /s/ and /ʃ/. In each graph, x-axis = planning field for tract variable (range of
all possible parameter values associated with that tract variable), y-axis = activation
added to each point on the field at each time step of its evolution. Green line at
activation level 0.75 indicates the interaction threshold θ; blue line at activation level
4 indicates the selection threshold κ. Top row presents visualizations of preshaped
activation corresponding to the representation of (a) an alveolar TTCL constriction
in /s/ (b) and an alveopalatal TTCL constriction in /ʃ/. Bottom row presents
visualization of preshaped activation corresponding to the representation of a critical
TTCD constriction in (c) /s/ and (d) /ʃ/. .................................................................. 155
Figure 4.2. Fields with the same preshape (centered at 10 with s.d. of 1) after specific input
reflecting acoustically distinct perceptual stimuli added. In each graph, x-axis =
planning field for tract variable (range of all possible parameter values associated
with that tract variable), y-axis = activation added to each point on the field at each
time step of its evolution, z-axis = time step. The green plane at activation level =
0.75 shows the interaction threshold (θ), the blue plane at activation level = 4 shows
the selection threshold κ, and the red plane shows the target value selected for
production (first value at which activation surpassed κ). (a) Evolution of field after
input centered at target value 8 (s.d. = 1) introduced at time step 500. Selected target
value (red plane) = 9. (b) Evolution of field after input centered at target value 12
(s.d. = 1) introduced at time step 500. Selected target value (red plane) = 11.2. .... 157
Figure 4.3. Schematic representation of lateral inhibition as implemented in DFT. The excitatory
region σw of the interaction kernel w(x) determines the range of local excitation.
xvii
The strength of local excitation is indicated by wexcite and the strength of global
inhibition is indicated by winhibit. ............................................................................. 159
Figure 4.4. (a) Example of the Preshape SD manipulation in Simulation 1. The width of the
preshaped activation in the left graph is narrower (Preshape SD = 1) than the
preshaped activation in the right graph (Preshape SD = 3). Note that activation of
both preshapes is low enough that the interaction term is not engaged (interaction
threshold = green plane). (b) Introduction of specific input to each of the preshaped
fields from (a) leads to the formation of a stable activation peak. Specific input
representing an internal command to start planning the production of the gesture was
introduced to each field with a high enough weight to catalyze the development of a
stable peak. The target value corresponding to the first point within the peak to
crosses the selection threshold (purple plane) is selected as the target for production.
................................................................................................................................. 165
Figure 4.5. (a) Distribution of values for selected targets across all five levels of Preshape SD.
Each graph presents the data from one level of Preshape SD, with graphs arranged in
descending order from narrowest to widest preshape. Histogram fill color indicates
Preshape SD level. (b) Mean Target Value SD across all five levels of Preshape SD
(color indicates Preshape SD level). Error bars indicate the 95% confidence interval
for each group mean. Significance between comparisons indicated by asterisks. .. 167
Figure 4.6. Effect of the nonlinear cross-field inhibition function on target selection in the TTCL
field as a function of the location of the activation engaging χ in the TBCL field. The
leftmost column of graphs shows planning fields for TBCL, the center column
shows the cross-field inhibition function corresponding to each TBCL graph, and the
rightmost column shows planning fields for TTCL. TBCL activation in (a) (centered
at 13) corresponds to a more anterior vocal tract location than the TBCL activation
in (d) (centered at 17.5). The cross-field inhibition function in (b) (reflecting the
effect of (a) on (c)) introduces less of an inhibitory effect on more anterior locations
in the TTCL field than the cross-field inhibition function in (e) (reflecting the effect
of (d) on (f)). Due to this difference in the inhibition function the target selected for
the TTCL field in (c) is more anterior (selected target = 12) than the target selected
for the TTCL field in (f) (selected target = 13.5). ................................................... 172
Figure 4.7. Distribution of value for selected targets across all five levels of Preshape SD in the
FRONT and BACK vowel conditions. Each graph presents the data from one level of
Preshape SD, with graphs arranged in descending order from narrowest to widest
preshape. Histogram fill color indicates vowel condition (light blue = FRONT, dark
blue = BACK) and histogram outline color indicates Preshape SD level. ................ 177
Figure 4.8. Mean Target Value (dark and light blue dots) across all five levels of Preshape SD in
the FRONT and BACK vowel conditions. Error bars indicate the 95% confidence
interval for each group mean. All pairwise comparisons across levels of Vowel
Location and Preshape SD are significant. .............................................................. 178
Figure 4.9. Visualization of the distance between the two specific inputs to the field (Sperc1 and
Sperc2) when the first input is located at parameter value 10.5 and the second input
is centered at parameter value (a) 9 (-1.5 away), (b) 9.5 (-1 away), or (c) 10 (-0.5
away). The activation distribution corresponding to the first input is shown in a blue-
to-yellow gradient, while the activation distribution corresponding to the second
input is red in each figure. ....................................................................................... 186
xviii
Figure 4.10. Mean Perceived Distance (dark and light green dots) across all five levels of
Preshape SD in the STRONG MAPPING and WEAK MAPPING conditions. Error bars
indicate the 95% confidence interval for each group mean. ................................... 190
Figure 4.11. Mean Perceived Distance (dark and light green dots) across all five levels of
Interstimulus Distance in the STRONG MAPPING and WEAK MAPPING conditions.
Error bars indicate the 95% confidence interval for each group mean. .................. 191
Figure 4.12. Mean Perceived Distance (dark and light green dots) across all five levels of
Preshape SD in the STRONG MAPPING and WEAK MAPPING conditions. Each graph
shows the results for a different level of Interstimulus Distance. Error bars indicate
the 95% confidence interval for each group mean. ................................................. 192
Figure 5.1. Trace of the spectrum for each of the steps of the (a) M1 and (b) M4 continua for /s/.
................................................................................................................................. 218
Figure 5.2. Definition of lower and upper endpoints for the F3 continuum using Praat
FormantGrid objects (left: lower endpoint, right: upper endpoint). Blue line and
points indicate the manipulated dimension (F3). .................................................... 220
Figure 5.3. Formant trajectories (F1-F5) for each of the seven stimuli in the F3 continuum. The
portion of the x-axis replaced by a dotted gray line indicates the temporal domain
over which F3 was manipulated (with the different F3 trajectories resulting from this
manipulation indicated by different colored lines). ................................................. 220
Figure 5.4. Schematic of trial creation process for the 4IAX task ............................................. 225
Figure 5.5. Response accuracy according to acoustic continuum and interstimulus acoustic
distance (step size) for all participants examined here. Notches in boxes indicate the
95% confidence interval around median performance in each combination of
acoustic continuum and step size. The dashed green line on the y-axis indicates at-
chance performance (50%). ..................................................................................... 230
Figure 5.6. Response accuracy by acoustic continuum for all participants examined here.
Notches in boxes indicate the 95% confidence interval around median performance
in each combination of acoustic continuum and step size. The dashed green line on
the y-axis indicates at-chance performance (50%). ................................................. 231
Figure 5.7. Response accuracy according to interstimulus acoustic distance (step size) for all
participants examined here. Notches in boxes indicate the 95% confidence interval
around median performance in each combination of acoustic continuum and step
size. The dashed green line on the y-axis indicates at-chance performance (50%). 232
Figure 5.8. Comparison of IQRCON and Response Accuracy across participants for each step size
in the F3 continuum. Significant correlations are indicated by an asterisk. The
dashed green horizontal line indicates chance performance (50%). ....................... 234
Figure 5.9. Comparison of IQRCON and Response Accuracy across participants for each step size
in the M1 continuum. No correlations are significant. The dashed green horizontal
line indicates chance performance (50%). .............................................................. 234
Figure 5.10. Comparison of IQRCON and Response Accuracy across participants for each step
size in the M4 continuum. No correlations are significant. The dashed green
horizontal line indicates chance performance (50%). ............................................. 235
Figure 5.11. Comparison of CoVCON and Response Accuracy across participants for each step
size in the F3 continuum. Significant correlations are indicated by an asterisk. The
dashed green line on the y-axis indicates chance performance (50%). ................... 236
xix
Figure 5.12. Comparison of CoVCON and Response Accuracy across participants for each step
size in the M1 continuum. Significant correlations are indicated by an asterisk. The
dashed green line on the y-axis indicates chance performance (50%). ................... 236
xx
Abstract
The articulatory and acoustic properties of any one phonological segment are known to
vary both between speakers and between tokens in an individual’s speech. Much of this observed
inter- and intraspeaker phonetic variation can be explained as the predictable consequence of
various linguistic and extralinguistic factors known to affect the realization of phonological
segments. However, relatively little research has systematically examined the extent to which
this variation may also reflect individual differences in how segments are represented, or in the
mechanisms affecting the selection of production targets in speech. This dissertation extends
existing research on stochastic variability in speech to investigate the hypothesis that individual
differences in speech production and perception reflect differences in the cognitive
representation of phonological units across speakers, specifically differences in the encoding of
variability in these representations.
The first of two empirical studies presented in this dissertation uses articulatory and
acoustic data from forty speakers in the Wisconsin X-Ray Microbeam corpus (Westbury, 1994)
to examine individual differences in phonetic variability in a set of English consonants, and how
these differences pattern across different structural units in speech. The results of this study
indicate that robust individual differences in variability are maintained across multiple levels of
linguistic structure but are not generalized across different phonological segments, supporting
models of phonological representation in which variability is encoded in the cognitive
representation of phonological targets. The results of an analysis of articulatory-acoustic
relations in the same data additionally highlight the potential communicative significance of
these individual differences. Building on the findings of this study, an extension of existing
Dynamic Field Theory models of phonological cognition is proposed to account for the observed
xxi
patterns of individual difference in phonetic variability. A series of simulations using this model
is presented to illustrate how the incorporation of both dynamical and invariant elements in the
representation of phonological units can both account for the observed patterns and generate
predictions about the relationship between variability in speech production and individual
differences in speech perception. The final empirical study included in this dissertation tests the
specific predictions generated by the model regarding the relationship between individual
differences in the production of phonetic variability and perceptual sensitivity to subphonemic
variability.
As a whole, the findings of this dissertation support a model of phonological cognition in
which the individual differences observed in acoustic and articulatory variability reflect the
encoding of variability for individual phonological units. Through this, the research presented in
this dissertation provides support for the hypothesis that individual differences in speech
production and perception reflect variation in the cognitive representation of phonological units
across speakers. This dissertation also illustrates how our understanding of the cognitive systems
involved in speech production and perception, as well as behavioral speech phenomena more
generally, is enhanced by considering measures of speaker performance beyond central
tendencies in speech production.
1
1. Introduction
Variability is an inherent component of motor behavior, particularly complex motor
behaviors like speech production. Research on non-speech motor control has highlighted the
ways in which the systematic examination of variability in motor behavior may provide critical
insight into the systems governing the planning and implementation of goal-oriented movement,
as well as the insights that can be gleaned from the examination of individual differences in
motor variability (e.g., Harris & Wolpert, 1998; Newell & Slifkin, 1998; Riley & Turvey, 2002).
As such, studies seeking to examine the processes underlying movement planning and
implementation have increasingly focused on measurements of intertrial variability in movement
execution (e.g., Cusumano & Dingwell, 2013; Dhawale, Smith, & Ölveczky, 2017; Latash,
Scholz, & Schöner, 2002; van Beers, 2009). This dissertation adopts a similar approach to the
investigation of the cognitive systems of phonological representation underlying the production
and perception of speech. Specifically, the systematic investigation of individual differences in
articulatory and acoustic variability is used to motivate a model of phonological cognition in
which individual differences in variability are encoded in the representation of phonological
units.
Although phonological systems have traditionally been described in terms of discrete,
invariant sound categories, phonetic research has for decades shown that the realization of these
categories in the physical speech signal is highly variable. The precise articulatory and acoustic
realization of any phonological segment varies both between speakers of a language and between
tokens produced by the same speaker (e.g., Allen et al., 2003; Baker et al., 2011; Beckman et al.,
1995; Byrd, 1992; De Decker & Nycz, 2012; Delattre & Freeman, 1968; Hillenbrand et al.,
2
1995; Johnson et al., 1993; Peterson & Barney, 1952; Whalen et al., 2018; see Bürki, 2018 for an
overview of variation phenomena in speech production).
Much of this observed inter- and intraspeaker phonetic variation can be explained as the
predictable consequence of various linguistic and extralinguistic factors known to affect the
realization of phonological segments.
1
For example, the production of specific segments can
differ between two speakers due to social factors like regional origin, gender, age, ethnicity,
sexual orientation, and social class (see Babel & Munson, 2014; Docherty & Mendoza-Denton,
2012; Foulkes, Scobbie, & Watt, 2010 for reviews of this topic). Interspeaker differences in
segmental production can also reflect differences in their history of language use, particularly for
multilingual speakers (e.g., Flege, 2007; Flege and Eefting, 1987; Flege, Schirru, & MacKay,
2003; Fowler, Sramko, Ostry, Rowland, and Hallé, 2008). Individual differences in vocal tract
morphology, cognitive traits, and sensory acuity have also been shown to engender differences in
the production of certain phonological segments across speakers (Brunner, Fuchs, & Perrier,
2009; Dediu & Moisik, 2019; Franken et al., 2017; Ghosh et al., 2010; Johnson, 2018;
Mooshammer, Perrier, Fuch, Geng, & Pape, 2004; Ou, Law, & Fung, 2015; Ou & Law, 2017;
Perkell et al., 2008; Rudy & Yunusova, 2013; Weirich & Fuchs, 2013; Yu, 2010, 2016).
A considerable amount of the token-to-token variability observed in the speech of an
individual speaker can also be predicted by the different contexts in which tokens of a
phonological segment are produced. Intraspeaker variation has been shown to emerge as the
lawful consequence of the linguistic context in which a segment is produced, with factors like
1
The term phonological segment is used here and throughout the dissertation to refer to “any discrete unit that can
be identified, either physically or auditorily, in the stream of speech” (Crystal, 2008, p. 426). It is not meant to
suggest any particular orientation towards the question of whether these perceived segments are homologous with
units of phonological control. A similar term, phonological unit, is also used to describe discrete units that may or
may not be homologous with what would typically be considered “segments” (e.g., subsegmental units such as
articulatory gestures [Browman & Goldstein, 1986, 1989, 1992 et seq.]).
3
segmental coarticulation (e.g., Gay, 1977; Fowler, 1980; Harcastle & Hewlett, 1999; Liberman,
Cooper, Shankweiler, & Studdert-Kennedy, 1967; Öhman, 1966; Recasens & Espinosa, 2009;
Recasens, Pallarès, & Fontdevila, 1997), local effects of prosodic structure (e.g., Byrd &
Saltzman, 1998; Byrd, 2000; Byrd et al., 2006; Cho, 2006; Edwards, Beckman, & Fletcher,
1991; Fougeron & Keating, 1997; Krivokapić, 2007; Lehiste, 1973; Oller, 1973; Klatt, 1975;
Shattuck-Hufnagel & Turk, 1998), and the processing demands imposed by lexical or semantic
context (e.g., Clopper & Turnbull, 2018) shown to impact the realization of phonological
segments. Paralinguistic elements, such as global temporal or stylistic properties of an utterance
(e.g., Gay, 1978; Lindblom, 1990; Miller & Liberman, 1979; Moon & Lindblom, 1994; Smith,
2002; Theodore, Miller, & DeSteno, 2009), can also induce intraspeaker variation in production.
All of the factors conditioning individual differences or token-to-token variability in the
production of phonological segments can be considered ‘regular’ in nature, as they constitute
concrete social, cognitive, physical, or linguistic motivations for differences in the phonetic
properties of different tokens of the same phonological unit. However, the effect of these
‘regular’ factors on the realization of speech is not deterministic in nature. Even after accounting
for these predictable effects, a certain amount of the difference in individual speakers’
production tendencies and variability in the token-to-token realization of a phonological segment
within the speech of a single speaker remains unexplained. This unexplained phonetic variation
can be considered ‘stochastic’ or irregular, and can constitute seemingly random token-to-token
variability in the speech of a single speaker (e.g., Beckman et al., 1995; Delattre & Freeman,
1968; Hillenbrand et al., 1995; Johnson et al., 1993; Whalen et al., 2018), interspeaker variation
in token-to-token variability that is not explained by predictable factors (e.g., Bakst, 2021;
Brunner et al., 2009; Rudy & Yunusova, 2013; Whalen et al., 2018), and interspeaker variation
4
in the impact on production of factors conditioning regular contextual variability, like
coarticulation (e.g., Beddor, 2009; Baker et al., 2011; Grosvald, 2009; Kataoka, 2011; Lubker &
Gay, 1982; Smith et al., 2019; Zellou, 2017), speech rate (e.g., Theodore, Miller, & De Steno,
2009), and prosodic structure (e.g., Byrd et al., 2006; Fougeron & Keating, 1997). This
dissertation examines whether these various, unexplained phenomena reflect differences in the
representation of phonological units across speakers. The overarching hypothesis tested in this
dissertation is that individual differences in speech production and perception reflect variation in
the cognitive representation of phonological units across speakers.
While most modern models of phonological cognition take the perspective that regular
variation in phonetic realization arises from the systems regulating speech production and
perception, they differ from one another in the precise manner in which they encode this
variation and in their incorporation of stochastic variability in the actual cognitive systems
governing speech. One of the main axes along which models differ that is particularly pertinent
to this latter point is the manner in which different models conceptualize phonological targets. A
distinction can be made between models that represent phonological goals as abstract, time- and
context-invariant targets, like Articulatory Phonology (e.g., Browman & Goldstein, 1985, 1986,
1989, 1992, 2000; Byrd, 1996; Byrd & Saltzman, 2003; Goldstein et al., 2006; Nam et al., 2009;
Saltzman et al., 2008) and models that represent phonological goals as distributions of possible
phonetic targets
2
and use these distributions themselves to account for regular variation.
While Articulatory Phonology and similar models are able to generate a wide swath of
the regular, contextually-conditioned variability observed in speech production, the abstract and
2
These include window models (e.g., Byrd, 1996; Keating, 1990, 1996), the DIVA model (e.g., Guenther, 1995,
2016; Guenther et al., 2006; Guenther et al., 1998; Tourville & Guenther, 2011), exemplar models (e.g., Goldinger,
1996; Johnson, 1997; Lacerda, 1995; Pierrehumbert, 2001, 2002, 2016; Wedel, 2006), and models incorporating
Dynamic Field Theory (e.g., Gafos & Kirov, 2009; Roon & Gafos, 2016; Tilsen, 2019).
5
invariant nature of its system of phonological representation precludes the generation of
stochastic variability from phonology-internal sources. This contrasts with models, like exemplar
models of speech (e.g., Goldinger, 1996; Johnson, 1997; Lacerda, 1995; Pierrehumbert, 2001,
2002, 2016; Wedel, 2006) and models incorporating Dynamic Field Theory (e.g., Gafos &
Kirov, 2009; Roon & Gafos, 2016; Tilsen, 2019), that both explicitly incorporate variability in
the representation of phonological units and provide a mechanism for generating stochastic
variability in phonological planning. The potential for individual differences in phonological
representation has also generally been explored more in this latter class of models, and in other
models that incorporate target distributions in representation (e.g., Guenther, 1995, 2016;
Keating, 1990, 1996), due to their greater emphasis on individual experience in the development
of phonological representations.
These differences in the architectures used to model phonological cognition generate
concrete differences in models’ ability to lawfully generate interspeaker variation, contextual
variability, and stochastic variability in speech production, and subsequently may impact their
ability to account for various phenomena observed in speech. This dissertation extends existing
research on stochastic variability in speech in order to evaluate different models’ success in
accounting for observed patterns of individual difference in stochastic variability. The work
presented here specifically argues for a ‘hybrid’ model of phonological cognition that combines
invariant representations of phonological units with dynamic processes for target selection
during planning (Gafos & Kirov, 2009; Roon & Gafos, 2016; Tilsen, 2019), and demonstrates
how this ‘hybrid’ representation of phonological units and their targets can explain connected
patterns of interspeaker differences and intraspeaker variability in speech production and
perception. Through this avenue of investigation, a novel account relating patterns of inter- and
6
intraspeaker variation across multiple levels of structural organization in speech and across
different speech behaviors is presented and empirically supported.
1.1. Notes on terminology
Following Vaughn et al. (2018), in this dissertation I make a distinction between
difference (also sometimes referred to as variation) and variability in describing the divergence
of the physical realization of a phonological unit across and within speakers.
Difference or variation is used to refer to the manner in which different speakers may
diverge in their production of the same phonological unit. More specifically, this is referred to as
individual differences or interspeaker variation, indicating that this divergence reflects a
speaker-dependent difference in linguistic behavior. Individual differences in speech can be
idiosyncratic or attributable to concrete differences between speakers in a specific trait (e.g.,
dialect or vocal tract morphology). The term variability, on the other hand, is used to refer to
token-to-token divergence in the physical realization of a phonological unit within the speech of
a single speaker.
Variability is further divided into stochastic variability and contextual variability.
Stochastic variability refers to variability in the realization of a phonological unit across tokens
that are produced in the same phonetic context (within-context variability)—e.g., tokens that are
produced with the same flanking segments and in the same position within both lexical and
prosodic structure. The use of the term ‘stochastic’ in reference to this type of within-context
variability is not meant to imply that the behavior of observed trial-by-trial fluctuations in the
physical properties of the phonological unit is actually stochastic in the statistical sense.
7
Contextual variability, in contrast, refers to the differences between tokens of a phonological unit
produced by the same speaker across different phonetic contexts.
Both difference and variability are measured for individual dimensions of phonetic
realization with a phonological segment. The terms phonetic dimension, articulatory dimension
and acoustic dimension are used throughout the dissertation to refer to the measured and
manipulated dimensions of phonetic realization. A phonetic dimension is any measure that can
be extracted from the physical speech signal or calculated from multiple extracted measures
(such as formant differences), while articulatory and acoustic dimensions are more specifically
measures extracted from articulatory and acoustic signals, respectively.
1.2. Outline of the dissertation
Chapter 2 presents a corpus study examining individual differences in articulatory and
acoustic variability in a set of American English coronal consonants, providing evidence for
robust differences between speakers in the extent of the variability they exhibit in the production
of multiple phonetic dimensions and phonological segments. An analysis of articulatory-acoustic
relations in these same segments is also conducted to evaluate the potential communicative
significance of these differences, with results demonstrating that the variability observed in
articulation is generally recoverable from acoustics, and vice versa.
Chapter 3 uses the same corpus data to more closely examine how individual differences
in articulatory and acoustic variability pattern across different structural units in speech. The
observed patterns indicate that individual differences in variability are maintained across
multiple levels of linguistic structure but are not generalized across different phonological
segments or across dimensions in the same segment that do not share a primary articulator. A
8
supplementary set of analyses comparing individual differences in stochastic variability to
interspeaker differences in vocal tract morphology and in the realization of paralinguistic
elements indicates that these factors do not consistently predict differences in phonetic
variability. In conjunction, these results indicate that interspeaker differences in phonetic
variability reflect the specific encoding of variability for individual phonological units,
supporting models of phonological representation in which variability is encoded in in the target
representations of specific phonological units.
Chapter 4 presents an extension of existing Dynamic Field Theory models of
phonological cognition designed to account for the patterns of individual difference in phonetic
variability observed in Chapters 2 and 3. First, an explanation is provided of the model
architecture and the manner in which individual differences are incorporated in the model.
Second, a series of simulations are presented that demonstrate how this model both accounts for
empirical patterns observed in the analyzed corpus data and can be used to generate predictions
about the relationship between individual differences in production variability and perception.
The chapter concludes with a comparison of the proposed model to other models capable of
encoding variability in phonological representation
Chapter 5 presents the preliminary results of an experiment examining the relationship
between individual differences in acoustic variability and the perception of subphonemic
variation. Based on previous research, it is expected that individual differences in these domains
will be related across speakers. The preliminary results of this study do suggest that such a
relationship exists, but that its appearance is mediated by the identity of the dimension along
which performance in production and perception are measured.
9
Chapter 6 presents a summary of the findings of the dissertation and a discussion of their
theoretical implications. Possible directions for future research are also discussed.
10
2. Individual differences in articulatory and acoustic variability
2.1. Introduction
A large body of research over the past several decades has demonstrated that individuals
differ from one another in their production and perception of phonological segments; in fact, as
Johnson, Ladefoged, and Lindau (1993) point out, “[m]ost studies of speech production find
some differences between speakers” (p. 701). While many of these differences can be viewed as
‘regular’ or ‘predictable’ variation due to their dependencies on speakers’ demographic, social,
anatomical, or cognitive traits, a substantial amount of variability remains that does not appear to
be immediately attributable to clearly defined speaker traits.
Much of the research on variation in speech production and interspeaker differences has
focused on differences in the central tendencies of speakers’ production along particular
articulatory or acoustic dimensions. This has enabled critical insights into the patterns of
variation observed within and between speakers, particularly in our understanding of the factors
that may lawfully condition variation (see review in Chapters 1 and 3) and, more recently, of the
extent to which idiosyncratic differences in production may reflect systematic interspeaker
differences in the cognitive representations and systems underlying speech (e.g., Chodroff &
Wilson, 2017; Smith et al., 2019). However, despite the growing focus on inter- and intraspeaker
variation in central tendencies in production, research on variation in speech production has
largely avoided examining stochastic variability and how this may pattern across phonological
units and across speakers. Almost all research on stochastic variation in the production of
phonological units has focused on the extent to which quantifiable interspeaker differences in
factors like vocal tract morphology, cognitive traits, and auditory or somatosensory acuity
condition interspeaker phonetic variation (e.g., Brunner et al., 2009; Brunner et al., 2011; Ghosh
11
et al., 2010; Nasir & Ostry, 2006; Ou & Law, 2017; Perkell, Guenther, et al., 2004; Perkell,
Matthies, et al., 2004; Rudy & Yunusova, 2013; Yu, 2016). Research focusing on how stochastic
variability patterns within and across speakers outside of the influence of these quantifiable
factors, and indeed the extent to which the observed differences in variability are of sufficient
magnitude and frequency to be conclusively considered differences between speakers instead of
random sampling error, has been relatively limited in scope.
It has been suggested that the systematic investigation of variability in motor behavior
may provide critical insight into the systems and control structures governing the planning and
implementation of motor movements (e.g., Newell & Slifkin, 1998; Riley & Turvey, 2002).
Building on this line of thought from more domain-general research on perception-action
systems, further investigation into patterns of inter- and intraspeaker variability in speech
production may be critical for a more detailed understanding of speech-related cognition and the
mechanisms by which listeners adapt to and maintain communicative parity amidst the inter- and
intraspeaker variation observed in the speech signal (Liberman & Whalen, 2000; Newman,
Clouse, & Burnham, 2001; Uchanski, Miller, Reed, & Braida, 1992).
Positing a relationship between individual differences in phonological representation and
the extent of variability exhibited in the production of a phonological unit is not without
theoretical precedent. Research has demonstrated that the density and overall patterning of a
phonological (sub)system may influence the extent of variability observed in its composite
phonological units, with more variability permitted along articulatory and/or acoustic dimensions
that are not as heavily relied on to distinguish one segment from others (Keating, 1990, 1996;
Lubker & Gay, 1982; Manuel, 1990, 1999; c.f. Flege, 1989). Recent work has further
demonstrated that this relationship between variability and contrast may itself be further
12
constrained by the use of specific phonetic dimensions to actively distinguish among
phonological segments, both on the level of a language as a whole and within an individual
(Hauser, 2019). Recent work on the relationship between within-category production variability,
perceptual categorization, and perceptual sensitivity (Chao, Achoa, & Daliri, 2019; Franken et
al., 2017; Perkell et al., 2008) and the active control of variability in speech production (Parrell
& Niziolek, 2020) also lends credence to the idea that the variability observed in the production
of a phonological unit may be in some way encoded in, and subsequently manifest as, a
reflection of its cognitive representation. This is the topic of the present study.
2.1.1. Variation in the production of American English coronal consonants
The investigation in this chapter focuses on inter- and intraspeaker variation in
articulation and acoustics for a subset of the coronal consonants in American English, namely /t/,
/s/, /ʃ/, /l/ and /ɹ/. This particular set of consonants was chosen for investigation for two reasons.
First, widespread idiosyncratic variation has previously been recorded in their production,
especially in articulation, providing a solid basis for expansion into this more specific
investigation of variability and its relationship to the representation of the phonological units
implicated in these segments’ production. Second, a substantial amount of existing research on
the acoustic properties of these fricative and liquid consonants and the acoustic correlates of
variation in their articulatory realization provides the foundation and guiding structure for a
follow-up investigation of the transmission of variability in articulatory-acoustic relations for
these segments.
Perhaps the most widely studied example of covert articulatory variation in speech
research is that observed in the production of American English /ɹ/. Typically formed via the
13
coordination of a labial constriction, a lingual constriction along the palatal vault, and a
pharyngeal constriction formed by the tongue root (Alwan et al., 1997; Delattre & Freeman,
1968; Westbury et al., 1988), extensive inter- and intraspeaker variation has been observed in the
production of each of these constriction gestures, and especially in the production of the palatal
constriction (Delattre & Freeman, 1968; Mielke et al., 2010; Tiede et al., 2004; Westbury et al.,
1998). The palatal constriction in /ɹ/ can vary both in its precise location along the palatal vault,
with the possible range of its realization encompassing the postalveolar to the pre-palatal regions,
and in whether the tongue tip, blade, or dorsum is used in its production. Although some of this
observed variation is context-dependent, reflecting both the segmental and prosodic environment
in which it is produced (e.g., Campbell et al., 2010; Delattre & Freeman, 1968; Gick, 1999;
Guenther et al., 1999; Hagiwara, 1995; Mielke et al., 2010, 2016; Ong & Stone, 1998; Westbury
et al., 1998; Zawadski & Kuehn, 1980), much of the observed variation is idiosyncratic, with
speakers differing from one another both in their preferred lingual configurations for the
production of /ɹ/ and in the precise manner in which their production is influenced by phonetic
context without any clear physiological or acoustic motivation. Speakers also differ broadly from
one another in their production of the labial gesture in this segment, with the magnitude of labial
compression and protrusion of this gesture varying across speakers and contexts (Campbell et al.,
2010; Lindau, 1985; Mielke et al., 2016; Zawadski & Kuehn, 1980).
Although the variation observed in the production of /ɹ/ has received the most attention,
the other coronal consonants of English are also known to exhibit substantial contextual and
idiosyncratic variation in their realization. For example, variation in the orientation of the tongue
tip during /s/ production in American English has been suspected for nearly a century (Kenyon,
1924, as cited in Johnson et al., 1993). Using lateral cinefluorography, Bladon and Nolan (1977)
14
found that a majority of the eight speakers they examined produced /s/ and /z/ with a laminal
articulation but varied in terms of their preference of an apical or laminal articulation for /t/ and
/d/. Dart (1991) observed similar variation in the articulation of /t/ and /d/ across twenty speakers
using palatography and additionally observed a much more equal split between apical and lateral
articulations across speakers in their production of /s/ and /z/. She also observed a correlation
between tongue tip orientation during the production of the fricative and the anteriority of its
constriction location, with apical productions tending to be more anterior than laminal
productions. Similar results regarding the prevalence of both apical and laminal /s/ across
speakers of English were observed in Ladefoged and Maddieson (1996), Narayanan, Alwan, &
Haker (1995), and Stone, Gomez, Zhuo, Tchouaga, & Prince (2019). Parallel differences in the
spectral properties of /s/ have also been observed across both demographic groups and individual
speakers (e.g., Dart, 1998; Flipsen, Shriberg, Weismer, Karlsson, & McSweeny, 1999; Jongman
et al., 2000; Newman, Clouse, & Burnham, 2001; Yu, 2016, 2019) and as a function of phonetic
context (e.g., Jongman et al., 2000; Soli, 1981; Yu, 2016, 2019).
The coronal liquid /l/ is also known to vary considerably in the degree of velarization that
it exhibits in American English as a function of phonetic context, with aspects of both the
prosodic and segmental environment in which /l/ is produced noted to induce variation in its
articulatory and acoustic properties (Krakow, 1989; Lin, Beddor, & Coetzee, 2014; Recasens &
Farnetani, 1994; Sproat and Fujimura, 1993). Although interspeaker variability in general
production strategies has not been observed for /l/ in the same way as it has been noted for the
other coronal consonants in American English, speakers have been observed to differ from one
another in the extent to which specific phonetic contexts condition velarization and vocalization
of /l/ (e.g., Lin, Beddor, & Coetzee, 2014).
15
In each of these phonological segments, articulatory and acoustic variation can manifest
either along dimensions that are thought to define their phonological identity, and thus can be
viewed as explicitly goal-oriented in their implementation, or along dimensions that are not as
directly related to the system of phonological contrast and which may subsequently not have a
controlled target for their production. The extent to which this difference in phonological
significance for production dimensions may or may not be related to the extent of observed
interspeaker differences will be a point of interest in the present investigation, as it bears on
whether these differences in variability may be encoded in the cognitive representations of
phonological units.
2.1.2. Articulatory-acoustic relationships
An additional question of interest, in terms of understanding the role that variability in
articulation and acoustics might play in the representation of phonological action units and in the
maintenance of communicative parity across individuals more generally, is the extent to which
variability in articulation is encoded in acoustic variability and vice versa. The mapping between
articulation and acoustics is non-linear and quantal (e.g., Stevens, 1972), and consequently the
extent to which variation in the location or degree of an articulatory constriction will affect the
acoustic signal differs depending on the region of the vocal tract in which the constriction is
produced, as well as the state of other articulators in the vocal tract and their effect on the vocal
tract sensitivity function. Similarly, due to the influence of the entire vocal tract shape on the
acoustic signal at any one moment in time (e.g., Iskarous, 2010) and the ability for different
configurations of the vocal tract to output relatively similar acoustic signals (what is known as
the “many-to-one” relationship between articulation and acoustics [e.g., Abbs et al., 1984; Atal et
16
al., 1978]), variability in the acoustic signal is difficult to map on to variability along any single
articulatory dimension. The extent to which variability in articulation is discoverable from
variability in acoustics (and vice versa), and in particular the extent to which these relationships
are generally discoverable across speakers, has implications regarding the potential involvement
of this variability in the transmission and perceptual identity of a phonological unit.
Building on the knowledge of non-linearity in vocal tract acoustics and the fact that
different articulatory configurations can result in relative consistency along certain salient
acoustic dimensions, researchers have claimed that there is less variability in the acoustic
realization of a segment than in its articulation (e.g., Guenther et al., 1999; Johnson, Ladefoged
& Lindau, 1993), a claim that has been used to support accounts positing that the targets of
speech are acoustic. For example, in an examination of vowel production by five speakers of
midwestern American English using x-ray microbeam, Johnson et al. (1993) found that speakers
were internally very consistent in their production of vowels across different repetitions of the
same word but that the strategies used in the production of tense/lax, vowel height, and vowel
rounding distinctions varied both across speakers and across vowel pairs within the speech of a
single individual. The interspeaker variability observed in their study was not accounted for by
differences in gender, speech rate, palate doming, or dialect. Johnson et al. (1993) took these
findings to suggest that variability in articulation is greater than variability in acoustics, under the
assumption that the different strategies used by speakers to produce the phonological distinction
between segments were not reflected in the acoustic signal. However, they did not actually
measure the acoustic signal to verify that the observed articulatory variability lacked acoustic
consequences.
17
Recent work that did directly examine articulatory and acoustic variability in the same
materials has found no support for Johnson et al.’s conclusion, reintroducing the possibility that
acoustic variation may be used to draw conclusions about variability in the underlying
articulation. Examining individual differences in the realization of vowel height contrasts,
Noiray, Iskarous, and Whalen (2014) found that speakers’ idiosyncratic articulatory strategies for
contrasting three front vowel pairs were reflected in the acoustics of these segments, with
speakers who showed articulatory tongue height ‘reversals’ of the type examined in Johnson et
al. (1993) (e.g., producing the high vowel /ɪ/ with a lower tongue position than the mid vowel /e/)
demonstrating a comparable reversal in first formant values. Whalen, Chen, Tiede & Nam (2018)
similarly found that articulatory variability was positively correlated with acoustic variability
across vowels for 32 speakers of American English in the Wisconsin XRMB corpus (Westbury,
1994), reflecting a significant relationship between variability in these two domains for the
majority of vowels (5/9) when examined separately. These studies support the idea that
variability is comparable in articulation and in acoustic dimensions critical for the identification
of vowels, with variability in these critical acoustic dimensions reflecting variation in the
articulatory constrictions used to generate the vowels (as predicted by Iskarous, 2010).
However, it is unclear whether this same relationship between variability in articulatory
and in perceptually important acoustic dimensions is observed for consonants, an implication
that has important implications for understanding the role of variability in communication and in
the cognitive phonological representations of these segments. Much of the research that has
observed a relationship between articulatory and acoustic variability in consonants has found that
variability in articulation is more likely to be reflected in acoustic dimensions that are not
perceptually critical. For example, while variation in the location of the palatal constriction and
18
the shape of the tongue body in /ɹ/ was found to influence the distance between the (presumably)
perceptually unimportant fourth and fifth formants in the acoustic signal (Zhou, Espy-Wilson,
Boyce, Tiede, Holland and Choe, 2008), most research on tongue posture in /ɹ/ has found little
evidence for a relationship between articulatory variation in /ɹ/ and the perceptually relevant
dimension F3. Indeed, Nieto-Castañon, Guenther, Perkell and Curtin (2005) found evidence that
articulatory variability may be harnessed to minimize variability in F3 in /ɹ/. Specifically, they
observed that the vast majority (91%) of articulatory variability in the production of /ɹ/ occurred
in articulatory dimensions that had a minimal impact on F3 values, a finding that that they took
to support the proposal that the production target for this segment is acoustic in nature.
However, evidence for the impact of inter- and intraspeaker variation in consonant
articulation on critical acoustic dimensions can be found in the literature, although the extent to
which this reflects gradient articulatory variability (instead of gross differences in tongue
posture) has not been widely examined. Despite the overall lack of an effect of tongue posture on
F3 in /ɹ/, sensitivity of F3 to variation in the degree of the palatal constriction in has been
observed in for this segment (Harper, Goldstein, & Narayanan, 2020). Tongue tip orientation has
also been observed to affect the fricative frequency spectrum for coronal fricatives in English
and other languages (e.g., Dart, 1991, 1998), although the uniformity of this effect across
different spectral properties not entirely clear. Some empirical evidence suggests that the specific
effect of tongue tip orientation on fricative acoustics may depend on the anteriority of the
constriction, with laminal fricatives exhibiting a greater amount of high frequency energy at
more anterior (descriptively dental) constriction locations but apical fricatives exhibiting greater
high frequency energy when the constriction is more posterior (descriptively alveolar) in both
English and French (Dart, 1991), aligning with the observation of higher spectral center of
19
gravity measurements for apical post-alveolar fricatives than laminal post-alveolar fricatives in
Ubykh (Ladefoged & Maddieson, 1996).
The question of whether and how effectively variation in one modality can be physically
uncovered from the signal of the other is additionally complicated by indications that the degree
of nonlinearity observed in the relationship between articulation and acoustics may differ
between individual speakers as a function of variation in vocal tract morphology (Bakst &
Johnson, 2018; Lammert, Proctor, Katsamanis, & Narayanan, 2011). This interspeaker variation
in articulatory-acoustic nonlinearities suggests that speakers may differ from one another in the
extent to which variation along a particular articulatory dimension will correlate with variation
along a particular acoustic dimension (a possibility empirically supported by Carignan [2019]
and Weirich & Simpson [2018]). Understanding the extent to which articulatory variability is
reflected in the acoustic signal across this interspeaker variation in the articulatory-acoustic
mapping, and vice versa, may have important implications for the communicative significance of
variability in each domain and the role that this variability may play in the representation of
phonological units.
2.2. Study goals
This chapter builds on this recent work by examining the extent to which intraspeaker
variability in the production of different phonological segments (and, more specifically, along
individual phonetic dimensions in the production of these segments) systematically differs
between speakers. The main goal of the investigation presented in this chapter is to determine
the extent to which measured interspeaker differences in production actually reflect widespread
behavioral differences with potential communicative significance in a group of speakers.
20
Namely, to what extent are the individual differences in variability observed among speakers in
a population (a) large enough to be attributed to concrete differences between speakers instead
of chance, and (b) widespread enough within the population to indicate a general vector of
behavioral difference. To this end, two main questions and a set of associated sub-questions are
addressed:
Q1: Do individual speakers demonstrably differ from one another in the extent of overt
behavioral variability they exhibit in specific articulatory or acoustic dimensions known
to be of significance in planning and executing speech production?
Q2: Is the variability exhibited by a speaker in articulation and/or acoustics recoverable
from the signal in the other domain?
The findings related to Question 1 will be crucial for drawing substantive conclusions
regarding the significance of observed differences in interspeaker variation and motivating
further research probing the potential that these differences may reflect individual differences in
the cognitive representation of phonological units regulating articulatory movement (and what
the foundational nature of those units may be). The manner in which speakers do or do not differ
in variability is analyzed separately for specific coronal segments, phonetic dimensions, and
levels of structural granularity, thereby illuminating how and where these differences may occur
and thus guiding further investigation into the implications of observed patterns of variability for
developing models of phonological representation. Similarly, the extent to which the answer to
Question 2 is specific to certain phonetic dimensions or segments will be critical for determining
the extent to which variability in articulation and acoustic is transmitted between speakers and
the information that listeners may glean from the variability they encounter in heard speech.
21
2.3. Methods
2.3.1. Corpus data
Articulatory and acoustic data for this study were taken from recordings of 40 speakers in the
Wisconsin XRMB database (Westbury, 1994). All speakers included in this study were native
speakers of American English, with the majority recorded as having a dialect base city within the
Northern dialect region (Labov, Ash, & Boeberg, 2006).
3
Although recordings of 46 speakers
were available, 6 speakers were excluded from analysis because they were determined to have an
insufficient amount of data for one or more of the segments of interest.
The kinematic articulatory data in the XRMB corpus consists of positional trajectories of
small pellets (2.5 mm) attached to various points on the articulatory anatomy and to stable
reference points in the mouth (Figure 1). Pellet trajectories were sampled at different rates
(ranging from 40 to 160 samples/second) and subsequently interpolated and resampled to 160
samples/second. Acoustic data consists of audio recordings captured synchronously with the
kinematic data recording.
3
33 out of the 40 speakers included in the analysis grew up in areas of Wisconsin, Minnesota, or Illinois belonging
to the Northern dialect region, most of whom still lived within this region. Of the six other speakers, one grew up in
the New York City dialect region, one grew up in the Eastern New England dialect region, one grew up in the Mid-
Atlantic dialect region, one grew up in the Midland dialect region and three grew up in the American West dialect
region. The only notable known difference in the production of the segments of interest across these dialects is the
non-rhoticity traditionally described in the New York City, Eastern New England, and Mid-Atlantic regions, which
was not observed for any of the speakers for those regions. The speakers from outside of the Northern dialect region
were included in the study after establishing that their exclusion did not change the results of the statistical analyses,
indicating that their production of the segments under investigation does not significantly diverge from the speakers
in the Northern dialect region.
22
Figure 2.1. Placement of pellets in the XRMB data. Green circles indicate pellets from which
articulatory measurements were taken. (Schematic adapted from Westbury, 1994).
A total of 10,899 tokens of /t/, /s/, /ʃ/, /l/, and /ɹ/ (breakdown in Table 1) were extracted
from sentences that participants read in two of the tasks included in the corpus, namely a
Sentence reading task and a Prose Passage reading task. These particular segments were chosen
for analysis for two reasons. First, most of these segments are known to exhibit substantial
variability in tongue posture and/or lip-rounding across and within speakers in American English
(Bladon & Nolan, 1977; Dart, 1998; Delattre & Freeman, 1968; Ladefoged & Maddieson, 1996;
Lindau, 1985; Mielke et al., 2016). Second, most of these segments are continuants with
information about their identity transmitted acoustically throughout the entire duration of their
production, making them particularly suitable for examining the relationship between variability
in articulation and acoustics and for planned studies involving speech perception.
UL
LL
T1
T2 T3 T4
JAW
LI
23
Table 2.1. Breakdown of examined tokens in the XRMB corpus across segments and speakers.
4
/t/ /s/ /ʃ/ /l/ /ɹ/
Minimum
tokens/speaker
66 50 23 85 50
Maximum
tokens/speaker
82 65 27 99 67
Median
tokens/speaker
74 57 26 94.5 60
Tokens were taken from word-initial and word-final positions that were not directly adjacent to a
lingual consonant (heterosyllabically or tautosyllabically). Although tokens were extracted from
all available vowel and labial consonant environments, in the interest of maintaining a maximally
similar data set across speakers and segments, vowel contexts were excluded from the statistical
analysis if tokens in that context were only available for a subset of the speakers. Additional
tokens were excluded from analysis on a case-by-case basis due to sensor tracking errors (384
tokens total, approximately 3.5% of all data).
2.3.2. Articulatory analysis
2.3.2.1. Identification of temporal landmarks
For all gestures of interest in a particular segment, temporal landmarks were
automatically identified using a version of the findgest algorithm (originally by Mark Tiede,
Haskins Laboratories) modified to automatically locate and extract temporal gesture landmarks
for all tokens of a specified segment in a set of point tracking data (in this case, the Wisconsin
XRMB corpus). Using an acoustic segmentation of the XRMB data created using the Penn
4
An analysis using Spearman’s rank-order correlation was conducted to ensure that differences in the number of
tokens did not correlate with differences in measured variability. Variability was not found to be correlated with the
number of tokens analyzed for a speaker for any dimension in any segment.
24
Phonetics Lab Forced Aligner (Yuan & Liberman, 2008; forced alignment of the XRMB data
performed by Tiede and Yuan), tokens of /t/, /s/, /ʃ/, /l/, and /ɹ/ were located in the articulatory
data by finding the articulatory frames corresponding to the acoustically-defined segment start
and end points. These acoustic boundaries were used to define the search window over which the
algorithm would look for movement extrema in the 2-D time series of the relevant pellets’
position signals.
5
Temporal landmarks for gestures were identified using the positional time
function of the pellet of interest. The movement extremum of the gesture was defined as the
velocity minimum closest to the segment’s acoustic midpoint. The onset of the gesture was
defined as the point preceding the first peak velocity of the gesture where the velocity signal first
crossed a threshold of 20% of the peak velocity, while the offset of the gesture was defined as
the point following the second (release) peak velocity where the velocity signal fell below 20%
of the peak value.
Time functions of pellets placed on the upper and lower lips (UL & LL) and on the
tongue tip (T1), blade (T2), body (T3) and dorsum (T4) were used to find the time of movement
extremum for the articulatory gesture(s) used to form each segment. For tokens of /t/, /s/, /ʃ/, and
/l/, temporal landmarks were extracted for a tongue tip gesture using the 2D tangential velocity
signal for the anterior lingual pellet (T1). For /ɹ/, temporal landmarks were extracted for pellets
on the tongue tip (T1), tongue blade (T2) and tongue body (T3), with landmarks being from the
pellet calculated to have the smallest distance from the palate. Positional measurements of pellets
on the tongue dorsum and lower lip (as well as pellets on the tongue blade, tongue body, and
5
For /l/ and /ɹ/, the window limit was extended 50 ms beyond the acoustic end points on either side, due to the
observation that target attainment for the tongue tip gesture may precede or follow the acoustic signal for these
segments (e.g., Recasens & Farnetani, 1994; Lawson, Stuart-Smith, & Scobbie, 2018).
25
upper lip) were taken at the time of maximum constriction for the anterior lingual constriction
for those segments.
2.3.2.2. Constriction measurement
Positional measurements were extracted for all lingual and labial pellets at the time of movement
extremum (velocity minimum) for the tongue tip/palatal constriction gesture in each consonant.
These positional measurements were used in the analyses described below, except for in
instances in which temporal landmarks were also extracted for labial and/or tongue dorsum
gestures in a segment, in which case positional measurements of the lips and/or tongue dorsum
were taken at the time of the positional extremum for the corresponding gesture.
Constriction location (CL) and degree (CD) measurements were logged for the pellet
determined to be closest to the palate trace (i.e., the most narrow constriction) for each token (T1
for all tokens of /s/, /ʃ/, /l/, and /t/, and either T1, T2, or T3 for /ɹ/
6
). A small number of tokens
where the pellet had crossed the palate trace at the time of movement extremum were excluded
from analysis. The constriction location measurements were taken by calculating the x-axis
distance of the pellet of interest from the coordinate system origin (positioned at the tip of the
maxillary incisors). All constriction location measurements were normalized by the length of the
speaker’s vocal tract to reflect differences in individual speakers’ vocal tract lengths. The length
of the vocal tract was measured as the distance from the LI sensor to the closest point a posterior
pharyngeal wall outline obtained for each speaker (Westbury, 1994, p. 51). Constriction degree
measurements were taken by calculating the Euclidean distance between the pellet’s X-Y
coordinate position and the closest point on the palate trace taken from each speaker. Labial
6
The use of multiple pellets to measure CL and CD for /ɹ/ was necessary given the (expected) variation observed
within and across speakers in which part of the tongue was used to make a constriction along the palate for this
segment (e.g., Westbury et al., 1998).
26
aperture (LA) and protrusion (LP) were calculated in a manner similar to that used for CL and
CD. Labial protrusion was calculated as the x-axis distance of the pellet on the lower lip from the
coordinate system origin, while labial aperture was calculated as the Euclidean distance between
the X-Y coordinates of pellets placed on the upper and lower lip.
In addition to these measurements, the orientation of the tongue tip (CO) was taken by
calculating the angular orientation of T1 relative to T2 following the method used in Westbury et
al. (1998) (Eq. 2.1).
*+,
!"
∗
($"%!$&%)
($"(!$&()
(2.1)
Positive CO values indicate that T1 is closer to the palate than T2, suggesting the token was
produced with a raised tongue tip (more apical), while negative angles indicate that T2 is closer
to the palate than T1, suggesting a lowered tongue tip (more laminal).
2.3.3. Acoustic analysis
2.3.3.1. Formant measurements
For /l/ and /ɹ/, values for the first five formants were automatically extracted at either the
time of maximum constriction for the anterior lingual gesture in the consonant or at the closest
voiced interval to this timepoint. This duality in the selection of a timepoint for acoustic analysis
is due to the observation that the time of maximum constriction for the anterior gesture in the
target consonant occurred outside of its acoustic duration in many tokens of /l/ and /ɹ/. In all of
these instances the target consonant either followed (onset tokens) or preceded (coda tokens) a
labial consonant or a period of silence, reflecting with observations that this gesture is often
produced late, and often after the offset of voicing or the transition to another consonant, in pre-
silence or pre-consonantal coda tokens of both /l/ (Lin, Beddor, & Coetzee, 2014; Recasens &
27
Farnetani, 1994) and /ɹ/ (Lawson, Stuart-Smith, & Scobbie, 2018). Following other studies
where this asynchrony between articulatory and acoustic achievement of the consonant has been
observed, formant measurements were taken at the onset or offset of periodic voicing for the
target segment in tokens of /l/ and /ɹ/ where the time of maximum constriction for the anterior
lingual gesture occurred outside of the acoustic duration of the target segment. Formant
measurements were taken at the timepoint of the maximum articulatory constriction in tokens
where this timepoint occurred during the acoustic duration of the consonant.
7
Formant tracking was configured to find five formants below 5000 Hz for male speakers
and five formants below 5500 Hz for female speakers. Values of extracted formants were
checked to ensure that they fell within a reasonable range of expected values for the target
consonant, defined for F3 in /ɹ/ as being between 1300 – 2100 Hz and for F2 in /l/ as being
between 800 - 1300 Hz. Tokens that fell outside of these ranges, as well as tokens where one or
more formant values were more than 2.5 standard deviations away from the speaker mean, were
visually inspected and manually corrected in Praat (Boersma & Weenink, 2020).
The first four formants of each liquid consonant were selected for analysis, and an
additional measurement of the distance between F2 and F1 was also included as a dimension of
interest. The low F3 value observed in /ɹ/ is generally considered to be its most prominent
acoustic feature (Boyce & Espy-Wilson, 1997; Delattre & Freeman, 1968; Hagiwara, 1995) and
has been shown to be the most salient acoustic correlate of this segment in perception (O’Connor
et al., 1957; Twist et al., 2007). The high F2 values observed in /ɹ/ have also been shown to have
an important influence on listeners’ perception of rhoticity (Heselwood & Plug, 2011; Polka &
Strange, 1985). While no evidence points to F4 values in /ɹ/ being particularly perceptually
7
The temporal lag between the acoustic onset/offset of the target segment and the time of maximum constriction for
the anterior lingual gesture was also measured for /l/ and /ɹ/, but is not included as a factor in the analyses here.
28
salient to listeners, they are known to covary with variation in the location and postural
realization of the palatal constriction gesture in this segment and are thus included in the analysis
as a possible item of interest. The coronal liquid /l/ is differentiated from /ɹ/ and /w/ by the
relative location of F2 and F3, with F3 much higher in /l/ than /ɹ/ and F2 higher in /l/ than /w/
(Lehiste, 1964; O’Connor et al., 1957). Additionally, both the value of F2 and the F2-F1 distance
have been highlighted as important acoustic correlates of /l/ velarization, with more velarized
tokens exhibiting smaller F2 values and a smaller F2-F1 distance (Lehiste, 1964; Recasens,
2004, 2012; Sproat & Fujimura, 1993).
2.3.3.2. Spectral measurements
The first four spectral moments of each fricative (M1, M2, M3 and M4) were calculated
using a DFT calculated over a 50 ms Hamming window centered either at the time of the
maximum tongue tip constriction in the consonant’s articulation (MAXC) or at the acoustic
midpoint of the segmented token. MAXC was used as the center of the 50 ms analysis window
by default, but the acoustic midpoint of the segment was selected as an alternate centering point
for a small number of tokens where the windowed interval overlapped with either a period of
silence or another segment when centered on MAXC. Prior to the calculation of spectral
moments, a high-pass filter with a cut-off at 500 Hz was applied to the fricative spectrum to
exclude spectral energy resulting from vocal fold vibration. Tokens of /s/ with measured M1
values below 5000 or above 10000 Hz, tokens of /ʃ/ with M1 values below 3000 or above 5000
Hz, and all tokens where one or more measured dimensions was more than 2.5 standard
deviations away from the speaker mean were visually inspected and, when necessary, manually
corrected in Praat.
29
The first four spectral moments of a fricative represent the weighted average of its
spectral frequency peaks, also known as the center of gravity (M1); the spread of the distribution
of frequency peaks, or their standard deviation (M2); the tilt of the distribution of spectral energy
around the mean, or its skewness (M3); and the peakedness of the distribution of spectral energy,
or its kurtosis (M4). These four spectral moments are frequently used in the acoustic analysis and
classification of fricative consonants (e.g., Forrest et al., 1988; Fox & Nissen, 2005; Jongman et
al., 2000; Nissen & Fox, 2005; Nittrouer, 1995; Shadle & Mair, 1996; Tomaik, 1990), and
empirical evidence suggests that they reflect the location of and tongue posture during fricative
production (Iskarous et al., 2010; Li et al., 2009). In terms of the articulatory correlates of the
examined spectral dimensions, M1 and M3 are known to exhibit sensitivity to the location of the
primary (lingual) constriction, as well as to labial protrusion and rounding, while M2 and M4 are
thought to reflect differences between laminal and apical configurations (Dart, 1991, 1998). Each
of these spectral measurements has been observed to differentiate /s/ and /ʃ/ in previous research,
although the consistency of this observation and the size of the differentiation between these two
segments varies across these dimensions of measurement. Specifically, while M1 has been
consistently shown to reliably distinguish /s/ from /ʃ/ (Jongman et al., 2000; Nittrouer, 1995;
Shadle & Mair, 1996), and to potentially be sufficient on its own to distinguish these segments
(Li et al., 2009), the extent to which M2, M3 and M4 serve to differentiate /s/ from /ʃ/, and even
which segment tends to exhibit lower values along each dimension, tends to differ across studies
(Jongman et al., 2000; Tomiak, 1990; Li et al., 2009).
8
8
These differences in the utility of the examined spectral dimensions for distinguishing English coronal fricatives,
while not a major focus of the present analysis, may be important to keep in mind when considering the potential
implications of the appearance of variability and the transmissibility of articulatory variability through the acoustic
signal on interspeaker communication.
30
2.3.3.3. MFCC calculations
A set of mel-frequency cepstrum coefficients (MFCCs) was calculated from the XRMB
audio to provide a measurement of the acoustic signal that (a) better reflected the psychophysical
properties of the human auditory system than the formant and spectral measurements and (b) did
not reflect assumptions regarding which properties of the signal were most likely to reflect
articulatory variation in particular phonological segments. MFCCs are commonly used in speech
recognition and speech synthesis (e.g., Rabiner et al., 1993), and have been used successfully to
synthesize articulatory trajectories through acoustic-to-articulatory inversion (e.g., Chartier et al.,
2018). MFCCs were calculated for the acoustic recordings from the XRMB corpus using the
mfcc function in Matlab 2020a (window size: 25 ms, lag: 5 ms). For each token of /s/, /ʃ/, /l/ and
/ɹ/ 13 MFCCs
9
were extracted from this signal at the window whose center was closest to the
timepoint at which formant/spectral calculations were made. MFCCs were used only for a set of
models constructed to evaluate the extent to which variability along the articulatory dimensions
of interest could be predicted by variability in the acoustic signal (see 2.3.5 for more detail) and
were not used in any direct analyses of inter- or intraspeaker variability.
2.3.4. Calculation of dispersion
To assess the extent of the variability exhibited along particular articulatory and acoustic
dimensions in specific segments by different speakers, measures of dispersion were calculated
for each speaker along each measured dimension for each segment under study. Analyses were
conducted for both the coefficient of variation (CoV) (Pearson, 1896) and Interquartile Range
(IQR). CoV is a statistical measure of dispersion calculated as the ratio of the standard deviation
9
The 0
th
MFCC was excluded from analysis due to its association with low frequency energy related to the glottal
voice source and, therefore, not directly reflected in the examined articulatory signal.
31
of a distribution to its mean (often multiplied by 100, so that it can be presented as a percentage)
(Everitt, 1998). As it is a standardized measurement, it can be used to compare variation in sets
of data with different units of measurement or with large differences in mean values and is
particularly appropriate for use in cases where there is a proportional relationship between the
standard deviation and the mean of the distribution. However, CoV is only meaningful for data
measured on a ratio scale and may generate misleading characterizations of dispersion for data
measured on an interval scale (i.e., measurements of location in cartesian coordinates or angular
measurements). While all of the formant and most of the spectral measurements examined here
are ratio scale data (M3 and M4 being the exception to this generalization), among the
articulatory dimensions examined only CD and LA unambiguously satisfy the assumptions for
using CoV. Although CoV is arguably a more accurate representation of normalized variability
for ratio scale measurements, for the sake of consistency, analyses were conducted with
Interquartile Range as an index of the extent of variability observed in each speaker’s production
along a dimension for a particular segment for all analyses, and those analyses are presented in
this chapter (supplementary analyses using CoV were conducted for all ratio scale dimensions
and are available given in Appendix A). Instances in which these results of the CoV-based
analyses differ from the IQR-based analyses are noted in the text.
IQR measures the difference between the 25
th
percentile and the 75
th
percentile of a set of
data. As it presents the spread of the middle 50% of the data, IQR has a breakdown point of 25%
and is therefore considered a fairly robust measure of spread (Hampel, 1971). As its accuracy is
not predicated on the type of data used or any contingent relationship between central tendency
and spread of the data, it is taken to be a more accurate measurement of relative dispersion than
CoV for all non-ratio-scale measurements examined here. IQR values were calculated using the
32
IQR function from the R stats package (R Core Team, 2020) with a Type 7 calculation of
quartiles (Hyndman & Fan, 1996), and 95% confidence intervals were estimated for each
calculated IQR value using a bootstrap method (10,000 samples of 50 tokens, sampling with
replacement). In addition to calculating measures of dispersion using the raw articulatory
dimensions, measurements obtained from all speakers for a particular dimension in a particular
segment were rescaled from 0 to 100 and used to calculate an additional set of dispersion
metrics. This rescaling created a similar scale of measurement for all dimensions, maximizing
the comparability of calculated IQR values across speakers, segments, and dimensions while still
preserving differences in the spread and relative location of individual speakers’ production
values along each measurement scale. These rescaled metrics were not used in any statistical
analyses and are solely used for visualization purposes in the graphs presented in 2.4.1.
For each speaker, three different measures of dispersion were calculated for each
articulatory dimension in each consonant using IQR. Overall measures of dispersion (IQR
TOT
)
were calculated using all tokens produced by a speaker along a single dimension within a single
segment (Eq. 2.2). This measurement was intended to index the variability exhibited by a
speaker in their production of a single dimension in a single consonant across the data set as a
whole (i.e., across all factors within the selected sample of tokens that could induce subphonemic
variability in the speech of a single speaker).
IQRTOT = 3
4
$)$
!.#$
- 3
4
$)$
!.%$
(2.2)
(where 676 signifies the full set of tokens of a segment produced by a speaker)
In addition to the overall measure of dispersion, two additional measures were calculated
to capture the variability exhibited by a speaker at a specific level of structural granularity. A
within-context measure of dispersion was calculated for each dimension in each segment as the
33
average dispersion exhibited by a speaker across tokens within a unique phonetic context
(IQRCON) (Eq. 2.3). This measure was intended to index the extent of stochastic variability a
given speaker exhibited along a dimension for a particular segment. For the purpose of this
analysis, the phonetic context of a token was defined on the basis of the segmental and prosodic
environment
10
in which it occurred, with tokens from the same unique context occurring with the
same surrounding segmental material and in the same (broadly defined) prosodic position within
the utterance.
89:
*)+
= ;
(-
.
&'
!.#$
! -
.
&'
!.%$
)
/(-
.
&%
!.#$
! -
.
&%
!.%$
)
/⋯/(-
.
&)
!.#$
! -
.
&)
!.%$
)
1
< (2.3)
(where ci = a unique phonetic context and n = number of unique phonetic contexts).
Finally, cross-context measures of dispersion were calculated as the variability observed
among the median production values obtained for a dimension of interest within unique phonetic
contexts (defined based on the segmental and prosodic environment in which a token occurred)
(IQRCROSS) (Eq. 2.4). This measure was intended to index the extent of contextual
(coarticulatory and/or prosodic) variation a speaker exhibited along a dimension in their
production of a particular segment. As with the overall and context-dependent measurements,
cross-context measures of dispersion were calculated separately for each speaker along each
measured dimension in a given segment.
IQRCROSS = 3
4
{( 3
&'
/( 3
&%
/⋯/( 3
&)
}
!.#$
- 3
4
{( 3
&'
/( 3
&%
/⋯/( 3
&)
}
!.%$
(2.4)
(where ci = a unique phonetic context, n = number of unique phonetic contexts).
10
The definition of prosodic environment was based on (a) the position of the segment in a word and (b) its location
relative to pauses and sentence boundaries in the transcription used for the forced alignment of the XRMB data.
34
2.3.5. Statistical analysis
All statistical analyses were conducted in R 4.0.2 (R Core Team, 2020). The theoretical
proposals evaluated in this dissertation are contingent on the existence of concrete (that is,
statistically significant) interspeaker and intergestural differences in the extent of variability
exhibited along specific phonetic dimensions. To evaluate the existence of this relationship,
systematic interspeaker comparisons of dispersion were conducted along each of the measured
articulatory and acoustic dimensions of interest for each segment for each of the levels of
structural granularity for which measures of dispersions were calculated (overall, cross-context
[contextual], and within-context [stochastic]). As existing research on speech and non-speech
motor control unequivocally indicates that there should be some variation in the extent of
variability exhibited by individual speakers in the corpus, the main purpose of this analysis was
to ascertain (a) whether these differences in variability were large enough to infer statistically
significant differences between individual speakers’ distributions of productions and (b) the
extent to which the observed variability differed across segments and across dimensions.
2.3.5.1. Analysis of individual differences in articulatory and acoustic variability
For each articulatory dimension in each segment for which variability was indexed by
IQR, the Brown-Forsythe test for equality of group variances (Brown & Forsythe, 1974) as
implemented by the hov function in the R package HH (Heiberger, 2020) was used to determine
whether there was heterogeneity of variance in the realization of that articulatory dimension
across the group of speakers as a whole. Post-hoc pairwise comparisons of variability between
individual speakers were by comparing the 95% confidence intervals for their IQR values, with a
lack of overlap in these confidence intervals for two speakers taken to indicate that they
35
significantly differed in how variable they were in their production of that dimension.
Comparison of within- and cross-context IQR values was also conducted by comparing 95%
confidence intervals for individual speaker IQR values.
2.3.5.2. Analysis of articulatory-acoustic relations and recoverability of variability
The second major analysis in this chapter was designed to examine articulatory-acoustic
and acoustic-articulatory relations in the XRMB data and to evaluate how well individual
patterns of variability are reflected between the two modalities. A series of linear mixed effects
models (LMERs) were fit to the data to determine which articulatory dimensions were
significant predictors of each measured acoustic dimension and which acoustic dimensions were
significant predictors of each measured articulatory dimension. These models were then used to
compare how well different articulatory and acoustic dimensions were predicted by data in the
other modality (and how this compared across segments), and the extent to which the predicted
values from each model correlated with and incorporated the patterns of variability observed in
the raw data. All models were fit using the lmer function in the R package lmerTest (Kuznetsova,
Brockhoff, & Christensen, 2016).
Each of the measured acoustic and articulatory dimensions in each of the four segments
for which both articulatory and acoustic data were measured (/s/, /ʃ/, /l/ and /ɹ/) occurred as the
dependent variable in two models. The first model fit to each articulatory dependent variable had
the full set of acoustic measurements of interest (F1-F4 and the F2-F1 distance for /l/ and /ɹ/, and
M1-M4 for /s/ and /ʃ/) as fixed effects and a random intercept for Speaker and Phonetic Context.
A random slope of the fixed effect that was the strongest predictor of the dependent variable by
Speaker was included in each model after confirming that the inclusion of this slope significantly
36
improved model fit by using the step function in the R package lmerTest (Kuznetsova et al.,
2016) to perform Akaike Information Criterion (AIC)-based model simplification.
11
The second
model fit to each articulatory dependent variable had the vector of calculated MFCC values as
fixed effects, random intercepts for Speaker and Phonetic Context, and a random slopes of the
MFCCs with the highest predictive power by Speaker. The models fit to each acoustic dependent
variable mirrored those fit to the articulatory dimensions, with fixed effects consisting of the
articulatory dimensions of interest (CL, CD, CO, LA and LP) in the first model and the complete
vector of raw pellet positions in the second model (encompassing the coordinates of the positions
of all pellets shown in Figure 1). The random effects structure for the models fit to the acoustic
dimensions of interest mirrored that of the models fit to articulatory dimensions.
Conditional and marginal R
2
values were calculated for each of the models using the
r.squaredGLMM function in the R package MuMIn (Barton, 2020) to facilitate the cross-model
comparison of fit. For mixed effects models the marginal R
2
value represents the amount of
variance in the data explained by the model’s fixed effect structure (similar to the interpretation
of R
2
for ordinary least squares regression), while the conditional R
2
represents the amount of
variance in the data explained by the entire structure of the model (both fixed and random
effects). Both conditional and marginal R
2
values were calculated using the method presented in
Nakagawa, Johnson, and Shielzeth (2017).
Finally, an analysis comparing the patterns of variability in the predicted values
generated by each model to those observed in the raw data was conducted to evaluate how well
11
Although the inclusion of additional random slopes further improved fit for some of the constructed models, the
decision to only include the random slope that accounted for the highest proportion of model variance (which was
always the slope of the most predictive fixed effect by SPEAKER) was ultimately made to maximize consistency
across the many models fit. The decision to exclude additional random slopes did not impact any of the gross results
of the acoustic-articulatory analysis.
37
interspeaker differences in variability are accounted for in the statistical modeling of articulatory-
acoustic and acoustic-articulatory relations. Measures of dispersion were calculated for each
model’s predicted values using the methods outlined in 2.3.4 and were compared to the measures
of dispersion calculated from the XRMB data using Spearman’s rank-order correlation.
2.4. Results
2.4.1. Interspeaker differences in articulatory and acoustic variability
2.4.1.1.Articulatory variability
To establish whether there was significant heterogeneity among the examined speakers in
the XRMB corpus in the stability of their production of the articulatory dimensions in a given
segment, the Brown-Forsythe test for equality of group variances was conducted for each of the
five articulatory dimensions of interest for each segment. The results of this statistical analysis
are given in Table 2.2, with visualization of interspeaker differences in variability for all
articulatory dimensions in each segment shown in Figure 2.2.
The results of the Brown-Forsythe test were significant for all dimensions in all
segments, leading us to reject the null hypothesis of equal variance across the speakers’
examined conditions. These results conform with the visual inspection of the graphs in Figure
2.2., which indicate speakers differ substantially from one another in how precise their
production is (and subsequently, how much variability is observed in their production) for each
of the examined articulatory dimensions in each segment. On the whole, then, the results of the
analysis of variability across the group of speakers in its entirety indicate that there is at least
some degree of interspeaker difference in the dispersion of productions along every dimension
examined in the data set.
38
Table 2.2. Results of Brown-Forsythe test for homogeneity of variance in all segments. All
comparisons are significant (p < 0.05).
/t/ /s/ /ʃ/ /l/ /ɹ/
F* p F* p F* p F* p F* p
CL 232.359 0.000 28.490 0.000 111.190 0.000 43.119 0.000 23.514 0.000
CD 88.945 0.000 192.942 0.000 208.583 0.000 214.083 0.000 341.185 0.000
CO 62.923 0.000 13.372 0.000 130.403 0.000 15.028 0.000 97.905 0.000
LA 141.107 0.000 316.189 0.000 339.265 0.000 158.165 0.000 225.894 0.000
LP 354.515 0.000 299.155 0.000 178.807 0.000 379.199 0.000 250.784 0.000
Although the finding that there is at least some variation among speakers is consistent on
the level of the group as a whole, the examination of pairwise comparisons suggests that the
number of speakers who differ from one another in variability and how large these differences
are varies along different articulatory dimensions. Visual inspection of the graphs in Figure 2.2
indicate that, at least for most dimensions in most segments, these differences between speakers
are fairly widespread amongst the group and are not the result of one or two outliers; however,
for some dimensions, such as CD and CO in /t/, it is unclear that this is the case. This impression
of widespread interspeaker differences for some dimensions in some segments, and much more
constrained differences for others, was statistically verified in the analysis of pairwise
comparisons conducted by comparing 95% confidence intervals for IQRTOT values (Table 2.3).
The results of the statistical analysis of pairwise comparisons confirms impressions regarding the
general, but not universal, tendency for the heterogeneity of variance observed across the
population of speakers as a whole, and shows differences between a substantial number of
individual speakers within the group, rather than one or two outlier speakers diverging from the
norms of a group that is generally fairly homogenous. As can be seen in the summary table, the
percentage of significant pairwise comparisons differs substantially across dimensions and
39
segments, ranging from 8.9% (LA in /ʃ/) to 50.4% (CD in /l/) of comparisons indicating a
significant difference at the Bonferroni-corrected significance level (⍺ = 0.000067).
Some general trends are observed among the examined segments and dimensions with
respect to speakers’ likelihood to significantly differ from one another in variability. For
example, CD and CL tend to have some of the largest percentages of pairwise comparisons in all
segments, with the comparison with the largest number of significant pairwise differences
observed for either CL or CD in all segments. Speakers also tend to significantly differ from one
another more for /l/ and /ɹ/ than for other segments across all dimensions, although this is not an
absolute pattern. On the whole, the particular dimensions and segments that tend to exhibit more
variation across speakers are those that are either already known to serve as a locus of
interspeaker variation in the production of those segments (such as CL and CO in /ɹ/) or that may
be less likely to reflect individual differences in coarticulatory tendencies due to their direct
relevance to the realization of articulatory goals (such as CL and CD).
40
Figure 2.2. IQRTOT (scaled) values by segment for all articulatory dimensions. Order of speakers
(x-axis) in all graphs is based on the rank-ordering of IQRTOT values for CL in /t/. Error bars
show the 95% confidence interval for each speaker’s IQRTOT value.
/s/ /ʃ/ /l/ /ɹ/
CO
/t/
41
Table 2.3. Proportion of pairwise comparisons of IQRTOT that were significant for each
measurement (by segment).
/t/ /s/ /ʃ/ /l/ /ɹ/
CL 0.306 0.303
12
0.169 0.469 0.458
CD 0.194 0.366 0.166 0.504 0.297
CO 0.137 0.187 0.251 0.138 0.377
LA 0.154 0.16 0.089 0.245 0.276
LP 0.202 0.33 0.101 0.237 0.133
Similar results were observed for both cross-context (Figure 2.3) and within-context variability
(Figure 2.4). Pairwise comparisons between speakers for each dimension in each segment were
able to be mathematically conducted through the comparison of 95% confidence intervals for
speakers’ calculated IQR values. These comparisons, as well as visual inspection of the graphs in
Figures 2.3 and 2.4, indicate that there is a substantial degree of general heterogeneity in both
IQRCROSS and IQRCON among the group of speakers as a whole, mirroring the results of the
analysis of overall variability (Tables 2.4 and 2.5).
12
The percentages given for all dimensions in /s/ do not include comparisons involving JW58 (the rightmost speaker
in each graph), who was a clear outlier in his production of this segment. Specifically, JW58 was the only speaker to
exhibit a discretely bimodal distribution in his production of /s/, and this bimodality was observed across multiple
dimensions. He is included in all figures for reference, but since the modes in his production of /s/ were remarkably
discrete and his variability measurements therefore potentially misleading when compared to other speakers with
unimodal production, he is excluded from all statistical analyses in this chapter and in Chapter 3 that use the XRMB
data.
42
Figure 2.3. IQRCROSS (scaled) values by segment for all articulatory dimensions. Order of
speakers (x-axis) in all graphs is based on the rank-ordering of IQRCROSS values for CL in /t/.
Error bars show the 95% confidence interval for each speaker’s IQRCROSS value.
/s/ /ʃ/ /l/ /ɹ/
CO
/t/
43
Figure 2.4. IQRCON (scaled) values by segment for all articulatory dimensions. Order of speakers
(x-axis) in all graphs is based on the rank-ordering of IQRCON values for CL in /t/. Error bars
show the 95% confidence interval for each speaker’s IQRCON value.
/s/ /ʃ/ /l/ /ɹ/
CO
/t/
44
Table 2.4. Percentage of pairwise comparisons of IQRCROSS that were significant for each
measurement (by segment).
/t/ /s/ /ʃ/ /l/ /ɹ/
CL 0.142 0.224 0.136 0.279 0.230
CD 0.120 0.156 0.146 0.303 0.142
A12 0.098 0.129 0.195 0.050 0.205
LP 0.046 0.237 0.093 0.078 0.038
LA 0.049 0.016 0.148 0.117 0.136
Table 2.5. Percentage of pairwise comparisons of IQRCON that were significant for each
measurement (by segment).
/t/ /s/ /ʃ/ /l/ /ɹ/
CL 0.276 0.252 0.086 0.291 0.218
CD 0.127 0.327 0.284 0.193 0.074
A12 0.166 0.256 0.299 0.329 0.255
LA 0.173 0.259 0.209 0.140 0.251
LP 0.259 0.272 0.290 0.361 0.181
Evaluating the results of the analysis of pairwise comparisons for IQRCROSS and IQRCON
more closely, speakers are observed to differ from one another to a greater extent in within-
context (stochastic) variability than cross-context (contextual) variability for all segments except
/ɹ/. Notable inversions to this pattern are observed for segments and dimensions where we may
expect robust contextual effects, such as CL and CD in /l/, which may reflect speakers’
differential attestation of positional reduction effects and overall differences in tendencies
towards /l/ velarization across contexts (Gick et al., 2006; Scobbie & Pouplier, 2010; Sproat &
Fujimura, 1993).
No notable patterns are immediately apparent in terms of segments or dimensions that
may engender greater differences between speakers in terms of the extent of within- and cross-
context variability they exhibit in articulation, despite the observation of such patterns in the
45
analysis of overall variability. Of particular interest, there does not seem to be any clear
difference between dimensions that are likely to be more directly related to phonological goals of
the segments (e.g., CL and CD) and those that would be thought to index more contextual or
strategic variation in the manner in which speakers accomplish those goals (e.g., CO), nor
between those dimensions that align with known loci of individual variation in the production of
certain segments. However, this may reflect the fact that fewer observations were available in for
the calculation of each of the IQRCON and IQRCROSS values than were available for the
calculation of IQRTOT, and confidence intervals for these measurements were wider (and more
prone to overlap) as a consequence.
2.4.1.2. Acoustic variability
A parallel analysis of heterogeneity among the examined speakers was conducted for the
acoustic data using the Brown-Forsythe test. The results of this statistical analysis are given in
Tables 2.6 and 2.7, with the calculated IQRTOT values presented visually for comparison in
Figures 2.5 and 2.6. The results of this analysis mirror those of the analysis of interspeaker
differences in the articulatory data, with the statistical significance of the result in all conducted
Brown-Forsythe tests leading us to reject the null hypothesis of equal variance across the
examined speakers for all acoustic dimensions in all segments.
46
Figure 2.5. IQRTOT (scaled) values for all acoustic dimensions in /s/ (top row) and /ʃ/ (bottom
row). Order of speakers (x-axis) in all graphs is based on the rank-ordering of IQRTOT values for
M1 in /s/. Error bars show the 95% confidence interval for each speaker’s IQRTOT value.
Table 2.6. Results of Brown-Forsythe test for homogeneity of variance for all spectral measures
in /s/ and /ʃ/. All comparisons are significant at the ! = 0.05 significance level (all p-values
rounded to three decimal places).
/s/ /ʃ/
F p F p
M1 42.825 0.00 42.061 0.00
M2 39.360 0.00 17.812 0.00
M3 30.229 0.00 16.408 0.00
M4 3.704 0.00 13.368 0.00
M1 M2 M3 M4
/s/
/ʃ/
47
Figure 2.6. IQRTOT (scaled) values for all acoustic dimensions in /l/ (top row) and /ɹ/ (bottom
row). Order of speakers (x-axis) in all graphs is based on the rank-ordering of IQRTOT values for
M1 in /s/. Error bars show the 95% confidence interval for each speaker’s IQRTOT value.
Table 2.7. Results of Brown-Forsythe test for homogeneity of variance for all spectral measures
in /l/ and /ɹ/. All comparisons are significant at the α = 0.05 significance level (all p-values
rounded to three decimal places).
/l/ /ɹ/
F p F p
F1 8.424 0.000 10.922 0.000
F2 38.367 0.000 21.747 0.000
F3 66.580 0.000 56.180 0.000
F4 69.326 0.000 90.836 0.000
F2-F1 16.266 0.000 10.961 0.000
A summary of the results of the analyses of pairwise comparisons for all acoustic
dimensions in all segments is given in Table 2.8. This analysis was conducted with the same
methods described for the pairwise comparisons of speakers’ overall variability for the
articulatory dimensions. Similarly to what was observed in that analysis, the number of
significant pairwise comparisons for the acoustic dimensions are generally large enough to
F1 F2 F3 F4
/l/
/ɹ/
F2-F1
48
suggest that the heterogeneity of variance observed across the population of speakers as a whole
reflects more than simply the effect of a couple of outlier speakers – that is, while there are
certainly tendencies for speakers to resemble one another in how variable they are, there are also
a notable number of instances in which speakers exhibit statistically significant differences in
variability. Certain patterns can be observed as to where speakers tend to differ from one another
more in variability. Specifically, speakers are consistently more variable in /s/ than /ʃ/ and in /l/
than /ɹ/. Speakers also appear to differ from one another more in M1 and M4 and, at least for /s/,
in M2 than in M3 for the fricative consonants.
Table 2.8. Percentage of pairwise comparisons of overall variability that were significant at the
Bonferroni corrected α = 0.000067 significance level for each spectral measurement in (a) /s/ and
/ʃ/ and (b) /l/ and /ɹ/.
(a) (b)
Similar results were observed for both cross-context (Figures 2.7 and 2.8) and within-
context variability (Figures 2.9 and 2.10). The visual and mathematical comparisons of 95%
confidence intervals for these IQR values indicate that there is a considerable degree of general
heterogeneity in variance among the group of speakers for most dimensions in most segments,
although for IQRCON these differences appear to be smaller than they were for the articulatory
data (Tables 2.9 and 2.10).
/l/ /ɹ/
F1 0.326 0.163
F2 0.192 0.167
F3 0.4 0.123
F4 0.322 0.214
F2-F1 0.301 0.048
/s/ /ʃ/
M1 0.274 0.101
M2 0.226 0.019
M3 0.131 0.035
M4 0.381 0.247
49
Figure 2.7. IQRCROSS (scaled) values for all acoustic dimensions in /s/ and /ʃ/. Order of speakers
(x-axis) in all graphs is based on the rank-ordering of IQRCROSS values for M1 in /s/. Error bars
show the 95% confidence interval for each speaker’s IQRCROSS value.
Figure 2.8. IQRCROSS (scaled) values for all acoustic dimensions in /l/ and /ɹ/. Order of speakers
(x-axis) in all graphs is based on the rank-ordering of IQRCROSS values for M1 in /s/. Error bars
show the 95% confidence interval for each speaker’s IQRCROSS value.
M1 M2 M3 M4
/s/
/ʃ/
F1 F2 F3 F4
/l/
/ɹ/
F2-F1
50
Table 2.9. Percentage of pairwise comparisons of cross-context variability that were significant
for each measurement (by segment).
/s/ /ʃ/ /l/ /ɹ/
M1 0.090 0.119
M2 0.109 0.147
M3 0.082 0.133
M4 0.216 0.177
F1 0.115 0.106
F2 0.038 0.053
F3 0.116 0.007
F4 0.186 0.163
F2-F1 0.170 0.004
Mirroring the same pattern observed in the articulatory data with respect to the
prevalence of statistically significant differences in cross- and within-context variability, again
speakers differed from one another to a larger extent in within-context variability than in cross-
context variability. No exceptions to this pattern were observed in the acoustic data. This
observation may suggest that speakers are more likely to resemble one another in the extent of
contextual variability they exhibit, due to the fairly regular nature of the effect context is
expected to have on the acoustic realization of a segment.
51
Figure 2.9. IQRCON (scaled) values for all acoustic dimensions in /s/ and /ʃ/. Order of speakers (x-
axis) in all graphs is based on the rank-ordering of IQRCON values for M1 in /s/. Error bars show
the 95% confidence interval for each speaker’s IQRCON value.
Figure 2.10. IQRCON (scaled) values for all acoustic dimensions in /l/ and /ɹ/. Order of speakers
(x-axis) in all graphs is based on the rank-ordering of IQRCON values for M1 in /s/. Error bars
show the 95% confidence interval for each speaker’s IQRCON value.
M1 M2 M3 M4
/s/
/ʃ/
F1 F2 F3 F4
/l/
/ɹ/
F2-F1
52
Table 2.10. Percentage of pairwise comparisons of within-context variability that were
significant for each measurement (by segment).
/s/ /ʃ/ /l/ /ɹ/
M1 0.569 0.520
M2 0.519 0.398
M3 0.524 0.540
M4 0.623 0.590
F1 0.447 0.470
F2 0.505 0.457
F3 0.559 0.484
F4 0.528 0.534
F2-F1 0.474 0.482
2.4.2. Articulatory-acoustic relations
The results of the analysis of individual differences in articulatory and acoustic variability
indicate that speakers within the same speech community significantly differ from one another in
the extent of the variability they exhibit in their realization of specific articulatory and acoustic
dimensions. This finding leads to the question of how variability in these two domains may be
related and, subsequently, the extent to which speakers may be both ‘aware’ of the consequences
of their articulatory variability for acoustics and able to uncover specific information about the
variability present in another speakers’ articulation from that speaker’s acoustic signal.
Relatedly, such ‘transparency’ between articulatory and acoustic variability could serve to reject
the hypothesis that articulatory variability disappears in acoustics, due to how the nonlinear
relation between articulation and acoustics plays out in different vocal anatomies. A series of
LMERs fit to the data were used to examine articulatory-acoustic and acoustic-articulatory
relations in the XRMB data and to evaluate how well individual patterns of variability are
conveyed between the two modalities. The goodness of fit of the fixed effects structure
53
(Marginal R
2
) and the entire model (Conditional R
2
) were used to evaluate how well variation in
either articulation or acoustics was explained by the other domain across the group of speakers as
a whole and how this relationship may differ across the specific dimensions and segments
examined. This same question was examined on the level of the individual through an analysis of
regression coefficients for the random slope by SPEAKER included in each model. Finally, a
comparison of individual variability in the actual data and the values predicted by the model was
conducted to provide a more detailed look at how well individual differences in variability may
be conveyed between articulation and acoustics and vice versa.
2.4.2.1. Encoding of unidimensional articulatory variability in multi-dimensional acoustic space
Summary statistics for Marginal and Conditional R
2
values for all models fit to the
articulatory data are given in Table 2.11, with the full set of results in Table 2.12. The similarity
of the R
2
statistics for both Model 1 and Model 2 indicates that the predictive power of the
MFCC vector was not substantially higher than the spectral and formant measurements.
Marginal R
2
values were on average higher for /l/ and /ɹ/ than they were for /s/ and /ʃ/ across
models and dimensions, indicating that the fixed effects structure of the models generally
explained a larger proportion of variance in the data for the liquid consonants than for the
fricatives. However, this was not an absolute pattern on the level of individual dimensions, with
notable reversals in the general pattern observed for CL in /ɹ/ and for LP across the fricative and
liquid consonants. The opposite pattern was observed for Conditional R
2
, with higher median
values observed for /s/ and /ʃ/ than for /l/ and /ɹ/. This suggests that the entire model explained a
larger proportion of variance in the data for the fricative consonants than for the liquids,
suggesting in combination with the results for Marginal R
2
that individual differences in the
54
acoustic-articulatory mapping, while critical for explaining the full breadth of variation observed
in the XRMB data for all segments examined, have greater explanatory power with respect to the
prediction of articulation for /s/ and /ʃ/ than for /l/ and /ɹ/.
Table 2.11. Median marginal and conditional R
2
values by segment across all LMER models fit
with an articulatory dependent variable.
Marginal R
2
Conditional R
2
/s/ /ʃ/ /l/ /ɹ/ /s/ /ʃ/ /l/ /ɹ/
Model 1 0.025 0.070 0.126 0.116 0.869 0.890 0.566 0.757
Model 2 0.020 0.033 0.114 0.057 0.849 0.877 0.525 0.749
Table 2.12. All marginal and conditional R
2
values for all LMER models fit with an articulatory
dependent variable.
Marginal R
2
Conditional R
2
/s/ /ʃ/ /l/ /ɹ/ /s/ /ʃ/ /l/ /ɹ/
Model 1
CL 0.025 0.107 0.126 0.077 0.870 0.876 0.566 0.394
CD 0.030 0.105 0.165 0.218 0.593 0.692 0.318 0.656
CO 0.008 0.021 0.152 0.116 0.602 0.890 0.497 0.778
LA 0.011 0.070 0.074 0.196 0.869 0.892 0.679 0.757
LP 0.038 0.016 0.005 0.014 0.898 0.890 0.840 0.852
Model 2
CL 0.007 0.058 0.114 0.041 0.849 0.860 0.525 0.374
CD 0.020 0.076 0.173 0.125 0.601 0.632 0.291 0.623
CO 0.063 0.022 0.125 0.057 0.658 0.882 0.313 0.761
LA 0.004 0.033 0.046 0.178 0.859 0.877 0.640 0.749
LP 0.020 0.017 0.008 0.015 0.895 0.894 0.850 0.861
A similar pattern is observed when comparing the magnitude of the random slope by
SPEAKER fit as part of each model’s random effects structure across speakers and segments.
Table 2.13 shows the median magnitude of the standardized random slope across speakers for
each dimension in each segment for each model type. For all but three model comparisons
(Models 1 and 2 with LP & Model 1 with CL as the dependent variable), the median
55
standardized slope is larger for both /l/ and /ɹ/ than for /s/ and /ʃ/; in one of these cases (Model 1
with CL as the dependent variable), /ɹ/ still has a substantially larger median slope than any other
segment, despite /l/ diverging from the general pattern and exhibiting greater similarity to the
fricative slopes. The observation of generally larger median slope values for the liquid
consonants suggests that there is generally a stronger relationship between the independent
acoustic variables and the predicted articulatory dimensions for these consonants on the level of
the individual.
Table 2.13. Median regression coefficient for the random slope by SPEAKER in each LMER
model fit to an articulatory dependent variable.
/s/ /ʃ/ /l/ /ɹ/
Model 1
CL 0.0056 0.0052 0.0051 0.0110
CD 0.0456 0.2647 0.5248 0.1639
CO 0.6444 1.3258 3.1677 6.0903
LA 0.1926 0.3162 0.4297 0.4387
LP 0.1816 0.1957 0.1607 0.1912
Model 2
CL 0.0017 0.0004 0.0081 0.0040
CD 0.1028 0.0171 0.4433 0.1489
CO 0.3411 0.2494 0.6745 2.6359
LA 0.0944 0.1279 0.3716 0.3068
LP 0.0727 0.0526 0.0603 0.0636
In addition to evaluating the general ability of the fit LMER models to account for
variation in the articulatory data, an analysis directly evaluating the models’ ability to account
for the interspeaker differences in variability observed in the actual XRMB data was conducted
by calculating the predicted values of each model using the predict function in the lme4 R
package (Bates, Machler, Bolker & Walker, 2015). IQRPRED was calculated for each speaker’s
predicted data using the method used to calculate IQRTOT values in the actual data. The
56
relationship between speakers’ calculated IQRPRED and IQRTOT values for a given articulatory
dimension was evaluated using Spearman’s rank-order correlation (Table 2.14). The results of
this analysis suggest substantial variation in the extent to which the fit LMERs captured the
patterns of interspeaker variability present in the actual data (Figure 2.11), with some
comparisons showing little to no relationship between the actual and predicted IQR values and
others indicating a fairly strong relationship for some dimensions in some segments. Taken as a
whole, however, the relatively substantial number of significant comparisons suggest that
information regarding individual differences in variability along a plurality of articulatory
dimensions could be uncoverable from the acoustic signal, although the particular dimensions
along which this information could be uncovered would likely differ depending on the segment
and, potentially, the specific acoustic parameters that speakers attend to (e.g., as suggested by the
differences between Models 1 and 2 for CL in /ɹ/).
Table 2.14. Spearman’s rank-order correlation for the comparison of IQRTOT and IQRPRED in
models fit to articulatory dependent variables. Significant comparisons are bolded.
/s/ /ʃ/ /l/ /ɹ/
! p ! p ! p ! p
Model 1
CL 0.162 0.316 0.435 0.005 0.568 0.000 0.247 0.124
CD 0.454 0.004 0.300 0.061 0.550 0.000 0.457 0.003
CO 0.079 0.629 0.550 0.000 0.482 0.002 0.626 0.000
LA 0.359 0.023 0.064 0.692 0.434 0.005 0.553 0.000
LP 0.388 0.014 0.315 0.048 0.403 0.010 0.036 0.827
Model 2
CL -0.001 0.998 0.487 0.002 0.504 0.001 0.234 0.146
CD 0.625 0.000 0.230 0.153 0.561 0.000 -0.097 0.549
CO 0.321 0.044 0.424 0.007 0.304 0.057 0.385 0.015
LA 0.170 0.292 0.463 0.003 0.314 0.049 0.544 0.000
LP 0.067 0.679 0.233 0.148 0.391 0.013 0.454 0.004
57
Figure 2.11. Relationship between IQRTOT (calculated from the actual XRMB data) and IQRPRED
(calculated from predicted values of LMER Model 1) across speakers for each articulatory
dimension in each segment.
/s/ /ʃ/ /l/ /ɹ/
/s/ /ʃ/ /l/ /ɹ/
/s/ /ʃ/ /l/ /ɹ/
/s/ /ʃ/ /l/ /ɹ/
/s/ /ʃ/
/l/ /ɹ/
*
CO
58
2.4.2.2. Encoding of unidimensional acoustic variability in multi-dimensional articulatory space
A summary and the full set of Marginal and Conditional R
2
values for all models fit to
the acoustic data are given in Tables 2.15 and 2.16. Median marginal R
2
values were generally
higher for /l/ and /ɹ/ than they were for /s/ and /ʃ/ across models and dimensions, with the
marginal R
2
values for /ʃ/ and /l/ Model 2 presenting the sole reversal of this pattern. This mirrors
the results from the articulatory analysis and again indicates that the fixed effects structure of this
set of models generally explained a larger proportion of variance in the data for the liquid
consonants than for the fricatives. Inspection of the marginal R
2
values for individual models
suggests a great deal of similarity between /ʃ/ and /l/ in terms of the amount of variability
explained by each model, with prediction for /s/ being markedly worse and prediction for /ɹ/
being markedly better. Diverging from what was observed in the articulatory models, there was
generally an improvement in marginal R
2
values when the full articulatory vector was used as a
predictor of the acoustic dimensions (in Model 2), a change that is particularly noticeable for /l/
in Table 2.15 and which is likely due to the omission of articulatory measurements
corresponding to the velar constriction in this consonant in the vector of predictors in Model 1.
Conditional R
2
values were on the whole fairly similar across all segments and both models, with
slightly higher values generally observed for Model 2 than for Model 1 and /ɹ/ exhibiting higher
values than the other segments. Interestingly, the conditional R
2
values are on the whole
generally lower than those calculated for the articulatory models, potentially pointing towards a
lesser predictive power for individual differences in the articulatory-acoustic mapping than in the
mapping in the opposite direction. Additionally, marginal R
2
values are generally higher than
they were for the articulatory models (in Tables 2.11 and 2.12), indicating that the acoustic
patterns produced by different articulatory vectors are somewhat more unique than the set of
59
possible articulatory vectors suggested by a particular acoustic signal (at least when individual
differences in articulatory-acoustic relations are ignored). This is consistent with the view that
articulatory-to-acoustic relations are somewhat many-to-one in nature.
Table 2.15. Median marginal and conditional R
2
values by segment across all LMER models fit
with an acoustic dependent variable.
Marginal R
2
Conditional R
2
/s/ /ʃ/ /l/ /ɹ/ /s/ /ʃ/ /l/ /ɹ/
Model 1 0.074 0.083 0.113 0.295 0.529 0.513 0.527 0.696
Model 2 0.107 0.197 0.171 0.275 0.529 0.620 0.629 0.671
Table 2.16. All marginal and conditional R
2
values for all LMER models fit with an acoustic
dependent variable.
Marginal R
2
Conditional R
2
/s/ /ʃ/ /l/ /ɹ/ /s/ /ʃ/ /l/ /ɹ/
Model 1
M1 0.11 0.26 0.69 0.84
M2 0.07 0.05 0.54 0.53
M3 0.08 0.12 0.52 0.49
M4 0.00 0.05 0.06 0.46
F1 0.11 0.31 0.28 0.70
F2 0.14 0.29 0.61 0.70
F3 0.02 0.16 0.53 0.72
F4 0.05 0.07 0.56 0.69
F2F1 0.16 0.30 0.39 0.57
Model 2
M1 0.17 0.31 0.65 0.79
M2 0.13 0.21 0.62 0.59
M3 0.08 0.17 0.44 0.65
M4 0.03 0.18 0.09 0.50
F1 0.31 0.32 0.63 0.60
F2 0.17 0.28 0.68 0.68
F3 0.03 0.24 0.57 0.66
F4 0.09 0.10 0.62 0.67
F2F1 0.35 0.37 0.74 0.74
60
Table 2.17. Median regression coefficient for the random slope by SPEAKER in each LMER
model fit to an acoustic dependent variable.
/s/ /ʃ/ /l/ /ɹ/
Model 1
M1 0.1620 0.1370
M2 0.0880 0.0930
M3 0.1600 0.0720
M4 0.0500 0.1230
F1 0.0560 0.1790
F2 0.0720 0.1130
F3 0.0980 0.0650
F4 0.0460 0.1190
F2F1 0.0390 0.0900
Model 2
M1 0.0420 0.1150
M2 0.0430 0.1120
M3 0.0420 0.2730
M4 0.0330 0.1370
F1 0.1560 0.0340
F2 0.1680 0.1620
F3 0.0940 0.0580
F4 0.0370 0.0290
F2F1 0.1640 0.1850
The results of the comparison of standardize random slopes by SPEAKER for each model
are shown in Table 2.17. Given that the fricative consonants and the liquid consonants were
measured using different acoustic dimensions (spectral vs. formant measurements), the
comparison of slope magnitude across the models fit to different classes of segments (fricatives
vs. liquids) is potentially misleading and will not be discussed here. That being said, comparing
within and across the models fit to like segments, no clear patterns emerge from this data in
terms of specific segments or response variables, with the exception that random slopes for each
speaker tend to be larger in Model 2 for all segments except /s/. This suggests that, at least within
61
sets of fricative and liquid consonants, the ability to predict acoustics from articulation may be
fairly similar across different acoustic dimensions.
Finally, the results for the comparison of IQRPRED and IQRTOT for the models fit to
acoustic dimensions is presented in Table 2.18, with graphical representation in Figure 2.12. The
results of this analysis again indicate some degree of variation in the extent to which the fit
LMERs captured the patterns of interspeaker acoustic variability present in the actual data,
although there is a fairly strong tendency for significant relationships to be observed for those
dimensions that have been shown in previous research to be used in perception (e.g., M1 in /ʃ/,
F2 and the F2-F1 distance in /l/, and F3 in /ɹ/). It is worth noting that the relationship between
IQR PRED and IQRTOT appears to be stronger for /l/ and /ɹ/ than for /s/ and /ʃ/ on the whole,
particularly in Model 2 models. The results of these comparisons suggest that information
regarding individual differences in acoustic variability may encode information regarding the
differences in articulatory tendencies exhibited between speakers.
62
Table 2.18. Spearman’s rank-order correlation for the comparison of IQRTOT and IQRPRED in
models fit to acoustic dependent variables. Significant comparisons are bolded.
/s/ /ʃ/ /l/ /ɹ/
! p ! p ! p ! p
Model 1
M1 0.299 0.061 0.421 0.007
M2 -0.095 0.558 -0.083 0.612
M3 0.096 0.555 -0.007 0.964
M4 0.517 0.001 0.536 0.000
F1 0.350 0.031 0.787 0.000
F2 0.575 0.000 0.240 0.135
F3 0.492 0.002 0.398 0.011
F4 0.291 0.076 0.533 0.000
F2-F1 0.530 0.001 0.050 0.761
Model 2
M1 0.233 0.159 0.304 0.063
M2 -0.022 0.896 0.073 0.665
M3 -0.061 0.715 0.371 0.022
M4 -0.027 0.872 0.536 0.001
F1 0.350 0.031 0.457 0.003
F2 0.575 0.000 0.175 0.287
F3 0.492 0.002 0.554 0.000
F4 0.291 0.076 0.458 0.003
F2-F1 0.530 0.001 0.486 0.002
63
Figure 2.12. Relationship between IQRTOT and IQRPRED across speakers for each acoustic
dimension in each segment.
/s/
/ʃ/
/l/ /ɹ/
/s/
/ʃ/
/l/ /ɹ/
/s/
/ʃ/
/l/ /ɹ/
/s/
/ʃ/
/l/ /ɹ/
/l/ /ɹ/
64
2.5. Discussion
The results of the analyses presented here strongly suggest that there is widespread
interspeaker variation in the extent of articulatory and acoustic variability observed in the
production of American English coronal consonants and that variability exhibited in one of these
physical dimensions is to some extent recoverable from the signal in the other domain. These
findings motivate further research on the precise nature of this interspeaker variation.
Specifically, we must probe the extent to which interspeaker variation reflects the influence of
predictable factors known to affect central tendencies and patterns of dispersion in the
production of phonological segments versus the extent to which it may be motivated as part of an
individual’s representation of phonological units (or other components of the control system) that
serve as the cognitive underpinning to the behavior observed in production. These findings
additionally strengthen the possibility that listeners may not only have access to inter- and
intraspeaker patterns of variability in the acoustic signal, but that they also may be able to access
information about variability in articulatory actions from the acoustic signal.
Significant heteroscedasticity was observed among the group of examined speakers for
all dimensions in all segments, suggesting that some interspeaker variation is the norm rather
than the exception in the production of these consonants and, likely, speech as a whole.
However, the pervasiveness of the observed heteroscedasticity across speakers when compared
individually one to another was notably different from segment to segment and (for the
articulatory analyses) across different modalities and dimensions. This variation across
conditions may have important implications both for highlighting the extent to which listeners
have to reconcile variability in the incoming signal in perception and for our understanding of
constraints on interspeaker variation in production.
65
The observation that the percentage of speakers who significantly differed from one
another in the production of any articulatory or acoustic dimension was markedly lower for
contextual variability than for stochastic variability could indicate general consistency in the
appearance and possibly magnitude of coarticulatory and prosodic effects across speakers of the
same language (e.g., Grosvald & Corina, 2012; Zellou, 2017). This underlying similarity in the
nature and specifically magnitude of contextual conditioning across speakers may simply be
much greater than the smaller effects of interspeaker differences, leading to greater absolute
similarity in the extent of variability observed due to contextual environments across speakers
than that observed due to more random or stochastic unconditioned elements. That said, this
result may also be an artifact of differences in the underlying data available for each analysis. As
the number of distinct phonetic conditions in each speaker’s data was smaller than the number of
tokens used in the calculation of either overall or stochastic variability, the margin of error and
therefore the degree of uncertainty encoded in the calculated 95% confidence intervals is
expected to be larger for IQRCROSS than either of the other measures of dispersion. The general
observation of fewer significant pairwise comparisons for IQRCROSS in both the articulatory and
the acoustic data could therefore be merely an artefact of differences in the sample size used in
each analysis, a possibility that will be probed further in future research.
Difference the extent of interspeaker differences in variability observed across different
articulatory and acoustic dimensions must be reconciled with and understood in terms of the
transmission of variability between articulation and acoustics. This is important for
understanding both the significance of asymmetries in production for perception and the
implications of this for the maintenance of communicative parity across speakers. For example,
/s/ exhibited a relatively high overall amount of interspeaker difference in variability in both
66
articulation and acoustics when comparing the extent of interspeaker variation across segments,
however, the mapping between articulation and acoustics was generally not as strong for /s/ as
for the other segments (as demonstrated in the comparison of R
2
values for the LMER models
used to analyze articulatory-acoustic relations and in the relationship between the calculated
variability of actual and predicted values from these models).
13
Nevertheless, considerable
interspeaker variation was still observed in the acoustic signal even for those segments and
dimensions for which there was a weaker mapping between articulation and acoustics, and these
differences between speakers should be available to listeners. The precise interpretation of these
differences and the nature of the normalization processes that may unfold may depend on
individuals’ use of the relationship between articulatory and acoustic variability in their own
production in interpreting the acoustic signal, depending on the nature of the control schemes
guiding both perception and production. Although these questions are complex and outside of the
domain of the investigation in this chapter, the apparent nonlinearities of the relationships
between interspeaker variability in articulation and acoustics and the transmission of variability
between these domains may provide crucial information for better understanding how variability
may or may not be harnessed and accommodated for in perception.
13
It’s worth noting that the results regarding acoustic variability in and articulatory-acoustic relations for the
fricative consonants /s/ and /ʃ/ may have been impacted by the presence of the XRMB lingual pellets. Weismer and
Bunton (1999) found an infrequent effect of XRMB pellet presence on spectral moments measurements for /s/ and
/ʃ/. Specifically, significant (greater than 1 kHz) differences in M1 were observed in about 20% of paired
comparisons of fricative production when pellets were and were not attached. This is a potential hindrance to the
investigation of articulatory-acoustic relations in fricatives, a hindrance exacerbated by the fact that the main
instruments and methods currently available for the study of articulation all either present with this same issue
(EMA, electropalatography [McFarland et al., 1996]), produce simultaneous audio with quality insufficient for the
analysis of acoustic variation (rtMRI), or have documented issues in imaging the tongue tip and other anterior
portions of the tongue (ultrasound [Stone, 2005]). In terms of reconciling the extent to which interspeaker
differences in variability, and specifically interspeaker differences in articulatory variability, may or may not be
recoverable from the acoustic signal and accessible to individuals in both the interpretation of their own motor
actions and in perception, this inconsistent acoustic effect of pellet presence may have created additional noise in the
acoustic signal for fricatives and, subsequently, obscured the mapping between articulatory and acoustic variation
more than is actually present in reality.
67
In sum, the results of the articulatory-acoustic analyses presented in this chapter
introduce the possibility that patterns of interspeaker variability may be of broader
communicative and functional significance, especially for those dimensions and segments in
which individual differences in articulatory variability were found here to be recoverable from
the vector of examined acoustic dimensions. However, the question of whether and to what
extent this information is used in broader communicative contexts remains open. Previous
research has shown that listeners are sensitive to interspeaker differences in the distributional and
relational properties of different phonological segments, including differences that result from
increased variability or overlap between contrasting segments along critical acoustic dimensions
(e.g., Clayards, Tanenhaus, Aslin, & Jacobs, 2008; Newman et al., 2003). It is unclear based
simply on the measurement of production whether the extent of the variability observed across
individual speakers in the present study is large enough to be utilized in perception. Future
research examining listeners’ response to speakers differing in the variability they exhibit for
various articulatory and acoustic dimensions will be necessary to determine whether the
magnitude of the interspeaker differences in variability are of sufficient size to be salient to
listeners and whether they may impact the manner in which listeners analyze and interpret the
speech of different interlocutors.
For example, for fricative consonants, speakers may be able to learn general trends with
respect to what types of acoustic changes are likely to result from a particular change in
articulation. However, as demonstrated by the statistical modeling of articulatory-acoustic
relations in Section 2.4, the precision of the relationship between changes in particular
articulatory dimensions and changes in the acoustic signal is reduced for the (examined) coronal
fricative consonants as compared to the coronal sonorant consonants (and likely for a wider
68
range of sonorant speech sounds than solely those examined here, as suggested by the results in
Whalen et al. [2018]). Assuming that speakers learn mappings between articulation and acoustics
from their own produced speech (e.g., Newman, 2003), this suggests that the mapping speakers
learn for fricatives likely exhibits greater nonlinearity in the relationship between articulation and
acoustics than that learned for (at least some) sonorants. However, this distinction will likely
vary across speakers as a consequence of factors like the degree of nonlinearity in the
articulatory-acoustic mapping within their own vocal tract (see e.g., Lammert et al., 2011). In
harnessing this mapping to infer information about others’ articulation from their acoustic signal,
listeners’ capacity for acoustic-to-articulatory inversion may thus be less accurate for fricatives
than it is for sonorant consonants and vowels. This would suggest that individuals would not
only be less able to predict the acoustic consequences of variation in their production of fricative
consequences but may be less likely to make inferences, or at least correct inferences, about the
articulatory variation underlying subtle changes in the fricative spectrum, an inference consistent
with the observation in this study that marginal R
2
values are lower when individual articulatory
dimensions are predicted from the acoustic signal than the other way around. The potential
ramifications of these differences in articulatory-acoustic mappings will be explored further in
Chapters 4 and 5.
2.6. Conclusion
In conclusion, the study of interspeaker differences in articulatory and acoustic variability
presented here indicates that there are robust differences between speakers in the extent of the
variability they exhibit in their production of American English coronal consonants. These
interspeaker differences seem to be widespread both in the consistency of their appearance
69
across phonetic dimensions and segments and in the extent to which they reflect general patterns
of individual difference within the population of speakers. The potential communicative
significance of these interspeaker differences is highlighted by the extent to which the variability
observed in articulation is recoverable in acoustics and vice versa, with statistical models
approximating this relationship consistently showing an ability to predict change in one domain
from change in the other. Interspeaker differences in variability exhibited in one of these
physical dimensions appear to be recoverable to some extent from the signal in the other domain,
although this capacity varies substantially across segments and dimensions. These findings
motivate further research on the factors underlying these interspeaker differences in phonetic
variability and the role that they play in the cognitive systems underlying speech production and
perception.
70
3. Evidence for the encoding of variability in the representation of
phonological units
3.1. Introduction
The study presented in Chapter 2 investigated interspeaker variability in articulation and
acoustics and the relationship between variability in these domains. The finding of robust
differences in interspeaker variability across all measured phonetic dimensions leads to the
question of what sources underlie the observed variability. This chapter examines the possibility
that variability is encoded in the cognitive representation of phonological units, and that
interspeaker differences in token-to-token variability consequently reflect speaker differences in
the system of cognitive representations underlying speech production. In doing so, two facets of
the inter- and intraspeaker patterning of token-to-token phonetic variability are examined in
detail: the ability of known speaker traits to predict individual differences in variability, and the
extent to which individual differences in variability reflect speaker-specific tendencies to be
generally more or less variable in production.
3.1.1. Factors conditioning inter- and intraspeaker variation
Speakers are known to differ from one another in not only the precise acoustic and
articulatory properties of the phonological segments they produce (e.g., Allen et al., 2003;
Delattre & Freeman, 1968; Johnson et al., 1993; Mooshammer et al., 2004; Newman et al., 2001;
Noiray et al., 2014; Peterson & Barney, 1952; Westbury et al., 1998; Whalen et al., 2018) but
also in the extent to which contextual factors affect these articulatory and acoustic properties of
segments (e.g., Beddor, 2009; Beddor et al., 2018; Baker et al., 2011; Grosvald, 2009; Kataoka,
2011; Krakow, 1989; Lubker & Gay, 1982; Mielke et al., 2016; Noiray et al., 2011; Smith et al.,
71
2019; Yu, 2016, 2019; Zellou, 2017) and in the ways that they manipulate different acoustic and
articulatory dimensions to generate contrasts between speech sounds (e.g., Clayards, 2018;
Coetzee et al., 2018; Schertz et al., 2015; Shultz et al., 2012). Some, but not all, of this
interspeaker variation can be explained as a function of speakers’ dialect, demography, or
linguistic history. Attempts to motivate the remainder of the observed interspeaker variation have
found that various anatomical, cognitive, and behavioral traits may play a role in accounting for
these differences (e.g., Brunner et al., 2009; Brunner et al., 2011; Ghosh et al., 2010; Nasir &
Ostry, 2006; Ou & Law, 2017; Perkell, Guenther et al., 2004; Perkell, Matthies et al., 2004;
Rudy & Yunusova, 2013; Yu, 2016).
3.1.1.1. Vocal tract morphology
Anatomical differences in the size and shape of the vocal tract have received considerable
attention for their potential to predict interspeaker differences in speech production. The
morphology of the vocal tract is known to differ across individuals. These differences can be
predictable, such as the changes observed in the shape and proportions of the vocal tract during
child development (e.g., Fitch & Giedd, 1999; Vorperian, Kent, Lindstrom, Kalina, Gentry, &
Yandell, 2005; Vorperian & Wang, 2009) and differences due to sexual dimorphism (e.g.,
Vorperian & Wang, 2009; Vorperian et al., 2011). They can also be unpredictable and reflect
purely idiosyncratic variation in vocal tract anatomy (Vorperian et al., 2005). Research exploring
the relationship between individual differences in vocal tract morphology and phonetic
variability has shown that differences in multiple anatomical dimensions within the vocal tract
may affect both the articulatory and acoustic characteristics of individual’s speech production in
predictable ways. Specifically, individual differences in the size and shape of specific vocal tract
72
structures, such as the palate, the oral cavity, and the pharyngeal cavity, have been shown to
affect the degree of nonlinearity observed in articulatory-acoustic mappings in regions of the
vocal tract (e.g., Bakst & Johnson, 2018; Brunner, Fuchs & Perrier, 2009; Lammert et al., 2011),
the amount of token-to-token variability exhibited by speakers in their production of specific
segments (e.g., Brunner et al., 2009; Mooshammer et al., 2004; Rudy & Yunusova, 2013), and
individual speakers’ preference for specific articulatory strategies in production (e.g., Dediu &
Moisik, 2019; Johnson, 2018; Weirich & Fuchs, 2013).
Two dimensions of anatomical variation that have been shown to contribute substantially
to interspeaker articulatory variation are palate morphology (e.g., Bakst & Lin, 2015; Brunner et
al., 2009; Johnson, 2018; Mooshammer et al., 2004; Rudy & Yunusova, 2013; Weirich & Fuchs,
2013) and length of the vocal tract or its subcomponents (Fuchs et al., 2008; Honda et al., 1996;
Johnson, 2018; Rudy & Yunusova, 2013). Experimental work examining the effect of palate
morphology on phonetic variation has demonstrated that speakers with flatter hard palates tend
to be less variable in their production of coronal consonants and vowels than speakers with more
palate doming, particularly in terms of vertical range of motion or constriction degree (e.g.,
Bakst & Lin, 2015; Brunner et al., 2009; Mooshammer et al., 2004; Rudy & Yunusova, 2013).
Palate length has also been shown to impact the range of vertical of motion in at least one study
in which speakers with longer palates were found to exhibit greater variation in the vertical
positioning of the tongue during the production of coronal consonants (Rudy & Yunusova,
2013). Similar effects on vertical tongue movement have also been attributed to interspeaker
variation in vocal tract length (Fuchs et al., 2008; Johnson, 2018).
Recent modeling investigations have demonstrated that the shape of the palate influences
the articulatory-acoustic mapping, with acoustic sensitivity to articulatory change greater for
73
speakers with flatter palates (Bakst & Johnson, 2018; Brunner, Fuchs, & Perrier, 2009; Lammert
et al., 2011). This difference in acoustic sensitivity across speakers with different palate shapes
has been highlighted as a possible explanation for the relationship between palate shape and
articulatory variability, as speakers with flatter palates may choose to maintain greater precision
in articulation to maintain a consistent acoustic output (e.g., Brunner et al., 2009).
Palate morphology may also affect patterns of interspeaker variation in articulation in
ways other than influencing the stability or variability of a particular speaker’s production. For
example, Weirich and Fuchs (2013) demonstrated that the particular manner in which individual
speakers produce a contrast between /s/ and /ʃ/ may reflect the angle of their alveopalatal ridge,
while Johnson (2018) found that talkers with flatter palates exhibited a greater preference for
postures involving a raised tongue tip in these same segments. Dediu and Moisik (2019) found
that both the doming of the palate and other aspects of palate morphology (such as the size of the
alveolar ridge) seem to similarly affect the articulatory strategies speakers use in producing /ɹ/.
Although previous research has demonstrated a relationship between interspeaker
variation in vocal tract morphology and phonetic variability, this relationship does not appear to
be deterministic in nature. Johnson (2018) discusses this lack of determinism explicitly in a study
examining the relationship between individual differences in vocal tract anatomy and the
production of vowel and fricative contrasts, noting that any correlations observed between
anatomical features and interspeaker differences in articulation were weak to moderate at best.
Similarly, Stone et al. (2012) noted that while palate morphology may influence speakers’
preference for apical or laminal articulations of /s/, it does not do so in a deterministic fashion.
Investigations probing the relationship between palate morphology and phonetic variability have
also not consistently found evidence for a relationship between these two domains, with Bakst
74
(2021) finding that palate doming did not directly predict either within- or across-context
articulatory and acoustic variability in the production of /s/ and /ɹ/ in American English.
3.1.1.2. Cognitive, sensory, and neural factors
Alongside variation in vocal tract anatomy, individual differences in sensory processing
and language-peripheral cognitive traits have received considerable attention for their potential
to explain inter- and intraspeaker phonetic variation. Previous research has suggested that
speakers with greater auditory or somatosensory acuity maintain larger distances between targets
for different phonological segments in phonetic space (e.g., Brunner et al., 2011; Ghosh et al.,
2010; Perkell, 2012; Perkell, Guenther et al., 2004; Perkell, Matthies et al., 2004) and exhibit less
subphonemic variability in production (Franken et al., 2017; Perkell et al., 2008). The potential
for these individual differences in sensory acuity to impact the realization of inter- and
intraspeaker phonetic variation is further enhanced by the observation that speakers may exhibit
a “sensory preference” for, and therefore attend more to, either auditory or somatosensory
feedback during speech production (e.g., Lametti, Nasir, & Ostry, 2012).
Certain cognitive characteristics, such as cognitive processing style (Stewart & Ota,
2008; Yu, 2010, 2016), working memory (Lev-Ari & Peperkamp, 2013, 2014; Ou et al., 2015),
and executive functions (Kim & Hazan, 2010; Ou & Law, 2017) may also play a role in
determining individual differences in speech production and perception. For example, Yu (2016)
found that Cantonese speakers with lower AQ (fewer autistic-like traits) exhibited more variation
in /s/ acoustics across vowel context than speakers with higher AQ, mirroring a pattern observed
in perception in which female listeners with lower AQ compensated less for coarticulation in
English sibilant perception (Yu, 2010).
75
Interspeaker differences in trial-to-trial variability may also arise as the result of
differences in the noisiness in the central and peripheral nervous systems during movement
preparation and execution, as has been observed in certain motor speech disorders (e.g.,
Anderson, Lowit, & Howell, 2008; Marquardt, Jacks, & Davis, 2004; Spencer & Slocomb,
2007). Effects of neural noise on trial-to-trial variability have also been observed more generally
for goal-directed movements such as visual saccades and reaching (e.g., Churchland, Afshar, &
Shenoy, 2006; Haar, Donchin, & Dinstein, 2017; Harris & Wolpert, 1998; van Beers, Baraduc,
& Wolpert, 2002; van Beers, Haggard, & Wolpert, 2004).
3.1.1.3. Prosodic factors
In addition to language-external factors, like the anatomical, cognitive and sensory
dimensions discussed above, habitual differences in the implementation of paralinguistic
elements across speakers may also cause interspeaker differences in phonetic variability. Speech
rate, for example, has been shown to influence articulatory dynamics in speech, with articulatory
movements generally found to become larger, longer, and faster as speakers’ speech rate
increases (e.g., Kelso, Vatikiotis-Bateson, Saltzman, & Kay, 1985; Tasko & McClean, 2004;
Theodore, Miller, & DeSteno, 2009). Although comparable effects have been observed in
acoustic realization in many studies (Fourakis, 1991; Gay, 1968, 1978; Lindblom, 1963; Turner,
Tjaden, & Weismer, 1995; cf. Mefferd & Green, 2010, van Son & Pols, 1992), the inconsistency
of the results across studies points to a lack of necessary relationship between changes in speech
rate and speech acoustics (at least in terms of formant frequencies).
Intraspeaker variation in speech rate is known to impact the appearance of phonetic
variability, with intraspeaker variability in the realization of articulatory movements increasing
76
as a function of decreased speaking rate (Kelso, Vatikiotis-Bateson, Saltzman, & Kay, 1985;
Kleinow, Smith, & Ramig, 2001; Smith, Goffman, Zelaznik, Ying, & McGillem, 1995; Smith &
Kleinow, 2000; cf. Tomaschek, Arnold, Sering, Tucker, van Rij, & Ramscar, 2020). However,
the extent to which idiosyncratic differences in preferred or “habitual” speech rate (Tsao &
Weismer, 1997) correlate with interspeaker differences in phonetic variability is unclear since
the small body of existing research examining the relationship between habitual speech rate and
phonetic variation has largely focused on the distance between phonological categories in
phonetic space rather than on subphonemic token-to-token variability. For example, Yunusova,
Rosenthal, Rudy, Baljko, and Daskalogiannakis (2012) observed that speaker with slower
habitual speaking rates exhibited larger distances and less overlap between the articulatory target
regions of alveolar consonant pairs, suggesting some relationship between habitual speaking rate
and the articulatory realization of phonological segments. A similar effect of habitual speech rate
on acoustics was not observed in Tsao, Weismer, and Iqbal (2006), who found that speakers with
slower habitual speech rates did not differ in their overall acoustic vowel space size from
speakers with faster habitual speech rates when a binary comparison was made between these
two groups. Tsao et al. (2006) did, however, find greater interspeaker variation in vowels’
acoustic centers within the group of slow talkers than within the group of fast talkers.
Although the direction of the effect of intraspeaker variation in speech rate on articulation
and acoustics is fairly consistent across speakers, the magnitude of the effect of speech rate
fluctuations on the phonetic properties of phonological segments differ across individuals, with
some speakers exhibiting larger rate-induced changes than others. For example, Theodore,
Miller, and DeSteno (2009) observed an overall effect of speech rate on VOT for voiceless stops
in English, with VOT increasing as speech rate decreased across all speakers. However, they also
77
found a considerable amount of variation in the magnitude of the rate effect across talkers, with
slope values for a linear function relating VOT to vowel duration differing significantly across
speakers. This interspeaker variation in the effect of intraspeaker changes in speech rate suggests
that interspeaker differences in overall phonetic variability could reflect individual differences in
the magnitude of phonetic change observed across variation in speech rate.
Speakers are also known to differ from one another in pause frequency and in the size of
the prosodic units preferred by individual speakers (Bishop, 2020; Fuchs, Petrone, Krivokapić, &
Hoole, 2013; Petrone, Fuchs, & Krivokapić, 2011). Interestingly, variation in the size of these
inter-pause or prosodic units has been observed to covary with variation in speech rate both
between and within individuals’ speech, with speech rate shown to be faster within longer
phrases (e.g., Fougeron & Jun, 1998; Quené, 2008, cf. Jacewica, Fox, O’Neill, & Salmons, 2009)
and faster overall for speakers who tend to produce fewer pauses (e.g., Byrd, 1994; Crystal &
House, 1982, cf. Jacewicz, Fox, O’Neill, & Salmons, 2009). This would suggest that any
relationship observed between speech rate and phonetic variability could also be observed
between phrase length and phonetic variability. Additionally, interspeaker variation in the
frequency of pauses or prosodic boundaries may affect the appearance of variability in
individuals’ speech outside of its relationship with speech rate due to the both the general effect
of such boundaries on the realization of proximal segments and the potential for these effects
themselves to differ somewhat across speakers (e.g., Byrd, Krivokapić, & Lee, 2006; Cole, 2015;
Cole, Mo, & Baek, 2010; Fougeron & Keating, 1997; Kim, 2020). Speakers who exhibit smaller
phrases and, subsequently, more frequent prosodic boundaries would be expected to exhibit a
greater modulation of segmental realization, and subsequently have the potential for overall
larger articulatory and acoustic variation, than speakers who use larger prosodic groupings.
78
3.1.2. Global and local patterns in phonetic variability
Although the factors mentioned above may be able to account for some interspeaker
differences in variability, the results of previous research suggest that no single factor can
completely account for the differences observed across speakers. This is demonstrated in part by
differences in the strength or appearance of these factors’ effects across phonological segments
(e.g., Bakst, 2021; Brunner et al., 2009) and their general lack of ability to deterministically
explain the variation observed in production (Johnson, 2018). Similarly, research on the activity
in the ventral sensorimotor cortex (vSMC) during speech production has found that variability in
both articulation and acoustics relates to variability in vSMC encoding (Bouchard & Chang,
2014; Chartier, Anumanchipalli, Johnson, & Chang, 2018), suggesting that the variability
observed in speech production also cannot be attributed solely to noise in parts of the neural
motor pathway governing the regulatory control of speech motor actions (i.e., the cerebellum) or
in the implementation of these actions by the peripheral nervous system, but rather arise in the
planning of speech kinematics.
The inability of predictive factors to fully account for the individual differences observed
in token-to-token variability (i.e., stochastic variability), preferred production strategies, and the
realization of coarticulatory and prosodic effects raises the possibility that at least some of the
interspeaker variation observed in speech production is idiosyncratic and not attributable to any
concrete difference between speakers in demographic, anatomical, paralinguistic or cognitive
factors. An outstanding question is whether these differences may reflect local differences in
phonological representation between speakers (i.e., differences in the representation of specific
phonological units), and specifically differences in the phonetic targets defining individual sound
79
categories (e.g., Allen et al., 2003; Baker et al., 2011; Beddor et al., 2018; Newman, 1997; Yu,
2013, 2016), even if they are truly idiosyncratic. However, idiosyncratic differences between
speakers in the same speech community could also reflect more holistic differences between
speakers in the cognitive motive and motoric systems guiding speech production, with speakers
exhibiting tendencies towards either greater variability or greater precision across all
phonological units in production.
To the extent that research on speech production has addressed the question of whether
individual differences in phonetic variability tend to reflect global or local tendencies, findings
seem to differ with regard to whether they suggest individual speakers exhibit general tendencies
towards variability. In a study using ultrasound to examine the relationship between palate
morphology and articulatory and acoustic variability, Bakst (2021) observed that individual
differences in overall tongue shape variability in /s/ and /ɹ/ covaried, despite a lack of significant
relationship between palate doming and articulatory variability for either segment individually.
This finding was taken to suggest that speakers were to some extent globally consistent in how
variable they were in their production of these consonants, and potentially in speech production
more broadly. In a study of American English using the Wisconsin XRMB corpus, Whalen et al.
(2018) also found some evidence that individual speakers generally exhibit a similar extent of
variability in their production of multiple vowels. Examining variability in articulation and
acoustics separately, they observed that two out of thirty-six possible pairwise comparisons of
variability between vowels were significant for each domain even after applying a correction for
false discovery rate among multiple comparisons (eight total comparisons were significant for
acoustics and thirteen significant for articulation at the uncorrected significance level). They
interpret this finding as indicating that speakers were consistent in how variable they were
80
(relative to other speakers) in their production across vowels generally, rather than displaying
vowel-specific tendencies towards greater variability or consistency.
These studies, however, do not specifically examine the realization of the articulatory
goals of the vowels and consonants of interest or measure variability separately for different
articulators. The methods used in both Bakst (2021) and Whalen et al. (2018) to measure
speakers’ articulation were relatively holistic in nature, taking into account the entire shape of
the tongue in an ultrasound frame (Bakst, 2021) or using principal components calculated using
the complete articulatory vector for each XRMB token (i.e., including information about the
horizontal and vertical position of multiple points on the tongue, lips, and jaw) (Whalen et al.,
2018). Similarly, Whalen et al. (2018) used acoustic measurements calculated in three-
dimensional formant space (F1, F2 and F3) with values of tokens within this space calculated
based on the normalized center of each individual speaker’s vowel space. The holistic nature of
the measurements in these studies seem to suggest that they could be picking up covariation in
aspects of the segments’ articulation that have less to do with the goals of specific phonological
units and more to do with individuals’ preferences for particular patterns of articulatory synergy
or particular tongue postures across multiple segments.
Acoustic studies that focus more specifically on dimensions known to be important in the
production of specific phonological contrasts do not tend to find the same tendencies towards
global variability or consistency in individual speakers’ production; that said, these studies have
been rather limited in number and scope. For example, the comparison of acoustic variability in
/s/ and /ɹ/ in Bakst (2021), which used measurements of variability along single acoustic
dimensions (the first spectral moment for /s/ and F3 for /ɹ/), did not find significant covariation
in variability across speakers. The findings of Chao et al. (2019) also suggest a lack of
81
covariation in acoustic variability across vowels for individual speakers. Measurements of the
standard deviation of F1 and F2 in the vowels /ɛ/ and /æ/ across speakers in their study indicate
that some speakers are more variable in F1-F2 space for /ɛ/, while other speakers are more
variable for /æ/. These differences in relative variability are additionally shown to play a role in
the location of a categorical perception boundary between these vowels across speakers, with
speakers who are more variable in their production of /ɛ/ exhibiting a perceptual boundary closer
to /æ/ and vice versa. This link between segment-specific variability and the speaker’s perception
of a phonological contrast critically suggests variability is in some way incorporated in the
speaker’s phonological knowledge and, subsequently, could form part of the speaker’s cognitive
representation of the relevant phonological units.
It seems likely that the effects of individual differences in cognitive and physiological
factors would impact variability in production through their influence on the establishment and
continual updating of the representations of phonological categories, instead of acting as an
external modulatory force acting on, but not directly affecting, the representation itself (e.g.,
Guenther 1995, 2016). As multiple different factors impacting variability certainly interact in the
development of these representations, and additionally interact with other differences in the
speakers’ linguistic experiences, different phonological units are expected to concomitantly
present with different patterns of representationally encoded variability, and subsequently to
exhibit different patterns of variability in their realization across and within individual speakers.
82
3.2. Hypothesis and predictions
The study presented in this chapter aims to test an overarching hypothesis regarding the nature of
phonological representation and, more specifically, the role of variability in the representation of
phonological units:
H1: Inter- and intraspeaker differences in phonetic variability reflect differences in the
cognitive representation of phonological units.
If interspeaker differences in phonetic variability arise as a consequence of differences in the
underlying representation of the same phonological unit across individuals, we would expect to
find that (a) the degree of variability observed in production should be internally consistent for a
given speaker’s production of a single phonological unit and (b) the patterns of variability
observed for a single speaker may exhibit some degree of specificity to a single phonological
unit, instead of reflecting the speaker’s general tendency to be globally more or less variable in
production. Consistency in individual behavior is analyzed here by examining the predictability
of a speaker’s variability for a given phonological unit as a function of their relationship to the
population of speakers for other phonological units.
With respect to expectation (a), H1 predicts that interspeaker differences in relative
variability (e.g., whether Speaker A is more or less variable than Speaker B) should remain
consistent for the same phonological unit across different contexts and levels of linguistic
structure (Figure 3.1). For example, interspeaker differences in the patterns of contextual
variation observed for a phonological unit should be consistent with interspeaker differences in
stochastic variability in the realization of the same unit.
P1: Individual differences in stochastic variability for a phonological unit should
correlate with individual differences in the contextual variability of that unit.
83
This prediction is based on existing research suggesting that the coarticulation observed in
speech likely results from a combination of biomechanical effects (e.g., Baum & Waldstein,
1991; Daniloff, Schuckers, & Feth, 1980; Flege, 1988; Gay, 1977; Recasens, 1987, 1989) and
the controlled implementation of coarticulation in speech motor planning (e.g., Bouchard &
Chang, 2014; Chartier, Anumanchipalli, Johnson, & Chang, 2018; Daniloff & Hammarberg,
1973; Whalen, 1990; Wood, 1991, 1996). Given that coarticulation is (at least partially) planned,
we may expect that individual differences in coarticulation (see references in 3.1.1.) will reflect
individual differences in the representation of target planning space for individual speakers and,
subsequently, will be related to other variation phenomena also arising from individual
differences in target space. In this study, the ‘other variation phenomena’ to which individual
differences in contextual variation are related is stochastic variability in production. This
prediction relies on the assumption that the target(s) for a particular phonological unit retain
some degree of consistency in their definition across different phonetic contexts, an assumption
shared by most prominent models of speech production (including AP, the DIVA model, and
most contemporary exemplar models
14
).
14
The predicted relationship between stochastic and contextual variation requires consistency in the definition of
target space for a phonological unit across all instances in which it may occur, but does not automatically follow
from the inclusion of an invariant target region in a phonological model. Indeed, it is not clear that any of the models
mentioned here can actually generate this relationship in their current formulation, even those (like the DIVA
model) that seem to assume its existence. This point will be taken up in more detail in Chapter 4.
84
Figure 3.1. Schematic illustration of predicted relationship between within- and cross-context
variability (P1).
With respect to expectation (b), two predictions are made:
P2: Speakers do not consistently exhibit more or less variability than others in their
production of different phonological units.
P3: Speakers who are similar in some trait T(i) may differ from one another in their
production of variability, and speakers who differ in some trait T(i) may produce similar
patterns of variability (Figure 3.2).
If speakers’ tendencies to be more or less variable relative to other speakers differ across
different phonological units, this will suggest that speakers’ general tendencies to be more
variable or more consistent in their production of speech do not fully account for the phonetic
variability observed in speech production. It would also suggest that individual differences in
variability are unlikely to be wholly accounted for by anatomical and cognitive traits or by
individual differences in other aspects of speakers’ linguistic competence like their realization of
speech prosody. Note that similarity in the phonological specification of particular segment pairs
(e.g., similarity in CD for /s/ and /ʃ/) is expected to engender consistency in variability along
certain dimensions. Additionally, failing to find consistent effects of anatomical, cognitive, or
linguistic factors on interspeaker differences in variability would further suggest that these
differences are not consistently observed across different phonological units and that these units
Contextual
Variability
Stochastic
Variability
85
may differ from one another in the extent of variability that they exhibit in an individual’s
speech. The lack of system-wide individual differences in phonetic variability suggested by
results in line with P2 and P3 would provide support for the hypothesis that variability is
encoded in the representation of individual phonological units.
Figure 3.2. Schematic illustrating the predicted lack of consistent relationship between a trait
(here, vocal tract morphology) and phonetic variability across speakers (P3).
3.3. Variability within and across phonological units
The same articulatory and acoustic data used for the study of individual differences in
phonetic variability in Chapter 2 of this dissertation was used here to examine how individual
differences in phonetic variability pattern within and across phonological units. Three analyses
testing H1 were carried out:
1. An analysis examining whether the rank order of IQRCON values is significantly
correlated with the rank order of IQRCROSS values across speakers for each phonetic
dimension in each phonological segment (Analysis 1).
Speaker A
Domed Palate
Speaker B
Domed Palate
Speaker C
Flat Palate
Speaker D
Flat Palate
86
2. An analysis examining whether the rank order of IQRCON values for a phonetic
dimension remained constant across speakers for different phonological segments
(Analysis 2).
3. An analysis examining whether the rank order of IQRCON values remained constant
across speakers for different articulatory dimensions in the same phonological
segment (Analysis 3).
The consistency of the rank order of speakers’ variability measurements across different levels of
structure, dimensions, and segments was statistically evaluated using Spearman’s rho. In addition
to testing this main hypothesis, the results of these analyses will also provide critical information
regarding how variability may be incorporated into a model of phonological representation and
speech planning, including the level of representation at which variability may be encoded and
the level of phonological abstraction required to account theoretically for the observed patterns
in behavior.
The same acronyms used to denote the measured articulatory and acoustic dimensions in
Chapter 2 are used in the presentation of the results of the analyses conducted in this chapter.
Table 3.1 provides a reminder of what these acronyms signify.
87
Table 3.1. Acronyms used for the analyzed articulatory and acoustic dimensions
Articulatory
Dimensions
CL CD CO LA LP
Constriction
Location
Constriction
Degree
Constriction
Orientation
Lip
Aperture
Lip
Protrusion
Fricative
Acoustics
M1 M2 M3 M4
1
st
Spectral
Moment
2
nd
Spectral
Moment
3
rd
Spectral
Moment
4
th
Spectral
Moment
Liquid
Acoustics
F1 F2 F3 F4 F2-F1
First
Formant
Second
Formant
Third
Formant
Fourth
Formant
Distance
between F2
and F1
3.3.1. Analysis 1 Results: Within- and across-context variability
3.3.1.1. Articulatory variability
Statistical comparisons of IQRCON and IQRCROSS for each dimension in each phonological
segment were conducted using Spearman’s rho in order to test P1. The results of the statistical
analysis are given in Table 3.2 and visualized in Figure 3.3. A significant correlation between
IQRCROSS and IQRCON is observed for multiple dimensions in each segment, with the direction of
the relationship consistently indicating that speakers who were more variable within contexts
exhibit a greater difference in articulation across different phonetic contexts.
A significant relationship was consistently observed between IQRCROSS and IQRCON for CL
and CD, which directly align with properties of place and manner of articulation critical for the
definition of phonological contrasts in both gestural and feature-based approaches. Comparisons
for these dimensions were significant at the p = 0.05 significance level for all segments, and only
failed to reach significance for the comparison of IQRCROSS and IQRCON for CD in /ʃ/ (padj = 0.057)
88
after controlling for false discovery rate. Significant correlations between IQRCROSS and IQRCON
were also observed for LP in /ʃ/, /l/, and /ɹ/ at both the corrected and uncorrected significance
levels. This pattern of significant correlations for LP also generally supports the idea that the
relationship between IQRCROSS and IQRCON may be stronger for dimensions directly related to the
achievement of a phonological goal (or that are defined by a specific articulatory target). Two of
the three segments for which this relationship is observed (/ʃ/ and /ɹ/) are habitually produced
with some degree of labialization in various dialects of American English (e.g., Delattre &
Freeman, 1968; Mielke et al., 2016; Smith et al., 2019; Toda et al., 2003; Zawadzki & Kuehn,
1980), which may be theorized as reflecting a phonological goal for labial protrusion in these
segments. The observation of a significant relationship between IQRCROSS and IQRCON for LA in
/ɹ/ may also fit this pattern, as Smith et al. (2019) observed that the labialization observed in the
production of /ɹ/ in American English is be accompanied by a reduction in labial aperture for
some (but not all) speakers.
15
These findings notably contrast with the results for CO, which only exhibits a significant
correlation between IQRCROSS and IQRCON for /s/ and which does not align as clearly with goal-
oriented components of vocal tract posture in the production of the phonological segments
examined here. However, it is worth noting that the relationship between IQRCROSS and IQRCON is
always positive, indicating that the relationship between within- and cross-context variability
follows a general pattern of speakers who are more variable within contexts also tending to be
more variable across contexts. Additionally, relatively strong and statistically significant
15
Both Toda et al. (2003) and Smith et al. (2019) note that labial protrusion is produced without lip narrowing for
/ʃ/. If the strength of the relationship between within- and across-context variability is indeed sensitive to the
presence of a phonological goal for a particular articulatory dimension, this difference in the production of
labialization in /ɹ/ and /ʃ/ may explain why a stronger relationship between IQRCROSS and IQRCON is observed for LA
in /ɹ/.
89
relationships are observed between IQRCROSS and IQRCON in a few comparisons where the
examined dimension would generally not be thought of as a phonologically contrastive goal for
particular segments (namely, LP in /l/, LA in /s/ and CO in /s/). These observations align with
proposals that individual speakers exhibit general tendencies towards overall greater or lesser
variability in their speech as a whole.
90
Figure 3.3. Relationship between IQRCROSS and IQRCON across speakers for each articulatory
dimension in each segment (top row to bottom row: CL, CD, CO, LA, and LP). Relationships
significant after the application of a Benjamini-Hochberg correction are indicated by an asterisk;
relationships significant at the unadjusted p < 0.05 level only are indicated by †.
CL
CD
CO
CD
LA
CD
LP
CD
/t/ /s/ /ʃ/ /l/ /ɹ/
* * * * *
* *
†
* *
*
* *
* * *
91
A few notable differences in the results of this analysis are observed when comparisons
are made using CoVCROSS and CoVCON instead of the corresponding IQR measurements for CD
and LA (Appendix A). First, the relationship between within- and cross-context variability for
LA in /s/ fails to reach significance when CoV values are compared, contra the results for IQR in
Table 1. Second, the relationship between within- and cross-context variability for CD in /ʃ/
remains significant after controlling for false discovery rate when this comparison is measured
using CoV instead of IQR. No other differences were observed between the two analyses. As
CoV is potentially a more appropriate measurement for comparing variability across subjects
than IQR for these articulatory dimensions (see discussion in 2.3.4), these results seem to
provide additional support for the finding that the relationship observed between within- and
across-context variability is stronger for dimensions that directly reflect the realization of a
segment’s phonological goals. This could, in turn, could be interpreted as support for the
incorporation of individual differences in variability into the representation of the corresponding
phonological units, a point that will be discussed further in the discussion section.
Table 3.2. Spearman’s rank-order correlation for the comparison of IQRCROSS and IQRCON for
each articulatory dimension in each segment. Green cells indicate comparisons are significant at
the uncorrected p < 0.05 level. Comparisons significant after using the Benjamini-Hochberg
method to control for false discovery rate (padj < 0.05) are additionally italicized.
/t/ /s/ /ʃ/ /l/ /ɹ/
rs p padj rs p padj rs p padj rs p padj rs p padj
CL 0.54 0.00 0.00 0.64 0.00 0.00 0.43 0.01 0.02 0.42 0.01 0.03 0.40 0.01 0.03
CD 0.52 0.00 0.00 0.58 0.00 0.00 0.33 0.04 0.06 0.62 0.00 0.00 0.37 0.02 0.04
CO 0.15 0.36 0.49 0.49 0.00 0.01 0.24 0.14 0.21 0.03 0.84 1.00 0.30 0.07 0.1
LA 0.31 0.05 0.09 0.40 0.01 0.03 0.29 0.08 0.1 0.04 0.80 1.00 0.39 0.01 0.03
LP 0.24 0.14 0.21 0.00 1.00 1.00 0.36 0.02 0.04 0.40 0.01 0.03 0.51 0.00 0.00
92
3.3.1.2. Acoustic variability
A parallel analysis of the relationship between IQRCROSS and IQRCON was conducted for
each acoustic dimension in each segment (Figure 3.4).The results of this analysis indicate that
within- and across-context variability show a much more general pattern of relationship in
acoustics than what was observed in articulation, with IQRCROSS and IQRCON significantly
correlated across speakers for most acoustic dimensions in /s/, /l/ and /ɹ/ (Table 3.3).
16
The
relationship between potential correlates of phonological goals in the acoustic dimensions to the
strength of the relationship between IQRCROSS and IQRCON is not as clear cut as it was in the
parallel analysis of the articulatory dimensions. Although there is some tendency for acoustic
dimensions highlighted by previous research as being particularly important for differentiating
specific segment pairs in production or perception (such as M1 in /s/, F2 in /l/ and F2 and F3 in
/ɹ/) to exhibit stronger correlations, this is not an absolute pattern. Equally strong correlations are
observed with some frequency for dimensions that do not play as clear a role in maintaining
segmental contrasts in perception or production (such as M4 for /s/ and /ʃ/ and F4 for /l/). It is
possible that this lack of differentiation among different acoustic dimensions could reflect the
relatively global influence of articulatory variability on the acoustic signal, in contrast with the
semi-independence of the measures of variability taken along different articulatory dimensions.
16
It is not clear why /ʃ/ differs from the other segments in this way. As can be seen in Table 3.2, the relationship
between IQRCROSS and IQRCON is only significant in this segment for M4. One possible explanation is that the range
of variability observed in /ʃ/ acoustics is too small across speakers, and the data too noisy (relative to the size of the
effect), for any relationship that may exist to emerge. The extent of variability observed along any given articulatory
or acoustic dimension was generally smaller for /ʃ/ than for the other segments (this difference is obscured
somewhat in the visualizations using scaled IQR). It seems possible that discrepancies between results for /ʃ/ and
other segments could be due to either this difference in the extent of variability or the overall smaller sample size
available for /ʃ/ in the XRMB data (Table 2.1).
93
Figure 3.4. Relationship between IQRCROSS and IQRCON across speakers for each acoustic
dimension in each segment.
/l/ /ɹ/ /s/ /ʃ/ /l/
94
Table 3.3. Spearman’s rho (rs) for the comparison of IQRCROSS and IQRCON for each acoustic
dimension in each segment. Green cells indicate comparisons that are significant at the
uncorrected p < 0.05 level. Comparisons significant after using the Benjamini-Hochberg method
to control for false discovery rate (padj < 0.05) are additionally italicized.
/s/ /ʃ/ /l/ /ɹ/
rs p padj rs p padj rs p padj rs p padj
M1 0.508 0.001 0.003 0.064 0.695 0.736
M2 0.286 0.074 0.089 -0.021 0.898 0.898
M3 0.356 0.025 0.040 0.157 0.332 0.373
M4 0.693 0.000 0.000 0.458 0.003 0.010
F1 0.433 0.006 0.011 0.300 0.061 0.078
F2 0.531 0.001 0.002 0.359 0.024 0.040
F3 0.447 0.004 0.010 0.347 0.029 0.043
F4 0.608 0.000 0.000 0.340 0.033 0.045
F2-F1 0.537 0.000 0.002 0.444 0.004 0.010
When the analysis between within- and across-context variability is conducted using
CoVCROSS and CoVCON instead of IQRCROSS and IQRCON for acoustic dimensions where this may
be appropriate (namely M1, M2, and F1-F4), the relationship between within- and across-context
variability strengthens to statistical significance for M2 in /s/ and /ʃ/ (/s/: rs = 0.4, p = 0.01, padj =
0.01; /ʃ/: rs = 0.41, p = 0.009, padj = 0.03). No other differences in the statistical significance of
correlations are noted between the two analyses, although the size of some correlations
significant in both analyses (e.g., for F3 in /ɹ/ and F4 in /l/ and /ɹ/) strengthen considerably when
CoV measurements are used (Appendix A). On the whole, the results of the alternate analysis
strengthen the appearance of a general relationship between within- and across-context acoustic
variability for the examined segments.
95
3.3.2. Analysis 2 Results: Comparison of variability across phonological segments
The results of the comparison of stochastic and contextual variability in the previous
section indicate that speakers’ production of variability for a single phonological unit (as indexed
by comparison to other speakers) is correlated across multiple levels of linguistic structure,
implying that an individual speaker’s stochastic variability (in a given dimension of a given
segment) predicts their contextual variability (and vice versa). This in turn may indicate that at
least some of the variability observed in an individual’s speech is phonologically controlled and
incorporated in the representation of phonological units. A second analysis was conducted to
explore the extent to which variability is specific to individual phonological units for the given
speaker, and the extent to which the variability a speaker exhibits along a given phonetic
dimension for one segment is not predictive of the variability exhibited for along the same
dimension of a different segment. In this analysis Spearman’s rank-order correlation was used to
quantify the predictability of individual speakers’ IQRCON values across pairs of phonological
segments for each of the measured phonetic dimensions. The analysis of the variability across
phonological segments was only conducted for IQRCON due to the expectation that this measure
of stochastic variability would reflect less of an influence of contextual or environmental factors
than either across-context or overall variability, and as such would more directly reflect
variability arising from the individual differences in the phonological unit rather than variability
arising from the interaction of these potential differences with other factors.
3.3.2.1. Articulatory variability
The results of the analysis of IQRCON for articulatory dimensions across phonological
segments indicate a general lack of a relationship in the realization of variability for a given
96
articulatory dimension across segments (Figures 3.5 – 3.9). As seen in Table 3.4, a statistically
significant correlation between IQRCON values for the same dimension in different segments is
only observed in a few isolated incidences. No consistent patterns of significant correlation are
observed for specific segments groupings or dimensions. This overall lack of strong correlations
and of patterns in the observed correlations suggest that speakers who are more variable in their
production of an articulatory dimension in one segment are not necessarily more variable in their
production of that dimension in other segments. In comparisons involving CL across segments,
for example, only one comparison is significant when controlling for false discovery rate (/s/
vs. /t/: rs = 0.475, p = 0.002 and padj = 0.036), with only one additional comparison reaching
significance in the uncontrolled comparison with ! = 0.05 (/s/ vs. /ɹ/: rs = 0.395, p = 0.012 and
padj = 0.083). No comparisons were significant for CD, and the comparison for LP in /ʃ/ and /ɹ/
(the only other comparison involving a dimension that may directly reflect an articulatory target
for a pair of segments) was similarly not significant (rs = 0.24, p = 0.127 and padj = 0.319). A
general lack of significant relationships in variability across segments was also observed for CO
and LA. The larger number of significant comparisons observed for LP may reflect general
speaker tendencies to be more or less variability in lip posture for the examined segments.
97
Figure 3.5. Comparison of IQRCON across phonological segments for CL.
Figure 3.6. Comparison of IQRCON across phonological segments for CD.
98
Figure 3.7. Comparison of IQRCON across phonological segments for CO.
Figure 3.8. Comparison of IQRCON across phonological segments for LA.
99
Figure 3.9. Comparison of IQRCON across phonological segments for LP.
When the analysis of variability across segments is conducted using CoVCON instead of
IQRCON for the two articulatory dimensions where this may be appropriate (CD and LA), the
results of the analysis change considerably for LA but not for CD. Using CoVCON, significant
correlations of LA across segments are observed in 7/10 pairwise comparisons both when
controlling and when not controlling for false discovery rate, with LA not significantly related in
the comparisons of /t/ vs. /s/, /s/ vs. /ʃ/, and /ʃ/ vs. /l/. This may also reflect general speaker
tendencies towards more or less variability in the production of lip posture in these segments. In
contrast, only one statistically significant correlation is observed for comparisons of CD across
segments using CoVCON, and none are observed after correcting for false discovery rate (/s/ vs.
/ʃ/: rs = 0.369, p = 0.02 and padj = 0.19). These findings, in combination with the general pattern
100
of results from the analysis of IQRCON, indicate that articulatory variability is not consistently
related across segments for dimensions directly related to their phonological goals.
Table 3.4. Correlation matrix for the comparison of IQRCON across segments for each articulatory
dimension. All correlations calculated using Spearman’s rho. Green cells indicate comparisons
that are significant at the uncorrected p < 0.05 level. Comparisons significant after using the
Benjamini-Hochberg method to control for false discovery rate are additionally italicized.
/t/ /s/ /ʃ/ /l/ /ɹ/
CL
/t/ 1.000 0.475 0.290 0.065 0.317
/s/ 1.000 -0.076 0.158 0.395
/ʃ/ 1.000 -0.245 -0.020
/l/ 1.000 0.183
/ɹ/ 1.000
CD
/t/ 1.000 0.275 0.152 0.129 -0.173
/s/ 1.000 0.133 0.029 0.160
/ʃ/ 1.000 -0.120 -0.161
/l/ 1.000 0.261
/ɹ/ 1.000
CO
/t/ 1.000 0.015 0.233 0.031 -0.001
/s/ 1.000 0.023 0.572 0.147
/ʃ/ 1.000 -0.014 0.045
/l/ 1.000 0.191
/ɹ/ 1.000
LA
/t/ 1.000 0.148 0.208 -0.158 0.219
/s/ 1.000 0.210 0.340 0.332
/ʃ/ 1.000 0.003 0.214
/l/ 1.000 0.298
/ɹ/ 1.000
LP
/t/ 1.000 0.328 0.536 0.403 0.393
/s/ 1.000 0.406 0.39 0.291
/ʃ/ 1.000 0.244 0.288
/l/ 1.000 0.237
/ɹ/ 1.000
101
3.3.2.2. Acoustic variability
A similar lack of general relationship in IQRCON across speakers and phonological
segments was observed for the more limited set of possible comparisons across measured
acoustic dimensions (Figure 3.10). As different acoustic dimensions were measured for the liquid
and fricative consonants, only one pairwise comparison between segments was possible for each
measured acoustic dimension. Using Spearman’s rho, significant relationships in IQRCON were
observed only for M2 and M4 in the /s/~/ʃ/ comparison (Table 3.5). These relationships were
only statistically significant at the ! = 0.05 significance level, failing to reach significance after a
correction for false discovery rate was applied (M2: rs = 0.375, p = 0.018 and padj = 0.071; M4: rs
= 0.385, p = 0.015 and padj = 0.071). Notably, the significant relationship observed for M2 in the
/s/~/ʃ/ comparison disappeared when the analysis was conducted with CoVCROSS and CoVCON
instead (M2: rs = -0.0009, p = 0.995, padj = 0.995). Like the results of the analysis of articulatory
variability across segments, the results of the analysis of acoustic variability point toward a
general lack of cross-segment relationships, particularly for dimensions that previous research
has shown to be critical in the production and perception of phonological contrasts for these
consonants (e.g., Boyce & Espy-Wilson, 1997; Delattre & Freeman, 1968; Jongman et al., 2000;
Lehiste, 1964; O’Connor et al., 1957; Li et al., 2009).
102
Figure 3.10. Comparison of IQRCON between phonological segments for all measured acoustic
dimensions. Top row: liquid contrasts (/l/ vs. /ɹ/); bottom row: fricative contrasts (/s/ vs. /ʃ/).
Table 3.5. Spearman’s rho for the comparison of IQRCON across segments for each acoustic
dimension. Green cells indicate comparisons that are significant at the uncorrected p < 0.05
level. No comparisons were significant using the Benjamini-Hochberg method to control for
false discovery rate.
M1 M2 M3 M4 F1 F2 F3 F4 F2-F1
/s/~/ʃ/ 0.134 0.375 0.013 0.385
/l/~/ɹ/ 0.264 0.271 0.194 0.193 0.080
3.3.3. Analysis 3: Comparisons across articulatory dimensions
The results of Analysis 2 indicate that the articulatory and acoustic variability a speaker
exhibits is not consistently related across phonological segments, suggesting that variability may
be encoded separately for individual phonological units. In Analysis 3, variability in the
production of multiple articulatory dimensions within the same segment is compared across
speakers to determine the identity of the phonological unit for which variability may be
controlled. Understanding how variability operates at a segmental or gestural level will be
IQRCON - /ʃ/ IQRCON - /ɹ/
103
important for the development and testing of models of phonological planning given differing
views regarding the locus of phonological representational and speech planning units in existing
models (e.g., Browman & Goldstein, 1985, 1986, 1992 et seq.; Guenther, 1995; Guenther et al.,
2006; Pierrehumbert, 2000, 2002, 2016; Tourville et al., 2011). This understanding can
subsequently be viewed as a prerequisite for deciding how best to incorporate individual
differences in variability into this system (e.g., Nieto-Castañon et al., 2005). If a plurality of
articulatory dimensions exhibits a comparable extent of variability within the same phonological
segment for a given speaker, this may suggest variability is controlled at the segmental level. In
contrast, if the variability observed for articulatory dimensions in the same segment that recruit
different active articulators are not related,
17
this may imply variability is specified at a
subsegmental level, for different gestures or tract variables.
To evaluate the extent to which variability might be controlled at a subsegmental versus a
segmental level, Spearman’s correlation coefficient was calculated for pairwise comparisons of
IQRCON across pairs of articulatory dimensions in each segment. To the extent that the variability
exhibited in along one dimension is predictive of the variability along the others, this would
argue for segment-specific control of variability. In this instance, a parallel analysis of acoustic
dimensions was not conducted. The decision to not conduct a parallel acoustic analysis was
based on the expectation that multiple articulatory dimensions would simultaneously influence
each measured acoustic dimension (e.g., Iskarous, Shadle, & Proctor, 2010; Narayanan, Alwan,
17
Some degree of “common” variability may be observed among articulatory dimensions that are taken from
measurements of the same functional subregion of active articulators and are therefore measuring different attributes
of the same constriction gesture. In the data presented here, CL, CD and CO are all measuring properties of the same
lingual constriction gesture. If variability is controlled on the level of a subphonemic unit of representation, it is
unclear and unanswerable from the data examined in this study whether it would be controlled at the level of the
tract variable (implying that CL and CD could differ in patterns of variability for the same gesture) or at the level of
the gesture (implying a common control of variability in CL and CD). Biomechanical factors and the manner in
which change in one of these dimensions may impact the others (e.g., CD is likely to change with a change in CO)
may also cause the appearance of shared variability in these dimensions.
104
& Haker, 1997; see also the results of the articulatory-acoustic analysis in Chapter 1) and that
change in a single articulatory dimension may influence multiple acoustic dimensions (e.g., Lin,
Beddor, & Coetzee, 2014). This lack of a one-to-one correspondence between articulation and
acoustics was expected to hinder the ability to tease apart segmental and subsegmental control of
variability in the analysis of acoustic dimensions, making it difficult to interpret the results of
comparisons across acoustic dimensions with respect to this question.
The results of the comparison of IQRCON across articulatory dimensions in each segment
indicate that there is a tendency for dimensions involving the same primary articulator to
correlate with one another but very little evidence for a strong relationship between dimensions
recruiting different primary articulators (Figures 3.11 – 3.15). At least one of the pairwise
comparisons between any two of the three dimensions involving measurement of the pellet on
the tongue tip (CL, CD and CO) was statistically significant at both the corrected and
uncorrected significance levels for each of the five segments examined (Table 3.6). In contrast,
pairwise comparisons between any of the three lingual dimensions and either of the two labial
dimensions (LA and LP) consistently failed to reach statistical significance after applying the
Benjamini-Hochberg method to control for false discovery rate, with only two comparisons
between different articulatory organs (CL vs. LA in /t/ and CD vs. LA in /t/) significant in the
uncontrolled analysis (Table 3.6). However, the correlation between LA and the two lingual
dimensions in /t/ is no longer statistically significant when the analysis is conducted with
CoVCON instead of IQRCON for LA and CD (CL vs. LA in /t/: rs = 0.193, p = 0.23; CD vs. LA in
/t/: rs = 0.278, p = 0.08213).
The results of this analysis suggest that if variability is incorporated in the representation
of phonological units, the appropriate level of representation for this incorporation is
105
subsegmental. This conclusion is drawn from the lack of consistent relationship in IQRCON
between dimensions recruiting different primary articulators or constricting organs, namely the
lips and the tongue tip, in the XRMB data. This conclusion is, however, only preliminary for
several reasons that will be elaborated on in the discussion.
Figure 3.11. Comparison of IQRCON across articulatory dimensions for /t/.
106
Figure 3.12. Comparison of IQRCON across articulatory dimensions for /s/.
Figure 3.13. Comparison of IQRCON across articulatory dimensions for /ʃ/.
107
Figure 3.14. Comparison of IQRCON across articulatory dimensions for /l/.
Figure 3.15. Comparison of IQRCON across articulatory dimensions for ɹ/.
108
Table 3.6. Correlation matrix for the comparison of IQRCON across articulatory dimensions for
each segment. All correlations calculated using Spearman’s rank-order correlation coefficient.
Green cells indicate comparisons that are significant at the uncorrected p < 0.05 level.
Comparisons significant after using the Benjamini-Hochberg method to control for false
discovery rate (pADJ < 0.05) are additionally italicized.
CL CD CO LA LP
/t/
CL 1.000 0.493 0.168 0.386 0.139
CD 1.000 0.290 0.380 -0.071
CO 1.000 0.232 -0.117
LA 1.000 -0.227
LP 1.000
/s/
CL 1.000 0.151 0.476 0.011 0.171
CD 1.000 0.511 -0.048 -0.150
CO 1.000 -0.115 -0.213
LA 1.000 0.147
LP 1.000
/ʃ/
CL 1.000 0.401 0.272 0.096 -0.232
CD 1.000 0.434 -0.125 -0.112
CO 1.000 0.109 -0.207
LA 1.000 -0.028
LP 1.000
/l/
CL 1.000 0.484 0.251 0.225 -0.039
CD 1.000 0.335 0.310 0.143
CO 1.000 0.303 0.016
LA 1.000 0.254
LP 1.000
/ɹ/
CL 1.000 0.148 0.573 0.204 0.032
CD 1.000 0.318 0.250 0.045
CO 1.000 0.207 0.069
LA 1.000 0.058
LP 1.000
109
3.3.4. Discussion
The analyses of phonetic variability within and across phonological units demonstrate
that the amount of variability that an individual speaker exhibits (relative to the population of
speakers) is systematic at the level of the individual phonological unit, but not across different
phonological units. A general implication of these results is that interspeaker differences in
phonetic variability do not consistently appear to reflect speaker-specific tendencies to be more
variable or more consistent in speech production; they may instead reflect individual differences
in the encoding of variability for different phonological units.
Significant correlations between within- and across-context variability are consistently
observed for multiple articulatory and acoustic dimensions in each segment, with universal
significance observed for the two articulatory dimensions most closely tied to phonological goals
(CL and CD). These relationships largely remain significant even after a correction for multiple
comparisons was applied. This indicates that despite the tendency for patterns of interspeaker
difference in (unconditioned) token-to-token variability and (conditioned) contextual variation to
be examined separately in the literature, interspeaker variability in these two domains is
connected, with speakers who exhibit more stochastic variability also exhibiting greater
contextual effects on their production of a particular phonological unit.
In articulation, this relationship is generally stronger for dimensions that are more closely
related to the contrastive phonological goals for a segment. In acoustics, this relationship is
broadly observed across examined dimensions and does not clearly reflect the presumed
phonological importance or informativity of individual dimensions. The articulatory results lend
themselves to an interpretation in which variability in articulatory production targets is
incorporated in the cognitive representation of phonological units accessed during speech
110
planning. This interpretation is drawn from the observation of stronger relationships between
within- and across-context variability for dimensions more closely reflecting the phonological
goals of the segment.
In conjunction with this, the results of the parallel analyses of articulatory and acoustic
variability across segments fail to support the proposal that individual differences in variability
primarily reflect general tendencies for individual speakers to be more or less variable in
production. No consistent patterns of significant correlations in variability between specific
segment pairs are observed. Although there are examples of significant correlations in IQRCON
across segments for specific articulatory and acoustic dimensions, regular patterns wherein the
variability for a particular dimension is consistently related across segments are only observed
for two articulatory dimensions (LA, in the analysis with CoVCON, and LP). Critically, significant
cross-segment relationships are extremely rare for the two most consistently goal-oriented
articulatory dimensions examined, CL and CD, with the strongest relationships in these
dimensions observed between segments that arguably share the same phonological specification
for the dimension along which they are related (/s/ and /t/ for CL, and /s/ and /ʃ/ for CD). In
contrast to the consistency observed in the direction of the relationship between stochastic and
contextual variability, both positive and negative correlation coefficients are observed for the
pairwise comparisons of articulatory variability across different segments. All of these
observations seem to undermine the proposal that interspeaker differences in variability reflect a
general propensity of the speaker toward variability, indicating instead that variability may be
encoded separately for individual phonological segments.
On the surface, the results of the comparison of dimension-specific articulatory
variability across segments presented here are fairly similar to those from Whalen et al. (2018).
111
In both Analysis 2 here and in their paper, a very small number of comparisons between
segments were found to be statistically significant. However, the conclusions drawn from the
results of the present investigation differ from those in Whalen et al. (2018). Whalen et al. (2018)
interpreted their findings as indicating that variability was related across vowels, with speakers
who were more variable in the production of one vowel also generally more variable in the
production of others, since the significance of at least one of the pairwise comparisons at the
corrected significance level is taken as sufficient evidence for the rejection of the general null
hypothesis in their investigation (that there will be no correlation between vowel pairs). In this
study, however, the null hypothesis was not rejected.
This discrepancy in the interpretation of very similar results can be attributed to the locus
of investigation and, subsequently, the domain of the hypotheses being tested in the two studies.
The focus of the present study was specifically determining whether the observed variability
could be tied to the control structures governing the production of the examined segments, and
because of the expectation that similarity in the phonological specification of particular segment
pairs (e.g., similarity in CD for /s/ and /ʃ/) may engender significant findings for certain
comparisons. As the control structures relevant to the realization of different segments along
particular dimensions may be equivalent, a more general finding of significant relationships in
variability (e.g., the observation of relationships between segments that lacked a similar
phonological specification) was taken to be necessary for full confidence in rejecting the null
hypothesis.
Second, and relatedly, the specific segment pairs for which production variability was
found to be significantly related along a particular articulatory dimension were generally those
that both empirical research and theoretical models of representation would very specifically
112
suggest have the same or very similar control structures underlying their production. The clearest
example of this is the findings for CL, where the only comparison found to exhibit a significant
relationship at the corrected significance level was that between /s/ and /t/. Previous research has
shown that the tongue tip constriction gesture for both /s/ and /t/ is generally realized at the
alveolar ridge by speakers of American English (Dart, 1998) with great overlap in region of the
palate over which they are produced (Mooshammer, Hoole, Geumann, 2007), although speakers
do differ in the extent to which their target locations for these consonants overlap (Yunusova,
Rosenthal, Rudy, Baljko, & Daskalogiannakis, 2012). This aligns with their identical
specification for place of articulation in both gestural and feature-based accounts of phonological
representation. Instrumental measurement of constriction location for the other consonants
examined in this study, including /l/ (which is often described as identical in place to /s/ and /t/ in
accounts of English phonology), has demonstrated that they do not exhibit as high a degree of
coherency as /s/ and /t/ in terms of realized place of articulation (e.g., Mooshammer et al., 2007).
A similar effect of equivalence in the control structures governing the production of the
specific vowel pairs found to be significantly related in variability could in theory underly the
results observed in Whalen et al. (2018). The only vowel pairs in their study for which
articulatory variability was significantly correlated after applying a correction for false discovery
rate, the high vowels /i/ and /u/ and the lax front vowels /ɪ/ and /ɛ/, exhibit similarities in their
phonological specification that could be used to motivate a relationship in variability. Their
results, as well as those in Bakst (2021), may also differ from those of the present study due to
their more holistic measurement of articulatory and acoustic variability. Conversely, differences
in the motor control principles or the nature of the representational structures governing vowel
production and consonant production (e.g., Bailly, 1997; Guenther, 1995; Öhman, 1966, 1967)
113
could underly asymmetries in whether more global or unit-specific effects are observed in the
production of members of these phonological classes, leading to differences in the globality of
variability in a speaker’s production of vowels and consonants.
The results of Analyses 1 and 2 provide support for P1 and P2, respectively, and
subsequently support the hypothesis that interspeaker differences in articulatory variability are
encoded in the representation of individual phonological units. The results of Analysis 3 suggest
that the appropriate level of representation for this incorporation is subsegmental. This
conclusion is drawn from the lack of consistent relationship in IQRCON between dimensions
recruiting different primary articulators, namely the lips and the tongue tip, in the XRMB data.
This conclusion is, however, only preliminary due to the possibility that different primary
articulators (i.e., the tongue tip and the lips) may differ in their intrinsic movement kinematics in
such a way that they will inherently differ in movement variability and due to the exclusion of
articulatory structures like the velum and larynx from the analysis. Further research will be
necessary to tease apart the extent to which cross-articulatory dimension differences in
variability can be attributed to differences in the abstract nature of the phonological targets
guiding goal-directed movement for each articulatory dimension and which may be attributable
to physical or biomechanical factors instead.
In theory, the differences in phonetic variability observed across speakers could arise
from the influence of extra- or paralinguistic factors on the realization of static articulatory
targets specified as production goals. Although the suggestion from the analyses presented in this
section that variability is differently represented for different phonological units makes this
eventuality seem rather unlikely, differences in the effect of different conditioning factors on
different consonants or dimensions could result in the appearance of unit-specific variation that
114
in fact reflects more general patterns of variability conditioning. For example, the individual
differences in variability observed could be due to subtle differences in the prosodic realization
of different tokens that were not picked up by the rather crude definition of prosodic context used
in the analysis here. This would mean that the observed interspeaker differences in within-
context variability would be due simply to differences in their implementation of prosodic
elements instead of differences in their definition of production targets. The analyses presented
in Section 3.4 and 3.5, examining the relationship between phonetic variability and two classes
of factors known to condition variability (vocal tract morphology and prosodic variation), are
intended to examine whether the interspeaker differences in phonetic variability observed here
are conditioned by speaker traits shown to influence variability in previous research.
3.4. Individual differences in vocal tract morphology and variability
For first of two analyses designed to evaluate the prediction that factors known to
condition variability in speech would not consistently account for the observed interspeaker
differences in variability (P3), the relationship between interspeaker anatomical variation and
phonetic variability is tested.
3.4.1. Methods
3.4.1.1. Morphological measurements
A set of 5 dimensions of variation in the morphology of the vocal tract were calculated
for each speaker for comparison with the measurements of phonetic variability. The dimensions
chosen for inclusion in this analysis were primarily measures of palate morphology. Variation in
palate morphology was focused on investigating the potential of anatomical differences to
115
explain interspeaker phonetic variability given previous findings suggesting individual
differences in palate morphology may affect the range of variability exhibited by a speaker in the
production of coronal consonants (e.g., Brunner et al., 2009; Mooshammer et al., 2004; Rudy &
Yunusova, 2013). The specific dimensions chosen were selected due to either their identification
as a major locus of variation in vocal tract morphology (e.g., Lammert et al., 2013) or previous
work demonstrating a relationship with articulatory variability (e.g., Brunner et al., 2009;
Johnson, 2018; Rudy & Yunusova, 2013).
As the coordinate system used to describe pellet positions for each speaker in the XRMB
data was based on stable anatomical landmarks, with the origin of the system centered at the tips
of the speaker’s central maxillary incisors (Westbury, 1994: 38), it was possible to base
measurements of palate morphology on the raw x-y coordinates of points on the speaker’s palate
trace (Figure 3.16). Four measurements indexing anatomical properties of the speaker’s palate
were calculated based on the palate trace and used in all analyses. The height of the palate (PH)
was measured as the y-axis value of the highest point on the speaker’s palate trace (YMax in
Figure 3.16), while the length of the palate (PL) was calculated as the difference between the x-
axis coordinates of the most anterior and most posterior points on the palate trace (XMax and
XMin, respectively, in Figure 3.16) (Equation 3.1). This measurement was included due to the
observation from Rudy and Yunusova (2013) that palate length was correlated with the range of
vertical values their speakers exhibited in the production of coronal consonants.
PL =3?+@
(
−3?B,
(
(3.1)
Previous research has shown that how flat or domed a speaker’s palate is may influence the
extent of variability they exhibit in production, particularly in terms of vertical motion (e.g.,
Brunner et al., 2009). Typical methods of measuring the doming of the palate require a
116
measurement of the palate width that was unobtainable from the midsagittal XRMB data (e.g.,
Bakst, 2021; Brunner et al., 2009). A measurement approximating the area under the palate in 2-
D space was calculated to compensate (PA). PA was calculated for each speaker using the
method described in Johnson (2018) (Equation 3.2). Larger PA values approximate larger area
under the palate
PA = C?+@
%
∗3?+@
(
(3.2)
with larger area predicted to gender greater variability in articulation across speakers.
The final measurement of palate morphology calculated was the slope of the palate from
the teeth to its highest point (PS). Although not explicitly tied to individual differences in
phonetic variability in the existing literature, palate slope has been related to individual
differences in the production of phonological contrasts (Weirich & Fuchs, 2013) and roughly
aligns with two of the major modes of interspeaker variation in palate shape (anteriority and
sharpness) noted in Lammert et al. (2013).
PS =
!"#$
!
18
!"#$
"
(3.3)
A measure of Oral Cavity Length (OCL) was also included in the analysis due to concerns about
the accuracy of the palate trace for multiple speakers (Equation 3.4). This measurement was
calculated as the horizontal distance between the lower incisor (LI) and the closest point on the
trace of the speaker’s pharyngeal wall (PHAR) with the speaker’s vocal tract was in a rest
position.
OCL = DEF:
(
− G8
(
(3.4)
18
The origin of the coordinate system (0,0) always served as the second point in the calculation of slope. This point,
although not technically on the palate, was used due to concerns about the accuracy of the anterior boundary of the
palate trace for multiple speakers. Overall results of the analysis did not change when slope was calculated with
(XMinX, XMinY) as the second point instead.
117
Finally, pharyngeal Cavity Length (PhCL) was calculated as the vertical distance between the
most superior and the most inferior points on the trace of the speaker’s pharyngeal wall, as
shown in Equation (3.5).
PhCL = PHARXCoord - LIXCoord (3.5)
Figure 3.16. Illustration of anatomical landmarks used to calculate vocal tract morphology
measurements.
Table 3.7. Summary of morphological measurements and associated acronyms
PL PA PH PS OCL PhCL
Palate Length Palate Area Palate Height Palate Slope
Oral Cavity
Length
Pharyngeal
Cavity
Length
118
Table 3.8. Spearman’s rank-order correlation for the comparison of vocal tract morphology and IQRCON for each articulatory
dimension in each segment. Green cells indicate comparisons that are significant at the uncorrected p < 0.05 level. No comparisons
were significant after using the Benjamini-Hochberg method to control for false discovery rate (all padj > 0.05).
CL CD CO LA LP
rs p padj rs p padj rs p padj rs p padj rs p padj
PA
/t/ 0.10 0.53 0.64 -0.03 0.84 0.94 -0.13 0.43 0.68 0.07 0.67 0.67 0.13 0.42 1.00
/s/ 0.06 0.73 0.73 0.34 0.03 0.09 0.07 0.68 0.82 -0.07 0.67 0.87 0.08 0.62 0.94
/ʃ/ 0.19 0.23 0.46 0.12 0.45 0.67 -0.08 0.66 0.95 0.07 0.66 0.66 0.06 0.69 0.83
/l/ -0.20 0.21 0.63 -0.30 0.06 0.19 -0.08 0.77 0.77 -0.29 0.07 0.40 0.29 0.07 0.22
/ɹ/ -0.04 0.79 0.80 -0.17 0.31 0.46 -0.41 0.11 0.36 -0.06 0.73 0.87 0.00 0.98 0.98
PL
/t/ 0.03 0.83 0.83 0.07 0.69 0.94 0.13 0.45 0.68 0.18 0.28 0.67 0.02 0.91 1.00
/s/ 0.22 0.17 0.73 0.09 0.59 0.61 0.08 0.67 0.82 -0.03 0.84 0.87 0.22 0.18 0.77
/ʃ/ 0.24 0.13 0.39 -0.07 0.65 0.68 -0.01 0.94 0.95 0.28 0.08 0.46 -0.03 0.87 0.87
/l/ -0.07 0.65 0.76 0.02 0.90 0.90 -0.15 0.50 0.75 -0.21 0.20 0.40 0.29 0.07 0.22
/ɹ/ 0.05 0.75 0.80 0.03 0.88 0.91 -0.36 0.15 0.36 0.23 0.15 0.37 -0.03 0.83 0.98
PH
/t/ 0.13 0.42 0.64 -0.05 0.75 0.94 -0.04 0.79 0.95 0.07 0.67 0.67 0.21 0.19 1.00
/s/ -0.07 0.66 0.73 0.41 0.01 0.06 0.16 0.37 0.82 0.05 0.75 0.87 -0.01 0.96 0.96
/ʃ/ 0.05 0.76 0.90 0.21 0.19 0.58 0.01 0.95 0.95 -0.08 0.64 0.66 0.21 0.20 0.83
/l/ -0.09 0.59 0.76 -0.34 0.03 0.19 0.46 0.06 0.20 -0.14 0.39 0.47 0.14 0.39 0.43
/ɹ/ -0.07 0.66 0.80 -0.21 0.20 0.46 0.28 0.08 0.59 -0.21 0.18 0.37 0.07 0.68 0.98
PS
/t/ 0.11 0.48 0.64 -0.14 0.39 0.94 -0.21 0.21 0.68 -0.24 0.13 0.67 -0.00 0.99 1.00
/s/ -0.09 0.58 0.73 0.08 0.61 0.61 -0.09 0.62 0.82 0.10 0.55 0.87 -0.18 0.26 0.77
/ʃ/ -0.10 0.54 0.81 0.23 0.16 0.58 0.09 0.60 0.95 -0.23 0.16 0.47 -0.08 0.63 0.83
/l/ 0.05 0.76 0.76 0.19 0.24 0.37 0.10 0.71 0.77 0.06 0.70 0.70 -0.19 0.24 0.43
/ɹ/ 0.04 0.80 0.80 0.02 0.91 0.91 0.33 0.18 0.36 -0.01 0.95 0.95 0.18 0.27 0.98
OCL
/t/ -0.29 0.07 0.40 -0.19 0.25 0.94 0.02 0.95 0.95 0.07 0.65 0.67 -0.00 1.00 1.00
/s/ -0.08 0.62 0.73 -0.20 0.21 0.40 -0.34 0.04 0.36 -0.03 0.87 0.87 0.08 0.63 0.94
/ʃ/ -0.34 0.03 0.18 -0.13 0.43 0.67 -0.18 0.24 0.67 0.17 0.30 0.61 0.07 0.69 0.83
/l/ -0.13 0.43 0.76 -0.02 0.89 0.90 -0.14 0.39 0.51 -0.19 0.23 0.40 0.16 0.32 0.43
/ɹ/ -0.17 0.29 0.80 0.17 0.29 0.46 0.05 0.85 0.85 -0.08 0.62 0.87 0.02 0.90 0.98
PCL
/t/ 0.11 0.50 0.64 0.01 0.94 0.94 -0.15 0.38 0.68 0.09 0.60 0.67 -0.04 0.79 1.00
/s/ -0.07 0.69 0.73 0.18 0.27 0.40 -0.01 0.98 0.98 -0.08 0.62 0.87 -0.02 0.93 0.96
/ʃ/ -0.02 0.90 0.90 -0.07 0.68 0.68 -0.12 0.45 0.95 0.08 0.62 0.66 -0.17 0.30 0.83
/l/ 0.28 0.08 0.51 0.24 0.14 0.28 -0.29 0.30 0.60 0.18 0.26 0.40 0.13 0.43 0.43
/ɹ/ -0.10 0.52 0.80 0.22 0.17 0.46 -0.16 0.49 0.59 0.29 0.07 0.37 -0.01 0.94 0.98
119
3.4.2. Results
3.4.2.1. Relationship between vocal tract morphology and articulatory variability
Individual differences in vocal tract morphology did not consistently predict speaker
differences in variability for any articulatory dimension nor for any of the examined
phonological segments (Table 3.8). Significant correlation between measurements of vocal tract
morphology and IQRCON were only observed at the uncorrected p < 0.05 level for CD in /s/ and
/l/, CL in /ʃ/, and CO in /s/. PH was found to significantly predict IQRCON for CD in /s/ and /l/,
although the direction of the predicted effect was in opposing directions for this articulatory
dimension in these two segments, indicating that a higher palate predicted greater variability in
CD for /s/ but less variability in CD for /l/. A significant positive correlation was also observed
between PA and CD in /s/, indicating that speakers with greater area under the hard palate were
more variable in CD for this segment. The other significant correlations all involved OCL and
indicated that speakers with shorter oral cavities exhibited less variability in CL for /ʃ/ and CO
and /s/.
A similar lack of consistent effects of vocal tract morphology on articulatory variability is
observed when the analysis is conducted with CoVCON for CD and LA. A significant correlation
is still observed between PH and CD variability in /l/ (rs = -0.39, p = 0.01, padj = 0.04), with PA
also significantly correlating with CD variability for this segment (rs = -0.47, p < 0.0001, padj =
0.01). Unlike in the analysis with IQRCON, variability in CD for /s/ does not significantly
correlate with variation in any morphological dimension. However, additional correlations not
observed in the analysis with IQRCON are found involving CD in /t/, with a significant positive
relationship observed between this dimension and both PA and PL (PA vs. /t/ CD: rs = 0.52, p <
0.0001, padj = 0.01; PL vs. /t/ CD: rs = 0.47, p < 0.0001, padj = 0.01). Significant negative
120
correlations are also found between LA in /l/ and two morphological dimensions: PA (rs = -0.33,
p =0.04, padj = 0.11) and PH (rs = -0.35, p = 0.03, padj = 0.11).
3.4.2.2. Relationship between vocal tract morphology and acoustic variability
Individual differences in vocal tract morphology were almost universally unrelated to
interspeaker differences in acoustic variability for the examined fricatives, but exhibited a more
substantial relationship with formant variability in the liquids. The results of the analysis relating
vocal tract morphology to IQRCON for fricative spectrum measurements are shown in Table 3.9
and the results for formant measurements in the liquid consonants are shown in Table 3.10. The
only significant correlation between any morphological dimension and IQRCON for any spectral
measurement in /s/ or /ʃ/ was a negative relationship between PCL and M1 in /ʃ/. No significant
correlations were observed when the analysis for M1 and M2 was redone using CoVCON.
A more substantial array of significant correlations was observed in the comparison of
vocal tract morphology and interspeaker differences in IQRCON for the formant measurements in
/l/ and /ɹ/. F1 variability was predicted by both PH and PS for both liquid consonants, with the
negative relationship observed in all comparisons indicating that speakers with shallower, less
steeply sloped palates exhibited more variability in F1 for both segments. Speakers with lower
palate heights also tended to exhibit greater variability in F2 for both /l/ and /ɹ/, as evidenced by
the significant negative correlation of IQRCON in both /l/ and /ɹ/ with PH. IQRCON for F2 in /l/
also exhibited a significant negative correlation with PA. Finally, PCL was positively correlated
with F3 variability in /l/, indicating that speakers with longer pharynxes were more variable in
this dimension. Similar results were observed for the comparison of CoVCON with vocal tract
morphology.
121
Table 3.9. Spearman’s rho (rs) for the comparison of vocal tract morphology and IQRCON for
each acoustic dimension in /s/ and /ʃ/. Comparisons significant at the uncorrected p < 0.05 level
are shown in green. No comparisons were significant after using the Benjamini-Hochberg
method to control for false discovery rate (all padj > 0.05).
M1 M2 M3 M4
rs p padj rs p padj rs p padj rs p padj
PA
/s/ -0.37 0.13 0.67 -0.16 0.35 0.88 -0.21 0.20 0.39 -0.21 0.19 0.47
/ʃ/ 0.02 0.92 0.92 -0.10 0.53 0.91 0.24 0.14 0.85 0.14 0.40 0.92
PL
/s/ -0.13 0.60 0.87 -0.10 0.58 0.88 -0.16 0.32 0.54 -0.22 0.17 0.47
/ʃ/ -0.12 0.48 0.82 0.08 0.60 0.91 0.14 0.38 0.85 -0.01 0.97 0.97
PH
/s/ 0.09 0.71 0.87 -0.12 0.50 0.88 -0.13 0.44 0.60 -0.07 0.67 0.90
/ʃ/ 0.17 0.29 0.82 -0.10 0.53 0.91 0.18 0.27 0.85 0.13 0.42 0.92
PS
/s/ 0.09 0.70 0.87 -0.11 0.54 0.88 -0.03 0.87 0.87 -0.06 0.72 0.90
/ʃ/ 0.14 0.41 0.82 -0.06 0.73 0.91 -0.07 0.67 0.85 -0.04 0.81 0.97
OCL
/s/ 0.04 0.87 0.87 0.03 0.87 0.88 -0.11 0.48 0.60 -0.17 0.31 0.61
/ʃ/ -0.30 0.07 0.33 0.01 0.97 0.97 -0.15 0.35 0.85 -0.12 0.46 0.92
PCL
/s/ -0.40 0.11 0.67 -0.03 0.88 0.88 -0.23 0.15 0.39 0.00 0.99 0.99
/ʃ/ -0.34 0.03 0.33 -0.24 0.14 0.91 0.03 0.84 0.85 0.14 0.39 0.92
122
Table 3.10. Spearman’s rho (rs) for the comparison of vocal tract morphology and IQRCON for each acoustic dimension in /l/ and /ɹ/.
Comparisons significant at the uncorrected p < 0.05 level are bolded and in green. Comparisons significant after using the Benjamini-
Hochberg method to control for false discovery rate (padj < 0.05) are additionally italicized.
F1 F2 F3 F4 F2-F1
rs p padj rs p padj rs p padj rs p padj rs p padj
PA
/l/ -0.32 0.05 0.09 -0.17 0.30 0.43 -0.08 0.64 0.86 -0.00 0.98 0.98 -0.27 0.09 0.43
/ɹ/ -0.16 0.33 0.55 -0.39 0.01 0.07 -0.20 0.21 0.52 -0.01 0.97 0.97 -0.25 0.13 0.27
PL
/l/ 0.11 0.52 0.57 0.28 0.08 0.26 0.13 0.44 0.79 0.19 0.23 0.67 0.03 0.84 0.97
/ɹ/ 0.18 0.25 0.51 0.01 0.95 0.98 -0.04 0.82 0.84 0.13 0.43 0.71 0.19 0.23 0.33
PH
/l/ -0.45 0.00 0.04 -0.46 0.00 0.03 -0.16 0.31 0.79 -0.17 0.30 0.67 -0.34 0.03 0.31
/ɹ/ -0.32 0.04 0.21 -0.50 0.00 0.01 -0.25 0.11 0.52 -0.05 0.74 0.93 -0.42 0.01 0.08
PS
/l/ -0.37 0.02 0.06 -0.19 0.23 0.38 -0.03 0.86 0.86 -0.17 0.28 0.67 -0.05 0.74 0.97
/ɹ/ -0.34 0.03 0.21 -0.28 0.08 0.27 -0.30 0.06 0.52 -0.13 0.43 0.71 -0.28 0.08 0.26
OCL
/l/ 0.11 0.49 0.57 0.26 0.10 0.26 -0.03 0.83 0.86 -0.02 0.90 0.98 -0.20 0.21 0.43
/ɹ/ 0.10 0.54 0.58 0.16 0.33 0.54 -0.03 0.84 0.84 0.15 0.34 0.71 0.00 1.00 1.00
PCL
/l/ 0.31 0.06 0.09 -0.10 0.54 0.54 0.37 0.02 0.22 0.26 0.11 0.67 0.13 0.42 0.69
/ɹ/ -0.11 0.49 0.58 0.00 0.98 0.98 0.05 0.76 0.84 -0.11 0.50 0.71 0.12 0.44 0.56
123
3.4.3. Discussion
On the whole, the results of this analysis of interspeaker differences in vocal tract
morphology and phonetic variability suggest that anatomical differences do not consistently
predict interspeaker differences in either articulatory or acoustic variability. This provides strong
support for the prediction that interspeaker differences in traits known to condition variability do
not consistently predict interspeaker differences in variability for specific phonological units
(P3), and consequently evidence in support of the hypothesis that interspeaker differences in
phonetic variability reflect differences in the cognitive representation of phonological units.
For the examined articulatory dimensions, a significant correlation with any
morphological dimension was only found in more than one segment for CD, with the direction of
the effect in /l/ the opposite of what would be expected given previous research (e.g., Brunner et
al., 2009; Rudy & Yunusova, 2013). Vocal tract morphology also did not consistently account
for variability in multiple dimensions for any given segment. A similar lack of relationship with
vocal tract morphology was observed for interspeaker differences in variability along the
acoustic dimensions measured for /s/ and /ʃ/.
The one caveat to this generalization is the finding that interspeaker differences in F1 and
F2 variability for /l/ and /ɹ/ (as measured using IQRCON) significantly correlated with
interspeaker differences along multiple morphological dimensions. This finding diverges from
previous literature that has examined the relationship between vocal tract morphology and
phonetic variability (e.g., Bakst, 2021; Brunner et al., 2009), which has generally failed to
observe significant relationships between acoustic variability and vocal tract morphology. The
lack of a relationship between morphological variation and acoustic variability in previous
studies is used as support for an account highlighting the need to maintain acoustic consistency
124
in production. In such an account, articulatory variability is more constrained for speakers with
flatter palates than for speakers with more domed palates because slight changes in articulation
result in larger acoustic changes for the speakers with flatter palates. If this constraint on
variability indeed were to underlie the observed relationship between palate shape and
articulatory variability, then we would expect similar variability in acoustics across all speakers
regardless of anatomical variation. The observation of a relationship between acoustic variability
and vocal tract morphology in the present study, as well as the observation that these
relationships are in the opposite direction than what Brunner et al. (2009) would predict, seems
to undermine this previously proposed motivation for the relationship between articulatory
variability and morphology.
In sum the relationship between vocal tract morphology and articulatory variability, as
well as the observation of relationships between morphology and acoustic variability, would
appear to suggest that the need to maintain acoustic consistency in the face of physical variability
is not in fact a consistent driving factor in the control of production variability. This, in turn,
could be extended to support the present proposal that variability is specific to individual
phonological units and encoded in their representation.
That said, it is worth noting that the few relationships observed between articulatory
variability and vocal tract morphology in the data examined here are mostly observed for /s/ and
/ʃ/, with the relationship between palate height and constriction degree in /l/ the only comparison
reaching significance for the liquid consonants. Additionally, the negative relationship between
palate height and /l/’s constriction degree is in the opposite direction from what would be
predicted to reduce acoustic variability for speakers with flatter palates. Given this, it is possible
that the almost categorical lack of a relationship between acoustic variability and vocal tract
125
morphology for the fricatives could result from speakers’ greater use of articulatory precision in
these consonants to minimize acoustic variability than is found in the liquid consonants.
19
While
this would to some extent undermine taking the reversal of the expected pattern as support for
the encoding of variability in phonological representation, this proposal still would not explain
why we observe multiple examples of relationships between articulatory variability and vocal
tract morphology that go in the opposite direction from what this account would predict. It also
fails to explain why speakers would not similarly minimize articulatory variability in the liquid
consonants in order to maintain acoustic consistency.
The investigation of individual differences in vocal tract morphology was limited in this
study by lack of access to coronal or axial plane measurements of speakers’ hard palates. Most
studies that have observed a relationship between palate morphology and articulatory variability
have used measurements of palate curvature based on the coronal shape of the hard palate (e.g.,
Bakst, 2021; Brunner et al., 2009; Rudy & Yunusova, 2013). Access to coronal measurements or
full 3-D geometries of the investigated speakers’ palates would have allowed for more concrete
measurements of aspects of vocal tract morphology that have been identified by previous
research as correlates of individual differences in articulatory variability. Although it is not clear
that the results of the present study deviate substantially from those of previous studies (because
of either the inconsistency of observed effects in those studies [e.g., Brunner et al., 2009] and/or
other differences in methodology [e.g., Rudy & Yunusova, 2013]), the use of palate area
measurements calculated solely from information in the sagittal plane could have minimized the
appearance of relationships between vocal tract morphology and articulatory variability here.
19
See Bakst (2017) for a similar proposal, albeit with fricative consonants in the position of suboptimal acoustic
consistency.
126
3.5. Individual differences in characteristic prosody and variability
Lastly, an analysis comparing interspeaker differences in suprasegmental properties of
speech to phonetic variability was conducted to further evaluate the prediction that factors
known to condition variability in speech would not consistently account for the observed
interspeaker differences in variability (P3). A potential relationship between habitual speech rate
and phonetic variability arises from previous research suggesting that speakers exhibit
substantial variation in speech rate across utterances, with the magnitude of this variation
differing across speakers (e.g., Miller, Grosjean, & Lomanto, 1984) and that the effect of
intraspeaker rate fluctuations on the phonetic properties of phonological segments differs across
individuals (Theodore, Miller, & DeSteno, 2009).
Given a lack of concrete knowledge about the specific relationship between interspeaker
differences in habitual speech rate and phonetic variability and the possibility that some speakers
in prior studies may have exhibited greater internal variation in their realization of
suprasegmental speech properties, and consequently conditioned more variation in their own
speech, the purpose of this present analysis is twofold. First, it is intended to evaluate the extent
to which intrinsic differences in speech prosody may relate to interspeaker differences in
phonetic variability (cf. Tsao et al., 2006; Yunusova et al., 2012). Second, it is intended to ensure
that the observed interspeaker differences in variability between speakers cannot be accounted
for simply via interspeaker variation in the magnitude of rate fluctuations or changes in prosodic
phrasing during the recording of the XRMB data.
127
3.5.1. Methods
Three different methods were used to analyze individual differences in speech rate and
phrasing among the speakers. Two of these measurements were designed to directly assess the
speech rate at which individual tokens were produced, allowing for the effect of speech rate on
articulation to be quantified within individual speakers as well as allowing for cross-speaker
comparisons of intrinsic rate differences and their relationship to variability. The third
measurement was designed to assess differences in prosodic phrasing across speakers and was
calculated separately for each file of XRMB data.
Following previous studies examining the effect of rate on consonant articulation (e.g.,
Byrd & Tan, 1996), the first measure of speaking rate, interval duration (IntD), calculated the
acoustic duration of a CVCV or VCVC sequence selected from the same syntactic clause as each
token of interest (Figure 3.18). Larger IntD values are interpreted as indicating slower speech
rate. As the tokens analyzed in the calculation of articulatory variability were taken from a
variety of TIMIT sentences, it was not possible to select an identical CVCV sequence for each
token. As such, various measures were taken to try and minimize the variability introduced into
the measurement from differences in sequence identity. First, for each stimulus sentence, the
same CVCV sequence was used across all speakers and all productions of the sentence. Second,
all selected sequences contained one lexically stressed vowel and one unstressed vowel. Finally,
whenever possible selected sequences were taken from a single morphological word to minimize
the chance that the sequence would be broken up by a pause or other prosodic break. To increase
the likelihood that the chosen CVCV sequences adequately reflected the speech rate at which the
token of interest was produced, the sequence (fulfilling the outlined structural requirements) that
128
was closest to the target phonetic segment was selected for each token, generally occurring in a
span of two words on either side of the token.
Figure 3.17. Illustration of linguistic elements used to calculate speech rate and prosodic
phrasing.
The second measure of speaking rate, unstressed vowel duration (UVD), was calculated
as the average duration of unstressed vowels in the same prosodic ‘phrase’ as the target segment
(Equation 3.6). For the purposes of this analysis, ‘prosodic phrase’ was defined as the span of
segments between annotated pauses in the forced alignment of the XRMB data. UVD was
chosen as a second metric for analyzing rate variation between and within speakers due to the
observation that vowels in unstressed syllables undergo larger modulations of their duration as
speech rate changes (Fourakis, 1991; Peterson & Lehiste, 1960; Port, 1981; c.f. Gay, 1978,
Tuller et al., 1982) and subsequently may serve as a reliable index of differences in speech rate.
Larger UVD values were interpreted as indexing slower speech rate.
UVD =
ə
!
" ə
"
"⋯"ə
#
%
(3.6)
(where ə = an unstressed vowel in the same prosodic ‘phrase’ as the target segment, and
n = number of unstressed vowels in inter-break interval)
In addition to these two measurements of speech rate, a final metric, pause ratio (PR),
was calculated to index differences in the patterning of prosodic phrasing across speakers
(Equation 3.7). Specifically, PR measured the ratio of the number of annotated pauses to the
number of segments in each file for each speaker. As the segmental content (i.e., the sentences
Pause or Break
129
read) in each file was constant across speakers, this metric is thought to indicate differences in
the phrase boundary frequency or size of prosodic groupings across individual speakers. Larger
PR values indicate a greater frequency of production of pauses or prosodic breaks.
PR =
∑ '(
$
%
$&!
∑ ()
$
'
$&!
(3.7)
(where PS = pause or break, SG = segment, p = PS count and g = SG count.)
Both the median and the standard deviation of each rate measurement were calculated
across all tokens for each speaker separately. The calculated median was taken to index the
speaker’s habitual production of each prosodic variable, while the calculated standard deviation
was taken as indexing how variable the speaker was in their realization of each prosodic variable
(i.e., how much their speech rate or size of prosodic groupings fluctuated across utterances). The
correlation of each of these summary statistics with speakers’ IQRCON values along each
measured articulatory and acoustic dimension were calculated using Spearman’s rho to evaluate
whether individual differences in articulatory variability along these dimensions reflected
individual differences in speech rate and prosodic phrasing. The Benjamini-Hochberg method
was used to control for false discovery rate in all analyses.
3.5.2. Results
3.5.2.1. Relationship between prosodic variables and articulatory variability
The results of the analysis of prosodic and articulatory variability are given in Table 3.11.
In this table and all others in this results section, the top portion of the table (labeled with x $)
presents the results of analyses using the median values of the measured prosodic variables
(corresponding to interspeaker differences in the habitual production of speech rate and
phrasing), while the bottom portion (labeled with SD) presents the results of analyses using the
130
standard deviation of the measured prosodic variables (corresponding to interspeaker differences
in how variable these dimensions are across the examined speech samples). The results indicate
that interspeaker differences in IQRCON for the measured articulatory dimensions are not
consistently predicted by interspeaker differences in either the median value or the standard
deviation of the considered prosodic variables. This is particularly true for comparisons of
IQRCON to median values of IntD, UVD and PR across speakers, which result in only one
comparison that is significant at the uncorrected p < 0.05 level (UVD vs. /l/ CL) and none that
are significant after a correction for false discovery rate is applied.
Comparisons of IQRCON to the standard deviations of the measured prosodic variables
suggest a greater tendency towards statistically significant correlations, with this tendency driven
mostly by a handful of significant correlations involving CO in /t/, /ʃ/ and /ɹ/. A positive
correlation between the standard deviations for UVD standard deviation and /l/ CL parallels the
relationship observed between this same dimension in /l/ and the medians for UVD. Additional
relationships are observed between PR and CD in /ʃ/, IntD and LA in /s/, and IntD and LP in /t/.
As was observed for the analysis with medians, no correlations are significant when a correction
for false discovery rate is applied. The results of an analysis using CoVCON for CD and LA are
identical to the analysis using IQRCON.
131
Table 3.11. Spearman’s rho (rs) for the comparison of speech rate and prosodic phrasing with IQRCON for each articulatory dimension
in each segment. Green cells indicate comparisons significant at the uncorrected p < 0.05 level. No comparisons were significant after
correcting for false discovery rate.
CL CD CO LA LP
rs p padj rs p padj rs p padj rs p padj rs p padj
x #
IntD
/t/ -0.21 0.19 0.58 -0.05 0.77 0.85 0.07 0.67 0.67 -0.04 0.81 0.85 0.20 0.22 0.29
/s/ 0.16 0.33 0.97 0.18 0.27 0.95 0.25 0.11 0.37 -0.37 0.02 0.17 0.14 0.38 0.73
/ʃ/ -0.27 0.09 0.31 -0.17 0.28 0.31 -0.20 0.21 0.27 -0.24 0.13 0.47 0.04 0.79 0.98
/l/ 0.16 0.33 0.38 -0.00 0.98 0.98 0.05 0.75 0.95 0.02 0.91 0.95 -0.10 0.53 0.79
/ɹ/ -0.13 0.44 0.71 -0.13 0.43 0.71 0.09 0.59 0.96 -0.15 0.37 0.81 0.04 0.80 0.80
UVD
/t/ -0.21 0.19 0.58 0.03 0.85 0.85 0.10 0.53 0.67 -0.14 0.37 0.85 0.19 0.22 0.28
/s/ 0.13 0.41 0.97 0.01 0.95 0.95 0.21 0.18 0.37 -0.31 0.05 0.15 -0.09 0.59 0.73
/ʃ/ -0.09 0.57 0.69 -0.20 0.22 0.31 -0.21 0.19 0.27 -0.22 0.17 0.47 -0.13 0.42 0.98
/l/ 0.37 0.02 0.08 0.02 0.89 0.98 -0.11 0.50 0.95 -0.01 0.95 0.95 -0.18 0.27 0.54
/ɹ/ 0.04 0.82 0.82 -0.20 0.21 0.71 0.05 0.76 0.96 -0.05 0.75 0.90 0.13 0.44 0.66
PR
/t/ -0.10 0.52 0.76 0.04 0.79 0.85 0.23 0.15 0.29 0.03 0.85 0.85 0.19 0.24 0.29
/s/ 0.01 0.97 0.97 0.07 0.67 0.95 0.25 0.11 0.37 -0.27 0.10 0.17 0.00 0.99 0.99
/ʃ/ 0.01 0.95 0.95 -0.18 0.26 0.31 -0.20 0.21 0.27 -0.11 0.52 0.77 -0.05 0.78 0.98
/l/ 0.29 0.07 0.11 0.09 0.58 0.87 0.05 0.75 0.95 0.05 0.78 0.95 0.01 0.96 0.96
/ɹ/ 0.13 0.44 0.71 -0.07 0.66 0.71 0.02 0.91 0.96 0.06 0.72 0.90 0.10 0.55 0.66
SD
IntD
/t/ -0.01 0.94 0.94 0.22 0.16 0.85 0.35 0.03 0.14 -0.06 0.69 0.85 0.35 0.03 0.14
/s/ -0.09 0.56 0.97 -0.02 0.92 0.95 0.07 0.65 0.65 -0.32 0.04 0.15 0.08 0.61 0.73
/ʃ/ -0.21 0.20 0.31 -0.22 0.17 0.31 -0.20 0.22 0.27 -0.04 0.78 0.94 0.03 0.86 0.98
/l/ 0.14 0.38 0.38 0.11 0.50 0.87 0.05 0.75 0.94 0.05 0.74 0.95 0.26 0.10 0.54
/ɹ/ 0.09 0.59 0.71 0.06 0.71 0.71 0.01 0.96 0.96 0.02 0.90 0.90 0.25 0.13 0.22
UVD
/t/ -0.08 0.63 0.76 0.10 0.55 0.85 0.32 0.04 0.14 -0.06 0.72 0.85 0.10 0.55 0.55
/s/ 0.07 0.65 0.97 0.07 0.66 0.95 0.11 0.49 0.59 0.00 0.99 0.99 0.30 0.06 0.35
/ʃ/ -0.22 0.18 0.31 -0.16 0.31 0.31 -0.04 0.81 0.81 -0.19 0.23 0.47 0.01 0.94 0.98
/l/ 0.35 0.03 0.08 0.15 0.34 0.87 0.01 0.97 0.97 0.21 0.20 0.59 -0.05 0.74 0.89
/ɹ/ 0.09 0.57 0.71 0.09 0.59 0.71 0.32 0.04 0.14 0.13 0.41 0.81 0.19 0.25 0.50
PR
/t/ -0.16 0.33 0.67 0.06 0.70 0.85 0.08 0.64 0.67 -0.05 0.78 0.85 0.29 0.07 0.14
/s/ -0.01 0.96 0.97 -0.08 0.63 0.95 0.15 0.36 0.54 -0.10 0.53 0.63 0.10 0.54 0.73
/ʃ/ -0.25 0.12 0.31 -0.35 0.03 0.15 -0.32 0.04 0.24 0.01 0.96 0.96 0.01 0.98 0.98
/l/ 0.31 0.05 0.11 0.30 0.06 0.34 0.30 0.06 0.38 0.30 0.06 0.36 -0.20 0.22 0.54
/ɹ/ 0.14 0.37 0.71 -0.11 0.51 0.71 0.49 0.01 0.16 0.14 0.40 0.81 0.23 0.14 0.25
132
3.5.2.2. Relationship between prosodic variables and acoustic variability
Relationships between IQRCON and interspeaker differences in the prosodic variables are
more prevalent when considering variability along acoustic dimensions. Table 3.12 displays the
results for the comparison between acoustic and prosodic variability for the spectral dimensions
measured in /s/ and /ʃ/. Significant positive correlations for /s/ are observed between M1 and
both the median and the standard deviation of PR across speakers (in both controlled and
uncontrolled comparisons) and between M2 and PR standard deviations (uncontrolled
comparison only). M2 in /ʃ/ is correlated with both the medians and the standard deviations of
IntD and UVD, with the positive relationships observed with UVD maintaining significance after
controlling for false discovery rate.
Table 3.12. Spearman’s rank-order correlation for the comparison of prosodic measurements and
IQRCON for each acoustic dimension in /s/ and /ʃ/. Comparisons significant at the uncorrected p <
0.05 level are bolded and shown in green. Comparisons significant after using the Benjamini-
Hochberg method to control for false discovery rate (padj < 0.05) are additionally italicized.
M1 M2 M3 M4
rs p padj rs p padj rs p padj rs p padj
x #
IntD
/s/ -0.04 0.88 0.91 -0.14 0.43 0.95 -0.09 0.57 0.99 0.08 0.61 0.85
/ʃ/ -0.02 0.89 0.93 0.36 0.02 0.08 -0.23 0.15 0.45 -0.24 0.12 0.66
UVD
/s/ 0.20 0.22 0.41 0.04 0.81 0.95 0.12 0.46 0.99 -0.24 0.13 0.85
/ʃ/ 0.19 0.24 0.91 0.48 0.00 0.01 0.07 0.65 0.99 -0.25 0.13 0.37
PR
/s/ 0.53 0.00 0.01 0.15 0.40 0.95 0.20 0.22 0.99 0.05 0.75 0.86
/ʃ/ 0.06 0.70 0.93 0.19 0.24 0.28 -0.01 0.96 0.99 -0.33 0.04 0.22
SD
IntD
/s/ -0.10 0.66 0.91 0.03 0.88 0.95 -0.02 0.92 0.99 0.14 0.40 0.85
/ʃ/ 0.08 0.62 0.93 -0.32 0.04 0.06 -0.44 0.01 0.03 -0.22 0.17 0.37
UVD
/s/ 0.32 0.19 0.72 0.29 0.09 0.95 0.00 0.99 0.99 -0.06 0.69 0.86
/ʃ/ 0.30 0.07 0.50 0.47 0.00 0.01 -0.43 0.01 0.75 -0.09 0.57 0.73
PR
/s/ 0.44 0.01 0.02 0.36 0.02 0.14 -0.07 0.68 0.99 0.01 0.94 0.94
/ʃ/ 0.15 0.36 0.91 0.13 0.42 0.42 -0.20 0.22 0.67 -0.35 0.03 0.17
133
When the comparison is conducted using CoVCON for M1 and M2, a significant
relationship is observed for the uncontrolled comparison between median PR values and M1 in
/s/ (rs = 0.37, p = 0.019, padj = 0.12) and for the uncontrolled comparison between IntD standard
deviations and M2 in /ʃ/ (rs = 0.37, p = 0.02, padj = 0.12). No additional comparisons were
statistically significant.
The results for the comparison between acoustic and prosodic variability for the formant
measurements in /l/ and /ɹ/ are given in Table 3.13. A significant relationship is observed
between median PR values and IQRCON for multiple formants in /l/ (F2, F4, and the F2-F1
distance), with F4 IQRCON also correlated with median UVD for /l/. Standard deviations for PR
values are correlated with IQRCON for the F2-F1 distance in /l/ and for F3 and F4 in /ɹ/.
Additionally, a negative correlation is observed between standard deviations for IntD and
IQRCON for F2 in /ɹ/. However, none of these relationships are particularly strong, as evidenced
by the fact that none are significant after controlling for false discovery rate.
In a parallel analysis using CoVCON to evaluate the relationship between acoustic and
prosodic variability in the liquid consonants, none of the specific comparisons found to be
significant in the analysis using IQRCON remain statistically significant. Instead, standard
deviations for IntD /l/ are found to correlate with /l/ F2 (rs = -0.34, p = 0.03, padj. = 0.18) and /ɹ/
F1 (rs = -0.33, p = 0.04, padj = 0.22), while median PR is found to correlate with /ɹ/ F4 (rs = 0.34,
p = 0.03, padj = 0.18).
134
Table 3.13. Spearman’s rank-order correlation for the comparison of prosodic measurements and IQRCON for each acoustic dimension
in /l/ and /ɹ/. Comparisons significant at the uncorrected p < 0.05 level are bolded and in green. No comparisons were significant after
using the Benjamini-Hochberg method to control for false discovery rate (all padj > 0.05).
F1 F2 F3 F4 F2-F1
rs p padj rs p padj rs p padj rs p padj rs p padj
x #
IntD
/l/ -0.17 0.31 0.71 -0.10 0.54 0.98 -0.13 0.42 0.94 0.20 0.21 0.40 -0.08 0.62 0.84
/ɹ/ -0.20 0.22 0.68 -0.11 0.49 0.75 0.03 0.86 0.88 0.09 0.60 0.78 -0.17 0.31 0.57
UVD
/l/ 0.04 0.82 0.95 0.05 0.78 0.98 0.09 0.58 0.94 0.38 0.01 0.31 0.15 0.37 0.84
/ɹ/ 0.04 0.80 0.86 0.12 0.46 0.75 0.12 0.45 0.88 0.15 0.37 0.78 0.09 0.57 0.58
PR
/l/ 0.21 0.20 0.69 0.44 0.00 0.07 0.04 0.81 0.88 0.38 0.01 0.20 0.36 0.02 0.16
/ɹ/ -0.12 0.48 0.86 0.05 0.75 0.81 0.03 0.83 0.88 0.27 0.09 0.42 0.29 0.07 0.40
SD
IntD
/l/ 0.08 0.61 0.88 0.00 0.98 0.98 -0.03 0.84 0.94 -0.13 0.41 0.51 0.04 0.81 0.86
/ɹ/ 0.18 0.27 0.68 -0.36 0.02 0.07 0.32 0.05 0.14 -0.30 0.06 0.42 0.11 0.51 0.58
UVD
/l/ 0.06 0.73 0.92 0.03 0.87 0.98 -0.02 0.88 0.94 -0.00 0.99 0.99 0.03 0.86 0.86
/ɹ/ -0.05 0.75 0.86 -0.05 0.75 0.81 0.09 0.59 0.88 -0.05 0.78 0.78 0.09 0.58 0.58
PR
/l/ 0.22 0.17 0.69 0.18 0.28 0.98 0.08 0.60 0.88 0.31 0.05 0.20 0.33 0.04 0.20
/ɹ/ 0.05 0.77 0.86 0.15 0.34 0.75 0.37 0.02 0.12 0.37 0.02 0.12 0.22 0.17 0.57
135
3.5.3. Discussion
Variation in speech rate and prosodic phrasing was not found to consistently predict
interspeaker differences in articulatory variability. The vast majority of comparisons that
exhibited statistically significant relationships (nine out of ten) involved measures of the standard
deviation of the prosodic variables considered rather than differences in the medians of speakers,
indicating that interspeaker differences in articulatory variability are almost universally unrelated
to interspeaker differences in habitual speech rate. Instead, the extent to which interspeaker
differences in articulation could be explained by variation in speech prosody was largely
attributable to speakers differing from one another in their consistency of speech rate or prosodic
grouping size across utterances; however, given that all comparisons were significant only before
applying a correction for false discovery rate, the explanatory power of even these differences in
fluctuation should be interpreted with caution. Variability in the orientation of the tongue tip
(CO) was more affected by interspeaker variation in intraspeaker prosodic variability than any
other dimension, with half of all significant comparisons involving tongue tip orientation.
The only significant relationship observed between habitual speech rate and variability
along an articulatory dimension is positive, indicating that speakers with slower habitual speech
rates are generally more variable in their production of this dimension (CL in /l/). This result
concords with the general direction of the relationship between speech rate and articulatory
variability generally observed in the literature (Kelso et al., 1985; Kleinow et al., 2001; Smith et
al., 1995; Smith & Kleinow, 2000). Most, but not all, of the significant correlations between
interspeaker differences in prosodic and articulatory variability were also positive.
In contrast to the general lack of relationship observed between the examined dimensions
of prosodic variation and articulatory variability, prosodic variation is observed to predict
136
acoustic variability with slightly higher consistency, at least when IQRCON is used as the index of
acoustic variability. Of particular note, a much wider array of significant relationships are
observed between habitual speech rate and habitual tendencies for prosodic grouping size and
acoustic variability. Given the dearth of such relationships in the present analysis of prosodic
variation and articulatory variability, it seems that these relationships are likely reflecting the
effect of habitual speech rate and/or prosodic grouping size on aspects of speech kinematics not
examined here, such as the temporal organization of articulatory gestures. However, as many of
the observed relationships do not reach significance when CoVCON is used to index acoustic
variability for dimensions measured on a ratio scale (M1, M2, and all individual formant
measurements) and as many of the relationships are only significant when not controlling for
false discovery rate, the true extent of the relationship between interspeaker prosodic differences
and acoustic variability remains fairly ambiguous.
As a whole, then, the results of the analysis of prosodic variation and phonetic variability
present a fairly mixed picture in terms of their correspondence with the prediction that factors
known to condition variability in speech would not consistently account for the observed
interspeaker differences in variability (P3). Although there were no consistent effects on
articulatory dimensions closely aligned with potential articulatory goals in the examined
consonants, there was a notable relationship with a dimension conveying information about a
non-contrastive dimension of tongue posture and (ambiguously) a more general relationship with
acoustic variability. As will be discussed further in the general discussion, the lack of consistent
relationship with goal-oriented articulatory dimensions is viewed as critical for the interpretation
of the results of the entire study as regards the role of variability in phonological representation.
137
3.6. General Discussion
The results of the study presented here paint a clear picture of differences across speakers
in their variability in speech production that is systematic at the level of the individual
phonological unit but not consistent across different phonological units. This suggests that
speaker differences in token-to-token variability reflect differences in their encoding of phonetic
variability in their cognitive representation of phonological units, and argues against the
hypothesis that there are speaker-specific tendencies to be more or less variable in production
generally.
The extent of stochastic variability exhibited by individual speakers mirrors the extent of
contextual variability that they show for many of the articulatory and acoustic dimensions
examined here, with this relationship particularly well attested for the articulatory dimensions
that most direct relate to the phonological goals of each consonant. This consistency in and of
itself points to a unified control of variability across multiple levels of structural granularity in
speech – the factors determining individual differences in stochastic phonetic variability must
directly interact with, if not be the same as, the factors determining individual differences in the
extent of phonetic variability conditioned by coarticulatory and/or prosodic structure.
20
Particularly when combined with the failure to observe a consistent relationship between speaker
differences in phonetic variability and either vocal tract morphology or prosodic variation, these
results align with an account in which some of the interspeaker variation observed in the
20
It was not possible to isolate the effects of segmental environment from that of prosodic environment in
the speech sample from the XRMB data, given the available phonetic contexts that proved suitable for the
extraction of target segments. Addressing the question of how within-context variability may or may not
relate to segmental and prosodic environment separately is a critical question that will be examined in
future work using a dataset designed for this purpose.
138
production of phonological segments reflects individual differences in the cognitive
representations of these segments (or their subcomponents) across speakers.
The extent to which these results can be taken to truly support such an account, instead of
merely failing to contradict it, may be questioned given that the set of potential conditioning
factors examined here was by no means exhaustive. Individual differences in multiple factors not
examined here, like certain cognitive traits (e.g., Ou et al., 2015; Ou & Law, 2017; Yu, 2010,
2016), sensory acuity (Brunner et al., 2011; Franken et al., 2017; Ghosh et al., 2010; Perkell et
al., 2008), and potentially the inherent noisiness of the central and peripheral neural systems
underlying the planning and execution of articulatory actions (Haar, Donchin, & Dinstein, 2017;
Harris & Wolpert, 1998), as well as factors that haven’t even been identified yet, may play a
critical role in conditioning interspeaker differences in variability. However, patterns observed in
the data under study here would suggest that a strong case in support of the tested hypothesis can
be made even without examining cognitive, sensory, or neural factors not included here.
Specifically, the general lack of systematic interspeaker relationships in variability across
different phonological units diminishes the likelihood that the examination of additional
predictive factors will paint a different picture regarding general predictability of interspeaker
differences in phonetic variability. Although different factors inducing inter- or intraspeaker
variability (such as coarticulatory and prosodic environment, anatomical variation, speech rate,
sensory acuity, etc.) do appear to affect specific phonological units or classes of phonological
units differently (e.g., Brunner et al., 2009; Perkell et al., 2008; Recasens & Espinosa, 2009,
Recasens, Pallarès, & Fontdevila, 1997; Theodore et al., 2009), the lack of cross-segment
coherency in variability and in particular the failure of the cross-segment differences to be shared
139
across speakers necessitates that such a predictive capacity would require each factor to
influence different phonological units in highly idiosyncratic ways in different speakers.
An example of why this is not likely to hold can be observed through a consideration of
the potential granularity of the effect of sensory acuity on phonetic variability. It has previously
been proposed that auditory and somatosensory acuity may each play an important role in the
interpretation of feedback during speech production, and subsequently the regulation of
variability, for a particular class of phonological segments (e.g., Guenther, Hampson, & Johnson,
1998; Ghosh et al., 2010). Specifically, vowels and vowel-like segments have been proposed to
generally rely more on auditory feedback for their regulation than consonants (or at least
obstruents like /t/, /s/ and /ʃ/), which are expected to generally rely more on somatosensory
feedback (although these tendencies would presumably be modulated by individual differences
in the reliance on each type of sensory feedback, e.g., Lametti et al., 2012). Patterns like this,
where certain factors may affect distinct classes of phonological segments differently, could
account for the lack of consistent effects observed in the data if there is some degree of
consistency in observed effects over definable subclasses of phonological segments or on
particular phonetic dimensions. The paucity of significant relationships in variability even across
those segments that may be considered as similar phonological classes suggests that this
explanation is unlikely to account for the effects observed in the data.
The results of the present study are therefore taken to suggest an unpredictable,
idiosyncratic component to the manifestation of phonetic variability across speakers, which is
best accounted for through the incorporation of individual differences in variability in the
cognitive representation of phonological units. These individual differences in the cognitive
representation of phonological units are proposed to arise from the interaction between multiple
140
factors impacting variability, such as vocal tract anatomy, cognitive traits, sensory acuity and
personal linguistic history, in the development of these representations. The implication that
some amount of the variability observed in the production of various phonological segments may
be due to individual difference in their cognitive representation does not, of course, preclude
predictable factors from also influencing the manner in which speakers may differ from one
another in production. In contrast, it is clear from previous research that the predictable effects of
factors like vocal tract morphology, habitual realization of prosody, auditory acuity, and
cognitive traits do influence the appearance of variation in speech production across individuals.
The influence of other factors that have yet to be examined thoroughly in terms of their impact
on speech production, such as individual differences in the noisiness of the neural and
biomechanical systems underlying the planning and implementation of speech motor actions,
may also play a crucial role in the generation of interspeaker differences in articulatory
variability. However, the results of both previous research and the current study clearly indicate
that the influence of these factors is non-deterministic and cannot account for the observed
individual differences in phonetic variability in isolation. Instead, it seems likely that the
cumulative effect of the speaker’s lifetime of linguistic experience, in addition to multiple
predictable factors, interact to determine the habitual variability observed in a given speaker’s
production of speech.
3.7. Conclusion
In conclusion, the study of patterns of articulatory and acoustic variability presented here
demonstrates that variability in a speaker’s speech production is systematic at the level of the
individual phonological unit, but not consistent across different phonological units. Patterns of
141
variability indicate that individual differences in variability are maintained across multiple levels
of linguistic structure but are not generalized across different phonological segments or across
dimensions in the same segment that do not share a primary articulator. These results indicate
that interspeaker differences in phonetic variability reflect not only the extent to which speakers
exhibit overall tendencies towards greater variability or precision in production, but also the
specific encoding of variability for individual phonological units. In conjunction with a general
lack of consistency in the prediction of interspeaker differences in phonetic variability by either
interspeaker anatomical variation or differences in the realization of paralinguistic elements
across speakers, these results are taken to support models of phonological representation and
speech planning in which variability is encoded in the target representations of specific
phonological units. The implications of the results of this study for the precise manner in which
variability is best incorporated into representation, and the predictions that this makes for other
speech behaviors like speech perception, will be explored further in the remaining chapters of
this dissertation.
142
4. Incorporating individual differences in phonological representation
4.1. Introduction
The development of theoretical models of phonology and speech motor control has been
driven in no small part by the need to incorporate variation in the cognitive representation and
processing underlying speech planning and control. Although early research on speech often
categorized all variation as ‘noise’ that obscured physical invariants of phonological categories
assumed to be necessary for communicative parity (e.g., Blumstein & Stevens, 1979, 1980;
Fujimura, 1986; Stevens, 1972; Stevens & Blumstein, 1978), most modern approaches take the
view that regular variation in phonetic realization arises from the systems regulating speech
production and perception (e.g., Browman & Goldstein, 1985, 1986, 1992, et seq.; Byrd 1996,
Guenther, 1994, 1995, 2016; Guenther et al., 2006; Guenther et al., 1998; Keating, 1990, 1996;
Gafos & Kirov, 2009; Tilsen, 2019; Perrier, 2003; Pierrehumbert, 2000, 2002, 2016). However,
these approaches to phonological representation and planning differ from one another in the
precise manner in which they encode variation in the representation or structural organization of
phonological units. These differences in the encoding of variation lead to models having a
greater or lesser capacity to lawfully incorporate interspeaker variation, contextual variability,
and stochastic variability.
The analyses of XRMB corpus data in previous chapters have established that the robust
individual differences in phonetic variability observed may reflect speaker differences in the
control of variability for specific phonological units. The particular patterns observed in the
realization of interspeaker differences in variability within and across phonological units in
Chapter 3 have definite implications for the manner in which variability may be incorporated
into phonological representations. Specifically, the consistency in the relationship between
143
stochastic and contextual variability across speakers for individual phonological units indicates
that the target space for any phonological goal is best represented as a distribution of possible
targets within a particular region of phonetic space. The conclusion that phonological goals are
best represented as distributions of possible targets or regions in phonetic space aligns with
existing models like window models (Byrd, 1996; Keating, 1990, 1996), the DIVA model
(Guenther et al., 1995, 2016; Tourville & Guenther, 2011) and exemplar models of phonology
(e.g., Pierrehumbert, 2001, 2002, 2016). This conclusion is driven by the appearance of a
dynamic interaction between the amount of stochastic variability observed in the production of
any given dimension in any given segment and the extent of contextual variability observed in
the production of that segment, which is most easily motivated by having a range of possible
target values for a given parameter of a phonological unit that can be subjected to the predictable
effects of a phonetic context on their realization. However, the specific patterns of variability
observed here point to precise requirements for the organization of these target distributions that
existing models do not clearly satisfy (for reasons elaborated on in Section 4.4.2).
The additional observation in Chapter 3 that the control of variability seems to exhibit
much more unit-internal coherency, and much less cross-unit generalizability, for the articulatory
dimensions most closely related to the hypothetical phonological goals of the examined
consonants (namely CL, CD and for two consonants LP) additionally suggests that variability is
better represented in articulation than in acoustics. This could be interpreted as general support
of models in which the units of representation, or at least the targets directly governing the
production of these units, are primarily articulatory in nature (e.g., Browman & Goldstein, 1985,
1986, 1992 et seq.) and that the target distributions in a model accounting for the observed
patterns of interspeaker variability should be defined in articulatory space. As such, a model that
144
successfully accounts for the results of the XRMB investigation should be able to both
incorporate articulatory representations of phonological units and generate new, testable
hypotheses regarding the implications of this for other speech behaviors, such as perception, that
theoretically utilize the same representational system or one relying on a ‘common currency’
(e.g., Goldstein & Fowler, 2003).
It is proposed that, in order to account for the patterns of individual difference and
intraspeaker variability observed in the XRMB data, a model must incorporate a dynamic system
for target selection within a system of invariant representation for phonological units. Crucially,
this system must be conceptualized in such a way that the target space associated with a
particular phonological unit can vary across speakers. The primary goal of this chapter is to
present a model of phonological representation and target planning that incorporates these
necessary features and which can subsequently generate the patterns of individual difference
observed in the XRMB data. Specifically, previous work utilizing the framework of Dynamic
Field Theory (DFT) (Erlhagen & Schöner, 2002; Schöner, Kopecz, & Erlhagen, 1997) to
develop dynamical models of phonological representation and articulatory control (Gafos &
Kirov, 2009; Kirov & Gafos, 2007; Roon, 2013; Roon & Gafos, 2016; Tilsen, 2007, 2019) is
expanded on here to illustrate how individual differences in phonetic variability can be
incorporated in this representational system. The expansion presented of DFT-based
phonological models demonstrates how patterns of stored activation across representational
fields associated with particular phonological units can be manipulated to generate individual
differences in stochastic variability and to generate the relationship between stochastic and
contextual variability observed in the XRMB data. The model will also be used to generate
145
testable predictions regarding the relationship between these individual differences in production
and expected cross-speaker behavioral variation in speech perception.
4.1.1. Foundations of DFT
DFT (Erlhagen & Schöner, 2002; Kopez & Schöner, 1995; Schöner, Kopecz, &
Erlhagen, 1997; Thelen, Schöner, Scheier, & Smith, 2001) is a mathematical framework for
embodied cognition that uses patterns of activation across (conceptual) neural populations to
model the planning and perception of goal-oriented behavior in response to specific task
parameters. Although originally developed to model the cognitive mechanisms underlying
movement planning (e.g., Erlhagen & Schöner, 2002; Schöner et al., 1997; Kopez & Schöner,
1995), DFT has been extended to account for the operational dynamics of more abstract
cognitive phenomena such as working memory (e.g., Lins & Schöner, 2014), motion-pattern
perception (Hock, Schöner, & Giese, 2003), the production and interpretation of spatial language
(Lipinski, Schneegans, Sandamirskaya, Spencer, & Schöner, 2012), and preservative errors
observed in the course of infant cognitive development (e.g., Thelen, Schöner, Scheier, & Smith,
2001). DFT-inspired models of phonological planning and articulatory control have also been
used to account for linguistic phenomena such as subphonemic priming and reaction time effects
in response-distractor tasks (Roon & Gafos, 2016; Tilsen, 2007), non-local harmony patterns
(Tilsen, 2019), anticipatory posturing (Tilsen, 2019), and synchronic lenition processes (Gafos &
Kirov, 2009). In this section, I will first review the basic principles underlying the generation of
goal-oriented behavior in DFT before describing the manner in which these principles have been
integrated into models of phonological representation and articulatory control.
146
In all instantiations of DFT, including DFT-inspired models of speech planning, each
spatial, motor, or featural dimension relevant to the planning of a behavioral action corresponds
to a continuous, spatially structured distribution of activation over the range of possible values
encompassed by that dimension (e.g., Erlhagen & Schöner, 2002; Schöner & Schutte, 2016).
These activation distributions are commonly referred to as planning fields. The dynamics of
planning fields derive from the principles of neural population coding (Erickson, 1974;
Georgopoulos, 1995; Georgopoulos et al., 1986), which hypothesize that “the properties of
perceptual, behavioral, and cognitive events are reflected by the distribution of activation over
populations of tuned neurons” (Schneegans, Lins, & Schöner, 2015, p. 63, quoting Erickson,
1974). The network properties proposed by population coding generate continuous and complete
coverage over an entire sensory dimension (Georgopoulos, 1995; Georgopoulos et al., 1986),
causing any input to the system to activate many neurons (or, in the conceptualization of DFT,
points along a planning field) that exhibit some degree of sensitivity to its specific parameter
value(s) and subsequently generate a distribution of activation across the population of neurons
(or region of the planning field). The dynamics underlying the planning and perception of
behavior in DFT are predicated on this network architecture. The principle of distributed
activation among populations of neurons allows for the evolution of activation within a planning
field to be modeled as spatially and temporally continuous (e.g., Erlhagen & Schöner, 2002),
with a topographic organization relating adjacent states within the field to contiguous spaces in
the physical or conceptual dimension modeled (e.g., Erlhagen, Bastian, Jancke, Riehle, &
Schöner, 1999; Schöner, 2014).
In the model the goal or percept of a behavioral action is generated when the distribution
of activation within a planning field evolves to a stable activation peak (Erlhagen & Schöner,
147
2002). The ability of these stable peaks to form within the field is a consequence of the
combination of excitatory and inhibitory dynamics implemented in the model. The foundational
influence of the principles of population coding on the conceptualization of the planning field
enables DFT to utilize a specific instantiation of activation field dynamics based on
mathematical models of the propagation of activation through a continuous neural field (e.g.,
Amari, 1977; Amari & Arbib, 1977; Kishimoto & Amari, 1979). Network architectures utilizing
lateral inhibition, cooperation, and competition (Amari, 1977; Amari & Arbib, 1977; Erlhagen &
Schöner, 2002) enable the dynamical systems governing planning field activation to converge to
a stable state (an activation peak) given sufficiently strong input. This stable peak is generated
through a combination of excitatory and inhibitory interaction between points exhibiting
activation levels above a soft interaction threshold and other points in the field (see Section 4.2.2
below for further detail on interaction dynamics in the model). Any experienced stimulus or
generated motor action that results in the formation of a stable activation peak will cause a slight
increase in the resting activation level associated with this specific region of the planning field in
memory (Erlhagen & Schöner, 2002). As such, the particular patterns of activation encoded in a
given planning field are developed based on the individual’s experience of previous behavioral
events involving the relevant dimension, with regions in the field more frequently used in the
planning or interpretation of behavioral events likely to have higher resting activation levels.
4.1.2. Integration of DFT into models of phonological planning
In DFT-based models of phonological planning specifically, the behavioral goal selected
by the development of a stable suprathreshold peak within a planning field is a specific tract
variable parameter value that will serve as input to the articulatory implementation system. In
148
most previous work on speech incorporating DFT, and in the present work, the assumed system
of implementation is the Task Dynamics model (Saltzman & Munhall, 1989). Previous work
using DFT to model phonological planning has integrated it into existing frameworks of
phonological representation by proposing that the dimensions along which contrastive
phonological units are defined correspond to individual planning fields (e.g., Gafos & Kirov,
2009; Roon & Gafos, 2016; Tilsen, 2019). Specifically, most proposals implementing DFT-
inspired dynamical models of phonological representation have proposed a correspondence
between planning fields and the parameters that define the goals of a gesture in Articulatory
Phonology (AP) (Browman & Goldstein, 1986, 1989, 1992, 2000).
In both AP and DFT-based models using AP as their conceptual framework, the
articulatory gesture functions as a unit of phonological representation (e.g., Gafos & Kirov,
2009; Roon & Gafos, 2016; Tilsen, 2019), with contrasts between different lexical items encoded
by the presence or absence of specific articulatory gestures, gestural parameter values, and the
abstract temporal coordination between gestures specified by a coupling graph (e.g., Browman &
Goldstein, 1986, 1989 1992, 2000; Goldstein et al., 2006; Nam et al., 2009). Planning fields in
DFT-based models do not replace gestures as the unit of phonological representation and
contrast but rather introduce a dynamical mechanism via which gestures’ constriction targets are
actively selected in planning.
21
This is made particularly clear in Tilsen (2019), where a dynamic
target, a tract variable parameter value that is dynamically selected through the evolution of
planning field activation, are distinguished from a gestural target, which is a long-term memory
21
Previous work incorporating DFT into AP has not explicitly addressed whether other gestural parameter values,
like stiffness, are also dynamically selected, and if so exactly how this should be incorporated into the model. This
question is left for future research.
149
of the patterns of activation associated with a particular contrastive phonological unit along a
planning field. Dynamic targets are selected anew each time a gesture is planned and, due to
stochastic and contextual interactions influencing planning field activation, will vary across
utterances.
As a consequence of the gesture’s role as the unit of phonological representation in both
AP and DFT-based models situated within the AP framework, the representation of lexical items
as gestural scores in AP directly translates into DFT-based models (Gafos & Kirov, 2009:225).
22
The substantive difference between the two models is that instead of defining gestures using
static targets, as AP does, gestural targets in the DFT-based model are defined as a specific
distribution of activation within a planning field (Gafos & Kirov, 2009; Roon & Gafos, 2016).
Contrastive gestures will differ from one another in their characteristic patterns of activation
within a planning field, representing differences in the range of likely tract variable parameter
values that could be specified as the production target for each gesture (see Figure 4.1 in Section
4.2.1.1. for a visualization). Each time a gesture is planned in a DFT-based model, the specific
tract variable parameter values serving as input to the articulatory implementation system (the
dynamic target) are actively selected and, due to the dynamics of activation within the model,
will vary from token to token in production, introducing a mechanism for the generation of
stochastic variability in target selection within the system of representation itself. Critically,
however, since the gestural targets in a lexical representation remain consistent across contexts,
22
Most previous work involving DFT-based models has focused solely on how the parameter space associated with
a gesture may change over time (Gafos & Kirov, 2009) or how competition between inputs affects the time course
of target planning (e.g., Roon & Gafos, 2016). Whether the same system of coupled oscillators currently
implemented in AP (e.g., Goldstein et al., 2006; Nam & Saltzman, 2003; Nam et al., 2009) should govern gestures’
coordination relationships, or how the dynamics of intergestural timing should be modeled more generally, has
remain largely unexamined in work incorporating DFT (although see Tilsen [2019] for a proposal integrating a
DFT-based target selection mechanism with his Selection-coordination theory [Tilsen, 2016, 2018]).
150
the same distribution of planning field activation always underlies the process of target selection
for a given gesture, maintaining invariance in the representation of the parameter values
associated with a particular gesture irrespective of phonetic contexts.
4.2. Model
The main goal of the modeling work presented here is to illustrate how individual
differences in phonetic variability can be incorporated in a cognitive system of phonological
representation. The proposal is made here, based on the results from Chapters 2 and 3, that this
incorporation is best accomplished by a model in which (a) the target space for any phonological
goal is represented as a context-invariant distribution of possible targets in articulatory space and
(b) the precise properties of these distributions of possible targets form the locus of individual
differences in representation. This section presents a detailed explanation of the dynamics and
components of a model incorporating DFT that satisfies these requirements.
4.2.1. Model components
Behavior is generated in DFT through the evolution of activation levels in a planning
field over time, under the influence of input reflecting the goals or demands of a specific
behavioral task. The evolution of activation within the planning field is governed by self-
stabilizing dynamics (e.g., Amari, 1977; Erlhagen & Schöner, 2002), which are modeled using
the following differential equation (reproduced from Roon & Gafos, 2016:226):
!"#($, %) = −#($, %) + ℎ + &'()% ($, %) + interaction($, %) + noise (4.1)
151
In this equation, the term A(x,t) refers to the current activation level of some point x on the
planning field, with change in A(x,t) across values of t modulated by τ, the rate of decay of
activation within this planning field. The resting level of point x, represented by the term h,
indicates an attractor state to which activation at any point in the field will return without the
introduction of input strong enough to engage the interaction term interaction(x,t). This
interaction term itself is a model component determining how activation levels at other points in
the planning field at time t influence the evolution of activation at point x. In the simulations
presented in this chapter, both the input and interaction terms are at times decomposed to reflect
the influence of multiple sources acting upon the planning field activation simultaneously.
Finally, the noise term enables stochastic change in activation at point x across time steps,
ensuring that the parameter value selected as a production target varies across different trials.
The major components of the equation governing the evolution of activation within the
planning field are discussed more thoroughly in the following sections, with a focus on their
contribution to both the generation of goal-based behavior in DFT generally and the specific
ways in which they are harnessed to generate individual differences in speech behaviors in the
version of the model developed here.
4.2.1.1. Planning fields and behavioral goals
The representational space within DFT takes the form of activation fields, or planning
fields, over which activation evolves in time and space. As mentioned in Section 4.1.1, planning
fields have primarily been conceptualized as corresponding to the tract variable parameters used
to define constriction goals of the gestures that serve as the basic units of contrast in AP (e.g.,
Gafos & Kirov, 2009; Roon & Gafos, 2016; Tilsen, 2019). In the present work, a firmer stance is
152
taken with respect to the articulatory nature of the phonetic space represented by the planning
fields. It is specifically hypothesized here, on the basis of the support for gestural accounts of
phonological representation provided by the results of Chapter 3, that each planning field along
which the representation of a phonological unit is defined corresponds to a single tract variable
(Figure 4.1).
23
This articulatory definition of planning fields becomes critical for the
interpretation of model predictions regarding speech perception, and particularly the perception
of subphonemic variation; these implications are discussed in detail in the presentation of
Simulation 3.
The activation dynamics in the planning field are bistable in the absence of input
(Erlhagen & Schöner, 2002). Each point along the continuum of the planning field has a low-
level attractor state, its resting activation level (h in Equation 4.1) (referred to as the “off” state
in Erlhagen and Schöner [2002]). Any change in the activation level of any point in the system
will ultimately result in the system’s return back to its resting activation level unless the input
causing this change is strong enough to prompt the stabilization of the system at the higher
activation attractor state (the “on” state) (Erlhagen & Schöner, 2002). In the model presented
here, a shift in the state of the system to the “on” state results in the presence of input indicating
a perceived stimulus or the intention to produce a gesture. The precise location within the
planning field where a stable peak emerges upon the system’s shift to the “on” state indicates the
particular tract variable target perceived or to be produced.
23
As mentioned in footnote 4 in Chapter 3, it is unclear from the results of the XRMB data analysis whether it is
more appropriate to model variability as controlled at the level of the tract variable or at the level of the articulatory
gesture. Given the lack of clear evidence regarding which of these subphonemic units serves as the locus of control
of variability, the choice was made here to model variability as if it were controlled at the level of the tract variable
in order to make the implementation of coarticulatory effects in Simulation 2 more straightforward. This is not,
however, meant to preclude the possibility that the gesture may be the more appropriate level of granularity for the
definition of the planning field in this model, a question which is left to future research.
153
All gestures that are defined in terms of a specific tract variable will preshape the
planning field associated with that tract variable any time they are part of the task space (i.e., any
occasion in which the speaker-listener may decide to produce this gesture or expect to perceive it
in speech produced by another). This preshaping of the field associated with each gesture in
memory is what allows an incoming sensory stimulus, such as a perceived speech sound, to be
correctly classified as containing an instance of a particular gesture. It is also what enables the
production of gestures with appropriate constriction targets in response to either an internal or
external prompt to produce this gesture.
4.2.1.2. Input
The selection and interpretation of target parameter values during production and
perception, respectively, is prompted by the introduction of input to the planning field. The two
types of possible input to a planning field are the task input and the specific input (Erlhagen &
Schöner, 2002:549-550). Although there is only one input term (input(x,t)) in the equation
governing the evolution of activation within the planning field (Equation 4.1), this umbrella term
can be decomposed to incorporate the additive effect of multiple sources of input influencing the
evolution of activation within the field. Indeed, at any moment where a goal-oriented behavior is
being planned or perceived, there are at least one task input and at least one specific input
affecting the distribution of activation within the planning field.
A. Task input
The task input in the model developed here corresponds to the specific distribution of
planning field activation associated with an articulatory gesture in memory. This distribution of
154
planning field activation, which is stored in the individual’s memory and serves as their
representation of the possible goal space for a gesture’s tract variable, is referred to as its
preshape (Erlhagen & Schöner, 2002; Roon, 2013, p. 121). Regions of the parameter space,
represented by the planning field, that would constitute acceptable production values for a given
gesture are “preshaped,” meaning they have a level of activation higher than the resting level of
the field at large. Figure 4.1 shows what preshaped activation may look like for the specification
of tongue tip constriction location (TTCL) and tongue tip constriction degree (TTCD) in
American English /s/ and /ʃ/. As demonstrated by this figure, the distribution of preshaped
activation in the TTCL planning field differs noticeably for the two fricatives (Figure 4.1a and
b), corresponding to what would be contrastive gestural parameter values (alveolar and
alveopalatal) for TTCL in AP. However, the distribution of preshaped activation in the TTCD
planning field is identical for /s/ and /ʃ/, reflecting their lack of contrast along this dimension (as
both are fricatives produced with a critical degree of constriction) (Figure 4.1 c and d). These
differences in the preshapes imposed by different gestures are one mechanism along which
lexical contrast is conveyed in DFT-based speech planning models.
155
Figure 4.1. Examples of possible preshape distributions along the TTCL and TTCD planning
fields for /s/ and /ʃ/. In each graph, x-axis = planning field for tract variable (range of all possible
parameter values associated with that tract variable), y-axis = activation added to each point on
the field at each time step of its evolution. Green line at activation level 0.75 indicates the
interaction threshold θ; blue line at activation level 4 indicates the selection threshold κ. Top row
presents visualizations of preshaped activation corresponding to the representation of (a) an
alveolar TTCL constriction in /s/ (b) and an alveopalatal TTCL constriction in /ʃ/. Bottom row
presents visualization of preshaped activation corresponding to the representation of a critical
TTCD constriction in (c) /s/ and (d) /ʃ/.
In the version of the model developed here, differences in the distribution of preshaped
activation associated with a particular gesture across speakers are proposed as the primary
locus of individual difference in the representation of phonological units. Development of these
stored patterns of activation is based on individual’s history of exposure to tokens of particular
phonological units. Differences in the characteristics of these preshaped distributions exist across
(b) (a)
(c) (d)
156
individuals for the “same” gesture
24
due to differences in the distribution of tokens speakers have
previously encountered, as well as individual differences in vocal tract articulatory-acoustic
mapping, cognitive traits, sensory acuity, or other factors modulating the individual’s
interpretation of linguistic input. In our current consideration of phonetic variability for
phonological models, the width of the region of preshaped activation (e.g., the range of
parameter values along the planning field that exhibit higher resting activation) is the source of
interspeaker variation manipulated in the simulations.
B. Specific input
Any source of input to the planning field that is not part of the task input but instead
specifies a particular sensory percept or behavioral goal is a specific input (Erlhagen & Schöner,
2002:549). This input can come from an external source, such as a heard speech sound or a read
word, or it can come from an internal impulse such as the decision to say a certain word (e.g.,
Gafos & Kirov, 2009; Roon, 2016; Thelen et al., 2001). The effect of an external stimulus on the
planning field is typically modeled as a localized activation spike, with the location of this spike
within the field corresponding either to the speaker’s representation of the dynamic target space
for a gesture (if the stimulus is a generic prompt to produce a specific gesture) or the perceived
target value of an auditory/visual stimulus (Figure 4.2).
24
Determining precisely how phonological equivalence (and therefore communicative parity) is maintained in this
model is outside of the scope of the present work. It is presumed that speakers from the same speech community
have a similar enough history of exposure to speech that their distribution of preshaped activation for a particular
gesture will reflect norms present in the community, allowing these distributions to be conceptualized as reflecting
the same function goal or task.
157
Figure 4.2. Fields with the same preshape (centered at 10 with s.d. of 1) after specific input
reflecting acoustically distinct perceptual stimuli added. In each graph, x-axis = planning field
for tract variable (range of all possible parameter values associated with that tract variable), y-
axis = activation added to each point on the field at each time step of its evolution, z-axis = time
step. The green plane at activation level = 0.75 shows the interaction threshold (θ), the blue plane
at activation level = 4 shows the selection threshold κ, and the red plane shows the target value
selected for production (first value at which activation surpassed κ). (a) Evolution of field after
input centered at target value 8 (s.d. = 1) introduced at time step 500. Selected target value (red
plane) = 9. (b) Evolution of field after input centered at target value 12 (s.d. = 1) introduced at
time step 500. Selected target value (red plane) = 11.2.
4.2.2. Interaction
Aside from discrete sources of input into the planning field, the interaction term
interaction(x,t) critically shapes the planning field dynamics. The capacity of the field to
generate stable activation peaks is largely a consequence of the field’s interaction dynamics, as
these are responsible for the formation of self-sustaining, stable activation peaks within a field
having sufficiently strong input. The interaction term itself can be decomposed into two different
types of interaction dynamics: within-field interaction and cross-field interaction. The principles
of within-field interaction in the model are discussed next, as they are active in the model used in
all simulations presented here. The implementation of cross-field interaction in the current work
Activation
4
2
0
-2
-4
-6
-8
Target value
Time
0
2 6 10 14 18 22
Activation
4
2
0
-2
-4
-6
-8
Time
Target value
2 6 10 14 22 18
0
(a) (b)
158
is described later (in Section 4.3.2.1), as only one of the three simulations presented in this
chapter involves interaction between fields.
Within-field interaction in the model comprises excitatory and inhibitory connections
between any point on the field exhibiting activation levels sufficiently close to a soft interaction
threshold 3
25
and all other points on the field. The definition of the soft threshold determining
which points along the field contribute to the within-field interaction effects at any point in time
is given as Equation 4.2 (reproduced from Roon & Gafos, 2016, p. 229). The term β in Equation
4.2 defines the steepness of the soft threshold 3.
4(#) =
!
!"#$% [( ) (+( ,)]
(4.2)
Points on the field exhibiting suprathreshold activation levels will spread excitation to
proximal regions of the field, while these same suprathreshold points will have an inhibitory
influence on more distal regions of the field (e.g., Amari, 1977; Erlhagen & Schöner, 2002).
Whether a particular suprathreshold point on the field will excite or inhibit any other point on the
field is determined by the value of σw (the width of the gaussian function defining the region
around a suprathreshold point receiving excitation) in the definition of the interaction kernel w(x)
(Figure 4.3). Equation 4.3 defines w(x) (reproduced from Roon & Gafos, 2016, p. 230).
9($) = 9
!"#$%!
:
&'
/
0
01
2
0
(
−9
$)*$+$%
(4.3)
25
Although the interaction threshold can be modeled as a ‘hard’ threshold by using a step function
(Amari, 1977), in DFT the threshold is usually modeled as a ‘soft’ threshold by using a sigmoid function
(Erlhagen & Schöner, 2002), allowing interaction effects spurred by a given point in the field to gradually
increase in strength as its activation approaches the selection threshold.
159
Figure 4.3. Schematic representation of lateral inhibition as implemented in DFT. The excitatory
region σw of the interaction kernel w(x) determines the range of local excitation. The strength of
local excitation is indicated by wexcite and the strength of global inhibition is indicated by winhibit.
Together, these terms determine the amount of activation added to or subtracted from any
point x along the field at time t (Equation 4.4, reproduced from Roon & Gafos, 2016, p. 229)
though a convolution operation in which the interaction kernel w(x) is applied to the areas of the
field determined by the function f(A) to be sufficiently close to the interaction threshold. Note
that the term $
,
in Equation 4.4 refers to all other locations in the planning field (other than x).
&'%:<=>%&?'
-$%*$)
= ∫9($ −$
,
)4[#($
,
,%)]"$′ (4.4)
As all locations on the field that are sufficiently activated (i.e., exceed the soft threshold
for interaction effects) will generate local excitation, more activated locations will generate
stronger interaction effects on the field as a whole (Erlhagen & Schöner, 2002; Schöner &
Schutte, 2015). The build-up of activation around the region first approaching the threshold, in
combination with the inhibition of regions of the field further from the activation peak, will
Activation
Target Value
w(x)
160
become a self-perpetuating and self-stabilizing process (Amari, 1977). This self-stabilization
results from the local excitatory interactions maintaining the existence of the activation peak,
along with the inhibition of more marginal regions of the field that prevent the peak of activation
from spreading beyond a certain width.
4.2.3. Dynamic Target Selection
The particular point within a stable activation peak that first crosses the selection
threshold κ will be selected as the specific tract variable parameter values serving as input to the
articulatory system for a production token being planned, or as the perceived target value of a
perceived token. As the dynamic field model in Erlhagen & Schöner (2002) includes a stochastic
noise source among the factors influencing the evolution of activation in the planning field, the
precise parameter value that reaches κ first will fluctuate slightly across instances in which the
production or perception of a gesture is prompted. However, given the processes of lateral
inhibition at play in the generation of the activation peak, the values selected under this
stochastic noise will still cluster around the region of the field that first approaches the
interaction threshold (Gafos & Kirov [2009], p. 228; also illustrated by the results of Simulation
1 in Section 4.3.1.2 of this chapter).
4.3. Simulations
This section presents a series of simulations testing the ability of the model framework
outlined in Section 4.2., and specifically the proposed manipulation of preshaped activation
distributions within this framework, to account for the patterns of interspeaker difference in
phonetic variability observed in the XRMB data. First, Simulation 1 demonstrates that this
161
preshape manipulation is able to generate differences in stochastic variability similar to that
observed across speakers in Chapter 2. Building on that, Simulation 2 examines the potential for
the model to generate the relationship between stochastic and contextual variability observed in
the XRMB data, and in doing so additionally proposes an embodied cognitive mechanism for the
incorporation of coarticulatory influence in the selection of dynamic speech targets. This
proposed mechanism can be viewed as an extension of the process of gestural blending in the
Task Dynamics model (Saltzman & Munhall, 1989) as it applies to the dynamics of planning
instead of (or in addition to) the dynamics of articulation.
Following these first two simulations aimed at demonstrating the ability of the proposed
model modifications to generate the patterns of variability observed in the empirical analysis
presented in previous chapters of this dissertation, an additional simulation is presented to
explore the types of predictions this model may make regarding individual differences in
phonetic variability and other speech behaviors. Simulation 3 specifically demonstrates how this
model makes concrete predictions regarding the relationship between phonetic variability in
production and the perception of subphonemic variation.
All simulations were run in MATLAB 2020a using scripts developed to implement the
solutions of the specific equations used for each simulation (detailed further in sections 4.3.1.1.,
4.3.2.1., and 4.3.3.1. below). The basic implementation of Equation 4.1 in each script was
adapted from code available as supplementary material in Roon and Gafos (2016).
4.3.1. Simulation 1: Individual differences in stochastic variability
As discussed in chapter 2, the variability exhibited in speech articulation and acoustics
was determined, based on its systematicity across different structural levels in speech and its lack
162
of systematicity across different phonological units/articulators, to arise at least in part from
variability in target selection during speech planning. In Simulation 1, a proposed mechanism for
incorporating individual differences in variability into phonological representations, namely
variation in the width of the preshaped activation interval corresponding to a particular gesture
across speakers, is tested to determine whether it would generate differences in variability of
target selection across speakers.
The choice to focus on this parameter as the most likely component of the model to
exhibit the encoding of individual differences in variability was based partly on theoretical
considerations, discussed further in Sections 4.3.1.3 and 4.4.1. It was also based on the
expectation that this manipulation would generate individual differences in stochastic variability
due to the manner in which the width of the preshaped activation interval would interact with the
noise term influencing the evolution of planning field activation. The inclusion of a noise term
(Eq. 4.1) leads to variation in the precise location of the maximally activated point of the field
across multiple productions of the same gesture. For two activation distributions identical in all
aspects except for the width of the interval preshaped to have higher resting activation, stochastic
noise will have a greater effect on determining the parameter value that is ultimately selected
when the preshape is wider with a more graduate slope in activation. This can be observed in the
comparison of the preshaped activation interval in the two graphs in Figure 4.4(a).
4.3.1.1. Simulation-specific model settings
2,500 total trials were run for Simulation 1, with 500 trials included for each of five
preshape conditions. The values used for !, h, and the noise term were held constant across all
runs of the model in Simulation 1: ! = 150, ℎ = -3, '?&E: = 6. The values used for variables
163
governing interaction were also held constant across the entire simulation: θ = 0.7, β = 1.5, wexcite
= 0.45, winhibit = 0.1, σw = 1. Each trial lasted for 2,000 time steps. All values used for
unmanipulated variables (those held constant throughout the simulation) were selected due to
their use in previous DFT modeling work (e.g., Erlhagen & Schöner, 2002; Roon, 2013; Roon &
Gafos, 2016) motivated by concerns that active selection of values for these unmanipulated
variables could bias model behavior.
The input term in the specific equation governing the evolution of planning field
activation in Simulation 1 can be decomposed into two terms, as shown in Equation 4.5:
&'()% ($,%) = F
%./0
(G
%./0
($,%)) + F
1234
(G
1234
($,%)) (4.5)
The term G
%./0
($,%)
represents the task input to the planning field in the form of the
preshape corresponding to the possible target space for a specific gesture. The preshape input
was present for the entire duration of each trial (2,000 time steps). The input strength of the
preshape, as represented by the term F
%./0
, was set to 0.6 for all trials. This value was selected to
be high enough that the raised activation level introduced by the preshape input would not die off
over time, but low enough that the amount of activation introduced by the preshape alone would
stay far enough away from the interaction threshold to avoid triggering the interaction term. In
the interest of simplicity, only the preshape corresponding to the gesture that would be planned
for production was specified as task input in the model, although in actual speech planning
preshapes for additional gestures would also be active in the field.
The preshape was defined as a gaussian function with a mean at parameter value 10
(placing it in the center of the planning field). The width of the gaussian defining the preshape,
defined by its standard deviation (Preshape SD), was manipulated across trials, with five levels
included in the manipulation (from narrowest to widest: 1, 2, 3, 4, 5, with each number
164
corresponding to the standard deviation specified for the function in a set of 500 trials) Figure
4.4(a) provides an example of how the distribution of preshaped activation in the field differs for
two values of Preshape SD (Preshape SD = 1 and Preshape SD = 3). This manipulation was
conducted to test whether differences in production variability across speakers could be modeled
as arising from differences in the preshape corresponding to the tract variable specification for a
specific gesture in memory.
The specific input G
1234
($,%) was introduced at the 200
th
time step of each trial and
lasted until the 500
th
time step. The weight of the specific input, F
1234
, was set to 3.5. This value
was selected to be high enough that it would move the system away from the attractor state of the
resting level towards a higher attractor state, prompting the formation of a stable activation peak.
The specific input in this simulation represented an internal command to start planning the
production of a specific gesture and the subsequent need to select a target parameter value for the
implicated tract variables.
26
From the introduction of the specific input onward, the evolution of
activation within the field reflects the time course of selecting a target value for a specific tract
variable in the production of the specified gesture. As the specific input was meant to represent a
command to produce the same gesture that the planning field was preshaped for, the specific
input was defined as a gaussian function with the same specification for its mean and standard
deviation as the preshape used on that trial (Figure 4.4(b)).
26
Although not fleshed out in the version of the model presented here, this ‘command’ would presumably
reflect the excitation of the gesture upon the retrieval of gestural score incorporating it from lexical
memory (see Gafos & Kirov, 2009; Tilsen, 2019).
165
Figure 4.4. (a) Example of the Preshape SD manipulation in Simulation 1. The width of the
preshaped activation in the left graph is narrower (Preshape SD = 1) than the preshaped
activation in the right graph (Preshape SD = 3). Note that activation of both preshapes is low
enough that the interaction term is not engaged (interaction threshold = green plane).
(b) Introduction of specific input to each of the preshaped fields from (a) leads to the formation
of a stable activation peak. Specific input representing an internal command to start planning the
production of the gesture was introduced to each field with a high enough weight to catalyze the
development of a stable peak. The target value corresponding to the first point within the peak to
crosses the selection threshold (purple plane) is selected as the target for production.
For each trial, the first location x on the field whose activation level evolved to cross a
selection threshold (κ) was selected as the target parameter value for the production of the
gesture. The value of κ was set at 4.5 for the whole simulation. This specific value was chosen
(a)
(b)
Activation
4
3
2
1
0
-1
-2
Activation
4
3
2
1
0
-1
-2
2 6
10 14 18 22
Target value
2 6
10 14 18 22
Target value
4
2
-4
-6
Activation
0
-2
Activation
4
2
-4
-6
0
-2
-8
2 6 10
14 18 22
Target value
2 6 10 14 18
Target value
Time
Time
Time
22
Time
166
because the stable peak generated upon the introduction of specific input tended to stabilize with
a peak activation level between 4 and 5 on all trials.
4.3.1.2. Results
A visual representation of the simulation results is presented in Figure 4.5. A Brown-
Forsythe test comparing the variance of the distribution of selected target values across all levels
of Preshape SD was statistically significant (F*[4,2495] = 58.939, p < 0.0001), indicating that
the manipulation of Preshape SD engendered differences in the variability of selected targets.
Pairwise comparisons between individual levels of Preshape SD indicated that the distribution of
selected targets significantly differed in variance for all comparisons except for those between
levels 3 and 4 (F*[1,998] = 3.348, p = 0.067, padj = 0.075) and levels 4 and 5 (F*[1,998] = 0.964,
p = 0.33, padj = 0.33) of the manipulation (Figure 4.5(b)).
27
This confirms that increases in
preshape width can produce gradient differences in output variability.
As can be seen in the histograms in Figure 4.5 (a), the manipulation of Preshape SD did
not appear to also cause a change in the center of the distribution of selected target values across
conditions. To verify this, an additional analysis using a one-way ANOVA to examine the effect
of Preshape SD on the distribution of selected target values was conducted. The results of this
ANOVA confirmed that the manipulation of preshape width did not significantly impact the
mean of the target distributions (F[4,2495] = 1.389, p = 0.235), indicating that differences in
27
The lack of significance of these comparisons reflects the number of runs generated for the model at each level of
Preshape SD. In a version of the simulation in which the model is run 1,000 times for each level of the manipulation,
the pairwise comparison between levels 3 and 4 reaches significance (F*[1,1998] = 6.7, p = 0.009, padj = 0.01), and
the comparison between levels 4 and 5 reaches significance when the model is run 2,000 times for each level
(F*[1,3998] = 3.86, p = 0.04, padj = 0.04)
167
stochastic variability are separable from differences in the “typical” production of a phonological
unit in this model.
Figure 4.5. (a) Distribution of values for selected targets across all five levels of Preshape SD.
Each graph presents the data from one level of Preshape SD, with graphs arranged in descending
order from narrowest to widest preshape. Histogram fill color indicates Preshape SD level. (b)
Mean Target Value SD across all five levels of Preshape SD (color indicates Preshape SD level).
Error bars indicate the 95% confidence interval for each group mean. Significance between
comparisons indicated by asterisks.
4.3.1.3. Discussion of Simulation 1
The outcome of Simulation 1 confirms that the variability of a distribution of selected
target values is impacted by the width of a preshaped region of activation within a planning field.
168
Greater variability in target selection was observed when a wider region of the planning field
exhibited raised baseline activation levels as a consequence of preshaping. As these preshaped
regions are taken to constitute an individual’s representation of the possible target space for a
specific phonological unit, the results of this simulation indicate that systematic individual
differences in variability could be accounted for in a DFT-based model by differences in the
manner in which the representation of a given phonological unit preshapes activation within a
planning field across speakers.
Although the manipulation of preshaped activation width was demonstrated to influence
variability in target selection in this simulation, it is not expected that this is the only model
component whose manipulation would result in differences in the amount of stochastic
variability generated across tokens of the same phonological unit. For example, increasing the
value of the noise term in Equation 4.1, which is responsible for inducing stochastic behavior
within the system, would certainly increase the amount of stochastic variability observed in the
behavior of the field and the selection of production targets across many instances of stable peak
formation. However, given that the results presented in previous chapters of this dissertation
indicate that interspeaker differences in phonetic variability reflect in part the local control of
variability for individual phonological units, the model must include a locus of interspeaker
differences in variability that operates at the level of specific phonological units in order to
account for the empirical data. While allowing the noise term to vary across speakers and even
potentially across different planning fields in the same speaker’s cognitive system may be able to
generate variation in stochastic variability across speakers and across gestures defined along
different tract variables, it could not generate the type of segment- or gesture-specific control of
variability observed in the XRMB data (at least not without positing within-field variation in the
169
value of this term). Variation in the distribution of activation associated with a particular gesture
across speakers, however, operates at a level that is sufficiently local to generate individual
differences in variability at the level of the phonological unit.
4.3.2. Simulation 2: Relationship between stochastic and contextual variability
The first simulation served as a sanity check confirming that differences in stochastic
variability could be generated by variation in the representation of the target space for a
phonological unit, specifically in the distribution of preshaped activation associated with a
particular gesture. As individual differences in stochastic variability were observed to correlate
with individual differences in contextual variability in the analysis presented in Chapter 3, a
straightforward hypothesis is that the generation of individual differences in stochastic and
contextual variability should arise from the same source within the model. Simulation 2 is
designed to test the hypothesis that individual differences in contextual variability in target
selection should also arise from the manipulation of the pattern of the pre-shape activation
associated with a tract variable for a particular gesture, and that (mirroring the results from
Chapter 3) these differences should exhibit a positive correlation with stochastic variability.
4.3.2.1. Implementation of coarticulatory effects on speech planning
The contextual variation in the XRMB data that Simulation 2 is intended to replicate
reflects both (a) the influence of vowel context on the realization of the target segments (all of
which were coronal consonants) and (b) the prosodic position of these segments. As the
coarticulatory environment is expected to more directly affect the selection of target values than
the prosodic context (e.g., Browman & Goldstein, 1992; Fowler & Saltzman, 1993; Fukaya &
170
Byrd, 2005; Iskarous, McDonough, & Whalen, 2012; Parrell & Narayanan, 2018), the hypothesis
that the manipulation of preshaped activation patterns should also impact contextual variability
was tested here specifically using a model designed to mimic the effect of vocalic context on the
selection of target values for the consonant.
As the consonants examined in the XRMB data are all traditionally defined as
incorporating tongue tip gestures, and vocalic gestures directly recruit the tongue body,
coarticulatory effects on target planning are implemented in the model presented here through
inhibitory coupling between multiple planning fields (specifically, a planning field
conceptualized as corresponding to the tongue tip constriction location [TTCL] tract variable
and a planning field corresponding to the tongue body constriction location [TBCL] tract
variable). The choice to model the interaction between these independent lingual subsystems
using inhibitory coupling was made because they introduce competing demands on the tongue
body as an articulator (Butcher & Weiher, 1976; Iskarous, Fowler, & Whalen, 2010; Kent &
Moll, 1972). The cross-field inhibitory effects are specifically modeled using the nonlinear
function presented in Equation 4.6.
&'%:<=>%&?'
#23//
= #($,%)+(
5
56!
( /
∗ 9
$)*$+$%
34566
) (4.6)
In this equation, the effect of activation present in the TBCL planning field on the activation
level (A) of any point x in the TTCL planning field at time t is a function of both (a) the location
of x in the TTCL field and (b) the location of maximum suprathreshold activation in the TBCL
field.
28
Note that only those points in the TBCL field that exceed a soft interaction threshold χ
will influence the evolution of activation in the TTCL field, in the same manner already
28
While presented as unidirectional here to simplify modeling, cross-field inhibitory effects would in reality be
bidirectional, with activation in the TTCL field also affecting the evolution of activation in the TBCL field.
171
described for within-field inhibition (Section 4.2.2.). The influence of the location of maximum
suprathreshold activation in the TBCL field on the evolution of activation in the TTCL field is
determined by the scaling variable 9
$)*$+$%
34566
, defined in Equation 4.7 for a field where more
anterior locations in the vocal tract correspond to smaller (numerical) target values within the
planning field. If we let b equal the maximally activated point along the TBCL field (where the
length of the TBCL field equals l), and if the activation of b is sufficient to trigger cross-field
inhibition onto the TTCL field, then 9
$)*$+$%
34566
is defined as:
9
$)*$+$%
34566
= −
7
7
|9( 7|
8
9
(4.7)
Nonlinearity of the cross-field inhibition function arises both from the use of a soft
threshold (χ) and from the scaling of the strength of the added inhibition as a function of the
location of point x within the planning field. This second source of nonlinearity in the
implementation of cross-field inhibition diverges from how cross-field inhibitory effects have
been defined in previous work using DFT (e.g., Hock, Schöner, & Giese, 2003; Roon, 2013;
Roon & Gafos, 2016). Namely, previous implementations of cross-field inhibitory effects in
DFT have generally accomplished this through the subtraction of a constant amount of activation
across the entirety of the affected field. In the model used here, however, the strength of cross-
field inhibition is shaped so that regions of the TTCL field mapping onto more anterior vocal
tract locations (parameter values closer to zero) receive greater inhibition (Figure 4.6). This,
combined with the influence of the location of TBCL activation on the scaling factor
9
$)*$+$%
34566
, is necessary to generate realistic “shifting” of the selected target due to
coarticulatory influence from the vocalic context onto the consonant (see Section 4.3.2.4. for
further discussion of this decision).
172
Figure 4.6. Effect of the nonlinear cross-field inhibition function on target selection in the TTCL
field as a function of the location of the activation engaging χ in the TBCL field. The leftmost
column of graphs shows planning fields for TBCL, the center column shows the cross-field
inhibition function corresponding to each TBCL graph, and the rightmost column shows
planning fields for TTCL. TBCL activation in (a) (centered at 13) corresponds to a more anterior
vocal tract location than the TBCL activation in (d) (centered at 17.5). The cross-field inhibition
function in (b) (reflecting the effect of (a) on (c)) introduces less of an inhibitory effect on more
anterior locations in the TTCL field than the cross-field inhibition function in (e) (reflecting the
effect of (d) on (f)). Due to this difference in the inhibition function the target selected for the
TTCL field in (c) is more anterior (selected target = 12) than the target selected for the TTCL
field in (f) (selected target = 13.5).
4.3.2.2. Simulation-specific model settings
5,000 total trials were run for Simulation 2, with 500 trials included for each combination
of five preshape conditions and two unique vowel conditions. Each trial lasted for 2,000 time
steps. All values used for unmanipulated variables (those held constant throughout the
simulation) were selected due to their use in previous modeling work using DFT (e.g., Erlhagen
& Schöner, 2002; Roon, 2013; Roon & Gafos, 2016), motivated by concerns that active selection
(a) (b)
(c)
(d) (e)
(f)
More anterior
More anterior
More anterior
More anterior
4
8 12
20
16
4
8 12 20 16
4
8 12
20 16
4
8 12 20 16
Time
0
Time
Time
Time
0 0
0
Activation
Activation
Activation
Activation
4
8 12 20 16 0
Target Value
Activation
-4
0
Activation
-4
0
4
8 12 20 16 0
Target Value
173
of values for these unmanipulated variables could be used to bias the behavior of the model. The
values used for !, h, and the noise term were the same as those used for Simulation 1 (! = 150, ℎ
= -3, '?&E: = 6) and were again held constant across all runs of the model.
Two planning fields, a TTCL field and a TBCL field, were created for the simulation.
The task input G
%./0
($,%)
to the TTCL planning field represented the preshape corresponding to
the possible target space for a consonant gesture, while the task input to the TBCL planning field
represented the preshape corresponding to the possible target space for a vowel gesture. The
preshape input in each field was present for the entire duration of each trial. The input strength of
the preshape (F
%./0
in Equation 4.2) was set to 0.7 for both fields for all trials. In the interest of
simplicity, only the preshape corresponding to the gesture that would be planned for production
in each field was specified as task input in the model, although in actual speech planning
preshapes for additional gestures would also be active in each field.
The preshape input to each field was defined as a gaussian function. The same
manipulation of gaussian width implemented in Simulation 1 (Preshape SD) was also
implemented in the preshape input to the TTCL field across trials in Simulation 2. The mean of
the TTCL preshape was always set at parameter value 10, and the width of the TTCL preshape
was manipulated across trials, with five levels included in the manipulation (these levels were
the same as in Simulation 1). Each level of Preshape SD was specified as the width of the TTCL
preshape input in a set of 500 trials.
For the task input to the TBCL field, the width of the gaussian function defining the
preshape was held constant across all trials, with the standard deviation of this function always
set to equal 3. The mean of the preshape was manipulated across trials to create a FRONT vowel
condition with its mean at parameter value 7 in the TBCL field and a BACK vowel condition with
174
its mean at parameter value 17 in the TBCL field (Vowel Location). This manipulation was
included in the simulation to create the different vowel “environments” necessary to examine
contextual variation.
Specific input (G
1234
($,%) in Equation 4.2) was introduced to each field at the 200
th
time
step of each trial and lasted until the end of the trial. The synchronicity of the specific input to
the TTCL and TBCD fields is conceptualized as the co-planned C and the V of a CV syllable.
The weight of the specific input (F
1234
in Equation 4.2) was set to 3.5 for the TTCL field and 2
for the TBCL field. As in Simulation 1, the specific input in this simulation represented an
internal command to start planning the production of consonant or vowel gesture. The specific
input to each field was defined as a gaussian function with the same specification for its mean
and standard deviation as the preshape used for that field on that trial.
The interaction term in the specific equation governing the evolution of planning field
activation in Simulation 2 can be decomposed into two terms, as shown in Equation 4.5 (using
the terminology in Roon & Gafos [2016]):
&'%:<=>%&?'($,%) = &'%:<=>%&?'
-$%*$)
+ &'%:<=>%&?'
#23//
(4.5)
Since there are multiple planning fields simultaneously active and coupled with each
other in this model, the full interaction term comprises both the effect of the interaction between
point x and other points within the same field on the activation of point x at time t and the impact
of the cross-field interaction function on point x at time t. Of course, each of these individual
components of the complete interaction term is only influencing the evolution of point x (or any
other point on the planning field) if there is some point in the same planning field that has a high
enough activation level to engage the within-field interaction threshold θ and/or a point on the
other planning field that has a high enough activation level to engage the cross-field interaction
175
threshold χ. The values used for variables governing within-field interaction were the same as in
Simulation 1 and were again held constant across the entire simulation: θ = 0.7, β = 1.5, wexcite =
0.45, winhibit = 0.1, σw = 1. The value of 9
$)*$+$%
34566
in the equation governing cross-field
interaction was calculated for each trial with b = 7 or 17.5 (depending on the vowel location
condition for that trial) and l = 20. The cross-field interaction threshold χ was set to 0, meaning
that activation in the TBCL field only triggered cross-field inhibition effects once it got
significantly close to an activation level of zero (after starting at the negative resting level of
activation).
For each trial, since the aim of the simulation was to model the effect of TTCL Preshape
SD on contextual variation in target selection, the primary output of interest was the parameter
value selected as the production target in the TTCL field. The first location x on TTCL field
whose activation level evolved to cross a selection threshold (κ) was selected as the target TTCL
for the consonant tongue tip gesture. The value of κ was set at 6.5 for the whole simulation. This
specific value was chosen because the stable peak for the TTCL field generated upon the
introduction of specific input tended to stabilize with a peak activation level between 6 and 7 on
all trials.
4.3.2.3. Results
To assess how coarticulatory effects interacted with Preshape SD in the simulation, a
two-way ANOVA was fit on the dependent variable Selected Target with Preshape SD (1, 2, 3, 4
and 5) and Vowel Location (FRONT vs BACK)
29
as factors. A Brown-Forsythe test run on the
29
This definition of the levels of Vowel Location is purely conceptual, as the value for b in the FRONT vowel
condition is likely too posterior to truly mimic the location of a front vowel in the planning field. This is reflected in
176
model was significant, indicating that unequal variances were observed in the distribution of
selected target values across runs of the model with different specifications for Preshape SD and
Vowel Location (F* = 32.235, p < 0.0001). This finding of unequal variances was unsurprising
given the relationship between Preshape SD and variability of selected target values in
Simulation 1. A log-transformation of Selected Target was used as the dependent variable in the
model to mitigate the effect of unequal variances on the comparison of means. However, the raw
(un-transformed) values of this variable are used in all visual representations of the data.
The results of the two-way ANOVA revealed a significant main effect of Preshape SD
(F[4,4990] = 23024, p < 0.0001) and Vowel Location (F[1, 4990] = 79969, p < 0.0001), as well
as a significant interaction between Preshape SD and Vowel Location (F[4, 4990] = 2351, p <
0.001). Although the results of a simulation with specific numerical specification of the FRONT
vowel location at TBCL = 7 and the BACK vowel location at TBCL = 17.5 are presented here, the
same patterns of statistical significance for both the main effects and the interaction term were
observed regardless of the precise numerical values used to specify these conditions. The
robustness of these results across different specifications of vowel location indicates that the
observed pattern of statistical significance reflects a general property of the model dynamics
instead of a coincidental effect of the choice of particular numerical values.
The results of Simulation 2 are shown in Figure 4.7. Tukey post-hoc tests revealed that
greater coarticulatory effects were consistently observed for larger values of Preshape SD (all p <
0.0001) and that the BACK condition elicited greater coarticulatory effects than the FRONT
condition for the Vowel Location manipulation (p < 0.0001). This pattern of results suggests not
the observation that the FRONT vowel condition elicits coarticulatory effects on the coronal consonants in the
simulation, which we would not necessarily expect to see in actual articulation.
177
only that the model was successful in generating the appearance of cross-field coarticulatory
influences (i.e., the location of suprathreshold activation in the TBCL field impacted the value of
the selected target in the TTCL field), but also simulates that more variable speakers (larger
Preshape SD) should exhibit a greater overall tendency towards coarticulatory effects on their
selection of consonantal production targets.
Figure 4.7. Distribution of value for selected targets across all five levels of Preshape SD in the
FRONT and BACK vowel conditions. Each graph presents the data from one level of Preshape SD,
with graphs arranged in descending order from narrowest to widest preshape. Histogram fill
color indicates vowel condition (light blue = FRONT, dark blue = BACK) and histogram outline
color indicates Preshape SD level.
Post-hoc Tukey tests also demonstrated that the significant interaction between Vowel
Location and Preshape SD was the result of a larger difference between the mean of the
distribution of selected target values across the two levels of Vowel Location (FRONT and BACK)
for larger values of Preshape SD (all p < 0.0001) (Figure 4.8). Although all pairwise comparisons
were significant in the post-hoc analysis, the magnitude of the difference in target values
BACK
Vowel Location
Preshape SD
1
2
3
4
5
Count
10 11 12
Target Value
FRONT
178
between the FRONT and BACK vowel conditions was over three times as large for the widest
Preshape condition (Preshape SD = 5: 1.21) than for the narrowest Preshape condition (Preshape
SD = 1: 0.34). Taking into account the relationship between stochastic variability and Preshape
SD in Simulation 1, this interaction indicates that the model simulates the relationship between
stochastic variability and contextual variation observed in the XRMB data (as presented in
3.3.1.1.).
Figure 4.8. Mean Target Value (dark and light blue dots) across all five levels of Preshape SD in
the FRONT and BACK vowel conditions. Error bars indicate the 95% confidence interval for each
group mean. All pairwise comparisons across levels of Vowel Location and Preshape SD are
significant.
4.3.2.4. Discussion of Simulation 2
The results of Simulation 2 confirm that the same mechanism used to model individual
differences in stochastic variability in Simulation 1 could also generate a relationship between
stochastic and contextual variability. Specifically, a greater difference was observed between the
Target Value (Mean)
1
2
3
4
5
Preshape SD
FRONT
Vowel Location
BACK
179
average parameter value selected as the production target across the different vowel conditions
for simulations where the region of preshaped activation in the TTCL field (corresponding to the
representation of the target space for a consonant produced with a tongue tip gesture) was wider.
Given the positive relationship between preshape width and stochastic variability observed in
Simulation 1, this suggests that representations of target space that induce greater stochastic
variability in the model also generate greater change in selected target values under the system of
cross-field inhibition implemented to mimic the influence of vowel context on coronal
consonants. This indicates that the model presented here is able to generate two of the main
patterns observed in the XRMB data, namely individual differences in stochastic variability and
the relationship between stochastic and contextual variability, by changing a single property of
the representation of gestural targets in the model.
The inclusion of both word-initial and word-final phonetic contexts, as well as pre- and
post-pausal contexts, in the data set used to calculate contextual variability for the XRMB data
makes it difficult to determine exactly how much of the observed contextual variability was due
to the influence of vowel context. The decision to include all syllabic or prosodic positions in the
same calculation of contextual variability was made due to the relatively small number of
contexts in the corpus that proved suitable for analysis. However, preliminary analyses
separating out these positional effects suggest that the same relationship is observed between
stochastic and contextual variability when the only difference between contexts is the vocalic
environment in which a target segment occurs (Appendix B). While the results of these analyses
should be interpreted with caution given their small sample size, they provide tentative support
for the proposed mechanism’s utility in accounting for the observed relationship between
stochastic and contextual variability.
180
The shaping of cross-field interaction included in this model diverges from previous work
in DFT incorporating the coupling of activation in multiple planning fields (e.g., Hock et al.,
2003; Roon, 2013; Roon & Gafos, 2016), as it introduces multiple sources of nonlinearity into
the interaction between planning fields. The precise implementation of cross-field interaction in
this model was designed to reflect the biomechanical properties of the vocal tract and kinematic
constraints that may influence the planning system (e.g., Browman & Goldstein, 1989; Fowler,
1980; Gick, Stavness, & Chiu, 2013; Ostry, Gribble, & Gracco, 1996). This approach provides
an embodied cognitive mechanism for incorporating coarticulatory influence in dynamic target
planning. However, future research examining the extent to which these patterns of cross-field
coarticulatory influence can be produced without this location-dependent shaping of inhibition in
a fuller model explicitly incorporating the biomechanical constraints placed upon the movement
of the plant itself may additionally be beneficial in evaluating if and how to best incorporate
coarticulatory effects in the DFT-based model.
This approach to the modeling of coarticulation also differs from its typical
conceptualization in AP and the Task Dynamics model, where coarticulatory effects fall out from
the dynamics of articulation as opposed to the dynamics of planning (e.g., Browman &
Goldstein, 1986, 1989; Iskarous, McDonough, & Whalen, 2012; Fowler & Saltzman, 1993;
Saltzman & Munhall, 1989; Zsiga, 1995). In AP/TD, coarticulation arises when two (or more)
gestures using the same set of articulators exhibit temporal overlap and jointly influence the
shape of the vocal tract (e.g., Browman & Goldstein, 1989; Saltman & Munhall, 1989; Fowler &
Saltzman, 1993). Specifically, two overlapping gestures defined with the same tract variable will
simultaneously influence the goal parameters for the constriction being produced (e.g., Saltzman
& Munhall, 1989). Crucially, this mechanism would not be predicted to produce the relationship
181
between stochastic and contextual variability observed here, as the spatial overlap between the
articulatory structures used to produce the modeled consonant and vowel gestures is incomplete
and, subsequently, minimal interference is expected between these coproduced gestures (e.g.,
Fowler & Saltzman, 1993). The proposed mechanisms for incorporating coarticulation into the
dynamics of planning therefore contrast with the traditional conceptualization of blending not
only in its focus on planning, instead of implementation, but also in that it affects gestures
defined along different tract variables.
The proposition that coarticulatory effects can be implemented as part of the process of
target selection, instead of purely through the interaction of the physical structures of the vocal
tract, relies on some degree of synchronicity or overlap in the planning of gestures that also
temporally overall in their articulation. This overlap in target planning for multiple gestures (or,
at least, the simultaneous influence of multiple gestures on planning dynamics) has previously
been proposed for DFT-based models of phonological cognition in Tilsen (2019). Specifically,
Tilsen (2019) incorporates his Selection-coordination theory (Tilsen, 2016) and a DFT-inspired
mechanism for selecting gestural target parameter values in a manner where multiple gestures
are simultaneously selected for production, meaning they will influence the evolution of
planning field activation. A mechanism of this type would enable the type of interaction modeled
here between multiple gestures during the planning of target parameter values.
4.3.3. Simulation 3: Production variability and perceptual sensitivity
The results of Simulations 1 and 2 indicate that the proposed mechanism for
incorporating individual differences in a DFT-based model, namely variation in the distribution
of planning field activation associated with a particular gesture, is able to reproduce results from
182
previous chapters that were interpreted as evidence for the encoding of variability in
phonological representation. The apparent viability of this as a method for encoding individual
differences in variability in cognitive representation leads to the question of what this would
predict for other speech behaviors that may make use of the same cognitive representations.
A number of prominent and closely related theories of cognitive action-perception
outside of language, such as common coding theory (e.g., Hommel, Müsseler, Aschersleben, &
Prinz, 2001; Prinz, 1997), ideomotor theory (e.g., Greenwald, 1970), and the ecological theory of
perception (e.g., E. Gibson, 1969, 1988; J. Gibson, 1966, 1977), rely on, or at the very least
incorporate, the assumption that there is an explicit connection between the production of actions
and the perception of these actions or events. Such a perspective is shared by various theories of
speech production and perception (e.g., Fowler, 1986; Fowler et al., 2003; Goldinger, 1998;
Johnson, 1997; Liberman & Mattingly, 1985; Pierrehumbert, 2001, 2003), which propose a
shared system of representation for speech perception and production for reasons including the
requirement of parity for successful communication (e.g., Fowler, 1986; Mattingly & Liberman,
1988) and the body of empirical evidence suggesting a direct connection between the behaviors
exhibited in production and perception. A review of this topic can be found in Casserly and
Pisoni (2010) and in Chapter 5 of this dissertation. This assumption is also reflected in the
original principles of DFT as presented in Erlhagen and Schöner (2002), which intrinsically lend
themselves to embodied approaches to cognition in which both the perception of states or objects
in the world and the execution of actions related to those states or objects are explained using the
same dynamical framework (e.g., Thelen, Schöner, Scheier, & Smith, 2001).
In a similar manner to how speakers differ in the habitual patterns they exhibit in speech
production, listeners are known to differ from one another in their perceptual behavior. These
183
individual differences are frequently documented in the same domains in which there is
extensive variation across speakers in production (Beddor, 2009; Kong & Edwards, 2016; Mann
& Repp, 1980; Yu, 2010; Yu & Lee, 2014). If the cognitive representations used to guide speech
production are the same as those accessed during perception, as has been suggested, we may
expect that any individual differences in the representational system, such as those proposed in
the modeling work presented here, should have consequences for both the production and the
perception of speech. As such, it should be possible to make concrete predictions regarding how
individuals will differ from one another in the perception of speech based on the specific model
encoding of individual differences in the cognitive representation of phonological units.
Simulation 3 uses the cognitive representations developed for Simulations 1 and 2 to
examine the model’s predictions for the relationship between individual differences in
production and cross-speaker behavioral variation in speech perception. The model is used to
simulate a subphonemic perceptual discrimination task. This task was selected for simulation as
recent research has suggested that individual differences in acoustic variability in production
may be related to individual differences in sensitivity to subphonemic variation in perception
(Franken et al., 2017; Perkell et al., 2008).
30
Specifically, experimental work has suggested that
speakers who are less variable than other speakers in their production of a phonological segment
exhibit greater sensitivity to phonetic detail in its realization (i.e., they are able to more
accurately discriminate more acoustically similar tokens). Simulation 3 was designed to test
whether manipulating the distribution of planning field activation associated with a particular
gesture would predict this same relationship between stochastic variability in production and
30
See Chapter 5 for a more thorough discussion of this literature.
184
discriminatory ability in perception. This simulation was also used to predict how variation in the
recoverability of articulatory variability from acoustics across segments and dimensions (as was
observed in Chapter 2) may affect this relationship.
4.3.3.1. Simulation-specific model settings
The basic components of the model set-up for Simulation 3 were the same as those used
in Simulations 1 and 2. Namely, the same values were used for !, h, and the noise term (! = 150,
ℎ = -3, '?&E: = 6) as in the earlier simulations. Additionally, the values used for the variables
governing within-field interaction were the same as those used in the previous simulations (θ =
0.7, β = 1.5, wexcite = 0.45, winhibit = 0.1, σw = 1). The simulation examined the interaction between
multiple specific inputs on a single planning field; as such, only one planning field (which can be
conceptualized as corresponding to a single tract variable) was created for the simulation, and no
cross-field interaction effects were modeled. 20,000 total trials were run for Simulation 3, with
500 trials included for each of forty groups representing all possible combinations of the three
variables manipulated in the simulation. Each trial lasted for 4,000 time steps.
The input term in the specific equation governing the evolution of planning field
activation in Simulation 3 can be decomposed into three terms, as shown in Equation 4.6:
&'()% ($,%) = F
%./0
KG
%./0
($,%)L+ F
1!2#
MG
1!2#5
($,%)N+ F
1!2#
MG
1!2#9
($,%)N (4.6)
The term G
%./0
($,%)
represents the task input to the planning field in the form of the preshape
corresponding to the possible target space for a specific gesture. The definition of the preshape
and the Preshape SD manipulation were the same as in Simulations 1 and 2. Specifically, the
mean of the preshape was always set at parameter value 10, and the width of the preshape was
185
manipulated across trials, with five levels included in the manipulation (these levels were the
same as in Simulations 1 and 2). Each level of Preshape SD was specified as the width of the
preshape input in a set of 500 trials. The input strength of the preshape (F
%./0
) was set to 0.6 for
all trials (see Section 4.3.1.1. for an explanation of how this value was chosen).
As the simulation was intended to model the process of discriminating between two
perceptual stimuli, two specific inputs were introduced to the planning field in each trial. The
first, G
1!2#5
($,%), was introduced at the 200
th
time step of each trial and lasted until the 500
th
time step. Although versions of the simulation were run with this first specific input introduced
at multiple different locations within the planning field, in the data analyzed here this input item
was always a gaussian function centered at parameter value 10.5 within the planning field.
The second specific input, G
1!2#9
($,%), was introduced at the 2000
th
time step of each
trial and lasted until the 2300
th
time step. The location of the second specific input
(Interstimulus Distance) varied across trials – it was centered around a parameter value -1.5, -1,
-0.5, or 0 away from the location of the first specific input (Figure 4.9). These levels of the
Stimulus Distance manipulation were constrained so that they stayed within one parameter value
of the center of the preshape input to the field (at parameter value 10.5), reflecting what was
thought to be a reasonable extent of stochastic, subphonemic variation in production (based on
the range of selected target values observed in Simulation 1).
186
Figure 4.9. Visualization of the distance between the two specific inputs to the field (S
:;<=5
and
S
:;<=9
) when the first input is located at parameter value 10.5 and the second input is centered at
parameter value (a) 9 (-1.5 away), (b) 9.5 (-1 away), or (c) 10 (-0.5 away). The activation
distribution corresponding to the first input is shown in a blue-to-yellow gradient, while the
activation distribution corresponding to the second input is red in each figure.
The width of the activation spike introduced to the field by each specific input was
manipulated across trials to reflect differences in listeners’ ability to recover information about
variability in the underlying articulatory actions from the acoustic signal (Stimulus Mapping)
because the results of the articulatory-acoustic analysis in Chapter 2 introduce the possibility that
this may vary across segments and phonetic dimensions. In half of the trials, the width of both
specific inputs was set to 0.5 to index a more precise mapping of the acoustic properties of the
perceived stimuli onto the set of tract variable parameters likely have generated them (STRONG
MAPPING condition). In the other half of the trials, the width of both specific inputs was set to 1,
indexing a less precise mapping of the acoustic properties of the perceived stimuli onto tract
variable parameter values (WEAK MAPPING condition).
The weight of both specific inputs, F
1!2#
, was set to 3.5 for all trials (see Section 4.3.1.1.
for a reminder of how this value was selected). The period of the evolution of the field between
the introduction of the first perceptual stimulus at the 200
th
time step and the introduction of the
6
8 12 10 16
Target Value
Activation
14
5
-1
-3
3
1
3
3
Time
(a)
6
8 12 10 16
Target Value
Activation
14
5
-1
-3
3
1
3
3
Time
(b)
Activation
5
-1
-3
3
1
3
3
(c)
Time
6
8 12 10 16
Target Value
14
187
second stimulus at the 2000
th
time step reflects the time course of the listener’s identification of a
tract variable value corresponding to the perceived phonetic properties of the first stimulus item.
After the introduction of the second stimulus at the 2000
th
time step, the evolution of the field
reflects the listener’s response to that stimulus item and their evaluation of its corresponding tract
variable value.
For each trial, the primary output of interest was the distance between the tract variable
parameter values selected in the perception of the two stimuli (Perceived Distance). This value
was calculated as the distance in planning field space between the tract variable parameter value
selected for the first stimulus, defined as the first location x in the planning field whose
activation level crossed the selection threshold (κ) (which always occurred before the
introduction of the second stimulus), and the value selected for the second stimulus, which was
selected as the tract variable parameter value with the highest activation level 500 time steps
after the introduction of the second specific input to the field
31
(time step 2500). The value of κ
used in the selection of the target of the first perceived stimulus was set to 5.5 for all trials. This
specific value was chosen because the stable peak generated consequent to the introduction of
specific input tended to stabilize with a peak activation level between 5 and 6 on all trials.
31
The stable peak induced by the first specific input did not destabilize before the introduction of the second specific
input to the field. As a consequence, the maximum activation within the field was already higher than κ when the
second input was introduced, and a specific time frame determined to present the maximum influence of the second
input on activation levels within the planning field was selected as the “perceived value” of this input instead. A
mechanism for destabilizing an existing peak (e.g., Sandamirskaya & Schöner, 2010) will be introduced in future
work on this model.
188
4.3.3.2. Results
A three-way ANOVA was fit on the dependent variable Perceived Distance with
Preshape SD (1, 2, 3, 4 and 5), Stimulus Mapping (STRONG MAPPING vs. WEAK MAPPING) and
Interstimulus Distance (-1.5, -1, -0.5, and 0) as factors to assess how differences in the
representation of and articulatory-acoustic mapping for phonological units interacted to
determine the perceived distance between perceptual stimuli of varying in physical distance.
Brown-Forsythe’s test was significant for this model, indicating that unequal variances were
observed across runs of the model with different specifications for the three factors (F = 25.74, p
< 0.0001). As this violated the assumption of equal variances, Perceived Difference was log-
transformed to reduce heteroscedasticity before re-fitting the three-way ANOVA. Raw values of
Perceived Distance are used in all graphs incorporating data from this simulation.
Significant main effects of Preshape SD (F[4,19960] = 100.87, p < 0.0001), Stimulus
Mapping (F[1,19960] = 187.85, p < 0.0001), and Interstimulus Distance (F[3,19960] = 5153.31,
p < 0.0001) were observed in the results of the three-way ANOVA. Significant two-way
interactions were also found for Preshape SD*Stimulus Mapping (F[4,19960] = 27.13, p = <
0.0001), Preshape SD*Interstimulus Distance (F[12,19960] = 2.09, p = 0.01), and Stimulus
Mapping*Stimulus Distance] (F[3,19960] = 45.04, p < 0.0001). The three-way interaction
between Preshape SD, Stimulus Mapping and Interstimulus Distance did not reach statistical
significance (F[12,19960] = 1.12, p = 0.34). Similar patterns of statistical significance were
generally observed when ANOVA models were fit to simulations with different numerical values
for the two levels of Stimulus Mapping and the four levels of Interstimulus Distance, although
some slight differences were observed based on the numerical proximity between the selected
values of manipulated factors (e.g., statistically significant main effects of or interaction effects
189
involving Stimulus Mapping were not observed when the numerical distance between the levels
of that factor was sufficiently small). The robustness of general trends in the results across
different model specifications suggest that the results reflect a general property of the model
dynamics that are not dependent upon the selection of specific numerical values for certain
parameters.
Figure 4.10 illustrates the two-way interaction between Preshape SD and Interstimulus
Mapping (values collapsed across all levels of Interstimulus Distance), as well as the main effect
of Preshape SD. Tukey HSD post-hoc tests indicated that Perceived Distance was larger for
smaller values of Preshape SD (all p < 0.01). Post-hoc tests also indicated that Perceived
Distance significantly differed across almost all levels of Preshape SD for the STRONG MAPPING
Condition (p < 0.001 for the comparisons except the comparison between levels 4 and 5, which
was not significant [p = 0.07]), but that significant differences in Perceived Distance were only
observed between the most extreme levels of Preshape SD for the WEAK MAPPING condition (5
vs. 2: p < 0.001; 5 vs. 1 and 4 vs. 1: p < 0.0001; all other p > 0.05). This suggests that the extent
to which differences in Preshape SD generate differences in Perceived Distance (and, by
extension, the extent to which speakers differing in their production variability exhibit different
perceptual behavior) depends on the precision of the mapping of perceptual input. If the
perceptual input excites a relatively small area of the planning field (i.e., if its localization is
more precise), less variable speakers appear to perceive a larger phonetic difference between
sequential inputs than more variable speakers. On the other hand, if the perceptual input is less
precisely localized within the planning field, then speakers differing in production variability
will exhibit more similar behavior in perception.
190
Figure 4.10. Mean Perceived Distance (dark and light green dots) across all five levels of
Preshape SD in the STRONG MAPPING and WEAK MAPPING conditions. Error bars indicate the
95% confidence interval for each group mean.
Figure 4.11 presents the two-way interaction between Stimulus Mapping and
Interstimulus Distance (values collapsed across all levels of Preshape SD). The results of post-
hoc Tukey tests indicated that the significant interaction between these factors reflects a
difference in the appearance of significant differences in Perceived Distance values across the
two levels of Stimulus Mapping for stimuli of varying distance. Perceived Distance was
significantly smaller in the WEAK MAPPING condition than the STRONG MAPPING condition when
the two stimuli were not identical (Interstimulus Distance. = -1.5, -1, or -0.5), with no significant
difference observed between the WEAK and STRONG MAPPING conditions when the two stimuli
were identical (Interstimulus Distance = 0) (p = 0.11).
191
Figure 4.11. Mean Perceived Distance (dark and light green dots) across all five levels of
Interstimulus Distance in the STRONG MAPPING and WEAK MAPPING conditions. Error bars
indicate the 95% confidence interval for each group mean.
The final two-way interaction between Preshape SD and Interstimulus Distance also
reflects a contrast between the condition where the two stimuli were acoustically identical
(Interstimulus Distance = 0) and all other conditions. While Perceived Distance differed slightly
across levels of Preshape SD whenever the two stimuli differed, with higher Perceived Distance
values observed for values of Preshape SD corresponding to narrower preshapes, no such
difference was observed when there was no difference between the stimuli. This, like the result
of the Stimulus Mapping*Interstimulus Distance interaction, reflects the fact that Perceived
***
***
***
192
Distance was generally equivalent across groups when Interstimulus Distance was equal to zero
(Figure 4.12), which logically follows from the lack of a difference to perceive in this condition.
Figure 4.12. Mean Perceived Distance (dark and light green dots) across all five levels of
Preshape SD in the STRONG MAPPING and WEAK MAPPING conditions. Each graph shows the
results for a different level of Interstimulus Distance. Error bars indicate the 95% confidence
interval for each group mean.
4.3.3.3. Discussion of Simulation 3
The results of Simulation 3 indicate that manipulating the distribution of planning field
activation associated with a particular gesture predicts a relationship between stochastic
variability in production and perceptual sensitivity to subphonemic variation that is modulated by
listeners’ mapping of acoustic input to tract variable values, the acoustic distance between the
two stimuli, and the positioning of the stimuli relative to the individual’s representation of a
193
gesture’s target space. As the joint influence of these factors on perception has not been
previously explored in the literature, these findings can generate testable hypotheses regarding
how individual differences in stochastic variability may relate to individual differences in speech
perception, and how this relationship may be modulated by the characteristics of the stimuli
being perceived. Additionally, the results of this simulation more generally in conjunction with
published empirical findings indicate the capacity for this model’s to account for the relationship
between production and perception.
The results for trials in the STRONG MAPPING condition largely agree with findings from
empirical work examining the relationship between individual differences in acoustic variability
and perceptual sensitivity to variation in the production of a phonological segment (e.g., Franken
et al., 2017; Perkell et al., 2008). The STRONG MAPPING condition corresponds to the perception
of stimuli where the listener has a fairly good ability to recover articulatory variability from the
acoustic signal. Previous work in articulatory-acoustic relations suggests variability in
articulation strongly correlates with variability in the acoustics of vocalic segments (e.g., Whalen
et al., 2018). The STRONG MAPPING condition in the simulation is likely to provide a more
accurate model of the conditions in past experiments that have only examined vowels. Across
trials in the STRONG MAPPING condition, planning fields preshaped with narrower activation
distributions generated larger differences in the tract variable parameter values selected for the
two stimuli perceived in a single trial than planning fields with wider preshaped activation. As
narrower preshapes were observed to produce less variability in target selection in Simulation 1,
this result for Simulation 3 indicates that the model would predict that less variable speakers
generally perceive a larger difference than more variable speakers between two tokens of a
phonological segment. This same relationship has been observed in previous work, reinforcing
194
the ability of this model to account for a suite of phenomena observed in the relationship
between speech perception and production.
4.4. General Discussion
The results of the model simulations presented in this chapter demonstrate that the
empirical findings from previous chapters in the dissertation can be accounted for by
incorporating individual differences in the representation of gestural targets in a DFT-based
model of phonological cognition. Simulation 1 confirmed that the pattern of individual
differences in speech production variability observed in Chapter 2 can be generated by varying
the target space for a gesture, defined as the distinct distribution of activation in a dynamic
planning field associated with a particular gesture. Specifically, increasing the range of values
within the planning field that were preshaped to reflect the target region associated with a gesture
in memory led to greater stochastic variability in dynamic target selection. Building on this
observation, Simulation 2 verified that this same manipulation of gestural target space could also
generate the relationship between stochastic and contextual variability observed in Chapter 3.
Crucially, these results were generated through a mechanism for incorporating individual
differences in the representation of phonological units that operates that the level of the
specification of targets for an individual articulatory gesture. This is critical given the findings
from Chapter 3 that suggest the control of variability operates on the level of the individual
phonological unit, and more specifically a subsegmental level of phonological representation,
and that variability is better represented in articulation than in acoustics.
This same manipulation of gestural target space was also shown to make predictions
regarding the relation of stochastic variability in production and perceptual sensitivity to
195
subphonemic variation, the strength of the articulatory-acoustic mapping in different segments or
gestures, and the acoustic distance between the two stimuli. As part of this, the model was able to
mirror the appearance and direction of a relationship between acoustic variability and perceptual
sensitivity that has been observed for vowels in the existing literature (Franken et al., 2017;
Perkell et al., 2008). Although the predictions made by the model do not directly map onto these
previous findings, partially because this existing research has focused on the overall variability
observed in the acoustic realization of a phonological segment across multiple segmental
environments, the observation of this agreement with previous work still serves to increase
confidence in the ability of the model to account for patterns and relationships in speech
behavior beyond the patterns in interspeaker articulatory variability that it was designed to
directly test.
The simulation results, particularly for the third simulation, point to the potential
generalizability of this model to speech behavior more broadly, and specifically the proposed
incorporation of individual differences in the representation of phonological units in this model.
The same manipulation of preshaped activation can generate not only the observed empirical
patterns of individual differences but also a relationship between individual differences in
acoustic variability and sensitivity to subphonemic variability reported elsewhere in the literature
(Franken et al., 2017; Perkell et al., 2008). This provides a strong basis for believing that such a
mechanism may be able to account for additional phenomena related to individual differences in
speech production and perception. Recent work suggesting that both production accuracy in L2
learners (Huffman & Schuhmann, 2020) and individual propensity towards convergence in
dyadic interaction (Lee, Goldstein, Parrell, & Byrd, 2021) may reflect individual differences in
variability similarly seems to suggest a general encoding of individual differences in the
196
cognitive systems governing speech behavior, which likely extends to linguistic knowledge more
generally. While the ability of the specific proposal presented here for the incorporation of
individual differences in phonological cognition to account for these specific phenomena
involving individual phonetic variability is unknown at present, the investigation of how these
and other empirical findings may be explainable through a common cognitive mechanism will be
an important future direction for the modeling work started here.
4.4.1. Preshaped activation distributions as the locus of individual differences
As already touched on in Section 4.3.1.3., the choice to incorporate individual differences
in phonetic variability in the width of the planning field interval exhibiting preshaped activation
was made largely because the results from the analyses of XRMB data suggested that these
differences between speakers were encoded at the level of the (subsegmental) phonological unit.
Further, an additional advantage to positing that this component of gestural representation differs
across speakers is that it provides a parsimonious account for the observation of individual
differences in phonetic variability. This emerges because the distribution of planning field
activation stored for a particular phonological unit in memory reflects the history of the
individual’s experience of tokens of this unit in DFT.
In DFT and derivative models, the distribution of activation within a planning field is
modified slightly every time a stable peak is formed within the field (e.g., Gafos & Kirov, 2009;
Johnson, Spencer, & Schöner, 2008; Simmering, Spencer, & Schutte, 2008; Spencer & Schöner,
2003). As such, the stored representation of the planning space for a phonological unit is
continually updated to reflect the phonetic properties of tokens containing that unit as
experienced by the individual, tokens that could either be produced by others or self-produced
197
(e.g., Gafos & Kirov, 2009). As self-speech is expected to make up a large proportion of
linguistic experiences in the mature speaker’s history of exposure to tokens of a phonological
unit, a speaker who is more variable in their production of a particular phonological unit may be
expected to demonstrate preshaping over a wider interval of planning field than a less variable
speaker, as they would more consistently be exposed to greater variability in their sensory input
for the phonological unit.
The observation that the width of the preshape impacts the variability of selected target
values presents a satisfyingly simple mechanism to serve as a potential major locus of individual
differences in variability due to in part to this presumed influence of self-speech on the
maintenance of preshape width. However, this account does not explain why some speakers have
wider preshapes and are therefore more variable than others to begin with. While individual
differences in preshape width are potentially self-perpetuating, they are not self-generating –
other factors influencing the development of these representations (which may still play an active
role in their maintenance in the mature speaker) must bear the initial responsibility for causing
some speakers to associate a wider range of values in a planning field with a particular
phonological unit.
Some of these mechanisms may originate outside of the system of phonological cognition
itself. The various factors known to correlate with individual differences in speech production,
like differences vocal tract morphology, sensory acuity, and certain cognitive traits, may
influence how an individual interprets the sensory input they receive. For example, a speaker
whose palate shape introduces greater nonlinearity in the articulatory-acoustic mapping for their
own vocal tract may have a much narrower activation spike introduced to their TTCL planning
field upon hearing tokens of certain phonological segments than a speaker with greater linearity
198
in their articulatory acoustic mapping. Similarly, differences in the type and frequency of tokens
that individual speakers are exposed to over time may impact the properties of the preshaped
activation distributions that they develop (e.g., Vaugh, Baese-Berk, & Idemaru, 2019), in
addition to the precise state of those activation distributions at a given point in time (e.g., Sancier
& Fowler, 1997; Tobin, Nam, & Fowler, 2017).
Another possibility is that certain general attributes of the cognitive system may impact
the manner in which representations develop for different speakers. This could take the form of
variation in the noisiness of field evolution across speakers or across different planning fields in
the speaker’s cognitive system, as mentioned in Section 4.3.1.3. Additional mechanisms not
included in the model presented here, such as leaky gating (discussed in Tilsen [2019]), could
introduce additional avenues for the emergence of individual differences in the representation of
gestural target space. The systematicity of individual differences in phonetic variability at the
level of the phonological unit suggests that whatever factors influence the development of
preshaped activation distributions for the individual speaker must interact in a fairly complex
way in order to generate the unpredictable and idiosyncratic differences in variability observed
across speakers and across different phonological units.
4.4.2. Comparison of the proposed model with other approaches to encoding variability in
phonological representation
Phonological models within the DFT framework have an intrinsic capability to generate
stochastic variability and contextual variation due to the nature of the model dynamics
(elaborated on in Section 4.2.1) (e.g., Erlhagen & Schöner, 2002; Gafos & Kirov, 2009; Tilsen,
2019). However, DFT-based models of speech target representation are far from the only
199
proposed models of phonological representation and speech planning to incorporate variation
into cognitive phonological representation. The reason that this framework was selected for the
modeling here instead of a competing model is that the specific formulation of the target space
and the relationship between dynamic and gestural targets in this model gives it a unique ability
to account for individual differences in both stochastic and contextual variability and the
relationship between them. This advantage becomes particularly apparent in a comparison of the
DFT approach with other prominent models that incorporate variation in the cognitive
representation of a phonological unit. Three of these, the window model (Keating, 1990, 1996),
the DIVA model (Guenther, 1994, 1995, 2016; Guenter et al., 2006; Guenther et al., 1998;
Tourville et al., 2011), and exemplar models (e.g., Goldinger, 1996; Johnson, 1997; Lacerda,
1995; Pierrehumbert, 2000, 2002, 2016; Wedel, 2006) are considered here.
In both Keating’s window model of coarticulation (Keating, 1990, 1996) and the DIVA
model (Guenther, 1994, 1995, 2016; Guenter et al., 2006; Guenther et al., 1998; Tourville et al.,
2011), units of phonological contrast are conceptualized as regions of acceptable target values
within phonetic space (this space being explicitly acoustic/orosensory in the DIVA model, and
either acoustic or articulatory in the window model).
32
As such, the target of a phonological unit
constitutes any of a number of articulatory positions or acoustic values falling within a target
region rather than a static, canonical target position or value. These models differ slightly from
one another in the precise manner in which these regions are specified, with regions proposed as
an undifferentiated range of acceptable values for an articulator in the window model (Keating,
1990) and as convex regions defined in high-dimensional phonetic space in the DIVA model
32
As Guenther (2016) notes, the conceptualization of target regions in the DIVA model
“constitutes a generalization of Keating’s (1990) window model of coarticulation” (p. 133).
200
(Guenther 1995, 2016). However, the incorporation of variability along the represented
dimension(s) proceeds similarly in the two models; the range of variability permitted (or
invariance required) in the production of a phonological unit along specific phonetic dimensions
is directly reflected in the size of the region along the axis defined by that dimension (Keating,
1990, 1996; Guenther, 1995, 2016). Work using the DIVA model as its theoretical basis has
gone on to explicitly propose that the dimensions of a goal region can differ across speakers,
with these differences hypothesized to reflect individual differences in auditory and
somatosensory acuity across speakers (Villacorta, Perkell, & Guenther, 2007; Ghosh et al.,
2010).
In both models the extent of the contextual variation observed in the realization of a
phonological unit along a particular phonetic dimension reflects the size of the goal region for
that unit along that dimension in phonetic space. In the window model, articulation is planned as
an interpolated path constructed through the sequence of windows comprising the feature values
for the segments in an utterance (Keating, 1990). Contextual variation arises principally from the
interaction between the articulatory constraints imposed by adjacent windows on articulatory
trajectories; greater contextual variation is observed along a dimension when it is specified with
a wider window for a particular segment because it will accommodate a larger number of paths
interpolating between sequential windows (Keating, 1990). A similar mechanism is in play in the
DIVA model. As Guenther (2016) notes, “no movements are commanded for positions anywhere
within the target range” for a particular phonological unit (p. 134); movement to a new goal state
is considered complete as soon as the biomechanical plant reaches any part of the new target
region (Guenther, 1995, 2016). Carryover coarticulatory effects and some examples of
anticipatory coarticulation consequently occur in this model because the initial state of the
201
system at the beginning of a movement will determine which point of a new goal target region is
closest and will therefore be reached first. Phonological units with larger regions will exhibit
greater contextual variation due to coarticulation as the closest point of the new target region will
be closer in phonetic space to the initial state of the system than the closest point of a smaller
region would be. Contextual variation due to paralinguistic variation similarly arises with
changes in the size of convex region targets under different speech rate and clarity conditions
(Guenther, 1995, 2016).
Although the variable target space in the window and DIVA models seems to provide a
reasonable avenue for stochastic variability to fall out from cognitive representation, and there
seems to be an assumption in both models that window/region size will relate to the amount of
stochastic variability observed in production (e.g., Keating, 1990, p. 11; Nieto-Castanon,
Guenther, Perkell & Curtin, 2005), the precise mechanism by which stochastic variability arises
is not specified in work on either model. Lacking this elaboration, it is not clear that any
relationship can be motivated between stochastic and contextual variability in either of these
models, a requirement that is necessary to completely account for the patterns observed in the
findings regarding variability in this dissertation. This differs from DFT-inspired models in that
the dynamic target selection process itself is responsible for some proportion of the variation
observed in the production of a gesture through the inclusion of the noise, input, and interaction
terms in the equation governing planning field activation.
An additional difference between the DFT-based model presented here and the DIVA
model relates to the dimensions along which targets are defined. In the DIVA model, the primary
phonemic targets of speech are characterized in terms of acoustic dimensions (e.g., Guenther,
2016; Guenther et al., 1998; Nieto-Castañon et al., 2005). As discussed in the introduction to this
202
chapter, the patterning of individual differences in articulatory variability seems to directly
reflect individual differences along dimensions closely related to contrastive phonological goals,
rather than acoustic variability. This finding was interpreted as evidence that variability is better
represented in articulation than in acoustics, and subsequently that the incorporation of
variability in cognitive representation is better accomplished by a model, like the DFT-based
model presented here, in which the units of representation are primarily articulatory in nature.
Another class of models of cognitive representation that explicitly incorporates
variability is exemplar models. Unlike the window and DIVA models, the stochastic selection of
target values from within a range of possible targets serves as a locus of variability in exemplar
models of speech (e.g., Goldinger, 1996; Johnson, 1997; Lacerda, 1995; Pierrehumbert, 2000,
2002, 2016; Wedel, 2006), similar to the approach in the proposed DFT-based model. In
exemplar models, the representation of each meaningful linguistic unit consists of a cloud of
remembered (i.e. encoded or represented) instances of tokens that that individual has
experienced. Each token encountered by a listener is classified according to its similarity to the
existing exemplar clouds for different categories
33
and is assigned membership in the most
similar category (e.g., Lacerda, 1995; Pierrehumbert, 2003). A detailed perceptual trace of that
token is subsequently stored in the listener’s memory, updating their mental representation of the
category with which it was associated. As such, the distribution of exemplars stored as members
of a category contains the full range of variability the individual has historically experienced in
the perception of tokens of that category. The probability distribution associated with the
33
These categories comprise both linguistic categories (corresponding to words, syllables, positional variants of
phonological segments, etc.) and social categories containing detailed socioindexical information about stored
exemplars (e.g., Foulkes & Docherty, 2006; Johnson, 2006; Pierrehumbert, 2006).
203
possible production values for a category are then represented by the distribution of stored
exemplars (in particular, the stored exemplars found at any specific location in a map of phonetic
space).
While exemplar models have become popular as a mechanism for explaining how factors
like lexical frequency and socioindexical information can affect the production and perception of
phonetic variants, they are not without their drawbacks. A common criticism leveled against
exemplar models is that their plausibility and potential functional efficiency are complicated by
the enormous demands the storage and processing of the massive number of memory traces
associated with each linguistic category would impose on the neurobiological systems
underlying speech behaviors (Baayen, Hendrix, & Ramscar, 2013; Hendrix, Bolger, & Baayen,
2017). Additional critiques of exemplar models as models of speech production specifically have
highlighted the difficulty current implementations have in accounting for the ways in which the
parameters of a speech task or the information structure of an utterance can affect the detailed
phonetic realization of a particular word or segment (Ernestus, 2014; Fink & Goldrick, 2015).
In considering the patterns observed in the XRMB data, a major drawback of exemplar
models is the lack of necessary abstract invariance in the representation of a phonological unit
across contexts. Although recent work on exemplar models for speech production and perception
admit the necessity of some degree of phonological abstraction in the exemplar space (e.g.,
Ernestus, 2014; McQueen, Cutler, & Norris, 2006; Pierrehumbert, 2002, 2006, 2016), it is not
clear that the same broad representation of possible target values for a segmental or
subsegmental phonological element is accessed each time an utterance involving that element is
planned. Without this type of contextual invariance in the definition of the target space for a
phonological unit, it seems difficult to navigate how the many overlapping exemplar clouds
204
containing tokens relevant to the definition of the full target space for a particular phonological
gesture or segment may interact to generate the relationship between stochastic and contextual
variability that we have observed and successfully modeled using the DFT-based model.
Likewise the ways in which the affiliation of different exemplars with different exemplar clouds
could modulate or limit the selection of a production target from the possible exemplar space is
opaque. The extensional definition of categories in exemplar models may also pose an obstacle
to generating this relationship between stochastic and contextual variability, although, barring an
attempt at a mathematical model of this relationship in an exemplar-based model, the extent to
which this truly poses an obstacle is unclear.
The tendency towards holistic storage of structural units in exemplar models may pose an
additional, separate obstacle to using exemplar models to account for the patterns observed in the
XRMB data. While the results from Chapter 3 clearly suggest individual differences arise at a
subsegmental level, further investigation is necessary to determine whether the appropriate level
of representation at which these differences should be incorporated is gestural or subgestural
(i.e., at the level of the individual tract variables defining a gesture). The observation that
individual differences appear at the level of the gesture, instead of the tract variable, would lend
themselves more easily to exemplar modeling.
4.5. Conclusion
In conclusion, an extension of existing Dynamic Field Theory models of phonological
cognition is presented in this chapter to account for the patterns of individual difference in
phonetic variability observed in previous empirical chapters of the dissertation. A series of
simulations were presented that serve to evaluate the ability of the model to account for these
205
empirical patterns by leveraging the incorporation of individual differences into the properties of
dynamical representations of gestural targets that explicitly encode variability. The results of
these simulations indicate that the individual differences in stochastic variability and the
relationship between stochastic and contextual variability observed in the XRMB data were
accounted for in this model. The model was also shown to generate predictions about the
relationship between individual differences in production variability and cross-speaker
behavioral variation in speech perception that align with existing experimental findings in the
literature.
On the whole, the results of this chapter suggest that the proposed model for encoding
individual differences in phonological representation works as a formal mechanism for unifying
the generation of individual differences in multiple types of phonetic variation, namely stochastic
variability and contextual variability, and for relating individual differences in speech production
to individual differences in speech perception. Predictions about the relationship between speech
production and perception generated in the evaluation of the model will next be tested in
additional experiments to further evaluate its ability to account for individual differences in
speech behavior.
206
5. The relationship between variability in speech production and perceptual
sensitivity to subphonemic variation
5.1. Introduction
In a similar manner to how speakers exhibit idiosyncratic tendencies in speech
production, listeners are known to differ from one another in their perceptual behavior. These
individual differences in speech perception frequently arise in the same domains in which
extensive variation is observed across speakers in production (Beddor, 2009; Kong & Edwards,
2016; Mann & Repp, 1980; Yu, 2010; Yu & Lee, 2014). For example, a number of studies have
found a relationship between the coarticulatory patterns exhibited by a specific speaker in
production and the extent to which that speaker ‘compensates’ for coarticulatory effects in
perception (e.g., Beddor, Coetzee, Styler, McGowan, & Boland, 2018; Harrington, Kleber, &
Reubold, 2008; Kleber, Harrington, & Reubold, 2012; Yu, 2019; Zellou, 2017), with individuals
who exhibit greater coarticulation in production more apt to attribute the acoustic consequences
of coarticulation to their source segment. Similarly, research suggests that interspeaker variation
in the strategies chosen to contrast phonological segments in production mirrors individuals’
reliance on different acoustic dimensions in the perception of these contrasts (e.g., Coetzee,
Beddor, Shedden, Styler, & Wissing, 2018; Schertz, Cho, Lotto, & Warner, 2015; Shultz,
Francis, & Llanos, 2012), as do their preferred physical realization of the articulatory and
acoustic properties of these segments (e.g., Newman, 2003).
The apparent ties between individual behavior in speech production and perception
constitute only one of a number of behavioral phenomena suggesting speech production and
speech perception are linked. Additional empirical evidence for an explicit connection between
207
these two domains comes from observations of speakers’ production shifting to reflect altered
auditory feedback (e.g., Houde & Jordan, 2002; Niziolek & Guenther, 2013) or the phonetic
properties of others’ speech (e.g., Babel, 2012; Fowler, 2003; Honorof, Weihing, & Nielsen,
2011; Pardo, Jordan, Mallari, Scanlon, & Lewandowski, 2013) and from listeners’ use of visual
and haptic information about articulatory actions to facilitate the perception or perceptual
classification of a speech sound (e.g., Fowler & Dekle, 1991; McGuire & Babel, 2012; McGurk
& MacDonald, 1976; Sumby & Pollack, 1954; Tranmüller & Öhrstrom, 2007). These and other
findings have contributed to the development of theories positing that a shared system of
cognitive and representational structures (at least in part) underlies behavior in these domains
(e.g., Fowler, 1986; Fowler et al., 2003; Goldinger, 1998; Johnson, 1997; Liberman & Mattingly,
1985; Pierrehumbert, 2001, 2003), a stance shared by many domain-general theories of the
cognition of action and perception (e.g., E. Gibson, 1969, 1991; J. Gibson, 1966, 1977;
Greenwald, 1970; Hommel et al., 2001; Prinz, 1997). These theories positing shared coding of
production and perception domains are lent credence by neurophysiological research pointing to
the recruitment of the same neurons or neural populations in both the production and perception
of certain motor tasks (e.g., Fadiga, Fogassi, Pavesi, & Rizzolatti, 1995; Pulvermüller et al.,
2006; Rizzolatti & Arbib, 1998).
This proposed relationship between the cognitive systems governing speech production
and perception entails some degree of shared substance, and very likely isomorphy, between the
representations of phonological units active during each. If the cognitive representations used to
guide speech production are the same as those accessed during perception, we may expect that
any individual differences in the representational system should have consequences for both the
production and the perception of speech. Previous research seeking to examine one facet of this
208
question, namely how individual differences in the distribution of phonological categories within
phonetic space affect perception, have generally found evidence of relationships that support this
principle of shared representation (e.g., Brunner et al., 2011; Ghosh et al., 2010; Perkell,
Matthies et al., 2004; Perkell, Guenther et al., 2004; Perkell, Matthies et al., 2004). However, the
fairly limited scope of previous research examining how individual differences in perception
relate to interspeaker differences in variability leave open numerous questions regarding the
nature of this relationship and how it may be affected by factors like the listeners’ ability to
access information from the acoustic signal the underlying articulation of a perceived token. The
experiment presented in this chapter investigates whether a relationship is observed between
individual differences in speech production variability, which earlier chapters of this dissertation
have proposed to reflect interspeaker differences in the representation of phonological units, and
individual differences in speech perception, specifically the perception of subphonemic
variability.
5.1.1. Previous research on individual differences in variability and speech perception
The proposal that individual differences in production variability may be related to
individual differences in speech perception has primarily been investigated in previous research
examining the production and perception of vowels. The findings of this research support a
relationship between variability and individual differences in multiple aspects of speech
perception, including aspects of individuals’ perception of phonological contrasts and their
sensitivity to acoustic variability. In a study of American English speakers, Chao, Achoa &
Daliri (2019) found that the location of each speaker’s perceptual boundary between /ε/ and /æ/
was strongly correlated with the relative variability (in F1-F2 space) of each vowel category
209
within their own speech. Specifically, speakers who produced /ε/ less variably than /æ/ exhibited
a categorical perception boundary closer to the center of their /ε/ distribution than speakers who
exhibited the opposite pattern in the relative variability of the two vowels. This result can be
taken as support for the proposal that the same organization of sound categories in phonetic
space is active in production and perception. That said, the failure to observe a similar
relationship when the consistency of speakers’ categorization of ambiguous tokens was
considered instead of the location of their perceptual boundary (Cheng, Niziolek, Buchwalk, &
McAllister, 2021) suggests the factors mediating this relationship may be more complicated than
a simple reflection of phonological organization would entail.
Of more direct relevance to the goal of the investigation here, multiple studies directly
examining the relationship between variability in production and sensitivity to variability in
perception have found evidence in support of a relationship between these two behaviors. In a
study examining the relationship between perceptual acuity and various measures of between-
and within-category dispersion, Perkell et al. (2008) found that individuals’ perceptual threshold
for discrimination along /ɪ/–/ɛ/ and /ɛ/–/æ/ continua was correlated with the average acoustic
variability they exhibited within vowel categories in American English, with the direction of the
relationship indicating that individuals exhibiting less within-category variability were better at
the discrimination task (i.e., were able to discriminate vowel tokens with smaller acoustic
differences). Similar results were obtained in an experiment on speakers of Dutch comparing
variability in production and threshold values for discrimination along /ɪ/–/ε/ and /ɑ/~/ɔ/ continua
in Franken et al. (2017), although their observation of such a relationship was variable across
vowel continua and depended on the method for assessing acoustic variability (MFCC-based
analysis or Bark-scale based formant analysis).
210
Taken together, these findings from previous research point toward the existence of some
type of relationship between individual differences in production variability and perceptual
behavior, including the potential for a relationship between variability in production and
individual differences in sensitivity to subphonemic variation specifically. However, it is unclear
how general these findings of a relationship between individual differences in production
variability and perceptual sensitivity may be, as all of the research focusing specifically on this
question has looked at vowels (c.f. Brunner et al. [2011], which compared a combined metric of
within- and across-category dispersion to perceptual thresholds for /s/ and /ʃ/). Given the
predictions made by the modeling work presented in Chapter 4 of this dissertation (specifically
Simulation 3), the question emerges of how the relationship between articulatory and acoustic
variability, as well as the phonological importance of individual dimensions along which
perceived stimuli may vary, may mediate any relationship between variability in production and
perceptual sensitivity to subphonemic variation across individuals.
5.2. Hypotheses and predictions
The focus of the experiment presented in this study is the relationship between individual
differences in acoustic variability (in production) and sensitivity to subphonemic variability (in
perception). The main hypothesis tested here is that individual differences in the cognitive
representation of a phonological unit will lead to differences in both the production and
perception of that unit across speakers. Specifically, based on the results of Simulation 3 in
Chapter 4 of this dissertation, it is predicted that speakers who are less variable in their
211
production of an acoustic
34
dimension will be more sensitive to variability in the perception of
that dimension within a particular phonological segment.
An additional prediction related to this hypothesis, but not directly derived from
Simulation 3, is made regarding the manner in which the phonological importance of a
dimension may affect the relationship (or lack thereof) between variability in its production and
sensitivity in its perception. Specifically, it is predicted that acoustic dimensions that are less
functionally important for a particular segment will exhibit a weaker relationship between
variability in production and sensitivity in perception, if any is observed at all. This prediction is
predicated on the assumption that less functionally important dimensions are not encoded in the
individual’s cognitive representation of a phonological unit (or, if they do map onto encoded
dimensions, they do so very weakly). As such, individual differences in the production and
perception of these dimensions are not expected to reflect the variation in the cognitive
representation of phonological units upon which this main hypothesis is predicated.
A more general hypothesis related to the nature of the phonetic space within which
phonological goals are defined is also tested in this experiment. The second hypothesis tested is
that phonological targets are represented in articulatory space. The specific acoustic dimensions
examined here in both production and perception make it possible to test this hypothesis. This
hypothesis was an assumption underlying the construction of the model in Chapter 4. Based on
the results of Simulation 3, and specifically the modulation of the relationship between
34
Although the model presented in Chapter 4 selects articulatory targets, and therefore directly predicts variation in
articulation and not acoustics, this prediction is made given that individual differences in articulatory variability
should be reflected in the acoustic signal for the specific dimensions where such a relationship between variability in
target selection and perceptual sensitivity is predicted. However, this prediction does pose a potential obstacle to the
interpretation of dimensions where articulatory variability is not well reflected in acoustics, an issue that is discussed
in Section 5.5.
212
representational differences and perceptual sensitivity by the strength of the mapping of the
acoustic signal to the articulatory planning space, it is predicted that the relationship between
production variability and perceptual sensitivity will be stronger for dimensions where variability
in articulation is more recoverable from the acoustic signal.
5.3. Methods
5.3.1. Participants and recruitment
70 participants (38 cisgender women, 29 cisgender men, 2 transgender men, 1
genderqueer individual) were recruited for this experiment using the online recruitment platform
Prolific (www.prolific.co). All participants were monolingual American English speakers
between 18-65 years of age (Mean = 31.3, SD = 11.4). Participants were pre-screened prior to
participation to ensure that they were born and currently living in California (USA), with this
geographic restriction imposed to minimize the potential for dialect variation to influence
individual differences in production or perception. No participants reported any speech or
hearing impairments or language disorders.
A target sample size for this study was determined by performing a statistical power
analysis for sample size estimation using the pwr package in R (Champley, 2020). The estimate
of the expected effect size for the current study used for this analysis was 0.35, reflecting the
average correlation between perceptual acuity and production variability in Franken et al. (2017)
and Perkell et al. (2008). Based on this expected effect size, it was projected that a sample size of
55 participants would be needed to obtain sufficient statistical power to evaluate the relationship
between production and perception across speakers at the recommended power of .8 (Cohen,
1988).
213
56 of the 70 participants recruited were determined to be eligible for inclusion in the full
analysis of the experimental data based on the following criteria. Two participants were excluded
from all analysis due to poor audio recording quality (either excessive background noise or an
insufficiently loud speech signal), two were excluded due to a large amount of missing
production data, and one was excluded due to failing a high proportion (77%) of the attention
checks included in the perception task. In addition, nine participants who successfully completed
the first study session did not return for the second session (12.9% attrition rate). All participants
included in the analysis presented here completed both sessions of the experiment. The
preliminary analysis of experimental results presented in this chapter includes data from 25 of
these 56 participants.
5.3.2. Stimuli
5.3.2.1. Production
Five English monosyllables beginning with /ɹ/ and five monosyllables beginning with /s/
were selected as target words for the production task (Table 5.1). All target words had a CV(C)
shape, with the identity of the vowel in the monosyllable balanced across words beginning with
each target consonant, and the coda consonant, when present, was always a labial consonant. Ten
additional CV(C) words beginning with either /l/ or /ʃ/ were also included in the word list but are
not analyzed here.
214
Table 5.1. List of stimulus words recorded by participants in the production task. Words in the
top two rows (/s/ and /ɹ/ initial) are included in the analysis presented here.
/ɑ/ /e
͜ i/ /i/ /o
͜ ʊ/ /u/
/s/ sob safe seam soap soup
/ɹ/ rob ray reef rope roof
/l/ law lei leap loaf loop
/ʃ/ shop shape sheep show shoe
5.3.2.2. Perception
Stimuli for the perception task were tokens from artificially created /s/ and /ɹ/ continua
varying along selected acoustic features of interest (Table 5.2). Two /ɹ/-initial and two /s/-initial
CV syllables were selected to create these stimuli: rue [ɹu], ray [ɹe
͜ i], sue [su], and see [si]. A
male monolingual American English model speaker who has lived in California since birth was
recorded reading each of these stimulus words in the carrier phrase “Type a ___ again.” All
recordings were made with Praat (Boersma & Weenink, 2020) in a quiet room with an M-Audio
Producer USB microphone sampling at 44.1 kHz onto a Macbook Pro laptop.
Each target word was elicited from and recorded by the model talker three times, with one of
the three recordings then selected as a “base recording” to be used in the creation of stimulus
items. In each of the selected base recordings, the target word was spliced out of the carrier
sentence and, for fricative-initial target words, additionally had the onset consonant removed.
These spliced-out target words formed the basis for three acoustic continua: an M1 continuum
for /s/, an M4 continuum for /s/, and an F3 continuum for /ɹ/. (Recall that M1 refers to the first
spectral moment of the fricative and M4 refers to the fourth spectral moment). Prior to
acontinuum creation, the base recordings were scaled to 65 dB and their pitch contour was
manipulated to create a slight fall from 116 Hz to 110 Hz over the voiced portion. This scaling
and pitch manipulation served to minimize differences between stimulus items other than the
215
target continuum manipulation, minimizing the likelihood that other factors such as pitch or
intensity would impede or facilitate discrimination. All stimulus files were checked after
resynthesis to ensure that their intensity level and pitch contour was consistent.
Table 5.2. Acoustic dimension(s) manipulated for each target segment and the motivation for
their selection. M1 = first spectral moment, M4 = fourth spectral moment, and F3 = third formant
Dimension
Manipulated
Motivation for manipulation
/s/ M1
Commonly used by speakers to differentiate /s/ and /ʃ/ in
production and perception (e.g., Jongman et al, 2000; Li et al.,
2011)
Correlates with constriction anteriority (e.g., Gordon et al, 2002)
M4
May be used by speakers to differentiate /s/ and /ʃ/ (Nissen & Fox,
2005), but to a lesser extent than M1 (e.g., Li et al., 2009)
Reflects differences in tongue tip orientation (e.g., Li et al., 2009)
/ɹ/ F3
Commonly used by speakers to differentiate /ɹ/ and /l/ in
production and perception (e.g., O’Connor et al., 1957)
M1 and F3 were selected for manipulation in generating perception stimuli because they
have been highlighted as critical for both the production and perception of /s/ and /ɹ/,
respectively (Boyce & Espy-Wilson, 1997; Delattre & Freeman, 1968; Hagiwara, 1995;
Jongman et al., 2000; Lehiste, 1964; Li et al., 2011; Nittrouer, 1995; O’Connor et al., 1957;
Shadle & Mair, 1996; Twist et al., 2007). M4 was selected to contrast with these two highly
informative acoustic dimensions, as evidence from previous work suggests that M4 is less
critical for the production and perception of /s/ than M1 (e.g., Li et al., 2009, 2011; Maniwa,
Jongman, & Wade, 2009). This combination of manipulated dimensions allows for the
216
examination of how the relationship between variability in production and perceptual
discrimination may differ across different phonological segments (M1 for /s/ vs. F3 for /ɹ/) or
across dimensions differing in their functional importance (M1 vs. M4 for /s/). Additionally, the
selection of M1 and F3 for manipulation allows for the evaluation of P3 – the relationship
between production and perception is stronger for dimensions for which variability in
articulation is more recoverable from the acoustic signal – due to differences observed in Chapter
2 regarding the extent to which articulatory variability was recoverable from acoustics for these
two dimensions.
A. Stimulus creation: /s/ continua
Seven-step fricative continua along the two dimensions of interest for /s/ (M1 and M4)
were generated from white noise specified for the location, slope, and relative amplitudes of
three spectral peaks (Winn, 2014) (Figure 5.1). Filtered white noise was used instead of directly
manipulating tokens of /s/ produced by the model talker as this method provided precise control
over individual spectral dimensions. The location of peaks was manipulated to create the M1
continuum (with peak slope and amplitude held constant), and the slope of peaks was
manipulated in the M4 continuum (peak location and amplitude held constant). For both
continua, the relative amplitude of the first peak to the second peak was -20 dB and from the
second peak to the third peak was 4 db. The duration of each synthesized fricative was set to be
200 ms based on the mean duration of onset /s/ in the recordings produced by the model talker.
Each synthesized fricative had an amplitude rise time of 150 ms and a fall time of 40 ms.
The values of peak location and peak slope used to define the lower bounds of the M1
and M4 continua, respectively, were selected to try and avoid percepts of [ʃ]. To select the lower
217
bound for each continuum, a twenty-step continuum between endpoints corresponding roughly to
the model speaker’s average productions of /s/ and /ʃ/ was played multiple times for four
consultants (twice in ascending order, and twice in descending order). Each consultant was asked
to indicate when they first heard an /s/ (for ascending repetitions) or /ʃ/ (for descending
repetitions) for each repetition of the continuum. The first step where all consultants consistently
perceived /s/ was then selected as a preliminary lower bound for the continuum and remained the
lower bound in the final seven-step continuum after participants in a pilot study confirmed that
they consistently heard lexical items beginning with /s/ (and not /ʃ/) in the stimulus items. For the
M1 continuum, this lower bound had an M1 value of 6320 Hz; for the M4 continuum, the M4
value of the lowest step equaled -0.808. The upper bound for each continuum was then defined
so that it was the same distance from the the speaker’s average production of /s/ as the lower
bound (M1 continuum: M1 = 6924 Hz; M4 continuum: M4 = -1.639)
35
. Intermediate values for
each seven-step continuum were generated using Bark scale frequency interpolation between the
continuum’s selected upper and lower bounds (Figure 5.1). Each of the tokens in the final
continua was scaled to 65 dB intensity and spliced on to the vowel from the /s/ base recordings to
create 4 separate monosyllable continua (2 fricative manipulations x 2 vowel environments).
35
M1 values for all seven steps of the M1 continuum (in Hz): 6320, 6401, 6487, 6575, 6726, 6818, 6924. M4 values
for all seven steps of the M4 continuum: -0.808, -1.066, -1.134, -1.402, -1.461, -1.567, -1.639. Minimal change in
all other spectral measurements was observed for each continuum.
218
Figure 5.1. Trace of the spectrum for each of the steps of the (a) M1 and (b) M4 continua for /s/.
B. Stimulus creation: /ɹ/ continuum
A seven-step continuum manipulating F3 in /ɹ/ was created using a Praat script for
generating formant continua via the resynthesis of naturally recorded speech tokens (Winn,
2016). The two-step process used to create the /s/ continua was also used in the creation of the
F3 continuum, with the upper bound of the seven-step continuum defined as the last step on a
twenty-step continuum between the model talker’s average productions of /ɹ/ and /l/ where the
(a)
(b)
Step
1
2
3
4
5
6
7
Frequency (Hz)
0 10
4
Sound pressure level (dB/Hz)
20
40
Frequency (Hz)
0 10
4
Sound pressure level (dB/Hz)
20
40
Frequency (Hz)
0 10
4
Sound pressure level (dB/Hz)
20
40
Frequency (Hz)
0 10
4
Sound pressure level (dB/Hz)
20
40
Frequency (Hz)
0 10
4
Sound pressure level (dB/Hz)
20
40
Frequency (Hz)
0 10
4
Sound pressure level (dB/Hz)
20
40
Frequency (Hz)
0 10
4
Sound pressure level (dB/Hz)
20
40
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10
4
Frequency (Hz)
0 10
4
Sound pressure level (dB/Hz)
20
40
Frequency (Hz)
0 10
4
Sound pressure level (dB/Hz)
20
40
Frequency (Hz)
0 10
4
Sound pressure level (dB/Hz)
20
40
Frequency (Hz)
0 10
4
Sound pressure level (dB/Hz)
20
40
Frequency (Hz)
0 10
4
Sound pressure level (dB/Hz)
20
40
Frequency (Hz)
0 10
4
Sound pressure level (dB/Hz)
20
40
Frequency (Hz)
0 10
4
Sound pressure level (dB/Hz)
20
40
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10
4
Step
1
2
3
4
5
6
7
M1 Continuum
M4 Continuum
219
same four consultants consistently heard /ɹ/ and not /l/ (F3 = 1606 Hz). The lower bound of the
seven-step continuum was set as the minimum F3 value measured in the model talker’s
productions of /ɹ/ (F3 = 1219 Hz).
The procedure by which the F3 continua were resynthesized involved passing the source
signal extracted from each base recording of /ɹ/ through filters manipulated to form a continuum
with the desired number of equally spaced steps. Using the filter extracted from the base
recording as a starting point, the F3 trajectory during the steady-state portion of /ɹ/ and during the
acoustic transition from the /ɹ/ to the following vowel was manipulated to create the filters used
to generate each continuum step. Trajectories for F1, F2 and F4 were held constant across all
steps of the continuum, as were all formants in the vocalic portion of the stimulus. Filters for
continuum endpoints were created by shifting formant points in the steady-state region of /ɹ/ to
the pre-determined upper- or lower-bound value of F3. Formant values during the transition from
the steady-state portion of the /ɹ/ to the following vowel were shifted as a function of their
proximity to the /ɹ/ steady-state region to produce a smooth transition (Figure 5.2).
220
Figure 5.2. Definition of lower and upper endpoints for the F3 continuum using Praat
FormantGrid objects (left: lower endpoint, right: upper endpoint). Blue line and points indicate
the manipulated dimension (F3).
Figure 5.3. Formant trajectories (F1-F5) for each of the seven stimuli in the F3 continuum. The
portion of the x-axis replaced by a dotted gray line indicates the temporal domain over which F3
was manipulated (with the different F3 trajectories resulting from this manipulation indicated by
different colored lines).
Time (s)
0.01821 0.2126
0
5000
Frequency (Hz)
Time (s)
0.01821 0.2126
0
5000
Frequency (Hz)
Time (s)
0.01821 0.2126
0
5000
Frequency (Hz)
Time (s)
0.01821 0.2126
0
5000
Frequency (Hz)
Time (s)
0.01821 0.2126
0
5000
Frequency (Hz)
Time (s)
0.01821 0.2126
0
5000
Frequency (Hz)
Time (s)
0.01821 0.2126
0
5000
Frequency (Hz)
Frequency (Hz)
0 6000
Sound pressure level (dB/Hz)
0
20
40
Frequency (Hz)
0 6000
Sound pressure level (dB/Hz)
0
20
40
Frequency (Hz)
0 6000
Sound pressure level (dB/Hz)
0
20
40
Frequency (Hz)
0 6000
Sound pressure level (dB/Hz)
0
20
40
Frequency (Hz)
0 6000
Sound pressure level (dB/Hz)
0
20
40
Frequency (Hz)
0 6000
Sound pressure level (dB/Hz)
0
20
40
Frequency (Hz)
0 6000
Sound pressure level (dB/Hz)
0
20
40
F5
F4
F3
F2
F1
Step
1 2 3 4 5 6 7
0
0 100 Time (% of stimulus)
5000
1220 Hz
1606 Hz
START START
221
Filters for intermediate continuum steps were then calculated through the interpolation of
the F3 values at a given formant point in each of the endpoint filters, and the source signal from
the base recording was passed through each filter to create the seven continuum steps.
Trajectories of the first five formants in each of the stimulus items in the seven-step F3 continua
can be seen in Figure 5.3. All acoustic energy above 3,500 Hz from the (original, unmanipulated)
base recording lost during the process of LPC decomposition was restored in the resynthesized
stimuli to improve perceived naturalness.
5.3.3. Procedure
The Gorilla Experiment Builder (www.gorilla.sc) was used to create and host the
experiment (Anwyl-Irvine, Massonnié, Flitton, Kirkham, & Evershed, 2018). The experiment
consisted of a speech production block and three speech perception blocks divided over two
experiment sessions. Stimuli in the three blocks comprising the perception experiment were
blocked by acoustic condition (M1, M4, or F3), and the order of the three perception blocks was
counterbalanced across participants using a Latin Square design (Table 5.3). All participants
completed the speech production block and one of the three speech perception blocks in the first
session and completed the remaining two speech perception blocks in the second session. The
production block always took place before the perception block in the first session. Each session
lasted approximately 45 minutes, and participants received $7.50 for each session ($15 for
completion of both sessions). Participants were required to use a laptop or desktop computer to
complete the experiment, with settings applied in both Prolific and Gorilla that restrict
participants from entering the experiment if they are using a cell phone or a tablet.
222
Table 5.3. Counterbalancing of perception block presentation across groups.
First block
(Session 1)
Second block
(Session 2)
Third block
(Session 2)
Order A M1 F3 M4
Order B F3 M4 M1
Order C M4 M1 F3
5.3.3.1. Production task
Before beginning the production task, participants were provided with an example
recording of a speaker reading a sentence at a target loudness level and pace and were asked to
record themselves reading the same sentence at a loudness and pace similar to the sample
recording. Participants were then asked to play back their own recording of the sentence to check
both that audio capture was working on their computer and that the loudness and pace of their
own speech was similar to the example.
In the production block, target words were visually presented to participants on their
computer screen in the carrier phrase “Type a ___ promptly.” Sentences were presented one at a
time, and the presentation was self-paced. The entire set of production stimuli was presented 10
times during the production block, with stimulus presentation blocked by list repetition and the
order of words randomized within each repetition of the entire list. Participants were instructed
to read each sentence they saw on the screen aloud at a natural reading pace and to restart the
sentence if they had to cough, sneeze, pause, or were otherwise interrupted while reading. They
were given the opportunity to take a short break (up to 3 minutes) after every 40 items in the
word reading list.
223
5.3.3.2. Headphone screen
Participants were informed at the beginning of each experimental session that they must
wear headphones for the perception blocks. Screening for headphone use was conducted
immediately before the first perception block in each session using the task described in Milne et
al. (2020). The task uses Huggins Pitch stimuli to test for dichotic listening (Akeroyd et al.,
2001; Chait et al., 2006; Cramer & Huggins, 1958) and has been found to correctly detect
headphone use 80% of the time with a 20% false positive rate. (This is a significant improvement
over the 70% correct detection and 31% false positive rate observed for the commonly used
Woods et al. (2017) headphone screen when tested online.
36
) The implementation of the
headphone screen used in this experiment was changed from the ABX paradigm used in Milne et
al. (2020) to a 4IAX paradigm to parallel the structure of trials and response mechanisms used in
the main perception task. Participants who failed the headphone screen once were given the
opportunity to re-do the headphone screen one additional time.
5.3.3.3. Perception task
The perception task tested participants’ sensitivity to variability along each manipulated
acoustic dimension using a four-interval same-different (4IAX) discrimination paradigm (Pisoni,
1975; as used in Beddor & Krakow, 1999; Beddor, Harnsberger, & Lindemann, 2002; Fowler,
1981; Zellou, 2017). This paradigm was selected for the experiment as it encourages participants
36
The results for the Woods et al. (2017) test for participants in Milne et al. (2020) differed substantially from those
in the original paper, where 100% of headphones-wearing participants passed (no false negatives observed). Milne
et al. (2020) hypothesize that this difference can be attributed to the fact that the participants in Woods et al. (2017)
completed the test in a controlled laboratory setting while all testing in Milne et al. (2020) was done online,
introducing variability from the different computers, browsers, and headphones used by participants and the
different situations in which they carried it out. Given that the conditions in Milne et al. (2020) were more similar to
the study presented here, their results are likely more informative for the present study.
224
to attend to auditory information instead of relying on top-down phoneme classification.
Consequently, it is thought to better measure listener sensitivity to acoustic differences than other
tasks that may induce greater phonological processing. This method of stimulus presentation also
removes the bias towards a ‘same’ response present in AX discrimination paradigms, which may
be particularly important for testing sensitivity to the perceptually subtle subphonemic
differences examined in this experiment.
In a 4IAX task, four acoustic stimuli are presented on every trial. Three of these stimuli
are acoustically identical, while one stimulus—either the second or third item presented during
the trial—differed from the others along one of the manipulated acoustic dimensions. This design
is shown schematically in Figure 5.4. After hearing all four stimulus items within a given trial,
the participant was asked to indicate whether the second or third item was different than the
others by pressing either the f key or the j key, respectively, on their computer’s keyboard. The
four stimuli in every trial were taken from two steps of the same acoustic continuum either 4, 5,
or 6 continuum steps apart. These step sizes were chosen following results from a pilot
experiment suggesting that almost all participants exhibited at-chance performance for trials with
smaller acoustic differences (2- or 3-step) between the stimulus items. The selection of step sizes
for testing resulted in a total of 6 unique stimulus pairings for each continuum (three 4-step
pairings, two 5-step pairings, and one 6-step pairing), leading to a list of 48 possible trial types
for each acoustic manipulation (6 stimulus pairings x 4 possible orders x 2 vowel contexts). In
the discussion of the results of this experiment, trials with a 4-step interstimulus distance will be
referred to as SMALL distance trials, trials with a 5-step distance will be MEDIUM trials, and trials
with a 6-step distance will be LARGE trials.
225
Figure 5.4. Schematic of trial creation process for the 4IAX task
The perception task was conducted in three separate blocks spread over the two
experimental sessions. Stimuli for a single acoustic condition were presented in each block. The
full list of 48 trials for each acoustic condition was repeated three times within the block for that
condition, for a total of 144 trials per manipulation per participant (72 SMALL trials, 48 MEDIUM
trials, and 24 LARGE trials). The presentation of stimuli was blocked by list repetition, and the
order of trials was randomized within each repetition of the full list. The interstimulus interval
within each trial was 500 ms and the intertrial interval between trials was 2 seconds.
One trial serving as an attention check was also randomly placed within each repetition of
the trial list (3 attention checks per perception block). Attention check trials were designed to
verify that participants were following instructions and that they were actually listening to the
experiment audio. In each attention check trial, the participant was shown a screen that instructed
them to do what the audio told them to do and were then played an audio recording telling them
226
to press one of the two response keys (f or j) used in the experiment. Participants who failed
more than one attention check trial per session were excluded from analysis.
37
Participants were given the opportunity to take a short (3 minute or less) break after each
set of 49 trials. The opportunity for a slightly longer break (up to 5 minutes) was given between
the production and perception blocks in the first session and between the two perception blocks
in the second session.
5.3.4. Analysis
5.3.4.1. Production measurements
The recordings collected from participants during the production block were force-
aligned using the Montreal Forced Aligner (McAuliffe, Socolof, Mihuc, Wagner, &
Sonderegger, 2017). The generated TextGrids were visually inspected and manually corrected as
necessary. These alignments and their corresponding audio files were then used as input to a
Python script that used the TextGridTools (Buschmeier & Włodarczak, 2013) and Parselmouth
(Jadoul, Thompson, & de Boer, 2018) libraries to interface with Praat (Boersma & Weenink,
2020) and collect acoustic measurements from the target segment in each recording. Values of
the first five formants were automatically extracted at the acoustic midpoint of each target /ɹ/,
with formant tracking in Praat configured to find five formants below either 5000 or 5500 Hz
(with the choice of upper bound for formant tracking determined individually for each
participant). For /s/ targets, the first four spectral moments of each fricative were calculated
37
One participant failed seven of the nine attention checks presented across both sessions of the experiment and was
excluded from the experiment. Two additional participants failed the first attention check in the first session and
were not excluded from the experiment. All other participants answered all attention checks correctly.
227
using a DFT calculated over a 50 ms Hamming window centered at the acoustic midpoint of the
/s/. A high-pass filter with a cut-off at 500 Hz was applied to the fricative spectrum prior to
spectral moment calculation for tokens of /s/. Although multiple formants were extracted for /ɹ/
and multiple spectral moments extracted for /s/, only F3 (for /ɹ/) and M1 and M4 (for /s/) are
used in the analyses presented here.
A total of 1,150 tokens of each target segment were collected from the 25 speakers whose
data are presented here (50 tokens/segment/speaker). Tokens of /s/ for which the automatic
procedure found M1 values below 5,000 Hz or above 10,000 Hz, tokens of /ɹ/ for which the
automatic procedure found F3 values below 1,300 Hz or above 2,100 Hz, and tokens of either
segment that were more than 2.5 standard deviations away from the speaker mean along one or
more measured acoustic dimensions were visually inspected and manually corrected in Praat.
Tokens of /ɹ/ were omitted from analysis when an F3 trajectory could not be visually identified
due to low signal intensity (N = 33), as were tokens of /s/ where excessive background noise or
low signal intensity prevented accurate measurement of spectral moments (N = 26). A small
number of tokens were additionally unable to be analyzed due to sporadic recording issues
resulting in lack of audio capture or early termination of the audio file (N = 16 for /ɹ/, N = 23 for
/s/).
To assess variability along each acoustic dimension of interest for each speaker, within-
context measures of dispersion were calculated for each speaker along each of the acoustic
dimensions corresponding to an acoustic manipulation in the perception task (M1, M4 and F3).
Both CoV and IQR measurements were calculated for the ratio scale measurements M1 and F3,
and only IQR was calculated for M4. These measures were calculated using the same procedure
used for the analysis of the XRMB data (as described in Section 2.3 and Appendix A). In this
228
analysis, within-context measures of dispersion (IQRCON and CoVCON) were calculated for each
dimension as the average dispersion exhibited by a participant across tokens of a single target
word.
5.3.4.2. Perception measurement
Response Accuracy was calculated as a measurement of participants’ performance in the
4IAX discrimination task. This was done to examine how accuracy in the perception of
differences along a manipulated dimension related to variability in its production across
speakers, and to evaluate the overall effect of the size of acoustic differences between stimuli and
variability along different acoustic continua. Response Accuracy was calculated separately for
each unique combination of step size and acoustic condition for each participant. As participants’
response options on each trial in the 4IAX task presented a binary choice, 50% accuracy would
indicate at-chance performance (inability to differentiate the acoustically distinct stimuli) within
a condition. Participants’ responses for each trial were coded as ‘correct’ or ‘incorrect,’ and the
proportion of correct responses was calculated out of the total number of trials of a given step
size within an acoustic condition. The calculation of response accuracy was collapsed across the
two vowel conditions in the perception experiment.
5.3.4.3. Statistical analysis
To investigate whether individual differences in acoustic variability were related to
accuracy in the discrimination of subphonemic variation across individuals and whether this
relationship differed across acoustic dimensions, a series of statistical analyses were conducted
using Spearman’s rank-order correlation. Specifically, correlations between measures of
229
dispersion and Response Accuracy were calculated separately for each Step Size in all acoustic
continua. Additionally, repeated measures ANOVA was used to evaluate whether there were
overall differences in listeners’ accuracy in discrimination across the three manipulated acoustic
dimensions and across trials differing in the interstimulus acoustic distance (i.e., differing in Step
Size). Post-hoc pairwise comparisons for the ANOVA were conducted using Tukey’s HSD post-
hoc tests.
5.4. Results
5.4.1. Summary statistics and comparison to chance performance
Table 5.4 shows the mean and standard deviation of Response Accuracy, calculated
separately for each level of Step Size in each acoustic continuum. The mean Response Accuracy
values in all cells indicate that, at the level of the group, participants’ sensitivity to subphonemic
variation was fairly similar across all three of the manipulated acoustic dimensions, with
performance in the M1 continuum slightly lower than that observed for the F3 continuum, and
performance in the F3 continuum slightly lower than that in the M4 continuum. Mean
measurements of accuracy within each Step Size-by-acoustic continuum grouping were generally
fairly close to chance, particularly for the SMALL and MEDIUM step size conditions. However,
group-level performance within each condition was still significantly above chance, as
demonstrated in Figure 5.5 by the failure of the 95% confidence interval for each group mean
(indicated by the notches in the boxplot for each condition) failing to intersect with the chance
performance level of 50% (indicated by the horizontal dashed line). This suggests that the
perception task was difficult, but that listeners were still generally able to distinguish acoustically
distinct stimuli in each acoustic continuum and interstimulus distance tested.
230
Figure 5.5. Response accuracy according to acoustic continuum and interstimulus acoustic
distance (step size) for all participants examined here. Notches in boxes indicate the 95%
confidence interval around median performance in each combination of acoustic continuum and
step size. The dashed green line on the y-axis indicates at-chance performance (50%).
Table 5.4. Summary statistics for Response Accuracy across speakers. Mean and standard
deviation (given in parentheses) calculated separately for each level of step size for each of the
manipulated acoustic dimensions.
F3 M1 M4
Step Size = 4 0.59 (0.06) 0.56 (0.11) 0.6 (0.15)
Step Size = 5 0.61 (0.1) 0.61 (0.11) 0.64 (0.14)
Step Size = 6 0.65 (0.11) 0.62 (0.12)
0.69 (0.15)
231
5.4.2. Effect of acoustic dimension and interstimulus distance on accuracy
A two-way repeated measures ANOVA was conducted to examine the effect of Acoustic
Continuum (F3, M1, or M4) and Step Size (SMALL, MEDIUM, or LARGE) on Response Accuracy
across subjects. Levene’s test indicated that the assumption of homogeneity of variance was
violated for the comparison of Step Size effects on Response Accuracy across the three acoustic
continua (F = 2.215, p = 0.03). The Shapiro-Wilk test for normality also indicated that the
accuracy data were not normally distributed (W = 0.975, p = 0.0017), with visual inspection
indicating that this was due to a slightly positive skew in the distribution of calculated Response
Accuracy. A log transformation of the dependent variable Response Accuracy was used in all
statistical analyses involving ANOVA to accommodate these violations of the assumptions of
equality of variance and normality. The un-transformed values of this variable are used in all
graphs and figures.
Figure 5.6. Response accuracy by acoustic continuum for all participants examined here.
Notches in boxes indicate the 95% confidence interval around median performance in each
combination of acoustic continuum and step size. The dashed green line on the y-axis indicates
at-chance performance (50%).
**
n.s.
n.s.
232
Significant main effects were observed for Continuum (F[2,195] = 5.234, p = 0.006)
and Step Size (F[2,195] = 9.642, p = 0.0001) in the model. Post-hoc pairwise comparisons
revealed that participants were significantly more accurate in their identification of the
acoustically distinct items for the M4 continuum than for the M1 continuum (p = 0.005), with no
significant difference in accuracy between the F3 and M4 continua (p = 0.21) or to F3 and M1
continua (p = 0.31) (Figure 5.6). Participants were more accurate in the trials with a LARGE
interstimulus acoustic difference (LARGE vs. MEDIUM: p = 0.02; LARGE vs. SMALL: p < 0.0001),
with no significant difference in performance observed between trials with SMALL and MEDIUM
interstimulus distances (p = 0.69) (Figure 5.7). The interaction between Continuum and Step
Size was not statistically significant (F[4,195] = 0.53, p = 0.714).
Figure 5.7. Response accuracy according to interstimulus acoustic distance (step size) for all
participants examined here. Notches in boxes indicate the 95% confidence interval around
median performance in each combination of acoustic continuum and step size. The dashed green
line on the y-axis indicates at-chance performance (50%).
*
***
n.s.
233
5.4.3. Relationship between production variability and discrimination accuracy
Spearman’s rank-order correlation coefficient was used to evaluate the relationship
between individual differences in acoustic variability and discrimination accuracy for each step
size within each acoustic continuum. An analysis between IQRCON and Response Accuracy was
carried out for all three continua (Table 5.5). An analysis between CoVCON and Response
Accuracy was carried out for the F3 and M1 continua only (Table 5.6).
The results of these comparisons indicate that the relationship between acoustic
variability and discrimination accuracy across individuals depends on (a) the identity of the
manipulated acoustic dimension and (b) the size of the interstimulus acoustic difference. In this,
the pattern of results observed here mirrors the predictions of the model presented in Chapter 4
(Simulation 3, Figure 4.10) in that the strongest relationship between acoustic variability and
discrimination accuracy is observed for F3 in /ɹ/, which exhibited a stronger mapping between
articulatory and acoustic variability than either M1 or M4 in /s/ in Chapter 2. A significant
negative correlation was observed between IQRCON and Response Accuracy for LARGE distance
trials in the F3 continuum, as shown in Figure 5.8. The correlation between these production and
perception measurements was not statistically significant for the SMALL or MEDIUM distance
conditions, although a greater negative trend was observed for the MEDIUM distance condition
than for the SMALL condition (rs = -0.19 vs. rs = -0.08). No significant relationship was observed
between IQRCON and Response Accuracy in any of the comparisons conducted for the M1
continuum (Figure 5.9) or the M4 continuum (Figure 5.10).
234
Figure 5.8. Comparison of IQRCON and Response Accuracy across participants for each step size
in the F3 continuum. Significant correlations are indicated by an asterisk. The dashed green
horizontal line indicates chance performance (50%).
Figure 5.9. Comparison of IQRCON and Response Accuracy across participants for each step size
in the M1 continuum. No correlations are significant. The dashed green horizontal line indicates
chance performance (50%).
*
235
Figure 5.10. Comparison of IQRCON and Response Accuracy across participants for each step
size in the M4 continuum. No correlations are significant. The dashed green horizontal line
indicates chance performance (50%).
Table 5.5. Spearman’s rho (rs) for the comparison of Response Accuracy and IQRCON across
speakers. Correlation coefficients calculated for each unique combination of Step Size and
Acoustic Continuum. Comparisons significant at the uncorrected p < 0.05 level are bolded.
Comparisons significant after using the Benjamini-Hochberg method to control for false
discovery rate (padj < 0.05) are additionally italicized.
F3 M1 M4
rs p padj rs p padj rs p padj
SMALL -0.08 0.71 0.7 -0.02 0.92 0.92 -0.04 0.86 0.92
MEDIUM -0.19 0.38 0.56 -0.02 0.91 0.92 0.07 0.75 0.92
LARGE -0.57 0.025 0.01 0.07 0.75 0.92 0.06 0.79 0.92
The results for the F3 and M1 continua were largely similar when production variability
was measured using CoVCON instead of IQRCON (Table 5.6). CoVCON and Response Accuracy were
again significantly correlated across speakers for LARGE distance trials in the F3 continuum, as
shown in Figure 5.11, but no relationship was observed between the production and perception
measures for either smaller distances in the F3 continuum (Figure 5.11) or for the M1 continuum
(Figure 5.12).
IQRCON
236
Figure 5.11. Comparison of CoVCON and Response Accuracy across participants for each step
size in the F3 continuum. Significant correlations are indicated by an asterisk. The dashed green
line on the y-axis indicates chance performance (50%).
Figure 5.12. Comparison of CoVCON and Response Accuracy across participants for each step
size in the M1 continuum. Significant correlations are indicated by an asterisk. The dashed green
line on the y-axis indicates chance performance (50%).
*
237
Table 5.6. Spearman’s rho (rs) for the comparison of Response Accuracy and CoVCON across
speakers. Correlation coefficients calculated for each unique combination of Step Size and
Acoustic Continuum (for M1 and F3 only). Comparisons significant at the uncorrected p < 0.05
level are bolded. No comparisons were significant after using the Benjamini-Hochberg method
to control for false discovery rate.
F3 M1
rs p padj rs p padj
SMALL 0.12 0.56 0.85 -0.19 0.4 0.98
MEDIUM 0.001 0.96 0.97 -0.005 0.98 0.98
LARGE -0.47 0.02 0.06 -0.07 0.73 0.98
5.5. Discussion
The results of this experiment indicate that in some cases (acoustic) variability in speech
production is related to individual differences in perceptual sensitivity to subphonemic variation,
but that the appearance of this relationship is mediated by the properties of the acoustic
dimensions along which behavior is measured. Specifically, the results of the experiment suggest
that the appearance of this relationship depends on the strength of the acoustic-articulatory
mapping along the acoustic dimension(s) in which tokens of the same phonological segment
differ. The only acoustic dimension along which a relationship was observed between variability
in production and discrimination accuracy, F3 in /ɹ/, exhibited a stronger and more consistent
relationship between articulatory and acoustic variability than either M1 or M4 in /s/ in the
analyses of articulatory-acoustic relations presented in Chapter 2. These findings provide general
support for models of speech communication in which a common currency mediates speech
production and perception (e.g., Goldstein & Fowler, 2003) and more specific support for the
Dynamic Field Theory models of phonological cognition presented in Chapter 4 of this
dissertation.
238
Higher accuracy in discrimination was observed for participants who were less variable
in their production of F3 in /ɹ/, but only for trials with the largest interstimulus acoustic distance.
For trials with smaller interstimulus acoustic distances in /ɹ/ and across all conditions in the two
/s/-based continua (M1 and M4) no relationship was observed between production variability and
accuracy in discrimination. The observation of a relationship between production and perception
for /ɹ/ (in one condition) and the lack of observation of such a relationship for /s/ (in any
condition) aligns with the predictions of the model presented in Chapter 4. Specifically, the
simulation of a perceptual discrimination task using the model suggested that less variable
speakers would perceive the difference between two tokens of the same phonological segment as
larger than more variable speakers would, but only if the listener was able to map the perceived
acoustic signal onto a fairly precise region of the possible target space for that segment, i.e., only
if there was a good mapping between articulatory and acoustic variability. If there was too much
uncertainty in the mapping of the acoustic signal onto the target space for the segment, no
differences in perception were observed across speakers differing in production variability. This
suggests that, in a forced choice paradigm like the one used in the experiment here, a greater
relationship between variability in production and individual differences in perceptual
discrimination would be observed for F3 in /ɹ/ before it would be observed for M1 in /s/, as the
mapping between articulation and acoustics was much stronger for /ɹ/ than it was for /s/ in the
analysis of articulatory-acoustic relations presented in Chapter 2. Specifically, marginal R
2
values were much larger for linear mixed effects models predicting F3 in /ɹ/ from articulation
than they were for models predicting M1 in /s/ from articulation in Chapter 2, with a similar
pattern of larger marginal R
2
also observed when comparing models predicting /ɹ/ articulation
with those predicting /s/ articulation from acoustics.
239
The observation that above-chance performance was consistently observed at the group
level for all step sizes within all three acoustic continua suggests that any discrepancy in the
relationship between speech production and perception across conditions is likely unattributable
to poor participant performance in any one condition (although it could have played a role in
differences in this relationship across step sizes for /ɹ/). This is particularly clear when the results
for the M4 continuum are considered and compared to the F3 continuum. Participant accuracy in
the F3 and M4 continua did not significantly differ. If anything, the difference between mean
accuracy in these two conditions is large enough that if current trends continue it may reach
significance, with higher accuracy observed in M4, once data is analyzed for all participants in
the study. Despite this equivalence in accuracy, and even a potential trend towards higher
accuracy in the M4 continuum, no significant relationship was observed between individual
differences in perception and production variability for M4, while it was for F3. This strongly
suggests that the appearance of such a relationship in one condition but not the other is mediated
by factors other than simply participants’ ability to discriminate between manipulated tokens. In
the particular case of M4 versus F3, the lack of a correlation between variability in production
and perception for M4 arguably reflects the relative weakness of the articulatory-acoustic
mapping in M4, although the fact that M4 is of lesser functional importance for the production
and perception of contrasts in /s/ than F3 is for /ɹ/ may also play a role.
For similar reasons related to the discrepancy in the strength of the articulatory-acoustic
mapping for /ɹ/ and /s/, the results of this study do not contradict findings from previously
published work that has fairly consistently observed a relationship between variability in
acoustics and perceptual acuity (e.g., Franken et al., 2017; Perkell et al., 2008). This is because,
as mentioned in Section 4.3.3.3, previous work on this topic has generally focused on the
240
relationship between production variability and perceptual acuity in vocalic segments. The
strength of the mapping between articulation and acoustics for vowels is likely at least as strong,
if not stronger, than that observed for /ɹ/ (Whalen et al., 2018), meaning that the F3 manipulation
(for which a significant correlation between production and perception was observed) presents
the closest analogue to previous research examining the relationship between variability in
production and perceptual acuity. As such, the finding of results comparable to previous research
in this condition specifically suggests a certain degree of agreement between the results
presented here and previously published research.
In previous work, observed relationships between individual differences in variability and
perceptual acuity have been interpreted as reflecting a relationship between auditory acuity and
the size of the target space for a phonological unit, with the presupposition that higher auditory
acuity will lead to smaller targets and, subsequently, less variability in the production of the
phonological unit (e.g., Franken et al., 2017; Perkell et al., 2008; see Villacorta et al. [2007] for
evidence of the kind of acuity-dependent propensity for variability reduction motivating this
interpretation). The observation that a relationship was not observed between individual
differences in variability and perceptual discrimination for either of the two dimensions
manipulated in /s/ seems to contradict this interpretation, as it implies that the range of phonetic
space encompassed within a speaker’s representation of a phonological unit is at most only
weakly tied to auditory acuity. However, previous proposals suggesting that /s/ and other
fricatives may generally rely more on somatosensory feedback than auditory feedback for their
regulation (e.g., Guenther et al., 1998; Ghosh et al., 2010) may provide an alternative
explanation, outside of explanation related to the strength of the articulatory-acoustic mapping
241
provided by the previous chapter’s modeling work, for the failure to observe a relationship
between variability in production and perceptual discrimination accuracy in this experiment.
The use of acoustic measurements to index individual differences in production
variability in this experiment presents an additional obstacle for relating the findings directly to
the model presented in the previous chapter. As the model used to generate predictions for this
study situates its system of phonological representation in articulatory space, the variability
predicted to correlate (or not) with individual differences in perceptual discrimination would be
variability in the articulatory targets selected for production. Although variability in the acoustic
signal reflects variability in articulation (and, subsequently, individual differences in the selected
targets) to some degree, the linearity of this mapping varies considerably across individuals and
across phonological units. The results from the articulatory-acoustic analysis in Chapter 2
suggest that interspeaker differences in the linearity of the articulatory-acoustic mapping may
affect the recovery of information about articulatory variability from acoustics more for /s/ than
it does for /ɹ/, as the variability of predicted values did not reflect the variability of actual
acoustic measurements for the models fit to M1 and one of the models fit to M4 in /s/, but it
consistently did for F3 in /ɹ/. This may make it more difficult to estimate for M1 and M4 in /s/
than for F3 in /ɹ/ the extent to which the measured variability reflects underlying articulatory
variability. Given that the relationship between stochastic and contextual variability supports a
representation in articulatory space for both /ɹ/ and /s/ (as presented in Chapter 3), the lack of
relationship between production variability and performance on the perception task for M1 and
M4 in this study could indicate that an underlying relationship in articulation was obscured by
the many-to-one nature of articulatory-acoustic relations in /s/ and the inconsistency of this
242
relationship across speakers. Future work incorporating the collection of articulatory data in a
similar experimental paradigm will be necessary to tease apart these possibilities.
5.6. Conclusion
In conclusion, the preliminary results of a combined speech production and perception
experiment demonstrate that individual differences in acoustic variability and the perception of
subphonemic variation are, under certain conditions, related across speakers. Accuracy in the
discrimination of /ɹ/ tokens differing in F3 was found to correlate within measures of dispersion
across participants, with less variable participants exhibiting higher accuracy, when there was a
large acoustic difference between tokens. No relationship between variability and perceptual
discrimination was observed for either M1 or M4 in /s/, or for tokens of /ɹ/ with that did not
exhibit as large an acoustic difference. This is despite the observation of group level performance
that was above chance in every condition. Taken together, these results exhibit general alignment
with the predictions of a proposed extension of Dynamic Field Theory models of phonological
cognition, providing some support for the ability of this model to provide unified explanations
for individual differences in speech behavior across multiple modalities. They additionally
provide some general support for models of speech communication in which both speech
perception and production either utilize the same representational system or rely on a ‘common
currency’ for the maintenance of communicative parity.
243
6. Summary and conclusions
This dissertation focuses on the implications of individual differences in phonetic
variability for our understanding of the cognitive systems governing speech production and
perception. The set of empirical and modeling investigations presented here demonstrate that the
robust individual differences observed in acoustic and articulatory variability reflect the specific
encoding of variability for individual phonological units, and that these individual differences
can be incorporated into cognitive systems of phonological representation in a way that allows
them to explain behavior in speech perception as well as speech production. Through this, the
research presented in this dissertation provides support for the hypothesis that individual
differences in speech production and perception reflect variation in the cognitive representation
of phonological units across speakers.
Evidence for the encoding of individual differences in variability in the cognitive
representations of individual phonological units is provided by the study using the XRMB corpus
(Westbury, 1994) presented in Chapters 2 and 3 of the dissertation. The results of the analyses of
articulatory and acoustic variability presented in those chapters indicate that individual
differences are observed in a set of American English coronal consonants (/t/, /s/, / ʃ/ , /l/ and /ɹ/),
with these differences widespread both in terms of the extent to which they reflect general
patterns of individual difference within the population of examined speakers and in terms of their
consistency across phonetic dimensions and segments. Of relevance to the implications of this
variability for phonological representation, differences are robustly observed across all measured
dimensions, including those (such as tongue tip constriction location and degree in articulation)
that reflect the achievement of phonological goals for the examined consonants. The more goal-
oriented articulatory dimensions in each segment were also observed to exhibit stronger
244
relationships between stochastic and contextual variability and weaker relationships across
different phonological segments than dimensions thought to be less directly related to the
achievement of articulatory goals for the examined segments. Combined with the failure to
observe a consistent relationship between speaker differences in phonetic variability and either
vocal tract morphology or prosodic variability, these results support an account in which
individual differences in the production of phonological segments reflects individual differences
in the cognitive representation of phonological units.
The aforementioned distinction between the behavior of the more goal-oriented and the
less goal-oriented articulatory dimensions in each segment also serves to support an account in
which the targets of units of phonological representation are defined in articulatory space. The
control of variability exhibits much more unit-internal coherency and much less cross-unit
generalizability for the articulatory dimensions most closely related to hypothetical phonological
goals for the examined consonants, while the examined acoustic dimensions did not pattern
differently depending upon their presumed phonological importance. This indication that the
control of variability reflects the phonological goals of a segment in articulation but not in
acoustics suggests that variability in articulatory, and not acoustic, targets is incorporated in the
cognitive representation of cognitive units. Parallel findings suggesting that interspeaker
differences in articulatory variability were somewhat recoverable from the acoustic signal, and
vice versa, suggest that these proposed representational differences may be of communicative
significance, even if the direct target space over which they manifest is articulatory instead of
acoustic.
The specific results of the empirical investigation in Chapters 2 and 3 suggested
important qualities that any model would have to possess in order to account for the observed
245
patterns of acoustic and articulatory variability. Specifically, it was determined that a model had
to incorporate both dynamical representations of articulatory target space and, due to the
relationship between stochastic and contextual variability, a degree of abstract invariance in the
representation of phonological units across contexts in order to account for the data. It also had
to incorporate a model architecture that was amenable to the introduction of representational
differences across speakers. Chapter 4 demonstrated that existing Dynamic Field Theory models
of phonological cognition (Gafos & Kirov, 2009; Roon & Gafos, 2016; Tilsen, 2019) fulfilled
these requirements when situated within the framework of Articulatory Phonology (Browman &
Goldstein, 1986, 1989, 1992, 2000). As shown in Chapter 4, an extension of this model that
explicitly incorporated variation in a gesture’s target region was able to generate both
interspeaker differences in stochastic variability and the specific relationship between stochastic
and contextual variability used to motivate the encoding of variability in specific phonological
units. Crucially, these phenomena were accounted for through a shared representational
mechanism, providing a unified explanation for multiple phenomena.
Although this model was able to adequately generate patterns observed in the data, the
decision (in the empirical analysis) to include all syllabic or prosodic positions in the same
calculation of contextual variability presents an obstacle to fully understanding how the various
factors defining the phonetic context interact with representational dynamics of the proposed
model. Future research that more thoroughly teases apart the specific effects of segmental
environment, word position, and prosodic position on the relationship between stochastic and
contextual variability will be necessary for further development of the proposed model, and for a
fuller understanding of how each of these factors interacts with individual differences in phonetic
variability. Work examining how the fluctuation of factors like emotional valence, speech rate,
246
and speech style interact with individual differences in phonological representation will similarly
engender better understanding of the complex dynamics at play in the interaction of different
factors that may impact the selection and achievement of production targets in speech. Relatedly,
the question of how intra- and intergestural temporal relations should be handled in this model is
a major outstanding question that will be critically important to address in future work. This
question is pertinent to both our understanding of how the patterns of temporal variability
observed in speech should be implemented within this model and our understanding of the
relationship between the dynamics of gestural activation with respect to target planning and
movement execution.
In addition to generating patterns of individual difference in variability observed in the
XRMB corpus data, the proposed extension of Dynamic Field Theory models of phonological
cognition was also able to generate predictions about the relationship between individual
differences in production variability and behavioral differences in speech perception.
Specifically, based on the results of a simulation designed to mirror the process of perceptually
discriminating two tokens of the same phonological unit, the prediction was made that speakers
who were less variable in their production of a phonetic dimension would be more sensitive to
variability in the perception of that dimension within a particular phonological segment. It was
additionally predicted that this relationship between variability in production (itself reflecting
representational differences between speakers) and perceptual sensitivity would be modulated by
the strength of the mapping of the acoustic signal to the proposed articulatory planning space.
Preliminary results from a study testing these predictions suggests that they are borne out in
behavioral data, with a relationship between F3 variability and response accuracy observed for
large acoustic differences in F3 in /ɹ/, but not for differences in acoustic dimensions in /s/.
247
These results align with results from previously published research (e.g., Franken et al.,
2017; Perkell et al., 2008) to provide both general support for the proposed extension of the
Dynamic Field Theory model of phonological cognition, and more general support for models
positing a relationship between speech production and perception. Critically, they also point to
the potential for the proposed model to generalize to speech behavior more broadly. This
observation, in combination with existing knowledge about the role of variability in motor
learning and adaptation, has implications for the ability to predict individual differences in
adaptive behaviors, and perhaps other behaviors involved in speech communication, from the
variability observed in a speaker’s behavior at baseline. Variability in motor output has been
shown to play an important role in motor learning, with variability in motor performance shown
to facilitate the development and refinement of new skills observed in non-speech human motor
behaviors (e.g., Wu, Miyamoto, Gonzalez Castro, Ölveczky, & Smith, 2014; see Dhawale,
Smith, & Ölveczky, 2017 for a review) and in communicative behaviors in non-human animals,
such as birdsong (e.g., Tumer & Brainard, 2007). Recent work on the acquisiton of L2 phonetics
(e.g., Huffman & Schuhmann, 2020) suggests that a similar relationship between variability and
success in learning may also appear in (at least some) speech behavior. Previous work on
accommodation in speech has also either implied (e.g., Babel, 2012; Walker & Campbell-Kibler,
2015) or explicitly shown (Lee et al., 2021) that individual differences in variability should/do
correlate with individual performance in accommodation to various acoustic dimensions. The
proposed model presents a potential way to account for individual differences among these and
other vectors of linguistic behavior through a single formal mechanism, suggesting interesting
avenues for the examination of connected behavioral dynamics in speech.
248
To conclude, this dissertation adopts an approach that utilizes the systematic investigation
of individual differences in articulatory and acoustic variability to investigate the cognitive
systems of phonological representation underlying speech. The empirical and modeling studies
in this dissertation provide evidence for the incorporation of individual differences in variability
into the representation of phonological units, and more generally support ‘hybrid’ models of
phonological cognition that incorporate both dynamic and invariant elements in phonological
representation. The empirical research presented here also provides support for approaches in
which the targets of phonological units are situated in articulatory space. This dissertation
highlights the importance of both variability and individual differences for our understanding of
phonological cognition and contributes to our understanding of how various behavioral
phenomena observed in speech may in fact arise from a common source.
249
References
Abbs, J.H., Gracco, V.L., & Cole, K.J. (1984). Control of multimovement coordination:
Sensorimotor mechanisms in speech motor programming. Journal of Motor Behavior,
16(2), 195-232.
Akeroyd, M.A., Moore, B.C.J., & Moore, G.A. (2001). Melody recognition using three types of
dichotic-pitch stimulus. Journal of the Acoustical Society of America, 110, 1498-1504.
Allen, J. S., Miller, J. L., & DeSteno, D. (2003). Individual talker differences in voice-onset-
time. Journal of the Acoustical Society of America, 113, 544-552.
Alwan, A., Narayanan, S., & Haker, K. (1997). Toward articulatory-acoustic models for liquid
approximants based on MRI and EPG data. Part II. The rhotics. Journal of the Acoustical
Society of America, 101(2), 1078–1089.
Amari, S. (1977). Dynamics of pattern formation in lateral-inhibition type neural fields.
Biological Cybernetics, 27, 77–87.
Amari, S., & Arbib, M. A. (1977). Competition and cooperation in neural nets. In J. Metzler
(Ed.), Systems neuroscience (pp. 119–165). New York: Academic Press.
Anderson, A., Lowit, A., & Howell, P. (2008). Temporal and spatial variability in speakers with
Parkinson’s Disease and Friedreich’s Ataxia. Journal of Medical Speech-Language
Pathology, 16(4), 173-180.
Anwyl-Irvine, A.L., Massonnié, J., Flitton, A., Kirkham, N., & Evershed, J.K. (2018). Gorilla in
our midst: An online behavioral experiment builder. Behavior Research Methods, 52,
388-407.
250
Atal, B.S., Chang, J.J., Mathews, M.V., & Tukey, J.W. (1978). Inversion of articulatory-to-
acoustic transformation in the vocal tract by a computer-sorting technique. Journal of the
Acoustical Society of America, 63(5), 1535-1555.
Baayen, R.H., Hendrix, P, & Ramscar, M. (2013). Sidestepping the combinatorial explosion:
Towards a processing model based on discriminative learning. Language and Speech, 56,
329-347.
Babel, M. (2012). Evidence for phonetic and social selectivity in spontaneous phonetic imitation.
Journal of Phonetics, 40(1), 177–189. https://doi.org/10.1016/J.WOCN.2011.09.001
Babel, M., & Munson, B. (2014). Producing socially meaningful linguistic variation. In M.
Goldrick, V. Ferreira, & M. Miozzo (Eds.), The Oxford handbook of language production
(pp. 308–325). Oxford University Press.
Bailly, G. (1997). Learning to speak. Sensori-motor control of speech movements. Speech
Communication, 22, 251-267.
Baker, A., Archangeli, D., & Mielke, J. (2011). Variability in American English s-retraction
suggests a solution to the actuation problem. Language Variation and Change, 23, 347–
374. https://doi.org/10.1017/S0954394511000135
Bakst, S. (2021). Palate shape influence depends on the segment: Articulatory and acoustic
variability in American English /ɹ/ and /s/. Journal of the Acoustical Society of America,
149(2), 971. https://doi.org/10.1121/10.0003379
Bakst, S., & Johnson, K. (2018). Modeling the effect of palate shape on the articulatory-acoustics
mapping. Journal of the Acoustical Society of American, 144(6), 3936-3949.
251
Bakst, S., & Lin, S. (2015) An ultrasound investigation into articulatory variation in American /r/
and /s/. In Proceedings of the 18th International Congress of Phonetic Sciences.
Glasgow: The University of Glasgow.
Barton, K. (2020). MuMIn: Multi-model inference. R package version 1.43.17. https://CRAN.R-
project.org/package=MuMIn.
Bates, D., Maechler, M., Bolker, B., & Walker, S. (2015). Fitting Linear Mixed-Effects Models
using lme4. Journal of Statistical Software, 67(1), 1-48. doi:10.18637/jss.v067.i01.
Baum, S. R., & Waldstein, R. S. (1991). Perseveratory coarticulation in the speech of profoundly
hearing-impaired and normally hearing children. Journal of Speech, Language, and
Hearing Research, 34, 1286-1292.
Beckman, M. E., Jung, T.-P, Lee, S.-H., de Jong, K., Krishnamurthy, A. K., Ahalt, S. C., Cohen,
K. B., and Collins, M. J. (1995). Variability in the production of quantal vowels revisited.
Journal of the Acoustical Society of America, 97, 471-490.
Beddor, P. S. (2009). A coarticulatory path to sound change. Language, 85(4), 785-832.
https://doi.org/10.1353/lan.0.0165.
Beddor, P. S., Coetzee, A., Styler, W., McGowan, K., & Boland, J. (2018). The time course of
individuals' perception of coarticulatory information is linked to their production:
Implications for sound change. Language, 94, 1-38.
https://doi.org/10.1353/lan.2018.0051.
Beddor, P. S., Harnsberger James D., J. D., & Lindemann, S. (2002). Language-specific patterns
of vowel-to-vowel coarticulation: Acoustic structures and their perceptual correlates.
Journal of Phonetics, 30(4), 591–627. https://doi.org/10.1006/jpho.2002.0177
252
Beddor, P. S., & Krakow, R. A. (1999). Perception of coarticulatory nasalization by speakers of
English and Thai: Evidence for partial compensation. Journal of the Acoustical Society of
America, 106, 2868-2887.
Bishop, J. (2020). Exploring the similarity between implicit and explicit prosody: Prosodic
phrasing and individual differences. Language and Speech, 23830920972732.
https://doi.org/10.1177/0023830920972732
Bladon, R. A. W., & Nolan, F. (1977). A video-fluorographic investigation of tip and blade
alveolars in English. Journal of Phonetics, 5, 185-193.
Blumstein, S. E., & Stevens, K. N. (1979). Acoustic invariance in speech production: Evidence
from measurements of the spectral characteristics of stop consonants. Journal of the
Acoustical Society of America, 66(4), 1001-1017.
Blumstein, S. E., & Stevens, K. N. (1980). Perceptual invariance and onset spectra for stop
consonants in different vowel environments. Journal of the Acoustical Society of
America, 67, 648-662.
Boersma, P., & Weenink, D. (2020). Praat: doing phonetics by computer (Version 6.1.16)
[Computer software]. http://www.praat.org/.
Bouchard, K. E., & Chang, E. F. (2014). Control of Spoken Vowel Acoustics and the Influence
of Phonetic Context in Human Speech Sensorimotor Cortex. Journal of Neuroscience, 34,
12662-12677. https://doi.org/10.1523/JNEUROSCI.1219-14.2014
Boyce, S., & Espy-Wilson, C. (1997). Coarticulatory stability in American English /r/. Journal of
the Acoustical Society of America, 101(6), 3741-3753.
Browman, C. P., & Goldstein, L. M. (1985). Dynamic modeling of phonetic structure. In V.
Fromkin (ed.), Phonetic Linguistics (pp. 35-53). New York: Academic.
253
Browman, C. P., & Goldstein, L. M. (1986). Towards an articulatory phonology. In C. Ewen &
J. Anderson (eds.), Phonology Yearbook 3. Cambridge: Cambridge University Press,
219-252.
Browman, C. P., & Goldstein, L. (1989). Articulatory Gestures as Phonological Units. Haskins
Laboratories Status Report on Speech Research, 100, 99–69.
Browman, C. P., & Goldstein, L. (1992). Articulatory Phonology: An Overview. Haskins
Laboratories Status Report on Speech Research. New Haven, Connecticut.
Browman, C.P., & Goldstein, L. (2000). Competing constraints on intergestural coordination and
self-organization of phonological structures. Bulletin de la Communication Parlée, no. 5,
pp. 25-34.
Brown, M.B., & Forsythe, A.B. (1974). Robust tests for the equality of variances. Journal of the
American Statistical Association, 69, 364-367. doi:10.1080/01621459.1974.10482955
Brunner, J., Fuchs, S., & Perrier, P. (2009). On the relationship between palate shape and
articulatory behavior. Journal of the Acoustical Society of America, 125(6), 3936–3949.
https://doi.org/10.1121/1.3125313
Brunner, J., Ghosh, S., Hoole, P., Matthies, M., Tiede, M., & Perkell, J. (2011). The Influence of
Auditory Acuity on Acoustic Variability and the Use of Motor Equivalence During
Adaptation to a Perturbation. Journal of Speech, Language, and Hearing Research, 54,
1–13. https://doi.org/10.1044/1092-4388(2010/09-0256)
Bürki, Audrey. (2018). Variation in the speech signal as a window into the cognitive architecture
of language production. Psychonomic Bulletin and Review, 25(6), 1973-2004.
254
Buschmeier, H., & Włodarczak, M. (2013). TextGridTools: A TextGrid processing and analysis
toolkit for Python. In Proceedings der 24. Konferenz zur Elektronischen
Sprachsignalverarbeitung (pp. 152-157), Bielefeld, Germany.
Butcher, A., & Weiher, E. (1976). An electropalatographic investigation of coarticulation in
VCV sequences. Journal of Phonetics, 4, 59-74. https://doi.org/10.1016/S0095-
4470(19)31222-7
Byrd, D. (1992). Sex, dialects and reduction. In J.J. Ohala, T.M. Neary, B.L. Derwing, M.M.
Hodge & G. E. Wiebe (Eds.), Proceedings of the International Conference on Spoken
Language Processing, vol. 2 (pp. 827-830).
Byrd, D. (1994). Articulatory Timing in English Consonant Sequences. Ph.D. Dissertation,
University of California, Los Angeles.
Byrd, D. (1996). A Phase Window Framework for Articulatory Timing. Phonology, 13(2), 139–
169.
Byrd, D. (2000) Articulatory vowel lengthening and coordination at phrasal junctures.
Phonetica, 57(1), 3-16.
Byrd, D., Krivokapić, J., & Lee, S. (2006). How far, how long: On the temporal scope of
prosodic boundary effects. Journal of the Acoustical Society of America, 120(3), 1589-
1599.
Byrd, D., & Saltzman, E. (1998). Intergestural dynamics of multiple prosodic boundaries.
Journal of Phonetics, 26, 173–199.
Byrd, D., & Saltzman, E. (2003). The elastic phrase: Modeling the dynamics of boundary-
adjacent lengthening. Journal of Phonetics, 31(2), 149–180. https://doi.org/10.1016/S0095-
4470(02)00085-2.
255
Byrd, D., & Tan, C.C. (1996). Saying consonant clusters quickly. Journal of Phonetics, 24, 263-
282.
Campbell, F., Gick, B., Wilson, I., & Vatikiotis-Bateson, E. (2010). Spatial and temporal
properties of gestures in North American English /r/. Language and Speech, 53(1), 49–
69. https://doi.org/10.1177/0023830909351209
Carignan, C. (2019). A network-modeling approach to investigating individual differences in
articulatory-to-acoustic relationship strategies. Speech Communication, 108, 1-14.
https://doi.org/10.1016/j.specom.2019.01.007
Casserly, E. D., & Pisoni, D. B. (2010). Speech perception and production. Wiley
Interdisciplinary Reviews. Cognitive Science, 1(5), 629–647.
https://doi.org/10.1002/wcs.63
Chait, M., Poeppel, D., & Simon, J. Z. (2006). Neural response correlates ofdetection
ofmonaurally and binaurally created pitches in humans. Cerebral Cortex, 16(6), 835–858.
https://doi.org/10.1093/cercor/bhj027
Champely, Stephane. (2020). pwr: Basic Functions for Power Analysis (Version 1.3-0). [R
package]. https://cran.r-project.org/web/packages/pwr/index.html
Chao, S.-C., Ochoa, D., & Daliri, A. (2019). Production Variability and Categorical Perception
of Vowels Are Strongly Linked. Frontiers in Human Neuroscience, 13, Article 96.
https://doi.org/10.3389/fnhum.2019.00096
Chartier, J., Anumanchipalli, G. K., Johnson, K., & Chang, E. F. (2018). Encoding of
Articulatory Kinematic Trajectories in Human Speech Sensorimotor Cortex. Neuron, 98,
1–13. https://doi.org/10.1016/j.neuron.2018.04.031
256
Cheng, H. S., Niziolek, C. A., Buchwald, A., & McAllister, T. (2021). Examining the
Relationship Between Speech Perception, Production Distinctness, and Production
Variability. Frontiers in Human Neuroscience, 15, Article 660948.
https://doi.org/10.3389/fnhum.2021.660948
Cho, T. (2006). Manifestation of prosodic structure in articulatory variation: Evidence from lip
kinematics in English. In L. Goldstein, D. H. Whalen, & C. T. Best (Eds.), Laboratory
Phonology 8 (pp. 519–548). Mouton de Gruyter.
Chodroff, E., & Wilson, C. (2017). Structure in talker-specific phonetic realization: Covariation
of stop consonant VOT in American English. Journal of Phonetics, 61, 30–47.
https://doi.org/10.1016/j.wocn.2017.01.001
Churchland, M.M., Afshar, A., & Shenoy, K.V. (2006). A central source of movement
variability. Neuron, 52(6), 1085-1096. https://doi.org/10.1016/j.neuron.2006.10.034
Clayards, M. (2018). Differences in cue weights for speech perception are correlated for
individuals within and across contrasts. Journal of the Acoustical Society of America,
144, EL172–EL177. https://doi.org/10.1121/1.5052025
Clopper, C. G., & Turnbull, R. (2018). Exploring variation in phonetic reduction: Linguistic,
social, and cognitive factors. In F. Cangemi, M. Clayards, O. Niebuhr, B. Schuppler, &
M. Zellers (Eds.), Rethinking Reduction: Interdisciplinary perspectives on conditions,
mechanisms, and domains for phonetic variation (pp. 25–72). De Gruyter Mouton.
https://doi.org/10.1515/9783110524178-002
Coetzee, A. W., Beddor, P. S., Shedden, K., Styler, W., & Wissing, D. (2018). Plosive voicing in
Afrikaans: Differential cue weighting and tonogenesis. Journal of Phonetics, 66, 185-
216. https://doi.org/10.1016/j.wocn.2017.09.009.
257
Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences. New York, NY:
Routledge Academic
Cole, J. (2015). Prosody in context: A review. Language, Cognition and Neuroscience, 30(1-2),
1-31. https://doi.org/10.1080/23273798.2014.963130
Cole, J. Mo, Y., & Baek, S. (2010). The role of syntactic structure in guiding prosody perception
with ordinary listeners and everyday speech. Language and Cognitive Processes, 25(7),
1141 – 1177. https://doi.org/10.1080/01690960903525507
Cramer, E. M., & Huggins, W. H. (1958). Creation of Pitch through Binaural Interaction.
Journal of the Acoustical Society of America, 30(5), 412–417.
https://doi.org/10.1121/1.1909628
Crystal, D. (Ed.). (2008). Segment. In A dictionary of linguistics and phonetics (6
th
ed.)
Blackwell Pub.
Crystal, T. H., & House, A. S. (1982). Segmental durations in connected speech signals:
Preliminary results. The Journal of the Acoustical Society of America, 72, 705 - 716.
https://doi.org/10.1121/1.388251
Cusumano, J.P., & Dingwell, J.B. (2013). Movement variability near goal equivalent manifolds:
Fluctuations, control, and model-based analysis. Human Movement Science, 32, 899-923.
Dag, O., Dolgun, A., & Konar, N.M. (2018). onewaytests: An R package for one-way tests in
independent groups designs. The R Journal, 10(1), 175-199. https://doi.org/ 10.32614/RJ-
2018-022
Daniloff, R., Schuckers, G., & Feth, L. (1980). The physiology of speech and hearing: An
introduction. Englewood Cliffs, NJ: Prentice-Hall.
258
Dart, S. N. (1991). Articulatory and Acoustic Properties of Apical and Laminal Articulations.
[Doctoral Dissertation, University of California, Los Angeles].
https://escholarship.org/uc/item/52f5v2x2
Dart, S. N. (1998). Comparing French and English coronal consonant articulation. Journal of
Phonetics, 26, 71–94. https://doi.org/10.1006/jpho.1997.0060
De Decker, P. M., & Nycz, J. R. (2012). Are tense [æ]s really tense? The mapping between
articulation and acoustics. Lingua, 122(7), 810-821.
https://doi.org/10.1016/j.lingua.2012.01.003
Dediu, D., & Moisik, S. R. (2019). Pushes and pulls from below: Anatomical variation,
articulation and sound change. Glossa, 4(1), Article 7. https://doi.org/10.5334/gjgl.646
Delattre, P., & Freeman, D. C. (1968). A dialect study of American r's by x-ray motion picture.
Linguistics, 44, 29-68. https://doi.org/10.1515/ling.1968.6.44.29
Dhawale, A. K., Smith, M. A., & Ölveczky, B. P. (2017). The Role of Variability in Motor
Learning. Annual Review of Neuroscience, 40(1), 479–498.
https://doi.org/10.1146/annurev-neuro-072116-031548
Docherty, G., & Mendoza-Denton, N. (2012). Speaker-related variation—Sociophonetic factors.
In A. C. Cohn, C. Fougeron, & M. K. Huffman (Eds.), Oxford handbook of laboratory
phonology (pp. 43–60). Oxford, UK: Oxford University Press.
https://10.1093/oxfordhb/9780199575039.0.13.0004
Edwards, ,l., Beckman, M. E. & Fletcher, ,l. (I991) The articulatory kinematics of final
lengthening, Journal of the Acoustical Society of America, 89, 369-382.
259
Erickson, R.P. (1974). Parallel ‘population’ neural coding in feature extraction. In F.O. Schmitt,
F.G. Worden (Eds.), The Neurosciences (pp. 155-169). Third Study Program, MIT Press,
Cambridge.
Erlhagen, W., & Schöner, G. (2002). Dynamic field theory of movement preparation.
Psychological Review, 109(3), 545-572. https://doi.org/10.1037/0033-295x.109.3.545
Erlhagen, W., Bastian, A., Jancke, D., Riehle, A., & Schöner, G. (1999). The distribution of
neuronal population activation (DPA) as a tool to study interaction and integration in
cortical representations. Journal of Neuroscience Methods, 94, 53–66.
https://10.1016/s0165-0270(99)00125-9
Ernestus, M. (2014). Acoustic reduction and the roles of abstractions and exemplars in speech
processing. Lingua, 142, 27–41. https://doi.org/10.1016/j.lingua.2012.12.006
Everitt, B. (1998). The Cambridge Dictionary of Statistics. Cambridge, UK: Cambridge
University Press. ISBN 978-0521593465.
Fadiga, L., Fogassi, L., Pavesi, G., and Rizzolatti, G. (1995). Motor facilitation during action
observation: a magnetic stimulation study. J. Neurophysiol., 73(6), 2608-11.
https://doi.org/10.1152/jn.1995.73.6.2608
Fink, A., & Goldrick, M. (2015). The influence of word retrieval and planning on phonetic
variation: Implications for exemplar models. Linguistics Vanguard, 1(1), 215–225.
Walter de Gruyter GmbH. https://doi.org/10.1515/lingvan-2015-1003
Fitch, W. T., & Giedd, J. (1999). Morphology and development of the human vocal tract: a study
using magnetic resonance imaging. Journal of the Acoustical Society of America, 106(3),
1511-1522. https://doi.org/10.1121/1.427148
260
Flege, J. E. (2007). Language contact in bilingualism: Phonetic system interactions. In J. Cole &
J. I. Hualde (Eds.), Laboratory Phonology 9 (pp. 353–380). Berlin: Mouton de Gruyter.
Flege, J. E. (1988). Anticipatory and carry-over nasal coarticulation in the speech of children and
adults. Journal of Speech and Hearing Research, 31, 525-536.
https://doi.org/10.1044/jshr.3104.525
Flege, J. E. (1989). Differences in inventory size affect the location but not the precision of
tongue positioning in vowel production. Language and Speech, 32(2), 123–147.
https://doi.org/10.1177/002383098903200203
Flege, J. E., & Eefting, W. (1987). Cross-language switching in stop consonant perception and
production by Dutch speakers of English. Speech Communication, 6, 185–202.
https://doi.org/10.1016/0167-6393(87)90025-2
Flege, J. E., Schirru, C., & MacKay, I. (2003). Interaction between the native and second
language phonetic subsystems. Speech Communication, 40(4), 467–491.
https://doi.org/10.1016/S0167-6393(02)00128-0
Flipsen, P., Shriberg, L., Weismer, G., Karlsson, H., & McSweeny, J. (1999). Acoustic
characteristics of /s/ in adolescents. Journal of Speech, Language, and Hearing Research,
42(3), 663-677. http://doi.org/10.1016/j.psychsport.2013.04.005
Forrest, K., Weismer, G., Milenkovic, P. & Dougall, R.N. (1988). Statistical analysis of word-
initial voiceless obstruents: Preliminary data. Journal of the Acoustical Society of
America, 84(1), 115-123. https://doi.org/10.1121/1.396977
Fougeron, C., & Keating, P. A. (1997). Articulatory strengthening at edges of prosodic domains.
Journal of the Acoustical Society of America, 101(6), 3728–3740.
http://doi.org/10.1121/1.418332
261
Fougeron, C. & Jun, S.-A. (1998). Rate effects on French intonation: prosodic organization and
phonetic realization. Journal of Phonetics, 26(1), 45-69.
https://doi.org/10.1006/jpho.1997.0062
Foulkes, P., Scobbie, J. M., & Watt, D. (2010). Sociophonetics. In W. Hardcastle, J. Laver, & F.
Gibbon (Eds.). The Handbook of Phonetic Sciences (2
nd
edition) (pp. 703–754). Oxford:
Blackwell. http://doi.org/10.1002/9781444317251.ch19
Fourakis, M. (1991). Tempo, stress, and vowel reduction in American English. The Journal of
the Acoustical Society of America, 90 (4), 1816-1827. https://doi.org/10.1121/1.401662
Fowler, C. A. (1980). Coarticulation and theories of extrinsic timing. Journal of Phonetics, 8(1),
113–133. https://doi.org/10.1016/S0095-4770(19)31446-9
Fowler, C. A. (1981). Production and perception of coarticulation among stressed and unstressed
vowels. Journal of Speech, Language, and Hearing Research, 24(1), 127-139.
https://doi.org/10.1044/jshr.2401.127
Fowler, C. A. (1986). An event approach to a theory of speech perception from a direct-realist
perspective. Journal of Phonetics, 14(1), 3-28.
https://doi.org/10.1016/S0095-4470(19)30607-2
Fowler, C. A., Brown, J. M., Sabadini, L., & Weihing, J. (2003). Rapid access to speech gestures
in perception: Evidence from choice and simple response time tasks. Journal of Memory
and Language, 49(3), 396-413. https://doi.org/10.1016/S0749-596X(03)00072-X
Fowler, C. A., & Dekle, D. J. (1991). Listening with eye and hand: Cross-modal contributions to
speech perception. Journal of Experimental Psychology: Human Perception and
Performance, 17(3), 816-828. https://doi.org/10.1037/0096-1523.17.3.816
262
Fowler, C. A., & Saltzman, E. (1993). Coordination and coarticulation in speech production.
Language and Speech, 36(2-3), 171-195. https://doi.org/10.1177/002383099303600304
Fowler, C. A., Sramko, V., Ostry, D., Rowland, S., & Hallé, P. (2008). Cross language phonetic
influences on the speech of French-English bilinguals. Journal of Phonetics, 36(4), 649–
663. https://10.1016/j.wocn.2008.04.001
Franken, M., Acheson, D., McQueen, J., Eisner, F., & Hagoort, P. (2017). Individual variability
as a window on production-perception interactions in speech motor control. Journal of
the Acoustical Society of America, 142(4), 2007-2018. https://doi.org/10.1121/1.5006899
Fox, R. A., & Nissen, S. L. (2005). Sex-related acoustic changes in voiceless English fricatives.
Journal of Speech, Language, and Hearing Research, 48, 753-765.
https://doi.org/10.1044/1092-4388(2005/052)
Fuchs, S., Petrone, C., Krivokapić, J., & Hoole, P. (2013). Acoustic and respiratory evidence for
utterance planning in German. Journal of Phonetics, 41(1), 29-47.
https://doi.org/10.1016/j.wocn.2012.08.007
Fuchs, S., Winkler, R., & Perrier, P. (2008, December). Do speakers' vocal tract geometries
shape their articulatory vowel space? In R. Sock, S. Fuchs, & Y. Laprie
(Eds.), Proceedings of the International Seminar on Speech Production (pp. 333-336).
Strasbourg, France: INRIA.
Fujimura, O. (1986). Relative invariance of articulatory movements: An iceberg model. In J. S.
Perkell & D. H. Klatt (Eds.), Invariance and Variability in Speech Processes (p. 226-
242). Hillsdale, NJ: Lawrence Erlbaum Associates.
263
Fukaya, T., & Byrd, D. (2005). An articulatory examination of word-final flapping at phrase
edges and interiors. Journal of the International Phonetic Association, 35(1), 45–58.
https://doi.org/10.1017/S002510030500189
Gafos, A., & Kirov, C. (2009). A dynamical model of change in phonological representations:
The case of lenition. In J. Chitoran, E. Marsico, F. Pellegrino, & C. Coupé (Eds.),
Approaches to Phonological Complexity (p. 219-240). Berlin/New York: Mouton de
Gruyter.
Gay, T. (1968). Effect of speaking rate on diphthong formant movements. The Journal of the
Acoustical Society of America, 44(6), 1570-1573. https://doi.org/10.1121/1.1911298
Gay, T. (1977). Articulatory movements in VCV sequences. Journal of the Acoustical Society of
America, 62(1), 183-193. https://doi.org/10.1121/1.381480
Gay, T. (1978). Effect of speaking rate on vowel formant movements. Journal of the Acoustical
Society of America, 63, 223-230. https://doi.org/10.1121/1.381717
Georgopoulos, A. P. (1995). Motor cortex and cognitive processing. In M. S. Gazzaniga (Ed.),
The Cognitive Neurosciences (pp. 507–517). Cambridge, MA: MIT Press.
Georgopoulos, A. P., Schwartz, A. B., & Kettner, R. E. (1986). Neuronal population coding of
movement direction. Science, 233(4771), 1416–1419.
https://doi.org/10.1126/science.3749885
Ghosh, S. S., Matthies, M. L., Maas, E., Hanson, A., Tiede, M., Ménard, L., Guenther, F. H.,
Lane, H., & Perkell, J. S. (2010). An investigation of the relation between sibilant
production and somatosensory and auditory acuity. The Journal of the Acoustical Society
of America, 128(5), 3079–3087. https://doi.org/10.1121/1.3493430
264
Gibson, E. J. (1969). Principles of perceptual learning and development. New York: Appleton-
Century Crofts.
Gibson, E. J. (1988). Exploratory behavior in the development of perceiving, acting, and the
acquiring of knowledge. Annual Review of Psychology, 39, 1-41.
https://doi.org/10.1146/annurev.ps.39.020188.000245
Gibson, J.J. (1966). The senses considered as perceptual systems. Boston: Houghton-Mifflin.
Gibson, J.J. (1977). The ecological approach to visual perception. Boston: Houghton Mifflin
Company.
Gick, B. (1999). A gesture-based account of intrusive consonants in English. Phonology, 16(1),
29-54. https://doi.org/10.1017/S0952675799003693
Gick, B., Campbell, F., Oh, S., & Tamburri-Watt, L. (2006). Towards universals in the gestural
organization of syllables: A cross-linguistic study of liquids. Journal of Phonetics, 34,
49-72. https://doi.org/10.1016/j.wocn.2005.03.005
Gick, B., Stavness, I., & Chiu, C. (2013). Coarticulation in a whole event model of speech
production. Proceedings of Meetings on Acoustics, 19, Article 060207.
https://doi.org/10.1121/1.4799482
Goldinger, S. D. (1996). Words and voices: episodic traces in spoken word identification and
recognition memory. Journal of Experimental Psychology: Learning, Memory, and
Cognition, 22(5), 1166-1183. https://doi.org/10.1037/0278-7393.22.5.1166
Goldinger, S. D. (1998). Echoes of echoes? An episodic theory of lexical access. Psychological
Review, 105(2), 251-279. https://doi.org/10.1037/0033-295x.105.2.251
265
Goldstein, L., Byrd, D., & Saltzman, E. (2006). The role of vocal tract gestural action units in
understanding the evolution of phonology. In M. Arbib (ed.), From action to language:
The mirror neuron system (pp. 215-249). Cambridge: Cambridge University Press.
Goldstein, L., & Fowler, C. A. (2003). Articulatory Phonology: A phonology for public language
use. In N. O. Schiller & A. S. Meyer (Eds.), Phonetics and Phonology in Language
Comprehension and Production (pp. 159–207). Mouton de Gruyter.
Greenwald, A. G. (1970). Sensory feedback mechanisms in performance control: With special
reference to the ideo-motor mechanism. Psychological Review, 77(2), 73-99.
https://doi.org/10.1037/h0028689
Grosvald, M. (2009). Interspeaker variation in the extent and perception of long-distance vowel-
to-vowel coarticulation. Journal of Phonetics, 37(2), 173-188.
https://doi.org/10.1016/j.wocn.2009.01.002
Grosvald, M., & Corina, D. (2012). The production and perception of sub-phonemic vowel
contrasts and the role of the listener in sound change. In Solé, M.-J., Recasens, D. (Eds.),
The initiation of sound change: Production, perception, and social factors (pp.77-100).
Amsterdam: John Benjamins.
Guenther, F. H. (1994). A neural network model of speech acquisition and motor equivalent
speech production. Biological Cybernetics, 72(1), 43–53.
https://doi.org/10.1007/BF00206237
Guenther, F. H. (1995). Speech sound acquisition, coarticulation, and rate effects in a neural
network model of speech production. Psychological Review, 102(3), 594–621.
https://doi.org/10.1037/0033-295X.102.3.594
266
Guenther, F. H. (2016). Neural Control of Speech. MIT Press.
https://doi.org/10.7551/mitpress/10471.003.0006
Guenther, F. H., Ghosh, S. S., & Tourville, J. A. (2006). Neural modeling and imaging of the
cortical interactions underlying syllable production. Brain and Language, 96(3), 280–
301. https://doi.org/10.1016/j.bandl.2005.06.001.
Guenther, F. H., Hampson, M., & Johnson, D. (1998). A theoretical investigation of reference
frames for the planning of speech movements. Psychological Review, 105(4), 611–633.
https://doi.org/10.1037/0033-295X.105.4.611-633
Guenther, F. H., Husain, F. T., Cohen, M. A., & Shinn-Cunningham, B. G. (1999). Effects of
categorization and discrimination training on auditory perceptual space. Journal of the
Acoustical Society of America, 106, 2900-2912. https://doi.org/10.1121/1.428112
Haar, S., Donchin, O., & Dinstein, I. (2017). Individual movement variability magnitudes are
explained by cortical neural variability. Journal of Neuroscience, 37(37), 9076–9085.
https://doi.org/10.1523/JNEUROSCI.1650-17.2017
Hagiwara, R. (1995). Acoustic realizations of American /r/ as produced by women and men.
UCLA Working Papers in Phonetics, 90, 1-187.
https://escholarship.org/uc/item/8779b7gq
Hampel, F. R. (1971). A general qualitative definition of robustness. The Annals of Mathematical
Statistics, 42(6), 1887-1896. https://doi.org/10.1214/aoms/1177693054
Harper, S., Goldstein, L., & Narayanan, S. (2020). Variability in individual constriction
contributions to third formant values in American English /ɹ/. Journal of the Acoustical
Society of America, 147(6), 3905-3916. https://doi.org/10.1121/10.0001413
267
Harrington, J., Kleber, F., & Reubold, U. (2008). Compensation for coarticulation, /u/-fronting,
and sound change in standard southern British: An acoustic and perceptual study. Journal
of the Acoustical Society of America, 123(5), 2825-35. https://doi.org/10.1121/1.2897042.
Harris, C. M., & Wolpert, D. M. (1998). Signal-dependent noise determines motor planning.
Nature, 394(6695), 780-784. https://doi.org/10.1038/29528
Hauser, I. (2019). Effects of phonological contrast on within-category phonetic variation.
[Doctoral Dissertation, UMass Amherst]. https://doi.org/10.7275/14836073
Hendrix, P., Bolger, P., & Baayen, H. (2017). Distinct ERP signatures of word frequency, phrase
frequency, and prototypicality in speech production. Journal of Experimental
Psychology: Learning, Memory, and Cognition, 43(1), 128-149.
https://doi.org/10.1037/a0040332
Heselwood, B., & Plug, L. (2011). The role of F2 and F3 in the perception of rhoticity: Evidence
from listening experiments. In Proceedings of the 17
th
International Congress of
Phonetic Sciences (pp. 867-870).
Hillenbrand, J., Getty, L.A., Clark, M.J., & Wheeler, K. (1995). Acoustic characteristics of
American English vowels. Journal of the Acoustical Society of America, 97(5), 3009-
3111. https://doi.org/10.1121/1.411872
Hock, H. S., Schöner, G., & Giese, M. (2003). The dynamical foundations of motion pattern
formation: Stability, selective adaptation, and perceptual continuity. Perception &
Psychophysics, 65(3), 429-457. https://doi.org/10.3758/BF03194574
Hommel, B., Müsseler, J., Aschlersleben, G., & Prinz, W. (2001). The theory of event coding
(TEC): a framework for perception and action planning. Behav. Brain Sci., 24, 849-937.
https://doi.org/10.1017/s0140525x01000103
268
Honda, K., Maeda, S., Hashi, M., Dembowski, J.S., & Westbury, J.R. (1996). Human palate and
related structures: Their articulator consequences. Proceedings of the International
Conference on Spoken Language Processing (pp. 784-787).
Honorof, D. N., Weihing, J., & Fowler, C. A. (2011). Articulatory events are imitated under
rapid shadowing. Journal of Phonetics, 39, 18–38.
https://doi.org/10.1016/j.wocn.2010.10.007.
Houde, J. F., & Jordan, M. I. (2002). Sensorimotor adaptation of speech I: Compensation and
adaptation. Journal of Speech, Language and Hearing Research, 45(2), 295-310.
https://doi.org/10.1044/1092-4388(2002/023)
Huffman, M. K., & Schuhmann, K. (2020, December). The relation between L1 and L2 category
compactness and L2 VOT learning. In Proceedings of Meetings on Acoustics, 42(1),
Article 060011. https://doi.org/10.1121/2.0001421
Hyndman, R. T., & Fan, Y. (1996). Sample quantiles in statistical packages. The American
Statistician, 50(4), 361-365. https://doi.org/10.1080/00031305.1996.10473566
Iskarous, K., McDonough, J., & Whalen, D. H. (2012). A gestural account of the velar fricative
in Navajo. Laboratory Phonology, 3(1). https://doi.org/10.1515/lp-2012-0011
Iskarous, K., Fowler, C. A., & Whalen, D. H. (2010). Locus equations are an acoustic expression
of articulator synergy. Journal of the Acoustical Society of America, 128(4), 2021-2032.
https://doi.org/10.1121/1.3479538.
Iskarous, K., Nam, H., & Whalen, D. H. (2010). Perception of articulatory dynamics from
acoustic signatures. Journal of the Acoustical Society of America, 127(6), 3717–3728.
https://doi.org/10.1121/1.3409485
269
Iskarous, K., Shadle, C. H., & Proctor, M. I. (2011). Articulatory-acoustic kinematics: the
production of American English /s/. Journal of the Acoustical Society of America, 129(2),
944–954. https://doi.org/10.1121/1.3514537
Jacewicz, E., Fox, R. A., O’Neill, C., & Salmons, J. (2009). Articulation rate across dialect, age,
and gender. Language Variation and Change, 21(2), 233.
https://doi.org/10.1017/S0954394509990093
Jadoul, Y., Thompson, B., & De Boer, B. (2018). Introducing parselmouth: A Python interface to
Praat. Journal of Phonetics, 71, 1-15.
Johnson, J. S., Spencer, J. P., & Schöner, G. (2008). Moving to higher ground: The dynamic
field theory and the dynamics of visual cognition. New Ideas in Psychology, 26(2), 227–
251. https://doi.org/10.1016/j.newideapsych.2007.07.007.Moving
Johnson, K. (1997). Speech perception without speaker normalization. In K. Johnson & J. W.
Mullenix (Eds.), Talker variability in speech processing (p. 145-166). San Diego:
Academic Press.
Johnson, K. (2006). Resonance in an exemplar-based lexicon: The emergence of social identity
and phonology. Journal of Phonetics, 34(4), 485-499.
https://10.1016/j.wocn.2005.08.004
Johnson, K. (2018). Speech production patterns in producing linguistic contrasts are partly
determined by individual differences in anatomy. UC Berkeley PhonLab Annual Report,
14(1), 256-282. https://doi.org/10.5070/P7141042483
Johnson, K., Ladefoged, P., & Lindau, M. (1993). Individual differences in vowel production.
Journal of the Acoustical Society of America, 94(2), 701–714.
https://doi.org/10.1121/1.406887
270
Jongman, A., Wayland, R., & Wong, S. (2000). Acoustic characteristics of English fricatives.
Journal of the Acoustical Society of America, 108(3), 1252-1263.
https://doi.org/10.1121/1.1288413
Kataoka, R. (2011). Phonetic and cognitive bases of sound change. [Doctoral Dissertation,
University of California, Berkeley].
Keating, P. A. (1990). The window model of coarticulation: Articulatory evidence. In J.
Kingston and M. E. Beckman (Ed.), Papers in Laboratory Phonology I (pp. 451–470).
New York, NY: Cambridge University Press.
Keating, P. A. (1996). The Phonology-Phonetics Interface. In U. Kleinhenz (Ed.), Interfaces in
Phonology (pp. 262–278). Berlin: Akademie Verlag.
Kelso, J. A. S., Vatikiotis-Bateson, E., Saltzman, E. L., & Kay, B. (1985). A qualitative dynamic
analysis of reiterant speech production: Phase portraits, kinematics, and dynamic
modeling. Journal of the Acoustical Society of America, 77(1), 266–280.
https://doi.org/10.1121/1.392268
Kent, R. D., & Moll, K. L. (1972). Cinefluorographic analyses of selected lingual consonants.
Journal of Speech and Hearing Research, 15, 453-473.
https://doi.org/10.1044/jshr.1503.453.
Kenyon, J. S. (1924). American Pronunciation: A Text-book of Phonetics for Students of English.
G. Wahr.
Kim, Y. H., & Hazan, V. (2010). Individual variability in the perceptual learning of L2 speech
sounds and its cognitive correlates. In Proceedings of the 6
th
International symposium on
the acquisition of second language speech, Poznań, Poland.
271
Kim, J. (2020). Individual Differences in the Production and Perception of Prosodic Boundaries
in American English. University of Michigan, Ann Arbor.
Kirov, C., & Gafos, A. (2007). Dynamic Phonetic Detail in Lexical Representations. In
Proceedings of the 16th International Congress of Phonetic Sciences (pp. 637–640).
Saarbrücken, Germany.
Kishimoto, K., & Amari, S. (1979). Existence and stability of local excitations in homogeneous
neural fields. Journal of Mathematical Biology, 7(4), 303–318.
https://doi.org/10.1007/BF00275151
Klatt, D.H. (1975). Voice onset time, frication, and aspiration in word-initial consonant clusters.
Journal of Speech and Hearing Research, 18(4), 686-706.
https://doi.org/10.1044/jshr.1804.686
Kleber, F., Harrington, J., & Reubold, U. (2012). The relationship between the perception and
production of coarticulation during a sound change in progress. Language and
speech, 55(3), 383-405.
Kleinow, J., Smith, A., & Ramig, L. O. (2001). Speech motor stability in IDP: Effects of rate and
loudness manipulations. Journal of Speech, Language, and Hearing Research, 44, 1041-
1051.
Kong, E. J., & Edwards, J. (2016). Individual differences in categorical perception of speech:
Cue weighting and executive function. Journal of Phonetics, 59, 40–57.
https://doi.org/10.1016/j.wocn.2016.08.006
Kopecz, K., & Schöner, G. (1995). Saccadic motor planning by integrating visual information
and pre-information on neural dynamic fields. Biological Cybernetics, 73(1), 49-60.
https://doi.org/10.1007/BF00199055
272
Krakow, R. A. (1989). The Articulatory Organization of Syllables: A Kinematic Analysis of
Labial and Velar Gestures [Doctoral Dissertation]. Yale University.
Krivokapić, J. (2007). The planning, production, and perception of prosodic structure [Doctoral
dissertation]. University of Southern California.
Kuehn, D. P., & Moll, K. L. (1976). A cineradiographic study of VC and CV articulatory
velocities. Journal of Phonetics, 4(4), 303–320. https://doi.org/10.1016/s0095-
4470(19)31257-4
Kuznetsova, A., Brockhoff, P. B., & Christensen, R. H. (2017). lmerTest package: tests in linear
mixed effects models. Journal of statistical software, 82(1), 1-26.
https://doi.org/10.18637/jss.v082.i13
Labov, W., Ash, S., & Boberg, C. (2008). The Atlas of North American English: Phonetics,
Phonology and Sound Change. Walter de Gruyter.
Lacerda, F. (1995). The perceptual-magnet effect: an emergent consequence of exemplar-based
phonetic memory. In K. Elenius and P. Branderyd (Eds.), XIIIth International Congress
of Phonetic Sciences (Volume 2) (pp. 140-147). Stockholm.
Ladefoged, P., & Maddieson, I. (1996). The sounds of the world's languages. Oxford: Blackwell.
Lametti, D. R., Nasir, S. M., & Ostry, D. J. (2012). Sensory Preference in Speech Production
Revealed by Simultaneous Alteration of Auditory and Somatosensory Feedback. Journal
of Neuroscience, 32(27), 9351-9358. https://doi.org/10.1523/JNEUROSCI.0404-12.2012
Lammert, A., Proctor, M., & Narayanan, S. (2013). Interspeaker variability in hard palate
morphology and vowel production. Journal of Speech, Language, and Hearing Research,
56, S1924-S1933. https://doi.org/10.1044/1092-4388(2013/12-0211).
273
Lammert, A., Proctor, M., Katsamanis, A., & Narayanan, S. (2011). Morphological variation in
the adult vocal tract: A modeling study of its potential acoustic impact. Proceedings of
INTERSPEECH, Florence (pp. 2813-2816).
Latash, M. L., Scholz, J. P., & Schöner, G. (2002). Motor control strategies revealed in the
structure of motor variability. Exercise and sport sciences reviews, 30(1), 26-31.
https://doi.org/10.1097/00003677-200201000-00006.
Lawson, E., Stuart-Smith, J., & Scobbie, J. M. (2018). The role of gesture delay in coda /r/
weakening: An articulatory, auditory and acoustic study. The Journal of the Acoustical
Society of America, 143, 1646–1657. https://doi.org/10.1121/1.5025325
Lee, Y., Goldstein, L., Parrell, B., & Byrd, D. (2021). Who converges? Variation reveals
individual speaker adaptability. Speech Communication, 131, 23-34.
https://doi.org/10.1016/j.specom.2021.05.001.
Lehiste, I. (1973). Rhythmic units and syntactic units in production and perception. The Journal
of the Acoustical Society of America, 54(5), 1228-1234.
https://doi.org/10.1121/1.1914379.
Lehiste, I. (1964). Acoustical characteristics of Selected English consonants. Bloomington:
Indiana University Press.
Lev-Ari, S., & Peperkamp, S. (2013). Low inhibitory skill leads to non-native perception and
production in bilinguals’ native language. Journal of Phonetics, 41(5), 320-331.
https://doi.org/10.1016/j.wocn.2013.06.002.
Lev-Ari, S., & Peperkamp, S. (2014). The influence of inhibitory skill on phonological
representations in production and perception. Journal of Phonetics, 47, 36-46.
https://doi.org/10.1016/j.wocn.2014.09.001.
274
Li, F., Edwards, J., & Beckman, M. E. (2009). Contrast and covert contrast: The phonetic
development of voiceless sibilant fricatives in English and Japanese toddlers. Journal of
Phonetics, 37(1), 111–124. https://doi.org/10.1016/j.wocn.2008.10.001.
Li, F., Munson, B., Edwards, J., Yoneyama, K., & Hall, K. (2011). Language specificity in the
perception of voiceless sibilant fricatives in Japanese and English: implications for cross-
language differences in speech-sound development. Journal of the Acoustical Society of
America, 129(2), 999–1011. https://doi.org/10.1121/1.3518716.
Liberman, A. M., & Whalen, D. H. (2000). On the relation of speech to language. Trends in
Cognitive Sciences, 4(5), 187–196. https://doi.org/10.1016/S1364-6613(00)01471-6.
Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert-Kennedy, M. (1967). Perception
of the speech code. Psychological Review, 74(6), 431-461.
https://doi.org/10.1037/h0020279.
Liberman, A.M., & Mattingly, I.G. (1985). The motor theory of speech perception revised.
Cognition, 21(1), 1-36. https://doi.org/10.1016/0010-0277(85)90021-6.
Lin, S., Beddor, P. S., & Coetzee, A. W. (2014). Gestural reduction, lexical frequency, and sound
change: A study of post-vocalic /l/. Laboratory Phonology, 5(1), 9-36.
https://doi.org/10.1515/lp-2014-0002.
Lindau, M. (1985). The story of /r/. In V.A. Fromkin (ed.), Phonetic Linguistics: Essays in
Honor of Peter Ladefoged (pp. 157-168). Academic Press.
Lindblom, B. (1963). Spectrographic study of vowel reduction. Journal of the Acoustical Society
of America, 35(11), 1773-1781. https://doi.org/10.1121/1.1918816.
275
Lindblom, B. (1990). Explaining phonetic variation: A sketch of H&H Theory. In W. J.
Hardcastle & A. Marchal (Eds.), Speech Production and Speech Modelling (pp. 403–
439). Kluwer Academic Publishers. https://doi.org/10.1109/ROBOT.1995.525400.
Lins, J., & Schöner, G. (2014). A neural approach to cognition based on dynamic field theory. In
S. Coombes, P. beim Graben, R. Potthast, and J.J. Wright, Neural Fields: Theory and
Applications (pp. 319-339). Berlin: Springer.
Lipinksi, J., Schneegans, S., Sandamirskaya, Y., Spencer, J., & Schöner, G. (2012). A neuro-
behavioral model of flexible spatial language behaviors. Journal of Experimental
Psychology: Learning, Memory, and Cognition, 38, 1490-1511.
Lubker, J., & Gay, T. (1982). Anticipatory labial coarticulation: Experimental, biological, and
linguistic variables. Journal of the Acoustical Society of America, 71, 437-448.
Ladefoged, P., & Maddieson, I. (1996). Recording the phonetic structures of endangered
languages. UCLA Working Papers in Phonetics, 1-7.
Maniwa, K., Jongman, A., & Wade, T. (2009). Acoustic characteristics of clearly spoken English
fricatives. Journal of the Acoustical Society of America, 125(6), 3962–3973.
https://doi.org/10.1121/1.2990715
Mann, V., & Repp, B. (1980). Influence of vocalic context on the perception of the [ʃ]-[s]
distinction. Perception and Psychophysics, 23, 213-228.
Manuel, S. (1999). Cross-language studies: Relating language-particular coarticulation patterns
to other language-particular facts. In W. J. Hardcastle and N. Hewlett (eds.),
Coarticulation: Theory, Data and Techniques (pp. 179-198). Cambridge, UK: Cambridge
University Press.
276
Marquardt, T.P., Jacks, A., & Davis, B. (2004). Token-to-token variability in developmental
apraxia of speech. Clinical Linguistics & Phonetics, 18(2), 127-144.
Mattingly, I. G., & Liberman, A. M. (1988). Specialized perceiving systems for speech and other
biologically significant sounds. In G. M. G. Edelman, W. E. Gall, & W.M. Cowan
(Eds.), Auditory function: Neurological bases of hearing (pp. 775-793). Wiley: New
York.
McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., & Sonderegger, M. (2017, August).
Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi. In Proceedings
of Interspeech (pp. 498-502).
McFarland, D.H., Baum, S.R., & Chabot, C. (1996). Speech compensation to structural
modifications of the oral cavity. Journal of the Acoustical Society of America, 100, 1093-
1104.
McGuire, G., & Babel, M. (2012) A cross-modal account for synchronic and diachronic patterns
of /f/ and /θ/ in English. Laboratory Phonology, 3, 1-41.
McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264, 746–748.
Mefferd, A. S., & Green, J. R. (2010). Articulatory-to-acoustic relations in response to speaking
rate and loudness manipulations. Journal of Speech, Language, and Hearing Research,
53(5), 1206-1219.
Mielke, J., Baker, A., & Archangeli, D. (2010). Variability and homogeneity in American
English /ɹ/ allophony and /s/ retraction. In C. Fougeron, B. Kuehnert, M. Imperio & N.
Vallee (Eds.), Laboratory Phonology 10 (pp. 699-730). De Gruyter Mouton.
Mielke, J., Baker, A., & Archangeli, D. (2016). Individual-level contact limits phonological
complexity: Evidence from bunched and retroflex /ɹ/. Language, 92, 101–140.
277
Miller, J. L., & Liberman, A. M. (1979). Some effects of later-occurring information on the
perception of stop consonant and semivowel. Perception & Psychophysics, 25(6), 457-
465.
Miller, J. L., Grosjean, F., & Lomanto, C. (1984). Articulation rate and its variability in
spontaneous speech: A reanalysis and some implications. Phonetica, 41(4), 215–225.
Milne, A. E., Bianco, R., Poole, K. C., Zhao, S., Oxenham, A. J., Billig, A. J., & Chait, M.
(2020). An online headphone screening test based on dichotic pitch. Behavior Research
Methods, 53(4), 1551-1562. https://doi.org/10.3758/s13428-020-01514-0
Moon, S.-J., & Lindblom, B. (1994). Interaction between duration, context, and speaking style in
English stressed vowels. Journal of the Acoustical Society of America, 96(1), 40–55.
http://acousticalsociety.org/content/terms.
Mooshammer, C., Hoole, P., & Geumann, A. (2007). Jaw and order. Language and Speech,
50(2), 145–176. https://doi.org/10.1177/00238309070500020101
Mooshammer, C., Perrier, P., Fuchs, S., Geng, C., & Pape, D. (2004). An EMMA and EPG study
on token-to-token variability. AIPUK, 36, 47-63.
Nakagawa, S., Johnson, P.C.D., & Shielzeth, H. (2017). The coefficient of determination R
2
and
intra-class correlation coefficient from generalized linear mixed-effects models revisited
and expanded. J. R. Soc. Interface, 14, Article 20170213.
Nam, H., & Saltzman, E. (2003). A Competitive, Coupled Oscillator Model of Syllable
Structure. Proceedings of ICPhS 15, 2253–2256.
278
Nam, H., Goldstein, L., & Saltzman, E. (2009). Self-organization of syllable structure: A coupled
oscillator model. In F. Pellegrino, E. Marisco, I. Chitoran, & C. Coupe (Eds.),
Approaches to phonological complexity (pp. 299-328). Berlin/New York: Mouton de
Gruyter.
Narayanan, S. S., Alwan, A. A., & Haker, K. (1995). An articulatory study of fricative
consonants using magnetic resonance imaging. Journal of the Acoustical Society of
America, 98(3), 1325–1347.
Narayanan, S. S., Alwan, A. A., & Haker, K. (1997). Toward articulatory-acoustic models for
liquid approximants based on MRI and EPG data. Part I. The laterals. Journal of the
Acoustical Society of America, 101(2), 1064–1077. https://doi.org/10.1121/1.418030
Nasir, S. M., & Ostry, D. J. (2006). Somatosensory Precision in Speech Production. Current
Biology, 16(19), 1918–1923. https://doi.org/10.1016/j.cub.2006.07.069
Newell, K. M., & Slifkin, A. B. (1998). The nature of movement variability. In J. P. Piek
(Ed.), Motor behavior and human skill: A multidisciplinary approach (pp. 143-160).
Champaign, IL: Human Kinetics.
Newman, R. S. (1997). Individual differences and the link between speech perception and speech
production. Doctoral Dissertation, SUNY Buffalo.
Newman, R. S. (2003). Using links between speech perception and speech production to evaluate
different acoustic metrics: A preliminary report. The Journal of the Acoustical Society of
America, 113(5), 2850–2860. https://doi.org/10.1121/1.1567280
Newman, R.S., Clouse, S.A., & Burnham, J.L. (2001). The perceptual consequences of within-
talker variability in fricative production. Journal of the Acoustical Society of America,
109(3), 1181-1196.
279
Nieto-Castanon, A., Guenther, F. H., Perkell, J. S., & Curtin, H. D. (2005). A modeling
investigation of articulatory variability and acoustic stability during American English /r/
production. Journal of the Acoustical Society of America, 117(5), 3196–3212.
https://doi.org/10.1121/1.1893271.
Nissen, S. L., & Fox, R. A. (2005). Acoustic and spectral characteristics of young children's
fricative productions: a developmental perspective. Journal of the Acoustical Society of
America, 118, 2570-2578.
Nittrouer, S. (1995). Children learn separate aspects of speech production at different rates:
Evidence from spectral moments. The Journal of the Acoustical Society of
America, 97(1), 520-530.
Niziolek, C. A., & Guenther, F. H. (2013). Behavioral/Cognitive Vowel Category Boundaries
Enhance Cortical and Behavioral Responses to Speech Feedback Alterations.
https://doi.org/10.1523/JNEUROSCI.1008-13.2013
Noiray, A., Cathiard, M.-A., Ménard, L., & Abry, C. (2011). Test of the movement expansion
model: Anticipatory vowel lip protrusion and constriction in French and English
speakers. Journal of the Acoustical Society of America, 129(1), 340-349.
Noiray, A., Iskarous, K., & Whalen, D. H. (2014). Variability in English vowels is comparable in
articulation and acoustics. Laboratory Phonology, 5(2), 271–288.
https://doi.org/10.1515/lp-2014-0010
O'Connor, J. D., Gerstman, L. J., Liberman, A. M., Delattre, P. C., & Cooper, F.S. (1957).
Acoustic cues for the perception of initial /w, j, r, l/ in English. Word, 13(1), 24-43.
Öhman, S. E. (1966). Coarticulation in VCV utterances: Spectrographic measurements. The
Journal of the Acoustical Society of America, 39(1), 151-168.
280
Öhman, S. E. (1967). Numerical model of coarticulation. The Journal of the Acoustical Society
of America, 41(2), 310-320.
Oller, D. K. (1973). The effect of position in utterance on speech segment duration in
English. Journal of the Acoustical Society of America, 54(5), 1235-1247.
Ong, D., & Stone, M. (1998). Three-dimensional vocal tract shapes in/r/and/l: A study of MRI,
ultrasound, electropalatography, and acoustics. Phonoscope, 1(1), 1-13.
Ostry, D.J., Gribble, P.L., & Gracco, V.L. (1996). Coarticulation of jaw movements in speech
production: Is context sensitivity in speech kinematics centrally planned? Journal of
Neuroscience, 16, 1570-1579.
Ou, J., & Law, S. P. (2017). Cognitive basis of individual differences in speech perception,
production and representations: The role of domain general attentional switching.
Attention, Perception, and Psychophysics, 79(3), 945–963.
https://doi.org/10.3758/s13414-017-1283-z.
Ou, J., Law, S. P., & Fung, R. (2015). Relationship between individual differences in speech
processing and cognitive functions. Psychonomic Bulletin & Review, 22, 1725–1732.
Pardo, J. S., Jordan, K., Mallari, R., Scanlon, C., & Lewandowski, E. (2013). Phonetic
convergence in shadowed speech: The relation between acoustic and perceptual
measures. Journal of Memory and Language, 69(3), 183-195.
Parrell, B., & Niziolek, C. A. (2020). Formant variability is actively regulated in vowel
production. Neuroscience, 7(9), 907-915.
Parrell, B., & Narayanan, S. (2018). Explaining Coronal Reduction: Prosodic Structure and
Articulatory Posture. Phonetica, 75(2), 151–181. https://doi.org/10.1159/000481099
281
Pearson, K. (1896). Mathematical contributions to the theory of evolution. III. Regression,
heredity, and panmixia. Philosophical Transactions of the Royal Society of London.
Series A, Containing papers of a mathematical or physical character, 187, 253-318.
Perkell, J. S. (2012). Movement goals and feedback and feedforward control mechanisms in
speech production. Journal of Neurolinguistics, 25(5), 382–407.
https://doi.org/10.1016/J.JNEUROLING.2010.02.011
Perkell, J. S., Guenther, F. H., Lane, H., Matthies, M. L., Stockmann, E., Tiede, M., &
Zandipour, M. (2004). The distinctness of speakers’ productions of vowel contrasts is
related to their discrimination of the contrasts. Journal of the Acoustical Society of
America, 116(4), 2338–2344. https://doi.org/10.1121/1.1787524
Perkell, J.S., Lane, H., Ghosh, S.S., Matthies, M.L., Tiede, M., Guenther, F.H., and Ménard, L.
(2008). Mechanisms of vowel production: Auditory goals and speaker acuity. In 8
th
International Seminar on Speech Production (pp. 29-32). Strasbourg, France.
Perkell, J. S., Matthies, M. L., Tiede, M., Lane, H., Zandipour, M., Marrone, N., ... & Guenther,
F. H. (2004). The distinctness of speakers' /s/—/∫/ contrast is related to their auditory
discrimination and use of an articulatory saturation effect. Journal of Speech, Language,
and Hearing Research, 47(6), 1259-1269.
Perrier, P. (2003). About speech motor control complexity. In Proceedings of the 6
th
International Seminar on Speech Production (pp. 225-230), Sydney, Australia.
Peterson, G., & Barney, H. (1952). Control methods used in the study of vowels. Journal of the
Acoustical Society of America, 24, 175-184.
Peterson, G. E., & Lehiste, I. (1960). Duration of syllable nuclei in English. The Journal of the
Acoustical Society of America, 32(6), 693-703.
282
Petrone, C., Fuchs, S., & Krivokapic, J. (2011). Consequences of working memory differences
and phrasal length on pause duration and fundamental frequency. 9th International
Seminar on Speech Production 2011, ISSP 2011, 01, 393–400.
Pierrehumbert, J. B. (2001). Exemplar dynamics: Word frequency, lenition and contrast. In J.
Bybee & P. Hopper (Eds.), Frequency effects and the emergence of linguistic structure
(pp. 137–157). Amsterdam: John Benjamins.
Pierrehumbert, J. B. (2002). Word-specific phonetics. In C. Gussenoven & N. Warner (eds.),
Laboratory Phonology VII (pp. 101-140). Berlin: Mouton de Gruyter.
Pierrehumbert, J. B. (2006). The next toolkit. Journal of Phonetics, 34(4), 516–530.
https://doi.org/10.1016/J.WOCN.2006.06.003
Pierrehumbert, J. B. (2016). Phonological Representation: Beyond Abstract Versus Episodic.
Annual Review of Linguistics, 2(1), 33–52. https://doi.org/10.1146/annurev-linguistics-
030514-125050
Pierrehumbert, J.B. (2003). Phonetic diversity, statistical learning, and acquisition of phonology.
Language and Speech, 46(2-3), 115-154.
Pisoni, D. B. (1975). Auditory short-term memory and vowel perception. Memory &
Cognition, 3(1), 7-18.
Polka, L., & Strange, W. (1985). Perceptual equivalence of acoustic cues that differentiate /r/ and
/l/. Journal of the Acoustical Society of America, 78(4), 1187-1197.
Port, R. F. (1981). Linguistic timing factors in combination. Journal of the Acoustical Society of
America, 69(1), 262-274.
Prinz, W. (1997). Perception and Action Planning. European Journal of Cognitive
Psychology, 9(2), 129-154. DOI: 10.1080/713752551
283
Pulvermüller, F., Huss, M., Kherif, F., Moscoso del Prado Martin, F., Hauk, O., Shtyrov, Y.
(2006). Motor cortex maps articulatory features of speech sounds. Proc. Natl. Acad. Sci.
USA, 103(20), 7865-70.
Quené, H. (2008). Multilevel modeling of between-speaker and within-speaker variation in
spontaneous speech tempo. Journal of the Acoustical Society of America, 123(2), 1104–
1113. https://doi.org/10.1121/1.2821762
R Core Team. (2020). R: a language and environment for statistical computing. Version 4.0. 2.
Vienna, Austria.
Rabiner, L. and Juang, B.-H. (1993). Fundamentals of Speech Recognition. Prentice-Hall.
Recasens, D. (1987). An acoustic analysis of V-to-C and V-to-V coarticulatory effects in Catalan
and Spanish VCV sequences. Journal of Phonetics, 15(4), 299-312.
Recasens, D. (1989). Long range coarticulation effects for tongue dorsum contact in VCVCV
sequences. Speech Communication, 8(4), 293-307.
Recasens, D. (2004). Darkness in [l] as a scalar phonetic property: implications for phonology
and articulatory control. Clin. Linguist. Phon., 18, 593-603.
doi:10.1080/02699200410001703556.
Recasens, D. (2012). A cross-language acoustic study of initial and final allophones of
/l/. Speech Communication, 54(3), 368-383.
Recasens, D., & Espinosa, A. (2009). An articulatory investigation of lingual coarticulatory
resistance and aggressiveness for consonants and vowels in Catalan. Journal of the
Acoustical Society of America, 125(4), 2288-2298.
284
Recasens, D., & Farnetani, E. (1994). Spatiotemporal properties of different allophones of /l/:
phonological implications. In W.U. Dressler, M. Prinzhorn, & J.R. Rennison (eds.),
Phonologica 1992 (pp. 195-204). Turin: Rosenberg and Sellier.
Recasens, D., Pallarès, M., & Fontdevila, J. (1997). A model of lingual coarticulation based on
articulatory constraints. Journal of the Acoustical Society of America, 102, 544-561.
Riley, M. A., & Turvey, M.T. (2002). Variability and Determinism in Motor Behavior. Journal
of Motor Behavior, 34(2), 99–125.
Rizzolatti, G., & Arbib, M.A. (1998). Language within our grasp. Trends Neurosci., 21(5), 188-
194.
Roon, K. D. (2013). The dynamics of phonological planning. [Doctoral Dissertation]. New York
University.
Roon, K.D., & Gafos, A.I. (2016). Perceiving while producing: Modeling the dynamics of
phonological planning. Journal of Memory and Language, 89, 222–243.
https://doi.org/10.1016/j.jml.2016.01.005.
Rudy, K., & Yunusova, Y. (2013). The effect of anatomic factors on tongue position variability
during consonants. Journal of Speech, Language, and Hearing Research, 56(1), 137–149.
https://doi.org/10.1044/1092-4388(2012/11-0218)
Saltzman, E.L., & Munhall, K.G. (1989). A dynamical approach to gestural patterning in speech
production. Ecological Psychology, 1(4), 333–382.
Saltzman, E., Nam, H., Krivokapić, J., & Goldstein, L. (2008). A task-dynamic toolkit for
modeling the effects of prosodic structure on articulation. In P.A. Barbosa, S. Madureira,
& C. Reis (Eds.), Proceedings of the Speech Prosody 2008 Conference (pp. 175-184).
Campinas: International Speech Communications Association.
285
Sancier, M.L., & Fowler, C.A. (1997). Gestural drift in a bilingual speaker of Brazilian
Portuguese and English. Journal of Phonetics, 25(4), 421-436.
Sandamirskaya, Y., & Schöner, G. (2010). An embodied account of serial order: How
instabilities drive sequence generation. Neural Networks, 23(10), 1164-1179.
Schertz, J., Cho, T., Lotto, A., & Warner, N. (2015). Individual differences in phonetic cue use in
production and perception of a non-native sound contrast. Journal of Phonetics, 52, 183–
204. https://doi.org/10.1016/J.WOCN.2015.07.003
Schneegans, S., Lins, J., & Schöner, G. (2015). Embedding Dynamic Field Theory in
Neurophysiology. In G. Schöner, J.P. Spencer, and the DFT Research Group, Dynamic
Thinking: A primer on Dynamic Field Theory (pp. 61-94). Oxford University Press: New
York.
Schöner, G., & Schutte, A.R. (2015). Dynamic Field Theory: Foundations. In G. Schöner, J.P.
Spencer, and the DFT Research Group, Dynamic Thinking: A primer on Dynamic Field
Theory (pp. 35-60). Oxford University Press: New York.
Schöner, G., Kopecz, K., & Erlhagen, W. (1997). The dynamic neural field theory of motor
programming: Arm and eye movements. In P. G. Morasso & V. Sanguineti (Vol. Eds.)
and G. E. Stelmach & P. A. Vroon (Series Eds.), Advances in psychology: Self-
organization, computational maps and motor control (Vol. 119, pp. 271–310).
Amsterdam: Elsevier-North Holland.
Scobbie, J. M., & Pouplier, M. (2010). The role of syllable structure in external sandhi: An EPG
study of vocalization and retraction in word-final English /l/. Journal of Phonetics, 38,
240-259.
286
Shadle, C., & Mair, S. (1996). Quantifying spectral characteristics of fricatives. In International
Conference on Spoken Language Processing (pp. 1521-1524).
Shattuck-Hufnagel, S., & Turk, A. (1998). The domain of phrase-final lengthening in English.
In The Sound of the Future: A Global View of Acoustics in the 21st Century, Proceedings
of the 16th International Congress on Acoustics and 135th Meeting Acoustical Society of
America (pp. 1235-1236).
Shultz, A. A., Francis, A. L., Llanos, F. (2012). Differential cue weighting in perception and
production of consonant voicing. Journal of the Acoustical Society of America, 132,
EL95-EL101.
Simmering, V. R., Schutte, A. R., & Spencer, J. P. (2007). Generalizing the dynamic field theory
of spatial cognition across real and developmental time scales. Brain Research, 1202, 68-
86. https://doi.org/10.1016/j.brainres.2007.06.081
Smith, A., Goffman, L., Zelaznik, H. N., Ying, G., & McGillem, C. (1995). Spatiotemporal
stability and patterning of speech movement sequences. Experimental Brain
Research, 104(3), 493-501
Smith, A., & Kleinow, J. (2000). Kinematic correlates of speaking rate changes in stuttering and
normally fluent adults. Journal of Speech, Language, and Hearing Research, 43(2), 521-
536.
Smith, B. J., Mielke, J., Magloughlin, L., & Wilbanks, E. (2019). Sound change and
coarticulatory variability involving English /ɹ/. Glossa: A Journal of General Linguistics,
4(1), 63. https://doi.org/10.5334/gjgl.650
Smith, B. L. (2002). Effects of Speaking Rate on Temporal Patterns of English. Phonetica, 59,
232-244.
287
Soli, S. D. (1981). Second formants in fricatives: Acoustic consequences of fricative-vowel
coarticulation . Journal of the Acoustical Society of America, 70(4), 976–984.
https://doi.org/10.1121/1.387032
Spencer, J. P., & Schoner, G. (2003). Bridging the Gap in Dynamic Systems Approach to
Development. Developmental Science, 4, 21. papers3://publication/uuid/0A98BA41-
3CC1-4A5F-916C-0A439BA0D93E
Spencer, K.A., & Slocomb, D.L. (2007). The neural basis of ataxic dysarthria. The Cerebellum,
6, 58-65.
Sproat, R., & Fujimura, O. (1993). Allophonic variation in English /l/ and its implications for
phonetic implementation. Journal of Phonetics, 21, 291–311.
Stevens, K. N. (1972). The quantal nature of speech: Evidence from articulatory-acoustic data. In
E. E. David Jr. & P. B. Denes (eds.), Human Communication: A Unified View (pp. 51-
66). New York: McGraw Hill.
Stevens, K. N., & Blumstein, S. E. (1978). Invariant cues for place of articulation in stop
consonants. Journal of the Acoustical Society of America, 111, 1872-1891.
Stewart, M. E., & Ota, M. (2008). Lexical effects on speech perception in individuals with
“autistic” traits. Cognition, 109(1), 157-162.
Stone, M. (2005). A guide to analysing tongue motion from ultrasound images. Clinical
linguistics & phonetics, 19(6-7), 455-501.
Stone, M., Gomez, A.D., Zhuo, J., Tchouaga, A.L., & Prince, J.L. (2019). Quantifying tongue tip
shape in apical and laminal /s/: Contributions of palate shape. Journal of Speech,
Language, and Hearing Research, 62(9), 3149-3159.
288
Stone, M., Rizk, S., Woo, J., Murano, E. Z., Chen, H., & Prince, J. L. (2012). Frequency of
Apical and Laminal /s/ in Normal and Postglossectomy Patients. Journal of Medical
Speech-Language Pathology, 20(4). http://www.ncbi.nlm.nih.gov/pubmed/26157329
Sumby, W.H., & Pollack, I. (1954). Visual contribution to speech intelligibility in noise. Journal
of the Acoustical Society of America, 26(212).
Tasko, S. M., & McClean, M. D. (2004). Variations in Articulatory Movement with Changes in
Speech Task. Journal of Speech, Language, and Hearing Research, 47(1), 85–100.
https://doi.org/10.1044/1092-4388(2004/008)
Thelen, E., Schöner, G., Scheier, C., & Smith, L.B. (2001). The dynamics of embodiment: A
field theory of infant perseverative reaching. Behavioral and Brain Sciences, 24, 1–86.
Theodore, R. M., Miller, J. L., & DeSteno, D. (2009). Individual talker differences in voice-
onset-time: Contextual influences. Journal of the Acoustical Society of America, 125(6),
3974-3982.
Tiede, M. K., Boyce, S. E., Holland, C. K., & Choe, K. A. (2004). A new taxonomy of American
English /r/ using MRI and ultrasound. Journal of the Acoustical Society of America,
115(5), 2633–2634. https://doi.org/10.1121/1.4784878
Tilsen, S. (2007). Vowel-to-vowel coarticulation and dissimilation in phonemic response
priming. UC Berkeley Phonology Lab Annual Report, 3, 416–458.
Tilsen, S. (2016). Selection and coordination: the articulatory basis for the emergence of
phonological structure. Journal of Phonetics, 55, 53–77. https://doi.org/10.1016/j.wocn.
2015.11.005
289
Tilsen, S. (2018). Three mechanisms for modeling articulation: selection, coordination, and
intention. Cornell Working Papers in Phonetics and Phonology, 1–49.
https://doi.org/10.1016/j.tranpol.2012.12.005
Tilsen, S. (2019). Motoric mechanisms for the emergence of non-local phonological patterns.
Frontiers in Psychology: Language Sciences, 10, 1-15.
https://doi.org/10.3389/fpsyg.2019.02143
Tobin, S. J., Nam, H., & Fowler, C. A. (2018). Phonetic drift in Spanish-English bilinguals:
Experiment and a self-organizing model. Journal of Phonetics, 65, 45–59.
https://doi.org/10.1016/j.wocn.2017.05.006
Toda, M., Maeda, S., Carlen, A. J., & Meftahi, L. (2003). Lip protrusion/rounding dissociation in
French and English consonants: /w/ vs. /ʃ/ and /ʒ/. In Proc. 15th ICPhS (pp. 1763– 1766).
Barcelona.
Tomaschek, F., Arnold, D., Sering, K., Tucker, B.V., van Rij, J., & Ramscar, M. (2020).
Articulatory variability is reduced by repetition and predictability. Language and Speech,
64(3), 654-680. https://doi.org/10.1177/0023830920948552
Tomiak, G. R. (1990). An evaluation of a spectral moments metric with voiceless fricative
obstruents. Journal of the Acoustical Society of America, 87(S1), S106-S107.
Tourville, J. A., and Guenther, F. H. (2011). The DIVA model: A neural theory of speech
acquisition and production. Lang. Cogn. Process., 26(7), 952–981.
Traunmüller, H., and Öhrström, N. (2007b). “The effect of incongruent visual cues on the heard
quality of front vowels. In J. Trouvain and W. J. Barry (Eds.), Proceedings of the 16th
International Congress of Phonetic Sciences (pp. 721-724). Saarbrücken: Universität des
Saarlandes.
290
Tsao, Y. C., & Weismer, G. (1997). Interspeaker variation in habitual speaking rate: Evidence
for a neuromuscular component. Journal of Speech, Language, and Hearing Research,
40(4), 858–866. https://doi.org/10.1044/jslhr.4004.858
Tsao, Y.-C., Weismer, G., & Iqbal, K. (2006). The effect of intertalker speech rate variation on
acoustic vowel space. Journal of the Acoustical Society of America, 119(2), 1074-1082.
Tuller, B., Kelso, J. S., & Harris, K. S. (1982). Interarticulator phasing as an index of temporal
regularity in speech. Journal of Experimental Psychology: Human Perception and
Performance, 8(3), 460.
Tumer, E. C., & Brainard, M. S. (2007). Performance variability enables adaptive plasticity of
“crystallized” adult birdsong. Nature, 450, 1240-1244.
Turner, G. S., Tjaden, K., & Weismer, G. (1995). The influence of speaking rate on vowel space
and speech intelligibility for individuals with amyotrophic lateral sclerosis. Journal of
Speech and Hearing Research, 38(5), 1001–1013. https://doi.org/10.1044/jshr.3805.1001
Twist, A., Baker, A., Mielke, J., & Archangeli, D. (2007). Are "covert" /ɹ/ allophones really
indistinguishable? University of Pennsylvania Working Papers in Linguistics, 13(2), 207-
216.
Uchanski, R. M., Millier, K. M., Reed, C. M., & Braida, L. D. (2011). Effects of token
variability on vowel identification. In M. E. H. Schouten (Ed.), The Auditory Processing
of Speech: From Sounds to Words (pp. 291-302). De Gruyter Mouton.
van Beers, R. J. (2009). Motor Learning Is Optimally Tuned to the Properties of Motor Noise.
Neuron, 63(3), 406–417. https://doi.org/10.1016/j.neuron.2009.06.025
291
van Beers, R.J., Baraduc, P., & Wolpert, D.M. (2002). Role of uncertainty in sensorimotor
control. Philos. Trans. R. Soc. Lond. B. Biol. Sci, 357(1424), 1137-1145.
https://doi.org/10.1098/rstb.2002.1101.
van Beers, R.J., Haggard, P., & Wolpert, D.M. (2004). The role of execution noise in movement
variability. Journal of Neurophysiology, 91(2), 1050-1063.
Van Son, R. J., & Pols, L. C. (1992). Formant movements of Dutch vowels in a text, read at
normal and fast rate. Journal of the Acoustical Society of America, 92(1), 121-127.
Vaughn, C., Baese-Berk, M., & Idemaru, K. (2018). Re-Examining Phonetic Variability in
Native and Non-Native Speech. Phonetica, 76(5), 327-358.
https://doi.org/10.1159/000487269
Villacorta, V. M., Perkell, J. S., & Guenther, F. H. (2007). Sensorimotor adaptation to feedback
perturbations of vowel acoustics and its relation to perception. The Journal of the
Acoustical Society of America, 122(4), 2306–2319. https://doi.org/10.1121/1.2773966
Vorperian, H. K., Kent, R. D., Lindstrom, M. J., Kalina, C. M., Gentry, L. R., & Yandell, B. S.
(2005). Development of vocal tract length during early childhood: A magnetic resonance
imaging study. Journal of the Acoustical Society of America, 117(1), 338-350.
Vorperian, H. K., Wang, S., Chung, M. K., Schimek, E. M., Durtschi, R. B., Kent, R. D., ... &
Gentry, L. R. (2009). Anatomic development of the oral and pharyngeal portions of the
vocal tract: An imaging study. Journal of the Acoustical Society of America, 125(3),
1666-1678
292
Vorperian, H.K., Wang, S., Schimek, M.E., Durtschi, R.B., Kent, R.D., Gentry, L.R., & Chung,
M.K. (2011). Developmental sexual dimorphism of the oral and pharyngeal portions of
the vocal tract: an imaging study. Journal of Speech, Language, and Hearing Research,
54(4), 995-1010.
Walker, A., & Campbell-Kibler, K. (2015). Repeat what after whom? Exploring variable
selectivity in a cross-dialectal shadowing task. Frontiers in Psychology, 6, Article
546. https://doi.org/10.3389/fpsyg.2015.00546.
Wedel, A. (2006). Exemplar models, evolution and language change. The Linguistic Review, 23,
247-274.
Weirich, M., & Fuchs, S. (2013). Palate morphology can influence speaker-specific realizations
of phonemic contrasts. J. Speech Lang. Hear. Res., 56(6), S1894-1908.
Weirich, M., & Simpson, A.P. (2018). Individual difference in acoustic and articulatory
undershoot in a German diphthong - Variation between male and female speakers.
Journal of Phonetics, 71, 35-50. https://doi.org/10.1016/j.wocn.2018.07.007.
Westbury, J. R. (1994). X-ray microbeam speech production database user's handbook.
Madison, WI.
Westbury, J. R., Hashi, M., & Lindstrom, M. J. (1998). Differences among speakers in lingual
articulation for American English /r/. Speech Communication, 26, 203–226.
Whalen, D. H. (1990). Coarticulation is largely planned. Journal of Phonetics, 18(1), 3–35.
https://doi.org/10.1016/s0095-4470(19)30356-0
Whalen, D. H., Chen, W. R., Tiede, M. K., & Nam, H. (2018). Variability of articulator positions
and formants across nine English vowels. Journal of Phonetics, 68, 1–14.
https://doi.org/10.1016/j.wocn.2018.01.003
293
Woods, K. J., Siegel, M. H., Traer, J., & McDermott, J. H. (2017). Headphone screening to
facilitate web-based auditory experiments. Attention, Perception, &
Psychophysics, 79(7), 2064-2072.
Wu, H., Miyamoto, Y., Nicolas Gonzales-Castro, L., Ölveczky, B. P., & Smith, M. (2014).
Temporal structure of motor variability is dynamically regulated and predicts motor
learning ability. Nature Neuroscience, 17, 312-321.
Yu, A. C. L. (2010). Perceptual compensation is correlated with individuals' "autistic" traits:
Implications for models of sound change. PLoS ONE, 5, e11950.
Yu, A. C. L. (2013). Individual differences in socio-cognitive processing and the actuation of
sound change. In A. C. L. Yu (ed.), Origins of Sound Change: Approaches to
Phonologization. (pp. 201-227). Oxford, UK: Oxford University Press.
Yu, A. C. L. (2016). Vowel-dependent variation in Cantonese /s/ from an individual-difference
perspective. Journal of the Acoustical Society of America, 139, 1672-1690.
Yu, A. C. L. (2019). Individual Differences in Language Processing: Phonology. Language and
Linguistics Compass, 5, 6.1-6.20. https://doi.org/10.1111/lnc3.12167
Yu, A. C. L., & Lee, H. (2014). The stability of perceptual compensation for coarticulation
within and across individuals: a cross-validation study. Journal of the Acoustical Society
of America, 136, 382-388.
Yuan, J., & Liberman, M. (2008). Speaker identification on the SCOTUS corpus. In Proceedings
of Acoustics '08 (pp. 5687-5690).
Yunusova, Y., Rosenthal, J.S., Rudy, K., Baljko, M., & Daskalogiannakis, J. (2012). Positional
targets for lingual consonants defined using electromagnetic articulography. Journal of
the Acoustical Society of America, 132, 1027-1038. https://doi.org./10.1121.1.4733542
294
Zawadzki, P. A., & Kuehn, D. P. (1980). A Cineradiographic Study of Static and Dynamic
Aspects of American English /r/. Phonetica, 37, 253–266.
Zellou, G. (2017). Individual differences in the production of nasal coarticulation and perceptual
compensation. Journal of Phonetics, 61, 13–29.
https://doi.org/10.1016/j.wocn.2016.12.002
Zhou, X., Espy-Wilson, C. Y., Boyce, S., Tiede, M., Holland, C., & Choe, A. (2008). A
magnetic resonance imaging-based articulatory and acoustic study of retrolex and
bunched American English /r/. The Journal of the Acoustical Society of America, 123(6),
4466–4481. https://doi.org/10.1121/1.2902168
295
Appendices
Appendix A: Analyses using CoV
Measures of overall CoV (CoVTOT), within-context CoV (CoVCON), and cross-context
CoV (CoVCROSS) were calculated for (a) the two ratio-scale articulatory dimensions measured
for each consonant (CD and LA), (b) the first two spectral moments in fricative consonants (M1
and M2), and (c) all formant measurements taken in the liquid consonants (F1-F4). The
equations used to calculate these measures of dispersion are given as Equations A.1-A.3.
!"#
!"!
= ($
#$$
/$̅>??
)∗100 (A.1)
T?U
@AB
= V
6
:!
/ ;
:!
6
6
:0
/ ;
:0
6⋯6
6
:<
/ ;
:<
)
W∗100 (A.2)
(where ci = a unique phonetic context and n = number of unique phonetic contexts).
T?U
@DAEE
= X
/
/ ;
:=>
"̅/ ;
:=>
Y∗100 (A.3)
(where n = number of unique contexts and $̅@AB
= set of means calculated for each
unique context).
296
Table A.1. Proportion of pairwise comparisons of CoVTOT, CoVCROSS and CoVCON that were
significant for LA and CD (by segment).
/t/ /s/ /ʃ/ /l/ /ɹ/
CoVTOT CD 0.16 0.271 0.131 0.271 0.288
LA 0.256 0.282 0.229 0.455 0.429
CoVCROSS CD 0.120 0.156 0.146 0.303 0.142
LA 0.049 0.016 0.148 0.117 0.136
CoVCON
CD 0.127 0.327 0.284 0.193 0.074
LA 0.173 0.259 0.209 0.140 0.251
Table A.2. Proportion of pairwise comparisons of CoVTOT, CoVCROSS and CoVCON that were
significant for M1 and M2 in the examined fricative consonants.
/s/ /ʃ/
CoVTOT M1 0.307 0.137
M2 0.207 0.117
CoVCROSS M1 0.090 0.119
M2 0.109 0.147
CoVCON M1 0.569 0.520
M2 0.519 0.398
297
Table A.3. Proportion of pairwise comparisons of CoVTOT, CoVCROSS and CoVCON that were
significant for F1-F4 in the examined liquid consonants.
/l/ /ɹ/
CoVTOT F1 0.356 0.202
F2 0.398 0.115
F3 0.413 0.164
F4 0.396 0.393
CoVCROSS F1 0.115 0.106
F2 0.038 0.053
F3 0.116 0.007
F4 0.186 0.163
CoVCON F1 0.447 0.470
F2 0.505 0.457
F3 0.559 0.484
F4 0.528 0.534
Table A.4. Spearman’s rho (rs) for the comparison of CoV CROSS and CoV CON for each ratio scale
dimension in each segment. Green cells indicate comparisons that are significant at the
uncorrected p < 0.05 level. Comparisons significant after using the Benjamini-Hochberg method
to control for false discovery rate (padj < 0.05) are additionally bolded.
/t/ /s/ /ʃ/ /l/ /ɹ/
rs p padj rs p padj rs p padj rs p padj rs p padj
CD 0.55 0.00 0.00 0.60 0.00 0.00 0.52 0.00 0.00 0.45 0.00 0.00 0.51 0.00 0.00
LA 0.35 0.02 0.09 0.27 0.08 0.11 0.17 0.28 0.28 0.26 0.10 0.11 0.32 0.04 0.06
M1 0.51 0.00 0.00 0.14 0.4 0.42
M2 0.4 0.01 0.02 0.41 0.01 0.02
F1 0.33 0.03 0.05 0.26 0.09 0.11
F2 0.44 0.00 0.01 0.33 0.03 0.05
F3 0.4 0.01 0.02 0.59 0.00 0.00
F4 0.71 0.00 0.00 0.70 0.00 0.00
298
Table A.5. Correlation matrix for the comparison of CoV CON across segments for CD and LA.
All correlations calculated using Spearman’s rho. Green cells indicate comparisons that are
significant at the uncorrected p < 0.05 level. Comparisons significant after using the Benjamini-
Hochberg method to control for false discovery rate are additionally bolded.
/t/ /s/ /ʃ/ /l/ /ɹ/
CD
/t/ 1.000 0.307 0.266 -0.01 -0.069
/s/ 1.000 0.369 0.189 -0.035
/ʃ/ 1.000 -0.066 0.101
/l/ 1.000 0.014
/ɹ/ 1.000
LA
/t/ 1.000 0.293 0.335 0.478 0.435
/s/ 1.000 0.307 0.395 0.451
/ʃ/ 1.000 0.354 0.096
/l/ 1.000 0.429
/ɹ/ 1.000
Table A.6. Spearman’s rho for the comparison of CoV CON across segments for ratio scale
acoustic dimension. No comparisons were found to be significant.
M1 M2 F1 F2 F3 F4
/s/~/ʃ/ 0.254 -0.0009
/l/~/ɹ/ 0.083 0.106 0.084 0.026
299
Table A.7. Spearman’s rank-order correlation for the comparison of vocal tract morphology and
CoV CON for CD and LA in each segment. Green cells indicate comparisons that are significant at
the uncorrected p < 0.05 level. No comparisons were significant after using the Benjamini-
Hochberg method to control for false discovery rate (all padj > 0.05).
CD LA
rs p padj rs p padj
PA
/t/ 0.51 0.00 0.01 -0.12 0.46 0.46
/s/ 0.13 0.42 0.82 -0.06 0.72 0.84
/ʃ/ 0.23 0.15 0.33 0.15 0.34 0.61
/l/ -0.47 0.00 0.01 -0.33 0.04 0.11
/ɹ/ -0.22 0.17 0.34 -0.06 0.73 0.88
PL
/t/ 0.47 0.00 0.01 0.20 0.23 0.34
/s/ 0.06 0.70 0.82 0.06 0.69 0.84
/ʃ/ 0.22 0.17 0.33 0.16 0.32 0.61
/l/ -0.29 0.07 0.14 -0.04 0.82 0.82
/ɹ/ 0.02 0.92 0.92 0.14 0.40 0.68
PH
/t/ 0.31 0.05 0.09 -0.24 0.14 0.29
/s/ 0.13 0.43 0.82 -0.03 0.84 0.84
/ʃ/ 0.08 0.64 0.81 0.13 0.41 0.61
/l/ -0.39 0.01 0.04 -0.35 0.03 0.11
/ɹ/ -0.30 0.06 0.19 -0.15 0.36 0.68
PS
/t/ -0.25 0.11 0.14 -0.13 0.41 0.46
/s/ 0.06 0.71 0.82 0.11 0.49 0.84
/ʃ/ 0.07 0.68 0.81 -0.14 0.40 0.61
/l/ 0.20 0.21 0.25 -0.07 0.68 0.82
/ɹ/ -0.03 0.84 0.92 0.12 0.46 0.68
OCL
/t/ -0.01 0.95 0.95 -0.27 0.10 0.29
/s/ -0.22 0.17 0.82 -0.22 0.16 0.84
/ʃ/ 0.00 0.99 0.99 0.08 0.64 0.77
/l/ -0.13 0.42 0.42 -0.28 0.08 0.15
/ɹ/ 0.31 0.05 0.09 0.02 0.89 0.89
PCL
/t/ 0.30 0.06 0.09 0.27 0.09 0.29
/s/ -0.04 0.82 0.82 0.08 0.62 0.84
/ʃ/ 0.25 0.12 0.33 0.00 0.98 0.98
/l/ 0.22 0.16 0.24 0.25 0.11 0.17
/ɹ/ 0.19 0.25 0.37 0.31 0.06 0.33
300
Table A.8. Spearman’s rho (rs) for the comparison of vocal tract morphology and CoV CON for M1
and M2 in /s/ and /ʃ/. No comparisons were found to be significant.
M1 M2
rs p padj rs p padj
PA
/s/ 0.02 0.88 0.98 -0.17 0.31 0.91
/ʃ/ 0.16 0.32 0.91 -0.01 0.96 0.98
PL
/s/ 0.28 0.08 0.62 -0.28 0.08 0.62
/ʃ/ 0.1 0.53 0.95 -0.04 0.81 0.98
PH
/s/ -0.09 0.58 0.95 0.03 0.87 0.98
/ʃ/ 0.14 0.36 0.91 -0.14 0.39 0.91
PS
/s/ 0.01 0.94 0.98 -0.14 0.39 0.91
/ʃ/ 0.28 0.08 0.62 0.19 0.25 0.91
OCL
/s/ 0.06 0.72 0.98 -0.04 0.81 0.98
/ʃ/ -0.25 0.18 0.91 -0.10 0.53 0.95
PCL
/s/ -0.01 0.95 0.98 0.11 0.50 0.95
/ʃ/ -0.13 0.43 0.93 0.15 0.37 0.91
Table A.9. Spearman’s rho (rs) for the comparison of vocal tract morphology and IQR CON for F1-
F4 in /l/ and /ɹ/. Comparisons significant at the uncorrected p < 0.05 level are bolded and in
green. No comparisons significant after controlling for false discovery rate.
F1 F2 F3 F4
rs p padj rs p padj rs p padj rs p padj
PA
/l/ -0.23 0.15 0.76 0.05 0.76 0.99 -0.01 0.94 0.99 0.03 0.85 0.99
/ɹ/ -0.05 0.73 0.99 -0.04 0.77 0.99 0.11 0.48 0.89 0.11 0.50 0.89
PL
/l/ 0.09 0.56 0.93 0.23 0.14 0.76 0.19 0.24 0.86 0.12 0.45 0.89
/ɹ/ -0.01 0.25 0.99 0.19 0.23 0.86 0.20 0.22 0.86 0.21 0.21 0.83
PH
/l/ -0.25 0.12 0.74 -0.08 0.59 0.93 -0.18 0.26 0.86 -0.11 0.49 0.89
/ɹ/ -0.15 0.34 0.86 -0.06 0.67 0.98 0.12 0.44 0.89 0.15 0.37 0.86
PS
/l/ -0.18 0.26 0.86 -0.03 0.87 0.99 -0.17 0.29 0.86 -0.11 0.50 0.89
/ɹ/ -0.15 0.34 0.86 -0.19 0.23 0.86 -0.26 0.02 0.40 -0.15 0.34 0.86
OCL
/l/ 0.11 0.48 0.89 0.26 0.10 0.99 0.09 0.54 0.91 0.003 0.98 0.99
/ɹ/ -0.14 0.37 0.86 0.16 0.33 0.79 0.15 0.36 0.86 0.30 0.09 0.35
PCL
/l/ 0.46 0.002 0.16 -0.10 0.54 0.71 0.43 0.005 0.18 0.29 0.11 0.31
/ɹ/ -0.27 0.08 0.70 0.00 0.98 0.99 -0.08 0.62 0.93 -0.18 0.26 0.86
301
Appendix B: Relationship between stochastic and contextual variability by position
Table B.10. Spearman’s rank-order correlation for the comparison of IQR CROSS and IQR CON for
each articulatory dimension in each segment in word-initial contexts only. Green cells indicate
comparisons are significant at the uncorrected p < 0.05 level. Comparisons significant after using
the Benjamini-Hochberg method to control for false discovery rate (padj < 0.05) are italicized.
/t/ /s/ /ʃ/ /l/ /ɹ/
rs p padj rs p padj rs p padj rs p padj rs p padj
CL 0.38 0.01 0.1 0.54 0.00 0.00 0.51 0.00 0.00 0.38 0.01 0.04 0.44 0.00 0.02
CD 0.28 0.08 0.15 0.52 0.00 0.00 0.07 0.68 0.74 0.51 0.00 0.01 0.34 0.03 0.06
CO 0.11 0.47 0.53 0.26 0.10 0.17 -0.17 0.27 0.34 0.35 0.02 0.1 0.19 0.24 0.31
LA -0.13 0.44 0.52 0.40 0.01 0.05 0.25 0.12 0.18 0.31 0.06 0.15 0.25 0.11 0.18
LP 0.02 0.88 0.88 0.28 0.08 0.16 0.46 0.00 0.02 0.27 0.09 0.16 0.33 0.03 0.06
Table B.11. Spearman’s rank-order correlation for the comparison of IQR CROSS and IQR CON for
each articulatory dimension in each segment in word-final contexts only. Green cells indicate
comparisons are significant at the uncorrected p < 0.05 level. Comparisons significant after using
the Benjamini-Hochberg method to control for false discovery rate (padj < 0.05) are italicized.
/t/ /s/ /ʃ/ /l/ /ɹ/
rs p padj rs p padj rs p padj rs p padj rs p padj
CL 0.42 0.01 0.4 0.26 0.11 0.27 0.36 0.02 0.05 0.51 0.00 0.01 0.44 0.00 0.02
CD 0.49 0.00 0.01 0.47 0.00 0.01 0.49 0.00 0.01 0.54 0.00 0.00 0.32 0.04 0.13
CO 0.22 0.16 0.22 0.54 0.00 0.00 0.25 0.12 0.19 -0.2 0.2 0.34 0.05 0.75 0.89
LA -0.24 0.14 0.21 0.04 0.81 0.89 0.06 0.69 0.73 0.12 0.45 0.55 0.24 0.13 0.18
LP 0.04 0.81 0.81 0.17 0.30 0.40 0.15 0.34 0.41 0.27 0.09 0.25 0.48 0.00 0.01
302
Abstract (if available)
Abstract
The articulatory and acoustic properties of any one phonological segment are known to vary both between speakers and between tokens in an individual’s speech. Much of this observed inter- and intraspeaker phonetic variation can be explained as the predictable consequence of various linguistic and extralinguistic factors known to affect the realization of phonological segments. However, relatively little research has systematically examined the extent to which this variation may also reflect individual differences in how segments are represented, or in the mechanisms affecting the selection of production targets in speech. This dissertation extends existing research on stochastic variability in speech to investigate the hypothesis that individual differences in speech production and perception reflect differences in the cognitive representation of phonological units across speakers, specifically differences in the encoding of variability in these representations. ❧ The first of two empirical studies presented in this dissertation uses articulatory and acoustic data from forty speakers in the Wisconsin X-Ray Microbeam corpus (Westbury, 1994) to examine individual differences in phonetic variability in a set of English consonants, and how these differences pattern across different structural units in speech. The results of this study indicate that robust individual differences in variability are maintained across multiple levels of linguistic structure but are not generalized across different phonological segments, supporting models of phonological representation in which variability is encoded in the cognitive representation of phonological targets. The results of an analysis of articulatory-acoustic relations in the same data additionally highlight the potential communicative significance of these individual differences. Building on the findings of this study, an extension of existing Dynamic Field Theory models of phonological cognition is proposed to account for the observed patterns of individual difference in phonetic variability. A series of simulations using this model is presented to illustrate how the incorporation of both dynamical and invariant elements in the representation of phonological units can both account for the observed patterns and generate predictions about the relationship between variability in speech production and individual differences in speech perception. The final empirical study included in this dissertation tests the specific predictions generated by the model regarding the relationship between individual differences in the production of phonetic variability and perceptual sensitivity to subphonemic variability. ❧ As a whole, the findings of this dissertation support a model of phonological cognition in which the individual differences observed in acoustic and articulatory variability reflect the encoding of variability for individual phonological units. Through this, the research presented in this dissertation provides support for the hypothesis that individual differences in speech production and perception reflect variation in the cognitive representation of phonological units across speakers. This dissertation also illustrates how our understanding of the cognitive systems involved in speech production and perception, as well as behavioral speech phenomena more generally, is enhanced by considering measures of speaker performance beyond central tendencies in speech production.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Articulatory dynamics and stability in multi-gesture complexes
PDF
Beatboxing phonology
PDF
The planning, production, and perception of prosodic structure
PDF
The role of individual variability in tests of functional hearing
PDF
Harmony in gestural phonology
PDF
Minimal contrast and the phonology-phonetics interaction
PDF
The prosodic substrate of consonant and tone dynamics
PDF
Toward understanding speech planning by observing its execution—representations, modeling and analysis
PDF
Emotional speech production: from data to computational models and applications
PDF
Articulatory knowledge in phonological computation
PDF
Investigating the production and perception of reduced speech: a cross-linguistic look at articulatory coproduction and compensation for coarticulation
PDF
Signs of skilled adaptation in the co-speech ticking of adults with Tourette's
PDF
Dynamics of consonant reduction
PDF
Sound sequence adaptation in loanword phonology
PDF
Speech production in post-glossectomy speakers: articulatory preservation and compensation
PDF
A computational framework for exploring the role of speech production in speech processing from a communication system perspective
PDF
Sources of non-conformity in phonology: variation and exceptionality in Modern Hebrew spirantization
PDF
The Spanish feminine el at the syntax-phonology interface
PDF
Effects of speech context on characteristics of manual gesture
PDF
Effects of language familiarity on talker discrimination from syllables
Asset Metadata
Creator
Harper, Sarah Kolin
(author)
Core Title
Individual differences in phonetic variability and phonological representation
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Linguistics
Degree Conferral Date
2021-12
Publication Date
09/23/2021
Defense Date
09/02/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
acoustic variability,Acoustics,articulation,articulatory variability,articulatory-acoustic relations,individual differences,OAI-PMH Harvest,perception-production relations,phonetics,phonological cognition
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Goldstein, Louis (
committee chair
), Byrd, Dani (
committee member
), Zevin, Jason (
committee member
)
Creator Email
skhaao@gmail.com,skharper@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC15925682
Unique identifier
UC15925682
Legacy Identifier
etd-HarperSara-10096
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Harper, Sarah Kolin
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
acoustic variability
articulation
articulatory variability
articulatory-acoustic relations
individual differences
perception-production relations
phonetics
phonological cognition