Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Investigating the production and perception of reduced speech: a cross-linguistic look at articulatory coproduction and compensation for coarticulation
(USC Thesis Other)
Investigating the production and perception of reduced speech: a cross-linguistic look at articulatory coproduction and compensation for coarticulation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INVESTIGATING THE PRODUCTION AND PERCEPTION OF REDUCED
SPEECH: A CROSS-LINGUISTIC LOOK AT ARTICULATORY COPRODUCTION
AND COMPENSATION FOR COARTICULATION
by
David Cheng-Huan Li
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(LINGUISTICS)
August 2014
Copyright 2014 David Cheng-Huan Li
! ii
Dedication
For Ian and Jadelynn, each equally my pride and joy
! iii
Acknowledgements
I would like to thank the members of my committee for their guidance and
support in helping me develop and complete this dissertation. I am especially indebted to
my advisor, Dr. Elsi Kaiser, who has not only spent countless hours in discussing,
reading, and commenting on this work, but also helped me gain the skills and expertise
necessary for completing this document during my graduate career. Finally, I would like
to express my appreciation for my wife, Wei-Yun, in shouldering far more than her fair
share of the parenting and household responsibilities while I pursued this degree.
! iv
Table of Contents
Dedication ii
Acknowledgements iii
List of Tables v
List of Figures vi
Abstract viii
Chapter 1 Introduction
1.1 Speech reduction, production, and perception 1
1.2 Theoretical accounts of coarticulation 5
1.3 Perception of reduced speech and compensation for coarticulation 12
1.4 Overview of the dissertation 15
Chapter 2 Mandarin Syllable Contraction
2.1 Introduction 19
2.2 Experiment 1 29
2.3 Discussion 50
Chapter 3 Compensation for Assimilation: Contextual Cues and Gradience
3.1 Introduction 55
3.2 Experiment 2 63
3.3 Experiment 3 79
3.4 General discussion, conclusions 88
Chapter 4 Nonnative compensation for assimilation:
When phonological systems clash
4.1 Introduction 93
4.2 Experiment 4 105
4.3 Discussion 119
Chapter 5 General Discussion
5.1 Summary of the findings 124
5.2 Directions for future research 127
References 132
! v
List of Tables
Table 1 Examples of common Mandarin contracted syllables (Tseng, 2005) 21
Table 2 The two different set types in the stimuli 31
Table 3 Targets and controls from Set 1 41
Table 4 Targets and controls from Set 2 43
Table 5 Targets and controls from Set 3 44
Table 6 Targets and controls from Set 4 46
Table 7 Targets and controls from Set 5 47
Table 8 Targets and controls from Set 6 49
Table 9 The chosen nonwords 66
Table 10 Example of splicing for fast and slow conditions 68
Table 11 Summary of main biographical characteristics for nonnative speakers of
English 106
! vi
List of Figures
Figure 1 Gestural score of non-contracted [uomɤn], ‘we’ 26
Figure 2 Gestural score of contracted ‘we’ under the in-phase coupling proposal 27
Figure 3 Gestural score of extreme contraction of ‘we’ under the in-phase
coupling proposal 27
Figure 4 Gestural score of contracted ‘we’ under the target undershoot proposal 28
Figure 5 Durations of targets and controls 36
Figure 6 Average F2 frequency values for all targets 37
Figure 7 Average F2 frequency values for all targets excluding Set 7 38
Figure 8 Comparison of targets and controls for sets with back-to-front vowel
order 39
Figure 9 Comparison of targets and controls for sets with front-to-back vowel
order 40
Figure 10 Set 1 average F2 frequency values 42
Figure 11 Set 1 target and control durations 42
Figure 12 Set 2 average F2 frequency values 43
Figure 13 Set 2 target and control durations 44
Figure 14 Set 3 average F2 frequency values 45
Figure 15 Set 3 target and control durations 45
Figure 16 Set 4 average F2 frequency values 46
Figure 17 Set 4 target and control durations 47
Figure 18 Set 5 average F2 frequency values 48
Figure 19 Set 5 target and control durations 48
Figure 20 Set 6 average F2 frequency values 49
! vii
Figure 21 Set 6 target and control durations 50
Figure 22 Pictures of all alien creatures 66
Figure 23 Sample visual display for target tokens 69
Figure 24 Percentage of unassimilated vs. assimilated form selected in the fast and
slow rate conditions 72
Figure 25 Coronal-picture advantage for the targets Vone and Kine 73
Figure 26 Coronal-picture advantage for the target Shoon 75
Figure 27 Representative example of measurement of an articulatory gesture 84
Figure 28 Correlation between coronal advantage scores and articulatory
overlap measures for Vone and Kine items 86
Figure 29 Correlation between coronal advantage scores and articulatory overlap
measures for Vone, Kine, and Shoon 88
Figure 30 Percentage of unassimilated vs. assimilated form selected by the L1
English speakers 109
Figure 31 Percentage of unassimilated vs. assimilated form selected by the L2
English speakers 110
Figure 32 Percentage of unassimilated vs. assimilated form selected by the L1
English speakers in the fast and slow rate conditions 111
Figure 33 Percentage of unassimilated vs. assimilated form selected by the L2
English speakers in the fast and slow rate conditions 112
Figure 34 Coronal-picture advantage for L1 English speakers 113
Figure 35 Coronal-picture advantage for L2 English speakers 115
Figure 36 Coronal-picture advantage for L1 and L2 English speakers in the fast
rate condition 117
Figure 37 Coronal-picture advantage for L1 and L2 English speakers in the slow
rate condition 118
! viii
Abstract
The pronunciation of a word in continuous speech is often reduced, different from when
it is spoken in isolation. Speech reduction reflects a fundamental property of spoken
language—the movements of articulators can overlap in time, also known as
coarticulation. A large number of experimental findings supports the role of articulatory
coproduction as the underlying mechanism in speech production (e.g., Bell-Berti and
Harris, 1981; Browman & Goldstein, 1986; Fowler, 1977). However, on the perception
side, prior research has not specifically examined the role of articulatory overlap or the
nature of the relation between production and perception. Given the central role of
articulatory overlap in speech production, this dissertation set out to address the question
of whether we can make use of the notion of articulatory coproduction in order to take
steps towards building a unified theory for how reduced speech is both produced and
perceived. To examine whether such a unified account exists, we investigated the
following primary questions: (i) in the production of reduced speech, can we find
additional evidence that alternative pronunciations of words can be attributed to
coproduction of articulation, (ii) how does the perception system deal with coproduction
information from preceding words (i.e., global cues from prior context) as well as at the
critical word during real-time word recognition, (iii) what is the relationship between the
degree of articulatory overlap (in the speaker’s output) and word recognition (by the
listener), and (iv) how do nonnative speakers process coarticulation information
associated with coproduction of articulation. My investigation of Mandarin syllable
contraction (Experiment 1) shows that coproduction of articulation can lead to reduction
in the production process. Turning our attention to the perception side and to English
! ix
coronal place assimilation, we found that listeners compensate more for assimilation in a
fast-speech context, which is associated with greater extent of reduction, than in a slow-
speech context (Experiment 2). We also showed that the degree of articulatory overlap in
a speaker’s output is correlated with listener’s word recognition (Experiment 3). After
investigating how native English speakers compensate for assimilation on the basis of
speech rate, we then investigated whether nonnative English speakers who are native
Mandarin speakers process coarticulatory information in the same way as the native
speakers (Experiment 4). We found that for adult L1 Mandarin learners who speak
English as a L2, they can compensate for assimilation like the L1 speakers but do so with
a delay. This dissertation has also presented a new methodology (Experiment 3) in
studying of the link between production and perception, where we directly measured the
degree of articulatory overlap via electromagnetic articulography (EMA) and correlate it
with the perceptual measure of eye-movements.
1
Chapter 1
Introduction
1.1 Speech reduction, production and perception
The pronunciation of a word in continuous speech is often different from when it
is spoken in isolation. The processes that account for these phonetic variations have been
termed “connected speech processes” (Gimson, 1970; Jones, 1969). For instance, a
number of different phonetic reduction processes occur in continuous speech in English,
including (i) assimilation of place, manner, voicing; (ii) reduction of vowels to schwa
when unaccented and (iii) deletion of some consonants and vowels (Gimson, 1970, see
also Kohler, 1999 on German, Ernestus, 2000 on Dutch for related findings). These
reduction processes also occur very frequently: over 60% of the words in Johnson’s
(2004) corpus of conversational American English are more reduced than the canonical
form in some way (e.g., involving assimilation or deletion of segments).
Speech reduction reflects a fundamental property of spoken language, namely that
during the production of multiple segments, the movements of articulators (e.g., the
tongue and the lips) can overlap in time. This overlap is known as coarticulation. The
overlapping nature of successive phonetic segments means that the subsequent vocal tract
configuration for a particular segment is influenced by the neighboring segments, which
leads to the contextual variability of speech sounds. For example, in the phrase run
quickly, the [n] in run may become [ŋ] when the place of articulation shifts towards that
of the following consonant, [k]. This is also known as assimilation, a frequently observed
2
phenomenon in which the phonetic properties of a speech sound are modified by those of
an adjacent segment.
In prior work, researchers have proposed a range of accounts of how
coarticulation guides the speech production mechanism and leads to reduction effects.
These include the theories of (i) coarticulation as a phonological component (e.g.,
Daniloff & Hammarberg, 1973; Hammarberg, 1976) (ii) coarticulation as speech
economy (e.g., Lindblom 1983, 1989, 1990), and (iii) coarticulation as coproduction
(e.g., Bell-Berti & Harris, 1981; Browman & Goldstein, 1986; Fowler, 1977). We will
discuss these in more detail below in Section 1.2.
As we will see, the third theory, coarticulation as coproduction (standardly known
as the coproduction account), appears to be supported by a robust amount of
experimental data (e.g., Farnetani & Recastens, 2010). In particular, the coproduction
account argues that speech reduction may be a function of the amount of articulatory
overlap, which is not predicted by other accounts. One of the predictions stemming from
this theoretical approach is that a greater amount of overlap is associated with increased
reduction (e.g. more assimilation, more vowel reduction).
So far we have been focusing on speech production, and the observation that
movements of the articulators overlap during production. Let us now consider the other
side of the coin, namely speech perception. Does the notion of articulatory overlap play a
role in the perception of reduced speech? While the proposal that perceiving speech is
perceiving articulatory movements—such as the motor theory of speech perception (e.g.,
Liberman & Mattingly, 1985; Liberman et al., 1967)—has had a mixed scientific
reception (Galantucci et al., 2006), it is not clear whether listeners may still activate
3
representations connected to articulatory coproduction. In other words, we are interested
in learning how the perception system might deal with varying amounts of overlap
information and to ask whether the notion of coproduction can inform our thinking about
the processes involved in the perception of reduced speech in particular.
This question has not received much attention in existing work. While research on
the perception of reduced speech has focused on how listeners compensate for
coarticulation (e.g. Coenen et al., 2001; Gaskell & Marslen-Wilson, 1996, 1998; Gow
2001; Mann, 1980; Mann & Repp, 1981), it has not looked specifically at the role of
articulatory overlap. Much of the recent work on the perception of reduced forms has
concentrated on how word recognition is influenced by the context following the critical
word (Coenen et al., 2001; Darcy, 2007; Gaskell & Marslen-Wilson, 1996, 1998, 2001;
Gow 2001, 2002; Mitterer & Blomert, 2003). For example, in “A quick rum picks you
up”, the word ‘run’ may be ambiguous between ‘run’ and ‘rum’ (Gaskell & Marslen-
Wilson, 2001) because the [m] from rum may be interpreted to be an underlying [n]
assimilating to [m] due to the influence of the [p] from picks. Similarly, from the second
language acquisition (SLA) literature, studies on how second language (L2) learners
compensate for coarticulation patterns in the second language (e.g., Darcy et al., 2007;
Darcy et al., 2009) has examined the acquisition of phonological processes, but did not
specifically investigate whether varying the amount of coarticulation influences word
recognition.
The main question that this dissertation aims to address is this: Can we make use
of the notion of coproduction in order to take steps towards building a unified theory for
how reduced speech is both produced and perceived? In particular, given the central role
4
of articulatory overlap in speech production, we would like to ask what role articulatory
overlap plays in speech perception and word recognition, especially in the presence of
reduction phenomena. If the approach exists, we should be able to find support for
coproduction of articulation in different linguistic domains as well as across different
languages.
This dissertation aims to explore the role of articulatory coproduction in the
domains of speech production, spoken word recognition, and second language
acquisition, by combining insights and methods from phonetics and psycholinguistics. To
do so, the first part of the dissertation (Chapter 2) examines the production mechanism
for a reduction phenomenon called Mandarin syllable contraction. Since this
phenomenon is unsatisfactorily explained by existing accounts (which do not consider the
extent of articulatory overlap), this part of the dissertation provides additional evidence
for the extent to which coproduction of articulation plays a role in speech production and
in grammar.
The second part of the dissertation (Chapter 3) looks at a different domain, and a
different language: In that part, we report a series of experiments in English investigating
the production and perception of speech at different speech rates. The experiments
investigate whether cues from prior context – in particular the speech rate of preceding
words – influences how people perceive the critical word. Specifically, we test whether
people are more likely to compensate for assimilation in a fast-speech context, which is
more likely to be reduced, than a slow-speech context. This part of the dissertation also
aims to directly establish the link between production and perception by examining the
correlation between (i) the extent of speakers’ articulatory overlap (as measured by
5
electromagnetic articulography) and (ii) listeners’ perception (as measured by visual-
world eye tracking).
Finally, the third part of the dissertation (Chapter 4) focuses on the domain of
second language acquisition. In a follow-up study building on the experiments in Chapter
3, this experiment aims to better understand to what extent nonnative speakers
(specifically, adults who speak Mandarin as their L1 and English as their L2) can
compensate for coarticulation patterns that are not allowed in the phonotactics of their
native language. Specifically, we investigate whether, when adults acquire a new
phonological system, the two systems can co-exist without interference.
In the next section of this chapter, we provide a general overview of the
theoretical accounts of coarticulation with respect to the types of language variations they
help explain. Section 1.3 gives a brief review of the literature on the recognition of
reduced speech. In section 1.4, we lay out an overview of the dissertation that
corresponds to these questions.
1.2 Theoretical accounts of coarticulation
There have been a number of different theoretical accounts for the underlying
mechanisms of coarticulation. Importantly, these accounts propose varying mechanisms
in terms of how reduction is produced in continuous speech. This section first surveys (i)
research which views coarticulation as a phonological component and (ii) literature
which claims that coarticulation occurs in an effort to maintain speech economy, before
turning to (iii) the account of coarticulation as coproduction—the main focus of this
dissertation.
6
1.2.1 Coarticulation as a phonological component
The account of coarticulation as a phonological component developed as a result
of the debate whether assimilation and coarticulation stem from by qualitatively different
mechanisms. Standard generative phonology (Chomsky & Halle, 1968) makes a clear
distinction between assimilation—in which an adjacent segment modifies a phonetic
segment in terms of its phonetic properties—and coarticulation. Under this view,
assimilation is part of the phonological component, and governed by language-specific,
phonological rules. In contrast, coarticulation is regarded as part of the physical
properties of the speech mechanism, governed by universal rules that hold for all
languages. In fact, under this view, coarticulation is merely the consequence of inertia in
the speech mechanism, “the transition between a vowel and an adjacent consonant, the
adjustments in vocal tract shape made in anticipation of subsequent motions, etc.”
(Chomsky & Halle, 1968, p. 295). However, this clear-cut dichotomy between
assimilation and coarticulation has been challenged by evidence showing that
coarticulation effects are manifested differently in different languages. For example,
Clumeck (1976) looked at six different languages and found that velar coarticulation in
these languages had different temporal patterns. Even within a language, there is
evidence that context-dependent assimilations, where the underlying mechanism should
be identical, can produce different articulatory patterns (e.g., Farnetani, 1986).
To address these concerns, the theory of feature spreading (Daniloff &
Hammarberg, 1973; Hammarberg, 1976) assumes coarticulation to be part of the
phonological component that can lead to the phonetic variations. Under this view, the
phonological component is separate from the phonetic component but controls its
7
implementation. When defined in this way, the phonological component lacks the spatial
and temporal specification that gives lexical organization of speech its distinctness in
other theories of coarticulation (e.g., Browman & Goldstein, 1991). In addition, given
that the physical speech mechanism can only execute (but not modify) phonological
components, variations attributed to coarticulation must be part of the input to the speech
mechanism. Moreover, since “[c]oarticulation is… a process whereby the properties of a
segment are altered due to the influences exerted on it by neighboring segments
(Hammarberg, 1976, p.576), coarticulation occurs as a way to minimize the differences
between the segment of interest and its context. Crucially, if coarticulation is part of the
phonological component, then forms of reduced speech should be explainable by
phonological rules. However, the feature spreading theory has been criticized for its
failure to explain extensive carry over effects on V-to-V coarticulation (e.g., Magen,
1989; Recasens, 1989) and appears to be incompatible with the findings that some
segments resist the influences of a neighboring segment or segments, also known as
coarticulatory resistance (e.g., Benguerel & Cowan, 1974; Magen, 1989; Sussman &
Westbury, 1981).
1.2.2 Coarticulation as speech economy
Instead of viewing coarticulation as a consequence of inertia, Lindblom (1983,
1989, 1990) argues that phonetic variation is part of a continual process of adaptation by
the speech mechanism to meet the demands of the communicative situation. On the basis
of treating speech as a biological activity, Lindblom’s theory of adaptive variability and
Hyper- and Hypo-speech aims to explain how linguistic forms are realized as the motor
8
control system interacts with the physical and physiological constraints impinging on
speech. In particular, the Hyper- and Hypo-speech theory (Lindblom, 1990) states that
much of the variation in speech occurs as a result of the speaker juggling output-oriented
and system-oriented goals. The output-oriented goal of being understood will force the
speaker to use hyperspeech, clear speech that is hyper-articulated. On the other hand, the
system-oriented and low-cost goal will allow the speaker to use hypospeech, more casual
speech that is under-articulated. Crucially, Lindblom’s theory of adaptive variability and
Hyper- and Hypo-speech suggest that during the course of speech communication, the
speaker continuously adapts to the needs of the listeners by estimating how much
hypospeech is permitted in order to reduce cost associated with the act of producing
speech. According to this account, coarticulation pervades casual speech because
hypospeech is a low-cost behavior, governed by the principle of economy.
Lindblom’s account of coarticulation is motivated by his study of vowel reduction
(Lindblom, 1963). He found a systematic reduction in frequencies of the first two
formants as the durations of vowels decrease. To account for these observations, he
introduced the model of target undershoot, in which the target is an ideal spectral
configuration that is context-free. Lindblom demonstrated that formants tend to reach
their target values when the vowel is long. However, as the duration of vowels decreases,
formants shift towards the value of adjacent consonants. Therefore, he concluded that
reduction is an articulatory process in response to the speech motor system under
conditions of increased motor activities. In other words, when the speech mechanism
needs to execute motor commands at very short temporal intervals, articulators tend not
9
to have sufficient time to reach their targets before the arrival of the next motor
command.
However, this account is challenged by findings that in general, motor systems do
not automatically undergo reduced movements with high rates of speech motor
commands (e.g., Gay, 1978; Kuehn & Moll, 1976; van Son & Pols, 1990). Moreover,
reduction can still occur even when speech rates are slow (Nord, 1986). In a revised
version of the target undershoot model (Moon & Lindblom, 1994), shorter durations are
not necessarily accompanied by target undershoot and factors such as speech style can
modify the amount of reduction/coarticulation. Regardless of the changes made in the
revised account of articulatory undershoot, both the original and revised account view
coarticulation not as a grammatical process, but one that is dependent on time pressure.
Importantly, reduction according to this view is the consequence of articulators not
having sufficient time to reach their target.
Studies on the production and perception of fast speech, when speech is more
likely to be reduced, have also found contradictory evidence for Lindblom’s account of
coarticulation as speech economy. On the one hand, if a speaker modifies his or her
speech with the needs of the listener in mind, the Hyper- and Hypo-speech theory
predicts that the speaker would speed up when the content they are conveying is less
informative, but speak more slowly during the most informative parts. According to a
hypothesis by Foulke (1971), the temporal organization of speech becomes increasingly
important when speech rate increases. In support of the hypothesis, duration studies of
normal and fast rate speech have shown that speech is not sped up linearly – in particular,
researchers have observed that consonants are reduced more than vowel durations (e.g.,
10
Gay, 1978; Lehiste, 1970). The relative difference in duration between stressed and
unstressed syllables increases in faster speech as well. On the other hand, perception
studies comparing linearly compressed fast speech (which preserves the reduction and
spectral characteristics of slower speech) to naturally produced fast speech found slower
phoneme recognition in naturally produced fast speech (Janse et al., 2003; Janse, 2004).
These results from Janse et al. seem to go against the prediction of the Hyper- and Hypo-
speech theory that naturally produced speech would have improved phoneme recognition
as the speaker adapts his or her speech to what the listener needs to comprehend the
message.
1.2.3 Coarticulation as coproduction
In contrast to the research traditions that treat coarticulation as a phonological
component and as speech economy, the theory of coarticulation as coproduction focuses
on the notion of articulatory overlap (e.g., Bell-Berti & Harris, 1981; Browman &
Goldstein, 1986; Fowler, 1977). The coproduction account bridges the distinction
between two different types of representations in feature-based theories: (i) the
phonological or cognitive structure and (ii) the phonetic or physical structure. This type
of division into phonological vs phonetic representations has been argued to be
unsatisfactory in capturing systematic physical differences in different languages
(Ladefoged, 1980; Port 1981). To overcome the distinction of phonological and phonetic
structure, Browman and Goldstein (1995) postulate that these representations are the
macroscopic and microscopic properties of the same system. In this system, the basic
phonological unit is the gesture (Browman & Goldstein, 1986, 1991; Saltzman &
11
Munhall, 1989). Gestures are temporally discrete “events that unfold during speech
production and whose consequences can be observed in the movements of the speech
articulators” (Browman & Goldstein, 1992, p. 23). As these gestures are assumed to be
invariant and inherently spatio-temporal, they can be coproduced or overlapped in time. It
is the overlap of invariant gestures that may result in context-varying articulatory
trajectories and subsequent changes in acoustic signals.
As a part of the coproduction account, Browman and Goldstein’s (1990, 1992)
theory of articulatory phonology also has the advantage of offering a unifying
explanation for apparently unrelated speech processes. Gradient changes in the amount of
overlap between gestures can account for allophonic variations as well as connected
speech processes such as assimilation, deletions and reductions. For example, in the fast
execution of the utterance “perfect memory”, the [t] may appear to be deleted. However,
Browman and Goldstein (1990) showed that the movement of tongue tip for [t] is
articulatorily present but acoustically hidden due to its overlap with the lips gesture for
[m]. When speech rate is relatively fast, the overlap of gestures for consonants can either
hide each other when they involve different articulators or blend their acoustic
characteristics when the same articulator is involved.
Extending the notion of articulatory overlap, Fowler and Saltzman (1993) also
offer an explanation for the phenomenon of coarticulatory resistance, in which a segment
resists the influences of a neighboring segment. They suggest that the level of resistance
is dependent on the “blending strength” associated with the overlapping gestures. In this
view, gestures with stronger blending strength are likely to suppress gestures with weaker
12
blending strength; the blending of gestures with similar strength will be an average of the
influence from overlapping gestures.
There is a growing body of evidence in support of the coproduction account.
Experimental data on contrast neutralization shows that allophonic variations is a graded,
rather than a categorical, process as predicted by articulatory phonology (Beckman &
Shoji, 1984; Port & O’Dell, 1985). Data on English place assimilation is also consistent
with the observation that there is an intermediate stage between the absence of
assimilation and complete assimilation (Kerswill & Wright, 1989; Wright & Kerswill,
1989; Nolan, 1992). Experimental findings on tongue tip/blade displacement in English
(Bladon & Nolan, 1977), vowel-to-vowel coarticulation in Catalan (Recasens, 1984) and
consonant-to-vowel coarticulation in Italian (Fanetani & Recasens, 1993) are also
compatible with Fowler and Saltzman’s account of coarticulatory resistance.
In light of the evidence supporting the coproduction account, this dissertation
focuses on the question of the extent to which articulatory overlap plays a role in the
production and perception of reduced speech, an area that has received relatively less
attention.
1.3 Perception of reduced speech and compensation for coarticulation
Much of the existing research on coarticulation has approached this from the
production side. The work that has been done on the perception of coarticulated speech
has mostly focused on the context-dependent nature of articulatory gestures (e.g.,
Liberman et al., 1954; Liberman & Mattingly, 1985; Fowler, 1981; Browman &
Goldstein, 1989). For example, Liberman and colleagues (1954) found that the formant
13
transitions for [d] in [di] and [du] are very different due to the coproduction of the
consonant and vowel: the second formant transition (F2) for [di] is rising and has a high
frequency locus, while the F2 for [du] is falling and has a low frequency locus. Strikingly,
Liberman and colleagues found that despite these acoustic differences, we perceive the
consonants in [di] and [du] to be the same ([d] in both cases).
In the face of such context-dependence in production, listeners are known to
compensate for coarticulation in their perception (Mann 1980; Mann & Repp, 1981;
Repp & Mann, 1981, 1982). In a seminal study, Mann (1980) showed that when listeners
are asked to identify ambiguous sounds in a continuum from [da] to [ga], they were more
likely to report hearing [ga] when the preceding syllable was [al] than when the preceding
syllable was [ar]. Thus, the same acoustic segments were perceived differently depending
on the preceding syllable. They attributed it to the listeners’ perceptual systems
compensating for coarticulatory processes occurring during language production: When
[g] is pronounced after [l], the relatively forward tongue position associated with [l]
means that in natural speech, the [g] tends to become acoustically similar to [d] (which
also has a forward tongue position). This does not occur when [g] is pronouns after [r],
because the tongue is further back during the articulation of [r]. Thus, an acoustic
sequence that sounds like [alda] may in fact be underlyingly [alga], but on the surface
sounds somewhat like [alda] due to coarticulation. Listeners can compensate for this by
‘recovering’ the underlying form (e.g. [ga]) in contexts where coarticulation may have
occurred (e.g. after [al]).
Recent work on the perception of reduced forms has concentrated on the influence
of the following context (Coenen et al., 2001; Darcy, 2007; Gaskell & Marslen-Wilson,
14
1996, 1998; Gow 2001, 2002; Mitterer & Blomert, 2003). This was first demonstrated
experimentally by Gaskell and Marslen-Wilson (1996). In a series of cross-modal
priming experiments, they showed that an unassimilated word form was activated only in
contexts where assimilation is phonologically viable. For example, the unassimilated
form wicked was found to be activated when the assimilated form wickib appeared in a
phonologically viable context, such as a sentence like “That was a wickib prank”. Here,
the bilabial [p] at the start of prank can trigger place assimilation of [d] at the end of
wicked, making it sound more like wickib. In such contexts, listeners who heard wickib
also activated wicked. However, when wickib was in a context where assimilation cannot
occur (e.g., “That was a wickib game,” where the velar [g] at the start of ‘game’ cannot
result in ‘wicked’ becoming wickib), listeners did not activate wicked. In sum, when the
critical word is followed by a potential assimilation trigger (e.g., wicked + prank =>
wickib + prank), listeners compensated for the assimilation and ‘recovered’ the
underlying, unassimilated form (e.g. wicked).
Using a similar experimental design, Gaskell and Marslen-Wilson (1998) found
that listeners can even compensate for assimilation of nonwords. With sentences like
“Lucidly, the ship was only a frayb/prayb bearer/carrier”, where frayb can result from a
phonologically viable context with bearer from a real word freight and prayb from a
nonword preight, listeners falsely reported the presence of [t] in frayb and prayb—albeit
to a weaker extent in prayb. Recent work by Gow and McMurray (Gow 2001, 2002,
2003; Gow & McMurray, 2007), also examining the processing of assimilated speech,
showed that coronal assimilation is a non-neutralizing, gradient process in production and
argued that the spoken word recognition process is characterized by both regressive
15
effects (which resolve ambiguity created by assimilation) and progressive effects (which
affect the perception of upcoming material).
In sum, prior work on the perception of coarticulated speech has found robust
evidence for how the perception system copes with the context-dependent nature of
speech, in particular the effects of the following context.
1.4 Overview of the dissertation
My dissertation aims to investigate the role of coproduction of articulation, if any,
in both the production and perception of reduced speech. More specifically, this research
seeks to address the question of whether we can make use of the notion of coproduction
in order to take steps towards building a unified theory for how reduced speech is both
produced and perceived. While prior work has shown—from a production perspective—
that the properties of reduced speech can be explained by making reference to
coproduction of articulatory gestures, many questions still remain open regarding the
perception of reduced speech (in particular, whether the patterns observed in perception
can be explained by making reference to overlap of articulatory gestures), as well as the
nature of the relation between production and perception. These overall concerns are
translated into a number of sub-questions with respect to coproduction of articulation:
i. In the production of reduced speech, can differences in the alternative
pronunciations of words be attributed to coproduction of articulation? In addition,
how does the coproduction account contrast with other accounts of coarticulation?
(Chapter 2)
16
ii. How does the perception system deal with acoustic information that may convey
the extent to which articulators overlap? How does coproduction information
from the preceding words (i.e., global cues from prior context) as well as
(gradient articulatory overlap information) at the critical word influence the
process of real-time word recognition? Moreover, what is the relationship
between the degree of articulatory overlap (in the speaker’s output) and word
recognition (by the listener)? (Chapter 3)
iii. If coproduction information influences the perception for native speakers, how
about nonnative speakers? In other words, how does the presence of an additional
phonological system influence the recognition of reduced forms? (Chapter 4)
Our investigation of these questions combines methodologies from the fields of phonetics
and psycholinguistics. From the field of phonetics, we use acoustic measure to study the
extent of coarticulation (Chapter 2) and electromagnetic articulography (EMA; Chapter
3) to track the movements of articulators. From the field of psycholinguistics, we use
visual world eye-tracking (Chapter 3, 4) to tap into spoken word recognition.
Specifically, Chapter 2 aims to provide additional evidence that the production of
reduced speech can be attributed to coproduction of articulation. To do so, we examine a
special type of pronunciation reduction in Mandarin Chinese observed in spontaneous
speech. The phenomenon has been called “contraction of syllables” (Cheng 1985; Chung
1997) or more recently “syllable merger” (Duanmu, 2000). In Experiment 1, we measure
the acoustic properties of function words, which have been documented to undergo more
contraction. Under the framework of articulatory phonology, we propose that Mandarin
17
syllable contraction occurs as a result of gestural reorganization shifting from an anti-
phase to an in-phase relationship.
Chapter 3 focuses on the perception of reduced speech and its connection to the
notion of articulatory overlap. The chapter includes two experiments which investigate
how two different sources of coarticulation information influence word recognition. In
Experiment 2, we use the visual world eye-tracking paradigm to study the phenomenon
of nasal place assimilation in English. We examine whether there are compensation
behavior differences in a fast-speech context, which is more likely to be reduced, than a
slow-speech context. We are also interested in investigating how the precise extent of
articulatory overlap can influence lexical access in native speakers. In Experiment 3, we
investigate the relationship between (i) the comprehension data from Experiment 2 and
(ii) the production data from EMA, which allows us to measure articulatory gestures
during production. If the perception is capable of detecting small, gradient differences in
the overlap of articulatory gestures, a correlation between the two measures is expected.
Chapter 4 is an extension of Experiment 2 in Chapter 3 to a different population:
adult L1 Mandarin learners of English as a second language. This allows us to investigate
how nonnative speakers of English process coarticulatory information associated with
coproduction of articulation. Additionally, the chapter addresses the question of whether
interference will occur in the processing of reduced speech with the development of a
nonnative phonological system. Mandarin speakers were chosen for Experiment 4
because Mandarin phonotactics does not allow words to end in the nasal coda [m],
though [n] is allowed, which directly pits it against the process of nasal place assimilation
in English.
18
Chapter 5 summarizes the findings in Chapter 2-4 and addresses the theoretical
implications of this work, as well directions for future research.
19
Chapter 2
Mandarin Syllable Contraction
2.1 Introduction
In spontaneous speech, spoken words are often reduced due to a number of
linguistic and pragmatic factors. The coproduction theory of coarticulation (e.g.,
Browman & Goldstein, 1986; Fowler, 1977; Bell-Berti & Harris, 1981) argues that the
overlap of articulatory gestures is what causes the variability observed in speech
reduction. Given the different accounts of coarticulation reviewed in Chapter 1, this
chapter provides additional evidence in support of the coproduction theory in particular.
More specifically, we will show that coproduction of articulation can be used to explain
the phenomenon of syllable contraction in Mandarin (e.g., Duanmu, 2000).
While it is not rare for words to be reduced at the syllable level (e.g., Ernestus,
2000 on Dutch), Mandarin syllable contraction is well known for the extent of
contraction it undergoes (e.g., Tseng, 2005a, 2005b) and how the new alternative
pronunciations can sometimes become lexicalized (e.g., [tʂɤiaŋzí] to [tçiaŋts] ‘this way’). Experiment 1 presented in this chapter tests the predictions of the
coproduction account (discussed below in Section 2.1.4). As we will see, the
coproduction account can also explain why nasal coda [m] exists in Mandarin contracted
syllables – which is a pattern that other accounts have not been able to explain
satisfactorily.
The following sections review the literature on Mandarin syllable contraction,
including a discussion of why existing theories (e.g., Chung, 1997; Hsu, 2003; Cheng &
20
Xu, 2009) remain unsatisfactory. Section 2.2 presents the design of Experiment 1, and we
discuss the findings of the experiment in Section 2.3.
2.1.1 Mandarin syllable contraction
In many Chinese dialects, there is a special type of pronunciation reduction or
segmental reduction that occurs frequently in spontaneous speech. The phenomenon
involves a sequence of two or more syllables that are reduced to a lesser number of
syllables—akin to I’m or gonna in English—and has subsequently been known as
“syllable contraction”, “syllable fusion”, and “syllable merger” (Cheng, 1985; Chung,
1997; Duanmu, 2000). Given the recent interest in the phenomenon, syllable contraction
has been studied in many dialects of Chinese, including Southern Min (Cheng 1985;
Myers & Li, 2009), Cantonese (Wong, 2006), and Hakka (Chung, 1997).
A growing number of studies have also examined syllable contraction in
Mandarin (Cheng & Xu, 2009; Hsiao 2002; Tseng, 2005a, 2005b). Like other dialects
that undergo syllable contraction, Mandarin syllable contraction may be disyllabic or
trisyllabic. As illustrated in Table 1
1
, in ex. 1, the disyllabic word [ɹanxou], ‘afterwards’,
is contracted into [ʔã]. A trisyllabic compound like [tʂɤiaŋzi], in ex. 5, ‘so + this way’, is
contracted into [tçiaŋts]. Note that contraction may occur within words as well as across
words. Therefore, disyllabic contractions may result from the merger of two
monosyllabic words or one disyllabic word. Similarly, trisyllabic contractions may be the
combination of three monosyllabic words, one monosyllabic and one disyllabic word, or
vice versa.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
1
Given the disputed status of glides in Mandarin, [w] and [j] are transcribed as [u] and [i] in this chapter.
21
Table 1. Examples of common Mandarin contracted syllables (Tseng, 2005)
Non-contracted Form Contracted Form Gloss
Disyllabic
ex. 1 ɹanxou ʔã ‘afterwards’
ex. 2 uomɤn uom, om, m ‘we’
ex. 3 suoi suɤ,s
w
ɛ ‘so’
ex. 4 kɤi kɤ ‘can’
Trisyllabic
ex. 5 tʂɤiaŋzí tçiaŋts ‘this way’
ex. 6 suoiuo suiʔ ‘so + I’
ex. 7 inueiuo inuo ‘because + I’
Examining the extent of contraction in Mandarin, Tseng (2005a, 2005b)
conducted a corpus study in which eight spontaneous conversations were recorded and
analyzed. Following Tseng and Liu (2002), she considered syllables to be contracted
when perceptually there was a deletion of syllables or no clear cues for syllable boundary.
Ignoring syllable contractions that were longer than three syllables, the study reported
that 32% of a total 39,490 syllables in the study were contracted. Of these contracted
syllables, disyllabic contractions were the most frequent, accounting for 74% of the
contractions. This may be correlated with the fact that the majority of words or
compounds in Chinese are disyllabic. On the combinations of syllables that result in
contractions, Tseng found that 65% of the disyllabic contractions occur in disyllabic
words while 60% of the trisyllabic contractions involve a monosyllabic and a disyllabic
word. Remarkably, despite how frequently syllables are contracted in Mandarin, most
speakers of the language are unaware of the phenomenon (Chung, 2006).
22
Consistent with other dialects in Chinese (see e.g., Cheng, 1985 for Southern
Min), contraction in Mandarin also occurs more frequently in function words (Tseng,
2001). For example, while there may be variation in what the first word might be in
contracted syllables of two monosyllabic words, the second monosyllabic word is often a
function word (e.g., [i], ‘one’ and [iou] ‘have’). In addition, words that are contracted
tend to be higher in frequency (Tseng, 2005a, 2005b and see Myers & Li, 2009 for an
experiment testing the relationship of lexical frequency and Southern Min syllable
contraction). Tseng suggests that function words are often contracted because they are
only a small part the vocabulary of a language. Compared to content words, function
words are more frequent, and thus may be uttered more rapidly because they are more
predictable. Also, since function words are less likely to be stressed than content words,
they are more likely to become attached to the preceding word. On the functionality of
syllable contraction, Chung (2006) argues that speakers use contractions to background
those parts of an utterance that are low in semantic content, thus helping listeners focus
on the more valuable information in the content words.
2.1.2 Phonology of syllable contraction
There have been various analyses of Chinese syllable contraction (e.g., Cheng,
1985; Cheng & Xu, 2009; Myers & Li, 2009; Hsu, 2003; Xu & Li, 2009). From a formal
linguistics perspective, Chung (1997) proposes an edge-in model to account for the
phenomenon. Chung suggests that contraction occurs according to the rule stated in (1),
based on an assumed structure of onset (C), nucleus (V), and coda (X; henceforth
23
abbreviated as CVX
2
) for all Chinese dialects (see Duanmu, 2000 for a discussion of the
CVX structure). Crucially, the result of contraction should adhere to the CVX structure.
(1) syllable contraction CVX + CVX => CVX
As shown in (2), Chung (1997) illustrates this rule with an example from Hakka.
(2) CVX + CVX
ŋ a i + t e n
ŋan
CVX
Following Chung, Hsu (2003) extended her model by incorporating the sonority scale of
a > ɔ > e > i > u to determine the vowel in the nucleus position. According this model, the
target of syllable contraction must obey the rules of Chinese phonotactics.
Despite its explanatory power to describe many contracted syllables observed, the
edge-in model remains an unsatisfactory account of Mandarin syllable contraction. As
Tseng (2005a, 2005b) noted in her corpus study, she observed many contracted syllables
that did not obey the rule in (1). In Mandarin, the consonant [m] is never allowed in the
coda position (unlike Southern Min and Hakka). However, many contracted syllables that
she observed in the corpus have a clear, often lengthened coda consonant [m]. An
example of this is illustrated in ex. 2 of Table 1. If Chinese syllable contraction is
determined by (1), a consonant [m] in the coda position should be impossible.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
2
CVX or CGVX structure has been proposed to account for glides and coda nasals (Duanmu, 2000; Chung,
1997). In Mandarin, the syllable structure is either open syllable CV or closed syllable CVC.
24
2.1.3 Articulatory undershoot
Offering a different perspective on syllable contraction, Cheng and Xu (2009)
argue that the underlying mechanism of syllable contraction is the gradient undershoot of
the articulatory target (Lindblom, 1990, see Section 1.2.2 in Chapter 1). Using nonsense
Taiwan Mandarin disyllabic sequences as targets of contraction, Cheng and Xu
manipulated speech rate and studied the relationship between contraction and time
pressure. To control rate, they asked subjects to speak at three different rates: (i) slow as
if reciting in class, (ii) natural as if having a conversation, and (iii) as fast as possible. To
rate the degree of contraction, Cheng and Xu classified the produced speech as non-
contracted, semi-contracted, or contracted based on the degree of interruption of formants
by the intervocalic consonant, presence of nasal murmur or a clearly lowered F1. The
results show that contraction can be induced in nonsense sequences and that speech rate
also has a significant effect on contraction. Therefore, they conclude that duration is a
key factor for the occurrence of contraction in Taiwan Mandarin. Cheng and Xu argue
that when duration of articulation decreases below a certain threshold due to time
pressure, the execution of the articulatory movement toward a consonantal target
eventually becomes difficult. However, like the edge-in model, the articulatory
undershoot model cannot offer an explanation for the presence of coda consonant [m] in
Mandarin contracted syllables. As we will see in the next section, a solution is offered by
the theory of articulatory phonology that adopts the coproduction view.
25
2.1.4 Articulatory phonology
Browman and Goldstein’s seminal theory of articulatory phonology (1990, 1992)
examines the phenomenon of speech reduction from the perspective of coproduction. In
articulatory phonology, gestures are the basic units of combination (Browman &
Goldstein, 1995). According to this view, speech is composed of an ensemble of gestures
overlapped temporally, which are coupled by a system of planning oscillators, or clocks
(Goldstein et al., 2006). In this model, the oscillators would entrain to two stable modes
of inter-gestural coordination, in-phase and anti-phase, two modes that are intrinsically
accessible. When the onsets of gestures are synchronous, clocks would have an in-phase
relationship (0°). Clocks would have an anti-phase relationship (180°) when the onsets of
gestures are sequential.
While the both the in-phase and anti-phase mode are intrinsically accessible,
research has suggested that an in-phase mode is more stable than an anti-phase one. As
seen in finger tapping experiments, while in-phase coupling remains stable as the rate of
production increases, anti-phase coupling becomes unstable and spontaneously switch to
an in-phase mode (Kelso, 1984). This was replicated in speech studies. de Jong et al.
(2002) demonstrated that VC syllables will also spontaneously switch to CV syllables as
speakers increased their speech rate. From an articulatory phonology perspective, the
consonant in a VC syllable can be viewed to couple with the preceding vowel in an anti-
phase relationship. On the other hand, the consonant in a CV syllable is coupled with the
vowel in an in-phase relationship. The implication of the study of de Jong et al. is that in
speech, the coordination of gestures will switch from an anti-phase relationship to an in-
phase relationship as a result of increased speech rate, a case of gestural reorganization.
26
Figure 1. Gestural score of non-contracted [uomɤn], ‘we’,
where the gestures for [n] is anti-phase coupled to the vowel
The presence of coda consonant [m]
The change in coupling relations allowed by articulatory phonology has the
advantage of explaining the presence of the coda consonant [m] in Mandarin syllable
contraction. For instance, when the disyllabic word [uomɤn], ‘we’, is not contracted, its
gestural score, a symbolic graphical representation of when a gesture is activated, is
shown in Figure 1. As specified under the articulatory phonology framework, gestures are
instead coupled with respect to each other in either an in-phase or anti-phase relationship.
Notice the gestures for [n] in [uomɤn] are anti-phase coupled to the vowel gesture from
tongue body. However, when the rate of production increases, such as during syllable
contraction, the duration of each gesture can shorten. Crucially, anti-phase coupling may
spontaneously switch to in-phase coupling because of the higher stability of in-phase
relations, as shown in Figure 2. More specifically, the shortened gestures for [m] in [mɤn]
(velum opening, bilabial stop) may become in-phase with the gestures for [n] (velum
27
opening and alveolar stop). When this occurs, the fact that the lips are closed at the same
time that contact is made between the tongue tip and alveolar ridge produces a sound that
is perceived as an [m] in the coda position.
Figure 2. Gestural score of contracted ‘we’ under the in-phase coupling proposal
Figure 3. Gestural score of extreme contraction of
‘we’ under the in-phase coupling proposal
28
It is important to note that there is a clear distinction between the predictions
made by articulatory phonology and the target undershoot model. Consider Figure 4, the
gestural score of contracted ‘we’ under the target undershoot proposal. All the gestures
are shortened due to time pressure during syllable contraction. However, the coda
consonant of the last syllable in Figure 4 would still be [n] because the word would still
end with the velum opening and alveolar stop occurring at the same time, no matter how
short the durations become.
Figure 4. Gestural score of contracted ‘we’ under the target undershoot proposal
2.1.5 Aims of this chapter
The experiment presented in this chapter uses acoustic measures to investigate the
underlying speech production mechanism of Mandarin syllable contraction. The
experiment has two aims:
The first aim is to examine the phenomenon of Mandarin syllable contraction
from the perspective of articulatory phonology. Previous accounts of Mandarin syllable
contraction (e.g., Chung, 1997; Cheng & Xu, 2009) have been unable to explain the
29
presence of the consonant coda [m], which violates Mandarin phonotactics. If
coarticulation is part of the phonological component as suggested by Chung (1997), then
the reduced form should be explainable by phonological rules. While reduction can occur
as a result of the articulators not having sufficient time to reach their target, Cheng and
Xu’s articulatory undershoot account still cannot explain why consonant coda [m] exists
(as discussed in Section 2.1.2 and 2.1.3 above). Therefore, we hypothesize that Mandarin
syllable contraction undergoes a change in gestural coordination (as discussed in Section
2.1.4). The experiment presented in this chapter investigates this possible change by
comparing the formant frequencies of contracted and non-contracted words in Mandarin.
The second aim of the experiment is to investigate the extent to which gestural
reorganization can occur. While an in-phase relationship is more stable than an anti-phase
one, it may not be the case that all words will undergo gestural reorganization when they
are contracted in Mandarin. The experiment analyses the different individual sets of
words used in the study to determine whether a change in gestural coordination has
occurred.
2.2 Experiment 1
To test whether change in gestural coordination occurs in contracted syllables of
Taiwan Mandarin, this study compared the coordination of gestures in function words,
which are more likely to be contracted (Tseng, 2001), to non-function words which are
less likely to be contracted. Focusing on vowels only, when a disyllabic word is
contracted, the vowels in the two syllables are predicted to switch to an in-phase
relationship. That is, the vowel gestures are ‘pulled’ towards occurring simultaneously. If
30
the disyllabic word contains a front vowel in the first syllable and a back vowel in the
second syllable (or vice versa), the strength of in-phase coupling will determine the
amount of coproduction of the back and front vowel. Such coproduction is measurable
via a change in the second formant (F2) frequencies. Therefore, if a disyllabic word with
front-to-back vowel order is contracted, F2 frequency values of the first syllable are
predicted to decrease compared to the non-contracted form. If the disyllabic word has a
back and front vowel in the first and second syllable respectively, F2 frequency values
are expected to increase.
2.2.1 Method
Participants
Four native speakers (1 male and 3 females) of Mandarin from Taiwan living in
the Los Angeles area participated in the study. They reported no hearing or language
related impairment and were paid $10 for their participation.
Experimental stimuli
Seven sets of mostly disyllabic
3
words were selected. In order to maximally
match the phonetic environments used within a set, two instances of a combination of
two monosyllabic words and three instances of a combination of a monosyllabic word
and a disyllabic word were used. The first two syllables of all words were controlled to
meet a variety of conditions.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
3
Most of the words used in the targets and controls are considered disyllabic; however, given the difficulty
of what defines a word in Chinese, some disyllabic words used in this experiment may be considered as a
compound word of two monosyllabic syllables. (e.g., from Set 6, [x ku] ‘river valley’)
31
In order to test the effect of contraction, there are two word types within each set:
contracted candidates and non-contracted candidates. Contracted candidates selected are
either conjunctions or adverbs because conjunctions are generally considered to be
function words. According to Wang (1987: 21), adverbs in Chinese are somewhere
“between a function and content word”. Non-contracted candidates were either verbs or
nouns.
Table 2. The two different set types in the stimuli
Set Possibility 1 Set Possibility 2
Vowel order Example Vowel order Example
1 contracted target front-to-back iʂou back-to-front tʂ
h
ufei
1 non-contracted target front-to-back iʂou back-to-front t
h
ufei
1 contracted control front-to-back itsai back-to-front tʂupu
1 non-contracted control front-to-back i#t
h
saitiɑʊ back-to-front tʂup
h
u
To make the measurement of coproduction feasible, we manipulated the backness
of the vowels in the first and second syllable to create two main types of words: (a) In
one type, which we will refer to as front-to-back vowel order, the first syllable has a front
vowel and the second syllable has a back vowel, and (b) in another type, which we will
refer to as back-to-front vowel order, the first syllable has a back vowel and the second
syllable has a front vowel.
In addition, it is important to tease apart the effect of coproduction from
contraction. For the purpose of this experiment, the coproduction effect mentioned here is
defined as the consequences of articulatory overlap. The effect of contraction is defined
as all other changes associated with reduction of syllables, but not due to the overlap of
gestures. To tease apart coproduction and contraction, we used (i) contracted and non-
32
contracted candidates that differ in the vowel backness of the first and second syllables as
targets, and (ii) we used candidates that do not differ in the vowel backness of the two
syllables as controls. Furthermore, the vowel in the first syllable of the controls needs to
match the corresponding vowel of the targets in backness. For example, if the order of the
two vowels in the targets is back-to-front, then the order of vowels in the controls is
back-to-back. As illustrated in Table 2, there are two possibilities for the order of vowels
in the targets. Each set has one contracted target, one non-contracted target, one
contracted control, and one non-contracted control. Of the 7 sets in the study, there are
four sets with front-to-back vowel order and three with back-to-front vowel order in the
targets.
All targets and controls have the form of (C1)V1(C2)V2 for the first two
syllables, with C1 and C2 being optional. All V1’s within a set are identical, and C1’s
and C2’s are maximally matched across targets and across controls to minimize the
differing effects of consonants on formant frequencies. The targets and controls are
placed in the middle of a carrier sentence, always at the beginning of a prosodic phrase
and preceded by a comma
4
. To minimize its effect on the formants, the vowels preceding
the targets and controls always match the backness of the following vowel. For ease of
segmentation, the targets or controls are always followed by a stop or affricate.
Each carrier sentence is always the second part of a dialogue. An example with
the target chufei, ‘unless’, is illustrated in (3). Each stimulus is set up to be like a
conversation between two individuals. In this setup, the first part of the dialogue is
intended to be either a statement or a question from one individual. To mimic this, the
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
4
The use of commas is somewhat arbitrary in Chinese. Its use is not limited to separate words and groups
of words and commas can occur at the end of a grammatical clause without the use of conjunctions.
33
first part of the dialogue was pre-recorded (with a Logitech USB microphone at 44,100
Hz) and played to the participants over headphones. The second part of the dialogue,
which contains the targets or controls, is intended to be a response to the recorded speech
and was produced by the subjects. The aim of this setup is to simulate a real conversation,
in order to encourage participants to speak as normally as possible.
(3) ni gen ni pengyou dasuan sheme shihou yao hui jia
you and your friends plan what time ASP go home
“When do you plan on going home with your friends?”
wo deng bu xiaqu xiang yao xian zou chufei bisai keyi mashang jieshu
I wait no longer want ASP first go unless game ASP soon end
“I can’t wait any longer unless the game will end soon.”
Procedure and design
Subjects were shown one stimulus at a time on a computer screen. Both the
statement/question portion and the response portion of the dialogue were visible to the
subjects. They were instructed to pretend they were the person providing the response
portion and to produce utterances that were as natural as possible. At the beginning of
each trial, they were given time to practice and rehearse the stimulus either internally or
saying it out loud. When they indicated they were ready to proceed, the first portion of
the dialogue was played back to them over headphones. After the playback, subjects said
the response aloud. This was recorded digitally using Praat (Boersma, 2001) via a
Logitech USB microphone at 44,100 Hz. This process of playback and sentence
34
production was repeated at least six times for each participant and for each stimulus. If
subjects became disfluent or produced a speech error, they were asked to repeat their
response.
The experiment consisted of 28 stimuli. Subjects were given a break of no more
than 10 minutes halfway through the experiment if needed. The total duration of the
experiment was approximately one hour.
Acoustic Measurements
The acoustic measurements made in this study included (i) second formant (F2)
frequency values at the midpoint of V1 and (ii) the duration of C1, V1, C2, and V2,
where applicable. Labels were placed using Praat according to the following criteria:
1. Consonants onset:
a. If the consonant is a fricative, the onset was defined as the first appearance
of aperiodic noise on the waveform, accompanied with frication noise
above 2500 Hz from the spectrogram.
i. In the case of consonant [x], it is not always clear where the aperiodic
noise begins. When that was the case, the onset was defined by the onset
of a 10 ms frame during which a decrease of 150 Hz in F2 frequency
value occurred.
b. If the consonant is a stop or affricate, the onset was defined by the
substantial decrease in amplitude. In cases where the stop or affricate is
preceded by silence, the onset was defined by the release burst.
35
2. Consonant offset: Regardless whether the consonant is a fricative, stop, or
affricate, the offset of the consonant is defined as the onset of the vowel.
3. Vowel onset: The onset was defined as the onset of periodicity.
4. Vowel offset: The offset was defined the onset of a consonant.
5. V1 midpoint: The point half way between the onset and offset of V1 was the
midpoint of the vowel.
It is important to note that segmentation was not always possible in contracted
targets or controls. When boundaries were not clear, vowels and consonants were left
unsegmented. For the contracted targets or controls where the onset of V1 was difficult to
identify, the V1 midpoint was calculated by adding half the duration of other V1’s for the
same target or control to the onset of the vowel. This occurred in six V1 midpoint
measurements.
A series of repeated measures Analyses of Variance (ANOVAs) was conducted
for statistical evaluation. The within-subject factors considered here were Contraction
(contracted vs. non-contracted), Vowel Order (back-to-front vs. front-to-back), and
Vowel Environment (targets, for which the two vowels are different in backness, vs.
controls, for which the vowels are the same).
36
2.2.2. Results
2.2.2.1 Durations of contracted vs. non-contracted targets and controls
Although the literature has suggested that function words are more likely to be
contracted, it is not certain whether the conjunctions and adverbs that we used will also
undergo contraction, because their grammatical status between content words and
function words is somewhat unclear. As a sanity check, since contracted syllables are
expected to be shorter, the durations of the contracted targets and controls are compared
with their non-contracted counterparts. Indeed, as can be seen in Figure 5, this is the case.
Figure 5 shows the average durations of all the words that were repeated at least six times
by each participant. To test the statistical significance of this pattern, a ANOVA was
performed on subject means of syllable duration for Contraction (contracted vs. non-
contracted). The test revealed a main effect of Contraction (F(1, 3)=16.13, p < .05),
which confirms that the words selected to be contracted candidates were indeed shorter in
duration than non-contracted candidates.
Figure 5. Durations of targets and controls
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Non-contracted Contracted
Time (s)
Control
Target
37
2.2.2.2 F2 of contracted vs. non-contracted targets
If the vowels in the targets are coproduced due to in-phase coupling, F2 is
expected to decrease in the contracted form if the first syllable contains a front vowel and
the second syllable contains a back vowel. On the other hand, F2 is expected to increase
in the contracted form if the first syllable contains a back vowel and the second syllable
contains a front vowel. This can be observed in Figure 6. An ANOVA was performed on
subject means of F2 values at the midpoint of V1 for Contraction and Vowel Order
(back-to-front vs. front-to-back). There was a main effect of Vowel Order (F(1, 3)=74.12,
p<.01) but an interaction of Contraction and Vowel Order was marginal (F(1, 3)=7.06,
p=.08). This shows that whether the words were contracted affected the F2 values in the
direction expected, but the effect was not significant.
Figure 6. Average F2 frequency values for all targets
0
500
1000
1500
2000
2500
3000
Non-contracted Contracted
F2 Frequency (Hz)
Back-to-front
Front-to-back
38
Figure 7. Average F2 frequency values for all targets excluding Set 7
As previously mentioned, the medial consonant of [x] in Set 7 created a
segmentation issue that required two different measurement criteria. It is possible that the
F2 values from Set 7 might have generated too much variance. Therefore, an additional
ANOVA was conducted without Set 7, which was removed from further analysis. As
shown in Figure 7, the removal caused little changes in average F2 frequency. However,
while an ANOVA showed that there was once again a main effect of Vowel Order (F(1,
3)=93.88, p<.01), there was a significant interaction of Contraction and Vowel Order
(F(1, 3)=15.04, p<.05), showing that F2 values changed in the direction expected and the
changes were significant.
2.2.2.3 F2 of targets vs. controls
Although the comparison of F2 between contracted and non-contracted targets
showed a significant interaction, this could be simply a difference between contracted
and non-contracted targets but not due to coproduction. To investigate the effect of
0
500
1000
1500
2000
2500
3000
Non-contracted Contracted
F2 Frequency (Hz)
Back-to-front
Front-to-back
39
coproduction, targets need to be compared with controls that vary in the same way in
terms of contraction but do not contain vowels that would lead to F2 changes due to
coproduction (i.e. vowels that are both front or both back).
Of the six sets of words used in the experiment (excluding Set 7), Figure 8
illustrates the comparison of targets with back-to-front vowel order to controls with back-
to-back vowel order. At first glance, there appeared to be an interaction of Contraction
and Vowel Environment (target vs. control). Notice that F2 for targets increased in the
contracted form while F2 for controls decreased. An ANOVA resulted a marginal effect
of Contraction (F(1, 3)=6.15, p=.09), and a marginal interaction (F(1, 3)=7.42, p=.07).
The results indicated that when combining all sets with back-to-front vowel order, the
influence of coproduction on targets did not alter the F2 to be significantly from the
controls during contraction, although the trend was in the expected direction.
Figure 8. Comparison of targets and controls for sets with back-to-front vowel order
750
800
850
900
950
1000
1050
1100
1150
Non-contracted Contracted
F2 Frequency (Hz)
Control
Target
40
Figure 9. Comparison of targets and controls for sets with front-to-back vowel order
The comparison of targets with front-to-back vowel order to their respective
controls is shown in Figure 9. F2 for both targets and controls was lower in the contracted
form than the non-contracted form. Note that F2 for targets appeared to decrease to a
greater extent compared to the controls. An ANOVA resulted in a significant interaction
of Contraction and Vowel Environment (F(1, 3)=18.42, p<.05). This showed a significant
influence of coproduction for the sets with front-to-back vowel order.
2.2.2.4 Within-set comparisons
By comparing targets to controls, sets with front-to-back vowel order showed a
significant interaction of Contraction and Vowel Environment and sets with back-to-front
vowel order a marginal interaction. Why did the sets with back-to-front vowel order not
quite reach significance? A possible reason is that gestural reorganization may not always
occur, even though an in-phase relationship is more stable than an anti-phase one.
2350
2400
2450
2500
2550
2600
2650
2700
2750
Non-contracted Contracted
F2 Frequency (Hz)
Control
Target
41
Therefore, the analyses presented in this section examine the F2 changes and durations of
each individual set of the contracted target, non-contracted target, contracted control, and
non-contracted control. We have excluded Set 7 and consider rest of the sets individually
below. These analyses probe the extent to which gestural reorganization may occur.
Table 3. Targets and controls from Set 1
Character IPA Contraction Vowel Pair Gloss Type
pupi Contracted Back-to-front ‘no need’ Target
upi Non-contracted Back-to-front ‘scandal’ Target
pukuo Contracted Back-to-back ‘however’ Control
fukuo Non-contracted Back-to-back ‘rich country’ Control
Table 3 lists the words used in Set 1. As seen in Figure 10, Set 1 appeared to have
an interaction of Contraction and Vowel Environment. As predicted, F2 was higher in the
contracted form for the targets while F2 was lower for the controls. An ANOVAs was
conducted on subject means of F2 and showed a main effect of Contraction (F(1,
3)=20.19, p<.05), a marginally significant effect of Vowel Environment (F(1, 3)=6.47,
p=.08), and crucially, a significant interaction (F(1, 3)=14.91, p<.05). As shown in Figure
11, the durations of targets and controls were also as expected: contracted syllables were
shorter than their non-contracted counterparts.
42
Figure 10. Set 1 average F2 frequency values
Figure 11. Set 1 target and control durations
500
550
600
650
700
750
800
850
900
Non-contracted Contracted
F2 Frequency (Hz)
Control
Target
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Non-contracted Contracted
Time (s)
Control
Target
43
Table 4. Targets and controls from Set 2
Character IPA Contraction Vowel Pair Gloss Type
tʂ
h
ufei Contracted Back-to-front ‘unless’ Target
t
h
ufei Non-contracted Back-to-front ‘bandit’ Target
tʂupu Contracted Back-to-back ‘step by step’ Control
tʂup
h
u Non-contracted Back-to-back ‘master butler’ Control
Table 4 lists the words used in Set 2. Set 2 also appeared to have an interaction, as
shown in Figure 12, with the targets having higher F2 in the contracted form and the
controls holding a relatively similar value. An ANOVA confirmed that there was an
interaction of Contraction and Vowel Environment (F(1, 3) =14.98, p<.05). Note that in
Figure 13, the duration of contracted control was in fact longer than the non-contracted
counterpart.
Figure 12. Set 2 average F2 frequency values
700
750
800
850
900
950
1000
1050
1100
Non-contracted Contracted
F2 Frequency (Hz)
Control
Target
44
Figure 13. Set 2 target and control durations
Table 5. Targets and controls from Set 3
Character IPA Contraction Vowel Pair Gloss
xɤpi Contracted Back-to-front ‘no need’ Target
# xɤ#p
h
itçiu Non-contracted Back-to-front ‘drink+beer’ Target
xɤk
h
u Contracted Back-to-back ‘why bother’ Control
xɤku Non-contracted Back-to-back ‘river valley’ Control
Table 5 lists the words used in Set 3. The unexpected finding here was that the
targets had a lower F2 value in the contracted form, as in Figure 14. However, crucially,
when an ANOVA was conducted, there was no interaction (although there was a main
effect of Contraction (F(1, 3)=14.45, p<.05) and Vowel Environment (F(1, 3)=31.73,
p<.05) so that this may be simply to random noise. This change in F2 values opposite the
predicted direction was is likely to contribute to the finding of a marginal interaction in
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Non-contracted Contracted
Time (s)
Control
Target
45
sets with back-to-front vowel order. As seen in Figure 15, the durations of both targets
and controls obeyed the expected pattern.
Figure 14. Set 3 average F2 frequency values
Figure 15. Set 3 target and control durations
1050
1100
1150
1200
1250
1300
1350
1400
1450
Non-contracted Contracted
F2 Frequency (Hz)
Control
Target
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Non-contracted Contracted
Time (s)
Control
Target
46
Table 6. Targets and controls from Set 4
Character IPA Contraction Vowel Pair Gloss
iʂou Contracted Front-to-back ‘single handedly’ Target
iʂou Non-contracted Front-to-back ‘transfer’ Target
itsai Contracted Front-to-front ‘repeatedly’ Control
# i#t
h
saitiɑʊ Non-contracted Front-to-front ‘ASP+fire’ Control
Table 6 lists the words used in Set 4. The targets in Set 4 perhaps offered the best
comparison of contracted and non-contracted syllables. They are perfect minimal pairs,
not only identical in segment but also in tone. As shown in Figure 16, there appeared to
be an interaction with a lower F2 in the contracted form. An ANOVA confirmed that
there was indeed an interaction of Contraction and Vowel Environment (F(1, 3)=55.57,
p<.01). The durations of targets and controls also obeyed the expected pattern, as in
Figure 17.
Figure 16. Set 4 average F2 frequency values
2300
2350
2400
2450
2500
2550
2600
2650
2700
Non-contracted Contracted
F2 Frequency (Hz)
Control
Target
47
Figure 17. Set 4 target and control durations
Table 7. Targets and controls from Set 5
Character IPA Contraction Vowel Pair Gloss
itu Contracted Front-to-back ‘once’ Target
it
h
u Non-contracted Front-to-back ‘plan’ Target
içi Contracted Front-to-front ‘perhaps’ Control
# i#çiku*n Non-contracted Front-to-front ‘ASP+accustom’ Control
Table 7 lists the words used in Set 5. As shown in Figure 19, an interaction was
also apparent. However, an ANOVA found in a marginal interaction (F(1, 3)=7.12,
p=.08). Figure 20 shows that the non-contracted controls are exhibiting the same pattern
as in Set 2: the contracted control were actually slightly longer in duration compared to
non-contracted control.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Non-contracted Contracted
Time (s)
Control
Target
48
Figure 18. Set 5 average F2 frequency values
Figure 19. Set 5 target and control durations
2300
2350
2400
2450
2500
2550
2600
2650
2700
Non-contracted Contracted
F2 Frequency (Hz)
Control
Target
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Non-contracted Contracted
Time (s)
Control
Target
49
Table 8. Targets and controls from Set 6
Character IPA Contraction Vowel Pair Gloss
it
h
ou Contracted Front-to-back ‘suddenly’ Target
# pi#t
h
ou Non-contracted Front-to-back ‘tip of nose’ Target
itçi Contracted Front-to-front ‘and’ Control
itçi Non-contracted Front-to-front ‘prostitute’ Control
Table 8 lists the words used in Set 6. Figure 20 shows another unexpected pattern:
Contracted targets were higher in F2 compared to non-contracted forms. An ANOVA
was conducted and revealed there was no interaction of Contraction and Vowel
Environment. Duration patterns of the targets and vowels were also as expected (Figure
21).
Figure 20. Set 6 average F2 frequency values
2300
2350
2400
2450
2500
2550
2600
2650
2700
Non-contracted Contracted
F2 Frequency (Hz)
Control
Target
50
Figure 21. Set 6 target and control durations
In summary, three out of the six sets analyzed showed a significant interaction of
Contraction and Vowel Environment (Set 1, Set 2, and Set 4) with respect to the F2
values while one set (Set 5) showed a marginal interaction. In addition, the durations of
the targets and controls were as expected for four out of the six sets analyzed (Set 1, Set
3, Set 4, and Set 6) whereas the durations were always as expected for the targets.
2.3 Discussion
The findings presented in this chapter are consistent with the coproduction
account of coarticulation (e.g., Browman & Goldstein, 1986; Fowler, 1977; Bell-Berti &
Harris, 1981). In particular, Experiment 1 provides initial evidence that disyllabic
contractions in Taiwan Mandarin involve a change in coupling modes as we hypothesized
in section 2.1.5. According to articulatory phonology (Browman & Goldstein, 1990,
1992), when a disyllabic word has a front vowel in the first syllable and a back vowel in
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Non-contracted Contracted
Time (s)
Control
Target
51
the second syllable, F2 is predicted to be lower if the vowels are coarticulated due to in-
phase coupling during contraction. On the other hand, if the word has a back vowel in the
first syllable and a front vowel in the second syllable, F2 is predicted to be higher.
This study first established that the contracted syllables used in the study were
significantly shorter than the non-contracted syllables. Then, when the vowel backness
order was front-to-back, the trend from all 7 sets of data was that, among targets only, F2
was lower in the contracted form—the direction predicted by the coarticulation of the two
vowels. The reverse trend was also true when the vowel order was back-to-front. While
this pattern was marginal initially, it became significant once data from Set 7, for which
the acoustic measurements were suspected to be unreliable, was removed. These results
support the analysis that coarticulation of vowels led to the change in F2 values.
To alleviate concerns that the change in F2 values was due to contraction but not
coproduction (also see Section 2.2.1 above), targets were compared to controls that had
vowels that are identical in terms of backness. Statistical analyses revealed that the
targets with front-to-back vowel order had a significant change in F2 values due to
coarticulation while the effect was marginal for targets with back-to-front vowel order.
To determine the extent to which gestural organization occurred, individual sets of data
were also analyzed: three out of six sets of data analyzed had a significant change of F2,
one set had a marginal change in the directions predicted. These findings support the
hypothesis that Mandarin syllable contraction can involve a change in gestural
coordination. More specifically, at least for four of the contracted targets, the vowels in
the two syllables were pulled towards each other.
52
The remaining two sets did not have a significant change of F2. This may be due
to too much noise with the small sample, but it may also reflect the fact that even at a
faster rate of production, gestural reorganization does not always occur. Even though in-
phase coupling is more stable, other factors, such as the frequency, may exert additional
influence on inter-gestural coordination. Moreover, the choice of consonant for one of
these two sets, Set 6, may also contribute the values of F2. In Set 6, the contracted target
is [it
h
ou] while the non-contracted target is [pi#t
h
ou]. The F2 value at the mid point of the
vowel may be influenced by the presence of the labial gesture especially when the vowel
gesture is shortened due to contraction. Although the trend is not significant, the F2
values of the non-contracted form are lower than the contracted form, opposite the
predicted direction.
Although the results of the experiment indicate that syllable contraction in
Mandarin may occur as a result of gestural reorganization, these results should not treated
as evidence that the phenomenon is a categorical process. It is possible that both gradient
and categorical changes occur in the production process when syllables contract, but the
experiment was not designed to differentiate between these possibilities. From a
perception standpoint, the fact that some contracted syllables have become lexicalized
(e,g., to ‘this way’) is an indication that for some words, syllable
contraction may be a categorical process. However, it is also not clear to what extent
listeners are sensitive to small changes in gestural reorganization, which means that
gradient changes could also occur without detection.
53
It is also important to note that although the results of my experiment demonstrate
that there is more overlap between the vowel gestures in contracted syllables, we do not
know the extent of the gestural reorganization that leads to the increased overlap. The
increase in overlap might have happened in two ways. It may occur due to gestural
reorganization on a more limited basis, such as involving only the vowel gestures. An
alternative explanation is that gestural reorganization may have happened on a global
level due to the general increase in clock speed, causing changes more than just the
vowels.
In summary, Experiment 1 presented in this chapter demonstrates that the
reduction process in Mandarin syllable contraction may involve the overlap of gestures
due to change in coupling mode. Compared with the previous accounts of coarticulation
as a phonological component (Chung, 1997), or as a result of articulatory undershoot
(Cheng & Xu, 2009), the coproduction account has the power to explain attested patterns
in Mandarin, such as the presence of nasal coda [m]. In addition, as shown by the F2
values of contracted syllables in Experiment 1, syllable contraction in Mandarin is not
merely a case of shortened gestures suggested by the undershoot model; it is also a
reduction process that can be explained by the overlap of articulatory gestures due to
change in coupling mode.
In this chapter, we have shown how coproduction of articulation can lead to
reduction in the production process. However, despite the role that overlap of articulatory
gestures processes play in speech reduction, relatively few studies have focused on how it
is processed on the perception side. In other words, if the coproduction of articulation
plays a role in speech reduction, how does it influence word recognition? In the following
54
chapter, we investigate how different sources of coarticulatory information might impact
the speech perception and lexical access: (i) global cues from prior context – namely the
speech rate of the preceding words and (ii) the extent of articulatory overlap.
55
Chapter 3
Compensation for Assimilation: Contextual Cues and Gradience
3.1 Introduction
The results of the production study on Mandarin syllable contraction in Chapter 2
suggest that a shift in coupling mode or coproduction of articulation offers an explanation
for how speech may be reduced. In light of the growing body of work indicating that
articulatory overlap influences speech reduction in production, what is its role in
perception? How does coproduction of articulation influence the perception of reduced
speech? Prior research has focused on how listeners compensate for coarticulation (Mann
1980; Mann & Repp, 1981; Repp & Mann, 1981, 1982) and more specifically,
assimilation (e.g., Darcy, 2007; Gaskell & Marslen-Wilson, 1996, 1998; Gow 2001,
2002, 2003; Gow & McMurray, 2007). Recent psycholinguistic work on compensation
for coarticulation has also explored the relation between lexical access and compensatory
processes (e.g., Magnuson et al., 2003; McQueen et al., 2009; Samuel & Pitt, 2003, see
also Gaskell & Marslen-Wilson, 2001).
This chapter builds on this work, extending prior research in two key ways, in the
form of two experiments: we report a perception study (Experiment 2) that was done with
visual-world eye-tracking and crucially, that used stimuli whose articulatory properties
we analyzed in detail using electromagnetic articulography (EMA) in Experiment 3.
The visual world eye-tracking study – Experiment 2 – tests whether cues from
prior context – namely the speech rate of the preceding words – can trigger compensation
for assimilation. Prior studies have focused mostly on situations where assimilation
56
(coarticulation) is triggered by subsequent segments (with the exception of Mann’s
original 1980 study, see also Elman & McClelland 1988 and related work). In light of a
growing body of work on the role of expectations and anticipatory processing,
Experiment 2 investigates whether compensation for coarticulation is also guided by
expectations that listeners create on the basis of prior context, expectations which are not
triggered by any specific lexical item but rather by global properties of the context – in
particular speech rate. Faster speech rates are associated with greater extent of reduction
and importantly, articulatory overlap. If listeners interpret potentially ambiguous words
differently when the speech rate of the carrier phrases is manipulated (while the speech
rate of the critical words is kept constant), this would provide evidence that cues
encountered before the critical words modulate the extent to which listeners compensate
for coarticulation.
We also report a second set of analysis (Experiment 3), focusing on the degree of
the articulatory overlap, which is gradient and variable, in line with natural speech. We
investigate directly how gradience influences the listener’s lexical access process by
looking at the relation between the (i) articulatory gestures produced by the speaker
(Experiment 3) and (ii) the eye-movements of the listener during word recognition
(Experiment 2). Prior work has not explicitly linked articulatory data with perception data
using online methods such as eye-tracking, and has not looked at the articulatory aspects
of gradient assimilation.
To tap directly into the articulatory properties of words, we use electromagnetic
articulography (EMA) to measure the actual articulatory gestures that are produced by the
speaker and used in the eye-tracking study. This two-pronged approach can show how the
57
production measure is linked to the perceptual measure of eye-movements. This aspect of
the chapter also complements Experiment 2: Whereas Experiment 2 focuses on
investigating effects of cues encountered prior to the critical word (i.e., speech rate of the
preceding words), Experiment 3 aim explores the consequences of the extent of
articulatory overlap (as measured by EMA) at the critical word.
The following sections are structured as follows: From Section 3.1.1 to 3.1.3, we
review the literature on the perception of fast speech, contextual cues, and the issue of
gradience. Experiment 2 is reported in Section 3.2 and Experiment 3 is reported in
Section 3.3. We discuss the findings of the two experiments in Section 3.4.
3.1.1 Speech rate
This chapter investigates whether compensation for assimilation can be triggered
by cues from prior context, namely the speech rate of the preceding words. With respect
to production, not only can speech rate vary considerably in a given conversation,
ranging from 140 to 180 words per minute (Miller et al., 1984), it also alters the patterns
of coarticulation and assimilation (Browman & Goldstein, 1990; Byrd & Tan, 1996). In
addition, increased speech rate may lead to other phonetic consequences, such as deletion
of segments (Ernestus et al., 2002; Koreman, 2006), and disproportional change in the
duration of consonants and vowels (Max & Caruso, 1997) or stressed and unstressed
syllables (Peterson & Lehiste, 1960). As a result, listeners frequently need to adjust for
the different speech rates of speakers in their recognition of spoken words (Miller et al.,
1984; Miller & Liberman, 1979). However, despite the wealth of knowledge in the field
58
of speech perception, relatively little is known about how listeners accommodate varying
speech rates.
In studying how fast speech is perceived, Janse and colleagues conducted some of
the few studies regarding speech rate and spoken word recognition (Janse et al., 2003;
Janse, 2004, Adank & Janse, 2009). Their work primarily examined word recognition in
artificially time-compressed speech. Janse et al. (2003) found word intelligibility was
higher when speech was linearly compressed, which preserves the vocal tract gestural
overlap patterns of speech at a slower rate, in comparison to when it was compressed to
reflect durational characteristic of natural fast speech (meaning that unstressed syllables
were shortened more than stressed syllables). Janse (2004) also compared the perception
of naturally-produced fast speech to linearly time-compressed speech. Using a phoneme
monitoring task, Janse found that naturally-produced fast speech slows down word
processing speed more than time-compressed speech does, because in naturally-produced
fast speech, important segmental details tended to be diminished in favor of ease of
production.
An important implication of Janse’s work is that varying speech rates can have
perceptual consequences. However, the studies of Janse and colleagues (Janse et al.,
2003; Janse, 2004, Adank & Janse, 2009) examined only computer-manipulated speech
and natural speech of the same rate, not naturally produced speech of different rates.
Natural speech, understandably, exhibits different temporal characteristics from computer
manipulated speech. The following section addresses how different speech rates can
potentially create different expectations regarding the extent of articulatory overlap.
59
3.1.2 Assimilation, speech rate, and contextual cues
In this chapter, we investigate the effects of speech rate and articulatory overlap
on word recognition by using the phenomenon of coronal place assimilation in English.
Place assimilation is a process in which a segment assimilates to the place of articulation
of the subsequent consonant. Coronal place assimilation occurs when a coronal segment
(e.g., [t], [d], [n]; ‘coronal’ meaning that it is articulated with the tip/blade of the tongue)
takes on the place of a following non-coronal (labial or velar) segment. For example, in
“A quick rum picks you up”, the word ‘run’ may be ambiguous between ‘run’ and ‘rum’
(Gaskell & Marslen-Wilson, 2001). This lexical ambiguity arises if the [m] from rum is
interpreted to be an underlying [n] assimilating to [m] due the labial gesture of the [p] of
picks overlapping the coronal gesture of the [n]. In other words, the word rum in the
sentence may be perceived as rum or run because of its occurrence in the context of
picks.
Related work by Gow and McMurray (2007) found that listeners can used
progressive assimilation patterns to anticipate upcoming words. For example, hearing
greem boat, with nasal place assimilation on green (green => greem) helps listeners to
anticipate the upcoming noun boat. Building on these findings, Experiment 2 presented in
this chapter tests whether contextual cues, namely speech rate of the preceding words,
can influence listeners’ expectations regarding coarticulation and thereby guide how they
interpret potentially ambiguous words.
How might listeners’ expectations change with respect to speech rate? Studies
have shown that faster speech tends to be more coarticulated than slower speech (Bell-
Berti & Krakow, 1991; Matthies et al., 2001; Munhall & Löfqvist, 1992). This is also
60
consistent with the prediction by the coproduction account of coarticulation. As
previously reviewed in Section 2.1.4, articulatory phonology (Browman & Goldstein,
1992, 1995) assumes the gestures as the basic units of combination. Articulatory gestures
are coupled by a system of planning oscillators (Goldstein et al., 2006) and entrain to two
stable modes of inter-gestural coordination, in-phase and anti-phase. While the both the
in-phase and anti-phase mode are intrinsically accessible, research has suggested that an
in-phase mode is more stable than an anti-phase one in finger tapping experiments (Kelso
1984) and in speech (e.g., de Jong et al., 2002). Therefore, while the anti-phase relation
remains relatively stable at a slower rate, as rate increases, gestures may become
increasingly coupled in an in-phase relationship, leading to the increased overlap of
articulatory gestures.
Thus, the key question is: If someone hears a sequence like rum picks embedded
in a fast speech-rate carrier phrase vs. a slow speech-rate carrier phrase, will this
influence how likely they are to access the potentially underlying unassimilated form
run? (Crucially, the speech rate of the critical sequence itself is held constant.) If word
recognition is sensitive to the rate of prior speech, then a potentially ambiguous sequence
like rum picks will result in more consideration of run when embedded in a fast carrier
phrase than when it is embedded in a slower carrier phrase.
This kind of result would provide evidence that cues encountered before the
critical words modulate the extent to which listeners compensate for assimilation. This
would show that the process of compensation for assimilation is influenced by listeners’
expectations that are based on information encountered prior to the critical region.
61
3.1.3 Gradience and articulatory overlap
In addition to the contextual cues with respect to the amount of coarticulation, the
extent to which the articulators may overlap at the critical word region can play a role in
how a word is perceived. This is especially important in the study of assimilation because
because assimilation in natural speech has been shown to be variable and gradient (e.g.
Byrd, 1996; Nolan, 1992, Gow, 2001, 2002, 2003, see also Gow and McMurray’s 2007
comprehension data). In light of this gradience, it may seem surprising that most prior
psycholinguistic studies investigating the comprehension of assimilation have not
focused on the gradient or partial nature of assimilation. Gaskell and Marslen-Wilson
(1996, 1998, 2001) intentionally created tokens that were deemed to be either assimilated
or unassimilated (as confirmed by pretests). In more recent work, Gow and McMurray
(2007) established a systematic relationship between gradient measures in production and
perception, although they did not look at articulatory measures. However, they asked the
speaker to intentionally pronounce the words with a particular type of pronunciation (e.g.,
a labial assimilation such as greem). Given what is known about gradience in naturalistic
production, it seems that this type of elicitation may result in articulatory and acoustic
patterns that differ from naturalistic speech where the final consonant of a word like
‘green’ may be only partially assimilated.
3.1.4 Aims of this chapter
The experiments presented in this chapter use (i) visual-world eye-tracking, which
allows us to tap into the process of lexical access during comprehension (Experiment 2),
62
and (ii) electromagnetic articulography (EMA), which allows us to measure articulatory
gestures during production (Experiment 3). The chapter has two main aims:
The first aim is to test whether information from prior context – namely the
speech rate of the preceding words – can trigger compensation for coarticulation.
Experimental work on compensation for coarticulation has focused mostly on articulatory
cues presented after the critical segment (with the exception of Mann’s original 1980
study), but in light of a growing body of work highlighting the role of expectations and
anticipatory processing (e.g. Altmann & Kamide, 1999; Kamide et al., 2003a; Kamide et
al., 2003b), we investigate whether compensation for coarticulation is also guided by
listener’s expectations. To do so, we elicited spoken stimuli at different speech rates.
These spoken stimuli were then used in a visual-world eye-tracking study, where we
tested whether the speech rate of prior words influences word recognition at a subsequent
point in the speech stream.
The second aim of the chapter is to use gradient assimilation to investigate the
relation between articulation and perception, in order to learn more about how bottom-up
information influences compensation for coarticulation. Even though we did not
manipulate speech rate of the critical sequence (e.g., rum picks) in the experiments (we
only manipulated the speech rate of the carrier phrases), the degree of the
assimilation/articulatory overlap in our critical sequences is gradient/variable because of
how they were elicited. It is especially important to investigate tokens where assimilation
is incomplete because assimilation in natural speech has been shown to be variable and
gradient (e.g. Byrd, 1996; Nolan, 1992, Gow, 2001, 2002, 2003; Gow & McMurray,
2007).
63
To learn more about the gradient nature of assimilation, we investigate both
production and comprehension. More specifically, Experiment 3 use electromagnetic
articulography (EMA) to measure the actual articulatory gestures naturally produced by
the speaker, thereby gaining concrete information about the extent of articulatory overlap
in the stimuli. This allows us to look for correlations in the articulatory domain
(Experiment 3) and the perception responses of the listeners (Experiment 2) and to
explore the extent to which bottom-up cues in the acoustic signal stemming from
variability in gestural overlap can be perceived by listeners.
In what follows, we present the visual-world eye-tracking study as Experiment 2,
which focuses on the perception side. We turn to the production side (as well as the
relation between perception and production) in Experiment 3, where we analyzed the
articulatory data from the items used in Experiment 2 using electromagnetic
articulography.
3.2 Experiment 2: Perception
Experiment 2 tests whether compensation for coarticulation is triggered by
contextual cues – in the form of speech rate variation – encountered before a potentially
ambiguous word. The faster someone speaks, the more coarticulated speech tends to be
(e.g., Bell-Berti & Krakow, 1991; Munhall & Löfqvist, 1992). Do listeners use this
information to guide the process of word recognition?
In a visual-world eye-tracking task, participants heard sentences like ex.4 while
they saw images of different ‘alien characters’ on the computer screen. (Before the start
64
of the experiment, participants were familiarized with the alien characters and their
names.)
(4) Seeing that every Fedoran has eaten too much since coming to Earth,
{Kine/Kime} proposes to skip lunch every other day.
We wanted to see whether the speech rate of the carrier sentence (shown in italics in
ex.4) influences how people interpret the critical word (e.g. Kine/Kime in ex.4). The
speech rate of the critical words themselves (in bold in ex.4) was held constant. We used
visual-world eye-tracking to probe the process of word recognition: Prior work using the
visual world eye-tracking paradigm has shown that eye-movements to objects in a display
are closely time-locked to the linguistic signal (e.g., Tanenhaus et al., 1995), and provide
a means to tap into lexical activation (e.g., Allopenna et al., 1998).
If the speech rate of the preceding words influences listeners’ interpretation of the
critical words, this would indicate that listeners build expectations based on the amount
of coarticulation they encounter in preceding speech (since the speech rate of the critical
words themselves was not manipulated).
3.2.2 Method
Participants
Twenty-eight native speakers of American English participated, all with normal
or corrected-to-normal vision and hearing. Three participants were excluded because
65
post-experimental questions revealed that they identified possible hints of splicing in the
sound files. One additional participant was excluded because they realized the study was
about assimilation. Twenty-four participants were included in the final analyses.
Design and materials
We manipulated the speech rate of the carrier phrases (shown in italics in ex.4
above). All critical items included a target sequence (underlined in ex.4) consisting of a
noun ending in a coronal nasal ([n]) followed by a verb beginning with a labial stop ([p],
[b]). The critical noun+verb sequence was preceded by an initial phrase or subordinate
clause (18-20 syllables in length). Thus, by the time participants heard the critical
noun+verb sequence, they had been exposed to the speech rate manipulation.
In the target sequences, we used nonwords (‘alien names’) instead of real words
to avoid confounds with frequency, neighborhood density and plausibility. Three pairs of
nonwords — Vone, Vome, Kine, Kime, Shoon, and Shoom — were chosen from the ARC
nonword database (Rastle et al., 2002). (See Table 9 for IPA transcriptions.) Each pair
consisted of one word ending in [n] and another word ending in [m]. We will refer to the
[n]-final words as the unassimilated forms, and the [m]-final words as the assimilated
forms. For all three pairs, we controlled the number of neighbors, the summed frequency
of neighbors, the number of phonological neighbors and the summed frequency of
phonological neighbors as closely as possible (Table 9). Even though the numbers are not
exactly the same, across the four categories as a whole, neither the unassimilated nor the
assimilated form had a clear advantage.
66
Table 9. The chosen nonwords (NN = number of neighbors; SFN = summed frequency of
neighbors; NPN = number of phonological neighbors; SFPN = summed frequency of
phonological neighbors).
IPA NN SFN NPN SFPN
Vone voʊn 8 3097 13 555
Vome voʊm 13 853 23 530
Kine kaɪn 16 1060 32 5476
Kime kaɪm 7 1680 16 4330
Shoon ʃun 2 91 16 438
Shoom ʃum 6 185 20 396
Figure 22. Pictures of all alien creatures
67
To justify the use of nonwords, participants were told that these nonwords were
the names of alien creatures from a planet called Fedora. In addition to the three pairs of
[n]-[m] words, we also used three filler aliens – Gome, Kise, Shan – to help prevent
participants from noticing the patterns in the target words. The images for all alien
creatures are shown in Figure 22.
Auditory component of target items
To create the targets, we spliced the critical noun+verb sequences into carrier
phrases of different speech rates. All recordings were done at 22050 Hz. A female
speaker of American English produced all recordings.
Generating the carrier phrases: To elicit carrier phrases (ex.5) at different speech
rates, the speaker listened to a beat (played over a headset), timed at 4/3Hz (fast rate) and
1Hz (slow rate), and was instructed to match her speaking rate to the rhythm of the beat
the best she could. The resulting sentences were used as the carrier phrases for the eye-
tracking experiment (with the noun+verb sequence spliced out, and replaced with the
critical noun+verb sequence). We will refer to the first part of the carrier phrase as Part 1,
and the second part as Part 2.
The carrier sentences in the fast and slow conditions had average speech rates of
5.46 syllables/s and 3.81 syllables/s, respectively.
(5) Sample carrier phrase:
[Seeing that every Fedoran has eaten too much since coming to Earth,]
part1
Kine
proposes [to skip lunch every other day.]
part2
68
Generating the target sequences of noun + verb: The critical noun+verb
sequences were elicited separately from the carrier sequences. The speaker generated
sentences containing the critical noun+verb sequence (ex.5) while speaking to match a
steady beat of 4/3 Hz over the headset. The noun that the speaker was shown was the
version ending in the coronal [n] (i.e., Kine, Vone, Shoon). We used a fairly fast speech
rate in order to elicit stimuli that could potentially be ambiguous between assimilated and
unassimilated versions (e.g., Kine/Kime). Each of the three target alien names was used in
twelve targets, for a total of 36 targets.
The sentence contexts used to elicit the noun+verb sequences were similar in
structure to the actual carried phrases, but differed in lexical content. This allows us to
avoid potential complications from other coarticulation phenomena or plausibility effects.
Then, we spliced out just the critical noun+verb sequence and inserted it into a carrier
phrase. Table 10 shows an example of how the sentences were cross-spliced. The
acoustic materials in the target sequence were held constant within an item.
(6) [Despite what everyone in this group says and wants to believe,] Kine proposes
[to the coach to skip practice every week.]
Table 10. Example of splicing for fast and slow conditions.
Carrier Phrase Part 1 Target Noun +
Verb Sequence
Carrier Phrase Part 2
Seeing that every Fedoran has eaten too
much since coming to Earth,
Kine proposes to skip lunch every other day
Fast or Slow Fast or Slow
69
Visual component of target items
All auditory tokens of the same target set had the same visual display, consisting
of four images, with the name of each alien below it (Figure 23). Two of the images were
the two members of the alien pair (Vone and Vome, or Kine and Kime, or Shoon and
Shoom). The third image was a clipart image (e.g. object or location). The fourth image
was one of the three ‘filler aliens’ (see Figure 23). The positions of the two critical aliens
were counterbalanced.
Figure 23. Sample visual display for target tokens.
In addition to 36 target items, participants also encountered 50 filler items. Like
the target tokens, the auditory stimuli for the fillers were two-clause sentences. We varied
whether the alien name was mentioned in the first or second clause. Thirty-two fillers
were similar to targets, but did not have contexts where assimilation could occur. The
70
remaining 18 fillers involved a different type of lexical ambiguity, namely homographic
homophones (e.g., ‘crane’ for a machine or an animal; ‘organ’ for a body part or an
instrument). The speech rate at which fillers were recorded was varied (16 fillers were
recorded at the same beat rate as slow target carrier phrases, 17 at the same beat rate as
fast targets, and 17 at an even faster beat rate of 5.86 syllables/s).
The study was designed so that each participant heard all of the fillers but never
heard more than one version of each target. Overall, each participant came across a total
of 86 items (36 targets; 50 fillers) and saw the two conditions an equal number of times.
Procedure
Participants listened to pre-recorded sentences while viewing scenes like Figure 2.
They were told that they would be hearing sentences about fictional aliens from the
planet Fedora. At the end of each sentence, participants’ task was to click on the picture
that best matched the sentence. Participants were told that they would hear sentences by a
female speaker practicing how to speak as a radio talk show host. Prior to the start of the
experiment, they were familiarized with the aliens’ names in a training phase.
Furthermore, to eliminate any memory problems, the names of aliens were shown on the
screen (see Figure 23). Eye-movements were recorded using an Eyelink II eyetracker
sampling at 500 Hz.
3.2.3 Results for perception study
We first present the offline click response for all the target items. With respect to
eye-movements, we present the results for the target items involving Vone/Vome and
71
Kine/Kime, and then we present separately the results for targets involving Shoon/Shoom.
If the speech rate of the preceding carrier phrase modulates the extent to which
participants compensate for coarticulation during their perception of the critical word, we
should see more consideration of the unassimilated form (i.e., Vone, Kine, Shoon) in the
fast-speech conditions than in the slow-speech conditions. This is because an [m]-final
word in a fast-speech context may be the result of coarticulation (an underlying [n]
becoming [m] due to assimilation with the following bilabial), but an [m]-final word
spoken in a slow-speech context is not expected to result from coarticulation, so there is
no reason to consider the [n]-final variant (unassimilated form).
3.2.3.1 Mouse click patterns
In the experiment, the participants’ task was to click on the picture that best
matched the auditory stimulus. Figures 24 shows the offline click responses for the target
items Vone/Vome, Kine/Kime, and Shoon/Shoom in terms of the percentage of trials on
which participants selected the coronal or the labial picture. We will refer to the [n]-final
words as the unassimilated forms, and the [m]-final words as the assimilated forms. The
offline choices by the listeners appear to be relatively impervious to effects of speech
rate. We observe a preference for the assimilated form as opposed to the unassimilated
form in both the fast and slow rate conditions. To evaluate the statistical significance of
these patterns, a coronal-picture advantage score was calculated by subtracting the
proportion of mouse clicks for the assimilated form from the mouse clicks for the
unassimilated form. Separate paired T-tests were conducted using the coronal advantage
score, and we found no significant effect of Speech Rate.
72
Figure 24. Percentage of unassimilated vs. assimilated form
selected in the fast and slow rate conditions
3.2.3.2 Fixation patterns for targets containing Vone and Kine
The eye-movement data from 0ms to 1500ms after the onset of the alien name are
shown in Figures 25, and plotted in terms of the coronal-picture advantage score. The
coronal advantage score is calculated by subtracting the proportion of fixations to the
assimilated form, which ends in a labial such as [m] (e.g., Vome, Kime, or Shoom), from
the fixations to the unassimilated form, which ends in a coronal such as [n] (e.g., Vone,
Kine, or Shoon). Therefore, a positive coronal advantage score means a higher proportion
of fixations to the unassimilated form (e.g., Vone) and a negative advantage score means
a higher proportion of fixations to the assimilated form (e.g., Vome). The vertical dashed
line indicates the average offset of the alien word. The speech rate labels refer to the
speed of carrier phrase portion of the target sentence: the target sequence of noun+verb
was identical across the different conditions.
0!
0.1!
0.2!
0.3!
0.4!
0.5!
0.6!
0.7!
0.8!
0.9!
1!
Fast! Slow!
Unassimilated (Coronal)!
Assimilated (Labial)!
73
As can be seen in Figure 25, the eye-movement patterns for the fast speech and
slow speech conditions do not diverge from each until around 800ms after the onset of
the alien name. Start. From about 800ms to 1500ms, participants looked more to the
unassimilated form ([n]-final name) in the fast speech condition and more to the
assimilated form ([m]-final name) in the slow speech condition.
Figure 25: Coronal-picture advantage for the targets Vone and Kine
(dashed line = average offset of alien name; 0ms = onset of
alien name, determined individually for each trial)
To assess these patterns statistically, a series of paired T-tests using the coronal
advantage score were conducted on 100ms time-slices, beginning from the onset of the
alien word until 1500ms post-onset. We found a main effect of Speech Rate (more looks
to the unassimilated form in the fast speech condition), which was significant from
1000ms to 1100ms in the item analysis and marginal in the subject analysis (T1(23)=
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0 200 400 600 800 1000 1200 1400
Proportion of fixations (coronal advantage)
Time (ms)
fast!
slow!
74
1.98, p1=.060; T2(23)=2.31, p2<.05), significant from 1100ms to 1300ms in both subject
and item analysis (1100 to 1200ms: T1(23)=2.55, p1<.05; T2(23)=2.90, p2<.01; 1200ms
to 1300ms: T1(23)=3.00, p1<.01; T2(23)=2.98, p2<.01), significant from 1300ms to
1400ms in the item analysis and marginal in the subject analysis (T1(23)=1.84, p1=.079;
T2(23)=2.16, p2<.05).
5
In sum, we find significant effects of speech rate on listeners’ comprehension of
potentially ambiguous sequences: Faster speech rates result in more consideration of the
unassimilated form. This shows that word recognition is modulated by the likelihood of
compensation for coarticulation (more likely in fast than slow speech).
3.2.3.3 Fixation patterns for targets containing Shoon
Unexpectedly, targets with Shoon patterned differently from targets with Vone
and Kine. Although participants also looked more to the unassimilated form (e.g., picture
of Shoon) in the fast rate condition and to the assimilated form (e.g., picture of Shoom) in
the slow rate condition, the difference began to emerge very early on, near the onset of
the alien word (see Figure 26), with the difference diminishing around 600ms. Around
1100ms, the fixation patterns begin to differ again, again with more looks to the
unassimilated form in the faster condition. The difference disappears around 1500ms.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
5
After the critical noun+verb sequence, the speech rates of the carrier phrases differ (fast or slow rate),
although the lexical items are the same. Thus, the timing properties of the acoustic input after the critical
noun+verb sequence (e.g. “to skip lunch every other day”) do not line up in exactly in the same way in the
fast and slow conditions. Importantly, in our opinion the eye-movement patterns in Figure 24 are not an
artifact of this ‘alignment difference.’ There is no reason to think that small timing differences in when
listeners encounter the remaining words (e.g. “to skip lunch every other day”) would trigger the pattern in
Figure 24. Instead, we take the patterns in Figure 25 as evidence that the unassimilated form (e.g. Vone) is
considered more in the fast speech conditions than in slow speech conditions (a finding which is logically
coherent given our expectations regarding compensation for coarticulation in fast vs. slow speech).
75
Figure 26: Coronal-picture advantage for the target Shoon
(dashed line = offset of alien name, 0ms = onset of alien name)
To assess this pattern statistically, paired T-tests were conducted on 100ms time-
slices, beginning from the onset of the alien word until 1500ms post-onset. We found an
early main effect of Speech Rate (stronger coronal advantage in the fast rate than in the
slow rate condition), which was significant from 100ms to 200ms in the subject analysis
and but not in the item analysis (T1(23)=2.27, p1<.05; T2(11)=2.00, p2=.18; significant
from 200ms to 600ms in both subject and item analysis (200ms to 300ms: T1(23)=3.28,
p1<.01; T2(11)=2.60, p2<.05; 300ms to 400ms: T1(23)=2.79, p1<.05; T2(1,11)=3.15,
p2<.01; 400ms to 500ms: T1(23)=3.94, p1<.001, T2(11)=5.04, p2<.001; 500ms to
600ms: T1(23)=2.74, p1<.05; T2(11)=6.16, p2<.001) , marginal from 1200ms to 1300ms
in the subject and item analysis (T1(23)=1.80, p1=.08; T2(11)=1.96, p2=.08).
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0 200 400 600 800 1000 1200 1400
Proportion of fixations (coronal advantage)
Time (ms)
fast!
slow!
76
3.2.4 Discussion for perception study
In Experiment 2, we tested whether compensation for assimilation can be triggered
by contextual cues encountered before a potentially ambiguous word. More specifically,
to see whether expectations can modulate the compensation for coarticulation process, we
explored whether speech rate influences the interpretation of subsequent words.
According to articulatory phonology (Browman & Goldstein, 1992, 1995), fast speech is
more likely to have more gestural overlap than slow speech. Therefore, we hypothesized
that – if fast speech rate can trigger compensation for coarticulation – then a potentially
ambiguous sequences such as ‘Kine/Kime proposes’ will result in more consideration of
the unassimilated form, Kine, when embedded in fast speech than when embedded in
slower speech.
This prediction is confirmed: Participants’ eye-movements in the Kine/Kime and
Vone/Vome trials show that the unassimilated form is fixated more in fast speech
conditions than in slower speech conditions, even though the speech rate of the actual
target strings was held constant. (We only varied the speech rate of the carrier phrases.)
Our findings indicate that coproduction information before the immediate context of
lexically ambiguous sounds can indeed lead to compensation for coarticulation (Mann
1980; Mann & Repp 1981; Repp & Mann, 1981, 1982) during real-time word
recognition.
Timing of the speech-rate effects
For Vone and Kine, the difference between the fast and slow rate conditions
became significant 1000ms after the onset of the critical word (alien name), which is
77
roughly 700ms after the acoustic offset of that word. Since it typically only takes about
100 to 200ms to program a saccade (e.g., Matin et al., 1993), this may seem rather late.
However, there are at least two main reasons why this delay is not unexpected. First, the
fact that the critical words ended in nasals ([n], [m]) may be causing a delay. This is
because a number of studies have shown that acoustic signals regarding the place of
articulation of nasals in particular are more difficult to perceive (House, 1957; Mohr &
Wang, 1968; Narayan, 2008). In particular, Hura et al. (1992) found that fricatives were
less confusable than nasals when they investigated the perception of fricative and nasal
assimilation. Thus, the extra time needed for the speech rate effect to reach significance
may indicate processing difficulties stemming from the nature of the acoustic signals
from the nasal. Second, the extra time may be due to the heightened lexical competition
between the two possible candidates. In related eye-tracking work on assimilation, Gow
and McMurray (2007) also found delayed effects in the presence of lexical competition,
and note that “such ambiguity slows the dynamics of processing” (Gow & McMurray,
2007, p. 189). In sum, although the timing of the speech rate effect may at first seem later
than expected, existing work suggests that this delay may be attributed to the acoustic
properties of nasals and the presence of lexical ambiguity.
Different patterns in fricative-initial words
We have not yet addressed the reasons for the discrepancies in fixation patterns
between Vone/Kine vs. Shoon. While the difference for Vone and Kine between the fast
and slow rate conditions became significant 1000ms post-onset, the difference for Shoon
became significant as early as 100ms post-onset, much earlier than expected. The early
78
divergence in the Shoon items suggests that participants were possibly using the earliest
acoustic cues available, as suggested by Beddor et al. (2013), before the nasal,
presumably in the fricative [ʃ]. In fact, existing research indicates that subsequent bilabial
nasals ([m]) can indeed influence the articulation and acoustics of a preceding fricative:
The articulation and acoustic realization of the fricative [ʃ] are known to be influenced
both by preceding and following segments (Hoole et al., 1993). In the case of the
Shoon/Shoom items, the frequency of the word-initial fricative may be influenced
differently by the anticipatory coarticulation with [m] and [n] (but the items used were
never actually produced with a final [m]). This is in part supported by the findings of
Daniloff & Moll (1968) who found that anticipatory coarticulation triggered by lip
rounding can be detected up to four segments before the segment of interest. In sum, we
attribute the early eye-movement patterns observed with the Shoon/Shoom items to the
fact that fricatives are highly susceptible to anticipatory coarticulation triggered by
subsequent segments.
In any case, the findings for Shoon items do not invalidate our key observation
that speech rate, or the potential coproduction information associated with it, influences
comprehenders’ perception of potentially assimilated/potentially ambiguous words.
When comprehenders encounter ambiguous acoustic signals in speech of different rates,
the coproduction information has an impact on how the word recognition system process
the acoustic signals. In sum, the results of Experiment 2 demonstrated that listeners are
able to use cues present in the speech stream before the ambiguous word (the speech rate
manipulation) to guide their processing,
79
In Experiment 3, we address the issue that assimilation in natural speech is often
not absolute and test the extent to which bottom-up cues in the acoustic signal stemming
from variability in gestural overlap can be perceived by listeners. To do this, we explore
the relationship between the speaker’s articulatory gestures and the listeners’ eye -
movements during word recognition. Prior work has not used gradient stimuli and has not
explicitly linked articulatory data with perception data. To tap directly into the
articulatory properties of words, we used electromagnetic articulography (EMA) to
measure the actual articulatory gestures that were produced by the speaker as she uttered
the words that were subsequently used in the eye-tracking study. This two-pronged
approach allows us to study how the production measure is linked to the perceptual
measure of eye-movements.
3.3 Experiment 3: Production
Even in phonologically viable contexts, assimilation, which is caused by the
overlap of articulatory gestures, may not always occur, and even when it does, it may not
be complete (Byrd, 1996; Nolan, 1992). This raises the question of what is the
relationship between the extent of coarticulation (e.g., the degree of gestural overlap
between the movements of lips and tongue tip) and perception (e.g., Suprenant &
Goldstein, 1998). In the example of “A quick run picks you up”, if the duration of the
articulatory movement of the tongue tip for [n] in run barely overlaps the duration of the
articulatory movement of the lips for [p] in picks, then the perceived word is very likely
to mean running. If the duration of the tongue tip movement almost completely overlaps
the duration of the lips movement, then the perceived word is very likely to mean an
80
alcoholic drink (rum). However, how would listeners interpret the word when the
duration of the two movements only overlap each other partially? How sensitive are they
to this kind of bottom-up information? In other words, how does the perception system
respond to information regarding the coproduction of articulation? This is an important
question, given prior work showing that assimilation in naturalistic speech is often only
partial.
To shed light on these questions, in Experiment 3 we analyze the articulatory data
of the critical noun+verb sequences from Experiment 2 to explore the link between (i) the
articulatory gestures produced by the speaker and (ii) how listeners perceive the resulting
words, as reflected by eye-movements during comprehension. The articulatory data was
recorded via three-dimensional Electromagnetic Articulography (EMA, also known as
Electromagnetic Articulometry). In EMA, sensors consisting of small electromagnetic
coils are attached to different articulators such as the tongue tip, the lips, and the jaw. To
locate the movements of these sensors in real-time, EMA measures the
electromagnetically induced currents in the sensors in the midsagittal plane (also see
Kaburagi et al., 2005 and Kröger et al., 2008 for a detailed discussion of EMA systems).
Over the past decade, EMA systems have been established as a key tool in studying
speech production (Hoole, 1996; Perkell et al., 1992; Schönle et al., 1987; Zierdt et al.,
1999). These systems are suitable for experimental speech research because they can
measure the movements and the orientation angles of the articulators with high temporal
resolution.
Why do we choose to analyze articulatory information (instead of the acoustic
measures) to study its connection to perception? First, due to the well-known lack of
81
invariance problem, it is unclear whether any one (or two or three) acoustic cues chosen
for analysis necessarily conveys all the information representing coproduction. In
essence, given that articulation is the actual, ‘true’ bottom-up source of acoustic
variation, it seems more desirable to use articulatory data as a measure of the extent of
coarticulation rather than its acoustic correlates. Second, by studying the possible link
between perception and production directly at the articulatory level, the study can also
contribute to existing debates concerning different theories of speech perception (see e.g.,
Liberman & Mattingly, 1985 and Fowler, 1986 for gestural theories of speech perception;
Diehl & Kluender, 1989 and Massaro, 1998 for general auditory and learning approaches
to speech perception).
By exploring the relation between articulation and perception, we aim to learn
more about listeners’ sensitivity to fine-grained bottom-up cues. Do small changes on the
articulatory level lead to corresponding small, gradient changes on the perceptual level?
Or, are only large changes in how words are articulated detected by the perception
system, perhaps in a categorical way? The unique use of EMA to measure aspects of
production and eye-tracking to measure aspects of perception allow us to tap into this
issue in a direct fashion.
3.3.1 Method
There were two parts to this experiment: (i) on the production side, EMA
recorded the speaker’s articulatory gestures and (ii) on the perception side, visual-world
eye-tracking recorded listeners’ word recognition processes. The eye-tracking methods
and data are from Experiment 2 above (and thus not repeated here). The methods
82
presented here refer to the measurement of articulatory data and subsequent analyses
linking the articulatory data and eye-movements.
Participants
The speaker who generated the spoken stimuli for Experiment 3 was a female
native speaker of American English speaker with no known language impairments. (The
eye-tracking data is the data from Experiment 2.)
Electromagnetic Articulography (EMA)
Articulatory data was collected using an electromagnetic articulometer (NDI
Wave). Transducers were attached at the vermilion border of the upper and lower lip, and
the tongue tip. A reference sensor was also attached to the bridge of the nose. The
articulatory data was collected at 100 Hz and the acoustic data at 22050 Hz. Head
movement was corrected during data collection by the Wave system.
Articulatory Overlap
As our focus here is on coronal place assimilation, all alien words end with a
coronal nasal [n] (e.g., Vone, Kine, Shoon) and the immediately following words begin
with a labial, [p] or [b] (e.g., bites and peeks). The articulation of the coronal nasal was
tracked by the movement of the tongue tip (TT) while the articulation of the labial was
tracked by lip aperture (LA). LA was calculated by the Euclidean distance in the sagittal
plane between the sensors on the upper and lower lip.
83
The identification of the constriction formation interval was conducted using the
MVIEW software package developed by Mark Tiede at Haskins Laboratories. The
identification algorithm uses a manually estimated midpoint of an articulatory gesture
from an EMA sensor or derived variable such as the LA. Then it uses the velocity of the
sensor/variable to locate the velocity minimum closest to the input point, known as the
point of maximum constriction. It also identifies the peak velocity before and after the
midpoint of the articulatory gesture. By using the two velocities, it locates the start and
the end of the articulatory gesture, called gesture onset and gesture offset, by identifying
a point that reaches an arbitrary threshold of the peak velocity. Following past work (e.g.,
Byrd et al., 2008), thresholds for identification of gesture onset and offset were set at
20%. A representative example is shown in Figure 27.
The example here is the tongue tip gesture from [n] and the lip aperture gesture
from [b] in “Vone buys…” (LA = lip aperture; TT = tongue tip). Articulatory overlap is
calculated by taking the overlap duration over the total active duration. The identification
of the gestural movement by the MVIEW software package allows for consistent
measurements across subjects. After the gestures were identified, all locations were
checked by hand to ensure the algorithm did not misidentify any gestures. The MVIEW
algorithm, especially with respect to the frequent use of [r] and [l] following the labial
sound in the verbs (e.g., believes, prefers, and protests), was able to accurately identify
16 of the targets used. The rest were excluded from further analysis, which is the standard
procedure used for misidentified gestures in articulatory phonetics (e.g., Byrd et al.,
2008; Krivokapić & Byrd, 2012; Parrell, 2011).
84
Figure 27. Representative example of measurement of an articulatory gesture.
From the MVIEW-identified variables, we computed the measure of articulatory
overlap. Articulatory overlap was calculated as the proportion of the overlap duration
over the total active duration of the constriction formation of the gesture (Chitoran,
2002). Note that the point of maximum constriction, instead of gesture offset, was chosen
as the end point of a gesture because earlier work has suggested the point is at or near the
end active control of a gesture (e.g., Browman & Goldstein, 1989; Pouplier & Goldstein,
2010). Articulatory movements after this point are frequently influenced by control of the
85
following gestures. Overlap duration was defined as the duration from gesture onset to
maximum constriction in TT and LA that are co-occurring in time (Chitoran, 2002). Total
active duration was defined as the duration from the earliest gesture onset from either TT
or LA to the latest maximum constriction from either TT or LA (Browman & Goldstein,
1989; Pouplier & Goldstein, 2010).
3.3.3 Results and discussion for production study
To test the relation between perception and information about coproduction of
articulation, we investigated the correlation between the measures of articulatory overlap
on the production side and eye-movements on the perception side. In essence, we wanted
to see whether the perception system is sensitive, in a gradient way, to varying amounts
of gestural overlap. If the perception system can detect small changes in articulation, it is
expected that increasing amount of gestural overlap should lead to increasing
consideration of alien names that end in a labial (e.g., Kime, Vome, Shoom).
In light of the eye-tracking results for Experiment 2, we chose to analyze (i) Vone
and Kine together, and (ii) Shoon separately. When thinking about which time-window to
use for the eye-tracking data, we opted for the time-slices where, in the eye-tracking
analysis, we found effects of speech rate (i.e., 1100 to 1300ms after word onset for Vone
and Kine; 200 to 600ms after word onset for Shoon). The eye-movements were
aggregated for these time-slices above and converted to coronal advantage scores. Data
points from both the fast and slow carrier phrase conditions were included in the
analyses. Then we assessed how these eye-movement coronal advantage scores
correlated to the articulatory overlap scores computed from the EMA data.
86
Figure 28. Correlation between coronal advantage scores and articulatory
overlap measures for Vone and Kine items. (The eye-tracking data for coronal
advantage were aggregated from 1100 to 1300ms post target word onset)
As shown in Figure 28, for Vone and Kine, the two measures of articulatory
overlap and coronal advantage were significantly correlated (r(22)=-.45, p< .05). Thus,
participants looked more to Vome and Kime when there is a greater amount of
articulatory overlap, and looked more to Vone and Kine, when there is a smaller amount
of articulatory overlap. The data points for the Vone and Kine items clearly show that
assimilation is not always complete, and that articulatory overlap varies in a gradient
manner. The overlap measure (x-axis) had values ranging from approximately 0.2 to 0.9,
with greater values indicating a greater extent of articulatory overlap between the
0.2 0.3 0.4 0.5 0.6 0.7 0.8
-1.0 -0.5 0.0 0.5 1.0
Articulatory overlap
Coronal advantage
87
movements of the tongue tip and the lips. The same observation of gradience is also true
for the measure of coronal advantage (y-axis). Coronal advantage scores had values from
-1.0 to 1.0, with values distributed relatively equally across the range.
For Shoon (not included in Figure 28), the correlation between articulatory
overlap and coronal advantage scores was not significant, which is probably due to a
small number of data points (only 8 data points).
A potential concern with the correlation shown in Figure 28 is that there are no
data points between the articulatory overlap values of 0.3 and 0.5. If the two data points
around the values of 0.2 in articulatory overlap were actually outliers, they could have
skewed the correlation. The concern is alleviated, however, if we combine the data for
Vone and Kine with the data for Shoon. As can be seen in Figure 29, when the data is
pooled together, the data points around the value of 0.2 are clearly not outliers. A
correlation test of this larger data set also found a significant negative correlation
between the production measure of articulatory overlap and the perception measure of
eye-movements (r(30)=-.41, p<.01).
The significant negative correlation between the production measure of
articulatory overlap and perception measure of eye-movements shows that the perception
system is very sensitive to coproduction information at the point of assimilation. As the
overlap between the tongue tip and the lip gesture increases, the coronal advantage score
becomes increasingly negative: The more overlap there is in the speaker’s articulation,
the more likely listeners are to look to the assimilated form (e.g. Vome). Importantly, as
we can see in Figures 28 and 29, listeners’ perceptual responses are gradiently sensitive
to small changes in articulatory overlap.
88
Figure 29. Correlation between coronal advantage scores and
articulatory overlap measures for Vone, Kine, and Shoon.
3.4 General discussion, conclusions
The research reported here started out with two main aims. Experiment 2 explored
whether expectations created by prior context (fast vs. slow speech rate) can trigger
compensation for coarticulation because of varying amount of articulatory overlap
associated with different speech rates. If so, this would provide evidence that
compensation for coarticulation – traditionally regarded as a process triggered by specific
acoustic cues in the input – can also be influenced by expectation regarding the extent of
articulatory gestures being coproduced. Our second aim, in Experiment 3, was to learn
more about how bottom-up information about the articulatory overlap at the word
influences compensation for coarticulation. Specifically, we looked for a relationship
0.2 0.3 0.4 0.5 0.6 0.7 0.8
-1.0 -0.5 0.0 0.5 1.0
Articulatory overlap
Coronal advantage
89
between the degree of overlap in a speaker’s articulatory gestures and the perception
responses of the listeners.
In Experiment 2, we found that listeners’ interpretation of the critical word
(whose speech rate was held constant) was influenced by the speech rate of surrounding
words. It appears that listeners are able to use coproduction information at a global level
to help guide the resolution of lexical ambiguities. In Experiment 3, we investigated the
link between the speaker’s articulation and listeners’ perception. This experiment
demonstrates a direct illustration of gradience on the articulatory level being correlated
with gradience on the perceptual level. In other words, the perception system is capable
of detecting small, gradient differences in the overlap of articulatory gestures. It shows
that the perception system is fine-tuned to the nature of the production system: how we
articulate words contributes how they are perceived.
Our study is the first to combine direct measurements of articulatory overlap with
real-time perception data in the form of eye-movements, and contributes to our
understanding of how both expectations based on contextual cues and gradient bottom-up
cues guide the process of compensating for coarticulation. Our findings also relate to the
debate regarding the role of articulation in speech perception. According to a widely
accepted view, listeners extract phonetic cues from the acoustic signal, in a process that is
separate from the articulation process—also known as the general auditory and learning
approaches to speech perception (e.g., Diehl & Kluender, 1989; Klatt, 1979; Massaro,
1998). However, other researchers (e.g., Liberman & Mattingly, 1985; Fowler, 1986)
argue in favor of gestural, articulation-based theories of speech perception. According to
the gestural theories of speech perception, both speech production and speech perception
90
recruit the motor system, with articulatory gestures serving as the ‘common currency’
that links production and perception. In other words, perceiving speech is perceiving
gestures.
At first glance, the results of the experiments appear to be compatible with both
gestural-based theories of speech perception and the general auditory and learning
account. With respect to gesture-based theories, our findings of compensation on the
basis of the amount of coarticulation as well as a correlation between the extent of
gestural overlap and the eye-movement are clearly consistent with the idea that
‘perceiving speech is perceiving gestures’ (see also Fadiga et al. 2002, Mitra et al. 2012).
On the other hand, the general auditory and learning account can also explain how
listeners could compensate for assimilation on the basis of speech rate. This account
posits that the perceptual system can learn that how words may be reduced in a particular
manner through a general perceptual process, without appealing to the representation of
gestures.
For example, in a sentence like “A quick run picks you up”, the perception system
may learn that that the word run is more likely to be the intended target than rum when
the acoustic signal is ambiguous and speech rate is fast. This is consistent with claims
made by exemplar models of phonology (e.g., Bybee, 2010), which propose that
phonological structure emerges from generalizations over the phonetic detail of speech as
experienced by a speaker-listener (e.g., Pierrehumbert, 2003). According to this view,
shared phonetic properties of words are dependent on a statistical relationship, onto
which phonological units are established. In other words, listeners could establish that
run is more likely to be the intended spoken word based on result of his or her
91
experience. Without appealing to the representation of gestures, listeners may still be able
to disambiguate words on the basis of speech rate.
However, the fact that the experiments found an effect with the use of nonwords
may weaken the case for the general auditory and learning approaches to speech
perception. Without prior examples, statistical learning is also not likely to occur (or can
occur only to a limited extent over the course of the experiment). What is crucial is
whether the exemplar model would allow the transfer of statistical learning from real
words to nonwords. If not, the findings with the nonwords seem to be more easily
captured by gestural theories. If the transfer of statistical learning is allowed, then the
perceptual system would treat ambiguous nonwords the same way as real words and can
compensate for coarticulation accordingly.
In sum, we have shown that for native speakers of English, coproduction
information from contextual cues as well as at the critical word itself influences
perception of reduced speech. The experiments presented here found that (a) the
coproduction information encountered before the point of coronal assimilation can impact
the perceptual processes that compensate for coarticulation and (b) there is a fine-grained
sensitivity to coproduction information -- as measured by the actual amount of
articulatory overlap – at the point where coronal assimilation occurs
A relevant question with regards to the processing of reduced speech is this: if
listeners sensitive to varying amount of coproduction information in their native
language, what about nonnative speakers? In other words, does the presence of an
additional phonological system influence the recognition of reduced forms? To address
this question, in the next chapter we present the results of a modified Experiment 2 by
92
comparing two different populations: adult native speakers of English and adult
Mandarin learners of English.
93
Chapter 4
Nonnative Compensation for Assimilation:
When Phonological Systems Clash
4.1 Introduction
The results of the experiments in Chapter 3 suggest that native speakers of
English can compensate for coproduction of articulation on the basis of speech rate
information in the preceding context. In particular, prior work has shown that faster
speech rates are associated with greater extent of reduction and greater amount of
coarticulation, and indeed, in Chapter 3 we saw that when the speech rate of carrier
phrases differs, English listeners exhibit different compensatory behavior to an identical
acoustic sequence. We measured this by means of analyzing people’s eye-movements in a
lexical identification task, which were found to be correlated with the extent of the
speaker’s articulatory overlap during production.
This brings us to the related question of how do nonnative speakers deal with
reduction patterns not in their native language? To what extent can they compensate for
coarticulatory patterns in a second language? When adults acquire a new (nonnative)
phonological system, do the two systems compete against one another, or co-exist
without interference? What might be the role of articulatory gestures in the perception of
nonnative phonological processes? In this chapter, we present the results of a follow-up
study, building on the experiments from Chapter 3. The study reported in this chapter had
participants from two different populations: adult native (L1) speakers of English and
adult Mandarin learners (L2) of English.
94
We chose to look at native speakers of Mandarin because Mandarin phonotactics
does not allow words to end in the nasal coda [m]. This is relevant because in Chapter 3
we examined the process of English coronal place assimilation. Coronal place
assimilation occurs when a coronal segment (e.g., [t], [d], [n]) takes on the place of a
following non-coronal (labial or velar) segment. For example, in “A quick rum picks you
up”, the word ‘run’ may be ambiguous between ‘run’ and ‘rum’ (Gaskell & Marslen-
Wilson, 2001). This lexical ambiguity arises if the [m] from rum is interpreted to be an
underlying [n] assimilating to [m] due the labial gesture of the [p] of picks overlapping
the coronal gesture of the [n]. In Experiment 2 and 3, all critical items included a target
sequence consisting of a noun ending in a coronal nasal ([n]) followed by a verb
beginning with a labial stop ([p], [b]). Therefore, if Mandarin phonotactics does not allow
words to end in the nasal coda [m], how would Mandarin speakers deal with the process
of coronal place assimilation in English?
It is important to note that although Mandarin phonotactics does not allow [m] in
the coda position, Mandarin speakers do have experience with word forms that end with
the labial nasal in their language. For example, as we have seen in the discussion of
Mandarin syllable contraction in Chapter 2, words can contract to produce forms such as
[uom], ‘we’. However, it is unclear to what extent Mandarin speakers can compensate for
assimilation to the nasal coda [m], since nasal codas are only present in contracted words
(which occur in casual speech) and, for example, are not part of the standard (written)
language. Two possibilities exist regarding how Mandarin speakers may compensate for
assimilation to nasal coda [m]: (i) they may simply disregard sounds that end with a [m]
because there is no need to consider a competing category of sounds that does not exist in
95
Mandarin in its citation form or (ii) they may also compensate for coarticulation in the
same manner as the English speakers in Chapter 3.
In this chapter, we investigate the extent to which the absence (or near-absence;
since it only occurs in spoken contracted forms) of a competing category (i.e. [m]) in the
coda position for the L1 (Mandarin) system influences how listeners compensate for
coronal place assimilation in their L2 (English). Broadly speaking, offline mouse-clicks
show that L1 phonotactics influence L2 speakers’ interpretations (explicit choices), but
online eye-movements show that the L2 speakers are able to compensate for assimilation,
similar to L1 speakers. Our findings highlight the benefits of using online measures such
as eye-movements in both L1 and L2 research. The experiment also demonstrates that
even in high-proficiency L2 speakers, there appears to be a delay in L2 speakers’
compensation behavior, indicating the difficulty of processing in a L2.
4.1.1 Native and nonnative compensation for assimilation
A number of studies have recently focused on how native speakers compensate
for the context-dependent phonological variation resulting from a subtype of
coarticulation, namely assimilation (Coenen et al., 2001; Gaskell & Marslen-Wilson,
1996, 1998; Gow 2001, 2002; Mitterer & Blomert, 2003). For example, in English, there
is a tendency towards place assimilation. In sentences like “A quick rum picks you up”,
the word rum may be interpreted either as run and rum (Gaskell & Marslen-Wilson,
2001). This lexical ambiguity arises if the [m] from rum is interpreted to be an underlying
[n] assimilating to [m] due the influence of the [p] from picks. In other words, the word
rum in the sentence may be interpreted as rum or run because of its occurrence in the
96
context of picks. Gaskell and Marslen-Wilson (1996) demonstrated that the unassimilated
word form is activated only in a context where assimilation is phonologically viable.
Gaskell and Marslen-Wilson (1998) also showed that listeners can compensate for
assimilation of nonwords. Gow and colleagues (Gow 2001, 2002, 2003; Gow &
McMurray, 2007) showed that assimilation is characterized by regressive and progressive
effects and is best described as a non-neutralizing, gradient process.
As opposed to the place assimilation seen in English, other languages display
different assimilation patterns. For example, French has a tendency towards voicing
assimilation. When a word like botte ([bot]; ‘boot’) is followed by grise ([griz]; ‘grey’), it
is produced as [bod] (Darcy, 2007), where the voicing of [t] from botte assimilates to the
voicing of [g] in grise. The voicing does not change, however, when botte is followed by
mauve ([mov]; ‘purple’). This is because voicing assimilation in French only occurs in
obstruent clusters in the same phonological phrase. Therefore, the [m] in mauve does not
trigger assimilation because it’s not an obstruent. The voicing assimilation pattern
observed in French and the place assimilation pattern in English are both language-
specific; that is, the phonological variations and contextual dependencies are specific to a
particular language and may be different from language to language. Such rules (e.g.,
substitution, insertion, or deletion of a segment as a result of phonological context) are
common across the world’s languages and can pose a challenge for word recognition:
speakers of a language need to learn how the form of a word is modified in their
respective language. Second language acquisition (SLA) studies have shown this
challenge is even greater for second language (L2) learners: in addition to being able to
accommodate for the phonological variations in their L1, the same challenge is present
97
for the acquisition of another phonological system. This raises the question of to what
extent can L2 learners compensate for the context dependencies in their L2 when the L1
and L2 have different language-specific phonological variations.
In exploring the question of language-specific phonological inference, Darcy and
colleagues (Darcy et al., 2007; Darcy et al., 2009) investigated how L1 English speakers
who speak French as their L2, and L1 French speakers who speak English as their L2
deal with assimilation rules in their L2’s, namely voicing assimilation in French and place
assimilation in English. Using a word detection task, Darcy et al. (2009) presented French
and English speakers with target words that may correspond (or not) to the existent
assimilation process in their respective languages. In other words, listeners heard target
words that were appropriately and inappropriately modified according to the context for
voicing assimilation in French and place assimilation in English. All target words in their
study contain only legal sequences and native phonetic categories in their L1s in both
conditions (place and voicing). The results of Darcy et al. (2009) showed that both L1
English and French speakers demonstrated a higher degree of compensation for
phonological changes that exist in their native language: French speakers seem to be
more fine-tuned to voicing assimilation and English speakers to place assimilation:
L1 French participants detected the target [bot] more often in bo[dg]rise than in
bo[dm]auve, signaling the compensation of voicing assimilation. In contrast, they
responded less to the target lune [lyn] ‘moon’ in the (appropriate place) lu[mp]ale and
(inappropriate) lu[mR]ousse conditions. L1 English participants detected assimilated
targets for place assimilation but responded less in the appropriate and inappropriate
voicing contexts. In other words, L1 English speakers compensated significantly more for
98
place assimilation while L1 French speakers compensated more for voicing assimilation
– in accordance with the presence/absence of these assimilation patterns in their native
languages.
Additionally, Darcy et al. also found a small but significant compensation effect
for the non-native assimilation processes for their two populations. They concluded that
L2 learners are indeed acquiring the ability to compensate for the assimilation process not
present in their native language, albeit to a lesser extent compared to the native language.
These results suggest that compensation for assimilation is dependent on language-
specific phonological knowledge even when their participants have variable L2
experience. The French native speakers in their study grew up monolingually with only
limited and late experience with English; the native English speakers also grew up
monolingually but roughly 1/3 were highly fluent in French while the rest were beginning
learners.
Darcy et al. (2007) further investigated the role of language fluency in L2
learners’ compensation behavior. They again studied both L1 speakers of English and
French, who are L2 speakers of French and English, respectively. Using the same
experimental paradigm, they tested their patterns of compensations in both their native
language and their L2. They found that L2 learners whose fluency is at the beginner level
compensate for assimilation in the same way as in their L1: the same compensation
pattern is observed in their L1 and L2. In contrast, L2 learners whose fluency is at the
advanced level can compensate for the assimilation process absent in their native
language while also maintaining the compensation pattern in their native language. This
99
leads to the conclusion that the two separate phonological systems can co-exist without
interference.
4.1.2 Nonnative acquisition of phonological processes
Darcy et al.’s work on assimilation contributes to a growing literature on the more
general question of whether L2 learners are capable of acquiring the phonological
processes of the target language in the same way or to the same extent as L1 learners. So
far, the findings in this area are unclear regarding the question of whether L2 learners are
able to acquire native-like phonological abilities in their second language. We discuss
some of the key findings below:
There is a set of findings, including those of Darcy and colleagues, which are
consistent with the view that L2 experience can modify speech perception and even
suggest that L2 learners may become native-like in at least some aspects of their L2
phonology (Flege, 1992a, 1992b; Flege et al., 1997). For example, in studying a group of
German, Spanish, Mandarin, and Korean learners of English, Flege et al. (1997) showed
that the more experienced nonnative participants produced and perceived English vowels
more accurately than their less experienced counterparts. In other words, phonological
knowledge may be malleable as a result of experience in a L2.
Further evidence for this view comes from work on foreign accent judgments of
nonnative speakers (Bongaerts, 1999; Bongaerts et al., 2000), which found that L2
learners have achieved native-like ultimate attainment in this subdomain. In a series of
experiments, Bongaerts and colleagues (1999, 2000) showed that some Dutch learners of
English and French as well as English learners of Dutch could produce sentences that
100
were rated as native-like by the respective native speakers. They postulate that the
potential neurological disadvantage due to a late start in acquiring a language may be
overcome by quality of input, personal motivation, duration of exposure,
or
instructional
factors.
In addition, in a perceptual learning experiment, Norris et al. (2003) showed that a
category boundary shift is possible even with a short exposure to sounds in a special
context. In their experiment with two groups of Dutch speakers, Norris et al replaced the
final fricative of target words with an ambiguous sound between [f] and [s]. One group
heard ambiguous [f]-final words (e.g., [wItlo?] from witlof, ‘chicory’, [?] is used to
denote the ambiguous sound) and unambiguous [s]-final words (e.g., na:ldbos] from
nadlabos, ‘pine forest’). The other group hear the opposite manipulations (e.g.,
ambiguous [naL;dbo?] and unambiguous [wItlof]). Norris et al. found that listeners who
were exposed to [?] in [f]-final words were more likely to categorize sounds as [f]
compared to those exposed to [?] in [s]-final words. Such use of lexical feedback for
learning, they argue, is beneficial in acquiring a second language.
Many studies have highlighted the difficulty that L2 learners experience in
acquiring the phonological properties of the target language, even with extensive L2
experience (Long, 1990; Flege, 1995; Flege et al., 1999; Johnson & Newport, 1989;
Pallier et al., 1997; Weber-Fox and Neville 1996). For example, Pallier et al. (1997)
studied Catalan-Spanish bilinguals raised in a highly bilingual society. These participants
started learning their L2 before the age of six and were highly proficient speakers of both
Catalan and Spanish. However, Pallier et al. found that even for Catalan-Spanish
bilinguals who began learning Catalan as early as age four still could not master the
101
contrast between the Catalan [e] and [ɛ] at a level comparable to the native speakers.
Further evidence for inability to master aspects of the L2 despite a long duration of
exposure comes from studies by Flege and colleagues (Flege, 1995; Flege et al., 1999). In
studying L1 Italian speakers who spoken English as their L2 (mean length of residence in
Canada: 32 years), Flege (1995) found that native speakers of English could detect the
accent of the L1 Italian speakers’ English speech even for L1 Italian speakers who had
arrived in Canada at age four. In a subsequent study on learners whose native language
was not related to English, Flege et al. (1999) found that native English speakers were
able to identify the accent of L1 Korean speakers speaking in English even when the
speakers had arrived in the U.S. at age one.
There have been a number of explanations proposed for the underlying
mechanism for the L2 learners’ difficulty in achieving native competence. The
observation of age effects on L2 performance has led to the suggestion that the ability to
acquire an L2 is limited by maturational factors—the critical period hypothesis (e.g.,
Long, 1990; Flege, 1995; Flege et al., 1995; Flege et al., 1999; Johnson & Newport, 1989;
Johnson & Newport, 1991; Weber-Fox & Neville 1996). While many studies point to the
role of age-related factors, aspects of the maturational account remain controversial. First,
studies (e.g., Birdsong, 1992; Bongaerts et al, 1997; White & Genesee, 1996) have shown
that at least some L2 learners can achieve native-like competence. Second, it remains
unclear which aspects of second language acquisition are (most) susceptible to critical-
period effects. Some researchers claim that pronunciation is most susceptible to such
effects (e.g., Long, 1990), while others argue it is the only aspect being affected (e.g.,
Scovel, 1988).
102
In contrast to the maturational/critical period approach, other researchers have
suggested that the difficulty in acquiring properties of an L2 is related to
interference/competition effects between the L1 and L2 (e.g., Oyama, 1979; Flege, 1995;
Bialystok, 1997). This explanation posits that L2 performance is influenced by and
derived from the nature and the extent of the interaction a bilingual’s languages. The
interference may occur when a nonnative listener processes the L2 input using his or her
L1 phonological system. This explanation is also potentially confounded with the
maturational approach because it implicitly suggests that, the more fully-developed the
L1 is by the time L2 learning begins, the more L2 learning will be influenced by the L1.
A model that is consistent with the interference/competition account is the
Perceptual Assimilation Model (PAM; Best, 1995). Derived from the gestural theories of
speech perception (Gibson, 1966, 1979; Fowler, 1986), the model draws on the fact that
there is a great amount of overlap in the gestural constellations used in the different
languages. In this view, gestural elements that do not match precisely with the native
constellations are considered to be the nonnative sounds. The fundamental premise of the
PAM is that nonnative gestural constellations are likely to be perceived according to their
similarities to the native constellations that are closest to them in phonological space. For
example, if the native language of a listener does not have a dental stop but has bilabial,
alveolar, velar stops, the tongue tip constriction of the dental stop would be closest to the
alveolar stop in native phonological space. Several studies have found support for this
model, where L2 listeners would process L2 phonological information using relying on
their L1 phonological structure, causing interference (McAllister et al., 2002; Strange et
al., 1998; Hallé et al., 1999; Weber & Cutler, 2004).
103
4.1.3 Aims of this chapter
The results of Darcy and colleagues (Darcy et al., 2007; Darcy et al., 2009) have
shown that L2 learners are capable of learning to compensate for phonological variations
that are not present in their L1. In addition, advanced learners can compensate for the
different types of assimilation in their L1 and L2, therefore demonstrating the
phonological properties of both languages. However, a question is left unexplored by
Darcy and colleagues: What would happen if one of the competing categories the second
language assimilates to is not allowed in the listeners’ first language? We attempt to
address this question by studying L1 Mandarin speakers who speak English as their L2.
Mandarin is well-suited for this question because while Mandarin phonotactics does not
allow nasal coda [m], English coronal place assimilation could lead to a word that ends in
a [m].
Our first aim is to investigate how the learners’ L1 and L2 phonological systems
might interact when there is a mismatch between the phonological properties of the L1
and L2. In order to accomplish this goal, we investigated how native Mandarin speakers
with English as an L2 compensate for assimilation, compared to native English speakers.
Mandarin phonotactics does not allow words to end in the nasal coda [m], though [n] is
allowed. Therefore, the study allows us to study the extent to which the L1 (Mandarin)
system influences whether/how listeners compensate for coarticulation in their L2
(English). Given that previous work has shown that L2 learners whose fluency is at the
beginner level may only compensate for assimilation the same way as in their L1, we
focused on learners whose English fluency is more advanced. In addition, this study will
help shed light on the issue of interference. If Mandarin speakers do not consider the
104
category of sounds disallowed in their L1 phonotactics, they should be more likely to
consider sounds that end with a [n] and would not demonstrate compensation for
coarticulation. On the other hand, if they can compensate for assimilation in a L2, they
should exhibit different behavior with respect to different speech rates in their offline and
online task.
Our second aim is to gain a better understanding of the time course of lexical
access given the potential interference between the L1 and L2. Darcy et al. have shown
that the advanced learners can compensate for the assimilation in their L2. However, a
word detection task may not be sufficient in capturing the moment-to-moment
differences occurring immediately after the target word is processed. The experiments
from Chapter 3 showed that prior context – namely the speech rate of the preceding
words – can trigger compensation for coarticulation. In that chapter, we showed that
contextual cues, namely speech rate of the preceding words, can influence listeners’
expectations regarding coarticulation and thereby guide how they interpret potentially
ambiguous words. Experiment 4 from this chapter closely resembles Experiment 2 from
Chapter 3, a visual-world eye-tracking experiment to probe the process of word
recognition. Eye-tracking allows us to gain insights into the lexical representations that
listeners are considering over time, which will help us to compare to what extent are the
L2 learners are different from their L1 counterparts in processing contextual
coarticulatory information in a situation where it impacts word recognition.
105
4.2 Experiment 4
4.2.1 Method
Participants
Thirty-two subjects participated, all with normal or corrected-to-normal vision
and hearing. Sixteen of the participants were native speakers of American English while
the sixteen were native speakers of Mandarin (from mainland China) living in the Los
Angeles area. The participants were paid $10 per hour for their time.
Many Mandarin speakers often speak a Chinese regional dialect in addition to
Mandarin, but we wanted to minimize the influence of other dialects on Mandarin
speakers’ acquisition of English. Thus, we opted for Mandarin speakers with a maximally
simple language profile. Eight of the sixteen native Mandarin speakers only speak
Mandarin as a L1 and English as a L2 and do not speak another regional dialect. As for
rest of the Mandarin speakers, only three participants speak a regional dialect (Changsha,
Shanghai, and Henan) natively or at a proficiency level they considered to be higher that
of their English. To the best of my knowledge, these regional dialects do not allow the
nasal coda [m] – in other words, they are like Mandarin in the relevant respect.
Both objective and subjective measures of phonological proficiency were
collected for the Mandarin speakers. The objective measures include the duration of
English classroom instructions received (possibly relating to the amount of English
input), length of residence in the U.S. (possibly relating to the amount and quality of
English input), a cloze test, and their scores on the internet-Based Test (iBT) variant of
the TOEFL (Test of English As A Foreign Language). A cloze test was administered to
106
all Mandarin participants to determine their English proficiency. The cloze test used in
the study was a shortened version of test developed by Oshita (1997) and used by other
researchers (e.g., Oh, 2010; Nava, 2010; Yin, 2012, who also evaluated the reliability of
the shortened test). The subsection we used – based on the evaluation of Yin (2012) --
contains a short story with 25 blanks, with the participants’ scores being equivalent to the
number of correct words used in the blanks. The TOEFL is standardized test that is
commonly taken by international students studying in the United States. Fourteen of the
sixteen Mandarin speaking participants had taken the TOEFL.
In addition to these objective measures, three subjective measures of English
proficiency were also collected: the participants were asked to rate their proficiency, on a
scale of one to ten, with ten being the most proficient, in listening, speaking and reading
English.
Table 11. Summary of main biographical characteristics for nonnative speakers of
English. * The perfect score for the Cloze test is 25. ** The type of TOEFL reported here
is the Internet-Based Test (all scores are for iBT) 14 of the 16 participants had taken this
test. *** The self ratings of the participants’ English listening, speaking, and reading
abilities are on a 10-point scale, with 10 indicating the highest fluency possible.
Length of
English
instructions
Length of
residence
in USA
Cloze
test *
TOEFL
**
Self-rated
English
listening
***
Self-rated
English
speaking
***
Self-
rated
English
reading
***
mean 10.56 (yrs) 2.38 (yrs) 20.63 95.93 6.91 6.81 7.97
sd 3.08 1.78 2.58 9.52 1.28 1.22 1.02
range 7-12 1-7 14-25 81-106 5-9 4-9 6-10
107
The Mandarin speakers’ biographical data, including the length of time they had
received English classroom instruction, length of residence in USA, Cloze test scores,
and self-ratings of English speaking, listening, and reading abilities are reported in Table
11.
Experimental stimuli
The stimuli used closely resembled those in Experiment 2 in Chapter 3. The
speech rates of the carrier phrases were manipulated and all critical items included a
target sequence consisting of a noun followed by a verb. A notable difference from
Chapter 3 is that instead of the three pairs of nonwords (‘alien names’) used in
Experiment 2, only two pairs—Vone, Vome, Kine, Kime—were used as the critical nouns
in this experiment. This was done because the in Experiment 2, fixation patterns for
Shoon/Shoom differed from that of Vone/Vome and Kine/Kime, for reasons that are not
entirely clear (see Section 3.2.4 for discussion of possible reasons). In addition to the two
pairs of [n]-[m] words, five additional aliens – Gome, Kise, Shan, Shoon, Shoom – served
as fillers to help prevent participants from noticing the patterns in the target words. The
images for all aliens were identical to ones used in Experiment 2 (Chapter 3).
Procedure and design
The procedure was nearly identical to Experiment 2 in Chapter 3. As in Chapter 3,
the participants went through a training phase to learn the nonwords. To justify the usage
of nonwords, the participants were told that they were names of alien creatures from a
108
planet called Fedora. After the experiment, the participants completed the cloze test as
well as a questionnaire about their language background and proficiency.
The design of the experiment closely resembled Experiment 2, a visual-world
eye-tracking experiment that recorded listeners’ word recognition processes. Eye-
movements were recorded using an Eyelink II eyetracker sampling at 500 Hz. For this
experiment, the carrier sentences in the fast and slow conditions had average speech rates
of 4.55 syllables/s and 3.47 syllables/s, respectively. Then, we spliced out just the critical
noun+verb sequence, and inserted it into a carrier phrase. The number of target and filler
items is also different from Experiment 2 (due to the use of only two alien name pairs):
There were 16 target items, and 40 filler items in this study. Overall, each participant
came across a total of 56 items and saw the two conditions an equal number of times.
4.2.2 Results
4.2.2.1 Mouse click patterns
In the experiment, the participants’ task was to click on the picture that best
matched the auditory stimulus. Figures 30 and 31 show these offline click responses for
the L1 and L2 English-speaking participants, in terms of the percentage of coronal- and
labial picture selected, respectively. The offline choices by the two populations appear to
be quite different. As in Chapter 3, we will refer to the [n]-final words as the
unassimilated forms, and the [m]-final words as the assimilated forms. For the L1 English
speakers (Figure 30), there was a preference for the assimilated form (i.e., Vome, Kime;
57%), as opposed to the unassimilated form (i.e., Vone, Kine; 43%). Even though the
speaker who produced the stimuli for the study was instructed to produce the alien names
109
with a coronal final sound, L1 English speakers appeared to be willing to accept them as
ending with a labial – presumably due to the influence of the labial consonant in the verb.
In contrast, the pattern was reversed for the L2 English speakers (Figure 31), who exhibit
a strong preference for the unassimilated form (65%) as opposed to the assimilated form
(34%). The L2 participants’ preference for the unassimilated form is consistent with their
L1 phonotactics: in Mandarin, a word can only end with a coronal, not a labial.
Figure 30. Percentage of unassimilated vs. assimilated
form selected by the L1 English speakers
To establish the statistical significance of these patterns, a coronal-picture
advantage score is calculated by subtracting the proportion of mouse clicks for the
assimilated form (e.g., Vome, Kime) from the mouse clicks for the unassimilated form
(e.g., Vone, Kine). An ANOVA with the factors Language (between subjects) and Speech
Rate (within subjects) was run using the coronal advantage score. This revealed a main
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
0.8"
Unassimilated form
(Coronal)!
Assimilated form
(Labial)!
Mouse click choice (%)!
110
effect of Language (F1(1,30)=17.27, p1<.01; F2(1, 15)=13.88, p2<.01). There was no
significant effect of Speech Rate and no interaction.
6
Figure 31. Percentage of unassimilated vs. assimilated
form selected by the L2 English speakers
To explore the effect of speech rate within each population, Figure 32 and 33
show the participants’ click response with respect to rate for the L1 and L2 speakers,
respectively. As shown in Figure 31, the L1 English speakers show almost no difference
in their click response with respect to the different speech rates. In contrast, Figure 32
shows that there is a greater difference between the fast and slow rate conditions for the
L2 English speakers, with a greater preference for the unassimilated form in the slow rate
condition.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
6
The same data were also analyzed using a linear mixed-effects model (Jaeger, 2008; Baayen et al., 2008),
testing the effects of Language and Speech Rate and adding random effects of Subject and Item. These
analyses replicated the results of the ANOVA: the odds of choosing the unassimilated form are significantly
higher for L2 speakers than for L1 speakers. There is no effect of Speech Rate and no interaction.
0!
0.1!
0.2!
0.3!
0.4!
0.5!
0.6!
0.7!
0.8!
Unassimilated form
(Coronal)!
Assimilated form
(Labial)!
Mouse click choice (%)!
111
Figure 32. Percentage of unassimilated vs. assimilated picture selected
by the L1 English speakers in the fast and slow rate conditions
To evaluate the statistical significance, separate paired T-tests were conducted to
investigate the effect of speech rate within each population. While there is no effect of
Speech Rate for the L1 speakers, the effect was marginal in the item analysis for the L2
speakers (T1(15)=-1.51, p1=.15; T2(15)=-1.94, p2=.07).
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fast Slow
Unassimilated form (Coronal)
Assimilated form (Labial)
112
Figure 33. Percentage of unassimilated vs. assimilated picture selected
by the L2 English speakers in the fast and slow rate conditions
4.2.2.2 Eye fixation patterns
The eye-movement data for the native English speaking participants from 0ms to
1500ms after the onset of the alien name is shown in Figure 34. The data is plotted in
terms of the coronal-picture advantage score, which is calculated by subtracting the
proportion of fixations to the assimilated form (e.g., Vome, Kime) from the fixations to
the unassimilated form (e.g., Vone, Kine). Thus, positive numbers mean there are more
fixations to the unassimilated form and negative numbers mean there are more fixations
to the assimilated form.!
! !
!
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fast Slow
Unassimilated form (Coronal)
Assmilated form (Labial)
113
Figure 34: Coronal-picture advantage for L1 English speakers (dashed line = average
offset of alien name; 0ms = onset of alien name, determined individually for each trial;
shaded = region with significant difference in both the subject and item analysis)
!
! The patterns observed for native English speakers are consistent with the results
from Experiment 2: native speakers of English exhibited compensatory behavior in their
eye-movements based on the speech rate of the preceding context. As shown in Figure
34, the eye-movement patterns for the fast speech and slow speech conditions do not
diverge from each until around 500ms after the onset of the alien name. From about
600ms to 1500ms, participants looked more to the unassimilated forms in the fast speech
condition and more to the assimilated forms in the slow speech condition. (Recall that the
critical sequence of noun+verb was identical in both conditions; it was only the speech
rate of the carrier phrase that was manipulated.)
20.4!
20.3!
20.2!
20.1!
0!
0.1!
0.2!
0.3!
0! 200! 400! 600! 800! 1000! 1200! 1400!
Proportion of fixation (coronal advantage)!
Time (ms)!
fast!
slow!
114
To assess these patterns statistically, paired T-tests were conducted using the
coronal advantage score on 100ms time-slices, beginning from the onset of the alien word
until 1500ms post-onset. We found a main effect of Speech Rate (more looks to the
unassimilated form when the context had a fast speech rate), which was significant in the
subject analysis and marginal in the item analysis from 600ms to 700ms (T1(15)= 2.14,
p1<.05; T2(15)=2.01, p2=.06), significant from 700ms to 900ms (700 to 800ms: T1(15)=
3.79, p1<.01; T2(15)=3.43, p2<.01; 800 to 900ms: T1(15)= 2.29, p1<.05; T2(15)=2.35,
p2<.05). (All other time slides slices had non-significant results for effects of Speech
Rate). The results from the L1 participants demonstrate the same significant effects of
speech rate as in Experiment 2: Faster speech rates result in more consideration of the
unassimilated forms.
The eye-movement data for the L2 participants is shown in Figure 35. The L2
participants displayed a different pattern in their eye-movements than the L1 participants.
There appears to be a slight preference to the unassimilated form in the slow rate
condition at first and assimilated form in the fast rate condition, but at 1000ms, the
pattern switches. From 1000ms to close to 1400ms, the L2 participants have a preference
for the unassimilated form in the fast rate condition compared to the slow rate condition.
To assess statistical significance of these patterns, paired T-tests were conducted
on 100ms time-slices, beginning from the onset of the alien word until 1500ms post-
onset. The effect of Speech Rate was marginal from 200ms to 300ms in the in the item
analysis (T1(15)=-1.38, p1=.19; T2(15)= -2.01, p2=.06) and was significant from 300ms
to 400ms in the item analysis (T1(15)=-1.42, p1=.18; T2(15)=-2.35, p2<.05). The
patterns became significant in both analyses from 1100 to 1200ms (T1(15)=2.64, p1<.05;
115
T2(15)=2.34, p2<.05) and marginal from 1200 to 1300ms in both analyses (T1(15)=1.96,
p1=.07; T2(15)=2.02, p2=.06). In sum, while Figure 33 and Figure 34 appear to be quite
different for the L1 and L2 participants, the statistical significance of the two populations
actually revealed a comparable pattern, that faster speech rates result in more
consideration of the unassimilated form (700ms to 900ms in L1 participants; 1100ms to
1200ms in L2 participants).
Figure 35: Coronal-picture advantage for L2 English speakers (dashed line = average
offset of alien name; 0ms = onset of alien name, determined individually for each trial;
shaded = region with significant difference in both the subject and item analysis)
To test whether the magnitude of the compensation behavior is different with
respect to speech rate across the different populations, two ANOVAs with the factors of
20.2!
20.1!
0!
0.1!
0.2!
0.3!
0.4!
0.5!
0! 200! 400! 600! 800! 1000! 1200! 1400!
Proportion of fixation (coronal advantage)!
Time (ms)!
fast!
slow!
116
Language (L1 English vs. L1 Mandarin; between-subjects) and Speech Rate (within-
subjects) were conducted using the coronal advantage scores from 100ms time-slices that
showed significant effects of Speech Rate in the earlier T-tests. (We realize that using
different time slices is not ideal, but it allows us to focus on those epochs that showed
significant differences for each group.) Therefore, the scores from 1100 to 1200ms time-
slice from the L2 speakers were used with respect to the 700 to 800ms time-slice as well
as the 800 to 900ms time-slice from the L1 speakers (Recall that there is a significant
difference with respect to speech rate from 700 to 900ms post-onset of the critical word
for the L1 speakers and 1100 to 1200ms post-onset for the L2 speakers). The tests
showed no interaction of Language and Speech Rate, suggesting that the magnitude of
the effect is similar in both populations.
Let us now take a closer look at the timing of the eye-movement patterns. To
facilitate comparison of the L1 and L2 speakers, we have plotted both groups’ coronal
advantage data for the 0ms to 1500ms time window in Figure 36 for the fast rate
condition and in Figure 37 for the slow rate condition. It appears that in both the fast and
slow conditions, there was a delay in terms of L2 participants’ eye-movements, compared
to that of the L1 participants. More specifically, as shown in Figure 36, L1 participants’
coronal advantage scores in the fast rate condition peak around 800ms after the onset of
the noun (signaling the strongest bias to look at unassimilated form at that time), whereas
L2 participants’ scores show a peak around 1200ms after the onset of the noun. Likewise,
in the slow rate condition, L1 participants’ coronal advantage score first reached a low
point around 600ms after the onset of the noun (signaling the strong bias to look at the
117
assimilated form), whereas L2 participants’ scores show a similar pattern but at a later
time (around 900ms post onset).
Figure 36: Coronal-picture advantage for L1 and L2 English speakers in the fast rate
condition (dashed line = average offset of alien name; 0ms = onset of alien name,
determined individually for each trial)
To assess these timing differences statistically, two separate series of paired T-
tests were conducted, on 100ms time-slices, using the coronal advantage score from the
onset of the alien word until 1500ms post-onset. For the fast rate condition, the effect of
Language was significant from 700ms to 900ms in both analyses (700ms to 800ms:
T1(15)=2.51, p1<.05; T2(15)=2.31, p2<.05; 800ms to 900ms: T1(15)=2.49, p1<.05;
T2(15)=2.33, p2<.05), marginal in the subject analysis and significant in the item
118
analysis from 1100ms to 1200ms (T1(15)=-1.86, p1=.07; T2(15)=-2.36, p2<.05),
significant from 1200ms to 1500ms in both analyses (1200ms to 1300ms: T1(15)=-2.59,
p1<.05; T2(15)=-3.28, p2<.01; 1300ms to 1400ms: T1(15)=-2.22, p1<.05; T2(15)=-2.56,
p2<.05; 1400ms to 1500ms: T1(15)=-2.04, p1<.05; T2(15)=-2.32, p2<.05).
Figure 37: Coronal-picture advantage for L1 and L2 English speakers in the slow rate
condition (dashed line = average offset of alien name; 0ms = onset of alien name,
determined individually for each trial)
For the slow rate condition, the effect of Language was marginal in the subject
analysis from 200ms to 300ms (T1(15)=-1.76, p1=.09; T2(15)=-1.72, p2=.11) marginal
300ms to 400ms in both analyses (T1(15)=-1.93, p1=.06; T2(15)=-2.03, p2=.06),
marginal from 400ms to 500ms in the subject analysis (T1(15)=-1.90, p1=.07; T2(15)=-
1.67, p2=.12), significant in the subject analysis and marginal in the item analysis from
119
500ms to 600ms (T1(15)=-2.42, p1<.05; T2(15)=-1.97, p2=.07), marginal from 600ms to
700ms in the subject analysis (T1(15)=-1.83, p1=.08; T2(15)=-1.67, p2=.07), significant
from 1200ms to 1500ms in both analyses (1200ms to 1300ms: T1(15)=-2.82, p1<.01;
T2(15)=-2.29, p2<.05; 1300ms to 1400ms: T1(15)=-4.56, p1<.01; T2(15)=-2.62,
p2<.05;1400ms to 1500ms: T1(15)=-3.78, p1<.01; T2(15)=-2.36, p2<.05). These
analyses suggest that there are differences in the timing of when native-English speaking
participants and second-language learners compensate for coarticulation based on speech
rate.
4.3 Discussion
This chapter reports an experiment that investigates how L1 Mandarin speakers
who speak English as their L2 compensate for English coronal assimilation. The
experiment aims to address the question of when adults acquire a new (nonnative)
phonological system, can two systems co-exist without interference, given the differences
in the phonological properties of the languages? I chose to look at Mandarin speakers
who speak English as a L2, because whereas Mandarin phonotactics does not allow nasal
coda [m] (except in some contracted forms, see Chapter 2), English coronal place
assimilation could lead to a word that ends in a [m].
The results of this experiment indicate that there are differences as well as
similarities in the L1 and L2 speakers’ compensatory behavior on the basis of speech
rate. From the L1 and L2 speakers’ offline click response, there are clear differences in
how L1 and L2 speakers of English interpret the potentially ambiguous target words.
From the click response, the L1 speakers appear to be more likely to accept the
120
possibility that alien words may be underlyingly labial in the coda position, and less
likely to interpret them as words ending with a coronal being coarticulated with the
following labial. For example, when L1 English speakers hear an ambiguous word
between Vone/Vome, they are much more likely to choose Vome as their offline response.
On the other hand, the L2 speakers’ offline choices clearly indicate their
preference for the unassimilated form, which is the opposite of the L1 speakers’
preferences. For example, when they hear the ambiguous Vone/Vome, they are far more
likely to choose Vone when they click on the mouse. A likely underlying cause for the L2
speakers’ preference for the coronal pictures is their L1 (Mandarin) phonotactics. As
previously mentioned, while Mandarin words can have coronal consonants in the coda
position, no words can end with a labial consonant. It appears that L2 speakers’ L1
phonotactics may have pushed them away from the possibility that a word may end in the
nasal coda [m].
While it may seem that L1 and L2 speakers’ behaviors are drastically different in
light of their click response, their eye fixation patterns reveal a more complex picture. As
shown in Figure 34, L1 speakers’ eye-movements demonstrated that they do compensate
for contextual coarticulatory information on the basis of speech rate, with a preference for
the unassimilated form in the fast rate condition. This makes sense, because higher rates
of speech are associated with greater extent of coproduction. Therefore, even though the
critical sequences of noun+verb are identical across different speech rates, L1 speakers
looked more at the unassimilated form to compensate for the higher likelihood of
assimilation. The preference, which is statistically significant (in both the subject and
item analyses), begins at 700ms and last for 200ms. In contrast, as shown in Figure 35,
121
the L2 speakers’ eye-movements, again appear to be quite different from those of the L1
speakers. There is an initial preference for the unassimilated form in the slow rate
condition, and then, when the preference for the unassimilated form in the fast rate
condition becomes significant, the timing is different from the L1 speakers. However,
despite these differences, it is important not to overlook the significant patterns observed
in both populations: faster speech rates result in more consideration of the unassimilated
form (700ms to 900ms in L1 participants; 1100ms to 1200ms in L2 participants).
The comparison of L1 and L2 speakers’ compensatory behavior shows that the L2
speakers were able to compensate for global contextual cues associated with the extent of
coproduction, although it took them longer to accomplish what the L1 speakers did. In
others words, L2 speakers appear to be able to process reduced speech like the L1
speakers, but are slower in terms of their compensatory response. This is shown in Figure
36 and 37 by plotting L1 and L2 speakers’ eye-movement data in each respective rate
condition. In Figure 36, both the L1 and L2 speakers’ fixation patterns started out around
a coronal advantage score of zero, reached a much higher positive value, and then began
dropping over time. However, L1 speakers’ coronal-picture advantage score peaked
around 800ms whereas L2 speakers around 1200ms. In Figure 37, both L1 and L2
speakers’ fixation patterns were first becoming more negative before moving towards a
more positive direction in terms of the coronal advantage score. However, the L2
speakers’ eye-movements in the slow rate condition also appeared to be ‘a step behind’
that of the L1 speakers.
In summary, the offline mouse clicks indicate that Mandarin speakers’ L1
phonological knowledge, notably Mandarin phonotactics, increases the likelihood that
122
they disregard the assimilated form in their offline choices. However, their eye-tracking
results reveal a somewhat different picture: the L1 Mandarin, L2 English listeners are
compensating for assimilation similar to the L1 English speakers, but later on. This
suggests that Mandarin speakers who learn English as a L2 are capable of compensating
for coarticulation information similar to L1 English speakers, but take longer to do so.
The ANOVAs examining the interaction of Speech Rate and Language did not find an
interaction, which indicates that the magnitude of compensation behavior from the two
populations is not different from one another. The delay observed in L2 speakers’
compensation pattern may signal that processing in a L2 is more difficult, even for this
group of more advanced English learners. Taken together, these findings highlight the
benefits of using online measures such as eye-movements in both L1 and L2 research. If
we only used the offline mouse clicks as a measure of the L1 and L2 speakers’
processing, we would not have the opportunity to discover the similarities revealed by the
visual eye-tracking paradigm.
L2 (English) proficiency
The study also addresses the question of interference in the acquisition of a L2. In
light of findings from Darcy et al. (2007) that L2 learners’ ability to compensate is
correlated with their language proficiency, we included participants who are more
advanced in their L2 proficiency as our group of interest. Their proficiency was
determined on the basis of two objective measures: (i) a cloze test and (ii) the TOEFL
iBT test. For the cloze test, our L2 participants scored an average of 20.63 while a control
group of twelve native speakers of English scored an average of 22.25 (out of 25). Also,
123
fourteen out of the sixteen participants had previously taken the TOEFL test. Their
average score was 95.93 while according to the test administering ITS, a score of 96
ranks in the 72 percentile based on statistics from the 2006 year.
The findings of this experiment show that among these advanced L2 learners,
there is still processing difference with respect to their L2 compensation behavior. The
difference is most clear in the offline mouse click data. When given a choice to carry out
a metalinguistic task of choosing which sound they heard, L2 speakers have a strong
preference for the coronal picture (e.g., Vone) over the labial picture (e.g., Vome). In
addition, as the L2 speakers’ eye-movements show, their initial preference on the basis of
speech rate and the timing of the significance pattern is also different from that of the L1
speakers, reflecting the fact that L2 processing is more difficult and thus causing delays.
The results presented in this chapter are also consistent with the Perceptual
Assimilation Model (Best, 1995), derived from the direct realist view of speech
perception (Gibson, 1966, 1979; Fowler, 1986). As part of the gestural, articulation-based
theories of speech perception, it draws on the fact that there is a great amount of overlap
in the gestural constellations used across languages. The model posits that nonnative
gestural constellations are likely to be perceived according to their similarities to the
native constellations that are closest to them in phonological space. Therefore, the
perception of labial [m] may be more challenging because of the [m] representation does
not exist in the coda position in the speakers’ L1. This may explain why L2 speakers
prefer the unassimilated form in their offline click response; an ambiguous [n]/[m] in the
coda position is closest to the [n] in their native phonological space.
124
Chapter 5
General Discussion
5.1 Summary of the findings
This dissertation set out to address the question of whether we can make use of
the notion of articulatory coproduction in order to take steps towards building a unified
theory for how reduced speech is both produced and perceived. Although there have been
a range of different explanations—from a production perspective—of speech reduction
(e.g., Daniloff & Hammarberg, 1973; Hammarberg, 1976; Lindblom 1983, 1989, 1990), a
large number of experimental findings supports the role of articulatory coproduction as
the underlying mechanism (e.g., Bell-Berti and Harris, 1981; Browman & Goldstein,
1986; Fowler, 1977). However, on the perception side, prior research has not specifically
examined the role of articulatory overlap or the nature of the relation between production
and perception. Given the central role of articulatory overlap in speech production, we
are interested in investigating whether listeners may still activate representations
connected to articulatory coproduction in speech perception and word recognition,
especially in the presence of reduction phenomena. If it is the case that articulatory
overlap plays a role in both speech production and perception, we would expect to find
support for coproduction of articulation in different linguistic domains and across
different languages.
To examine whether such a unified account exists, in this dissertation we
investigated the following primary questions: In the production of reduced speech, can
differences in the alternative pronunciations of words be attributed to coproduction of
125
articulation? (Chapter 2); How does the perception system deal with coproduction
information from preceding words (i.e., global cues from prior context) as well as at the
critical word during real-time word recognition? What is the relationship between the
degree of articulatory overlap (in the speaker’s output) and word recognition (by the
listener)? (Chapter 3); How do nonnative speakers process coarticulation information
associated with coproduction of articulation? (Chapter 4)
Our investigation of Mandarin syllable contraction in Experiment 1 shows that
coproduction of articulation can lead to reduction in the production process (Chapter 2).
The experiment provides initial evidence that the phenomenon of syllable contraction in
Mandarin can be explained by the overlap of articulatory gestures due to a change in
coupling mode.
Turning our attention to the perception side and to a different language, we
examine the compensation process for assimilation in English coronal place assimilation
in Chapter 3. In Experiment 2, using visual-world eye-tracking, we found that listeners
compensate more for assimilation in a fast-speech context, which is associated with
greater extent of reduction, than in a slow-speech context. In Experiment 3 (combining
data from eye-tracking and electromagnetic articulography), we studied the issue of
gradience in the production and perception of reduced speech, and showed that the degree
of articulatory overlap in a speaker’s output is correlated with listener’s word recognition.
After investigating how native English speakers compensate for assimilation on
the basis of speech rate, we then investigated whether nonnative English speakers who
are native Mandarin speakers process coarticulatory information in the same way as the
native speakers (Chapter 4). We found that for adult L1 Mandarin learners who speak
126
English as a L2, they can compensate for assimilation like the L1 speakers but do so with
a delay, as revealed by the online eye-movement patterns. With respect to the offline
mouse click response, the results are most consistent with the Perceptual Assimilation
Model (Best, 1995) that posit nonnative sounds are likely to be perceived to be existing
sounds in listeners’ native phonological space.
The experiments presented in this dissertation investigate different reduction
phenomena in different linguistic domains in English and Mandarin Chinese. While the
notion of articulatory coproduction has been more widely accepted in speech production,
its role in the perception process has been highly controversial. On the one hand, general
auditory and learning approaches to speech perception argue that speech perception is
separate from articulation (e.g., Diehl & Kluender, 1989; Klatt, 1979; Massaro, 1998). On
the other hand, gestural approaches to speech perception argue that perceiving speech is
perceiving gestures (e.g., Liberman & Mattingly, 1985; Fowler, 1986). In this
dissertation, Chapter 2 first presents additional evidence for the theory of coarticulation
as coproduction in the speech production process, and in Chapter 3 and 4, we report
findings from the perception of reduced speech that are consistent with the gestural view
of speech perception (although the experiments were not specifically designed to
differentiate the different approaches to the mechanism of speech perception).
This dissertation has also presented a new methodology in studying of the link
between production and perception. In Experiment 3, we directly measured the degree of
articulatory overlap via electromagnetic articulography (EMA) and correlate it with the
perceptual measure of eye-movements. This is different from most studies that
investigate the underlying mechanism of speech perception because these studies
127
generally use acoustic measures of speech production and do not use online measures of
speech perception and word recognition. Thus, the combined methodology of EMA and
eye-tracking presents a new tool in studying the role of articulation in perception research
and can potentially help shed new light on the controversy surrounding the role of
articulation.
5.2 Directions for future research
As we have shown in Chapter 3 and 4, listeners can compensate for assimilation
on the basis of speech rates. This makes sense, given that varying speech rates are
associated with different amount of articulatory overlap. As speech rate can vary
considerably in a given conversation (Miller et al., 1984) and individual speakers vary in
their speech rate, here we pose two additional questions regarding the perception of
reduced speech:
i. How does reduced speech influence word intelligibility and processing speed?
ii. Is speech rate and coproduction information associated with it part of the talker-
specific knowledge that is used in word recognition?
We briefly review the literature related to these questions and discuss some possible
directions for future research below.
128
Intelligibility of reduced speech
The results of Experiment 2 show that articulatory overlap information
contributes to resolving lexical ambiguity, therefore potentially improving
comprehension. However, previous research has shown that when speech is reduced (i.e.,
as articulatory overlap is increased), the intelligibility of speech decreases. In studying
the perception of natural and linearly compressed speech, Janse (2004) used a phoneme
monitoring task and found that naturally-produced fast speech slows down processing
speed more than time-compressed speech does. She suggests that this might be due to
important segmental details being diminished in favor of ease of production, particularly
in naturally-produced fast speech.
The fact that most listeners show no apparent difficulty in adjusting to the varying
speech rates of different speakers challenges the notion that more reduced speech is
always less intelligible and requires longer processing time. In addition, the fact that
Janse (2004) used a phoneme monitoring task might have reflected a level of processing
that is different from the task of word recognition. In order to assess whether the amount
of coproduction information may potentially influence the speed of word recognition, we
propose to employ the modified cross-modal priming paradigm used by Gaskell &
Marslen-Wilson (1996) in comparing naturally fast speech to linearly compressed speech.
We predict that when using target words that are assimilated (e.g., lean bacon, with the
first word pronounced as leam), there would be a facilitation effect in word recognition
when the word is embedded in naturally fast speech. Natural speech, understandably,
exhibits different temporal characteristics from computer compressed speech. We
129
hypothesize that increased coproduction characteristics of fast speech will help the
listeners recognize an assimilated word faster.
Talker identity and reduced speech
Another line of future research is to investigate whether speech rate and the
articulatory coproduction associated with it is part of the talker-specific knowledge that
listeners use in word recognition. Prior work has shown that listeners can undergo a
process of ‘phonological abstraction’ in the speech recognition system by adjusting the
bounds of phonemic categories (e.g., Cutler et al., 2010; Eisner & McQueen, 2005, 2006;
Norris et al., 2003;). Using a two-phase paradigm, Norris et al. (2003) designed an
experiment in which Dutch listeners first went through a training phase before the testing
phase. In the training phase, they listened to words that ended in [f] or [s]. For one group
of listeners, the final [f] was replaced with an ambiguous fricative [?] between [f] and [s]
but the [s]-final words were not altered. For the other group, the [f]-final words were
unchanged but the final [s] were changed to a fricative [?] between [f] or [s]. After
training, listeners were asked to categorize sounds from a [ɛf]-[ɛs] continuum. For those
who listened to natural [s]-final words and ambiguous [f]-final words, Norris et al. (2003)
reported that the listeners were more likely to categorize sounds as [f] and for those who
listened to words of the opposite pattern, they were more likely to categorize the sounds
as [s]. This shows that listeners were able to adjust the phonemic boundary for a deviant
sound after a short exposure in a special setting. In expanding this area of research, other
studies have shown that this type of learning is talker specific (Eisner & McQueen,
2005), is stable over time (Eisner & McQueen, 2006), can generalize to previously
unheard words that also contain the same phoneme (McQueen et al., 2006). Studies on
130
talked identity also show that preschool-aged children can use talker acoustics
information predictively (Creel, 2010) and listeners may store talker acoustics as a
highly-detailed representation of speech and semantically associate the talker with
information in the outside world (Creel & Tumlin, 2011).
These results bring up a related question of whether listeners use speech rate of an
individual and the associated coproduction characteristics of the speaker in their word
recognition? To answer this question, we propose to adopt the same two-phase paradigm
as Norris et al. (2003). During the training phase, the listeners will also be divided into
two groups: one group will be exposed to speech produced at a fast rate whereas the other
group will be exposed speech that is considerably slower. Importantly, the same
individual will produce the different rates of speech so that only speech rate is different
between the two conditions during training. During the testing phase, we still study
listeners’ word recognition using a visual-world eye-tracking paradigm. Again, we will
use target words that are in an assimilation context to see whether the training the
listeners receive will allow to behave differently in their word recognition.
In summary, this dissertation explored the extent to which articulatory
coproduction plays a role in the production and process of reduced speech and provided
evidence that (i) coproduction of articulation can explain how speech is reduced, (ii) both
native and nonnative speakers of a language can use speech rate (and associated
articulatory coproduction information) to resolve lexical ambiguity, and (iii) a speaker’s
articulatory overlap patterns are associated with listeners’ perceptual measures. Although
the role of articulatory overlap has received relatively little attention in perception
131
studies, this work demonstrates representations connected to articulatory coproduction
are activated in listeners’ compensation behavior.
132
References
ADANK, PATTI; and ESTHER JANSE. 2009. Perceptual learning of time-compressed and
natural fast speech. The Journal of the Acoustical Society of America 126.2649.
ALLOPENNA, PAUL D.; JAMES S. MAGNUSON; and MICHAEL K. TANENHAUS. 1998.
Tracking the time course of spoken word recognition using eye movements:
Evidence for continuous mapping models. Journal of memory and language 38.419–
439.
ALTMANN, G.; and Y. KAMIDE. 1999. Incremental interpretation at verbs: Restricting the
domain of subsequent reference. Cognition 73.247–264.
BAAYEN, R. HARALD; DOUGLAS J. DAVIDSON; and DOUGLAS M. BATES. 2008. Mixed-
effects modeling with crossed random effects for subjects and items. Journal of
memory and language 59.390–412.
BECKMAN, MARY; and ATSUKO SHOJI. 1984. Spectral and perceptual evidence for CV
coarticulation in devoiced /si/ and /syu/ in Japanese. Phonetica 41.61–71.
BELL-BERTI, FREDERICKA; and KATHERINE S. HARRIS. 1981. A temporal model of speech
production. Phonetica 38.9–20.
BELL-BERTI, FREDERICKA; and RENA ARENS KRAKOW. 1991. Anticipatory velar
lowering: A coproduction account. Journal of the Acoustical Society of America
90.112–123.
BENGUEREL, A.-P.; and HELEN A. COWAN. 1974. Coarticulation of upper lip protrusion in
French. Phonetica 30.41–55.
BEST, CATHERINE T. 1995. A direct realist view of cross-language speech perception.
Speech perception and linguistic experience: Issues in cross-language research.
Baltimore: York Press.
BIALYSTOK, ELLEN. 1997. The structure of age: In search of barriers to second language
acquisition. Second language research 13.116–137.
BIRDSONG, DAVID. 1992. Ultimate attainment in second language acquisition.
Language.706–755.
BLADON, R. A. W.; and FRANCIS NOLAN. 1977. A video-fluorographic investigation of tip
and blade alveolars in English. Journal of Phonetics 5.185–193.
BOERSMA, PAUL. 2002. Praat, a system for doing phonetics by computer. Glot
international 5.341–345.
133
BONGAERTS, THEO. 1999. Ultimate attainment in L2 pronunciation: The case of very
advanced late L2 learners. Second language acquisition and the critical period
hypothesis, ed. by David Birdsong, 133–159. Mahwah, NJ: Lawrence Erlbaum
Associates.
BONGAERTS, THEO; SUSAN MENNEN; and FRANS VAN DER SLIK. 2000. Authenticity of
pronunciation in naturalistic second language acquisition: The case of very advanced
late learners of Dutch as a second language. Studia Linguistica 54.298–308.
BONGAERTS, THEO; CHANTAL VAN SUMMEREN; BRIGITTE PLANKEN; and ERIK SCHILS.
1997. Age and ultimate attainment in the pronunciation of a foreign language.
Studies in second language acquisition 19.447–465.
BROWMAN, C. P; and L. GOLDSTEIN. 1986. Towards an articulatory phonology.
Phonology yearbook 3.219–252.
BROWMAN, C. P; and L. GOLDSTEIN. 1990. Tiers in articulatory phonology, with some
implications for casual speech. Papers in Laboratory Phonology I: Between the
grammar and physics of speech, ed. by J. Kingston and M. E. Beckman, 341–376.
Cambridge, UK: Cambridge University Press.
BROWMAN, C. P; and L. GOLDSTEIN. 1992. Articulatory phonology: An overview.
Phonetica 49.155–180.
BROWMAN, C. P.; and L. GOLDSTEIN. 1989. Articulatory gestures as phonological units.
Phonology 6.201–251.
BROWMAN, C. P.; and L. M. GOLDSTEIN. 1995. Dynamics and articulatory phonology.
Mind as Motion, ed. by T. van Gelder and R. F. Port, 175–193. Cambridge, MA:
MIT Press.
BROWMAN, CATHERINE P.; and LOUIS GOLDSTEIN. 1991. Gestural structures:
Distinctiveness, phonological processes, and historical change. Modularity and the
motor theory of speech perception, ed. by I. G Mattingly and M. Studdert-Kenndy,
313–338. Hillsdales, NJ: Lawrence Erlbaum Associates.
BYBEE, JOAN L. 2010. Language, usage and cognition. Cambridge University Press
Cambridge.
BYRD, DANI. 1996. Influences on articulatory timing in consonant sequences. Journal of
Phonetics 24.209–244.
BYRD, DANI; SUNGBOK LEE; and REBEKA CAMPOS-ASTORKIZA. 2008. Phrase boundary
effects on the temporal kinematics of sequential tongue tip consonants. The Journal
of the Acoustical Society of America 123.4456.
134
BYRD, DANI; and CHENG CHENG TAN. 1996. Saying consonant clusters quickly. Journal
of Phonetics 24.263–282.
CHENG, C.; and Y. XU. 2009. Extreme reductions: Contraction of disyllables into
monosyllables in Taiwan Mandarin. Tenth Annual Conference of the International
Speech Communication Association.
CHENG, ROBERT L. 1985. Sub-syllabic morphemes in Taiwanese. Journal of Chinese
linguistics 13.12–43.
CHITORAN, IOANA; LOUIS GOLDSTEIN; and DANI BYRD. 2002. Gestural overlap and
recoverability: Articulatory evidence from Georgian. Laboratory phonology 7.419–
447.
CHOMSKY, NOAM; and MORRIS HALLE. 1968. The sound pattern of English. Cambridge,
MA: MIT Press.
CHUNG, K. S. 2006. Contraction and backgrounding in Taiwan Mandarin. Concentric:
Studies in Linguistics 32.69–88.
CHUNG, R. F. 1997. Syllable contraction in Chinese. Chinese Languages and Linguistics
III: Morphology and Lexicon.199–235.
CLUMECK, HAROLD. 1976. Patterns of soft palate movements in six languages. Journal of
Phonetics 4.337–351.
COENEN, ELSE; PIENIE ZWITSERLOOD; and JENS BÖLTE. 2001. Variation and assimilation
in German: Consequences for lexical access and representation. Language and
Cognitive Processes 16.535–564.
CREEL, S. C; and M. A TUMLIN. 2011. On-line acoustic and semantic interpretation of
talker information. Journal of Memory and Language.264–285.
CREEL, SARAH C. 2010. Considering the source: preschoolers and adults use talker
acoustics predictively and flexibly in on-line sentence processing. Proceedings of the
32nd Annual Conference of the Cognitive Science Society, ed. by S. Ohlsson and R.
Catrambone, 15:1810–1815.
CUTLER, A.; F. EISNER; J. M. MCQUEEN; and D. NORRIS. 2010. How abstract phonemic
categories are necessary for coping with speaker-related variation. Laboratory
phonology 10.91–111.
DANILOFF, RAYMOND; and ROBERT E. HAMMARBERG. 1973. On defining coarticulation.
Journal of Phonetics 1.239–248.
DANILOFF, RAYMOND; and KENNETH MOLL. 1968. Coarticulation of Lip Rounding.
Journal of Speech and Hearing Research 11.707–721.
135
DARCY, ISABELLE; SHARON PEPERKAMP; and EMMANUEL DUPOUX. 2007. Bilinguals play
by the rules: Perceptual compensation for assimilation in late L2-learners.
Laboratory phonology 9.411–442.
DARCY, ISABELLE; FRANCK RAMUS; ANNE CHRISTOPHE; KATHERINE KINZLER; and
EMMANUEL DUPOUX. 2009. Phonological knowledge in compensation for native and
non-native assimilation. Variation and gradience in phonetics and phonology
14.265.
DE JONG, KENNETH J.; B. J. LIM; and KYOKO NAGAO. 2002. Phase transitions in a
repetitive speech task as gestural recomposition. IULC Working Papers [On-line
serial] 2.
DIEHL, RANDY L.; and KEITH R. KLUENDER. 1989. On the objects of speech perception.
Ecological Psychology 1.121–144.
DUANMU, SAN. 2002. The phonology of standard Chinese. Oxford: Oxford University
Press.
EISNER, FRANK; and JAMES MCQUEEN. 2005. The specificity of perceptual learning in
speech processing. Attention, Perception, & Psychophysics 67.224–238.
EISNER, FRANK; and JAMES M. MCQUEEN. 2006. Perceptual learning in speech: Stability
over time. The Journal of the Acoustical Society of America 119.1950.
ELMAN, JEFFREY L.; and JAMES L. MCCLELLAND. 1988. Cognitive penetration of the
mechanisms of perception: Compensation for coarticulation of lexically restored
phonemes. Journal of Memory and Language 27.143–165.
ERNESTUS, MIRJAM. 2000. Voice assimilation and segment reduction in casual Dutch: A
corpus-based study of the phonology. Utrecht: LOT.
ERNESTUS, MIRJAM; HARALD BAAYEN; and ROB SCHREUDER. 2002. The Recognition of
Reduced Word Forms. Brain and Language 81.162–173.
FADIGA, LUCIANO; LAILA CRAIGHERO; GIOVANNI BUCCINO; and GIACOMO RIZZOLATTI.
2002. Speech listening specifically modulates the excitability of tongue muscles: a
TMS study. European Journal of Neuroscience 15.399–402.
FARNETANI, EDDA. 1986. A pilot study of the articulation of /n/ in Italian using electro-
palatography and airflow measurements. 15e Journées d’Etudes sur la Parole.GALF
23–26.
136
FARNETANI, EDDA; and DANIEL RECASENS. 1993. Anticipatory consonant-to-vowel
coarticulation in the production of VCV sequences in Italian. Language and Speech
36.279–302.
FARNETANI, EDDA; and DANIEL RECASENS. 2010. Coarticulation and connected speech
processes. Handbook of Phonetic Sciences, ed. by William J. Hardcastle, John
Laver, and Fiona E. Gibbon. 2nd ed. Wiley-Blackwell.
FLEGE, JAMES E. 1992a. Speech learning in a second language. Phonological
development: Models, research, implications, ed. by C. Ferguson, L. Menn, and C.
Stoel-Gammon, 565–604. Timonium, MD: York Press.
FLEGE, JAMES E. 1992b. The intelligibility of English vowels spoken by British and
Dutch talkers. Intelligibility in speech disorders: Theory, measurement, and
management, ed. by R. Kent, 1:157–232. Amsterdam: John Benjamins.
FLEGE, JAMES E. 1995. Second language speech learning: Theory, findings, and
problems. Speech perception and linguistic experience: Issues in cross-language
research, ed. by Winifred Strange, 233–277. Timonium, MD: York Press.
FLEGE, JAMES EMIL; OCKE-SCHWEN BOHN; and SUNYOUNG JANG. 1997. Effects of
experience on non-native speakers’ production and perception of English vowels.
Journal of Phonetics 25.437–470.
FLEGE, JAMES EMIL; MURRAY J. MUNRO; and IAN RA MACKAY. 1995. Effects of age of
second-language learning on the production of English consonants. Speech
Communication 16.1–26.
FLEGE, JAMES EMIL; GRACE H. YENI-KOMSHIAN; and SERENA LIU. 1999. Age constraints
on second-language acquisition. Journal of memory and language 41.78–104.
FOULKE, E. 1971. The perception of time-compressed speech. The perception of
language, ed. by D. L. Horton and J. J. Jenkins. Columbus, OH: Charles E. Merrill
Publishing Company.
FOWLER, CAROL A. 1981. Production and perception of coarticulation among stressed and
unstressed vowels. Journal of Speech, Language and Hearing Research 24.127.
FOWLER, CAROL A. 1986. An event approach to the study of speech perception from a
direct-realist perspective. Journal of Phonetics 14.3–28.
FOWLER, CAROL A.; and ELLIOT SALTZMAN. 1993. Coordination and coarticulation in
speech production. Language and speech 36.171–195.
FOWLER, CAROL ANN. 1977. Timing control in speech production. Bloomington: Indiana
University Linguistics Club.
137
GALANTUCCI, B.; C. A FOWLER; and M. T TURVEY. 2006. The motor theory of speech
perception reviewed. Psychonomic Bulletin & Review 13.361.
GASKELL, M. G.; and W. D. MARSLEN-WILSON. 1996. Phonological variation and
inference in lexical access. Journal of Experimental Psychology: Human Perception
and Performance 22.144.
GASKELL, M. G.; and W. D. MARSLEN-WILSON. 1998. Mechanisms of phonological
inference in speech perception. Journal of Experimental Psychology: Human
Perception and Performance 24.380.
GASKELL, M.G.; and W.D. MARSLEN-WILSON. 2001. Lexical ambiguity resolution and
spoken word recognition: Bridging the gap. Journal of Memory and Language
44.325–349.
GAY, THOMAS. 1978. Effect of speaking rate on vowel formant movements. The journal
of the Acoustical society of America 63.223–230.
GIBSON, J. J. 1966. The senses considered as perceptual systems. Boston, MA: Houghton
Mifflin.
GIBSON, J. J. 1979. The ecological approach to visual perception. Boston, MA: Houghton
Mifflin.
GIMSON, ALFRED CHARLES. 1970. An introduction to the pronunciation of English. 2nd
ed. London: Edward Arnold.
GOLDSTEIN, L.; D. BYRD; and E. SALTZMAN. 2006. The role of vocal tract gestural action
units in understanding the evolution of phonology. Action to language via the mirror
neuron system, ed. by Michael Arbib, 215–249. Cambridge, UK: Cambridge
University Press.
GOW, D. W.; and BOB MCMURRAY. 2007. Word recognition and phonology: The case of
English coronal place assimilation. Papers in Laboratory Phonology, ed. by J. S.
Cole and J. Hualdo, 9:173–200. New York: Mouton de Gruyter.
GOW, DAVID W. 2001. Assimilation and Anticipation in Continuous Spoken Word
Recognition. Journal of Memory and Language 45.133–159.
GOW, DAVID W. 2002. Does English coronal place assimilation create lexical ambiguity?
Journal of Experimental Psychology: Human Perception and Performance 28.163–
179.
GOW, DAVID W. 2003. Feature parsing: Feature cue mapping in spoken word recognition.
Perception & Psychophysics 65.575–590.
138
HALLÉ, PIERRE A.; CATHERINE T. BEST; and ANDREA LEVITT. 1999. Phonetic vs.
phonological influences on French listeners’ perception of American English
approximants. Journal of Phonetics 27.281–306.
HAMMARBERG, ROBERT. 1976. The metaphysics of coarticulation. Journal of Phonetics
4.353–363.
HOOLE, P. 1993. Methodological considerations in the use of electromagnetic
articulography in phonetic research. Forschungsberichte-Institut für Phonetik und
Sprachliche Kommunikation der Universität München.43–64.
HSU, HUI-CHUAN. 2003. A sonority model of syllable contraction in Taiwanese Southern
Min. Journal of East Asian Linguistics 12.349–377.
HURA, SUSAN L.; BJORN LINDBLOM; and RANDY L. DIEHL. 1992. On the Role of
Perception in Shaping Phonological Assimilation Rules. Language and Speech
35.59–72.
JAEGER, T. FLORIAN. 2008. Categorical data analysis: Away from ANOVAs
(transformation or not) and towards logit mixed models. Journal of memory and
language 59.434–446.
JANSE, E. 2004. Word perception in fast speech: artificially time-compressed vs. naturally
produced fast speech. Speech Communication 42.155–173.
JANSE, E.; S. NOOTEBOOM; and H. QUENÉ. 2003. Word-level intelligibility of time-
compressed speech: prosodic and segmental factors. Speech Communication 41.287–
301.
JOHNSON, JACQUELINE S.; and ELISSA L. NEWPORT. 1989. Critical period effects in
second language learning: The influence of maturational state on the acquisition of
English as a second language. Cognitive psychology 21.60–99.
JOHNSON, JACQUELINE S.; and ELISSA L. NEWPORT. 1991. Critical period effects on
universal properties of language: The status of subjacency in the acquisition of a
second language. Cognition 39.215–258.
JOHNSON, KEITH. 2004. Massive reduction in conversational American English.
Spontaneous speech: Data and analysis. Proceedings of the 1st session of the 10th
international symposium, ed. by K. Yoneyama and K. Maekawa, 29–54. Tokyo,
Japan: The National International Institute for Japanese Language.
JONES, DANIEL. 1969. An outline of English phonetics. 9th ed. Cambridge: Cambridge
University Press.
139
KABURAGI, TOKIHIKO; KOHEI WAKAMIYA; and MASAAKI HONDA. 2005. Three-
dimensional electromagnetic articulography: A measurement principle. The Journal
of the Acoustical Society of America 118.428.
KAMIDE, Y.; G. ALTMANN; and S. L. HAYWOOD. 2003. The time-course of prediction in
incremental sentence processing: Evidence from anticipatory eye movements.
Journal of Memory and Language 49.133–156.
KAMIDE, Y.; C. SCHEEPERS; and G. T. M. ALTMANN. 2003. Integration of syntactic and
semantic information in predictive processing: Cross-linguistic evidence from
German and English. Journal of Psycholinguistic Research 32.37–55.
KELSO, J. A. SCOTT. 1984. Phase transitions and critical behavior in human bimanual
coordination. American Journal of Physiology 246.R1000–R1004.
KLATT, K. H. 1979. Speech perceptionL a model of acousitc-phonetic analysis and lexical
access. Journal of Phonetics.279–312.
KOHLER, KLAUS J. 1990. Segmental reduction in connected speech in German:
Phonological facts and phonetic explanations. Speech production and speech
modelling, 69–92. Springer.
KOREMAN, JACQUES. 2006. Perceived speech rate: The effects of articulation rate and
speaking style in spontaneous speech. The Journal of the Acoustical Society of
America 119.582.
KRIVOKAPIĆ, JELENA; and DANI BYRD. 2012. Prosodic boundary strength: An articulatory
and perceptual study. Journal of Phonetics.
KRÖGER, BERND J.; MARIANNE POUPLIER; and MARK K. TIEDE. 2008. An Evaluation of
the Aurora System as a Flesh-Point Tracking Tool for Speech Production Research.
Journal of Speech, Language, and Hearing Research 51.914–21.
KUEHN, DAVID P.; and KENNETH L. MOLL. 1976. A cineradiographic study of VC and
CV articulatory velocities. Journal of Phonetics.303–320.
LADEFOGED, PETER. 1980. What are linguistic sounds made of? Language.485–502.
LEHISTE, I. 1970. Suprasegmentals. Cambridge, MA: MIT Press.
LIBERMAN, A. M; and I. G MATTINGLY. 1985. The motor theory of speech perception
revised. Cognition 21.1–36.
LIBERMAN, A. M.; FRANKLIN S. COOPER; D. SHANKWEILER; and M. STUDDERT-KENNDY.
1967. Perception of the speech code. Psychological Review.431–461.
140
LIBERMAN, ALVIN M.; PIERRE C. DELATTRE; FRANKLIN S. COOPER; and LOUIS J.
GERSTMAN. 1954. The role of consonant-vowel transitions in the perception of the
stop and nasal consonants. Psychological Monographs: General and Applied 68.1–
13.
LINDBLOM, B. 1963. Spectrographic Study of Vowel Reduction. The Journal of the
Acoustical Society of America 35.1773–1781.
LINDBLOM, BJÖRN. 1983. Economy of speech gestures. The Production of Speech, ed. by
P. F. MacNeilage, 217–246. New York: Springer.
LINDBLOM, BJÖRN. 1989. Phonetic invariance and the adaptive nature of speech. Working
Models of Human Perception, ed. by B. A. G. Elsendoorn and H. Bouma, 139–73.
London: Academic Press.
LINDBLOM, BJÖRN. 1990. Explaining phonetic variation: A sketch of the H&H theory.
Speech production and speech modelling, ed. by William J. Hardcastle and A.
Marchal, 403–439. Dordrecht: Kluwer Academic Publishers.
LONG, MICHAEL H. 1990. Maturational constraints on language development. Studies in
second language acquisition 12.251–285.
MAGNUSON, JAMES S.; BOB MCMURRAY; MICHAEL K. TANENHAUS; and RICHARD N.
ASLIN. 2003. Lexical effects on compensation for coarticulation: The ghost of
Christmash past. Cognitive Science 27.285–298.
MANN, V. A. 1980. Influence of preceding liquid on stop-consonant perception. Attention,
Perception, & Psychophysics 28.407–412.
MANN, V. A.; and B. H. REPP. 1981. Influence of preceding fricative on stop consonant
perception. The Journal of the Acoustical Society of America 69.548.
MASSARO, DOMINIC W. 1998. Perceiving talking faces: From speech perception to a
behavioral principle. The MIT Press.
MATTHIES, MELANIE; PASCAL PERRIER; JOSEPH S. PERKELL; and MAJID ZANDIPOUR.
2001. Variation in Anticipatory Coarticulation with Changes in Clarity and Rate.
Journal of Speech, Language, and Hearing Research 44.340–353.
MCALLISTER, ROBERT; JAMES E. FLEGE; and THORSTEN PISKE. 2002. The influence of L1
on the acquisition of Swedish quantity by native speakers of Spanish, English and
Estonian. Journal of phonetics 30.229–258.
MCQUEEN, JAMES M.; ALEXANDRA JESSE; and DENNIS NORRIS. 2009. No lexical–
prelexical feedback during speech perception or: Is it time to stop playing those
Christmas tapes? Journal of Memory and Language 61.1–18.
141
MILLER, JOANNE L.; FRANÇOIS GROSJEAN; and CONCETTA LOMANTO. 1984. Articulation
rate and its variability in spontaneous speech: A reanalysis and some Implications.
Phonetica 41.215–225.
MILLER, JOANNE L.; and ALVIN M. LIBERMAN. 1979. Some effects of later-occurring
information on the perception of stop consonant and semivowel. Perception &
Psychophysics 25.457–465.
MITRA, VIKRAMJIT; HOSUNG NAM; CAROL ESPY-WILSON; ELLIOT SALTZMAN; and LOUIS
GOLDSTEIN. 2012. Recognizing articulatory gestures from speech for robust speech
recognition. The Journal of the Acoustical Society of America 131.2270.
MITTERER, HOLGER; and LEO BLOMERT. 2003. Coping with phonological assimilation in
speech perception: Evidence for early compensation. Perception & Psychophysics
65.956–969.
MOON, SEUNG-JAE; and BJÖRN LINDBLOM. 1994. Interaction between duration, context,
and speaking style in English stressed vowels. The Journal of the Acoustical society
of America 96.40–55.
MUNHALL, KEVIN; and ANDERS LÖFQVIST. 1992. Gestural aggregation in speech-
laryngeal gestures. Journal of Phonetics 20.111–126.
MYERS, JAMES; and YINGSHING LI. 2009. Lexical frequency effects in Taiwan Southern
Min syllable contraction. Journal of Phonetics 37.212–230.
NOLAN, FRANCIS. 1992. The descriptive role of segments: evidence from assimilation.
Papers in laboratory phonology II: Gesture, segment, prosody.261–280.
NORD, LENNART. 1986. Acoustic studies of vowel reduction in Swedish. Speech
Transmission Laboratory Quarterly Progress and Status Report 4.19–36.
NORRIS, D.; J. M. MCQUEEN; and A. CUTLER. 2003. Perceptual learning in speech.
Cognitive Psychology 47.204–238.
OYAMA, SUSAN. 1979. The concept of the sensitive period in developmental studies.
Merrill-Palmer Quarterly of Behavior and Development.83–103.
PALLIER, CHRISTOPHE; LAURA BOSCH; and NÚRIA SEBASTIÁN-GALLÉS. 1997. A limit on
behavioral plasticity in speech perception. Cognition 64.B9–B17.
PARRELL, BENJAMIN. 2011. Dynamical account of how/b, d, g/differ from/p, t, k/in
Spanish: Evidence from labials. Laboratory Phonology 2.423–449.
142
PERKELL, JOSEPH S.; MARC H. COHEN; MARIO A. SVIRSKY; MELANIE L. MATTHIES;
IÑAKI GARABIETA; and MICHEL TT JACKSON. 1992. Electromagnetic midsagittal
articulometer systems for transducing speech articulatory movements. The Journal
of the Acoustical Society of America 92.3078.
PETERSON, GORDON E.; and ILSE LEHISTE. 1960. Duration of Syllable Nuclei in English.
The Journal of the Acoustical Society of America 32.693–703.
PORT, ROBERT F. 1981. Linguistic timing factors in combination. The Journal of the
Acoustical Society of America 69.262–274.
PORT, ROBERT F.; and MICHAEL L. O’DELL. 1985. Neutralization of syllable-final voicing
in German. Journal of Phonetics.
POUPLIER, MARIANNE; and LOUIS GOLDSTEIN. 2010. Intention in articulation:
Articulatory timing in alternating consonant sequences and its implications for
models of speech production. Language and Cognitive Processes 25.616–649.
RASTLE, KATHLEEN; JONATHAN HARRINGTON; and MAX COLTHEART. 2002. 358,534
nonwords: The ARC Nonword Database. The Quarterly Journal of Experimental
Psychology: Section A 55.1339–1362.
RECASENS, DANIEL. 1984. Vowel-to-vowel coarticulation in Catalan VCV sequences. The
Journal of the Acoustical Society of America 76.1624–1635.
RECASENS, DANIEL. 1989. Long range coarticulation effects for tongue dorsum contact in
VCVCV sequences. Speech Communication 8.293–307.
REPP, B. H.; and V. A. MANN. 1982. Fricative–stop coarticulation: Acoustic and
perceptual evidence. The Journal of the Acoustical Society of America 71.1562–
1567.
REPP, BRUNO H.; and VIRGINIA A. MANN. 1981. Perceptual assessment of fricative–stop
coarticulation. The Journal of the Acoustical Society of America 69.1154.
SALTZMAN, ELLIOT L.; and KEVIN G. MUNHALL. 1989. A dynamical approach to gestural
patterning in speech production. Ecological psychology 1.333–382.
SAMUEL, ARTHUR G.; and MARK A. PITT. 2003. Lexical Activation (and Other Factors)
Can Mediate Compensation for Coarticulation. Journal of Memory and Language
48.416–434.
SCHÖNLE, PAUL W.; KLAUS GRÄBE; PETER WENIG; JÖRG HÖHNE; JÖRG SCHRADER; and
BASTIAN CONRAD. 1987. Electromagnetic articulography: Use of alternating
magnetic fields for tracking movements of multiple points inside and outside the
vocal tract. Brain and Language 31.26–35.
143
SCOVEL, THOMAS. 1988. A time to speak: A psycholinguistic inquiry into the critical
period for human speech. Newbury House Rowley, MA.
STRANGE, WINIFRED; REIKO AKAHANE-YAMADA; RIEKO KUBO; SONJA A. TRENT;
KANAE NISHI; and JAMES J. JENKINS. 1998. Perceptual assimilation of American
English vowels by Japanese listeners. Journal of phonetics 26.311–344.
SURPRENANT, AIMÉE M.; and LOUIS GOLDSTEIN. 1998. The perception of speech
gestures. The Journal of the Acoustical Society of America 104.518–529.
SUSSMAN, HARVEY M.; and JOHN R. WESTBURY. 1981. The effects of antagonistic
gestures on temporal and amplitude parameters of anticipatory labial coarticulation.
Journal of Speech and Hearing Research 16-24.16.
TANENHAUS, MICHAEL K.; MICHAEL J. SPIVEY-KNOWLTON; KATHLEEN M. EBERHARD;
and JULIE C. SEDIVY. 1995. Integration of visual and linguistic information in
spoken language comprehension. Science 268.1632–1634.
TSENG, S. 2005. Contracted syllables in Mandarin: Evidence from spontaneous
conversations. Langauge and Linguistics 6.153–180.
TSENG, S. C. 2005. Syllable contractions in a Mandarin conversational dialogue corpus.
International Journal of Corpus Linguistics 10.63–83.
TSENG, S. C.; and Y. F. LIU. 2002. Annotation of Spontaneous Mandarin. Technical
Report. No. 02-01. Chinese Knowledge Processing Group, Academia Sinica (in
Chinese).
TSENG, SHU-CHUAN. 2001. Highlighting utterances in Chinese spoken discourse.
Language, Information and Computation. PACLIC, 15:163–174.
VAN SON, ROB JJH; and LOUIS CW POLS. 1990. Formant frequencies of Dutch vowels in
a text, read at normal and fast rate. The Journal of the Acoustical Society of America
88.1683–1693.
WANG, L. 1987. Contemporary Chinese Syntax. Lan Deng (in Chinese).
WEBER, ANDREA; and ANNE CUTLER. 2004. Lexical competition in non-native spoken-
word recognition. Journal of Memory and Language 50.1–25.
WEBER-FOX, CHRISTINE M.; and HELEN J. NEVILLE. 1996. Maturational constraints on
functional specializations for language processing: ERP and behavioral evidence in
bilingual speakers. Journal of Cognitive Neuroscience 8.231–256.
WHITE, LYDIA; and FRED GENESEE. 1996. How native is near-native? The issue of
ultimate attainment in adult second language acquisition. Second Language
Research 12.233–265.
144
WONG, WAI YI PEGGY. 2006. Syllable fusion in Hong Kong Cantonese connected speech.
The Ohio State University.
WRIGHT, SUSAN; and PAUL KERSWILL. 1989. Electropalatography in the analysis of
connected speech processes. Clinical linguistics & phonetics 3.49–57.
ZIERDT, ANDREAS; PHILIP HOOLE; and HANS G. TILLMANN. 1999. Development of a
system for three-dimensional fleshpoint measurement of speech movements.
Proceedings of the 14th International Conference of Phonetic Sciences (ICPhS99),
1:73–75.
Abstract (if available)
Abstract
The pronunciation of a word in continuous speech is often reduced, different from when it is spoken in isolation. Speech reduction reflects a fundamental property of spoken language—the movements of articulators can overlap in time, also known as coarticulation. A large number of experimental findings supports the role of articulatory coproduction as the underlying mechanism in speech production (e.g., Bell‐Berti and Harris, 1981
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
The planning, production, and perception of prosodic structure
PDF
Speech production in post-glossectomy speakers: articulatory preservation and compensation
PDF
Articulatory dynamics and stability in multi-gesture complexes
PDF
The prosodic substrate of consonant and tone dynamics
PDF
Toward understanding speech planning by observing its execution—representations, modeling and analysis
PDF
Emotional speech production: from data to computational models and applications
PDF
Dynamics of consonant reduction
PDF
A computational framework for exploring the role of speech production in speech processing from a communication system perspective
PDF
Signs of skilled adaptation in the co-speech ticking of adults with Tourette's
PDF
Prosody and informativity: a cross-linguistic investigation
PDF
Understanding music perception with cochlear implants with a little help from my friends, speech and hearing aids
PDF
The role of individual variability in tests of functional hearing
Asset Metadata
Creator
Li, David Cheng-Huan
(author)
Core Title
Investigating the production and perception of reduced speech: a cross-linguistic look at articulatory coproduction and compensation for coarticulation
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Linguistics
Publication Date
08/28/2014
Defense Date
06/18/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
articulatory coproduction,compensation for coarticulation,electromagnetic articulography,eye‐tracking,OAI-PMH Harvest,speech perception,speech production,spoken word recognition
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kaiser, Elsi (
committee chair
), Byrd, Dani (
committee member
), Goldstein, Louis (
committee member
), Iskarous, Khalil (
committee member
), Wilcox, Rand R. (
committee member
)
Creator Email
li.david.c@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-465888
Unique identifier
UC11286947
Identifier
etd-LiDavidChe-2853.pdf (filename),usctheses-c3-465888 (legacy record id)
Legacy Identifier
etd-LiDavidChe-2853.pdf
Dmrecord
465888
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Li, David Cheng-Huan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
articulatory coproduction
compensation for coarticulation
electromagnetic articulography
eye‐tracking
speech perception
speech production
spoken word recognition