Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Prosody and informativity: a cross-linguistic investigation
(USC Thesis Other)
Prosody and informativity: a cross-linguistic investigation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
i
PROSODY AND INFORMATIVITY: A CROSS-LINGUISTIC INVESTIGATION
By
Iris Chuoying Ouyang
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(LINGUISTICS)
December 2015
Copyright 2015 Iris Chuoying Ouyang
ii
Dedication
To my dad, Dr. Chiung Ouyang, for his belief in me. Without him I would not have found myself
in academia.
To my mom, Chuntun Yen, and Connie, and Fiona, each of whom has a special place in my heart.
iii
Acknowledgements
I would like to thank Elsi Kaiser for showing me how amazing an advisor can be. I am forever
indebted to the priceless and countless hours she has spent on shaping me into the scholar I am
today. I would also like to thank Dani Byrd, Louis Goldstein, Khalil Iskarous, Andrew Simpson
and Rachel Walker, for the advice, feedback, encouragement and support they have given me
throughout the years. I am also very grateful to Hagit Borer, Elena Guerzoni, Audrey Li, Toby
Mintz, Roumyana Pancheva and Barry Schein, for their invaluable contribution to my
professional training and to my experience at USC. A special thanks goes to Joyce Perez for her
help, support and friendship. The chocolate in her office kept me going through numerous long
days of class, teaching, and running experiments. In addition, I would like to take this
opportunity to thank Hsiu-Fang Yang, who opened my eyes to the beauty of linguistics, and
Mengbing Xiang, for his continued mentorship long after I left Asia.
I gratefully acknowledge the support of a Doctoral Dissertation Research Improvement Grant
from the National Science Foundation (BCS-1451596). I would also like to thank Zesheng Chen,
Emily Fedele, Josephine Lim, Katie Roberts, and Sasha Spala for assistance with the
experiments.
Parts of the research presented in Chapter 5 are published in Language, Cognition and
Neuroscience (Ouyang & Kaiser 2015b) and parts of the research in Chapter 3 are published in a
chapter in Individual Differences in Speech Production and Perception (Ouyang & Kaiser
2015a).
iv
Table of Contents
(Click on the underlined links to go to the corresponding sections.)
Dedication ii
Acknowledgements iii
List of Tables vi
List of Figure vii
Abstract ix
Chapter 1: Introduction 1
1.1. Prosodic prominence and information structure 3
1.2. Prosodic prominence and information-theoretic factors 5
1.3. Prosodic prominence and perspective-taking 6
1.4. Overview of this dissertation 8
Chapter 2: The interplay between information structure and information-theoretic factors
in prosody 11
2.1. Introduction 11
2.2. Experiment 1: Aims and expected outcome 13
2.3. Experiment 1: Methods 15
2.4. Experiment 1: Results 22
2.5. Discussion 25
Chapter 3: Individual differences in the prosodic encoding of informativity 30
3.1. Introduction 30
3.2. Aims and expected outcome 32
3.3. Results on individual differences in Experiment 1 33
3.4. Discussion 42
Chapter 4: The role of interlocutors in the prosodic encoding of information structure 46
4.1. Introduction 46
4.2. Experiments 2 and 3: Aims and expected outcome 49
4.3. Experiments 2 and 3: Methods 52
4.4. Experiments 2 and 3: Results 60
4.5. Discussion 63
v
Chapter 5: Discourse-level prosody in a tone language: Prosodic encoding of information
structure in Mandarin 69
5.1. Introduction 69
5.2. Experiment 4: Aims and expected outcome 74
5.3. Experiment 4: Methods 76
5.4. Experiment 4: Results 84
5.5. Discussion 94
Chapter 6: General discussion 101
References 105
Appendix 1: Target items in Experiment 1 119
Appendix 2: Target items in Experiments 2 and 3 121
vi
List of Tables
(Click on the underlined links to go to the corresponding tables.)
Table 2.1: Manipulation of word frequency and contextual probability in Experiment 1. 18
Table 2.2: Lexical frequency of the target words in Experiment 1. 18
Table 4.1: Conditions in Experiments 2 and 3. 56
Table 4.2: Manipulation of contextual probability in Experiments 2 and 3. 58
Table 5.1: Structure of target trials in Experiment 4. 78
Table 5.2: Target words in Experiment 4. 82
vii
List of Figures
(Click on the underlined links to go to the corresponding figures.)
Figure 2.1: Sample display in the selection task of Experiment 1. 20
Figures 2.2-2.5: Best-fitted curves with 95% confidence intervals for the f0 values in the
pre-focus, focus and post-focus regions of an utterance in Experiment 1. 24
Figure 3.1: Observed mean f0 for participants 01, 04, 06, 07 and 09 in the narrow new-
information focus, high word frequency and high contextual probability condition in
Experiment 1. 35
Figure 3.2: Observed mean f0 of participant 04 in all the conditions of Experiment 1. 36
Figure 3.3: Best-fitted curves with 95% confidence intervals for the f0 values produced
by participants 01, 04 and 06 in Experiment 1. 37
Figure 3.4: The observed f0 ranges with 95% confidence intervals for individual
participants in the sentence region from the pre-focus interval to the post-focus interval in
Experiment 1. 38
Figure 3.5: The observed f0 ranges in the sentence region from the Focus interval to the
Post-Focus interval for individual participants in the condition of high word frequency
and high contextual probability in Experiment 1. 40
Figure 3.6: The observed differences in f0 ranges in the sentence region from the Focus
interval to the Post-Focus interval for individual participants who conform to more than
one group trend in Experiment 1. The differences were calculated based on the group
trend in each condition. 42
viii
Figures 4.1-4.4: F0 ranges of the target words in Speaker A’s responses in each condition
of Experiments 2 and 3. 62
Figure 5.1: Possible ways of expanding ranges. 76
Figure 5.2: Sample display of Experiment 4. 77
Figure 5.3: Average duration of the target words in each condition of Experiment 4. 86
Figure 5.4: Average F0 ranges of the target words in each condition of Experiment 4. 87
Figure 5.5: Average intensity ranges of the target words in each condition of Experiment
4. 88
Figure 5.6: Average maximum and minimum F0 of the target words in each condition of
Experiment 4. 90
Figure 5.7: Average maximum and minimum intensity of the target words in each
condition of Experiment 4. 91
ix
Abstract
This dissertation aims to extend our knowledge of prosody – in particular, what kinds of
information may be conveyed through prosody, which prosodic dimensions may be used to
convey them, and how individual speakers differ from one another in how they use prosody.
Four production studies were conducted to examine how various factors interact with one
another in shaping the prosody of an utterance and how prosody fulfills its multi-functional role.
Experiments 1 explores the interaction between two types of informativity, namely information
structure and information-theoretic properties. The results show that the prosodic
consequences of new-information focus are modulated by the focused word’s frequency,
whereas the prosodic consequences of corrective focus are modulated by the focused word’s
probability in the context. Furthermore, f0 ranges appear to be more informative than f0 shapes
in reflecting informativity across speakers. Specifically, speakers seem to have individual
‘preferences’ regarding f0 shapes, the f0 ranges they use for an utterance, and the magnitude of
differences in f0 ranges by which they mark information-structural distinctions. In contrast, there
is more cross-speaker validity in the actual directions of differences in f0 ranges between
information-structural types.
Experiments 2 and 3 further show that the interaction found between corrective focus and
contextual probability depends on the interlocutor’s knowledge state. When the interlocutor
has no access to the crucial information concerning utterances’ contextual probability, speakers
prosodically emphasize contextually improbable corrections, but not contextually probable
corrections. Furthermore, speakers prosodically emphasize the corrections in response to
contextually probable misstatements, but not the corrections in response to contextually
improbable misstatements. In contrast, completely opposite patterns are found when words’
contextual probability is shared knowledge between the speaker and the interlocutor: speakers
prosodically emphasize contextually probable corrections and the corrections in response to
contextually improbable misstatements.
x
Experiment 4 demonstrates the multi-functionality of prosody by investigating its discourse-level
functions in Mandarin Chinese, a tone language where a word’s prosodic patterns is crucial to its
meaning. The results show that, although prosody serves fundamental, lexical-level functions in
Mandarin Chinese, it nevertheless provides cues to information structure as well. Similar to what
has been found with English, corrective information is prosodically more prominent than non-
corrective information, and new information is prosodically more prominent than given
information.
Taken together, these experiments demonstrate the complex relationship between prosody and
the different types of information it encodes in a given language. To better understand prosody, it
is important to integrate insights from different traditions of research and to investigate across
languages. In addition, the findings of this research suggest that speakers’ assumptions about
what their interlocutors know – as well as speakers’ ability to update these expectations – play a
key role in shaping the prosody of utterances. I hypothesize that prosodic prominence may
reflect the gap between what speakers had expected their interlocutors to say and what their
interlocutors have actually said.
1
Chapter 1: Introduction
Prosody plays a vital role in spoken communication. The intonation and rhythm of a spoken
utterance often convey information that the words in the utterance do not. People use prosodic
cues to interpret ambiguous syntactic structure (e.g. Kjelgaard & Speer, 1999; Price, Ostendorf,
Shattuck ‐Hufnagel, & Fong, 1991; Snedeker & Trueswell, 2003), to identify important elements
in discourse (e.g. Bartels & Kingston 1994; Breen, Fedorenko, Wagner, & Gibson, 2010; Cutler,
1976), to predict upcoming messages in a conversation (e.g. Brown, Salverda, Dilley, &
Tanenhaus, 2011; Ito & Speer, 2008; Watson, Tanenhaus, & Gunlogson, 2008), and to infer the
speaker’s attitudes and feelings (e.g. Brennan & Williams, 1995; Johnstone & Scherer, 1999;
Morley, van Santen, Klabbers, & Kain, 2011; Swerts & Krahmer, 2005), among many other
functions.
As much as we know what prosody can convey, relatively little is known about how prosody
carries out these different functions at the same time. One widely-accepted view is that prosodic
prominence signals the extent to which a linguistic element is ‘informative’. Prior work has
approached the relationship between prosody and informativity from various angles. Two
popular ones are information structure (e.g. Breen, Fedorenko, Wagner, & Gibson, 2010;
Brown, 1983; Cooper, Eady, & Mueller, 1985; Couper-Kuhlen, 1984; Eady & Cooper, 1986;
Hay, Sato, Coren, Moran, & Diehl, 2006; Katz & Selkirk, 2011; Krahmer & Swerts, 2001; Ladd,
1996; Pierrehumbert & Hirschberg, 1990) and information theory (e.g. Aylett & Turk, 2004;
Baker & Bradlow, 2009; Bell, Jurafsky, Fosler-Lussier, Girand, Gregory, & Gildea, 2003; Bell,
Brenier, Gregory, Girand, & Jurafsky, 2009; Calhoun, 2010; Clopper & Pierrehumbert, 2008;
Gahl, 2008; Gregory, Raymond, Bell, Fosler-Lussier, & Jurafsky, 1999; Lieberman, 1963;
Munson & Solomon, 2004; Pan & Hirschberg, 2000; Pitrelli, 2004; Pluymaekers, Ernestus, &
Baayen, 2005a, 2005b; Scarborough, 2010; van Son, Koopmans-van Beinum, & Pols, 1999;
Wright, 2003). I will discuss these two approaches in more depth in Sections Section 1.1 and 1.2.
It has been found that the acoustic properties of an utterance such as duration, f0, intensity, and
spectral characteristics provide cues for the relative informativity of its components (see Wagner
& Watson, 2010, for a review).
2
Furthermore, as language communication involves not only the talker/speaker but also the
listener/addressee, the informativity of a word could depend on the addressee’s knowledge about
the word. Because the speaker’s and the addressee’s knowledge states are often different, this
brings up the question of whether speakers’ encoding of information takes into account
addressees’ knowledge states or whether it is egocentric, driven by speaker-internal
considerations. The general question of perspective-taking has received considerable attention
in prior work (audience design or addressee-oriented production vs. speaker egocentrism or
production-internal processing: e.g. Arnold, Kahn, & Pancani, 2012; Bard, Anderson, Sotillo,
Aylett, Doherty-Sneddon, & Newlands, 2000; Galati & Brennan, 2010; Horton & Keysar, 1996;
Kahn & Arnold, 2012; Kaland, Swerts, & Krahmer, 2013; Lockridge & Brennan, 2002;
Robnagel, 2000, 2004; Rosa & Arnold, 2011). However, open questions remain about the extent
to which speakers’ prosodic encoding of informativity takes into account the knowledge state of
the addressee.
Issues regarding the prosodic encoding of informativity become more complex when we
consider tone languages, where the acoustic dimensions involved in prosody – such as duration,
pitch and intensity – are also used to distinguish lexical items from one another (e.g. Mandarin:
Shih, 1986; Shen, 1990; Whalen & Xu, 1992; Xu, 1997). In other words, we are faced with the
question of whether these acoustic dimensions also provide cues for post-lexical information,
and if so, how they accomplish this dual role. In Mandarin, for example, prior work has led to
divergent results regarding which types of information structure prosodically differ from each
other and which prosodic features distinguish between them (e.g. Chen, 2006; Chen & Braun,
2006; Chen & Gussenhoven, 2008; Chen, Wang, & Xu, 2009; Greif, 2010; Jin, 1996; Xu, 1999).
In addition to the interaction between different types or levels of information, another important
factor that influences an utterance’s prosodic representation is individual differences. Research
has shown that speakers should not be assumed to be homogenous even if they speak the same
language/dialect. Speakers can differ in their acoustic characteristics and the prosodic patterns
they use to signal linguistic categories (e.g. Allen, Miller & Desteno, 2003; Dahan & Bernard,
1996; Fougeron, 2004; Fougeron & Kewley-Port, 2007; Loakes & McDougall, 2010; Niebuhr,
3
D’Imperio, Fivela, & Cangemi, 2011; Smith & Hawkins, 2012; Theodore, Miller, & DeSteno,
2007; Trouvain & Grice, 1999).
In the remainder of this section, I will first discuss the previous research on prosody from the
perspectives of information structure (Section 1.1), information theory (Section 1.2), and
perspective-taking (Section 1.3). Then, I will describe the research questions of this dissertation,
which integrates insights from these different lines of research in order to further our
understanding of prosody and informativity (Section 1.4). The previous research on tone
languages and individual differences will not be discussed until later chapters when they become
relevant (Chapters 3 and 5, respectively).
1.1. Prosodic prominence and information structure
In the information-structure-based tradition, acoustic prominence is associated with linguistic
material in the foreground, or in focus – broadly speaking, material that adds new information to
the conversation. Depending on the preceding discourse, speakers may emphasize particular
words in an utterance to direct their addressee’s attention to the important message they are
trying to convey. It has been found that some types of information structure differ acoustically
from each other. For instance, consider the word ‘toys’ in the following contexts:
(1.1) a. What did David find on the stairs?
b. He found toys on the stairs. [‘toys’ = narrow, new-information focus]
(1.2) a. Did David find toys on the stairs?
b. Yes, he found toys on the stairs. [‘toys’ = given information, unfocused]
(1.3) a. What happened?
b. David found toys on the stairs. [‘toys’ = wide, new-information focus]
In response to (1.1a), toys in (1.1b) is in new-information focus, as it conveys information that
has not been mentioned and cannot be inferred from the preceding discourse. In contrast, the
same word toys in (1.2b) responding to (1.2a) is unfocused, given information, because what it
conveys has been expressed in the preceding discourse (e.g. Prince, 1992; Rooth, 1992).
4
Furthermore, toys in (1.1b) is narrowly focused new information, since it is the only component
of the utterance that introduces new information to the conversation. However, the same word
toys in (1.3b) in response to (1.3a) is new information in wide focus, because the entire utterance
with multiple constituents including the word toys is in new-information focus (e.g.
Gussenhoven, 1983; Selkirk, 1984). It has been shown that new elements are acoustically more
prominent than given elements (e.g. Brown, 1983; Eady & Cooper, 1986; Hay, Sato, Coren,
Moran, & Diehl, 2006; Krahmer & Swerts, 2001; Ladd, 1996), and that material in narrow new-
information focus is acoustically more prominent than the same material in wide new-
information focus (e.g. Breen, Fedorenko, Wagner, & Gibson, 2010; Eady & Cooper, 1986).
Another kind of information structure that has been extensively studied is contrastive focus, of
which various subtypes have been identified (e.g. Vallduví & Vilkuna, 1998). For example, two
common types of contrastive focus, both involving explicit alternatives in the preceding
discourse, are shown in (1.4-1.5). Toys in (1.4b) – responding to (1.4a) – picks ‘toys’ from the
set consisting of ‘books’ and ‘toys’ that has been established via (1.4a), and toys in (1.5b) –
responding to (1.5a) – is intended to contradict the information socks conveyed in (1.5a). In this
dissertation, I concentrate on the latter type of contrastive focus, which has been referred to as
corrective focus (e.g. Dik, 1997). I chose this subtype because its information-structural
properties are well-understood and it is prevalent in communication. Contrastive/corrective
elements have been shown to receive greater acoustic prominence than non-contrastive/non-
corrective elements, whether they are given or new material in the discourse (e.g. Breen,
Fedorenko, Wagner, & Gibson, 2010; Cooper, Eady, & Mueller, 1985; Couper-Kuhlen, 1984;
Katz & Selkirk, 2011; Krahmer & Swerts, 2001).
(1.4) a. Did David find books or toys on the stairs?
b. He found toys on the stairs. [‘toys’ = narrow, contrastive focus]
(1.5) a. Did David find socks on the stairs?
b. No, he found toys on the stairs. [‘toys’ = narrow, contrastive/corrective focus]
Various acoustic properties have been found to reflect information-structural focus, including
types or presence of accents on and after the focused element (e.g. Krahmer & Swerts, 2001;
5
Ladd, 1996; Pierrehumbert & Hirschberg, 1990), expanded vowel space and increased formant
movement in the focused element (e.g. Hay, Sato, Coren, Moran, & Diehl, 2006), increased
duration, f0
1
, intensity, and more f0 protrusion during the focused element, more f0 compression
following the focused element (e.g. Breen, Fedorenko, Wagner, & Gibson, 2010; Brown, 1983;
Couper-Kuhlen, 1984; Katz & Selkirk, 2011), decreased duration, f0 and intensity preceding the
focused element (e.g. Eady & Cooper, 1986), and a sudden drop or sharper fall within or
following the focused element (e.g. Cooper, Eady, & Mueller, 1985; Couper-Kuhlen, 1984; Eady
& Cooper, 1986).
1.2. Prosodic prominence and information-theoretic factors
In addition to work from the information-structural perspective, there is also research in the
information-theoretic tradition, where a correlation has been found between acoustic reduction
and the redundancy, or the predictability of a linguistic element. Depending on what occurs more
(or less) often in the language or the given linguistic environment, certain elements may be
pronounced with more or less acoustic prominence. A wide variety of probabilistic
measurements have been used to represent the predictability of a segment, phoneme, or word.
Commonly-considered measures include context-independent properties such as frequency and
neighborhood density (e.g. Gahl, 2008; Munson & Solomon, 2004; Pitrelli, 2004; Scarborough,
2010; Wright, 2003) and context-dependent properties such as joint probability, conditional
probability, mutual information, and semantic predictability (e.g. Bell, Jurafsky, Fosler-Lussier,
Girand, Gregory, & Gildea, 2003; Clopper & Pierrehumbert, 2008; Lieberman, 1963; Pan &
Hirschberg, 2000; Scarborough, 2010; van Son, Koopmans-van Beinum, & Pols, 1999). For
instance, the word ‘time’ has a higher lexical frequency than the word ‘thyme’, because ‘time’ is
used more often than ‘thyme’ (e.g. Baayen, Piepenbrock, & van Rijn, 1993; Gahl, 2008).
Moreover, the word ‘nine’ in a familiar proverb such as “a stitch in time saves nine” is highly
predictable from the sentence context (a stitch in time saves…), while the same word ‘nine’ in a
1
In Breen et al. (2010), focus breadth (narrow vs. wide) and contrastiveness (corrective vs. non-corrective) have
opposite effects on f0. Narrow focus is marked with higher mean and maximum f0 than wide focus, while
correctively focused word is produced with lower mean and maximum f0 than non-correctively focused word. This
finding about contrastiveness diverges from other previous research.
6
relatively neutral context such as “the number that you will hear is nine” is far less predictable
(e.g. Lieberman, 1963).
It has been shown that elements with higher across-the-board frequency in the language are
acoustically more reduced than elements that occur less frequently. Likewise, elements that are
more likely to appear in a particular environment (based on adjacent items or semantic context)
are more acoustically reduced than elements that are less likely to appear in the environment.
Research has found information-theoretic predictability being realized with decreased duration
and amplitude (e.g. Bell, Jurafsky, Fosler-Lussier, Girand, Gregory, & Gildea, 2003; Gahl, 2008;
Lieberman, 1963), lower likelihood of accentuation (e.g. Pan & Hirschberg, 2000; Pitrelli, 2004),
lower center of gravity of the power spectrum (CoG), less extreme distance between the first and
second formants (e.g. van Son, Koopmans-van Beinum, & Pols, 1999), shorter vowels, and less
dispersed vowel space (e.g. Clopper & Pierrehumbert, 2008; Munson and Solomon, 2004;
Scarborough, 2010; Wright, 2003).
1.3. Prosodic prominence and perspective-taking
So far, we have discussed two existing approaches, information structure and information theory,
which address the relationship between prosody and informativity. Nevertheless, prior work
suggests that a third angle also needs to be considered when investigating the prosodic encoding
of informativity: the role of the addressee, in particular what the addressee knows or doesn’t
know. It has been shown that a speaker’s utterances are affected by his/her addressee’s
knowledge state. For example, people speak differently when their intended addressees have
hearing loss or speech impartments (e.g. Baker & Bradlow, 2009; Fougeron, 2004; Fougeron &
Kewley-Port, 2007; Hummert & Shaner, 1994), are distracted (e.g. Bavelas, Coates, & Johnson,
2000; Pasupathi, Stallworth, & Murdoch, 1998; Rosa and Arnold, 2011; Rosa, Finch, Bergeson,
& Arnold, 2015) or ill-informed (e.g. Galati & Brennan, 2010; Kaland, Swerts, & Krahmer, 2013;
Lockridge & Brennan, 2002; Yoon & Brown-Schmidt, 2014). People also speak differently
when they are leaving messages for themselves (e.g. Fussell & Krauss, 1989a, 1989b), talking to
computers (e.g. Brennan, 1991) or well-informed addressees (e.g. Arnold, Kahn, & Pancani,
7
2012; Isaacs & Clark, 1987). This ability to take other people’s perspectives into account has
been referred to as perspective-taking (see Brown-Schmidt & Hanna, 2011, for a review).
In terms of prosody, greater acoustic prominence such as lengthening and expanded vowel space
has been found where the intended addressees have hearing loss or are distracted (e.g. Baker &
Bradlow, 2009; Fougeron, 2004; Fougeron & Kewley-Port, 2007; Rosa, Finch, Bergeson, &
Arnold, 2015). Furthermore, when there are two addressees that are informed to different extents,
speakers are able to track both addressees’ knowledge states and prosodically mark informativity
from the perspective of the addressee they are talking to at the given point of time. For example,
Galati and Brennan (2010) show that, when speakers tell a story twice to one addressee and once
to another addressee using the same sentences, their utterances are perceptually less intelligible
when they are talking to the addressee who has already heard the story once before. Similarly,
Kaland, Swerts and Krahmer (2013) show that, in a task where Dutch speakers give different
instructions to two addressees about where to place objects (e.g. in Dutch, instructing one
addressee to move ‘the blue triangle’, and then either asking the same addressee or the different
one to move ‘the red triangle’), contrastive information (‘red’) is pronounced with higher
perpetual prominence when it is also contrastive to the addressee (i.e. when the addressee is the
one who has moved a triangle before) than when it is not.
However, perspective-taking has been argued to be not automatic but a costly process that
speakers don’t carry out without the support of favorable conditions. Under time pressure or
under high cognitive load, speakers are more likely to ignore the knowledge gaps between
themselves and their addressees, i.e. to rely on their own privileged knowledge as much as the
common ground shared with their addressees (e.g. Horton & Keysar, 1996; Horton & Gerrig,
2005; Robnagel, 2000, 2004). Insufficient and/or unnatural feedback from the addressee also
appears to deflect the speaker’s attention from the addressee’s knowledge state (e.g. Lockridge &
Brennan, 2002, compared with Brown & Dell, 1987; Dell & Brown, 1991). When favorable
conditions are absent, although speakers nevertheless alter their utterances based on
informativity, they do so regardless of their addressees’ knowledge states. For instance, Kraljic
and Brennan (2005) found that speakers produce cues that resolve syntactic ambiguities even
when the situation being described is not ambiguous and there is no addressee. Brown and Dell
8
(1987) and Dell and Brown (1991) found that, compared to typical information, speakers
explicitly mention atypical information more often, but this holds whether their addressees are
well-informed or clueless about the typicality. With regard to prosody, Bard and colleagues
(Bard, Anderson, Sotillo, Aylett, Doherty-Sneddon & Newlands, 2000; Bard & Aylett, 2004)
found that people speak in a less intelligible manner when repeating information, whether or not
their addressees have heard the original mention or see the mentioned item. According to these
results, language production might be fundamentally egocentric (see Keysar, 2007, for a review).
Furthermore, evidence suggests that even when speakers take their addressees’ knowledge states
into consideration, they do not necessarily tailor their utterances to their addressees’ needs. For
example, Rosa and Arnold (2011) show that speakers use less specific (more reduced) referring
expressions when their addressees are distracted, which is in the opposite direction of what
would be expected if speakers attempted to aid addressees’ comprehension. With regard to
prosody, Arnold, Kahn and Pancani (2012) found that, when speakers are giving instructions
about where to place objects (e.g. the teapot goes on red), they begin speaking more quickly and
pronounce the word ‘the’ with shorter duration if their addressees have anticipated hearing the
object and pick it up before the instruction. However, there are no prosodic consequences on the
head noun of the object (teapot), which largely rules out accounts that attribute the observed
reduction to addressees’ not needing to hear clearly. Thus, these effects of perspective-taking
might result from speaker-internal (rather than addressee-oriented) mechanisms of language
processing (e.g. processing fluency: Kahn & Arnold, 2015).
In sum, previous research shows that speakers may change the way they speak based on their
addressees’ knowledge states, although a consensus has not been reached on the cognitive
processes and mechanisms behind this behavior.
1.4. Overview of this dissertation
This dissertation investigates how different types of information are integrated and conveyed
through prosody. As discussed above, prosody has been shown to provide cues for a variety of
information, including the predictability of words, the information structure of discourse, the
9
addressee’s knowledge state, as well as lexical categories in tone languages. Prosody can also
show considerable speaker-specific variation. However, many questions remain open regarding
how these different factors interact in shaping the prosody of an utterance. In particular, this
research looks into four questions. The first question concerns the interaction between
information structure and words’ predictability (Chapter 2). The second issue I look at is how
prosody exhibits inter-speaker variability while encoding the interaction between information
structure and words’ predictability (Chapter 3). The third question has to do with the interaction
between information structure and speakers’ expectations about their addressees (Chapter 4).
Finally, the fourth question I look at is how prosody encodes information structure while also
providing lexical cues in a tone language (Chapter 5).
More specifically, Chapter 2 and 3 explore the interaction between word frequency, contextual
probability, and the type of focus. The results show that the prosodic consequences of new-
information focus are modulated by the focused word’s frequency, whereas the prosodic
consequences of corrective focus are modulated by the focused word’s probability in the context.
Furthermore, f0 ranges appear to be more informative than f0 shapes in reflecting informativity
across speakers. Speakers seem to have individual ‘preferences’ regarding f0 shapes, the f0
ranges they use for an utterance, and the magnitude of differences in f0 ranges by which they
mark information-structural distinctions. In contrast, there is more cross-speaker validity in the
actual directions of differences in f0 ranges between information-structural types.
Chapter 4 investigates the motivation behind the patterns observed in Chapter 2. The results
show that, in conversations between two people, the speaker’s prosodic realization of corrective
focus and contextual probability depend on the conversational partner’s knowledge state. When
their conversational partners have no access to the crucial information concerning utterances’
contextual probability, speakers prosodically emphasize contextually improbable corrections, but
not contextually probable corrections. Furthermore, speakers prosodically emphasize the
corrections in response to contextually probable misstatements, but not the corrections in
response to contextually improbable misstatements. In contrast, completely opposite patterns are
found when words’ contextual probability is shared knowledge between the speaker and the
conversational partner: speakers prosodically emphasize contextually probable corrections and
10
the corrections in response to contextually improbable misstatements. These findings suggest
that speakers’ assumptions about what their conversational partners know – as well as speakers’
ability to update these expectations – play a key role in shaping the prosody of utterances.
Prosodic prominence might reflect the gap between what speakers had expected their
conversational partners to say and what their conversational partners have actually said.
Chapter 5 concentrates on the dual role of prosody in a tone language, where prosody
presumably serves both lexical and post-lexical functions. The tone language investigated in this
dissertation is Mandarin Chinese, in which a word’s meaning depends on its prosodic patterns
such as pitch, intensity and duration. The results show that the acoustic dimensions involved in
lexical tones nevertheless provide cues to information structure as well. Similar to what has been
found with English, corrective information is prosodically more prominent than non-corrective
information, and new information is prosodically more prominent than given information.
Parts of this dissertation have been published in conference proceedings, book chapters or
journal articles. Earlier versions of the work in Chapters 2 and 3 (Experiment 1) have appeared
in Ouyang and Kaiser (2014a, 2014b, 2015a). Earlier versions of the work in Chapter 5
(Experiment 4) have appeared in Ouyang and Kaiser (2011, 2012, 2015b).
11
Chapter 2: The Interplay between Information Structure and Information-Theoretic
Factors in Prosody
2.1. Introduction
In the preceding chapter, I briefly reviewed three traditions of research that have addressed the
relationship between informativity and prosody: work on information structure, on information
theory, and on perspective-taking. In the current chapter, I will first focus on the interplay
between information-structural and information-theoretic factors, which will yield insights into
how perspective-taking might play a role in this interplay. The experiment presented in this
chapter shows that prosody is influenced by the complex interplay between word frequency,
contextual probability, and the type of discourse focus.
While the information-structural and the information-theoretic traditions focus on different
factors of informativity from distinct perspectives, they have found similar prosodic patterns that
signal the relative degree of informativity between linguistic elements. In particular, prior work
shows that a higher degree of informativity in general results in more exhaustive use of a
prosodic space, whichever acoustic dimension it is that a particular study examines. This leads us
to the question of how information structure and information-theoretic properties interact in
influencing prosody: do they combine with each other in an additive way, or do they interact in a
non-additive way? For example, being in discourse focus and being less predictable have both
been shown to increase a word’s prosodic prominence. However, consider a correctively-focused
word that is highly improbable in its sentence context: the word ‘seals’ in the context “Did David
find socks on the stairs? No, he found SEALS on the stairs.” Will this word be pronounced with
extremely high prosodic prominence, i.e., the effects of corrective focus and contextual
probability are additive? Or, will this word be prosodically less prominent than a corrective word
normally is, because low contextual probability might somehow weaken the
correctiveness/correctness? So far, these issues are not yet well-understood.
There are a number of studies that mostly adopt an information-theoretic approach but also
include the repeated use of words as a redundancy factor. Since repeated words are by definition
12
given, or at least not entirely new, information, the information-theoretic notion of repetition can
be regarded as givenness in an information-structural view (e.g. Fowler & Housum, 1987).
Generally speaking, prior work has identified effects of word repetition, over and above (other)
information-theoretic factors, on different kinds of linguistic units. For example, Aylett and Turk
(2004) measure how many times a referent has been previously mentioned, and show that
syllable duration decreases as the order of mention increases, in addition to the effects of word
frequency and syllable conditional trigram probability. In related work on suffixed words in
Dutch, Pluymaekers, Ernestus, and Baayen (2005b) measure how many times a word has been
uttered, and show that repetition significantly reduces the duration of suffixes and marginally
reduces the duration of stems and entire words, in addition to the effects of mutual information
with the adjacent words. Additional work on repetition was done by Bell, Brenier, Gregory,
Girand and Jurafsky (2009) who found that English content words are shorter when repeated,
more frequent, or more predictable from the following word, while function words are not so
affected by repetition and word frequency, but are affected by the predictability from the
following word. The predictability from the preceding word only shortens very frequent function
words. Lastly, Gregory, Raymond, Bell, Fosler-Lussier and Jurafsky (1999) find that word
duration decreases as the following redundancy factors increase: word frequency, mutual
information, conditional bigram probability, semantic relatedness, and repetition. In sum, word
repetition has been shown to cause shortening at the syllable, morpheme and word levels, even
when we take into account word frequency and other statistical-probabilistic factors based on
adjacent items or semantic context.
To the best of my knowledge, there is only one existing study that addresses the interaction
between word repetition and (other) information-theoretic factors. In a production experiment
where participants read a number of paragraphs twice, Baker and Bradlow (2009) found that
word frequency influences the amount of reduction a word undergoes when it is mentioned for
the second time. Higher-frequency words exhibit more shortening upon second mention than
lower-frequency words, when word length is controlled. Furthermore, this interaction is only
found in so-called ‘plain’ speech, i.e. when participants are instructed to speak as if they are
talking to someone familiar with their voice and speech patterns. It does not occur in so-called
‘clear’ speech, i.e. when participants are instructed to speak as if they are talking to a listener
13
with a hearing loss or to a non-native speaker learning their language. From the perspective of
information structure, this finding can be restated as: The duration cue for new information (i.e.
first mention) is weaker in lower-frequency words, and weaker in clear speech compared to plain
speech. Thus, there seems to be a saturation effect such that the prosodic cues for information
structure are weakened when information-theoretic factors also demand prosodic prominence.
However, it remains unclear whether other kinds of information-theoretic factors, such as
contextual probability, have a similar impact and whether other kinds of information structure,
such as corrective focus, are affected in a similar way. Related work by Calhoun (2010) shows
that whether a word carries a nuclear accent, non-nuclear accent, or no accent can be predicted
using models including word frequency, bigram probability, the presence/absence of focus, as
well as other factors. Nevertheless, no interaction between these factors is mentioned. In sum, it
is not yet well-understood how information-theoretic properties and information structure
interact to influence prosody.
2.2. Experiment 1: Aims and expected outcome
The previous research discussed in Sections 1.1-1.2 shows that an utterance’s prosodic
representation depends on how informative each of its constituents is. Information-structural
status, such as being in narrow focus, and information-theoretic properties, such as lexical
frequency and contextual probability, both play a role in prosody. It is striking that little attention
has been paid to the potential interaction between information structure and information-
theoretic factors, given the considerable efforts that have been devoted to both kinds of factors
separately. To shed light on this issue, I conducted a psycholinguistic production study to
investigate whether information structure and information-theoretic factors interact in
determining a word’s prosodic prominence, and if so, whether different information-structural
types interact with different information-theoretic factors in similar ways. For instance, could it
be that the prosodic cues for new-information vs. corrective focus differ in terms of whether they
are sensitive to word frequency vs. contextual probability?
Since prior work has found that the effect of givenness on duration is stronger when the repeated
words are high-frequency (Baker & Bradlow, 2009), I hypothesized that the prosodic
14
consequences of information structure would be stronger in words with low informativity in the
information-theoretic dimensions. In other words, the prosodic cues for information structure
might be weakened when other factors – such as information-theoretical properties – also
demand prosodic prominence. Building on Baker and Bradlow (2009), this study explored
effects of word frequency and narrow new-information focus.
I also looked at the effects of another information-theoretic factor, namely contextual probability,
as well as another type of information structure, namely narrow corrective focus. Including
multiple factors of each kind of informativity allowed us to investigate the potentially complex
interplay between them. Specifically, I expected that narrow focus would be prosodically distinct
from wide focus when the target word is highly frequent and/or highly contextually probable (i.e.
has low information-theoretic informativity). In contrast, when the target word is low-frequency
and/or low-probability (i.e. has high information-theoretic informativity), I predicted that the
prosodic distinctions between narrow and wide focus might be weakened or even absent:
prosodic reflexes of information structure might be observed in only one or perhaps in neither of
the two narrow-focus conditions. If these predictions are borne out, we can then look into
whether different information-structural types (i.e. corrective vs. new) could react differently to
different information-theoretic factors (i.e. lexical frequency vs. contextual probability).
In terms of the acoustic correlates of prosodic prominence, we focused on (i) the shape of an f0
contour and (ii) the size of excursions in an f0 contour (which will be called ‘f0 range’
henceforth). I chose f0 because it is an acoustic dimension that has been extensively studied in
the information-structural tradition yet not much so in the information-theoretic tradition. In
other words, by conducting this study, I also hoped to provide further evidence for the effects of
information-theoretic factors on f0. Furthermore, because there are studies showing that
intonational categories (e.g. H*, L+H*) do not necessarily map straightforwardly onto focus
types (e.g. Katz & Selkirk, 2011; Krahmer & Swerts, 2001, Watson, Tanenhaus, & Gunlogson,
2008), I did not take an intonational-phonological approach (e.g. Ladd, 1996; Pierrehumbert and
Hirschberg, 1990). Based on previous research, a good indicator of narrow focus seems to be a
relatively exhaustive use of the acoustic space. In the f0 dimension, as mentioned in Section 1.1,
it has been found that narrow focus differs from wide focus in having greater f0 protrusion or
15
higher f0 on the narrowly focused element, greater f0 compression or sharper f0 fall following
the focused element, and lower f0 preceding the focused element (e.g. Breen, Fedorenko,
Wagner, & Gibson, 2010; Brown, 1983; Cooper, Eady, & Mueller, 1985; Couper-Kuhlen, 1984;
Eady & Cooper, 1986; Katz & Selkirk, 2011). Therefore, I quantitatively measured both f0
shapes and f0 ranges, which presumably would capture the level of prominence in the f0
dimension.
In this study, the object of a sentence is the narrowly focused word in the discourse. Therefore, I
expected narrow focus to influence prosody in the sentence region containing the object and the
words immediately preceding and following it. Specifically, I predicted that the f0 movement of
this sentence region would be bigger, or at least not smaller, in the narrow-focus conditions than
the wide-focus condition.
2.3. Experiment 1: Methods
I conducted a production study with an interactive set-up. Each trial consisted of a production
task and a subsequent selection task. In both tasks, participants interacted with a partner, who
was a lab assistant. The production task provided the critical recordings: the target sentences
were produced by the participants during the production task. The selection task was included to
engage both people in the production task: paying attention to what the other person said in the
production task was necessary to successfully perform the selection task. In the following
subsections, I will first discuss the design and procedure of the production task and then move on
to the second phase, the selection task.
2.3.1. Step 1: Production task
A trial began with a production task, where participants worked with a partner in reading aloud
sentence pairs. Each sentence pair consisted of a question spoken by the partner (Sentence A)
and a response (the critical sentence) spoken by the participant (Sentence B), as shown in (2.1-
2.3) below (example from the high word frequency/high contextual probability condition).
Participants saw Sentence B on a computer screen when it was their turn to speak. The target
16
sentences (Sentence B on target trials) are transitive clauses with the following structure: a third-
person plural pronoun subject, a simple past tense verb, an object, and a prepositional phrase
indicating a location. The critical word of interest is the object of each target sentence (e.g. balls).
The experiment had 48 target items, 4 items in each condition. Every participant encountered
every item but did not see any item more than once. A full list of the target items can be found in
Appendix 1. There were also 48 filler items. The dependent variable I measured was the f0
values of an utterance.
(2.1) Narrow Corrective Focus, High Word Frequency, and High Contextual Probability
A (lab assistant): I heard that Dawn and Alice got gloves at the sports store.
B (participant): No, they got [balls]CORRECTIVE FOCUS at the sports store.
(2.2) Narrow New-Information Focus, High Word Frequency, and High Contextual Probability
A (lab assistant): What did Rachel and Carolyn get at the sports store?
B (participant): They got [balls]NEW-INFO FOCUS at the sports store.
(2.3) Wide/VP Focus, High Word Frequency, and High Contextual Probability
A (lab assistant): What did Angela and Joyce do?
B (participant): They [got balls at the sports store]NEW-INFO FOCUS.
To investigate how information-theoretic factors and information structure interact in shaping the
prosody of an utterance, I manipulated (i) the lexical frequency of the object noun, (ii) whether
the object was probable in the context of the preceding verb and the following location, and (iii)
the object’s informational-structural status in relation to the question. Thus, a within-subject
design with three independent variables was implemented: (i) word frequency (with two levels:
high or low frequency), (ii) contextual probability (with two levels: high or low probability), and
(iii) focus type (with three levels: narrow corrective focus, narrow new-information focus, or
wide/VP focus).
I manipulated the focus type of the critical noun by means of the question asked by the partner,
as shown in (2.1-2.3). In the wide/VP focus condition (ex. 2.3), the question asks about the
content of the entire verb phase (i.e. what did X do?), and the answer spoken by the participant
provides this information. Thus, the whole VP (e.g. got balls at the sports store) is new
17
information. In the narrow new-information focus condition (ex. 2.2), in contrast, the question
asks for the object of the transitive verb, and therefore only the object is new information in this
condition. Finally, in the narrow corrective focus condition (ex. 2.1), the partner makes a
statement where the object is incorrect (as signaled to the participant by the sentence on their
screen), and thus the object in the participant’s response is correctively focused.
Norming study
The contextual probability of the critical words was estimated through a web-based norming
study (conducted on Ibex Farm, http://spellout.net/ibexfarm/). 116 native speakers of American
English (who did not participate in the main experiment) performed a fill-in-the-blank writing
task. They saw sentences composed of a personal name, a verb, a blank, and a location (e.g.
Christian found ______ in the sea.) and filled in the blank with one or two words. There were 63
items in total; 66 to 70 responses were collected for each item. Four verb-location contexts and
eight objects were ultimately selected for the target sentences in the main study, as shown in
Table 2.1. For each verb-location context, I identified the most frequently-produced nouns, and
choose two of the three most frequently-produced nouns to act as the probable objects for that
context (e.g. found fish in the sea, found shells in the sea). In addition, these same nouns were
used as improbable objects in other contexts. It was made sure that the improbable objects used
for a context in the main study never occurred as responses to that context in the norming study
(e.g. no participant completed kicked ______ in the garage with ‘books’ or ‘shells’). Thus, each
of the eight target nouns functioned as a probable object in some contexts (e.g. found shells in
the sea) and an improbable object in other contexts (e.g. kicked shells in the garage). This allows
us to ensure that any effects of contextual probability cannot be attributed to idiosyncratic
properties of specific nouns. Another four nouns were selected to be the ‘incorrect’ objects in the
question that elicited corrective focus (e.g. gloves in ex. 2.1). These nouns had a contextual
probability between the high-probability and low-probability critical words.
The word frequency of the object nouns was determined according to the SUBTLEXus database
(Brysbaert & New, 2009). SUBTLEXus provides word frequency measures on the basis of
American English subtitles, and contains 51 million words in total. The critical words in the
18
high-frequency conditions (i.e. balls, books, cars, and fish) ranged in frequency from 67.76 to
40.16 per million, while those in the low-frequency conditions (i.e. cans, cleats, toys and shells)
ranged in frequency from 13.22 to 0.41 per million, as shown in Table 2.2. The ‘incorrect’
objects in the question that elicited corrective focus (e.g. gloves in ex. 2.1) had frequencies
between the high-frequency and low-frequency critical words.
VERB
OBJECT
LOCATION Frequent &
Probable
Infrequent &
Probable
Frequent &
Improbable
Infrequent &
Improbable
got balls cleats fish toys at the sports store
kicked cars cans books shells in the garage
found fish shells balls cans in the sea
found books toys cars cleats on the stairs
Table 2.1: Manipulation of word frequency and contextual probability in Experiment 1. A
full list of target items can be found in Appendix 1.
Word
Frequency in SUBTLEXus
(per million)
fish 83.49
books 67.76
cars 45.63
balls 40.16
toys 13.22
cans 7.67
shells 5.57
cleats 0.41
Table 2.2: Lexical frequency of the target words in Experiment 1.
19
Thus, I manipulated contextual probability and word frequency while controlling for syntactic
construction and the number of syllables. This placed severe constraints on the choice of words,
and therefore bare plural forms were uniformly used, and voiceless segments in the words were
not avoided.
2.3.2. Step 2: Selection task
Each trial had two parts: the production task (described above) and a subsequent selection task.
The selection task began immediately after the production task was completed. The participant
and the partner saw a display of eight pictures and four words. The pictures represented objects
and locations, and the words were verbs in the past tense, as illustrated in Figure 2.1. The partner
was asked to select items based on the sentences spoken during the production task, e.g. if s/he
had asked ‘What did Connie and Sharon kick in the garage?’ and the participant had answered
‘They kicked cars in the garage’, then the partner should now select a picture of ‘cars’.
On a narrow-focus target trial such as ex. 2.1-2.2, the partner only needed to select the picture of
the object (cars). On a wide/VP-focus target trial such as ex. 2.3, the partner had to select three
items: the verb (kicked) and the pictures of the object (cars) and the location (garage). The
participants were asked to watch carefully and provide verbal help if needed and to ensure that
the partner chose the right items. Participants were told to correct their partner if s/he made any
mistakes and to give confirmation when they agreed with their partner’s choice.
The selection task was included to keep people engaged and interacting with each other, and to
make sure that they paid attention to what the other person said in the production task, as the
answer depends on both Sentence A and Sentence B. The partners, who were lab assistants, were
not instructed to intentionally make errors in the selection task, although errors sometimes
occurred naturally.
20
Figure 2.1: Sample display in the selection task of Experiment 1. This particular display
appeared immediately after the sentence pair ‘What did Connie and Sharon kick in the
garage? They kicked cars in the garage’ was produced.
2.3.3. Participants
Sixteen native speakers of American English participated in this study. All participants, 11
female and 5 male, were students at the University of Southern California. Two lab assistants
interacted with participants in this study. Both lab assistants were female native speakers of
American English and students at the University of Southern California.
2.3.4. Data analysis
768 utterances were collected from the 16 participants, each producing 48 target responses. Out
of the full set of data, 43 utterances (5.6%) were not included in the data analysis, due to speech
errors (16 utterances), disfluencies (6 utterances) and technical issues with the audio recording
(21 utterances).
F0 measurements were obtained using the STRAIGHT algorithm (Kawahara, de Cheveigne &
Patterson, 1998) through the VoiceSauce program (Shue, Keating, Vicenik & Yu, 2011). The
raw f0 values were then smoothed (smoothn in MATLAB: Garcia, 2010) to remove f0 tracking
21
errors and segmental effects. The smoothed values were then converted into a semitone scale, as
semitones reflect pitch perception better than the Hertz scale (e.g. Nolan, 2003). Finally, the data
were normalized by subject using z-scores, to factor out individual differences in f0 registers (e.g.
women usually have wider and higher registers of f0 than men). The z-scores represented each
data point in terms of its number of standard deviations above or below the mean across all
utterances produced by a given speaker.
To investigate whether different levels of word frequency and contextual probability influence
the prosodic encoding of narrow focus in different ways, I examined the effects of narrow focus
in each of the four conditions of word frequency and contextual probability separately: high-
frequency words in high-probability contexts, high-frequency words in low-probability contexts,
low-frequency words in high-probability contexts, and low-frequency words in low-probability
contexts. In other words, different levels of word frequency or contextual probability were not
directly compared, and the interactions between the independent variables (i.e. word frequency,
contextual probability, and focus type) were not statistically tested. This was because identical
sentences existed only between different types of focus, which is an intrinsic property of the
design, due to the manipulation of word frequency and contextual probability.
As the prosodic consequences of narrow focus were expected in the focused word and the words
immediately before and after it (see Section 2.2 for the predictions of this study), I examined
these regions of a sentence. Specifically, I analyzed f0 shapes and ranges for the following three
intervals: Pre-Focus (verb), Focus (object), and Post-Focus (the region from preposition to
article), from Pre-Focus to Focus (verb and object), and from Focus to Post-Focus (the region
from object to article). The head noun of the prepositional phrase was not statistically analyzed,
because it was at the end of a sentence, where f0 varied considerably due to factors outside the
scope of this study such as dialects (e.g. Ching, 1982) and turn transition cues (e.g. Wennerstrom
& Siegel, 2003). For the intervals I analyzed, f0 ranges were calculated by subtracting the
minimum f0 from the maximum f0 in each interval. Since the f0 measurements have been
normalized based on a given speaker’s f0 register, a larger f0 range indicates that the participant
was employing a bigger proportion of his or her f0 register.
22
To examine the shapes of f0 contours, I used a smoothing spline ANOVA approach. The
smoothing spline ANOVA fits regression to continuous data to test differences between curves
(Gu, 2002). I plotted the best-fitted curves with 95% confidence intervals (1.96 standard errors).
For example, in Figures 2.2-2.5, the lines represent the best-fitted curves, and the shading around
each line represents its 95% confidence intervals. The best-fitted values in a regression analysis
can be interpreted as the average patterns of the data being modeled. Two conditions can be
considered as being significantly different if the 95% confidence intervals of their best-fitted
values do not overlap. For example, in Figure 2.2, VP/wide focus significantly differs from the
other two conditions, because the shading of the dotted line does not overlap the shading of the
other two lines in the Post-Focus Interval. However, the two narrow-focus conditions do not
significantly differ from each other, because the shading of the solid and the dashed lines
overlaps throughout the utterance. Similar approaches have been used for other kinds of
continuous data in phonetics, such as tongue shapes (e.g. Davidson, 2006) and formants (e.g.
Baker, 2006). I first extracted 20 data points with equal time spacing from each of the three
consecutive intervals: Pre-Focus, Focus and Post-Focus. Mixed-effects smoothing spline
ANOVA models were then performed with Focus Type, Time and their Interaction as fixed
effects; Subject and Context (i.e. verb-location collocation) were included as random intercepts
(gss in R: Gu, 2014).
For f0 ranges, mixed-effects models were conducted on f0 ranges (anova in R: Chambers &
Hastie, 1992; lme4 in R: Bates, Maechler, Bolker, & Walker, 2014; lmerTest in R: Kuznetsova,
Brockhoff, & Christensen, 2015). Focus Type was included as a fixed effect, and Subject and
Context were included as random effects. When specifying the structure of random effects, I
started with a full model (i.e. including intercepts and slopes for Subject and Context) and
excluded a random slope when it did not significantly contribute to the model. All models in the
final analysis had random intercepts for Subject and Context.
2.4. Experiment 1: Results
Overall, the predictions outlined in Section 2.2 were borne out, as can be seen from the
smoothing spline ANOVA results shown in Figures 2.2-2.5. The lines represent the best-fitted
23
curves, and the shading around each line represents its 95% confidence intervals. As mentioned
in the preceding section, two conditions can be considered as being significantly different if the
95% confidence intervals of their best-fitted values do not overlap. Thus, in terms of f0 shapes,
we can see that the three types of focus do not significantly differ in the Pre-Focus interval (the
first section marked on the x-axis). Significant differences in f0 shapes start emerging towards
the end of the Focus interval (the middle section marked on the x-axis) and continue for most of
the Post-Focus interval (the last section marked on the x-axis). Narrow corrective focus (solid
lines) and narrow new-information focus (dashed lines) have a steeper f0 drop than wide focus
(dotted lines) in some cases, depending on the narrowly-focused word’s frequency and
contextual probability. More specifically, when the word is high-frequency and occurs in a
probable context (got balls at the sports store), both types of narrow focus differ significantly
from wide focus (Figure 2.2, labelled ‘High Freq + High Prob’). However, when a high-
frequency word is focused in an improbable context (got fish at the sports store), only new-
information focus differs significantly from wide focus; corrective focus patterns with wide
focus (Figure 2.3, ‘High Freq + Low Prob’). In contrast, when the word is lexically infrequent
but contextually probable (got cleats at the sports store), corrective focus differs significantly
from wide focus; new-information focus does not (Figure 2.4, ‘Low Freq + High Prob’). Finally,
neither type of narrow focus differs from wide focus when it is an infrequent word focused in an
improbable context (got toys at the sports store, Figure 2.5, ‘Low Freq + Low Prob’).
The analysis of f0 ranges finds parallel patterns to the above results of f0 shapes. There are no
significant differences in f0 ranges when the Pre-Focus and Focus intervals are analyzed either
jointly (i.e. treated as one region) or separately (t’s < 1.723, p’s > 0.086). The differential effects
of word frequency and contextual probability on different focus types appear when the Post-
Focus interval is analyzed alone or jointly with the Focus interval. In the condition of lexically
frequent and contextually probable words, both types of narrow focus have significantly larger
f0 ranges than wide focus (t’s > 2.524, p’s < 0.05, except for new-information focus in the Post-
Focus interval: t = 1.458; p = 0.147). In the condition of lexically frequent but contextually
improbable words, only new-information focus has larger f0 ranges than wide focus (t’s > 1.994,
p’s < 0.05); corrective focus does not (t’s < 1.650, p’s > 0.100). In contrast, for low-frequency
but high-probability words, corrective focus has larger f0 ranges than wide focus (t’s > 2.159, p’s
24
< 0.05); new-information focus patterns with wide focus (t’s < 1.091, p’s > 0.276). Lastly,
neither type of narrow focus differs from wide focus when low-frequency words are focused in
low-probability contexts (t’s < 1.366, p’s > 0.173). I do not report statistics for the Focus and
Post-Focus intervals separately here, due to reasons of readability, and more importantly because
I do not think that this distinction (i.e. whether it is the Focus or Post-Focus interval which shows
significant differences) is theoretically relevant for the claims made in this dissertation.
Figure 2.2: High Freq + High Prob
Figure 2.3: Low Freq + High Prob
Figure 2.4: High Freq + Low Prob
Figure 2.5: Low Freq + Low Prob
Figures 2.2-2.5: Best-fitted curves with 95% confidence intervals for the f0 values (in
semitone, standardized by speaker) in the pre-focus, focus and post-focus regions of an
utterance in Experiment 1.
As a whole, we see that narrow focus brings greater prosodic prominence than wide focus, but
this effect disappears under certain conditions of word frequency and contextual probability.
Specifically, narrow corrective focus differs from wide focus only when the word carrying
25
corrective information in narrow focus is probable in its sentence context. Conversely, new-
information focus differs from wide focus when the word carrying new information in narrow
focus is a frequent word. This suggests that the prosodic prominence associated with information
structure is modulated by word frequency and contextual probability.
2.5. Discussion
Experiment 1 investigated how information structure and information-theoretic properties
interact in shaping the prosody of an utterance. Existing studies have examined prosody from
both perspectives, but the interplay between these two kinds of informativity factors has not been
thoroughly investigated. A better understanding of this issue is important because it is involved
in fundamental questions regarding the functions and constraints of the prosodic system.
The results in Section 2.4 show that the prosodic consequences of information structure are
modulated by information-theoretic factors. In particular, we see differential effects of contextual
probability and word frequency on corrective narrow focus vs. new-information narrow focus.
Corrective narrow focus results in significant f0 movement only when the word carrying
corrective information is probable in the context. However, new-information narrow focus
results in significant f0 movement only when the word carrying new information is a frequent
word. When the narrowly focused word is lexically frequent and contextually probable, both
types of narrow focus have greater f0 movement than wide focus. In contrast, when the narrowly
focused word is infrequent and improbable, neither type of narrow focus type is distinguishable
from wide focus. This fits with my prediction that the prosodic prominence associated with
information structure would be weakened when other factors also demand prosodic prominence.
One might wonder why the prosodic differences between narrow and wide focus did not appear
in the narrowly-focused word itself (i.e. the object noun), but only appeared in the region
following it. While it is not surprising to find effects of narrow focus in the Post-Focus Interval,
it is unexpected to find no effects of focus in the Focus Interval (e.g. Breen, Fedorenko, Wagner,
& Gibson, 2010; Brown, 1983; Cooper, Eady, & Mueller, 1985; Couper-Kuhlen, 1984; Eady &
Cooper, 1986; Katz & Selkirk, 2011). One explanation has to do with the syntactic construction
26
used in this study. Prosodic realization of narrow focus varies depending on the type of syntactic
constructions (e.g. whether it is a noun phrase or a verb phrase, see Jun, 2011 for a brief cross-
linguistic discussion). In English, a nuclear pitch accent placed on the final argument of a verb
phrase (e.g. ‘bought a book about cats’ or ‘sent a book to Mary’) can alternatively indicate
narrow focus on the final argument or wide focus on the entire verb phrase (e.g. Birch and
Clifton, 1995; Büring, 2006; Gussenhoven, 1983; Selkirk, 1984, 1985; Welby, 2003). In this
study, the object was the final argument of the verb phrase (e.g. found books), and therefore wide
focus could be prosodically realized by accentuating only the object. Furthermore, the object was
the focused element in the narrow-focus conditions, and hence the only accented word in the
sentence. In other words, an accented object would satisfy the prosodic requirements in both
cases, which might be the reason why narrow focus was prosodically indistinguishable from
wide focus in the Focus interval (i.e. the object).
2.5.1. Relation to prior work
Taken together, the findings of this study pose a challenge to the widespread view that narrow
focus is (consistently) associated with greater prosodic prominence than wide focus. In fact, prior
work on the phonetic realization of information structure suggests a prominence hierarchy, such
that contrastive/corrective information is prosodically more marked than ‘plain’ new information,
and new information in narrow focus is prosodically more marked than new information in wide
focus (e.g. English: Breen, Fedorenko, Wagner, & Gibson, 2010; Katz & Selkirk 2011; German:
Baumann, Grice, & Steindamm, 2006; Mandarin Chinese: Ouyang & Kaiser, 2015; Xu, 1999).
To the contrary, this hierarchy is not found in my data – I found that contextual probability and
word frequency need to be considered in order to understand the relative prosodic prominence of
different focus types. Interestingly, it seems that many existing studies have focused on relatively
probable contexts and have not manipulated word frequency, which may explain the hierarchical
relation previously found between corrective focus, new-information focus and wide focus (i.e.
narrow corrective > narrow new > wide new).
Consider a hypothetical study that has a mix of high-frequency and low-frequency words
focused in probable contexts. Based on my results, in such a study: (a) corrective focus will have
27
greater prominence than wide focus, since the contexts are probable, and (b) new-information
focus will be less prominent than corrective focus and more prominent than wide focus, because
frequent words pattern with the former but infrequent words pattern with the latter. These
predictions are confirmed by a follow-up analysis where I pooled the conditions of word
frequency and excluded the condition of high contextual probability. Using the approaches
described in Section 2.3.4, I found significant differences in the Focus and Post-Focus intervals.
The f0 movement was largest for corrective focus, second largest for new-information focus, and
smallest for wide/VP focus. In other words, the common generalization about the prominence
hierarchy between the three types of focus might be an epiphenomenon stemming from not
controlling word frequency and using relatively probable contexts. My findings highlight the
importance of disentangling information structure and information-theoretic factors: To fully
understand how prosody encodes informativity, it is important to integrate the work in the
information-theoretic approach and the work in the information-structural approach (see Wagner
& Watson 2010:933, for relevant discussion).
2.5.2. An interaction-based account
In this subsection, I start to explore the question of why information-theoretic factors modulate
the effects of information structure in the way that we saw in Experiment 1. In particular, why
does contextual probability only impact the realization of corrective focus, while lexical
frequency only impacts the realization of new-information focus? I suggest that these differential
effects can be explained in terms of how ‘surprising’ the preceding utterance (the question asked
by the partner) is to the participant. In a conversation, people might have expectation about what
others might say, and when other people’s utterances contradict their expectation, they mark this
surprisal prosodically in their responses.
Let us consider the low-probability corrective focus condition. Here, the correct information (the
sentence read by the participant to correct the partner’s utterance) is improbable (e.g. Bonnie and
Laura found balls in the sea). Relative to this correct ‘state of the world’, the partner’s
misstatement (Bonnie and Laura found boats in the sea) may not strike the participant as
surprising, since boats are quite likely to occur in the sea. This may be what motivates low
28
prosodic prominence (so low that it is comparable to the prosodic prominence in wide focus) in
the participant’s response. On the other hand, when the correct information/‘state of the world’ is
highly probable (e.g. Bonnie and Laura found fish in the sea), it may be quite surprising to the
participant that the partner is mistaken about this, and thus the participant produces high prosodic
prominence in the correction.
Possibly for similar reasons, new information conveyed by infrequent words is realized with low
prosodic prominence. In the new-information focus condition, if the new information is a
relatively infrequent word (e.g. shells), it may not be surprising to the participant that the partner
asks for this information (what was found in the sea). In contrast, when the new information is a
higher-frequency, often-used word (e.g. fish), it may seem to the participant that the partner’s
question is so obvious that it is (almost) not worth asking, and thus surprising – triggering
increased prosodic prominence in the participant’s response. In general terms, the idea is that
participants produce increased prosodic prominence in their response if the other person’s
preceding utterance is unexpected.
If these ideas are on the right track, my findings can be connected to the notion of ‘surprise’ or
‘astonishment’ developed by researchers in intonation and social interaction (e.g. Dombrowski,
2003; Local, 1996; Niebuhr & Zellers, 2012; Selting, 1996; Wilkinson & Kitzinger, 2006).
Crucially, existing work distinguishes “the social expression of surprise (the public display of
finding something counter to expectation)” from “the psychology of surprise (the emotional
experience of encountering the unexpected)” (Wilkinson & Kitzinger, 2006:152). Related to the
social expression of surprise, so-called linguistic astonishment “is oriented towards the
propositional content of the speaker’s message. It relates a local piece of information to the
overall argumentation by denoting a factual mismatch between the speaker and the external
world.” (Niebuhr & Zellers, 2012:160). My thinking builds on these ideas.
Specifically, I use the term epistemic surprisal to refer to the degree of difference between the
speaker’s expectation and reality. ‘Reality’ as used here is not limited to events or concrete
objects, since a speaker can also be surprised by the preceding content of the conversation (e.g.
what their partner just said). A higher degree of epistemic surprisal means a bigger mismatch
29
between what has just happened/what has been said and what the speaker had expected (cf.
Bayes’ rule). I use the term ‘epistemic surprisal’ to avoid confusing linguistic astonishment with
emotional astonishment, and to indicate a conceptual connection to the information-theoretic
definition of surprisal (e.g. Levy, 2008; Hale 2003).
My results can thus be recast as follows: The higher the degree of epistemic surprisal
experienced by the speaker, the higher the degree of prosodic prominence that the speaker will
produce on the relevant words. I would like to suggest that people’s expectations about their
conversational partner’s utterances are an important factor involved in the prosodic encoding of
informativity. This may be the motivation behind the interaction between information-theoretic
factors and information structure. Experiments 2 and 3 in Chapter 4 will provide further evidence
for this account.
30
Chapter 3: Individual Differences in the Prosodic Encoding of Informativity
3.1. Introduction
The results of Experiment 1 in the preceding chapter show that prosody is influenced by a
complex interplay between information structure and information-theoretic properties. Word
frequency modulates the prosodic consequences of new-information focus, whereas contextual
probability modulates the prosodic consequences of corrective focus. However, existing work
has shown that speakers can differ from each other in how they mark the linguistic distinction in
question using duration, f0, intensity and spectral parameters. To name a few examples,
individual differences have been identified in the duration and spectral cues for word boundaries
(e.g. Smith & Hawkins, 2012), in voice-onset-time (VOT) for stop consonants (e.g. Allen, Miller
& Desteno, 2003; Loakes & McDougall, 2010), and how VOT is affected by other factors such
as speech rate and place of articulation (e.g. Theodore, Miller, & DeSteno, 2007).
It appears that between-subject variability can occur qualitatively and quantitatively, both on a
general level and in specific cases. Along a given acoustic dimension, participants have different
ranges of absolute values, produce different sizes and directions of variation between and within
linguistic categories, and use different kinds and numbers of strategies to signal a linguistic
contrast. For example, in a study where participants were asked to speak at self-selected fast,
normal and slow rates, some people’s fast rates were similar to some others’ slow rates in terms
of the number of syllables they produced per second. Moreover, the participants differed in how
they altered their speech rate: while some people produced more syllables a second for a faster
rate, some others produced longer pauses for a slower rate (German: Trouvain & Grice, 1999). In
a study by Dahan and Bernard (1996) on French emphatic accent with four participants, some
people increased f0 to a greater extent than others. The participants also differed in where and
how they used intensity to signal emphasis. For the emphasized element in a sentence, one
person increased the intensity, another person decreased it, and two other people produced no
difference. In the sentence region preceding the emphasized element, three people decreased the
intensity, while one person produced no differences. Lastly, everyone decreased the intensity in
the sentence region following the emphasized element (Dahan & Bernard, 1996).
31
In addition to individual differences in the modulation of duration, pauses, f0 and intensity, work
by Niebuhr, D’Imperio, Fivela and Cangemi (2011) found evidence for individual differences on
the realization of pitch accent categories in Standard Northern German (H* vs. H+L*),
Neapolitan Italian (L+H* vs. L*+H) and Pisa Italian (H* vs. H*+L). Niebuhr et al. found that
Standard Northern German and Neapolitan Italian speakers used different strategies in terms of
the alignment and shapes of f0 contours: some people produced systematic differences in the
location of f0 peak with respect to the target syllable, while others produced systematic
differences in how steep and large the f0 rise or fall was. In contrast, Pisa Italian speakers only
differed in cue strength: those who made greater alignment differences also made greater
differences in shapes (Niebuhr, D’Imperio, Fivela & Cangemi, 2011).
Individual differences also exist in the strategies people use for increasing the
audibility/intelligibility of their speech. In a study where participants were first asked to speak
normally and then asked to speak as they would if they were talking to a hearing-impaired person,
individual differences were observed. According to normal-hearing listeners in a perception
study, some of the speakers significantly improved their vowel intelligibility while others did not.
It turns out that the former group of speakers increased their vowel duration and raised their F2
for front vowels to a greater extent than the latter group. Also, the former group expanded their
vowel space in the F1 dimension, while the latter group did not (Fougeron, 2004; Fougeron &
Kewley-Port, 2007). In sum, empirical evidence suggests that speakers may differ from one
another substantially in terms of whether and how particular acoustic markers correlate with
particular linguistic factors.
In addition to the studies that explicitly focus on individual differences, research whose primary
focus is not on individual differences has also led to observations about between-subject
variability, i.e. how individuals differ. For example, it has been noted that participants differ in
their duration and spectral cues for the edges of prosodic domains (e.g. Fougeron & Keating,
1997; Krivokapic & Byrd, 2012; Korean: Cho & Keating, 2001), in their pausing and
lengthening cues for levels of discourse structure (e.g. word vs. clause vs. paragraph in Dutch:
32
van Donzel & Beinum, 1996), and in the effect of word prosodic structure on vowel duration (e.g.
Dutch: Rietveld, Kerkhoff & Gussenhoven, 2004).
More specifically related to informativity, Krahmer and Swerts (2001) investigated the
intonational cues for the distinctions between contrastive focus, non-contrastive focus, and given
information in Dutch. An interactive task was used, where participants worked in pairs to
complete dialogues. It was found that some participants’ prosodic behavior ignored their
partner’s contribution and instead prosodically marked elements that were contrastive to their
own last utterance. These participants also tended to end their utterances with a high boundary
tone (H%), which is generally interpreted as signaling the speaker’s intention to hold the turn.
Thereby, these ‘egocentric’ participants constituted the exceptions in the data.
In related work on focus types, Andreeva, Barry and Steiner (2007) investigated the cues in
duration, f0, intensity and vowel quality for the distinctions between narrow contrastive focus,
narrow non-contrastive focus, and wide focus in German. They note that some participants
produced larger differences than others, and some participants also used one parameter to a
greater extent than another. Thus, individual participants had their own tendencies and strategies
in producing prosodic prominence.
However, other than these sparse observations, little is known about the extent or nature of
individual differences regarding the prosodic encoding of informativity. In this chapter, I aim to
contribute to our knowledge in this area.
3.2. Aims and expected outcome
The discussion above shows that speakers differ in their acoustic realization of prosody. As there
is not a lot of prior work focusing on individual differences in sentence prosody, I first wanted to
see, on a general level, whether the results of Experiment 1 fit with the previous findings that
sentence prosody is susceptible to speaker-specific effects. I then looked more closely at whether
and how individual differences manifest themselves in the prosodic encoding of informativity.
Roughly speaking, I expected individual differences along multiple prosodic dimensions,
33
because existing research on other prosody-related topics (as discussed in Section 3.1) found
both qualitative and quantitative variability among the participants of a study, in terms of the
range and characteristics of cues a participant produces along an acoustic dimension as well as
the size and direction of acoustic differences that a participant produces to signal a linguistic
contrast (Andreeva, Barry, & Steiner 2007; Dahan & Bernard, 1996; Fougeron & Kewley-Port,
2007; Krahmer and Swerts 2001; Niebuhr, D’Imperio, Fivela & Cangemi, 2011; Trouvain &
Grice, 1999).
Specifically, I expected the participants to differ from each other in terms of (i) whether they
made distinctions between narrow and wide focus in a given information-theoretic condition, (ii)
whether they increased or decreased prosodic prominence for a given region of the sentence, (iii)
to what extent they vary prosodic prominence to convey the informativity of a word, as well as
(iv) the overall prosody of their utterances. Individual differences were expected to occur on both
of the acoustic correlates I analyzed, namely f0 shapes and ranges.
3.3. Results on individual differences in Experiment 1
In Chapter 2, I summarized the overall patterns in Experiment 1 when all sixteen participants are
investigated as a group. Let us now explore whether and how individual participants differ from
one another. In this section, I will first look at the overall prosody of individual speakers,
focusing on f0 shapes and ranges (Section 3.3.1). Then, I will examine the different experimental
conditions, to see how individual speakers produce different types of focus in different
conditions of word frequency and contextual probability (Section 3.3.2).
Similar to Chapter 2, a smoothing spline ANOVA approach was used to examine f0 shapes,
while mixed-effects models were conducted on f0 ranges. In the analysis of individual speakers’
overall f0 shapes (presented in Section 3.3.1), smoothing spline ANOVA models were
performed on each participant’s data separately, and Context was included as a random intercept.
In the analysis of individual speakers’ overall f0 ranges regardless of the experiment condition
(presented in Section 3.3.1), Subject was included as a fixed effect and Context was included as
a random effect. This model had both a random intercept and random slopes for Context. Finally,
34
for the directionality and magnitude of differences in f0 ranges between conditions (presented in
Section 3.3.2), I mainly focused on descriptive statistics, because the numbers of observations
became relatively low when the data were split into small subsets by both subject and condition.
3.3.1. Overall prosody of utterances
Overall, in terms of general prosodic patterns, speaker-specific variation occurs both
qualitatively and quantitatively. Between-subject variability and within-subject consistency were
observed in both the shapes of f0 contours and the ranges of f0 values.
First, the shapes of f0 contours vary greatly from participant to participant. In a given condition,
participants differ in the number, locations and relative height of the f0 peaks and valleys that
they produce in an utterance. To illustrate the extent of variability, I plotted a sample of five out
of sixteen participants whose f0 shapes are clearly distinct from one another. Figure 3.1 shows
the observed f0 contours produced by these participants for new-information, frequent words that
are narrowly-focused in probable contexts (e.g. What did Rachel and Carolyn get at the sports
store? They got balls at the sports store.) We can see that participants 04 (triangles), 06 (dots)
and 09 (dashes) all tend to produce a high tone on the focused word (i.e. balls) – thus showing
overall consistency in this regard. However, their choices regarding the adjacent tones differ.
Participant 06’s utterances on average have a low tone preceding the high tone, participant 04’s
in general have another high tone preceding the high tone, and participant 09’s seem to have a
low tone following the high tone. Furthermore, participant 01 (squares) and participant 04 both
show a clear tendency of declination, but participant 04’s utterances have two high tones
whereas participant 01’s do not have apparent tone targets. Lastly, participant 07 (solid line)
distinctively produces the focused word with a low tone. Such diversity is found among other
participants and in other conditions as well.
35
Figure 3.1: Observed mean f0 (in semitone, standardized by speaker) for participants 01,
04, 06, 07 and 09 in the narrow new-information focus, high word frequency and high
contextual probability condition in Experiment 1.
Although different participants produce different shapes of f0 contours, they show consistent
patterns within their own utterances. For example, Figure 3.2 provides a glance at the observed
f0 contours produced by participant 04 in all twelve experimental conditions. We can see that
participant 04’s utterances are quite similar to one another, regardless of the condition. To further
illustrate the intra-subject consistency with better graphical legibility, Figure 3.3 shows the
smoothing spline ANOVA results of three individual participants, including participant 04, in all
four information-theoretic conditions. These three participants were chosen because they had
strong preferences regarding f0 shapes. We can see that participant 01’s utterances (top row)
mostly follow a declination slope, although a low tone occasionally occurs around the end of the
Focus interval. Participant 04 (middle row) consistently produces a high tone in the Pre-Focus
interval and another high tone, downstepped, in the Focus interval, except there is sometimes a
low tone proceeding and/or following the second high tone. Participant 06 (bottom row)
generally produces a low tone in the Pre-Focus interval and a high tone in the Focus interval,
which is often followed by another low tone. Speaker-specific preferences of this sort are also
found for most of the other participants in the data.
-2
-1
0
1
2
log f0
(z-scored by speaker)
got BALLS at the
Subj01
Subj04
Subj06
Subj07
Subj09
36
Figure 3.2: Observed mean f0 (in semitone, standardized by speaker) of participant 04 in
all the conditions of Experiment 1.
-2
-1
0
1
2
log f0 (z-scored by speaker)
got BALLS/FISH/CLEATS/TOYS at the
HiFreqHiProb - Corr
HiFreqHiProb - New
HiFreqHiProb - VP
HiFreqLoProb - Corr
HiFreqLoProb - New
HiFreqLoProb - VP
LoFreqHiProb - Corr
LoFreqHiProb - New
LoFreqHiProb - VP
LoFreqLoProb - Corr
LoFreqLoProb - New
LoFreqLoProb - VP
37
HiFreq+HiProb HiFreq+LoPob LoFreq+HiProb LoFreq+LoProb
Subj
01
Subj
04
Subj
06
Figure 3.3: Best-fitted curves with 95% confidence intervals for the f0 values (in semitones,
standardized by speaker) produced by participants 01, 04 and 06 in Experiment 1.
I also found speaker-specific effects in the ranges of f0 values. Some participants regularly
employ a large proportion of their f0 register, while others regularly employ a small proportion
of their f0 register. To illustrate, let us take a close look at the sentence region from the Pre-
Focus interval to the Post-Focus interval. Figure 3.4 shows the average f0 ranges with 95%
confidence intervals (1.96 standard errors) produced by individual participants. We can see that
every participant differs from some other participant(s). Pairwise comparisons with the
Bonferroni adjustment show that, between the sixteen participants, everyone significantly differs
from at least two other people and as many as thirteen other people (p’s<0.05). For example,
participant 05, whose f0 ranges are largest on average (mean = 2.787) and the least variable
among all participants (standard deviation = 0.512), differs from participants 01, 03, 04, 06, 07,
and 09-16. On the other hand, participant 12, whose f0 ranges are smallest on average (mean =
1.582), differs from participants 02, 04, 05, 06, 08, 09, 11 and 16. Even participant 07, whose f0
ranges are the most variable among all participants (standard deviation = 1.253), differs from
38
participants 02, 05, 08 and 11 (by being smaller). More details about other participants can be
observed in Figure 3.4.
Figure 3.4: The observed f0 ranges (calculated from semitones standardized by speaker)
with 95% confidence intervals for individual participants in the sentence region from the
pre-focus interval to the post-focus interval in Experiment 1. A larger f0 range indicates
that the speaker employs a bigger proportion of his/her f0 register for this sentence region.
In sum, individual participants appear to be fairly different from one another, yet consistent
within one’s own utterances, in terms of the f0 shapes they adopt and how large a proportion of
their f0 register they use. This provides evidence for speaker-specific behavior in the overall
prosodic patterns of utterances and the extent to which people utilize their vocal capacity to
produce prosodic cues.
3.3.2. Prosodic encoding of informativity
Now that we have seen speaker-specific effects on the overall shapes and ranges of f0, let us
move on to the individual differences in how their prosody reflects the informativity of linguistic
elements. Since a given participant’s f0 shapes are similar across the conditions, i.e. different
types of focus and different levels of word frequency and contextual probability (see the
preceding subsection), only f0 ranges are of the interest in this subsection. To draw a direct
comparison between the group trends and the individual patterns, I present the results of the
39
sentence region from the Focus interval to the Post-Focus interval, where the group analysis
found significant effects of information structure on prosody (see Section 2.4).
First, we can observe between-subject variability in terms of the direction of distinctions
between different kinds of information. As presented in Section 2.4, there are three main patterns
when all sixteen participants are analyzed as a group: (i) wide focus has smaller f0 ranges than
both types of narrow focus in the condition of frequent and probable words, (ii) narrow new-
information focus has larger f0 ranges than narrow corrective focus and wide focus in the
condition of frequent but improbable words, and (iii) narrow corrective focus has larger f0
ranges than narrow new-information focus and wide focus in the condition of infrequent but
probable words. The analysis of individual participants finds each pattern in eight or nine people
out of sixteen: pattern (i) is exhibited by participants 01, 02, 04, 05, 07, 10, 12, 14 and 15; pattern
(ii) is exhibited by participants 01, 02, 06, 07, 12, and 14-16; pattern (iii) is exhibited by
participants 04 and 07-13. In other words, only about half of the participants conform to the
group trends regarding how information-structural types are differentiated in a given
information-theoretic condition, and it is not the same individuals in every condition. However, it
is worth noting that there are no alternative ‘competitor’ patterns – instead, the participants who
do not match the overall group trends show a mix of patterns in the different conditions. Thus,
although the overall group trends (as summarized in (i)-(iii) above) are not exhibited by
everyone, they nevertheless constitute the clearest patterns that emerge from the data.
Participants also differ in the magnitude of the information-structural distinctions they make. To
illustrate, Figure 3.5 shows the f0 ranges of individual participants in the condition of high word
frequency and high contextual probability. It appears that some people make clearer distinctions
than others. For example, the differences between wide and narrow focus are bigger in
participants 07 and 15 than participants 04 and 14. Participants 07 and 15 use substantially larger
f0 ranges for the utterances containing narrow focus than the utterances containing wide focus,
whereas participants 04 and 14 differentiate these two kinds of utterances to a lesser degree.
Similarly variable patterns are found in other conditions as well.
40
Figure 3.5: The observed f0 ranges (calculated from semitones standardized by speaker) in
the sentence region from the Focus interval to the Post-Focus interval for individual
participants in the condition of high word frequency and high contextual probability in
Experiment 1. A larger f0 range indicates that a bigger proportion of the speaker’s f0
register is employed.
Let us now consider how internally-consistent speakers are in terms of the (i) directionality and
(ii) magnitude of the information-structural distinctions that they produce. We see considerable
trial-by-trial variation in the direction of the information-structural distinctions produced by
individual participants (although the patterns reach significance in the group analysis).
Particularly, there is little indication of interactions between speaker (i.e. who is speaking) and
any of the informativity factors in terms of the direction of distinctions between different kinds
of information. In other words, the overall group results also hold on the level of individual
speakers, and it is generally not the case that, depending who the speaker is, one particular type
of information would consistently lead to smaller (or larger) f0 ranges than another particular
type of information.
Interestingly, if we look at the magnitude of these distinctions, we find more speaker-internal
consistency. Some participants regularly produce much larger f0 ranges for one type of focus
than another, while some others regularly produce only slightly larger f0 ranges for one type of
focus than another. For example, let us take a close look at the participants who conform to more
than one group trend: participants 01, 02, 04, 07, 10, 12, 14 and 15. It appears that they can be
0.5
1
1.5
2
2.5
3
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16
f0 excursion size
(z-scored semitone)
High Freq + High Prob
Corr
New
VP
Subject ID
Focus
41
divided into two subgroups such that, across information-theoretic conditions, one subgroup
consistently produces stronger cues for information-structural distinctions than the other
subgroup. To illustrate, Figure 3.6 shows the differences in f0 ranges produced by the eight
participants in the information-theoretic conditions where they conform to the group trends
regarding the information-structural distinctions. These differences were calculated with respect
to the group trend in each condition, i.e. patterns (i-iii) summarized towards the beginning of this
subsection. Specifically, the bars for the high-frequency high-probability condition represent the
differences between wide focus and the other two types focus (i.e. the f0 range in wide focus
subtracted from the f0 ranges in new-information narrow focus and corrective narrow focus)
based on pattern (i)), the bars for the high-frequency low-probability condition represent the
differences between narrow new-information focus and the other two types of focus (i.e. the f0
ranges in wide focus and corrective focus subtracted from the f0 range in new-information
narrow focus, based on pattern (ii)), and the bars for the low-frequency high-probability
condition represent the differences between narrow corrective focus and the other two types of
focus (i.e. the f0 ranges in wide focus and new-information focus subtracted from the f0 range in
corrective narrow focus, based on pattern (iii)). For the participants who do not conform to all of
these three group patterns (i.e. participants 1, 2, 4, 10, 14, and 15), I only calculated the
differences in f0 ranges for the information-theoretic conditions where they do. Thus, we can see
that participants 02, 07, 12 and 15 produce pattern (i) with larger differences than participants 01,
04, 10 and 14, participants 02, 07, 12 and 15 produce pattern (ii) with larger differences than
participants 01 and 14, and participants 07 and 12 produce pattern (iii) with larger differences
than participants 04 and 10. Essentially, in a given information-theoretic condition, participants
02, 07, 12 and 15 consistently use larger differences in f0 ranges than participants 01, 04, 10 and
14 for a given direction of information-structural distinctions. In general, this observation leads
us to speculate that what matters (in terms of encoding and perceiving informativity) are not the
absolute but rather the relative values.
42
Figure 3.6: The observed differences in f0 ranges (calculated from semitones standardized
by speaker) in the sentence region from the Focus interval to the Post-Focus interval for
individual participants who conform to more than one group trend in Experiment 1. The
differences were calculated based on the group trend in each condition.
To sum up, when we look at individual differences in how speakers encode informativity
prosodically, in every given information-theoretic condition, about half of the speakers clearly
exhibit the f0 range patterns that we observed for the group as a whole in terms of which
information-structural types have larger vs. smaller f0 ranges, and the remaining speakers show
more variable data. In terms of the magnitude of their f0 ranges, speakers are largely internally
consistent, and my data suggests that speakers differ in how much they modulate f0 to signal
informativity. Broadly speaking, this suggests that what matters in terms of encoding
information-theoretic notions prosodically are relative, not absolute, values – an observation in
line with prior work on prosody and information structure (e.g. Andreeva, Barry, & Steiner,
2007).
3.4. Discussion
In the preceding section, I analyzed how individual participants in Experiment 1 differed in the
overall prosody of utterances and the prosodic encoding of informativity. Prior work on prosody
mostly focuses on the general trends among speakers, and little has been said about the
differences between or within speakers. In this section, I will discuss how my results speak to the
nature and extent of individual variation, and what their broader implications are.
0
0.5
1
1.5
2
01 02 04 07 10 12 14 15
∆ f0 excursion size
(z-scored semitone)
Subject ID
HiFreqHiProb
HiFreqLoProb
LoFreqHiProb
43
Let us consider the shapes of f0 contours, the ranges of f0 values, the directionality of differences
in f0 ranges (i.e. which conditions have larger/smaller f0 ranges than other conditions), and the
magnitude of differences in f0 ranges (i.e. how much larger/smaller the f0 ranges are in one
condition than another). As we saw in Section 3.3, if we look at the overall prosody and f0
ranges that speakers produce, abstracting away from informativity notions, we find that speakers
differ from one another but are internally quite consistent. In other words, individual speakers
have preferences with regard to the shapes of f0 contours and the ranges of f0 values, generally
speaking. Then, when we look at how individual speakers encode informativity notions
prosodically, we find that the group-level patterns regarding the directionality of information-
structural distinctions are exhibited by many, but not all, speakers. Interestingly, when we look
more closely at how internally consistent speakers are in this regard, we find that speakers show
considerable internal variation in the directionality of distinctions they produce (i.e. whether a
particular type of focus has larger or smaller f0 ranges than another particular type of focus). In
contrast, in terms of the magnitude of distinctions they produce (i.e. how much larger or smaller
the f0 ranges are in one particular type of focus than another), speakers are more internally
consistent while, again, different from one another. Nevertheless, the group patterns are
statistically significant (in analyses that include subjects and items as random factors), and thus
we can conclude that they are still meaningful even in the face of individual variation.
As noted in the preceding chapter, the group analysis reveals three main patterns which highlight
the interplay of information theory and information structure: (i) wide focus has smaller f0
ranges than both types of narrow focus in the condition of frequent and probable words, (ii)
narrow new-information focus has larger f0 ranges than narrow corrective focus and wide focus
in the condition of frequent but improbable words, and (iii) narrow corrective focus has larger f0
ranges than narrow new-information focus and wide focus in the condition of infrequent but
probable words. The results presented in the current chapter show that about half of the
participants clearly exhibit these patterns. Importantly, there is no other ‘competitor pattern’ that
emerges from the data, as the rest of the participants exhibit more than one other pattern (e.g.
some make corrective focus the least prominent while others make new-information focus the
least prominent, as can be seen in Figure 3.5).
44
Thus, we observe a set of patterns that are, in each condition, exhibited by a large subset of
participants, despite other, seemingly highly variable and non-systematic patterns. It seems that
speakers loosely follow principles determined by information-theoretic factors and information
structure, and collectively show a systematic relationship between prosodic prominence and
informativity. A related phenomenon has been found in the field of speech processing. Studies
on accent prediction have argued that using speaker-dependent parameters does not substantially
improve a model’s performance in predicting whether a word receives an accent or not, because
the variability in placing an accent or not between speakers is similar to that within a speaker
(Badino & Clark, 2007; Shriberg, Ladd, Terken, & Stolcke, 1996; Yuan, Brenier, & Jurafsky,
2005). My results are consistent with these findings.
While the directions of differences in f0 ranges are closely tied to informativity factors, some
other aspects of f0 – including the ranges of f0 values, the sizes of differences in f0 ranges, and
the shapes of f0 contours – appear to show speaker-specific behavior. Given the multi-
functionality of prosody, it is not surprising that these other f0 parameters do not supply strong
cues for the particular factors I investigated. In terms of the range of f0 values in an utterance
and the magnitude of fluctuations in f0 ranges across utterances, prior work has found that these
aspects of f0 ranges can reflect the speaker’s emotions and psychological traits. For example, sad,
depressed, anxious, irritated, tense or fearful speech employs more limited f0 ranges than happy
or angry speech (e.g. Johnstone & Scherer, 1999; Morley, van Santen, Klabbers, & Kain, 2011).
Furthermore, children and young adults with autistic spectrum disorders use more exaggerated f0
ranges than individuals with typical development (e.g. Hubbard & Trauner, 2007; Paul, Bianchi,
Augustyn, Klin, & Volkmar, 2008; Sharda et al., 2010). Thus, it is likely that the speaker-
specific patterns regarding f0 ranges observed in this study correlate with individual participants’
mood or personal characteristics.
Similarly, f0 shapes have been shown to convey many other kinds of pragmatic meanings that
are not investigated in this study, such as the speaker’s attitudes or how the speaker’s current
utterance is related to a subsequent one (e.g. Pierrehumbert & Hirschberg, 1990; Ward &
Hirschberg, 1985, 1986). Due to the nature of Experiment 1 (i.e. reading aloud sentence pairs),
the stimuli were not fully specified in these aspects and subjected to a participant’s own
45
interpretations. Therefore, the presence of speaker-specific patterns in f0 shapes might imply that
individual speakers have preferences regarding how to fill in unspecified details at the pragmatic
level. This is an interesting question that would benefit from future work.
Thus, based on my results, it appears that f0 shapes are less informative than f0 ranges in
distinguishing the three information-structural types of interest, namely corrective narrow focus,
new-information narrow focus, and wide focus. F0 shapes differentiate these three types of focus
when we look at all speakers as a whole, but not when we look at each speaker individually. In
contrast, the directions of differences in f0 ranges distinguish focus types at both the group level
and the individual level. This suggests that f0 ranges may have a greater contribution than f0
shapes to the prosodic marking of information structure. I leave this question open for future
work.
In sum, this study contributes to our understanding of individual differences, providing empirical
evidence for inter- and intra-speaker variability in the prosodic encoding of informativity. My
results are consistent with previous observations that prosody can exhibit speaker-specific
behavior. Furthermore, I show that apparent differences among the participants in a study do not
necessarily constitute stable speaker-specific patterns. Instead, the prosodic dimensions that do
not show participants’ individual preferences may be the key dimensions that reflect the
linguistic distinctions in question (e.g. the direction of differences in f0 ranges in this study). In
addition, I discuss possible explanations for speaker-specific behavior in those prosodic
dimensions where speaker-specific effects are found. Prosody appears to be highly multi-
functional and tolerant of inter-speaker variability to a considerable extent.
46
Chapter 4: The Role of Interlocutors in the Prosodic Encoding of Information Structure
4.1. Introduction
The results of Experiment 1 show that prosody is influenced by a complex interplay between
information structure and information-theoretic properties. In Chapter 2, I proposed an
interaction-based account that explains the observed patterns in terms of epistemic surprisal. I
use the term ‘epistemic surprisal’ to refer to the degree of difference between what actually
happened and what the speaker had expected. I speculated that when participants answered
questions or corrected misstatements in Experiment 1 (the study where participants worked with
a partner in reading aloud sentence pairs), they produced greater prosodic prominence in their
responses to unexpected questions and statements than to expected ones, because the former had
higher epistemic surprisal than the latter. Specifically, participants might have expected their
partner to have common-sense world knowledge, and therefore were surprised when their partner
asked or were mistaken about highly predictable information (see Section 2.5.2 for more detailed
discussion). Thus, in this case, epistemic surprisal involves two key elements: the speaker’s
expectation prior to the conversation and the interlocutor’s behavior during the conversation. In
the rest of this section, I review prior work concerning the effects of these two factors in
language production. I will first consider them – speakers’ prior expectations and their
interlocutors’ actual behavior – separately, and then turn to the interplay between them.
Although relatively little research has explicitly focused on (i) the speaker’s prior expectations
and (ii) the interlocutor’s actual behavior, it has been shown that both factors indeed influence
various aspects of utterances, such as the number of words in a sentence, the amount of details in
a description, the choice of referential expressions, the use of words (e.g. literally vs.
figuratively), as well as prosody. For instance, among the existing work discussed in Section 1.3,
participants in some studies were asked to imagine that they were communicating with particular
types of addressees, such as non-native speakers, people who have hearing loss or speech
impartments, strangers, or participants themselves (e.g. Baker & Bradlow, 2009; Fougeron, 2004;
Fougeron & Kewley-Port, 2007; Fussell & Krauss, 1989a; Hummert & Shaner, 1994). It is worth
noting that in these particular studies, participants were not actually talking to an addressee; they
47
were simply instructed to act as if they were talking to another person. Thus, the effects of
perspective-taking in this set of studies clearly result from speakers’ own assumptions about their
intended addressees. Previous research on language perception has also shown that listeners’
reaction to speakers’ utterances depend on what they know about the speakers (e.g. Arnold, Kam,
& Tanenhaus, 2007).
In contrast, participants in some other studies worked with ‘live’, physically present partners
whose task-relevant characteristics, e.g. their levels of expertise on the conversation topic or
whether they anticipated what the participants were going to say, were unknown to the
participants in advance (e.g. Arnold, Kahn, & Pancani, 2012; Isaacs & Clark, 1987). Therefore,
the effects of perspective-taking in these studies clearly result from what the partners did and/or
said in their interaction with the participants.
Thus, existing work suggests that perspective-taking involves at least two dimensions: (i) what
speakers have assumed about their partners at the start of the conversation, regardless of whether
they are talking to real or imaginary partners, and (ii) what speakers learn about their partners
during the conversation, based on their partners’ utterances or behavior.
Many other studies on perspective-taking, however, do not clearly distinguish the effects of
speakers’ initial assumptions from the effects of their partners’ actual behavior (e.g. Bavelas,
Coates, & Johnson, 2000; Galati & Brennan, 2010; Kaland, Swerts, & Krahmer, 2013;
Lockridge & Brennan, 2002; Pasupathi, Stallworth, & Murdoch, 1998; Rosa & Arnold, 2011;
Rosa, Finch, Bergeson, & Arnold, 2015; Yoon & Brown-Schmidt, 2014). Most of such studies
address issues where the distinction between these two factors is irrelevant or not crucial.
Participants in these studies were told about their partners’ knowledge or attention states before
the start of the conversation, and their partners’ behavior during the conversation matched their
expectations.
Even less attention has been paid to the interplay between speakers’ expectations and
interlocutors’ behavior (but see Brennan, 1991; Kuhlen & Brennan, 2010; Russell & Schober,
1999). Existing work suggests that what speakers expect their interlocutors to do/say and what
48
the interlocutors actually do/say over the course of the conversation can contribute to the shaping
of utterances to different extents. For example, Russell and Schober (1999) shows that speakers
stick with their assumptions based on information they receive prior to the conversation,
regardless of what their interlocutors’ utterances suggest, whereas Brennan (1991) shows that
speakers abandon their initial assumptions and adjust to what their interlocutors’ utterances
reveal:
In Russell and Schober (1999), some participants were led to have false initial assumptions about
their partners’ conversation goals and the amount of information needed in order to achieve these
goals. Specifically, pairs of participants described abstract figures to each other, and they had
different task goals that required different amount of information. One person had to identify the
exact figure being described, and the other person only had to determine whether the figure fell
inside or outside a region on their sheet. Pairs of participants were either informed of the goal
differences, misinformed that their goals were the same, or uninformed about the goal
differences. It was found that the misinformed participants tailored their utterances based on their
false assumptions and never changed the amount of details they provided over the course of the
experiment, even in the face of unexpected feedback from their partners that could have
informed them about their partners’ true needs. Furthermore, the uninformed participants
patterned with the misinformed participants, suggesting that they also falsely assumed and stuck
with their assumptions that their own task goals were shared by their partners.
In Brennan (1991), some participants were told that they were communicating with human
partners, when all participants were in fact interacting with computer programs. However, some
‘partners’ produced human-like, complete responses, while others produced computer-like, short
responses. It was found that participants initially spoke in a style that stemmed from their prior
beliefs, i.e., they used complete sentences for human partners but short sentences for computer
partners. However, as the experiment went on, participants adapted their utterances to match the
styles of their partners’ responses, regardless of their own prior expectations and initial strategies.
This shows that people can revise the assumptions they have before the start of the conversation,
as well as change the way they speak, based on the way their interlocutors speak.
49
Kuhlen and Brennan (2010) provide additional evidence that utterances are shaped by the
interaction between speakers’ prior expectations and addressees’ actual behavior. In a joke-
retelling task, participants told the jokes with more vivid details when they (i) expected attentive
addressees (based on information that had been provided before the start of the task) and (ii)
were in fact talking to attentive addressees. In contrast, participants produced less vivid
narrations when they either (i) expected distracted addressees or (ii) were in fact talking to
distracted addressees. Moreover, participants put in more effort when their addressees’ behavior
matched their expectations than when it did not. That is, participants who expected attentive
addressees spent more time telling the jokes to attentive addressees than distracted addressees,
while those who expected distracted addressees spent more time telling the jokes to distracted
addressees than attentive addressees.
These complex patterns reveal that what speakers know about their interlocutors prior to the
conversation and what speakers learn about their interlocutors during the conversation both play
important roles in determining the way speakers say their utterances. Nevertheless, it remains
unclear how these two factors interact in shaping utterances, since little work has been devoted
to this question.
4.2. Experiments 2 and 3: Aims and expected outcome
In Sections 1.3 and 4.1, I reviewed existing work which shows that the way people speak greatly
depends on the kind of interlocutors they are talking to (or that they assume they are talking to).
These prior findings on perspective-taking support the idea of ‘epistemic surprisal’ that I use to
explain the complex interplay between information structure and information-theoretic properties
in Chapter 2. My interaction-based explanation states that the degree of prosodic prominence
reflects the extent to which interlocutors’ actual behavior differs from speakers’ prior
expectations. Since little existing work has focused on either (i) the interaction between
information-structural and information-theoretic factors or (ii) the mismatches between speakers’
expectations and interlocutors’ behavior, the objectives of this study are twofold. First, I wanted
to replicate my findings on (i), namely the results in Chapter 2. Second and more importantly, I
50
also wanted to provide further evidence for my interaction-based hypothesis in terms of (ii),
namely epistemic surprisal.
To fulfill these objectives, I conducted Experiments 2 and 3, which differed from Experiment 1
in three main aspects:
The first difference is that, the results of Experiment 1 show that the prosodic
prominence associated with corrective focus and new-information focus is dependent on the
focused word’s contextual probability and word frequency, respectively. Experiments 2 and 3
concentrate on corrective focus and contextual probability, with the aim to replicate the findings
with a different experiment setting (see the other differences below).
The second, and a more important difference is that, in Experiment 1, the speaker
corrected a statement that the interlocutor uttered that the speaker knew to be incorrect. For
example, the interlocutor said ‘I heard that Dawn and Alice got gloves at the sports store’ and
the speaker responded to this with ‘No, they got [balls] at the sports store.’ In other words, the
speaker corrects the object. Contextual probability was manipulated only on the correctively-
focused object (balls vs. fish in the response); the incorrect information was held constant
(gloves in the misstatement uttered by the interlocutor). However, in Experiments 2 and 3, I
manipulated contextual probability in both the interlocutor’s misstatements and the speaker’s
responses. This will be explained in more detail in the following section.
I added this dimension to the design because investigating the effects of interlocutors’
statements is necessary for testing my interaction-based hypothesis, which crucially makes
reference to interlocutors’ utterances. Furthermore, the incorrect information – as well as the
correction – is an integral part of corrective focus structure, since correction is by default in
contrast to pre-existent incorrect information. Examining the effects of contextual probability in
both parts of corrective focus structure can provide a more fine-grained understanding of how
contextual probability interacts with corrective focus. For instance, in the example above, does
the contextual probability of the incorrect object ‘gloves’ also plays a role in shaping the prosody
of the correction? Would speakers produce the correction differently if the incorrect object is
contextually improbable (e.g. I heard that Dawn and Alice got cereal at the sports store)?
Finally, Experiment 1 and the experiments presented in this chapter (Experiments 2 and
3) also differ in terms of how the contextual probability is determined, to test my interaction-
51
based hypothesis. I established contextual probability through the context provided by preceding
discourse (see the following section for details) and manipulated speakers’ expectations about
their interlocutors’ knowledge states. Specifically, in Experiment 2 (shared knowledge),
speakers and their interlocutors both had the knowledge necessary for assessing the contextual
probability. In Experiment 3 (privileged knowledge), the crucial context was exclusively
presented to the speakers and unavailable to their interlocutors. In other words, in Experiment 2,
contextual probability was part of the common ground between the speakers and their
interlocutors, but in Experiment 3 contextual probability was part of the speakers’ private
knowledge.
Predictions
Thus, the interlocutors’ knowledge state in Experiment 2 was similar to Experiment 1, where the
contextual probability was based on common-sense world knowledge, which I assumed was also
shared between the speakers and their interlocutors. Therefore, if my interaction-based
explanation for Experiment 1’s results is correct, I expect that contextually improbable
misstatements by the interlocutors would elicit high prosodic prominence in the corrective
responses in Experiment 2. This is because it would be surprising to the speakers that their
interlocutors, despite the awareness of the context, had contextually improbable beliefs (because
their interlocutors ‘should have known better’). In other words, the interlocutors’ improbable
utterances should evoke high epistemic surprisal, which should motivate high prosodic
prominence in the speakers’ utterances. In contrast, I expect that contextually probable
misstatements by the interlocutors would not elicit high prosodic prominence in Experiment 2,
because the interlocutors’ beliefs did not contradict the speakers’ assumption about their
interlocutors’ knowledge state. In other words, the interlocutors’ probable utterances should only
cause low epistemic surprisal, which should not provide motivation for high prosodic
prominence in the speakers’ utterances.
In addition, because of the close similarity between Experiments 2 and 1, I expect Experiment 2
to resemble Experiment 1 in terms of how contextual probability modulates the effect of
corrective focus. That is, high prosodic prominence would occur in corrective focus when the
52
focused word was contextually probable, but not when the focused word was contextually
improbable.
For Experiment 3, I expect to see substantially different patterns from Experiment 2, since the
speakers had entirely different assumptions about their interlocutors. In this study, it was made
very clear to the speakers that their interlocutors had no knowledge about the contextual
probability of the words. If my predictions for Experiments 2 and 3 are borne out, we can then
look into how speakers’ prior expectations and interlocutors’ actual behavior interact in shaping
the prosody of utterances.
4.3. Experiments 2 and 3: Methods
Similar to Experiment 1, Experiments 2 and 3 were production studies with an interactive set-up.
Each trial consisted of a production task and a subsequent comprehension task. Naï ve
participants worked in pairs on the production task and independently on the comprehension task.
The production task provided the critical recordings, namely the target sentences produced by
participants. The comprehension task was included to engage both people in the production task,
as paying attention to the dialogue was necessary to successfully perform the comprehension
task.
4.3.1. Design and procedures
A trial began with a production task, where two naï ve participants (Speakers A and B) worked
with each other in reading aloud dialogues. In both Experiments 2 and 3, the primary speaker of
interest is Speaker A. I will first present the design and set-up of Experiment 2, and then turn to
Experiment 3.
In Experiment 2, each dialogue consisted of five sentences (Sentences 1-5), as shown in
examples (4.1)-(4.2). Participants saw the text of the sentences on paper. Sentences 1 and 2 were
spoken by Speaker A, introducing a character (e.g. Jacky) and his or her preference or need (e.g.
Jacky likes fruit in her salads but not vegetables, or Zac likes vegetables in his salads but not
53
fruit). Sentences 3 and 4 were spoken by Speaker B. Sentence 3 provided information about a
recent event (e.g. Jacky/Zac going grocery shopping). Sentence 4 commented on this event,
starting with “I heard that…” and describing something this person did. Crucially, this described
event either matched or mismatched the information from Sentences 1-2 (e.g. match: Jacky got
apples; mismatch: Zac got apples) – I refer to this manipulation as Statement Type. Once
Sentence 4 was produced, the other speaker, Speaker A, spoke Sentence 5, starting with either
“Yes” or “No” to confirm or correct the previous sentence. Similar to the manipulation in
Sentence 4, this described event either matched or mismatched the information from Sentences
1-2 – I refer to this manipulation as Response Type. Participants thus interacted with each other
to produce the dialogues, and each participant only had access to the text of sentences that he or
she was responsible for.
(4.1) CORRECTIVE FOCUS
A: Jacky prefers her salad a certain way. [Sentence 1]
A: She loves fruit but hates vegetables. [Sentence 2]
B: She went grocery shopping yesterday evening. [Sentence 3]
B: I heard that she got some apples
match
at the farmer’s market. [Sentence 4]
A: No, she got some [lettuce]
mismatch; CORRECTIVE
at the farmer’s market. [Sentence 5]
(4.2) NON-CORRECTIVE INFORMATION
A: Zac tends to put certain things in his salad. [Sentence 1]
A: He loves vegetables but hates fruit. [Sentence 2]
B: He went grocery shopping this morning. [Sentence 3]
B: I heard that he got some apples
mismatch
at the supermarket. [Sentence 4]
A: Yes, he got some [apples]
mismatch; NON-CORRECTIVE
at the supermarket. [Sentence 5]
The experiment had 192 target items and 96 fillers. Each pair of participants encountered 48-96
items and did not see any item more than once. Each participant served as Speaker A in half of
the dialogues and Speaker B in the other half, i.e., the roles of speaker A and B were intermixed
throughout the study. These two halves had different sets of characters and scenarios. For
example, one participant began and finished all the dialogues that involved Gary, Lauren, and the
54
part of their house that they need to buy new things for (e.g. bathroom vs. patio), whereas the
other participant began and finished all the dialogues about Jacky, Zac, and the kind of salad they
like (e.g. fruit vs. vegetable). Before the main experiment, there was a familiarization phase,
during which participants were asked to memorize the preferences and needs of the set of
characters occurring in the half of the dialogues where they would be serving as Speaker A (e.g.
Jacky loves fruit but hates vegetables, whereas Zac loves vegetables but hates fruit). There were
sixteen characters and six scenarios in the dialogues; each participant was familiarized with eight
people and three scenarios. Participants’ feedback suggested that this was doable. A full list of
the target items can be found in Appendix 2.
In all dialogues, Sentence 4 (Speaker B’s statement) and Sentence 5 (Speaker A’s response) were
the critical sentences, which contained transitive clauses with the following structure: a third-
person singular pronoun subject, a simple past tense verb, an object noun phrase, and a
prepositional phrase indicating a location or beneficiary. The critical word of interest is the head
noun of the object noun phrase (e.g. apples or lettuce). To investigate the modulation of
corrective prosody by contextual probability, and the role of interlocutors in such modulation, I
manipulated the information-structural status of Speaker A’s response with respect to Speaker
B’s statement, as exemplified in (4.1)-(4.2), and the contextual probability of both people’s
utterances, relative to the character’s preference and need.
Furthermore, in order to distinguish what speakers initially expect about their interlocutors from
what speakers learn about their interlocutors during the conversation, I conducted a parallel
experiment, Experiment 3, and manipulated Speaker B’s knowledge state between the two
experiments. Specifically, in Experiment 3, I created a knowledge gap between Speakers A and
B by removing Speaker B’s access to Sentence 2, the sentence introducing the character’s
preference/need and thereby establishing a contextual bias. Participants were asked not to read
Sentence 2 aloud, which essentially made it Speaker A’s private knowledge. In other words, the
contextual probability of the critical words was privileged information kept by Speaker A and
unavailable to Speaker B. Everything else was held the same between Experiments 2 and 3.
55
Thus, four independent variables were implemented:
(i) Correctiveness (Corrective vs. Non-Corrective Information), whether the object head
noun in Speaker A’s response (Sentence 5) was in corrective focus. For instance, in ex. 5.1, the
object ‘lettuce’ in Speaker A’s response (No, she got some lettuce at the farmer’s market) is
corrective information because Speaker A was correcting Speaker B’s statement (Sentence 4, I
heard that she got some apples at the farmer’s market). In contrast, in ex. 5.2, the object ‘apples’
in Speaker A’s response (Yes, he got some apples at the farmer’s market) is non-corrective
information because Speaker A was confirming Speaker B’s statement (I heard that he got some
apples at the farmer’s market).
(ii) Contextual probability of the statement; Statement Type (Probable vs.
Improbable Statement): whether the object head noun in Speaker B’s statement (e.g. apples in ex.
5.1) matched or conflicted with the character’s preference/need. For instance, in ex. 5.1, Speaker
B says that s/he heard Jacky bought apples, and we know Jacky loves fruit, so ex. 5.1 shows a
contextually probable statement. However, consider an alternative where Speaker B says that
s/he heard Jacky bought lettuce. Because we know from the preceding context that Jacky hates
vegetables, ‘she bought lettuce’ is a contextually improbable statement here.
(iii) Contextual probability of the response; Response Type (Probable vs. Improbable
Response): whether the object head noun in Speaker A’s response (e.g. lettuce in ex. 5.1)
matched or conflicted with the character’s preference/need. For instance, in ex. 5.1, Speaker A
says Jacky bought lettuce, but we know from the preceding context that Jacky hates vegetables;
therefore, ex. 5.1 shows a contextually improbable response. However, consider an alternative
where Speaker A says Jacky bought apples. Because we know from the preceding context that
Jacky loves fruit, ‘she bought apples’ is a contextually probable response here.
(iv) Knowledge Type (Shared vs. Privileged Knowledge), whether the contextual
probability of the object head nouns was shared knowledge (Experiment 2) or Speaker A’s
privileged knowledge (Experiment 3).
Knowledge Type was manipulated between experiments, and all the other three independent
variables were manipulated within experiments. Thus, both Experiments 2 and 3 had six
conditions, as shown in Table 4.1. The dependent variable I measured was the f0 range
56
(calculated by subtracting the f0 minimum from the f0 maximum) in the object head noun of
Speaker A’s response (e.g. lettuce in ex. 5.1).
Condition
Examples of objects in the statement and
response (Statement-Response)
Probable Statement, Probable and
Corrective Response
apples-cherries; cherries-apples
Probable Statement, Probable and
Non-Corrective Response
apples-apples; cherries-cherries
Probable Statement, Improbable and
Corrective Response
apples-lettuce; apples-spinach; cherries-
lettuce; cherries-spinach
Improbable Statement, Probable and
Corrective Response
lettuce-apples; lettuce-cherries; spinach-
apples; spinach-cherries
Improbable Statement, Improbable
and Corrective Response
lettuce- spinach; spinach-lettuce
Improbable Statement, Improbable
and Non-Corrective Response
lettuce- lettuce; spinach-spinach
Table 4.1: Conditions in Experiments 2 and 3. The examples are from items where the
character ‘loves fruit but hates vegetables’ (such as ex. 4.1).
There were 48 items in each of the four corrective condition and 24 items in each of the two non-
corrective conditions. The number of items in each condition varies because, in non-corrective
conditions, Speaker A’s response (Sentence 5) could only repeat Speaker B’s statement
(Sentence 4). For example, as shown in Table 4.1, if the object in Speaker B’s statement was
‘lettuce’, the object in Speaker A’s non-corrective response had to also be ‘lettuce’ and could not
be anything else. In contrast, corrective conditions had more flexibility in the choice of objects.
For example, the object in Speaker A’s corrective response could be ‘apples’, ‘cherries’, or
‘spinach’ when the object in Speaker B’s statement was ‘lettuce’.
57
Norming study
The contextual probability of the critical words was estimated through a web-based norming
study (conducted on Amazon Mechanical Turk, https://www.mturk.com/). 132 native speakers of
American English (who did not participate in any of the other experiments reported in this
dissertation) performed a rating task. They saw sequences of four sentences and judged the
probability of the fourth sentence using a 7-point scale. The first three sentences in a sequence
constructed a scenario involving someone’s preference or need (e.g. Jacky likes her salad a
certain way. She loves fruit but hates vegetables. Yesterday evening she went grocery shopping.)
The last sentence described something this person did (e.g. She got some apples at the farmer’s
market.) Participants were told to rate “how likely it is that the following event [referring to the
last sentence] took place” “if the above statements [referring to the first three sentences] are true”.
I manipulated the person’s preference/need (i.e. loves fruit but hates vegetables) and the object
noun in the last sentence (i.e. cherries), such that what the person did in the last sentence might
seem probable (e.g. loves fruit and bought apples) or improbable (e.g. hates fruit but bought
apples) in the context.
I also tested various choices of wording (e.g. loves/hates vs. like/doesn’t like) and background
settings (e.g. yesterday evening vs. this morning, supermarket vs. farmer’s market). There were
270 items in total; 21 to 25 responses were collected for each item. Six scenarios were ultimately
selected, each with four objects and two background settings, as shown in Table 4.2. All the
selected items had median ratings not smaller than 5 in the probable condition and not larger than
3 in the improbable condition.
58
Preference/Need (X vs. Y) Category X Objects Category Y Objects
meat vs. seafood beef, lamb fish, shrimp
fruit vs. vegetables apples, cherries lettuce, spinach
bathroom vs. patio stuff bath mat, face wash lawn chairs, yard lights
bedroom vs. kitchen stuff dresser, mattress blender, mixer
carpenter vs. chef tools hammers, wrenches burgers, pizzas
farm vs. jungle animals cow, sheep bear, lion
Table 4.2: Manipulation of contextual probability in Experiments 2 and 3. A full list of the
target items can be found in Appendix 2.
Comprehension task
Each trial had two parts: the production task (described above) and a subsequent comprehension
task. The comprehension task began immediately after the production task was completed on
each trial. Participants saw a question about the dialogue they just read. The question asked
about one of the three elements in the dialogue: the character’s preference or need (e.g. What
does Jacky prefer to eat?), Speaker B’s statement (e.g. What did people think that Jacky got?), or
Speaker A’s response (e.g. What did Jacky get?). Participants were asked to write their answers
on paper. Take the dialogue in (1) for example: the correct answers are that Jacky prefers fruit,
people thought that she got apples, but she actually got lettuce. On a given trial, the two
participants never saw the same question, and therefore each participant had to recall the
information on his or her own.
4.3.2. Participants
Twenty-six native speakers of American English participated in Experiments 2 and 3, six pairs in
Experiment 2 and seven pairs in Experiment 3. All participants were students or staff at the
University of Southern California.
59
4.3.3. Data analysis
850 utterances were collected from the 26 participants, each producing 24-48 target responses.
Out of the full set of data, 54 utterances (6.4%) were not included in the data analysis due to
speech errors or disfluencies. F0 measurements were obtained using the YAAPT (Yet Another
Algorithm for Pitch Tracking) algorithm (Zahorian & Hu, 2008). Then, similar to Experiment 1,
the raw f0 values were smoothed, converted into a semitone scale, and normalized by subject
(see Section 2.3.4 for more details).
To investigate whether the prosodic encoding of correctiveness is influenced by contextual
probability and interlocutors’ knowledge states, I examined the effects of corrective focus in both
experiments for each of the four conditions of contextual probability separately: (a) probable
responses, (b) improbable responses, (c) responses elicited by probable statements, and (d)
responses elicited by improbable statements. In other words, different levels of contextual
probability were not directly compared, e.g. I did not compare ‘lettuce’ from dialogues about
Jacky (who hates vegetables) with ‘lettuce’ from dialogues about Zac (who loves vegetables).
The interactions between the independent variables (i.e. correctiveness, response type, and
statement type) in each experiment were also not statistically tested. This is because identical
sentences existed only between different types of focus, due to the variability in characters and
background settings that I added to the design to keep it reasonably natural. In all cases, I
analyzed the acoustics of Speaker A’s responses (Sentence 5) and did not analyze the acoustics
of Speaker B’s statements (Sentence 4), as mentioned in Section 4.3.1.
Mixed-effects models were conducted on f0 ranges (anova in R: Chambers & Hastie, 1992; lme4
in R: Bates, Maechler, Bolker, & Walker, 2014; lmerTest in R: Kuznetsova, Brockhoff, &
Christensen, 2015). Correctiveness was included as a fixed effect, and Subject and Scenario were
included as random effects. When specifying the structure of random effects, I started with a full
model (i.e. including intercepts and slopes for Subject and Scenario) and excluded a random
slope from the final analysis when it did not significantly contribute to the model. All models in
the final analysis had random intercepts for Subject and Scenario.
60
4.4. Experiments 2 and 3: Results
Overall, the predictions outlined in Section 4.2 were borne out, as can be seen in Figures 19-21.
As mentioned above, all of these analyses focus on the acoustics of Speaker A’s responses
(corrective or non-corrective); I did not analyze the acoustics of Speaker B’s statement that
expresses his/her misbelief. Thus, although the results are discussed in terms of the
(im)probability of Speaker B’s statements (Statement Type) as well as the (im)probability of
Speaker A’s responses (Response Type), the acoustic analyses focus only on the utterances
produced by Speaker A. It is also worth noting that both Figures 4.1 and 4.2 include the full set
of data from Experiment 2, and likewise, both Figures 4.3 and 4.4 include the full set of data
from Experiment 3. In other words, there is no data splitting; different figures simply present the
effects of different factors (Statement Type and Response Type) in an experiment.
In Experiment 2, where the information concerning words’ contextual probability was known to
both the speaker (Speaker A) and the interlocutor (Speaker B), correctively-focused responses
were produced with larger f0 ranges than non-corrective responses when they were elicited by
contextually improbable statements (e.g. Speaker B has the misbelief that Jacky bought lettuce
when we know that Jacky hates vegetables; the pair of bars on the left of Figure 4.1, t= 2.080, p<
0.05) or when the responses themselves were contextually probable (e.g. Speaker A corrects B
by saying that Jacky bought apples, when we know that Jacky loves fruit; the pair of bars on the
right of Figure 4.2, t= 1.974, p< 0.05). No significant differences in f0 ranges were found
between corrective and non-corrective responses when they were elicited by contextually
probable statements (e.g. when Speaker B has the misbelief that Jacky bought apples, in a
context where Jacky loves fruit; the pair of bars on the right of Figure 4.1, t= 1.253, p= 0.2109)
or when the responses themselves were contextually improbable (e.g. when Speaker A’s
correction states that this time Jacky bought lettuce, in a context where Jacky hates vegetables;
the pair of bars on the left of Figure 4.2, t= 1.560, p= 0.1194).
In contrast, the results of Experiment 3 are in the opposite direction from Experiment 2. In
Experiment 3, where the interlocutor had no access to the information concerning words’
contextual probability (i.e., Speaker B does not know that Jacky loves fruit but hates vegetables;
61
only Speaker A knows this), correctively-focused responses were produced with larger f0 ranges
than non-corrective responses when they were elicited by contextually probable statements (e.g.
when Speaker B says that Jacky bought apples; but recall that contextual probability is now only
defined in terms of Speaker A’s knowledge; the pair of bars on the right of Figure 4.3, t= 2.692,
p< 0.01) or when the responses themselves were contextually improbable (e.g. when Speaker A
corrects Speaker B by saying that Jacky bought lettuce, when Jacky in fact hates vegetables; the
pair of bars on the left of Figure 4.4, t= 2.986, p< 0.01). No significant differences in f0 ranges
were found between corrective and non-corrective responses when they were elicited by
contextually improbable statements (e.g. when Speaker B says that Jacky bought lettuce, but
Jacky in fact hates vegetables, although Speaker B doesn’t know this; the pair of bars on the left
of Figure 4.3, t= 1.871, p= 0.0623) or when the responses themselves were contextually probable
(e.g. when Speaker A’s correction states that Jacky bought apples, in a context that Jacky loves
fruit; the pair of bars on the right of Figure 4.4, t= 1.506, p= 0.1331).
62
Figure 4.1
Figure 4.2
Figure 4.3
Figure 4.4
Figures 4.1-4.4: F0 ranges (calculated from semitones, standardized by speaker) of the
target words in Speaker A’s responses in each condition of Experiments 2 and 3. Error bars
show +/- 1 standard errors. Contextual probability is shared knowledge in Experiment 2
but privileged knowledge in Experiment 3.
*
*
*
*
n.s.
n.s.
n.s. n.s.
(Jacky loves fruit but hates vegetables) (Jacky loves fruit but hates vegetables)
(Jacky loves fruit but hates vegetables) (Jacky loves fruit but hates vegetables)
(I heard that she got…)
lettuce apples
(I heard that she got…)
lettuce apples
(No/Yes, she got…)
lettuce apples
(No/Yes, she got…)
lettuce apples
63
4.5. Discussion
Experiments 2 and 3 investigate how corrective focus and contextual probability interact in
shaping the prosody of utterances, and how the speaker’s expectation and realization about the
interlocutor’s knowledge state contribute to this shaping of prosody. As discussed in Sections 2.1
and 4.1, previous research has examined prosody from all of these angles, but has not paid close
attention to their interplay. A better understanding of these issues is important because they are
concerned with fundamental questions regarding the functions of prosody in interpersonal
communication. In this section, I discuss how my results relate to my key aims sketched out in
Section 4.2, as well as their broader implications.
One aim of Experiments 2 and 3 is to test the idea of ‘epistemic surprisal’ that I propose to
account for the patterns in Experiment 1. My hypothesis is that prosodic prominence such as f0
ranges can reflect the extent to which speakers’ prior expectations about their interlocutors differ
from what they learn about their interlocutors through the conversation. The results of
Experiments 2 and 3 support this hypothesis. In Experiment 2, where the interlocutors were
expected to be fully aware of the context, speakers prosodically marked corrective information
when responding to contextually improbable statements, not when responding to contextually
probable statements. In Experiment 3, where the interlocutors were expected to have no
knowledge about the context, speakers prosodically marked corrective information in their
responses to contextually probable statements, not in their responses to contextually improbable
statements. These patterns can be explained in terms of epistemic surprisal – in this case, how
‘surprised’ the speakers were at their interlocutors’ statements, given the contextual knowledge
that the speakers had expected their interlocutors to have (or not to have).
Let us consider the conversational setting in Experiment 2, where the characters’ preferences and
needs were mentioned in the dialogues. Expecting their interlocutors to have this knowledge,
speakers might not have found it surprising when their interlocutors had misbeliefs consistent
with the characters’ preferences and needs. The low epistemic surprisal might be what prevented
speakers from prosodically emphasizing the corrective responses to contextually probable
misstatements. In contrast, it might have struck speakers as surprising when their supposedly-
64
informed interlocutors had misbeliefs conflicting with the context. The high epistemic surprisal
might be what elicited extra prosodic prominence in the corrective responses to contextually
improbable misstatements.
Probably for similar reasons, the opposite patterns were yielded by the conversational setting in
Experiment 3, where the characters’ preferences and needs were privy to the speaker. Here,
speakers might have instead been surprised at their interlocutors’ probable misstatements,
because it might have seemed that their interlocutors were somehow able to form probable
beliefs without the critical knowledge. In contrast, improbable misstatements might not have
struck speakers as surprising, since improbable beliefs might have fit their assumption about
their interlocutors’ lack of critical knowledge. In other words, the level of prosodic prominence
still reflects the degree of epistemic surprisal, but compared to Experiment 2, (im)probability of
the interlocutors’ utterances had the opposite effect in Experiment 3, because the speakers had
the opposite expectation about their interlocutors’ knowledge states.
My findings thus add to the small body of literature concerning how speakers update their prior
assumptions about their interlocutors by taking incoming cues from their interlocutors’ behavior
(e.g. Brennan, 1991; Kuhlen & Brennan, 2010; Russell & Schober, 1999). Speakers’ responses
are influenced by both what they assume their interlocutors to know and what their interlocutors
appear to know. My results show that the prosody of speakers’ responses can reveal the gap
between speakers’ expectations and the reality they encounter (e.g. what their interlocutors
actually say). In the research on prosody and information structure, the current study draws
attention to the importance of investigating the interlocutors’ utterances (e.g. the misstatement in
corrective-focus structure) that elicit the speakers’ responses (e.g. the correction). This is
consistent with views that highlight the role of interlocutors in language production (see Sections
1.3 and 4.1 for more details).
My other aim is to provide more evidence for the interplay between corrective focus – an
information-structural factor – and contextual probability – an information-theoretic factor, since
Experiment 1 seems to be the first study that demonstrates this interplay. The results of
Experiments 2 and 3 replicate and extend my prior findings. Similar to Experiment 1, contextual
65
probability was part of the common ground between speakers and their interlocutors in
Experiment 2. In both experiments, speakers prosodically emphasized corrective words that were
contextually probable, but not those that were contextually improbable. However, the two
experiments focused on different types of contextual probability. In Experiment 1, the words’
contextual probability hinged on world knowledge (e.g. got BALLS vs. FISH at the sports store),
whereas in Experiment 2, the words’ contextual probability was established via preceding
discourse (e.g. Jacky loves fruit and got APPLES vs. Jacky hates vegetables but got LETTUCE).
Thus, we see that both types of contextual knowledge modulated the prosodic marking of
corrective focus.
Different from Experiments 1 and 2, contextual probability was privileged knowledge available
only to the speakers in Experiment 3. Nevertheless, contextual probability still interacted with
corrective focus in affecting the speakers’ prosody, even though the information concerning
contextual probability was inaccessible to their interlocutors. Furthermore, results of Experiment
3 are in the opposite directions from Experiment 2 (and Experiment 1). In Experiment 3,
speakers prosodically emphasized corrective words that were contextually improbable, but not
those that were contextually probable. Taken together, these findings suggest that the effects of
contextual probability on corrective prosody are modulated by interlocutors’ knowledge states, in
accordance with the discussion above.
I am currently working on extending these investigations in two further directions: (i) probing
potential effects of non-linguistic unexpectedness and (ii) testing adaptation to interlocutors’
knowledge over the course of a conversation:
First, let us consider the idea of linguistic vs. non-linguistic unexpectedness. My interaction-
based account in terms of ‘epistemic surprisal’, at the current stage, does not distinguish
expectations about linguistic elements (e.g. what the interlocutor might say) from expectations
about non-linguistic elements (e.g. things in the environment, such as what the speaker sees). In
the studies reported in this dissertation, epistemic surprisal can be operationally defined as the
likelihood of a conversational partner’s utterance given the linguistic and conversational
circumstances such as the discourse context and speakers’ knowledge about their interlocutors.
66
The possible sources of expectation-reality mismatches, however, are not limited to these kinds
of elements. It will be interesting to see whether non-linguistic unexpectedness (e.g. seeing
something odd while talking) and linguistic unexpectedness (e.g. hearing the interlocutor say
something odd) have similar impacts on prosody, and if not, how prosody reflects non-linguistic
unexpectedness. For example, imagine that I am asked for directions by a stranger in my
neighborhood. As I am describing the route, I am surprised to see many balloons floating near the
shop I am pointing at – does this affect the prosody of the instructions I am giving to the stranger,
even if I do not mention the balloons?
On one hand, existing work suggests that visual information can ‘contaminate’ language
processing and cause speech errors (Harley, 1984). On the other hand, it is reasonable to suspect
that some non-linguistic elements may be sufficiently salient to cause updates of epistemic states
but not relevant enough to alter the conversation. Research on the prosodic encoding of non-
linguistic or paralinguistic information has mostly been devoted to emotion (e.g. Campbell &
Erickson, 2004; Hammerschmidt & Jürgens, 2007; Ishi, Ishiguro, & Hagita, 2005). However,
other work suggests that the emotional state of surprise may not influence prosody in the same
way as epistemic surprisal (Dombrowski, 2003). Furthermore, most work on paralinguistic
prosody focused on monosyllabic words, not sentence prosody. Exploring potential factors in the
environment that could indirectly contribute to a conversation will help us to better understand
the systems that contribute to language production. Such research will also shed light on the
discussion about the extent to which cognitive systems (e.g. linguistic and visual representations)
interfere with each other in the human mind.
The second direction in which I am extending these investigations has to do with the question of
how rapidly we are able to recalibrate our assumptions about what others know. Talking with
strangers or new friends is a common part of everyday life, and we often learn details about
people through the conversations with them. The results of Experiments 2 and 3 suggest that
prosody can reflect the mismatches between what speakers initially assume about their
conversational partners and what they learn during the conversation (i.e. epistemic surprisal).
However, it is unclear (i) how quickly speakers’ initial assumptions can be recalibrated in
accordance with their partners’ true characteristics revealed in the course of the conversation,
67
and (ii) to what extent speakers rely on the prior knowledge they have had at the start of the
conversation, compared to the incoming cues during the conversation. For example, imagine that
I am chatting with a stranger at a conference dinner. Because this is a linguistics conference, I
assume this person is a linguist, and therefore I make linguistic references in our conversation.
However, this person actually does not have much background in linguistics (but is just a friend
of a conference attendee). Will I be able to realize this fact based on the person’s responses, if he
or she does not tell me directly? How long will it take me to change my speaking style and stop
using linguistic jargon?
Most of the existing work on partner-specific processing has investigated – and found evidence
for – listeners’ ability to rapidly compute and track information about unfamiliar talkers’ speech
characteristics (e.g. Creel, Aslin & Tanenhaus, 2008; Horton & Slaten, 2012; Trude & Brown-
Schmidt, 2012). Relatively little work has looked into these questions in the other direction of
communication, namely speakers’ adaptation to their addressees’ behavior (but see details about
Brennan (1991), Kuhlen & Brennan (2010), and Russell and Schober (1999) in Section 4.1). In
natural conversations, people often take turns and switch roles between listeners and speakers.
Studying the dynamics between interlocutors, which my on-going work investigates, will help us
better understand how utterances are planned and how communication is achieved via language.
I have developed an interactive design where I manipulate whether people are told beforehand
about their interlocutors’ knowledge and what they are told. In some conditions, speakers are
given information beforehand about what their interlocutor knows/doesn’t know, but in other
conditions, speakers are given no prior information. In all conditions, speakers interact with
interlocutors who frequently or infrequently make errors, from which speakers may infer the
extent of their knowledge. For example, some participants are told prior to the experiment that
their partner is good at identifying the flags of different countries but bad at identifying the logos
of different companies. Some other participants, in contrast, are told nothing about their partner.
During the experiment, participants give instructions to their partner about where to move
flags/logos, and the partner sometimes makes errors and moves wrong flags/logos. Thus, if a
partner makes many errors in flags but few errors in logos, this would imply that this partner is
actually bad at flags and good at logos. Crucially, I also have conditions where the information
68
that speakers are provided with at the start of the conversation turns out to not match their
interlocutors’ behavior over the course of the conversation. For example, some participants are
told beforehand that their partner is good at logos but bad at flags, but it turns out that their
partner makes far more errors in logos than flags. This design allows me to test (i) how rapidly
speakers can recalibrate their initial assumptions to reflect the partners’ true characteristics that
are revealed over the course of the conversation, and (ii) to what extent speakers rely on the prior
knowledge they have before the start of the conversation, compared to the incoming cues during
the conversation.
69
Chapter 5: Discourse-level prosody in a tone language: Prosodic encoding of information
structure in Mandarin
5.1. Introduction
We have so far been focusing on how prosody encodes information structure and related factors
in English. However, as briefly mentioned in Chapter 1, the question of how prosody conveys
discourse-level information becomes even more complex when we consider tone languages,
where duration, f0, and intensity also distinguish between lexical items (e.g. African languages:
Zerbian, Genzel, & Kugler, 2010; Cantonese: Bauer, Cheung, Cheung, & Ng, 2004; Vietnamese:
Jannedy, 2007). In Mandarin Chinese, for example, four pitch patterns – commonly referred to as
‘tones’ – function as phonemes: high (Tone 1), rising (Tone 2), low (Tone 3), and falling (Tone
4). They can alter lexical meaning, as shown in (5.1). In addition to the four-way distinction
based on f0 movement, lexical tones in Mandarin also differ in amplitude and length. Tone 2,
Tone 3, and Tone 4 are perceptible solely on the basis of their amplitude contours (Whalen & Xu,
1992), and Tone 3 is 1.5 times longer than the other tones when produced in isolation (Xu, 1997).
(5.1) Tone 1 ma [High] ‘mother’
Tone 2 ma [Rising] ‘hemp’
Tone 3 ma [Low] ‘horse’
Tone 4 ma [Falling] ‘scold’
Given that f0, intensity, and duration are used to distinguish lexical items in Mandarin, we are
faced with the question of whether these acoustic dimensions also function as signals to
information structure, and if so, how they accomplish this dual role. In this chapter, I present a
production study on Mandarin, investigating the prosodic encoding of information-structural
distinctions that have been well established in semantics, such as corrective vs. non-corrective
and new vs. given (see Section 1.1 for more details). In the rest of this section, I will review
existing work on how prosody encodes information structure in Mandarin. I will first summarize
work on new-information focus and then look at work on corrective focus. As will become clear,
existing research on different information-structural categories in Mandarin has led to divergent
70
results regarding the specifics of which prosodic cues distinguish different information-structural
categories.
Nevertheless, it is worth noting that, in terms of f0, most research on Mandarin agrees that
information structure is conveyed not by the shapes of f0 contours but by their ranges in
Mandarin (e.g. Jin, 1996; Chen & Braun, 2006). For example, a study on corrective focus by
Chen & Gussenhoven (2008) found that the f0 shapes specified by different lexical tones are still
distinct from one another within an information-structural type. This makes sense given that the
shapes of f0 contours are the major cue for lexical tones in Mandarin, and thus need to be
maintained to ensure successful spoken word recognition.
Even in English, as briefly mentioned in Section 2.2, recent evidence suggests that f0 shapes do
not necessarily map straightforwardly onto information-structural types (e.g. Katz & Selkirk,
2011; Krahmer & Swerts, 2001; Watson, Tanenhaus, & Gunlogson, 2008). It has traditionally
been argued that different types of discourse information occur with distinct pitch accents (i.e.
pitch targets that are temporally associated with particular syllables). Pierrehumbert and
Hirschberg (1990) concludes that words in new-information focus receive an H* pitch accent (i.e.
a high pitch target aligned with the stress syllable), whereas words in contrastive focus receive an
L+H* pitch accent (i.e. a high pitch target aligned with the stress syllable and a low pitch target
preceding it). However, Watson, Tanenhaus and Gunlogson’s (2008) visual-world eye-tracking
study suggests that the mapping between pitch accents and focus types may not be entirely
straightforward: In a comprehension/perception study, they found that listeners look towards
contrastive referents when they hear an L+H* pitch accent, whereas hearing an H* accent leads
listeners to consider both new and contrastive referents. A production study by Katz and Selkirk
(2011) further shows that the acoustic differences between contrastive focus and new-
information focus cannot be explained by pitch accents and should be analyzed as differences in
prosodic prominence.
Let us now review existing work on how prosody encodes information structure in Mandarin,
starting with claims regarding new-information focus, e.g. mangos in (5.2b). Researchers have
71
compared words that are new information in narrow focus to those in broad focus, exemplified in
(5.3b).
(5.2) a. What did Peter buy at the market?
b. Peter bought [mangos]. (narrow new-information focus)
(5.3) a. What happened?
b. [Peter bought mangos]. (broad new-information focus)
Prior research on Mandarin has found that words that are new information in narrow focus have
longer duration (Jin, 1996), larger f0 ranges (Jin, 1996; Xu, 1999) and higher mean f0 (Chen,
Wang, & Xu, 2009), when compared to words that are new information in broad focus. The
results for intensity in this domain are less clear: Chen et al. (2009) claim that new information in
narrow focus has a higher mean intensity than new information in broad focus, but this is not
found by Jin (1996). It is important to note that in general, these studies compared narrow new-
information focus (ex. 5.2b) and broad new-information focus (ex. 5.3b), rather than new-
information focus and non-focus (i.e. given information, e.g. mangos in ex. 5.4b). Consequently,
their results shed light on the differences between narrow new-information focus and broad new-
information focus, but do not allow us to draw conclusions regarding the nature of the prosodic
differences between focused material and non-focused (given) material.
(5.4) a. Did Peter buy mangos?
b. (Yes,) Peter bought mangos. (given information)
The only existing experimental work in Mandarin that I have seen directly comparing new and
given information (termed ‘rheme’ and ‘theme’ in their study) is Chen & Braun (2006). They
found that new information has longer duration and larger f0 ranges than given information.
However, their data were elicited through a one-person reading task where individual
participants saw question-answer pairs and read aloud the answers according to the questions. To
the best of my knowledge, there is no prior work investigating this issue with a more natural
design.
72
Let us now turn to corrective focus, as exemplified in (5.6). A number of studies on corrective
focus realization in Mandarin compared correctively-focused words to non-focused (given)
words embedded in an utterance that contains some kind of focus, e.g. new-information focus as
in (5.6), such as Chen (2006), or corrective focus as in (5.7), such as Chen & Gussenhoven
(2008). Thus, these studies compared corrective ‘mangos’ in a context like (5.4b) to unfocused
or presupposed ‘mangos’ in contexts like (5.6b) or (5.7b).
(5.5) a. Did Peter buy apples at the market?
b. No, he bought [mangos]. (corrective focus on ‘mangos’)
(5.6) a. What did Peter do to the mangos?
b. He [bought] mangos. (new-information focus on ‘bought’)
(5.7) a. Did John buy mangos?
b. [Peter] bought mangos. (corrective focus on ‘Peter’)
As a whole, these studies found that correctively-focused words have longer durations (Chen,
2006; Chen & Gussenhoven, 2008) and larger f0 ranges (Chen & Gussenhoven, 2008) than non-
focused words. This result fits well with data from other languages as well as the general
intuition that corrective words tend to be more prominent (in various ways) than words that refer
to already-mentioned or presupposed information. Although duration and f0 have been
investigated, there seems to be no prior work investigating the intensity of correctively-focused
words in Mandarin.
Some existing studies on Mandarin have compared new-information focus and corrective focus,
but with conflicting results. Greif (2010) found that correctively-focused words had longer
durations than words in new-information focus, but did not differ reliably in terms of their f0
ranges. (Greif compared narrow new-information focus with two types of corrective focus:
semantic correction and pragmatic correction. I will not discuss Greif’s findings for what he calls
‘pragmatic correction’, since only semantic correction is similar to the corrective focus that I
investigated.) In contrast, when Chen & Braun (2006) compared corrective focus to new-
information focus, they found differences in f0 ranges but no differences in duration. Neither
73
Greif (2010) nor Chen & Braun (2006) looked for differences in intensity between correctively-
focused words and new-information words.
It is worth noting that Chen & Braun (2006)’s study is the first prosodic investigation of
information structure in Mandarin that investigated different information-structural categories in
a way that closely tied them to theoretical work. Their information-structural notions and
terminology are based on Steedman (2000): theme, rheme, background, and focus. In the
discussion preceding and following the current paragraph, I refer to their ‘normal rheme focus’
and ‘corrective rheme focus’ as ‘new-information focus’ and ‘corrective focus’ for the sake of
consistency. More specifically, Chen & Braun (2006) examine four categories of discourse
information: theme background, theme focus, rheme background, and rheme focus. Following
Steedman (2000)’s framework, these four categories are based on two layers of information
structure: a primary distinction between rheme and theme, and a secondary distinction between
focus and background. While ‘rheme’ and ‘theme’ roughly correspond to the new and given
information as defined in my study, the division between ‘focus’ and ‘background’ is based on
prosody – focus is intonationally marked and background is not. Chen & Braun (2006) find that
both rheme and focus are marked by lengthening and f0 range expansion, and that the distinction
between rheme and theme is prosodically more prominent than the distinction between focus and
background. Additionally, and most relevantly for us – they look into two subtypes of rheme
focus (essentially, new-information focus and corrective focus), and find that correctively-
focused rhemes have bigger f0 ranges than non-correctively focused rhemes, but do not differ in
duration – a finding which contrasts with Greif (2010), who found that corrective focus leads to
longer duration but not bigger f0 ranges than new-information focus. As a whole, Chen & Braun
(2006)’s findings suggest that different information-structural distinctions can be encoded in the
same prosodic dimensions with different degrees of prominence, as well as reflected in different
prosodic dimensions.
In sum, while existing findings for Mandarin generally agree that (i) narrow new-information
focus involves increases in f0 displacement and duration when compared to broad new-
information focus, and that (ii) corrective focus similarly involves increases in f0 displacement
and duration when compared to unfocused words, there is as of yet no consensus about how the
74
two focus types (new-information focus and corrective focus) differ from each other, nor about
how constituents in new-information focus differs from non-focused constituents.
In light of the outcomes of Greif (2010) and Chen & Braun (2006), one might conclude that
perhaps new-information focus and corrective focus do not differ reliably in their acoustic
encoding. However, because the details of the target sentences, the designs, and the information-
structural manipulations in these two studies differ from each other (because their general aims
were different), we should be careful in comparing them directly. In essence, then, existing work
has not led to a conclusion about whether and how corrective focus and new-information focus
differ from each other.
It is quite striking that the prosodic properties of two major types of information structure that
have received the most attention in research on English (e.g. Breen, Fedorenko, Wagner, &
Gibson, 2010; Katz & Selkirk, 2011; Krahmer & Swerts, 2001; Pierrehumbert & Hirschberg,
1990; Watson, Tanenhaus, & Gunlogson, 2008) – new information and contrastive focus – are
not yet well-understood in Mandarin. An understanding of this question in a cross-linguistic
context is important for theories of information structure. This question relates to fundamental
issues about whether new-information focus and corrective focus are semantically distinct
categories or variants of the same category, and whether their prosodic realizations are
categorically different or variants on a continuum.
5.2. Experiment 4: Aims and expected outcome
As we saw in the preceding section, it remains unclear to what extent prosodic cues differentiate
one type of discourse information from another in Mandarin. To shed light on this issue, and
more generally on the question of how information structure is realized prosodically in Mandarin,
a tone language where all three prosodic dimensions – duration, f0, and intensity – already serve
lexical purposes, I conducted a psycholinguistic production study that investigates three main
questions.
First, do words in new-information focus vs. corrective focus differ from each other in
terms of their prosodic realization in Mandarin, and if so, which parameters (e.g. duration, mean
75
f0, f0 range, or intensity) encode these differences? Second, I also investigated whether and how
the distinction between ‘new’ and ‘given’ influences the acoustic realization of both corrective
and non-corrective words. As we saw in the preceding section, prior work does not yield a clear
picture.
Broadly speaking, I expected new information to be prosodically more prominent than
given information in Mandarin, based on existing work on other languages (e.g. Brown, 1983;
Eady & Cooper, 1986; Fowler & Housum, 1987; Hay, Sato, Coren, Moran, & Diehl, 2006;
Krahmer & Swerts, 2001; Ladd, 1996) as well as Chen and Braun (2006)’s findings regarding
increased duration and larger f0 ranges. However, the question of whether and how
corrective/contrastive focus differs from new-information focus is more open. As we have seen
above, existing work on Mandarin has not reached a consensus (e.g. Chen & Braun, 2006 vs.
Greif, 2010). Moreover, recent studies find no one-to-one mapping between information-
structural types and pitch accents in English (Katz & Selkirk, 2011; Krahmer & Swerts, 2001;
Watson, Tanenhaus, & Gunlogson, 2008). To contribute to our understanding of this
phenomenon in a cross-linguistic setting, I wanted to see how and whether the two focus types
differ in Mandarin, a language where discourse-driven prosodic cuing is constrained by the
existence of lexical tone.
The third key aim of my study is a more inclusive analysis of the acoustic parameters
relevant to information structure, to ensure that potentially crucial distinctions are not
inadvertently overlooked. This aim has two sub-parts:
First, I analyzed not only duration and f0, but also intensity. Prior studies on Mandarin
mostly focused only on duration and f0. The two studies that did look at intensity (Jin, 1996;
Chen, Wang, & Xu, 2009) analyzed mean intensity but did not look at potential effects of
information structure on intensity ranges (i.e. difference between maximum and minimum
intensity) – and even in the domain of mean intensity, their results do not agree with each other.
Given that intensity contours, as well as f0 contours, are associated with lexical tones in
Mandarin, I expected that intensity ranges could reflect discourse-level information just like f0
ranges do. Thus, I investigated how and whether all three prosodic dimensions encode
information structure.
The second sub-part of this third aim has to do with how f0 range and intensity
expansion is accomplished. Theoretically, there are three possible ways of expanding the range
76
of an excursion: (a) raising the maximum and lowering the minimum, (b) only raising the
maximum, and (c) only lowering the minimum, as illustrated schematically in Figure 5.1. If one
finds, say, f0 range expansion for both new-information focus and corrective focus, then in order
to assess whether the phenomena are really the same or not, one needs to investigate the
underlying source of the range expansion, i.e. which of the three ‘strategies’ is being used.
Failing to do this could result in incorrectly grouping together two phenomena that are
underlyingly different. Thus, I conducted detailed analyses not only of f0 and intensity ranges,
but also maxima, minima and means.
(a) (b) (c)
Figure 5.1: Possible ways of expanding ranges.
5.3. Experiment 4: Methods
To investigate the prosodic encoding of information structure in a tone language, I conducted a
production study on Mandarin. Participants produced instructions based on pictures and arrows
shown on a computer screen. They were told to imagine that another person in another room
would be listening to their instructions and moving the objects on the screen accordingly (though
this person might sometimes make mistakes), and that the movements made by the listener
would be visible on the participants’ computer screen. Participants’ utterances were recorded
through a head-mounted microphone. Using pictures allowed us to avoid presenting participants
with written sentences which can result in unnatural ‘reading’ intonation (especially when it is a
one-person task, like this study). In this section, I will first discuss the experimental design and
the stimuli, and then go over the procedure.
77
In this study, I focus on two major distinctions between information-structural types: (i) Do new
and corrective elements differ from each other in terms of their prosodic realization in Mandarin?
(ii) Are new and given elements marked differently in prosody? Although these kinds of issues
have been investigated in prior work, the existing results do not yield a clear picture.
5.3.1. Design and stimuli
Participants saw colored pictures on the computer screen. Objects were presented in circles, each
with its name shown below. There were six pictures on each screen. Arrows were used to
indicate the commands participants should produce. For example, in Figure 5.2a, the arrow
points from the cigarette to the lounge chair, so participants should say: ‘Move the cigarette next
to the lounge chair.’ After they produced the instruction, participants saw a moving event on the
computer screen that responded to the instruction either correctly or incorrectly. For example, in
Figure 5.2b, the cigarette is moved next to the lounge chair, which is a correct response.
(a) (b)
Figure 5.2: Sample display of Experiment 4. Panel (a) shows an arrow pointing from the
cigarette to the lounge chair, and panel (b) shows that the cigarette has been moved next to
the lounge chair.
To examine discourse-level intonation across lexical tones, I manipulated the information
structure of the target words and controlled their tonal combinations. Specifically, a repeated-
78
measures within-subjects design with two independent variables was used: (i) correctiveness
(with two levels: presence or absence of correction) and (ii) givenness (with two levels: new or
given information). Target words were bisyllabic, with one of the three tonal combinations:
High-High (HH), High-Low (HL), or Low-High (LH). A third of the target words were HH, a
third were HL and a third were LH. All sentences were produced in the frame illustrated in (7).
For instance, the sentence “ba xiangyan (‘cigarette’) fangdao/fangzai tangyi (‘lounge chair’)
pangbian” would be produced for the display in Figure 5.2. For the verb ‘put’, the variant fang is
also possible, in addition to fang-dao and fang-zai. These forms are interchangeable across
speakers in this context. Participants were asked to use the one most natural to them; only one
participant used the short form fang.
(7) ba OBJECT fang-dao/-zai LOCATION pangbian
BA OBJECT put-PREP LOCATION side
‘Move the OBJECT next to the LOCATION’
A target word always appeared in the OBJECT role in a sentence. Table 5.1 shows the summary
of the four conditions. Target sentences were the last two sentences in a trial, i.e. the sentences in
bold in Table 5.1. Next, let us consider the four conditions in more detail.
Trial type New information Given information
1
st
Sentence (a) Move A next to B (d) Move A next to TARGET
1
st
Visual event (Correct moving) (Correct moving)
2
nd
Sentence
(b) Move TARGET next to C
[Non-Corrective New]
(e) Move TARGET next to C
[Non-Corrective Given]
2
nd
Visual event (Wrong object is moved next to C) (Wrong object is moved next to C)
3
rd
Sentence
(c) Move TARGET next to C
[Corrective New]
(f) Move TARGET next to C
[Corrective Given]
3
rd
Visual event (Correct moving) (Correct moving)
Table 5.1: Structure of target trials in Experiment 4.
79
Broadly speaking, there were two types of target trials: New-information trials and Given-
information trials. The New-information trials were composed of three spoken instructions
(sentence (a), (b), and (c) in Table 5.1). First, a participant saw an image with an arrow and
produced the corresponding sentence, e.g. Move the cigarette next to the lounge chair (sentence
(a)). The object moved correctly in the display, and after that, another arrow appeared. To
convey the information represented by the second arrow, the participant produced another
sentence, e.g. Move the juice next to the crow (sentence (b)). After the second sentence was
uttered, an incorrect object moved to the location. For example, instead of the juice, the pacifier
moved next to the crow. To correct the moving event, the participant repeated the instruction, e.g.
Move the JUICE next to the crow (sentence (c)). This time, the correct object moved on the
screen. The corrective sentence was the last sentence in a trial.
Note that on these New-information trials, the target word (juice in this case) had not been
mentioned until sentence (b) was uttered for the first time, i.e. neither of the two nouns
mentioned in sentence (a) (cigarette and lounge chair in this case) was the target word in
sentence (b). Thus, ‘juice’ was new information when it was mentioned in sentence (b), which I
refer to as the Non-Corrective New information condition. Later, when the participant
repeated the instruction in order to correct the incorrect moving event, thus producing sentence
(c), I refer to this as the Corrective New information condition since the target word (juice)
here was uttered in a corrective context. As a result, the distinction between new and given in
this study is defined from the hearer’s perspective. From the perspective of the speaker, by the
time they got to sentence (c), ‘juice’ was already given information because the speaker had
already uttered it in sentence (b). However, from the perspective of the hearer, ‘juice’ was
presumably still new information at the point where sentence (c) was uttered, since the hearer
apparently did not hear sentence (b) correctly and moved an incorrect object instead of the
‘juice’. Thus, the fact that a wrong object was moved implies that the hearer did not pay
attention to sentence (b) and misheard the object to be moved – i.e. it was still new information
to the hearer in sentence (c) (I discuss this more below).
Having considered the New-information trials, let us now turn to the Given-information trials.
The Given-information trials had the same structure as the New-information trials, except for the
80
information-structural properties of the target words. Specifically, in the Given trials, the target
word had already been mentioned in the LOCATION role of the first sentence (sentence (d) in
Table 5.1). In other words, it had already been involved in an earlier moving event, as shown in
Table 5.1. Thus, in the Given-information trials, the second spoken instruction (sentence (e) in
Table 5.1) is in the Non-Corrective Given information condition, and the third sentence
(sentence (f) in Table 5.1) is in the Corrective Given information condition.
In this study, I used the distinction between the speaker’s perspective and the hearer’s
perspective to differentiate the Corrective Given and Corrective New conditions. Importantly,
there is a considerable body of work showing that speakers’ prosodic realizations are indeed
sensitive to hearers’ perspectives and attention states (see Section 1.3 for more details). The
experiments presented in Chapter 4 also show that speakers take their conversational partners’
knowledge states into account. (More specifically, we saw that in English, the directions in
which contextual probability modulates speakers’ corrective prosody depend on whether their
conversational partners know the context (and are thus able to estimate words’ contextual
probability)).In addition, corpus work on word order patterns and non-canonical constructions
(in English and other languages) shows that people are very sensitive to whether or not
something has been mentioned in preceding discourse / successfully introduced into the
discourse model (e.g. Birner & Ward, 1998, see also Prince, 1992). Based on these findings, it
seems reasonable to assume that speakers understand that in the Corrective New condition, the
critical target noun is new to the hearer, whereas in the Corrective Given condition, the critical
target noun has already been introduced to the discourse (and the listener’s mental model of the
discourse) by virtue of being successfully moved after the first sentence (sentence (d)).
Definition of new/given. It is important to note that in my design, the distinction between ‘given’
and ‘new’ information is drawn in terms of discourse-status, i.e. whether the noun has already
been mentioned in the preceding discourse or not (e.g. Birner & Ward, 1998; Prince, 1992;
Kaiser & Trueswell, 2004). Thus, any entity that has already been mentioned in the prior
discourse, regardless of grammatical position, is by definition discourse-old/given information.
In particular, the target word in the given conditions first occurs in the LOCATION role and then
in the OBJECT role. Prior work has identified connections between prosodic cues and discourse-
81
status (e.g. Féry, Kaiser, Hörnig, Weskott, & Kliegl, 2009 on f0 contours in German) as well as
word order patterns and discourse-status (e.g. Birner & Ward, 1998 on English). This definition
of old vs. new differs from Terken and Hirschberg (1994) and Schwarzschild (1999), who note
that a change in syntactic role (and/or surface position) can render something ‘new’ for purposes
of accent assignment. According to their view, the target nouns in the Given conditions should in
fact count as ‘new’ for purposes of prosodic prominence, because their syntactic role has
changed. However, an eye-tracking study by Dahan, Tanenhaus and Chambers (2002) shows that
when listeners hear temporarily-ambiguous discourse-old nouns realized in a new syntactic
position with prosodic prominence, they do not interpret these prosodic cues as potentially
referring to discourse-new referents. In other words, Dahan et al.’s results suggest that discourse-
old entities mentioned again in a different syntactic position do not pattern in the same way that
discourse-new nouns do, seemingly in contrast to the claims of Terken & Hirschberg (1994).
Pre-empting my findings somewhat, let us note already that my results indicate that Mandarin
speakers are indeed sensitive to the given vs. new distinction defined in terms of discourse-
information status and independently of grammatical role or surface position. Although the given
information manipulation in my experiment may not represent the ‘highest degree’ of givenness
(due to the change in grammatical role), the key observation (discussed more in the results
sections) is that the difference between the new and given information conditions is reflected in
participants’ prosody.
Each condition contained 9 items, 3 in each of the 3 tone combinations. (See Table 5.2 for the
list of target words.) There were 18 target trials, each including a non-corrective target sentence
and a corrective target sentence. The dependent variables that I measured were the duration, f0
range, and intensity range of the target word region in a target sentence. In the target trials, either
three or four of the objects were mentioned in a particular trial. The extra two to three pictures in
a display were presented to ensure that participants could not predict which picture was going to
be involved in the next instruction. The experiment also included 36 filler trials, which differed
from the target trials along one or more of the following parameters: the number of sentences in
a trial, whether and in which sentence a wrong moving event occurs, on which noun the
correction needs to be made in a corrective sentence, and the lexical tones of the nouns.
82
Tone Word
HH (high+high) xiang.yan ‘cigarette’
wu.ya ‘crow’
qing.wa ‘frog’
HL (high+low) qiu.yin ‘earthworm’
ying.wu ‘parrot’
ban.ma ‘zebra’
LH (low+high) gui.wu ‘ghost house’
yu.yi ‘raincoat’
hai.ou ‘seagull’
Table 5.2: Target words in Experiment 4.
Among all the trials, any two of the nouns co-occurred less than four times, and any three of the
nouns co-occurred less than once in a trial. The tonal combinations used in target words (HH, HL,
and LH) appeared 56 times each, and the tonal combination which was only used in filler words
(LL) appeared 48 times. All the words (i.e. names of the pictures) were concrete nouns that
denoted movable objects and had a frequency no higher than 15.23 counts per million according
to Cai and Brysbaert (2010). I controlled for various factors including the number of syllables,
the combination of tones, semantic properties, and word frequency. This placed severe
constraints on the choice of words, and thus voiceless and sibilant consonants in the words could
not be entirely avoided. The positions and directions of arrows on the displays were
counterbalanced across trials, so that none of the movements were particularly predictable or
surprising to the participants. Thereby, in this study, I focused on information structure and did
not look at information-theoretic factors. Nevertheless, predictability is also an important
direction for prosody research, as demonstrated by existing work as well as my studies in the
preceding chapters. The results of the current study can provide a foundation for future work to
build on and investigate whether the prosodic marking of information structure is modulated by
information-theoretic properties and interlocutor-related factors in Mandarin prosody (e.g. like
what has been found with English in this dissertation).
83
5.3.2. Procedure
Participants were told to give instructions to move objects based on the pictures and arrows on
the computer screen, to check if their instructions were carried out correctly, and to provide a
correction if their instructions were not followed. They were asked to only use the sentence
frame in (10) during the entire experiment, and to speak as naturally as possible. Participants
were told to imagine that they were speaking to a person in another room, in front of another
computer connected to the participants’ computer, and that the listener, who might sometimes
get distracted and make mistakes, would move the objects according to the participants’
instructions. This was done in order to make the task as natural as possible; my assumption was
that people would be most likely to mark information-structural cues in their prosody in a
communicative situation.
5.3.3. Participants
Ten adult native speakers of Mandarin, five women and five men, participated. All were either
born in Beijing or had lived in Beijing since age 13 or younger. All participants were born and
raised in China. All of them were students or visiting scholars at the University of Southern
California, and had left Beijing no more than two years before their participation in this study.
5.3.4. Data analysis
Acoustic analyses were done using the Praat software with the ProsodyPro script (Xu, 2005-
2011). Duration, f0, and intensity were extracted by the script. Repeated measure ANOVAs and
paired t-tests were conducted on the duration, f0 ranges (maximum f0 – minimum f0), and
intensity ranges (maximum intensity – minimum intensity) of target words. All ANOVAs
presented in this paper had Correctiveness (correction or non-correction) and Givenness (given
or new) as independent variables. In the by-subject analyses, the tonal combination of each target
word (HH, HL, or LH) was also included as a control variable. Here I focus on the by-subject
analyses, because the design of the study does not allow for by-item analyses of all targets, as the
nine target words were ‘cycled’ through the 36 target items. However, if one analyzes the nine
84
target words in all four conditions, the statistical patterns closely resemble the by-subject
analyses.
To make sure that my data or conclusions are not distorted by the occurrence of creaky voice, I
manually removed the markings for aperiodic pulses (resulting from irregular vibration of vocal
folds) before the f0 values were computed. This is because pitch tracking for aperiodic
waveforms is often inaccurate, which could potentially impact the analysis of f0, as low tones
prevalently bring about creakiness in tone languages (e.g. Yoruba: Welmers, 1973:109;
Cantonese: Vance, 1977; Mandarin: Belotel-Grenié & Grenié, 1994). To minimize this problem,
all of the f0 analyses focus on the non-creaky portions. Since the minimum f0 in the HL and LH
tonal combinations appears during the low tone component, the data might not accurately reflect
the actual f0 ranges. In a HL or LH word that contained a creaky region, the actual minimum f0
could be lower than what was measured, and if so, the f0 range calculated by deducting
minimum f0 from maximum f0 would be smaller. Therefore, the effects of information structure
on f0 ranges could be consequently overestimated or underestimated, if creaky voice occurred in
one type of information structure more than another. However, a pre-analysis I conducted on the
data did not find creaky voice associated with particular information-structural types. (Instead, it
seems that some participants produced creaky voice more often than others.) Thus, I do not
regard the occurrence of creaky voice as a problem for this study.
5.4. Experiment 4: Results
Three sentences are missing from the recordings due to technical problems, and two sentences
were misspoken. They amount to 1.39% of the data. In this section, I first present the results for
duration of target words. Then I turn to f0 and intensity measures, which were analyzed in three
different ways: (i) the range of f0 and intensity measures on the target words (difference
between maximum and minimum on a given word), (ii) the maximum and minimum measures
for f0 and intensity, and (iii) the mean values for f0 and intensity on the target word.
85
5.4.1. Duration, f0 ranges, and intensity ranges
Duration
Overall, as can be seen in Figure 5.3, words in the corrective conditions (the two bars on the left)
are longer than words in the non-corrective conditions (the two tall bars on the right). Within the
non-corrective conditions, words that are given information have shorter durations than words
that are new information, but this distinction between given and new does not appear in the
corrective conditions. The observations are confirmed by statistical analyses: ANOVAs show a
main effect of correctiveness (F(1,9)=20.020, p< .01), with corrective conditions showing
significantly longer duration than non-corrective conditions, and no main effect of givenness
(F(1,9)=2.189, p= .173). There is a significant interaction between correctiveness and givenness
(F(1,9)=6.260, p< .05). More specifically, planned comparisons reveal that the correctiveness
effect on duration occurs in both the new information conditions (Corrective-New has longer
duration than Non-Corrective New: t(9)=4.177, p< .01) and the given information conditions
(Corrective Given has longer duration than Non-Corrective Given: t(9)=4.641, p< .01), but the
givenness effect on duration emerges only when the words are non-corrective (Non-Corrective
New has longer duration than Non-Corrective Given: t(9)=3.333, p< .01) and not when the
words are corrective (Corrective New does not differ from Corrective Given: t(9)= .331, p= .748).
In sum, while non-corrective words show an effect of givenness, no effect of givenness is
detected on corrective words.
86
Figure 5.3: Average duration of the target words in each condition of Experiment 4 (Error
bars show +/- 1 SE).
F0 ranges
Having considered duration, let us now turn to the findings for f0 ranges. Overall, f0 ranges
show a similar pattern as duration, as can be seen in Figure 5.4. Mirroring the results of duration,
words in the corrective conditions have larger f0 ranges than words in the non-corrective
conditions; given information has smaller f0 ranges than new information in the non-corrective
conditions, but this given/new distinction is not present in the corrective conditions. These
observations are again confirmed statistically: ANOVAs show a main effect of correctiveness
(F(1,9)=22.232, p< .01) but no main effect of givenness (F(1,9)= .749, p=.409). There is a
significant interaction between correctiveness and givenness (F(1,9)=5.892, p< .05). Planned
comparisons reveal that the correctiveness effect on f0 ranges occurs in both the new information
conditions (Corrective New has a larger range than Non-Corrective New: t(9)=3.536, p< .01) and
the given information conditions (Corrective Given has a larger range than Non-Corrective
Given: t(9)=5.059, p< .01). However, the givenness effect on f0 ranges emerges only when the
words are non-corrective (Non-Corrective New has a bigger range than Non-Corrective Given:
t(9)=3.348, p< .01) and not when they are corrective (Non-Corrective New does not differ from
Non-Corrective Given: t(9)=-.669, p= .521).
300
350
400
450
500
Corrective
New
Corrective
Given
NonCorrective
New
NonCorrective
Given
(ms)
Duration
87
Figure 5.4: Average F0 ranges of the target words in each condition of Experiment 4 (Error
bars show +/- 1 SE).
Intensity ranges
Finally, let us move on to the findings for the third parameter, intensity ranges, shown in Figure
5.5. Interestingly, the intensity range patterns differ from the patterns observed for f0 ranges and
duration. On one hand, the distinction between corrective and non-corrective conditions remains:
Words in the corrective conditions have larger intensity ranges than words in the non-corrective
conditions. However, new information does not differ from given information on intensity ranges,
in either the corrective or non-corrective conditions. ANOVAs show a main effect of
correctiveness (F(1,9)=9.659, p< .05). There is no main effect of givenness (F(1,9)= .130, p=.727)
and no interaction between correctiveness and givenness (F(1,9)=.563, p= .472).
40
50
60
70
80
Corrective
New
Corrective
Given
NonCorrective
New
NonCorrective
Given
(HZ) F0 Ranges
88
Figure 5.5: Average intensity ranges of the target words in each condition of Experiment 4
(Error bars show +/- 1 SE).
5.4.2. F0 and intensity maximum, minimum, and mean
In contrast to prior work where the distinction between correction and new information was only
found in one prosodic dimension (duration in Greif, 2010; f0 ranges in Chen & Braun, 2006), the
results presented in the preceding sections show that corrective focus and new-information focus
differ in all three prosodic dimension. However, in order to better understand the underlying
nature of the f0 and intensity ranges, further analyses are needed. As mentioned in the ‘aims’
section, f0 ranges or intensity ranges are not a single parameter in and of themselves. To
understand how f0 and intensity ranges are altered acoustically, we need to examine their
‘components’ – the maxima and minima of f0 and intensity. Inspecting the maxima and minima
allows us to see how range expansion is accomplished, e.g. (i) by raising the maximum, (ii)
lowering the minimum, or (iii) both? (See Figure 5.1). I aim to answer two questions: First, are
f0 range expansion and intensity range expansion achieved in the same way? Second, do new-
information focus and contrastive focus involve different ‘strategies’ of range expansion for
either f0 or intensity? In this section, I present the results of maximum and minimum f0, and
maximum and minimum intensity.
15
17
19
21
23
Corrective
New
Corrective
Given
NonCorrective
New
NonCorrective
Given
(dB)
Intensity Ranges
89
Maximum and minimum f0
Breaking f0 ranges down into maximum and minimum f0, we see that different types of
information structure enlarge f0 ranges through different means. As shown in Figure 5.6, words
in the corrective conditions have overall higher maximum f0 and lower minimum f0 than words
in the non-corrective conditions – in other words, the increased f0 range that we observe for
correction is accomplished by both raising the maximum and lowering the minimum. Taking a
closer look at the conditions, we see that the Corrective Given and Corrective New conditions do
not differ from each other in terms of their f0 maxima or minima (|t(9)|s< .643, p’s> .536).
However, when we look at the non-corrective conditions, we see that Non-Corrective Given and
Non-Corrective New differ in terms of their maximum f0 (higher for new information) but not in
terms of their minimum f0. Thus, the f0 range expansion that I reported above for words that
are new information in non-corrective contexts is accomplished by raising the maximum f0
without changing the minimum f0.
The statistical analyses confirm these observations to a large extent. ANOVAs show a main
effect of correctiveness on maximum f0 (F(1,9)=27.844, p< .01) and minimum f0 (F(1,9)=7.415,
p< .05), with maximum f0 being significantly higher and minimum f0 significantly lower in the
corrective words than the non-corrective words. Somewhat unexpectedly, there is no significant
interaction between correctiveness and givenness in either maximum f0 (F(1,9)=2.994, p= .118)
or minimum f0 (F1,9)= .802, p= .394). Nevertheless, given and new information in the non-
corrective conditions differ in maximum f0 by 5 Hz, which is in magnitude the same as the
statistically significant difference in minimum f0 between corrective and non-corrective words.
90
Figure 5.6: Average maximum and minimum F0 of the target words in each condition of
Experiment 4 (Error bars show +/- 1 SE).
Maximum and minimum intensity
Turning now to the question of intensity ranges, a very different pattern emerges: In contrast to
f0, maximum intensity stays the same among different types of information structure, as
indicated in Figure 5.7. The presence and absence of correction is reflected only in minimum
intensity: Minimum intensity is lower in corrective words than non-corrective words. ANOVAs
show a main effect of correctiveness on minimum intensity (F(1,9)=9.013, p< .05) but not on
maximum intensity (F(1,9)= .059, p= .813). Consistent with the patterns of intensity ranges,
there is no interaction between correctiveness and givenness in either minimum intensity
(F(1,9)=1.119, p= .318) or maximum intensity (F1,9)= .802, p= .394).
130
150
170
190
210
230
Corrective
New
Corrective
Given
NonCorrective
New
NonCorrective
Given
(Hz)
Max & Min F0
Max
Min
91
Figure 5.7: Average maximum and minimum intensity of the target words in each
condition of Experiment 4 (Error bars show +/- 1 SE).
In sum, we find that the intensity range expansion for f0 and for intensity is accomplished in
different ways, with intensity range expansion accomplished simply by lowering the minimum
intensity and f0 range expansion showing a more complex pattern: The f0 range expansion
observed for corrective focus is accomplished by both raising the maximum and lowering the
minimum, whereas the f0 range expansion observed for non-corrective new information focus is
accomplished by raising the maximum f0 without changing the minimum f0.
Mean f0 and mean intensity
Having inspected the ranges of f0 and intensity and their maxima and minima, we now look at
the means of f0 and intensity. In the preceding section, we saw that only in some contexts did
range expansion involve both raising maxima and lowering minima (i.e. f0 range expansion for
corrective focus). In other contexts where the ranges were expanded by only raising the maxima
(i.e. f0 range expansion for new-information focus) or only lowering minima (i.e. intensity range
expansion for new-information focus), means could potentially also reflect information-structure.
Somewhat surprisingly, mean f0 and mean intensity do not robustly differ between information-
structural types. Despite a main effect of correctiveness on mean f0 (F(1,9)=12.244, p< .01), the
50
55
60
65
70
75
80
Corrective
New
Corrective
Given
NonCorrective
New
NonCorrective
Given
(dB)
Max & Min Intensity
Max
Min
92
difference in mean f0 is significant only between the given information conditions (i.e.
Corrective Given is significantly higher than Non-Corrective Given: t(9)=4.307, p< .01) but
marginal between the new information conditions (i.e. Corrective New is marginally higher than
Non-Corrective New: t(9)=1.858, p= .096). There is no main effect of givenness on mean f0
(F(1,9)= .616, p=.453). Also, neither correctiveness (F(1,9)= .110, p=.748) nor givenness
(F(1,9)= .486, p=.503) has a main effect on mean intensity.
In sum, we see that – unlike ranges, maxima, and minima – means of f0 and intensity do not
provide reliable cues about information status in Mandarin. This finding highlights the
importance of a comprehensive analysis of prosodic features; in our case, the major cues for
information-structural distinctions would have been overlooked if one had only analyzed the
means of f0 and intensity.
5.4.3. Regarding lexical tone combinations
As mentioned in Section 5.3.1, I used bisyllabic target words in one of three lexical tonal
combinations: High-High (HH), High-Low (HL), or Low-High (LH). These tonal combinations
were equally distributed among the target words and the different conditions, to ensure that my
conclusions would not be restricted to a single tonal combination type. Because I controlled for
factors such as word frequency and semantic properties, the identity of the segments in target
words was not controlled across tonal combinations. This means that prosodic differences
between tonal combinations may come from segmental variance rather than tonal properties.
Thus, any comparison between tonal combinations must be viewed cautiously. The analyses in
this section are included for the sake of completeness, but the reader should keep in mind that
these are post-hoc analyses and the study was not designed with these analyses in mind (due to
the variability in the segmental properties of the target words).
When we look at the prosodic features within each tonal combination, we see that the majority of
the patterns discussed in the results section emerges within each tonal combination as well. In the
remainder of this section, I will first consider the Corrective vs. Non-Corrective manipulation,
and then the Given vs. New manipulation.
93
Corrective vs. Non-Corrective in each tonal combination
Paired t-tests were conducted comparing corrective new vs. non-corrective new for all three tonal
combinations, and corrective given vs. non-corrective given for all three tonal combinations.
Overall, the effects of correctiveness in both new and given information that we observed in the
preceding sections appear in all tonal combinations, although some of them do not reach
significance. The absence of significance is not surprising, given (i) the fact that identity of the
segments in target words was not controlled across tonal combinations and (ii) the reduction in
power that comes from looking at a third of the entire dataset (since there are three tonal
combinations), and even less when it is split into given vs. new and corrective vs. non-corrective.
Nonetheless, the distribution of significance among the conditions shows compatible patterns
with prior work that examines the dependence of focus-driven prosody on lexical tones (Chen &
Gussenhoven, 2008). In fact, while all tonal combinations are significantly lengthened in
corrective focus, only the HL combination robustly differs in the f0 and intensity dimensions
between corrective and non-corrective conditions. The properties of tones seem to impose
greater restrictions on LH and HH, than on HL, in terms of the extent to which f0 and intensity
can be altered to encode information structure (see Chen & Gussenhoven, 2008 for relevant
further discussion).
Duration: Corrective words are longer (all t(9)s>3.707, p’s< .01) in all tonal combinations than
non-corrective words. F0 ranges: Corrective words have larger f0 ranges (all t(9)s>2.659,
p’s< .05) than non-corrective words (except corrective new vs. non-corrective new in LH, which
is marginal, p=.083, and corrective new vs. non-corrective new in HH which is ns). Intensity
ranges: Corrective words have larger intensity ranges than non-corrective words (all t(9)s>2.599,
p’s< .05; except for corrective given vs. non-corrective given in LH which is marginal, p=.057,
and corrective new vs. non-corrective new in HH which is ns). F0 maxima: Corrective words
have higher f0 maxima than non-corrective words (all t(9)s>2.986, p’s< .05, except for
corrective new vs. corrective given in LH which is ns). F0 minima: Corrective words have
numerically lower f0 minima than non-corrective words (n.s.). Intensity maxima: Corrective
words have numerically higher intensity maxima (corrective new vs. non-corrective new and
corrective given vs. non-corrective given for HH reach significance, p’s<.05). Intensity minima:
94
Corrective words have lower intensity minima (all t(9)s<-2.522, p’s< .05; except for corrective
new vs. non-corrective new LH which is marginal, p=.052, and corrective new vs. non-corrective
new HH which is ns).
Given vs. New in each tonal combination
Paired t-tests were conducted comparing corrective new vs. corrective given for all three tonal
combinations, and non-corrective new vs. non-corrective given for all three tonal combinations.
Numerically, the prosodic properties of the three tonal combinations largely mirror the data
pattern presented in the preceding sections, although the analyses do not reach significance
(except for intensity ranges: Non-Corrective New information has a larger intensity range than
Non-Corrective Given information in HH words, t(9)=2.519, p< .05). This is not surprising, as
discussed above for the correctiveness manipulation. The descriptive statistics are largely
consistent with our previous observations: Within a tonal combination, most of the numerical
differences between Non-Corrective New and Non-Corrective Given conditions are towards the
same tendencies as tested in the main analyses.
5.5. Discussion
Experiment 4 investigates the prosodic cues for two kinds of distinctions between discourse-
information structures in Beijing Mandarin: the presence or absence of corrective focus, and the
new vs. given distinction. Although existing findings on Mandarin prosody generally agree that
new-information focus and corrective focus both involve increases in f0 displacement and
duration (relative to given words or words in broad focus), there is as of yet no consensus about
whether and how the two focus types – new-information focus and corrective focus – differ from
each other. The role of intensity is also not well-understood. A better understanding of these
issues is important because they are involved in fundamental questions regarding the relationship
of information structure and prosody, such as how the prosodic system represents different
information-structural categories and lexical contrasts at the same time. In this section, I discuss
how my results relate to my three key aims sketched out at the start of the paper, as well as their
broader implications.
95
My first aim was to investigate whether words in new-information focus vs. corrective focus
differ from each other in terms of their prosodic realization in Mandarin, and if so, which
parameters (e.g. duration, mean f0, f0 range, or intensity) encode these differences. My second
aim was to explore whether and how the distinction between ‘new’ and ‘given’ influences the
acoustic realization of both corrective and non-corrective words.
As regards the first aim, my results suggest that the cues for corrective focus do differ from those
signaling the new vs. given distinction. Corrective focus and new-information focus are
distinguished from each other both by the degrees of prominence they induce along the same
acoustic dimensions, and by the different acoustic dimensions they occupy. On one hand,
corrective focus and new-information focus both affect duration and f0, but corrective focus has
a stronger impact on these prosodic features than new-information focus. On the other hand,
intensity cues only appear for corrective focus, not for new-information focus. Despite the fact
that my study did not focus on the same kinds of information-structural distinctions, my findings
are consistent with Chen & Braun (2006) on a conceptual level. This indicates the full
complexity of prosodic structure that allows many information-structural categories to be
encoded distinctively.
As regards the second aim, I found that the prosodic distinction between new and given
information only emerged in non-corrective words in my study. There are several possible
reasons for why correctively focused words do not show a distinction between given and new.
Cognitively, correction might be more salient than ‘newness’ (e.g. Kaiser, 2011), which could in
some sense ‘overwhelm’ the distinction between new and given information in a context where
the words are also corrective. In principle, such a ‘ceiling effect’ could also be caused by
physiological constraints. Since correction yields extremely strong prosodic prominence even
when the information is given, it might be difficult or inefficient to further increase the
prominence for new information. However, a prior study has found duration lengthening for
corrective focus in two different emphatic degrees (Chen & Gussenhoven, 2008). This suggests
that physiology is unlikely to be hindering the prosodic realization of different degrees of
emphasis (for related work on speech perception, see Ladd & Morton, 1997).
96
Another explanation is that speakers in my experiment might define givenness from their own
perspective (rather than from the hearer’s perspective as intended), which would then effectively
remove the distinction between new and given in the corrective focus conditions in my
experiment. More specifically, recall that target sentences in the corrective conditions had been
uttered by the speaker, although the listener apparently did not hear them properly the first time.
If speakers fail to keep a log of the listener’s knowledge state, then the prosodic differences
between new and given information in non-corrective conditions might actually reflect whether
the words had been uttered more than once, rather than whether the listener had heard the words.
This explanation, if it turns out to be on the right track, would be in line with existing work that
indicates egocentric or production-internal processes to be fundamental in language production
(see Section 1.3 for more details).
Lastly, as discussed in the method section, existing work has shown that given information is
prosodically prominent when it appears in a different syntactic role (Terken & Hirschberg, 1994;
Schwarzschild, 1999). In my study, a target word in the given-information conditions first occurs
in the LOCATION role and then in the OBJECT role. Although I did find prosodic differences
between given information and new information when there is no correction, the degree of
givenness might not be large enough for given information to be substantially de-accented in
corrective focus due to the change in syntactic roles. These are intriguing questions that deserve
to be investigated further in future work.
My third aim to was provide a more inclusive analysis of the acoustic parameters relevant to
information structure, including analysis of intensity as well as a closer look at how f0 range and
intensity expansion is accomplished (e.g. lowering minima, raising maxima, or both). Regarding
intensity, my results show that it can provide information structural cues, though to a limited
extent: While contrastive focus results in increased intensity range expansion, new-information
focus does not do so.
Regarding range expansion, I found that intensity range expansion for f0 and for intensity were
accomplished via multiple routes. As illustrated in Figure 5.1, the two parameters involved in
the expansion of ranges – maximum and minimum – potentially form three ways of achieving
97
range expansion: (a) raising the maximum and lowering the minimum, (b) only raising the
maximum, and (c) only lowering the minimum. Intuitively, one might expect that extending both
the upper and lower bound of a range (option (a)) would require the least articulatory effort while
accomplishing the largest range expansion, so this pattern might be the most widely used.
However, analyses of the production data from this experiment reveal that both maximum and
minimum were employed, and all three possible ways of expanding ranges emerged in different
portions of the data. Recall that f0 ranges were extended for both corrective focus and new-
information focus, whereas intensity ranges were extended only for corrective focus. I found that
corrective focus expands f0 ranges by raising the maximum and lowering the minimum (option
(a)), new-information focus expands f0 ranges by only raising the maximum (option (b)), and
corrective focus expands intensity ranges by only lowering the minimum (option (c)). My
findings rule out a simple hypothesis that the semantically or pragmatically prominent words are
merely spoken slower, louder and with higher pitch. Indeed, the duration of the words does
become longer, but f0 range expansion during a focused word results from extending not only
the upper bound but also the lower bound of the range. Moreover, the expansion of intensity
ranges is due to a decrease in intensity during some part of the word, rather than an increase.
In earlier work, Chen and Gussenhoven (2008) found that the expansion of f0 ranges for
information structure in Mandarin is mainly accomplished by raising maximum f0, which is
somewhat inconsistent with my findings. As pointed out by Chen and Gussenhoven (2008), this
might have to do with the fact that f0 lowering in a low lexical tone leads to serious creakiness,
which makes it difficult to assess the effect of emphasis on f0 minimum. In Section 5.4.2, the
analyses of f0 minimum also show no significant difference between information-structural
conditions within a tonal combination. However, a possibility that cannot be excluded is that the
cues for range expansion do not necessarily exist at both sides of the range; I leave the question
open for future work.
Let us now briefly consider the fact that in a tone language like Mandarin, discourse-level
intonation and lexical tones potentially occupy the same acoustic dimensions in tone languages.
Existing work on Mandarin has found that all three prosodic dimensions – duration, f0, and
intensity – provide cues for discourse-level information, as well as make contrasts (i.e. lexical
98
tones) between word meanings. Largely consistent with prior studies (Jin, 1996; Xu, 1999; Chen,
2006; Chen & Braun, 2006; Chen & Gussenhoven, 2008; Greif, 2010), I found lengthening and
f0 range expansion in corrective focus and new-information focus (although Chen and Braun
(2006) did not find lengthening and Greif (2010) did not find f0 range expansion for corrective
focus, as discussed in Section 5.1). Furthermore, my results show that intensity ranges may also
be expanded to emphasize words in an utterance: Intensity excursions become larger when the
speakers express a correction (although intensity excursions do not distinguish new information
from given information). In other words, there is no evidence for specialized functions where
some prosodic dimensions mark information structure and others mark lexical items, e.g. it is not
the case that intensity ranges are used to mark only lexical contrasts whereas f0 ranges are used
to mark only information-structural distinctions. All three prosodic dimensions are multi-
functional.
It is worth noting that covariation between f0 and intensity has been found in production studies,
regardless of whether the participants are asked to vary f0 and intensity separately, or whether
they have professional vocal/singing training (e.g. Chen, Park, Kreiman, & Alwan, 2014;
Gramming, Sundberg, Ternström, Leanderson, Perkins, 1988; Holmberg, Hillman, & Perkell,
1988). Nevertheless, previous research has also shown that the effects of f0 and intensity are not
necessarily accompanied by each other. For instance, in a study on voice quality by Chen, Park,
Kreiman and Alwan (2014), eight participants were asked to vary their f0 while keeping their
intensity constant. It was found that f0 is a significant predictor of the measure of interest (the
relative magnitude of the first two harmonics) for all eight participants. However, only three
participants exhibited a small effect of intensity; the other five participants did not show a
significant effect of intensity. These mixed findings indicate that f0 and intensity are neither fully
co-dependent nor fully separate dimensions. This is reflected in the results of Experiment 4,
where f0 marks both correction and new information, while intensity marks only correction, not
new information.
The results of Experiment 4 are also compatible with earlier observations regarding the manner
in which prosodic cues in Mandarin encode discourse-information structure and lexical
distinctions – in particular, the idea that these two kinds of information are encoded differently:
99
The latter has to do with the shapes of f0 movement and intensity movement, and the former
with the ranges of their movement. Earlier work has pointed out that, for different lexical tones,
the shapes of f0 contours clearly differ, whereas with information-structural types, what vary are
the ranges of f0 contours (Xu, 1997; Chen & Gussenhoven, 2008). Whalen and Xu (1992)
suggest that f0 and intensity are positively correlated in lexical tones, which enables Mandarin
speakers to perceive tones without the presence of contrastive f0 patterns. Indeed, if one inspects
the contours of f0 and intensity in the data, both their shapes considerably differ between tonal
combinations while staying similar across different conditions of information structure. Given
my results showing that intensity ranges are used to differentiate information-structural types,
there appear to be parallels between f0 and intensity in the specialization of parameters. Lexical
information is encoded by the shapes of f0 and intensity contours, whereas discourse information
is marked by the ranges of f0 excursions and, as indicated by my findings, the ranges of intensity
excursions. This highlights the fine-grained ability of the language production system to utilize
different aspects of acoustic dimensions.
The results reported in this chapter have focused on information-structural factors, but did not
directly test the effects of information-theoretic factors. In on-going work, I am investigating
whether word frequency and contextual probability influence prosody in Chinese languages the
same ways as they do in English. To what extent do lexical tones affect the ways in which other
types of information are encoded in prosody? To my knowledge, there is little work in this regard
(but see, e.g., Wiener, Speer, & Shank, 2012, a study that has explored the prosodic consequences
of word frequency and repetition). Furthermore, future research can investigate whether
information structure interacts with information-theoretic properties and perspective-taking
factors in Chinese prosody, like what has been found with English in this dissertation.
Compared to English, Chinese languages have considerably more homophones that can be
differentiated in writing. This property of Chinese languages allows us to strictly control the
identity of segments we use in a study, and thereby better understand whether or to which extent
a word’s phonological representation affects its prosody (see e.g., Gahl, 2008, for more
discussion on the potential mechanisms that drive the effect of word frequency based on English
homophone data such as ‘time’ vs. ‘thyme’). Cross-linguistic comparisons can disentangle
100
language-specific features from what might be universal, which will inform our understanding of
the human language faculty.
101
Chapter 6. General Discussion
This dissertation set out to extend our knowledge of prosody by addressing fundamental
questions including what kinds of information are conveyed through prosody, which prosodic
dimensions are used to convey them, and how individual speakers differ from one another. I
conducted four production experiments to examine how various factors interact with one another
in shaping the prosody of an utterance and how prosody fulfills its multi-functional role.
One widely-accepted view on prosody is that prosodic prominence signals the extent to which a
linguistic element is ‘informative’. The notion of informativity, despite being intuitive, is hard to
define and not yet well-understood. This dissertation aims to contribute to this body of literature
by combining insights from different lines of research. Prior work has approached the
relationship between prosody and informativity from various angles, of which two popular ones
are information structure and information theory. However, little attention has been paid to the
potential interaction between information-structural categories – such as new or corrective
information– and information-theoretic properties – such as lexical frequency or contextual
probability. With Experiment 1 I found that, in English, the prosodic consequences of new-
information focus are modulated by the focused word’s frequency, whereas the prosodic
consequences of corrective focus are modulated by the focused word’s probability in the context
(Chapter 2).
Furthermore, as existing work shows that speakers can differ in their ways of marking linguistic
distinctions using prosody, I looked into the inter- and intra-speaker variability in my data.
Previous research has rarely examined individual differences in the prosodic encoding of
informativity, let alone the interaction between information structure and information-theoretic
properties. I found that participants in Experiment 1 seemed to have individual ‘preferences’
regarding f0 shapes, the f0 ranges they used for an utterance, and the magnitude of differences in
f0 ranges by which they marked information-structural distinctions. In contrast, there is more
cross-speaker validity in the actual directions of differences in f0 ranges between information-
structural categories with respect to information-theoretic conditions. Thus, my results suggest
102
that f0 ranges might be more informative than f0 shapes in reflecting informativity across
English speakers (Chapter 3).
Experiments 2 and 3 investigate the motivation behind the interplay between corrective focus
and contextual probability. I show that the knowledge state of the interlocutor (i.e. the speaker’s
conversational partner) plays a key role. More specifically, prosody is influenced both by the
speakers’ assumptions (about what their interlocutors know) and their interlocutors’ utterances
(that confirm or conflict with the speakers’ assumptions). Little previous research has
investigated how these interlocutor-related factors affect the way people speak. I found that the
effects of contextual probability on corrective prosody depend on whether the information
concerning words’ contextual probability is only known to the speaker or also available to the
interlocutor. When contextual probability is private knowledge, speakers prosodically emphasize
their corrective responses to their interlocutors’ contextually probable misstatements. In contrast,
when contextual probability is shared knowledge, speakers prosodically emphasize their
corrective responses to their interlocutors’ contextually improbable misstatements. I suggest that
these patterns can be explained by ‘epistemic surprisal’, a term I use to refer to the gap between
speakers’ prior assumptions and what they encounter/learn during the conversation. I propose
that the degree of prosodic prominence might reflect the extent to which the interlocutors’
utterances conflict with what the speakers expect to hear (Chapter 4).
To further illustrate the notion of epistemic surprisal, consider the following scenario: Ann, Betty
and Chris are close friends. Ann and Betty are planning a weekend trip and want to invite Chris
to come along. Ann says, “Chris is picky about the places she visits.” Betty says, “We can go
hiking in the Rocky Mountains.” Ann is surprised by Betty’s suggestion, because Ann and Betty
both know that Chris is not interested in outdoor athletic activities. Now, Ann responds, “Or we
can go to Six Flags.” According to my interaction-based hypothesis, Ann will prosodically
emphasize her response ‘Six Flags’ because it is contrastive to Betty’s suggestion ‘Rocky
Mountains’, which, to Ann, was an unexpected suggestion that yields high epistemic surprisal.
However, imagine an alternative scenario, where only Ann is a close friend to Chris. Betty has
no idea whether Chris likes hiking or not, and Ann knows this too. Therefore, Betty’s outdoorsy
103
suggestion is not surprising to Ann in this situation. According to my interaction-based
hypothesis, Ann will not prosodically emphasize her response ‘Six Flags’ in this scenario,
because she has no particular expectations about what kinds of tourist attractions Betty should
(not) suggest. In other words, Betty’s utterance does not yield high epistemic surprisal for Ann,
because it does not conflict with Ann’s prior assumptions.
My idea that speakers’ expectations can modulate the encoding of correction is in line with work
in other areas of psycholinguistics. Indeed, a substantial body of literature has investigated the
effects of expectations in language comprehension and production. Comprehension studies show
that listeners anticipate what kinds of words will come next in ongoing utterances, given the
preceding context. Using the visual-world eye-tracking paradigm, it has been found that people
move their eyes towards semantically- or prosodically-appropriate referents before they actually
hear the referential expression (e.g. Altmann & Kamide 1999; Boland, 2005; Brown, Salverda,
Dilley, & Tanenhaus, 2011; Ito & Speer, 2008; Watson, Tanenhaus, & Gunlogson, 2008).
Listeners’ expectations have also been investigated using the event-related potential (ERP)
technique. Hearing words that do not fit their morphosyntactic contexts evoke larger N400
effects than hearing words that are morphosyntactically appropriate (e.g. van Berkum, Brown,
Zwitserlood, Kooijman, & Hagoort, 2005; Delong, Urbach, & Kutas, 2005; Wicha, Bates,
Moreno, & Kutas, 2003). Turning to language production, expectation-related findings have
been reported in information-theoretic work. Based on experiments and spoken corpora, it has
been shown that speakers choose different forms of a word according to how predictable the
word is in the linguistic context (see Section 1.2 for more details). In sum, there is abundant
evidence for predictive processing at various linguistic levels, especially on the lexical and
syntactic levels (see also Kaiser & Trueswell, 2004 and others for evidence for predictive
processing on the discourse level). My work contributes to this literature by exploring the
prosodic domain.
Furthermore, it is important to acknowledge that, language users have a dual role – as a speaker
and as a listener – in spoken communication. Most existing work on predictive processing
examines either production or comprehension; few studies consider the most common
conversational setting where people rapidly switch roles between speakers and listeners (but see
104
Section 4.1 for discussion on Brennan (1991), Kuhlen & Brennan (2010), and Russell & Schober
(1999)). For example, when people listen and respond to what others say, they might first
employ predictive processes as listeners, and then, as speakers. Does their predictive processing
as listeners have consequences for their utterances when they later become speakers?
Experiments 2 and 3 address this question and show that the prosody of speakers’ utterances can
reflect the ‘epistemic surprisal’ they previously experienced when listening to others’ utterances.
Experiment 4 further demonstrates the multi-functionality of prosody by investigating its
discourse-level functions in Mandarin Chinese, a tone language where a word’s prosodic patterns
is crucial to its meaning. Previous work on Mandarin reports mixed findings as to which
information-structural distinctions are encoded in prosody and what kinds of acoustic cues signal
those distinctions. My results show that, although prosodic dimensions such as pitch, intensity
and duration serve fundamental, lexical-level functions in Mandarin, they nevertheless provides
cues to information structure as well. Similar to what previous research has found with English,
corrective information is prosodically more prominent than non-corrective information, and new
information is prosodically more prominent than given information (Chapter 5).
Taken together, these experiments show a complex relationship between prosody and the
different types of information it encodes in a given language. To better understand prosody, it is
important to integrate insights from different traditions of research and to investigate across
languages.
105
References
Allen, J. S., Miller, J. L., & DeSteno, D. (2003). Individual talker differences in voice-onset-
time. Journal of the Acoustical Society of America.
Altmann, G. T., & Kamide, Y. (1999). Incremental interpretation at verbs: Restricting the
domain of subsequent reference. Cognition, 73(3), 247-264.
Andreeva, Bistra, William J. Barry, and Ingmar Steiner. 2007. Producing phrasal prominence in
German. In Proceedings of the 16th International Congress of Phonetic Sciences, 1209-
1212.
Arnold, J. E., Kahn, J. M., & Pancani, G. C. (2012). Audience design affects acoustic reduction
via production facilitation. Psychonomic bulletin & review,19(3), 505-512.
Arnold, J. E., Kam, C. L. H., & Tanenhaus, M. K. (2007). If you say thee uh you are describing
something hard: the on-line attribution of disfluency during reference
comprehension. Journal of Experimental Psychology: Learning, Memory, and
Cognition, 33(5), 914.
Aylett, M., & Turk, A. (2004). The smooth signal redundancy hypothesis: A functional
explanation for relationships between redundancy, prosodic prominence, and duration in
spontaneous speech. Language and speech, 47(1), 31-56.
Baayen, R. H., Piepenbrock, R., and Gulikers, L. (1995). The CELEX Lexical Database (Release
2) [CD-ROM]. Linguistic Data Consortium, University of Pennsylvania [Distributor],
Philadelphia, PA
Badino, L., & Clark, R. A. (2007). Issues of optionality in pitch accent placement. In SSW (pp.
252-257).
Baker, A. (2006). Quantifying diphthongs: a statistical technique for distinguishing formant
contours. NWAV35, November. Ohio State University, Columbus, OH.
Baker, R. E., & Bradlow, A. R. (2009). Variability in word duration as a function of probability,
speech style, and prosody. Language and speech, 52(4), 391-413.
Bard, E. G., Anderson, A. H., Sotillo, C., Aylett, M., Doherty-Sneddon, G., & Newlands, A.
(2000). Controlling the intelligibility of referring expressions in dialogue. Journal of
Memory and Language, 42(1), 1-22.
Bard, E. G., Aylett, M. P., Trueswell, J., & Tanenhaus, M. (2004). Referential form, word
106
duration, and modeling the listener in spoken dialogue. Approaches to studying world-
situated language use: Bridging the language-as-product and language-as-action
traditions, 173-191.
Bartels, C., & Kingston, J. (1994). Salient pitch cues in the perception of contrastive focus. The
Journal of the Acoustical Society of America, 95(5), 2973-2973.
Bates, Douglas, Martin Maechler, Ben Bolker, and Steven Walker. 2014. lme4: Linear mixed-
effects models using Eigen and S4. R package version 1.1-7.
Bauer, R. S., Cheung, K.-H., Cheung, P.-M., & Ng, L. (2004). Acoustic correlates of focus-stress
in Hong Kong Cantonese. Papers from the Eleventh Annual Meeting of the Southeast
Asian Linguistics Society. Arizona State University, Program for Southeast Asian
Studies.
Baumann, S., Grice, M., & Steindamm, S. (2006, May). Prosodic marking of focus domains-
categorical or gradient. In Proceedings of speech prosody (pp. 301-304).
Bavelas, J. B., Coates, L., & Johnson, T. (2000). Listeners as co-narrators.Journal of personality
and social psychology, 79(6), 941.
Bell, A., Brenier, J. M., Gregory, M., Girand, C., & Jurafsky, D. (2009). Predictability effects on
durations of content and function words in conversational English. Journal of Memory
and Language, 60(1), 92-111.
Bell, A., Jurafsky, D., Fosler-Lussier, E., Girand, C., Gregory, M., & Gildea, D. (2003). Effects
of disfluencies, predictability, and utterance position on word form variation in English
conversation. The Journal of the Acoustical Society of America, 113(2), 1001-1024.
Belotel-Grenié, A. & Grenié, M. (1994). Phonation types analysis in Standard Chinese.
Proceedings of the 3rd International Conference on Spoken Language Processing
(ICSLP 1994), Yokohama, Japan.
Biiring, D. (2006). Focus projection and default prominence. The architecture of focus, 82, 321.
Birch, S., & Clifton, C. (1995). Focus, accent, and argument structure: Effects on language
comprehension. Language and speech, 38(4), 365-391.
Birner, B. J., & Ward, G. (1998). Information status and noncanonical word order in
English (Vol. 40). John Benjamins Publishing.
Boland, J. E. (2005). Visual arguments. Cognition, 95(3), 237-274.
Breen, M., Fedorenko, E., Wagner, M., & Gibson, E. (2010). Acoustic correlates of information
107
structure. Language and Cognitive Processes, 25(7-9), 1044-1098.
Brennan, S. E. (1991). Conversation with and through computers. User modeling and user-
adapted interaction, 1(1), 67-86.
Brennan, S. E., & Williams, M. (1995). The feeling of Another′ s Knowing: prosody and filled
pauses as cues to listeners about the metacognitive states of speakers. Journal of memory
and language, 34(3), 383-398.
Brown, G. (1983). Prosodic structure and the given/new distinction. In Prosody: Models and
measurements (pp. 67-77). Springer Berlin Heidelberg.
Brown, M., Salverda, A. P., Dilley, L. C., & Tanenhaus, M. K. (2011). Expectations from
preceding prosody influence segmentation in online sentence processing. Psychonomic
bulletin & review, 18(6), 1189-1196.
Brown, P. M., & Dell, G. S. (1987). Adapting production to comprehension: The explicit
mention of instruments. Cognitive Psychology, 19(4), 441-472.
Brown-Schmidt, S., & Hanna, J. E. (2011). Talking in another person’s shoes: Incremental
perspective-taking in language processing. Dialogue and Discourse, 2, 11-33.
Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis: A critical evaluation of
current word frequency norms and the introduction of a new and improved word
frequency measure for American English. Behavior research methods, 41(4), 977-990.
Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese Word and Character Frequencies
Based on Film Subtitles. Plos ONE, 5 (6), e10729.
Calhoun, S. (2010). How does informativeness affect prosodic prominence?.Language and
Cognitive Processes, 25(7-9), 1099-1140.
Campbell, N., & Erickson, D. (2004). What do people hear? A study of the perception of non-
verbal affective information in conversational speech. Journal of the Phonetic Society of
Japan, 7(4), 9-28.
Chambers, J. M. and Hastie, T. J. (1992) Statistical Models in S, Wadsworth & Brooks/Cole.
Chen, G., Park, S. J., Kreiman, J., & Alwan, A. (2014). Investigating the effect of F0 and vocal
intensity on harmonic magnitudes: Data from high-speed laryngeal videoendoscopy.
In Fifteenth Annual Conference of the International Speech Communication Association.
Chen, S. W., Wang, B., & Xu, Y. (2009, September). Closely related languages, different ways
of realizing focus. In Interspeech (pp. 1007-1010).
108
Chen, Y. (2006). Durational adjustment under corrective focus in Standard Chinese. Journal of
Phonetics, 34(2), 176-201.
Chen, Y., & Braun, B. (2006). Prosodic realization of information structure categories in
standard Chinese.
Chen, Y., & Gussenhoven, C. (2008). Emphasis and tonal implementation in Standard
Chinese. Journal of Phonetics, 36(4), 724-746.
Ching, Marvin. K. L. 1982. The question intonation in assertions. American Speech, 95-107.
Cho, T., & Keating, P. A. (2001). Articulatory and acoustic studies on domain-initial
strengthening in Korean. Journal of Phonetics, 29(2), 155-190.
Clopper, C. G., & Pierrehumbert, J. B. (2008). Effects of semantic predictability and regional
dialect on vowel space reduction. The Journal of the Acoustical Society of
America, 124(3), 1682-1688.
Cooper, W. E., Eady, S. J., & Mueller, P. R. (1985). Acoustical aspects of contrastive stress in
question–answer contexts. The Journal of the Acoustical Society of America, 77(6),
2142-2156.
Couper-Kuhlen, E. (1984). A new look at contrastive intonation. In Modes of interpretation:
Essays presented to Ernst Leisi (pp. 137-158). Gunter Narr Verlag.
Creel, S. C., Aslin, R. N., & Tanenhaus, M. K. (2008). Heeding the voice of experience: The role
of talker variation in lexical access. Cognition, 106(2), 633-664.
Cutler, A. (1976). Phoneme-monitoring reaction time as a function of preceding intonation
contour. Perception & Psychophysics, 20(1), 55-60.
Dahan, D., & Bernard, J. M. (1996). Interspeaker variability in emphatic accent production in
French. Language and speech, 39(4), 341-374.
Dahan, D., Tanenhaus, M. K., & Chambers, C. G. (2002). Accent and reference resolution in
spoken-language comprehension. Journal of Memory and Language, 47(2), 292-314.
Davidson, L. (2006). Comparing tongue shapes from ultrasound imaging using smoothing spline
analysis of variancea). The Journal of the Acoustical Society of America, 120(1), 407-
415.
Dell, G. S., & Brown, P. M. (1991). Mechanisms for listener-adaptation in language production:
Limiting the role of the “model of the listener. Bridges between psychology and
linguistics: A Swarthmore Festschrift for Lila Gleitman, 105.
109
DeLong, K. A., Urbach, T. P., & Kutas, M. (2005). Probabilistic word pre-activation during
language comprehension inferred from electrical brain activity.Nature
neuroscience, 8(8), 1117-1121.
Dik, S. C., & Hengeveld, K. (1997). The theory of functional grammar: the structure of the
clause. Walter de Gruyter.
Dombrowski, E. (2003). Semantic features of accent contours: effects of F0 peak position and F0
time shape. In Proc. 15th International Congress of Phonetic Sciences, Barcelona,
Spain (pp. 1217-1220).
Eady, S. J., & Cooper, W. E. (1986). Speech intonation and focus location in matched statements
and questions. The Journal of the Acoustical Society of America, 80(2), 402-415.
Ferguson, S. H. (2004). Talker differences in clear and conversational speech: Vowel
intelligibility for normal hearing listeners. Journal of the Acoustical Society of America
116:2365-2373.
Ferguson, S. H., & Kewley-Port, D. (2007). Talker differences in clear and conversational
speech: Acoustic characteristics of vowels. Journal of Speech, Language, and Hearing
Research 50:1241-1255.
Féry, C., Kaiser, E., Hörnig, R., Weskott, T., & Kliegl, R. (2009). Perception of intonational
contours on given and new referents: a completion study and an eye-movement
experiment1. Phonology in perception, 15, 235.
Fougeron, C., & Keating, P. A. (1997). Articulatory strengthening at edges of prosodic
domains. The journal of the acoustical society of America, 101(6), 3728-3740.
Fowler, C. A., & Housum, J. (1987). Talkers' signaling of “new” and “old” words in speech and
listeners' perception and use of the distinction. Journal of Memory and Language, 26(5),
489-504.
Fussell, S. R., & Krauss, R. M. (1989a). The effects of intended audience on message production
and comprehension: Reference in a common ground framework. Journal of
Experimental Social Psychology, 25(3), 203-219.
Fussell, S. R., & Krauss, R. M. (1989b). Understanding friends and strangers: The effects of
audience design on message comprehension. European Journal of Social
Psychology, 19(6), 509-525.
Gahl, S. (2008). Time and thyme are not homophones: The effect of lemma frequency on word
110
durations in spontaneous speech. Language, 84(3), 474-496.
Galati, A., & Brennan, S. E. (2010). Attenuating information in spoken communication: For the
speaker, or for the addressee?. Journal of Memory and Language, 62(1), 35-51.
Garcia, D. (2010). Robust smoothing of gridded data in one and higher dimensions with missing
values. Computational statistics & data analysis,54(4), 1167-1178.
Gramming, P., Sundberg, J., Ternström, S., Leanderson, R., & Perkins, W. H. (1988).
Relationship between changes in voice pitch and loudness. Journal of Voice, 2(2), 118-
126.
Gregory, M. L., Raymond, W. D., Bell, A., Fosler-Lussier, E., & Jurafsky, D. (1999). The effects
of collocational strength and contextual predictability in lexical production. In Chicago
Linguistic Society (Vol. 35, pp. 151-166).
Greif, M. (2010). Contrastive focus in mandarin Chinese. In Proceedings of Speech Prosody.
Gu, Chong. 2002. Smoothing Spline ANOVA Models. New York: Springer.
Gu, Chong. 2014. gss: General Smoothing Splines. R package version 2.1-4.
Gussenhoven, C. (1983). Focus, mode and the nucleus. Journal of linguistics,19(02), 377-417.
Hale, J. (2003). The information conveyed by words in sentences. Journal of Psycholinguistic
Research, 32(2), 101-123.
Hammerschmidt, K., & Jürgens, U. (2007). Acoustical correlates of affective prosody. Journal of
Voice, 21(5), 531-540.
Harley, T. A. (1984). A critique of top-down independent levels models of speech production:
Evidence from non-plan-internal speech errors. Cognitive Science, 8: 191-219.
Hay, J. F., Sato, M., Coren, A. E., Moran, C. L., & Diehl, R. L. (2006). Enhanced contrast for
vowels in utterance focus: A cross-language study. The Journal of the Acoustical Society
of America, 119(5), 3022-3033.
Holmberg, E. B., Hillman, R. E., & Perkell, J. S. (1988). Glottal airflow and transglottal air
pressure measurements for male and female speakers in soft, normal, and loud
voice. The Journal of the Acoustical Society of America,84(2), 511-529.
Horton, W. S., & Gerrig, R. J. (2005). The impact of memory demands on audience design
during language production. Cognition, 96(2), 127-142.
Horton, W. S., & Keysar, B. (1996). When do speakers take into account common
ground?. Cognition, 59(1), 91-117.
111
Horton, W. S., & Slaten, D. G. (2012). Anticipating who will say what: The influence of
speaker-specific memory associations on reference resolution. Memory and Cognition,
40(1), 113-126.
Hubbard, K., & Trauner, D. A. (2007). Intonation and emotion in autistic spectrum
disorders. Journal of psycholinguistic research, 36(2), 159-173.
Hummert, M. L., & Shaner, J. L. (1994). Patronizing speech to the elderly as a function of
stereotyping. Communication Studies, 45(2), 145-158.
Isaacs, E. A., & Clark, H. H. (1987). References in conversation between experts and
novices. Journal of experimental psychology: general, 116(1), 26.
Ishi, C. T., Ishiguro, H., & Hagita, N. (2005). Proposal of acoustic measures for automatic
detection of vocal fry. In INTERSPEECH (pp. 481-484).
Ito, K., & Speer, S. R. (2008). Anticipatory effects of intonation: Eye movements during
instructed visual search. Journal of Memory and Language,58(2), 541-573.
Jannedy, S. (2007). Prosodic focus in Vietnamese. In S. Ishihara, S. Jannedy, & A. Schwarz.
(Eds.), Interdisciplinary Studies on Information Structure (Vol.8, pp. 209-
230). Potsdam: Universitätsverlag Potsdam.
Jin, S. (1996). An acoustic study of sentence stress in Mandarin Chinese. Doctoral dissertation,
the Ohio State University.
Johnstone, T., & Scherer, K. R. (1999, August). The effects of emotions on voice quality.
In Proceedings of the XIVth International Congress of Phonetic Sciences (pp. 2029-
2032). San Francisco: University of California, Berkeley.
Jun, S. A. (2011). Prosodic markings of complex NP focus, syntax, and the pre-/post-focus
string. In Proceedings of the 28th West Coast Conference on Formal Linguistics (pp.
214-230).
Kahn, J. M., & Arnold, J. E. (2012). A processing-centered look at the contribution of givenness
to durational reduction. Journal of Memory and Language, 67(3), 311-325.
Kahn, J. M., & Arnold, J. E. (2015). Articulatory and lexical repetition effects on durational
reduction: speaker experience vs. common ground. Language, Cognition and
Neuroscience, 30(1-2), 103-119.
Kaiser, E. (2011). Focusing on pronouns: Consequences of subjecthood, pronominalization and
contrastive focus. Language and Cognitive Processes 26, 1625-1666.
112
Kaiser, E., & Trueswell, J. C. (2004). The role of discourse context in the processing of a flexible
word-order language. Cognition, 94(2), 113-147.
Kaland, C., Swerts, M., & Krahmer, E. (2013). Accounting for the listener: Comparing the
production of contrastive intonation in typically-developing speakers and speakers with
autism. The Journal of the Acoustical Society of America, 134(3), 2182-2196.
Katz, J., & Selkirk, E. (2011). Contrastive focus vs. discourse-new: Evidence from phonetic
prominence in English. Language, 87(4), 771-816.
Kawahara, H., de Cheveigné, A., & Patterson, R. D. (1998, December). An instantaneous-
frequency-based pitch extraction method for high-quality speech transformation: revised
TEMPO in the STRAIGHT-suite. In ICSLP.
Keysar, B. (2007). Communication and miscommunication: The role of egocentric processes.
Intercultural Pragmatics, 4(1), 71–84.
Kjelgaard, M. M., & Speer, S. R. (1999). Prosodic facilitation and interference in the resolution
of temporary syntactic closure ambiguity. Journal of Memory and Language, 40(2),
153-194.
Krahmer, E., & Swerts, M. (2001). On the alleged existence of contrastive accents. Speech
communication, 34(4), 391-405.
Kraljic, T., & Brennan, S. E. (2005). Prosodic disambiguation of syntactic structure: For the
speaker or for the addressee?. Cognitive psychology, 50(2), 194-231.
Krivokapić, J., & Byrd, D. (2012). Prosodic boundary strength: An articulatory and perceptual
study. Journal of phonetics, 40(3), 430-442.
Kuhlen, A. K., & Brennan, S. E. (2010). Anticipating distracted addressees: How speakers'
expectations and addressees' feedback influence storytelling.Discourse Processes, 47(7),
567-587.
Kuznetsova, Alexandra, Per Bruun Brockhoff, and Rune Haubo Bojesen Christensen. 2015.
lmerTest: Tests in Linear Mixed Effects Models. R package version 2.0-25.
Ladd, D. R. (1996). Intonational phonology. Cambridge University Press.
Ladd, D. R., & Morton, R. (1997). The perception of intonational emphasis: continuous or
categorical?. Journal of Phonetics, 25(3), 313-342.
Levy, R. (2008). Expectation-based syntactic comprehension. Cognition,106(3), 1126-1177.
Lieberman, P. (1963). Some effects of semantic and grammatical context on the production and
113
perception of speech. Language and speech, 6(3), 172-187.
Loakes, D., & McDougall, K. (2010). Individual Variation in the Frication of Voiceless Plosives
in Australian English: A Study of Twins' Speech. Australian Journal of
Linguistics, 30(2), 155-181.
Local, J. (1996). Conversational phonetics: Some aspects of news receipts in everyday
talk. Studies in Interactional Sociolinguistics, 12, 177-230.
Lockridge, C. B., & Brennan, S. E. (2002). Addressees' needs influence speakers' early syntactic
choices. Psychonomic bulletin & review, 9(3), 550-557.
Morley, E., Van Santen, J., Klabbers, E., & Kain, A. (2011, May). F 0 range and peak alignment
across speakers and emotions. In Acoustics, Speech and Signal Processing (ICASSP),
2011 IEEE International Conference on (pp. 4952-4955). IEEE.
Munson, B., & Solomon, N. P. (2004). The effect of phonological neighborhood density on
vowel articulation. Journal of speech, language, and hearing research, 47(5), 1048-
1058.
Niebuhr, O., & Zellers, M. (2012). Late pitch accents in hat and dip intonation
patterns. Understanding prosody–the role of function, context, and communication, 159-
186.
Niebuhr, O., D’Imperio, M., Fivela, B. G., & Cangemi, F. (2011). Are there “shapers” and
“aligners”? Individual differences in signalling pitch accent category. In Proceedings of
the 17th International Congress of Phonetic Sciences, 120–123.
Nolan, F. (2003). Intonational equivalence: an experimental evaluation of pitch scales.
In Proceedings of the 15th International Congress of Phonetic Sciences,
Barcelona (Vol. 39).
Ouyang, I. C. & Kaiser E. (2012). Focus-marking in a tone language: Prosodic cues in Mandarin
Chinese. In Extended Abstracts of the 86th Annual Meeting of the Linguistic Society of
America, held January 2012, Portland, Oregon.
Ouyang, I. C. & Kaiser E. (2011). Prosodic cues for information structure in a tone language. In
Proceedings of the 2011 Western Conference on Linguistics, held November 2011.
Ouyang, I. C. & Kaiser E. (2014a). Prosodic encoding of informativity: Word frequency and
contextual probability interact with information structure. In Proceedings of the 36th
annual meeting of the Cognitive Science Society, held July 2014, Quebec City, Canada.
114
Ouyang, I. C. & Kaiser E. (2014b). Prosody marks different kinds of informativity: Interactions
between frequency, probability and focus. In Proceedings of the 38th Annual Penn
Linguistics Conference, held March 2014, Philadelphia, Pennsylvania.
Ouyang, I. C. & Kaiser, E. (2015a). Individual differences in the prosodic encoding of
informativity. In S. Fuchs, D. Pape, C. Petrone, and P. Perrier (Eds.), Individual
Differences in Speech Production and Perception. Peter Lang International Academic
Publishers.
Ouyang, I. C. & Kaiser, E. (2015b). Prosody and information structure in a tone language: An
investigation of Mandarin Chinese. Language, Cognition and Neuroscience, 30(1-2): 57-
72.
Pan, S., & Hirschberg, J. (2000). Modeling local context for pitch accent prediction.
In Proceedings of the 38th annual meeting on association for computational
linguistics (pp. 233-240). Association for Computational Linguistics.
Pasupathi, M., Stallworth, L. M., & Murdoch, K. (1998). How what we tell becomes what we
know: Listener effects on speakers’ long ‐term memory for events. Discourse
Processes, 26(1), 1-25.
Paul, R., Bianchi, N., Augustyn, A., Klin, A., & Volkmar, F. R. (2008). Production of syllable
stress in speakers with autism spectrum disorders.Research in Autism Spectrum
Disorders, 2(1), 110-124.
Pierrehumbert, J., & Hirschberg, J. (1990). The meaning of intonational contours in the
interpretation of discourse. Intentions in communication, 271, 311.
Pitrelli, J. F. (2004). ToBI prosodic analysis of a professional speaker of American English.
In Speech Prosody 2004, International Conference.
Pluymaekers, M., Ernestus, M., & Baayen, R. (2005b). Articulatory planning is continuous and
sensitive to informational redundancy. Phonetica, 62(2-4), 146-159.
Pluymaekers, M., Ernestus, M., & Baayen, R. H. (2005a). Lexical frequency and acoustic
reduction in spoken Dutch. The Journal of the Acoustical Society of America, 118(4),
2561-2569.
Price, P. J., Ostendorf, M., Shattuck ‐Hufnagel, S., & Fong, C. (1991). The use of prosody in
syntactic disambiguation. the Journal of the Acoustical Society of America, 90(6), 2956-
2970.
115
Prince, E. F. (1992). The ZPG letter: Subjects, definiteness, and information-status. Discourse
description: diverse analyses of a fund raising text, 295-325.
Rietveld, T., Kerkhoff, J., & Gussenhoven, C. (2004). Word prosodic structure and vowel
duration in Dutch. Journal of Phonetics, 32(3), 349-371.
Rooth, M. (1992). A theory of focus interpretation. Natural language semantics,1(1), 75-116.
Rosa, E. C., & Arnold, J. E. (2011). The role of attention in choice of referring
expressions. Proceedings of PRE-Cogsci: Bridging the gap between computational,
empirical and theoretical approaches to reference.
Rosa, E. C., Finch, K. H., Bergeson, M., & Arnold, J. E. (2015). The effects of addressee
attention on prosodic prominence. Language, Cognition and Neuroscience, 30(1-2), 48-
56.
Roßnagel, C. S. (2004). Lost in thought: Cognitive load and the processing of addressees’
feedback in verbal communication. Experimental Psychology,51(3), 191-200.
Roxβnagel, C. (2000). Cognitive load and perspective ‐taking: applying the automatic ‐controlled
distinction to verbal communication. European Journal of Social Psychology, 30(3),
429-445.
Russell, A. W., & Schober, M. F. (1999). How beliefs about a partner's goals affect referring in
goal ‐discrepant conversations. Discourse Processes, 27(1), 1-33.
Scarborough, R. (2010). Lexical and contextual predictability: Confluent effects on the
production of vowels. Laboratory phonology, 10, 557-586.
Schwarzschild, R. (1999). Givenness, avoidf and other constraints on the placement of
accent*. Natural language semantics, 7(2), 141-177.
Selkirk, E. O. (1984). Phonology and syntax: The relation between sound and structure.
Cambridge, MA: MIT Press.
Selkirk, Elisabeth (1995). Sentence prosody: Intonation, stress, and phrasing. In: John A.
Goldsmith (ed.), The Handbook of Phonological Theory, 550-569. Oxford: Blackwell.
Selting, M. (1996). Prosody as an activity-type distinctive cue in conversation: the case of so-
called ‘astonished’. Prosody in conversation: Interactional studies, (12), 231.
Sharda, M., Subhadra, T. P., Sahay, S., Nagaraja, C., Singh, L., Mishra, R., Sen, A., Singhal,
Erickson, D., & Singh, N. C. (2010). Sounds of melody—Pitch patterns of speech in
autism.Neuroscience letters, 478(1), 42-45.
116
Shen, X. N. S. (1990). the prosody of Mandarin Chinese (Vol. 118). University of California
Press.
Shih, C. L. (1986). The Prosodic Domain of Tone Sandhi in Chinese. Doctoral dissertation,
University of California San Diego.
Shriberg, E., Ladd, D. R., Terken, J., & Stolcke, A. (1996). MODELING PITCH RANGE
VARIATION WITHIN AND ACROSS SPEAKERS: PREDICTING F 0 TARGETS
WHEN “SPEAKING UP”. In Proceedings of the 4th international conference on spoken
language processing (pp. 1-4).
Shue, Yen-Liang, Patricia Keating, Chad Vicenik, and Kristine Yu. 2011. VoiceSauce: A
program for voice analysis. In Proceedings of the 17th International Congress of
Phonetic Sciences, 1846–1849.
Smith, R., & Hawkins, S. (2012). Production and perception of speaker-specific phonetic detail
at word boundaries. Journal of Phonetics, 40(2), 213-233.
Snedeker, J., & Trueswell, J. (2003). Using prosody to avoid ambiguity: Effects of speaker
awareness and referential context. Journal of Memory and language,48(1), 103-130.
Steedman, M. (2000). Information structure and the syntax-phonology interface.Linguistic
inquiry, 31(4), 649-689.
Swerts, M., & Krahmer, E. (2005). Audiovisual prosody and feeling of knowing.Journal of
Memory and Language, 53(1), 81-94.
Terken, J., & Hirschberg, J. (1994). Deaccentuation of words representing ‘given’information:
Effects of persistence of grammatical function and surface position. Language and
Speech, 37(2), 125-145.
Theodore, R. M., Miller, J. L., & DeSteno, D. (2007). The effect of speaking rate on voice-onset-
time is talker-specific. In Proceedings of ICPHs (Vol. 16, pp. 473-476).
Trouvain, J., & Grice, M. (1999). The effect of tempo on prosodic structure. In Proceedings of
14th International Congress of Phonetic Sciences (Vol. 1067, p. 1070).
Trude, A. M. and Brown-Schmidt, S. (2012). Talker-specific perceptual adaptation during online
speech perception. Language and Cognitive Processes, 27(7-8), 979-1001.
Vallduví , E., & Vilkuna, M. (1998). On rheme and kontrast. Syntax and semantics, 79-108.
Van Berkum, J. J., Brown, C. M., Zwitserlood, P., Kooijman, V., & Hagoort, P. (2005).
Anticipating upcoming words in discourse: evidence from ERPs and reading
117
times. Journal of Experimental Psychology: Learning, Memory, and Cognition, 31(3),
443.
Van Donzel, M. E., & Beinum, F. J. (1996, October). Pausing strategies in discourse in Dutch.
In Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference
on (Vol. 2, pp. 1029-1032). IEEE.
Van Son, R. J. J. H., & Pols, L. C. (1999). An acoustic description of consonant
reduction. Speech communication, 28(2), 125-140.
Vance, T. J. 1977. Tonal distinctions in Cantonese. Phonetica, 34, 93-107.
Wagner, M., & Watson, D. G. (2010). Experimental and theoretical advances in prosody: A
review. Language and cognitive processes, 25(7-9), 905-945.
Ward, G. L., & Hirschberg, J. (1986). Reconciling Uncertainty with Incredulity: A Unified
Account of the L*+ HLH% Intonational Contour.
Ward, G., & Hirschberg, J. (1985). Implicating uncertainty: The pragmatics of fall-rise
intonation. Language, 747-776.
Watson, D. G., Tanenhaus, M. K., & Gunlogson, C. A. (2008). Interpreting pitch accents in
online comprehension: H* vs. L+ H*. Cognitive Science, 32(7), 1232-1244.
Welby, P. (2003). Effects of pitch accent position, type, and status on focus projection. Language
and Speech, 46(1), 53-81.
Welmers, W.E. (1973). African Language Structures. Berkeley: University of California Press.
Wennerstrom, A., & Siegel, A. F. (2003). Keeping the floor in multiparty conversations:
Intonation, syntax, and pause. Discourse Processes, 36(2), 77-107.
Whalen, D. H., & Xu, Y. (1992). Information for Mandarin tones in the amplitude contour and in
brief segments. Phonetica, 49(1), 25-47.
Wicha, N. Y., Bates, E. A., Moreno, E. M., & Kutas, M. (2003). Potato not Pope: human brain
potentials to gender expectation and agreement in Spanish spoken
sentences. Neuroscience Letters, 346(3), 165-168.
Wiener, S., Speer, S. R., & Shank, C. (2012). Effects of frequency, repetition and prosodic
location on ambiguous Mandarin word production. In Proceedings of the 6th
International Conference on Speech Prosody (pp. 528-531).
Wilkinson, S., & Kitzinger, C. (2006). Surprise as an interactional achievement: Reaction tokens
in conversation. Social psychology quarterly, 69(2), 150-182.
118
Wright, J. (2003). Pricing in debit and credit card schemes. Economics Letters,80(3), 305-309.
Xu, Y. (1997). Contextual tonal variations in Mandarin. Journal of phonetics,25(1), 61-83.
Xu, Y. (1999). Effects of tone and focus on the formation and alignment of F 0 contours. Journal
of phonetics, 27(1), 55-105.
Xu, Y. (2005-2011). ProsodyPro.praat. Available from:
http://www.phon.ucl.ac.uk/home/yi/ProsodyPro/.
Yoon, S. O., & Brown-Schmidt, S. (2014). Adjusting conceptual pacts in three-party
conversation. Journal of Experimental Psychology: Learning, Memory, and
Cognition, 40(4), 919.
Yuan, J., Brenier, J. M., & Jurafsky, D. (2005). Pitch accent prediction: effects of genre and
speaker. In Interspeech (pp. 1409-1412).
Zahorian, S. A., & Hu, H. (2008). A spectral/temporal method for robust fundamental frequency
tracking. The Journal of the Acoustical Society of America, 123(6)., 4559-4571.
Zerbian, S., Genzel, S., & Kügler, F. (2010). Experimental work on prosodically-marked
information structure in selected African languages (Afroasiatic and Niger-
Congo). Proceedings of speech prosody 2010, 1-4.
119
Appendix 1: Target items in Experiment 1
The 48 critical sentence pairs in Experiment 1 are recoverable as follows. There are 12
conditions, formed by combining three types of question-response pairs (X-Z) and four kinds of
object nouns in the responses (A-D). Each condition has four items (1-4), which can be
differentiated based on the verb-location context where the object nouns occurs. The subject of a
question always consists of two personal names; no personal name occurs more than once in the
experiment.
1. Context: got…at the sports store
(X) Narrow Corrective Focus
Partner asks: I heard that {Dawn and Alice; …} got gloves at the sports store.
(Y) Narrow New-Information Focus
Partner asks: What did {Rachel and Carolyn; …} get at the sports store?
(Z) VP/Wide Focus
Partner asks: What did {Angela and Joyce; …} do?
(A) High Frequency and High Probability: balls
(B) Low Frequency and High Probability: cleats
(C) High Frequency and Low Probability: fish
(D) Low Frequency and Low Probability: toys
Participant responds: (No,) they got {balls; cleats; fish; toys} at the sports store.
2. Context: kicked…in the garage
(X) Narrow Corrective Focus
Partner asks: I heard that {Teresa and Martha; …} kicked dirt in the garage.
(Y) Narrow New-Information Focus
Partner asks: What did {Connie and Sharon; …} kick in the garage?
(Z) VP/Wide Focus
Partner asks: What did {Evelyn and Jacqueline; …} do?
(A) High Frequency and High Probability: cars
(B) Low Frequency and High Probability: cans
120
(C) High Frequency and Low Probability: books
(D) Low Frequency and Low Probability: shells
Participant responds: (No,) they kicked {cars; cans; books; shells} in the garage.
3. Context: found…in the sea
(X) Narrow Corrective Focus
Partner asks: I heard that {Bonnie and Laura; …} found boats in the sea.
(Y) Narrow New-Information Focus
Partner asks: What did {Mary and Irene; …} find in the sea?
(Z) VP/Wide Focus
Partner asks: What did {Lillian and Gladys; …} do?
(A) High Frequency and High Probability: fish
(B) Low Frequency and High Probability: shells
(C) High Frequency and Low Probability: balls
(D) Low Frequency and Low Probability: cans
Participant responds: (No,) they found {fish; shells; balls; cans} in the sea.
4. Context: found…on the stairs
(X) Narrow Corrective Focus
Partner asks: I heard that {Matthew and Edward; …} found socks on the stairs.
(Y) Narrow New-Information Focus
Partner asks: What did {Joseph and Steven; …} find on the stairs?
(Z) VP/Wide Focus
Partner asks: What did {Daniel and Jason; …} do?
(A) High Frequency and High Probability: books
(B) Low Frequency and High Probability: toys
(C) High Frequency and Low Probability: cars
(D) Low Frequency and Low Probability: cleats
Participant responds: (No,) they found {books; toys; cars; cleats} on the stairs.
121
Appendix 2: Target items in Experiments 2 and 3
The 192 critical dialogues in Experiments 2 and 3 are recoverable as follows. Each dialogue has
five sentences (S1-S5); the last two sentences of each dialogue (S4 and S5) constitute a
statement-response pair. There are 6 conditions, formed by combining two kinds of statement-
response relationship (Corrective vs. Non-Corrective), two types of object nouns in the
statements (Probable vs. Improbable), and two types of object nouns in the responses (Probable
vs. Improbable). A corrective response begins with the word ‘no’ and has an object noun that is
different from the object noun in the statement. In contrast, a non-corrective response begins
with the word ‘yes’ and has the same object noun as the statement.
Every condition has items from six scenarios (1-6). Each scenario contains two sub-scenarios (X
vs. Y, e.g. Christopher prefers meat vs. Abbey prefers seafood) and is associated with a set of
four object nouns (e.g. beef, lamb, fish, and shrimp). Two of the four object nouns are probable
in the context of sub-scenario X but improbable in sub-scenario Y. The other two object nouns,
in contrast, are probable in the context of sub-scenario Y but improbable in sub-scenario X.
1. Preference/Need: meat vs. seafood
(1X) Probable Objects: beef, lamb; Improbable Objects: fish, shrimp
S1: {Christopher; Gary; Joseph} is particular about food.
S2: He loves meat but hates seafood.
S3: He went out for lunch yesterday.
S4: I heard that he had {beef; lamb; fish; shrimp} at a restaurant.
S5: {Yes; No}, he had {beef; lamb; fish; shrimp} at a restaurant.
(1Y) Probable Objects: fish, shrimp; Improbable Objects: beef, lamb
S1: {Abbey; Lauren; Sabrina} is a picky eater.
S2: She loves seafood but hates meat.
S3: She went out for dinner last night.
S4: I heard that she had {beef; lamb; fish; shrimp} at a restaurant.
S5: {Yes; No}, she had {beef; lamb; fish; shrimp} at a restaurant.
122
2. Preference/Need: fruit vs. vegetables
(2X) Probable Objects: apples, cherries; Improbable Objects: lettuce, spinach
S1: {Elisa; Jacky; Tina} prefers her salad a certain way.
S2: She loves fruit but hates vegetables.
S3: She went grocery shopping yesterday evening.
S4: I heard that she got some {apples; cherries; lettuce; spinach} at the farmer’s
market.
S5: {Yes; No}, she got some {apples; cherries; lettuce; spinach} at the farmer’s
market.
(2Y) Probable Objects: lettuce, spinach; Improbable Objects: apples, cherries
S1: {Bob; Daniel; Zac} tends to put certain things in his salad.
S2: He loves vegetables but hates fruit.
S3: He went grocery shopping this morning.
S4: I heard that he got some {apples; cherries; lettuce; spinach} at the supermarket.
S5: {Yes; No}, he got some {apples; cherries; lettuce; spinach} at the supermarket.
3. Preference/Need: getting bathroom vs. patio stuff
(3X) Probable Objects: bath mats, face wash; Improbable Objects: lawn chairs, yard lights
S1: {Abbey; Lauren; Sabrina} just moved into her new house.
S2: She has enough patio stuff but needs more things for the bathroom.
S3: She went out shopping today.
S4: I heard that she bought some {bath mats; face wash; lawn chairs; yard lights} at
a store.
S5: {Yes; No}, she bought some {bath mats; face wash; lawn chairs; yard lights} at
a store.
(3Y) Probable Objects: lawn chairs, yard lights; Improbable Objects: bath mats, face wash
S1: {Christopher; Gary; Joseph} just moved into his new house.
S2: He has enough bathroom stuff but needs more things for the patio.
S3: He went out shopping this afternoon.
S4: I heard that he bought some {bath mats; face wash; lawn chairs; yard lights} at
the mall.
123
S5: {Yes; No}, he bought some {bath mats; face wash; lawn chairs; yard lights} at
the mall.
4. Preference/Need: selling bedroom vs. kitchen stuff
(4X) Probable Objects: dresser, mattress; Improbable Objects: blender, mixer
S1: {Elisa; Jacky; Tina} is moving in with her boyfriend.
S2: They want to keep her kitchen stuff but get rid of her bedroom furniture.
S3: They are not going to donate anything.
S4: I heard that she sold her {dresser; mattress; blender; mixer} at a garage sale.
S5: {Yes; No}, she sold her {dresser; mattress; blender; mixer} at a garage sale.
(4Y) Probable Objects: blender, mixer; Improbable Objects: dresser, mattress
S1: {Bob; Daniel; Zac} just moved into his new house.
S2: They want to keep his bedroom furniture but get rid of his kitchen stuff.
S3: They are not going to give anything away.
S4: I heard that he sold his {dresser; mattress; blender; mixer} in the classified ads.
S5: {Yes; No}, he sold his {dresser; mattress; blender; mixer} in the classified ads.
5. Preference/Need: playing carpenter vs. chef
(5X) Probable Objects: hammers, wrenches; Improbable Objects: burgers, pizzas
S1: {Christopher; Gary; Joseph}'s nephew is six, and he likes to imagine what he
wants to do when he grows up.
S2: He loves to pretend to be a carpenter, but never plays chef.
S3: {Christopher; Gary; Joseph} went to a flea market over the weekend.
S4: I heard that he bought toy {burgers; pizzas; hammers; wrenches} for his
nephew.
S5: {Yes; No}, he bought toy {burgers; pizzas; hammers; wrenches} for his
nephew.
(5Y) Probable Objects: burgers, pizzas; Improbable Objects: hammers, wrenches
S1: {Abbey; Lauren; Sabrina}'s niece is four, and she likes to play make-believe.
S2: She loves to pretend to be a chef, but never plays carpenter.
S3: {Abbey; Lauren; Sabrina} found some make-believe props on eBay recently.
124
S4: I heard that she bought toy {burgers; pizzas; hammers; wrenches} for her niece.
S5: {Yes; No}, she bought toy {burgers; pizzas; hammers; wrenches} for her niece.
6. Preference/Need: farm vs. jungle animals
(6X) Probable Objects: cow, sheep; Improbable Objects: bear, lion
S1: {Elisa; Jacky; Tina}’s son is not a fan of all animals.
S2: He is obsessed with farm animals but completely uninterested in jungle animals.
S3: She took her son to a toy store downtown.
S4: I heard that he got a stuffed {cow; sheep; bear; lion} at the shop.
S5: {Yes; No}, he got a stuffed {cow; sheep; bear; lion} at the shop.
(6Y) Probable Objects: bear, lion; Improbable Objects: cow, sheep
S1: {Bob; Daniel; Zac}’s daughter only likes certain animals.
S2: She is obsessed with jungle animals but completely uninterested in farm animals.
S3: He took his daughter to a kid's store the other day.
S4: I heard that she got a stuffed {cow; sheep; bear; lion} at the shop.
S5: {Yes; No}, she got a stuffed {cow; sheep; bear; lion} at the shop.
Abstract (if available)
Abstract
This dissertation aims to extend our knowledge of prosody—in particular, what kinds of information may be conveyed through prosody, which prosodic dimensions may be used to convey them, and how individual speakers differ from one another in how they use prosody. Four production studies were conducted to examine how various factors interact with one another in shaping the prosody of an utterance and how prosody fulfills its multi-functional role. ❧ Experiments 1 explores the interaction between two types of informativity, namely information structure and information-theoretic properties. The results show that the prosodic consequences of new-information focus are modulated by the focused word’s frequency, whereas the prosodic consequences of corrective focus are modulated by the focused word’s probability in the context. Furthermore, f0 ranges appear to be more informative than f0 shapes in reflecting informativity across speakers. Specifically, speakers seem to have individual ‘preferences’ regarding f0 shapes, the f0 ranges they use for an utterance, and the magnitude of differences in f0 ranges by which they mark information-structural distinctions. In contrast, there is more cross-speaker validity in the actual directions of differences in f0 ranges between information-structural types. ❧ Experiments 2 and 3 further show that the interaction found between corrective focus and contextual probability depends on the interlocutor’s knowledge state. When the interlocutor has no access to the crucial information concerning utterances’ contextual probability, speakers prosodically emphasize contextually improbable corrections, but not contextually probable corrections. Furthermore, speakers prosodically emphasize the corrections in response to contextually probable misstatements, but not the corrections in response to contextually improbable misstatements. In contrast, completely opposite patterns are found when words’ contextual probability is shared knowledge between the speaker and the interlocutor: speakers prosodically emphasize contextually probable corrections and the corrections in response to contextually improbable misstatements. ❧ Experiment 4 demonstrates the multi-functionality of prosody by investigating its discourse-level functions in Mandarin Chinese, a tone language where a word’s prosodic patterns is crucial to its meaning. The results show that, although prosody serves fundamental, lexical-level functions in Mandarin Chinese, it nevertheless provides cues to information structure as well. Similar to what has been found with English, corrective information is prosodically more prominent than non-corrective information, and new information is prosodically more prominent than given information. ❧ Taken together, these experiments demonstrate the complex relationship between prosody and the different types of information it encodes in a given language. To better understand prosody, it is important to integrate insights from different traditions of research and to investigate across languages. In addition, the findings of this research suggest that speakers’ assumptions about what their interlocutors know—as well as speakers’ ability to update these expectations—play a key role in shaping the prosody of utterances. I hypothesize that prosodic prominence may reflect the gap between what speakers had expected their interlocutors to say and what their interlocutors have actually said.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Interaction between prosody and information structure: experimental evidence from Hindi and Bangla
Asset Metadata
Creator
Ouyang, Iris Chuoying
(author)
Core Title
Prosody and informativity: a cross-linguistic investigation
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Linguistics
Publication Date
11/23/2015
Defense Date
10/19/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
addressee's knowledge state,between-speaker variability,contextual probability,corrective focus,duration,epistemic surprisal,f0 excursion,given information,individual differences,informativity,intensity range expansion,Mandarin,new-information focus,OAI-PMH Harvest,pitch range expansion,prior expectation,prosody,semitone,within-speaker variability,word frequency
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kaiser, Elsi (
committee chair
), Goldstein, Louis (
committee member
), Simpson, Andrew (
committee member
), Wood, Justin (
committee member
)
Creator Email
iris.oy@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-202298
Unique identifier
UC11277336
Identifier
etd-OuyangIris-4053.pdf (filename),usctheses-c40-202298 (legacy record id)
Legacy Identifier
etd-OuyangIris-4053.pdf
Dmrecord
202298
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Ouyang, Iris Chuoying
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
addressee's knowledge state
between-speaker variability
contextual probability
corrective focus
epistemic surprisal
f0 excursion
given information
individual differences
informativity
intensity range expansion
Mandarin
new-information focus
pitch range expansion
prior expectation
prosody
semitone
within-speaker variability
word frequency