Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Effects of language familiarity on talker discrimination from syllables
(USC Thesis Other)
Effects of language familiarity on talker discrimination from syllables
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EFFECTS OF LANGUA GE F AMILIARITY ON
T ALKER DISCRIMINA TION FR OM SYLLABLES
b y
Andrés Benítez P ozo
A Dissertation Presen ted to t he
F A CUL TY OF THE USC GRADUA TE SCHOOL
UNIVERSITY OF SOUTHERN CALIF ORNIA
In P artial F ulfillmen t of the
Requiremen ts for the Degree
DOCTOR OF PHILOSOPHY
(LINGUISTICS)
Decem b er 2020
Cop yrigh t 2021 Andrés Benítez P ozo
Dedication
A mamá, p ap á y Chiqu i.
ii
A c kno wledgmen ts
This has b een the journey of a lif etime. I w ould lik e to thank the follo wing p eople, who ha v e
w alk ed the journey with me and held m y hand through thic k and thin:
Jason Zevin I cannot b egin to express ho w m uc h I o w e to y ou. Y ou ha v e b een a supp ortiv e,
understanding and insigh tful advisor — in w ork and in life. It is no exaggeration that I w ould
ha v e simply not made it this far without y ou. Thank y ou for b eing equal parts an amazing
scien tist and an a w esome h uman.
Sandra F errari Disner Thank y ou for b eing a w onderful men tor and role mo del of profes-
sionalism. I am honored that y ou trusted me to collab orate with y ou on so man y o ccasions.
Rand Wilco x Thank y ou for kic k-starting m y lo v e for statistics and setting me on the w on-
drous path to data science and co ding. Robust statistics will prev ail!
The Zevin Lab Thank y ou to all the Zevlings who ha v e made our lab m y home a w a y from
home — esp ecially Maury , Erin, Melissa, Julia and Mai. I also o w e m uc h gratitude to the man y
researc h assistan ts who ha v e sp en t inn umerable hours helping me run m y exp erimen ts, and to
the h undreds of studen ts who v olun teered their time to participate in them.
iii
The Linguistics Departmen t Thank y ou to m y colleagues and friends at the USC Linguis-
tics Departmen t, with whom I’v e made a lifetime of memories. Esp ecial thanks go to Binh,
Cyn thia, Caitlin, Brian, Sarah, Betul, Mairym, Miran, Merouane, Jesse, Luismi, Lucy , and
Monica. Thanks as w ell to our staff, esp ecially Lisa Jo, Guillermo and Brandon.
My teac hers I’v e made it to where I am to da y thanks to ev ery teac her and men tor who
guided me on the w a y and inspired me to learn what they lo v ed. I ha v e a bit of eac h of y ou
in me. Ab o v e all, Raúl, An tonio, Juan Carlos, José Man uel, P aul, Louis, Elsi, Rac hel, Jason,
Sandy , Rand, Ken: thank y ou.
My friends T o Am y , Andrew, Betul, Jens, Jesse, Marian, P aco and Ro c k o: thank y ou for
b eing next to me. In p erson or from 5,578 miles a w a y , y ou inspire me with y our accomplishmen ts,
help me with y our advice, and just mak e me feel go o d for no particular reason.
Mic hael and Kelcie Thank y ou for alw a ys b elieving in me — esp ecially when I didn’t. This
dissertation w ould ha v e nev er b een finished without y our guidance through the dark est times.
My family I o w e everything to m y mom, m y dad and m y brother. Thank y ou for y our effort
and sacrifice, for y our lo v e and y our supp ort, and for encouraging me to find m y o wn path.
A v ery big thanks go es to Jaeb eom for b eing the cutest nephew ev er and for calling me
“tito”, and to his mom, Jisun, for helping me find K orean participan ts and for taking care of
Juhee when I w as to o busy .
And finally , Juhee: thank y ou for lo ving me. Thank y ou for the incalculable amoun t of lo v e
and supp ort y ou’v e giv en me while I w as w orking on this dissertation. I hop e y ou will allo w me
to repa y y ou in kind o v er man y y ears to come.
iv
T able of Con ten ts
Dedication ii
A c kno wledgmen ts iii
List Of T ables vii
List Of Figures viii
Abstract ix
Chapter 1: In tro duction 1
1.1 The Goals of this Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 T alk er v ariabilit y in sp eec h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 P erceiving sp eec h from v oices . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 P erceiving v oices from sp eec h . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 The language familiarit y adv an tage in talk er pro cessing . . . . . . . . . . . . . . 7
1.4 Chapter outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Chapter 2: Effects of consonan t familiarit y on talk er discrimination 12
2.1 Exp erimen t 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 M etho ds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1.1 Stim uli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1.2 P articipan ts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.1.3 Pro cedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.2 R esults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.2.1 Ov erall categorization and go o dness ratings . . . . . . . . . . . . 17
2.1.2.2 Breakdo wn b y con text . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Exp erimen t 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.1 M etho ds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.1.1 Stim uli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.1.2 P articipan ts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.1.3 Pro cedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.1.4 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.2 R esults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3 Exp erimen t 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.1 M etho ds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.1.1 Stim uli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
v
2.3.1.2 P articipan ts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.1.3 Pro cedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.1.4 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.2 R esults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Chapter 3: Effects of v o w el familiarit y on talk er discrimination 35
3.1 Exp erimen t 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1.1 M etho ds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1.1.1 P articipan ts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1.1.2 Stim uli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1.1.3 Pro cedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.1.1.4 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.1.2 R esults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.1.2.1 T alk er discriminabilit y b y v o w el . . . . . . . . . . . . . . . . . . 43
3.1.2.2 Multidimensional scaling . . . . . . . . . . . . . . . . . . . . . . 44
3.2 Exp erimen t 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.0.1 P articipan ts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.0.2 Stim uli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2.0.3 Pro cedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.0.4 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.1 R esults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2.1.1 T alk er discriminabilit y b y v o w el . . . . . . . . . . . . . . . . . . 52
3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Chapter 4: Discussion 56
Bibliograph y 62
vi
List Of T ables
2.1 Summary of results of Exp erimen t 1 . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Results of Exp erimen t 1: bilabial stops . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Results of Exp erimen t 1: alv eolar stops . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Results of Exp erimen t 1: v elar stops . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5 Results of Exp erimen t 1: alv eo-palatal affricates . . . . . . . . . . . . . . . . . . 20
vii
List Of Figures
2.1 P aired results of Exp erimen t 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2 Difference scores of Exp erimen t 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 P aired results of Exp erimen t 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4 Difference scores of Exp erimen t 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1 Discriminabilit y results of Exp erimen t 4 . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Multidimensional scaling results of Exp erimen t 4 . . . . . . . . . . . . . . . . . . 45
3.3 Correlations b et w een MDS dimensions and a coustic v ariables of the stim uli in
Exp erimen t 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Discriminabilit y results of Exp erimen t 5 . . . . . . . . . . . . . . . . . . . . . . . 52
viii
Abstract
Sp eec h is a complex m ultidimensional signal that carries linguistic information —what is said—
and indexical information —who said it— sim ultaneously . These t w o dimensions are deeply
in tert wined in the acoustic signal of sp eec h and listeners rely on kno wledge of one to deco de the
other. F or example, it is w ell established that listeners exploit kno wledge of their language to
pro cess information ab out talk ers, suc h that they are notably b etter at distinguishing talk ers
in a familiar language. Ho w ev er, it is not clear what the mec hanism b ehind this effect is, or at
what lev el of sp eec h pro cessing it op erates. This dissertation seeks to expand our understanding
of the language familiarit y adv an tage in talk er pro cessing b y exploring the in teraction b et w een
linguistic and indexical information in non-nativ e sp eec h. Sp ecifically , this dissertation in v esti-
gates ho w familiarit y with the sounds of a language affects the pro cessing of talk er iden tit y at
the lev el of individual syllables, fo cusing on the sp ecific con tributions of b oth consonan ts and
v o w els. W e h yp othesized that the discriminabilit y of talk ers in an unfamiliar language ma y b e
affected b y first-language p erceptual mec hanisms similar to those that determine the discrim-
inabilit y of non-nativ e sp eec h con trasts. T o test this, w e assessed English listeners’ abilit y to
discriminate pairs of talk ers across K orean sounds differing in degree of familiarit y using a v ari-
et y of psyc ho-acoustic metho ds. The results of our exp erimen ts w ere inconsisten t across differen t
testing paradigms, but w ere ten tativ ely supp ortiv e of the initial h yp othesis, as w ell as of mo dels
that situate the language familiarit y effect at an abstract lev el of phonological pro cessing.
ix
Chapter 1
In tro duction
1.1 The Goals of this Diss ertation
Sp eec h is a complex m ultidimensional signal that carries information at m ultiple lev els. A t
the broadest lev el, the information carried b y sp eec h has b een traditionally separated in to t w o
t yp es: linguistic and indexical ( Ab ercrom bie, 1984). Linguistic information within sp eec h is in
turn made up of m ultiple lev els of linguistic analysis (phonological, morpho-syn tactic, seman tic
and pragmatic) that together con v ey the prop ositional con ten t of the utterance. On the other
hand, indexical information signals prop erties of the talk er, suc h as their iden tit y , some t yp es
of group mem b ership (e.g., age, gender, so cio-economic bac kground, geographic origin), and
c hanging emotional states (e.g., anger, sadness, etc.). More generally , listeners can use indexical
information to iden tify or distinguish individual talk ers.
Researc h has increasingly sho wn that these t w o dimensions are deeply in tert wined in the
sp eec h signal and that listeners rely on kno wledge of one to deco de the other. It is w ell estab-
lished that listeners use their kno wledge of indexical information to more effectiv ely pro cess the
linguistic con ten t of an utterance, suc h that familiarit y with a talk er facilitates sp eec h compre-
hension. On the other hand, the parallel phenomenon whereb y listeners exploit kno wledge of
1
their language to b etter pro cess information ab out talk ers has receiv ed less atten tion. Listeners
are notably b etter at distinguishing talk ers in a language they understand, but it is unclear
what the mec hanism b ehind this so-called “language familiarit y adv an tage” is, or at what lev el
of sp eec h pro cessing it op erates.
The o v erall goal of this dissertation is to expand our understanding of the language familiarit y
adv an tage in talk er recognition b y exploring the in teraction b et w een linguistic and indexical
information in non-nativ e sp eec h. Sp ecifically , this dissertation in v estigates ho w familiarit y
with the sounds of a language affects the pro cessing of talk er iden tit y at the lev el of individual
syllables, fo cusing on the sp ecific con tributions of b oth consonan ts (Chapter 2) and v o w els
(Chapter 3). The main h yp othesis guiding the pro ject is that the discriminabilit y of talk ers
in an unfamiliar language ma y b e affected b y first-language p erceptual mec hanisms similar to
those that determine the discriminabilit y of non-nativ e sp eec h con trasts.
The rest of this c hapter situates the pro ject in the broader con text of the sp eec h p erception
and talk er pro cessing literature. After an o v erview of the in teraction b et w een the t w o phe-
nomena, it fo cuses on the effects of sp eec h p erception on the p erception of talk er iden tit y from
the v oice, mainly c haracterized b y the language familiarit y adv an tage. It then in tro duces the
exp erimen ts rep orted in the follo wing c hapters, their metho dology and h yp otheses.
1.2 T alk er v ariabilit y in s p eec h
T alk ers ha v e differen t v oices for a v ariet y of reasons. Most ob vious are differences in anatom y
and ph ysiology across individuals. F or example, the o v erall size of the v o cal tract, the mass and
stiffness of the v o cal folds, the structure of the palate and teeth, among others, all ph ysically
shap e the sounds pro duced b y a p erson. Although these parameters constrain what a sp eak er
2
can do with their v oice to some exten t, sp eak ers are also able to acquire individuating sp eec h
habits. F or example, sp eak ers can v ary their habitual pitc h or sp eaking rate. In addition to this
t yp e of idiosyncratic b eha vior, sp eak ers also learn so cio-cultural sp eec h habits that signal group
mem b ership, suc h as sp ecific dialects that index age, gender, so cio-economic group, geographic
origin and others. Ultimately , the sp ecific w a y in whic h all these biological, idiosyncratic and
so cio-cultural factors mak e up the v oice of an y giv en individual is probably unkno w able, but
they nev ertheless affect exactly ho w the individual will ph ysically pro duce sp ok en language
and enco de it in the acoustic signal of sp eec h. In practice, this means that the same linguistic
message can b e enco ded differen tly b y differen t sp eak ers, or ev en b y the same sp eak er at differen t
times. This presen ts an ob vious problem for understanding sp eec h comm unication: if the same
linguistic information can b e enco ded in sp eec h differen tly b y differen t sp eak ers, ho w do listeners
manage to map suc h a highly v ariable signal to their o wn linguistic kno wledge during sp eec h
comprehension?
1.2.1 P erceiving sp eec h from v oices
Classic mo dels of sp eec h p erception ha v e circum v en ted this problem of “lac k of in v ariance” in
the sp eec h signal b y treating v oice p erception as a paralinguistic phenomenon with minor or
no implications for linguistic pro cessing (review in Klatt, 1989). In these early mo dels, the
v ariabilit y imp osed b y an individual’s v oice on to the linguistic form of an utterance is, for the
purp oses of sp eec h understanding, discarded during early auditory-acoustic pro cessing. What
the listener ultimately p erceiv es is though t to b e a normalized, talk er-indep enden t form that
matc hes the p osited abstract sym b olic units in the language system. In this pro cess, kno wn as
sp eak er normalization, linguistically-relev an t, sp eak er-dep enden t acoustic cues are in terpreted
in relation to other asp ects of the signal. F or example, listeners adjust their p erception of v oice
3
onset time v alues dep ending on the talk er’s sp eaking rate (e.g., Newman and Sa wusc h, 1996), or
they in terpret forman t v alues in v o w els relativ e to other cues (lik e fundamen tal frequency) that
are a pro xy to the talk er’s v o cal tract size (review in Johnson, 2005). Crucially , these p erceptual
adjustmen ts are b eliev ed to b e the result of lo w-lev el auditory pro cesses that are not sp ecific
to sp eec h. Therefore, in this view, the v ariable indexical prop erties of the sp eec h signal are
remo v ed during early auditory-acoustic pro cessing and are not used in the linguistic pro cessing
of sp eec h.
More recen tly , m ultiple studies ha v e c hallenged the abstractionist view of sp eak er normal-
ization, and it is b y no w widely accepted that reco v ering linguistic information from sp eec h is,
to some exten t, con tingen t on pro cessing indexical information. Evidence for this comes from
studies on what is kno wn as the talk er in terference effect, whereb y p erformance is lo w er across
a n um b er of sp eec h p erception tasks when stim uli are pro duced b y m ultiple talk ers, as opp osed
to a single talk er (Creelman, 1957; Goldinger et al., 1991; Mullennix et al., 1989). Another
source of evidence comes from studies of a related phenomenon, the talk er sp ecificit y effect:
familiarization with individual talk ers facilitates p erformance across m ultiple sp eec h p erception
tasks (e.g., Goldinger, 1996; Nygaard and Pisoni, 1998). These t w o effects demonstrate that
talk er v ariabilit y in the sp eec h signal is m uc h more relev an t to sp eec h pro cessing than initially
though t, as listeners app ear to activ ely exploit their kno wledge of individual talk ers to facilitate
the pro cessing of linguistic information. T o accoun t for these findings, exemplar-based mo dels
of sp eec h p erception ha v e b een prop osed (e.g., Pierreh um b ert, 2001; Goldinger, 1998; Johnson,
1997). This approac h p osits that listeners store sp e cific instanc es of sp eec h in long-term memory ,
with all their sp eak er-dep enden t detail. Categories are therefore not normalized abstractions,
but rather collections of stored instances that are defined online b y the activ ation of the instances
that are relev an t to the talk er in question and the deactiv ation of those that are not relev an t.
4
The dev elopmen t of these mo dels has stim ulated the treatmen t of sp eec h p erception and talk er
p erception as in ter-dep enden t phenomena that need to b e considered together — with some
ca v eats that are discussed in the next section.
1.2.2 P erceiving v oices from sp eec h
Recognizing talk ers from sp eec h is a routine task for most p eople, y et the cognitiv e mec hanisms
underlying this remarkable abilit y ha v e remained elusiv e. There is, ho w ev er, one clear finding
that is of relev ance to the w ork in this dissertation: the pro cessing of familiar and unfamil-
iar v oices are strikingly differen t cognitiv e abilities. In fact, eac h of these t w o abilities engage
separate cortical systems and in v olv e fundamen tally differen t cognitiv e tasks (V an Lanc k er and
Kreiman, 1987; Kreiman and Sidtis, 2011). Sp ecifically , unfamiliar v oices are pro cessed featu-
rally , while familiar v oices are pro cessed holistically using Gestalt principles. F eatural pro cessing
in v olv es attending to the v alues of a kno wn set of features (e.g., fundamen tal frequency , for-
man ts, etc.) and fo cusing on those that sp ecifically distinguish an individual v oice from others.
Holistic pro cessing instead in v olv es matc hing the whole pattern of a v oice against a collection
of kno wn v oices.
One reason for the disso ciation of these t w o in tuitiv ely similar abilities is that they in v olv e
fundamen tally differen t tasks. Pro cessing familiar v oices t ypically in v olv es recognition and iden-
tification tasks, while pro cessing unfamiliar v oices t ypically in v olv es discrimination tasks. In
discrimination tasks, t w o sample v oices are compared against eac h other in short term memory .
That is a v ery differen t exp erience from recognizing a familiar v oice, whic h in v olv es listening to
only one sample v oice and deciding if the talk er is presen t in long term memory , or iden tifying a
familiar v oice, whic h encompasses recognition but in v olv es the extra step of naming the sp ecific
5
kno wn talk er. Only o ccasionally , and almost alw a ys in exp erimen tal or forensic setups, are un-
familiar v oices in v olv ed in iden tification pro cedures (e.g., in v oice lineups for an unkno wn talk er
that w as heard previously).
The most lik ely reason, ho w ev er, for the disso ciation b et w een familiar and unfamiliar v oice
pro cessing stems from their differen t ev olutionary histories. Recognizing p ersonally relev an t
v oices is an abilit y shared with man y animal sp ecies that has clear surviv al v alue (e.g., ensuring
mothers pro vide care to the righ t offspring, facilitating reunions, promoting so cial b onding
and co op eration, distinguishing insiders from outsiders , etc.) and one that lik ely predates the
dev elopmen t of language b y thousands of y ears (Belin, 2006; Kreiman and Sidtis, 2011). This
skill starts v ery early in life — h umans, for example, can recognize their mother’s v oice in utero
(Kisilevsky et al., 2003). On the other hand, the abilit y to distinguish the v oices of strangers,
lac king the vital function of familiar v oice recognition, is unheard of in the animal king and is
lik ely a m uc h more recen t h uman dev elopmen t.
Ho w ev er, p ossibly b ecause it lac ks the dedicated ev olutionary arc hitecture of familiar v oice
pro cessing, unfamiliar v oice discrimination app ears to b e at least partially relian t on sp eec h
p erception pro cesses. Multiple studies, detailed in the next section, ha v e found that listeners
use linguistic kno wledge to assist in the pro cessing of talk er information. Notably , the featural
analysis that c haracterizes unfamiliar v oice pro cessing is readily compatible with sp eec h p ercep-
tion pro cesses. Unsurprisingly , there is considerable o v erlap in the range of acoustic features
that listeners are kno wn to rely on for sp eec h p erception and for unfamiliar talk er discrimination
(e.g., f0, v o w el forman ts, etc.).
6
1.3 The language famili arit y adv an tage in talk er pro cessing
The finding that listeners are able to use kno wledge of the language b eing sp ok en to facilitate
the pro cessing of (unfamiliar) v oices emphasizes the close in tegration of linguistic and indexical
information in sp eec h. The main source of evidence for this comes from the study of the language
familiarit y effect, whereb y distinguishing talk ers is easier in a language that is kno wn to the
listener. F or example, in a lineup setup, monolingual English listeners can iden tify the same
English-German bilingual talk ers more accurately in English than in German (Goggin et al.,
1991). Similarly , listeners learn to discriminate v oices as a function of their exp erience with
the language of the talk er. In Bregman and Creel (2014), monolingual English listeners w ere
slo w er at learning to distinguish K orean v oices than English v oices. K orean-English bilinguals
sho w ed the opp osite pattern, but p erformance in their second language (English) w as gradien tly
correlated with the listeners’ age of acquisition. Ev en v arying lev els of familiarit y with v arieties
of the same language can ha v e an effect on talk er iden tification. P errac hione et al. (2010) ask ed
listeners to iden tify v oices of differen t racial groups, and found that listeners had an adv an tage
for talk ers that matc hed their o wn language v ariet y (General American or African-American
English), regardless of racial affiliation. A coustic analyses sho w ed that the listeners w ere able to
exploit their kno wledge of phonetically-relev an t, so cially-acquired features of their o wn language
v ariet y suc h as differences in v o w el forman t v alues, rather than p oten tial anatomical differences
in v o cal tract anatom y across the racial groups.
While the language familiarit y effect is w ell established, its underlying cause is not y et
w ell understo o d. One straigh tforw ard p ossibilit y is that understanding an utterance someho w
facilitates talk er recognition (K öster and Sc hiller, 1997), but that ma y b e an o v erly simplistic
accoun t in ligh t of the finding that 7-mon th-old infan ts, with p o or comprehension abilities, can
7
discriminate sp eak ers of their nativ e language b etter than foreign sp eak ers (Johnson et al., 2011).
In addition, pairs of unin telligible time-rev ersed sp eec h samples are rated as more dissimilar
when sp ok en in the language of the listeners, as o pp osed to a foreign language (Fleming et al.,
2014). Both studies strongly suggest that comprehension is not essen tial for obtaining a language
familiarit y adv an tage in talk er recognition.
Instead, these findings p oin t to w ards the more lik ely p ossibilit y that kno wledge of the sound
system of the language is resp onsible for pro viding this talk er recognition adv an tage. F amiliarit y
with the phonological structure of a language ma y help listeners distinguish linguistic v ariation
from talk er v ariation in sp eec h. Listeners who are a w are of whic h sound con trasts are relev an t
in the language ma y b e more lik ely to iden tify acoustic features that are idiosyncratic to an
individual talk er. This idea is in line with K öster and Sc hiller’s finding of a language familiarit y
adv an tage in 7-mon th-old infan ts, since p erceptual commitmen t to the sounds of the nativ e
language o ccurs b et w een 6 to 12 mon ths of age (e.g., Kuhl et al., 2006). A dditionally , adult
dyslexics do not presen t a language familiarit y adv an tage, p erforming equally p o orly in their
nativ e English as in Chinese in a talk er recognition task, presumably b ecause of imp o v erished
phonological pro cessing (P errac hione et al., 2011). Finally , Bregman and Creel’s finding that the
language familiarit y effect in bilingual listeners is gradien t and correlated with age of acquisition
offers additional supp ort for the imp ortan t role of phonological kno wledge in talk er recognition,
as age of onset is the strongest predictor of sp eec h p erception abilities in a second language
(Flege, 1995; Flege et al., 1999; Meador et al., 2000; Ma y o et al., 1997).
Unsurprisingly , naïv e non-nativ e listeners of a language, without kno wledge of the relev an t
sound distinctions, ma y not b e able to separate linguistically relev an t v ariation from the t yp e of
talk er-sp ecific v ariation that distinguishes one sp eak er from another. P erceptual sp ecialization
for the first language sound system ma y facilitate talk er recognition in the nativ e language,
8
but ma y inevitably impair it in non-nativ e languages. There is a v ast literature sho wing that
learning a first language fundamen tally alters sp eec h p erception pro cesses (e.g., Flege, 1995;
Iv erson et al., 2003; Näätänen et al., 1997), suc h that nativ e listeners b ecome attuned to the
sp ecific sound con trasts in their language and less sensitiv e to asp ects that are not relev an t in it
(Kuhl et al., 2006). This commitmen t to the nativ e language mak es the le arning of new v o w el
categories and con trasts v ery difficult, and listeners tend to p erceiv e non-nativ e sp eec h through
the filter of their nativ e language system. Mo dels of non-nativ e sp eec h p erception suc h as Flege’s
(1995) Sp eec h Learning Mo del (SLM) or Best’s (1995) P erceptual Assimilation Mo del (P AM),
predict the relativ e difficult y of discriminating non-nativ e sounds based on ho w they p erceptually
assimilate to the first language (L1) phonology . F or example, P AM, whic h sp ecifically deals with
naïv e non-nativ e p erception (rather than second-language p erception, as SLM do es), predicts
that when t w o con trasting non-nativ e sounds are categorized as differen t nativ e phonemes (t w o-
category assimilation), discriminabilit y will b e excellen t b ecause of the close alignmen t of the
nativ e and non-nativ e categories. On the other hand, if the t w o non-nativ e sounds are recognized
as b elonging equally w ell to the same L1 category (single-category assimilation), discriminabilit y
will b e p o or. Ho w ev er, discriminabilit y will b e somewhere in b et w een if one of the t w o sounds
is p erceiv ed as a b etter fit to the nativ e category (category-go o dness assimilation).
If alignmen t b et w een the nativ e and non-nativ e phonologies determines the discriminabilit y
of linguistically-relev an t v arian ts in the non-nativ e language, it is p ossible that the same mec ha-
nisms affect the discriminabilit y of the talk ers pro ducing those v arian ts. F or example, the same
pair of foreign talk ers ma y b e easier to discriminate across a pair of con trasting phones that are
assimilated to t w o differen t nativ e categories. Con v ersely , it ma y b e harder to discriminate the
same pair of foreign talk ers across a pair of con trasting phones that are assimilated to a single
nativ e category , as general discriminabilit y is lo w er for within-category stim ulus pairs relativ e
9
to pairs near or crossing category b oundaries (Lacerda, 1998; Lib erman et al., 1957; Iv erson and
Kuhl, 1995). In other w ords, difficulties in discriminating non-nativ e con trastiv e phones ma y
result in concomitan t difficulties in discriminating the talk ers pro ducing them.
Suc h an effect of sp eec h sound discriminabilit y on talk er discriminabilit y can b e reasonably
exp ected considering the patterns of m utual in terference b et w een linguistic and indexical infor-
mation found across m ultiple studies. Most evidence for this m utual in terference comes from
studies emplo ying the Garner paradigm (Garner, 1974), a selectiv e atten tion task used to test
the in terdep endence of t w o dimensions in a stim ulus. Sub jects classify t w o-dimensional stim-
uli while attending to only one the dimensions, while the unattended dimension either v aries
randomly or sta ys constan t. If random v ariation in the unattended dimension in terferes with
the pro cessing of the attended dimension, the t w o dimensions are understo o d to b e pro cessed
holistically . A n um b er of studies ha v e paired a segmen tal linguistic dimension with an index-
ical dimension and sho wn a pattern of in tegral pro cessing of the t w o. F or example, Mullennix
and Pisoni (1990), tested the in terdep endence of consonan t iden tit y ( /b/ vs. /p/ ) in real w ords
and talk er gender (male vs. female). While the effect of talk er gender v ariation on consonan t
iden tit y w as stronger, the pro cessing of consonan t iden tit y w as also affected b y random c hanges
in talk er gender. Green et al. (1997) tested the same dimensions in nonsense syllables ( /bi/ vs.
/pi/ ), but found only an effect of gender v ariation on consonan t classification, and no effect in the
opp osite direction. Ho w ev er, Kagano vic h et al. (2006) did find strong m utual and symmetrical
in terference b et w een linguistic and indexical dimensions in single phonemes (English /ɛ/ vs. /æ/ )
paired with within-gender talk er iden tit y (male 1 vs. male 2). Cutler et al. (2011) similarly
found that v o w els in V C stim uli in terfered with the classification of t w o male talk ers, although
less than talk er iden tit y in terfered with v o w el classification.
10
1.4 Chapter outline
The studies review ed in the previous section clearly sho w that familiarit y with the language
b eing sp ok en facilitates the pro cessing of talk er information. While this language familiarit y
adv an tage is w ell established, little is kno wn ab out its causes. A p ossible explanation is that
kno wledge of the sounds of language ma y help the listener distinguish whic h v ariation in the
sp eec h signal is due to the language and whic h to the talk er. If so, w e could exp ect to find
evidence of the effect ev en in small linguistic units suc h as individual segmen ts or syllables.
The c hapters that follo w explore the language familiarit y effect b y testing listeners’ abilit y to
discriminate talk ers across syllables that differ in ho w familiar they are to the listeners.
Chapter 2 rep ort on three exp erimen ts assessing the effects of c onsonant familiarit y on
talk er discrimination. In Exp erimen t 1, w e tested English listeners’ p erception of K orean stops
to obtain empirically-deriv ed stim uli for the subsequen t exp erimen ts. Exp erimen t 2, tested
English listeners’ abilit y to discriminate pairs of talk ers across consonan ts of v arying familiarit y
using an AXB task. Exp erimen t 3 extended the paradigm to test the effects of familiar and
unfamiliar consonan ts-v o w el pairs.
Chapter 3 rep orts on t w o exp erimen ts that assessed the effects of vowel familiarit y on talk er
discrimination. Exp erimen t 4 used an AX same-differen t paradigm to test listeners’ abilit y to
discriminate talk ers across familiar and unfamiliar v o w els. W e further used m ultidimensional
scaling to create and compare mo dels of the p erceptual features or dimensions that listeners use
to judge the similarit y of pairs of talk ers across v o w els of v arying familiarit y . Exp erimen t 5 used
a Garner paradigm to test the in terdep endence of linguistic and talk er information in v o w els.
Chapter 4 in terprets the results of the exp erimen ts and discusses their implications for our
understanding of the language familiarit y effect in talk er pro cessing.
11
Chapter 2
Effects of consonan t familiarit y on talk er discrimination
The exp erimen ts rep orted in this c hapter in v estigated the language familiarit y effect in v oice
p erception b y testing listeners’ abilit y to discriminate v oices across syllables that differed in ho w
familiar they w ere to the listeners. Sp ecifically , w e ask ed nativ e English listeners to discriminate
b et w een pairs of nativ e K orean talk ers pro ducing syllables in K orean. W e w ere able to manipu-
late ho w familiar the syllables w ere to the nativ e English sp eak ers b y using K orean consonan ts
or consonan t-v o w el pairs that differed in terms of their alignmen t with the sound system of
English. In con trast to the t w o-w a y laryngeal con trast b et w een stops in English, K orean has
a three-w a y con trast for w ord-initial v oiceless obstruen ts: lenis in /pul/ ‘fire’, aspirated in /pʰul/
‘grass’, and fortis in /p*ul/ ‘horn’ . While the K orean aspirated consonan ts are p erceiv ed b y nativ e
English listeners as b eing virtually iden tical to their English coun terparts, their p erception of
the K orean lenis and esp ecially the fortis consonan ts has b een sho wn to b e inconsisten t and
problematic (Sc hmidt, 2007), reflecting a p o or fit to the listeners’ English categories.
In Exp erimen t 1, w e attempted to replicate the findings of Sc hmidt (2007) regarding nativ e
English listeners’ p erception of K orean st ops. T o that end, w e recorded and normed K orean
w ord-initial consonan ts in differen t v o w el con texts sp ok en b y m ultiple nativ e K orean sp eak ers.
W e then ask ed a g roup of nativ e English sp eak ers to categorize and rate them in terms of their
12
o wn English stop categories. F ollo wing the results of Exp erimen t 1, w e selected K orean syllables
with consonan ts that the English listeners rated as b eing go o d exemplars of their o wn English
categories (and therefore w ere familiar with), and other syllables with consonan ts that they
rated as bad exemplars and had difficult y categorizing (meaning they w ere unfamiliar to them).
W e used those syllables as stim uli for Exp erimen t 2, where w e tested English listeners’ abilit y to
discriminate pairs of talk ers across the differen t t yp es of K orean consonan ts. In Exp erimen t 3
w e extended the paradigm to test the effects of familiar and unfamiliar K orean consonan ts-v o w el
pairs. W e exp ected to find that the English listeners w ould b e b etter at discriminating talk ers
when listening to those K orean sounds that they w ere familiar with, relativ e to other K orean
sounds that w ere unfamiliar to their nativ e sound system.
2.1 Exp erimen t 1
Exp erimen t 1 tested the p erception of K orean w ord-initial obstruen ts b y nativ e English listeners
without previous exp osure to K orean. The purp ose of Exp erimen t 1 w as to obtain empirically-
deriv ed stim uli and h yp otheses for the subsequen t exp erimen ts on talk er discrimination across
syllables of v arying familiarit y .
Seoul K orean obstruen t realizations v ary b y p osition, but only the initial p osition consonan ts
are relev an t here. In w ord-initial p osition, K orean has a three-w a y laryngeal con trast b et w een
bilabial stops ( /p pʰ p*/ ), den tal or den ti-alv eolar stops ( /t tʰ t*/ ), v elar stops ( /k kʰ k*/ , and alv eo-
palatal affricates ( /tʃ tʃʰ tʃ*/ ). The first mem b er of eac h set is t ypically referred to as lenis (or
lax), the second as aspirated, and the third as fortis (or tense); they are traditionally pro duced
with resp ectiv e mid, long and short v oice-onset time (V OT) v alues (Han and W eitzman, 1970;
Kaga y a, 1974). Con temp orary Seoul K orean has largely neutralized the V OT con trast b et w een
13
lenis and aspirated consonan ts, but it retains the distinction primarily b y a m uc h lo w er f0 of the
v o w el follo wing the lenis consonan ts (e.g., Kang, 2014).
W e exp ected to replicate the findings in Sc hmidt’s (2007) more comprehensiv e study of
English-K orean cross-language consonan t iden tification. F or our consonan ts of in terest, Sc hmidt
found that English listeners reliably iden tified K orean aspirated stops as their English coun ter-
parts, and rated them as v ery similar, suggesting that the category-relev an t acoustic cues are
similar in b oth languages. The results for the lenis stops w ere only sligh tly w orse. On the
other hand, K orean fortis stops w ere iden tified without m uc h consistency and receiv ed p o orer
similarit y ratings, indicating that there is no go o d English coun terpart to them and therefore
English listeners ha v e difficult y in terpreting the appropriate acoustic cues.
In this study , w e presen ted the consonan ts in the con text of the K orean v o w els /i ɑ u ɯ/ .
While Sc hmidt found /i ɑ/ to b e w ell assimilated in to the corresp onding English categories, /u/
systematically receiv ed lo w er similarit y ratings. W e decided to include K orean /ɯ/ , whic h w as
not tested b y Sc hmidt, b ecause it is a t yp ologically rare v o w el whic h has no straigh tforw ard
English coun terpart. As an extreme case of a con text that w ould b e unfamiliar to English
listeners, w e included /ɯl/ as w ell, as K orean /l/ app ears to b e pro duced differen tly than English
/l/ (although its exact prop erties and cross-linguistic in terpretation are not kno wn). Note that
while w e included differen t v o w el con texts, w e only ask ed participan ts to categorize and rate the
consonan ts.
14
2.1.1 Metho ds
2.1.1.1 Stim uli
Recordings Fiv e college-age female nativ e sp eak ers of Seoul K orean w ere individually
recorded in a noise-atten uating b o oth at a sampling rate of 44,100 Hz at 16 bits p er sample. The
sp eak ers read out loud a series of K orean syllables presen ted in K orean orthograph y sequen tially
on a computer screen. The order of presen tation w as randomized across sp eak ers. The syllables
consisted of K orean’s nine w ord-initial stops ( /p pʰ p*, t tʰ t*, k kʰ k*/ ) and three w ord-initial
affricates ( /tʃ tʃʰ tʃ*/ ) in CV p osition, with the v o w els /i ɑ u ɯ/ , and in CV C p osition with /ɯ/ and
the lateral /l/ as the final consonan t. Eac h of the fiv e sp eak ers pro duced 60 differen t syllables
(12 c onsonan ts x 5 V(C) con texts), eac h rep eated four times, for a total of 240 CV(C) sequences
p er sp eak er, and a grand total of 1,200.
Stim uli Selection The recordings w ere automatically segmen ted and lab elled, with man-
ual correction b y a nativ e K orean sp eak er. Of the 1,200 syllables, 12 had to b e discarded due
to recording errors. The remaining 1,188 tok ens w ere amplitude-normalized using the ro ot-
mean-square metho d. T o v erify the accuracy and naturalness of the syllables, four female nativ e
K orean sp eak ers from Seoul, aged 18-30, w ere ask ed to listen to all 1,188 tok ens, categorize them
and rate them for go o dness. The participan ts sat in a sound-atten uating b o oth and listened to
the K orean syllables, one at a time, o v er Son y MDR7506 headphones at a comfortable lev el con-
sisten t across participan ts. The stim uli w ere ev enly distributed across blo c ks and randomized
within. The order of presen tation of the blo c ks v aried for eac h listener. The recordings w ere
pla y ed sequen tially and the participan ts’ resp onses r ecorded using the exp erimen t mo dule in
Praat. The listeners w ere ask ed to categorize the initial consonan t in the syllable they heard b y
15
c ho osing from one of the options presen ted on a computer screen. The options presen ted w ere
the nine v oiceless stops and three v oiceless affricates, written in K orean orthograph y with the
corresp onding v o w el or v o w el-consonan t sequence (that is, listeners had to iden tify the initial
consonan t that they heard, but the rest of the syllable w as sho wn to them). In addition, they
w ere ask ed to rate eac h tok en for go o dness of fit to the category they selected, on a scale of 1
(w orst) to 4 (b est). Of the 1,188 syllables, 211 (17.7%) w ere discarded b ecause they w ere either
miscategorized b y at least one of the listeners (11.1%), or b ecause, while correctly categorized
b y all listeners, they w ere rated 3.5 out of 4 or b elo w in a v erage for category go o dness (6.5%).
The discarded items w ere randomly distributed, and at least t w o rep etitions of eac h syllable p er
talk er surviv ed the selection pro cess. The remaining 977 K orean syllables w ere presen ted to the
English listeners in the exp erimen t.
2.1.1.2 P articipan ts
A total of 25 listeners completed Exp erimen t 1. The listeners w ere nativ e sp eak ers of American
English recruited from the undergraduate studen t p opulation at the Univ ersit y of Southern
California (USC). As indicated b y self-rep ort, all participan ts had normal vision and hearing,
and no history of sp eec h, language or learning problems. None of the listeners rep orted an y
exp erience with K orean. They receiv ed course credit for their v olun tary participation. The
study w as appro v ed b y the USC Institutional Review Board.
2.1.1.3 Pro cedure
The stim uli w ere ev enly distributed across four blo c ks of appro ximately 245 items eac h, and eac h
participan t listened to only one of those blo c ks. The participan ts sat in a sound-atten uating
b o oth and listened to the K orean syllables, one at a time and in random order, o v er Son y
16
T able 2.1: P ercen tage of K orean syllable-initial consonan ts categorized as English consonan ts. Minor p ercen tages
(<3) not sho wn. Mean go o dness ratings in paren theses.
K orean English
/p/ /b/ /t/ /d/ /k/ /g/ /tʃ/ /dʒ/
/pʰ/ 95.93% (3.29)
/p/ 73.33% (3.14) 24.00% (3.33)
/p*/ 30.62% (2.92) 65.38% (3.07)
/tʰ/ 97.04% (3.12)
/t/ 70.43% (3.01) 25.91% (3.23)
/t*/ 34.48% (2.81) 61.47% (2.98)
/kʰ/ 94.88% (3.19)
/k/ 72.87% (3.07) 24.24% (3.29)
/k*/ 40.67% (2.94) 54.14% (2.96)
/tʃʰ/ 50.49% (2.96) 49.02% (3.01)
/tʃ/ 40.06% (2.82) 11.57% (3.17) 30.42% (2.87) 14.99% (2.87)
/tʃ*/ 24.79% (2.69) 27.16% (2.86) 5.57% (2.75) 10.72% (2.56) 31.34% (2.76)
MDR7506 headphones at a comfortable lev el consisten t across participan ts. They receiv ed writ-
ten instructions to categorize the initial consonan t in eac h syllable b y c ho osing from one of the
options presen ted on a computer screen. The options presen ted w ere all six English stops /b d g p
t k/ and t w o affricates /tʒ tʃ/ , written in con v en tional English sp elling and with the corresp onding
v o w el or v o w el-consonan t sequence (that is, listeners had to categorize the init ial consonan t that
they heard, but the rest of the syllable w as sho wn to them). The consonan ts /b d g p t k dʒ tʃ/
w ere represen ted b y the graphs ‘b, d, g, p, t, k, c h, j’, and the v o w els /i ɑ u ɯ/ b y ‘ee, a, o o, eu’ .
In addition, listeners w ere ask ed to rate eac h tok en for go o dness of fit to t he English category
they selected, on a scale of 1 (w orst) to 4 (b est). The participan ts’ resp onses w ere recorded
using the exp erimen t mo dule in Praat (Bo ersma and W eenink, 2020).
2.1.2 Results
2.1.2.1 Ov erall categorization and go o dness ratings
T able 2.1 sho ws the total p ercen tages of K orean syllable-initial consonan ts categorized as En-
glish consonan ts b y the English listeners, collapsed across the fiv e V(C) con texts, and with
17
mean go o dness ratings in paren theses (on a 4-p oin t scale). As exp ected, The K orean v oiceless
aspirated stops w ere nearly alw a ys categorized as the corresp onding English v oiceless aspirated
stops. The K orean lenis stops are categorized as the corresp onding English v oiceless aspirated
stops at appro ximately a 3 to 1 ratio against the v oiced English coun terparts. This pattern is
somewhat rev ersed for the K orean fortis stops, whic h are preferen tially categorized as v oiced
stops rather than v oiceless stops appro ximately 60% of the trials. The K orean affricates follo w a
v ery similar pattern. The aspirated affricated is alw a ys in terpreted as an English v oiceless con-
sonan t; the lenis affricate is split b et w een a v oiced and v oiceless in terpretation, strongly fa v oring
the v oiceless; the fortis affricate is also split b et w een a v oiced and v oiceless categorization, but
the preference is for the v oiced. Ho w ev er, the K orean affricates w ere not alw a ys p erceiv ed as
English affricates b y the English listeners. In all cases, appro ximately 50% of K orean affricates
w ere categorized as English alv eolar stops.
As for go o dness ratings, the K orean aspirated stops w ere consisten tly rated as go o d samples
of the corresp onding English stop categories. Despite the fact that they w ere cate gorized as
English v oiced stops only in 25% of trials, the K orean lenis stops receiv ed similarly high mean
ratings when cat egorized as suc h, while the ratings when categorized as v oiceless w ere sligh tly
lo w er. The K orean fortis stops receiv ed notably lo w er mean go o dness ratings o v erall, except
when categorized as English /b/ . Mean go o dness ratings for the K orean affricates w ere also
consisten tly lo w er, with the exception of the K orean lenis affricate b eing categorized as /d/ .
2.1.2.2 Breakdo wn b y con text
T ables 2.2, 2.3, 2.4 and 2.5 sho w, resp ectiv ely , the p ercen tages of K orean bilabial, alv eo-den tal,
v elar and alv eo-palatal consonan ts categorized as English consonan ts, brok en do wn b y V(C)
con text, and with mean go o dness ratings in paren theses. Ov erall, the breakdo wn follo ws the
18
T able 2.2: P ercen tage of K orean syllable-initial bilabial stops categorized as English consonan ts. Minor p ercen t-
ages (<3) not sho wn. Mean go o dness ratings in paren theses.
K orean English
/p/ /b/ /t/ /d/ /k/ /g/ /tʃ/ /dʒ/
/pɑ/ 91.25% (3.68) 6.25% (3.20)
/pi/ 93.07% (3.70) 6.93% (2.86)
/pu/ 83.93% (3.01) 9.82% (3.18)
/pɯ/ 90.80% (2.80) 9.20% (3.75)
/pɯl/ 89.47% (2.53) 6.32% (3.83)
/pʰɑ/ 98.53% (3.66)
/pʰi/ 98.80% (3.74)
/pʰu/ 96.30% (3.32)
/pʰɯ/ 93.94% (2.88) 5.05% (2.60)
/pʰɯl/ 88.78% (2.78) 6.12% (2.33)
/p*ɑ/ 15.97% (3.58) 84.03% (3.54)
/p*i/ 11.86% (3.00) 88.14% (3.61)
/p*u/ 34.91% (2.84) 59.43% (2.83)
/p*ɯ/ 17.60% (2.50) 72.00% (2.66)
/p*ɯl/ 22.32% (2.48) 69.64% (2.61)
T able 2.3: P ercen tage of K orean syllable-initial alv eolar stops categorized as English consonan ts. Minor p er-
cen tages (<3) not sho wn. Mean go o dness ratings in paren theses.
K orean English
/p/ /b/ /t/ /d/ /k/ /g/ /tʃ/ /dʒ/
/tɑ/ 8.05% (3.29) 81.61% (3.45) 8.05% (3.57)
/ti/ 90.80% (3.43) 9.20% (3.13)
/tu/ 85.98% (3.13) 10.28% (3.64)
/tɯ/ 81.72% (2.64) 13.98% (3.46)
/tɯl/ 81.05% (2.48) 16.84% (2.94)
/tʰɑ/ 5.38% (3.00) 92.47% (3.34)
/tʰi/ 98.65% (3.64)
/tʰu/ 97.67% (3.30)
/tʰɯ/ 98.39% (2.92)
/tʰɯl/ 98.02% (2.39)
/t*ɑ/ 22.97% (3.24) 21.62% (2.63) 54.05% (3.20)
/t*i/ 18.40% (2.96) 80.80% (3.53)
/t*u/ 41.90% (3.05) 58.10% (3.13)
/t*ɯ/ 20.54% (2.48) 78.57% (2.81)
/t*ɯl/ 26.00% (2.23) 72.00% (2.31)
19
T able 2.4: P ercen tage of K orean syllable-initial v elar stops categorized as English consonan ts. Minor p ercen tages
(<3) not sho wn. Mean go o dness ratings in paren theses.
K orean English
/p/ /b/ /t/ /d/ /k/ /g/ /tʃ/ /dʒ/
/kɑ/ 91.07% (3.62) 7.14% (3.63)
/ki/ 88.79% (2.99) 8.41% (3.56)
/ku/ 77.78% (3.41) 19.75% (3.38)
/kɯ/ 89.08% (2.78) 7.56% (3.33)
/kɯl/ 85.71% (2.71) 9.82% (3.45)
/kʰɑ/ 94.62% (3.65)
/kʰi/ 7.14% (3.50) 75.00% (3.57) 17.86% (3.00)
/kʰu/ 99.06% (3.20)
/kʰɯ/ 98.11% (2.97)
/kʰɯl/ 97.00% (2.62)
/k*ɑ/ 7.63% (3.22) 33.90% (3.13) 56.78% (3.45)
/k*i/ 30.09% (3.15) 63.72% (3.29)
/k*u/ 44.54% (2.77) 50.42% (2.82)
/k*ɯ/ 18.49% (2.73) 80.67% (2.77)
/t*ɯl/ 28.57% (2.41) 64.71% (2.52)
T able 2.5: P ercen tage of K orean syllable-initial alv eo-palatal affricates categorized as English consonan ts. Minor
p ercen tages (<3) not sho wn. Mean go o dness ratings in paren theses.
K orean English
/p/ /b/ /t/ /d/ /k/ /g/ /tʃ/ /dʒ/
/tʃɑ/ 74.34% ( 3.42) 18.58% (2.86) 6.19% (3.71)
/tʃi/ 39.29% ( 2.89) 11.61% (2.69) 33.93% (2.82) 14.29% (2.88)
/tʃu/ 79.25% (3.13) 17.92% (3.32)
/tʃɯ/ 57.14% ( 2.44) 4.46% (2.00) 28.57% (2.59) 7.14% (2.5)
/tʃɯl/ 54.72% (2.17) 5.66% (3.0) 27.36% (2.55) 11.32% (2.75)
/tʃʰɑ/ 66.67% ( 3.38) 33.33% (3.24)
/tʃʰi/ 37.63% ( 2.66) 61.29% (2.98)
/tʃʰu/
/tʃʰɯ/ 54.84% ( 2.51) 45.16% (2.52)
/tʃʰɯl/ 54.72% ( 2.38) 44.34% (2.62)
/tʃ*ɑ/ 12.61% ( 3.20) 57.98% (3.42) 24.37% (3.03)
/tʃ*i/ 20.17% (3.1 3) 22.69% (2.67) 21.01% (3.16) 9.24% (2.55) 26.89% (2.91)
/tʃ*u/ 5.04% (3.00) 21.01% (2.92) 69.75% (2.94)
/tʃ*ɯ/ 28.81% (2.15) 34.75% (2.61) 4.24% (2.20) 7.63% (1.89) 23.73% (2.32)
/tʃ*ɯl/ 30.51% (1.8 6) 26.27% (2.26) 5.08% (1.50) 11.86% (2.43) 26.27% (2.42)
20
general patte rn describ ed ab o v e, but there is some notable v ariation induced b y some of the
con texts. Most notably , the /ɯ/ and /ɯl/ con texts cause a drop in go o dness ratings across all
consonan ts, except when K orean lenis stops are classified as the corresp onding English v oiced
consonan ts. The same con texts also cause the K orean aspirated bilabial stop to b e categorized
as English /k/ in ab out 5% of trials. Categorization rates are consisten t across con texts for the
K orean lenis and aspirated stops. Ho w ev er, the fortis stops sho w m uc h more v ariation across
con texts; while categorization as an English v oiced consonan t is alw a ys preferred, it is so at
differen t rates across con texts. The v o w el /ɑ/ leads to the K orean alv eolar stops b eing categorized
as bilabial English stops o ccasionally . As is eviden t in T able 5, the K orean affricates reflect the
most v ariabilit y b y far of all the consonan ts in questions. While follo wing the general trend of
their K orean stop coun terparts, they presen t the added complication of b eing categorized almost
with equal frequency as English affricates as w ell as stops. The /u/ con text consisten tly pro duced
a drop in go o dness ratings for almost all consonan ts.
While the listeners in Sc hmidt’s study p erformed an op en-set iden tification task, rather
than the categorization task in our study , our results are virtually iden tical in terms of what
categories English listeners p erceiv ed K orean consonan ts to b elong to. The pattern of go o dness
ratings is also strikingly similar (but note that w e used a 4-p oin t scale rather than 5). New
in our findings —although exp ected— is the fact that the v o w el con text /ɯ/ causes English
listeners to dramatically lo w er go o dness ratings for the accompan ying consonan ts, and ev en
c hange the categorization pattern o ccasionally . It is p ossible that listeners ma y ha v e b een rating
the syllables as whole, but that is unlik ely to b e an issue for our purp oses.
In sum, the K orean aspirated stops app ear to b e p erceiv ed b y English listeners as b eing v ery
close to their English coun terparts, while the K orean fortis stops are not p erceiv ed consisten tly
21
and are not deemed as go o d represen tativ e of an y English category . These results indicate that
these t w o categories of stim uli are appropriate to test the effects of differen t lev els of linguistic
familiarit y in English listeners’ talk er discrimination abilities.
2.2 Exp erimen t 2
The goal of Exp erimen t 2 w as to test English listeners’ abilit y to discriminate pairs of talk ers
across syllables of v arying familiarit y . Using the results of Exp erimen t 1 as guide, w e selected
t w o groups of syllables as the conditions for Exp erimen t 2.
2.2.1 Metho ds
2.2.1.1 Stim uli
Based on the results of Exp erimen t 1, w e selected K orean stim uli that differed in ho w familiar
they w ere to nativ e English sp eak ers. Half of the stim uli w ere syllables with aspirated stops,
whic h the English listeners in Exp erimen t 1 had rated as b eing v ery close to their English stop
coun terparts. The other half w as made up of syllables with fortis stops and affricates, whic h
w ere on the other extreme of familiarit y for the English listeners, as reflected b y v ery inconsisten t
categorization and lo w go o dness ratings.
The selection of syllables with aspirated stops included /pʰi, tʰi, tʰɑ/ . The selection of syllables
with fortis stops and affricates included /tʃ*i, k*ɑ, t*u/ . Eac h of the six sp eak ers from Exp erimen t
1 con tributed t w o rep etitions of eac h, for a total of 72 stim uli (6 syllables x 6 sp eak ers x 2
rep etitions). The design of the exp erimen t follo w ed an AXB paradigm, where listeners w ould
hear three stim uli in sequence, and had to sa y whether the second (X) w as more similar to the
first (A) or to the third (B). Therefore, w e arranged the stim uli suc h that, for eac h syllable,
22
a giv en sp eak er w ould face ev ery other sp eak er four times. Of those four times, eac h sp eak er
w ould b e on the X p osition t wice. In total there w ere 240 unique AXB syllable triplets (4 sp eak er
face-offs x 10 sp eak er pairs x 6 syllables).
2.2.1.2 P articipan ts
A total of 55 listeners to ok part in Exp erimen t 2. All participan ts w ere nativ e sp eak ers of
American English recruited from the undergraduate studen t p opulation at USC. They rep orted
no kno wledge of K orean. Their ages ranged from 18 to 27 and a v eraged 20.25. As indicated b y
self-rep ort, all participan ts had normal vision and hearing, and no history of sp eec h, language
or learning problems. They pro vided written informed consen t and receiv ed course credit for
their participation. The study w as appro v ed b y the USC Institutional Review Board. (The data
from nine other participan ts w as excluded from analysis b ecause they rep orted some degree of
familiarit y with K orean.)
2.2.1.3 Pro cedure
The participan ts sat in a sound-atten uating b o oth and listened to the K orean syllables o v er
Son y MDR7506 headphones at a comfortable lev el consisten t across participan ts. They receiv ed
written instructions to listen to three syllables in quic k succession (A, X, B) and to try to iden tify
whic h t w o of the three syllables had b een pro duced b y the same sp eak er. They w ere told that the
second syllable in the sequence (X) w ould alw a ys b e the sp eak er in question, and so their task
w as to iden tify whic h of the other t w o syllables, first (A) or last (B), w as pro duced b y the same
sp eak er. Sp eak ers selected their answ er b y c ho osing from t w o on-screen buttons lab eled ‘first’
and ‘last’ . Listeners completed a practice blo c k with 6 trials with the help of an exp erimen ter
who pro vided feedbac k to ensure they understo o d the task. The stim uli for the practice blo c k
23
w ere syllables with lenis consonan ts from Exp erimen t 1. The presen tation of stim uli and the
recording of resp onses w as done using P aradigm (P erception Researc h Systems, 2007). The 240
syllable triplets w ere ev enly distributed across four blo c ks and randomized within. The order
of presen tation of the blo c ks w as coun terbalanced across listeners. The three syllables in eac h
trial w ere separated b y an in ter-stim ulus in terv al of 250 ms, and trials w ere separated b y 1 s.
P articipan ts w ere allo w ed to tak e a break for as long as they w an ted b et w een blo c ks.
2.2.1.4 Statistics
R soft w are (R Core T eam, 2018) w as used to conduct all statistical analyses. T o minimize the
p oten tial effects of outliers or non-normalit y , robust statistical tec hniques and measures w ere
adopted (see Wilco x, 2016) using the WRS2 pac kage (Mair and Wilco x, 2020). In particular,
b o otstrap analyses w ere c hosen b ecause they a v oid the assumptions of normalit y and equal
v ariance that can greatly reduce p o w er when using traditional ANO V As and t -tests. Rather
than p erforming tests on normal distributions that ma y not accurately reflect the observ ed
data, they generate a b o otstrap distribution b y rep eatedly sampling with replacemen t from the
original dataset. The trimmed mean (of the same family of trimmed estimators as the median)
w as c hosen as a measure of cen tral tendency in the analyses b ecause it has b een sho wn to b e less
sensitiv e to outliers than the mean and to main tain high p o w er when testing from b oth normal
and non-normal distributions. The 20% cutoff, whic h remo v es the lo w est 20% and the highest
20% of observ ations, has pro v en to b e a go o d default in most situations (Wilco x and Keselman,
2003). When needed to accoun t for m ultiple comparisons, the familywise T yp e I error rate w as
con trolled atα = 0.05 using Rom’s metho d, an impro v emen t on the Bonferroni correction (Rom,
1990).
24
2.2.2 Results
Figure 2.1 summarizes the results of the exp erimen t. Ov erall, accuracy is high for most listeners.
P articipan ts app eared to p erform b etter with the syllables con taining fortis consonan ts than with
the ones con taining aspirated consonan ts. This is clear on the left panel of Figure 2.1, where
results are displa y ed in pairs b y participan t. That is, eac h p oin t represen ts the o v erall accuracy
of a listener in eac h condition, with the lines connecting a listener’s data p oin ts across the
t w o conditions. As seen in the plot, most lines p oin t up w ards, indicating that listeners ma y
ha v e had an adv an tage for discriminating the K orean talk ers’ v oices in the syllables con taining
fortis consonan ts. In order to test the significance of these observ ations, w e compared the t w o
conditions using a p ercen tile b o otstrap metho d on the difference scores (see Figure 2.2). The
difference b et w een the conditions w as significan t (
ˆ
ψ = .026, p< 0.001 , 95% C.I. = 0.016,0.034) ,
demonstrating an impro v emen t in talk er discrimination accuracy in the fortis condition. (Note
that degrees of freedom are not rep orted, as these are not asso ciated with b o otstrap tec hniques.)
Aggregating accuracy scores b y talk er pairs (middle panel in Figure 2.1) sho ws a similar
trend, with all but t w o of the talk er pairs sho wing a clear increase in accuracy in the fortis
condition. The same is true if aggregating accuracy scores b y syllable triplets, as the b eanplot
on the righ t panel of Figure 2.1 sho ws. While the syllable triplets are not designed to b e
paired across conditions, there is a considerable increase in densit y to w ards the top of the fortis
distribution of scores, relativ e to the aspirated distribution. While w e did not test the statistical
significance of these differences (as they w ere not the main purp ose of the exp erimen t), they
supp ort the main finding that listeners found it easier to discriminate talk ers in syllables that
con tained fortis consonan ts.
25
B y l i s t e n e r
A ccu r a cy
A sp i r a t e d F o r t i s
B y t a l k e r p a i r
A sp i r a t e d F o r t i s
0 . 7 0 . 8 0 . 9 1 . 0
B y s y l l a b l e t r i p l e t
A sp i r a t e d F o r t i s
0 . 7 0 . 8 0 . 9 1 . 0
0 . 7 0 . 8 0 . 9 1 . 0
0 . 8 0 . 9 1 . 0
Figure 2.1: Plots of p air e d r esults of Exp eriment 3, aggr e gate d by p articip ants (left p anel) or by talker p airs
(midd le p anel). The right p anel shows a b e anplot of the r esults aggr e gate d by syl lable triplets; the lines show
individual observations, while the ar e a shows the distribution.
In sum, while w e found a significan t effect of syllable familiarit y on English listeners’ abilities
to discriminate v oices, the effect w as in the opp osite direction that w e originally predicted. That
is, listeners w ere able to discriminate v oices at a higher rate on those syllables that had b een
deemed as more unfamiliar in Exp erimen t 1, the K orean fortis stops and affricates.
2.3 Exp erimen t 3
T o furt her in v estigate the effect of familiarit y on talk er discrimination, w e conducted a v ariation
of the previous exp erimen t. As in Exp erimen t 2, w e used t w o t yp es of K orean consonan ts that
26
P a i r w i s e d i f f e r e n c e s
E f f e ct o f co n so n a n t t yp e
− 0 . 1 0 − 0 . 0 5 0 . 0 0 0 . 0 5 0 . 1 0 0 . 1 5 0 . 2 0
B y l i st e n e r
Figure 2.2: Plot of differ enc e sc or es aggr e gate d by p articip ants (ac cur acy in fortis minus ac cur acy in aspir ate d
trials), along with 95% c onfidenc e interval of the 20% trimme d me an of the differ enc es.
27
v aried in their lev el of familiarit y to English listeners (aspirated and fortis). Ho w ev er, to allo w
for more relev an t comparisons, in Exp erimen t 3 w e selected the syllables in eac h group to b e
directly comparable to those in the other. F or example, w e c hose to compare /pʰi and /p*i .
A dditionally , w e decided to use the consonan ts in the con texts /ɯ/ and /ɯl/ , since they had
attracted m uc h lo w er ratings and v ariable categorization in Exp erimen t 1. This w ould allo w us
to mak e comparison not just across the t w o t yp es of consonan ts, but also b et w een v o w el con texts
that w ere p erceiv ed v ery differen tly b y the English listeners in Exp erimen t 1.
2.3.1 Metho ds
2.3.1.1 Stim uli
As in Exp erimen t 2, w e selected K orean stim uli that differed in ho w familiar they w ere to nativ e
English sp eak ers. Half of the stim uli w ere syllables with aspirated stops, whic h the English
listeners in Exp erimen t 1 had rated as b eing v ery close to their English stop coun terparts.
The other half w as made up of syllables with fortis stops, whic h w ere on the other extreme
of familiarit y for the English listeners, as reflected b y v ery inconsisten t categorization and lo w
go o dness ratings. Unlik e in Exp erimen t 2, w e c hose the syllables in eac h group to b e directly
comparable to the ones in the other b y using only one place of articulation (bilabial) for b oth
aspirated and fortis stops, and b y using the same v o w el con texts across conditions.
The selection of syllables with aspirated stops included /pʰi, pʰɯ, pʰɯl/ . The selection of
syllables with fortis stops and affricates included /p*i, p*ɯ, p*ɯl/ Again, eac h of the six sp eak ers
from Exp erimen t 1 con tributed t w o rep etitions of eac h, for a total of 72 stim uli (6 syllables X
6 sp eak ers x 2 rep etitions). The designed of the exp erimen t follo w ed an AXB paradigm, where
listeners w ould hear three stim uli in sequence, and had to sa y whether the second (X) is more
28
similar to the first (A) or to the third (B). Therefore, w e arranged the stim uli suc h that, for
eac h syllable, a giv en sp eak er w ould face ev ery other sp eak er four times. Of those four times,
eac h sp eak er w ould b e on the X p osition t wice. In total there w ere 240 unique AXB syllable
triplets (4 sp eak er face-offs x 10 sp eak er pairs x 6 syllables).
2.3.1.2 P articipan ts
A total of 63 participan ts completed Exp erimen t 3. All participan ts w ere nativ e sp eak ers of
American English recruited from the undergraduate studen t p opulation at USC. They rep orted
no kno wledge of K orean. Their age and gender w ere not collected. As indicated b y self-rep ort,
all participan ts had normal vision and hearing, and no history of sp eec h, language or learning
problems. They pro vided written informed consen t and receiv ed course credit for their partici-
pation. The study w as appro v ed b y the USC Institutional Review Board. (The data from three
other participan ts w as excluded from analysis b ecause they rep orted some degree of familiarit y
with K orean. The data from one other participan t w as excluded b ecause their resp onses w ere
at c hance lev els.)
2.3.1.3 Pro cedure
The pro cedure w as iden tical to the one in Exp erimen t 2.
2.3.1.4 Statistics
The statistical metho ds w ere the same as in Exp erimen t 2.
29
2.3.2 Results
Figure 2.3 summarizes the results of the exp erimen t, aggregated b y listeners, b y talk er pairs and
b y syllable triplets. All plots are of paired data p oin ts, as indicated b y the connecting lines. As
in Exp erimen t 2, accuracy is high for most participan ts, with man y concen trated b et w een 90
and 100%, and almost all ab o v e 80%. Visually , there is not as clear a trend as in Exp erimen t
2. W e compared the participan ts’ p erformance across the t w o syllable t yp es using a p ercen tile
b o otstrap metho d on the difference scores, whic h rev ealed a statistically significan t differen t
b et w een the t w o syllable conditions (
ˆ
ψ = 0.011, p < 0.001 , 95% C.I. = 0.004,0.017) . Again,
listeners w ere b etter at discriminating talk ers in the fortis syllables, but the o v erall effect w as
smaller this time.
Figure 2.4 breaks do wn the listener scores in to the differen t v o w el con texts. Scores app ear
to b e similarly distributed across all con texts and consonan ts. P ercen tile b o otstrap tests rev eal
small but significan t effects of consonan t condition on b oth /-ɯ/ (
ˆ
ψ = 0.019 p < 0.001 , 95%
C.I. = 0.007,0.031) and /-ɯl/ (
ˆ
ψ = 0.028 p < 0.001 , 95% C.I. = 0.017,0.039) , where again fortis
consonan ts confer an adv an tage in talk er discrimination o v er aspirated consonan ts. On the other
hand, the effect is rev ersed in the /-i/ con text (
ˆ
ψ = -0.017 p = 0.006 , 95% C.I. =−0.030,−0.004) ,
where the fortis condition is more difficult than the aspirated.
F urther comparisons of in terest are those b et w een the same consonan t t yp e but across dif-
feren t v o w el con texts. If there is an effect of familiarit y on talk er discrimination at the syllable,
w e w ould also exp ect to find it when comparing only v o w els of v arying familiarit y to the lis-
teners. W e tested the con trast b et w een /p*i/ and /p*ɯ/ and found that the unfamiliar v o w el
/ɯ/ caused a significan t drop in talk er discrimination accuracy (
ˆ
ψ = -0.03 p < 0.001 , 95% C.I.
30
= −0.046,−0.015) . Ho w ev er, the effect is not mirrored in the comparison b et w een /pʰi/ and
/pʰɯ/ (
ˆ
ψ = 0.004 p = 0.542 , 95% C.I. =−0.009,0.016) .
In sum, the results largely replicated those of Exp erimen t 2. W e found again an effect of
familiarit y on English listeners’ abilities to discriminate v oices, but in the opp osite direction
as originally predicted. Syllables with fortis consonan ts w ere o v erall significan tly easier for the
listeners when discriminating v oices. When breaking do wn the results b y the differen t v o w el
con texts, it is eviden t that /-ɯ/ and /-ɯl/ are driving the results. On the other hand, /-i/ sho ws
the opp osite pattern, with aspirated stops b eing the easier condition. F urther, comparing the
same consonan t condition across differen t v o w el con texts sho w ed inconsisten t results. While,
/p*i/ w as easier than /p*ɯ/ , as w e w ould ha v e predicted, /pʰi/ and /pʰɯ/ sho w ed no differences
whatso ev er.
2.4 Discussion
The exp erimen ts in this c hapter had the goal of testing the effect of consonan t familiarit y on
talk er discrimination at the syllable lev el. W e first tested English listeners’ p erception of K o-
rean syllables and iden tified differen t lev els of familiarit y , replicating previous w ork b y Sc hmidt
(2007). W e had predicted that K orean syllables familiar to English listeners (based on their
categorization and go o dness ratings in relation to English sounds) w ould pro vide an adv an tage
o v er less familiar syllables in a talk er discrimination task. The results of Exp erimen ts 2 and
3 did not align with this prediction. In fact, o v erall, it app ears that the effect ma y ha v e gone
in the opp osite direction. That is, less familiar syllables pro vided an adv an tage o v er the more
familiar ones.
31
0.6 0.7 0.8 0.9 1.0
By listener
Accuracy
Aspirated Fortis
0.6 0.7 0.8 0.9 1.0
By talker pair
Accuracy
Aspirated Fortis
0.6 0.7 0.8 0.9 1.0
By syllable triplet
Accuracy
Aspirated Fortis
Figure 2.3: Plots of p air e d r esults of Exp eriment 3, aggr e gate d by p articip ants (left p anel), by talker p airs
(midd le), or by syl lable triplets (right).
32
0 . 6 0 . 7 0 . 8 0 . 9 1 . 0
- ɯ
A ccu r a cy
A sp i r a t e d F o r t i s
0 . 6 0 . 7 0 . 8 0 . 9 1 . 0
- ɯ l
A sp i r a t e d F o r t i s
0 . 6 0 . 7 0 . 8 0 . 9 1 . 0
- i
A sp i r a t e d F o r t i s
E f f e ct o f co n so n a n t t yp e
- 0 . 1 0 - 0 . 0 5 0 . 0 0 0 . 0 5 0 . 1 0 0 . 1 5 0 . 2 0
- 0 . 1 0 - 0 . 0 5 0 . 0 0 0 . 0 5 0 . 1 0 0 . 1 5 0 . 2 0
- 0 . 1 0 - 0 . 0 5 0 . 0 0 0 . 0 5 0 . 1 0 0 . 1 5 0 . 2 0
Figure 2.4: Plots of p air e d r esults of Exp eriment 3, aggr e gate d by p articip ants and br oken down by vowel c ontexts.
The less familiar fortis consonan ts of K orean pro vided a consisten t adv an tage to the English
listeners in the t w o talk er discrimination exp erimen ts presen ted here. Statistical estimation
places this b o ost in the 1-5% range. This is in fact a considerable increase in accuracy relativ e
to the aspirated consonan ts, giv en the fact that accuracy rates w ere already close to ceiling in the
latter condition for man y listeners. F rom visual examination of the paired plots, it app ears that
the effect is particularly dramatic for listeners who p erformed lo w est on the aspirated conditions,
whic h suggest that there ma y ha v e b een a ceiling effect at pla y throughout the exp erimen ts.
These results are inconsisten t with previous studies sho wing that listeners are b etter are
iden tifying talk ers in a language or dialect that they are familiar with (e.g. Bregman and Creel,
2014; P errac hione et al., 2010). The discrepancy in the results ma y b e explained b y the differen t
33
lev els of linguistic analysis at w ork in the differen t studies. Previous studies ha v e used full
sen tences, whic h ma y ha v e engaged linguistic pro cessing to a greater exten t than the syllables
in our study . Giv en the short, nonsense syllables used as stim uli, listeners ma y ha v e b een able
to ignore an y linguistic asp ects of the signal and p erform a more purely acoustic analysis of the
talk ers’ v oices.
In this con text, it is difficult to ascrib e the difference in accuracy b et w een the fortis and
aspirated consonan ts to an an effect of familiarit y . In fact, a simpler explanation ma y b e that
v o w els after fortis consonan ts in K orean are longer relativ e to v o w els after aspirated consonan ts
(T aehong, 1995), and v o w els pla y a more prominen t role than consonan ts in talk er discrimination
(Owren & Cardillo, 2006). Ho w ev er, if o v erall syllable length w as a factor, w e w ould exp ect to
find higher accuracy rates for the /-ɯl/ con text, giv en the extra sonoran t in the syllable co da.
P ost-ho c tests did not find an y differences in accuracy b et w een /pʰɯ/ and /pʰɯl/ , nor b et w een
/p*ɯ/ and /p*ɯl/ .
A comparison that could disam biguate the cause of this effect is that b et w een /p*i/ and /p*ɯ/ ,
whic h do not differ in length but do differ in the relativ e familiarit y of /i/ and /ɯ/ to English
listeners. Confusing things further, ho w ev er, the unfamiliar /ɯ/ is significan tly harder than the
familiar /i/ , rev ersing the direction of the effect found in the other comparisons. F urther, /pʰi/
and /pʰɯ/ , do not differ in an y significan t w a y .
In sum , the results resist easy in terpretation but w arran t further exploration of the p o-
ten tial effect of familiarit y on t alk er discrimination in syllables, particularly in regards to the
con tribution of v o w els. The follo wing c hapter in v estigates the issue in detail.
34
Chapter 3
Effects of v o w el familiarit y on talk er discrimination
The exp erimen ts rep orted in this c hapter in v estigated the language familiarit y effect in v oice
p erception b y testing listeners’ abilit y to discriminate v oices across syllables that differed in
ho w familiar they w ere to the listeners. W e follo w ed a design similar to that in Chapter 2,
but fo cused sp ecifically on the effects of v o w els, whic h are kno wn to pla y a more prominen t
role than consonan ts in talk er discrimination and are particularly suited to enco de indexical
information due to their prominen t fundamen tal frequency and v o cal tract length cues, as w ell
as a ric h harmonic structure that can b e shap ed b y in ter-talk er v ariabilit y in the v o cal-tract
transfer function (Owren and Cardillo, 2006). Bac horo wski and Owren (1999) ha v e sho wn
that individual v o w els consisten tly con tain cues to b oth talk er iden tit y and talk er sex, and that
acoustic measuremen ts asso ciated with the v o cal-tract transfer function (i.e., forman t structure)
are the b est predictors of talk er iden tit y , far surpassing source-related v ariables suc h as f0 whic h
are instead more useful for distinguishing talk er sex
1
.
T o manipulate v o w el familiarit y , w e c hose the K orean v o w els /i, u, ɯ/ . K orean /i/ and /u/ are
acoustically similar to their English coun terparts. On the other hand, K orean /ɯ/ , a high bac k
1
While f0 is generally found to b e an imp ortan t v o cal feature relev an t to talk er recognition in longer sp eec h
samples (see Kreiman and Sidtis, 2011), w e do not exp e ct it to pla y a prominen t role in talk er iden tit y in isolated
v o w el sounds. This discrepancy is not surprising if w e consider the f0 of individual v o w els ma y b e of little use to
listeners without reference to the o v erall con text of a talk er’s f0 range and idiosyncratic proso dic patterns.
35
unrounded v o w el, is t yp ologically rare and has no straigh tforw ard English coun terpart. In terms
of con trast pairs, /i/ and /u/ assimilate to the t w o corresp onding English categories (t w o-category
assimilation in Best’s P erceptual Assimilation Mo del (P AM, 1995)). On the other hand, /u/ and
/ɯ/ assimilate to a single English category /u/ , with [ɯ] presumably b eing p erceiv ed as a w orse
fit than [u] (category-go o dness assimilation in P AM).
In Exp erimen t 4 , w e compared the p erformance of nativ e K orean and English listeners
on this stim uli set using a same-differen t (AX) discrimination task. W e exp ected that patterns
of assimilation of non-nativ e con trasts to nativ e categories w ould differen tially affect the dis-
criminabilit y of talk ers for the English listeners. W e then applied m ultidimensional scaling to
create a map of the p erceptual space of the listeners and in v estigated whether familiarit y with
the differen t v o w els affected the size of the p erceptual space for talk ers. W e also measured
correlations b et w een dimensions in the p erceptual space and acoustic features of the stim uli to
in v estigate whether v o w el familiarit y affected whic h prop erties of the stim uli listeners attended
to when discriminating talk ers.
Because talk er iden tit y is seemingly enco ded in the same forman t-frequency c haracteristics
that enco de v o w el iden tit y , it is lik ely that the pro cessing of these t w o dimensions in sp eec h is
in tegrated in suc h a w a y that separating them ma y b e difficult for listeners, esp ecially for naïv e
listeners lac king kno wledge of whether a p erceiv ed acoustic con trast is relev an t linguistically ,
indexically , or not at all. Exp erimen t 5 explored this p ossibilit y b y testing the talk er dis-
crimination abilities of English listeners and nativ e K orean listeners under a Garner paradigm.
The Garner task (Garner, 1974) measures in terference b et w een differen t stim uli dimensions, and
ma y b e able to detect differe n t lev els of in terference of v o w el iden tit y (a linguistic dimension)
on talk er iden tit y (an indexical dimension) dep ending on ho w a phone pair aligns to the nativ e
phonology .
36
3.1 Exp erimen t 4
The goal of Exp erimen t 4 w as to test listeners’ abilit y to discriminate pairs of talk ers across
v o w els of v arying familiarit y . As in the previous exp erimen ts, w e manipulated familiarit y b y
using a selection of K orean sounds that differed in phonological alignmen t to English sound
categories; these are detailed in the next section. W e also used an AX same-differen t paradigm
to test listeners’ abilit y to discriminate talk ers across the differen t v o w els. Unlik e in the previous
exp erimen ts, w e included a group of nativ e K orean listeners to con trast with the English listeners.
A dditionally , w e applied m ultidimensional scaling to the n umerical pattern of results in
the AX discrimination task to create a map of the p erceptual space of the listeners, where
ph ysical pro ximit y reflects p erceiv ed similarit y b et w een the talk ers. Multidimensional scaling
is a mathematical to ol that has b een extensiv ely used to study the p erceptual space of v o w el
systems (e.g., P ols et al., 1969; Iv erson and Kuhl, 1995; F o x et al., 1995). Applying MDS to
the discrimination data allo w ed us to create and compare mo dels of the p erceptual features or
dimensions that listeners use to discriminate pairs of talk ers. W e sough t to determine whether
v o w el similarit y affected talk er similarit y , and whether this effect w as differen t across the t w o
language bac kground groups. Mapping v o w els and talk ers to a p erceptual space ma y rev eal
relationships among them that ma y not b e readily apparen t from ra w discrimination data. In
particular, non-nativ e listeners are kno wn to resort to p erceptual strategies that are differen t
from those of nativ e listeners, ev en when iden tification or categorization accuracy is near or
at nativ e lev els (e.g., F o x et al., 1995). When faced with unfamiliar K orean v o w els or difficult
con trasts, English listeners ma y resolv e talk er discrimination successfully through m ore general
auditory mec hanisms, while nativ e K orean listeners ma y exploit their kno wledge of the language
37
to solv e the problem. If that is the case, w e ma y observ e differences in the MDS mo dels for eac h
of the groups.
A dditionally , b y measuring correlations b et w een dimensions in the MDS maps and acoustic
prop erties of the v o w el stim uli, it is p ossible to dra w inferences ab out sp ecifically whic h acoustic
prop erties of the stim uli are relev an t to the listeners, and whether the familiarit y manipulation
affected whic h prop erties listeners attended to when discriminating talk ers. W e exp ected to
find differences in the t yp es of p rop erties that nativ e K orean and English listeners emplo y ed
to discriminate talk ers in v o w els, particularly across those v o w els that are most unfamiliar
to the English listeners. Sp eculativ ely , b ecause F1-F3 sim ultaneously enco de cues to v o w el
and talk er iden tit y , they ma y b e of less utilit y to naïv e listeners who lac k the language-sp ecific
kno wledge required to distinguish the t w o t yp es of cues. Therefore, they ma y resort to exploiting
other prop erties that serv e less linguistically imp ortan t functions, suc h as F4-F5 and f0, where
v ariabilit y across tok ens ma y more readily b e attributed to talk er differences. On the other
hand, w e exp ect the nativ e K orean listeners to use a fuller range of stim uli prop erties, including
those that sim ultaneously con v ey v o w el and talk er cue in a language-sp ecific manner.
3.1.1 Metho ds
3.1.1.1 P articipan ts
P articipan ts w ere dra wn from t w o a priori groups: monolingual sp eak ers of American English
and nativ e K orean sp eak ers. All participan ts w ere naïv e as to the purp ose of the study and had
not participated in an y other of the exp erimen ts in our lab oratory . The study w as appro v ed b y
the USC Institutional Review Board.
38
A total of 69 nativ e English sp eak ers completed the exp erimen t. Of them, 27 w ere recruited
from the USC undergraduate comm unit y and receiv ed course credit for their v olun tary par-
ticipation. The other 42 w ere recruited through Prolific (www.prolific.co), an online platform
for online participan t recruitmen t, and w ere comp ensated for their time at a rate of $5 p er 20
min utes. As indicated b y self-rep ort, all nativ e English participan ts w ere raised monolingual
in the United States and had no history of hearing, sp eec h or language problems. None of the
listeners rep orted an y exp erience with K orean.
A total of 30 nativ e K orean sp eak ers completed the exp erimen t. Of them, 7 w ere recruited
through Prolific and 23 w ere recruited through a K orean language online forum and b y w ord-
of-mouth. All w ere comp ensated for their time at a rate of $5 p er 20 min utes. As indicated
b y self-rep ort, they w ere raised monolingual in K orea and had no history of hearing, sp eec h or
language problems.
3.1.1.2 Stim uli
The stim uli consisted of the syllables /hi hu hɯ/ recorded b y four female nativ e K orean sp eak ers.
The consonan t /h/ w as selected b ecause of its relativ ely neutral influence on the adjacen t v o w els.
The three v o w els w ere selected b ecause of their differen t alignmen t to the sound system of
English. The K orean v o w els /i/ and /u/ are acoustically similar to their English coun terparts.
On the other hand, K orean /ɯ/ , a high bac k unrounded v o w el, is t yp ologically rare and has no
straigh tforw ard English coun terpart. In terms of con trast pairs, /i/ and /u/ assimilate to the t w o
corresp onding English categories (t w o-category assimilation in P AM terms). On the other hand,
/u/ and /ɯ/ assimilate to a single English category /u/ , with [ɯ] presumably b eing p erceiv ed as
a w orse fit than [u] (category-go o dness assimilation).
39
The stim uli w ere recorded b y four female nativ e K orean sp eak ers from Seoul, all undergrad-
uate studen ts at the Univ ersit y of Southern California. The sp eak ers w ere recorded individually
in a noise-atten uating b o oth at a sampling rate of 44,100 Hz at 16 bits p er sample. They read
out loud the syllables presen ted in K orean orthograph y one b y one and in random order on a
computer screen. Eac h sp eak er recorder 20 rep etitions of eac h syllable, for a total of 60 syllables
p er sp eak er and a grand total of 240.
The recordings w ere automatically segmen ted, lab elled and amplitude normalized using the
ro ot-mean-square metho d. A female nativ e K orean sp eak er from Seoul listened to all 240 stim-
uli and discarded 16 of them due to recording error or unnatural pron unciation. Finally , 10
rep etitions of eac h syllable p er sp eak er w ere selected at random, for a total of 30 syllables p er
sp eak er and a grand total of 120 that w ere used in the exp erimen t.
3.1.1.3 Pro cedure
Data collection w as conducted remotely using Gorilla (www.gorilla.sc), an online exp erimen t
builder (An wyl-Irvine et al., 2020). P articipan ts w ere ask ed to w ear headphones and to complete
the exp erimen t while in a quiet lo cation. T o ensure participan t compliance, w e emplo y ed the
headphone screening metho d dev elop ed b y W o o ds et al. (2017). The screening consists of a
simple three in terv al alternativ e force d-c hoice (3-AF C) in tensit y-discrimination task. F or eac h
of six trials, participan ts listen to three pure tones of equal frequency and duration, one of
whic h is of lo w er amplitude than the other t w o, and are ask ed to iden tify the lo w er in tensit y
tone. Crucially , one of the t w o equal-in tensit y tones is presen ted with a phase difference of 180°
b et w een stereo c hannels, whic h hea vily atten uate eac h other due to phase-cancellation when
pla y ed in free field — but not o v er headphones. This mak es the in tensit y-discrimination task
v ery difficult to complete correctly when not w earing headphones. Listeners had to iden tify the
40
correct in terv al in at least fiv e of the six trials to pass the headphone screening and pro ceed to
the main exp erimen t (c hance p erformance in a 3-AF C is t w o correct trials on a v erage).
The design of the main exp erimen t follo w ed an AX same-differen t paradigm, where listeners
w ould hear t w o stim uli in sequence, and had to sa y whether the first (A) and second (X) w ere
pro duced b y the same sp eak er or b y differen t sp eak ers. P articipan ts w ere ask ed to mak e their
selection b y using the S(ame) or D(ifferen t) k eys on their k eyb oard as quic kly as p ossible but
without sacrificing accuracy . The stim uli w ere presen ted in three blo c ks, corresp onding to eac h
of the three syllable conditions /hi hu hɯ/ . In eac h blo c k, a giv en sp eak er faced ev ery other
sp eak er 10 times, eac h time using a differen t tok en of the 10 a v ailable from eac h sp eak er. Of
those 10 times, 5 w ere in the A p osition (e.g. Jinm yung vs. Jiw on) and 5 in the X p osition
(Jiw on vs. Jinm yung). In total, there w ere 180 of these ‘differen t’ trials (6 sp eak er pairs x 10
face-offs x 3 blo c ks). In addition, eac h talk er also faced herself 15 times, for a total of 180 ‘same’
trials (4 sp eak ers x 15 face-offs x 3 blo c ks). The unique tok ens a v ailable for eac h sp eak er w ere
rep eated, but nev er in the same trial. There w ere a grand total of 360 trials. The order of
presen tation of the blo c ks w as coun terbalanced across listeners, and the trials w ere randomized
within eac h blo c k. The t w o stim uli in eac h trial w ere separated b y an in ter-stim ulus in terv al of
250 ms, and trials w ere separated b y 1 s. P articipan ts w ere allo w ed to tak e a break for as long
as they w an ted b et w een blo c ks.
3.1.1.4 Statistics
Python soft w are w as used to conduct all statistical analyses. T o minimize the p oten tial effects of
outliers or non-normalit y , robust statistical tec hniques and measures w ere adopted (see Wilco x,
2016) using the Hyp othesize library (Camp opiano and Wilco x, 2020). Planned linear con trasts
w ere p erformed using a p ercen tile b o otstrap analysis based on the 20% trimmed mean.
41
Bo otst rap analyses w ere c hosen b ecause they a v oid the assumptions of normalit y and equal
v ariance that can greatly reduce p o w er when using traditional ANO V As and t -tests. Rather
than p erforming tests on theoretical normal distributions that ma y not accurately reflect the
observ ed data, they generate a b o otstrap distribution b y rep eatedly sampling with replacemen t
from the original dataset. Sp ecifically , to test for differences b et w een t w o groups in eac h of the
planned linear comparisons, eac h group of observ ations w as sampled from 4,000 times, computing
the difference in trimmed means b et w een the t w o groups eac h time. This led to a b o otstrap
distribution of the estimate of the difference in trimmed means b et w een the t w o groups. A t the
lev el of significance used in this study α = 0.05, the n ull h yp othesis that the difference b et w een
the t w o groups is zero w as rejected if the 95% confidence in terv al of the estimated difference
b et w een the t w o groups did not con tain zero.
The trimmed mean (of the same family of trimmed estimators as the median) w as c hosen
as a measure of cen tral tendency in the analyses b ecause it has b een sho wn to b e less sensitiv e
to outliers than the mean and to main tain high p o w er when testing from b oth normal and non-
normal distributions. The 20% cutoff, whic h remo v es the lo w est 20% and the highest 20% of
observ ations, has pro v en to b e a go o d default in most situations (Wilco x and Keselman, 2003).
When needed to accoun t for m ultiple comparisons, the familywise T yp e I error rate w as
con trolled at α = 0.05 using Rom’s metho d, an impro v emen t on the Bonferroni correction
(Rom, 1990).
42
Figure 3.1: Discriminability r esults of Exp eriment 4, by p articip ant gr oup and aggr e gate d by vowel c ondition .
3.1.2 Results
3.1.2.1 T alk er discriminabilit y b y v o w el
F or eac h participan t in eac h v o w el condition, w e calculated d-prime scores, a standardized mea-
sure of sensitivit y or discriminabilit y deriv ed from signal detection theory that is designed to
b e unaffected b y an individual’s resp onse biases (Macmillan and Creelman, 2005). Figure 3.1
summarizes the main results of the exp erimen t.
Bet w een-groups results The con trasts of in terest w ere the differences in trimmed means
of d-prime scores b et w een the participan t groups, o v erall as w ell as b y eac h v o w el condition.
Ov erall, the English listeners sho w ed b etter talk er discrimination abilities than the K orean
listeners (
ˆ
ψ = 0.452, p< 0.001 , 95% C.I. = 0.210,0.684) . This main effect w as driv en sp ecifically
43
b y increased discriminabilit y in the /hi/ (
ˆ
ψ = 0.543, p = 0.009 , 95% C.I. = 0.147,0.931) and /hu/
(
ˆ
ψ = 0.498, p = 0.011 , 95% C.I. = 0.143,0.854) conditions. Both groups p erformed equally in
the /hɯ/ condition (
ˆ
ψ = 0 .311, p = 0.174 , 95% C.I. =−0.125,0.731) .
Within-groups results Within the participan t groups, the con trasts of in terest w ere the
differences in trimmed means of (paired) d-prime scores b et w een v o w el conditions. T alk ers in
the v o w el condition /hi/ w ere more discriminable than in /hu/ for English listeners (
ˆ
ψ = 0.286,
p = 0.013 ), but w ere equally discriminable across b oth v o w el conditions for K orean listeners
(
ˆ
ψ = 0.235, p = 0.128 ). The /hɯ/ condition w as more discriminable compared to the other t w o
conditions /hi/ and /hu/ , resp ectiv ely , for b oth English listeners (
ˆ
ψ = 0.289, p = 0.018 ;
ˆ
ψ = 0.578,
p < 0.001 ) and K orean listeners (
ˆ
ψ = 0.514, p = 0.005 ;
ˆ
ψ = 0.723, p = 0.001 ).
3.1.2.2 Multidimensional scaling
W e emplo y ed m ultidimensional scaling (MDS) to mo del the pattern of talk er discriminabilit y in
the exp erimen t as distances in p erceptual space. W e used the metric MDS implemen tation in
scikit-learn (P edregosa et al., 2011) to compute t w o-dimensional solutions for eac h participan t
using their individual accuracy scores as the input dissimilarit y matrix. Note that, while studies
ha v e t ypically used the a v erage dissimilarit y matrix of all participan ts as input to a single MDS
mo del, w e follo w Ho wson and Monahan (2020) in instead calculating p er-sub ject MDS mo dels
that accoun t for participan t v ariance and can b e submitted to inferen tial tests. This analysis
results in a set of 4 x- and y-co ordinates p er participan t, where eac h p oin t represen ts the lo cation
of a talk er in relation to the others in an abstract p erceptual space. F or eac h participan t in eac h
v o w el condition, w e calculated the area of the p olygon formed b y the set of co ordinates as a
measure of the size of the participan t’s p erceptual space for the 4 talk ers. Smaller areas result
44
Figure 3.2: R esults of the multidimensional sc aling analysis of Exp eriment 4. Datap oints r epr esent by-p articip ant
sizes of the p er c eptual sp ac e for talkers.
from cro wded p erceptual spaces where the talk ers are closer to eac h other (i.e., more frequen tly
confused), while larger areas result from p erceptual spaces where the talk ers are further apart
(less frequen tly confused). Figure 3.2 summarizes the results.
Bet w een-groups results The con trasts of in terest w ere the differences in trimmed means
of calculated p erceptual areas b et w een the participan t groups, o v erall as w ell as b y eac h v o w el
condition. Ov erall, the p erceptual spaces of English listeners w ere significan tly smaller (ab out
16%) than those of K orean listeners (
ˆ
ψ = -0.036, p < 0.001 ). This main effect app eared to b e
driv en in particular b y a reduced p erceptual space (b y ab out 50%) in the /hi/ condition for the
English group relativ e to the K orean group (
ˆ
ψ = -0.042, p < 0.005 ). There w ere no differences
b et w een the groups in the /hu/ or /hɯ/ conditions ( p = 0.254 and p = 0.129 , resp ectiv ely).
45
Within-groups results Within the participan t groups, the con trasts of in terest w ere the
differences in trimmed means of (paired) calculated p erceptual areas b et w een v o w el conditions.
No significan t differences w ere found b et w een an y of the v o w el conditions ( p > 0.05 ).
Correlations to acoustic v ariables T o in v estigate whether v o w el familiarit y affected
whic h prop erties of the stim uli listeners attended to when discriminating talk ers, w e measured
the correlations b et w een the x- and y-co ordinates that made up the MDS maps and acoustic
prop erties of in terest in the stim uli. A coustic v ariables of in terest w ere calc ulated automatically
(with man ual correction) for all stim uli using V OICESA UCE, an application implemen ted in
MA TLAB that pro vides automated v oice measuremen ts (Sh ue et al., 2009). The v ariables
calculated included: fundamen tal frequency (f0), forman t frequency (F1-F5), forman t bandwidth
(B1-B5), uncorrected harmonic (H1-...) and forman t amplitude (A1-...), corrected (c) harmonic
and forman t amplitude, cepstral p eak prominence (CPP), ro ot mean square energy , harmonic-
to-noise ratios (HNR), total length, and length of v oicing. Correlations w ere measured using
the p ercen tage b end correlation co efficien t, a robust measure of asso ciation (Wilco x, 1994).
The correlation heat map in Figure 3.4 visualizes the results. Ov erall, the results app ear
nearly iden tical for b oth language bac kground groups, refuting the idea that lac k of familiarit y
with at least some of the v o w els ma y ha v e caused the English listeners t o rely on differen t acoustic
features to discriminate b et w een talk ers as compared to the nativ e K orean listeners. Instead,
b oth groups app ear to ha v e relied on the same set of acoustic parameters. The x dimension
is strongly correlated with the cepstral p eak prominence, the harmonic-to-noise ratios and the
ro ot mean square energy of the stim uli. The cepstral p eak prominence (CPP) is a measure of
v oice qualit y related to mo dalit y and breathiness; high v alues correlate with mo del phonation
and smaller with breath y phonation (F raile and Go dino-Lloren te, 2014). Harmonics-to-noise
46
Figure 3.3: Corr elations b etwe en MDS dimensions and ac oustic variables of the stimuli, by language b ackgr ound
gr oup .
ratios (HNR; calculated separately for sev eral frequency bands) measure ho w m uc h of the v oice
energy is p erio dic v ersus ap erio dic; normal v oices tend to ha v e high HNRs, while pathological
or whisp ery v oices tend to ha v e lo w er v alues (Kreiman and Sidtis, 2011). Ro ot mean square
energy is a measure of loudness. The y dimension, on the other hand, seems to mostly correlate
with F5.
3.2 Exp erimen t 5
Exp erimen t 5 w as designed to test ho w v o w el familiarit y affected the abilit y of listeners to
segregate the linguistic and indexical dimensions of sp eec h. Lik e in previous exp erimen ts w e
manipulated familiarit y b y using a selection of K orean sounds that differed in phonological
alignmen t to English sound categories. W e tested the talk er discrimination abilities of English
listeners and nativ e K orean listeners under a Garner paradigm (Garner, 1974). The Garner
paradigm (also kno wn as sp eeded classification paradigm) is a selectiv e atten tion task used to
test the in terdep endence of dimensions in a stim ulus. In our exp erimen t, sub jects classified t w o-
dimensional stim uli b y attending to only one of the dimensions, while the unattended dimension
either v aried randomly or sta y ed constan t. The t w o dimensions of the stim uli w ere v o w el iden tit y
47
(a linguistic dimension) and talk er iden tit y (an indexical dimension). If random v ariation in
the unattended dimension in terferes with the pro cessing of the attended dimension, the t w o
dimensions are understo o d to b e in tegral and pro cessed holistically . On the other hand, if the
random v ariation in the unattended dimension do es not ha v e an effect on the p erformance in
the attended dimension, the t w o dimensions can b e said to b e separable.
Because talk er iden tit y is enco ded in the same forman t-frequency c haracteristics that enco de
v o w el iden tit y , w e exp ected to find that random v ariation along the unattended v o w el iden tit y
dimension w ould affect listeners’ abilit y to discriminate talk er iden tit y . This is particularly lik ely
for the English listeners when listening to v o w els that are not familiar to them, since they ma y
lac k the language-sp ecific kno wledge needed to kno w whether a p erceiv ed acoustic con trast is
relev an t linguistically or indexically . That is, w e predict that patterns of assimilation of non-
nativ e v o w els to nativ e categories ma y affect the English listeners’ abilit y to discriminate the
talk ers pro ducing those sounds.
3.2.0.1 P articipan ts
P articipan ts w ere dra wn from t w o a priori groups: monolingual sp eak ers of American English
and nativ e K orean sp eak ers. All participan ts w ere naïv e as to the purp ose of the study and had
not participated in an y other of the exp erimen ts in our lab oratory . The study w as appro v ed b y
the USC Institutional Review Board.
A total of 69 nativ e English sp eak ers completed the exp erimen t. Of them, 27 w ere recruited
from the USC undergraduate comm unit y and receiv ed course credit for their v olun tary par-
ticipation. The other 42 w ere recruited through Prolific (www.prolific.co), an online platform
for online participan t recruitmen t, and w ere comp ensated for their time at a rate of $5 p er 20
min utes. As indicated b y self-rep ort, all nativ e English participan ts w ere raised monolingual
48
in the United States and had no history of hearing, sp eec h or language problems. None of the
listeners rep orted an y exp erience with K orean.
A total of 34 nativ e K orean sp eak ers completed the exp erimen t. Of them, 7 w ere recruited
through Prolific and 27 w ere recruited through a K orean language online forum and b y w ord-
of-mouth. All w ere comp ensated for their time at a r ate of $5 p er 20 min utes. As indicated b y
self-rep ort, they w ere raised monolingual in K orea had no history of hearing, sp eec h or language
problems.
3.2.0.2 Stim uli
The stim uli consisted of the same syllables /hi hu hɯ/ recorded for Exp erimen t 4, but only
the stim uli of t w o of the K orean sp eak ers w ere selected for this exp erimen t. F urther, only
10 rep etitions of eac h syllable p er sp eak er w ere selected at random, for a total of 30 syllable
recordings p er sp eak er and a grand total of 60 that w ere used in the exp erimen t.
T o recapitulate, the consonan t /h/ w as selected b ecause of its relativ ely neutral influence on
the adjacen t v o w els. The three v o w els w ere selected b ecause of their differen t alignmen t to the
sound system of English. The K orean v o w els /i/ and /u/ are acoustically similar to their English
coun terparts. On the other hand, K orean /ɯ/ , a high bac k unrounded v o w el, is t yp ologically rare
and has no straigh tforw ard English coun terpart. In terms of con trast pairs, /i/ and /u/ assimilate
to the t w o corresp onding English categories (t w o-category assimilation in P AM terms). On the
other hand, /u/ and /ɯ/ assimilate to a single English category /u/ , with [ɯ] presumably b eing
p erceiv ed as a w orse fit than [u] (category-go o dness assimilation).
49
3.2.0.3 Pro cedure
Data collection w as conducted remotely using Gorilla (www.gorilla.sc), an online exp erimen t
builder (An wyl-Irvine et al., 2020). P articipan ts w ere ask ed to to w ear headphones and to com-
plete the exp erimen t while in a quiet lo cation. T o ensure participan t compliance, w e emplo y ed
the headphone screening metho d dev elop ed b y W o o ds et al. (2017), whic h w as describ ed in the
previous exp erimen t.
The design of the exp erimen t follo w ed a Garner paradigm, in whic h sub jects w ere ask ed
to classify , as fast as p ossible, v ariation from one dimension in to t w o p ossible categories while
ignoring v ariation in the other dimension. Listeners heard t w o stim uli in an AX sequence, and
had to classify them as b eing pro duced b y either the same or differen t talk ers. That is, the
attended dimension w as talk er iden tit y , while sub jects w ere told to ignore the v o w el iden tit y
dimension. P articipan ts w ere ask ed to mak e their selection b y using the S(ame) or D(ifferen t)
k eys on their k eyb oard as quic kly as p ossible but without sacrificing accuracy .
Stim ulus set conditions include a con trol set and an orthogonal set. In the con trol condition,
there is no v ariation in the unattended v o w el iden tit y condition. That means that b oth v o w els
in the AX sequence are the same, regardless of whether the talk ers in the attended dimension
are the same or differen t. In the orthogonal condition, there is v ariation in b oth dimensions
and sub jects are exp osed to all p ossible com binations of v o w el iden tities and talk er iden tit y .
Differences b et w een the con trol and orthogonal conditions, measured as discrimination accuracy ,
are used to establish whether the v o w el dimension influences the talk er dimension. If v o w el
iden tit y is pro cessed indep enden tly of talk er iden tit y , v ariation in the v o w el dimension should
ha v e no effect on talk er classification, and w e should not find differences b et w een the con trol
and orthogonal conditions.
50
The crucial manipulation in the exp erimen t is whic h v o w el pair is used in the unattended
condition. W e used the pairs /i/-/u/ and /u/-/ɯ/ , eac h with a con trol and orthogonal condition.
Differences in the amoun t of in terference (orthogonal – con trol) b et w een the t w o v o w el pairs
w ould indicate differen tial effects of v o w el con trast status on talk er discrimination. W e predict
that English listeners will presen t larger in terference of v o w el iden tit y on talk er on the /u/-/ɯ/
pair, as that non-nativ e con trast assimilates to one nativ e category in a category go o dness
manner. W e exp ect K orean listeners to p erform equally on b oth v o w el pairs.
The stim uli w ere presen ted in t w o sets of t w o blo c ks eac h. Eac h of the sets corresp onded to
one of the v o w el pair conditions ( /i/-/u/ or /u/-/ɯ/ ) and included a con trol blo c k and an orthogonal
blo c k. In eac h blo c k, the t w o sp eak ers faced eac h other 40 times. Of those, 20 w ere in the A
p osition (e.g. Jinm yung vs. Jiw on) and 20 in the X p osition (Jiw on vs. Jinm yung). In
total, there w ere 160 of these ‘differen t’ trials (2 talk ers x 20 face-offs x 4 blo c ks). In addition,
eac h talk er also faced herself 40 times in eac h blo c k, for a total of 160 ‘same’ trials. In the
orthogonal conditions, where v o w el iden tit y could v ary , v o w els faced eac h other in the same
configuration (i.e., /i/-/u/ faced eac h other in 40 ‘differen t’ trials and themselv es in 40 ‘same’
trials p er orthogonal blo c k). The con trol blo c ks only con tained ‘same’ trials with resp ect to
v o w el iden tit y , where eac h v o w el faced itself 40 times. The order of presen tation of the sets
w as coun terbalanced across listeners, and the trials w ere randomized within eac h blo c k. The
t w o stim uli in eac h trial w ere separated b y an in ter-stim ulus in terv al of 250 ms, and trials w ere
separated b y 1 s. P articipan ts w ere allo w ed to tak e a break for as long as they w an ted b et w een
blo c ks.
3.2.0.4 Statistics
The statistical metho ds w ere the same as in Exp erimen t 4.
51
Figure 3.4: Discriminability r esults of Exp eriment 5, by p articip ant gr oup and aggr e gate d by vowel p air and blo ck .
3.2.1 Results
3.2.1.1 T alk er discriminabilit y b y v o w el
F or eac h participan t in eac h v o w el condition, w e calculated d-prime scores, a standardized mea-
sure of sensitivit y or discriminabilit y deriv ed from signal detection theory that is designed to
b e unaffected b y an individual’s resp onse biases (Macmillan and Creelman, 2005). Figure 3.4
summarizes the main results of the exp erimen t.
Within-groups results Within the language bac kground groups, the con trasts of in terest
w ere the differences in trimmed means of paired d-prime scores b et w een the con trol and orthog-
onal blo c ks for eac h of the v o w el pair conditions. This con trast w as mean t to test the abilit y
of listeners to separately pro cess the talk er iden tit y dimension of the stim uli in the presence
52
of random v ariation in the v o w el iden tit y dimension, and whether v arying familiarit y with the
differen t v o w el con trasts affected suc h abilit y .
Within the nativ e K orean listener group, the differences b et w een the c on trol and orthogonal
blo c ks w ere appro ximately equal for b oth v o w el pair conditions. The differences b et w een con trol
and orthogonal blo c ks in the ( /i/-/u/ v o w el pair condition w ere statistically significan t (
ˆ
ψ = 0.586,
p = 0.0016 ), as w ere the differences in the /u/-/ɯ/ condition (
ˆ
ψ = 0.569, p = 0.0150 ). These
results sho w that the v ariation in the unattended condition did affect the listeners’ abilit y to
discriminate talk ers on the attended condition, a nd that it did so b y reducing sensitivit y b y
appro ximately 0.6 d-prime scores regardless of v o w el pair condition.
Within the English listener group, the differences b et w een the con trol and orthogonal blo c ks
w ere statistically significan t in b oth v o w el pair conditions, but differed considerably in magnitude
b et w een them. The magnitude of the differences b et w een con trol and orthogonal blo c ks in the
/i/-/u/ v o w el pair condition w ere lo w er (
ˆ
ψ = 0.587, p < 0.001 ) than in the /u/-/ɯ/ condition (
ˆ
ψ =
1.139, p < 0.001 ). These results mean that in the /i/-/u/ condition, v o w el iden tit y v ariation
in the unattended dimension affected the English listeners appro ximately to the same degree
that it affected the K orean listeners (ab out 0.6 d-prime scores). Ho w ev er, the effect in the
/u/-/ɯ/ , a v o w el con trast that is unfamiliar to the English listeners, w as almost t wice as large
(appro ximately 1.1 d-prime scores).
3.3 Discussion
The exp erimen ts in this c hapter w ere in tended to test the effects of v o w el familiarit y on talk er
discrimination. W e had predicted that discriminating talk ers w ould b e more difficult in stim uli
con taining v o w els that w ere unfamiliar to the listeners. W e used a set of K orean v o w els that
53
differed in familiarit y to English listeners b y virtue of ho w they assimilated to their nativ e
English sounds system, and compared the results to those b y a group of nativ e K orean listeners,
for whom all v o w els w ere equally familiar. Ov erall results w ere mixed, with inconsisten t results
across differen t task conditions and analyses.
In Exp erimen t 4, the English listeners p erformed b etter in the AX talk er discrimination task
than the nativ e K orean listeners, without an y consisten t effect of v o w el familiarit y . Con trary to
exp ectations, they p erformed b est with the unfamiliar v o w el, but so did the K orean listeners,
for whom all v o w els w ere equally familiar; this suggest p erhaps that prop erties in trinsic to /ɯ/
ma y b e resp onsible for the increased discriminabilit y .
Despite the o v erall discriminabilit y results, the m ultidimensional scaling (MDS) analysis
sho w ed that the p erceptual space for talk ers of the English listeners ma y b e significan tly smaller
than that of the nativ e K orean listeners, particularly in the familiar v o w el condition /i/ . While
the discriminabilit y results are d-prime corrected to accoun t for individual sub ject biases in
resp onse strategies, the MDS analysis m ust b e based on ra w scores due to limitations of the
design; this ma y help explain the discrepancy in the results.
Correlating the results of the MDS analysis with a n um b er of acoustic parameters of the
stim uli sho w ed that the t w o language bac kground groups did not differ in terms of the main
acoustic features they used to p erform the talk er discrimination task. Both the English and the
K orean listeners exploited features related to individual v oice qualit y to discriminate b et w een
talk ers, suc h as the cepstral p eak prominence and the harmonics-to-noise ratio, as w ell as the
fifth forman t, whic h has traditionally b een regarded as con v eying talk er information. None
of these features pla y a role in con v eying the linguistic iden tit y of a v o w el. This, along with
the discriminabilit y results, suggests some lev el of indep endence of talk er iden tit y abilities from
language abilities.
54
The results of Exp erimen t 5, ho w ev er, ma y serv e to refine our in terpretation of the results
ab o v e. In a Garner task, b oth groups of listeners w ere affected b y random v o w el v ariation in
the unattended dimension, lo w ering their talk er discrimination abilities relativ e to a con trol
condition where no unattended v ariation existed. This result suggests that listeners w ere not
able to fully separate the linguistic and indexic al dimensions in the stim uli. Notably , this effect
app eared to b e mo dulated b y v o w el familiarit y . When v o w el con trasts w ere familiar to a group,
they caused comparable drops in talk er discrimination p erformance. In con trast, the drop in
p erformance nearly doubled in magnitude for the English listeners in the con trast condition that
w as unfamiliar only to them. The follo wing c hapter discusses this issue in more detail.
55
Chapter 4
Discussion
The goal of this dissertation w as to in v estigate the language familiarit y adv an tage in talk er
recognition. While prior w ork has consisten tly found that talk er pro cessing is facilitated b y fa-
miliarit y with the language b eing sp ok en, the mec hanism resp onsible for this effect has remained
elusiv e. In this w ork, w e explored the p ossibilit y that kno wledge of the phonological structure
of a language ma y b e implicated in pro viding the adv an tage to talk er pro cessing. W e did so b y
examining the in teraction b et w een linguistic and indexical information in non-nativ e sp eec h at
the lev el of individual syllables, fo cusing on the sp ecific con tributions of consonan ts (Chapter 2)
and v o w els (Chapter 3). W e h yp othesized that the discriminabilit y of talk ers in an unfamiliar
language ma y b e affected b y the kinds of first-language p erceptual mec hanisms that determine
the discriminabilit y of non-nativ e sp eec h con trasts. Ov erall, the results of our exp erimen ts w ere
inconsisten t across differen t testing paradigms. The rest of this c hapter summarizes these results
and discusses some p ossible explanations for the discrepancies.
Exp erime n ts 2 and 3, in Chapter 2 tested the effect of consonan t familiarit y on talk er dis-
crimination in syllables. W e predicted that K orean consonan ts that matc hed the nativ e English
sound in v en tory —as determined b y the results of Exp erimen t 1— w ould pro vide a familiar-
it y adv an tage in an AXB talk er discrimination task. Con trary to this prediction, less familiar
56
non-nativ e consonan ts that did not align w ell with the nativ e English phonology pro vided a
significan t and consisten t adv an tage. These results are, ho w ev er, confounded b y differences in
length of the v o w els follo wing the t w o t yp es of consonan ts. It is p ossible that listeners ma y ha v e
relied on the longer duration of the v o w els follo wing the unfamiliar fortis consonan ts, compared
to the shorter v o w els that follo w ed the familiar aspirated consonan ts.
Chapter 3 used t w o differen t paradigms and additional analyses to isolate the effects of v o w el
familiarit y on talk er discrimination, and added a group of nativ e K orean listeners to compare
against the naïv e listeners. Exp erimen t 4, using an AX same-differen t talk er discrimination task,
failed to rev eal an y consisten t effect of v o w el familiarit y . F urther, a m ultidimensional scaling
analysis of the discrimination data sho w ed that the similarit y spaces of naïv e listeners and the
nativ e K orean listeners w ere similar in size and organized around the same acoustic features.
On the other hand, Exp erimen t 5, using a Garner in terference task, found a large effect of v o w el
familiarit y on talk er discrimination. The p erformance of naïv e listeners w as comparable to that
of nativ e K orean listeners on t w o familiar K orean v o w el con trasts, but dropp ed considerably on
an unfamiliar K orean v o w el con trast that did not assimilate w ell to the English sound system.
In short, while consonan t or v o w el familiarit y did not consisten tly affect the discriminabilit y
of talk ers when using an AXB (Exp erimen ts 2 and 3) or AX task (Exp erimen t 4), it did cause a
large effect on talk er discrimination p erformance when using a Garner in terference task (Exp er-
imen t 5). An explanation for these discrepan t results ma y reside in the differen t task demands
of the AX/AXB and the Garner in terference designs. In the AX and AXB tasks, listeners w ere
ask ed to discriminate talk ers while the v o w el category remained constan t. In con trast, v o w el
category could v ary p er talk er in the Garner task. Listeners in the AX a nd AXB tasks ma y
ha v e ignored an y p oten tial linguistic information, discriminating talk ers b y comparing tok ens
in short term memory in a purely auditory manner. On the other hand, the v o w el iden tit y of
57
the t w o tok ens b eing compared in a trial of the Garner task cannot b e assumed, and listeners
ma y need to refer to their language represen tations b efore b eing able to discriminate b et w een
talk ers.
In other w ords, t he AX/AXB and Garner tasks ma y ha v e engaged differen t stages of the
sp eec h p erception pro cess. It is generally accepted that the pre-lexical stages of sp eec h pro-
cessing in v olv e at least t w o distinct tasks: auditory-acoustic pro cessing and phonetic pro cessing
(Bab el and Johnson, 2010; Pisoni, 1973; Pisoni and T ash, 1974; W erk er and Logan, 1985). The
first of these tasks, auditory-acoustic pro cessing (sometimes called psyc ho-acoustic pro cessing)
in v olv es analysis of the ra w acoustics of the stim uli through ph ysiological mec hanisms that are
though t to b e indep enden t of the listener’s language exp erience. Phonetic pro cessing (also kno wn
as phonemic or linguistic pro cessing) in v olv es more abstract categorization of the stim uli in to
existing linguistic categories. Unlik e auditory-acoustic pro cessing, phonetic pro cessing is nec-
essarily affected b y the listener’s nativ e language exp erience. Because v o w el iden tit y w as held
constan t b et w een talk ers in the AX and AXB exp erimen ts, it is lik ely that listeners ma y ha v e
b een able to ignore the linguistic asp ect of the stim uli and p erform the talk er discrimination
task using auditory-acoustic analysis of the stim uli in short term memory .
By con trast, the design of the Garner in terference task sp ecifically mak es the linguistic di-
mension of the stim uli difficult to ignore for listeners. The con trol blo c ks of the Garner tasks
are essen tially iden tical to the AX and AXB tasks in that they do not v ary v o w el iden tit y b e-
t w een talk ers. As suc h, the results of those blo c ks b y themselv es do not rev eal an y differences
in p erformance related to the familiarit y of the v o w el stim uli. As in the AX/AXB tasks, it is
lik ely that listeners p erformed the discrimination task in the con trol blo c ks using the same t yp e
of auditory-acoustic pro cessing without referring to their linguistic kno wledge. Ho w ev er, the
58
orthogonal blo c ks in the Garner task v ary v o w el iden tit y unpredictably . As exp ected of a Gar-
ner in terference task, this orthogonal manipulation of the unattended v o w el iden tit y dimension
results in a drop in p erformance on the attended talk er iden tit y dimension — seen across all
familiarit y conditions and for b oth naïv e and nativ e listeners. What is notable in these results,
ho w ev er, is that the drop in p erformance is appro ximately doubled for the naïv e listeners in the
unfamiliar v o w el con trast condition. This m uc h larger drop in talk er discrimination p erformance
cannot b e solely attributed either to the exp ected effect of in terference from v ariation in the
unattended dimension (b ecause the other conditions caused smaller drops) or to the particular
prop erties of the /u/-/ɯ/ con trast ( b ecause the additional p erformance drop do es not o ccur for
the nativ e K orean listeners). Therefore, the most lik ely v ariable to explain this discrepancy is
the linguistic exp erience of the t w o differen t groups of listeners. Because b oth /i/-/u/ and /u/-/ɯ/
are familiar to the nativ e K orean listeners, the orthogonal blo c k causes comparable drops in
p erformance for b oth conditions. Similarly , /i/-/u/ is a familiar con trast for the English listeners,
and so the orthogonal blo c k causes an equiv alen t drop in p erformance. Ho w ev er, the /u/-/ɯ/ con-
trast is unfamiliar for them, p oten tially explaining the additional drop in talk er discrimination
p erformance.
The problem faced b y listeners in the orthogonal blo c ks of the Garner task can b e understo o d
as a problem of probabilistic inference under differen t degrees of uncertain t y . Ideal observ er mo d-
els (e.g., Kleinsc hmidt and Jaeger, 2015; Norris et al., 2016; P a jak et al., 2016) ha v e prop osed
that listeners o v ercome the c hallenge of talk er v ariabilit y in sp eec h b y using their ric h kno wledge
of the co v ariation b et w een linguistic features and indexical features to mak e con tin uous proba-
bilistic inferences ab out their language input. In Ba y esian terms, listeners matc h the prop erties
of input data against their stored categories (conceiv ed as probabilit y distributions) and select
the category with the highest p osterior probabilit y giv en the input data. Crucially , this t yp e of
59
probabilistic kno wledge of the join t distribution of linguistic features and indexical features can
b e exploited for the purp oses of resolving b oth linguistic and indexical inputs. Applying this
reasoning to our Garner data, the English listeners lac k kno wledge of the probabilit y distribu-
tion of the K orean v o w el /ɯ/ that they otherwise ha v e for the distributions of the English-lik e
K orean v o w els /i/ and /u/ . When attempting to discriminate talk ers across the /i/-/u/ con trast,
English listeners could rely on their ric h exp erience of ho w v o w el iden tit y features and talk er
iden tit y features co v ary in the equiv alen t English v o w els —after ha ving heard them coun tless
times pro duced b y a v ast n um b er of differen t talk ers — to mak e an informed probabilistic guess
on the situation. This guess w ould lik ely tak e in to accoun t whether the t w o v o w els are the same
or not, facilitating the decision ab out whether the t w o talk ers are the same or not. On the
other hand, English listeners are unlik ely to ha v e b een exp osed to the K orean /ɯ/ v o w el, and
they therefore lac k an y kno wledge of ho w differen t talk ers ma y affect v ariabilit y in the v o w el.
Presen ted with an am biguous situation in the /u/-/ɯ/ con trast, they will mak e a less reliable
probabilistic guess as to whether they exp erienced t w o differen t talk ers or t w o differen t v o w els,
or b oth.
This lac k or presence of probabilistic kno wledge ma y ha v e b een the mo dulating factor in the
results of the Garner task. The drop in p erformance in the orthogonal blo c ks of those v o w el
con trasts that are familiar to the listeners is the exp ected drop b ecause of in terference from
an unattended dimension that forces listeners to engage in linguistic pro cessing b efore making
talk er discrimination decisions. Ho w ev er, the m uc h larger drop in the orthogonal blo c k of the
unfamiliar v o w el con trast for the Englis h listeners could reflect the sp eec h signal b ecoming more
or less separable as a function of language exp erience.
While these results are ten tativ e, t hey do offer some preliminary supp ort for the idea that
the language familiarit y effect ma y ha v e its origins in mec hanisms similar to those that those
60
that determine the discriminabilit y of non-nativ e sp eec h con trasts. They are also supp ortiv e of
mo dels that situate the language familiarit y effect at an abstract lev el of phonological pro cessing
rather than on simple familiarit y (Johnson et al., 2018). F uture w ork ma y fo cus on dev eloping
more sp ecific h yp otheses ab out talk er discrimination that are directly compatible with those
made b y mo dels of naïv e or non-nativ e sp eec h p erception. F or example, it ma y b e in teresting
to test ho w a single-category v o w el assimilation (t w o non-nativ e v o w els that assimilate equally
w ell to a single nativ e category) affects talk er discrimination abilities. Naïv e assimilation mo dels
predict p o or v o w el discrimination, but it is unclear what is exp ected of talk er discrimination
in suc h a situation. Listeners w ould lik ely ha v e exp erience with b oth v o w els and ho w talk er
v ariation affects them — ho w ev er that exp erience w ould fall under a single v o w el category in
their nativ e language. It is not clear whether the differen t categorization w ould affect talk er
discrimination abilities, since the underlying exemplars ma y b e similar regardless.
61
Bibliograph y
Ab ercrom bie, D. (1984). Elements of Gener al Phonetics . Edin burgh Univ ersit y Press, Edin-
burgh. OCLC: 1147258124.
An wyl-Irvine, A. L., Massonnié, J., Flitton, A., Kirkham, N., and Ev ershed, J. K. (2020).
Gorilla in our midst: An online b eha vioral exp erimen t builder. Behavior R ese ar ch Metho ds ,
52(1):388–407.
Bab el, M. and Johnson, K. (2010). A ccessing psyc ho-acoustic p erception and language-sp ecific
p erception with sp eec h sounds. L ab or atory Phonolo gy , 1(1).
Bac horo wski, J.-A. and Owren, M. J. (1999). A coustic correlates of talk er sex and individual
talk er iden tit y are presen t in a short v o w el segmen t pro duced in running sp eec h. The Journal
of the A c oustic al So ciety of A meric a , 106(2):1054–1063.
Belin, P . (2006). V oice pro cessing in h uman and non-h uman primates. Philosophic al T r ansactions
of the R oyal So ciety B: Biolo gic al Scienc es , 361(1476):2091–2107.
Best, C. T. (1995). A direct realist view of cross-language sp eec h p erception. In Strange, W.,
editor, Sp e e ch Per c eption and Linguistic Exp erienc e: Issues in Cr oss L anguage R ese ar ch . Y ork
Press, Baltimore.
Bo ersma, P . and W eenink, D. (2020). Praat: doing phonetics b y computer.
Bregman, M. R. and Creel, S. C. (2014). Gradien t language dominance affects talk er learning.
Co gnition , 130(1):85–95.
Camp opiano, A. and Wilco x, R. (2020). Hyp othesize: robust statistics for p ython. Journal of
Op en Sour c e Softwar e , 5(50):2241.
Creelman, C. D. (1957). Case of the unkno wn talk er. The Journal of the A c oustic al So ciety of
A meric a , 29(5):655.
Flege, J. E. (1995). Second Language Sp eec h Learning: Theory , Findings, and Problems. In
Strange, W., editor, Sp e e ch Per c eption and Linguistic Exp erienc e: Issues in Cr oss L anguage
R ese ar ch . Y ork Press, Baltimore.
Flege, J. E., MacKa y , I. R. A., and Meador, D. (1999). Nativ e Italian sp eak ers’ p erception and
pro duction of English v o w els. The Journal of the A c oustic al So ciety of A meric a , 106(5):2973–
2987.
Fleming, D., Giordano, B. L., Caldara, R., and Belin, P . (2014). A language-familiarit y effect
for sp eak er discrimination without comprehension. Pr o c e e dings of the National A c ademy of
Scienc es , 111(38):13795–13798.
62
F o x, R. A., Flege, J. E., and Munro, M. J. (1995). The p erception of English and Spanish v o w els
b y nativ e English and Spanish listeners: A m ultidimensional scaling analysis. The Journal of
the A c oustic al So ciety of A meric a , 97(4):2540–2551.
F raile, R. and Go dino-Lloren te, J. I. (2014). Cepstral p eak prominence: A comprehensiv e
analysis. Biome dic al Signal Pr o c essing and Contr ol , 14:42–54.
Garner, W. R. (1974). The Pr o c essing of Information and Structur e . Erlbaum Asso ciates,
P otomac.
Goggin, J. P ., Thompson, C. P ., Strub e, G., and Simen tal, L. R. (1991). The role of language
familiarit y in v oice iden tification. Memory & Co gnition , 19(5):448–458.
Goldinger, S. D. (1996). W ords and v oices: Episo dic traces in sp ok en w ord iden tification and
recognition memory . Journal of Exp erimental Psycholo gy: L e arning, Memory, and Co gnition ,
22(5):1166–1183.
Goldinger, S. D. (1998). Ec ho es of e c ho es? An episo dic theory of lexical access. Psycholo gic al
R eview , 105(2):251–279.
Goldinger, S. D., Pisoni, D. B., and Logan, J. S. (1991). On the nature of talk er v ariabilit y effects
on recall of sp ok en w ord lists. Journal of Exp erimental Psycholo gy: L e arning, Memory, and
Co gnition , 17(1):152–162.
Green, K. P ., T omiak, G. R., and Kuhl, P . K. (1997). The enco ding of rate and talk er information
during phonetic p erception. Per c eption & Psychophysics , 59(5):6 75–692.
Han, M. and W eitzman, R. (1970). A coustic F eatures of K orean /P , T, K/, /p, t, k/ and /p
h
,
t
h
, k
h
/. Phonetic a , 22(2):112–128.
Ho wson, P . J. and Monahan, P . J. (2020). A metho d for comparing p erceptual distances and
areas with m ultidimensional scaling. Metho dsX , 7:100790.
Iv erson, P . and Kuhl, P . K. (1995). Mapping the p erceptual magnet effect for sp eec h using
signal detection theory and m ultidimensional scaling. The Journal of the A c oustic al So ciety
of A meric a , 97(1):553–562.
Iv erson, P ., Kuhl, P . K., Akahane-Y amada, R., Diesc h, E., T ohkura, Y., Kettermann, A., and
Sieb ert, C. (2003). A p erceptual in terference accoun t of acquisition difficulties for non-nativ e
phonemes. Co gnition , 87(1):B47–B57.
Johnson, E. K., Bruggeman, L., and Cutler, A. (2018). Abstraction and the (Misnamed) lan-
guage familiarit y effect. Co gnitive Scienc e , 42(2):633 –645.
Johnson, E. K., W estrek, E., Nazzi, T., and Cutler, A. (2011). Infan t abilit y to tell v oices apart
rests on language exp erience: Infan t v oice discernmen t. Developmental Scienc e , 14(5):1002–
1011.
Johnson, K. (1997). Sp eec h p erception without sp eak er normalization: An exemplar mo del.
In Johnson, K. and Mullennix, J. W., editors, T alker variability in sp e e ch pr o c essing , pages
145–165. A cademic Press, San Diego, ed. 1 edition.
63
Johnson, K. (2005). Sp eak er Normalization in Sp eec h P erception. In Pisoni, D. B. and Re-
mez, R. E., editors, The Handb o ok of Sp e e ch Per c eption , Blac kw ell handb o oks in linguistics.
Blac kw ell Pub, Malden, MA.
Kagano vic h, N., F rancis, A. L., and Melara, R. D. (2006). Electroph ysiological evidence for
early in teraction b et w een talk er and linguistic information during sp eec h p erception. Br ain
R ese ar ch , 1114(1):161–172.
Kaga y a, R. (1974). A fib erscopic and acoustic study of the K orean stops, affricates and fricativ es.
Journal of Phonetics , 2(2):161–180.
Kang, Y. (2014). V oice Onset Time merger and dev elopmen t of tonal con trast in Seoul K orean
stops: A corpus study . Journal of Phonetics , 45:76–90.
Kisilevsky , B. S., Hains, S. M., Lee, K., Xie, X., Huang, H., Y e, H. H., Zhang, K., and W ang, Z.
(2003). Effects of exp erience on fetal v oice recognition. Psycholo gic al Scienc e , 14(3):220–224.
Klatt, D. H. (1989). Review of selected mo dels of sp eec h p erception. In Marslen-Wilson, W.,
editor, L exic al R epr esentation and Pr o c ess , pages 169–226. MIT Press, Cam bridge, Mass.
Kleinsc hmidt, D. F. and Jaeger, T. F. (2015). Robust sp eec h p erception: Recognize the familiar,
generalize to the similar, and adapt to the no v el. Psycholo gic al R eview , 122(2):148–20 3.
Kreiman, J. and Sidtis, D. (2011). F oundations of voic e studies: an inter disciplinary appr o ach
to voic e pr o duction and p er c eption . Wiley-Blac kw ell, Oxford, UK.
Kuhl, P . K., Stev ens, E., Ha y ashi, A., Deguc hi, T., Kiritani, S., and Iv erson, P . (2006). Infan ts
sho w a facilitation effect for nativ e language phonetic p erception b et w een 6 and 12 mon ths.
Developmental Scienc e , 9(2):F13–F21.
K öster, O. and Sc hiller, N. O. (1997). Differen t influences of the nativ e language of a listener
on sp eak er recognition. International Journal of Sp e e ch L anguage and the L aw , 4(1):18–28.
Lacerda, F. (1998). An exemplar‐based accoun t of emergen t phonetic categories. The Journal
of the A c oustic al So ciety of A meric a , 103(5):2980–2981.
Lib erman, A. M., Harris, K. S., Hoffman, H. S., a nd Griffith, B. C. (1957). The discrimination
of sp eec h sounds within and across phoneme b oundaries. Journal of Exp erimental Psycholo gy ,
54(5):358–368.
Macmillan, N. A. and Creelman, C. D. (2005). Dete ction The ory: A U ser’s Guide . La wrence
Erlbaum Asso ciates, Mah w ah, N.J, 2nd ed edition.
Mair, P . and Wilco x, R. (2020). Robust statistical metho ds in R using the WRS2 pac kage.
Behavior R ese ar ch Metho ds , 52(2):464–488.
Ma y o, L. H., Floren tine, M., and Buus, S. (1997). Age of second- language acquisition and
p erception of sp eec h in noise. Journal of Sp e e ch, L anguage, and He aring R ese ar ch , 40(3):686–
693.
Meador, D., Flege, J. E., and Mac ka y , I. R. A. (2000). F actors affecting the recognition of w ords
in a second language. Bilingualism: L anguage and Co gnition , 3(1):55–67.
64
Mullennix, J. W. and Pisoni, D. B. (1990). Stim ulus v ariabilit y and pro cessing dep endencies in
sp eec h p erception. Per c eption & Psychophysics , 47(4):379–390.
Mullennix, J. W., Pisoni, D. B., and Martin, C. S. (1989). Some effects of talk er v ariabilit y on
sp ok en w ord recognition. The Journal of the A c oustic al So ciety of A meric a , 85(1):365–378.
Newman, R. S. and Sa wusc h, J. R. (1996). P erceptual normalization for sp eaking rate: Effects
of temp oral distance. Per c eption & Psychophysics , 58(4):540–560.
Norris, D., McQueen, J. M., and Cutler, A. (2016). Predictio n, Ba y esian inference and feedbac k
in sp eec h recognition. L anguage, Co gnition an d Neur oscienc e , 31(1):4–18.
Nygaard, L. C. and Pisoni, D. B. (1998). T alk er-sp ecific learning in sp eec h p erception. Per c eption
& Psychophysics , 60(3):355–376.
Näätänen, R., Leh tok oski, A., Lennes, M., Cheour, M., Huotilainen, M., Iiv onen, A., V ainio, M.,
Alku, P ., Ilmoniemi, R. J., Luuk, A., Allik, J., Sinkk onen, J., and Alho, K. (1997). Language-
sp ecific phoneme represen tations rev ealed b y electric and magnetic brain resp onses. Natur e ,
385(6615):432–434.
Owren, M. J. and Cardillo, G. C. (2006). The relativ e roles of v o w els and consonan ts in dis-
criminating talk er iden tit y v ersus w ord meaning. The Journal of the A c oustic al So ciety of
A meric a , 119(3):1727–1739.
P a jak, B., Fine, A. B., Kleinsc hmidt, D. F., and Jaeger, T. F. (2016). Learning additional lan-
guages as hierarc hical probabilistic inference: insigh ts from first language pro cessing: learning
languages as hierarc hical inference. L anguage L e arning , 66(4):900–944.
P edregosa, F., V aro quaux, G., Gramfort, A., Mic hel, V., Thirion, B., Grisel, O., Blondel, M.,
Prettenhofer, P ., W eiss, R., Dub ourg, V., V anderplas, J., P assos, A., Cournap eau, D., Bruc her,
M., P errot, M., and Duc hesna y , E. (2011). Scikit-learn: Mac hine learning in Python. Journal
of Machine L e arning R ese ar ch , 12:2825–2830.
P erception Researc h Systems (2007). P aradigm Stim ulus Presen tation.
P errac hione, T. K., Chiao, J. Y., and W ong, P . C. (2010). Asymmetric cultural effects on
p erceptual exp ertise underlie an o wn-race bias for v oices. Co gnition , 114(1):42–55.
P errac hione, T. K., Del T ufo, S. N., and Gabrieli, J. D. E. (2011). Human v oice recognition
dep ends on language abilit y . Scienc e , 333(6042):595–595.
Pierreh um b ert, J. B. (2001). Exemplar dynamics: W ord frequency , lenition and con trast. In
Byb ee, J. L. and Hopp er, P . J., editors, T yp olo gic al Studies in L anguage , v olume 45, page 137.
John Benjamins Publishing Compan y , Amsterdam.
Pisoni, D. B. (1973). A uditory and phonetic memory co des in the discrimination of consonan ts
and v o w els. Per c eption & Psychophysics , 13(2):253–260.
Pisoni, D. B. and T ash, J. (1974) . React ion times to comparisons within and across phonetic
categories. Per c eption & Psychophysics , 15(2):285–290.
P ols, L. C. W., v an der Kamp, L. J. T., and Plomp, R. (1969). P erceptual and ph ysical space
of v o w el sounds. The Journal of the A c oustic al So ciety of A meric a , 46(2B):458–467.
65
R Core T eam (2018). R: A L anguage and Envir onment for Statistic al Computing . R F oundation
for Statistical Computing, Vienna, A ustria.
Rom, D. M. (1990). A sequen tially rejectiv e test pro cedure based on a mo dified Bonferroni
inequalit y . Biometrika , 77(3):663–665.
Sh ue, Y.-L., Keating, P ., and Vicenik, C. (2009). V OICESA UCE: A program for v oice analysis.
The Journal of the A c oustic al So ciety of A meric a , 126(4):2221.
V an Lanc k er, D. and Kreiman, J. (1987). V oice discrimination and recognition are separate
abilities. Neur opsycholo gia , 25(5):829–834.
W erk er, J. F. and Logan, J. S. (1985). Cross-language evidence for three factors in sp eec h
p erception. Per c eption & Psychophysics , 37(1) :35–44.
Wilco x, R. R. (1994). The p ercen tage b end correlation co efficien t. Psychometrika , 59(4):601–
616.
Wilco x, R. R. (2016). Intr o duction to R obust Estimation and Hyp othesis T esting . A cademic
press, W altham, MA, 4th edition.
Wilco x, R. R. and Keselman, H. J. (2003). Mo dern robust data analysis metho ds: measures of
cen tral tendency . Psycholo gic al Metho ds , 8(3):254–274.
W o o ds, K. J. P ., Siegel, M. H., T raer, J., and McDermott, J. H. (2017). Headphone screen-
ing to facilitate w eb-based auditory exp erimen ts. A ttention, Per c eption, & Psychophysics ,
79(7):2064–2072.
66
Abstract (if available)
Abstract
Speech is a complex multidimensional signal that carries linguistic information—what is said—and indexical information—who said it—simultaneously. These two dimensions are deeply intertwined in the acoustic signal of speech and listeners rely on knowledge of one to decode the other. For example, it is well established that listeners exploit knowledge of their language to process information about talkers, such that they are notably better at distinguishing talkers in a familiar language. However, it is not clear what the mechanism behind this effect is, or at what level of speech processing it operates. This dissertation seeks to expand our understanding of the language familiarity advantage in talker processing by exploring the interaction between linguistic and indexical information in non-native speech. Specifically, this dissertation investigates how familiarity with the sounds of a language affects the processing of talker identity at the level of individual syllables, focusing on the specific contributions of both consonants and vowels. We hypothesized that the discriminability of talkers in an unfamiliar language may be affected by first-language perceptual mechanisms similar to those that determine the discriminability of non-native speech contrasts. To test this, we assessed English listeners’ ability to discriminate pairs of talkers across Korean sounds differing in degree of familiarity using a variety of psycho-acoustic methods. The results of our experiments were inconsistent across different testing paradigms, but were tentatively supportive of the initial hypothesis, as well as of models that situate the language familiarity effect at an abstract level of phonological processing.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
When things are left unsaid: existential and anaphoric implicit objects in discourse
PDF
Individual differences in phonetic variability and phonological representation
PDF
The planning, production, and perception of prosodic structure
PDF
The role of individual variability in tests of functional hearing
PDF
Processing the dynamicity of events in language
PDF
Connecting phrasal and rhythmic events: evidence from second language speech
PDF
Effects of speech context on characteristics of manual gesture
PDF
Speech and language understanding in the Sigma cognitive architecture
PDF
Investigating the production and perception of reduced speech: a cross-linguistic look at articulatory coproduction and compensation for coarticulation
PDF
Signs of skilled adaptation in the co-speech ticking of adults with Tourette's
PDF
Data-driven methods in description-based approaches to audio information processing
PDF
Considering the effects of disfluent speech on children’s sentence processing capabilities and language development
PDF
A computational framework for exploring the role of speech production in speech processing from a communication system perspective
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Beatboxing phonology
PDF
Enriching spoken language processing: representation and modeling of suprasegmental events
PDF
Did you get all that? Encoding of amplitude modulations at the auditory periphery predicts hearing outcomes
PDF
Exploring the effects of Korean subject marking and action verbs’ repetition frequency: how they influence the discourse and the memory representations of entities and events
PDF
Automatic quantification and prediction of human subjective judgments in behavioral signal processing
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
Asset Metadata
Creator
Benítez Pozo, Andrés (author)
Core Title
Effects of language familiarity on talker discrimination from syllables
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Linguistics
Publication Date
12/13/2020
Defense Date
10/29/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
language familiarity,OAI-PMH Harvest,speech perception,talker discrimination,talker identity,talker processing,voice perception
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Zevin, Jason (
committee chair
), Disner, Sandra (
committee member
), Wilcox, Rand (
committee member
)
Creator Email
a.benitez@me.com,benitezp@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-406279
Unique identifier
UC11666598
Identifier
etd-BentezPozo-9211.pdf (filename),usctheses-c89-406279 (legacy record id)
Legacy Identifier
etd-BentezPozo-9211.pdf
Dmrecord
406279
Document Type
Dissertation
Rights
Benítez Pozo, Andrés
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
language familiarity
speech perception
talker discrimination
talker identity
talker processing
voice perception