Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Selectivity for visual speech in posterior temporal cortex
(USC Thesis Other)
Selectivity for visual speech in posterior temporal cortex
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SELECTIVITY FOR VISUAL SPEECH IN POSTERIOR
TEMPORAL CORTEX
by
Benjamin Taylor Files
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(NEUROSCIENCE)
December 2013
Copyright 2013 Benjamin Taylor Files
ii
Acknowledgments
I thank the members of my dissertation committee, all of whom made incalculable
contributions to my doctorate education in general, and to my research specifically: Norberto
Grzywacz, Richard Leahy, Bosco Tjan, and Lynne E. Bernstein. I would like to particularly thank
my dissertation adviser, Dr. Bernstein. The work presented in this dissertation is a result of her
guidance, suggestions and ideas. In addition, I would like to acknowledge the contributions of
Edward Auer, who, along with Dr. Bernstein, is a co-author on the published version of the work
in Chapter 3.
Jintao Jiang, Julie Verhoff, Ewen Chao, John Jordan, Justin Aronoff and Brian Cheney
all contributed to work done in the Communication Neuroscience lab at the House Ear Institute.
Jintao Jiang provided stimulus materials, code and advice regarding the computation model on
which much of this research is based. Silvio P. Eberhardt provided valuable technical assistance
with stimulus presentation, hardware design and data collection in Dr. Bernstein’s lab at George
Washington University. I also thank the entirety of Bosco’s lab for helpful discussions and the
students of the Neuroscience Graduate Program who proved to be valuable sources of advice and
camaraderie. In particular, Farhan Baluch consulted on technical questions about EEG.
I thank the support and administrative staff of the Neuroscience Graduate Program.
Gloria Wan, Vanessa Clark, Amy Goodman and Linda Bazilian: thank you for helping me stay
enrolled, funded, and on track.
I was supported by a college doctoral fellowship from the Dornsife College of Letters,
Arts and Sciences and as a trainee of the Hearing & Communication Neuroscience program under
NIH Training Grant T32DC009975. The research in this dissertation was funded by a grant from
NIH/NIDCD Grant DC008583 and by the National Science Foundation.
iii
Finally, I owe much to the loving support of my family and friends who kept me mostly
sane throughout graduate school. In particular, thanks go to my wife, Nikki Derdzinske, who
showed patience, support, love and kindness throughout my years at USC, and to my parents John
and Trish who instilled in me a love of learning. Thanks, everyone.
iv
Table of Contents
Acknowledgments ii
List of Tables v
List of Figures vi
Abstract viii
Chapter 1: Introduction 1
Visual Speech Perception ............................................................................................... 1
Second-order Isomorphism ............................................................................................. 3
Neural Accounts of Visual Speech Perception ............................................................... 6
Summary and Overview ............................................................................................... 14
Chapter 2: Visual Speech Discrimination 16
Experiment 1: Discrimination of Natural and Synthetic Speech Syllables ................... 23
Experiment 2: Discrimination of Inverted Visible Speech ........................................... 54
General Discussion ....................................................................................................... 61
Chapter 3: The visual mismatch negativity elicited with visual speech stimuli 70
Method .......................................................................................................................... 75
Results ........................................................................................................................... 88
General Discussion ....................................................................................................... 96
Chapter 4: Conclusions 107
Perceptual Dissimilarity Maps to Optical Dissimilarity ............................................. 107
Visual Speech Mismatch Response Maps to Reliably Different Phonetic Forms ...... 111
Summary ..................................................................................................................... 115
Future Directions ........................................................................................................ 117
References 119
Appendix A: GFP and Permutation Tests 132
Global Field Power ..................................................................................................... 132
GFP Simulations ......................................................................................................... 136
Permutation Tests ....................................................................................................... 144
Permutation Simulations ............................................................................................. 152
Conclusion .................................................................................................................. 160
Appendix B: Multiple Comparisons Correction 162
Methods ...................................................................................................................... 169
Results ......................................................................................................................... 172
Discussion ................................................................................................................... 178
Conclusion .................................................................................................................. 182
v
List of Tables
Table 2.1. Summary of all stimulus pairs in Experiments 1 and 2. 30
Table 2.2. Analyses of variance on d’ for the triplets in Experiment 1. 37
Table 2.3: Analyses of variance for response times in Experiment 1. 43
Table 2.4. Analyses of variance on d’ for the triplets in Experiment 2. 59
Table 3.1. Syllables included in each of four block types. 80
Table 3.2. Summary of reliable vMMNs. 93
Table A.1. Number of sweeps per subject in the three permutation test
simulations 153
vi
List of Figures
Figure 2.1. Still frames from natural and synthetic speech stimuli. 27
Figure 2.2. Group mean d’ sensitivity for Experiment 1. 38
Figure 2.3. Correlations between sensitivity and perceptual distance. 40
Figure 2.4. Mean response times for Experiment 1 discrimination. 41
Figure 2.5. Group mean phoneme identification percent correct and entropy in
Experiment 1 45
Figure 2.6. Identification confusion matrices. 46
Figure 2.7. Mean d’ sensitivity for inverted and upright stimuli in Experiment 2. 58
Figure 2.8. Response time for inverted and upright stimuli in Experiment 2. 60
Figure 3.1. Schematic diagram of the proposed roles for left and right posterior
temporal cortices in visual speech perception. 74
Figure 3.2. Temporal kinematics of the syllables. 78
Figure 3.3. Global field power plots for the four vMMN contrasts. 89
Figure 3.4. Source images for “fa” standard. 90
Figure 3.5. Source images for “zha” standard. 91
Figure 3.6. Source images for “ta” standard. 92
Figure 3.7. Group mean ERPs and vMMN analyses for posterior temporal EOI
clusters. 94
Figure 3.8. VMMN comparisons. 95
Figure 3.9. Group mean ERPs and MMN analyses for (A) electrode Fz and (B)
electrode Cz. 96
Figure 4.1: Lipreading screening scores in words correct versus d’ for syllable
discrimination. 115
Figure A.1. The underlying effect for GFP Simulation 1. 137
Figure A.2. Characterization of a single sweep example of the noise used in GFP
Simulations 1 & 2. 138
Figure A.3. Example simulation of how GFP changes with number of sweeps. 139
Figure A.4. Distributions of mean GFP amplitudes over 100 simulations. 140
Figure A.5.The effect of noise on GFP decreases as SNR increases. 141
Figure A.6. Group mean GFP from different numbers of sweeps. 142
Figure A.7. Distributions of sweep-average mean GFP over 100 resamplings 143
Figure A.8. False positive rate in the fixed-majority simulation. 154
Figure A.9. False positive rate over time in the fixed-majority simulation. 155
Figure A.10. False positive rate in the half-ratio simulation. 156
Figure A.11. False positive rate over time in the half ratio simulation. 156
Figure A.12. False positive rate in the quarter-ratio simulation. 157
Figure A.13. False positive rate over time in the quarter ratio simulation. 157
Figure A.14. Comparison of results across minority/majority conditions using
standard paired-comparisons permutation testing 158
Figure A.15. False positive rate in the unbalanced paired samples test in the
fixed-majority condition. 159
Figure A.16. False positive rate over time in the unbalanced paired samples test
in the fixed-majority condition. 160
Figure B.1. Example simulated data used for multiple comparisons simulations. 170
Figure B.2. Hit rates of three multiple comparisons correcting procedures. 173
vii
Figure B.3. False alarm rates of three multiple comparisons correcting
procedures. 174
Figure B.4 d’ sensitivity for three methods of correcting for multiple
comparisons. 175
Figure B.5. Family-wise error rate for three methods of correcting for multiple
comparisons. 176
Figure B.6. False discovery rates of three methods for correcting for multiple
comparisons. 177
Figure B.7 Proportion of simulation results in which false discovery rates
exceeded four cutoffs. 178
viii
Abstract
Visual speech perception, also known as lipreading or speech reading, involves
extracting linguistic information from seeing a talking face. What information is available in a
seen talking face, as well as how that information is extracted are unsolved questions. The
introductory chapter of this dissertation discusses some of the theories describing what
information is available in the talking face and how that information is extracted. These theories
fall into three broad categories based on the structure of the representation of visible speech
information: Auditory models, Motor models, and Late-integration models. Auditory models
propose that visual speech information is transformed into an auditory representation. Motor
models propose that visual speech information is represented in terms of the articulatory gestures
involved in speech production. Late-integration models propose multiple sensory-specific
pathways for speech perception that operate somewhat independently.
The work in this dissertation uses behavioral methods to investigate the visible stimulus
information used for visual speech perception and electrophysiological methods to investigate the
neural representation of visual speech. In both cases, the experiments take advantage of second-
order isomorphism, that is, the dissimilarity relationship of stimuli should match to the
dissimilarity relationship of behavioral responses and neural measures.
In the behavioral experiments, a model of the visual representations that drive visual
speech perception (Jiang et al., 2007a) was used to predict visual speech discrimination. The
model makes no use of feature extraction, and is instead a straightforward transformation of
optically-available data. As such, this model is an empirical realization of the claim that visible
syllable dissimilarity can be determined using straightforward visual processes. Previously, the
model accounted for visible speech phoneme identification in terms of perceptually weighted
distance measures computed using 3-dimensional optical recordings. In Behavioral Experiment 1,
ix
participants discriminated natural and synthesized pairs of consonant-vowel spoken nonsense
syllables that were selected on the basis of their modeled perceptual distances. An identification
task was also used to confirm that both natural and synthetic stimuli were perceived as speech.
The synthesized stimuli were generated using the same data that were input to the perceptual
model. Modeled perceptual distance reliably accounted for discrimination sensitivity, measured
as d’, and response times.
The results of Behavioral experiment 1 showed that the perceptual dissimilarity model
successfully predicted discrimination sensitivity in both natural and synthetic stimulus conditions.
Sensitivity was more strongly related to the predictions of the perceptual dissimilarity model in
the synthetic condition compared to the natural condition, but this discrepancy was largely
attributable to large differences between predicted and measured sensitivity for two natural
speech stimulus pairs. Additionally, discrimination sensitivity was higher than expected from an
implicit identification model, and the perceptual dissimilarity model was a better predictor of
sensitivity than was the implicit identification model.
In Behavioral Experiment 2, the natural stimuli were inverted in orientation during the
discrimination task to investigate whether the success of the perceptual dissimilarity model relied
on a specific orientation. Results largely replicated the pattern of findings of Experiment 1,
suggesting that perception of visible speech information is invariant to stimulus orientation.
Although the model and the synthetic speech were shown to be incomplete, the results of these
behavioral experiments were interpreted as consistent with speech discrimination relying on
dissimilarities in the visible speech stimulus that are closely related to the 3D optical motion on
which the perceptual dissimilarity model is based.
The electrophysiological experiment reported in Chapter 3 uses the visual speech
mismatch negativity (vMMN) to test whether the neural representation of visual speech reflects
x
the optical dissimilarity of pairs of syllables. The vMMN derives from the brain’s response to
stimulus deviance, and is thought to be generated by the cortex that represents the stimulus. The
vMMN response to visual speech stimuli was used in a study of the lateralization of visual speech
processing. Previous research suggested that the right posterior temporal cortex has specialization
for processing simple non-speech face gestures, and the left posterior temporal cortex has
specialization for processing visual speech gestures. Here, visual speech consonant-vowel (CV)
stimuli with controlled perceptual dissimilarities were presented in a vMMN paradigm. The
vMMNs were obtained using the comparison of event-related potentials (ERPs) for separate CVs
in their roles as deviant versus their roles as standard. Four separate vMMN contrasts were tested,
two with the perceptually far deviants (i.e., “zha” or “fa”) and two with the near deviants (i.e.,
“zha” or “ta”). Only far deviants evoked the vMMN response over the left posterior temporal
cortex. All four deviants evoked vMMNs over the right posterior temporal cortex. The results are
interpreted as evidence that the left posterior temporal cortex represents speech stimuli that are
perceived as different consonants, and the right posterior temporal cortex represents face gestures
that may not be reliably discriminated as different CVs.
The data gathered in the electrophysiological experiment pose a number of challenges to
conventional statistical analyses. The data are not normally distributed, and the data from
different stimulus conditions do not have equal variance. In the past, these kinds of data have
been analyzed using paired-samples permutation tests, but these data are problematic even for the
conventional paired-samples permutation test. Appendix A of this dissertation presents a
modified permutation test that addresses these problems. The data also consist of many non-
independent samples upon which statistical comparisons are made; this necessitates an
appropriate correction for multiple comparisons in order to avoid an uncontrolled Type I error
rate. Appendix B compares several approaches to the multiple comparisons problem as it applies
xi
to the data from the electrophysiological experiment and motivates the use of a cluster-based
method.
The main theoretical contribution that follows from the behavioral results is to reinforce
the possibility that an account of visual speech perception need not make recourse to abstract or
gestural features. This is important, because much of the theorizing about visual speech
perception relies on a feature-level representation to provide a common format for integration of
visual and auditory speech information. The close mapping between perceptual dissimilarity with
3D optical data as its input and perceptual dissimilarity is consistent with a model of visual
speech perception in which visible speech stimuli are processed extensively by visual cortex
before coordinating with representations of speech from other sensory modalities.
The main theoretical contribution of the electrophysiology result is to provide converging
evidence for a representation of visual speech in left posterior temporal cortex. Bernstein and
colleagues (Bernstein et al., 2011) used a system of control stimulus contrasts in a functional
magnetic resonance imaging experiment to isolate the temporal visual speech area (TVSA) that is
specifically sensitive to visual speech stimuli. Consistent with the hypothesis that TVSA is the
site of representation of visual speech, the work in Chapter 3 shows a left posterior temporal
cortical region that is highly sensitive to dissimilarities in visible speech syllables that signal
reliable differences in syllable identity.
Overall, the contribution of this work is to provide support for the theoretical position
that visual speech perception is a visual process. This claim is supported by showing that non-
visual representations are not required in order to account for visual syllable discrimination and
by contributing evidence that the TVSA is sensitive to stimulus contrasts that signal a reliable
syllable difference.
1
Chapter 1: Introduction
The topic of this dissertation is the visual perception of visible phonetic information. The
research described here seeks to demonstrate selectivity for visual speech in posterior temporal
cortex, and that the selectivity arises through visual processing. This introduction reviews
research that shows visual speech perception to be a well-established phenomenon in both
hearing and deaf populations. The idea of second-order isomorphism is introduced as an
organizing concept for the research approach taken. Finally, this introduction provides an
overview of theories of speech perception, and how they relate specifically to visual speech
perception.
Visual Speech Perception
Visual speech perception (also known as lipreading or speechreading) is possible when
auditory speech stimuli are unavailable (Bernstein et al., 2000; Bernstein et al., 2001; Mattys et
al., 2002; Mohammed et al., 2005; Auer and Bernstein, 2007). Normally-hearing individuals are
able to take advantage of visible speech information, although as a population pre-lingually deaf
individuals raised in an oral environment are better lipreaders (Bernstein et al., 2000; Bernstein et
al., 2001; Auer and Bernstein, 2007).
The perception of visible speech
1
has been studied almost exclusively in the context of
research on audiovisual speech integration, examples of which include enhanced perception of
degraded acoustic speech in noise when seeing a talker’s face (Sumby and Pollack, 1954;
MacLeod and Summerfield, 1987; Ross et al., 2007; Ma et al., 2009), enhanced comprehension
with good quality acoustic speech but a difficult-to-understand spoken message (Arnold and Hill,
1
I refer to the stimulus as visible speech or visual phonetic information and its perception as visual speech
perception.
2
2001; Arnold and Oles, 2001; Davis and Kim, 2004), and the McGurk effect, which is said to
occur when a heard syllable presented with a different seen syllable results in perception of a
third different syllable (McGurk and MacDonald, 1976; Jiang and Bernstein, 2011).
The term visual phonetic information refers to the linguistically relevant physical
characteristics of a seen talking face. Although the evidence cited above shows that there is
indeed linguistically relevant information in the seen talking face, there is currently no agreement
as to what, exactly, constitutes the visual phonetic information in the talking face. More
formally, one of the questions this dissertation addresses is what the functional features of the
seen speech stimulus are. By functional features, I mean those aspects of the stimulus that afford
extraction of the visual phonetic information. The visual speech stimulus is complex (Ramsay et
al., 1996) and affords an open-ended list of potentially functional visual stimulus features. For
example, visible speech varies over time, and the kinematics of the talking face carry phonetic
information (e.g., Rosenblum et al., 1996; Rosenblum and Saldaña, 1996); yet, the syllable being
spoken can in some cases be identified from static images (Calvert and Campbell, 2003). Speech
stimuli are non-invariant from one utterance to the next and from one speaker to the next. A
priori, what specific stimulus attributes constitute the functional features are not known.
Functional features of visible speech
Built into the word lipreading is the assumption that the visible speech signal is primarily
in the talker’s lips, so it is perhaps not surprising that some past work has focused the search for
functional features on lip shape and position. For example, the fourth and fifth principle
components of lip motion can be used to discriminate the Japanese vowel /u/ from other vowels
(Mishima et al., 2011). These principle components were mapped to mouth opening height and
width, but these data were not sufficient to distinguish vowels other than /u/. In another
experiment testing the hypothesis that monophthong vowels have a characteristic lip shape,
3
several measures from a single frame of video were taken, including height and width of mouth
opening (Montgomery and Jackson, 1983). Results showed that mouth-opening height, width,
and their ratio were related to lipreading performance of a subset of monophthong vowels,
particularly the front vowels. However, visible speech information is available in parts of the
face other than the lips (Thomas and Jordan, 2004; Jiang et al., 2007a) and some visual phonetic
information can be derived from whole-head movements (Munhall et al., 2004). In contrast to
experiments with human data, computer recognition of visible speech based on analysis of lip
motion has had some success, but a major hurdle for automatic visible speech recognition is
variability in talker, viewing angle and illumination (Liew and Wang, 2009).
In summary, while visual speech perception has been reliably observed, the functional
features of the talking face remain unknown. The next section outlines the approach taken in this
dissertation toward establishing the relationship between the talking face and visual speech
perception.
Second-order Isomorphism
One approach to learning whether visual speech perception relies on visual processes to
integrate the linguistically relevant stimulus is to examine relationships between measures of the
physical stimulus and measures of the perceptual response. We refer to this type of direct
mapping as a first-order isomorphism. Past research has attempted to map features of the talking
face, such as mouth opening width mouth opening height and lip rounding (Montgomery and
Jackson, 1983), mouth shape (Massaro, 1998), or principal components of lip motion (Mishima et
al., 2011) onto phonetic dimensions. Such first-order isomorphic mappings have not been notably
successful for understanding how humans lipread (Bernstein and Jiang, 2009).
An alternative approach derives from considering the need for distinctiveness in
language. Language is thought to balance at least two factors: maximization of perceptual
4
distinctiveness (to convey information and avoid confusion in the perceiver) and minimization of
articulatory cost (Lindblom and Maddieson, 1988). Simply put, phonologies (the sound systems
of languages) appear to develop such that phonemes (the contrastive units of speech) are
sufficiently distinct from each other to comprise words while still respecting articulatory
constraints, suggesting that the dissimilarity structure of a language is one of its critical
components. Moreover, dissimilarity processing is a general property of neural representations
(Shepard and Chipman, 1970; Edelman, 1998; Kriegeskorte et al., 2008; Kriegeskorte, 2009).
The need for perceptual distinctiveness suggests that the functional features of the visible
speech stimulus are those that effectively map to linguistically relevant perceptual dissimilarities.
This type of mapping was referred to by Shepard and Chipman (1970) as second-order
isomorphism. They posited that an approach to understanding how complex stimuli are perceived
is to consider stimulus dissimilarity relationships, and how they map to perceptual dissimilarity
relationships. As Shepard and Chipman expressed it, “Thus, although the internal representation
for a square need not itself be square, it should (whatever it is) at least have a closer functional
relation to the internal representation for a rectangle than to that, say, for a green flash or the taste
of a persimmon” (p. 2). The advantage of working within the space of second-order
isomorphisms is that perceptual systems are generally expected to preserve relevant stimulus
dissimilarity, and perceptual dissimilarity can be measured using standard psychophysical
techniques, whereas, first-order isomorphism typically involves a priori selection of stimulus
features or examination of an unspecified number of potentially relevant features for which
measures must be developed.
Second-order isomorphism originally described how perceptual dissimilarity could be
mapped onto physical dissimilarity measures (Shepard and Chipman, 1970), but second-order
isomorphism is also expected between perceptual and neural representation (Edelman, 1998).
5
Second-order isomorphism has served to bridge between perceptual, neural, and computational
mappings providing a powerful, generalized approach to linking observations across non-
comparable measures (Kriegeskorte et al., 2008; Kriegeskorte, 2009).
Applying second-order isomorphism
In order to obtain evidence for a representation of visible speech using second-order
isomorphism, two main components are needed. The first component is a flexible and robust
measure of stimulus dissimilarity that maps to perceptual dissimilarity. A candidate physical
measure of visible speech dissimilarity is described below. The flexibility and robustness of this
measure is tested in Chapter 2 of this dissertation. The second component needed for obtaining
evidence for a representation of visible speech is a neural measure of the dissimilarity of
representations of visible speech. Chapter 3 describes the visual mismatch response that is used
as a neural measure of dissimilarity of representations.
Jiang and colleagues (Jiang et al., 2007a) investigated a second-order isomorphic
mapping between 3-dimensional (3D) optical recordings of visible speech syllables and
perceptual identification of those syllables presented as natural video recordings. In their study,
the perceptual identifications of spoken visible syllables were transformed into a
multidimensional perceptual space, and the dissimilarities among the syllables in that space were
shown to be isomorphic to the physical stimulus dissimilarity space of the optical data. Across a
large set of stimuli (with varied vowels in the consonant-vowel context), multiple tokens, and
four different talkers, the variance accounted for in the perceptual syllable identification data
ranged between 46% and 66%.
The second-order isomorphic approach taken by Jiang and colleagues (2007a) relating
the weighted physical dissimilarity of visible three-dimensional motion to perceptual dissimilarity
derived from identification confusion forms the basis for the approach used in this dissertation.
6
Chapter 2 presents behavioral experiments designed to test the generalizability of the second-
order isomorphic relationship posited by Jiang and colleagues (2007a). These behavioral
experiments further support the second-order isomorphism between optical dissimilarity and
perceptual dissimilarity. Firmly establishing the perceptual relevance of the optical dissimilarity
of seen speech syllables provides a strong platform from which to examine the neural
representation of visible speech.
Neural Accounts of Visual Speech Perception
A number of theoretical positions have emerged that make claims regarding the neural
basis of visual speech perception. Some of these hypotheses are yoked to positions on the
functional features of visible speech perception. These models of visual speech perception can be
placed broadly into three main categories that are discussed below: auditory models, motor
models, and late-integration models.
Auditory models
Auditory models posit an auditory pathway for speech perception in auditory cortex.
However, if one is theoretically committed to an exclusively auditory model of speech
perception, a challenge is how to explain the overwhelming evidence that vision contributes to
speech perception (see above). One proposed explanation is that visual information is re-
represented in terms of auditory features through some mechanism. Summerfield (1992) presents
this phenomenological account: “speech in noise sounds clearer when one can see the talker’s
face; the combination of an acoustical ‘ba’ with a visual ‘ga’ sounds like ‘da’ or ‘dha’.
Therefore, some aspect of the visual information must be converted into an auditory
representation during audio-visual integration” (p. 76). Functional neuroimaging evidence is
available showing that seen talking faces modulate activity in auditory cortex (Sams et al., 1991;
Calvert et al., 1997; Calvert and Campbell, 2003; Pekkola et al., 2005; Pekkola et al., 2006; Saint-
7
Amour et al., 2007; Kauramaki et al., 2010), although activity in auditory cortex is not always
observed with visual speech stimuli (Bernstein et al., 2002; Bernstein et al., 2011). It has been
suggested that such activity might be evidence of an auditory representation of visible speech
(Sams et al., 1991; Calvert et al., 1997; Pekkola et al., 2005). However, changes in activation are
insufficient evidence for a representation. Further evidence suggests an alternative explanation of
the finding that visual speech might modulate activity in auditory cortex: recordings in monkeys
show that the modulation of auditory cortical neurons resulting from sensory stimuli in other
senses (Ghazanfar et al., 2005; Lakatos et al., 2007; Kayser et al., 2008; Lakatos et al., 2008) is a
general mechanism by which the temporal structure of neural processing is altered in order to
efficiently sample the environment (Lakatos et al., 2009; Kayser et al., 2010) and is not specific
to speech or vocalization processing, although certain temporal properties of speech may be
particularly well-suited to this processing mechanism (Schroeder et al., 2008; Giraud and
Poeppel, 2012).
Late-integration models posit, in addition to an auditory pathway for speech perception,
other sensory-specific pathways (see below). In this sense, late-integration models and auditory
models differ in how visual speech perception contributes to speech, but perhaps do not differ in
how auditory speech perception is carried out.
Non-invariance and auditory models
A criticism that has been leveled at auditory models is that a straightforward mapping
between acoustic properties and phonemes is not possible (e.g., Liberman and Mattingly, 1985).
The physical signals of speech are non-invariant due, in part, to coarticulation (for an overview of
this and other issues in articulatory phonolgoy, see Browman and Goldstein, 1992). The sounds
and sights of a phoneme can vary dramatically depending on what came before and what is to
follow that phoneme. This criticism of auditory models for speech perception can easily be
8
leveled at any account that makes specific claims about how sensory information is represented.
This criticism does not appear to be fatal due to the hierarchical organization of the auditory
cortex (Okada et al., 2010). By pooling the non-invariant responses of lower-level cortical areas,
higher-level cortical areas can achieve invariant representations of non-invariant stimuli.
In summary, the evidence for auditory representations of acoustic speech information is
abundant. However, the role for seen speech information is not accounted for using an auditory-
only account.
Motor Models of Speech Perception
While there are multiple models that posit the involvement of motor cortex in perception,
the motor theory of Liberman and Mattingly specifically addresses motor involvement in speech
perception (Liberman and Mattingly, 1985, 1989; Liberman and Whalen, 2000). Mirror neurons
have been proposed as a possible neural substrate for motor representations of speech (Rizzolatti
and Arbib, 1998). In this model, the functional features of speech are intended articulatory
gestures. By appealing to the speaker’s intention, rather than the actual realization accessible to
the observer, this model avoids the difficulties of non-invariance, as presumably the intended
gesture is the same across instances, and it is merely the realization of that gesture that varies
(Galantucci et al., 2006). However, this model trades one problem for another, as there appears
to be no proposal for how the sensory consequences of speech are transformed into motor
representations (Liberman and Whalen, 2000).
Experimental evidence for and against motor models
Some experimental evidence is available that motor systems are closely related to speech
perception. Hearing and/or seeing speech leads to an increased activation of motor and pre-motor
cortical areas (Wilson et al., 2004; Hall et al., 2005; Skipper et al., 2005; Pulvermuller et al.,
2006). Moreover, hearing (Fadiga et al., 2002) or seeing (Watkins et al., 2003) speech leads to
9
lowered threshold of activation of muscles driving the articulators that would be used to produce
that speech. Increases in neural activation in inferior frontal areas evoked by heard speech appear
to occur after posterior temporal activation, (Pulvermuller et al., 2003) suggesting this pre-motor
activity is a result of linguistic processing. While this evidence shows that systems for production
and perception of speech are inter-related, evidence that speech production systems play a crucial
role in perception is lacking.
Further evidence supports the hypothesis that motor cortical areas are interconnected with
representations of speech, but that this interconnection does not imply that speech is represented
in motor cortex. Motor and premotor cortices appear to be insensitive to speech versus nonspeech
sounds (Agnew et al., 2011). Transcranial magnetic stimulation designed to inactivate the
putative motor speech circuit results in deficits in some, but not all, language tasks (Sato et al.,
2009). Rather than playing an essential role for a motor circuit in speech perception, this circuit
could be recruited in language development and working memory (Hickok and Poeppel, 2007;
Scott et al., 2009).
Amodal models of speech perception
Closely related to motor models of speech perception are amodal models. Amodal
models make no specific claim about the cortical substrate of speech perception, and instead
focus on what is represented. According to amodal models of speech perception, speech is not
represented in terms of the specific sensory pathway by which that information enters the brain;
rather it is represented in terms of properties shared across sensory modalities (for a review, see
Rosenblum, 2008). Under normal circumstances, acoustic and optical speech signals are
generated by the same articulatory event. Measures of, the acoustic and optical signals resulting
from speech exhibit strong correlations (Jiang et al., 2002b; Munhall and Vatikiotis-Bateson,
2004).
10
Cross-modal source matching
The phenomenon of cross-modal source matching has been viewed as evidence for
amodal encoding of speech. Cross-modal source matching refers to the ability to match seen and
heard speech uttered by the same individual talker. Seeing an individual talker enables perceivers
to select heard speech generated by that talker over heard speech generated by a different talker,
and vice versa (Kamachi et al., 2003; Lachs and Pisoni, 2004b, a, c; Rosenblum et al., 2006).
Moreover, one hour of experience attempting to understand a talker in visual-only conditions
leads to a significant improvement understanding auditory-only speech in noise produced by the
same talker (Rosenblum et al., 2007). Cross-modal source matching has been interpreted as
evidence for amodal speech information encoding and challenges a strictly auditory or visual
speech representation, because visual experience of a talker’s idiolect (i.e. idiosyncrasies in the
talker’s speech production) has consequences for auditory perception of that talker. Seen speech
gives evidence of both what was said and how it was said, which can subsequently be accessible
to auditory speech perception.
Although the details of what exactly is encoded amodally, and where in the brain it is
represented remain controversial (Munhall and Buchan, 2004), one suggestion is that the
temporal structure of idiosyncratic prosody is the basis for cross-modal source matching (Lander
et al., 2007). This proposed representation of temporal structure could be instantiated by the
mechanisms described above for cross-sensory temporal modulation.
Late Integration Models
Late integration models propose multiple sensory pathways for speech perception. In
addition to an auditory pathway, a visual pathway is proposed. After sensory-specific processing,
the sensory-specific representations are integrated (Massaro, 1998) or coordinated (Bernstein et
al., 2004; Bernstein, 2005; Altieri et al., 2011) to produce a percept. An important distinction
amongst different late integration models is whether they posit convergence onto a final speech
11
representation, or whether they posit separate representations that coordinate or associate
(Bernstein et al., 2004; Altieri et al., 2011). The issue of convergence versus association is not
addressed here, but in both kinds of model, a visual representation of speech is required.
Evidence for a visual representation
A number of lines of research support the possibility of a visible speech processing
pathway distinct from an auditory pathway. Behavioral evidence shows that errors in visible
speech word identification are related to visual similarity rather than auditory similarity (Auer,
2002; Mattys et al., 2002). If there was a single, amodal or multimodal representation of speech,
then that representation would have a single dissimilarity structure. The observation of multiple
dissimilarity structures implies multiple representations. The optical dissimilarity (Jiang et al.,
2007a) of visible speech syllables drives identification of those syllables (Jiang et al., 2007a).
The results in Chapter 2 of this dissertation further support that the optical dissimilarity of visible
speech syllables maps directly to the perceptual discriminability of those syllables. These
findings support the possibility of a visual representation of speech.
Another line of evidence supporting late-integration models comes from response time
analysis. In a double factorial paradigm, the reliability of seen and heard congruent speech were
manipulated and response times were analyzed and compared against predictions based on early
and late integration models. The results were interpreted as supporting a late integration model
with cross-sensory interactions (Altieri and Townsend, 2011).
The McGurk effect and visual representation
Other findings supporting a visual representation of speech that is distinct from an
auditory representation come from manipulations using the McGurk effect (McGurk and
MacDonald, 1976). The McGurk effect is an audio-visual speech effect that occurs when some
12
visible speech syllables are dubbed with different audibly speech syllables. This can result in a
fusion percept in which the speaker seems to have uttered a third syllable that is different from
the heard syllable. For example, a seen /gA/ and a heard /bA/ is perceived as /dA/ (for a review,
see Campbell, 2008).
One such finding combined the McGurk effect with speech adaptation. Speech
adaptation experiments expose participants to a continuum of syllables varying in increments
between two phonetic categories and asking participants to judge to which category these
syllables belong. A phoneme boundary is determined as the point on the continuum where the
category judgment changes. After repeated presentation of one of the endpoint syllables, the
category usually shifts toward the adapter. Under a theory of shared representation, adaptation
would always be expected based on the perceived syllable, while late integration theories would
predict adaptation to occur on a sensory-specific basis. Using a combination of heard /bA/ and
seen /vA/ as an adapting stimulus, shared representation theories would predict adaptation to /vA/,
shifting the category boundary on the auditory /bA/ to /vA/ continuum away from /vA/, while late
integration theories would predict adaptation to the heard /bA/, driving the category boundary
away from /bA/ in the auditory continuum. The results clearly showed auditory-specific
adaptation, in support of late integration models (Saldana and Rosenblum, 1994; Shigeno, 2002).
In further support of late-integration models, there are cultural differences in the
frequency of McGurk fusion effects. For example, native speakers of Japanese rarely show
McGurk fusion effects compared to native speakers of American English but still are able to use
auditory and visual speech information (Sekiyama and Tohkura, 1991). Late-integration models
can account for either the absence or presence of fusion effects through varying the degree of
13
cross-sensory interactions, while amodal and sensory-specific accounts of speech have no
apparent explanation for this kind of variability.
Neural evidence of a visual representation of speech
As discussed above, visible speech is a complex stimulus and viewing speech activates a
broad network of brain regions, some of which are involved in speech perception, and some of
which are not. This network includes visual regions (MacSweeney et al., 2002b; Calvert and
Campbell, 2003; Capek et al., 2004; Hall et al., 2005; Pekkola et al., 2005; Capek et al., 2008). In
order to isolate those cortical regions specifically sensitive to visible speech, functional magnetic
resonance imaging contrasting visible speech with non-speech face motion was performed using
both video and point-light stimuli. A region of posterior superior temporal sulcus and posterior
middle temporal gyrus, part of the late vision pathway was identified as specifically activated in
response to visible speech (Bernstein et al., 2011). This region, called the temporal visual speech
area (TVSA), was proposed as the cortical location of visual representation of speech.
Summary of Neural Accounts of Visual Speech Perception
While this review is not an exhaustive account of all proposals and models for speech
perception, the auditory, motor, and late integration models have received a great deal of focus
and some degree of support (Altieri et al., 2011). Evidence that has been interpreted as
supporting auditory and motor representations of visual speech, however, can be explained
instead as interactions between representations, rather than shared representations per se. What
emerges from this review, and as others have suggested (Bernstein et al., 2004; Bernstein, 2005;
Altieri et al., 2011), is that visual speech information has a representation that is distinct from
auditory speech representations. In this sense, the visual representation is independent. This is
not to say there is no interaction between visual and auditory speech processing pathways. On
14
the contrary, there is strong evidence for interaction between these pathways from diverse lines of
research. However, the visual representation of speech is the focus of this dissertation, rather
than the interactions between representations. The main neuroanatomical evidence for a
representation of visual speech information (Bernstein et al., 2011) indicates the temporal visual
speech area (TVSA) as a region is preferentially activated by visible speech.
Chapter 3 of this dissertation reports the findings of an experiment using
electroencephalography (EEG) that provides neural evidence for a visual representation of seen
syllables. This representation is part of a feed-forward pathway, and its location is consistent
with that of TVSA in posterior temporal cortex.
Summary and Overview
This introductory chapter has laid the groundwork for understanding the subsequent
chapters describing experimental work investigating the representation of seen speech. The
evidence that vision contributes to speech perception is clear, but the functional features of the
stimulus are unknown. The approach taken here is to seek second-order isomorphism among the
physical dissimilarity of seen syllables, the perceptual distinctiveness of those syllables, and
neural responses to them. Chapter 2 describes behavioral experiments extending the work of
Jiang et al., (2007a) to confirm this second-order isomorphism between optical dissimilarity and
perceptual discrimination of seen syllables.
Theories differ on the neuroanatomical substrate for the representation of speech and how
visual information contributes to this representation. The best-supported account is that there is a
visual representation of seen speech, and the proposed cortical location of this representation is in
TVSA, or at the least cortical area/s in high-level vision. Chapter 3 describes an EEG experiment
that provides support for the hypothesis that the representation of visual speech is in a visual
feedforward pathway located in posterior temporal visual cortex. Chapter 3 is in press for
15
publication, in an edited form, in Frontiers in Human Neuroscience with Edward T. Auer and
Lynne E. Bernstein as co-authors. Chapter 3 uses statistical methods that are not widely used; the
Appendices describe these methods in detail as well as provides results of simulations
demonstrating the necessity and suitability of these methods.
Chapters 2 and 3 are largely meant to stand on their own. As such some of the concepts
discussed in this introduction are briefly re-introduced. A comprehensive conclusion is presented
in Chapter 4, which highlights the theoretical contributions of this work and also proposes
extensions to the present work.
16
Chapter 2: Visual Speech Discrimination
Visual speech perception (also known as lipreading or speechreading) is possible when
auditory speech stimuli are unavailable (Bernstein et al., 2000; Bernstein et al., 2001; Mattys et
al., 2002; Mohammed et al., 2005; Auer and Bernstein, 2007). Nevertheless, the perception of
visible speech has been studied predominantly in the context of research on audiovisual speech
integration, examples of which include enhanced perception of degraded acoustic speech in noise
when seeing a talker’s face (Sumby and Pollack, 1954; MacLeod and Summerfield, 1987; Ross et
al., 2007; Ma et al., 2009), enhanced comprehension with good quality acoustic speech but a
difficult to understand spoken message (Arnold and Hill, 2001; Arnold and Oles, 2001; Davis and
Kim, 2004), and the McGurk effect, which is said to occur when a heard syllable presented with a
different seen syllable results in perception of a third syllable (McGurk and MacDonald, 1976;
Jiang and Bernstein, 2011).
In order for a theory of speech perception to be complete, it must account for the
contribution of sight to speech perception (Summerfield, 1987). Some theories of speech
perception posit that visible speech information is transformed into a representation that is in
common with auditory speech information. The output of this transformation is described in
terms of, for example intended articulatory gestures (Liberman and Mattingly, 1985) or fuzzy-
logical truth values representing support for linguistic features (Massaro, 1998). Other theories
propose that visible speech information is encoded into vision-specific representations that are
subsequently coordinated with auditory representations at some higher level of processing
(Bernstein et al., 2004; Nahorna et al., 2012). In almost all theories (cf. Fowler, 2004), some
perceptual front-end is proposed that encodes visual information, and this encoded visual
information supports speech perception. The present study contributes to ongoing research to
identify what information is encoded that supports visual speech perception. The goal of such
17
research is to establish measurable physical dimensions of the visual speech stimulus that can be
used to predict visual speech perception. Below, two classes of approach to investigating the
visible stimulus information used for visual speech perception are outlined.
One approach to quantifying the visible stimulus information that supports visual speech
perception is to examine relationships between measures of the physical stimulus and the
perceived identity of the stimulus. We refer to this type of direct mapping as a first-order
isomorphism. First-order isomorphic mappings have had limited success elucidating the visible
stimulus information humans encode that supports lipreading (Bernstein and Jiang, 2009). For
example, the principal components of visible lip motion were used as a physical measure, and the
relationship between this physical measure and the vowels of Japanese was examined (Mishima
et al., 2011). The fourth and fifth principle components of measured lip motion were sufficient to
reliably distinguish /u/ from the other vowels, but could not be used to distinguish amongst
vowels other than /u/. X-ray beam recording has been used to track both visible and obscured
parts of the face during articulation, and the speed, duration and distance travelled in a vocal
movement stroke of the jaw and to a lesser extent the lower lip during vowel articulation was
found sufficient to reliably (with percent correct predictions of 67% and 53%) select the spoken
vowel from a closed set of six possible vowels (Yunusova et al., 2011). In an experiment testing
the hypothesis that monophthong vowels have a characteristic lip shape used for visual speech
perception, several measures from a single frame of video were taken, including height and width
of mouth opening (Montgomery & Jackson, 1983). Results showed that height, width, and their
ratio were related to lipreading performance of a subset of monophthong vowels, particularly the
front vowels. These studies clearly show that some measures of visible stimulus information can
be related to visual speech perception, but these studies have yet to result in a general
understanding of the visible stimulus information used for visual speech perception.
18
It should be noted that success has been obtained by starting with known features of
speech segments (Miller and Nicely, 1955) and relating those features to visual speech
perception. In the fuzzy logical model of perception (FLMP), a perceptual front-end makes
available sensory features that provide support for speech segments (Massaro, 1998; Massaro and
Cohen, 1999). This model has been successful in accounting for audio-visual speech perception
in many contexts, including, but not limited to audio-visual asynchrony (Massaro et al., 1996),
synthesized visual and auditory speech (Cohen et al., 1996), and spatial quantization of the visual
speech stimulus (Campbell and Massaro, 1997). The FLMP is flexible and powerful, but relies on
an as-yet unspecified perceptual front-end to transform physical stimulus information into a more
abstract feature-based representation, and so it does not address the goal of identifying
measurable physical dimensions of the visible speech stimulus that can be used to predict
perception.
One difficulty that inhibits successful measurement of the visual information used for
speech perception is that the visual speech stimulus is complex (Ramsay et al., 1996) and affords
an open-ended list of potentially effective visual stimulus cues. For example, visible speech
varies over time, and the kinematics of the talking face carry phonetic information (e.g.,
Rosenblum et al., 1996; Rosenblum and Saldaña, 1996); yet, the syllable being spoken can in
some cases be identified from static images (Calvert and Campbell, 2003). Speech stimuli are
non-invariant from one utterance to the next and from one speaker to the next (Browman and
Goldstein, 1992). A priori, what specific stimulus dimensions to measure are not known.
Second-order isomorphism
An alternative to the first-order approach derives from consideration of how speech
functions. The purpose of language is to communicate, and language is thought to balance at least
two factors: maximization of perceptual distinctiveness (to convey information and avoid
19
confusion in the perceiver) and minimization of articulatory cost (Lindblom and Maddieson,
1988), suggesting that the dissimilarity structure of a language is one of its critical components
(Auer and Bernstein, 1997; MacEachern, 2000). Given the importance of perceptual dissimilarity
to language, and the complexity inherent to mapping physical stimulus dimensions to perceived
stimulus identity described above, an alternative approach is to examine how stimulus
dissimilarity maps to perceptual dissimilarity. This type of mapping was referred to by Shepard
and Chipman (1970) as second-order isomorphism. They posited that stimulus dissimilarity
relationships, and how they map to perceptual dissimilarity relationships provide an approach to
understanding complex stimuli. Recently, Jiang and colleagues (Jiang et al., 2007a) suggested
that in order to understand the relationship between visible stimulus information and visual
speech perception, a second-order isomorphism may need to be established between physical
signals and perception.
A model of visible syllable perceptual dissimilarity
Jiang and colleagues (2007a) investigated a second-order isomorphism between 3-
dimensional (3D) optical recordings of visible speech syllables and perceptual identification of
those syllables presented as natural video recordings. In their study, the perceptual identifications
of spoken visible syllables were transformed into a multidimensional perceptual space, and the
dissimilarities among the syllables in that space were shown to be isomorphic to the physical
stimulus dissimilarity space of the optical data. Across a large set of stimuli (with varied vowels
in the consonant-vowel context), multiple tokens, and four different talkers, the variance
accounted for in the perceptual syllable identification data ranged between 46% and 66%. The
successful mapping (via second-order isomorphism) of 3D optical recordings and perceptual
identification data provides evidence that the visible stimulus information in the talking face is, to
some extent, captured by the 3D motion of a sparse set of points on the face.
20
The present study sought to further test the extent to which the Jiang et al. perceptual
dissimilarity model predicts the relationship between visible stimulus information and visual
speech perception. The perceptual task used by Jiang et al. to develop and evaluate their
dissimilarity model was a syllable identification task. Here, we use a syllable discrimination task
in which participants are presented pairs of visible speech consonant vowel (CV) stimuli and are
asked to judge whether the stimuli were the same or different syllables. This kind of
discrimination task may allow a more direct assessment of the perception of the visible stimulus
information used for visual speech perception. To explain why this may be case, a brief
discussion of models of signal classification follows.
Classification tasks, identification and discrimination
A classification task that involves placing a stimulus in its correct category requires two
functional steps: First, the relevant physical stimulus properties of each stimulus must be
extracted from available sensory information about the stimulus. This extraction results in a
sensory representation of the stimulus. Second, these extracted properties must be classified,
resulting in a category-based representation of the stimulus (Neisser and Beller, 1965; Posner and
Mitchell, 1967; Duda and Hart, 1973). The mapping from sensory representation to category
representation is many-to-one (Ahissar et al., 2008), so some detail is necessarily lost between the
sensory and category representations. This general framework for models of classification fits
well with theories of speech perception (Massaro, 1998; Bernstein et al., 2004; Nahorna et al.,
2012) in which the visible stimulus information is encoded as a phonetic form, and the categories
for classification are phonemes. When the task is not classification but instead the task is to make
same/different judgments on the categories of pairs of stimuli, the response could be based on the
outputs of either of the two functional stages. If discrimination is performed at the level of
categories, then the same/different judgment is based on whether the category decision for the
21
first stimulus is the same as the category decision for the second stimulus. If discrimination is
performed at the level of the sensory representation, then perceptual detail lost in the mapping
between the sensory representation and the category-based representation is available for making
a comparison. Moreover, each stage in the classification model has associated internal noise
(Durlach and Braida, 1969). Perceptual judgments based on the sensory representation stage can
be more precise, because detail is not lost in the projection to the classification stage.
Although speech categories do impact speech perception (Kuhl, 1991; Iverson and Kuhl,
1995, 2000), speech perception is not generally well-described as strictly categorical (Hary and
Massaro, 1982; Massaro and Hary, 1984). Therefore the expectation is not that visual speech
perception is strictly categorical. Nonetheless, in the syllable discrimination task used here, the
task was explicitly to judge whether the two stimuli belong to the same CV syllable category.
Participants were informed that even when the CV syllables were the same, the two stimuli would
differ in other ways. Thus, a plausible model for performance of this task is as an implicit
identification task: the subject could categorize the first stimulus, categorize the second stimulus,
and then respond “same” or “different” based only on whether the two stimuli were given the
same category or not, and ignoring any comparison of phonetic forms.
The most straightforward interpretation of non-categorical perception is that the
discrimination judgment is instead based on a sensory representation (sometimes called a
perceptual trace, e.g., Mirman, Holt & McClelland, 2004). Under this interpretation,
discrimination tasks provide a method to investigate the mapping between visible stimulus
information and visual speech perception without the additional noise of a classification stage.
However, this straightforward interpretation is not the only possible explanation for differences in
discrimination and identification task performance. Discrimination could be based on a different
classification process that uses the same sensory representation but a different memory-encoding
22
stage with possibly more precise categories, or could even be based on a different sensory
representation.
The possibility that visual speech perception could involve multiple sensory
representations is supported by neuroscientific evidence showing that information in the talking
face is processed by multiple sensory pathways (Campbell et al., 1986; Campbell et al., 1996b;
Bernstein et al., 2011). In particular, violations of regularities established by visible speech
stimuli result in a left- and right-lateralized visual mismatch negativity (vMMN). The left-
lateralized vMMN was associated with differences in phonetic forms that could reliably be
assigned to different phoneme categories, but the right-lateralized vMMN was associated with
finer-grained differences that were not necessarily associated with phoneme categorization and
instead may be involved in more generalize face motion processing (Files, Auer & Bernstein, in
press). So, while perceptual experiments like those reported here cannot support conclusions
about neural pathways, perceptual experiments can provide information about what visible
stimulus information is used in visual speech perception and can provide some converging
evidence about how that information may be processed.
Here, we report two experiments that were designed to investigate the visible stimulus
information used for visual speech perception and to provide some description of how that
information may be used. Experiment 1 used identification and discrimination tasks with natural
and synthetic visual speech stimuli to investigate if perceptual dissimilarity measured using a
discrimination task is well-predicted by the Jiang et al. (2007) model of perceptual dissimilarity.
The results support the conclusion that the Jiang et al. model successfully predicts perceptual
dissimilarity for most, but not all syllable pairs tested. Experiment 2 uses vertical stimulus
inversion to probe whether the sensory representation supporting visual speech discrimination is
sensitive to facial orientation.
23
Experiment 1: Discrimination of Natural and Synthetic Speech Syllables
The primary goal of Experiment 1 was to assess whether the dissimilarity model of Jiang
et al. (2007) successfully predicts discrimination sensitivity. Another goal was to test if
discrimination sensitivity is based on additional information that is not well-captured by the 3D
optical measurements used as the input to the perceptual dissimilarity model. An alternative
model was also tested to see if discrimination judgments were well-described as the output of an
implicit identification process.
To test whether discrimination sensitivity reflects the physical dissimilarity structure
elucidated by Jiang et al., pairs of syllables were chosen that had either low or high physical
dissimilarity (i.e. they were predicted to be perceptually near or perceptually far). Discrimination
stimulus pairs were organized into triplets based on what is referred to here as an anchor syllable.
Each triplet comprised a stimulus pair of the same syllable (i.e., anchor vs. anchor) but different
tokens, a pair with a large physical distance (i.e., anchor vs. far), and a pair with an intermediate
physical distance (i.e., anchor vs. near). Same pairs used different stimulus tokens to defend
against discrimination judgments based on phonetically irrelevant stimulus attributes. The stimuli
were taken from the library of consonant-vowel (CV) syllables produced by Talker M2 (Jiang et
al., 2007a). Talker M2 was chosen because the stimuli he produced resulted in the best
correlation between physical similarity and perceptual confusions in previous work (Jiang et al.,
2007a). Simultaneous with the video recording of the CV syllables, 20 points on the talker’s face
were tracked using a 3D motion capture system. The motion was used to generate the
dissimilarity measure described in Jiang et al. (2007a).
Perceptual systems need to deal with variability, or non-invariance, in order to classify
sensory signals. This is particularly true in speech perception, in which there is considerable
within-phoneme variation and inter-talker variation (Demorest and Bernstein, 1992). Without
24
careful control and characterization of stimulus materials, the inherent variability in speech can
potentially confound experimental effects. Here, by including multiple tokens of each syllable,
we introduced within-syllable stimulus variability. If only one token of each syllable were used,
then differences between tokens—related to speech or not—could have been a basis for
successful task performance. (Including more examples of each syllable would strengthen the
case for generalization to all utterances of that syllable but would not have been practical in this
context because of the combinatorial explosion it would cause in the number of possible stimulus
pairs.)
In order to investigate whether visual speech discrimination is based on the 3D motion
used by the perceptual dissimilarity model, a stimulus was required that presented the 3D motion
but included little else. This was accomplished using synthetic visible speech. Synthetic speech
generally provides better stimulus control than is available with natural speech tokens (Munhall
and Vatikiotis-Bateson, 1998), because it reduces the possibility of performance based on natural
artifacts such as eye gaze or facial expression. Here, we used a visual speech synthesizer that was
driven by the 3D motion-capture data that was gathered simultaneously with the video used in the
natural speech condition (Jiang et al., 2008).
The synthetic stimuli were used to examine whether cues other than those represented by
the 3D data are required for visual speech perception. One specific possibility was that glimpses
of tongue motion when the mouth is open constituted an important cue for visual speech
perception in some, but not all, syllables (Preminger et al., 1998; Jiang and Bernstein, 2011).
Tongue motion correlates with external face motion (Yehia et al., 1998; Jiang et al., 2002b), so
the previously reported relationship between 3D motion and perception could be mediated by
glimpses of the tongue. Because the 3D motion data do not include any direct measures of the
tongue, the synthetic talking face had no direct representation of the tongue.
25
Identification data were collected with both natural and synthetic stimuli primarily as a
control to ensure that the stimulus modification (see methods) of the natural speech stimuli did
not impede syllable identification and that the synthetic stimuli could reliably be identified. These
data also afforded an opportunity to test an implicit identification model of the visual speech
discrimination task. The identification data were used to estimate the probability that a given
stimulus would be assigned to a given CV category response. Using these estimates, the
probability a pair of syllables would be independently assigned the same category could be
estimated, and simulated same/different responses were generated. These simulated responses
were first compared with actual responses to assess whether the implicit identification model
could plausibly account for discrimination performance. Second, the ability of the implicit
identification model to predict discrimination performance was compared with the ability of the
perceptual dissimilarity model (Jiang et al., 2007a) to predict discrimination performance.
To summarize, Experiment 1 consisted of a discrimination task and an identification task.
Discrimination performance was assessed in terms of d’ and response times. Identification data
were assessed in terms of percent correct and response entropy. Identification data were also used
to test an implicit identification model. Stimuli were natural and synthetic visible speech
syllables. Synthetic stimuli were used to isolate possible non-phonetic cues and to present only
the information captured by the model of dissimilarity. Discrimination d’ and response times
were analyzed to investigate whether the model of dissimilarity successfully predicted
discrimination performance.
Methods
Participants. Twelve volunteers (8 female, 11 right-handed), mean age of 35 years
(range 22 to 47 years), participated in Experiment 1. Participants were selected from an existing
database of volunteers screened to have normal or corrected-to-normal vision, normal hearing,
26
and lipreading ability no worse than the median for hearing individuals (Auer and Bernstein,
2007). There are large individual differences in lipreading ability among adults with normal
hearing (Bernstein et al., 2000; Auer and Bernstein, 2007). Individuals with low lipreading skill
would be unlikely to provide data that could be used to support any conclusions about visual
speech perception, and so were not invited to continue the experiment. Vision was tested using a
standard Snellen chart with an inclusion criterion of 20:30 or better in both eyes. Hearing was
screened using pure-tone audiometry and a threshold of less than +15 dB HL. All participants
gave informed consent and were compensated financially for their participation. Most participants
completed the experiment in fewer than three hours.
Stimuli. Stimuli were presented on a CRT monitor in a sound-attenuating IAC booth. The
video source was a DVD player driven by custom software. Response times were recorded using
a custom-built timer, triggered by a specially-designed audio track on the DVD (note, no sound
was presented to the participant), which was verified to ensure accurate and reliable
synchronization between the video stimulus and the response timer.
Natural stimuli. The experiment made use of previous video-recordings of natural
consonant-vowel (CV) syllables that were made with a production quality camera (Sony DXC-
D30 digital) and video recorder (Sony UVW 1800) (Jiang et al., 2007a). The speech was
produced by Talker M2 whose perceptual and physical stimulus dissimilarities were most highly
correlated.
The syllables were two tokens each of /bA, pA, dA, tA, gA, kA, dZA, tSA, ZA, hA, lA, nA,
rA, vA, fA/. Two examples of each syllable were used to defend against reliance on stimulus
artifacts for making discrimination judgments. Stimuli began and ended with a neutral face
position (i.e., mouth closed) and were free of large head motion and major artifacts, such as large
27
eye movements and non-verbal mouth
motion. Figure 2.1a shows a video frame
from one stimulus. The dots on the
talker’s face are retro-reflectors that
were used to record face motion.
In order to reduce the time
needed for testing, each video was
shortened to end after the syllable’s
maximal jaw opening. One video frame
(at 29.97 frames/s) of the still face was
repeated five times before the speech
stimulus started and one was repeated
five times at the end, resulting in ten
frames (333 ms) of neutral face
separating the two syllables of a
discrimination trial. This was done to
defend against visual masking due to
abrupt transitions across syllables. All
stimuli in the experiment were authored
to DVD to allow accurate response time measures and random access to individual stimuli.
Stimulus distances. Jiang et al. (2007) describe in detail the recording of 3D motion
capture data and the testing of the perceptual weighting of stimulus distances. Here follows an
outline of their approach.
Figure 2.1. Still frames from natural and synthetic
speech stimuli. The white dots on the face of the talker (A)
are retro-reflectors that were used during video recording
for motion-capture of 3D motion on the talker’s face. This
3D motion drove the motions of the synthetic talking face
(B). Video and synthetic stimuli were presented in full
color against a dark blue background.
28
During video recording, a three-camera, 3-D motion capture system (Qualisys
MCU120/240 Hz CCD Imager) recorded (120 Hz sampling rate per retro-reflector per dimension)
the positions of passive retro-reflectors during infrared flashes not perceptible to the talker.
Twenty retro-reflectors were tracked on the talker’s face, including three on the forehead (for
head-motion correction), eight on the lip contours, three on the chin and six on the cheeks (Figure
2.1a). The forehead points were used to separate and eliminate overall head movement from
speech movement.
The motion capture data were used to compute Euclidean distances among pairs of
stimuli. To do this, the squared difference in position for each of the 17 retro-reflectors
(excluding the forehead markers after stabilizing motion) was calculated at each frame of data in
each of three directions (vertical, horizontal, depth). These squared differences were summed
over the first 280 ms of each token, and their square root was taken to yield a 51-dimensional
vector (17 markers x 3 directions) of Euclidean distances for each pair of syllables. These vectors
were the raw stimulus distances. Perceptual distances were used to obtain perceptual weightings
for the Euclidean stimulus distances. To do so, perceivers identified stimuli in a 23-alternative (all
initial English consonants) forced-choice consonant identification paradigm. Then the resultant
confusion matrices were submitted to a non-metric multidimensional scaling (Kruskal and Wish,
1978) procedure. After the responses were transformed into perceptual space, Euclidean distances
were calculated among consonant pairs in perceptual space. Then perceptual and raw stimulus
distances were submitted to least squares linear minimization (Kailath et al., 2000) to calculate
the weights on the channels of stimulus data that would best account for the perceptual distances.
This weighting procedure was carried out using half of the perceptual identification data. Then
correlations between stimulus pairs with perceptually weighted distances and perceptual-only
distances from the other half of the data were carried out to evaluate the extent to which the
29
perceptually weighted stimulus distances accounted for the perceptual distances. The current
study used those previously computed perceptually weighted stimulus distances in constructing
the stimulus pairs for discrimination.
Discrimination stimuli. The mean duration for all syllables in the experiment was 0.530
s (SD = 0.18, range 0.27 to 1.12). The variation in stimulus duration needed to be taken into
account in order to compare response times across different stimulus pairs. For this reason,
stimulus pairs were organized as triplets based on what is referred to here as an anchor syllable.
Each triplet comprised (1) a same stimulus pair of the same CV syllable (i.e., anchor vs. anchor)
but different tokens, (2) a far pair with a large physical distance (i.e., anchor vs. far), and (3) a
near pair with an intermediate physical distance (i.e., anchor vs. near). Same pairs used different
stimulus tokens to defend against discrimination judgments based on irrelevant stimulus
attributes. Pairs used in Experiments 1 and 2 are listed in Table 2.1.
In Jiang et al. (2007a), the largest stimulus distances were approximately 4, so 4 was
chosen here as the target for far pairs, and 2 was chosen as an intermediate value for near pairs.
In addition, triplets were checked to assure that the far distance was approximately 4 in both
physical and perceptual distance, and the near pair distance was close to 2 in both physical and
perceptual distances.
The anchor stimulus was always presented second, and response time was measured from
its onset, allowing comparisons of response times within triplets. However, by fixing the order of
the stimuli, the need arose for foil trials, so that the initial token was not a key to the correct
response. To the extent possible, anchors were selected that could also serve as a near or far
syllable in a different triplet in order to reduce the need for foil pairs. Same foils were added to
the stimulus set whenever the near or far phoneme in a triplet never served as an anchor in
another triplet. Different foils were added whenever an anchor in one triplet did not appear in any
30
Table 2.1. Summary of all stimulus pairs in Experiments 1 and 2. Stimuli were organized
into triplets for discrimination judgments of same, far, and near stimulus pairs that used the same anchor
stimulus.
Triplet anchor Pair Stimulus distance Role
Experiment 1
dA pA dA 3.92 far
pA pA 0 foil (same)
ZA dA 2.32 near
dA dA 0 same
dZA vA dZA 3.96 far
vA vA 0 foil (same)
dA dZA 2.88 near
dZA dZA 0 same
kA rA kA 4.08 far
rA rA 0 foil (same)
hA kA 2.27 near
hA hA 0 foil (same)
kA kA 0 same
lA tSA lA 4.04 far
tSA tSA 0 foil (same)
gA lA 1.96 near
gA gA 0 foil (same)
lA lA 0 same
lA kA 2.05 foil (different)
nA vA nA 4.24 far
kA nA 2.15 near
nA nA 0 same
nA kA 2.15 foil (different)
tA bA tA 3.99 far
bA bA 0 foil (same)
dZA tA 2.28 near
tA tA 0 same
ZA fA ZA 4.04 far
fA fA 0 foil (same)
tA ZA 2.28 near
ZA ZA 0 same
(continues)
31
Table 2.1 (continued)
Triplet anchor Pair Stimulus distance Role
Experiment 2
dA pA dA 3.92 far
pA pA 0 foil (same)
ZA dA 2.32 near
ZA ZA 0 foil (same)
dA dA 0 same
dZA vA dZA 3.96 far
vA vA 0 foil (same)
dA dZA 2.88 near
dZA dZA 0 same
dZA tA 2.28 foil (different)
kA rA kA 4.08 far
rA rA 0 foil (same)
hA kA 2.27 near
hA hA 0 foil (same)
kA kA 0 same
nA vA nA 4.24 far
kA nA 2.15 near
nA nA 0 same
nA kA 2.15 foil (different)
Note. Stimulus Distance is modeled perceptual dissimilarity. Foils are listed below the pair
requiring the foil. When a foil is required due to multiple pairs, it is listed below the first of those
pairs.
other triplet; otherwise, that syllable appearing as the first token in a pair could unambiguously
indicate a same trial. In all, there were 94 unique stimulus pairs (foils and discrimination pairs).
Synthetic stimuli. A set of synthetic stimuli was generated using a synthetic face model
driven by the motion capture data from the natural stimulus set (Jiang et al., 2008). After
removing head motion using the forehead points, the remaining points were used to drive the
synthetic face mesh using a modified radial basis function. All of the synthetic tokens were
matched to the corresponding natural tokens in terms of the frames that were used for
32
presentation (Figure 2.1b). Thus, the synthetic stimuli were generated to be of the same duration,
with the same start and end frames as the natural video. Importantly, the synthetic speech used the
motion data that was used to calculate the physical Euclidean distances between the stimuli.
Procedure.
Participants were tested in a discrimination paradigm followed by an identification
paradigm. The order of natural versus synthetic stimuli was the same across paradigms and
counter-balanced across participants. Prior to discrimination testing, participants received
instructions that they would see video clips of a face silently speaking pairs of nonsense syllables,
and that their task was to judge if the two syllables were the same or different. Instructions
emphasized that even when the syllables were the same, the two video clips would be different.
That is, participants were directed to attend to the consonant identity and not to linguistically
irrelevant information. They were also instructed to respond as quickly and as accurately as
possible. To familiarize participants with the task, two brief six-trial practice blocks—one with
natural stimuli, the other with synthetic—preceded the experimental data collection.
Each of the 94 stimulus pairs comprising a discrimination block was presented in a
pseudo-random order. Blocks were repeated 10 times per condition (synthetic or natural), with
each block having a different pseudo-random pair order. The hand used to report same or
different was counter-balanced across participants.
Feedback indicating the correct response was delivered via a pair of blue light-emitting
diodes mounted on the sides of the display. After a response was given, the light corresponding to
the correct response was briefly illuminated. Response times were recorded using a custom-built
timer, triggered by a specially-designed audio track on the DVD (note, no sound was presented to
the participant), which was verified to ensure accurate and reliable synchronization between the
video stimulus and the response timer.
33
Following the discrimination task, closed-set perceptual identification was carried out on
the 15 syllables (two tokens each) from the triplets in the discrimination task. Stimuli were
presented singly and were blocked in terms of whether they were natural or synthetic. Participants
identified each CV syllable by using a computer mouse to click one of fifteen labeled response
buttons on a separate display. Each button showed a letter and an example word to identify the
phoneme. Stimuli were presented in pseudo-random order ten times blocked by condition. No
feedback was provided, and participants were informed that response times were not recorded
during the identification task, so they could take as much time as they needed and could guess if
they were unsure. To ensure familiarity with the task, a brief six-trial practice block preceded the
main identification blocks.
Discrimination measures. Analyses were carried out on the discrimination data from
the triplet sets only, not the foils. Percent correct was calculated as the proportion of different
responses for near and far trials, and same responses for same trials. Discrimination sensitivity,
d’, was calculated on a per-triplet basis using the standard normal assumption. The d’ sensitivity
measurement was chosen, because it uses results from both same and different trials in a single
measure and is a bias-free estimate of sensitivity (Green and Swets, 1966; Macmillan and
Creelman, 1991). Within a triplet, the false alarm rate was the proportion of times the different
response was given when the syllables were the same, and the correct detection rate was the
proportion of times the different response was given when the syllables were near or far. When
hit rates were 1.0 or false alarm rates were 0, these values were recalculated as though half of one
trial was a miss or a false alarm, respectively. In this formulation, z(hit rate) – z(false alarm rate)
yields a d’ for the difference between a same pair and a different pair, but to estimate the d’ for
the difference between the syllables in question, that quantity was multiplied by √ 2 (Macmillan
34
and Creelman, 1991). To summarize, √ 2 .
Discrimination sensitivity was analyzed using repeated-measures analysis of variance (ANOVA).
Response time measures. Response time data were analyzed for each subject. The
mean and standard deviation of all correct responses to trials that were not foils were computed.
In order to remove outliers, responses that were farther from the subject’s mean by 2.5 times the
subject’s standard deviation were excluded from analysis.
Identification measures. Percent correct and response entropy were calculated with
perceptual identification data. Percent correct was calculated for each participant and each
syllable.
Because percent correct identification scores do not take into account the responses other
than those on the diagonal of a stimulus-response confusion matrix, Shannon entropy (Shannon,
1948) was calculated for each syllable as
n
p p
2
log
, where n is the number of response
categories, and p is the proportion of responses in that category. A low entropy value implies that
responses to the stimulus were assigned to one or a small number of categories, even if they were
incorrect. A high value implies that perceivers distributed their responses uniformly across
available response categories. Thus, the entropy of assigning each stimulus to only one response,
but doing so incorrectly, is exactly the same entropy as making no errors. With the 15 alternatives
in the identification task, entropy can range between 0 for all correct responses or use of one
incorrect response, to 3.9 for an equal number of responses in all cells of the confusion matrix.
Implicit identification model
To generate simulated responses based on an implicit identification model, the
probability that a pair of syllables, S
1
and S
2
would be assigned the same category label C
i
was
estimated for each subject from the results of the identification experiment. First, the probability a
35
stimulus S would be identified as belonging to a particular syllable category C
i
was estimated as:
|
20 ⁄ , where R
i
is the number of times the subject identified stimulus S as belonging
to category C
i
. The probability a pair of stimuli S
1
and S
2
would be assigned the same specific
syllable category label C
i
is then the joint probability that the two syllables will independently be
assigned that category label. The probability that the two syllables will be assigned the same
category label overall is the sum of the probabilities for the L = 15 individual category labels:
|
,
∑ | ∗ | . Because the only two options in a same/different
task are for the syllables to be identified as the same or a different category, the probability that
the two syllables will be identified as different is the complement of the probability that they will
be identified as same: |
,
1 |
,
.
With these equations, simulated same/different response probabilities were calculated for
each participant and then converted to d’ following the method above for actual responses. The
predictions of the implicit identification model were compared to the actual d’ results using
repeated-measures ANOVA. The correlation of the implicit identification model predicted d’
with actual d’ was compared with the correlation of the perceptual dissimilarity model with actual
d’ to see which model was a better predictor of discrimination d’.
Analyses. Analyses of identification proportion correct was carried out in parallel with
and without the arcsine transformation to stabilize variance. All of the results were replicated
across transformed and untransformed scores. For simplicity and ease of interpretation, the
proportion correct results are presented. Because correlation coefficient distributions are bounded
and skewed, group-level comparisons of single-subject correlation coefficients were done with
paired-samples permutation testing, a nonparametric method that is robust to outliers and non-
normality (Edgington and Onghena, 2007).
36
Repeated-measures analysis of variance (ANOV A) degrees of freedom were corrected for
violations of sphericity using the Huynh-Feldt correction ( ε ̃), when there were more than two
levels of a within-subjects factor. Nuisance factors of presentation order and button side mapping
were initially included in ANOVA models but were removed unless their effect was reliable. To
provide a measure of effect size, η
2
values are reported. Statistically significant effects in
ANOVAs were followed-up with paired comparisons using Bonferroni correction, except where
noted. Error bars in figures are within-subjects 95% confidence intervals (Morey, 2008).
Statistical analyses were carried out using SPSS version 17.0 and MATLAB 7.10.
Results
Discrimination. The modeled perceptual dissimilarity (near, far) and stimulus
condition (natural, synthetic) factors in the discrimination design were the main interests for
Experiment 1. However, the design also included the stimulus triplet factor. In order to determine
whether to pool results across stimulus triplets, a repeated measures ANOVA was carried out
with within-subjects factors of anchor consonant (/dA, jA, kA, lA, nA, tA, ZA/), stimulus distance
(far, near), and stimulus condition (natural, synthetic), with pooling across the different tokens
within each of the triplets.
Stimulus distance was a reliable main effect, with stimulus pairs of predicted far
perceptual dissimilarity discriminated better than near perceptual dissimilarity, F(1, 11) = 337.5,
η
2
= .573, p < .001 (far mean d’ = 3.67, near mean d’ = 1.10), as was stimulus condition, F(1,
11) = 65.9, η
2
= .108, p < .001 (natural mean d’ = 2.94, synthetic mean d’ = 1.83). However,
anchor consonant was a reliable main effect, F(3.59, 39.49) = 6.2, p = .001, as were its
interactions with distance, F(2.1,22.8) = 3.7, p = .04, and with stimulus condition,
F(3.4,37.3) = 6.3, p = .001. There was also a three-way interaction of anchor consonant with
condition and distance, F(4.0, 44.3) = 4.9, p = .002. The anchor effects showed that the d’
37
measures varied depending on the specific phonemes within a triplet set, and so the analysis could
not be simplified by pooling across anchor.
In order to investigate the main issues related to distance and condition, separate
repeated-measures ANOVAs, with within-subjects factors of distance (far, near) and condition
(natural, synthetic), were carried out for each of the seven anchors and their near and far stimuli.
Results are reported in Table 2.2 (see Figure 2.2 also). Each triplet headed by a different anchor
Table 2.2. Analyses of variance on d’ for the triplets in Experiment 1. Each of the
anchor stimuli and its triplet set were analyzed separately (see text).
Anchor Source df F η
2
p mean difference
/dA/ distance (far, near) 1 364.3 .77 .001 2.10 [1.86 2.35]
condition (natural, synthetic) 1 9.7 .08 .010 0.69 [0.20 1.17]
distance with condition 1 1.62 .00 .230
error 11 .14
/dZA/ distance (far, near) 1 220.6 .74 .000 2.73 [2.32 3.13]
condition (natural, synthetic) 1 20.2 .12 .001 1.10 [0.56 1.64]
distance with condition 1 0.1 .00 .770
error 11 .14
/kA/ distance (far, near) 1 217.6 .89 .000 3.09 [2.63 3.55]
condition (natural, synthetic) 1 4.4 .01 .059 0.34 [-0.02 0.69]
distance with condition 1 0.4 .00 .530
error 11 .09
/lA/ distance (far, near) 1 32.4 .36 .000 2.17 [1.33 3.01]
condition (natural, synthetic) 1 51.6 .40 .000 2.29 [1.59 3.00]
distance with condition 1 0.4 .01 .530
error 11 .24
/nA/ distance (far, near) 1 78.7 .60 .000 2.92 [2.19 3.64]
condition (natural, synthetic) 1 25.6 .15 .003 1.46 [0.60 2.33]
distance with condition 1 6.0 .02 .032
error 11 .23
/tA/ distance (far, near) 1 217.7 .74 .000 2.80 [2.39 3.22]
condition (natural, synthetic) 1 18.4 .12 .001 1.12 [0.55 1.70]
distance with condition 1 10.1 .01 .009
error 11 .12
/ZA/ distance (far, near) 1 156.6 .70 .000 2.18 [1.80 2.56]
condition (natural, synthetic) 1 12.0 .10 .005 0.80 [0.29 1.31]
distance with condition 1 5.7 .02 .035
error 11 .18
Note. Mean difference is level 1 minus level 2 (i.e., far minus near, natural minus synthetic). Values in
brackets are lower and upper 95% confidence intervals.
38
consonant resulted in reliable main effects of distance category, with d’ for far pairs higher than
d’ for near pairs, and reliable main effects of stimulus condition, with d’ for natural pairs higher
than d’ for synthetic pairs. There were also interactions of distance category with stimulus
condition in the three triplets with the individual anchors /nA tA/, and /ZA/. For triplets with
anchor syllables /nA/ and /tA/, the interaction is due to a larger effect of natural versus synthetic
stimulus condition in the near pair than in the far pair. For the triplet with anchor
syllable /ZA/, the interaction is due to a smaller effect of stimulus condition in the near pair than
in the far pair.
Correlations between d’ and modeled perceptual dissimilarity. The stimuli for
the discrimination task were organized in triplet sets primarily in order to carry out response time
comparisons that controlled for anchor duration. However, that approach converted the metric
Figure 2.2. Group mean d’ sensitivity for Experiment 1. The large panel (left) shows results collapsed
across stimulus triplets, and the small panels show results for each triplet. Error bars are within-subjects
95% confidence intervals.
39
relationships among all the stimuli to distance categories. In order to test whether the modeled
perceptual distance accounted for discrimination sensitivity directly, Pearson correlation
coefficients were computed between modeled perceptual distance and mean d’ across all stimulus
pairs and participants (Figure 2.3). The correlation for natural speech was r (166) = .676,
p < .001, and the correlation for synthetic speech was r (166) = .828, p < .001.
The correlations were also carried out using raw stimulus distances (Jiang et al., 2007).
Much greater variance was accounted for using the modeled than the raw distances. With raw
distances, the correlations were reliable but small, for natural speech, r (166) = .227, p = .003,
and for synthetic speech, r (166) = .253, p = .001. Variance accounted for with the modeled
distances was 46% synthetic speech and 68% natural speech. With the raw distances, variance
accounted for was 6% synthetic speech and 5% natural speech.
Because data pooled across participants need not accurately reflect individual response
patterns, Pearson correlations were also computed using individual participant d’ sensitivities
versus modeled perceptual dissimilarity. With natural stimuli, the correlations between d’ and
modeled perceptual distance were statistically reliable (p < .05) in 10 out of the 12 participants,
mean r(12) =.720, range r(12) = .446 to .893 (variance accounted for range 19.9% to 79.7%).
None of the correlations based on the raw Euclidean physical distances for natural speech were
reliable, mean r(12) = .224, range r(12),= .072 to .501. With synthetic speech, the correlations
between d’ and modeled perceptual distance were reliable for all 12 participants, mean r
(12) = .861, range r (12) = .710 to .981 (variance accounted for range 50.4% to 96.2%); but there
were no statistically reliable correlations between raw stimulus distances and d’ sensitivities,
mean r (12) = .262, range r(12) = .072 to .375. To assess the difference between single-subject
correlation coefficients between the physical dissimilarity measure and d’ from natural (mean
r(12) = .720) and synthetic speech (mean r(12) = .861), single-subject correlation coefficients
40
were submitted to a paired-samples permutation test. Correlation coefficients were reliably higher
for synthetic compared to natural speech, p = .0044.
Inspection of the scatter plot for perceptual distance versus d’ sensitivity (Figure 2.3)
suggested that stimulus pairs, /lA/-/gA/ and /nA/-/kA/ were responsible for the reliably higher
correlation coefficients of modeled perceptual dissimilarity with synthetic speech d’ compared to
the correlation coefficients of modeled perceptual dissimilarity with natural speech d’. To assess
whether the difference in correlation coefficients was a general property of the dataset, or was a
result of the inclusion of those two pairs, a post-hoc re-analysis was run excluding those two
Figure 2.3. Correlations between sensitivity and perceptual distance. Group mean d’ sensitivity
versus perceptual distance natural (A) and synthetic (B) syllable pairs. Error bars are within-subjects 95%
confidence intervals. Perceptual distance is the modeled perceptual distance based on the 3D motion of a
sparse set of points on the talking face. Points marked with asterisks are the pairs /lA/ - /gA/ and /nA/ -
/kA/. Dashed line is a fit to the full dataset, dotted line is a fit to the dataset excluding pairs marked with
asterisks (see text).
41
pairs. Without those stimulus pairs, results were comparable across conditions. For example, in
the pooled data, the correlation of modeled perceptual dissimilarity with natural speech stimuli
became r(142) = .806, p < .001, and the correlation of modeled perceptual dissimilarity with
synthetic speech stimuli became r(142) = .812, p < .001. The reason that discrimination of these
two stimulus pairs was well-predicted for synthetic speech, but not natural speech, cannot be
determined from the present study (see discussion). However, the apparent overall difference in
prediction success of the dissimilarity model for synthetic stimuli compared to natural stimuli
appears largely attributable to better prediction for these two stimulus pairs specifically.
Response times. Response time was obtained as a converging measure of perceptual
dissimilarity, with lower response times predicted when sensitivity was greater. After removing
outlier data, 97.3% of the responses (12,545) were retained. Response times were generally long
(individual participant M = 1,231 to 1,781ms), because they were measured from the onset of the
second stimulus, whose mean duration was 530 ms. Results are summarized in Figure 2.4.
Figure 2.4. Mean response times for Experiment 1 discrimination. Group mean response times pooled
over anchor syllable are shown in the left panel, and group mean response times for each anchor syllable are
shown in the right panels. Error bars are within-subjects 95% confidence intervals.
42
A repeated-measures ANOVA was carried out for response times with within-subjects
factors of stimulus condition (natural, synthetic), distance category (same, near, far), and anchor
consonant (/dA, dZA, kA, lA, nA, tA, ZA/). Presentation order (natural first, synthetic first) was
retained in this analysis because of an interaction of presentation order with anchor consonant and
distance category (see below). One participant produced no correct answers for the near pair /lA/-
/gA/ in the synthetic speech condition, so that participant’s data were not included in the omnibus
ANOVA.
Presentation order was not a reliable factor, F(1,9) = 0.3, η
2
= .024, p = .624, but its
interaction with stimulus condition was reliable, F(1,9) = 19.2, η
2
= .020, p = .002. The anchor
consonant interactions were all reliable: Anchor consonant with stimulus condition was
F(3,27) = 6.3, ε ̃ = .5, η
2
= .013, p = .002; Anchor consonant with distance category was,
F(12,108) = 4.4, ε ̃ = 1.0, η
2
= .021, p < .001; Anchor consonant with stimulus condition and
distance category was F(8,71.9) = 3.4, ε ̃ = .67, η
2
= .016, p = .002; and Anchor consonant with
distance category and presentation order was F(12,108) = 2.70, ε ̃ = 1.0, η
2
= .013, p = .003.
To investigate these interactions, repeated-measures ANOVAs were carried out for each
anchor consonant with within-subjects factors of distance category (far, near, same) and stimulus
condition (natural, synthetic) and between-subjects factor of presentation order (natural first,
synthetic first). The results are reported in Table 2.3 (see also Figure 2.4). All the main effects of
stimulus condition and of distance category were reliable. Responses to natural stimuli were
always faster than to synthetic stimuli, and far pair responses were always faster than near pair
responses or same pair responses. The effect sizes were uniformly larger for stimulus condition
than for distance in these analyses.
43
Distance with presentation order was never a reliable interaction, nor was there a reliable
three-way interaction of distance with stimulus condition and presentation order. The interaction
of distance with stimulus condition was reliable in triplets with anchors /dZA/ and /ZA/.
Examination of the estimated marginal means showed that for the /ZA/ triplet, the response time
difference between natural and synthetic stimuli in the far pair (M = 383 ms) was larger than the
difference in the near pair (M = 197 ms) and the same pair (M = 204 ms). For the /dZA/ triplet,
the effect of stimulus condition in the near pair (M = 182 ms) was smaller than the effect of
Table 2.3: Analyses of variance for response times in Experiment 1. Each of the anchor
stimuli and its triplet set were analyzed separately (see text).
Anchor Source df F ε ̃ η
2
p
/dA/
distance (far, near, same) 2, 20 17.8 1 .29 .000
condition (natural, synthetic) 1, 10 198.2 1 .46 .000
distance with condition 2, 20 1.6 1 .01 .230
/dA/
distance (far, near, same) 2, 20 23.8 1 .30 .000
condition (natural, synthetic) 1, 10 348.7 1 .41 .000
distance with condition 1.6, 16.6 7.2 0.83 .04 .004
/kA/
distance (far, near, same) 2, 20 17.8 1 .41 .000
condition (natural, synthetic) 1, 10 129.2 1 .31 .000
distance with condition 1.4, 14.2 3.9 0.71 .02 .055
/lA/
distance (far, near, same) 1.9, 17.4 17.8 0.97 .11 .000
condition (natural, synthetic) 1, 9 88.6 1 .65 .000
distance with condition 2, 18 0.9 1 .01 .436
/nA/
distance (far, near, same) 2, 20 23.0 1 .30 .000
condition (natural, synthetic) 1, 10 80.4 1 .42 .000
distance with condition 2, 20 0.8 1 .00 .471
/tA/
distance (far, near, same) 2, 20 31.8 1 .43 .000
condition (natural, synthetic) 1, 10 70.9 1 .26 .000
distance with condition 1.4, 14.2 0.1 0.71 .00 .819
/A/
distance (far, near, same) 2, 20 17.3 1 .27 .000
condition (natural, synthetic) 1, 10 147.9 1 .36 .000
distance with condition 2, 20 6.3 1 .04 .008
44
stimulus condition in the same pair (M = 430 ms), while the effect of condition in the far pair was
not reliably different from either.
Correlations between response time and modeled perceptual distance. The
correlation between perceptual distance and response time for natural speech syllable pairs pooled
across participants was negative, such that pairs with larger modeled perceptual distance were
responded to more quickly, r(250) = -.433, p < .001, as was found for synthetic pairs, r(249) = -
.482, p < .001. Correlations using single-subject data from natural stimuli were reliable for 9 out
of 12 participants, mean r(12) = -.550, range r(12) = -.812 to -.125. Reliable correlations with
responses to synthetic stimuli were found for 11 out of 12 participants, mean r(12) = -.630, range
r(12) = -.860 to -.354. No difference in correlation coefficients was obtained between natural and
synthetic stimuli.
Summary of Discrimination Results. Natural speech was better discriminated than
synthetic speech. In both natural and synthetic speech conditions, discrimination of far pairs was
better than discrimination of near pairs. Moreover, modeled perceptual dissimilarity values were
reliably correlated with discrimination d’, and the correlation was stronger for synthetic stimuli
than it was for natural stimuli. The reduced correlation between modeled perceptual dissimilarity
and natural speech discrimination was largely attributable to two stimulus pairs that were
discriminated better than would be predicted based on the dissimilarity model.
Percent correct identification. To examine whether the stimuli were perceived as
speech, percent correct was calculated for each participant and each stimulus condition, and a
95% confidence interval was established for natural and synthetic speech trials using standard
binomial statistics. Individual percent correct scores ranged from mean 37% to 51% for natural
speech and from mean 26% to 47% for synthetic speech. No individual participant confidence
interval included the percent correct expected by chance, 6.67%. These accuracy levels are
45
similar to those Jiang et al. (2007) obtained for the same talker with natural speech (in the /A/
context the range of correct scores was 35%-45%) and support the conclusion that both the
synthetic and natural stimuli were perceived as speech by all of the participants.
Group mean percent correct phoneme identification is shown in the upper panel of Figure
2.5. Confusion matrices are shown in Figure 2.6. A repeated measures ANOVA was carried out
with the within-subjects factors of stimulus condition (natural, synthetic), consonant (/bA/, /tSA/,
/dA/, /fA/, /gA/, /hA/, /dZA/, /kA/, /lA/, /nA/, /pA/, /rA/, /tA/, /vA/, /ZA/), and token (Token 1, Token
Figure 2.5. Group mean phoneme identification percent correct and entropy in Experiment 1. Group
mean percent correct (upper panel) and Shannon entropy (lower panel) for the syllable identification task.
Error bars show within-subjects 95% confidence intervals. Correct identification was reliably above chance
(6.7%, dashed line) for all syllables except /gA/ in the natural condition and /lA/ and /nA/ in the synthetic
condition. Even for cases with low percent correct identification, entropy was generally low, well below the
theoretical maximum for this task (3.91, dashed line), indicating that responses for a given syllable stimulus
were typically organized into a small number of response categories.
46
2). There was not a reliable main effect of token, but there was an interaction of consonant with
token, F(14,154) = 4.70, p < .001, η
2
= .0181. However, the token factor and its interactions
accounted for less than 2% of the variance in the data, so subsequent analyses were carried out
with data pooled across tokens to reduce the complexity of the dataset. A repeated-measures
ANOVA with the within-subjects factors stimulus condition (natural, synthetic) and consonant
(/bA/, /tSA/, /dA/, /fA/, /gA/, /hA/, /dZA/, /kA/, /lA/, /nA/, /pA/, /rA/, /tA/, /vA/, /ZA/), showed that
natural speech was perceived more accurately (mean percent correct = 44.2) than was synthetic
speech (mean percent correct = 33.4), F(1, 11) = 96.2, η
2 =
.032, p < .001. Consonant was also a
reliable factor, F(12.2, 134.6) = 6.91, η
2
= .264, ε ̃ = .874, p < .001, as was its interaction with
condition, F(10.9,120.0) = 16.52, η
2
= .169, ε ̃ = .779, p < .001.
Follow-up paired t-tests showed that natural speech consonants /fA, hA, dZA, lA, nA, rA/
were more accurately identified than synthetic speech consonants, and synthetic speech /gA/ was
more accurately identified than its natural counterparts, p < 0.05 (uncorrected).
Figure 2.6. Identification confusion matrices. Group response proportions are shown for identification
of A. natural stimuli and B. synthetic stimuli. The initial consonant of the CV stimulus (with /a/ in the
vowel context) is shown at the head of each row, and responses to that stimulus are in separate columns.
Correct responses fall on the diagonal, and incorrect responses are in off-diagonal cells of the matrix.
47
Identification entropy. Percent correct consonant scores used only the diagonal of the
stimulus-response matrix, and identification responses could incorporate response bias. Shannon
entropy measures entropy in response selection, independent of response correctness. The
minimum response entropy was 0 in both conditions, and the obtained maxima were 3.35 and
3.30 for natural and synthetic stimuli, respectively (maximum possible entropy was 3.91). Overall
subject mean entropy scores were calculated as the mean of the entropies over the 15 stimulus
consonants. The values of overall subject mean entropy for natural stimuli ranged from 0.66 to
1.84, with mean = 1.19, and the values for synthetic stimuli ranged from 1.15 to 2.24, with
mean = 1.61.
Group mean Shannon entropy for each syllable is shown in the lower panel of Figure 2.5.
A repeated-measures ANOVA was carried out on entropies with within-subjects factors of
stimulus condition (natural, synthetic) and consonant (/bA/, /tSA/, /dA/, /fA/, /gA/, /hA/, /dZA/, /kA/,
/lA/, /nA/, /pA/, /rA/, /tA/, /vA/, /ZA/) and between-subjects factor of presentation order (natural
first, synthetic first). Lower entropy was obtained with natural than with synthetic stimuli,
F(1,10) = 77.3, η
2
= .092, p < .001 (natural, mean entropy = 1.19; synthetic, mean
entropy = 1.61). Consonant entropy also varied reliably, F(7.9,79.2) = 15.7, η
2
= .385, ε ̃ = .566,
p < .001, as did the interaction between consonant and stimulus condition, F(12.5, 124.7) = 8.1,
η
2
= .104, ε ̃ = .891, p < .001. Follow-up paired comparisons showed that there was significantly
lower entropy for natural compared to synthetic stimuli for /tSA, dA, fA, hA, lA, nA, tA/, p < .05
(uncorrected). There was also a reliable interaction between condition and presentation order,
F(1,10) = 5.0, p = .049, but the variance accounted for by this order was extremely small,
η
2
= .006.
Inspection of scatter plots between percent correct and entropy suggested that low percent
correct could be associated with a range of entropies, but high percent correct is predictably
48
associated only with low entropy. Pooled across participants, the correlation between entropy and
percent correct with natural speech was r(178) = -.487, p < .001, and between entropy and
percent correct with synthetic stimuli was, r(178) = -.446, p < .001. Individual participant
correlations for natural stimuli were mean r(13) = -.561, ranging from -.901 to -.301, with 6 out
of 12 p < .05, and for synthetic stimuli, were mean r(13) = -.454, ranging from -.776 to -.036,
with 5 out of 6 p < .05. The individual results suggest wide individual differences. But most of
the individual results are consistent with the correlations on the pooled data.
Summary of identification results. Natural speech was more intelligible than
synthetic speech, but participants were able to identify the consonants in both and use the
information systematically; although some of the systematic response pattern was due to
systematic errors.
The implicit identification model. The predictions of the implicit identification
model were compared with d’ from the discrimination task by submitting both to a repeated
measures ANOVA with factors of measurement source (discrimination, model), anchor
consonant (/dA, dZA, kA, lA, nA, tA, ZA/), stimulus distance (far, near), and stimulus condition
(natural, synthetic). Because the main factor of interest for this analysis was measurement source,
only the main effect of measurement source and its interactions with the other factors are
reported.
There was a main effect of measurement source, with higher d’ from discrimination
(M=2.4) than from the model (M=2.0), F(1,11) = 9.8, η
2
= .08, p = .009, however there was an
interaction of distance with measurement source, F(1,11) = 31.6, η
2
= .12, p < .001, and
interactions of measurement source with anchor consonant, F(6,66) = 1.9, η
2
= .05, ε ̃ = 1,
p = .025, measurement source with anchor consonant and distance, F(4.5,49.3) = 2.8, η
2
= .02,
49
ε ̃ = .748, p = .032 and a four-way interaction of measurement source with anchor, condition, and
distance, F(5.1,56.4) = 6.0, η
2
= .02, ε ̃ = .855, p < .001.
To follow up on all of these interactions, separate repeated-measures ANOVAs were run
with data from each triplet. The factors for each triplet were, as in the omnibus ANOVA,
measurement source, condition, and distance. A main effect of measurement source (p < .05) was
found for triplets with anchors /dA, nA, dZA/, with discrimination d’ > model predictions. A
reliable (p < .05) interaction of distance with measurement source was found for triplets with
anchors /kA, dA, nA, ZA, tA, dZA/, in all cases reflecting a larger effect of measurement source
(i.e., discrimination d’ > model predictions) for the far pair in the anchor compared to the near
pair. For the triplet with anchor /lA/, a reliable interaction of condition with measurement source,
F(1,11) = 6.6, p = .026 reflected a higher discrimination d’ than model predictions with synthetic
stimuli, but the opposite relationship with natural stimuli. Three-way interactions of measurement
source with stimulus condition and pair distance were reliable (p < .05) for triplets with anchors
/ZA, tA, dZA/. In all cases, this interaction reflected differences across conditions in the size of the
advantage of discrimination d’ over the model prediction in far pairs. To summarize, with the
exception of the triplet with anchor /lA/, the pattern held that discrimination d’ was greater than
predicted by the implicit identification model, in the far pair, but not generally for the near pair.
Comparison of prediction of d’: perceptual dissimilarity vs. implicit
identification. Above, modeled perceptual dissimilarity and discrimination d’ were found to be
reliably correlated and discrimination d’ was found to be higher than predicted from the implicit
identification model. To see whether modeled dissimilarity is a better predictor of discrimination
d’ than the implicit identification model, single-subject Pearson correlation coefficients between
discrimination d’ and modeled dissimilarity were compared to Pearson correlation coefficients
50
between discrimination d’ and the implicit identification model. Comparisons were made
separately for data collected using natural and synthetic stimuli using paired-samples permutation
tests. Model prediction success was not reliably different for natural stimuli, p = .44, implicit
identification with discrimination mean r(12) = 0.76 (N=12), physical dissimilarity with
discrimination mean r(12) = 0.72 (N=12), but they were reliably different for synthetic stimuli
with the correlation between modeled perceptual dissimilarity and discrimination d’ stronger than
the correlation between the implicit identification model and discrimination d’, p = .007, implicit
identification with discrimination mean r(12) = 0.78 (N=12), physical dissimilarity with
discrimination mean r(12) = 0.86 (N=12). This pattern of results replicated in the pooled data: for
natural speech, the pooled correlation between d’ and perceptual dissimilarity was r(166) = .676
and the pooled correlation between d’ and the implicit identification model was r(166) = .694.
For synthetic speech, the pooled correlation between d’ and perceptual dissimilarity was r(166) =
.828 and the pooled correlation between d’ and the implicit identification model was r(166) =
.778.
Above, it was observed that two pairs of syllables (/lA/-/gA/ and /nA/-/kA/) were
discriminated better than predicted based on their modeled physical dissimilarity. To see if the
difference in correlation strength for natural and synthetic stimuli was attributable to those
stimulus pairs, the above analysis was re-run excluding those pairs. With those pairs excluded,
physical dissimilarity’s correlation with discrimination d’ was reliably stronger than the
correlation of identification d’ with discrimination d’ for natural stimuli, p = 0.01, identification
with discrimination mean r(10) = 0.78 (N=12), physical dissimilarity with discrimination mean
r(10) = 0.86 (N=12), and for synthetic stimuli, p = 0.006, identification with discrimination mean
r(10) = 0.76 (N=12), physical dissimilarity with discrimination mean r(10) = 0.85 (N=12). Again,
this pattern of results replicated in the pooled data. For natural speech, the pooled correlation
51
between d’ and perceptual dissimilarity was r(144) = .806 and the pooled correlation between d’
and the implicit identification model was r(144) = .717. For synthetic speech, the pooled
correlation between d’ and perceptual dissimilarity was r(144) = .812 and the pooled correlation
between d’ and the implicit identification model was r(144) = .762.
Discussion
The discrimination results showed that perceptual sensitivity increased and response time
decreased with modeled perceptual distance. The pattern of results was consistent across different
stimulus triplet sets and was replicated across natural and synthetic stimuli. But the natural
stimuli were more discriminable than the synthetic stimuli, as reflected in the larger d’ and faster
latency measures.
As was done in Jiang et al. (2007), both the raw physical dissimilarity measures and the
modeled dissimilarities were entered into correlations to evaluate their ability to account for the
perceptual results. Here, as before, there were small reliable correlations between the sensitivity
and the raw physical measures, but the correlations using the modeled dissimilarities were far
stronger. The difference between the model and the raw physical measure is that the model
includes scaling of the dimensions of the 3D motion based on perceptual data. The relative
success of the perceptual dissimilarity model compared to the raw physical measure supports one
of the conclusions of Jiang et al., that appropriate transformations must be applied to physical
measures in order to predict perceptual results.
The perceptual dissimilarity model better accounted for discrimination of synthetic
speech than it did for discrimination of natural speech. The strong correlation between modeled
dissimilarity and discrimination d’ for synthetic speech shows that the information on which these
discriminations were made is well-captured by measurement of 3D optical data. Of course, this
does not mean that the discrimination is based on the 3D optical data directly, as 3D motion of
52
the talking face is complex and internally correlated (Munhall and Vatikiotis-Bateson, 1998). The
implication of the relatively weaker (but still reliable) correlation between perceptual
dissimilarity and discrimination d’ for natural speech is that discrimination of natural syllables is
based, at least in part, on information that is not well-captured by the perceptual dissimilarity
model..
We observed relatively good discrimination of two pairs of natural syllables that were
predicted to be perceptually near and therefore predicted to be relatively difficult to discriminate.
These two pairs, /lA/-/gA/ and /nA/-/kA/, presumably derive their distinctiveness in natural stimuli
from differences that are not as well-captured by the 3D motion data, and therefore are not
captured by the perceptual dissimilarity model. The model includes perceptual data as well,
including data from the syllables that constitute these pairs, but because the Jiang et al. model
estimates one weighting vector and applies that to the 3D motion data from all recorded syllables,
a small subset of syllable identification data would not have a large impact on the resulting
weighting vector, but could lead to a mismatch between prediction and result of the sort observed
here.
One candidate for stimulus information that is not captured by the 3D motion data but
may be used for visual speech perception is visible tongue motion, which has been shown to be
important in successful visual speech perception (Rosenblum et al., 1996). Analyzing data both
including and excluding these two stimulus pairs highlights the general success of the
dissimilarity model in accounting for discrimination, but also points toward potentially important
sources of visible speech information that the model fails to capture. When those pairs were
excluded, the correlation between modeled physical dissimilarity and discrimination d’ was
comparable for both natural and synthetic stimuli, but d’ was reliably higher for natural than for
synthetic stimuli even for those triplets that did not include those pairs. Taken together, these
53
observations suggest that although the perceptual dissimilarity model fails to capture some visible
speech information in the natural speech stimulus, the visible speech information that is captured
by the model is more effectively conveyed by the natural speech stimulus.
The identification results showed that most of the syllables were identified reliably above
chance (see Figure 2.5). This result is particularly relevant for the synthetic speech, as the
synthesis was based on fairly sparse 3D optical sampling, and that sampling was the same
sampling used in the perceptual model. The success of the synthetic talking face in conveying CV
syllable identity is further evidence that the 3D optical sampling captures information that is used
for visual speech perception. The response entropy measure showed that even when accuracy was
low (as it was in the synthetic speech condition), in many cases responses were clustered into
relatively few categories. This clustering is consistent with previous observations that visual
speech identification errors are often systematically clustered (Walden et al., 1977; Auer and
Bernstein, 1997) and indicates that even when accuracy is relatively poor, sufficient information
is available to place a stimulus within a group of stimuli that are similar with each other (i.e.
phoneme equivalence classes Auer and Bernstein, 1997).
A comparison between d’ from discrimination and from predictions of an implicit
identification model showed that visual speech discrimination is overall better than would be
predicted if same/different judgments were based strictly on estimated syllable identity. Although
discrimination d’ was predicted equally well by an implicit identification model and modeled
perceptual dissimilarity values for natural speech, for synthetic speech discrimination d’ was
better predicted by the perceptual dissimilarity model. As above, when these analyses were
conducted on data excluding the pairs /lA/-/gA/ and /nA/-/kA/, the difference between natural and
synthetic results was reduced such that under both conditions, the physical dissimilarity model of
Jiang et al., (2007) better predicted d’ discrimination than did the implicit identification model.
54
This result shows that visual speech discrimination is based on more than just implicit syllable
categorization. As an alternative, it is suggested that visual speech discrimination may be
supported by comparison of visual phonetic forms, a suggestion revisited in the general
discussion.
Overall, Experiment 1 confirmed the success of the Jiang et al. (2007) model for
predicting perceptual dissimilarity. With the exception of some information sources (possibly
glimpses of the tongue) that are important for discrimination of certain segments, the model
appears to generally capture the visual information used for discrimination. In Experiment 2, we
sought to further our understanding of the properties of the representation supporting visual
speech discrimination.
Experiment 2: Discrimination of Inverted Visible Speech
Experiment 2 sought to further test the predictions of the perceptual dissimilarity model
of Jiang et al., (2007). The Jiang et al. dissimilarity model predicts dissimilarity based on the
motion of points on the talking face. The model has no explicit accounting of the position of
these points relative to each other. In other words, the configuration of the points has no direct
impact on the predictions of the model, so the orientation of the face is predicted to have no
impact on visual dissimilarity. One of the general properties of many visual objects is their
perceptual invariance to orientation. Faces are thought to be an important exception, exhibiting
perceptual and neural orientation-specific responses (Yovel and Kanwisher, 2005; Susilo et al.,
2010). Previous neuroimaging results support the view that visible speech processing does not
rely on the FFA(Bernstein et al., 2011). That is, visible speech and face representations appear to
be processed differently by the brain. Experiment 2 investigated whether visible speech stimuli
are like faces or like other objects in terms of their perception when inverted.
55
There has been some perceptual research on inversion of visible speech stimuli, but the
visual conditions were among audiovisual conditions that were the main focus of the studies.
Rosenblum and colleagues (Rosenblum et al., 2000) investigated the inversion effect with visual
speech stimuli, and they crossed face inversion with mouth orientation (matched or mismatched
orientation). The stimuli were /bA/, /gA/, and /vA/, and all were identified similarly when
otherwise unaltered. Thomas and Jordan (Thomas and Jordan, 2004) used six monosyllabic words
with different initial consonants and different vowels. They implemented several manipulations
of the stimuli, including removal of the movement information beyond the mouth area. They
reported that there was no difference in accuracy for identification of stimuli that showed mouth
movement, even when the extra-oral face information was altered. Jordan and Bevan (1997)
rotated sets of four syllables through 360 degrees and showed no effect on visual identification.
Massaro and Cohen (Massaro and Cohen, 1996) reported that identification was reliably worse in
two of the four synthetic speech syllables used when the stimuli were inverted, but that inversion
had no effect on the information used in audiovisual integration.
Although the effect of inversion on visual speech perception appears to vary by segment
(Jordan and Bevan, 1997; Thomas and Jordan, 2002, 2004), the general pattern is that visual
identification of most syllables is orientation-invariant. However, vertical inversion of a visible
speech stimulus has been shown in some cases to dissociate visual speech perception from audio-
visual speech perception. In an experiment using visual-only and audio-visual speech stimuli,
visual speech identification was generally orientation-invariant, but some incongruent audio-
visual stimuli resulted in fewer fusion results with non-upright facial orientations (Jordan and
Bevan, 1997). In another experiment using visual-only and audio-visual speech stimuli, two of
the four syllables in the experiment were identified significantly worse when inverted, and
consistent with (Jordan and Bevan, 1997), inverted visual stimuli resulted in reliably fewer fusion
56
responses in the audio-visual condition for all syllables (Massaro and Cohen, 1996). Subsequent
experiments further supported the conclusion that the influence of visible speech stimuli on
audio-visual speech perception is disrupted by inversion to a greater extent than visual syllable
identification is disrupted by inversion (Rosenblum et al., 2000; Thomas and Jordan, 2002, 2004).
Here, we tested whether stimulus inversion disrupted visual syllable discrimination. One
possibility is that discrimination is based on a sensory representation that contributes to audio-
visual (rather than visual-only) speech perceptual processes, or some other sensory representation
(such as that supporting facial identity perception) that is disrupted by inversion (Yin, 1969;
Rossion, 2008; McKone and Yovel, 2009). In that case, discrimination should be generally
disrupted by inversion, and the physical dissimilarity model should no longer accurately predict
discrimination sensitivity. If visual syllable discrimination is based on the same (or a similar)
process as the one involved in visual syllable identification, then inversion effects should be
limited to a subset of syllables and should nonetheless be well-predicted by the physical
dissimilarity model.
Methods
Only the methods unique to Experiment 2 are described in this section. All of the other
methods carried forward from Experiment 1.
Participants. Twelve volunteers (10 female, all right-handed, mean age 25 years, range
19 to 37 years), none from Experiment 1, gave informed consent and were financially
compensated for their participation.
Stimuli. Stimuli for discrimination were four triplets from the natural stimuli in
Experiment 1. A smaller set was used to reduce the testing time and reduce the number of foils
(see Table 2.1). Foils were included such that the correct response could not be determined by
identifying the first syllable. Same foils used were /ZA pA vA hA rA/. Different foils were /nA/-
57
/kA/ and /dZA/- /tA/. The total number of stimulus pairs was 58; however an additional pair, /gA/-
/gA/ was inadvertently included, so the total number of pairs presented was 59.
Stimulus inversion was achieved by presenting the stimuli on an inverted monitor. This
ensured that the stimuli were identical across orientation in every way, except orientation.
Procedure. Stimulus pairs were presented in pseudo-random order within a block.
Blocks were repeated (with stimuli in a different order each time) a total of six times per
condition (upright, inverted). All blocks of a particular condition were completed before blocks of
the other condition were begun with counter-balancing across participant groups. Before each
condition there was a six-trial practice to familiarize the participant with the experimental setup.
Because response time effects did not interact with button side mapping in Experiment 1,
Experiment 2 did not counter-balance button side mapping.
Results
Discrimination. Figure 2.7 summarizes the discrimination results and shows that the
pattern of discrimination across near and far stimulus pairs was invariant to orientation. A
repeated measures ANOVA was carried out with within-subjects factors of stimulus distance
(near, far), orientation (upright, inverted), and anchor syllable (/dA dZA kA nA/). Distance was a
reliable main effect, F(1, 11) = 399.5, η
2
= .525, p < .001, with d’ for far (mean d’ = 3.67) greater
than near pairs (mean d’ = 1.50). Anchor was a reliable main effect, F(2.13, 23.43) = 11.21, ε ̃
=.710, η
2
= .11, p < .001. Orientation was a reliable but very small main effect, η
2
= .015, F(1,
11) = 4.85, p = .05, with higher d’ for upright (mean d’ = 2.77) than inverted stimulus pairs (mean
d’ = 2.41). All of the interactions were reliable, but the effect sizes were small: Orientation with
distance, F(1, 11) = 7.15, η
2
= .004, p = .022; Anchor consonant with distance, F(3, 33) = 6.85,
ε ̃ = 1.0, η
2
= .019, p < .001; Orientation with anchor consonant, F(3, 33) = 6.09, ε ̃ = 1.0, η
2
= .04,
p = .002; and Distance, orientation, and anchor, F(3, 33) = 3.39, ε ̃ = 1.0, η
2
= .006, p = .029.
58
To address the interactions, separate repeated measures ANOVAs were run on each of
the four triplets with the factors orientation (upright, inverted) and stimulus distance (near, far).
Results are reported in Table 2.4. The results suggested that the triplet with anchor /dZA/ was the
main source of effects involving orientation in the omnibus analysis, and within that triplet, the
orientation effect was apparently due to the far pair. Paired t-tests revealed that there was a
reliable inversion effect (upright more discriminable than inverted) for the far pair in the /dZA/
triplet (/vA/-/dZA/) but not for the near pair (/dA/- /dZA/). The effect size for distance was sizeable
across all of the individual triplets, and the effect of orientation was small and generally
unreliable.
Response time. Response time results are summarized in Figure 2.8. After removing
outliers, 97.3% of the responses (4,396) were retained. A repeated-measures ANOVA was carried
Figure 2.7. Mean d’ sensitivity for inverted and upright stimuli in Experiment 2. The left panel shows
group mean d’ collapsed over all anchors, and the small panels show group mean d’ separated out by triplet
anchor. Error bars are 95% within-subjects confidence intervals.
59
out with within-subjects factors of stimulus distance (same, near, far), orientation (upright,
inverted), and anchor (/dA dZA kA nA/), and between-subjects factor of presentation order
(upright first, inverted first).
The main effect of orientation was not reliable, F(1, 10) = 0.5, η
2
= .006, p = .495. Nor
was the main effect of order, F(1, 10) = 2.9, η
2
= .226, p = .118. Distance was a reliable main
effect, F(2, 20) = 31.1, ε ̃ = 1.0, η
2
= .290, p < .001. Responses to far pairs (mean RT = 1,058)
were faster than those to near (mean RT = 1,255) or same pairs (mean RT = 1,225), but same and
near were not different. Anchor was a reliable main effect, F(2.9, 28.7) = 34.6, ε ̃ = .956,
η
2
= .123, p < .001. Responses to the stimuli with /nA/ were reliably slower than those with the
other anchors, but the /nA/ tokens were longer (590 and 560 ms) compared to the other anchor
syllables (mean duration = 403 ms), and the duration difference between /nA/ and the other
anchor tokens was similar to the mean response time effect of 127 ms.
Table 2.4. Analyses of variance on d’ for the triplets in Experiment 2. Each of the anchor
stimuli and its triplet set were analyzed separately (see text).
Anchor Source df F η
2
p mean difference
/dA/
distance (far, near) 1 399.5 .64 .001 1.66 [1.30 2.03]
orientation (inverted, upright) 1 0.0 .00 .979 0.01 [-0.66 0.67]
distance with orientation 1 0.0 .00 .905
error 11 .36
/dZA/
distance (far, near) 1 79.9 .52 .001 2.08 [1.57 2.59]
orientation (inverted, upright) 1 13.7 .18 .003 -1.23 [-0.50 -2.00]
distance with orientation 1 10.8 .04 .007
error 11 .26
/kA/
distance (far, near) 1 257.0 .86 .001 2.80 [2.42 3.19]
orientation (inverted, upright) 1 1.8 .01 .210 0.34 [-0.22 0.91]
distance with orientation 1 0.1 .00 .773
error 11 .13
/nA/
distance (far, near) 1 140.0 .73 .001 2.14 [1.74 2.53]
orientation (inverted, upright) 1 4.2 .05 .065 -0.56 [-1.16 0.04]
distance with orientation 1 1.3 .00 .275
Error 11 .22
Note. Mean difference is level 1 minus level 2 (i.e., far minus near, inverted minus upright). Values in
brackets are lower and upper 95% confidence intervals.
60
The interaction of orientation with distance category and anchor consonant accounted for
a small but significant amount of the variance, F(3.3, 32.7) = 2.8, ε ̃ = .55, η
2
= .019, p = .049.
Follow-up paired comparisons showed that in both upright and inverted conditions, the far pair
response was faster than the near and same pair responses for all of the anchors, except for /vA/-
/dZA/.
The interaction of distance category with anchor syllable was reliable, F(4.0, 40.2) = 6.7,
ε ̃ = .67, η
2
= .047, p < .001; however follow-up paired comparisons showed that all triplets
exhibited the same pattern of difference observed in the main effect of distance category, namely
that far pairs were faster than near and same (all p < .05), and near and same response times were
similar, except with the /nA/ anchor, for which there was a small but reliable difference between
near (mean RT = 1,355 ms) and same (mean RT = 1,285).
Figure 2.8. Response time for inverted and upright stimuli in Experiment 2. The panel on the left
shows mean response times across anchors and the smaller panels show mean response times by anchor.
Error bars are 95% within-subjects confidence intervals.
61
Discussion
The results of the experiment suggest that discrimination sensitivity is generally invariant
to orientation. Consistent with past identification experiments with orientation manipulations
(Massaro and Cohen, 1996; Jordan and Bevan, 1997; Rosenblum et al., 2000; Thomas and
Jordan, 2002, 2004) that find that identification of most syllables is orientation-invariant,
orientation did not appreciably impact visual-only speech perception with the exception of one
pair, /vA/-/dZA/. This far pair was also not faster than its near pair comparison when it was
inverted. The consonant /v/ was used in previous identification experiments that also found
inversion to affect identification of that syllable (Massaro and Cohen, 1996; Rosenblum et al.,
2000).
General Discussion
In Experiment 1, perceptual distance category (same, near, far) and continuous
perceptual dissimilarity measures from the Jiang et al. (2007) model were successful in predicting
discrimination sensitivity (d’) and response time measures for pairs of natural and synthetic
speech stimuli. The task required participants to discriminate on the basis of phoneme, and they
could not rely on non-phonemic information, because every stimulus pair (same and different
pairs) comprised two different stimulus tokens. Perceptual identification of synthesized syllables
supported the conclusion that the participants perceived the synthetic stimuli as speech. The
identification results with natural stimuli were similar in accuracy to those obtained previously
(Jiang et al., 2007). Discrimination d’ was better than d’ calculated on the basis of an implicit
identification model, and the perceptual dissimilarity model better predicted discrimination d’
than did the implicit identification model. In Experiment 2, upright and inverted stimuli were
presented for discrimination, and the model of perceptual dissimilarity was supported with the
62
results from discrimination of inverted stimuli. Here, we discuss the relevance of the results to
understanding how visible speech is perceived and to using the synthetic visible speech.
Visual speech perception.
Widespread appreciation for the significance of visual speech perception on the part of
speech perception researchers is relatively recent (for a historical overview, Bernstein, 2012a, b),
even though research findings on lipreading in deaf individuals appeared regularly in the clinical
literature during most of the twentieth century (Jeffers and Barley, 1971) and could have alerted
speech researchers to the potential visual basis for speech perception.
However, the prevailing view was that speech perception was a special auditory function
(Liberman, 1982). The discovery of the McGurk effect (McGurk and MacDonald, 1976), that is,
alteration of auditory speech perception by simultaneous presentation of mismatched video,
initiated intense interest in how audiovisual speech information is perceived (Summerfield, 1987)
and integrated by the brain (Sams et al., 1991; Calvert et al., 1999). Yet, interest in audiovisual
integration has only rarely motivated research focused per se on the visual stimulus (cf., Jiang
and Bernstein, 2011), visual speech perception (cf., Montgomery and Jackson, 1983; Thomas and
Jordan, 2002; Munhall and Vatikiotis-Bateson, 2004; Jiang et al., 2007a), or its underlying neural
bases (cf., MacSweeney et al., 2002a; Paulesu et al., 2003; Bernstein et al., 2011). This lacuna is
likely attributable, at least in part, to the pervasive view that audiovisual integration is at a low
level of processing, resulting in amodal or auditory speech representations (Rosenblum, 2008).
The logic is that if visible speech information is converted at a low level to representations that
are not visual (Liberman and Mattingly, 1985; Sams et al., 1991; Kislyuk et al., 2008), there is
not much to investigate with regard to visual processing.
Contrary to this expectation, as we have demonstrated here, a visual model of the
dissimilarity of speech based in part on perceptual identification accounts for discrimination with
63
upright (natural and synthetic) and inverted stimuli. Jiang and colleagues pointed out that the
success of their model in absence of a feature extraction stage demonstrates that feature-based
representations are not necessary to account for speech perception. Feature-based representations
are common in models of speech perception (Liberman and Mattingly, 1985; Massaro, 1998;
Rosenblum, 2008), although the specifics of what features are encoded differ from one theory to
another. It has been argued that feature-based representations provide a necessary “common
currency” (e.g., Fowler et al., 2003) between auditory and visual speech signals as well as speech
production. The results of the present study reinforce the suggestion of Jiang et al. that feature-
based representations are unnecessary. Of course, the question of whether feature-based
representations are actually used, despite the apparent lack of necessity, is beyond the scope of
behavioral experiments.
The close correspondence between optical dissimilarity (i.e. the inputs to the perceptual
dissimilarity model of Jiang et al.) and perceptual dissimilarity was demonstrated using
identification data (Jiang et al., 2007a) and discrimination data in the present study. These results
are consistent with the stimuli being processed along a visual cortical hierarchy in which higher
levels show selectivity for increasingly complex stimuli combined with an increasing tolerance to
stimulus transformations (Hubel and Wiesel, 1962; Ungerleider and Haxby, 1994; Logothetis and
Sheinberg, 1996; Zeki, 2005). Thus, these results support the conclusion that speech is
represented to a high level of integration within the visual system: That is, the visual stimuli
produced during the act of speaking are integrated within the visual pathway to a level that is
linguistically relevant. This hierarchical structure is not strictly convergent (Ahissar et al., 2008),
and as a complex visual stimulus, visible speech stimuli are expected to activate multiple,
functionally distinct representations (Bruce and Young, 1986; Campbell et al., 1996b).
64
Multiple representations.
The talking face is complex (Munhall and Vatikiotis-Bateson, 1998), and conveys many
kinds of information in addition to phonetic information, such as emotion, target of attention, and
identity. These different kinds of information are processed by different sensory pathways (Bruce
and Young, 1986; Campbell et al., 1986; Campbell et al., 1996b; Bernstein et al., 2011).
Recently, Files, Auer and Bernstein (in press) demonstrated two visual mismatch negativity
responses elicited by visual syllable deviance: one was localized to left posterior temporal cortex
and was interpreted as reflecting differences in phonetic forms that could be reliably mapped to
different syllable categories. The other was localized to right posterior temporal cortex and was
interpreted as reflecting more fine-grained distinctions in facial motion that were detectable, but
could not necessarily be reliably mapped to different syllable categories.
The relationship of the present results with neural processing pathways cannot be
established without further testing, but the evidence here is consistent with a representation of the
phonetic forms of visible speech that supports phonemic categorization. However; visible speech
discrimination is based on more information than is preserved in this category-based
representation and is more closely related to the motion of the talking face. Whether the sensory
representations supporting discrimination are the same as those supporting identification cannot
be determined with behavioral data alone, but the inversion manipulation of Experiment 2
supports the rejection of two candidates for alternative sensory representations that might
contribute to discrimination but not identification.
Facial identity processing has been shown to be severely impacted by inversion (Yin,
1969; Rossion, 2008; McKone and Yovel, 2009). If visual speech discrimination were supported
by sensory representations that also support facial recognition, we would expect discrimination of
inverted faces to be similarly impacted. The general orientation-invariance for discrimination
found in Experiment 2 argues against that possibility. Neuroimaging data also support
65
dissociation between facial identity perception and visual speech perception. The fusiform face
area (Kanwisher et al., 1997) is attributed with facial recognition processes, and this area was
found to be under-activated when stimuli were visible speech and point-light speech displays
when contrasted with non-speech motion controls.
Previous research with inverted and upright visible speech in conjunction with auditory
stimuli has generally found that the impact of visual stimuli on audiovisual speech perception is
reduced when the visual stimuli are inverted, but for most syllables inversion has little impact on
visual-only speech perception (Massaro and Cohen, 1996; Jordan and Bevan, 1997; Rosenblum et
al., 2000; Thomas and Jordan, 2002, 2004). Here, we found that visual speech discrimination, like
visual speech identification, is not generally disrupted by inversion. Insofar as auditory and visual
speech perception are supported by separate, but interacting pathways (Bernstein et al., 2004;
Bernstein, 2012a; Nahorna et al., 2012), orientation-invariance appears to be a general property of
the visual speech perception pathway under both identification and discrimination. If this is so,
why do some segments, such as /v/, consistently show visual-only orientation effects? Thomas
and Jordan (2004) used visual speech stimuli edited to either only show oral movement or extra-
oral movement. They crossed this manipulation with orientation, and found that when orientation
effects were obtained, they were strongest in the extraoral motion conditions. They suggested that
inversion may impede encoding of extraoral features, but that these features may only be
necessary in some specific gestures.
Discrimination and Identification.
In Experiment 1, discrimination sensitivity was compared to simulated task performance
based on an implicit identification model. Discrimination sensitivity was higher than predicted
based on the implicit identification model, and sensitivity was better-predicted by the perceptual
dissimilarity model than the implicit identification model. This result implies that visual syllable
66
discrimination judgments are not strictly based on phonemic labeling of the stimuli and instead
may be based on a sensory representation. Category representations have been argued to be
crucial for speech perception, in part because successful speech perception needs to be invariant
to variability caused by different talkers and different contexts (Holt and Lotto, 2010). The many-
to-one relationship between sensory representations and category representations suggests that the
categorization stage step of a two-stage model of classification (Durlach and Braida, 1969) may
lose some sensory detail not because of corruption by noise, but because of invariance to
variability that may not reliably convey phonetic information. The evidence that visual speech
discrimination is better than expected from categorization alone shows that although invariance to
certain details may be necessary for reliable phoneme classification, the entire system is not
invariant to those details and apparently can use them for making perceptual judgments.
Visible speech synthesis.
There is a literature spanning decades that details the acoustic attributes of speech
(Stevens, 1998). Acoustic analysis and synthesis has been available since the mid-1950s
(Jakobson et al., 1961; Klatt, 1987) and has been used to learn how the structure of acoustic
speech signals affects speech perception (Pisoni and Remez, 2004), and more recently, how the
brain supports auditory speech perception (Binder et al., 2000; Scott et al., 2006; Hickok and
Poeppel, 2007).
There have been only a few previous implementations of parametrically controllable
synthetic visible speech (Walden et al., 1987; Cohen et al., 1996; Beskow, 2004). Detailed
analysis of the configurations and dynamics of the talking face, some of which is ongoing in the
automatic speech recognition community (Liew and Wang, 2009) and some of which is ongoing
in the speech production community (Yehia et al., 1998; Yehia et al., 1999), in conjunction with
67
parametric variation of stimuli via synthesis and perceptual studies, is needed to learn the first-
order isomorphic mappings between visible stimulus information and perceptual representations.
However, the results here show that visual speech synthesis should be more detailed than
our current synthesis. The natural stimuli were more accurately identified than the synthetic
stimuli, and sensitivity was higher and response times were faster when discrimination was
measured with natural stimuli. That perceivers were able to exploit the more detailed information
in the natural stimuli is surprising from the perspective of claims that the visual stimulus is highly
impoverished, necessarily resulting in very low levels of perceptual accuracy (Fisher, 1968; Kuhl
and Meltzoff, 1988; Massaro, 1998). We however are not surprised by the results demonstrated
here, because quite accurate visual speech perception has been documented in congenitally deaf
and also in some hearing individuals (Bernstein et al., 2000; Mohammed et al., 2005; Auer and
Bernstein, 2007). Furthermore, as has been shown with auditory speech stimuli (Saffran et al.,
1996; Hay et al., 2011), we expect that a lifetime of exposure to visible speech results in learning
its common attributes (Massaro, 1984; Massaro et al., 1986; Auer and Bernstein, 2007).
Several weaknesses of the synthesizer used for this study are readily identifiable. The 3D
optical signals were constrained by the technology to be obtained from the surface of the face
(Jiang et al., 2007). The retro-reflectors that were used to record the infrared flashes were
relatively sparse. This is a particular concern for modeling the lips. Figure 2.1a shows that the lip
retro-reflectors were place around the vermillion boarder of the lip. However, that border does not
maintain a constant relationship with the inner edge of the lips. Consider how the lips thin for
consonants followed by /i/, which spreads the lips, versus those followed by /A/, which leaves
them more relaxed. This type of reliable variation was not available in the 3D data for developing
the model or synthesizing stimuli, because the retro-reflectors could only be attached on the outer
lip margins.
68
Lipreaders have visual access to the tongue when the mouth is open. Tongue motion is
incompletely correlated with movement of the face (Yehia et al., 1998; Jiang et al., 2002b), so
being able to see into the mouth can afford additional speech information (Rosenblum et al.,
1996). The synthesizer used for Experiment 1 did not include a tongue model, which could
account for the more accurate identification of natural /hA, dZA, lA, nA, rA/ and the lower
entropy for natural /tSA, dA, fA, hA, lA, nA, tA/. The lack of the gesture that places the teeth
against the lower lip could account for the /fA/ natural stimuli being more accurately identified
than the synthetic ones. On the other hand, even with the reduction in information presented by
the synthesizer, the results for discrimination and identification were analogous to those with
natural stimuli. This can be explained in the context of previous demonstrations that speech
information is redundantly distributed on the face. For example, previous results for Talker M2
used in the current study show that phoneme identity could be reliably predicted based on only
the 3D data taken from his cheeks (Jiang et al., 2002a). Also, previous neuroimaging results
obtained with point-light renderings of the 3D data showed that even without the face model, the
points alone produce similar cortical activation patterns to those with natural speech video
(Bernstein et al., 2011). Overall, the current synthesis, driven by 3D optical data and constrained
by a face model that does not include tongue movement or data-based control of inner lip
margins, was found to be useful for stimulus control (i.e., by omitting natural video artifacts,
including head movement) and for subjecting the perceptual model to test using stimuli based
only on the raw data that were used to develop the model.
Summary and conclusions.
The main findings of this study were that modeled perceptual dissimilarity (Jiang et al.,
2007a) of visible speech CV syllables successfully predicted discrimination when stimuli were
natural, synthetic, or natural but inverted. Syllable discrimination was better than predicted from
69
an implicit identification model, and discrimination sensitivity was better predicted by the
perceptual dissimilarity model than an implicit identification model. Synthetic stimuli generated
using a sparse representation of the motion on a talking face were found to sufficiently convey
phonetic information that synthetic stimuli could be reliably identified and discriminated.
Discrimination of inverted stimuli was generally comparable to discrimination of upright stimuli,
supporting the view that visible speech perception is dissimilar from face identity perception.
Although both were found here to be successful, the model of perceptual dissimilarity and the
synthetic talking face could be improved, possibly through a direct representation of the visibility
of the tongue. Taken together, these results support a model of visual speech perception with a
sensory representation of visible 3D motion as its perceptual front-end.
70
Chapter 3: The visual mismatch negativity elicited with visual
speech stimuli
The visual mismatch negativity (vMMN) paradigm was used here to investigate visual
speech processing. The MMN response was originally discovered and then extensively
investigated with auditory stimuli (Näätänen et al., 1978; Näätänen et al., 2011). The classical
auditory MMN is generated by the brain’s automatic response to a change in repeated stimulation
that exceeds a threshold corresponding approximately to the behavioral discrimination threshold.
It is elicited by violations of regularities in a sequence of stimuli, whether the stimuli are attended
or not, and the response typically peaks 100-200 ms after onset of the deviance (Näätänen et al.,
1978; Näätänen et al., 2005; Näätänen et al., 2007). The violations that generate the auditory
MMN can range from low-level stimulus deviations such as the duration of sound clicks (Ponton
et al., 1997) to high-level deviations such as speech phoneme category (Dahaene-Lambertz,
1997). More recently, the vMMN was confirmed (Pazo-Alvarez et al., 2003; Czigler, 2007;
Kimura et al., 2011; Winkler and Czigler, 2012). It too is elicited by a change in regularities in a
sequence of stimuli, across different levels of representation, including deviations caused by
spatiotemporal visual features (Pazo-Alvarez et al., 2004), conjunctions of visual features
(Winkler et al., 2005), emotional faces (Li et al., 2012; Stefanics et al., 2012), and abstract visual
stimulus properties such as bilateral symmetry (Kecskes-Kovacs et al., 2013) and sequential
visual stimulus probability (Stefanics et al., 2011).
Speech can be perceived visually by lipreading, and visual speech perception is carried
out automatically by hearing as well as by hearing-impaired individuals (Bernstein et al., 2000;
Auer and Bernstein, 2007). Inasmuch as perceivers can visually recognize the phonemes
(consonants and vowels) of speech through lipreading, the stimuli are expected to undergo
71
hierarchical visual processing from simple features to complex representations along the visual
pathway (Grill-Spector et al., 2001; Jiang et al., 2007b), just as are other visual objects, including
faces (Grill-Spector et al., 2001), facial expression (Li et al., 2012; Stefanics et al., 2012), and
non-speech face gestures (Puce et al., 1998; Puce et al., 2000; Puce et al., 2007; Bernstein et al.,
2011). Crucially, because the vMMN deviation detection response is thought to be generated by
the cortex at the cortical level that represents the standard and deviant stimuli (Winkler and
Czigler, 2012), it should be possible to obtain the vMMN in response to deviations in visual
speech stimuli. However, previous studies in which a speech vMMN was sought produced mixed
success in obtaining a deviance response attributable to visual speech stimulus deviance detection
(Colin et al., 2002; Colin et al., 2004; Saint-Amour et al., 2007; Ponton et al., 2009; Winkler and
Czigler, 2012). A few studies have even sought an auditory MMN in response to visual speech
stimuli (e.g., Sams et al., 1991; Möttönen et al., 2002).
The present study took into account how visual stimuli conveying speech information
might be represented and mapped to higher levels of cortical processing, say for speech category
perception or for other functions such as emotion, social, or gaze perception. That is, the study
was specifically focused on the perception of the physical visual speech stimulus. The distinction
between representations of the forms of exogenous stimuli versus representation of linguistic
categories is captured in linguistics by the terms phonetic form versus phonemic category.
Phonetic forms are the exogenous physical stimuli that convey the linguistically-relevant
information used to perceive the speech category to which the stimulus belongs. Visual speech
stimuli convey linguistic phonetic information primarily via the visible gestures of the lips, jaw,
cheeks, and tongue, which convey the system of phonological contrasts that convey speech
phonemes (Yehia et al., 1998; Jiang et al., 2002b; Bernstein, 2012b). Phonemic categories are the
consonant and vowel categories that a language uses to differentiate and represent words. If
72
visual speech is processed similarly to auditory speech stimuli, functions related to higher-level
language processing, such as categorization and semantic associations, are carried out beyond the
level of exogenous stimulus form representations (Scott and Johnsrude, 2003; Hickok and
Poeppel, 2007).
This study was concerned with the implications for cortical representation of speech
stimuli in the case that visual speech perception is generally left-lateralized. There is evidence for
form-based speech representations in high-level visual areas, and there is evidence that they are
left-lateralized (Campbell et al., 2001; Bernstein et al., 2011; Campbell, 2011; Nath and
Beauchamp, 2012). For example, Campbell and colleagues (1986) showed that a patient with
right-hemisphere posterior cortical damage failed to recognize faces but had preserved speech lip-
shape recognition, and that a patient with left-hemisphere posterior cortical damage failed to
recognize speech lip-shapes but had preserved face recognition.
Recently, evidence for hemispheric specialization was obtained in a study designed to
investigate specifically the site/s of specialized visual speech processing. Bernstein et al. (2011),
applied a functional magnetic resonance imaging (fMRI) block design while participants viewed
video and point-light speech and non-speech stimuli and tiled control stimuli. Participants were
imaged during localizer scans for three regions of interest (ROIs), the fusiform face area (FFA)
(Kanwisher et al., 1997), the lateral occipital complex (LOC) (Grill-Spector et al., 2001), and the
human visual motion area V5/MT. These three areas were all under-activated by speech stimuli.
Although both posterior temporal cortices responded to speech and non-speech stimuli, only in
the left hemisphere was an area found with differential sensitivity to speech versus non-speech
face gestures. It was named the temporal visual speech area (TVSA) and was localized to the
posterior superior temporal sulcus and adjacent posterior middle temporal gyrus (pSTS/pMTG),
anterior to cortex that was activated by non-speech face movement in video and point-light
73
stimuli. TVSA is similarly active across video and point-light stimuli. In contrast, right-
hemisphere activity in the pSTS was not reliably different for speech versus non-speech face
gestures. Research aimed at non-speech face gesture processing has also produced evidence of
right-hemisphere dominance for non-speech face gestures, with a focus in the pSTS (Puce et al.,
2000; Puce et al., 2003).
The approach in the current study was based on predictions for how the representation of
visual speech stimuli should differ for the right versus left posterior temporal cortex under the
hypothesis that the left cortex has tuning for speech, but the right cortex has tuning for non-
speech face gestures. Specifically, lipreading relies on highly discriminable visual speech
differences. Visual speech phonemes are not necessarily as distinctive as auditory speech
phonemes. Visual speech consonants are known to vary in terms of how distinct they are from
each other, because some of the distinctive speech features used by listeners (e.g., voicing,
manner, nasality, place) to distinguish phonemes are not visible or are less visible to lipreaders
(Auer and Bernstein, 1997; Bernstein, 2012b). A left posterior temporal cortex area specialized
for speech processing, part of an extensive speech processing pathway, is expected to be tuned to
represent linguistically useful exogenous phonetic forms, that is, forms that can be mapped to
higher-level linguistic categories, such as phonemes. However, when spoken syllables (e.g.,
“zha” and “ta”) do not provide enough visual phonetic feature information, their representations
are expected to generalize. That is, the indistinct stimuli activate overlapping neural populations.
This is depicted in Figure 2.1, for which the visually near (perceptual categories are not distinct)
syllables “ta” and “zha” are represented by almost completely overlapping ovals in the box
labeled left posterior temporal visual cortex. The perceptually far stimulus “fa,” a stimulus that
shares few visible phonetic features with “zha,” is depicted within its own non-overlapping oval
in that box. Here, using the vMMN paradigm, a deviance response was predicted for the left
74
hemisphere with the stimuli “zha” versus “fa,” representing a far contrast. But the near contrast
“zha”-“ta,” depicted in Figure 3.1, was not predicted to elicit the vMMN response by the left
posterior temporal cortex for “zha” or for “ta” syllables.
In contrast, the right posterior temporal cortex, with its possible dominance for
processing simple non-speech face motions such as eye open versus closed, and simple lips open
versus closed (Puce et al., 2000; Puce et al., 2003), was predicted to generate a deviance response
to both perceptually near and far speech stimulus contrasts. The depiction in Figure 3.1 for the
right posterior temporal cortex shows that the stimulus differences are represented there more
faithfully (i.e., there are more neural units that are not in common). The right posterior temporal
cortex is theoretically more concerned with perception of non-speech face gestures, for example,
gestures related to visible emotion or affect: The representations may even be more analog in the
Figure 3.1. Schematic diagram of the proposed roles for left and right posterior temporal cortices in
visual speech perception. Left posterior temporal visual cortex is hypothesized to represent phonetic
forms that support eventual phoneme categorization and to therefore be invariant to variation in facial
motion that is unlikely to reliably support speech perception. Pairs of visual speech syllables are predicted
to activate largely overlapping populations of neurons in left posterior temporal cortex when the syllables
in the pair are perceptually similar (i.e., they are perceptually near). But non-overlapping populations of
neurons in left posterior temporal cortex represent syllables that are perceptually highly dissimilar (i.e.,
they are perceptually far). In contrast, the right posterior temporal cortex is hypothesized to represent non-
speech facial gestures. Near pairs of visual speech syllables are predicted to activate only partially
overlapping populations of neurons in right posterior temporal visual cortex, and far pairs are predicted to
activate non-overlapping populations.
75
sense that they are not used as input to a generative system that relies on combinations of
representations (i.e., vowels and consonants) to produce a very large vocabulary of distinct
words.
Even very simple low-level visual features or non-speech face or eye motion in the
speech video clips can elicit the vMMN (Puce et al., 2000; Puce et al., 2003; Miki et al., 2004;
Thompson et al., 2007). With natural speech production, phonetic forms vary from one
production to the next. An additional contribution to variability is the virtually inevitable shifts in
the talker’s head position, eye gaze, eyebrows, etc., from video recording to recording. Subtle
differences are not necessarily so obvious on a single viewing, but the vMMN paradigm involves
multiple stimulus repetitions, which can render subtle differences highly salient.
The approach here was to use two recordings for each consonant and to manipulate the
stimuli to minimize non-phonetic visual cues that might differentiate the stimuli. The study
design took into account the likelihood that the deviance response to speech stimuli would be
confounded with low-level stimulus differences, if it involved a stimulus as standard (e.g., “zha”)
versus a different stimulus as deviant (e.g., “fa”). Therefore the vMMN was sought using the
event-related potentials (ERPs) obtained with the same stimulus (e.g., “zha”) in its two possible
roles of standard and deviant. Stimulus discriminability was verified prior to ERP recording.
During ERP recording, participants monitored for a rare target phoneme to engage their attention
and hold it at the level of phoneme categorization, rather than at the level of stimulus
discrimination.
Method
Participants. Participants were screened for right-handedness (Oldfield, 1971), normal
or corrected to normal vision (20/30 or better in both eyes using a traditional Snellen chart),
normal hearing, American English as a first and native language, and no known neurological
76
deficits. Lipreading was assessed with a screening test that has been used to test a very large
sample of normal hearing individuals (Auer and Bernstein, 2007). The screening cutoff was 15%
words correct in isolated sentences to assure that participants who entered the EEG experiment
had some lipreading ability. Forty-nine individuals were screened (mean age = 23 years), and 24
(mean age = 24, range 21 to 31, 18 female, lipreading score M = 28.7% words correct) met the
inclusion criteria for entering the EEG experiment. The EEG data from 11 participants (mean
age = 23.2, range 19 to 31, 7 female, lipreading score M = 33.0) were used here: One participant
was lost to contact, one ended the experiment early, two had unacceptably high initial impedance
levels and were not recorded, and nine had high electrode impedances, excessive bridging
between electrodes, or unacceptable noise levels. Informed consent was obtained from all
participants. Participants were paid. The research was approved by the Institutional Review
Boards at George Washington University and at the University of Southern California.
Stimuli
Stimulus dissimilarity. The stimuli for this study were selected to be of predicted
perceptual and physical dissimilarities. Estimates of the dissimilarities and the video speech
stimuli themselves were obtained from Jiang et al. (2007a), which gives a detailed description of
the methods for predicting and testing dissimilarity. Based on the dissimilarity measures in Jiang
et al. (2007), the stimulus pair “zha” – “fa,” with modeled dissimilarity of 4.04, was chosen to be
perceptually far, and the stimulus pair “zha” – “ta,” with modeled dissimilarity of 2.28 was
chosen to be perceptually near. In a subsequent study, Files and Bernstein (submitted) tested
whether the modeled dissimilarities among a relatively large selection of syllables correctly
predicted stimulus discriminability, and they did.
Stimulus video. Stimuli were recorded so that the talker’s face filled the video screen,
and lighting was from both sides and slightly below his face. A production quality camera (Sony
77
DXC-D30 digital) and video recorder (Sony UVW 1800) were used simultaneously with an
infrared motion capture system (Qualisys MCU120/240 Hz CCD Imager) for recording 3-
dimensional (3D) motion of 20 retro-reflectors affixed to the talker’s face. The 3D motion
recording was used by Jiang et al. (2007) in developing the dissimilarity estimates. There were
two video recordings of each of the syllables, “zha,” “ta,” and “fa” that were used for eliciting the
vMMNs. Two tokens of “ha,” and of “va” were used as targets to control attention during the
vMMN paradigm. All video was converted to grayscale.
In order to reduce differences in the durations of preparatory mouth motion across
stimulus tokens and increase the rate of data collection, some video frames were removed from
slow uninformative mouth opening gestures. But most of the duration differences were reduced
by removing frames from the final mouth closure. No frames were removed between the sharp
initiation of articulatory motion and the quasi-steady-state portion of the vowel.
During the EEG experiment, the video clips were displayed contiguously through time.
To avoid responses due to minor variations in the position of the head from the end of one token
to the beginning of the next, morphs of 267 ms were generated (Abrosoft’s FantaMorph5) to
create smooth transitions from one token to the next. The morphing period corresponded to the
inter-stimulus-interval.
The first frame of each token was centered on the video monitor so that a motion-capture
dot that was affixed at the center of the upper lip was at the same position for each stimulus. Also,
stimuli were processed so that they would not be identifiable based solely on the talker’s head
movement. This was done by adding a small amount of smooth translational motion and rotation
to each stimulus on a frame-by-frame basis. The average motion speed was 0.5 pixels per frame
(0.87 degrees of visual angle/sec), with a maximum of 1.42 pixels per frame (2.5 degrees/sec).
Rotation varied between plus and minus 1.2 degrees of tilt, with an average change of .055
78
degrees of tilt per frame (3.28 degrees/sec) and a maximum change of .15 degrees of tilt per
frame (9.4 degrees of tilt/sec). A stationary circular mask with radius 5.5 degrees of visual angle
and luminance equal to the background masked off the area around the face of the talker.
Stimulus alignment and deviation points. The two tokens of each consonant (e.g.,
“zha”) varied somewhat in their kinematics, so temporal alignments had to be defined prior to
averaging the EEG data. We developed a method to align tokens of each syllable. Video clips
were compared frame by frame separately for “zha,” “fa,” and “ta.” In addition, mouth opening
area was measured as the number of pixels encompassed within a manual tracing of the
vermillion border in each frame of each stimulus. Visual speech stimulus information is widely
distributed on the talking face (Jiang et al., 2007a), but mouth opening area is a gross measure of
speech stimulus kinematics. Figure 3.2 shows the mouth-opening area and video of the lips for
Figure 3.2. Temporal kinematics of the syllables. For each syllable, “fa,” “zha,” and “ta,” mouth opening
area was measured in pixels, normalized to the range 0 (minimum for that syllable) to 1 (maximum for that
syllable). Below each mouth opening graph are two rows of video images, one for each token of the
stimulus. The images are cropped to show only the mouth area. The full face was shown to the participants
in gray-scale. The vertical line in cyan marks the time of deviation for “zha” versus “fa.” The magenta
vertical line marks the time of deviation for “zha” versus “ta.”
79
the three different consonant-vowel (CV) stimuli and the two different tokens of each of them.
The stimuli began with a closed neutral mouth and face, followed by the gesture into the
consonant, followed by the gesture into the /a/ vowel (“ta,” “fa,” “zha”). Consonant identity
information develops across time and continues to be present as the consonant transitions into the
following vowel. The steep mouth opening gesture into the vowel partway through the stimulus
was considered a possible landmark for temporal alignment, because it is a prominent landmark
in the mouth area trace, but using this landmark in some cases brought the initial part of the
consonant into gross misalignment. The frames comprising the initial gesture into the consonant
were chosen to be the relevant landmark for alignment across tokens, because they are the earliest
indication of the consonant identity (Jesse and Massaro, 2010).
The question was then, when did the image of one consonant (e.g., “fa”) deviate from the
image of the other (e.g., “zha”). The MMN is typically elicited by stimulus deviation, rather than
stimulus onset (Leitman et al., 2009), and this deviation onset point is used to characterize the
relative timing of the vMMN. Typically, ERPs to visual stimuli require steep visual energy
change (Besle et al., 2004), but visual speech stimulus onset can be relatively slow-moving,
depending on the speech phonetic features. Careful examination of the videos shows that
differences in the tongue are visible across the different consonants. The “zha” is articulated by
holding the tongue in a quasi-steady-state somewhat flattened position in the mouth. This
articulation is expected to take longer to register as a deviation, because of its subtle initial
movement. The “ta” and “zha” stimuli vary primarily in terms of tongue position, which is visible
but difficult to discern without attention to the tongue inside the mouth aperture. The deviation
onset point here was defined as the first frame at which there was a visible difference across
consonants. The 0-ms points in this report are set at the relevant deviation point and vMMN times
are reported relative to the deviation onset.
80
Procedures
Discrimination pre-test. To confirm the discriminability of the consonants
comprising the critical contrasts in the EEG experiment, participants carried out a same-different
perceptual discrimination task that used “zha”–“fa”, and “zha”–“ta” different stimulus pairs. The
two tokens of each syllable were combined in each of four possible ways and in both possible
orders. Same pairs used different tokens of the same syllable, so that accurate discrimination
required attention to consonant category. This resulted in six unique same pairs and 16 unique
different pairs. To reduce the difference in number of same pairs versus the number of different
pairs, the same pairs were repeated, resulting in 12 same pairs and 16 different pairs per block, for
a total of 28 pairs per block. During each trial, the inter-stimulus interval was filled by a morph
transition from the end of the first token to the start of the second lasting 267 ms. Instructions
emphasized that the tokens might differ in various ways, but that the task was to determine if the
initial consonants were the same or different. Eleven blocks of pseudo-randomly ordered trials
were presented. The first block was used for practice to ensure the participants’ familiarity with
the task, and it was not analyzed.
vMMN procedure. EEG recordings were obtained during an oddball paradigm in which
standard, deviant, and target stimuli were presented. If one stimulus category is used as the
standard and a different category stimulus is used as the deviant in deriving the vMMN, the
Table 3.1. Syllables included in each of four block types.
Block type Standard Deviant Target Dissimilarity
a
1 “zha” “ta” “va” 2.28
2 “ta” ”zha” “va” 2.28
3 “zha” “fa” “ha” 4.04
4 “fa” “zha” “ha” 4.04
Note. Each block had a standard syllable, a deviant syllable and a target
syllable.
a
Dissimilarity measures the difference between the standard and the
deviant syllable.
81
vMMN also contains a response to the physical stimuli (Czigler et al., 2002). In order to compare
ERPs to standards versus deviants, holding the stimulus constant, each stimulus was tested in the
roles of deviant and standard across different recording blocks (Table 3.1).
EEG recording comprised 40 stimulus blocks divided across four block types (Table 3.1).
Each block type had one standard consonant (i.e., “zha,” “fa,”, or “ta”), one deviant consonant
(i.e., “zha,” “fa,” or “ta”), and one target consonant (i.e., “ha,” or “va”). The “zha” served as
deviant or standard with either “fa” or “ta.” Thus, four vMMNs were sought: (1) “zha” in the
context of “ta” (near); (2) “ta” in the context of “zha” (near); (3) “zha” in the context of “fa”
(far); and (4) “fa” in the context of “zha” (far). Each vMMN was based on 10 stimulus blocks
with the vMMN stimulus in either deviant or standard role. During each block, a deviant was
always preceded by five to nine standards. At the beginning of a block, the standard was
presented 9 times before the first deviant. The inter-stimulus-interval was measured as the
duration of the morphs between the end of a stimulus and the beginning of the next, which was
267 ms.
To ensure that the visual stimuli were attended, participants were instructed to monitor
the stimuli carefully for a target syllable. At the start of each block, the target syllable was
identified by presenting it six times in succession. A target was always preceded by three to five
standards. Participants were instructed to press a button upon detecting the target, which they
were told would happen rarely. In each block, the target was presented four times, and the
deviant was presented 20 times. In all, 85.4% of stimuli in a block were standards, 12.1% were
deviants and 2.4% were targets. This corresponded to 200 deviant trials and approximately 1400
standard trials per contrast per subject. The first standard trial following either a deviant trial or a
target trial was discarded from analysis, because a standard following something other than a
82
standard might generate a MMN (Sams et al., 1984; Nousak et al., 1996). This resulted in 1160
standard trials for computing the vMMN.
Participants were instructed to take self-paced breaks between blocks, and longer breaks
were enforced every 10 blocks. Recording time was approximately 4.5hr per participant. After
EEG recording, electrode locations were for each subject using a 3-dimensional digitizer
(Polhemus, Colchester, Vermont).
EEG Recording and Offline Data Processing
EEG data were recorded using a 62-electrode cap that was configured with a modified
10-20 system for electrode placement. Two additional electrodes were affixed at mastoid
locations, and bipolar EOG electrodes were affixed above and below the left eye and at the
external canthi of the eyes to monitor eye movements. The EEG was amplified using a high input
impedance amplifier (SynAmps 2, Neuroscan, NC). It was digitized at 1000 Hz with a 200 Hz
low-pass filter. Electrode impedances were measured, and the inclusion criterion was 35 kOhm.
Offline, data were band-pass filtered from 0.5 Hz to 50 Hz with a 12-dB/octave rolloff
FIR zero phase-shift filter using EDIT 4.5 software (Neuroscan, NC). Eyeblink artifacts were
removed using EDIT’s blink noise reduction algorithm (Semlitsch et al., 1986). Data were
epoched from 100 ms before video onset to 1,000 ms after video onset. Epochs were baseline-
corrected by subtracting the average of the voltage measurements from -100 to +100 ms for each
electrode and then average-referenced.
Artifact rejection and interpolation were performed using custom scripts calling functions
in EEGLAB (Delorme and Makeig, 2004). Epochs in which no electrode voltage exceeded 50 µV
at any point in the epoch were included. For those epochs in which only one electrode exceeded
the 50 µV criterion, the data for that electrode were interpolated using spherical spline
interpolation (Picton et al., 2000b). This procedure resulted in inclusion of 91% of the EEG
83
sweeps. To correct for variation in electrode placement between subjects, individual subject data
were projected onto a group average set of electrode positions using spherical spline interpolation
(Picton et al., 2000b).
Analyses of Discrimination Data
Same-different discrimination sensitivity was measured with d’ (Green and Swets, 1966).
The hit rate was the proportion different responses to trials with different syllables. The false
alarm rate was the proportion different responses for same pairs. If the rate was zero it was
replaced with 1/(2N), and if it was one it was replaced by 1 – 1/(2N), where N is the number of
trials (Macmillan and Creelman, 1991). Because this is a same-different design, z(hit rate) –
z(false alarm rate) was multiplied by √ (Macmillan and Creelman, 1991).
Target detection during the EEG task was also evaluated using d’, but the measure was
z(hit rate) – z(false alarm rate). A response within 4 seconds of the target presentation was
considered a hit, and a false alarm was any response outside this window. All non-target syllables
were considered distracters for the purpose of calculating a false alarm rate. To assess differences
in target detection across blocks, d’ was submitted to repeated-measures ANOVA.
Analyses of EEG Data
Overview. A priori, the main hypothesis was that visual speech stimuli are processed by
the visual system to the level of representing the exogenous visual syllables. Previous research
had suggested that there was specialization for visual speech stimuli by left posterior temporal
cortex (Campbell et al., 2001; Bernstein et al., 2011; Campbell, 2011; Nath and Beauchamp,
2012). Previous research also suggested that there was specialization for non-speech face motion
by right posterior temporal cortex (Puce et al., 1998; Puce et al., 2000; Puce et al., 2007;
Bernstein et al., 2011). Therefore, the a priori anatomical regions of interest (ROI) were the
bilateral posterior temporal cortices. However, rather than merely selecting electrodes of interest
84
(EOI) over scalp locations approximately over those cortices and carrying out all analyses with
those EOIs, a more conservative, step-by-step approach was taken, which allowed for the
possibility that deviation detection was carried out elsewhere in cortex (e.g., Sams et al., 1991;
Möttönen et al., 2002).
In order first to test for reliable stimulus deviation effects, independent of temporal
window or spatial location, global field power (GFP; Lehmann and Skrandies, 1980; Skrandies,
1990; See also Appendix A) measures were compared statistically across standard versus deviant
for each of the four different vMMN contrasts. The GFP analyses show the presence and
temporal interval of a deviation response anywhere over the scalp. The first 500 ms post-stimulus
deviation was examined, because that interval was expected to encompass any possible vMMN.
Next, source analyses were carried out to probe whether there was evidence for stimulus
processing by posterior temporal cortices, consistent with previous fMRI results on visual speech
perception (Bernstein et al., 2011). Distributed dipole sources (Tadel et al., 2011) were computed
for the responses to standard stimuli and for the vMMN waveforms. These were inspected and
compared with the previous Bernstein-et-al. results and also with results from previous EEG
studies that presented source analyses (Bernstein et al., 2008; Ponton et al., 2009). The inspection
focused on the first 500 ms of the source models.
After examining the source models, EOIs were sought for statistical testing of vMMNs,
taking into account the ERPs at individual electrode locations. For this level of analysis, an
approach was needed to guard against double-dipping , that is, use of the same results to select
and test data for hypothesis testing (Kriegeskorte, 2009). Because we did not have an independent
localizer (i.e., an entirely different data set with which to select EOIs), as is recommended for
fMRI experiments, we ran analyses on several different electrode clusters over posterior temporal
85
cortices. Because all those results were highly similar, only one set of EOI analyses are presented
here.
A coincident frontal positivity has also been reported for Fz and/or Cz in conjunction
with evidence for a vMMN (Czigler et al., 2002; Czigler et al., 2004). The statistical tests for the
vMMN were carried out separately on ERPs from electrodes Fz and Cz to assess the presence of
a frontal MMN. These tests also served as a check on the validity of the EOI selection. Fz and Cz
electrodes are commonly used for testing the auditory MMN (Näätänen et al., 2007). If the same
results were obtained on Fz and Cz as with the EOIs, the implication would be that EOI selection
was biased towards our hypothesis that the posterior temporal cortices are responsible for visual
speech form representations. The results for Fz and Cz were similar to each other but different
from the EOI results, and only the Fz results are presented here. None of the Cz results were
statistically reliable. ERPs evoked by target stimuli were not analyzed, because so few target
stimuli were presented.
Global field power. GFP (Lehmann and Skrandies, 1980; Skrandies, 1990) is the root
mean squared average-referenced potential over all electrodes at a time sample. The GFP was
calculated for each standard and deviant ERP per stimulus and per subject. The analysis window
was 0-500 ms post stimulus deviation. Statistical analysis of group mean GFP differences
between standard and deviant, within syllable, used randomization testing (Blair and Karniski,
1993; Nichols and Holmes, 2002; Edgington and Onghena, 2007) of the null hypothesis of no
difference between the evoked response when the stimulus was a standard versus the evoked
response when the stimulus was a deviant. The level of re-sampling was the individual trial.
Surrogate mean GFP measures were generated for each subject by permuting the single-
trial labels (i.e., standard or deviant) 1999 times and then computing mean GFP differences
(deviant minus standard) for these permutation samples. These single-subject permutation mean
86
GFP differences were averaged across subjects to obtain a permutation distribution of group
mean GFP differences within the ERPs for a particular syllable. To avoid bias due to using a
randomly generated subset of the full permutation distribution, the obtained group mean GFP
difference was included in the permutation distribution, resulting in a total of 2000 entries in the
permutation distribution. The p-value for a given time point was calculated as the proportion of
surrogate group mean GFP difference values in the permutation distribution that were as or more
extreme than the obtained group mean GFP difference, resulting in a two-tailed test. Justification
of this testing method and simulations supporting its use are in Appendix A.
To correct for multiple comparisons over time, a threshold length of consecutive p-values
< .05 was established (Blair and Karniski, 1993; Groppe et al., 2011). The threshold number of
consecutive p-values was determined from the permutation distribution generated in the
corresponding uncorrected test. For each entry in the permutation distribution, a surrogate p-value
series was computed as though that entry were the actual data. Then, the largest number of
consecutive p-values < .05 in that surrogate p-value series was computed for each permutation
entry. The threshold number of consecutive p-values was the 95
th
percentile of this null
distribution of run lengths. This correction, which offers weak control over family-wise error rate
and is appropriate when effects persist over many consecutive samples (Groppe et al., 2011), is
similar to one used with parametric statistics (Guthrie and Buchwald, 1991) but requires no
assumptions or knowledge about the autocorrelation structure of the underlying signal or noise.
Simulations demonstrating the suitability of this correction, and comparisons with other methods
of correction for multiple comparisons are in Appendix B.
EEG distributed dipole source models. EEG sources were modeled with
distributed dipole source imaging using Brainstorm software (Tadel et al., 2011). In lieu of
having individual anatomical MRI data for source space and forward modeling, the MNI/Colin 27
87
brain was used. A boundary element model (Gramfort et al., 2010) was fit to the anatomical
model using a scalp model with 1082 vertices, a skull model with 642 vertices, and a brain model
with 642 vertices. The cortical surface was used as the source space, and source orientations were
constrained to be normal to the cortical surface. Cortical activity was estimated using depth-
weighted minimum-norm estimation (wMNE; Baillet et al., 2001).
EEG source localization is generally less precise than some other neuroimaging
techniques (Michel et al., 2004). Simulations comparing source localization techniques resulted in
a mean localization error of 19.6 mm when using a generic brain model (Darvas et al., 2006), as
was done here. Similar methods were used here, so the estimate of localization errors is
approximately 20 mm. Therefore, the source solutions found here serve as useful visualization
tools and for EOI selection but are not intended for making conclusion related to precise
anatomical localization.
vMMN analyses. The vMMN analyses used the same general approach as the approach
to the GFP analyses rather than the more pervasive analysis of difference waveforms. To assess
the reliability of the vMMNs for each stimulus, the average of the ERP for the EOIs for the
token-as-standard was compared with the average of the ERPs for the token-as-deviant using a
standard paired-samples permutation test (Edgington and Onghena, 2007) with the subject mean
ERP as the unit of re-sampling. A threshold number of consecutive p-values < .05 was
established to correct for multiple comparisons using the same criterion (Blair and Karniski,
1993) as described above for the GFP analyses. The EOI cluster results that are presented are
from the clusters left P5, P3, P1, PO7, PO5 and PO3, and right P2, P4, P6, PO4, PO6 and PO8.
2
2
The alternate EOI clusters that were analyzed were: left (TP7 CP5 P7 P5), right (CP6 TP8 P6 P8); left
(CP5 CP3 CP1 P7 P5 P3 P1 PO7 PO5 PO3 CB1) right (CP2 CP4 CP6 P2 P4 P6 P8 PO4 PO6 PO8 CB2);
and left (TP7 CP5 CP3 CP1 P7 P5 P3 P1 PO7 PO5 PO3 CB1), right (CP2 CP4 CP6 TP8 P2 P4 P6 P8 PO4
PO6 PO8 CB2).
88
We also carried out comparisons of the difference waveforms across near versus far contrasts.
These were a general check on the hypothesis that far contrasts were different from near
contrasts.
In some cases in which a vMMN is observed, a coincident frontal positivity has also been
reported for Fz and/or Cz (Czigler et al., 2002; Czigler et al., 2004). The statistical tests for the
vMMN were carried out separately on ERPs from electrodes Fz and Cz to assess the presence of
a frontal MMN.
Results
Behavioral Results
The purpose of testing behavioral discrimination was to assure that the stimulus pair
discriminability was predicted correctly. The 49 screened participants were tested, and the EEG
data from 11 of them are reported here. Discrimination d’ scores were compared across groups
(included vs. excluded participants) using analysis of variance (ANOVA) with the within-subjects
factor of stimulus distance (near vs. far) and between-subjects factor of group (included vs.
excluded). The groups were not reliably different, and group did not interact with stimulus
distance.
Far pairs were discriminated better than near pairs, F(1,47) = 591.7, p < .001, mean
difference in d’ = 3.13. Within the EEG group, mean d’ for the far stimulus pairs was reliably
higher than for the near stimulus pairs, paired-t(10) = 12.25, p < 0.001, mean difference in d’
= 3.02. Mean d’ was reliably above chance for both near, t(10) = 8.09, p < 0.001, M = 1.40, and
far, t(10) = 15.62, p < 0.001, M = 4.51, stimulus pairs.
Detection d’ of “ha” or “va” during EEG recording was high, group mean d’ = 4.83,
range [3.83, 5.91]. The two targets were detected at similar levels, paired-t(10) = 0.23, p = .82.
89
For neither target syllable was there any effect of which syllable was the standard in the EEG
recording block.
ERPs across vMMN stimulus pairs. The ERP group mean data sets for the four
stimulus pairs were inspected for data quality.
GFP results. GFP measures were computed for each standard and deviant syllable.
Holding syllable constant, the standard versus deviant GFP was compared to determine whether
and, if so, when a reliable effect of stimulus deviance was present in each of the four stimulus
conditions (i.e., “zha” in the near context, “zha” in the far context, “fa” a far contrast, and “ta” a
near contrast). All of the stimulus contrasts resulted in reliable effects. Figure 3.3 summarizes the
Figure 3.3. Global field power plots for the four vMMN contrasts. Group mean global field power
(mGFP) evoked by the standard and the deviant are shown for (A) “zha” in the near context, (B) “ta” in the
near context, (C) “zha” in the far context, and (D) “fa” in the far context. The time axes show time relative
to the onset of stimulus deviation. Highlighted regions show times of statistically significant difference (p <
.05) between standard and deviant GFPs, as determined by a permutation test corrected for multiple
comparisons over time. Statistical comparisons were performed over the times indicated by the heavy black
line along the time axis. This time window was selected to include the expected time for a vMMN evoked
by the consonant part of the syllable.
90
GFP results for each vMMN. The reliable
GFP difference for “zha” in the far
context was 200-500 ms post-deviation
onset. For “zha” in the near context, there
were two intervals of reliable difference,
268-329 ms and 338-500 ms post-
deviation onset. The reliable difference for
“fa” was 52-500 ms post-deviation onset.
The reliable difference for “ta” was 452-
500 ms post-deviation onset.
Distributed dipole source
models. Dipole source models were
computed using ERPs obtained with
standard stimuli (“zha,” “fa,” and “ta”) in
order to visualize the spatiotemporal
patterns of exogenously driven responses
to the stimuli. Figures 4-6 show the dipole
source strength at 20-ms intervals starting
from 90 ms after onset of visible motion
until 670 ms for the group mean ERPs.
The images are thresholded to only show
dipole sources stronger than 20 pA·m. The
figures show images starting at 90 ms
post-stimulus onset, because no
Figure 3.4. Source images for “fa” standard. Images
show the depth-weighted minimum norm estimate of
dipole source strength constrained to the surface of the
cortex using a boundary element forward model and a
generic anatomical model at 20-ms intervals starting from
90 ms after onset of visible motion for the group mean
ERPs for syllable “fa” as standard. The time indicated by
the cyan bar indicates the time at which “fa” visibly
differs from “zha.” Images are thresholded at 20 pA·m.
Initial activity is in the occipital cortex. At 150 ms after
syllable onset, the bilateral posterior temporal activity
begins that lasts until 290 ms in the left hemisphere and
until 490 ms in the right hemisphere. Activation in the
right posterior temporal cortex is more widespread and
inferior to that on the left. Fronto-central activity is visible
from 250 to 510 ms post-stimulus onset.
91
suprathreshold sources were obtained
earlier. The images continue through 690
ms to indicate that posterior activity rises
and falls within the interval, as would be
expected in response to a temporally
unfolding stimulus.
The right hemisphere overall
appeared to have stronger and more
sustained responses focused on posterior
temporal cortex. Additionally, the right
posterior temporal activation was more
widespread but with a more inferior focus
compared to that in left posterior temporal
cortex. Variations in the anatomical
locations of the foci of activity across
Figures 3.4-3.6 suggest that the possibility
that activation sites varied as a function of
syllable. But these cannot be interpreted
with confidence given the relatively low
level of spatial resolution of these
distributed dipole source models.
The temporal differences across
syllable are more interpretable. Variation
across syllables is attributed to differences
Figure 3.5. Source images for “zha” standard. Images
show the depth-weighted minimum norm estimate of
dipole source strength at 20-ms intervals starting from 90
ms after the onset of visible motion for the group mean
ERPs for syllable “zha” as standard. The cyan bar
indicates the time at which “zha” visibly differs from “fa,”
and the magenta bar indicates the time at which “zha”
visibly differs from “ta.” Images are thresholded at 20
pA·m. Initial activity is in the occipital cortex. At 190 ms
after syllable onset, strong, widespread bilateral posterior
temporal activity begins that lasts until 290 ms, with
weaker activations recurring through 610 ms post-
stimulus onset. Activation in the right posterior temporal
cortex is more widespread and inferior to that on the left.
Fronto-central activity is visible from 270 to 490 ms post-
stimulus onset.
92
in stimulus kinematics. The “fa” standard
(Figure 3.4) resulted in sustained right
hemisphere posterior temporal activity
from approximately 120 ms to 490 ms
relative to stimulus onset and sustained left
hemisphere posterior temporal activity
from approximately 170 ms to 270 ms. The
“zha” standard (Figure 3.5) resulted in
sustained right hemisphere posterior
temporal activity from approximately 190
ms to 430 ms and sustained left
hemisphere posterior temporal activity
from approximately 190 ms to 390 ms. The
“ta” standard (Figure 3.6) resulted in
sustained right hemisphere posterior
temporal activity from approximately 150
ms to 250 ms and sustained left
hemisphere posterior temporal activity
from approximately 150 ms to 230 ms. The
shorter period of sustained activity for “ta”
versus “fa” and “zha” can be explained by
its shorter (fewer frames) initial
articulatory gesture (Figure 3.2).
Figure 3.6. Source images for “ta” standard. Images
show the depth-weighted minimum norm estimate of
dipole source strength constrained to the surface of the
cortex using a boundary element forward model and a
generic anatomical model at 20-ms intervals starting
from 90 ms after the onset of visible motion for the group
mean ERPs for syllable “ta” as standard. The magenta
bar indicates the time at which “ta” visibly differs from
“zha.” Images are thresholded at 20 pA·m. Initial activity
is in occipital cortex. At 130 ms after syllable onset,
bilateral posterior temporal activity begins that fades by
250 ms post-stimulus onset, but then recurs from 330 to
590 ms on the right and from 330 to 470 ms on the left.
Fronto-central activity is visible from 270 to 470 ms
post-stimulus onset.
93
Some fronto-central and central activity emerged starting 220 to 280 ms post-stimulus
onset, particularly with “zha” and “fa.” No other prominent activations were obtained elsewhere
during the initial periods of sustained posterior temporal activity.
Dipole source models were also computed on the vMMN difference waveforms, resulting
in lower signal strength in posterior temporal cortices in comparison with models based on the
standard ERPs. The models support the presence of deviance responses in those cortical areas and
higher right posterior activity for far contrasts than near contrasts. All of the difference waveform
models demonstrate patterns of asymmetric frontal activity with greatest strength generally
beyond 200 ms post-deviation that seems attributable to attention to the deviant.
vMMN results. ERPs of EOI clusters for each syllable contrast and hemisphere were
submitted to analyses to determine the reliability of the deviance responses. Thus, there were four
vMMN analyses per hemisphere. They were for “zha” in its near or far context, “fa” in the far
context, and “ta” in the near context. Summaries of the results are given in Table 3.2. The
duration (begin points to end points) of reliable deviance responses varied across syllables (from
Table 3.2. Summary of reliable vMMNs.
Syllable
(contrast)
Electrode(s) Begin (ms) End (ms) Duration (ms)
Zha (Far) LPT 322 497 176
RPT 324 500 177
Zha (Near) RPT 239 288 50
Fa (Far) LPT 251 435 185
RPT 300 442 143
Ta (Near) RPT 449 500 52
Note. All times are relative to deviance onset. LPT: left posterior temporal,
RPT: right posterior temporal.
a
The p-value corresponds to the entire indicated time window and is corrected
for multiple comparisons over time.
b
The mean is the group average deviant minus standard, averaged over the
period from the begin to end points.
94
50 ms to 185 ms) and varied in mean voltage (from -0.35 µV to -0.85 µV).
Figure 3.7 shows the statistical results for the EOI cluster waveforms for each contrast
and hemisphere. The theoretically predicted results were obtained. All of the right-hemisphere
contrasts resulted in reliable deviance responses. They were “zha” in the near context from 239 to
288 ms post-deviation onset, “zha” in the far context from 324 to 500 ms post-deviation onset,
“ta” from 449 to 500 ms post-deviation onset, and “fa” from 300 to 442 ms post-deviation onset.
Only the far contrasts resulted in reliable left-hemisphere deviance responses. They were “zha” in
the far context from 322 to 497 ms post-deviation onset and “fa” from 251 to 435 ms post-
deviation onset.
Comparison of far versus near vMMNs. Difference waveforms were computed
using the standard type of approach to the vMMN, that is, by subtracting the EOI cluster ERPs to
standards from the response to deviants for each stimulus contrast and hemisphere on a per-
Figure 3.7. Group mean ERPs and vMMN analyses for posterior temporal EOI clusters. Time shown
is relative to stimulus deviation onset. Statistical comparisons were performed over the times indicated by
the heavy black line along the time axis. Highlighted regions denote statistically significant (p < .05)
differences of ERPs evoked by the stimulus as deviant versus standard, corrected for multiple comparisons
over time. Reliable differences were obtained for the right EOI means with all four syllable contrasts.
Reliable differences were obtained for the left EOI means with only two far vMMN contrasts.
95
subject basis. The magnitudes of the vMMN waveforms were then compared between far and
near contrasts using the resampling method that was applied to the analyses of standards versus
deviants.
The “zha” near and far vMMN waveforms were found to be reliably different (Figure
3.8). On the left, the difference wave for “zha” in the far context was reliably larger (i.e., more
negative) than for “zha” in the near context (320 to 443 ms post-deviation onset), not
unexpectedly as the near context did not
result in an observable vMMN. On the
right, the difference wave was also
reliably larger for “zha” in the far
context (from 331 to 449 ms post-
deviation onset), although both contexts
were effective. The results were similar
when the vMMN waveforms were
compared between “fa” versus “ta”
(Figure 3.9). On the left, the difference
wave for “fa” was reliably larger than
for “ta” (309 to 386 ms post-deviation
onset). On the right, the difference wave
was also reliably larger for “fa” (from
327 to 420 ms post-deviation onset).
Fronto-central results.
ERPs were analyzed based on
recordings from electrodes Fz and Cz,
Figure 3.8. VMMN comparisons. Group mean vMMN
difference waves (deviant minus standard) from EOI means
were compared to test whether syllable distance (far vs.
near) predicted relative vMMN magnitude. Comparisons
were (A) “zha” in the far context versus “zha” in the near
context, and (B) “fa” (far context) versus “ta” (near
context). Statistical comparisons were performed over the
times indicated by the heavy black line along the time axis.
Highlighted regions denote statistically significant (p < .05)
differences in the vMMNs corrected for multiple
comparisons over time. For display purposes only,
difference waves were smoothed with a 41 sample moving
average.
96
because these electrodes are typically used to
obtain an auditory MMN (Kujala et al., 2007),
but positivities on these electrodes have been
reported for vMMNs (Czigler et al., 2002;
Czigler et al., 2004). Results with Fz (Figure
3.9) showed reliable effects for “ta,” “fa,” and
“zha” far. None of the Cz results were
reliable. Reliable differences with the deviant
ERPS more positive were found on Fz for both
of the far contrasts, from 282 to 442 ms post-
deviation onset for “fa” and from 327 to 492
ms post-deviation onset for “zha” in the far
context. These positive differences occur at
similar times and with opposite polarity as the
posterior temporal vMMNs. A reliable
positivity was also obtained for “ta” from 151
to 218 ms post-deviation onset, but no reliable
difference was obtained for “zha” in the near
context.
General Discussion
This study investigated the brain’s
response to visual speech deviance, taking into
account that (1) responses to stimulus deviants
Figure 3.9. Group mean ERPs and MMN analyses
for (A) electrode Fz and (B) electrode Cz. Group
mean ERPs are shown for “zha” in the near context,
“ta” in the near context, “zha” in the far context, and
“fa” in the far context. Times shown are relative to
stimulus deviation onset. Highlighted time regions
show statistically significant (p < .05) differences of
ERP evoked by the deviant from the ERP evoked by
the standard, corrected for multiple comparisons over
time. Statistical comparisons were performed for the
times indicated by the heavy black line along the
time axis. Reliable positive differences (deviant vs.
standard) were obtained on electrode Fz for the two
far syllable contrasts and for the near syllable
contrast “ta.” No reliable differences were obtained
on electrode Cz.
97
are considered to be generated by the cortex that represents the stimulus (Winkler and Czigler,
2012), and (2) that there is evidence that exogenous visual speech processing is lateralized to left
posterior temporal cortex (Campbell, 1986; Campbell et al., 2001; Bernstein et al., 2011). Taken
together these observations imply that the right and left posterior temporal cortices represent
visual speech stimuli differently, and therefore that their responses to stimulus deviance should
differ.
We hypothesized that the right posterior temporal cortex, for which there are indications
of representing simple non-speech face gestures (Puce et al., 2000; Puce et al., 2003), would
generate the deviance response to both perceptually near and perceptually far speech stimulus
changes (Figure 3.1). In contrast, the left hemisphere, for which there are indications of
specialization (Campbell et al., 2001; Bernstein et al., 2011; Campbell, 2011; Nath and
Beauchamp, 2012) for representing the exogenous stimulus forms of speech, would generate the
deviance response only to perceptually far speech stimulus changes. That is, it would be tuned to
stimulus differences that are readily perceived as different consonants (Figure 3.1).
Two vMMNs were sought for far stimulus deviations (one for “zha” and one for “fa”),
and two vMMNs were sought for near stimulus deviations (one for “zha” and one for “ta”). The
“zha” stimulus was used to obtain a perceptually near and a perceptually far contrast in order to
hold consonant constant across perceptual distances. Reliable vMMN contrasts supported the
predicted hemispheric effects. The left-hemisphere vMMNs were obtained only with the highly
discriminable (far) stimuli, but the right-hemisphere vMMNs were obtained with both the near
and far stimulus contrasts. There were also reliable differences between vMMN difference
waveforms as a function of perceptual distance, with larger vMMN difference waveforms
associated with larger perceptual distances.
98
Evidence for the vMMN deviance response with speech stimuli.
Previous reports have been mixed concerning support for a posterior vMMN specific to
visual speech form-based deviation (i.e., deviation based on the phonetic stimulus forms). An
early study failed to observe any vMMN in a paradigm in which a single visual speech token was
presented as a deviant and a single different speech token was presented as a standard (Colin et
al., 2002). A more recent study (Saint-Amour et al., 2007) likewise failed to obtain a vMMN
response.
In Colin et al. (2004), a posterior difference between the ERP evoked by a standard
syllable and the ERP evoked by a deviant syllable was obtained on Oz (from 155 to 414ms), but
this difference was attributed to low-level (non-speech) stimulus differences and not to speech
syllable differences, because the effect involved two different stimuli. A subsequent experiment
controlling for stimulus difference found no vMMN for visual speech alone. For example, the
original deviance detection could have arisen at a lower-level such as the temporal or spatial
frequency differences between the stimuli, or it could have been the result of shifts in the talker’s
eye gaze across stimuli. A study by Möttönen and colleagues (Möttönen et al., 2002) used
magnetoencephalography (MEG) to record the deviance response with a single standard (“ipi”)
versus a single deviant (“iti”). The mismatch response was at 245-410 ms on the left and 245-405
ms on the right. But again, these responses cannot be attributed exclusively to consonant change.
Winkler et al. (2009) compared the ERPs to a “ka” stimulus in its roles as standard versus
deviant and reported a late occipital difference response, and possibly also an earlier negative
difference peak at 260 ms on occipital electrodes that did not reach significance. In their study,
the vMMN is not attributable to lower-level stimulus attributes that changed.
Ponton et al. (2009) used a similar approach in attempting to obtain vMMNs for “ga” and
“ba.” A reliable vMMN was obtained for “ba” only. The authors speculated that the structure of
the “ga” stimulus might have impeded being able to obtain a reliable vMMN with it. The stimulus
99
contained three early rapid discontinuities in the visible movement of the jaw, which might have
each generated their own C1, P1, and N1 responses, resulting in the oscillatory appearance of the
obtained vMMN difference waveforms. Using current density reconstruction modeling (Fuchs et
al., 1999), the “ba” vMMN was reliably localized only to the right posterior superior temporal
gyrus, peaking around 215 ms following stimulus onset. The present study suggests that the
greater reliability for localizing the right posterior response could be due to generally more
vigorous responding by that hemisphere.
As suggested in Ponton et al. (2009), whether a vMMN is obtained for speech stimuli
could depend on stimulus kinematics. The current study took into account kinematics and the
different deviation points across the different stimulus pairs. Inasmuch as the vMMN is expected
to arise following deviation onset (Leitman et al., 2009), establishing the correct time point from
which to measure the vMMN is critical. A method was devised here to establish the onset of
stimulus deviation. The method was fairly gross, involving inspection of the video frames and
measurement of the lip-opening area to align the stimuli within phoneme category and establish
deviation across categories (Figure 3.2), but it resulted in good correspondence of the vMMNs
latencies across stimuli and with previous positive reports (Ponton et al., 2009; Winkler et al.,
2009).
The distributed dipole models of the standard stimuli here (Figures 4-6) suggest that the
posterior temporal cortex responds to speech stimuli by 170-190 ms post-stimulus onset and
continues to respond for approximately 200 ms. This interval is commensurate with the reliable
vMMNs here (Table 2), which were measured using the electrode locations approximately over
the posterior temporal response foci in the distributed dipole models. The results here are
considered strong evidence that there is a posterior visual speech deviance response that is
100
sensitive to consonant dissimilarity, but that detailed attention to stimulus attributes may be
needed on the part of researchers in order to obtain it reliably.
Hemispheric asymmetry of visual speech stimulus processing.
Beyond demonstrating that visual speech deviance is responded to by high-level visual
cortices, the current study focused on the hypothesis that the right and left posterior temporal
cortices would demonstrate lateralized processing. The distributed dipole source models (Figures
4-6) show somewhat different areas of posterior temporal cortex to have been activated by each
of the standard stimuli. In addition, during the first 400-500 ms post-stimulus onset, the activation
appears to be greater for the right hemisphere.
There are published results that support functional anatomical asymmetry for processing
non-speech face stimuli. For example, the right pSTS has been shown to be critically involved in
processing eye gaze stimuli (Ethofer et al., 2011). In an ERP study alternating mouth open and
mouth closed stimuli, the most prominent effect was a posterior negative potential around 170 ms
which appeared to be larger on the right but was not reliably so (Puce et al., 2003). The
researchers point out that the low spatial resolution with ERPs precludes the possibility of
attributing their obtained effects exclusively to pSTS, because close cortical areas such as the
human motion processing area (V5/MT) could also contribute to activation that appears to be
localized to pSTS. Thus, although there is evidence in their study and here of functional
specialization across hemispheres, the indeterminacies with EEG source modeling preclude
strong statements about the specific neuroanatomical regions within the posterior temporal
cortices. However, an fMRI study (Bernstein et al., 2011), in which localizers were used did show
that V5/MT was under-activated by visual speech in contrast with non-speech stimuli.
The left posterior temporal EOI deviance responses here are consistent with the temporal
visual speech area (TVSA) reported by Bernstein et al. (2011) and are generally consistent with
101
observations in other neuroimaging studies of lipreading (Calvert and Campbell, 2003; Paulesu et
al., 2003; Skipper et al., 2005; Capek et al., 2008). The TVSA appears to be in the pathway that is
also attributed with multisensory speech integration (Calvert, 2001; Nath and Beauchamp, 2011).
The current results are consistent with the suggestion (Bernstein et al., 2011) that visual speech
stimuli are extensively processed by the visual system prior to being mapped to higher-level
speech representations, including semantic representations, in more anterior temporal cortices
(Scott and Johnsrude, 2003; Hickok and Poeppel, 2007).
The right- versus left-hemisphere vMMN results could be viewed as paradoxical under
the assumption that sensitivity to speech stimulus deviation is evidence for specialization for
speech. That is, the four vMMNs on the right might seem to afford more speech processing
information than the two on the left. Here, the near deviant stimuli were discriminable as different
patterns of speech gestures. But the obtained d’ discrimination measures that were approximately
1.4 for near contrasts are commensurate with previous results that showed the stimuli are not
reliably labeled as different speech phonemes (Jiang et al., 2007a). Stimulus categorization
involves generalization across small and/or irrelevant stimulus variation (Goldstone, 1994; Jiang
et al., 2007b). Neural representations are the recipients of convergent and divergent connections,
such that different lower-level representations can map to the same higher-level representation,
and similar lower-level representations can map to different higher-level representations (Ahissar
et al., 2008). Small stimulus differences that do not signal different phonemes could be mapped to
the same representations on the left but mapped to different representations on the right (Figure
3.1).
The vMMNs on the left are explicitly not attributed to phoneme category representations
but to the representation of the exogenous stimulus forms that are mapped to category
representations, an organizational arrangement that is observed for non-speech object processing
102
(Grill-Spector et al., 2006; Jiang et al., 2007b). This type of organization is also thought to be true
for auditory speech processing, which is initiated at the cortical level with basic auditory features
(e.g., frequencies, amplitudes) that are projected to exogenous phonetic stimulus forms, and then
to higher-level phoneme, syllable, or lexical category representations (Binder et al., 2000; Scott et
al., 2000; Eggermont, 2001; Scott, 2005; Hickok and Poeppel, 2007; Obleser and Eisner, 2009;
May and Tiitinen, 2010; Näätänen et al., 2011).
Thus, the sensitivity of the left posterior temporal cortex only to larger deviations is
expected for a lateralized language processing system that needs exogenous stimulus
representations that can be reliably mapped to higher-level categories (Binder et al., 2000;
Spitsyna et al., 2006). The deviation detection on the right could be more tightly integrated into a
system responsive to social and affective signals (Puce et al., 2003), for which an inventory of
categories such as phonemes that are combinatorically arranged is not required. For example, the
right-hemisphere sensitivity to smaller stimulus deviations could be related to processing of
emotion or visual attention stimuli (Puce et al., 1998; Puce et al., 2000; Puce et al., 2003;
Wheaton et al., 2004; Thompson et al., 2007).
Dissimilarity.
Here, four vMMNs were sought in a design incorporating between- and within-consonant
category stimuli and estimates of between-consonant category perceptual dissimilarity (Files and
Bernstein, submitted; Jiang et al., 2007a). The perceptual dissimilarities were confirmed, and the
vMMNs were consistent with the discrimination measures: Larger d’ was associated with larger
vMMNs as predicted based on the expectation that the extent of neuronal representation overlap
is related to the magnitude of the vMMN (Winkler and Czigler, 2012) (Figure 3.1). The direct
comparison of the vMMN difference waves showed that, while holding stimulus constant (i.e.,
“zha”), the magnitude of the vMMN varied reliably with the context in which it was obtained. In
103
the far (“fa”) context, the vMMN was larger than in the near (“ta”) context. To our knowledge,
this is the first demonstration of predicted and reliable relative difference in the vMMN as a
function of visual speech discriminability. This finding was also supported by the results for the
other two stimuli, “ta” and “fa.”
These results converge with previous results on the relationship between visual speech
discrimination and the physical visual stimuli. Jiang et al. (2007a) showed that the perceptual
dissimilarity space obtained through multidimensional scaling of visual speech phoneme
identification can be accounted for in terms of a physical (i.e., 3D optical) perceptually (linearly)
warped multidimensional speech stimulus space. Files and Bernstein (in submission) followed up
on those results and showed that the same dissimilarity space successfully predicts perceptual
discrimination of the consonants. That is, the modeled perceptual dissimilarities based on warped
stimulus differences predicted discrimination results and the deviance responses here.
The controlled dissimilarity factor in the current experiment afforded a unique approach
to investigation of hemispheric specialization for visual speech processing. An alternate approach
would be to compare ERPs obtained with speech versus non-speech face gestures, as has been
done in an fMRI experiment (Bernstein et al., 2011). However, that particular approach could
introduce uncontrolled factors such as different salience of speech versus non-speech stimuli. The
current vMMN results also contribute a new insight about speech perception beyond that obtained
within the Jiang et al. (2007a), and Chapter 2 perceptual studies. Specifically, the results here
suggest that two types of representations contribute to the perceptual discriminability of visual
speech stimuli, speech consonant representations and face gesture representations.
Mechanisms of the vMMN response.
One of the goals of vMMN research, and MMN research more generally, has been to
establish the mechanism/s that are responsible for the brain’s response to stimulus deviance
104
(Jääskeläinen et al., 2004; Näätänen et al., 2005; Näätänen et al., 2007; Kimura et al., 2009; May
and Tiitinen, 2010). A main issue has been whether the cortical response to deviant stimuli is a
so-called “higher-order memory-based process” or a neural adaptation effect (May and Tiitinen,
2010). The traditional paradigm for deriving the MMN (i.e., subtracting the ERP based on
responses to standards from the ERP based on responses to deviants when deviant and standard
are the same stimulus) was designed to show that the deviance response is a memory-based
process. But the issue then arose whether the MMN is due entirely instead to refractoriness or
adaptation of the same neuronal population activated by the same stimulus in its two different
roles. The so-called “equiprobable paradigm” was designed to control for effects of refractoriness
separate from deviance detection (Schroger and Wolff, 1996, 1997). The current study did not
make use of the equiprobable paradigm, and we did not seek to address through our experimental
design the question whether the deviance response is due to refractoriness/adaptation or a
separate memory mechanism. We do think that our design rules out low-level stimulus effects
and points to higher-level deviance detection responses at the level of speech processing.
The stimuli presented in the current vMMN experiment were not merely repetitions of
the exact same stimulus. Deviants and standards were two different video tokens whose stimulus
attributes differed (see Figure 3.2). These stimulus differences were such that it was necessary to
devise a method to bring them into alignment with each other and to define deviations points,
which were different depending on which vMMN was being analyzed. Furthermore, the stimuli
were slightly jittered in position on the video monitor during presentation to defend additionally
against low-level effects of stimulus repetition. Thus, the deviation detection at issue was relevant
to consonant stimulus forms. We interpret the lateralization effects to be the result of the left
hemisphere being more specialized for linguistically-relevant stimulus forms and the right
hemisphere being more specialized for facial gestures that while not necessarily being discrete
105
categories were nevertheless detected as different gestures (Puce et al., 1996). However, these
results do not adjudicate between explanations that attempt to separate adaptation/refractoriness
from an additional memory comparison process.
vMMN to attended stimuli.
The auditory MMN is known to be obtained both with and without attention (Näätänen et
al., 1978; Näätänen et al., 2005; Näätänen et al., 2007). Similarly, the vMMN is elicited in the
absence of attention (Winkler et al., 2005; Czigler, 2007; Stefanics et al., 2011; Stefanics et al.,
2012). Here, participants were required to attend to the stimuli and carry out a phoneme-level
target detection task. Visual attention can result in attention-related ERP components in a similar
latency range as the vMMN. A negativity on posterior lateral electrodes is commonly observed
and is referred to as the posterior N2, N2c, or selection negativity (SN) (Folstein and Petten,
2008). However, the current results are not likely attributable to the SN, as the magnitude of the
vMMN increased with perceptual dissimilarity of the standard from the deviant, whereas the SN
is expected to increase with perceptual similarity of the deviant to a task-relevant target (Baas et
al., 2002; Proverbio et al., 2009). Here, the target consonant was chosen to be equally dissimilar
from both the standard and the deviant stimuli in a block, and this dissimilarity was similar
across blocks. Therefore, differences in vMMN across syllables are unlikely attributable to the
similarity of the deviant to the target: The task was constant in terms of the discriminability of the
target, but the vMMNs varied in amplitude.
No auditory MMN.
Results of this study do not support the hypothesis that visual speech deviations are
exogenously processed by the auditory cortex (Sams et al., 1991; Möttönen et al., 2002). This
possibility received attention previously in the literature (e.g., Calvert et al., 1997; Bernstein et
al., 2002; Pekkola et al., 2005). Seen vocalizations can modulate the response of auditory cortex
106
(Möttönen et al., 2002; Pekkola et al., 2006; Saint-Amour et al., 2007), but the dipole source
models of ERPs obtained with standard stimuli (Figures 4-6) do not show sources that can be
attributed to the region of the primary auditory cortex. Nonetheless, the Fz and Cz ERPs obtained
with standards and deviants were compared in part because of the possibility that an MMN
reminiscent of an auditory MMN (Näätänen et al., 2007) might be obtained. Instead, a reliable
positivity was found for the two far syllable contrasts. The timing of this positivity was similar to
that of the vMMN observed on posterior temporal electrodes but was opposite in polarity. Similar
positivities have been reported for other vMMN experiments and could reflect inversion of the
posterior vMMN or some related but distinct component (Czigler et al., 2002; Czigler et al.,
2004).
Summary and conclusions.
Previous reports on the vMMN with visual speech stimuli were mixed, with relatively
little evidence obtained for a visual deviation detection response. Here, the details of the visual
stimuli were carefully observed for their deviations points. The possibility was taken into account
that across hemispheres the two posterior temporal cortices represent speech stimuli differently.
The left posterior temporal cortex, hypothesized to represent visual speech forms as input to a
left-lateralized language processing system, was predicted to be responsive to perceptually large
deviations between consonants. The right hemisphere, hypothesized to be sensitive to face and
eye movements, was predicted to detect both perceptually large and small deviations between
consonants. The predictions were shown to be correct. The vMMNs that were obtained for the
perceptually far deviants were reliable bilaterally over posterior temporal cortices, but the
vMMNs for the perceptually near deviants were reliably observed only over the right posterior
temporal cortex. The results support a left-lateralized visual speech processing system.
107
Chapter 4: Conclusions
Here, I will summarize the main findings and theoretical implications of the experiments
in Chapters 2 and 3. I will conclude with some suggestions for future research based on the
findings in this dissertation.
Perceptual Dissimilarity Maps to Optical Dissimilarity
The results of this study largely corroborated the predictions of a strictly visual model of
visible syllable perceptual dissimilarity (Jiang et al., 2007a) and strengthen the theoretical
position that visual speech perception is a visual process. The discriminability of pairs of
syllables, as measured with d’ and response times was shown in two separate experiments to be
correlated with perceptual dissimilarity as modeled from optical 3D data. Stimuli generated using
a synthetic talking face driven by the same 3D optical data conveyed information sufficient for
discrimination and identification of visible speech syllables. Some discrepancies between
discrimination of natural syllables compared with discrimination of synthetic syllables suggest
there were sources of information in the natural stimuli not captured by the 3D optical data. The
Jiang et al. model predicts that visible syllable perceptual dissimilarity is orientation-invariant,
and that prediction was largely corroborated: inversion had a very small effect on discrimination
d’, and was largely concentrated in one syllable pair. Together, these results strengthen the central
theoretical claim that visual speech perception is visual.
Visual speech is visual
The Jiang et al. (2007) model of visible syllable perceptual dissimilarity is an empirical
realization of the claim that using only optically-available information and relatively
straightforward linear transformations, the dissimilarity of pairs of visible speech syllables can be
estimated. The success of the model in correlating with behavioral measures of dissimilarity
108
demonstrates that sufficient information is visually available in the talking face to support visual
speech perception. Given the existence of individuals capable of highly accurate lipreading
(Bernstein et al., 2000; Mohammed et al., 2005; Auer and Bernstein, 2007), it might be expected
that there must be sufficient information in the talking face to support visual speech perception,
but more surprising is the evidence that such a sparse representation of the face can nonetheless
support accurate visible syllable discrimination.
Other accounts of the relationship between the visible stimulus and visual speech
perception have made use of derived features of the stimulus, such as the ratio of mouth opening
height to mouth opening width (Jackson et al., 1976; Montgomery and Jackson, 1983). Jiang et al.
(2007) pointed out that because their model uses no special feature representation layer, it
demonstrates that no such representation layer is strictly necessary to account for visible syllable
perceptual dissimilarity. The present results strengthen the case that a model using
straightforward transforms of visible stimulus information correlates with perceptual
dissimilarity. Because straightforward transformations of visible stimulus information are
sufficient to account for perceptual dissimilarity of visible speech syllables, the visual system is
well-equipped to perceive visual speech. The model of Jiang et al. and the present results
converge with neural data suggesting that the representation of visual speech is in the temporal
visual speech area, a high-level visual center in the dorsal pathway of visual cortex (Bernstein et
al., 2011; Files et al., 2013).
The evidence for visual system contributions to lipreading includes results from
neuroimaging studies. Bernstein et al. (2011) carried out a functional magnetic resonance imaging
(fMRI) study with speech and non-speech face gestures in natural video or point-light animations.
They showed that posterior left temporal cortex demonstrates significantly greater activity for
109
visual speech versus non-speech gestures, in contrast to a homologous posterior right-hemisphere
sensitivity, which did not elicit a preferential response to speech.
These results and others (Campbell, 1986; Puce et al., 1998; Calvert and Campbell, 2003;
Paulesu et al., 2003; Skipper et al., 2005; Capek et al., 2008; Okada and Hickok, 2009; Bernstein
et al., 2011) point to cortical specialization for visual speech stimuli and possibly also sharing of
neural resources for processing of speech and non-speech face gestures, particularly, by the right
posterior temporal cortex. The neural findings provide converging evidence in support of the
view that speech perception involves extensive modality-specific representations.
From model to perception
Although this work has demonstrated that the optical dissimilarity measurement makes
useful predictions about visual speech perception via the perceptual dissimilarity model (Jiang et
al., 2007a), no claim is made that the perceptual dissimilarity model reflects the processing done
to establish a visual representation of speech. Visual speech perception can be done based on
point-light stimuli (Rosenblum et al., 1996; Rosenblum and Saldaña, 1996; Bernstein et al.,
2011), but the particulars of what points on the talking face might be tracked (if indeed facial
motion is perceived in terms of tracked points on the face under normal viewing conditions)
could be entirely different. Visible speech unfolds over time (Jesse and Massaro, 2010). The
optical dissimilarity measurement collapses measurements over the temporal dimension, but there
are certainly other ways to deal with time-varying stimuli: Visual systems integrate evidence over
time (e.g., Burr, 1981) and auditory cortex is described as having spectro-temporal receptive
fields (Eggermont et al., 1981) that respond to specific ‘shapes’ of sounds in spectro-temporal
space. One possibility for how complex, time-varying stimuli can be discriminated is through a
framework of state-dependent temporal processing (Buonomano and Maass, 2009).
110
One of the questions raised in Chapter 2 was what visible stimulus information was used
for visual speech perception, and how that information can be quantified. Based on these results
we cannot say exactly what comprises the effective stimulus, but we can contribute some
expectations and limitations on those stimulus properties that contribute to visual speech
perception. First and foremost, the visible stimulus information used for visual speech perception
is related fairly directly to visible motion on the talking face. This can be concluded based on the
second-order isomorphism observed and argues against the necessity of expressing
representations in terms that bear only a tenuous, highly abstract, or highly non-linear relationship
to 3D motion.
The results from the synthetic condition in Chapter 2 support that the 3D motion capture
data of a limited number of the points on the face forms a sufficient representation of the talking
face for most syllables. The exceptions appeared to be due to glimpses of the tongue acting as an
informative cue in some pairs of syllables. The sufficiency of such a reduced stimulus
demonstrates that highly-detailed and high-resolution information about the talking face is not
necessary for an effective representation of the talking face.
A further limitation on the effective visual speech stimulus is that it does not appear to
rely in any highly specific way on the orientation of the talking face. Experiment 2 in Chapter 2
demonstrated that syllable discrimination was related to optical dissimilarity via a model of
perceptual dissimilarity (Jiang et al., 2007a) even when the talking face was inverted, and that
discrimination performance did not generally decrease when the talking face was inverted.
Visual speech perception has been studied in the past using inverted talking faces (Massaro and
Cohen, 1996; Jordan and Bevan, 1997; Rosenblum et al., 2000; Thomas and Jordan, 2002, 2004).
Here, we demonstrated that the dissimilarity structure of visible speech is insensitive to
orientation. This insensitivity to orientation contrasts with the orientation-sensitive processing of
111
face identity (Yin, 1969; Rossion and Gauthier, 2002; McKone and Yovel, 2009). One
explanation for why face identity processing might be orientation-sensitive is that it relies on
receptive fields tuned to specific spatial relationships (Jiang et al., 2006). Given the insensitivity
to inversion for visible speech, it is unlikely that the effective visual speech stimulus is comprised
of specific spatial arrangements of features. This conclusion converges with a functional
neuroimaging result that showed that FFA was under-activated by talking face stimuli (Bernstein
et al., 2011).
In addition to the evidence about what stimulus features contribute to a visual speech
representation and strengthening the claim that visual speech perception is a visual process, the
results from Chapter 2 serve as a foundation for the work in Chapter 3. By contributing evidence
that visual speech perception is closely related to the 3D optical properties of the stimulus, we
were able to generate hypotheses about the neural basis of the representation of visual speech.
Visual Speech Mismatch Response Maps to Reliably Different Phonetic
Forms
The main result of Chapter 3 was that visible syllable contrasts elicited two visual
mismatch negativities (vMMNs). A vMMN localized to left posterior temporal cortex was
detectable only when the two syllables (the standard and the deviant) could, with high reliability,
be classified as different syllables. A vMMN localized to right posterior temporal cortex was
detectable for both perceptually near and perceptually far pairs of syllable, and the vMMN for the
far pairs was reliably larger than the vMMN for the near pairs. Using the terms and framework
laid out in the introductory chapter and in Chapter 3, these results show that both left and right
posterior temporal cortex have a second-order isomorphism relationship with the perceptual
dissimilarity of these pairs of syllables. Given that speech perception is generally found to be left-
112
lateralized, it may seem surprising that representations of the visible speech stimulus might be
found in both left and right hemispheres.
Other research supports the view that right posterior temporal cortex contains
representations of facial motions and/or gestures (Campbell et al., 1986; Campbell et al., 1996a;
Puce et al., 1998; Puce et al., 2000; Campbell et al., 2001; Puce et al., 2003; Wheaton et al., 2004;
Miki et al., 2006; Puce et al., 2007; Campbell, 2011), including eye movements (Ethofer et al.,
2011). Such facial motion can be communicative and highly behaviorally relevant even if the
facial motion does not itself constitute speech (Campbell, 2011). If the right hemisphere
represents face motion and the left hemisphere represents speech, or speech-like facial motion,
then both would be expected to be activated in response to visible speech. This is generally the
finding of functional neuroimaging experiments comparing visible speech to non-speech stimuli
(Campbell et al., 2001; Bernstein et al., 2011).
The view that emerges from past work on face motion processing is that the right
posterior temporal cortex contains representations of behaviorally-relevant facial motion,
including speech face motion. The emergence of a vMMN to visible speech stimuli that shows
second-order isomorphism with perceptual dissimilarity may seem at odds with an expectation
that visual speech perception should be left-lateralized. Below, I offer two possible explanations
that might account for these results.
One possible explanation is that, despite efforts to introduce dissimilarities in the two
tokens used for each syllable, some coincidental regularity in facial motion was established by the
repetition of the standard syllable and then violated by the presentation of the deviant syllable.
By ‘coincidental,’ I mean some regularity that was specific to the stimuli used, and does not
represent a distinction that reliably signals phoneme differences, and is not necessarily a
regularity that would generalize to more utterances of the same CV syllable. This possibility
113
could be tested using the same general framework as was used in Chapter 3, but including much
more dramatic between-token-within-syllable variability. This could be accomplished by using
tokens of multiple talkers saying the same syllable several times each. This would maximize the
chance that any established regularity would be due to the syllable spoken. A constantly changing
stream of face identities could be distracting and disruptive, so an alternative approach would be
to use synthetic visible speech driven by the recorded 3D motion of several talkers, similar to the
synthetic condition in Chapter 2. Such an experiment could potentially be used to isolate neural
responses related to visual speech perception to the exclusion of perception of non-speech face
motion, if the two are truly separable. As described, the stimuli required would be impractical to
record given the difficulty and cost of recording 3D optical motion, but could be simplified
dramatically if commercial face-capture software could be used (see below).
Another possibility is that the distinction between speech and non-speech facial motion is
not as dichotomous as previous results might suggest. Perhaps right posterior temporal cortex
encodes information that could be used for visual speech perception, but generally is not.
Evidence supporting this possibility is limited and indirect, and so here I can only speculate.
One source of indirect support for this possibility comes from research on a stroke patient with
extensive damage to left posterior temporal cortex, including superior temporal sulcus (Baum et
al., 2012). The patient’s audio-visual speech perception was tested and found to be consistent
with normal audio-visual speech perception. Functional neuroimaging of the patient while doing
an audio-visual speech task was compared with normal controls. The stroke patient showed
activity in right posterior temporal cortex that was of greater magnitude and extent than age-
matched controls. Pre-injury data were not available, but Baum and colleagues argued that the
most likely explanation was that cortical reorganization in right temporal cortex allowed that
cortex to process multisensory speech. This case study provides support for the possibility that
114
right posterior temporal cortex has the capacity to support audio-visual speech perception. Given
the anatomical proximity of audio-visual and visual-only speech representations (Bernstein et al.,
2011), it also indirectly supports the possibility that right posterior temporal cortex could support
visual-only speech perception. Whether this potential may be realized in absence of neurological
insult is another matter for speculation.
During participant screening for the experiment in Chapter 3, 49 individuals were tested
for lipreading ability (Auer and Bernstein, 2007) and their ability to discriminate the syllables
used in the vMMN experiment was also tested. Informal observation of the lipreading data
alongside the discrimination data suggested that the two might be associated. Figure 4.1 shows a
scatter plot of the each screened participant’s lipreading test result (words correct) versus
discrimination d’ for the near pair and the far pair used in Chapter 3. There was a reliable
positive relationship between number of words correct on the lipreading screening test and d’
sensitivity for the near pair, Pearson r(47) = .448, p = .0013, and for the far pair Pearson r(47) =
.297, p = .0385. The association between lipreading ability and discrimination of the near pair is
intuitive, in that people who are able to make better distinctions between perceptually similar
syllables should be more successful lipreaders. The interpretation in Chapter 3 that right
posterior temporal cortex has a representation that is sensitive to distinctions among perceptually
similar syllables, combined with the observed relationship between d’ and lipreading suggests a
hypothesis: Perhaps better lipreaders are better, because they have learned to make more
extensive use of the representations of facial motion in right posterior temporal cortex for visual
speech perception. To test this hypothesis, functional imaging of many subjects with a wide
range of lipreading ability doing a visual speech perceptual task would be required. To my
knowledge, the only fMRI experiment of visual speech perception that reported subject lipreading
ability is that of Bernstein and colleagues (Bernstein et al., 2011). In that study, lipreading ability
115
was not a screening criterion, and participants lipreading test scores (on the same test as used in
Chapter 3) ranged from 6 to 79 words correct (M=28). No quantitative assessment of the
relationship between lipreading ability and activity in right posterior STS was carried out, but
their supplemental Figure 5 shows individual subject right posterior STS activation maps, and the
most extensive speech-specific activation appears in the subjects with the higher lipreading
scores. Clearly, such informal observations cannot support or refute this hypothesis, and testing
on a larger sample that spans a larger range of the normal population would be required.
Figure 4.1: Lipreading screening scores in words correct versus d’ for syllable discrimination.
Data from all participants screened in Chapter 3 are included here, N = 49. Circles denote participants
whose EEG data were analyzed (n = 11), triangles denote all other participants. The dashed line indicates
the screening cutoff criterion of 37 words correct, the solid line is the line of best fit. Number of words
correct (out of 253) on the lipreading screening test related positively to d’ in the near pair /ZA-tA/, Pearson
r(47) = .448, p = .0013 and the far pair /ZA-fA/, Pearson r(47) = .297, p = .0385.
Summary
One of the main findings of the work in this dissertation are that visual speech perception
is closely related to the motion on the talking face when the face motion is submitted to a model
of perceptual dissimilarity that makes no recourse to abstract or gestural features. This close
relationship holds even when the stimuli are an impoverished, synthetic view of the stimuli and
when the stimuli are vertically inverted. Moreover, discrimination of visible speech stimuli is
better than would be predicted by an implicit identification model. Another main finding was
116
observation of visual mismatch negativities (vMMN) to visible speech syllables with generators
in left and right posterior temporal cortex. The left-lateralized vMMN was found only for syllable
distinctions that were highly reliable, and the right-lateralized vMMN was found for both
perceptually near and far syllable pairs. The main theoretical contribution of the first main result
is to reinforce the possibility that an account of visual speech perception need not make recourse
to abstract or gestural features. This is important, because much of the theorizing about visual
speech perception relies on a feature-level representation to provide a common format for
integration of visual and auditory speech information. Instead, the claim is that straightforward
visual processes are sufficient to achieve visual speech syllable discrimination. The main
theoretical contribution of the second main result is to provide converging evidence for a
representation of visual speech in left posterior temporal cortex. Bernstein and colleagues used a
system of control stimulus contrasts to isolate a temporal visual speech area (TVSA) that is
specifically sensitive to visual speech stimuli. Consistent with the hypothesis that TVSA is the
site of representation of visual speech, the work in Chapter 3 shows a left posterior temporal
cortical region that is highly sensitive to dissimilarities in visible speech syllables that signal
reliable differences in syllable identity.
In addition to the theoretical contributions just described, this work involved a number of
technical developments. The discrimination experiments reported in Chapter 2 are the first, to my
knowledge, to directly assess the discriminability of visible speech CV syllables while controlling
for non-phonetic confounds. The results of Chapter 3 are the first report of a vMMN attributable
to violations of regularity in visual speech syllable identity. Past reports have either failed to find
a visual speech vMMN, or could not be attributed specifically to syllable identity. Finally, the
application of permutation testing methods to global field power (GFP) measures with unequal
numbers of trials is a novel solution to a problem that has historically limited the use of GFP
117
measures. Having summarized the main findings and contributions of the work in this
dissertation, I will now briefly outline some potential future directions of research suggested by
this work.
Future Directions
One limitation of the optical dissimilarity measure (Jiang et al., 2007a) used here is that it
takes as input 3D motion information gathered using a professional-grade motion capture system.
Correctly using such a system requires expertise and resources not often available in speech
perception laboratories. The desire for realistic face animation for animated films and video
games is exerting commercial pressure toward unobtrusive and easy-to-use face motion capture.
For example, Faceware software (Faceware Technologies Ltd., Santa Monica, CA) is targeted
toward synthetic facial animation using video recording rather than motion capture as its source
data. The ability to rapidly and easily extract the motion in two dimensions on the talking face
from any set of points would allow further exploration of how much and what kind of visual
information is sufficient to convey linguistic information. Before such technology could be used,
testing procedures such as those presented in Chapter 2 and in (Jiang et al., 2007a) would need to
be applied to face motion from automatic motion capture software. Additionally, the sufficiency
of 2D motion for optical dissimilarity measurement would need to be established. By adapting
the optical dissimilarity measure to low-cost ubiquitous equipment, the utility of the research
described here would be greatly expanded.
Another future direction for this research is to take further advantage of second-order
isomorphism to strengthen the case for a visual representation of visible speech dissimilarity in
TVSA. Representational similarity analysis (RSA) is a technique that compares the dissimilarity
structure of a pattern of neural activation evoked by a set of stimuli to the dissimilarity structure
expected based on some model of the dissimilarity of those stimuli (Kriegeskorte et al., 2008;
118
Kriegeskorte, 2009). RSA could be applied using fMRI by first using a functional localizer to
isolate a subject’s TVSA (Bernstein et al., 2011) and then displaying a set of visible speech
syllables. The dissimilarity structure of the activations in TVSA would then be compared to the
dissimilarity structure predicted by optical dissimilarity (Jiang et al., 2007a) and to alternative
predictions, perhaps from acoustic or low-level visual features. This proposed experiment builds
on the research in this dissertation in two ways. First, the work in Chapter 2 further validates the
optical dissimilarity measure as a perceptually-relevant measure of the dissimilarity of seen
syllables. Second, the work in Chapter 3 demonstrated that the mismatch response likely arises
from a feed-forward visual pathway. This is important because the proposed experiment uses
fMRI which has relatively low temporal resolution. By combining evidence from the high spatial
resolution of fMRI and the high temporal resolution of EEG, a convincing case can be built for a
feed-forward visual representation of seen speech in high-level vision.
119
References
Agnew ZK, McGettigan C, Scott SK (2011) Discriminating between auditory and motor cortical
responses to speech and nonspeech mouth sounds. J Cogn Neurosci 23:4038-4047.
Ahissar M, Nahum M, Nelken I, Hochstein S (2008) Reverse hierarchies and sensory learning.
Philosophical Transactions of the Royal Society B 364:285-299.
Altieri N, Townsend JT (2011) An assessment of behavioral dynamic information processing
measures in audiovisual speech perception. Front Psychol 2:238.
Altieri N, Pisoni DB, Townsend JT (2011) Some behavioral and neurobiological constraints on
theories of audiovisual speech integration: a review and suggestions for new directions.
Seeing Perceiving 24:513-539.
Arnold P, Hill F (2001) Bisensory augmentation: A speechreading advantage when speech is
clearly audible and intact. Br J Psychol 92 Part 2:339-355.
Arnold P, Oles J (2001) Bisensory augmentation of complex spoken passages. Br J Audiol 35:53-
58.
Auer ET, Jr. (2002) The influence of the lexicon on speech read word recognition: contrasting
segmental and lexical distinctiveness. Psychon Bull Rev 9:341-347.
Auer ET, Jr., Bernstein LE (1997) Speechreading and the structure of the lexicon:
computationally modeling the effects of reduced phonetic distinctiveness on lexical
uniqueness. J Acoust Soc Am 102:3704-3710.
Auer ET, Jr., Bernstein LE (2007) Enhanced visual speech perception in individuals with early-
onset hearing impairment. Journal of Speech, Language, and Hearing Research 50:1157-
1165.
Baas JM, Kenemans JL, Mangun GR (2002) Selective attention to spatial frequency: an ERP and
source localization analysis. Clinical neurophysiology : official journal of the
International Federation of Clinical Neurophysiology 113:1840-1854.
Baillet S, Mosher JC, Leahy RM (2001) Electromagnetic brain mapping. Signal Processing
Magazine, IEEE 18:14-30.
Baum SH, Martin RC, Hamilton AC, Beauchamp MS (2012) Multisensory speech perception
without the left superior temporal sulcus. NeuroImage 62:1825-1832.
Benjamini Y, Hochberg Y (1995) Controlling the False Discovery Rate: A Practical and Powerful
Approach to Multiple Testing. Journal of the Royal Statistical Society Series B
(Methodological) 57:289-300.
Benjamini Y, Yekutieli D (2001) The Control of the False Discovery Rate in Multiple Testing
under Dependency. The Annals of Statistics 29:1165-1188.
Bernstein LE (2005) Phonetic Processing by the Speech Perceiving Brain. In: The Handbook of
Speech Perception (Pisoni DB, Remez RE, eds), pp 79-98. Malden, MA: Blackwell
Publishing.
Bernstein LE (2012a) Multisensory information integration for communication and speech. In:
The New Handbook of Multisensory Integration (Stein BE, ed). Boston: MIT.
Bernstein LE (2012b) Visual speech perception. In: AudioVisual Speech Processing (Vatikiotis-
Bateson E, Bailly G, Perrier P, eds), pp 21-39. Cambridge: Cambridge University.
Bernstein LE, Jiang J (2009) Visual Speech Perception, Optical Phonetics, and Synthetic Speech.
In: Visual Speech Recognition: Lip Segmentation and Mapping (Liew AW, Wang S,
eds), pp 439-461. Hershey, PA: Information Science Reference.
Bernstein LE, Demorest ME, Tucker PE (2000) Speech perception without hearing. Perception &
Psychophysics 62:233-252.
120
Bernstein LE, Auer ET, Jr., Tucker PE (2001) Enhanced speechreading in deaf adults: can short-
term training/practice close the gap for hearing adults? Journal of Speech, Language, and
Hearing Research 44:5-18.
Bernstein LE, Auer ET, Jr., Moore JK (2004) Audiovisual Speech Binding: Convergence or
Association? . In: Handbook of multisensory processes (Calvert GA, Spence C, Stein BE,
eds), pp 203 - 223. Cambridge, MA: MIT.
Bernstein LE, Auer ET, Jr., Wagner M, Ponton CW (2008) Spatiotemporal dynamics of
audiovisual speech processing. NeuroImage 39:423-435.
Bernstein LE, Jiang J, Pantazis D, Lu ZL, Joshi A (2011) Visual phonetic processing localized
using speech and nonspeech face gestures in video and point-light displays. Hum Brain
Mapp 32:1660-1676.
Bernstein LE, Auer ET, Jr., Moore JK, Ponton CW, Don M, Singh M (2002) Visual speech
perception without primary auditory cortex activation. Neuroreport 13:311-315.
Beskow J (2004) Trainable articulatory control models for visual speech synthesis. International
Journal of Speech Technology 7:335-349.
Besle J, Fort A, Delpuech C, Giard M-H (2004) Bimodal speech: early suppressive visual effects
in human auditory cortex. Eur J Neurosci 20:2225-2234.
Binder JR, Frost JA, Hammeke TA, Bellgowan PS, Springer JA, Kaufman JN, Possing ET (2000)
Human temporal lobe activation by speech and nonspeech sounds. Cereb Cortex 10:512-
528.
Blair RC, Karniski W (1993) An alternative method for significance testing of waveform
difference potentials. Psychophysiology 30:518-524.
Browman CP, Goldstein L (1992) Articulatory phonology: an overview. Phonetica 49:155-180.
Bruce V, Young A (1986) Understanding face recognition. Br J Psychol 77 ( Pt 3):305-327.
Buonomano DV, Maass W (2009) State-dependent computations: spatiotemporal processing in
cortical networks. Nat Rev Neurosci 10:113-125.
Burr DC (1981) Temporal summation of moving images by the human visual system. Proc R Soc
Lond B Biol Sci 211:321-339.
Calvert GA (2001) Crossmodal processing in the human brain: insights from functional
neuroimaging studies. Cereb Cortex 11:1110-1123.
Calvert GA, Campbell R (2003) Reading speech from still and moving faces: the neural
substrates of visible speech. J Cognit Neurosci 15:57-70.
Calvert GA, Brammer MJ, Bullmore ET, Campbell R, Iversen SD, David AS (1999) Response
amplification in sensory-specific cortices during crossmodal binding. Neuroreport
10:2619-2623.
Calvert GA, Bullmore ET, Brammer MJ, Campbell R, Williams SC, McGuire PK, Woodruff PW,
Iversen SD, David AS (1997) Activation of auditory cortex during silent lipreading.
Science 276:593-596.
Campbell CS, Massaro DW (1997) Perception of visible speech: influence of spatial quantization.
Perception 26:627-644.
Campbell R (1986) The lateralization of lip-read sounds: a first look. Brain and Cognition 5:1-21.
Campbell R (2008) The processing of audio-visual speech: empirical and neural bases.
Philosophical Transactions of The Royal Society Of London Series B: Biological
Sciences 363:1001-1010.
Campbell R (2011) Speechreading and the Bruce-Young model of face recognition: early
findings and recent developments. Br J Psychol 102:704-710.
Campbell R, Landis T, Regard M (1986) Face recognition and lipreading. A neurological
dissociation. Brain 109 ( Pt 3):509-521.
121
Campbell R, De Gelder B, De Haan E (1996a) The lateralization of lip-reading: a second look.
Neuropsychologia 34:1235-1240.
Campbell R, Brooks B, de Haan E, Roberts T (1996b) Dissociating face processing skills:
decision about lip-read speech, expression, and identity. Q J Exp Psychol A 49:295-314.
Campbell R, MacSweeney M, Surguladze S, Calvert G, McGuire P, Suckling J, Brammer MJ,
David AS (2001) Cortical substrates for the perception of face actions: an fMRI study of
the specificity of activation for seen speech and for meaningless lower-face acts
(gurning). Cognitive Brain Research 12:233-243.
Capek CM, Bavelier D, Corina D, Newman AJ, Jezzard P, Neville HJ (2004) The cortical
organization of audio-visual sentence comprehension: an fMRI study at 4 Tesla.
Cognitive Brain Research 20:111-119.
Capek CM, Macsweeney M, Woll B, Waters D, McGuire PK, David AS, Brammer MJ, Campbell
R (2008) Cortical circuits for silent speechreading in deaf and hearing people.
Neuropsychologia 46:1233-1241.
Cohen J (1969) Statistical power analysis for the behavioral sciences. New York,: Academic
Press.
Cohen MM, Walker RL, Massaro DW (1996) Perception of synthetic visual speech. In:
Speechreading by humans and machines (Stork D, Henneck M, eds), pp 153-168. Berlin:
Springer-Verlag.
Colin C, Radeau M, Soquet A, Deltenre P (2004) Generalization of the generation of an MMN by
illusory McGurk percepts: voiceless consonants. Clinical Neurophysiology 115:1989-
2000.
Colin C, Radeau M, Soquet A, Demolin D, Colin F, Deltenre P (2002) Mismatch negativity
evoked by the McGurk-MacDonald effect: a phonetic representation within short-term
memory. Clinical Neurophysiology 113:495-506.
Czigler I (2007) Visual mismatch negativity: Violation of nonattended environmental regularities.
Journal of Psychophysiology 21:224-230.
Czigler I, Balazs L, Winkler I (2002) Memory-based detection of task-irrelevant visual changes.
Psychophysiology 39:869-873.
Czigler I, Balazs L, Pato LG (2004) Visual change detection: event-related potentials are
dependent on stimulus location in humans. Neurosci Lett 364:149-153.
Dahaene-Lambertz G (1997) Electrophysiological correlates of categorical phoneme perception
in adults. Neuroreport 8:919-924.
Darvas F, Ermer JJ, Mosher JC, Leahy RM (2006) Generic head models for atlas-based EEG
source analysis. Hum Brain Mapp 27:129-143.
Davis C, Kim J (2004) Audio-visual interactions with intact clearly audible speech. Quarterly
Journal of Experimental Psychology Section A 57:1103-1121.
Dawson GD (1947) Cerebral Responses to Electrical Stimulation of Peripheral Nerve in Man. J
Neurol Neurosurg Psychiatry 10:134-140.
Delorme A, Makeig S (2004) EEGLAB: an open source toolbox for analysis of single-trial EEG
dynamics including independent component analysis. J Neurosci Methods 134:9-21.
Demorest ME, Bernstein LE (1992) Sources of variability in speechreading sentences: a
generalizability analysis. Journal of Speech and Hearing Research 35:876-891.
Duda RO, Hart PE (1973) Pattern Classification and Scene Analysis. New York: John Wiley &
Sons.
Durlach NI, Braida LD (1969) Intensity perception. I. Preliminary theory of intensity resolution.
The Journal of the Acoustical Society of America 46:372-383.
122
Edelman S (1998) Representation is representation of similarities. The Behavioral and Brain
Sciences 21:449-467; discussion 467-498.
Edgington ES, Onghena P (2007) Randomization tests, 4th Edition. Boca Raton, FL: Chapman &
Hall/CRC.
Eggermont JJ (2001) Between sound and perception: reviewing the search for a neural code.
Hearing Res 157:1-42.
Eggermont JJ, Aertsen AM, Hermes DJ, Johannesma PI (1981) Spectro-temporal characterization
of auditory neurons: redundant or necessary. Hearing Res 5:109-121.
Ethofer T, Gschwind M, Vuilleumier P (2011) Processing social aspects of human gaze: a
combined fMRI-DTI study. NeuroImage 55:411-419.
Fadiga L, Craighero L, Buccino G, Rizzolatti G (2002) Speech listening specifically modulates
the excitability of tongue muscles: a TMS study. Eur J Neurosci 15:399-402.
Files BT, Auer ET, Bernstein LE (2013) The Visual Mismatch Negativity Elicited with Visual
Speech Stimuli. Front Hum Neurosci 7.
Fisher CG (1968) Confusions among visually perceived consonants. J Speech Hear Res 11:796-
804.
Folstein JR, Petten CV (2008) Influence of cognitive control and mismatch on the N2 component
of the ERP: a review. Psychophysiology 45:152-170.
Fowler CA (2004) Speech as a Supramodal or Amodal Phenomenon. In: The Handbook of
Multisensory Processes (Calvert GA, Spence C, Stein BE, eds), pp 189-202. Cambridge,
MA: MIT Press.
Fowler CA, Brown JM, Sabadini L, Weihing J (2003) Rapid access to speech gestures in
perception: Evidence from choice and simple response time tasks. Journal of Memory
and Language 49:396-413.
Freeman WJ (2006) Origin, structure, and role of background EEG activity. Part 4: Neural frame
simulation. Clinical neurophysiology : official journal of the International Federation of
Clinical Neurophysiology 117:572-589.
Fuchs M, Wagner M, Köhler T, Wischmann HA (1999) Linear and nonlinear current density
reconstructions. J Clin Neurophysiol 16:267-295.
Galantucci B, Fowler CA, Turvey MT (2006) The motor theory of speech perception reviewed.
Psychon Bull Rev 13:361-377.
Garrido MI, Kilner JM, Stephan KE, Friston KJ (2009) The mismatch negativity: a review of
underlying mechanisms. Clin Neurophysiol 120:453-463.
Ghazanfar AA, Maier JX, Hoffman KL, Logothetis NK (2005) Multisensory integration of
dynamic faces and voices in rhesus monkey auditory cortex. J Neurosci 25:5004-5012.
Giraud AL, Poeppel D (2012) Cortical oscillations and speech processing: emerging
computational principles and operations. Nat Neurosci 15:511-517.
Goldstone RL (1994) Influences of categorization on perceptual discrimination. Journal of
Experimental Psychology: Human Perceptual Performance 123:178-200.
Gramfort A, Papadopoulo T, Olivi E, Clerc M (2010) OpenMEEG: opensource software for
quasistatic bioelectromagnetics. BioMedical Engineering OnLine 9:45.
Green DM, Swets JA (1966) Signal Detection Theory and Psychophysics. New York: Wiley.
Greenblatt RE, Pflieger ME (2004) Randomization-based hypothesis testing from event-related
data. Brain Topography 16:225-232.
Grill-Spector K, Kourtzi Z, Kanwisher N (2001) The lateral occipital complex and its role in
object recognition. Vision Res 41:1409-1422.
Grill-Spector K, Henson R, Martin A (2006) Repetition and the brain: neural models of stimulus-
specific effects. Trends in Cognitive Sciences 10:14-23.
123
Groppe DM, Urbach TP, Kutas M (2011) Mass univariate analysis of event-related brain
potentials/fields I: a critical tutorial review. Psychophysiology 48:1711-1725.
Guthrie D, Buchwald JS (1991) Significance testing of difference potentials. Psychophysiology
28:240-244.
Hall DA, Fussell C, Summerfield Q (2005) Reading fluent speech from talking faces: typical
brain networks and individual differences. J Cogn Neurosci 17:939-953.
Handy TC (2005) Basic Principles of ERP Quantification. In: Event-Related Potentials: A
Methods Handbook (Handy TC, ed), pp 33-56. Cambridge, MA: The MIT Press.
Hary JM, Massaro DW (1982) Categorical results do not imply categorical perception. Perception
& Psychophysics 32:409-418.
Hay JF, Pelucchi B, Estes KG, Saffran JR (2011) Linking sounds to meanings: Infant statistical
learning in a natural language. Cognitive Psychology 63:93-106.
Hemmelmann C, Horn M, Süsse T, Vollandt R, Weiss S (2005) New concepts of multiple tests
and their use for evaluating high-dimensional EEG data. J Neurosci Methods 142:209-
217.
Hickok G, Poeppel D (2007) The cortical organization of speech processing. Nat Rev Neurosci
8:393-402.
Holt LL, Lotto AJ (2010) Speech perception as categorization. Atten Percept Psychophys
72:1218-1227.
Horváth J, Czigler I, Jacobsen T, Maess B, Schröger E, Winkler I (2008) MMN or no MMN: no
magnitude of deviance effect on the MMN amplitude. Psychophysiology 45:60-69.
Hubel DH, Wiesel TN (1962) Receptive fields, binocular interaction and functional architecture
in the cat's visual cortex. Journal of Physiololgy 160:106-154.
Iverson P, Kuhl PK (1995) Mapping the perceptual magnet effect for speech using signal
detection theory and multidimensional scaling. J Acoust Soc Am 97:553-562.
Iverson P, Kuhl PK (2000) Perceptual magnet and phoneme boundary effects in speech
perception: do they arise from a common mechanism? Percept Psychophys 62:874-886.
Jääskeläinen IP, Ahveninen J, Bonmassar G, Dale AM, Ilmoniemi RJ, Levänen S, Lin F-H, May
P, Melcher J, Stufflebeam S, Tiitinen H, Belliveau JW (2004) Human posterior auditory
cortex gates novel sounds to consciousness. Proc Natl Acad Sci U S A 101:6809-6814.
Jackson PL, Montgomery AA, Binnie CA (1976) Perceptual dimensions underlying
vowellipreading performance. Journal of Speech and Hearing Research 19:796-812.
Jakobson R, Fant CGM, Halle M (1961) Preliminaries to Speech Analysis: The Distinctive
Features and Their Correlates. Cambridge, MA: MIT.
Jeffers J, Barley M (1971) Speechreading (Lipreading). Springfield, IL: Charles C. Thomas.
Jesse A, Massaro DW (2010) The temporal distribution of information in audiovisual spoken-
word identification. Attention, Perception, & Psychophysics 72:209-225.
Jiang J, Bernstein LE (2011) Psychophysics of the McGurk and other audiovisual speech
integration effects. Journal of Experimental Psychology: Human Perception and
Performance.
Jiang J, Aronoff JM, Bernstein LE (2008) Development of a visual speech synthesizer via
second-order isomorphism. In: Acoustics, Speech and Signal Processing, 2008. ICASSP
2008. IEEE International Conference on, pp 4677-4680.
Jiang J, Alwan A, Bernstein LE, Auer ET, Jr., Keating P (2002a) Similarity structure in
perceptual and physical measures for visual consonants across talkers. In: ICASSP 2002,
pp 441-444. Orlando, FL.
Jiang J, Alwan A, Keating P, Auer ET, Jr., Bernstein LE (2002b) On the relationship between
face movements, tongue movements, and speech acoustics. EURASIP Journal on
124
Applied Signal Processing: Special issue on Joint Audio-Visual Speech Processing
2002:1174-1188.
Jiang J, Auer ET, Jr., Alwan A, Keating PA, Bernstein LE (2007a) Similarity structure in visual
speech perception and optical phonetic signals. Perception & Psychophysics 69:1070-
1083.
Jiang X, Rosen E, Zeffiro T, VanMeter J, Blanz V, Riesenhuber M (2006) Evaluation of a Shape-
Based Model of Human Face Discrimination Using fMRI and Behavioral Techniques.
Neuron 50:159-172.
Jiang X, Bradley E, Rini RA, Zeffiro T, Vanmeter J, Riesenhuber M (2007b) Categorization
training results in shape- and category-selective human neural plasticity. Neuron 53:891-
903.
Jordan TR, Bevan K (1997) Seeing and hearing rotated faces: influences of facial orientation on
visual and audiovisual speech recognition. Journal of Experimental Psychology: Human
Perception and Performance 23:388-403.
Kailath T, Sayed AH, Hassibi B (2000) Linear Estimation. Upper Saddle River, NJ: Prentice
Hall.
Kamachi M, Hill H, Lander K, Vatikiotis-Bateson E (2003) "Putting the face to the voice":
matching identity across modality. Curr Biol 13:1709-1714.
Kanwisher N, McDermott J, Chun MM (1997) The fusiform face area: a module in human
extrastriate cortex specialized for face perception. J Neurosci 17:4302-4311.
Karniski W, Blair RC, Snider AD (1994) An exact statistical method for comparing topographic
maps, with any number of subjects and electrodes. Brain Topography 6:203-210.
Kauramaki J, Jaaskelainen IP, Hari R, Mottonen R, Rauschecker JP, Sams M (2010) Lipreading
and covert speech production similarly modulate human auditory-cortex responses to
pure tones. The Journal of neuroscience 30:1314-1321.
Kayser C, Petkov CI, Logothetis NK (2008) Visual modulation of neurons in auditory cortex.
Cereb Cortex 18:1560-1574.
Kayser C, Logothetis NK, Panzeri S (2010) Visual enhancement of the information representation
in auditory cortex. Curr Biol 20:19-24.
Kecskes-Kovacs K, Sulykos I, Czigler I (2013) Visual mismatch negativity is sensitive to
symmetry as a perceptual category. The European journal of neuroscience 37:662-667.
Kimura M, Schroger E, Czigler I (2011) Visual mismatch negativity and its importance in visual
cognitive sciences. Neuroreport 22:669-673.
Kimura M, Katayama J, Ohira H, Schroger E (2009) Visual mismatch negativity: new evidence
from the equiprobable paradigm. Psychophysiology 46:402-409.
Kislyuk DS, Mottonen R, Sams M (2008) Visual processing affects the neural basis of auditory
discrimination. J Cognit Neurosci 20:2175-2184.
Klatt DH (1987) Review of text-to-speech conversion for English. J Acoust Soc Am 82:737-793.
Koenig T, Melie-Garcia L (2010) A method to determine the presence of averaged event-related
fields using randomization tests. Brain Topogr 23:233-242.
Koenig T, Kottlow M, Stein M, Melie-Garcia L (2011) Ragu: A Free Tool for the Analysis of
EEG and MEG Event-Related Scalp Field Data Using Global Randomization Statistics.
Comput Intell Neurosci 2011:938925.
Kriegeskorte N (2009) Relating Population-Code Representations between Man, Monkey, and
Computational Models. Frontiers in Neuroscience 3:363-373.
Kriegeskorte N, Mur M, Bandettini P (2008) Representational similarity analysis - connecting the
branches of systems neuroscience. Frontiers in Systems Neuroscience 2:4.
Kruskal JB, Wish M (1978) Multidimensional scaling. Beverly Hills, CA: Sage.
125
Kuhl PK (1991) Human adults and human infants show a "perceptual magnet effect" for the
prototypes of speech categories, monkeys do not. Perception & Psychophysics 50:93-
107.
Kuhl PK, Meltzoff AN (1988) Speech as an intermodal object of perception. In: Perceptual
Development in Infancy (Yonas A, ed), pp 235-266. Hillsdale, NJ: Lawrence Erlbaum
Associates, Inc.
Kujala T, Tervaniemi M, Schroger E (2007) The mismatch negativity in cognitive and clinical
neuroscience: theoretical and methodological considerations. Biological psychology
74:1-19.
Lachs L, Pisoni DB (2004a) Cross-modal source information and spoken word recognition.
Journal of Experimental Psychology: Human Perception and Performance 30:378-396.
Lachs L, Pisoni DB (2004b) Specification of cross-modal source information in isolated
kinematic displays of speech. J Acoust Soc Am 116:507-518.
Lachs L, Pisoni DB (2004c) Crossmodal Source Identification in Speech Perception. Ecological
Psychology 16:159-187.
Lakatos P, Chen C-M, O'Connell MN, Mills A, Schroeder CE (2007) Neuronal oscillations and
multisensory interaction in primary auditory cortex. Neuron 53:279-292.
Lakatos P, Karmos G, Mehta AD, Ulbert I, Schroeder CE (2008) Entrainment of neuronal
oscillations as a mechanism of attentional selection. Science 320:110-113.
Lakatos P, O'Connell MN, Barczak A, Mills A, Javitt DC, Schroeder CE (2009) The leading
sense: supramodal control of neurophysiological context by attention. Neuron 64:419-
430.
Lander K, Hill H, Kamachi M, Vatikiotis-Bateson E (2007) It's not what you say but the way you
say it: matching faces and voices. J Exp Psychol Hum Percept Perform 33:905-914.
Lehmann D, Skrandies W (1980) Reference-free identification of components of checkerboard-
evoked multichannel potential fields. Electroencephalogr Clin Neurophysiol 48:609-621.
Leitman DI, Sehatpour P, Shpaner M, Foxe JJ, Javitt DC (2009) Mismatch negativity to tonal
contours suggests preattentive perception of prosodic content. Brain Imaging Behav
3:284-291.
Li X, Lu Y, Sun G, Gao L, Zhao L (2012) Visual mismatch negativity elicited by facial
expressions: new evidence from the equiprobable paradigm. Behav Brain Funct 8:7.
Liberman AM (1982) On finding that speech is special. American Psychologist 37:148-148-167.
Liberman AM, Mattingly IG (1985) The motor theory of speech perception revised. Cognition
21:1-36.
Liberman AM, Mattingly IG (1989) A specialization for speech perception. Science 243:489-494.
Liberman AM, Whalen DH (2000) On the relation of speech to language. Trends in Cognitive
Sciences 4:187-196.
Liew AW-C, Wang S (2009) Visual Speech Recognition: Lip Segmentation and Mapping.
Hershey, NY: IGI Global.
Lindblom B, Maddieson I (1988) Phonetic universals in consonant systems. In: Language,
Speech, and Mind (Hyman L, Li C, eds), pp 62-78. New York: Routledge.
Logothetis NK, Sheinberg DL (1996) Visual object recognition. Annu Rev Neurosci 19:577-621.
Luck SJ (2005) Ten Simple Rules for Designing ERP Experiments. In: Event-Related Potentials:
A Methods Handbook (Handy TC, ed), pp 17-32. Cambridge, MA: The MIT Press.
Ma WJ, Zhou X, Ross LA, Foxe JJ, Parra LC (2009) Lip-reading aids word recognition most in
moderate noise: a Bayesian explanation using high-dimensional feature space. PLoS
ONE 4:e4638.
126
MacEachern MR (2000) On the visual distinctiveness of words in the English lexicon. J Phon
28:367-376.
MacLeod A, Summerfield Q (1987) Quantifying the contribution of vision to speech perception
in noise. Br J Audiol 21:131-141.
Macmillan NA, Creelman CD (1991) Detection theory : a user's guide. Cambridge England ;
New York: Cambridge University Press.
MacSweeney M, Calvert GA, Campbell R, McGuire PK, David AS, Williams SC, Woll B,
Brammer MJ (2002a) Speechreading circuits in people born deaf. Neuropsychologia
40:801-807.
MacSweeney M, Woll B, Campbell R, McGuire PK, David AS, Williams SC, Suckling J, Calvert
GA, Brammer MJ (2002b) Neural systems underlying British Sign Language and audio-
visual English processing in native users. Brain 125:1583-1593.
Maris E (2004) Randomization tests for ERP topographies and whole spatiotemporal data
matrices. Psychophysiology 41:142-151.
Maris E, Oostenveld R (2007) Nonparametric statistical testing of EEG- and MEG-data. J
Neurosci Methods 164:177-190.
Massaro DW (1984) Children's perception of visual and auditory speech. Child Dev 55:1777-
1788.
Massaro DW (1998) Perceiving Talking Faces: From Speech Perception to a Behavioral
Principle. Cambridge, Massachusets: The MIT Press.
Massaro DW, Hary JM (1984) Categorical results, categorical perception, and hindsight.
Perception & Psychophysics 35:586-588.
Massaro DW, Cohen MM (1996) Perceiving speech from inverted faces. Perception and
Psychophysics 58:1047-1065.
Massaro DW, Cohen MM (1999) Speech perception in perceivers with hearing loss: synergy of
multiple modalities. Journal of speech, language, and hearing research : JSLHR 42:21-41.
Massaro DW, Cohen MM, Smeele PM (1996) Perception of asynchronous and conflicting visual
and auditory speech. The Journal of the Acoustical Society of America 100:1777-1786.
Massaro DW, Thompson LA, Barron B, Laren E (1986) Developmental changes in visual and
auditory contributions to speech perception. J Exp Child Psychol 41:93-113.
Mattys SL, Bernstein LE, Auer ET, Jr. (2002) Stimulus-based lexical distinctiveness as a general
word-recognition mechanism. Perception and Psychophysics 64:667-679.
May PJC, Tiitinen H (2010) Mismatch negativity (MMN), the deviance-elicited auditory
deflection, explained. Psychophysiology 47:66-122.
McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746-748.
McKone E, Yovel G (2009) Why does picture-plane inversion sometimes dissociate perception of
features and spacing in faces, and sometimes not? Toward a new theory of holistic
processing. Psychon Bull Rev 16:778-797.
Michel CM, Murray MM, Lantz G, Gonzalez S, Spinelli L, de Peralta RG (2004) EEG source
imaging. Clinical Neurophysiology 115:2195-2222.
Miki K, Watanabe S, Kakigi R, Puce A (2004) Magnetoencephalographic study of
occipitotemporal activity elicited by viewing mouth movements. Clinical
neurophysiology : official journal of the International Federation of Clinical
Neurophysiology 115:1559-1574.
Miki K, Watanabe S, Kakigi R, Puce A (2006) Cortical activities elicited by viewing mouth
movements: a magnetoencephalographic study. Suppl Clin Neurophysiol 59:27-34.
Miller GA, Nicely PE (1955) An Analysis of Perceptual Confusions Among Some English
Consonants. The Journal of the Acoustical Society of America 27:338-352.
127
Mishima K, Yamada T, Matsumura T, Moritani N (2011) Analysis of lip motion using principal
component analyses. Journal of Cranio-Maxillofacial Surgery 39:232-236.
Mohammed T, Campbell R, MacSweeney M, Milne E, Hansen P, Coleman M (2005)
Speechreading skill and visual movement sensitivity are related in deaf speechreaders.
Perception 34:205-216.
Montgomery AA, Jackson PL (1983) Physical characteristics of the lips underlying vowel
lipreading performance. J Acoust Soc Am 73:2134-2144.
Möttönen R, Krause CM, Tiippana K, Sams M (2002) Processing of changes in visual speech in
the human auditory cortex. Cognitive Brain Research 13:417-425.
Munhall KG, Vatikiotis-Bateson E (1998) The moving face during speech communication. In:
Hearing by Eye II: advances in the psychology of speechreading and auditory-visual
speech (Campbell R, Dodd B, Burnham D, eds): Psychology Press.
Munhall KG, Vatikiotis-Bateson E (2004) Spatial and temporal constraints on audiovisual speech
perception. In: The Handbook of Multisensory Processes (Calvert GA, Spence C, Stein
BE, eds), pp 177-188. Cambridge, MA: MIT Press.
Munhall KG, Buchan JN (2004) Something in the way she moves. Trends Cogn Sci 8:51-53.
Munhall KG, Jones JA, Callan DE, Kuratate T, Vatikiotis-Bateson E (2004) Visual prosody and
speech intelligibility: head movement improves auditory speech perception. Psychol Sci
15:133-137.
Murray MM, Brunet D, Michel CM (2008) Topographic ERP analyses: a step-by-step tutorial
review. Brain Topography 20:249-264.
Näätänen R, Alho K (1995) Mismatch negativity--a unique measure of sensory processing in
audition. Int J Neurosci 80:317-337.
Näätänen R, Gaillard AWK, Mäntysalo S (1978) Early selective-attention effect on evoked
potential reinterpreted. Acta Psychologica 42:313-329.
Näätänen R, Jacobsen T, Winkler I (2005) Memory-based or afferent processes in mismatch
negativity (MMN): a review of the evidence. Psychophysiology 42:25-32.
Näätänen R, Kujala T, Winkler I (2011) Auditory processing that leads to conscious perception:
A unique window to central auditory processing opened by the mismatch negativity and
related responses. Psychophysiology 48:4-22.
Näätänen R, Paavilainen P, Rinne T, Alho K (2007) The mismatch negativity (MMN) in basic
research of central auditory processing: a review. Clin Neurophysiol 118:2544-2590.
Nahorna O, Berthommier F, Schwartz JL (2012) Binding and unbinding the auditory and visual
streams in the McGurk effect. The Journal of the Acoustical Society of America
132:1061-1077.
Nakagawa S (2004) A farewell to Bonferroni: the problems of low statistical power and
publication bias. Behav Ecol 15:1044-1045.
Nath AR, Beauchamp MS (2011) Dynamic changes in superior temporal sulcus connectivity
during perception of noisy audiovisual speech. J Neurosci 31:1704-1714.
Nath AR, Beauchamp MS (2012) A neural basis for interindividual differences in the McGurk
effect, a multisensory speech illusion. NeuroImage 59:781-787.
Neisser U, Beller HK (1965) Searching through word lists. Br J Psychol 56:349-358.
Nichols TE, Holmes AP (2002) Nonparametric permutation tests for functional neuroimaging: a
primer with examples. Hum Brain Mapp 15:1-25.
Nousak JM, Deacon D, Ritter W, Vaughan HG, Jr. (1996) Storage of information in transient
auditory memory. Brain research Cognitive brain research 4:305-317.
Obleser J, Eisner F (2009) Pre-lexical abstraction of speech in the auditory cortex. Trends in
Cognitive Sciences 31:14-19.
128
Okada K, Hickok G (2009) Two cortical mechanisms support the integration of visual and
auditory speech: A hypothesis and preliminary data. Neurosci Lett 452:219-223.
Okada K, Rong F, Venezia J, Matchin W, Hsieh IH, Saberi K, Serences JT, Hickok G (2010)
Hierarchical organization of human auditory cortex: evidence from acoustic invariance in
the response to intelligible speech. Cereb Cortex 20:2486-2495.
Oldfield RC (1971) The assessment and analysis of handedness: the Edinburgh inventory.
Neuropsychologia 9:97-113.
Paulesu E, Perani D, Blasi V, Silani G, Borghese NA, De Giovanni U, Sensolo S, Fazio F (2003)
A functional-anatomical model for lipreading. J Neurophysiol 90:2005-2013.
Pazo-Alvarez P, Cadaveira F, Amenedo E (2003) MMN in the visual modality: a review.
Biological Psychology 63:199-236.
Pazo-Alvarez P, Amenedo E, Cadaveira F (2004) Automatic detection of motion direction
changes in the human brain. The European journal of neuroscience 19:1978-1986.
Pekkola J, Ojanen V, Autti T, Jaaskelainen IP, Mottonen R, Sams M (2006) Attention to visual
speech gestures enhances hemodynamic activity in the left planum temporale. Hum Brain
Mapp 27:471-477.
Pekkola J, Ojanen V, Autti T, Jaaskelainen IP, Mottonen R, Tarkiainen A, Sams M (2005)
Primary auditory cortex activation by visual speech: an fMRI study at 3 T. Neuroreport
16:125-128.
Picton TW, Alain C, Otten L, Ritter W, Achim A (2000a) Mismatch Negativity: Different Water
in the Same River. Audiology and Neurotology 5:111-139.
Picton TW, Bentin S, Berg P, Donchin E, Hillyard SA, Johnson R, Miller GA, Ritter W, Ruchkin
DS, Rugg MD, Taylor MJ (2000b) Guidelines for using human event-related potentials to
study cognition: recording standards and publication criteria. Psychophysiology 37:127-
152.
Pisoni DB, Remez RE (2004) Handbook of Speech Perception. Cambridge: MIT.
Ponton CW, Bernstein LE, Auer ET, Jr. (2009) Mismatch negativity with visual-only and
audiovisual speech. Brain Topography 21:207-215.
Ponton CW, Don M, Eggermont JJ, Kwong B (1997) Integrated mismatch negativity (MMNi): a
noise-free representation of evoked responses allowing single-point distribution-free
statistical tests. Electroencephalogr Clin Neurophysiol 104:143-150.
Posner MI, Mitchell RF (1967) Chronometric analysis of classification. Psychol Rev 74:392-409.
Preminger JE, Lin HB, Payen M, Levitt H (1998) Selective visual masking in speechreading.
Journal of Speech, Language, and Hearing Research 41:564-575.
Proverbio AM, Del Zotto M, Crotti N, Zani A (2009) A no-go related prefrontal negativity larger
to irrelevant stimuli that are difficult to suppress. Behav Brain Funct 5:25.
Puce A, Smith A, Allison T (2000) Erps evoked by viewing facial movements. Cognitive
Neuropsychology 17:221-239.
Puce A, Epling JA, Thompson JC, Carrick OK (2007) Neural responses elicited to face motion
and vocalization pairings. Neuropsychologia 45:93-106.
Puce A, Allison T, Asgari M, Gore JC, McCarthy G (1996) Differential sensitivity of human
visual cortex to faces, letterstrings, and textures: a functional magnetic resonance
imaging study. The Journal of Neuroscience 16:5205-5215.
Puce A, Allison T, Bentin S, Gore JC, McCarthy G (1998) Temporal cortex activation in humans
viewing eye and mouth movements. The Journal of Neuroscience 18:2188-2199.
Puce A, Syngeniotis A, Thompson JC, Abbott DF, Wheaton KJ, Castiello U (2003) The human
temporal lobe integrates facial form and motion: evidence from fMRI and ERP studies.
NeuroImage 19:861-869.
129
Pulvermuller F, Shtyrov Y, Ilmoniemi R (2003) Spatiotemporal dynamics of neural language
processing: an MEG study using minimum-norm current estimates. NeuroImage
20:1020-1025.
Pulvermuller F, Huss M, Kherif F, Moscoso del Prado Martin F, Hauk O, Shtyrov Y (2006)
Motor cortex maps articulatory features of speech sounds. Proc Natl Acad Sci U S A
103:7865-7870.
Ramsay JO, Munhall KG, Gracco VL, Ostry DJ (1996) Functional data analyses of lip motion. J
Acoust Soc Am 99:3718-3727.
Rizzolatti G, Arbib MA (1998) Language within our grasp. Trends Neurosci 21:188-194.
Rosenblum LD (2008) Speech Perception as a Multimodal Phenomenon. Current Directions in
Psychological Science 17:405-409.
Rosenblum LD, Saldaña HM (1996) An audiovisual test of kinematic primitives for visual speech
perception. Journal of Experimental Psychology: Human Perception and Performance
22:318-331.
Rosenblum LD, Johnson JA, Saldaña HM (1996) Point-light facial displays enhance
comprehension of speech in noise. Journal of Speech and Hearing Research 39:1159-
1170.
Rosenblum LD, Yakel DA, Green KP (2000) Face and mouth inversion effects on visual and
audiovisual speech perception. Journal of Experimental Psychology: Human Perception
and Performance 26:806-819.
Rosenblum LD, Miller RM, Sanchez K (2007) Lip-read me now, hear me better later: cross-
modal transfer of talker-familiarity effects. Psychol Sci 18:392-396.
Rosenblum LD, Smith NM, Nichols SM, Hale S, Lee J (2006) Hearing a face: cross-modal
speaker matching using isolated visible speech. Percept Psychophys 68:84-93.
Ross LA, Saint-Amour D, Leavitt VM, Javitt DC, Foxe JJ (2007) Do you see what I am saying?
Exploring visual enhancement of speech comprehension in noisy environments. Cereb
Cortex 17:1147-1153.
Rossion B (2008) Picture-plane inversion leads to qualitative changes of face perception. Acta
Psychologica 128:274-289.
Rossion B, Gauthier I (2002) How does the brain process upright and inverted faces? Behav Cogn
Neurosci Rev 1:63-75.
Saffran JR, Aslin RN, Newport EL (1996) Statistical learning by 8-month-old infants. Science
274:1926-1928.
Saint-Amour D, Sanctis PD, Molholm S, Ritter W, Foxe JJ (2007) Seeing voices: High-density
electrical mapping and source-analysis of the multisensory mismatch negativity evoked
during the McGurk illusion. Neuropsychologia 45:587-597.
Saldana HM, Rosenblum LD (1994) Selective adaptation in speech perception using a compelling
audiovisual adaptor. J Acoust Soc Am 95:3658-3661.
Sams M, Alho K, Naatanen R (1984) Short-term habituation and dishabituation of the mismatch
negativity of the ERP. Psychophysiology 21:434-441.
Sams M, Aulanko R, Hämäläinen M, Hari R, Lounasmaa OV, Lu ST, Simola J (1991) Seeing
speech: visual information from lip movements modifies activity in the human auditory
cortex. Neurosci Lett 127:141-145.
Sato M, Tremblay P, Gracco VL (2009) A mediating role of the premotor cortex in phoneme
segmentation. Brain Lang 111:1-7.
Schroeder CE, Lakatos P, Kajikawa Y, Partan S, Puce A (2008) Neuronal oscillations and visual
amplification of speech. Trends Cogn Sci 12:106-113.
130
Schroger E, Wolff C (1996) Mismatch response of the human brain to changes in sound location.
Neuroreport 7:3005-3008.
Schroger E, Wolff C (1997) Fast preattentive processing of location: a functional basis for
selective listening in humans. Neurosci Lett 232:5-8.
Scott SK (2005) Auditory processing--speech, space and auditory objects. Curr Opin Neurobiol
15:197-201.
Scott SK, Johnsrude IS (2003) The neuroanatomical and functional organization of speech
perception. Trends Neurosci 26:100-107.
Scott SK, McGettigan C, Eisner F (2009) A little more conversation, a little less action--candidate
roles for the motor cortex in speech perception. Nat Rev Neurosci 10:295-302.
Scott SK, Blank CC, Rosen S, Wise RJ (2000) Identification of a pathway for intelligible speech
in the left temporal lobe. Brain 123 Pt 12:2400-2406.
Scott SK, Rosen S, Lang H, Wise RJ (2006) Neural correlates of intelligibility in speech
investigated with noise vocoded speech--a positron emission tomography study. J Acoust
Soc Am 120:1075-1083.
Sekiyama K, Tohkura Y (1991) McGurk effect in non-English listeners: few visual effects for
Japanese subjects hearing Japanese syllables of high auditory intelligibility. J Acoust Soc
Am 90:1797-1805.
Semlitsch HV, Anderer P, Schuster P, Presslich O (1986) A Solution for Reliable and Valid
Reduction of Ocular Artifacts, Applied to the P300 ERP. Psychophysiology 23:695-703.
Shepard RN, Chipman S (1970) Second-order isomorphism of internal representations: Shapes of
states. Cognitive Psychology 1:1- 17.
Shigeno S (2002) Anchoring effects in audiovisual speech perception. J Acoust Soc Am
111:2853-2861.
Skipper JI, Nusbaum HC, Small SL (2005) Listening to talking faces: motor cortical activation
during speech perception. NeuroImage 25:76-89.
Skrandies W (1990) Global field power and topographic similarity. Brain Topography 3:137-141.
Spitsyna G, Warren JE, Scott SK, Turkheimer FE, Wise RJ (2006) Converging language streams
in the human temporal lobe. J Neurosci 26:7328-7336.
Stefanics G, Kimura M, Czigler I (2011) Visual mismatch negativity reveals automatic detection
of sequential regularity violation. Front Hum Neurosci 5:46.
Stefanics G, Csukly G, Komlosi S, Czobor P, Czigler I (2012) Processing of unattended facial
emotions: a visual mismatch negativity study. NeuroImage 59:3042-3049.
Stevens KN (1998) Acoustic Phonetics. Cambridge, MA: MIT Press.
Sumby W, Pollack I (1954) Visual Contribution to Speech Intelligibility in Noise. J Acoust Soc
Am 26:212-215.
Summerfield Q (1987) Some Preliminaries to a Comprehensive Account of Audio-visual Speech
Perception. In: Hearing by Eye: The Psychology of Lip-reading (Dodd B, Campbell R,
eds), pp 3-51. London: Lawrence Erlbaum Associates.
Summerfield Q (1992) Lipreading and audio-visual speech perception. Philos Trans R Soc Lond
B Biol Sci 335:71-78.
Susilo T, McKone E, Edwards M (2010) Solving the upside-down puzzle: Why do upright and
inverted face aftereffects look alike? J Vis 10:1-16.
Tadel F, Baillet S, Mosher JC, Pantazis D, Leahy RM (2011) Brainstorm: a user-friendly
application for MEG/EEG analysis. Computational Intelligence and Neuroscience
2011:879716.
Thomas SM, Jordan TR (2002) Determining the influence of Gaussian blurring on inversion
effects with talking faces. Perception & Psychophysics 64:932-944.
131
Thomas SM, Jordan TR (2004) Contributions of oral and extraoral facial movement to visual and
audiovisual speech perception. Journal of Experimental Psychology: Human Perception
and Performance 30:873-888.
Thompson JC, Hardee JE, Panayiotou A, Crewther D, Puce A (2007) Common and distinct brain
activation to viewing dynamic sequences of face and hand movements. NeuroImage
37:966-973.
Ungerleider LG, Haxby JV (1994) "What" and "where" in the human brain. Curr Opin Neurobiol
4:157-165.
Walden BE, Montgomery AA, Prosek RA (1987) Perception of synthetic visual consonant-vowel
articulations. Journal of Speech and Hearing Research 30:418-424.
Walden BE, Prosek RA, Montgomery AA, Scherr CK, Jones CJ (1977) Effects of training on the
visual recognition of consonants. J Speech Hear Res 20:130-145.
Watkins KE, Strafella AP, Paus T (2003) Seeing and hearing speech excites the motor system
involved in speech production. Neuropsychologia 41:989-994.
Wheaton KJ, Thompson JC, Syngeniotis A, Abbott DF, Puce A (2004) Viewing the motion of
human body parts activates different regions of premotor, temporal, and parietal cortex.
NeuroImage 22:277-288.
Wilson SM, Saygin AP, Sereno MI, Iacoboni M (2004) Listening to speech activates motor areas
involved in speech production. Nat Neurosci 7:701-702.
Winkler I, Czigler I (2012) Evidence from auditory and visual event-related potential (ERP)
studies of deviance detection (MMN and vMMN) linking predictive coding theories and
perceptual object representations. Int J Psychophysiol 83:132-143.
Winkler I, Horváth J, Weisz J, Trejo LJ (2009) Deviance detection in congruent audiovisual
speech: evidence for implicit integrated audiovisual memory representations. Biological
Psychology 82:281-292.
Winkler I, Czigler I, Sussman E, Horváth J, Balazs L (2005) Preattentive binding of auditory and
visual stimulus features. J Cognit Neurosci 17:320-339.
Yehia H, Rubin P, Vatikiotis-Bateson E (1998) Quantitative association of vocal-tract and facial
behavior. Speech Communication 26:23-43.
Yehia H, Kuratate T, Vatikiotis-Bateson E (1999) Using speech acoustics to drive facial motion.
In: ICPhS 1999, pp 631-634. San Francisco, CA.
Yeung N, Bogacz R, Holroyd CB, Cohen JD (2004) Detection of synchronized oscillations in the
electroencephalogram: an evaluation of methods. Psychophysiology 41:822-832.
Yeung N, Bogacz R, Holroyd CB, Nieuwenhuis S, Cohen JD (2007) Theta phase resetting and
the error-related negativity. Psychophysiology 44:39-49.
Yin RK (1969) Looking at upside-down faces. Journal of Experimental Psychology 81:141-145.
Yoder PJ, Blackford JU, Waller NG, Kim G (2004) Enhancing power while controlling family-
wise error: an illustration of the issues using electrocortical studies. J Clin Exp
Neuropsychol 26:320-331.
Yovel G, Kanwisher N (2005) The neural basis of the behavioral face-inversion effect. Curr Biol
15:2256-2262.
Yunusova Y, Weismer GG, Lindstrom MJ (2011) Classifications of vocalic segments from
articulatory kinematics: healthy controls and speakers with dysarthria. Journal of speech,
language, and hearing research : JSLHR 54:1302-1311.
Zeki S (2005) The Ferrier Lecture 1995: Behind the Seen: The functional specialization of the
brain in space and time. Philosophical Transactions: Biological Sciences 360:1145-1183.
132
Appendix A: GFP and Permutation Tests
This appendix serves to clarify and justify the approach to statistical testing taken in
Chapter 3. First, global field power (GFP) is introduced. Noise and its impact on GFP are
discussed, and simulations are used to illustrate the impact of noise on GFP supporting the
conclusion that GFP is biased upward by noise. Permutation tests in general are briefly described,
and two approaches to paired-samples data are described. Further simulations then show that the
previously demonstrated bias of GFP has practical implications for the standard paired-samples
permutation test when applied to GFP data but not for the unbalanced paired-samples permutation
test approach used in Chapter 3.
Global Field Power
Global field power (GFP) (Lehmann and Skrandies, 1980; Skrandies, 1990) is the spatial
standard deviation of a set of average-referenced electrode voltages measured at a particular time
sample. GFP is given as:
∑ ,
where N is the number of electrodes in the montage, and v
i
is the average-referenced
voltage on electrode i at time t (adapted from Lehmann and Skrandies, 1980 equation 1B). When
the voltages are relatively uniform, the spatial standard deviation is relatively small, so the GFP
will be relatively low. When there are extreme positive and extreme negative values in the
voltages, the spatial standard deviation is relatively large, so the GFP will be relatively high. GFP
is insensitive to the spatial arrangement of the voltages on a set of electrodes. In particular, the
spatial arrangement of electrode voltages can change without changing the GFP. For example,
consider two spatial arrangements of electrodes in which one is obtained by multiplying the other
by -1. These two arrangements would have inverted polarity, but would have identical GFP. As
133
another example, consider the spatial arrangement of voltages generated by a single current
dipole of unit strength, and the arrangement generated by that same dipole rotated 30 degrees in
some direction. These arrangements would have different topographies, but would have identical
GFP. For more examples, see (Murray et al., 2008). Therefore, GFP is an especially suitable
measure for latency and amplitude effects, rather than for effects that are expected to be primarily
spatial.
Noise and GFP
Up to this point in this appendix, GFP has been treated as the result of a mathematical
operation. The interpretation of GFP depends critically on what kind of data serve as the basis for
obtaining the GFP. There are two main types of data of which GFP is usually taken, and they
have different implications for the interpretation of GFP. These types of data will be introduced,
and then the connection will be made between these types of data and how the GFP resulting
from them can be interpreted.
The first type of data is continuous EEG recordings (or equivalently single sweeps, which
are short continuous EEG recordings around some stimulus event of interest). At any given
moment during an EEG experiment, an EEG electrode will measure electric fields generated by a
myriad of sources, including neural activity related to the experiment, neural activity unrelated to
the experiment, muscle activity, and ambient/environmental electric fields (Handy, 2005). The
second type of data is the averaged evoked potential (EP). An EP is the result of averaging the
EEG resulting from many repeated presentations of a stimulus event using some temporal
landmark (such as stimulus onset) as a temporal reference (Dawson, 1947). The reason for
averaging together multiple sweeps is to reduce the noise under the assumption that noise will not
be phase-locked to a stimulus event, but the signal will be. Averaging together EEG from
multiple stimulus presentations will cancel out the noise while the signal will remain, although
134
total noise cancellation is never actually obtained: In general, the signal to noise ratio (i.e. the
amplitude of the noise-free signal divided by the RMS of the signal-free noise) increases with the
square-root of the number of sweeps contributing to the average (Luck, 2005). In summary, the
critical difference between these two types of data is that the first contains signal and noise and
the second contains a signal plus whatever noise remains after averaging. In other words, the
single-trial EEG emphasizes both the signal and the noise, and the averaged evoked potential
emphasizes the signal.
The differences in emphasis in these two kinds of data lead to the differences in
interpretation of the GFP of those data. Because single-sweep measures emphasize both the signal
and the noise, a relatively high GFP means that the signal, the noise, or both the signal and the
noise had extreme values. A relatively low GFP in single-sweep data means that either the signal
and the noise were both low or the signal and the noise cancelled each other out by chance. When
averaging is sufficient to remove most of the noise (as in the averaged EP), the interpretation of
the GFP is different: In the case of EP data, a relatively high GFP means that there were extreme
values in the signal, and a relatively low GFP means that there was little if any signal. In short,
taking the GFP of single-trial data characterizes the signal and the noise, but taking the GFP of
averaged evoked potentials characterizes the evoked response and deemphasizes the role of the
noise.
It is, of course, possible to take an average of multiple GFP measurements. One might
consider taking the GFP of N single sweeps and averaging those N GFP measurements. This
would result in an averaged GFP. Because each single sweep GFP measurement measures the
spatial standard deviation of the signal and the noise, averaging these single sweep GFP
measurements would result in an average spatial standard deviation of the signal and the noise. In
particular, the noise would not cancel out. This stands in contrast to taking the GFP of the
135
averaged evoked potential, because averaging cancels out some amount of the noise. In other
words, the average of the single-sweep GFP is an estimate of the typical spatial standard
deviation of a single trial, while the GFP of the averaged evoked potential is an estimate of the
spatial standard deviation of the signal.
GFP and the Number of Trials
As mentioned above, averaging single-sweep EEG data will cancel out some of the noise,
but not all of it. As the number of trials going into an average increases, the noise remaining in
that average decreases. As described in the previous sections, the presence or absence of noise
alters the interpretation of GFP, but this is an over-simplification as there will always be some
noise. This section lays out a qualitative description of the relationship between the number of
trials contributing to an average and the GFP of the resulting average. This qualitative account is
substantiated in the following section with examples from simulations.
When there is no signal at all, then noise can only increase the spatial standard deviation.
Standard deviation involves a squaring of a real number, so standard deviation cannot be less than
zero. The spatial standard deviation of a set of uniform voltages is zero, so any change (due to
e.g. noise) would increase the spatial standard deviation. The more noise is added to a set of
uniform voltages, the larger the increase will be. Again, averaging reduces noise with the square
root of the number of trials going into that average, so in the case in which there is no signal, the
GFP of the average will decrease as the number of trials increases.
The situation is slightly more complicated when there is a signal. When a signal is
present, noise might cancel out some or all of the signal. In that case, the spatial standard
deviation of the signal plus noise would be smaller than the spatial standard deviation of the
signal alone. On the other hand, noise could add to the signal and thereby increase the spatial
standard deviation. As mentioned above, the spatial standard deviation must be positive, so there
136
is a lower limit to the effect of noise on GFP: noise cannot reduce GFP to below zero. In contrast,
there is no upper limit to the effect of noise on GFP except for the maximum amplitude of the
noise itself. Because noise can cause a limited reduction in GFP but an unlimited increase in GFP,
the net effect of noise will be to increase GFP. More formally, GFP is biased positively by noise.
The degree of bias depends on the imbalance between the limited reduction to GFP and the
unlimited increase to GFP that noise can cause. Intuitively, if the signal is very large relative to
the noise, it is unlikely that the noise could cancel out the signal and thereby collide with the limit
on reduction of GFP, so the imbalance between the limited reduction and unlimited increase
would be small. If the noise is very large relative to the signal, then the noise could very easily
cancel out the signal and the imbalance between decrease and increase would be large. In other
words, the degree to which noise biases GFP depends on the relative size of the signal and the
noise.
To summarize the qualitative relationship between the number of trials and GFP: an
increase in the number of trials will decrease the amount of noise in the average. A decrease in
the amount of noise will reduce the positive biasing effect of noise on GFP. The impact of this
bias varies with the GFP of the underlying signal such that when there is little or no signal, the
bias will be greatest, but the bias will be reduced where the signal is large relative to the noise.
Simulations described in the next section were run to confirm these qualitative relationships
GFP Simulations
To confirm the qualitative relationships between of number of trials, GFP, and signal to
noise ratio, three simulations were run. The first simulation examines the relationship between
GFP and number of trials in an extremely simple artificial case. The second simulation shows that
the relationship between GFP and number of trials is mediated by the signal to noise ratio in this
137
extremely simple case. The third simulation uses real EEG data to show that the conclusions
drawn from the extremely simple case hold even when using real data.
Simulation 1: GFP decreases as the number of trials increases
Simulation 1 examined how GFP changes with the number of sweeps that go into
creating the average evoked potential of which the GFP is computed. The general approach was
to generate synthetic single-sweep data by adding synthetic pseudo-random noise to a synthetic
underlying effect. Averages were computed by averaging together multiple such synthetic single
sweeps. GFP was then computed for averages based on varying numbers of sweeps to see how
GFP varied with number of trials. The specific details of this simulation were chosen to
approximate a simple but plausible EEG signal while avoiding unnecessary complications.
The goal for the underlying effect was to achieve a spatially and temporally smooth effect
that summed to
zero, which is
consistent with a
physiologic source
projected onto an
average-referenced
electrode array.
One approach to
generating a true
effect would be to
compute a forward
model relating
neural sources to
Figure A.1. The underlying effect for GFP Simulation 1. The upper panel
shows the voltages on each electrode at each sample. The lower panel shows how
GFP of the true effect varies with time.
138
electrode voltages and then to simulate a dipole source and its projection onto the electrodes (e.g.,
Yeung et al., 2007). However; this approach is both complicated and unnecessary as GFP is
insensitive to spatial effects. Therefore, the true effect data were generated as a raised cosine over
time with a period of 0.5 s multiplied by a sine wave over electrodes with a linear arrangement
with a period of 64 electrodes (Figure A.1). The simulated sampling rate was 400 Hz. A baseline
interval equal to the duration of the signal interval preceded the signal interval. The peak
amplitude of the true effect was 1.5 µV and the peak of the GFP for the true effect was 1.05 µV.
These amplitude values were chosen to approximate a moderate ERP effect. A range of values for
effect size could have been used, but as the simulation addressed how GFP varies with number of
trials, the effect was kept constant and the number of trials was varied.
To this
underlying effect,
noise was added to
generate synthetic
single sweeps. In
keeping with other
simulations of EEG
noise (e.g., Yeung et
al., 2004; Freeman,
2006), the noise
added was zero-
mean pink noise (i.e.
with a 1/f spectrum)
generated by taking
Figure A.2. Characterization of a single sweep example of the noise used in
GFP Simulations 1 & 2. The upper panel shows an example of a single sweep of
noise. The histogram in the lower panel shows the relative frequency of
occurance of different noise values.
139
the two-dimensional inverse Fourier transform of a 1/f spectrum (where f ranges from 1 to 400
Hz) with pseudo-random phase shifts. The standard deviation of the noise taken over the number
of time samples (400) and electrodes (64) in a sweep was 2.62. A typical example of the noise
added to the true effect for a single sweep is depicted in (Figure A.2). Using the noise distribution
described results in voltages that rarely exceed +/- 10 µV. Typical artifact rejection calls for
discarding any data in which the voltage exceeds values such as +/- 50, 100 or 200 µV (Picton et
al., 2000b), so these values are well within what would be accepted as clean data.
GFP was calculated based on 16, 32, 64, 128, 256, and 512 simulated trials in order to
demonstrate the effect varying the number of trials has on GFP. A new pseudo-random noise
sample was generated for each trial. Figure A.3 shows GFP taken of an average ERP generated
using these numbers
of trials. The result
of this simulation
shows a clear trend
during the baseline
period: GFP is larger
as the number of
sweeps contributing
to the average is
smaller. Whether
this is true during
the signal period is
less clear, although
inspection of the
Figure A.3. Example simulation of how GFP changes with number of sweeps.
Lines show GFP of a mean ERP with increasing numbers of trials. GFP appears
to be larger as the number of trials is smaller, and this effect is most pronounced
over the baseline interval. The thick black line shows the GFP of the noise-free
true effect.
140
result reveals a possible trend. To further investigate the signal period, the simulation was
repeated 100 times. To summarize the result of each simulation, the GFP was averaged over the
time window spanning 185 to 310 ms (i.e. a 125 ms window centered on the actual peak of the
GFP) resulting in a single estimate of the GFP amplitude for each number of sweeps and
simulation repetitions. Boxplots of these summary values show a clear trend of decreasing GFP
with increasing numbers of trials (Figure A.4).
Based on these observations, it is clear that GFP increases as the number of trials
decreases under the circumstances simulated. The effect of the number of trials seems largest
when the amplitude of the underlying effect is smallest relative to the noise (i.e., during the
baseline period), in line with the qualitative expectation outlined above. Simulation 2 investigates
the relationship between GFP and the signal to noise ratio.
Figure A.4. Distributions of mean GFP amplitudes over 100 simulations.
Boxplots show the distribution of the mean GFP taken over the 125 ms surrounding
the peak of the underlying effect over 100 repetitions. The number of sweeps
contributing to the GFP is doubled in each consecutive group. The median of the
GFP mean amplitude approaches but does not reach the noise-free value.
141
Simulation 2: GFP depends on the signal to noise ratio
To see the effect of signal to noise ratio on GFP, a simulation similar to Simulation 1 was
run, but instead of systematically varying the number of sweeps, the number of sweeps was held
constant at 1000 and the true underlying GFP effect increased linearly from zero to 21 µV. To
see the effect of instantaneous SNR on the GFP, the instantaneous SNR was calculated as the
GFP of the true underlying signal divided by the standard deviation of the noise (2.62). Under
normal empirical circumstances, SNR must be defined in terms of an estimate of the noise and an
estimate of the signal. In these simulated data, both the signal and the noise are known. The
effect of SNR was calculated as the mean of the noise-contaminated GFP minus the mean of the
noise-free GFP. The effect of noise falls off rapidly at low SNRs but the fall-off decelerates as
SNR increases (Figure A.5).
Figure A.5.The effect of noise on GFP decreases as SNR increases. The effect
of noise is the difference between the mean of the noise-contaminated GFP and the
noise-free GFP. The SNR is the noise-free GFP divided by the standard deviation
of the noise distribution.
142
Simulation 3: GFP varies with number of sweeps using real EEG data
Simulations 1 and 2 demonstrated using simulated data that the GFP decreases with
increasing numbers of trials and SNR, respectively. Although the simulations were designed to be
reasonable approximations of EEG data, some unexpected difference between simulated and real
data might invalidate those conclusions. Simulation 3 used EEG data recorded in an actual EEG
experiment. Data were between 600 and 800 trials per subject from 11 subjects in which an
identical stimulus under identical experimental conditions was displayed. To see how different
numbers of trials affect the group mean GFP, n trials were pseudo-randomly selected from all the
available trials for each subject, and for each subject a mean ERP was computed and then the
GFP for that mean ERP was computed, resulting in 11 GFP waveforms. These 11 GFP
Figure A.6. Group mean GFP from different numbers of sweeps. Group-mean GFP of
real EEG data increases as the number of sweeps decreases. Each trace shows the group-
mean GFP averaged over 100 repetitions.
143
waveforms were averaged to generate a group-mean GFP. The number of trials selected was
varied from 16, 32, 64, 128, 256, and 512 trials. For each number of trials, this randomization
was repeated 100 times.
The average taken over all the repetitions of the group-mean GFP at each of the numbers
of trials is shown in Figure A.6. Over the entire time interval, the GFP is larger as the number of
trials is smaller. Another view is offered in Figure A.7, which shows the distribution of the mean
taken over time (i.e. the sweep-average) of the group-mean GFP at each of the 100 repetitions.
The sweep-average GFP is tightly clustered about the median, which decreases as the number of
sweeps increases.
This result is consistent with the results of Simulations 1 and 2. Together these
simulations provide clear evidence that the number of sweeps contributing to a GFP measure
Figure A.7. Distributions of sweep-average mean GFP over 100 resamplings. The
distribution of sweep-average mean GFP for each of the 100 repetitions is depicted with
boxplots. The sweep-average GFP clusters tightly around the median and decreases as the
number of sweeps per subject increases.
144
have a systematic effect on the GFP such that the fewer trials contributing to an average, the
larger GFP will be. The next section explores the impact of this relationship between number of
trials and GFP on a class of statistical tests called permutation tests.
Permutation Tests
This section introduces the fundamentals of permutation tests. Following this
introduction, mismatch negativity (MMN) experiments are briefly described to set the context for
applying permutation tests to a research question. Thereafter, simulations examine the
performance of two kinds of permutation tests when applied to GFP measures obtained in a
MMN experiment.
Permutation tests are a class of non-parametric statistical tests (Edgington and Onghena,
2007) and have been applied to electrophysiological measures (Blair and Karniski, 1993;
Karniski et al., 1994; Maris, 2004) including GFP (Murray et al., 2008; Koenig and Melie-Garcia,
2010; Koenig et al., 2011). Simulations show that permutation methods are equivalent to
traditional parametric methods when the underlying assumptions of the parametric methods are
known to be true (Greenblatt and Pflieger, 2004).
The mechanics of permutation testing have been well-described (for excellent tutorials on
this topic, see Blair and Karniski, 1993; Nichols and Holmes, 2002; Maris, 2004; Edgington and
Onghena, 2007; Maris and Oostenveld, 2007), but they will be briefly reviewed here. A
permutation test starts with a set of observations. Each observation has an associated label. For
example, each observation might be a subject’s score on some test, and the label for that
observation might be the experimental treatment that subject underwent. Of interest is whether
some summary statistic (such as the mean) of the data with one label differs in some way from
the same summary statistic of the data with another label. The corresponding null hypothesis,
then, is that the difference in the summary statistic could have arisen regardless of how the
145
observations were labeled. In other words, the null hypothesis is that the labeling of these
observations is arbitrary.
In order to test this null hypothesis, the test statistic obtained from the observations with
their labels intact is compared to a null distribution. The null distribution is generated by
permuting the labels of the data many times and calculating the test statistic for each of these
permuted data sets. If the null hypothesis is true, then there is nothing special about how the
observed data were labeled, and the test statistic should be typical of test statistics in the null
distribution. If, on the other hand, the null hypothesis is false, then the labeling of the data is not
arbitrary, and the observed test statistic should be unusual compared to the test statistics in the
null distribution. More formally, the p-value for the test is the proportion of test statistics in the
null distribution that are as extreme as or more extreme than the actually-observed test statistic.
Therefore, if the observed statistic falls in one of the extreme tails of the null distribution, the
observed data are unlikely to have arisen if the null hypothesis were true.
Any measure or statistic may be used in permutation testing. The choice of the test
statistic will change how the result of the test is interpreted, but it does not impact the underlying
logic of the test. Normalized statistics such as t or F may be used, although in many cases a
simpler statistic such as the mean difference between groups gives equivalent results at reduced
computational cost (Edgington and Onghena, 2007).
In the above example, the observations were measures from a single subject, but the
observations could be derived from any suitable experimental unit. An experimental unit can be
anything to which an experimental treatment may be assigned, such as an experimental subject, a
block in a functional MRI experiment, or a trial in a psychophysical experiment. Permutation
tests have been extended to repeated-measures (a.k.a. paired samples or dependent measures)
designs by restricting re-labeling to be within-subjects. In repeated-measures designs, the
146
experimental unit is the session or trial for a subject. In a two condition experiment, each subject
might participate in two sessions with each session being randomly (albeit not independently)
assigned a condition.
Permutation tests require fewer assumptions than related parametric statistical tests. In
particular, while most parametric tests assume the data are randomly sampled from some larger
population, permutation tests make no such assumptions. Not requiring random sampling leads
to a limitation of permutation testing: it is impossible to make statistical inferences about the
population from which the data were sampled based on a permutation test (Edgington and
Onghena, 2007). This restriction might appear to limit the usefulness of permutation tests, but
parametric tests are only free of this restriction under the assumption of random sampling, and
true random sampling is difficult or impossible to achieve under most circumstances.
Having reviewed the relevant basics of permutation testing, the next section describes a
specific research question and then describes how two different permutation tests would be
applied to that question.
The mismatch negativity
The mismatch negativity (MMN) (Näätänen et al., 1978; Näätänen and Alho, 1995) is a
neural response to a stimulus that deviates in some way from an established regularity (Garrido et
al., 2009; May and Tiitinen, 2010). The MMN is typically elicited in an oddball paradigm in
which regularity is established by repetitive presentation of one stimulus (the standard) with rare
presentations of another stimulus (the deviant) that deviates from the regularity. The MMN is
then isolated by subtracting the ERP evoked by the standard from the ERP evoked by the
deviant. This subtraction method is only valid when it is known that the differences in the ERP
due to differences between the standard and the deviant are trivial compared to the MMN. As
this assumption cannot usually be made, a slightly different method is required (Horváth et al.,
147
2008; Ponton et al., 2009). In order to isolate the MMN, a longer experiment with two kinds of
blocks is used. In one block, one stimulus (stimulus A) is the standard while another stimulus
(stimulus B) is the deviant. In the other block, their roles are reversed and stimulus B is the
standard and stimulus A is the deviant. Then, two MMNs can be isolated by subtracting the
stimulus-as-standard from the stimulus-as-deviant ERP waveforms (i.e. stimulus A-as-deviant
minus stimulus A-as-standard).
When investigating the MMN for a particular stimulus manipulation that has been
thoroughly studied previously, the MMN may be measured by examining the ERP at a particular
EEG electrode location (Näätänen et al., 2007). When the cortical generator of a MMN is
unknown for a stimulus manipulation, the GFP (see above) can be used (Ponton et al., 2009) in
place of a particular electrode. In order to assess the statistical reliability of the MMN as
measured using GFP, a statistical test is needed that assesses whether the GFP elicited by the
stimulus-as-deviant is reliably different from the GFP elicited by the stimulus-as-standard.
To summarize, the data for a MMN experiment consist of recordings from some number
of subjects. Each subject’s recordings consist of some number of sweeps or trials in which the
stimulus-as-standard was presented and some number of sweeps in which the stimulus-as-deviant
was presented. Importantly, the stimulus-as-deviant is presented rarely, so the number of
stimulus-as-standard sweeps is large relative to the number of stimulus-as-deviant sweeps. The
goal is to assess the statistical reliability of the MMN. Next, two permutation tests are introduced
that attempt to achieve this goal.
The Standard Paired-samples Permutation Test
Standard paired-samples permutation testing in a two condition experiment use a
subject’s condition blocks as experimental units. For example, if an experiment is broken into
two sessions, with each subject randomly assigned to Condition A in the first session and
148
Condition B in the second session or vice versa, then the experimental unit is the subject’s
session. The summary statistic is the mean difference between the results under Condition A and
the results under condition B, denoted as D. The null distribution is calculated by permuting the
condition labels on the obtained data within the subjects. Because there are only two possible
mappings between conditions and sessions (i.e. A first or B first) for each subject, the number of
possible permutations is simply the number of unique ways in which these two mappings can be
combined. This number is calculated as 2
N
, where N is the number of subjects.
As a technical aside, when the measure of interest is D, then the permutation distribution
is symmetrical about zero since x – y = – (y – x). Therefore only half of the permutations need to
actually be computed, and the full permutation distribution can be obtained by multiplying the
first half by negative one.
When N is relatively small, the full permutation distribution can be calculated in a
reasonable amount of time. If N becomes large, then it is not feasible to calculate the full
permutation distribution. Under these circumstances, a randomly-selected subset of possible
permutations can be calculated instead. This has the potential to slightly lower the power of
permutation tests, but does not raise the false positive rate (Edgington and Onghena, 2007).
A p-value can be obtained from this test as the proportion of entries in the permutation
distribution that are as extreme as or more extreme than the actually-obtained data. Note that the
minimum possible p-value is 2/(2
N
) because both the actually-obtained data and its opposite are
entries in the permutation distribution, and both are (by definition) as extreme as the actually-
obtained data.
When assessing the MMN using GFP, one possible procedure would be to calculate for
each subject the GFP of the ERP elicited by the standard (GFP
std
) and the GFP of the ERP elicited
by the deviant (GFP
dev
) and take as the overall measure the mean of the difference between
149
GFP
dev
and GFP
std
. A permutation distribution of this overall measure could be obtained as
described above, permuting the labels of GFP
dev
and GFP
std
within the subjects
3
.
A difficulty with this approach is that the experimental unit is difficult to define. As
reviewed above, an experimental unit is a chunk of the experiment to which an experimental
condition may be (randomly) assigned. “All the trials in which a standard was presented” is not
an experimental unit because there is no random assignment under which all of those trials might
have been a deviant simply because there are many more trials in which a standard was presented
relative to the number of trials in which a deviant was presented. Furthermore, as described
above and demonstrated in Simulations 1-3, the measure used here, GFP, increase as SNR
decreases, so comparing GFP from a relatively small number of trials against GFP from a
relatively large number of trials could systematically bias the outcome measure. Whether these
difficulties present problems only in theory, or if they present practical problems under realistic
conditions is here addressed with simulations.
Simulation 4, below, demonstrates that the standard paired-samples permutation test has
a false-positive rate much higher than the nominal value under conditions that attempt to simulate
a MMN paradigm. The false-positive rate approaches the nominal value as the number of deviant
sweeps per subject increases, but this increase is dramatically slowed if the number of standard
sweeps per subject also increases.
3
Actually, this procedure would be applied at each time sample, yielding a null distribution at each time
sample and a corresponding p-value for each time sample. Some form of correction for multiple
comparisons would be required in order to correctly interpret these p-values. For discussions of correction
for multiple comparisons in the context of permutation testing, see Blair RC, Karniski W (1993) An
alternative method for significance testing of waveform difference potentials. Psychophysiology 30:518-
524, Yoder PJ, Blackford JU, Waller NG, Kim G (2004) Enhancing power while controlling family-wise
error: an illustration of the issues using electrocortical studies. J Clin Exp Neuropsychol 26:320-331. and
Appendix B of this dissertation.
150
The Unbalanced Paired-samples Permutation Test
To address the shortcomings of the standard paired-samples permutation test as applied to
GFP data from a MMN experiment, a modified permutation test was developed for unbalanced
GFP data. The mechanics of the test are described here, and the validity of the test is
demonstrated in Simulation 5.
In the unbalanced paired-samples permutation test, the experimental unit is the single
trial. The single trial fits the definition of an experimental unit because any given trial could be a
standard or a deviant assuming the number of standard trials between deviant trials varies
randomly over some range as they do in typical MMN paradigms and specifically in the paradigm
used in Chapter 3. Using this experimental unit, the null hypothesis is that there is no difference
between trials that are labeled standard and trials that are labeled deviant. The logic of this test is
that under the null hypothesis, the actually-obtained difference in group-mean GFP would be no
different if the labels of the single sweeps were permuted. The group level measure is identical to
that used in the standard paired-samples permutation test: the average of the difference between
GFP
std
and GFP
dev
. Where the unbalanced test differs is in how the permutation distribution is
generated and how the p-value is determined.
If the null hypothesis is true, then rather than having N subjects, each having j standard
trials and k deviant trials, what we really have for each of the N subjects is j + k trials that are
essentially the same (modulo random noise) and j of them have been arbitrarily labeled standard
trials. To generate a permutation sample, then, for each subject, the collected trials’ labels are
randomly permuted, such that there are still j trials labeled standard and k trials labeled deviant.
These permuted trial data are averaged for each subject and the GFP of this average is taken to
create N GFP
std(permuted)
and N GFP
dev(permuted)
waveforms. The group-mean of the difference in
these permutation GFPs is entered as a member of the permutation distribution. The number of
entries in a permutation distribution that included all possible ways of relabeling the data is given
151
by the expression , ! / ! ! ^ . For example, even with unreasonably small numbers,
such as j = 8, k = 4, and N = 6, the number of possible relabelings is over 1.4 x 10
16
. Because
there are potentially so many possible ways of relabeling the data, systematically producing all of
them is not feasible. Therefore, a random subset of all possible relabelings is selected. As with
the standard paired-samples permutation test, using less than all possible relabelings reduces the
power of the test slightly, but this problem can be ameliorated by increasing the size of the
random subset (Edgington and Onghena, 2007).
Unlike the null distribution for the standard paired-samples permutation test, the
permutation distribution in this test is not necessarily symmetric around zero. This is
demonstrated below in Simulation 5, but can also be understood as a consequence of the
resampling scheme. In the standard paired-samples null distribution, for any given entry, the
mathematical opposite of that entry should also be in the null distribution. Using the resampling
scheme here, there is no such expectation. Even in the enormous full permutation distribution,
there is no expectation of symmetry because the magnitude of the GFP increases as the number of
trials decreases and there are always far fewer trials labeled as deviant than are labeled as
standard.
As with any permutation test, the p-value is calculated as the proportion of entries in the
null distribution that are as extreme as or more extreme than the actually-obtained value. Because
the null distribution is not symmetrical around zero, the definition of extreme cannot be relative
to zero, but must be relative to the null distribution. To accomplish this, two proportions are
calculated: First, the proportion of the null distribution that is greater than or equal to the actually-
observed value is calculated. Next, the proportion of the null distribution that is less than or equal
to the actually-observed value is calculated. The p-value is then two times the value of whichever
of these two proportions is smaller.
152
Simulation 5, below, shows that this unbalanced paired-samples permutation test is valid
regardless of the number of sweeps used or the degree of imbalance between them. A minor
drawback to this method is that it is cannot be practically used under the requirement that all
permutations be systematically generated. A major drawback to this approach compared to the
standard approach is that the unbalanced paired-samples permutation test is extremely
computationally intensive. As computer memory size and processing speed increase, this
drawback is reduced.
Permutation Simulations
Simulation 4 examines the performance of the standard paired-samples permutation test
on GFP constructed from unequal numbers of trials. Simulation 5 examines the performance of
the unbalanced paired-samples permutation test on GFP constructed from unequal numbers of
trials. The general approach for evaluating these statistical methods was to create a data set in
which the null hypothesis was true. Specifically, data sets were created in which there was no
difference between the data labeled standard and the data labeled deviant except for noise in the
recordings. When the null hypothesis is known to be true, then any positive findings of a
statistical test (i.e. rejections of the null hypothesis) are false positives. A valid statistical test will
have a nominal false positive rate equal to α, the criterion for rejection.
Permutation test simulations used EEG data. The data used here were collected from 11
subjects participating in a visual speech mismatch negativity experiment. Only data from one
syllable, /ZA
2
/-as-standard are used here. The data have been filtered, epoched from 100 ms
before to 1000 ms after stimulus onset, baseline corrected over the first 200 ms of each epoch,
artifact rejected and projected onto a common set of electrode locations. See Chapter 3 for the
full details of how the EEG data were recorded and processed.
153
Table A.1. Number of
sweeps per subject in the
three permutation test
simulations
minority majority
Fixed Majority
50 300
100 300
150 300
200 300
250 300
300 300
Half Ratio
25 50
50 100
100 200
200 400
Quarter Ratio
25 100
50 200
75 300
100 400
Simulation 4: Standard paired-samples permutation testing
To examine the validity of applying the standard paired-samples permutation test to data
consisting of GFP measured over unbalanced numbers of sweeps, the approach was to create a
dataset in which the null hypothesis was true. Specifically, the same EEG data used in
Simulation 3 was pseudorandomly partitioned into two sets of trials per subject. One set was
called the minority set and the other set was called the majority set. The number of sweeps going
into the two sets differed, but it must be emphasized that every trial used in this simulation was
generated using an identical stimulus and under identical experimental conditions.
After partitioning the data into minority and majority sets for each subject, the standard
paired-samples permutation test was run as described above. Because the number of subjects was
11, the number of possible permutations is 2
11
= 2048, so the full permutation distribution could
be systematically generated. As noted above, using the full
permutation distribution maximizes the power without changing
the validity of the test. Under this testing scheme, because the
null hypothesis is true, any p-value less than α (following
convention, here α = .05) is a false positive, and the number of
false positives divided by the number of tests conducted is the
false positive rate. A valid statistical test is expected to have a
false positive rate of α for any α between 0 and 1.
Table A.1 shows the number of sweeps going into the
majority and minority data sets. These numbers were selected to
investigate how the false positive rate of the standard paired-
samples permutation test changed with increasing numbers of
trials in the minority data set under three basic conditions. The
154
first condition is with the number of trials in the majority set fixed at 300. The second condition
is with the number of trials in the majority set fixed to be twice the number of trials in the
minority set, and the third condition is with the number of trials in the majority set to be four
times the number of trials in the minority set.
Selecting these numbers of trials to create minority and majority datasets and then
carrying out permutation testing was repeated 250 times per pair. The results are presented
separately for the three conditions in terms of false positive rates both at each time sample and
collapsed over all time samples.
Fixed-Majority. In the fixed-majority condition, the number of sweeps in the majority
set was fixed at 300, and the number of sweeps in the minority set varied between 50 and 300.
Figure A.8. False positive rate in the fixed-majority simulation. When the number of
sweeps in the majority data set is fixed at 300, the false positive rate decreases as the number
of sweeps in the minority data set increases. The values are not uniform over time, with
larger false positive rates occurring at the times of the minima in the GFP (see Figure A.6),
around 100 and 1000 ms.
155
Figure A.8 shows the false positive rate with α = .05 averaged over the 250 repetitions at each
time sample. Figure A.9 shows the distribution of false positive rate for each of the 250
repetitions taken over the full time window. Both demonstrate that with the number of sweeps in
the majority data set fixed at 300, the false positive rate is very high (> .40) when the number of
sweeps in the minority is 100, but the false positive rate approaches .05 as the number of sweeps
in the minority data set approaches 300.
Half-ratio. In the half-ratio condition, the number of sweeps in the majority set was
always twice the number in the minority set. Figure A.10 shows the false positive rate with α =
Figure A.9. False positive rate over time in the fixed-majority simulation. Boxplots show
the distribution of false positive rates computed over the entire time window for each of 250
repetitions in the fixed-majority condition. Although variability over simulations is high, the
nominal .05 false positive rate is not included in the distribution until the number of sweeps
exceeds 150, but the median is very close to .05 for both 250 and 300 conditions.
156
.05 averaged over
the 250 repetitions
at each time
sample. Figure
A.11 shows the
distribution of false
positive rate for
each of the 250
repetitions taken
over the full time
window. As in the
fixed-majority
condition, the false
positive rate decreases as
the number of sweeps in the
minority set increases. In
contrast to the fixed-
majority condition, in the
half-ratio condition even the
largest minority set tested
still had a false positive
(median = .172,
mean = .178, std = .065)
rate substantially higher
Figure A.10. False positive rate in the half-ratio simulation. When the ratio of
number of sweeps in the minority set to the number of sweeps in the majority set is
fixed to be one half, the false positive rate decreases as the number of sweeps
increases and is non-uniform over time.
Figure A.11. False positive rate over time in the half ratio simulation.
Boxplots show the distribution of false positive rates computed over the
entire time window for each of 250 repetitions in the half-ratio condition.
Although variability over simulations is high, the nominal .05 false
positive rate is not included in the distribution until the number of sweeps
is 200, but the median is still over .15.
157
than the nominal
rate of .05, t(249) =
30.8, p < .0001.
Quarter-
ratio. In the
quarter-ratio
condition, the
number of sweeps
in the majority set
was always four
times the number in
the minority set.
Figure A.12 shows
the false positive
rate with α = .05
averaged over the
250 repetitions at
each time sample.
Figure A.13 shows
the distribution of
false positive rate
for each of the 250
repetitions taken
over the full time
Figure A.12. False positive rate in the quarter-ratio simulation. When the ratio
of number of sweeps in the minority set to the number of sweeps in the majority set
is fixed to be one quarter, the false positive rate decreases as the number of sweeps
increases and is non-uniform over time.
Figure A.13. False positive rate over time in the quarter ratio simulation.
Boxplots show the distribution of false positive rates computed over the entire time
window for each of 250 repetitions in the quarter-ratio condition. Although
variability over simulations is high, the nominal .05 false positive rate is not
included in the distribution for any of the numbers of sweeps tested.
158
window. The false positive rate decreases as the number of sweeps in the minority increases, but
the false positive rate remains high for the largest number of sweeps in the minority set tested.
Comparison of Conditions. The median false positive rate for each of the three
conditions at varying numbers of sweeps in the minority set is shown in Figure A.14. All three
conditions included simulations using 50 and 100 sweeps in the minority set. Examination of
these data points further illustrates that as the ratio of minority sweeps to majority sweeps gets
smaller, the false positive rate increases.
Simulation 5: Unbalanced paired-samples permutation test
In order to examine how the unbalanced paired-samples permutation performs, it was
applied to data sets generated as in Simulation 4. Conditions tested were also the same as in
Simulation 4: fixed-majority, half-ratio and quarter-ratio (Table A.1). As mentioned above, the
computation time for the unbalanced paired-samples permutation test is long compared to the
Figure A.14. Comparison of results across minority/majority conditions using standard paired-
comparisons permutation testing. A shows the median experiment-wise false positive rate over 250
simulated experiments with different numbers of trials in the minority set. The three conditions differ in
the number of trials in the majority set. All three conditions used 50 and 100 trials in the minority set.
Boxplots showing the distributions of experiment-wise false positive rates versus the ratio of trials in the
minority set to trials in the majority set are shown for 50 and 100 trials in the minority set in B and C
respectively.
159
standard paired-samples permutation test. In order to save computation time, a single null
distribution with 1999 resamplings was computed for each condition and 250 examples of ‘real’
data were generated and p-values were calculated as described above using the single null
distribution. This is equivalent to repeating the unbalanced paired-samples permutation test 250
times except that instead of creating a new null distribution each time, the same null distribution
was used each time
4
.
Results for the fixed-majority (Figures A.15 & A.16), half-ratio, and quarter-ratio
conditions all show false positive proportions close to the nominal value of .05. Figures are not
4
This short-cut was taken to reduce computation time. As described, this computation took approximately
10 hours, with the majority of that time devoted to generating the single permutation distribution.
Figure A.15. False positive rate in the unbalanced paired samples test in the fixed-
majority condition. The unbalanced paired-samples permutation test false positive rate
remains unchanged over time samples in the fixed-majority condition.
160
presented for the half-ratio and quarter-ratio conditions because the results are uniform across
conditions. No systematic effects of number of trials or ratio of minority to majority are apparent.
Conclusion
Simulations 1-3 showed that decreasing the number of trials contributing to a GFP
measure tends to increase that GFP measure. To see if this bias has practical implications when
applying permutation testing to GFP measures, Simulations 4 and 5 examined how two different
permutation testing approaches perform when comparing GFP measures generated from unequal
numbers of trials. The standard paired-samples permutation test showed high false positive rates
when the number of trials was unequal. In contrast, the unbalanced paired-samples permutation
test showed false positive rates close to the nominal value.
Figure A.16. False positive rate over time in the unbalanced paired samples
test in the fixed-majority condition. Boxplots of false positive rate distribution
of the unbalanced paired-samples permutation test in the fixed-majority condition.
Higher-than nominal (.05) false positive rates occur, but the median is close to the
nominal value.
161
Two trends are apparent for the standard paired-samples permutation test: First, as the
number of trials in the minority set increases, the false positive rate decreases. The second trend
indicates that the rate of this decrease is reduced as the ratio of minority to majority trials
decreases. These inferences are drawn over relatively small ranges. As the ratio gets small (e.g.
< 1/10) but the number of trials in the minority set increases (e.g. > 300), the false positive rate
would likely still be inflated, but the extent to which this inflation might have practical
implications would need to be demonstrated before the standard paired-samples permutation test
could safely be applied.
The unbalanced paired-samples permutation test appears to have a nominal false positive
rate regardless of the number of trials used or the ratio of minority trials to majority trials.
Therefore, it is valid under the conditions simulated. Given the invariance of the false positive
rate to number of trials and their ratio, the unbalanced paired-samples permutation test would be
expected to generalize to any number of trials.
Using GFP as a measure of neural activity has many advantages over measures using
single electrodes, however GFP is biased by number of trials. When combined with the
unbalanced paired-samples permutation test, this bias can be corrected and valid statistical
inferences can be drawn.
162
Appendix B: Multiple Comparisons Correction
This appendix briefly introduces the multiple comparisons problem and describes some
solutions to this problem that are in common use, including the cluster-based approach used in
Chapter 3 of this dissertation. Simulations are described that demonstrate the suitability of the
cluster-based approach for effects of the type expected in a visual mismatch negativity
experiment.
The multiple comparisons problem is shorthand for the increase in the probability of
incorrectly rejecting a null hypothesis (committing a Type I error) as the number of null
hypotheses tested increases. Statistical tests have some chance to produce p-values at or below
the threshold for significance even when the underlying null hypothesis is true, resulting in a false
alarm. For a valid statistical test, the expected value of the proportion of tests to produce a false
alarm is equal to the threshold for significance. In other words, since each test has a chance to
produce a false alarm, as the number of tests increases, the chance of one or more false alarms
also increases. Specifically, the chance of getting one or more false alarms in a family of N tests
in which the null hypothesis is always true is equal to 1 1 . The chance of committing
one or more Type I errors in a family of statistical tests is referred to as the family-wise error rate,
(FWER). For a significance threshold of .05 and 10 tests, the expected FWER is .4, or 40%. This
inflation in FWER with increases in the number of tests presents a clear problem for situations
such as those arising in Chapter 3 in which statistical tests are performed on each sample of data
recorded at a high sampling rate over a long duration.
Many solutions to the multiple comparisons problem have been proposed. Here I will
briefly describe some of these solutions and comment on their suitability to data of the sort used
in Chapter 3. More details about these procedures can be found in the excellent tutorial review of
Groppe, Urbach, & Kutias (2011).
163
One straightforward approach to the multiple comparisons problem is the Bonferroni
procedure, in which α is adjusted for each test such that the FWER is .05. This is accomplished
by setting α to .05/N. As N becomes large, the criterion becomes very small. While this provides
strong control over FWER, it can be far too conservative (e.g., Nakagawa, 2004), resulting in
accepting the null hypothesis when an alternative is actually true (committing a Type II error).
This problem is exacerbated when the underlying test statistics are positively correlated, as the
effective number of tests is reduced by the correlation structure of the data, but this effective
reduction is not accounted for by the Bonferroni procedure. As such, the Bonferroni procedure
sacrifices power to achieve strong FWER control. This might be suitable in cases where the costs
of any Type I errors are high and the costs of Type II errors are low, but the Bonferroni procedure
is otherwise too conservative.
An alternative to controlling the FWER is to attempt to control the false discovery rate
(FDR). The FDR is the proportion of the rejected null hypotheses that are actually true. If 100
null hypotheses were rejected and in 5 of those cases the null hypothesis was true (and in the
remaining 95 cases an alternative hypothesis was true) then the FDR would be 5%. This says
nothing about the FWER which depends on the total number of hypotheses tested rather than the
number of null hypotheses rejected. The first procedure for controlling FDR was introduced by
Benjamini and Hochberg (1995; referred to hereafter as B&H FDR), but other procedures have
also been developed with each having slightly different strengths and weaknesses (Groppe et al.,
2011). A weakness of two commonly used FDR methods (Benjamini and Hochberg, 1995;
Benjamini and Yekutieli, 2001) is that the power and variability in FDR of these methods varies
with the proportion of null hypotheses that are actually false (Hemmelmann et al., 2005).
Because the constraint of maintaining a low FWER is relaxed, FDR methods in general have
fairly good power compared to the Bonferonni procedure.
164
The two previous methods for solving the multiple comparisons problem rely on
adjusting the critical p-value for accepting or rejecting a null hypothesis. The following two
methods use permutations similar to those used in permutation testing (See Appendix A) to
determine the distribution of a family-wide test statistic under the null hypothesis. The first of
these methods is called T
max
(Blair and Karniski, 1993; Nichols and Holmes, 2002; Groppe et al.,
2011). The T
max
procedure starts with a multivariate dataset composed of some number of entries
(generated by different subjects, for example) split across some number of conditions with each
entry having the same number of variables or observations. For example, in Chapter 3 when
comparing the mismatch response in the far context to the mismatch response in the near context,
the dataset is mean GFP traces from 11 subjects in two conditions. Each entry consists of 1101
samples (at 1000 Hz). The T
max
procedure begins by performing a permutation test, but instead of
permuting each sample, permutations are done at the level of a full entry. This keeps adjacent
samples in a particular entry together, regardless of how the condition label is changed. Thus far,
the T
max
procedure is identical to an uncorrected permutation test except that the T
max
procedure
requires some extra bookkeeping, but here the two procedures diverge. In an uncorrected
permutation test, the actual statistic at each sample would be compared to the permutation
distribution of statistics generated at that sample. In contrast, in the T
max
procedure, the actual
statistic at each sample is compared to the permutation distribution of maximal test statistics,
which is constructed in the following way: In each permutation, the maximal statistic, regardless
of sample is added to a single distribution of maximal test statistics. If the actual test statistic is in
the top 100%(1- α) percentile of the distribution of maximal test statistics, then that sample can be
considered significantly different at a level of α, corrected for all comparisons performed.
The T
max
procedure has benefits and drawbacks. The T
max
procedure guarantees strong
control of FWER, meaning that regardless of the number of true and false null hypotheses, the
165
FWER will be at or below α. An additional benefit is that it provides interpretable corrected p-
values at each sample. One drawback to T
max
is that it might be overly conservative, with critical
values higher than the Bonferroni criterion but sometimes lower than what would be required
using FDR procedures (Groppe et al., 2011). The T
max
procedure is best for detecting large
differences that may occur over a small number of variables or samples. When the effect of
interest is likely to be spread out over many samples, then another method might be better.
An alternative approach to the T
max
procedure for correcting for multiple comparisons
that still uses permutation testing is the cluster-based approach. In cluster-based approaches,
some definition of a cluster is adopted as is a method for determining a cluster size. For time-
sequence data, a common definition of a cluster is a series of consecutive samples at which an
uncorrected statistical test would reject the null hypothesis. The cluster size is then defined as
either the number of consecutive samples or as the sum of the absolute value of the test statistic at
each sample (called T
|sum|
). First, clusters are identified in the actual data and then the cluster
sizes of these clusters are compared to a permutation distribution of cluster sizes under the null
hypothesis. As with T
max
, in the cluster-based approach permutations are done on entire entries
rather than on samples within entries, and the largest cluster (using the same definition of size)
from each permutation is added to the permutation distribution of cluster sizes. Clusters
identified in the actual data are compared to clusters in the permutation distribution, and those
clusters that are in the 100%(1- α) percentile of the permutation distribution are accepted as
statistically significant at a level of α, corrected for multiple comparisons.
The chosen definition of a cluster and measure of cluster size can have an impact on the
kinds of effects cluster-based methods will find. The T
|sum|
definition is more likely to find
moderately large, sustained effects and brief but very large effects, while clusters measured by the
number of consecutive samples will focus fairly exclusively on effects smeared out over time.
166
Cluster-based methods in general have good power and are less conservative than the T
max
approach. A drawback of cluster-based methods is that they offer weak control over FWER.
This means that when all null hypotheses are true, the expected value of FWER is less than or
equal to α. When some null hypotheses are false, however, then the probability of finding a false
positive in addition to a true positive may increase (Groppe et al., 2011). An additional drawback
to cluster-based methods is that they provide less interpretable detail: They do not provide
corrected p-values at each sample, and rejection of the null hypothesis occurs at the level of
clusters. This means that while a cluster might be accepted as statistically significantly, no
specific statements about the individual samples in a cluster can be made.
Justification of the approach used in Chapter 3
The approach used in Chapter 3 for correcting for multiple comparisons is a cluster-based
permutation method using consecutive significant results as a cluster size definition. As stated
above, this approach is suitable for effects that are likely to be smeared out in time, but not
necessarily brief or large effects. This approach fits the expected effects of the experiment in
Chapter 3 well for two main reasons. The first reason is that the expected effect is a visual
mismatch response which is expected to be a small, long-lasting response. The second reason is
that individual variability in response is expected to smear the effect over time. Both of these
reasons will be explained in further detail below.
The visual mismatch response is small and slow. The expectation that a visual
speech mismatch response would be small and slow is motivated by three lines of evidence. The
first is analogy with the classical auditory MMN which is typically small and slow. The second is
from results in a variety of visual mismatch responses. The third is from a previous experiment
that sought a visual speech mismatch response.
167
The classical auditory MMN is a relatively long-lasting ERP effect (Näätänen et al.,
1978; Picton et al., 2000a; Näätänen et al., 2007) and is generally expected to be of low amplitude
relative to recording noise (Ponton et al., 1997; Picton et al., 2000a; Näätänen et al., 2007). This
leads to the expectation that experimental effects that induce an MMN will best be detected by a
statistical method that is sensitive to small but long-lasting effects. These expectations come
from auditory MMN results, but appear to hold in the context of visual mismatch responses. The
existence of a visual homologue of the auditory MMN has been repeatedly demonstrated (Pazo-
Alvarez et al., 2003; Czigler, 2007; Kimura et al., 2011). Like the auditory MMN, the visual
mismatch response is typically small and long-lasting (Pazo-Alvarez et al., 2003). In particular, a
visual mismatch response was elicited by visible speech mismatch (Ponton et al., 2009) that
appears to be long-lasting and of small amplitude. The duration and amplitude of this mismatch
response were not evaluated statistically, but descriptive results show a mismatch response lasting
approximately 80 ms and with peak amplitude on the order of 0.5 µV.
These three lines of evidence suggest that a visible speech mismatch response like that
sought in Chapter 3 should be long-lasting and of small amplitude. Statistical methods that rely
on single samples with large effect sizes are likely to miss effects that are slow but last many
samples. The FDR methods and the T
max
method described above fall under this description, and
so were considered unlikely to be sensitive to the kinds of effects expected in Chapter 3. The
cluster-based methods described above would be expected to be sensitive to this kind of effect,
and particularly the consecutive-samples method is best suited to effects that are small at any
given sample.
Individual variability temporally smears expected effects. Aside from the size
and duration of the expected visible speech mismatch response, variability amongst individual
subjects could lead to smearing of the effect at the group level. Ongoing accumulation of
168
evidence of mismatch has been implicated in the auditory MMN for complex, time-varying
stimuli (Picton et al., 2000a), so individual differences in the rate of evidence accumulation, or
individual differences in sensitivity to evidence of dissimilarity could lead to differences in
latency, amplitude and duration of the individual visual speech mismatch response. As
individuals are known to vary dramatically in general lipreading ability (Auer and Bernstein,
2007) and in visible syllable discrimination specifically (Chapter 3), this variability is expected to
impact the visual mismatch response results in Chapter 3. This temporal smearing of the
mismatch response at the group level further makes T
max
and FDR methods less preferable,
because temporal smearing will likely eliminate sharp peaks with large effects, thereby reducing
the sensitivity of these methods.
Simulations
Based on the known properties of the methods for correcting for multiple comparisons
outlined above and the expected properties of the visual speech mismatch response a cluster-
based method using consecutive significant results as a size metric should be the most appropriate
method for assessing the statistical reliability of a visual speech mismatch response. In order to
confirm this expectation, simulations were run comparing the performance of Bonferroni, B&H
FDR (Benjamini and Hochberg, 1995), T
max
and a cluster-based method of correcting for multiple
comparisons. The design decisions are outlined and justified below; full details of the
simulations appear in the methods section of this appendix.
The data used for simulation were samples drawn from normal distributions and then
smoothed to add temporal correlation. This might not be a realistic representation of EEG data,
but at issue here is multiple comparisons correction, so the particulars of the simulated data are
not too important. To simulate a multi-subject, repeated-measures two condition experiment, each
simulation consisted of 11 simulated single-subject averages in each of two conditions. To see
169
how multiple comparisons methods dealt with varying proportions of true and false null
hypotheses, a small effect was added to a varying number of consecutive samples in the second
condition. The number of samples to which an effect was added included zero to test for weak
FWER as well as small and large proportions to specifically assess the performance of the B&H
FDR method under its best and worst conditions. The effect was small and the samples were
consecutive in order to approximate a visual mismatch response. Simulation performance was
evaluated on four measures: FWER, FDR, false alarm rate and hit rate. FWER is reported as the
proportion of simulations that yielded any false positives. Where possible, d’ from signal
detection theory (Green and Swets, 1966) was calculated because it takes into account false alarm
and hit rate simultaneously. Because the B&H FDR correction has been criticized for resulting in
potentially high FDR, the probability of obtaining high FDR was assessed for all correction
methods.
Strictly speaking, the cluster-based method used here evaluates clusters of time points
rather than each individual time point, so the measures described above might be somewhat
inappropriate for evaluating the actual characteristics of this method. These measures are
included anyway for two reasons. The first is one of convenience to provide a simple way to
compare performance of the cluster-based method to the point-wise methods. The second is that
despite careful explanation of how a cluster-based result should be interpreted, readers might
interpret cluster-based results as point-wise results, so knowing how this method performs on a
point-wise basis is of interest.
Methods
5,000 simulations were carried out, and in each simulation, a standard paired-samples
permutation test was performed to achieve uncorrected p-values. The corrections for multiple
comparisons were then carried out and assessed. For each simulation, a dataset was constructed
170
by pseudorandomly drawing from a normal distribution with a mean of 0 and a standard deviation
of 1. Positive dependence between consecutive data points was achieved by smoothing the data
using a moving average with a span of 5 time points. Each dataset consisted of 100 time points
for each of 11 subjects and two conditions. The only difference between the two conditions was
that in some cases an effect was added to the second condition.
The 5,000 simulations were divided equally amongst 5 different proportions of time
samples at which to add an effect. Those proportions were 0, .1, .2, .5 and 1. In each simulation,
a starting time for the effect was pseudorandomly selected such that the proportion of time points
could consecutively have an effect added. The effect was added to the selected time points by
adding 0.6 to the data
5
at that time in condition 2. For an example of the group means for one
such simulated dataset, see Figure B.1.
Figure B.1. Example simulated data used for multiple comparisons simulations.
5
This corresponds to a Cohen’s d of 0.6 which Cohen [Cohen J (1969) Statistical power analysis for the
behavioral sciences. New York,: Academic Press.] classifies as a medium effect size, but that classification
is relative to the research method being employed. An effect of this size should account for approximately
8.3% of the variance in the data.
171
On each simulated data set, a standard paired-samples permutation test was performed
(for the details of the mechanics of this test, see Appendix A) resulting in an uncorrected test. On
each test result, four methods correcting for multiple comparisons were applied. Bonferroni
correction was achieved by dividing the critical α of .05 by the number of time points (100)
resulting in a corrected critical α of .0005. B&H FDR (Benjamini and Hochberg, 1995)
correction was applied using the implementation of FDR included with EEGLAB software
(Delorme and Makeig, 2004). In this implementation, the p-values are sorted in ascending order
and the null hypotheses corresponding to those p-values are labeled 1:m, where m is the total
number of time points. Then, the largest i is found such that , where q is the target false
discovery rate of .05, and null hypotheses 1 through i are rejected.
The T
max
procedure was applied by computing a paired t-statistic at each time point in
each permutation re-labelling. For each permutation, the maximal absolute paired t-statistic taken
over time was added to the distribution of maximal paired t-statistics. The absolute paired t-
statistic at each time point for the actual simulated data was compared to this distribution of
maximal paired t-statistics, and a corrected p-value was calculated at each time point as the
proportion of the distribution that was greater than or equal to the paired t-statistic at that time
point.
Finally, the cluster-based method was applied by computing uncorrected p-values for
each entry in the permutation distribution. For each entry in the permutation distribution at a
given time point, the paired t-statistic was re-expressed in terms of its percentile of the entire
distribution. This is equivalent to proceeding as though that permutation entry were the ‘actual’
data entry and computing a p-value as usual (see Appendix B). Once p-values were computed for
each permutation entry, consecutive runs of p-values < .05 were identified and the length of the
longest run in each permutation entry was added to the reference distribution of maximal run
172
lengths. The lengths of runs in the actual p-values were then compared to the reference
distribution, and any runs that were longer than the 95
th
percentile of the reference distribution
were accepted as reliable.
Each correction was evaluated in terms of five quantities. False alarm rate was the
proportion of true null hypotheses that were rejected (Type I errors). Hit rate was the proportion
of false null hypotheses that were rejected. To calculate d’, the equation ′ was used, where Z denotes the inverse of the normal cumulative
distribution function. To avoid d’ values going to positive or negative infinity, rates of 0 were
replaced by .005 and rates of 1 were replaced by .995. FDR was the proportion of rejected null
hypotheses that were true. FWER was computed as the proportion of simulations in which any
false alarms occurred. For each measure, the mean rate was computed and a 95% confidence
interval was constructed using a bias corrected and accelerated percentile bootstrap method
implemented in MATLAB (bootci) using 2000 bootstrap samples. To assess the probability of
obtaining higher than expected FDRs, FDR was also evaluated in terms of how often the FDR of
the simulated result exceeded thresholds of .05, .15, .25 and .35.
Results
Results are presented in terms of each measure of performance, starting with the sample-
wise measures of hit rate, false alarm rate and d’. Next the FWER is presented followed by FDR
results including proportions of simulated experiments exceeding threshold cutoffs of FDR.
Bonferroni correction consistently resulted in no rejections of any null hypotheses, so no further
results are reported about that method.
Hit Rate
Average hit rate varied considerably depending on the method of multiple comparisons
correction and number of false null hypotheses (Figure B.2). The T
max
method’s highest average
173
hit rate was 15.2%. The B&H FDR method had an average hit rate that increased as the
proportion of false null hypotheses increased; its average hit rate with 10 false null hypotheses
(out of 100) was 11.2% and with 100 false null hypotheses was 73.4%. The cluster-based method
had the highest average hit rate of any method for all but the one condition with 100 false null
hypotheses, with hit rates over 50% in all conditions. (Hit rates are not meaningful when all null
hypotheses are true.)
Figure B.2. Hit rates of three multiple comparisons correcting procedures. Hit rate is measured as the
proportion of false null hypothesis that were correctly rejected after correction for multiple comparisons.
Error bars are bootstrapped 95% confidence intervals. When no null hypotheses were false, hit rate is
defined to be zero.
False Alarm Rate
Average false alarm rates were generally low with one exception (Figure B.3). T
max
had
average false alarm rates under 0.1% in all conditions. The B&H FDR method’s false alarm rate
increased as the number of false null hypotheses increased up to a false alarm rate of 1.8% with
50 false null hypotheses. The cluster-based method had a false alarm rate close to 0.6% in all but
the condition with no false null hypotheses in which the false alarm rate was 0.16%. (False alarm
rate is not meaningful when all null hypotheses are false.)
174
Figure B.3. False alarm rates of three multiple comparisons correcting procedures. False alarm rate is
the proportion of true null hypotheses that are incorrectly rejected. Note the difference in the vertical axis
from the previous plot. Error bars are bootstrapped 95% confidence intervals. When all null hypotheses
are false, no false alarms are possible.
Sensitivity
Because d’ cannot be computed when either all or none of the null hypotheses are true,
results are presented for the intermediate conditions. The cluster-based method shows a clear
advantage compared to the others when evaluated in terms of d’ (Figure B.4). T
max
and B&H
FDR have similarly low sensitivities in the 10 false null hypotheses condition, with d’ < 0.7 for
both methods. As more null hypotheses are false, the B&H FDR method improves, achieving a
d’ of 2.3 when half the null hypotheses are false. Similarly, the T
max
method improves with more
false null hypotheses, but it only gets as high as a d’ of 1.3. The cluster-based method has a d’ of
2.2 or greater for all conditions and has the highest d’ for each condition.
175
Figure B.4 d’ sensitivity for three methods of correcting for multiple comparisons. Sensitivity is based
on hit rate and false alarm rate and is undefined when all null hypotheses are true or all null hypotheses are
false. Error bars are bootstrapped 95% confidence intervals.
Family-wise Error Rate
Family-wise error rate estimates the probability of having one or more false positives in a
family of statistical tests. Average FWER for each method is shown in Figure B.5. The T
max
method had a low FWER close to 5% in all conditions. B&H FDR and the cluster-based method
both had low FWER when no null hypotheses were false with both less than 2.5%. When some
null hypotheses were false, both B&H FDR and the cluster-based method had higher FWER, with
the cluster-based FWER reaching a maximum of 28% with 10 false null hypotheses and B&H
FDR reaching a maximum FWER of 45% with 50 false null hypotheses.
176
Figure B.5. Family-wise error rate for three methods of correcting for multiple comparisons. Family-
wise error rate is the probability of committing one or more Type I errors (false alarms) in an entire family
of statistical tests. Error bars are bootstrapped 95% confidence intervals.
False Discovery Rate
False discovery rate is the proportion of rejected null hypotheses that are actually true.
Overall, average FDR for each method was low (Figure B.6). Simulation results were compatible
with each method having an average FDR of 5% or lower. The T
max
and cluster-based methods
had average FDRs greater than 5% when 0 and 10 null hypotheses were false, respectively, but
the 95% confidence intervals for both included FDR of 5%. (False discovery rate is not
meaningful when all null hypotheses are false.)
177
Figure B.6. False discovery rates of three methods for correcting for multiple comparisons. False
discovery rate is the proportion of rejected hypotheses that are actually true. When no null hypotheses are
false, then any rejected hypotheses are actually true, so false discovery rate is 1 in those cases, or 0 when
there are no rejections. When all null hypotheses are false, then no rejected hypotheses can be true, so false
discovery rate is 0. Error bars are bootstrapped 95% confidence intervals.
To assess the probability of obtaining unusually high FDRs, the probability of obtaining
FDRs above a series of thresholds (5%, 15%, 25% and 35%) were estimated as the proportion of
simulation results greater than each threshold (Figure B.7). The T
max
method had a low
probability of exceeding any of the thresholds under all conditions, with no probability exceeding
.059. The B&H method had high probabilities of exceeding the 5% threshold with .21 and .24 in
the 20 and 50 false null hypothesis conditions respectively. The cluster-based method also had
high probabilities of exceeding the 5% threshold with .28 and .24 in the 10 and 20 false null
hypothesis conditions respectively.
178
Figure B.7 Proportion of simulation results in which false discovery rates exceeded four cutoffs. Each
simulated experiment had a false discovery rate. Shown are the proportions of those experiments that had
false discovery rates higher than four thresholds for each of the three methods for correcting for multiple
comparisons. Error bars are bootstrapped 95% confidence intervals.
Discussion
Four methods of correcting for multiple comparisons were tested using simulations
designed to emulate an effect similar to the one expected from a visual speech mismatch
179
response. The performance of each method was evaluated in terms of sensitivity, FDR and
FWER. Bonferroni correction was clearly the worst performer of the methods tested as it resulted
in no hits, false or otherwise. T
max
appeared to be too conservative, resulting in poor sensitivity.
The other methods were somewhat more mixed in terms of their performance characteristics, but
the best appears to be a cluster-based method with B&H FDR as a close second.
Bonferroni Correction
The Bonferroni-corrected threshold for significance is given by α/N, where N is the
number of statistical tests in the family and α is the single-test threshold for significance. The
simulations consisted of 100 tests, and α was .05, so the Bonferroni-corrected criterion was
.0005. Permutation tests have a minimum possible p-value equal to 2/ where B is the number
of entries in the permutation distribution. In a paired-samples test like the one used in the
simulations, 2
where n is the number of participants. With 11 participants, the minimum p-
value is .0009, and since .0009 > .0005, a hit was mathematically impossible using Bonferroni
correction with these conditions.
T
max
Correction
The T
max
method had the lowest sensitivity of the three remaining methods over two of
the three proportions of false null hypotheses and tied for the lowest with B&H FDR in the
remaining case. This method was the only one to maintain strong control over FWER, with a
FWER close to 5% under all conditions tested in which FWER is meaningful. (When all null
hypotheses are false, FWER is by definition 0.) In the conditions in which FWER is meaningful,
T
max
avoids sacrificing more sensitivity than it needs to by maintaining a FWER close to the
nominal level instead of achieving a FWER that is much below the nominal level. While T
max
achieves the goal of strong control over FWER while still having some sensitivity, it is still a very
180
conservative test in the simulated conditions, never achieving an average hit rate higher than
15.2%.
B&H FDR
In terms of sensitivity, B&H FDR correction tied with T
max
for the least sensitive when
10 (out of 100) of the null hypotheses were false, but sensitivity increased as the proportion of
false null hypotheses increased. This placed B&H FDR above T
max
in terms of sensitivity but was
still well behind the cluster-based method. The B&H FDR method has as its goal the maintenance
of a nominal FDR. On average, it achieved this goal, with mean FDR well below the nominal
value in all conditions simulated. Although the average FDR taken over all the simulated
experiments was maintained, the distribution of single-experiment FDR values was heavy-tailed.
This was apparent from the probability of obtaining FDR values over a set of thresholds, with
FDR values over .05 in as many as 24% of experiments, and FDR values over .15 in 10% of
experiments in the condition with 20 false null hypotheses. Over the range of tested proportions
of false null hypotheses, it appears that the probability of obtaining FDR over .05 increases as the
number of false null hypotheses increases. The topic of FDR variability will be revisited below.
B&H FDR does not have as a goal the maintenance of any particular FWER, and the
results of the simulations here showed that FWER was often obtained far in excess of the nominal
5%: When half the null hypotheses were false, the FWER approached 45%. This is not
surprising, as in order to achieve an average FDR of approximately .05 at least some false alarms
should be committed, or the FDR would approach zero. Because FDR is the ratio of false
positives to hits, the probability of obtaining one or more false alarms should be higher as the
number of false null hypotheses increases. B&H FDR does appear to offer weak FWER control,
which means that when all null hypotheses are true, the FWER is close to 5%.
181
Cluster-based Correction
The cluster-based method used here had by far the highest sensitivity in terms of d’
compared to the other three methods. The cluster-based method maintained an average FDR
close to the nominal .05, although when 10 (out of 100) of the null hypotheses were false, the
average FDR trended above 5% with a mean value of 5.5%, but the 95% confidence interval
included 5%. The distribution of FDR appeared to have heavy tails when the number of false null
hypotheses was low, such that the probabilities of obtaining FDR higher than 5% and 15% were
.28 and .14, respectively. When half the null hypotheses were false, though, these probabilities
were .05 and .002, respectively. Similar to B&H FDR, the cluster-based method does not exert
strong control over FWER, with FWER as high as 27.6%. Also like B&H FDR, the cluster-based
method offers weak FWER control, maintaining FWER less than 5% when all null hypotheses
are true.
FDR Variability
All 3 of the methods (excluding Bonferroni) had good average FDR, but under some
conditions the chance of getting higher than expected FDR was high for both the cluster-based
and B&H FDR methods. Expressed only in terms of FDR, there appears to be a trade-off, such
that the cluster-based method has a high risk of high FDR when few null hypotheses are false and
B&H FDR has a high risk of high FDR when many null hypotheses are false. When considered
in terms of absolute numbers of false positives, though, the errors made by these two methods are
not at all equivalent. When only 10 null hypotheses are false and we assume a 100% hit rate,
then even one false alarm raises the FDR to 9%; even a FDR of 17% represents only two false
alarms. Considering the opposite end of the spectrum when many null hypotheses are false,
maintaining a low FDR becomes more important as even a modestly inflated FDR can mean a
large increase in the absolute number of false alarms.
182
The reason for the different behaviors of the two methods can be understood based on the
mechanics of the methods. The cluster-based method identifies clusters. It is unlikely that, in
absence of any effect, noise will be identified as part of a significant cluster. When there is an
effect, though, a noisy observation might be identified as part of a cluster if that noisy observation
happens to be temporally adjacent to the effect. In other words, cluster-based methods are liable
to false alarm on those samples that are next to the edges of genuine clusters. In time-series data
like those considered here, any given cluster can only have two edges: where the cluster starts and
where the cluster stops. The number of edges is independent of the size of the cluster, so the
number of false alarms will stay approximately constant. Because the number of hits is
increasing with the size of the cluster, FDR decreases. An additional property of false alarms in
the cluster-based method is that because they occur on the edges of clusters, they are unlikely to
lead to dramatically incorrect scientific conclusions.
The B&H FDR correction establishes a family-wide significance threshold, and that
threshold depends on the p-values in the family such that the more very low p-values are
obtained, the higher the threshold will be. As the number of false null hypotheses increases, there
are more chances to obtain very small p-values, potentially raising the significance threshold.
This increased threshold can inflate the FDR by causing more true null hypotheses to be rejected
simply because of the relaxed significance threshold. This inflation of FDR is non-specific, in
that it increases false alarms throughout the true null hypotheses. This is in contrast to the
cluster-based method which localizes errors to be near true effects.
Conclusion
Based on the simulation results, the two clear best performers are T
max
and the cluster-
based method. The T
max
method provided the best control of FWER regardless of the number of
false null hypotheses, but this comes at the expense of sensitivity. The cluster-based method
183
provided the best sensitivity, but only provides weak control of FWER. The cluster-based
method also has a heavy-tailed distribution of FDR when a small number of null hypotheses are
false. This leads to an increased probability of obtaining a higher than expected FDR, but this
drawback is considered minor as it leads to a small number of false alarms in absolute terms, and
any false alarms are likely to be temporally adjacent to true effects.
As a general recommendation based on these simulations, the T
max
method is best when it
is important to be extremely conservative. Cases in which a false alarm is much more costly or
undesirable than a miss are appropriate for T
max
correction. When sensitivity is a priority and
incorrectly over-estimating the size of a cluster of effects is not too much of a problem, then the
cluster-based method is appropriate. In cases like the one in Chapter 3, in which the effect is
expected to be small and temporally distributed, the cluster-based method is the best of the
methods considered here. The kinds of errors that are likely with the cluster-based method are
also not major problems for interpretation of experimental results of the kind in Chapter 3.
Observing an effect 270 ms after stimulus onset when the true effect does not actually occur until
273 ms is unlikely to alter the overall conclusion based on the data.
Abstract (if available)
Abstract
Visual speech perception, also known as lipreading or speech reading, involves extracting linguistic information from seeing a talking face. What information is available in a seen talking face, as well as how that information is extracted are unsolved questions. The introductory chapter of this dissertation discusses some of the theories describing what information is available in the talking face and how that information is extracted. These theories fall into three broad categories based on the structure of the representation of visible speech information: Auditory models, Motor models, and Late-integration models. Auditory models propose that visual speech information is transformed into an auditory representation. Motor models propose that visual speech information is represented in terms of the articulatory gestures involved in speech production. Late-integration models propose multiple sensory-specific pathways for speech perception that operate somewhat independently. ❧ The work in this dissertation uses behavioral methods to investigate the visible stimulus information used for visual speech perception and electrophysiological methods to investigate the neural representation of visual speech. In both cases, the experiments take advantage of second-order isomorphism, that is, the dissimilarity relationship of stimuli should match to the dissimilarity relationship of behavioral responses and neural measures. ❧ In the behavioral experiments, a model of the visual representations that drive visual speech perception (Jiang et al., 2007a) was used to predict visual speech discrimination. The model makes no use of feature extraction, and is instead a straightforward transformation of optically-available data. As such, this model is an empirical realization of the claim that visible syllable dissimilarity can be determined using straightforward visual processes. Previously, the model accounted for visible speech phoneme identification in terms of perceptually weighted distance measures computed using 3-dimensional optical recordings. In Behavioral Experiment 1, participants discriminated natural and synthesized pairs of consonant-vowel spoken nonsense syllables that were selected on the basis of their modeled perceptual distances. An identification task was also used to confirm that both natural and synthetic stimuli were perceived as speech. The synthesized stimuli were generated using the same data that were input to the perceptual model. Modeled perceptual distance reliably accounted for discrimination sensitivity, measured as d’, and response times. ❧ The results of Behavioral experiment 1 showed that the perceptual dissimilarity model successfully predicted discrimination sensitivity in both natural and synthetic stimulus conditions. Sensitivity was more strongly related to the predictions of the perceptual dissimilarity model in the synthetic condition compared to the natural condition, but this discrepancy was largely attributable to large differences between predicted and measured sensitivity for two natural speech stimulus pairs. Additionally, discrimination sensitivity was higher than expected from an implicit identification model, and the perceptual dissimilarity model was a better predictor of sensitivity than was the implicit identification model. ❧ In Behavioral Experiment 2, the natural stimuli were inverted in orientation during the discrimination task to investigate whether the success of the perceptual dissimilarity model relied on a specific orientation. Results largely replicated the pattern of findings of Experiment 1, suggesting that perception of visible speech information is invariant to stimulus orientation. Although the model and the synthetic speech were shown to be incomplete, the results of these behavioral experiments were interpreted as consistent with speech discrimination relying on dissimilarities in the visible speech stimulus that are closely related to the 3D optical motion on which the perceptual dissimilarity model is based. ❧ The electrophysiological experiment reported in Chapter 3 uses the visual speech mismatch negativity (vMMN) to test whether the neural representation of visual speech reflects the optical dissimilarity of pairs of syllables. The vMMN derives from the brain’s response to stimulus deviance, and is thought to be generated by the cortex that represents the stimulus. The vMMN response to visual speech stimuli was used in a study of the lateralization of visual speech processing. Previous research suggested that the right posterior temporal cortex has specialization for processing simple non-speech face gestures, and the left posterior temporal cortex has specialization for processing visual speech gestures. Here, visual speech consonant-vowel (CV) stimuli with controlled perceptual dissimilarities were presented in a vMMN paradigm. The vMMNs were obtained using the comparison of event-related potentials (ERPs) for separate CVs in their roles as deviant versus their roles as standard. Four separate vMMN contrasts were tested, two with the perceptually far deviants (i.e., “zha” or “fa”) and two with the near deviants (i.e., “zha” or “ta”). Only far deviants evoked the vMMN response over the left posterior temporal cortex. All four deviants evoked vMMNs over the right posterior temporal cortex. The results are interpreted as evidence that the left posterior temporal cortex represents speech stimuli that are perceived as different consonants, and the right posterior temporal cortex represents face gestures that may not be reliably discriminated as different CVs. ❧ The data gathered in the electrophysiological experiment pose a number of challenges to conventional statistical analyses. The data are not normally distributed, and the data from different stimulus conditions do not have equal variance. In the past, these kinds of data have been analyzed using paired-samples permutation tests, but these data are problematic even for the conventional paired-samples permutation test. Appendix A of this dissertation presents a modified permutation test that addresses these problems. The data also consist of many non-independent samples upon which statistical comparisons are made
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Functional models of fMRI BOLD signal in the visual cortex
PDF
Temporal dynamics of attention: attention gating in rapid serial visual presentation
PDF
The role of similarity in restoring missing notes in music
PDF
Human visual perception of centers of optic flows
PDF
Functional magnetic resonance imaging characterization of peripheral form vision
PDF
Spatiotemporal processing of saliency signals in the primate: a behavioral and neurophysiological investigation
PDF
Exploring sensory responses in the different subdivisions of the visual thalamus
PDF
Individual differences in heart rate response and expressive behavior during social emotions: effects of resting cardiac vagal tone and culture, and relation to the default mode network
PDF
Explicit encoding of spatial relations in the human visual system: evidence from functional neuroimaging
PDF
Characterization of visual cortex function in late-blind individuals with retinitis pigmentosa and Argus II patients
PDF
Understanding music perception with cochlear implants with a little help from my friends, speech and hearing aids
PDF
Crowding in peripheral vision
PDF
Situated proxemics and multimodal communication: space, speech, and gesture in human-robot interaction
PDF
The role of music training on behavioral and neurophysiological indices of speech-in-noise perception: a meta-analysis and randomized-control trial
PDF
Neural circuits control and modulate innate defensive behaviors
PDF
Functional synaptic circuits in primary visual cortex
PDF
Differential regulation of anatomical and functional visual plasticity by NgR1
PDF
Spatial and temporal precision of inhibitory and excitatory neurons in the murine dorsal lateral geniculate nucleus
PDF
Gene-environment interactions in neurodevelopment
PDF
Multimodality, context and continuous dynamics for recognition and analysis of emotional states, and applications in healthcare
Asset Metadata
Creator
Files, Benjamin Taylor
(author)
Core Title
Selectivity for visual speech in posterior temporal cortex
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Neuroscience
Publication Date
10/29/2013
Defense Date
07/24/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
behavior,Discrimination,electroencephalography,lipreading,OAI-PMH Harvest,visual speech mismatch negativity,visual speech perception
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Bernstein, Lynne E. (
committee chair
), Grzywacz, Norberto M. (
committee member
), Leahy, Richard M. (
committee member
), Tjan, Bosco S. (
committee member
)
Creator Email
bfiles@gmail.com,bfiles@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-340643
Unique identifier
UC11295409
Identifier
etd-FilesBenja-2116.pdf (filename),usctheses-c3-340643 (legacy record id)
Legacy Identifier
etd-FilesBenja-2116.pdf
Dmrecord
340643
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Files, Benjamin Taylor
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
behavior
electroencephalography
lipreading
visual speech mismatch negativity
visual speech perception