Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Mapping the neural architecture of concepts
(USC Thesis Other)
Mapping the neural architecture of concepts
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MAPPING THE NEURAL ARCHITECTURE OF CONCEPTS
by
Kingson Man
Advisor: Antonio Damasio
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
NEUROSCIENCE
December 2014
Copyright 2014 Kingson Man
! ""!
EPIGRAPH
"I can see, he can hear. We make a great combination."
- Warren Buffett, speaking of his partner and friend, Charlie Munger
! """!
DEDICATION
To my mother, father, and grandmother.
! "#!
ACKNOWLEDGEMENTS
It has been nothing short of a dream made real to pursue my Ph.D. under
the mentorship of Antonio Damasio. His book, The Feeling of What Happens,
struck me like a thunderbolt: "This is it! This is what I must do with my life!" I read
the book in the middle of my turbulent college years, at a time when most of the
authority figures around me dismissed the topic of consciousness. I was told that
it was a dead end, an epiphenomenon, a topic unworthy of scientific
investigation. Antonio's body of work proved — decisively for me — otherwise. I
am grateful to Antonio, firstly, for giving me the courage to follow my curiosity.
At USC my curiosity has led me to use the tools of neuroscience to
investigate one small corner of the topic of consciousness: how our different
senses are brought together to build our unified experience of the world. Towards
this end I have been extremely fortunate to receive all the resources, attention,
time, and mentorship that I have ever requested from Antonio and Hanna
Damasio. The uniquely stimulating and elegant working environment that Antonio
and Hanna have created in the Brain and Creativity Institute has, I think, spoiled
me for ever working anywhere else.
I am grateful to Irving Biederman for serving on my qualifying committee
and for his frank and critical attention to my work. It has made me a better thinker
and scientist. I am grateful to Zhong-Lin Lu for serving on my qualifying
! #!
committee and for his helpful feedback on my research at various stages. I am
grateful to John Monterosso for serving on my dissertation committee and for
always having words of encouragement at the ready. I am grateful to Randall
Wetzel for serving as the outside member on my qualifying and dissertation
committees, seeing me through two different research projects and playing a
greater role than a mere "outside" member.
I've also learned, in grad school, that it's far better to be lucky than smart.
My greatest stroke of luck here was to have met and been taken on by Kaspar
Meyer and Jonas Kaplan. Kaspar, you are a gentleman. Jonas, you are a wizard.
Learning from you, puzzling over data with you, batting ideas back and forth with
you — these were the formative experiences of my scientific development. You
two have been the model of competence, decency, integrity, and generosity. This
work would have been impossible without your mentorship. If I ever become a
gentleman wizard it will be thanks to you guys.
Thanks to Helder, Glenn, and Fei, my BCI siblings. Laughing together is
better than crying together! And lord knows we've done both. Thank you for
making our workplace a joyful place.
Thank you, Jordan, for being my pal. Thank you Pauline, George, Amy,
and Rachel, for your love and friendship. Thank you Komperda, Brad, and Jillian,
for taking nice long walks up high with me.
! #"!
Thanks to the very special teachers who took a risk on me, who gave me
a chance – and more, who saw in me a better person than the one I knew. Mrs.
Spiro, Ms. Suri, Vinay Parikh, Martin Sarter, and David Meyer: I still aspire to
become that better person.
Thanks to my sisters Cindy and Carolene, and to my brother-in-law Henry,
for teaching me the meaning of family. It knows no distance nor bounds. Finally,
to my mother and father, Huu Le and Cheuk Yiu, and to my grandmother, Hoan
Huynh, I can't give enough thanks. You have loved me into this world. You have
given me everything. I can only hope to honor your sacrifices.
1
!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
$
!A note about tools: Tools enable work; good tools enable good work. My work,
for the most part, involves the manipulation of knowledge and information (not
the same things). Here I'd like to express gratitude for certain products and
resources - tools - that have made my work possible. The core products from
Google and Apple have been force multipliers for me. While almost nothing about
my experiments have "just worked", Google's search and Apple's computers
have met this extremely high standard. I am also grateful to the open source
software community for making freely available scientific tools such as FSL and
PyMVPA, without which my work would certainly have been impossible. I would
also like to thank the many journeymen scientists who have put their esoteric
computer scripts and code snippets onto the internet for the benefit of other
esoteric souls, such as myself. Finally, it is exciting to witness the first stirrings of
the open science movement in this past decade. I hope that tools for open
access publication, open peer review, open data sharing, and enlightened
measures of research impact will proliferate and mature. The continuing success
of the free and open source software movement may be the best rebuttal to the
oft-cited tragedy of the commons. May we emulate their success and tend to our
scientific commons together.
TABLE OF CONTENTS
EPIGRAPH !!"
DEDICATION !!!"
ACKNOWLEDGEMENTS !#"
" " ABSTRACT $"
" CHAPTER 1: Introduction 2
1.1. What is a concept? %"
1.2. Insights from neuroscience: the convergence-divergence
framework &"
1.3. From univariate to multivariate studies of semantic knowledge '"
1.4. Filling in the convergence hierarchy 5
1.5. Concept sharing and communication ("
1.6. Outlook )"
Chapter 1 Figures *"
Chapter 1 References +"
" CHAPTER 2. Sensory convergence and divergence: a review of
anatomical pathways and functional neuroimaging $,"
2.1. Experimental neuroanatomy of sensory convergence $$"
2.2. Integrating multisensory information in the mammalian cerebral
cortex $("
2.3. From percepts to concepts &&"
Chapter 2 Figures &)"
Chapter 2 References '*"
" CHAPTER 3. Neural convergence of sight and sound to form
audiovisual invariant representations ('"
3.1. Introduction (("
3.2. Materials and Methods (*"
3.3. Results )'"
3.4. Discussion ))"
Chapter 3 Figures and Tables *&"
Chapter 3 References -,"
" CHAPTER 4. Mapping neural abstraction of object representations
across sight, sound, and touch -&"
4.1. Introduction -&"
4.2. Materials and Methods -("
4.3. Results +&"
4.4. Discussion +*"
Chapter 4 Figures and Tables $,&"
! #"""!
Chapter 4 References $$'"
" CHAPTER 5. "See what I mean?" Abstract representations are shared
across individuals $$-"
5.1. Introduction $$-"
5.2. Experiment 1: Auditory, Visual, and Audiovisual-invariant
representations are shared across individuals. $$+"
5.3. Experiment 2: Decoding representations across subjects in our
trimodal dataset $%&"
5.4. General discussion $%*"
Chapter 5 Tables $%+"
Chapter 5 References $&%"
" CHAPTER 6. General discussion 134
6.1. Summary $&'"
6.2. Concepts: Are we there yet? $&("
6.3. Outstanding Issues $&*"
Chapter 6 References $&+"
!
! "!
ABSTRACT
How are concepts represented in the brain? When we hear the ringing of a bell,
or watch a bell swinging back and forth, is there a shared "BELL" pattern of
neural activity in our brains? Philosophers have debated the nature of concepts
for centuries, but recent technical advances have allowed neuroscientists to
make contributions to this topic. The combination of functional neuroimaging and
multivariate pattern analysis (MVPA) has allowed us to examine distributed
patterns of activity in the human brain to decode what they represent about the
world, and to what level of abstraction. I first review some historical findings of
experimental neuroanatomy in relation to pathways of sensory convergence.
Next I describe an experiment which used MVPA to map a region in the brain
that linked the auditory and visual features of an object. A second experiment
extended the method to three modalities: sight, sound, and touch. This permitted
the mapping of distinct levels of convergence: a lower level of bimodal
convergences (audiovisual, audiotactile, and visuotactile) and a higher level of
trimodal convergence (audiovisuotactile). A third experiment addressed the
social function of concepts, to enable communication of similar thoughts and
associations between people. I re-analyzed the first two datasets to determine
the extent to which different individuals shared similar neural substrates for the
same abstract representations. Overall, this work reveals a unified hierarchical
organization to the vast and variously-defined association cortices of the human
brain, and provides a neuroanatomical grounding for a previously proposed
convergence-divergence framework of the cerebral cortex.
! #!
Chapter 1: Introduction
1.1: What is a concept?
What is a concept? And how is it instantiated in the brain? The nature of
mental concepts has remained one of the most vexing and durable problems in
the history of philosophy. The debate may be roughly summarized as one
between concept empiricists and concept nativists. The empiricists, dating back
at least to Locke and Hume, claim that concepts are composed of associations
learned from sensory experience. A contemporary restatement of this idea is
given by the neo-empiricist Jesse Prinz: "All concepts are copies or combinations
of copies of perceptual representations." (Prinz 2002) Thus, a BELL concept is
composed of the sight, sound, touch, etc., of bells we've encountered in our past.
Concept nativists, on the other hand, reject the dependence of concepts
on sensory experience. The domain specific hypothesis (Caramazza and Shelton
1998) proposes that conceptual categories are not derived from an individual's
sensory experience, but rather are innately given. They would reflect a
preoccupation with specific kinds of objects encountered by an organism's distant
ancestors. The "radical nativist" Jerry Fodor would go so far as to deny that
concepts contain any representational content at all, but are instead just amodal
placeholders (Fodor 1998). Thus, there is nothing bell-like about the BELL
concept; it is just a generic indicator that blinks on whenever bells are
encountered in the environment.
! $!
Chapter 1.2. Insights from neuroscience: the convergence-divergence framework
Bridging the empiricist and nativist camps, my mentor Antonio Damasio
proposed, in 1989, a unifying theoretical framework in which the brain's
perceptual machinery and its "amodal" knowledge stores share an intimate,
obligate, link (1989a; 1989b, 1989c). Thus a BELL concept entails not only the
activation of "ding-dong" representations in auditory cortex and "swinging"
representations in visual cortex, but also the activation of a modality invariant
representation that ties them together. The convergence-divergence zone (CDZ)
framework establishes a hierarchy of successive levels of neural representation.
At the base level are the early sensory cortical maps for each modality. They
feed information in a bottom-up manner through successive levels of
convergence zones. CDZs at each level are tuned to detect specific conjunctions
of features from the preceding CDZ level. At the highest level are the
multisensory CDZs (Fig. 1.1, CDZ
n
). At each stage there are also reciprocal
feedback connections that allow a CDZ to retro-activate the same set of features
it has been tuned to detect. In such a framework, mapped information from one
sensory modality may traverse up the hierarchy, reach a multisensory CDZ
n
, and
be propagated back down to the sensory maps of different modalities. Semantic
information about the various modalities and combinations thereof would be
found at each step along the way.
! %!
1.3. From univariate to multivariate studies of semantic knowledge
Prior human fMRI studies have mapped sensory convergence by
performing univariate activation analysis, in which the overall change in activity
level of a brain voxel or region of interest is assessed. Accordingly, multisensory
cortices were defined by mapping sensory co-activation, e.g., regions activated
both by sounds and by pictures (Driver & Noesselt 2008). A more direct link to
conceptual knowledge is established by the semantic congruency effect, e.g.,
certain regions are more activated by the sight and sound of the same object,
compared to the sight of one object paired with the sound of a different object
(Doehrmann & Naumer 2008). Many regions are sensitive to semantic
congruency, including posterior superior temporal sulcus (pSTS), superior
temporal gyrus (STG), posterior parietal cortex (PPC), premotor cortex (PMC),
and inferior frontal cortex (IFC) (Fig. 2).
However, these brain regions may signal that some object was presented
congruently, without necessarily signaling what object was presented
congruently. To answer the "what" question, one can perform multivariate pattern
analysis (MVPA) to decode distributed patterns of brain activity for their
representational content (Mur et al. 2009). In Experiment 1 I applied MVPA in a
novel crossmodal manner to detect audiovisual invariant representations: I
trained a classification algorithm with auditory fMRI data, collected while subjects
heard the sounds of various objects, and then decoded the objects from visual
fMRI data, collected while subjects watched corresponding videos of those
objects. Despite the variety of regions implicated by the semantic congruency
! &!
effect, I found that crossmodal classification was only successful in a focal region
near the posterior superior temporal sulcus. Only there were objects i)
distinguishable from each other and ii) represented similarly whether they were
seen or heard (Man et al. 2012). A whole brain analysis showed a near-complete
lateralization of pSTS voxels to the right hemisphere, indicating that the
representations were not "merely" linguistic entries (which would be expected to
be left lateralized), but rather involved semantic properties (Damasio H. et al.
2004).
1.4. Filling in the convergence hierarchy
Having established the existence of audiovisual invariant representations,
and, further, mapping them to a focal region between the auditory and visual
cortices, In Experiment 2 I extend the method of crossmodal MVPA to reveal
additional nodes in the sensory convergence hierarchy. By the addition of a third
modality, touch, I identified new regions of audiotactile and visuotactile
convergence. In addition to those bimodally invariant representations, I also
revealed a region containing "CDZ
n+1
", or a superordinate level of trimodally
invariant representations, across sight, sound, and touch.
1.5. Concept sharing and communication
In Experiment 3 I address the social function of concepts, to enable
consistent transmission of thoughts and associations between people. I re-
analyzed the first two datasets to determine the extent to which different
! '!
individuals shared similar neural substrates for the same abstract
representations.
1.6. Outlook
In Chapter 2 I review the neuroanatomical basis for sensory convergence
and divergence in the mammalian cerebral cortex, drawing upon studies from
experimental neuroanatomy as well as functional neuroimaging. Chapter 3
reports my MVPA experiment on audiovisual invariant representations. Chapter 4
reports my MVPA experiment on trimodally invariant representations across
sight, sound, and touch. Chapter 5 reports my reanalysis of data from the prior
two experiments to decode concepts across individuals. The final Chapter 6
summarizes the main findings from my three studies, considers the role of
neuroscience in investigating concepts, and considers some outstanding issues
with MVPA studies of abstract representations.
! (!
Chapter 1 Figures
Figure 1.1. A schematic of the convergence-divergence zone (CDZ) framework.
(Meyer & Damasio 2009)
! )!
Figure 1.2. Regions of the brain sensitive to the semantic congruency effect.
(Doehrmann & Naumer 2008)
! *!
Chapter 1 References
Caramazza, A., & Shelton, J. R. (1998). Domain-specific knowledge systems in
the brain the animate-inanimate distinction. Journal of cognitive neuroscience,
10(1), 1–34.
Damasio, Antonio. (1989a). The brain binds entities and events by multiregional
activation from convergence zones. Neural Computation, 1, 123–132.
Damasio, Antonio. (1989b). Time-locked multiregional retroactivation: A systems-
level proposal for the neural substrates of recall and recognition. Cognition, 33,
25–62.
Damasio, AR. (1989c). Concepts in the brain. Mind & Language, 4(1 and 2), 24–
28. doi: 10.1111/j.1468-0017.1989.tb00236.x/abstract
Damasio, H., Tranel, D., Grabowski, T., Adolphs, R., & Damasio, a. (2004).
Neural systems behind word and concept retrieval. Cognition, 92(1-2), 179–229.
doi:10.1016/j.cognition.2002.07.001
Doehrmann, O., & Naumer, M. J. (2008). Semantics and the multisensory brain:
how meaning modulates processes of audio-visual integration. Brain research,
1242, 136–50. doi:10.1016/j.brainres.2008.03.071
Driver, J., & Noesselt, T. (2008). Multisensory interplay reveals crossmodal
influences on “sensory-specific” brain regions, neural responses, and judgments.
Neuron, 57(1), 11–23. doi:10.1016/j.neuron.2007.12.013
Fodor, J. (1998). Concepts: Where Cognitive Science Went Wrong. New York:
Oxford University Press.
Man K, Kaplan JT, Damasio A, Meyer K (2012) Sight and sound converge to
form modality-invariant representations in temporoparietal cortex. The Journal of
neuroscience : the official journal of the Society for Neuroscience 32:16629–
16636
Meyer K, Damasio A (2009) Convergence and divergence in a neural
architecture for recognition and memory. Trends in neurosciences 32:376–382.
Mur, M., Bandettini, P. a, & Kriegeskorte, N. (2009). Revealing representational
content with pattern-information fMRI--an introductory guide. Social cognitive and
affective neuroscience, 4(1), 101–9. doi:10.1093/scan/nsn044
Prinz, J. (2002). Furnishing the Mind: Concepts and Their Perceptual Basis. MIT
Press.
! "+!
CHAPTER 2. Sensory convergence and divergence: a review of anatomical
pathways and functional neuroimaging
(Corresponding publication: Man K, Kaplan J, Damasio H, Damasio A (2013)
Neural convergence and divergence in the mammalian cerebral cortex: From
experimental neuroanatomy to functional neuroimaging. The Journal of
Comparative Neurology 521:4097–4111.)
Abstract:
A development essential for understanding the neural basis of complex behavior
and cognition is the description, during the last quarter of the twentieth century,
of detailed patterns of neuronal circuitry in the mammalian cerebral cortex. This
effort established that sensory pathways exhibit successive levels of
convergence, from the early sensory cortices to sensory-specific association
cortices and to multisensory association cortices, culminating in maximally
integrative regions; and that this convergence is reciprocated by successive
levels of divergence, from the maximally integrative areas all the way back to the
early sensory cortices. This chapter first provides a brief historical review of these
neuroanatomical findings, which were relevant to the study of brain and mind-
behavior relationships using a variety of approaches and to the proposal of
heuristic anatomo-functional frameworks. In a second part, the chapter reviews
new evidence that has accumulated from studies of functional neuroimaging,
employing both univariate and multivariate analyses, as well as
electrophysiology, in humans and other mammals, that the integration of
information across the auditory, visual, and somatosensory-motor modalities
proceeds in a content-rich manner. Behaviorally and cognitively relevant
! ""!
information is extracted from and conserved across the different modalities, both
in higher-order association cortices and in early sensory cortices. Such stimulus-
specific information is plausibly relayed along the neuroanatomical pathways
alluded to above. The evidence reviewed here suggests the need for further in-
depth exploration of the intricate connectivity of the mammalian cerebral cortex in
experimental neuroanatomical studies.
2.1. Experimental neuroanatomy of sensory convergence
In the last quarter of the twentieth century, a critical development occurred in the
history of understanding complex behavior and cognition. This development was
the systematic description of detailed patterns of neuronal circuitry in the
mammalian cerebral cortex. In this section, we first provide a brief review of
these studies of experimental neuronatomy. Of particular interest are a number
of landmark neuroanatomical studies, conducted in non-human primates and
authored by, among others, E.G. Jones, T.P. Powell, Deepak Pandya, Gary W.
Van Hoesen, Kathleen Rockland, and Marsel Mesulam, that revealed intriguing
sets of connections projecting from primary sensory regions to successive
regions of association cortex. The connections were organized according to an
ordered and hierarchical architecture whose likely functional result was a
convergence of diverse sensory signals into certain cortical areas, which, of
necessity, became multisensory. The studies also revealed that, in turn, the
convergent projections were usually reciprocated, diverging in succession back
to the originating primary sensory regions. For example, Jones and Powell
! "#!
(1970) found that the primary sensory areas project unidirectionally to their
adjacent sensory-specific association areas. The initial strict topography of the
primary sensory areas was progressively lost along the stream of associational
processing. The sensory-specific association areas further converged onto
multimodal association areas, most notably in the depths of the superior temporal
sulcus, and these multimodal association areas originated reciprocal, divergent
projections back to the sensory-specific association cortices. The grand design of
this sort of architecture was made quite transparent by the manner in which
connections from varied sensory regions, such as visual and auditory, gradually
converged into the hippocampal formation via the entorhinal cortex, a critical
gateway into the hippocampus; and by how, in turn, the entorhinal cortex initiated
diverging projections that reciprocated the converging pathways all the way back
to the originating cortices. In this regard, the work of Gary W. Van Hoesen and of
his colleagues stands out by offering remarkable evidence for sensory
convergence and divergence. Specifically, Van Hoesen, Pandya, and Butters
(1972) revealed the sources of afferents to the entorhinal cortex. Three cortical
regions, themselves the recipients of already highly convergent multimodal input,
were found to be the main inputs to entorhinal cortex. They were the
parahippocampal cortex; prepiriform cortex; and ventral frontal cortex (Fig. 2.1).
In 1975 this same research group published a more comprehensive description
of the cortical afferents and efferents of the entorhinal cortex describing (a)
afferents from ventral temporal cortex (Van Hoesen and Pandya 1975a); (b)
afferents from orbitofrontal cortex (Van Hoesen et al., 1975); and (c) efferents
! "$!
that form the perforant pathway of the hippocampus (Van Hoesen and Pandya
1975b).
The connectional patterns that emerged from these studies were novel and
functionally suggestive. Through a series of convergent steps, the hippocampus
was provided with modality-specific and multisensory input hailing from
association areas throughout the cerebral cortex. The entorhinal and perirhinal
cortices, which are in of themselves divisible into subregions with distinct
afferent/efferent connectivity profiles, acted as final waystations, funneling
sensory information into the hippocampus proper.
Work from Seltzer and Pandya (1976) confirmed that the parahippocampal area
— which projects to entorhinal cortex — receives projections from each of the
association cortices of the auditory, visual, and somatosensory modalities, and
that the cytoarchitectonic subdivisions of the superior temporal sulcus (STS)
possess distinctive profiles of afferents from auditory, visual, and somatosensory
associational cortices (Seltzer and Pandya 1994). Also multisensory, but perhaps
to a lesser extent, the inferior parietal cortices were found to receive inputs from
the visual and somatosensory cortices (Seltzer and Pandya 1980). Later, making
use of a neuropathological signature of Alzheimer’s disease — the presence of
neurofibrillary tangles — and of the fact that Alzheimer's disease pathology can
be highly selective and regional, Van Hoesen and his colleagues would make an
effort to demonstrate equivalent patterns of connectivity in the human brain
! "%!
(Hyman et al., 1984; Hyman et al., 1987; Van Hoesen and Damasio 1987; Van
Hoesen at al., 1991). For example, both the input and output layers of the
entorhinal cortex were heavily compromised by tangles, but not the other layers,
in effect disconnecting the sensory cortices from the hippocampus proper (Fig 2).
From a perspective of comparative morphology, the development of higher
cognitive functions was enabled by the progressive addition of new association
cortices and new levels of convergence. Simpler architectural designs, such as
direct connections between primary sensory and limbic regions, only provide a
relatively narrow behavioral repertoire. Additional, intermediary regions of
association – modality-specific association cortices and then higher-order
multimodal association cortices – provide greater behavioral flexibility,
complexity, and abstraction, as suggested in prior theoretical work that in part
inspired these anatomical studies (Geschwind 1965a,b; reviewed by Catani &
ffytche 2005) (Fig. 2.3).
An attempt to investigate comparable connectional patterns in humans, makes
use of a form of noninvasive tractography, diffusion tensor imaging (DTI), based
on magnetic resonance scanning. A recent study employing DTI demonstrated
convergence of auditory, visual, and somatosensory white-matter fibers in the
temporo-parietal cortices (Bonner et al., 2013) (Fig. 2.4).
! "&!
The convergent bottom-up projections are reciprocated by top-down projections
that cascade from medial temporal lobe structures to multisensory association
cortices, and on to parasensory association cortices and then finally to early
sensory cortices. The pattern of divergent projections from sensory association
cortices back to primary sensory cortices has been confirmed for the auditory
modality (Pandya 1995), and for the visual (Rockland and Pandya 1979) and
somatosensory modalities (Vogt and Pandya 1978). The back-projections exhibit
a distinctive laminar profile, targeting the superficial layers of the primary sensory
cortices (reviewed in Meyer 2011). More recent work has also revealed sparse
direct projections between the primary and secondary sensory cortices of, for
example, audition and vision (Falchier et al., 2002; Rockland and Ojima 2003;
Clavagnier et al., 2004; Cappe and Barone 2005; Hackett et al., 2007; Falchier et
al., 2010; and reviewed by Falchier et al., 2012) (Fig. 2.5). We will address the
significance of these findings later in the article. We also note that we have
focused exclusively on neuroanatomical findings in cerebral cortex. We must, of
necessity, omit discussion of the rich pathways of subcortical convergence and
cortical-subcortical interactions (reviewed in several chapters of (Stein, ed.,
2012)), although, for an excellent example of the latter, see (Jiang et al., 2001).
2.2. Integrating multisensory information in the mammalian cerebral cortex
2.2.1. Anatomo-functional framework
The work cited so far established the following facts: (i) sensory pathways exhibit
successive levels of convergence, from the early sensory cortices to sensory-
! "'!
specific association cortices and to multisensory association cortices, culminating
in maximally integrative regions such as in medial temporal lobe cortices and
both lateral and medial prefrontal cortices; and (ii) the convergence of sensory
pathways is reciprocated by successive levels of divergence, from the maximally
integrative areas to the multisensory association cortices, to the sensory-specific
association cortices, and finally to the early sensory cortices.
In an attempt to bridge these anatomical facts and the evidence provided by a
variety of approaches to the study of brain and mind-behavior relationships, a
number of anatomo-functional frameworks were proposed, for example by
Damasio (1989a; 1989b; Meyer and Damasio 2009), and by Mesulam (1998)
(Fig. 2.6). Briefly, the Damasio framework proposes an architecture of
convergence-divergence zones (CDZ) and a mechanism of time-locked
retroactivation. Convergence-divergence zones are arranged in a multi-level
hierarchy, with higher-level CDZs being both sensitive to, and capable of
reinstating, specific patterns of activity in lower-level CDZs. Successive levels of
CDZs are tuned to detect increasingly complex features. Each more-complex
feature is defined by the conjunction and configuration of multiple less-complex
features detected by the preceding level. CDZs at the highest levels of the
hierarchy achieve the highest level of semantic and contextual integration, across
all sensory modalities. At the foundations of the hierarchy lie the early sensory
cortices, each containing a mapped (i.e., retinotopic, tonotopic, or somatotopic)
representation of sensory space. When a CDZ is activated by an input pattern
! "(!
that resembles the template for which it has been tuned, it retro-activates the
template pattern of lower-level CDZs. This continues down the hierarchy of
CDZs, resulting in an ensemble of well-specified and time-locked activity
extending to the early sensory cortices. The mid- and high-level CDZs that span
multiple sensory modalities share much in common with Mesulam's (1998)
account of "transmodal nodes", as well as with "hubs", or nodes with high
centrality, from a graph-theoretic approach (Bullmore and Sporns, 2009).
The overall framework allows for regions in which a strict processing hierarchy is
not maintained. Certain sub-sectors may have particularly strong internal
connections, both "vertical" and "horizontal", forming somewhat independent
functional complexes. Two examples are the dorsal and ventral pathways, whose
internal connectivities are related with, respectively, visually guided action
(Kravitz et al., 2011) and the representation of object qualities (Kravitz et al.,
2013). Contrary to other models that posit nonspecific, or modulatory, feedback
mechanisms, time-locked retroactivation provides a mechanism for the global
reconstruction of specific neural states. In the next section we review recent
findings that are compatible and possibly supportive of this neuroarchitectural
framework.
2.2.2. Investigating sensory convergence and divergence in functional
neuroimaging experiments from humans.
! ")!
In this section, we turn to evidence from functional neuroimaging studies in
humans. Specifically, we focus on the integration of multisensory information in
the creation of representations of the objects of perception. These
representations are content-rich, in that they contain cognitively and behaviorally
relevant information about the stimuli. The information is abstracted across
different sensory modalities. Recent work from our laboratory has provided
evidence for content-rich multisensory information integration both at the level of
the early, sensory-specific cortices, which, we argue, rely on divergent
projections, and at the level of the multisensory association cortices, which
depend on convergence.
Our studies were conducted using functional magnetic resonance imaging
(fMRI), which noninvasively measures blood-oxygenation level dependent
signals in the brain as a proxy measure of neural activity. This generates three-
dimensional images of neural activity with good spatial resolution (around two
cubic mm voxels), permitting the fractionation of discrete brain regions (e.g.,
primary visual cortex) into many independently measured voxels. By performing
multivariate (or multivoxel) pattern analysis (MVPA) of these human functional
brain images, the representational content may be decoded from distributed
patterns of activity (Mur et al., 2009). In a common implementation of MVPA,
machine-learning algorithms are first trained to recognize the association
between given classes of stimuli and certain spatial patterns of brain activity.
Next, the algorithm is tested by having it assign class labels to new sets of data
! "*!
based on the recognition of diagnostic spatial patterns learned from the training
set. We used this technique to make a strong case that stimuli of one modality
could orchestrate content-specific activity in the early sensory cortices of another
modality. Visual stimuli in the form of silent but sound-implying movies were
found to evoke content-specific representations in auditory cortices (Meyer et al.,
2010). For example, silent video clips of a violin being played and of a dog
barking were reliably distinguished based solely on activity in the anatomically
defined primary auditory cortices. The nine individual objects as well as the three
semantic categories to which they belonged ("animals", "musical instruments",
and "objects") were successfully decoded. Intriguingly, the subjects' ratings of the
vividness of the video-cued auditory imagery significantly correlated with
decoding performance (Fig. 2.7).
Extending this approach to the somatosensory cortices, Meyer et al. (2011)
demonstrated that visual stimuli could also orchestrate content-specific activity in
somatosensory cortices. Touch-implying videos that depicted haptic exploration
of common objects, such as rubbing a skein of yarn or handling a set of keys,
were shown to subjects. These video stimuli were then successfully decoded
from anatomically defined primary somatosensory cortices.
As detailed in the next chapter, we next reasoned that information from the visual
stimuli would likely have travelled up the convergence hierarchy and then back
down into a different sensory sector, either the auditory or the somatosensory.
! #+!
Over the course of that journey, information specifying the modality-invariant
identity of the stimulus would have been abstracted from the raw energy pattern
transduced by the sensory organ. The recovery of this modality-invariant
information would have allowed a higher-order convergence zone to specify the
manner in which to retro-activate a cortex of a different modality. Once again,
performing machine learning analyses of fMRI data, we searched for modality-
invariant neural representations across audition and vision (Man et al., 2012). We
tested the hypothesis that the brain contains representations of objects that are
similar both upon seeing the object and hearing the object. Subjects were shown
video clips and audio clips that depicted various common objects. We then
performed a crossmodal classification, by training an algorithm to distinguish
between sound clips and then testing it to decode video clips (and vice versa).
Out of several a priori defined "multisensory" regions of interest that became
active both to audio and video stimuli, only a region near the posterior superior
temporal sulcus was found to contain content-specific and modality-invariant
representations. A whole-brain searchlight analysis confirmed that the pSTS
uniquely and consistently contained object representations that were invariant
across vision and audition (Fig. 2.8).
Earlier fMRI studies on the extraction of semantic information across vision and
audition typically followed one of two designs: (i) a comparison of brain
activations for congruent vs. incongruent audiovisual stimuli, or (ii) an
investigation of crossmodal carry-over effects when an auditory stimulus was
! #"!
followed by a congruent or incongruent visual stimulus (or vice versa). A
comprehensive review of these studies (Doehrmann and Naumer 2008) identified
a common pattern: regions in the lateral temporal cortices, including pSTS and
STG, are more activated by audiovisual semantic congruency, whereas regions
in the inferior frontal cortices are more activated by audiovisual semantic
incongruency. The authors reasoned that the lateral temporal cortices organize
stable multisensory object representations, whereas the inferior frontal cortices
operate for the more cognitively demanding incongruent stimuli. However,
somewhat challenging to this pattern was the finding of Taylor and colleagues
(2006) that perirhinal cortex, but not the pSTS, is sensitive to audiovisual
semantic congruency. Doehrmann and Naumer (2008) suggest that these and
other discrepancies may be due to variation in the stimuli used across studies.
The importance of stimulus control was emphasized in a review of crossmodal
object identification by Amedi and colleagues (2005). There is an important
distinction between "naturally" and "arbitrarily" associated crossmodal stimuli. In
the former, the dynamics of sight and sound inhere within the same source
object. The facial movements that accompany speech sounds are a good
example of such naturally crossmodal stimuli. In the latter, sight and sound are
associated by convention, such as found when speech sounds accompany
printed words. One study highlighting this distinction found elevated functional
connectivity between the voice selective areas of auditory cortex and the fusiform
face area, for voice/face stimuli but not for voice/printed name stimuli (von
! ##!
Kriegstein and Giraud, 2006). However, it remains unclear whether the elevated
functional connectivity was necessarily mediated by an intervening convergence
area — as it was during the training phase of the experiment — or if the face and
voice areas synchronized their activity through direct connections (as suggested
by (Amedi et al., 2005)).
We suggest that content-rich crossmodal patterns of activation require the
involvement of supramodal convergence zones. While direct projections between
sensory cortices have been identified and possibly have a behavioral function
regarding stimuli in the periphery of the visual field (Falchier et al., 2002; Falchier
et al., 2010), we assume that they are too sparse to specify the content-specific
patterns of activity we have observed. Direct connections between the primary
sensory cortices of different modalities seem to be even sparser (Borra and
Rockland, 2011). Rather, direct connections are likely involved in sub-threshold
modulation of ongoing activity (Lakatos et al., 2007) or response latency (Wang
et al., 2008; reviewed in Falchier et al., 2012). For the crossmodal orchestration
of content-specific representations across multiple brain voxels, we submit that
the involvement of convergence zones would be a more plausible account than
that of direct connections alone.
Our findings are relevant to the debate over the neural localization of semantic
congruency effects. While a modality-invariant representation in pSTS implies
crossmodal semantic congruency, the converse is not true: demonstrating a
! #$!
crossmodal semantic congruency effect does not necessarily imply the existence
of a modality-invariant representation. Given that our study identified pSTS as
the unique location of audiovisual invariant representations, we suggest that it
may be the source of the semantic congruency signal. Upon presentation of a
bimodal stimulus, pSTS would search for a match in its store of audiovisual
invariant representations, and, upon succeeding or failing to find a match, would
announce its verdict of "congruent" or "incongruent" downstream to the other
regions of the brain that also show semantic congruency effects (Man et al.,
2012).
In the following sections we review additional evidence for multisensory
integration of information in the audiovisual, visuotactile, and audiotactile
domains, from the perspectives of bottom-up convergence and top-down
divergence.
2.2.3. Bottom-up crossmodal integration of information
It has long been known that certain regions and neurons of the mammalian
cerebral cortex are multisensory, a fact based on the identification of neurons
responsive to auditory, visual, and tactile stimuli (Jung et al., 1963). It is
reasonable to assume that nearly all neocortex exhibits some multisensory
activity (Calvert, Spence, and Stein, 2004; Ghazanfar and Shroeder, 2006; Driver
and Noesselt, 2008). However, multisensory activations and modulations permit
only a relatively weak and nonspecific characterization of multisensory
! #%!
processing (Kayser and Logothetis, 2007). Our focus here is on multisensory
processing that retains information about the stimulus, or multisensory integration
of information.
At the highest levels of sensory convergence, single neurons in, for example, the
medial temporal lobes can selectively respond to the identities of certain people,
whether they are seen or heard (Quian Quiroga et al., 2009). A single neuron
robustly responded to the voice, printed name, and various pictures of Oprah
Winfrey, but not of any other person tested (with the exception of minor
responses to Whoopi Goldberg). Furthermore, the progression of anatomical
sensory convergence was recapitulated in the progressive increase in the
proportion of cells showing visual invariance at each stage (Fig. 2.9). From
parahippocampal cortex, to entorhinal cortex, to hippocampus, an increasing
proportion of cells in each region were selectively activated by pictures of a
particular person.
In a review of human neuroimaging studies of audiovisual and visuotactile
crossmodal object identification, three multisensory convergence regions stood
out prominently: the posterior superior temporal sulcus (STS) and superior
temporal gyrus (STG), for audiovisual convergence, and the lateral occipital
tactile-visual area (LOtv) as well as the intra-parietal sulcus (IPS) for visuotactile
convergence (Amedi et al., 2005). As noted earlier, each of these multisensory
regions are bi-directionally connected to the medial temporal cortices.
! #&!
Audiovisual integration of information, to the level of semantic congruency, was
demonstrated in an electrophysiology study in the superior temporal cortex of the
macaque. Dahl and colleagues (2010) recorded single and multi-unit activity
during the presentation of audiovisual scenes. Congruent scenes (e.g.,
corresponding video and audio tracks of a conspecific's grunt) were more reliably
decoded than incongruent scenes (e.g., the audio of a grunt paired with the video
of a cage door).
Convergence of information across the visual and tactile modalities was found in
an fMRI study involving the perception of abstract clay objects. Lateral and
middle occipital cortices responded more actively to the visual presentation of
objects that had previously been touched, than to objects that had not yet been
touched (James et al., 2002). There is evidence that the lateral occipital cortices
may play a greater role in visuotactile integration for familiar objects, whereas the
intraparietal sulcus would be more involved with unfamiliar objects, during which
spatial imagery would be recruited (Lacey et al., 2009; but see also Zhang et al.,
2004, in which the level of activity in LOC correlated with the vividness of visual
imagery induced by tactile exploration). Another study decoded object category
related-response patterns across sight and touch in the ventral temporal cortices
(Pietrini et al., 2004). A recent study of visuotactile semantic congruency varied
the order in which the modalities were presented and found greater responses in
lateral occipital cortex, fusiform gyrus, and intraparietal sulcus when visual stimuli
! #'!
were followed by congruent haptic stimuli as opposed to the other way around,
leading the authors to conclude that vision may predominate over touch in those
regions (Kassuba et al., 2013).
Crossmodal studies of supramodal object representations across hearing and
touch are as yet few, although audiotactile semantic congruency effects have
been reported in the pSTS and the fusiform gyrus (Kassuba et al., 2012).
2.2.4. Top-down crossmodal information integration
There is a vast literature on the modulatory effects of arousal, attention, and
imagery on the sensory cortices, which are presumably mediated top-down, from
higher-order cortices toward earlier sensory cortices. In this section, once again,
we focus only on studies demonstrating informative retro-activations of sensory
cortices. In other words, we concentrate on activations that carry content-specific
information from areas of sensory convergence back to early sensory areas.
2.2.4.1. Decoding retro-activations in early visual cortices
Studies of visual imagery, in particular, provide rich evidence for internally
generated, content-specific activity extending into early visual cortices. These
activations are content-specific to the degree that they permit distinctions
between different imagined forms and between different form locations. Thirion
and colleagues (2006) built an inverse model of retinotopic cortex that was able
! #(!
to reconstruct both the forms of visual imagery (simple geometric patterns) and
their locations (in the left or right visual field) from fMRI data (Fig. 2.10).
This leads naturally to the question of whether visual imagery and visual
perception share a code. Is the same visual cortical representation evoked when
seeing something and when imagining the same thing? Thirion et al. (2006)
found that classification across perception and imagery was indeed successful,
although only in a minority of their subjects. Commonality of representation could
be inferred by training a classification algorithm to distinguish among seen
objects and then testing it to decode imagined objects, or vice versa. Successful
performance indicates generalization of learning across the classes of "seen"
and "imagined". Following this approach, several groups have found abundant
evidence for common coding across visual perception and imagery. The decoded
stimuli include: "X"'s and "O"'s in lateral occipital cortices (Stokes et al., 2009);
the object categories "tools", "food", "famous faces", and "buildings" in ventral
temporal cortex (Reddy et al., 2010); the categories "objects", "scenes", "body
parts", and "faces" in their respective category-selective visual regions (e.g. faces
in fusiform face area), as well as the location of the stimulus (in the left or right
visual fields) in V1, V2, and V3 (Cichy et al., 2012). Taken together, these studies
suggest that visual imagery is a top-down process that reconstructs and uses, to
some extent, the patterns of activity that were established during veridical
perception. Remarkably, these reconstructions appear to extend all the way to
the initial site of cortical visual processing, V1.
! #)!
The studies of (Stokes et al., 2009, Reddy et al., 2010, and Cichy et al., 2012)
used auditory cues can trigger visual imagery. But whether auditory cues are
sufficient to trigger content-specific activity in visual cortices without necessarily
triggering visual imagery is an important follow-up question. After establishing
that various natural sounds could be decoded in V2 and V3, Vetter and
colleagues (2011) constrained visual imagery with an orthogonal working
memory task. Subjects heard natural sounds while memorizing word lists and
performing a delayed match-to-sample task. Despite these constraints on visual
imagery, the sounds were still successfully decoded in early visual cortices.
Perceptual expectation is another top-down cognitive process that modulates V1
in a content-specific manner. An fMRI study (Kok et al., 2012) used auditory
tones to cue subjects to expect that a particular visual stimulus would be shown
immediately afterward. High or low tones were associated with right- or left-
oriented contrast gratings, respectively. This prediction of visual orientation by
auditory tone was valid in 75% of trials (expected condition); the remaining trials
violated the prediction (unexpected condition). Performing MVPA in V1, the
expected visual stimuli were decoded more accurately than the unexpected
ones. (In V2 and V3, however, expectation had no effect on decoding accuracy.)
Interestingly, the gross level of activity in V1 was lower in the expected condition
than in the unexpected condition. These findings support an account of
expectation as producing a sharpening effect: the consistency and
! #*!
informativeness of representation increases, even as overall level of activity
declines (Fig. 2.11). This sharpening effect was present even when the
expectation was not task-relevant — when subjects performed a contrast
judgment task that was unrelated to stimulus orientation — showing it to be
independent of task-related attention.
2.2.4.2. Decoding retro-activations in early auditory cortices
We noted earlier that we decoded sound-implying visual stimuli from primary
auditory cortices (Meyer et al., 2010). However, visual stimuli that did not imply
sounds failed to be decoded from early auditory cortices (Hsieh et al., 2012).
Other findings from auditory cortex resemble those from visual cortex: there is
substantial multisensory information conveyed back to relatively "early" sensory
cortices (Schroeder and Foxe 2005 Curr Op Neurobiol; but see also Kayser et
al., 2009 Hearing Res). Kayser and colleagues (2010 Curr Biol) recorded local
field potentials and spiking activity from monkey auditory cortex during the
presentation of various naturalistic stimuli. Compared to presenting sounds
alone, the presentation of sounds paired with congruent videos resulted in a
more consistent and informative pattern of neural activity across each
presentation of a particular stimulus. This consistency was evident in both firing
rates and spike timings. For incongruent pairings of sounds and videos, this
consistency across stimulus presentations was reduced.
! $+!
Another example of higher-level semantic processing influencing crossmodal
activations was given by an fMRI study by Calvert et al. (1997; reviewed in Meyer
2011). Primary auditory cortex activity was observed when subjects watched a
person silently mouthing numbers but not when they watched nonlinguistic facial
movements that did not imply sounds. More recently, an fMRI study of the
McGurk effect (Benoit et al., 2010; reviewed in Meyer 2011) showed a stimulus-
specific response to visual input in auditory cortices. McGurk and MacDonald
(1976) reported a perceptual phenomenon in which the auditory presentation of
the syllable /ba/, combined with the visual presentation of the mouth movement
for /ga/, resulted in the percept of a third syllable, /da/. Benoit and colleagues
(2010), performing fMRI in primary auditory cortices, exploited this effect in a
repetition suppression paradigm. They presented a train of three audiovisual
congruent /pa/ syllables, followed by the presentation of either a fourth congruent
/pa/ or an incongruent McGurk stimulus (visual /ka/, auditory /pa/). Although the
auditory stimulus was the same /pa/ syllable in both conditions, the co-
presentation of an incongruent /ka/ mouth movement evoked greater activity
(release from adaptation) in A1. This demonstrates that primary auditory cortices
can respond differently to the same auditory stimulus when it is paired with
different visual stimuli.
The question of whether auditory cortices receive content-specific tactile
information also deserves attention. Multi-electrode recordings in macaque have
shown that tactile input may modulate A1 excitability by resetting phase
! $"!
oscillatory activity (Lakatos et al., 2007), potentially affecting subsequent auditory
processing in a stimulus-specific manner. A psychophysical study in humans
found that auditory stimuli interfered with a tactile frequency discrimination task
only when the frequencies of the auditory and tactile stimuli were similar (Yau et
al., 2009). This effect is likely mediated by the caudomedial or caudolateral belt
regions of auditory cortex (Foxe 2009). A high-field fMRI study of the auditory
cortex belt area in macaque showed greater enhancement of activity when a
sound stimulus was synchronously paired with a tactile stimulus, compared to
pairing with an asynchronous tactile stimulus (Kayser et al., 2005). This shows
the temporal selectivity of secondary auditory cortex for tactile stimuli. On the
other hand, also in macaque, Lemus and colleagues (2010) performed single
neuron recordings in A1 and found that it could not distinguish between two
tactile flutter stimuli.
2.2.4.3. Decoding retro-activations in early somatosensory cortices
In the study cited above, Lemus and colleagues (2010) also found that primary
somatosensory cortices failed to distinguish between two acoustic flutter stimuli.
However, in an fMRI study in humans, Etzel and colleagues (2008) successfully
decoded sounds made by either the hands or the mouth from activity patterns in
left S1 and bilateral S2. With respect to content-specific activations by visual
stimuli, a univariate fMRI study found selectivity for visually presented shapes
over visually presented textures in precentral sulcus (Stilla and Sathian 2008),
! $#!
and the multivariate fMRI study of Meyer et al. (2011), cited earlier, decoded the
identities of visual objects from somatosensory cortices.
2.2.4.4. Decoding actions across sensory and motor systems
The discovery of neurons that activate both during action observation and action
execution in macaque premotor cortices (Gallese et al., 1996) has led to
proposals that they participate in a mirror neuron system that underlies action
recognition and comprehension. A pertinent question is whether these activations
constitute action-specific representations, and, if so, whether they represent a
given action similarly across observation and execution. Single cell recordings in
ventral premotor cortices were indeed found to distinguish between two actions,
whether they were seen, heard, or performed (Keysers et al 2003).
In human fMRI studies the evidence for common coding of actions has been
somewhat mixed. Adaptation studies test the hypothesis that observing an action
after executing the same action (or vice versa) would result in reduced activity
compared to observing and executing different actions. Evidence supporting this
hypothesis has been found for certain goal-directed motor acts (Kilner et al.,
2009) and yet not for other intransitive, non-meaningful motor acts (Lingnau et
al., 2009). Evidence from MVPA studies has been similarly mixed (reviewed by
Oosterhof et al., 2012). Actions were successfully decoded across hearing and
performing them (Etzel et al., 2008) or across seeing and performing them
(Oosterhof et al., 2010), however another group reported failed cross-
! $$!
classification between seeing and performing an action (Dinstein et al., 2008).
The likely source of disagreement among these research findings, echoing the
debate over the localization of audiovisual semantic congruency effects, is the
variation in the stimuli employed. However, it seems safe to conclude that there
is at least some commonality of representation between action observation and
execution, with room for disagreement on the degree of fidelity between
representations.
2.3. From percepts to concepts
The evidence reviewed above establishes that (i) the multisensory association
cortices extract information regarding the congruence or identity of stimuli across
modalities in a content-specific manner; and (ii) the early sensory cortices
contain content-specific information regarding heteromodal stimuli (that is, stimuli
of a modality different from the sensory cortex surveyed). We suggest that these
findings are consonant with a neuroarchitectural framework of convergence-
divergence zones, in which content-rich information is integrated across the
different sensory modalities, as well as broadcasted back to them.
The systematic linkage of sensory information has long been seen as playing a
role in conceptual processes by philosophers (Hume 1739; Prinz 2002) as well
as by cognitive neuroscientists (Damasio 1989c; Barsalou 1999; Martin 2007).
This position, known as concept empiricism, holds that abstract thought and
explicit sensory representation share an intimate and obligate link. For example,
! $%!
the concept of a "bell" is determined by its sensory associations: it is composed
by the sound of its ding-dong, the sight of its curved and flared profile, the feel of
its cold and rigid surface. Even highly abstract concepts, such as "truth" or
"disjunction", may be decomposed into sensory primitives (Barsalou 1999).
Concept empiricism has enjoyed the accumulation of evidence from several lines
of inquiry. Studies of experimental neuroanatomy in the final quarter of the 20th
century established the existence of pathways of sensory convergence and
divergence in the mammalian cerebral cortex. Largely as a result of studies
performed in the past decade, these neuroanatomical pathways were then found,
by electrophysiological and functional neuroimaging methods, to carry content-
rich information both up to integrating centers and back down to sensory cortices.
The present review focused on audition, vision, and touch, owing to the relative
dearth of studies for the other modalities. However, concepts also make
abundant reference to the modalities of taste, smell, vestibular sensation, and
interoception, and additional studies are needed on the integration among these
modalities. Moreover, while bi-modally invariant representations (across, e.g.
vision and audition or vision and touch) have been found, the identification of tri-
modally invariant representations would buttress the argument that they
participate in conceptual thought, and would establish a yet higher level atop the
hierarchy of convergence-divergence. I detail this work in Chapter 4.
! $&!
We hope that the human evidence discussed in this Chapter can lead to further
in-depth exploration of the intricate connectivity of the mammalian cerebral cortex
using experimental neuroanatomy.
! $'!
Chapter 2 Figures
Fig. 2.1. The multisynaptic pathways by which sensory information converges
upon the hippocampus, depicted on the rhesus monkey brain. Primary
association areas for visual, auditory, and somatosensory areas are labeled VA1,
AA1, and SA1, respectively. Secondary association areas are labeled similarly,
as VA2, AA2, and SA2. Reproduced, with permission, from (Van Hoesen et al.,
1972).
! $(!
Fig. 2.2. Distribution of neurofibrillary tangles in the hippocampal formation in
Alzheimer's disease. Neurofibrillary tangles appear yellow in these Congo red-
stained sections from the brain of a human patient with Alzheimer's disease.
Tangles are selectively apparent in (A) subicular CA1 fields and in (B) layers II
and IV of entorhinal cortex. Reproduced, with permission, from (Hyman et al
1984).
! $)!
Fig. 2.3 Phylogenetic expansion of higher-order association cortices and
increased complexity in sensory convergence architectures. The top sequence
shows the expansion of inferior parietal cortex across species. The bottom
sequence shows the variation in connectivity patterns across species, with cross-
sensory interactions mediated by higher levels of association cortices in man.
Reproduced, with permission, from (Catani & ffytche 2005).
! $*!
Fig. 2.4. DTI of a multisensory association region. Tractography of a multisensory
activation cluster in the temporo-parietal cortices (angular gyrus, tan) with
regions responsive to sight-, sound-, and manipulation-implying words.
Reproduced, with permission, from (Bonner et al., 2013).
! %+!
Fig. 2.5. Direct feedback connections to V1. A retrograde tracer study revealed
long-distance feedback projections to V1 from auditory cortices, multisensory
cortices, and a perirhinal area. Lines in black represent the dorsal stream; lines in
gray, the ventral stream. The thickness of lines represents the strength of
connection. Reproduced, with permission, from (Clavagnier et al. 2004).
! %"!
Fig. 2.6. Large-scale neural frameworks of convergence and divergence. A) A
schematic illustration of the convergence-divergence zone framework. Red lines
indicate bottom-up connections, blue lines, top-down. From (Meyer & Damasio
2009). B) An illustration of transmodal nodes, in red, connecting visual regions, in
green, with auditory regions, in blue. Each concentric ring represents a different
synaptic level. Reproduced, with permission, from (Mesulam 1998).
! %#!
Fig. 2.7. The decoding accuracy of silent videos from auditory cortex activity
correlated with subjects' ratings of the vividness of auditory imagery.
Reproduced, with permission, from (Meyer et al. 2010).
! %$!
Fig. 2.8. Audiovisual-invariant representations were most consistently found in
the temporoparietal cortices, near the pSTS. Reproduced, with permission, from
(Man et al. 2012).
! %%!
Fig. 2.9. The proportion of visually-responding neurons that were selective for a
particular person across several different pictures, shown as the portion of the
circles filled with purple, increased in the progression from parahippocampal
cortex to entorhinal cortex and finally to hippocampus. Reproduced, with
permission, from (Quian Quiroga et al. 2009).
! %&!
Figure 10. There is a rough likeness between the target visual pattern, shown in
black circles, and the pattern reconstructed from brain activations in early visual
cortex during imagery of the target pattern, shown as colored blobs. Reproduced,
with permission, from (Thirion et al. 2006).
! %'!
Figure 11. The expectation of a certain visual stimulus resulted in reduced BOLD
activity, but higher decoding accuracy for that stimulus, in V1. Adapted and
reproduced, with permission, from (Kok et al. 2012).
! %(!
Chapter 2 References
Amedi A, Von Kriegstein K, Van Atteveldt NM, Beauchamp MS, Naumer MJ
(2005) Functional imaging of human crossmodal identification and object
recognition. Exp Brain Res 166:559–571.
Barsalou LW (1999) Perceptual symbol systems. Behav Brain Sci 22:577–660.
Borra E and Rockland KS (2011) Projections to early visual areas V1 and V2 in
the calcarine fissure from parietal association areas in the macaque. Front
Neuroanat 5:1–12.
Benoit MM, Raij T, Lin F-H, Jääskeläinen IP, Stufflebeam S (2010) Primary and
multisensory cortical activity is correlated with audiovisual percepts. Hum Brain
Mapp 31:526–538.
Bonner MF, Peelle JE, Cook PA, Grossman M (2013) Heteromodal conceptual
processing in the angular gyrus. NeuroImage 71:175–186.
Bullmore E, Sporns O (2009) Complex brain networks: graph theoretical analysis
of structural and functional systems. Nat Rev Neurosci 10:186–198.
Calvert GA (1997) Activation of Auditory Cortex During Silent Lipreading.
Science 276:593–596.
Calvert GA, Spence C, Stein BE eds. (2004) The handbook of multisensory
processes. MIT Press.
Cappe C, Barone P (2005) Heteromodal connections supporting multisensory
integration at low levels of cortical processing in the monkey. Eur J Neurosci
22:2886–2902.
Catani M, ffytche DH (2005) The rises and falls of disconnection syndromes.
Brain 128:2224–2239.
Cichy RM, Heinzle J, Haynes J-D (2012) Imagery and perception share cortical
representations of content and location. Cereb Cortex 22:372–380.
Clavagnier S, Falchier A, Kennedy H (2004) Long-distance feedback projections
to area V1: implications for multisensory integration, spatial awareness, and
visual consciousness. Cogn Affect Behav Neurosci 4:117–126.
Dahl CD, Logothetis NK, Kayser C (2010) Modulation of visual responses in the
superior temporal sulcus by audio-visual congruency. Front Int Neurosci 4:10.
! %)!
Damasio A (1989a) Time-locked multiregional retroactivation: A systems- level
proposal for the neural substrates of recall and recognition. Cognition 33:25–62.
Damasio A (1989b) The brain binds entities and events by multiregional
activation from convergence zones. Neural Comput 1:123–132.
Damasio A (1989c) Concepts in the brain. Mind Lang 4:24–28.
Dinstein I, Gardner JL, Jazayeri M, Heeger DJ (2008) Executed and observed
movements have different distributed representations in human aIPS. J Neurosci
28:11231–11239.
Doehrmann O, Naumer MJ (2008) Semantics and the multisensory brain: how
meaning modulates processes of audio-visual integration. Brain Res 1242:136–
150.
Driver J, Noesselt T (2008) Multisensory interplay reveals crossmodal influences
on “sensory-specific” brain regions, neural responses, and judgments. Neuron
57:11–23.
Etzel JA, Gazzola V, Keysers C (2008) Testing simulation theory with cross-
modal multivariate classification of fMRI data. PloS One 3:e3690.
Falchier A, Cappe C, Barone P, Schroeder CE (2012) Sensory convergence in
low-level cortices. In: The New Handbook of Multisensory Processing (Stein BE,
ed), pp.67. Cambridge, MA: MIT Press.
Falchier A, Clavagnier S, Barone P, Kennedy H (2002) Anatomical evidence of
multimodal integration in primate striate cortex. J Neurosci 22:5749–5759.
Falchier A, Schroeder CE, Hackett TA, Lakatos P, Nascimento-Silva S, Ulbert I,
Karmos G, Smiley JF (2010) Projection from visual areas V2 and prostriata to
caudal auditory cortex in the monkey. Cereb Cortex 20:1529–1538.
Foxe JJ (2009) Multisensory integration: frequency tuning of audio-tactile
integration. Curr Biol 19:R373–5.
Gallese V, Fadiga L, Fogassi L, Rizzolatti G (1996) Action recognition in the
premotor cortex. Brain 119:593–609.
Geschwind N (1965a) Disconnexion syndromes in animals and man. I. Brain
88:237–294.
Geschwind N (1965b) Disconnexion syndromes in animals and man. II. Brain
88:585–644.
! %*!
Ghazanfar AA, Schroeder CE (2006) Is neocortex essentially multisensory?
Trends Cogn Sci 10:278–285.
Hackett TA, Smiley JF, Ulbert I, Karmos G, Lakatos P, De la Mothe LA,
Schroeder CE (2007) Sources of somatosensory input to the caudal belt areas of
auditory cortex. Perception 36:1419–1430.
Hsieh P-J, Colas JT, Kanwisher N (2012) Spatial pattern of BOLD fMRI activation
reveals cross-modal information in auditory cortex. J Neurophysiol 107:3428–
3432.
Hume D (1739) A treatise of human nature. Project Gutenberg, eBook Collection.
http://www.gutenberg.org/files/4705/4705-h/4705-h.htm
Hyman BT, Van Hoesen GW, Damasio A, Barnes CL (1984) Alzheimer’s
Disease!: Cell-Specific Pathology Isolates the Hippocampal Formation. Science
225:1168–1170.
Hyman BT, Van Hoesen GW, Damasio A (1987) Alzheimer’s disease: glutamate
depletion in the hippocampal perforant pathway zone. Ann Neurol 22:37–40.
James TW, Humphrey GK, Gati JS, Servos P, Menon RS, Goodale MA (2002)
Haptic study of three-dimensional objects activates extrastriate visual areas.
Neuropsychologia 40:1706–1714.
Jiang W, Wallace MT, Jiang H, Vaughan JW, Stein BE (2001) Two cortical areas
mediate multisensory integration in superior colliculus neurons. J Neurophysiol
85:506–522.
Jones EG, Powell TP (1970) An anatomical study of converging sensory
pathways within the cerebral cortex of the monkey. Brain 93:793–820.
Jung R, Kornhuber H, Da Fonseca J (1963) Multisensory convergence on
cortical neurons. In: Progress in Brain Research (Moruzzi G, Fessard A, Jasper
HH, eds), pp.207–240. New York, NY: Elsevier.
Kassuba T, Klinge C, Hölig C, Röder B, Siebner HR (2013) Vision holds a
greater share in visuo-haptic object recognition than touch. NeuroImage 65:59–
68.
Kassuba T, Menz MM, Röder B, Siebner HR (2012) Multisensory Interactions
between Auditory and Haptic Object Recognition. Cereb Cortex 23:1097-1107.
Kayser C, Logothetis NK (2007) Do early sensory cortices integrate cross-modal
information? Brain Struct Func 212:121–132.
! &+!
Kayser C, Logothetis NK, Panzeri S (2010) Visual enhancement of the
information representation in auditory cortex. Curr Biol 20:19–24.
Kayser C, Petkov CI, Augath M, Logothetis NK (2005) Integration of touch and
sound in auditory cortex. Neuron 48:373–384.
Kayser C, Petkov CI, Logothetis NK (2009) Multisensory interactions in primate
auditory cortex: fMRI and electrophysiology. Hearing Res 258:80–88.
Keysers C, Kohler E, Umiltà MA, Nanetti L, Fogassi L, Gallese V (2003)
Audiovisual mirror neurons and action recognition. Exp Brain Res 153:628–636.
Kilner JM, Neal A, Weiskopf N, Friston KJ, Frith CD (2009) Evidence of mirror
neurons in human inferior frontal gyrus. J Neurosci 29:10153–10159.
Kok P, Jehee J, de Lange F (2012) Less Is More!: Expectation Sharpens
Representations in the Primary Visual Cortex. Neuron 75:265–270.
Kravitz DJ, Saleem KS, Baker CI, Mishkin M (2011) A new neural framework for
visuospatial processing. Nat Rev Neurosci 12:217–230.
Kravitz DJ, Saleem KS, Baker CI, Ungerleider LG, Mishkin M (2013) The ventral
visual pathway : an expanded neural framework for the processing of object
quality. Trends Cog Sci 17:26–49.
Lacey S, Tal N, Amedi A, Sathian K (2009) A putative model of multisensory
object representation. Brain Topogr 21:269–274.
Lakatos P, Chen C-M, O’Connell MN, Mills A, Schroeder CE (2007) Neuronal
oscillations and multisensory interaction in primary auditory cortex. Neuron
53:279–292.
Lemus L, Hernández A, Luna R, Zainos A, Romo R (2010) Do sensory cortices
process more than one sensory modality during perceptual judgments? Neuron
67:335–348.
Lingnau A, Gesierich B, Caramazza A (2009) Asymmetric fMRI adaptation
reveals no evidence for mirror neurons in humans. Proc Natl Acad Sci U S A
106:9925–9930.
Man K, Kaplan JT, Damasio A, Meyer K (2012) Sight and sound converge to
form modality-invariant representations in temporoparietal cortex. J Neurosci
32:16629–16636.
Martin A (2007) The representation of object concepts in the brain. Annu Rev
Psychol 58:25–45.
! &"!
McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746–
748.
Mesulam M (1998) From sensation to cognition. Brain 121:1013–1052.
Meyer K (2011) Primary sensory cortices, top-down projections and conscious
experience. Prog Neurobiol 94:408–417.
Meyer K, Damasio A (2009) Convergence and divergence in a neural
architecture for recognition and memory. Trends Neurosci 32:376–382.
Meyer K, Kaplan JT, Essex R, Damasio H, Damasio A (2011) Seeing touch is
correlated with content-specific activity in primary somatosensory cortex. Cereb
Cortex 21:2113–2121.
Meyer K, Kaplan JT, Essex R, Webber C, Damasio H, Damasio A (2010)
Predicting visual stimuli on the basis of activity in auditory cortices. Nat Neurosci
13:1–26.
Mur M, Bandettini PA, Kriegeskorte N (2009) Revealing representational content
with pattern-information fMRI--an introductory guide. Soc Cogn Affect Neur
4:101–109.
Oosterhof NN, Tipper SP, Downing PE (2012) Viewpoint (In)dependence of
Action Representations: An MVPA Study. J Cognitive Neurosci 24:975–989.
Oosterhof NN, Wiggett AJ, Diedrichsen J, Tipper SP, Downing PE (2010)
Surface-based information mapping reveals crossmodal vision-action
representations in human parietal and occipitotemporal cortex. J Neurophysiol
104:1077–1089.
Pandya D (1995) Anatomy of the auditory cortex. Rev Neurol 151:486–494.
Pietrini P, Furey ML, Ricciardi E, Gobbini MI, Wu W-HC, Cohen L, Guazzelli M,
Haxby J V (2004) Beyond sensory images: Object-based representation in the
human ventral pathway. Proc Natl Acad Sci U S A 101:5658–5663.
Prinz J (2002) Furnishing the Mind: Concepts and Their Perceptual Basis. MIT
Press.
Quian Quiroga R, Kraskov A, Koch C, Fried I (2009) Explicit encoding of
multimodal percepts by single neurons in the human brain. Curr Biol 19:1308–
1313.
! !
Reddy L, Tsuchiya N, Serre T (2010) Reading the mind’s eye: decoding category
information during mental imagery. NeuroImage 50:818–825.
Rockland KS, Ojima H (2003) Multisensory convergence in calcarine visual areas
in macaque monkey. Int J Psychophysiol 50:19–26.
Rockland KS, Pandya D (1979) Laminar origins and terminations of cortical
connections of the occipital lobe in the rhesus monkey. Brain Res 179:3–20.
Schroeder CE, Foxe J (2005) Multisensory contributions to low-level,
“unisensory” processing. Curr Op Neurobiol 15:454–458.
Seltzer B, Pandya D (1976) Some Cortical Projections to the Parahippocampal
Area in the Rhesus Monkey. Exp Neurol 160:146–160.
Seltzer B, Pandya D (1980) Converging visual and somatic sensory cortical input
to the intraparietal sulcus of the rhesus monkey. Brain Res 192:339–351.
Seltzer B, Pandya D (1994) Parietal, temporal, and occipital projections to cortex
of the superior temporal sulcus in the rhesus monkey: a retrograde tracer study.
J Comp Neurol 343:445–463.
Stein BE, ed. (2012) The New Handbook of Multisensory Processing. MIT Press.
Stilla R, Sathian K (2008) Selective visuo-haptic processing of shape and texture.
Hum Brain Mapp 29:1123–1138.
Stokes M, Thompson R, Cusack R, Duncan J (2009) Top-down activation of
shape-specific population codes in visual cortex during mental imagery. J
Neurosci 29:1565–1572.
Taylor KI, Moss HE, Stamatakis EA, Tyler LK (2006) Binding crossmodal object
features in perirhinal cortex. Proc Natl Acad Sci U S A 103:8239–8244.
Thirion B, Duchesnay E, Hubbard E, Dubois J, Poline J-B, Lebihan D, Dehaene
S (2006) Inverse retinotopy: inferring the visual content of images from brain
activation patterns. NeuroImage 33:1104–1116.
Van Hoesen GW, Damasio A (1987) Neural correlates of cognitive impairment in
Alzheimer’s disease. In: Handbook of Physiology (Mountcastle VB, Plum F,
Geiger SR, eds), pp.871–898. Bethesda, Maryland: American Physiological
Society.
Van Hoesen GW, Hyman BT, Damasio A (1991) Entorhinal cortex pathology in
Alzheimer’s disease. Hippocampus 1:1–8.
! &$!
Van Hoesen GW, Pandya D (1975a) Some connections of the entorhinal (area
28) and perirhinal (area 35) cortices of the rhesus monkey I temporal lobe
afferents. Brain Res 95:1–24.
Van Hoesen GW, Pandya D (1975b) Some connections of the entorhinal (area
28) and perirhinal (area 35) cortices of the rhesus monkey III efferent
connections. Brain Res 95:39–59.
Van Hoesen GW, Pandya DN, Butters N (1972) Cortical afferents to the
entorhinal cortex of the Rhesus monkey. Science 175:1471–1473.
Van Hoesen GW, Pandya DN, Butters N (1975) Some connections of the
entorhinal (area 28) and perirhinal (area 35) cortices of the rhesus monkey. II.
Frontal lobe afferents. Brain Res 95:25–38.
Vetter P, Smith FW, Muckli L (2011) Decoding natural sounds in early visual
cortex. J Vis 11:779–779.
Vogt BA, Pandya D (1978) Cortico-cortical connections of somatic sensory cortex
(areas 3, 1 and 2) in the rhesus monkey. J Comp Neurol 177:179–191.
Von Kriegstein K, Giraud A-L (2006) Implicit multisensory associations influence
voice recognition. PLoS Biol 4:e326.
Wang Y, Celebrini S, Trotter Y, Barone P (2008) Visuo-auditory interactions in
the primary visual cortex of the behaving monkey: electrophysiological evidence.
BMC Neurosci 9:79.
Yau JM, Olenczak JB, Dammann JF, Bensmaia SJ (2009) Temporal frequency
channels are linked across audition and touch. Curr Biol 19:561–566.
Zhang M, Weisser VD, Stilla R, Prather SC, Sathian K (2004) Multisensory
cortical processing of object shape and its relation to mental imagery. Cogn Aff
Behav Neurosci 4:251–259.
! &%!
CHAPTER 3. Neural convergence of sight and sound to form audiovisual
invariant representations
(Corresponding publication: Man K, Kaplan JT, Damasio A, Meyer K (2012) Sight
and sound converge to form modality-invariant representations in
temporoparietal cortex. The Journal of Neuroscience 32:16629–16636.)
Abstract
People can identify objects in the environment with remarkable accuracy,
irrespective of the sensory modality they use to perceive them. This suggests
that information from different sensory channels converges somewhere in the
brain to form modality-invariant representations, i.e., representations that reflect
an object independently of the modality through which it has been apprehended.
In this functional magnetic resonance imaging study, we first identified brain
areas in human subjects that responded to both visual and auditory stimuli and
then used crossmodal multivariate pattern analysis to evaluate the neural
representations in these regions for content-specificity (i.e., do different objects
evoke different representations?) and modality-invariance (i.e., do the sight and
the sound of the same object evoke a similar representation?). While several
areas became activated in response to both auditory and visual stimulation, only
the neural patterns recorded in a region around the posterior part of the superior
temporal sulcus displayed both content-specificity and modality-invariance. This
region thus appears to play an important role in our ability to recognize objects in
our surroundings through multiple sensory channels and to process them at a
supra-modal (i.e., conceptual) level.
! &&!
3.1. Introduction
Whether we see a bell swing back and forth or, instead, hear its distinctive
ding-dong, we easily recognize the object in both cases. Upon recognition, we
are able to access the wide conceptual knowledge we possess about bells and
we use this knowledge to generate motor behaviors and verbal reports. The fact
that we are able to do so independently of the perceptual channel through which
we were stimulated suggests that the information provided by different channels
converges, at some stage, into one or several modality-invariant neural
representations of the object.
Neuroanatomists have long identified areas of multisensory convergence in the
monkey brain, for instance, in the lateral prefrontal and premotor cortices, the
intraparietal sulcus, the parahippocampal gyrus, and the posterior part of the
superior temporal sulcus (pSTS) (Seltzer and Pandya, 1978, 1994). Lesion and
tracer studies have shown that the pSTS region not only receives projections
from visual, auditory, and somatosensory association cortices but returns
projections to those cortices as well (Seltzer and Pandya, 1991; Barnes and
Pandya, 1992). Also, electrophysiological studies have identified bi- and tri-modal
neurons in the pSTS (Benevento et al., 1977; Bruce et al., 1981; Hikosaka et al.,
1988). Recent functional neuroimaging studies in humans are in line with the
anatomical and electrophysiological evidence and have located areas of
multisensory integration in the lateral prefrontal cortex, premotor cortex, posterior
parietal cortex, and the pSTS region (for reviews, see e.g. Calvert 2001, Amedi
et al. 2005; Beauchamp 2005; Doehrmann and Naumer 2008; Driver & Noesselt
! &'!
2008). These observations alone, however, do not address two important
questions. First, are the neural patterns established in these multimodal brain
regions content-specific? In other words, do they reflect the identity of a sensory
stimulus, rather than a more general aspect of the perceptual process? Second,
are the neural patterns modality-invariant? In other words, does an object evoke
similar neural response patterns when it is apprehended through different
modalities?
In the present study, we used multivariate pattern analysis (MVPA) of functional
magnetic resonance imaging (fMRI) data to probe multimodal regions for neural
representations that were both content-specific and modality-invariant. We first
performed a univariate fMRI analysis to identify brain regions that were activated
by both visual and auditory stimuli, and these regions corresponded well with
those found in previous studies. Next, we tested the activity patterns in these
regions for content-specificity by asking whether a machine-learning algorithm
could predict from a specific pattern which of several audio or video clips a
subject had perceived. Finally, we tested for modality-invariance by decoding the
identities of objects not only within, but across modalities: the algorithm was
trained to distinguish neural patterns recorded during visual trials and used to
classify neural patterns recorded during auditory trials. The crossmodal MVPA
analysis revealed that out of all the multisensory regions identified, only the pSTS
region contained neural representations that were both content-specific and
modality-invariant.
! &(!
3.2. Materials and Methods
Subjects
Nine right-handed subjects were originally enrolled in the study. One subject was
excluded from the analysis due to excessive head movement during the scan.
The data presented come from the remaining eight participants, five female and
three male. The experiment was undertaken with the informed written consent of
each subject.
Stimuli
Audiovisual clips depicting a church bell, a gong, a typewriter, a jackhammer, a
pneumatic drill, and a chainsaw were downloaded from www.youtube.com
(Movie 1). All clips were truncated to five seconds. Two additional sets of clips
were generated from the original versions: an auditory set containing the audio
tracks presented on a black screen and a visual set containing the video tracks
presented in silence. All audio tracks were peak-leveled using The Levelator 2
software (The Conversations Network).
Offline Stimulus Familiarization
Prior to scanning, subjects watched the six original (i.e., audiovisual) stimuli on a
loop for ten minutes in order to familiarize themselves with the correspondence
between the auditory and visual content of the clips.
! &)!
Online Stimulus Presentation.
Once inside the fMRI scanner, subjects were only exposed to clips from the
auditory and visual sets, presented in separate auditory and visual runs that
alternated for a total of eight runs. Each run contained 36 stimulus presentations:
each of the six clips was shown six times in randomized order with no back-to-
back repeats. One clip was presented every 11 s. A sparse-sampling scanning
paradigm was used to ensure that all clips were presented in the absence of
scanner noise: a single whole-brain volume was acquired starting two seconds
after the end of each clip. The two-second long image acquisition was followed
by a two-second pause, after which the next trial began. Timing and presentation
of the video clips was controlled with MATLAB 7.9.0 (The Mathworks), using the
freely available Psychophysics Toolbox 3 software (Brainard, 1997). Video clips
were displayed on a rear-projection screen at the end of the scanner bore which
subjects viewed through a mirror mounted on the head coil. Sound intensity of
the audio clips was adjusted to the loudest comfortable level for each subject.
Participants were instructed to keep their eyes open during all runs.
Prior to each auditory run, subjects were told to “imagine seeing in your
mind’s eye the images that go with these sound clips as vividly as possible.”
Analogously, each visual run was preceded by the instruction to “imagine hearing
in your mind’s ear the sounds that go with these video clips as vividly as
possible.” We included this imagery instruction based on findings from a previous
study in which subjects were instructed to passively watch sound-implying (but
silent) video clips (Meyer et al., 2010). Those subjects reported that they
! &*!
spontaneously generated auditory imagery, and the reported vividness of their
imagery correlated with the classification accuracies of the evoking stimuli from
neural activity in the early auditory cortices.
In addition to the eight MVPA runs, there were two runs of a functional localizer,
placed at the beginning and the middle of the scanning session, respectively.
These runs served as an independent dataset on which we performed a
conventional univariate analysis to identify ROIs for MVPA. The functional
localizer runs employed a slow event-related design with continuous image
acquisition. Each run contained 24 stimulus presentations (two repetitions each
of the six auditory clips and the six visual clips, presented in randomized order
with no back-to-back repeats). The duration of each trial was randomly jittered up
to one second about the mean duration of 15 seconds.
Image Acquisition
Images were acquired with a 3-Tesla Siemens MAGNETON Trio System. Echo-
planar volumes for MVPA runs were acquired with the following parameters: TR
= 11,000 ms, TA = 2,000 ms, TE = 25 ms, flip angle = 90 degrees, 64 x 64
matrix, in-plane resolution 3.0 mm x 3.0 mm, 41 transverse slices, each 2.5 mm
thick, covering the whole brain. Volumes for functional localizer runs were
acquired with the same parameters except in continuous acquisition with TR =
2,000 ms. We also acquired a structural T1-weighted MPRAGE for each subject
(TR = 2,530 ms, TE = 3.09 ms, flip angle = 10 degrees, 256 x 256 matrix, 208
coronal slices, 1 mm isotropic resolution).
! '+!
Univariate Analysis of Functional Localizer Runs
We performed a univariate analysis of the functional localizer runs using FSL
(Smith et al., 2004). Data preprocessing involved the following steps: motion
correction (Jenkinson et al., 2002), brain extraction (Smith, 2002), slice-timing
correction, spatial smoothing with a 5-mm full-width at half-maximum Gaussian
kernel, high-pass temporal filtering using Gaussian-weighted least-squares
straight line fitting with sigma (standard deviation of the Gaussian distribution)
equal to 60s, and pre-whitening (Woolrich et al., 2001).
The two stimulus types, auditory and visual, were modeled separately with
two regressors derived from a convolution of the task design and a gamma
function to represent the hemodynamic response function. Motion correction
parameters were included in the design as additional regressors. The two
functional localizer runs for each participant were combined into a second-level
fixed-effects analysis, and a third-level inter-subject analysis was performed
using a mixed-effects design.
Registration of the functional data to the high-resolution anatomical image of
each subject and to the standard Montreal Neurological Institute (MNI) brain was
performed using the FSL FLIRT tool (Jenkinson and Smith, 2001). Functional
images were aligned to the high-resolution anatomical image using a six-degree-
of-freedom linear transformation. Anatomical images were registered to the MNI-
152 brain using a 12-degree-of-freedom affine transformation.
! '"!
Two activation maps were defined using the functional localizer data: the
areas activated during the presentation of video clips, as compared to rest, and
the areas activated during the presentation of audio clips, as compared to rest.
The visual and auditory activation maps were thresholded with FSL’s cluster
thresholding algorithm using a minimum Z-score of 2.3 (P < 0.01) and a cluster
size probability of P < 0.05. An audiovisual map was defined by the overlap of the
auditory and visual activation maps. In order to detect the greatest number of
shared voxels in the overlap, the component maps were Z-score thresholded as
above but received no cluster size thresholding. Each map was normalized,
overlapping voxels were summed, and the resultant map was smoothed with a 5-
mm full-width at half-maximum Gaussian kernel.
Voxel Selection for Multivariate Pattern Analysis
The six ROIs for the MVPA analysis were generated based on the largest activity
cluster from the group-level auditory activation map (located in the superior
temporal lobe), the largest cluster from the visual activation map (located in the
occipital lobe), and the four largest clusters in the audiovisual overlap map
(located, respectively, in lateral premotor cortex, medial premotor cortex, anterior
insula, and around the posterior superior temporal sulcus). In order to define
these ROIs in each individual subject, we centered a sphere at the peak voxel of
each cluster on the group-level maps and then warped these spheres into the
functional spaces of the individual subjects. All spheres had a radius of 20 mm,
except the one for the anterior insula, which, due to the smaller size of the
! '#!
activity cluster, had a radius of 12 mm. Within the spheres located in the
functional maps of each subject, we then selected the 500 voxels with the
highest Z-scores from that subject’s relevant localizer. This method allowed us to
select voxels from the same anatomical regions across subjects, but with
subject-specific sensitivity to activation.
Multivariate Pattern Analysis
Within the six ROIs described above we performed both intramodal and
crossmodal MVPA. For both intramodal and crossmodal classification, all
possible two-way discriminations among the six stimuli were carried out (n = 15).
For intramodal classification, we performed a leave-one-run-out cross-validation
procedure: a classifier was trained on data from three of the four runs of a given
modality and then tested on the fourth run. This procedure was repeated four
times with the data from each run serving as the testing set once. For
crossmodal classification, a classifier was either trained on all four auditory runs
and tested on all four video runs, or vice versa. MVPA was performed using the
PyMVPA software package (Hanke et al., 2009) in combination with LibSVM’s
implementation of the linear support vector machine (Chang and Lin, 2011). Data
from the eight MVPA runs of each subject were concatenated and motion-
corrected to the middle volume of the entire time series, then linearly detrended
and converted to Z-scores by run.
! '$!
Whole-Brain Searchlight Analyses
In order to conduct an ROI-independent search for brain regions containing
information relevant to intra- or crossmodal classification, we performed a
searchlight procedure (Kriegeskorte et al., 2006). In each subject, a 6-way
classifier (trained and tested on all stimuli simultaneously) was repetitively
applied to small spheres (r = 8 mm) centered on every voxel of the brain. The
classification accuracy for each sphere was mapped to its center voxel to obtain
a whole-brain accuracy map for intramodal and crossmodal classification. In
order to illustrate which brain regions showed high performance across subjects,
we thresholded the individual maps at the 95th percentile, warped them into the
standard space, and summed them to create an overlap map.
Statistical Analyses
Given that our hypothesis was directional — classifier performance on two-way
discriminations should be higher than the chance result of 0.5 — we employed
one-tailed t-tests across all eight subjects to assess the statistical significance of
our results. When comparing classifier performances to each other (e.g., those
obtained in different ROIs), we used two-tailed, paired t-tests across all eight
subjects.
! '%!
3.3. Results
Functional Localizer Scans
At the group level, the presentation of auditory stimuli, as compared to rest,
yielded a prominent activity cluster in the auditory cortices (Heschl’s gyrus,
planum temporale, planum polare, and the surrounding regions of the superior
temporal gyrus), as well as smaller foci throughout the brain (Table 1, Fig. 3.1).
The presentation of visual stimuli, as compared to rest, revealed a main cluster in
the visual cortices (medial and lateral surfaces of the occipital lobe) and, again,
smaller foci dispersed throughout the brain. An overlap of the two activation
maps revealed several brain regions that responded to both auditory and visual
stimulation; the four largest clusters were located in lateral premotor cortex,
medial premotor cortex, anterior insula, and an area around the pSTS, which
included parts of the superior temporal gyrus and the inferior parietal lobule
(Table 3.1, Fig. 3.1).
Intramodal Classification of Stimuli
From the two unimodal and the four multimodal clusters described above, voxels
were selected in a subject-specific manner (see Materials and Methods) to define
six regions of interest (ROIs) for MVPA: auditory cortices (AC), visual cortices
(VC), lateral premotor cortex (lPM), medial premotor cortex (mPM), anterior
insula (aI), and pSTS. As expected, audio clips were successfully classified
within AC: averaged across all subjects and all two-way discriminations among
the six stimuli, classification performance was 0.838, significantly higher than the
! '&!
chance level of 0.5 (P < 1 x 10-6; one-tailed t-test, n = 8; Table 2, Fig. 3.2A).
Similarly, classification of video clips within VC yielded an average performance
of 0.843 (P < 1 x 10-4). Classification of video clips from AC and classification of
audio clips from VC yielded more modest though still significant results (Fig.
3.2A). Audio and video clips were also well classified from the pSTS region
(audio clips: 0.765, p < 1 x 10-5; video clips: 0.691, P < 10-4, Fig. 3.2B).
Classification performance was significantly higher in the pSTS than in any other
multimodal ROI (p < 0.001 in all cases; two-tailed, paired t-tests).
In each subject, we performed a whole-brain searchlight analysis to
conduct an ROI-independent search for regions classifying audio or video clips,
respectively. The individual subjects’ searchlight maps were thresholded at the
95th percentile, warped to the standard space, and summed to create a group-
level overlap map. Highest classification accuracy for audio and video clips was
found in auditory and visual cortices, respectively (Fig. 3.2C).
Crossmodal Classification of Stimuli
Crossmodal classification performance in the pSTS region was significantly
better than chance (0.601, P < 0.01) and significantly better than in all other
multimodal ROIs, which were at chance level (lPM: 0.505, mPM: 0.506, aI: 0.497;
P > 0.05 in each case; Table 2, Fig. 3.3A). Ten out of the fifteen pairwise
discriminations among the six items were significantly above chance (Fig. 3.3B).
Crossmodal classification was modest, but also significant, in the auditory ROI
(0.547, P < 0.01); it was at chance level in the visual ROI (0.490, P > 0.05). The
! ''!
converse crossmodal classifications, training on auditory and testing on visual
stimuli, yielded comparable results.
The whole-brain crossmodal searchlight analysis confirmed that the only
region consistently containing crossmodal information was located around the
pSTS, near the junction of the temporal and parietal lobes, almost completely
lateralized to the right hemisphere (Fig. 3.3C). This region was similar in location
to the pSTS cluster identified in the audiovisual overlap map presented in Fig.
3.1. In spite of the lateralization evident in the searchlight, the classification
accuracies between the right and left pSTS ROIs were not significantly different
(P > 0.05).
3.4. Discussion
We used crossmodal MVPA to investigate the response patterns of
various brain regions to corresponding audio and video clips of common objects.
Although several regions were activated by stimuli of both modalities, both the
ROI and the searchlight analyses indicate that the pSTS region may be the
unique site of successful crossmodal classification. This suggests the pSTS is
different from other multimodal brain areas in that it holds neural representations
of common objects that are both content-specific and modality-invariant.
Previous studies have aimed to identify brain regions that engage in
multisensory processing using a variety of criteria: a region may simply be
activated by more than one modality of stimulation; or its response to a unimodal
stimulus may be enhanced or depressed by the addition of a stimulus of a
! '(!
different modality; or, more stringently, a region may be more strongly activated
by a multimodal stimulus than by the summed activations from the unimodal
stimuli presented separately (superadditivity) (Calvert, 2001; Beauchamp, 2005;
Laurienti et al., 2005). However, none of these approaches directly address the
Tquestion of whether the activation pattern found in a multisensory region is
specific to the identity of a stimulus and whether it is modality-invariant. Modality-
invariance is more directly addressed by the semantic congruency effect: a brain
area would display a higher level of activity when responding to the sight and
sound of the same object than when responding to the sight and sound of
different objects. Doehrmann and Naumer (2008) reviewed several human fMRI
studies manipulating semantic congruency and identified a general pattern of
activation in lateral temporal cortex for semantically congruent stimuli and in
inferior frontal cortex for semantically incongruent stimuli. Our results are in
general agreement with this trend but allow for conclusions that go beyond those
that can be drawn from a semantic congruency effect: activity in the pSTS
reflects which object was presented congruently.
MVPA heretofore has been used to identify neural patterns which
generalize across formats within the same sensory modality, e.g. between words
and pictures (Shinkareva et al., 2011), and also across modalities, e.g. between
action observation and action execution (Etzel et al., 2008) or between visual and
auditory displays of emotion (Peelen et al., 2010). Until now, however, there has
not been an explicit test for conserved representations of object identities across
! ')!
different sensory modalities (although Pietrini et al. (2004) found conserved
representations of object categories across the visual and tactile modalities).
Degrees of Modality Invariance
Our claim that neural patterns in the pSTS display modality-invariance may
appear tempered by the fact that crossmodal classification accuracies are not
perfect. While many MVPA studies attempt to predict perceptual stimuli from
neural activity in early sensory cortices, the current investigation was concerned
with activity patterns in association cortices. Conceivably, the topographical
organization of the early cortices would be particularly conducive to successful
pattern classification, and it is well known that this organization is gradually lost
at higher levels of the sensory hierarchies (Felleman and Van Essen, 1991). This
may be reflected by our intramodal classification accuracies: they were high for
the early sensory cortices (around 0.84 for both audio and video clips) but
declined in the pSTS (0.69 for video clips and 0.77 for audio clips). Furthermore,
it stands to reason that crossmodal classification would be unlikely to yield
accuracies exceeding the lower of the two intramodal classification accuracies.
The pSTS crossmodal accuracy of around 0.6 thus should be appreciated from
the perspective of 0.69 as a “soft ceiling”.
The functional organization of the pSTS may provide additional insight into
the comparison between crossmodal and intramodal classification accuracies.
Superior temporal cortex has been found to contain intermixed millimeter-scale
patches that respond preferentially to auditory (A), visual (V), or audiovisual
! '*!
stimuli (AV) (Beauchamp et al., 2004; Dahl et al., 2009). Under this regime,
auditory stimuli would activate A and AV patches, and visual stimuli would
activate V and AV patches. Crossmodal classification, however, could rely solely
on the AV patches. Consequently, crossmodal classification accuracy would be
expected to be lower than intramodal classification accuracy because (1) it has
access to fewer information-bearing voxels; and (2) the A and V patches, from
the classifier’s perspective, add noise to the analysis.
In light of these arguments, we wish to make clear that we use
"invariance" as a relative term. What our analysis shows is that the neural activity
patterns induced by corresponding auditory and visual stimuli are more similar
than the patterns induced by non-corresponding stimuli or, in other words, that
some features of the patterns evoked by corresponding auditory and visual
stimuli are shared.
Support for the Framework of Convergence-Divergence Zones
According to a neuroarchitectural framework proposed by Damasio (Damasio,
1989; Meyer and Damasio, 2009), neuron ensembles in higher-order association
cortices constitute convergence-divergence zones (CDZs), which register
associations among perceptual representations from multiple sensory modalities.
Due to the convergent bottom-up projections they receive from early sensory
cortices, CDZs can be activated, for instance, both by the sight and the sound of
a specific object. Once activated, the CDZs can re-instantiate the associated
representations in different sensory cortices by means of divergent top-down
! (+!
projections. (For a visualized account of the framework, see (Meyer and Kaplan,
2011).) In brief, according to the CDZ framework, the bottom-up processing of
sensory stimuli in a given modality would be continuously accompanied by the
top-down reconstruction of associated patterns in different modalities.
In keeping with this prediction, previous MVPA studies have shown that
visual stimuli implying sound or touch lead to content-specific representations in
early auditory and somatosensory cortices, respectively (Meyer et al., 2010,
2011). Conversely, auditory stimuli induce content-specific neural patterns in
early visual cortices (Vetter et al., 2011). The current results provide an additional
piece of support for the framework, suggesting that the pertinent audiovisual
CDZs may be located around the pSTS.
In the context of the CDZ framework, it is interesting to consider whether
modality-invariance would extend from multisensory association cortices toward
early sensory cortices. If watching a silent video clip of a jackhammer results in
the reconstruction of a neural activity pattern in the early auditory cortices, it is
conceivable that the reconstructed pattern would resemble the one established
when the jackhammer was actually heard. This question is the reason why our
subjects received an explicit instruction to imagine the sensory counterpart
(auditory or visual) of the stimuli (visual or auditory) they perceived. We did not
find any indication of modality-invariance in the early visual cortices, and while
cross-modal classification performance was indeed significant in the auditory
ROI, but the searchlight analysis suggests this may have been due to the partial
! ("!
overlap of the AC and pSTS ROIs. In brief, the results of the present study do not
provide evidence for modality-invariance at the level of early sensory cortices.
Is it possible, on the other hand, that crossmodal classification in the pSTS
was successful only due to mental imagery? In other words, have we
demonstrated cross-format classification between perception and imagery within
a single modality, rather than truly crossmodal classification between auditory
perception and visual perception? We do not believe this is the case for two
reasons. First, the pSTS ROI was defined based on activations recorded during
the localizer runs, in which subjects did not receive any imagery instruction.
Second, previous studies have performed cross-format classification between
visual perception and visual imagery, finding above-chance performance in
ventral visual stream regions such as lateral occipital cortex and ventral temporal
cortex (Stokes et al., 2009; Reddy et al., 2010; Lee et al., 2011). Our crossmodal
analysis did not identify any of these regions, and, conversely, the mentioned
studies did not find above-chance cross-format classification in the pSTS,
suggesting that our results cannot be explained by cross-classification of visual
imagery and perception. As for the auditory modality, to the best of our
knowledge, cross-format classification between perception and imagery
heretofore has not been performed. Generally speaking, auditory perception and
auditory imagery do activate overlapping areas in planum temporale and the
posterior superior temporal gyrus (Bunzeck et al., 2005; Zatorre and Halpern,
2005), admitting of the theoretical possibility for cross-format auditory
classification in our experiment. However, such an interpretation would be
! (#!
pressed to explain the asymmetry in our results between (successful) auditory
and (non-successful) visual cross-format classification. Given the large amount of
evidence that implicates the pSTS in multisensory processing, it appears most
parsimonious to interpret the neural representations in pSTS as generalizing
across the auditory and visual modalities, rather than only across auditory
perception and auditory imagery.
To summarize, we have shown that a region around the posterior superior
temporal sulcus is activated in content-specific fashion by both visual and
auditory stimuli, and that the activity patterns induced by the same object
presented in the two modalities exhibit a certain degree of similarity. Such
modality-invariant representations of the objects of perception may be the initial
stage of the neural process that allows us to recognize and react to sensory
stimuli independently of the modality through which we perceive them.
! ($!
Chapter 3 Figures
Figure 3.1. Brain regions activated by auditory and visual stimuli. Auditory
activations (first panel on the left) most notably comprised the superior part of the
temporal lobe (Heschl’s gyrus, planum temporale, planum polar¬¬e, and
adjacent parts of the superior temporal gyrus), as well as additional regions in the
parietal and frontal lobes. Visual activations (second panel from the left) included
the medial and lateral portions of the occipital lobe, the posterior temporal lobe,
as well as additional regions in the parietal and frontal lobes. The intersection of
the auditory and visual activations was calculated and the four largest clusters on
the audiovisual overlap map were subsequently used for MVPA. The four
multisensory regions of interest were the posterior superior temporal sulcus
(pSTS), inferior frontal cortex (IFC), medial premotor cortex (mPM), and the
anterior insula (aI). Slices across different planes and regions are not to scale.
! (%!
Figure 3.2. Intramodal classification performance. A, Auditory and visual stimuli
were reliably classified from their respective sensory cortices, auditory cortex
(AC) and visual cortex (VC). Classification performance of unimodal stimuli in
heteromodal cortices was much more modest, albeit still significant for visual
stimuli classified from AC. B, Audio and video clips were classified significantly
more accurately in the pSTS than in the other three multimodal ROIs. ns, not
! (&!
significant; *, P < 0.05; **, P < 0.01; ***, P < 0.001; P-values FDR-corrected. All
error bars indicate SEM. C, Intramodal searchlight analysis. Audio clips were
classified most accurately from voxels in the superior temporal lobe, while video
clips were best discriminated from voxels in the occipital lobe. The figure shows
voxels that classified above the 95th percentile of accuracy in four or more
subjects. The mean of the 95th percentile thresholds was 0.250 for the auditory
searchlight and 0.227 for the visual searchlight; chance performance in the six-
way searchlights was 0.167.
! ('!
Figure 3.3.A-B. Crossmodal classification performance. A, Mean crossmodal
accuracy in the multimodal ROIs, superimposed with individual subject
accuracies (open circles). Classification was significantly better than chance only
in the pSTS. The pSTS was significantly higher than any of the other multimodal
! ((!
regions. ns, not significant; *, P < 0.05; **, P < 0.01; ***, P < 0.001. B, Pairwise
classification accuracies. There were fifteen pairwise discriminations among the
six stimuli. In the pSTS ROI, twelve out of the fifteen discriminations yielded
classification performances significantly above chance, as indicated by darker
shaded bars (top panel). None of the pairwise comparisons were significantly
better than chance in IFC and mPM, and only one was in aI (bottom panel). C,
Crossmodal searchlight analysis. The figure shows voxels that classified above
the 95th percentile of accuracy in four or more subjects. The mean of the 95th
percentile thresholds was 0.211, chance performance for the six-way
searchlights being 0.167. Voxels cluster around the pSTS, close to the junction of
the temporal and parietal lobes, and are almost completely lateralized to the
right.
! ()!
!"#$%&'()*
+",-)$.-*
!
/,$-"0$%*
1(2&",*
!
3($4*5"6(%*%"#$-&",*789:*.;$#(*<=*>=*?@*
!
,-./!0-12340-5-! 6270/!0-12340-5-!
89:2/;5!5-3/! ?94-52;5!/-14;5@A!A;B-!!
!"#$%!&'$%"%
!
()$%!)"$%%
!
C239@A!=3>!5-3/! D-:2@A!@E:!A@/-5@A!;FF242/@A!
A;B-!!
!&)$%!($%"%
!
"$%!*#$%!%
!
89:2;=239@A!;=-5A@4! G;3/-52;5!394-52;5!/-14;5@A!
39AF93!!
!+($%!(($%&(%
!
"($%!(#$%&"%
!
,@/-5@A!45-1;/;5!F;5/-H!!
!+)$%,($%&'%
!
('$%&"$%)(%
!
D-:2@A!45-1;/;5!F;5/-H!!
!($%&)$%+(%
!
)$%'$%"#%
!
8E/-52;5!2E39A@!!
!,)$%&"$%#%
!
,)$%)"$%)%
!
!
Table 3.1. Coordinates of the peak voxels of the two unimodal and four
multimodal activity clusters.
!
! (*!
"#$! %&'(()*)+',)-.!,/01!
!"#$%&'(%)*
%+(,#'$-*
!"#$%&'(%)*.,/+%)* 0$'//&'(%)*
89:2/;5)$)
III
!! +>&*"
III
!! +>&%(
II
!!
C239@A!F;5/2F-3!! +>&$&
I
!! +>)%$
III
!! +>%*+!JE3K!
G;3/-52;5!394-52;5!/-14;5@A!
39AF93!!
+>('&
III
!! +>'*"
III
!! +>'+"
II
!!
,@/-5@A!45-1;/;5!! +>&'$
III
!! +>&("
III
!! +>&+&!JE3K!
D-:2@A!45-1;/;5!! +>&%"
II
!! +>&&&
III
!! +>&+'!JE3K!
8E/-52;5!2E39A@!! +>(
!
JE3K! +>&$"
I
!! +>%*(!JE3K!
!
Table 3.2. Intramodal and crossmodal classification accuracies for all six ROIs.
! )+!
Chapter 3 References
Beauchamp M (2005) Statistical criteria in FMRI studies of multisensory
integration. Neuroinformatics:93–113.
Brainard DH (1997) The Psychophysics Toolbox. Spat Vis 10:433–436.
Bunzeck N, Wuestenberg T, Lutz K, Heinze H-J, Jancke L (2005) Scanning
silence: mental imagery of complex sounds. NeuroImage 26:1119–1127.
Calvert G A (2001) Crossmodal processing in the human brain: insights from
functional neuroimaging studies. Cereb Cortex 11:1110–1123.
Chang CC, Lin CJ (2011) LIBSVM A Library for Support Vector Machines. ACM
TIST 2:27:2-27:27.
Damasio A (1989) Time-locked multiregional retroactivation: A systems- level
proposal for the neural substrates of recall and recognition. Cognition 33:25–62.
Doehrmann O, Naumer MJ (2008) Semantics and the multisensory brain: how
meaning modulates processes of audio-visual integration. Brain Res 1242:136–
150.
Etzel J a, Gazzola V, Keysers C (2008) Testing simulation theory with cross-
modal multivariate classification of fMRI data. PloS ONE 3:e3690.
Felleman DJ, Van Essen DC (1991) Distributed hierarchical processing in the
primate cerebral cortex. Cereb Cortex 1:1–47.
Hanke M, Halchenko YO, Sederberg PB, Hanson SJ, Haxby JV, Pollmann S
(2009) PyMVPA: A python toolbox for multivariate pattern analysis of fMRI data.
Neuroinformatics 7:37–53.
Jenkinson M, Bannister P, Brady M, Smith S (2002) Improved optimization for
the robust and accurate linear registration and motion correction of brain images.
NeuroImage 17:825–841.
Jenkinson M, Smith S (2001) A global optimisation method for robust affine
registration of brain images. Med Image Analysis 5:143–156.
Kriegeskorte N, Goebel R, Bandettini P (2006) Information-based functional brain
mapping. Proc Natl Acad Sci 103:3863–3868 .
Laurienti PJ, Perrault TJ, Stanford TR, Wallace MT, Stein BE (2005) On the use
of superadditivity as a metric for characterizing multisensory integration in
functional neuroimaging studies. Exp Brain Res 166:289–297.
! )"!
Lee S-H, Kravitz DJ, Baker CI (2011) Disentangling visual imagery and
perception of real-world objects. NeuroImage:1–10.
Meyer K, Damasio A (2009) Convergence and divergence in a neural
architecture for recognition and memory. Trends Neurosci 32:376–382.
Meyer K, Kaplan JT (2011) Cross-modal multivariate pattern analysis. J Vis Exp
57:e3307.
Meyer K, Kaplan JT, Essex R, Damasio H, Damasio A (2011) Seeing touch is
correlated with content-specific activity in primary somatosensory cortex. Cereb
Cortex 21:2113–2121.
Meyer K, Kaplan JT, Essex R, Webber C, Damasio H, Damasio A (2010)
Predicting visual stimuli on the basis of activity in auditory cortices. Nat Neurosci
1–26.
Peelen MV, Atkinson AP, Vuilleumier P (2010) Supramodal representations of
perceived emotions in the human brain. J Neurosci 30:10127–10134.
Pietrini P, Furey ML, Ricciardi E, Gobbini MI, Wu W-HC, Cohen L, Guazzelli M,
Haxby JV (2004) Beyond sensory images: Object-based representation in the
human ventral pathway. Proc Natl Acad Sci 101:5658–5663.
Reddy L, Tsuchiya N, Serre T (2010) Reading the mind’s eye: decoding category
information during mental imagery. NeuroImage 50:818–825.
Shinkareva SV, Malave VL, Mason RA, Mitchell TM, Just MA (2011)
Commonality of neural representations of words and pictures. NeuroImage
54:2418–2425.
Smith SM (2002) Fast robust automated brain extraction. Hum Brain Mapp
17:143–155.
Smith SM, Jenkinson M, Woolrich MW, Beckmann CF, Behrens TEJ, Johansen-
Berg H, Bannister PR, De Luca M, Drobnjak I, Flitney DE, Niazy RK, Saunders J,
Vickers J, Zhang Y, De Stefano N, Brady JM, Matthews PM (2004) Advances in
functional and structural MR image analysis and implementation as FSL.
NeuroImage 23 Suppl 1:S208–19.
Stokes M, Thompson R, Cusack R, Duncan J (2009) Top-down activation of
shape-specific population codes in visual cortex during mental imagery. J
Neurosci 29:1565–1572.
Vetter P, Smith FW, Muckli L (2011) Decoding natural sounds in early visual
cortex. J Vis 11:779–779.
! )#!
Woolrich MW, Ripley BD, Brady M, Smith SM (2001) Temporal autocorrelation in
univariate linear modeling of FMRI data. NeuroImage 14:1370–1386.
Zatorre RJ, Halpern AR (2005) Mental concerts: musical imagery and auditory
cortex. Neuron 47:9–12.
! )$!
CHAPTER 4. Mapping neural abstraction of object representations across sight,
sound, and touch
Abstract
Objects are encountered in the world through multiple sensory modalities, but it
is unclear how the the brain integrates these disparate features into abstract
representations. We studied the organization and convergence of sensory
information in the human brain with functional magnetic resonance imaging and
multivariate pattern analysis. We presented common objects across three
different modalities — sight, sound, and touch — and mapped brain regions
containing information about the identity of these objects. We first performed
intramodal classification, decoding object identity within one modality of
presentation. We mapped the spread of auditory, visual, and tactile sensory
information, which started from their respective primary sensory cortices,
converged with (one or two) other modalities in associational cortices, and even
invaded the sensory cortices of a different modality. We also performed
crossmodal classification, decoding object identity across different sensory
modalities of presentation. We conclude that representation abstracting across
sensory modalities may form a neural basis for mental concepts.
4.1. Introduction
The objects that we encounter in the world are richly detailed, dynamic,
and multisensory. (Consider, however, that the objects we encounter in the
! )%!
laboratory are rarely all three.) Here, we studied the latter characteristic to
understand how activity in the segregated brain regions known to be chiefly
responsible for each sensory modality can come to generate unified, abstract
representations of objects. In pursuit of this goal we employed real-world objects
with dynamic and detailed auditory, visual, and tactile properties. These rich
properties may provide access to abstract, but difficult to enumerate,
representations at intermediate levels of sensory convergence For example, the
scratchy feel of Velcro may be linked to the hacking sound one hears when
pulling it apart, in a way that does not necessarily link to its visual qualities. In this
study we sought to map multiple levels of object representation among three
sensory modalities.
Nearly half a century ago, Jones and Powell (1970) studied the stepwise
convergence of sensory pathways among the auditory, visual, and
somatosensory modalities. Using a lesion tracing method in the macaque, they
mapped the convergence of all three modalities to a region homologous to
human angular and supramarginal gyrus. More recently, this region was
implicated in humans using diffusion tensor imaging (Bonner et al. 2013), and
several parietal regions were implicated in a human fMRI sensory co-activation
study (Bremmer et al. 2001). Here, we performed fMRI in human subjects as
they encountered objects by sight, sound, or touch, and used multivariate pattern
analysis (MVPA) to detect information specifying which object was presented.
We performed MVPA both intramodally and crossmodally. Using intramodal
MVPA, we assessed whether the neural representations in the sensory regions
! )&!
of a specific modality (e.g., the visual cortices) would allow us to decode the
identity of stimuli presented in that modality. Using crossmodal MVPA, on the
other hand, we asked whether the same object presented in two different
modalities (e.g., feeling Velcro and hearing it pulled apart) would lead to invariant
(i.e., abstracted) representations in multimodal regions of the brain. Finally, we
mapped a superordinate level of convergence by identifying regions containing
abstract representations involving all three modalities.
4.2. Materials and Methods
Subjects
Nineteen right-handed subjects were originally enrolled in the study. One
subject was excluded from the analysis due to incomplete coverage of the brain
by the scan's field of view. The data presented come from the remaining eighteen
participants, eight female and ten male. The experiment was undertaken with the
informed written consent of each subject.
Stimuli
Auditory, visual, and tactile stimuli were generated from three objects: a
whistle, a wine glass, and a strip of Velcro. The visual stimuli consisted of video
recordings made while the objects were being actively used by the author: the
wine glass was tapped with a fingernail, the whistle was blown, and the pieces of
Velcro were attached and separated. The visual stimuli were always played in
silence. Auditory stimuli were generated from the audio tracks of the videos and
! )'!
played over a black screen with the volume adjusted to the loudest comfortable
level for each subject.. The audio and video extracts did not overlap in time; this
helped to prevent a correspondence between the sight and sound of the same
object at the level of simple dynamics. All audio and video clips were truncated to
five seconds.
Tactile stimuli were delivered by placing the objects themselves into the hands of
subjects for bimanual active exploration. Subjects were instructed to place their
hands together in a shallow cupped position at waist level with palms facing up to
receive the object. As soon as the object was received, they actively explored
the object with both hands for five seconds. Subjects were instructed to avoid
looking down at their hands while manipulating the objects, and visibility of the
hands was also occluded by the head coil and mirror. Subjects were further
instructed to avoid producing sounds with the objects..
Stimulus Presentation
Prior to scanning, subjects were briefly introduced to the three objects and
allowed to touch them. Once inside the fMRI scanner, subjects were presented
with stimuli in separate auditory (A), visual (V), and tactile (T) runs. There were
12 in total, in the order A V T V A T A T V T V A. Each run contained 15 stimulus
presentations: each of the three objects was presented five times in
pseudorandomized order with no back-to-back repeats. One stimulus was
presented every 11 s. A sparse-sampling scanning paradigm was used to ensure
that all stimuli were presented in the absence of scanner noise: a single whole-
! )(!
brain volume was acquired starting two seconds after the end of each 5-second
clip or 5-second object manipulation period. The two-second long image
acquisition was followed by a two-second pause, after which the next trial began.
Timing and presentation of the auditory and visual stimuli was controlled with
MATLAB 7.9.0 (The Mathworks), using the freely available Psychophysics
Toolbox 3 software (Brainard, 1997). Participants were instructed to keep their
eyes open during all runs and to pay attention to the sensory qualities of each
stimulus.
Prior to the 12 unimodal runs for MVPA, there were two runs of a
functional localizer at the beginning of the scanning session. These functional
localizers served as an independent dataset on which we performed a
conventional univariate analysis to identify ROIs for MVPA. The functional
localizer runs employed a slow event-related design with continuous image
acquisition. The three different objects were presented unimodally (sound only,
sight only, and touch only, using the same set of stimuli as in MVPA runs) as well
as trimodally, in which subjects simultaneously heard, saw, and touched the
objects. Each of these twelve unique stimuli was presented three times, for a
total of 36 stimulus presentations in each functional localizer. The stimuli were
presented in pseudorandomized order with no back-to-back repeats. The
duration of each trial was randomly jittered up to one second about the mean
duration of 15 seconds.
! ))!
Image Acquisition
Images were acquired with a 3-Tesla Siemens MAGNETON Trio System.
Echo-planar volumes for MVPA runs were acquired with the following
parameters: TR = 11,000 ms, TA = 2,000 ms, TE = 25 ms, flip angle = 90
degrees, 64 x 64 matrix, in-plane resolution 3.0 mm x 3.0 mm, 41 transverse
slices, each 2.5 mm thick, covering the whole brain. Volumes for functional
localizer runs were acquired with the same parameters except in continuous
acquisition with TR = 2,000 ms. We also acquired a structural T1-weighted
MPRAGE for each subject (TR = 2,530 ms, TE = 3.09 ms, flip angle = 10
degrees, 256 x 256 matrix, 208 coronal slices, 1 mm isotropic resolution). The
structural scan was collected after the sixth MVPA run, serving as a passive rest
period for the subject in the middle of the 12-run sequence.
Univariate Analysis of Functional Localizer Runs
We performed a univariate analysis of the functional localizer runs using
FSL (Smith et al., 2004). Data preprocessing involved the following steps: motion
correction (Jenkinson et al., 2002), brain extraction (Smith, 2002), slice-timing
correction, spatial smoothing with a 5-mm full-width at half-maximum Gaussian
kernel, high-pass temporal filtering using Gaussian-weighted least-squares
straight line fitting with sigma (standard deviation of the Gaussian distribution)
equal to 60 s, and pre-whitening (Woolrich et al., 2001).
The four stimulus types, auditory (A), visual (V), tactile (T), and trimodal
(AVT) were modeled separately with four regressors derived from a convolution
! )*!
of the task design and a gamma function to represent the hemodynamic
response function. Motion correction parameters were included in the design as
additional regressors. The two functional localizer runs for each participant were
combined into a second-level fixed-effects analysis, and a third-level inter-subject
analysis was performed using a mixed-effects design.
Registration of the functional data to the high-resolution anatomical image
of each subject and to the standard Montreal Neurological Institute (MNI) brain
was performed using the FSL FLIRT tool (Jenkinson and Smith, 2001).
Functional images were aligned to the high-resolution anatomical image using a
six-degree-of-freedom linear transformation. Anatomical images were registered
to the MNI-152 brain using a 12-degree-of-freedom affine transformation.
Five activation maps were defined using the functional localizer data:
three "unimodal" maps of the areas activated during the presentation of each
modality compared to rest (A > R, V > R, T > R), a "multisensory" map of regions
activated during simultaneous AVT stimulation compared to rest (AVT > R), and
a "superadditive" map of areas in which the trimodal stimuli evoked activity
greater than the sum of the activations from the three unimodal stimuli (AVT >
A+V+T). The activation maps were thresholded with FSL’s cluster thresholding
algorithm using a minimum Z-score of 2.3 (P < 0.01) and a cluster size probability
of P < 0.05.
! *+!
Localizer-Based Voxel Selection for Multivariate Pattern Analysis
Voxels were selected for six different masks within which MVPA was
performed, based on data from the functional localizers. Three unimodal masks
were created by selecting the 1,000 voxels with the highest Z-scores from each
subject's corresponding A, V, and T unimodal functional localizer maps. Three
bimodal masks, AV, AT, and VT, were created by finding the respective overlaps
between the two corresponding unimodal functional localizer maps. For instance,
to obtain the AV mask, we identified voxels that were significantly activated in
both the the A > R and the V > R maps. Voxel values were then rescaled to be
the percentage of the maximum value of each respective map. These values
were summed across the two maps. The top 1000 voxels in this combined map
were taken to form the AV mask. This procedure allowed the selection of voxels
that were most active in both the A > R and V > R maps. An analogous
procedure was used to create the AT and VT masks.
Multivariate Pattern Analysis
Data from the 12 MVPA runs of each subject were concatenated and
motion-corrected to the middle volume of the entire time series, then linearly
detrended and converted to Z-scores by run. MVPA was performed using the
PyMVPA software package (Hanke et al., 2009) in combination with LibSVM’s
implementation of the linear support vector machine (Chang and Lin, 2011). We
performed both intramodal and crossmodal classification within the localizer-
based masks. There were three possible two-way discriminations among the
! *"!
three stimuli (wine glass vs. whistle, wine glass vs. Velcro, whistle vs. Velcro).
For intramodal classification, we performed a leave-one-run-out cross-validation
procedure: a classifier was trained on data from three of the four runs of a given
modality and then tested on the fourth run. This procedure was repeated four
times with the data from each run serving as the testing set once. For
crossmodal classification, a classifier was trained on all four runs of one modality
and tested on all four runs of a different modality. For the three modalities A, V,
and T, there were six orderings of training and testing: [A,V] [V,A] [A,T] [T,A] [V,T]
[T,V].
Whole-Brain Searchlight Analyses
In order to conduct a mask-independent search for brain regions
containing information relevant to intra- or crossmodal classification, we
performed a searchlight procedure (Kriegeskorte et al., 2006). In each subject, a
three-way classifier (trained and tested on all three stimuli simultaneously) was
repetitively applied to small spheres (r = 4 voxels) centered on every voxel of the
brain. The classification accuracy for each sphere was mapped to its center voxel
to obtain a whole-brain accuracy map for each of three intramodal and six
crossmodal classifications. Maps from each subject were then warped into the
standard space and visualized with FSL and MRIcroGL (Chris Rorden,
http://www.cabiatl.com/mricro/mricrogl). The three unimodal searchlight maps
were overlayed and regions of overlap were found. Values of the overlap map
were calculated by scaling the values of each input map to its percent-maximum
! *#!
and then summing them. From the six crossmodal classifications, three
crossmodal searchlight maps, AV, AT, and VT, were generated by averaging
across the two directions of training and testing for two given modalities.
Statistical Analyses
Given that our hypothesis was directional — classifier performance on
two-way discriminations should be higher than the chance result of 0.5 — we
employed one-tailed t-tests across all 18 subjects to assess the statistical
significance of our results in functional localizer masks.
We further interrogated the group searchlight maps for evidence of above
chance classification with a conjunction group analysis (Heller et al. 2007). This
analysis is a potentially more powerful way to establish the existence of an effect,
seeking to show it in only a portion of the subjects studied and not necessarily in
the entire group, as with a t-test. Briefly, we tested the partial conjunction
hypothesis that at least n out of 18 subjects performed the classification better
than chance, sequentially for values of n from 1 to 18. Classification accuracies
for individual subjects were transformed to P values based on the binomial
distribution (Pereira & Botvinick 2011). We used Stouffer's method to combine
the 18 independent P values to generate a new, pooled P value. The resulting
searchlight maps were thresholded with FSL’s cluster thresholding algorithm
using a minimum Z-score of 0.95
*
and corrected for multiple comparisons with a
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
I
!L;/-!/0@/!/023!MN3F;5-!23!1;5-!4-512332=-!/0@E!/0-!7-E-5@AA;.!">'%!J.;5!@E!-O92=@A-E/!;E-N/@2A-:!G!;.!+>+&K>!P0-!5-39A/3!45-3-E/-:!0-5-!30;9A:!
/093!B-!F@9/2;93A!
! *$!
cluster size probability threshold of P < 0.05. Each remaining voxel was then
assigned the largest n for which the partial conjunction null hypothesis could be
rejected.
4.3. Results
Functional localizer scans
At the group level, the unimodal maps of regions activated by auditory,
visual, or tactile stimuli, as compared to rest, identified peak activity clusters in
the auditory, visual, and somatosensory/motor cortices, respectively (Fig. 4.1.a).
In the trimodal map, the presentation of simultaneous AVT stimuli compared to
rest activated broad regions of the cortex resembling a concatenation of the
unimodal maps, with peaks in auditory, visual, and somatosensory/motor cortices
(Fig. 4.1.b.). The peak regions identified on the trimodal map were nearly
identical to the peak regions identified on the superadditive map which, as
mentioned earlier, identified regions in which activation during AVT stimulation
was greater than the sum of activations during individual A, V, and T stimulation.
Thus, the superadditive map revealed a multimodal potentiation effect: activity in
the auditory, visual, and tactile cortices was higher during simultaneous trimodal
stimulation than during isolated unimodal stimulation.
Classifier Performance in Localizer-Based Masks
MVPA was performed within masks based on each subject's own
functional localizer data (see Materials and Methods). Accuracies were averaged
! *%!
across 18 subjects and three pairwise discriminations of the three objects (Table
4.1).
Intramodal classification in unimodal masks (Fig. 4.1.A). Classification of
the three sounds within the auditory mask, of the three video clips within the
visual mask, and of the three tactile stimuli within the somatosensory mask were
all significantly better than chance, at the group level. We were also interested in
the ability of a classifier to distinguish among stimuli presented in one modality
within "heteromodal" brain regions, that is, within masks generated based on
stimuli of a different modality. We found that auditory stimuli could be
distinguished at levels slightly but significantly better than chance in the visual
mask but not significantly better than chance in the tactile mask. Visual stimuli
were classified significantly better than chance within both the auditory and tactile
masks. Tactile stimuli were classified significantly better than chance within both
the auditory mask and visual masks.
Crossmodal classification in unimodal masks (Fig. 4.1.B). We assessed
the accuracies of crossmodal classifiers, trained in one modality and tested in
another, within the masks for one or the other modality. For example, we trained
a classifier to distinguish between sounds produced by different objects, then
tested it to distinguish between videos depicting the same objects, using voxels
from only the auditory mask or only the visual mask. Out of the six crossmodal
classifications (AV, VA, AT, TA, VT, TV), each performed in the two relevant
unimodal masks, the only accuracies significantly better than chance were found
! *&!
in 1) the train-visual test-tactile classification within the tactile mask and 2) the
train-tactile test-visual classification, also within the tactile mask.
Crossmodal classification in bimodal masks (Fig. 4.1.B). We assessed the
accuracies of crossmodal classifiers in the bimodal masks, which were formed by
combining the two relevant unimodal maps (see Materials and Methods). Out of
the six crossmodal classifications, each performed in its respective bimodal
mask, only the train-visual test-tactile and the train-tactile test-visual
classifications were significantly better than chance at the group level.
Searchlight Classifier Performance Across the Whole Brain
We visualized the regions that most consistently contained stimulus-
specific information across our subjects by mapping the results of the conjunction
group analyses.
Intramodal searchlights. Sounds were most consistently decoded, peaking
at 17 out of 18 subjects, in the auditory cortices, including Heschl's gyrus and the
planum temporale, and extending into the middle and posterior superior temporal
gyrus and sulcus, bilaterally (Fig. 4.2.a). Videos were most consistently decoded,
peaking at 17 out of 18 subjects, in the visual cortices, throughout the calcarine
sulcus and occipital pole, and extended into lateral and ventral occipital cortices,
superior parietal lobe, and postcentral sulcus, bilaterally. Touches were most
consistently decoded, peaking at 13 out of 18 subjects, in somatosensory and
motor cortices, including the central sulcus and pre- and post-central gyri, and
extending into the secondary somatosensory cortex, angular gyrus, superior
! *'!
parietal lobe, lateral occipital cortex, and middle superior temporal gyrus,
bilaterally.
Regions of overlap among the auditory, visual, and tactile unimodal
searchlight maps were also found (Fig. 4.2.b). Regions that appeared in both the
A and V searchlight maps included posterior superior temporal gyrus and sulcus,
as well as postcentral gyrus and sulcus, bilaterally. Regions that appeared in
both the A and T searchlight maps included parietal operculum, precuneus, and
posterior cingulate cortices, bilaterally. Regions that appeared in both the V and
T searchlight maps included superior parietal lobe and lateral and ventral
occipital cortices, bilaterally. Regions that appeared in all three unimodal
searchlight maps, A, V, and T, included parietal operculum, the temporo-parieto-
occipital junction, and precuneus, bilaterally (Fig. 4.2.c.).
Crossmodal searchlights. AV crossmodal classification was most
consistently found, peaking at 5 out of 18 subjects, bilaterally in lateral occipital
cortex, and in the right ventral temporal cortex, superior parietal lobule and
supramarginal gyrus (Fig. 4.3.A). AT crossmodal classification was most
consistently found, peaking at 3 out of 18 subjects, bilaterally in postcentral
gyrus, precuneus, supramarginal gyrus, and lateral occipital cortex. VT
crossmodal classification was most consistently found, peaking at 8 out of 18
subjects, in the right postcentral gyrus and sulcus, and bilaterally in the superior
parietal lobule, supramarginal gyrus, inferior temporal cortex, and the medial,
dorsal, and ventral premotor cortices.
! *(!
Regions of overlap among the AV, AT, and VT crossmodal searchlight
maps were found (Fig 4.3.B). Regions that appeared in both the AV and AT
maps included postcentral sulcus/parietal operculum and lateral occipital cortex,
bilaterally. Regions that appeared in both the AV and VT maps included
postcentral sulcus and supramarginal gyrus, inferior parietal lobule, and posterior
inferior temporal gyrus, bilaterally. Regions that appeared in both the AT and VT
maps included supramarginal gyrus and postcentral gyrus leading into parietal
operculum, bilaterally. Two regions appeared in all three crossmodal searchlight
maps: the ventral aspect of the postcentral gyrus and sulcus, and the dorsal
aspect of the supramarginal gyrus, both bilaterally.
4.4. Discussion
We mapped brain regions in which we could detect modality specific and
modality invariant representations of a set of objects that were presented across
three different modalities. The three objects, a wine glass, a whistle, and a piece
of Velcro, were common, yet richly detailed, real-world objects, each of which
had distinct auditory, visual, and tactile qualities. Our intramodal results show
that the sounds, videos, and touches could be distinguished from each other not
only within their respective namesake cortices, but also in heteromodal cortices,
i.e., the sensory cortices of a different modality. Our intramodal searchlights
reveal the far-flung reaches of the information extracted from these sensory
stimuli. Our crossmodal classifications performed in predefined masks derived
from functional localizers were, for the most part, unsuccessful at the group level.
! *)!
Exceptions were the detection of visuotactile invariant representations in the
unimodal T mask and the bimodal VT mask. In those regions, at the group level,
a classifier trained to distinguish objects presented visually could then distinguish
the same objects presented in touch, and vice versa. Looking throughout the
brain with searchlights, however, our conjunction group analyses located
bimodally invariant representations in small but significant portions of our
subjects. In these circumscribed regions of the association cortices, we detected
object information that generalized across auditory and visual, auditory and
tactile, as well as visual and tactile presentations of the objects.
We located the overlaps between the AV & AT, AV & VT, and AT & VT
maps. For example, a region that appears in both the AV and AT maps, but not
the VT map, contains representations that abstract across hearing and seeing an
object, or hearing and touching an object, but that do not abstract across seeing
and touching an object. We describe such representations as "trimodal with an
auditory bias". In these regions, hearing an object evokes a representation that
shares a certain set of features with the representation evoked by seeing that
object, and shares a different set of features with the representation evoked by
touching that object. Similarly, the AV-VT overlap may contain trimodal
representations with a visual bias, and the AT-VT overlap, trimodal
representations with a tactile bias. "Biased" multisensory representations have
previously been identified, although only in the bimodal case: Kassuba et al
(2013) detected representations that abstract across vision and touch, with vision
predominating.
! **!
We note that our method of crossmodal classification could only
definitively establish bimodal invariance, not trimodal invariance. We may,
however, identify candidate regions for containing trimodally invariant
representations, by finding the overlap of the bimodal invariances. This reasoning
is bolstered by the observation that bimodal invariance was often found in the
overlap of the unimodal maps. For example, regions that distinguished videos
and also distinguished touches were found to contain bimodally invariant VT
representations. We found the AV-AT-VT overlap, i.e., regions that appeared in
all three crossmodal maps. This final subset identifies the bilateral postcentral
sulcus and supramarginal gyrus as candidates for containing trimodally invariant
representations with fairly equal contributions of auditory, visual, and tactile
information.
Our various classifications and overlaps support an overall model of
multiple stages of information processing in the sensory and association cortices
(Fig 4.4). As sensorimotor information traverses through deeper stages of
processing, modality-specific representations are transformed into increasingly
more abstract representations that generalize across two and possibly three
modalities. Bimodal and trimodal convergences follow the distance-minimizing
topographic principle of occurring at the borders separating their respective
sensory cortices (Wallace et al. 2004). We further distinguish the candidates for
trimodal representations into those that remain biased towards one modality and
those, perhaps at a superordinate level of information convergence, that are
trimodally invariant and unbiased towards any particular modality.
! "++!
Decoding Sensory Representations Around the Brain
Our detection of intramodal information within heteromodal cortices
contributes to a growing appreciation that "modality specific" cortices may in fact
have rather more catholic interests (Schroeder & Foxe 2005). Our results
complement recent work showing that primary sensory cortices of the auditory,
visual, and somatosensory modalities contain information that can determine in
which other modality a stimulus was presented (Liang et al. 2013). For example,
activation patterns in primary somatosensory cortex encoded whether an
auditory beep or visual flash was presented. Furthermore, those authors
decoded the location of visual and tactile stimuli within heteromodal cortices. The
visual and tactile unimodal maps in our study, showing classification of videos
and touches in heteromodal cortices, provide a conceptual replication of Liang et
al. (2013). Our auditory unimodal map extends the evidence to heteromodal
classification of sounds as well.
Our results replicate and unify the findings of various other studies
showing sensory stimulus-specific information in heteromodal cortices: auditory
information can be detected in visual (Vetter et al. 2011; Vetter et al. 2014) and
somatosensory (Etzel et al. 2008) cortices; visual information can be detected in
auditory (Meyer et al. 2010, Hsieh et al. 2012) and somatosensory (Meyer et al.
2011; Smith & Goodale 2013) cortices; tactile information can be detected in
auditory (Kayser et al. 2005, in macaque) and visual (Oosterhof et al. 2012)
cortices.
! "+"!
By virtue of presenting the same set of objects across three different
modalities, our results also unify the piecemeal findings of individual studies that
have identified bimodally invariant representations among sight, sound, and
touch: audiovisual (Peelen et al. 2010; Akama et al. 2012; Simanova et al. 2012;
Man et al. 2012; Ricciardi et al. 2013), audiotactile (Etzel et al. 2008, Kassuba et
al. 2012), and visuotactile (Pietrini et al. 2004; Oosterhof et al. 2010; Oosterhof et
al. 2012; see also Lacey & Sathian 2011 for a review of non-MVPA studies).
Our crossmodal maps that included the tactile modality (VT and AT) may
also relate to "mirroring" mechanisms for action recognition. Watching or hearing
an object being handled activated representations similar to actually handling the
object. These representations were mostly detected outside of the canonical
mirror neuron regions, ventral premotor cortex and inferior frontal gyrus,
consistent with other MVPA studies of the human mirror neuron system
(Oosterhof, et al. 2013). We therefore interpret visual mirror neurons and auditory
"echo neurons" (Kohler et al. 2002) to be examples of a more general
mechanism of action-perception convergence, in which other, and possibly all,
sensory modalities may participate, and which occurs throughout human
association cortex.
Limitations on Inference to Particular Subjects or the Whole Group
Given that our conjunction group analysis could only establish that some
portion of our group contained bimodally invariant representations, we wondered
if multiple crossmodal classifications were ever successful in the same subject. If
! "+#!
so, we could then establish that the same spatial pattern learned by a classifier to
identify an object in one modality could also identify the object in two other
modalities. It is unclear, however, if this is permitted in a conjunction group
analysis, as currently defined (Heller et al. 2007; Benjamini & Heller 2008). When
pooling P values to conclude that "At least 8 out of 18 subjects can cross-decode
videos and touches in this region", it does not necessarily follow that one can
then select the subjects with the eight highest accuracies (and smallest P
values).
Our goal was to establish the existence (and map the locations) of
modality invariant representations. While the adoption of conjunction group
analyses comes at the cost of making inferences regarding individual subjects or
the whole group, we believe it is outweighed by the ability to detect our
desiderata in partial conjunctions of the group. In those cases in which we
employed group average t-tests — the localizer-based crossmodal classifications
— our results were mixed. Modality invariant representations may thus be in
highly variable locations across subjects. Our activation-based method of
defining masks may also be sub-optimal, since it was shown that higher
decoding accuracies may actually be associated with lower levels of overall
activation; these were called sharpened representations (Kok et al. 2012).
Is Neocortex Essentially Multisensory?
In an influential article, Ghazanfar and Schroeder (2006) explored whether
all of neocortex was essentially multisensory. According to our results, if not all,
! "+$!
then most of it is. Our maps show deep interpenetrations of sensory information
throughout the cortical mantle, with the exception of the rostral prefrontal
cortices. These results agree with a recent MVPA study locating regions in which
modality information could not be decoded: dorsolateral prefrontal cortex and
anterior insula (Tamber-Rosenau et al. 2013). Those authors interpreted these
regions to be an "amodal" cortical bottleneck involved in response selection. But
this account of its function might interchangeably be called "pan-modal": a
response is made in response to something, and that thing is sensory input,
however far removed.
We mapped the representations of real world, dynamic objects, presented
across audition, vision, and touch, to large swathes of the sensory and
association cortices. We distinguished several stages of representational
abstraction, occurring in distinct regions of cortex, and culminating perhaps in
unbiased trimodally invariant representations in the candidate sites of
supramarginal gyrus and secondary somatosensory cortex. We stress, however,
that even these super-ordinate representations would not constitute concepts on
their own. Any particular concept, reaching its full efflorescence, will involve the
activation of fragmentary associations located in multiple sensory-motor
modalities (Damasio 1989; Meyer and Damasio, 2009). Super-ordinate
representations do the necessary but not sufficient work of linking and
coordinating the retrieval of these fragments. Thus, a mental concept may be
more appropriately defined as a particular global state of the multisensory brain.
! "+%!
Figure 4.1.a. Functional
localizer maps. Image is in
radiological convention
(right side of image
corresponds to left side of
brain). (All volumetric data
files from this chapter may
be downloaded at this
persistent link:
http://goo.gl/zyxy7T)
! "+&!
Figure 4.1.b. Functional localizer maps.
! "+'!
Figure 4.2.a. Intramodal
searchlight maps: regions
of the brain in which
sounds, videos, and
touches, respectively, were
decoded. Color indicates
the consistency of
decoding, or the portion of
subjects in which stimulus
specific information was
detected.
! "+(!
Figure 4.2.b. Overlap of intramodal (auditory, visual, and tactile) searchlight maps:
color indicates overlap score (in arbitrary units), a measure of the combined
classification consistencies from each component map.
! "+)!
Figure 4.2.c. Summary figure of intramodal searchlights. Color indicates
classification type and brightness indicates consistency. Hues are additive in
regions of overlap.
! "+*!
Figure 4.3.a. Crossmodal
searchlight maps. Regions
of the brain in which
bimodally invariant
representations were
detected. Color indicates
the consistency of
decoding, or the portion of
subjects in which stimulus
specific information was
detected.
! ""+!
Figure 4.3.b. Crossmodal searchlight overlaps. Overlaps between the A-V, A-T,
and V-T maps were found. Color indicates overlap score. Regions which
appeared in all three crossmodal maps are circled in green.
! """!
Figure 4.3.c. Summary figure of crossmodal searchlights. Color indicates
classification type and brightness indicates consistency. Hues are additive in
regions of overlap.
! ""#!
Figure 4.4. A schematic map of hierarchical abstraction of representations from
auditory, visual, and tactile stimuli. Contours indicate level of abstraction.
! ""$!
Table 4.1. Classifier performance in localizer-based masks at the group level.
Shaded cells indicate intramodal classification performed in the namesake mask
or crossmodal classification performed in the pertinent bimodal mask. Chance
level was 0.5; the statistical significance of better than chance accuracy at the
group level was assessed with one-tailed t-tests. *, P < 0.05; **, P < 0.01; ***, P <
0.001; ****, P < 0.0001; ns, not significant.
-.%/012345637%87399:;:831:50%388<238:=9%
!"#$%&%!'$()*"+','$*!-' /AB&-")C* D&.A$%* E$#-&%(*
% % %
F"A,B.* #.>&"%????% #.+,(%?% #.+),%@09A%
% % %
D&B(".* #.+",%??% #.'#,%????% #."*)%????%
% % %
E"A#G(.* #.+"%???% #.+++%??% #."#(%????%
% % %
% % % % % % %
B.%C259945637%87399:;:831:50%388<238:=9%
".*#/0"1!"','$*!-' /AB&-")C* D&.A$%* E$#-&%(* /AB&"5&.A$%* /AB&"-$#-&%(* D&.A"-$#-&%(*
/HD* #.+&,%@09A% #.+#>%@09A%
%
#.+&*%@09A%
% %
DH/* #.+#(%@09A% #.+%@09A%
%
#.+&>%@09A%
% %
/HE* #.+#&%@09A%
%
#.+%@09A%
%
#.(*"%@09A%
%
EH/* #.+#,%@09A%
%
#.+#&%@09A%
%
#.+#)%@09A%
%
DHE*
%
#.(*"%@09A% #.+(*%??%
% %
#.+"(%??%
EHD*
%
#.(*#%@09A% #.+">%??%
% %
#.+')%??%
! ""%!
Chapter 4 References
Akama H, Murphy B, Na L, Shimizu Y, Poesio M (2012) Decoding semantics
across fMRI sessions with different stimulus modalities: a practical MVPA study.
Frontiers in Neuroinformatics 6:24.
Benjamini Y, Heller R (2008) Screening for partial conjunction hypotheses.
Biometrics 64:1215–1222.
Bonner MF, Peelle JE, Cook PA, Grossman M (2013) Heteromodal conceptual
processing in the angular gyrus. NeuroImage 71:175–186.
Brainard DH (1997) The Psychophysics Toolbox. Spatial Vision 10:433–436.
Bremmer F, Schlack A, Shah NJ, Zafiris O, Kubischik M, Hoffmann K, Zilles K,
Fink GR (2001) Polymodal Motion Processing in Posterior Parietal and Premotor
Cortex. Neuron 29:287–296.
Chang CC, Lin CJ (2011) LIBSVM A Library for Support Vector Machines. ACM
TIST 2:27.
Damasio A (1989) Concepts in the brain. Mind & Language 4:24–28.
Etzel JA, Gazzola V, Keysers C (2008) Testing simulation theory with cross-
modal multivariate classification of fMRI data. PloS one 3:e3690.
Ghazanfar AA, Schroeder CE (2006) Is neocortex essentially multisensory?
Trends in Cognitive Sciences 10:278–285.
Hanke M, Halchenko YO, Sederberg PB, Hanson SJ, Haxby JV, Pollmann S
(2009) PyMVPA: A python toolbox for multivariate pattern analysis of fMRI data.
Neuroinformatics 7:37–53.
Heller R, Golland Y, Malach R, Benjamini Y (2007) Conjunction group analysis:
an alternative to mixed/random effect analysis. NeuroImage 37:1178–1185.
Hsieh P-J, Colas JT, Kanwisher N (2012) Spatial pattern of BOLD fMRI activation
reveals cross-modal information in auditory cortex. Journal of Neurophysiology
107:3428–3432.
Jenkinson M, Bannister P, Brady M, Smith S (2002) Improved optimization for
the robust and accurate linear registration and motion correction of brain images.
NeuroImage 17:825–841.
Jenkinson M, Smith S (2001) A global optimisation method for robust affine
registration of brain images. Medical Image Analysis 5:143–156.
! ""&!
Jones EG, Powell TP (1970) An anatomical study of converging sensory
pathways within the cerebral cortex of the monkey. Brain 93:793–820.
Kassuba T, Klinge C, Hölig C, Röder B, Siebner HR (2013) Vision holds a
greater share in visuo-haptic object recognition than touch. NeuroImage 65:59–
68.
Kassuba T, Menz MM, Röder B, Siebner HR (2013) Multisensory Interactions
between Auditory and Haptic Object Recognition. Cerebral Cortex 23: 1097–
1107.
Kayser C, Petkov CI, Augath M, Logothetis NK (2005) Integration of touch and
sound in auditory cortex. Neuron 48:373–384.
Kohler E, Keysers C, Umiltà MA, Fogassi L, Gallese V, Rizzolatti G (2002)
Hearing sounds, understanding actions: action representation in mirror neurons.
Science 297:846–848.
Kriegeskorte N, Goebel R, Bandettini P (2006) Information-based functional brain
mapping. Proceedings of the National Academy of Sciences of the United States
of America 103:3863–3868 .
Lacey S, Sathian K (2011) Multisensory object representation: insights from
studies of vision and touch., 1st ed. Elsevier B.V.
Liang M, Mouraux A, Hu L, Iannetti GD (2013) Primary sensory cortices contain
distinguishable spatial patterns of activity for each sense. Nature
Communications.
Man K, Kaplan JT, Damasio A, Meyer K (2012) Sight and sound converge to
form modality-invariant representations in temporoparietal cortex. The Journal of
Neuroscience 32:16629–16636.
Meyer K, Damasio A (2009) Convergence and divergence in a neural
architecture for recognition and memory. Trends in Neurosciences 32:376–382.
Meyer K, Kaplan JT, Essex R, Webber C, Damasio H, Damasio A (2010)
Predicting visual stimuli on the basis of activity in auditory cortices. Nature
neuroscience.
Meyer K, Kaplan JT, Essex R, Damasio H, Damasio A (2011) Seeing touch is
correlated with content-specific activity in primary somatosensory cortex.
Cerebral Cortex 21:2113–2121.
! ""'!
Oosterhof NN, Wiggett AJ, Diedrichsen J, Tipper SP, Downing PE (2010)
Surface-based information mapping reveals crossmodal vision-action
representations in human parietal and occipitotemporal cortex. Journal of
Neurophysiology 104:1077–1089.
Oosterhof NN, Tipper SP, Downing PE (2012) Visuo-motor imagery of specific
manual actions: a multi-variate pattern analysis fMRI study. NeuroImage 63:262–
271.
Oosterhof NN, Tipper SP, Downing PE (2013) Crossmodal and action-specific:
neuroimaging the human mirror neuron system. Trends in Cognitive Sciences
17:311–318.
Peelen M V, Atkinson AP, Vuilleumier P (2010) Supramodal representations of
perceived emotions in the human brain. The Journal of Neuroscience 30:10127–
10134.
Pereira F, Botvinick M (2011) Information mapping with pattern classifiers: a
comparative study. NeuroImage 56:476–496.
Pietrini P, Furey ML, Ricciardi E, Gobbini MI, Wu W-HC, Cohen L, Guazzelli M,
Haxby J V (2004) Beyond sensory images: Object-based representation in the
human ventral pathway. Proceedings of the National Academy of Sciences of the
United States of America 101:5658–5663.
Ricciardi E, Handjaras G, Bonino D, Vecchi T, Fadiga L, Pietrini P (2013) Beyond
motor scheme: a supramodal distributed representation in the action-observation
network. Public Library of Science One 8:e58632.
Schroeder CE, Foxe J (2005) Multisensory contributions to low-level,
“unisensory” processing. Current Opinion in Neurobiology 15:454–458.
Simanova I, Hagoort P, Oostenveld R, van Gerven M a J (2012) Modality-
Independent Decoding of Semantic Information from the Human Brain. Cerebral
Cortex.
Smith FW, Goodale MA (2013) Decoding Visual Object Categories in Early
Somatosensory Cortex. Cerebral Cortex.
Smith SM (2002) Fast robust automated brain extraction. Human Brain Mapping
17:143–155.
Smith SM, Jenkinson M, Woolrich MW, Beckmann CF, Behrens TEJ, Johansen-
Berg H, Bannister PR, De Luca M, Drobnjak I, Flitney DE, Niazy RK, Saunders J,
Vickers J, Zhang Y, De Stefano N, Brady JM, Matthews PM (2004) Advances in
! ""(!
functional and structural MR image analysis and implementation as FSL.
NeuroImage 23 Suppl 1:S208–19.
Tamber-Rosenau BJ, Dux PE, Tombu MN, Asplund CL, Marois R (2013) Amodal
processing in human prefrontal cortex. The Journal of Neuroscience 33:11573–
11587.
Vetter P, Smith FW, Muckli L (2011) Decoding natural sounds in early visual
cortex. Journal of Vision 11:779–779.
Vetter P, Smith FW, Muckli L (2014) Decoding sound and imagery content in
early visual cortex. Current Biology 24:1256–1262.
Wallace MT, Ramachandran R, Stein BE (2004) A revised view of sensory
cortical parcellation. Proceedings of the National Academy of Sciences of the
United States of America 101:2167–2172.
Woolrich MW, Ripley BD, Brady M, Smith SM (2001) Temporal autocorrelation in
univariate linear modeling of FMRI data. NeuroImage 14:1370–1386.!
!
!
!
!
!
!
! "")!
CHAPTER 5. "See what I mean?" Abstract representations are shared across
individuals
5.1. Introduction
Fodor (1998) pointed out that any philosophical definition of concepts
must provide a mechanism which enables concepts to be shared between
people. This was called the "publicity" criterion and it underlies the very
possibility of communication between people. From a neural perspective,
communication may be thought of as a method for transferring patterns of brain
activity from one brain to another.
Signals transmit information by virtue of the fact that they cause different
states to occur. There is a dependency relation: different messages reliably
cause different responses (Dretske 1981). Therefore, the receiver of a message
does not have arbitrary latitude to interpret the message. If we are to
communicate, we may not understand whatever we wish from whatever we hear.
When we talk about bells, we both think of ringing and swinging things; we have
the same mental content activated in both of our brains. If you thought instead of
a barking and swinging thing, communication will have failed.
Here we tested the hypothesis that these shared mental contents are
reflected in shared spatial activity patterns in the brains of different people. In
previous work we detected the presence of modality invariant neural
representations of common objects; they were activated whether hearing or
! ""*!
seeing or touching an object. We reanalyzed data from these two experiments to
determine if they were similar across different subjects.
5.2. Experiment 1: Auditory, Visual, and Audiovisual-invariant representations are
shared across individuals.
In previous work we established the existence of modality-invariant
representations in temporoparietal cortex: similar representations are evoked
when hearing or seeing an object. Here we reanalyzed the audiovisual (AV)
dataset to determine the extent to which modality-invariant representations were
shared across individual subjects. Would the same object, seen by one person
and heard by another, evoke similar neural representations in the two
individuals?
5.2.1. Experiment 1: Materials and Methods
We re-analyzed the data collected from our previous audiovisual study
(Man et al. 2012). Briefly, we recorded brain images from eight subjects while
they watched silent videos depicting various objects or listened to the
corresponding sounds.
Creation of Study-Specific BOLD Template Image
Perhaps the greatest challenge to decoding representations across
subjects is bringing the idiosyncratic brains into anatomical or functional
alignment. One solution is to warp each brain image into a common space. A
! "#+!
BOLD template image was created with the Advanced Neuroimaging Tools
(ANTs) image registration and template generation algorithms, which use
diffeomorphic transformations to preserve local topology (Avants et al. 2011; with
custom scripts provided by N.M. van Strien). This template had some desirable
qualities: it was specific to our dataset, being based only on the brains of our
subjects; its processing pipeline used the same functional BOLD data with which
we'd perform MVPA; and it did not require an additional warping step through
subjects' anatomical brain images. Each subject's functional data were warped
into the template space and then smoothed with a 2.35 mm FWHM Gaussian
kernel.
Cross-Subject Classification
MVPA was performed using the PyMVPA software package (Hanke et al.,
2009) in combination with LibSVM’s implementation of the linear support vector
machine (Chang and Lin, 2011). We trained classifiers on data from seven
subjects and tested them on data from the remaining left out subject. We
performed eight-fold cross-validation with each subject left out once, then
averaged the accuracies from the eight folds. Intramodal and crossmodal
classification was performed in three regions of interest from our AV study:
auditory cortices (AC), visual cortices (VC), and posterior superior temporal
sulcus (pSTS), the sole region in which we had previously detected audiovisual
invariant representations.
! "#"!
5.2.2. Experiment 1: Results
Intramodal Classification
Sounds stumuli were successfully classified from the auditory cortices of
one subject using information from the auditory cortices of seven other subjects
(Table 5.1.a). Videos were likewise successfully classified in the visual cortices.
Heteromodal classification was also successful but with lower accuracies:
sounds were classified across the visual cortices of different subjects, and
videos, across the auditory cortices. The pSTS multisensory region classified
both sounds and videos across different subjects.
Crossmodal Classification
Crossmodal classification was not successful in the auditory or visual
cortices across different subjects. In contrast, the pSTS region successfully
performed crossmodal classification across subjects. Both directions of
crossmodal classification were significantly better than chance level (Table
5.1.b).
5.2.3. Experiment 1: Discussion
We found that auditory stimuli evoke similar activity patterns in the
auditory cortices of different people, and that visual stimuli evoke similar activity
patterns in the visual cortices of different people. To some extent this result may
have been expected, given the topographic organization of the primary visual
! "##!
and auditory cortices. Retinotopy and tonotopy impose a certain level of
representational similarity across different subjects.
Perhaps more surprising is the fact that sounds can cause similar patterns
to be evoked in the visual cortices of different people, and videos can cause
similar patterns to be evoked in the auditory cortices of different people. A
parsimonious explanation would be that sensory cortices do not represent the
heteromodal stimuli per se, but rather that their activity is a sensory-appropriate
association triggered by the heteromodal stimulus. Associations triggered in AC
and VC might once again be topographic in nature, and facilitate cross-subjects
classification.
Representations of sounds in the pSTS were similar across different
people; the same was true for representations of videos. The pSTS is not known
to be topographically organized – in fact, it has a discontinuous, patchy
organization (Beauchamp et al. 2004), barring a slavish spatial correspondence
in stimulus representation. Our original study showed that pSTS represented
objects in a similar manner across seeing and hearing them; they were therefore
abstract, non-topographic representations.
Our cross-subject crossmodal classification shows that these abstract
representations are shared. We demonstrate that an object, seen by one person
and heard by another, evokes similar representations in the pSTS of the two
people. Recall too that both persons would also have similar patterns activated in
their auditory cortices (by perception or by association) and in their visual
cortices (likewise).
! "#$!
5.3. Experiment 2: Decoding representations across subjects in our trimodal
dataset
In this reanalysis of the trimodal (AVT) dataset, we assess the
dependence of cross-subject classification on shared representational content. In
the AVT study the sounds and videos were identical across repetitions and
across subjects: the same sound or video clip was played each time. During
tactile stimulation, by contrast, subjects freely explored the object placed in their
hands. Touches and hand movements were different each time the same object
was presented. Any resemblance of touch patterns was certainly even less
across different subjects. There would thus be no commonly agreed-upon "touch
representation" for the objects, which we hypothesized would degrade cross-
subjects classification performance for representations involving touch.
We capitalized on the difference in cross-subject stimulus variability to
predict that sounds and videos would be better classified than touches across
individuals. We also expected to replicate our prior finding of shared audiovisual
invariant representations across individuals. Finally, we hoped to assess the
extent to which audiotactile and visuotactile invariant representations were
shared across individuals.
! "#%!
5.3.1. Experiment 2: Materials and Methods
We re-analyzed the AVT study data. Briefly, we recorded functional brain
images from 18 subjects while they were presented the same set of objects in
the auditory, visual, and tactile modalities.
Subject Selection and Co-registration
The conjunction group analyses from our original study showed that
information was present in at least some of our subjects. The minimal test to
demonstrate cross-subject information sharing is to decode representations
across two subjects. Accordingly, we selected the top two best-classifying
subjects from each classification type for further cross-subject analysis (Table
5.2). Conjunction group analysis did not permit us to infer that these particular
subjects definitively contained the information of interest, but they were, for our
purposes, the subjects most likely to cross-decode.
We performed the relatively simple registration of one subject's functional
data to another's using the FSL FLIRT tool (Jenkinson and Smith, 2001) and a 12
degree-of-freedom affine transformation. This dispensed with the need to create
a new template image.
Searchlight Based Voxel Selection
For each of the nine different classifications (intramodal A, V, T and
crossmodal AV, VA, AT, TA, VT, TV) we defined a mask of voxels based on their
performance in each subject's searchlight map. The subjects' searchlight
! "#&!
accuracy maps were normalized and summed, then the top 1000 voxels were
selected to compose the mask.
In the context of a cross-subject analysis, selecting voxels to perform a
classification based on the searchlight map of that classification does not
constitute "peeking", or over-fitting one's own data to inflate classifier
performance. We selected the best-classifying subjects and the best-classifying
voxels within each subject in order to test the hypothesis that those voxels are
activated in the same way, across subjects – a conclusion emphatically not
guaranteed by our selection process.
Cross-Subject Classification
We minimized the number of transformations (and attendant image
distortion) by training classifiers in the subject's own functional data space, then
testing the classifier on the other subject's warped-in data. Both subjects' data
were smoothed with a Gaussian kernel with FWHM of 1.35 voxels. Each subject
served in turn as training subject and then testing subject, and the two
accuracies were averaged. We used permutation testing to generate a null
distribution of accuracy values, performing the analysis 10,000 times with
randomly permuted testing labels. `
5.3.2. Experiment 2: Results
Voxels in the auditory mask represented sounds with similar activity
patterns across two subjects (Table 5.3.a). Voxels in the visual mask represented
! "#'!
videos similarly for different subjects. Touch representations in the tactile mask
did not generalize to a different subject.
A classifier trained to distinguish sounds in one subject was able to
distinguish videos in the other subject, and likewise for training on videos to
distinguish sounds (Table 5.3.b). Sound-evoked activation patterns in one
subject did not generalize to touch-evoked activation patterns in the other
subject, and vice versa. Video-evoked activation patterns in one subject were
found to generalize to touch-evoked activation patterns in the other subject, and
vice versa.
5.3.3. Experiment 2: Discussion
Consistent with our hypothesis, brain activation patterns evoked by the
idiosyncratic touches were not found to be shared across subjects. Brain
activation patterns evoked by the identically presented sounds and videos,
however, were shared across subjects.
We detected shared bimodally invariant representations. Similar abstract
representations were evoked in the brains of two subjects when one heard the
sound of an object and the other saw a video of the same object, replicating a
result from Experiment 1. We also found that (a likely different set of) similar
abstract representations were evoked when one person saw a video of an object
and the other person touched the same object. We did not detect the presence of
shared audiotactile-invariant representations.
! "#(!
Subjects in our AVT experiment were instructed not to imitate the actions
depicted on video – doing so would have created sounds with those objects and
contaminated the unimodal tactile stimulation. We observed that their otherwise
free manipulation of the objects was highly variable from trial to trial and subject
to subject. Nevertheless, these variable touches evoked visuotactile-invariant
representations that were similar across subjects. Representations that abstract
across vision and touch may therefore be more tolerant to variation and more
similar across individuals than representations that abstract across audition and
touch.
5.4. General discussion
In re-analyses of two different datasets, we found that auditory and visual
representations of common objects were similar across participants. We
speculate that the topographic nature of sensory cortices may contribute to
similarity of representation across subjects. Furthermore, we found
representations that abstracted across seeing or hearing an object, or across
seeing and touching an object, that were also shared across subjects.
Abstract representations for stimuli, presented visually, have previously
been identified to be similar across subjects. Shinkareva et al. (2008) found that
categories (tools or dwellings), as well as individual exemplars, of pictures were
found to be shared across subjects. Follow-up work found representations that
abstracted object category across pictures and printed words were also shared
across subjects (Shinkareva et al. 2011). Another study (Quadflieg et al. 2011)
! "#)!
decoded abstract representations of "up" and "down" across visual stimuli
presented in high or low spatial locations and categories of printed words (high or
low locations, good or bad valence) across subjects.
As hypothesized, we failed to decode touches across subjects. We
attribute this to the variability of hand movements across subjects. It is likely that
consistent touches would have evoked similar representations in the
somatosensory cortices of different subjects; a study from our group found that
videos of touches evoked similar representations in the primary somatosensory
cortices of different subjects (Kaplan & Meyer 2011).
A recent study found that the time courses of activity in the posterior
superior temporal gyrus were similar across a speaker and a listener when a
non-ambiguous scene was being described; the time course of activity was
different when the speaker described a scene with many different interpretations
(Dikker et al. 2014). This bolsters the case that inter-brain coupling depends on
shared mental content (Hasson et al. 2012). Future work on shared neural
representations may provide greater insight into the thorny problem of
miscommunication. Perhaps simply the recognition that miscommunication arises
from pattern mismatch in the brains of conversants may lead to greater efforts to
bring brains, and people, into synchrony.
! "#*!
Chapter 5 Tables
-.%/012345637%87399:;:831:50%388<238:=9%
!"#$%&%!'$()*"+','$*!-' /AB&-")C*#")-(6* D&.A$%*#")-(6* ;FEF*
F"A,B.* #."+%????% #.+&+%@09A% #.",>%????%
D&B(".* #.+",%????% #.""%???% #.+">%????%
%
B.%C259945637%87399:;:831:50%388<238:=9%
2&*!!#3#2*"#(/','$*!-' /AB&-")C*#")-(6* D&.A$%*#")-(6* ;FEF*
-)$&,*."A,B.=*-(.-*5&B(".* #.('"%@09A% #.(''%@09A% #.+&"%?%
-)$&,*5&B(".=*-(.-*."A,B.* #.+#*%@09A% #.(*%@09A% #.++*%???%
Table 5.1. AV experiment cross-subject classification accuracies. (Chance
performance was 0.5; statistical significance assessed with one-tailed t-tests. *, P
< 0.05; **, P < 0.01; ***, P < 0.001; ****, P < 0.0001; ns, not significant)
! "$+!
/012345637%3<6:152D% #"$%#(%
:012345637%E:9<37% &,$%&>%
/012345637%1381:7=% #+$%&"%
C259945637%-!F%306%F!-% #($%)&%
C259945637%-!G%306%G!-% &($%#+%
C259945637%F!G%306%G!F% #"$%#>%
Table 5.2. Subjects selected for cross-subject analysis
! "$"!
-.%/012345637%87399:;:831:50%388<238:=9% B.%C259945637%87399:;:831:50%388<238:=9%
F"A,B.* #.>)*%????% /HD' #."*"%????%
D&B(".* #."#'%???% DH/* #.+''%??%
E"A#G(.* #.(''%@09A%
*
%
% %
/HE% #.(+%@09A%
% %
EH/* #.+&,%@09A%
% %
*
%
% %
DHE% #.++(%?%
% %
EHD* #.+*)%??%
Table 5.3. AVT experiment cross-subject classification accuracies. (Chance
performance was 0.5; statistical significance was assessed with permutation
testing.)
! "$#!
Chapter 5 References
Avants BB, Tustison NJ, Song G, Cook P a, Klein A, Gee JC (2011) A
reproducible evaluation of ANTs similarity metric performance in brain image
registration. NeuroImage 54:2033–2044.
Beauchamp MS, Argall BD, Bodurka J, Duyn JH, Martin A (2004) Unraveling
multisensory integration: patchy organization within human STS multisensory
cortex. Nature neuroscience 7:1190–1192.
Chang C-C, Lin C-J (n.d.) LIBSVM A Library for Support Vector Machines.
Dikker S, Silbert LJ, Hasson U, Zevin JD (2014) On the Same Wavelength:
Predictable Language Enhances Speaker-Listener Brain-to-Brain Synchrony in
Posterior Superior Temporal Gyrus. Journal of Neuroscience 34:6267–6272.
Dretske F (1981) Knowledge and the Flow of Information. MIT Press.
Fodor J (1998) Concepts: Where Cognitive Science Went Wrong. New York:
Oxford University Press.
Hanke M, Halchenko YO, Sederberg PB, Hanson SJ, Haxby J V, Pollmann S
(2009) PyMVPA: A python toolbox for multivariate pattern analysis of fMRI data.
Neuroinformatics 7:37–53.
Hasson U, Ghazanfar A a, Galantucci B, Garrod S, Keysers C (2012) Brain-to-
brain coupling: a mechanism for creating and sharing a social world. Trends in
cognitive sciences 16:114–121.
Jenkinson M, Smith S (2001) A global optimisation method for robust affine
registration of brain images. Medical image analysis 5:143–156.
Kaplan JT, Meyer K (2012) Multivariate pattern analysis reveals common neural
patterns across individuals during touch observation. NeuroImage 60:204–212.
Man K, Kaplan JT, Damasio A, Meyer K (2012) Sight and sound converge to
form modality-invariant representations in temporoparietal cortex. The Journal of
neuroscience : the official journal of the Society for Neuroscience 32:16629–
16636.
Quadflieg S, Etzel JA, Gazzola V, Keysers C, Schubert TW, Waiter GD, Macrae
CN (2011) Puddles , Parties , and Professors : Linking Word Categorization to
Neural Patterns of Visuospatial Coding. :2636–2649.
! "$$!
Shinkareva S V, Mason R a, Malave VL, Wang W, Mitchell TM, Just MA (2008)
Using FMRI brain activation to identify cognitive states associated with
perception of tools and dwellings. PloS one 3:e1394.
Shinkareva S V, Malave VL, Mason RA, Mitchell TM, Just MA (2011)
Commonality of neural representations of words and pictures. NeuroImage
54:2418–2425.
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! "$%!
CHAPTER 6. General Discussion
"Through this whole book there are great pretensions to new discoveries in
philosophy; but if anything can entitle the author to so glorious a name as that of
an inventor, it is the use he makes of the principle of the association of ideas,
which enters into most of his philosophy. Our imagination has a great authority
over our ideas; and there are no ideas that are different from each other which it
cannot separate, and join, and compose into all the varieties of fiction. But
notwithstanding the empire of the imagination, there is a secret tie or union
among particular ideas, which causes the mind to conjoin them more frequently
together, and makes the one, upon its appearance, introduce the other. ... It will
be easy to conceive of what vast consequence these principles must be in the
science of human nature, if we consider that, so far as regards the mind, these
are the only links that bind the parts of the universe together, or connect us with
any person or object exterior to ourselves. For as it is by means of thought only
that anything operates upon our passions, and as these are the only ties of our
thoughts, they are really to us the cement of the universe, and all the operations
of the mind must, in a great measure, depend on them."
David Hume, 1740
6.1. Summary
In this dissertation I reported a set of studies on the nature of abstract
neural representations. I and my colleagues first reviewed the neuroanatomical
basis for sensory convergence pathways, making reference to a rich history of
experimental comparative neuroanatomy. Next we reviewed the human
functional brain imaging evidence for bottom-up and top-down organization of
sensory information across the auditory, visual, and somatosensory-motor
! "$&!
modalities. In my first study we detected the existence of audiovisual-invariant
representations, and mapped them to a known multisensory region in the
temporoparietal cortices. In my second study we detected the existence of
audiovisual-, audiotactile-, and visuotactile-invariant representations; mapped
their locations and various overlaps; and constructed a model of the topographic
organization of multiple stages of representational abstraction across sight,
sound, and touch. In my third study, we show that audiovisual-invariant and
visuo-tactile invariant representations are similar across the brains of different
subjects, and that shared neural activity patterns may depend on shared mental
content.
6.2. Concepts: Are we there yet?
I set out to map the neural architecture of concepts. How close did I get to
the goal? The answer depends, in large part, on how satisfactory one finds the
empiricist account of concepts. If one accepts that concepts are built from
sensations, then mental concepts are just the patterns of neural activity that link
neural sensory representations. But what else could concepts possibly be?
Alternative accounts, of differing plausibility, have been put forth. Radical
nativism holds that we are born with the entire stock of concepts we can ever
possibly have (Fodor 1998). Despite seeming false on its face (where is the
"SCHADENFREUDE" neuron located in an infant?), there are indications that
such a position may be reconciled with modern neuroscience. We may be born
with a maximally interconnected brain in which every imaginable association —
! "$'!
every conceivable concept — is present. This "primary repertoire" then
undergoes severe synaptic and neuronal pruning during development, perhaps
following Darwinian selectionist principles (Edelman 1987). A more moderate
nativist position holds that we may be born with particular neural emphasis on
specific categories of things in the world: perhaps faces, for humans, and lasers
(and other quickly moving bright spots), for cats (Caramazza & Shelton 1998).
Under either formulation of concepts, neuroscience has clearly assumed
the mantle of investigating them. Conceptual representation has been claimed for
brain activities that can tell you whether a pictured object belongs in the kitchen
or the garage, and whether it should be squeezed or rotated (Peelen &
Caramazza 2012). Perhaps a stronger case for concepts was made by
Coutanche & Thompson-Schill (2014), who showed that the representation of a
cued (but not yet presented) object in a high level convergence region was
strongest when component features of that object were also activated in the
lower-level cortices representing those features. For example, subjects cued to
an upcoming picture of a lime had better decoding accuracy for limes in the
anterior temporal lobe if they also activated representations of "green" in V4 and
of "round" in LOC. Single neurons in the hippocampus and medial temporal
cortices have been called "concept cells" (Quiroga 2012). They display extreme
specificity to particular concepts: "Jennifer Aniston" neurons or "Oprah" neurons
that respond only to the pictures, printed names, or voices of their designees
(Quiroga et al. 2009).
! "$(!
We make the somewhat different case that concepts are specific global
brain states. Meaning is built from combinatorial associations drawing upon
widespread regions of the brain (Pulvermuller 2013). Concepts do not
correspond to the naked activity of the links without the content. Our work
elucidates the neural architecture of these links.
6.3. Outstanding issues
An issue remains with the generally low accuracies of our classifiers.
When we call the detected patterns similar, the patterns are in fact so different
that the information is nearly undetectable. But the information is nevertheless
there, and detected. We may only hope that technological improvements in
functional brain imaging and classification algorithms can improve our ability to
decode the mental contents of brain activity. But for our purposes — to establish
the existence and map the location of content-specific information — the
currently available methods have proved adequate.
A different issue is whether the spatial activity patterns we detected really
constitute "representations". It is clear that the brain does not transact its affairs
at the centimeter-scale neighborhoods of our searchlights. However, MVPA is
sensitive to features, such as orientation specificity in V1, that occur at scales
smaller than single voxels (Kamitani & Tong 2005). The status of such MVPA
"hyperacuity" is still under debate (Op de Beeck 2010; Kamitani & Sawahata
2010). Meyer and Kaplan (2011) discuss the role of spatial scale in decoding
representations. It may be trivial to decode stimulus modality from the whole
! "$)!
brain: the activity of a single voxel in visual cortex may discriminate between the
presentation of a video or the presentation of a sound. The function of that voxel
is not to encode stimulus modality. Rather, its activity contains information which
allows a classification algorithm to make the distinction. (In fact, its activity
contains information specifying much more than that – we are just unable to
make use of it.) Given the restricted scale of our searchlights (10 mm radii) and
the specificity of the distinctions we ask them to make ("Is this more like a bell or
a typewriter?") the activity patterns that we study may have a claim to being
functional representations.
! "$*!
Chapter 6 References
Caramazza A, Shelton JR (1998) Domain-specific knowledge systems in the
brain the animate-inanimate distinction. Journal of cognitive neuroscience 10:1–
34.
Coutanche MN, Thompson-Schill SL (2014) Creating Concepts from Converging
Features in Human Cortex. Cerebral cortex (New York, NY : 1991).
Edelman, Gerald. (1987) Neural Darwinism. The Theory of Neuronal Group
Selection Basic Books: New York.
Fodor J (1998) Concepts: Where Cognitive Science Went Wrong. New York:
Oxford University Press.
Hume D (1739) A treatise of human nature. Project Gutenberg, eBook Collection.
http://www.gutenberg.org/files/4705/4705-h/4705-h.htm
Kamitani Y, Sawahata Y (2010) Spatial smoothing hurts localization but not
information: pitfalls for brain mappers. NeuroImage 49:1949–1952.
Kamitani Y, Tong F (2005) Decoding the visual and subjective contents of the
human brain. Nature neuroscience 8:679–685.
Kaplan JT, Meyer K (2011) Multivariate pattern analysis reveals common neural
patterns across individuals during touch observation. NeuroImage.
Op de Beeck HP (2010) Against hyperacuity in brain reading: spatial smoothing
does not hurt multivariate fMRI analyses? NeuroImage 49:1943–1948.
Peelen M V, Caramazza A (2012) Conceptual Object Representations in Human
Anterior Temporal Cortex. 32:15728–15736.
Pulvermüller F (2013) How neurons make meaning: brain mechanisms for
embodied and abstract-symbolic semantics. Trends in cognitive sciences
17:458–470.
Quian Quiroga R, Kraskov A, Koch C, Fried I (2009) Explicit encoding of
multimodal percepts by single neurons in the human brain. Current biology : CB
19:1308–1313.
Quiroga RQ (2012) Concept cells: the building blocks of declarative memory
functions. Nature reviews Neuroscience 13:587–597.
!
!
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Spatial and temporal patterns of brain activity associated with emotions in music
PDF
Majoring in music: how conservatory training changes the brain
PDF
The brain's virtuous cycle: an investigation of gratitude and good human conduct
PDF
Psychopathic traits and the fronto-amygdala circuit in a community sample: the role of trait anxiety
PDF
Behabioral and neural evidence of state-like variance in intertemporal decisions
PDF
Engagement of the action observation network through functional magnetic resonance imaging with implications for stroke rehabilitation
PDF
Decoding the neural basis of valence and arousal across affective states
PDF
Experience modulates neural activity during action understanding: exploring sensorimotor and social cognitive interactions
PDF
Reward substitution: how consumers can be incentivized to choose smaller food portions
PDF
Computational models and model-based fMRI studies in motor learning
PDF
Individual differences in heart rate response and expressive behavior during social emotions: effects of resting cardiac vagal tone and culture, and relation to the default mode network
PDF
Functional models of fMRI BOLD signal in the visual cortex
PDF
The neuroscience of ambivalent and ambiguous feelings
PDF
Behavioral and neural correlates of core-self and autobiographical-self processes
PDF
The representation of medial axes in the perception of shape
PDF
Brain and behavior correlates of intrinsic motivation and skill learning
PDF
Heart, brain, and breath: studies on the neuromodulation of interoceptive systems
PDF
Multivariate statistical analysis of magnetoencephalography data
PDF
The angry brain: neural correlates of interpersonal provocation, directed rumination, trait direct aggression, and trait displaced aggression
PDF
The neural correlates of face recognition
Asset Metadata
Creator
Man, Kingson
(author)
Core Title
Mapping the neural architecture of concepts
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Neuroscience
Publication Date
03/09/2015
Defense Date
05/14/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
binding,concepts,fMRI,multisensory integration,multivariate pattern analysis,neuroanatomy,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Damasio, Antonio (
committee chair
), Damasio, Hanna (
committee member
), Monterosso, John R. (
committee member
), Wetzel, Randall C. (
committee member
)
Creator Email
kingsonman@gmail.com,kupman99@hotmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-471975
Unique identifier
UC11286905
Identifier
etd-ManKingson-2904.pdf (filename),usctheses-c3-471975 (legacy record id)
Legacy Identifier
etd-ManKingson-2904.pdf
Dmrecord
471975
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Man, Kingson
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
binding
concepts
fMRI
multisensory integration
multivariate pattern analysis
neuroanatomy