Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
The representation of medial axes in the perception of shape
(USC Thesis Other)
The representation of medial axes in the perception of shape
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
THE REPRESENTATION OF MEDIAL AXES
IN THE PERCEPTION OF SHAPE
by
Mark Daniel Lescroart
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(NEUROSCIENCE)
May 2011
Copyright 2011 Mark Daniel Lescroart
ii
Table of Contents
List of tables......................................................................................... v
List of figures .......................................................................................vi
Abstract...............................................................................................vii
Chapter 1: Introduction ........................................................................ 1
Metric vs. coordinate relations .......................................................................2
Specifying relations vs. relational role binding ...............................................6
Medial axes....................................................................................................6
Structure of the dissertation ...........................................................................9
Chapter 2.......................................................................................................9
Chapter 3.....................................................................................................10
Chapter 4.....................................................................................................11
Chapter 5.....................................................................................................12
Chapter 2: A cross-cultural study of the representation of shape:
sensitivity to generalized-cone dimensions....................................... 14
Abstract...........................................................................................................14
Introduction......................................................................................................15
Independent processing of dimensions of shape.........................................15
Possible role of the presence of simple artifacts..........................................17
The Himba ...................................................................................................20
Texture segregation.....................................................................................21
Methods ..........................................................................................................25
Logistics.......................................................................................................25
Task.............................................................................................................27
Stimuli..........................................................................................................28
Training........................................................................................................29
Subjects.......................................................................................................30
Results ............................................................................................................33
Classifier analysis........................................................................................36
Discussion.......................................................................................................38
Chapter 3: Spontaneous perception of axis-relative relationships
under different viewing conditions ..................................................... 41
Introduction......................................................................................................41
iii
Materials & Methods........................................................................................42
Subjects.......................................................................................................42
Stimuli..........................................................................................................43
Similarity rating task.....................................................................................44
Statistical analysis........................................................................................45
Results ............................................................................................................46
Discussion.......................................................................................................49
Size vs. orientation.......................................................................................51
Conclusions .................................................................................................52
Chapter 4: Efficient mental rotation of medial axis structures........... 53
Introduction......................................................................................................53
Materials & Methods........................................................................................56
Subjects.......................................................................................................56
Stimuli..........................................................................................................56
Task & trial timing ........................................................................................58
Statistical analysis........................................................................................60
Results ............................................................................................................60
Effect of angular disparity ............................................................................61
Effect of same / different parts on axis structure judgments ........................63
Other factors affecting reaction time and accuracy in both tasks.................64
Discussion.......................................................................................................66
Comparison of same-axis and same-part tasks...........................................66
What rotates? ..............................................................................................67
Medial axis structures can be rotated very efficiently...................................70
Conclusions .................................................................................................72
Chapter 5: Imaging of the representation of axis structures ............. 74
Introduction......................................................................................................74
Materials & Methods........................................................................................76
Subjects.......................................................................................................76
Stimuli..........................................................................................................76
Task: attend to component parts .................................................................79
fMRI data collection and preprocessing.......................................................80
fMRI classification analyses.........................................................................81
Regions of interest.......................................................................................82
Results ............................................................................................................84
Behavioral results ........................................................................................84
Univariate fMRI results.................................................................................86
fMRI classification results ............................................................................86
Discussion.......................................................................................................95
Relation to other work..................................................................................97
iv
Why the high accuracy for axis structure classification in V1?.....................98
Why the lower accuracy for classification of parts vs. axis structures?........99
Why the low classification accuracy overall? ...............................................99
Conclusions ...............................................................................................100
Chapter 6: General Conclusions ..................................................... 102
Good performance on axis structure tasks ................................................102
Persistent effects of orientation / sensitivity to viewpoint ...........................102
Intermediate-level representation of medial axis structure.........................103
References....................................................................................... 104
Appendix: Full stimulus set.............................................................. 117
v
List of tables
Table 1: Mental rotation regression statistics......................................................62
Table 2: Statistical results by ROI.......................................................................88
vi
List of figures
Figure 1: Relation/identity examples.....................................................................1
Figure 2: Categorical relations in different reference frames.................................5
Figure 3: Shape sorter. .......................................................................................17
Figure 4: Himba visual environment and testing setup. ......................................21
Figure 5: Illustration of texture segregation tasks................................................26
Figure 6: Himba / USC behavioral results (RT and Error)...................................34
Figure 7: Experimental stimuli for similarity rating (9/54 images used). ..............43
Figure 8: Statistical similarity rating results: size variation. .................................47
Figure 9: Statistical similarity rating results: orientation variation........................48
Figure 10: Mental rotation stimuli........................................................................57
Figure 11: Mental rotation results (RT and sensitivity). .......................................61
Figure 12: Reaction times for all axis families separately. ..................................65
Figure 13: Other studies investigating speeds of mental rotation........................69
Figure 14: Stimuli for fMRI classification .............................................................77
Figure 15: Regions of interest and activation......................................................83
Figure 16: Support vector machine classifier results by region of interest. .........87
Figure 17: Classification results with voxel count equated for each ROI.............89
Figure 18: SVM classification results for training/testing by orientation...............92
Figure 19: Hemodynamic responses in all ROIs to each axis group...................94
vii
Abstract
Aristotle famously said that vision is “to know what is where, by looking”—
but that is not the whole story. Vision is also to know what is where relative to
everything else. We constantly make use of relative position information, when
we draw, build, or read a map or diagram. How does our visual system divide up
space, to let us know that one object is above another, or that one part of an
object protrudes from the end of another part? Structural description theories of
shape recognition hold that our visual system represents objects as collections of
parts in particular relations. The principal focus of this dissertation is to
investigate one plausible scheme for the encoding of within-object (between-part)
spatial relations: the encoding of medial-axis-relative relations. Medial axes are
imaginary lines that pass through the central part of a volume, as a spit through a
hot dog. Medial axes within each part of an object can define an invariant, stick-
figure like structure of an object, in the sense that the points of attachment and
relative angles between the axes (in rigid objects) will not change with the
perspective from which the object is seen. A behavioral similarity rating
experiment showed that naïve subjects spontaneously judge novel objects to be
more similar if the objects share the same medial axis structure, in many cases
even if the objects are composed of different-shaped parts and shown from
different perspectives. When subjects were asked to distinguish (in a
same/different task) categorically-different medial axis structures seen from
viii
different perspectives, they showed slower recognition times with greater
differences in perspective (though the estimated rates of mental rotation were
very fast). Finally, evidence from a multi-voxel pattern classification fMRI study
showed that medial axis structure—as distinct from simple retinotopic
orientation—is encoded starting at a very early stage (V3) in the visual cortex.
Patterns of activity in V3 were more similar in response to novel objects that
shared the same medial axis structure than in response to novel objects that
shared the same overall orientation.
1
Chapter 1: Introduction
Identifying an object despite variation in position, depth, and other cues is
a difficult task—but not the only task the human visual system can accomplish.
Our visual cortex provides us not only with the names of the objects around us,
but with information about the relations between objects and the internal
structure of individual objects. A representation of relative position is critical for
many tasks: we use relative position information whenever we draw or build, and
any time that we read a map or diagram. How do we do it? What sort of a
representation could support our ability not only to name things, but to reason
about the relative positions of object parts (independent of their absolute
positions)? Structural description theories of shape perception
Figure 1: Relation/identity examples.
Pairs of objects defined by different relationships between the same parts.
2
(Biederman, 1987; Marr, 1982) hold that visual objects are represented as
collections of parts and the spatial relations between them.
Spatial relations are not only necessary for explicit judgments of relative
position (for example, the position of a king relative to a rook on a chess board)—
the relative spatial positions of objects’ parts can strongly influence our
perception of an object (or part). Shimon Ullman pointed out that a door knob
may be any shape, but so long as it is attached to a door in the usual manner it
will be recognized as a doorknob (Ullman, 1989). In some cases changes in the
relative position of parts can change the entire identity of an object (Biederman,
1987) (see Figure 1), just as changing the relative positions of the phonemes in a
word can change its meaning (as in “cat” and “tack” or “rough” and “fur”).
The principal purpose of this dissertation will be to investigate one
plausible way to encode the relations between parts of an object: encoding of
medial-axis-based relations.
Metric vs. coordinate relations
The spatial position of an object (or a part of an object) with respect to
another object (or part) can be encoded in several distinct ways. Relations can
be encoded precisely, in terms of absolute distances and/or angles—for
example, “the ball is four feet to the left of the table on the ground”. Specifying a
location or relation in this way requires a coordinate frame with an origin and a
basis—“four feet to the left” has no meaning without the coordinate frame defined
3
by the table (the origin), the basis defined by gravity (down) and the viewer’s
perspective (which defines left/right). Because of their precision and dependency
on a coordinate frame, relations that specify absolute locations have been called
either metric or coordinate relations (Kosslyn, 1987; Laeng, Chabris, & Kosslyn,
2003). Metric will be favored here, because all relations depend on coordinate
frames. Metric relations are well-suited for guiding action in the real world, where
failure to perceive exact distances can mean a missed step, a knocked over
glass, or a car accident. However, encoding of precise distances is less useful for
identifying objects (objects do not change in identity depending on their size,
orientation, or distance from an observer).
Another way to encode spatial relations is to divide all the locations in the
world into discrete bins or categories; this type of relation encoding has been
called “categorical” (Kosslyn, 1987; Laeng et al., 2003). Dividing space into
categorically-distinct regions emphasizes some changes in position and ignores
others. Steven Pinker provides a vivid example:
Imagine you are in a rainstorm, ten feet from an overhanging ledge. Move
one foot toward it; you still get wet. Move over another foot; you still get
wet. Keep moving, and at some point you no longer get wet. Continue to
move another foot in the same direction; you don’t get any drier. So nature
has set up a discontinuity between the segment of the path where gradual
changes of position leave you equally wet and the segment where gradual
changes of position leave you equally dry. And it is exactly at that
discontinuity that one would begin to describe your position using “under”
rather than “near.” (Pinker, 2007, p. 186)
4
As highlighted in Pinker’s example, language tends to capture categorical
distinctions, often in prepositions (or their equivalents in other languages) (Pinker
& Bloom, 1990).
Categorical relations, like metric relations, must be specified in reference
to a coordinate frame. Pinker’s example gives one categorical distinction with
respect to a gravitational reference frame; many more such categories are
possible (e.g. “above”, “beside”, “in front of,” and “behind”—see Figure 2a, b).
Within objects, categorical relationships can capture the way the parts of an
object join together, and thus facilitate object recognition (Biederman, 1987). For
example, the legs of a chair are usually below the seat. However, gravitational
categorical relations are only useful for recognizing objects to the extent that
objects usually appear in the same canonical orientation with respect to true
vertical. If a chair is knocked over, the legs are no longer below the seat
(gravitationally speaking), just as the cylinder in Figure 2b is no longer above the
brick if the whole object is rotated 90º.
If our visual system is to make sense of misoriented or novel objects, it
must have some way to represent between-part relations in a more object-
specific manner that would not be completely dependent on the object’s
orientation. One way to encode object-centered spatial relations is to use the
principal axis of elongation of an object to specify a coordinate frame for relations
(Marr, 1982; Marr & Nishihara, 1978; Palmer, 1975). Because the principal axis
5
Figure 2: Categorical relations in different reference frames.
a.,b. Relations with respect to gravity. c.,d. Relations with respect to the principal axis of
an object. *Note that ends of the axis will only be distinct if some feature of the object
distinguishes them, for example taper along the axis resulting in one end being narrower
than the other, or a protrusion at one end (as here).
of an object will rotate with the object, axial relations will stay constant with
changes in perspective (Figure 2c, d).
It is somewhat less clear what constitutes a categorical (vs. metric)
relation change with respect to a principal axis, but several suggestions have
been proposed, such as the distinction between parts adjoining end-to-end (as in
an L) vs. end-to-middle (as in a T) (Biederman, 1987). Categorical distinctions
can also be made between perpendicular and co-linear parts, as well as between
parts that join together via the ends of their medial axes (as the line segments in
either an L or a T) vs. the sides of their medial axes (as two logs in a raft).
6
The principal objective of this dissertation was to explore the encoding of
axis-based relations in the human visual system.
Specifying relations vs. relational role binding
A (spatial) relation delineates two (or more) positions in space, but does
not necessarily specify the entities that occupy those locations or positions. Fully
encoding the relationship between two objects or object parts additionally entails
binding the objects or parts into roles within the relation. Different role-bindings of
objects (or parts) in the same relation can define highly different scenes (or
objects). For example, a person on top of a house is very different from a house
on top of a person. This investigation was principally concerned with medial axes
as a way to encode within-object relations per se, without addressing how entities
(e.g. specific parts) are bound to roles in those relations. For a discussion of
relational role filling and the binding problem in general, see (Hayworth, 2009;
Hummel & Biederman, 1992; Hummel & Holyoak, 2003; Singer, 1999).
Medial axes
A medial axis is the central line through a volume, as a bone through a
leg. In the late 1960s and ‘70s, Harold Blum (1967; 1978) pointed out that
representing an object as a set of conjoined medial axes captures the structure of
the object, as a stick figure captures the structure of an animal, and provided a
way to compute a Medial Axis Transform (MAT) from a shape’s outline via a
7
“grassfire” algorithm. Since that time, several researchers have updated Blum’s
algorithm to make it more robust to variation in outline noise (Feldman & Singh,
2006; Pizer, Burbeck, Coggins, Fritsch, & Morse, 1994; Pizer, Siddiqi, Székely,
Damon, & Zucker, 2003).
A single medial axis specifies a vector, with an origin (the base of the part)
and a direction (the angle at the junction with the other part), and potentially an
associated degree of curvature of the axis (which could be zero when the axis is
straight). That is: specifying a medial axes for each part of an object provides a
description of the parts at the level necessary for specifying relations (either
categorical or metric).
A few neurological studies support the idea that medial axes are computed
at an early stage in the visual system, potentially at the level of V1: Lee and
colleagues (1998) found that V1 cells show heightened responses to oriented
bars located along the medial axis of a texture-defined figure. Kimia (2003) has
noted that the lateral connections in V1 are well-situated to compute convex
parts’ medial axes via a computation like Blum’s “grassfire” algorithm.
Psychophysical evidence also suggests that humans have some
automatic representation of medial axes, as well: Kovacs and Julesz (1994)
found that subjects showed increased contrast sensitivity to local edges (Gabors)
located at the center and oriented parallel to the medial axis of a 2-D figure. If
medial axes are computed at an early stage of the visual system, then it seems
8
likely that later stages of the visual system might encode junctions between
medial axes—that is, relationships between objects’ parts, specified as medial
axis structures.
Lesion studies in both dorsal and ventral regions of the visual cortex
support these conclusions. There are a number of patients in the literature who
have selectively lost the ability to make relative position judgments (including
relative position judgments between parts of objects), though other aspects of
their object recognition are remarkably spared. Many of these patients sustained
damage to the parietal lobe, and their deficits are commonly called
“simultagnosia” (Wolpert, 1924) or Balint’s syndrome.
The most typical deficit in simultagnosia is an inability to simultaneously
perceive multiple entities in the visual field: “only one object, or part of an object,
can be seen at a time.” (Farah, 2004). These patients are able to integrate
information about the parts of an object sufficiently to name the object, though
they often take longer than healthy patients in object naming tasks. However,
they cannot reason about the structure of objects or cognitively address one part
of an object at a time. Simultagnosic patients convincingly demonstrate that an
intact ventral stream is not sufficient for explicit representation of the relations
between object parts.
A tempting conclusion to draw from patients with Balint’s syndrome and
simultagnosia, then, is that relative position is explicitly encoded in the parietal
lobe. However, several other patients in the literature complicate the story.
9
Gallant, Shoup, and Mazer (2000) studied a patient, A.R., with a lesion in the
ventral right hemisphere, a region most likely to comprise the ventral part of V4.
One quadrant of the patient’s visual field was affected by the lesion; in that
quadrant only, he demonstrated (among other deficits) a marked impairment in
reporting which half of a half-white, half-black circle was on top. Control
experiments indicated that he could distinguish white/black circles from solid-
colored circles and reliably report the orientation of the boundary; thus his deficit
was specifically in determining the relative position of the object parts (i.e, the
circle halves). Behrmann et al. (2006) describe another patient, S.M., who had a
lesion in the ventral part of the left lateral occipital area. In an experiment using
two-part geometrical objects, S.M. was selectively impaired at distinguishing
objects that shared the same parts but differed in the relationships between the
parts.
These patients’ deficits show that an intact parietal lobe is not sufficient for
the explicit representation of relative position, either. Instead, they point to an
interaction between dorsal and ventral areas that makes spatial relationships
between object parts cognitively explicit.
Structure of the dissertation
Chapter 2
A central assumption of the present work is that objects’ parts are
represented in the visual system as generalized cones. Generalized cones are
10
the volumes defined when a cross section is swept along an axis (for example, a
circle traversing a curved axis would define a macaroni-like shape). Research in
human psychophysics (Arguin & Saumier, 2000; H. Op de Beeck, Wagemans, &
Vogels, 2003; Stankiewicz, 2002) and macaque electrophysiology (Kayaert,
Biederman, Op de Beeck, & Vogels, 2005) has shown that the dimensions that
define differences in generalized cones—for example, aspect ratio and axis
curvature—are independently represented in developed-world laboratory
subjects and macaques (lending credence to the theory of volumetric part
representation). In chapter 2, I provide further support for the representation of
parts as generalized cones, by testing whether humans from a culture and with
minimal exposure to the simple, manufactured objects that characterize
developed-world visual environments encode axis-relative dimensions of
generalized cones independently, as well. The Himba, a semi-nomadic people in
a remote region of Northwestern Namibia with little exposure to regular, simple
artifacts, were virtually identical to Western observers in representing
generalized-cone dimensions of simple shapes independently. Thus immersion in
a world of simple, manufactured shapes is not required to develop a
representation of parts based on generalized cone dimensions.
Chapter 3
The theoretical motivation for encoding relationships between parts with
respect to a principal medial axis is that the relationships will still be defined in
11
the same way despite variation in the perspective from which an object is seen.
So a first question to ask is: Would naïve subjects notice axis-structural
commonalities among a set of novel objects, even if the objects vary in size or
orientation? In Chapter 3, I used an “inverse multidimensional scaling” paradigm,
in which naïve subjects rated the similarity of a set of novel objects that varied in
medial axis structure, in the parts that composed the objects, and in overall
orientation and size. I found that medial axis structure influenced the perception
of parts at all sizes and orientations, in that most subjects judged objects sharing
the same parts and the same axis structure to be more similar than objects with
the same parts alone. However, though some subjects did prioritize medial axis
structure per se (independent of common parts, orientation or size), axis
structure was not as readily extracted when the orientation of the objects varied.
Access to medial axis structural information—independent of parts—may only be
made cognitively explicit with focused attention.
Chapter 4
Following up on the results of Chapter 3—that subjects seem to more
readily notice the medial axis structure of upright objects—I asked what sort of
cost would be elicited by a task that forced subjects to compare different medial
axis structures presented at different orientations in depth and in the picture
plane. Subjects were given two different tasks with the same set of images used
in the similarity rating study: first, to judge whether two simultaneously-presented
12
objects shared the same medial axis structure, and second, to judge whether two
objects shared the same component parts. Subjects identified objects sharing the
same component parts with a minimal cost of the angular disparity between
them. Objects sharing the same medial axis structure were also identified very
quickly, with estimated mental rotation speeds exceeding those from most (any,
to this author’s knowledge) other mental rotation studies in the literature.
However, the costs in reaction time were highly significant and strikingly linear at
increasing orientation disparities, indicating that even categorical distinctions in
medial axis relationships (of objects with the same or different parts) are
substantially more readily perceived if two objects are aligned. Medial axis
structures provide a likely basis for efficient mental rotation, even if the objects to
be aligned do not share any local features in common.
Chapter 5
Finally, I investigated the representation of medial axis structure in the
brain using an fMRI multi-voxel pattern classification study. fMRI provides a way
to probe the visual representation at multiple different stages of processing, from
primary visual cortex (V1), which has been implicated in encoding of (single)
medial axes, through higher-order processing stages in the dorsal and ventral
temporal lobes, which have been implicated in processing of relations by lesion
studies (Farah, 2004). Using the same set of novel objects as in the behavioral
experiments, I found that objects sharing the same medial axis structure elicited
13
reliably more similar patterns of activity than objects sharing the same overall
orientation—that is, a prioritization of axis structural information over retinotopic
orientation information—by the level of V3 (a surprisingly low-level area to show
complex shape structure effects).
General conclusions are discussed in chapter 6.
14
Chapter 2: A cross-cultural study of the representation
of shape: sensitivity to generalized-cone dimensions
(Published as Lescroart, M.D., Biederman, I., Yue, X., & Davidoff, J. (2010) A
cross-cultural study of the representation of shape: sensitivity to generalized-
cone dimensions. Visual Cognition. Volume 18, Issue 1, 50-66 First published on:
22 December 2008 (iFirst)
Abstract
Many of the phenomena underlying shape recognition can be derived from
an assumption that the representation of simple parts can be understood in terms
of independent dimensions of generalized cones, e.g., whether the axis of a
cylinder is straight or curved or whether the sides are parallel or non-parallel.
What enables this sensitivity? One explanation is that the representations derive
from our immersion in a manufactured world of simple objects, e.g., a cylinder
and a funnel, where these dimensions can be readily discerned independent of
other stimulus variations. An alternative explanation is that genetic coding and/or
early experience with extended contours—a characteristic of all naturally varying
visual worlds—would be sufficient to develop the appropriate representations.
The Himba, a semi-nomadic people in a remote region of Northwestern Namibia
with little exposure to regular, simple artifacts, were virtually identical to Western
observers in representing generalized-cone dimensions of simple shapes
15
independently. Thus immersion in a world of simple, manufactured shapes is not
required to develop a representation that specifies these dimensions
independently.
Introduction
Any simple shape can be represented by a generalized cone (GC)
(Binford, 1971; Marr & Nishihara, 1978), which is the volume created by
sweeping a cross section along an axis as, for example, when a circle is moved
perpendicularly along a straight axis to produce a cylinder. Different volumes can
be created through variations of independent GC dimensions, such as whether
the axis is straight or curved, or whether the cross section remains constant in
size. GCs assume central importance in parts-based accounts of object
recognition (Biederman, 1987; Marr & Nishihara, 1978). It is one thing to show
mathematically, as did Marr and Nishihara, that any shape can be created by
GCs. But are the dimensions that define GCs represented independently?
Independent processing of dimensions of shape
In both humans from the developed world and laboratory macaques, there
is strong evidence that GC dimensions are encoded independently, in that
selective attention to one dimension can be exercised without an effect of
variations in another dimension. Such combinations of dimensions are said to be
separable (Garner, 1974) or analyzable (Shepard, 1964). Dimensions that cannot
16
be treated independently, such as hue and saturation, are termed integral
(Garner, 1974) or non-analyzable (Shepard, 1964), and selective attention is not
efficient (Garner, 1974; R.N. Shepard, 1964).
With respect to the specific case of axis curvature and aspect ratio,
Stankiewicz (2002) reported that University of Minnesota subjects could
discriminate noisy variations in one of these GC dimensions, e.g., axis curvature,
independently of the noise level on the other GC dimension, e.g., aspect ratio.
Op de Beeck, Wagemans, and Vogels (2003) showed that the search slopes for
a target that differed from the distractors in a value of a single dimension of either
axis curvature or aspect ratio were less steep than when the distractors differed
from the target in a conjunction of the values from both dimensions. For example,
a low curvature target was more readily detected among high curvature
distractors that varied in aspect ratio than when the target was defined as a low
curvature-high aspect ratio shape and the distractor shapes were a mixture of
high curvature-low aspect ratio shapes and low curvature-high aspect ratio
shapes. A possible neural basis for this selectivity was discovered by Kayaert,
Biederman, Op de Beeck, and Vogels (2005) who found that 95% of the variance
of the firing of macaque IT cells to 2D shapes could be accounted for by
independent representation of GC dimensions.
17
Possible role of the presence of simple artifacts
However, in all of these studies the humans and the laboratory monkeys
were raised in environments full of geometrically simple, manufactured objects, in
which variation along single GC dimensions could be readily appreciated. For
example, nails vary in aspect ratio and no other dimension, and pasta often vary
Figure 3: Shape sorter.
A “shape sorter” in which cross-section shape and aspect ratio (correlated with color in
this sorter) are varied independently. (Haba Shape Sorter Board, from
www.maukilo.com).
18
in axis curvature and no other dimension. A popular toy for toddlers is a “shape
sorter” (Figure 3) in which separate dimensions, cross-section shape and aspect
ratio (correlated with color for the toy in the image) are varied independently. It is
possible that exposure to dimensional variation in such simple shapes facilitates
the learning of independent shape dimensions. Supporting such a view are the
reports of Schyns and Murphy (1994) and Schyns and Rodet (1997) suggesting
that the features that we use for responding to our visual world are not fixed but
flexible, reflecting our categorization needs. Consistent with the idea of encoding
flexibility, Goldstone (1994) showed that humans can learn to perform fine
judgments in one dimension of an integral combination of dimensions (brightness
and saturation) without strongly affecting discrimination performance in the other.
Would individuals from a culture with only limited exposure to developed-
world artifacts show the same independence of shape dimensions evidenced by
the typical artifact-immersed laboratory subject? If dimensions are defined flexibly
according to the needs of a culture, might the combination of dimensions that
Westerners represent separably not be so represented by people from a
markedly different culture?
There is general belief that that the early cortical stages of the visual
system have evolved (or have developed during infancy) in response to the
statistics of the images that characterize the visual world, e.g., Baddeley &
Hancock (1991). For example, there is a 1/f relation between Fourier amplitude
19
and spatial frequency in natural scenes, and intensity values for adjacent pixels
in natural scenes are highly correlated. These spatial (or Fourier-like) statistics
seem to characterize not only natural scenes, but artifactual environments as well
(Switkes, Mayer, & Sloan, 1978; Tadmor & Tolhurst, 1994), and tuning properties
strikingly similar to V1 cells’ receptive fields can be derived from these statistics
and a few simple assumptions (Olshausen & Field, 1996). Consequently one
would expect little difference in the early coding stages of individuals either
immersed or not immersed in a developed-world visual environment, and no one
has proposed such differences.
What about later stages of processing? The sensitivity of human
psychophysical performance and the tuning of macaque cells in more anterior
visual areas cannot be predicted from the Fourier-like tuning properties evident at
earlier stages of processing. Specifically, Fourier tuning does not make
generalized cone dimensions explicit and thus cannot account for the tuning to
GC dimensions evident in the tuning of IT cells (Kayaert et al, 2005) and human
psychophysics (Stankiewicz, 2002; Op de Beeck et al, 2003). An example of non-
Fourier tuning is the finding of Pasupathy & Connor (1999) that approximately
12% of the cells in V4 of the macaque respond to L-vertices at a particular
orientation and angle (Pasupathy & Connor, 1999). These cells are unresponsive
to either the angle bisector or the individual legs of the vertex, effects that would
be expected from Fourier-like tuning. Another example is that macaque IT cells
20
and human psychophysics demonstrate greater sensitivity to nonaccidental
compared to metric variations in shape (Biederman & Bar, 1999; Kayaert,
Biederman, & Vogels, 2003). The standard statistical analyses of Fourier
components make none of this coding explicit. The origins of higher-level
perceptual categories are still unclear, and it is entirely possible that influences
other than low-level Fourier-based image statistics (e.g., immersion in an
environment filled with simple shapes, cultural emphasis, cognitive demands),
could affect the human representation of shape.
A parallel can be drawn with speech perception. Although there is no
evidence that the basic frequency-sensitivity tuning characteristics of early stages
of audition differ from culture to culture, the particular set of phonemes that can
be readily discriminated does vary with the particular language experienced in
childhood. Children only retain the ability to hear the phonemic contrasts that
convey semantic information.
The Himba
The Himba are a semi-nomadic people living in a remote region of
northwestern Namibia. Figure 4a and b show scenes typical of the Himba
environment. In the more remote encampments, the Himba have little exposure
to simple, modern artifacts and thus provide an opportunity to assess the effects
of the presence of such artifacts (or lack thereof) on the representation of shape.
We are not assuming that there is less variation in GC dimensions in the Himba’s
21
Figure 4: Himba visual environment and testing setup.
a. Himba village showing dung and stick dwellings. b. Watering hole. c. Illustration of
Himba training procedure with real macaroni and stick to indicate texture field boundary
(slightly curved on top, highly curved on bottom), d. Himba subject (and son) with
experimenter (MDL).
visual world. The issue is whether the exposure to simple shapes in which the
dimensions are clearly contrasted, as in the shape sorter (Figure 3), facilitates
the representation of the dimensions as independent dimensions.
Texture segregation
To examine independent GC coding we employed a texture segregation
task as shown in Figure 5. Tasks such as the one illustrated in Figure 5a-c have
been used to assess whether dimensions such as luminance and shape are
22
independently coded, e.g. Bach et al (2000). The subject has to report whether
the boundary between luminance and/or shape regions is vertical or horizontal.
The boundary is always on either side of the middle row (if horizontal) or column
(if vertical), so there is some uncertainty as to its location. In both Figure 5a and b
the boundary is rapidly and effortlessly perceived. However, in Figure 5c, in
which the texture fields are defined by a conjunction of luminance and shape,
scrutiny is required. At first glance, the conjunction condition seems so different
from the other two that the underlying similarity of the displays is not obvious.
However, in all three panels, each texture field contains two of the four elements
(darker and lighter circles and squares). So why should Figure 5c be more
difficult? In Figure 5a and b, the elements on each side of the border differ in the
values of one dimension while the values of the other dimension vary across the
whole display. If the relevant and irrelevant dimensions are separable (i.e.,
represented independently), then selective attention can be employed to respond
only to the relevant dimension. In Figure 5c, both values of each dimension are
on either side of the border so selective attention to one of the dimensions would
not help segregate the fields. Because the border is defined by a conjunction of
values, both dimensions must be processed and the task would be expected to
be more difficult than the single dimension tasks in Figure 5a and b. It is only by
virtue of a dimensionalized representation that conjunction tasks would be
expected to be more difficult than the single dimension tasks.
23
It seems obvious that shape and luminance (Figure 5a-c) would be
represented as independent dimensions. But what about different dimensions of
shape itself? We used the texture-segregation tasks illustrated in Figure 5d-i to
determine whether University of Southern California (USC) students, individuals
immersed in the artifacts of the developed world, and the Himba would show
independent representation of two generalized-cone metric dimensions: degree
of axis curvature and aspect ratio. These dimensions are the best-studied
examples of GC dimensions, and were used in the studies of Stankiewicz (2002),
Op de Beeck et al (2003), and Kayaert et al (2005), among others. The
dimensions also allowed effective variation of rotation in depth and the plane to
eliminate a contribution of low-level cues of orientation and luminance. Also,
concrete examples of shapes based on these dimensions (i.e., macaroni) were
readily available for instructional purposes.
Why might the Himba, in contrast to people from the developed world, not
represent shape dimensions independently? Every object that all people see will
have some width and, to the extent that an axis can be ascribed to the object (or
object part), some value of axis curvature. The issue under test, however, was
not whether the individual attributes could be discriminated but whether, when
varied in combination, variations in one attribute could be selected and the other
ignored. As noted previously, developed-world environments provide frequent
exposure to simple manufactured objects, or simple object parts, that vary in only
24
a single dimension, such as aspect ratio, e.g., nails, pens, and soup cans, or only
in axis curvature, such as coiled power cords and pasta, or where the dimensions
are explicitly varied independently, as in shape sorters. We thus have more
opportunity than the Himba for discrimination training on one of these
dimensions, independently of the other. In addition, developed-world language
and classroom instruction may allow us to express and selectively attend to these
variations, whereas the Himba language, Otjiherero, provides a more limited
vocabulary of shape terms (Viljoen & Kamupingene, 1983).
Might the Himba learn to encode dimensions independently through their
exposure to, say, the aspect ratios of tree trunks or goats’ legs? Possibly, but
since many other attributes vary simultaneously in such examples, learning
would be expected to be more difficult. Furthermore, if it were found that the
Himba did differ from Westerners in any aspect of their representation of shape,
an obvious explanation would be based on the difference in visual environments.
If exposure to simple artifacts facilitated the learning of independent shape
dimensions, we would expect that the Himba would process the single dimension
tasks more like conjunctions, i.e., as integral combinations of dimensions, so
there would be little or no advantage for the single dimension tasks.
25
Methods
Logistics
The experimenter (MDL) flew to Windhoek, Namibia’s capital, and
undertook a two-day drive in an off-road capable vehicle to a township (Opuwo)
at the edge of Himba territory, to meet the guide and obtain provisions. Because
of ecotourism, which brings the Himba in ever greater contact with modern
artifacts, it was necessary to go to even more remote regions, at least a full day’s
drive or two from Opuwo, to search for current encampments. These remote
Himba still do have occasional interaction with traders bringing blankets, water
jugs, and Western clothes, and NGOs providing health services. They have no
electronics of any kind, no western tools, no running water, and no furniture but
rocks on which to sit and thick blankets for bedding. The experiment provided
their first exposure to a computer. Upon encountering an encampment, the guide
would approach the village chief and ask permission to camp on the outskirts of
the compound and have members of the tribe participate in the experiment.
Given that the guide could only facilitate one experiment at a time (two separate
experiments were being run), with a number of subjects unable or unwilling to
complete the experiment, a good “yield” would be one or two subjects per day.
26
Figure 5: Illustration of texture segregation tasks.
a-c: Luminance and shape (not used in the experiment but shown here to illustrate
conjunction costs in texture segregation). The boundary in c. is horizontal, between rows
3 and 4. d-f: Examples of displays from the low-variability task. The boundary in d is
defined by axis curvature (and is horizontal, between rows 2 and 3). In e, aspect ratio
(boundary is vertical, between columns 3 and 4), in f, axis curvature+aspect ratio
(conjunction) (boundary is horizontal, between rows 3 and 4). g-i: Examples of displays
from the high-variability task. In g, the boundary is defined by axis curvature (boundary is
vertical, between columns 2 and 3), in h, aspect ratio (boundary is horizontal, between
rows 2 and 3), and i. axis curvature + aspect ratio (conjunction) (boundary is vertical,
between Columns 2 and 3).
27
Task
To test sensitivity to underlying shape dimensions we employed the
texture segregation task illustrated in Figure 5d-i. The task was exactly the same
as in 4a-c, but instead of color and shape, the texture elements varied on the
metric dimensions of aspect ratio and axis curvature. As a result, our stimuli
resembled macaroni noodles. The four different elements of each display were
(informally): (1) narrow, highly curved cylinders, (2) wide, highly curved cylinders,
(3) narrow, slightly curved cylinders, and (4) wide, slightly curved cylinders. The
radii of curvature for the slightly curved cylinders were 68 and 100 pixels for the
narrow and wide elements, respectively; for the highly curved cylinders, the radii
of curvature were 20 and 29 pixels, for the narrow and wide cylinders,
respectively. Each narrow cylinder had an aspect ratio of 1:4, and each wide
cylinder had an aspect ratio of 1.1:2. Each display was composed of 5x5
elements, divided into two regions, each with two types of cylinders. Subjects
judged, as quickly and as accurately as possible, whether the boundary between
the two regions was vertical or horizontal.
There were three possible ways to define the boundary: (1) by axis
curvature (highly curved vs. slightly-curved), (2) by aspect ratio (wide vs. narrow),
or (3) by a combination of the aspect ratio and axis curvature (narrow-highly-
curved and wide-slightly-curved on one side vs. narrow-slightly-curved and wide-
highly-curved on the other). In each of the first two conditions, subjects could
28
perform the task based on only one GC dimension; in the third, the conjunction
condition, they had to use information from two dimensions simultaneously. Each
subject’s sequence of trials was composed of all three conditions presented in
pseudo-random order.
Stimuli
The texture field of 25 display elements (cylinders) spanned a square of
600 x 600 pixels on a 1024 x 768 pixel screen. The subjects sat approximately
.66 m from the screen so the whole square subtended a visual angle of
approximately 10.7°. The centers of the 25 display elements were evenly spaced
but variations in size and planar and depth orientation produced some variability
in the inter-element distances. The average size of each display element at 0°
orientation in depth subtended a visual angle of approximately 1.5°.
To deter participants from basing their decisions on low-level cues such as
local or global orientation or pixel intensity values, we varied both the orientation
and size of the elements. In the Low- (High-) Variability Condition shown in
Figure 5 d-f (g-i), we randomly rotated each cylinder over 22.5º (360º) in the
plane and, in depth, up to 22.5º (45º). All images were rendered in perspective
projection, and after rendering, the size (in pixels) of each cylinder was randomly
varied by 25% (33%) independent of the size variation from the depth rotation.
29
Stimulus presentation, response recording, and feedback were done on a
Macintosh Powerbook G4 computer with a 15” screen. The stimulus presentation
code was written using Psychtoolbox3 for Matlab (Brainard, 1997; Pelli, 1997).
Training
The complexity of the task required a thorough explanation and training
procedure for both groups. For the Himba, the experimenter (with the help of the
translator) first illustrated the task using actual macaroni noodles (Figure 4c).
Subjects were asked to divide the array of noodles with a stick, keeping the
shapes that were “the same” on the same side. Once this was grasped (usually
after three or four trials), they then moved on to a practice sequence on the
computer. Initially using a stick placed across the display, as in the training trials,
they were taught to swipe the touchpad in the same (projected) orientation as
that of the stick. After their response a line would appear on the display indicating
the correct location of the border. Subjects were not required to distinguish
between the two possible locations for a divide. Subjects would continue training
until they correctly responded on seven out of eight consecutive trials, or until
they completed 40 trials (at which point it was judged that they did not
understand the task, and they were excluded from the experiment).
30
Subjects
A total of 32 Himba (16 female, approx. mean age 25.1 years) and 9 USC
subjects (7 female, mean age 20.5 years) participated in the experiment. (The
Himba are uncertain as to their ages.) Himba were compensated with .5 kilos of
maize (corn meal) per hour tested. USC subjects received participation course
credit or were compensated $8 for their time and effort.
12 Himba subjects (4 females, approx. mean age 21.7 years) and 7 USC
subjects (6 females, mean age 20.3 years) were included in the final analysis.
Two of the Himba ran in both the low- and high-variability versions of the task, so
a total of 14 Himba sessions were analyzed. All USC subjects performed both
versions of the task, so 14 USC subject sessions were analyzed.
The data from one Himba subject was lost due to battery failure, and the
data from 21 other USC and Himba subjects were excluded for a variety of
reasons, including failure to meet training criterion on the low-variability task (4
Himba, 2 USC), failure to meet training criterion on the high-variability task (7
Himba), voluntarily quitting before half the trials were completed (4 Himba),
different (pilot) testing conditions (3 Himba), and excessive Westernization (1
Himba). None of these individuals were excluded for failure to show a
conjunction cost, and the data from these subjects (only a few training trials in
many cases) were in the same direction as the data from those who completed
the experiment.
31
There are several reasons for the higher attrition rate among the Himba
than among the Westerners. Social customs required that some of the older
Himba (three of the 19) be allowed to attempt the experimental task, even though
two were incapable of performing above chance and the third older subject quit
after less than half a session. Also, in accordance with USC Institutional Review
Board requirements and to maintain harmonious relations with a village, all
subjects were compensated with maize, whether they completed the experiment
or not (which likely contributed to the higher drop-out rate). It was made very
clear to subjects that they could quit at will, and given that the testing was
repetitive and entirely outside their experience, it is a testament to their
perseverance that only four chose to quit. The three Himba excluded for different
testing conditions were run early in the investigation, while piloting appropriate
testing procedures, i.e. whether to test inside a tent, in the dark and relative
isolation, but often uncomfortable heat, or outside the tent, and deciding which
response device to use (i.e., joystick or touchpad) and reasonable stimulus noise
levels. One subject was excluded for excessive Westernization, since during the
day he was tested, it became obvious that many of the children in his village had
been to a new nearby school, built since the guide’s last trip to the village several
years earlier.
For the first two weeks of the investigation, no runs of the high-variability
version of the experiment were collected, due to a high degree of skepticism from
32
a senior author who had worked with the Himba on several prior investigations
and our guide as to whether the Himba would be able to perform the task at all.
Thus the priority, in the second two weeks of training, was to run subjects in the
high-variability version of the task, and subjects were advanced on to that task
without first performing complete trials of the low-variablity task (as there was a
significant risk of losing subjects to goat-herding responsibilities). In those two
weeks, seven subjects passed the training threshold for the low-variability task,
but did not pass the threshold for the high-variability task.
Those 12 Himba who did complete the training/criterion phase required an
average of 24 trials (mean of 10.8 minutes on the computer). Seven USC
subjects successfully performed the same training procedure, although they were
not trained with the real macaroni phase. The USC subjects required an average
of 19 trials (5.1 minutes). It should be noted that the increased training time for
the Himba included translation lags and familiarization with what was, for all of
them, their first experience with a computer.
Himba were given 72 and Western subjects 108 trials per condition. For
750 ms after the subject responded on every trial, a colored line appeared over
the actual position of the texture boundary as feedback. Green indicated a
correct response, red incorrect. There was no question that the Himba
understood the feedback: they showed obvious signs of displeasure at incorrect
responses.
33
Conditions of testing were not completely comparable between Himba and
USC subjects: none of the USC subjects breastfed their infants while performing
the task (Figure 4d), nor did a noisy goat ever attempt to enter the USC testing
room.
Results
Reaction times and error rates were analyzed in a mixed 2X3 repeated-
measures analysis of variance, with factors Tribe (Himba vs. USC, a between
subjects, unequal Ns variable) and Condition (Axis Curvature [single-dimension],
Aspect Ratio [single dimension], and Conjunction, a within subjects variable).
Separate ANOVAs were run for the high- and low-variability tasks although
primary discussion will be on the high-variability tasks as these were better
controlled for low-level features that could have produced an artifactual
conjunction cost (as described in the classifier section).
Given the differences in testing conditions noted earlier, as well as the
complete unfamiliarity with the experience for the Himba and possible general
ability differences, it was not surprising that error rates and RTs were higher for
the Himba than the USC subjects, although only by 12.2% and 1.37 s. (Figure 6).
These differences were significant; for Error Rates, F(1,10) = 5.06, p < 0.05, and
for RTs, F(1,10) = 5.31, p < 0.05).
The primary interest of this investigation was whether the Himba would be
able to selectively attend to a single dimension so that their RTs and error rates
34
in the single dimension condition would be reduced compared to the conjunction
condition. Both groups had reliably lower error rates and RTs when the boundary
was defined by a single dimension (either aspect ratio or axis curvature)
compared to when the boundary was defined by a conjunction of the two
dimensions (Figure 6) [for error rates, F(2,20) = 41.76, p < 0.001, ŋ
p
2
= 0.81; for
RTs, F(2,20) = 56.14, p < 0.001, ŋ
p
2
= 0.85]. In fact, the mean of the two single-
dimension conditions, in both RTs and error rate, was lower than the conjunction
condition for every subject in the experiment. Moreover, the magnitude of the
advantage of the single dimension conditions was comparable for both tribes,
Figure 6: Himba / USC behavioral results (RT and Error)
Mean percent errors and correct reaction times for the Himba and USC subjects for the
single and conjunction conditions in the Low- and High-Variability tasks.
35
yielding non-significant interactions between Tribe and Conditions for error rates,
F(2,20)=1.24, p=0.31, ŋ
p
2
=0.11, and for RTs, F(2,20)=0.612, p=0.55, ŋ
p
2
=0.058.
A Tukey HSD test verified the greater difficulty of the conjunction condition
compared to both single dimension conditions (p < 0.01 for both), but found no
significant difference between the two single-dimension conditions.
Because only two of the Himba were run on both the high- and low-
variability display conditions, whereas all of the Western subjects performed both
levels of the task, a single ANOVA encompassing all the data at both noise levels
could not be run. The reported F values are those for the high-variability displays.
An ANOVA run on the low-variability displays (and the data from those subjects
who quit only part-way through, or for whom we only have training data) gave the
same picture: all subjects showed a conjunction cost. For the excluded subjects,
the single-dimension conditions’ mean error rate was 23.9% and the mean RT
was 10.20s. For the conjunction condition, the mean error rate was 42.8% and
the mean RT was 15.90.
To investigate the possibility that the Himba quickly learned the
dimensions of axis curvature and aspect ratio during the course of the
experiment, we compared the first half of the trials to the second half to see if the
conjunction costs increased over the course of the session. They did not. In fact
the opposite was the case: the difference between the averaged single
dimension conditions and the conjunction condition actually diminished from the
36
first to the second half of the trials, being 21.2% for errors (and RTs = 2.59s) in
the first half and 16.2% (1.52s) in the second half.
It could be the case that the Himba learned to separately encode the
dimensions of aspect ratio and axis curvature within the first few minutes of task
instruction. But if 10 minutes of training will produce an effect equal to a lifetime
of increased exposure to simple shapes, then that, too, speaks to the primacy of
GC dimensions in the neural representation of shape.
The greater difficulty of the conjunction condition is presumed to be a
result of having to attend to two (rather than one) independent “high-level” shape
dimensions, axis curvature and aspect ratio. However, if the border in the single
dimension conditions could be defined by low-level, non-shape cues, either in
orientation or average pixel intensity values, then the greater difficulty of the
conjunction condition could be trivially explained by the unavailability of such
cues in that condition. Since Fourier statistics, which encompass simple local
features like orientation and pixel intensity, have been shown to be essentially
identical in both natural and artificial environments (Tadmor & Tolhurst, 1994), it
is essential to verify that the task could not be done using only such “pre-shape”
information.
Classifier analysis
To test whether the low-level cues of luminance and orientation could be
the source of the difference between the conditions, we created a classifier that
37
performed the task based solely on orientation and intensity. The classifier used
a subset of the Itti and Koch (2000) model—the feature channels that compute
local orientation (four orientations at six scales) and intensity—to process each of
the 5x5 experimental texture displays. The channels are based on generally
accepted quantitative estimates of early visual filtering in both domains.
Three different decision schemes for the classifier were modeled: one in
which the classifier chose the divide that gave the greatest difference in mean
intensity or orientation, one in which it chose the divide that had the greatest
difference in the variance in intensity or orientation, and one in which it combined
the mean and variance in orientation or intensity of each side of each possible
divide into a 2-dimensional vector, and chose the divide that gave the greatest
Euclidean distance between the vectors. These decision schemes represent
simple ways of making use of low-level image information to do the task (i.e.,
“Does local orientation vary more on one side than another?” rather than “does
one side have greater axis curvature than the other?”). The model that used a
vector consisting of both mean and variance performed slightly more accurately
than the other two, so further discussion will refer to that model. As with humans,
if the classifier correctly chose vertical, but chose the wrong vertical divide (i.e.,
between columns 2 and 3 when the correct divide was between 3 and 4), it was
credited with a correct response. The classifier ran 100 trials of each condition.
38
For the high-variability displays, the orientation-based classifier showed no
significant difference between its error rates on any of the three conditions (for all
comparisons between conditions, bootstrapped P>0.05). The intensity-based
classifier produced as many errors for the Axis Curvature condition as the
Conjunction condition, with the Aspect Ratio condition associated with the fewest
errors. The ordering of conditions for the classifier was thus inconsistent with the
results shown in Figure 6. For the low-variability displays, the ordering of the
conditions by the classifier did match the ordering of the human subjects, but the
difference was smaller than that observed in the human subjects. The potential
availability of a low-level, non-shape cue (primarily orientation) in the low-
variability condition justifies our variation of orientation and size in the high-
variability condition. Consequently, we conclude, especially for the high variability
conditions, that neither low-level differences in pixel intensity nor differences in
orientation could explain the ordering of conditions.
Discussion
Every Himba and USC subject showed a significant behavioral cost when
they had to perform a task based on a conjunction of generalized cone
dimensions rather than on a single dimension. The displays for all three
conditions contained exactly the same four elements, with two of the elements on
either side of the border. The classifier ruled out an effect of luminance and
orientation differences across the border. Consequently, nothing in the displays
39
themselves would necessitate the greater difficulty of determining the boundary
in the conjunction condition. It is only by the coding of the display elements as
independent dimensions over which selective attention can be exercised that the
advantage of the single dimension over the conjunction conditions can be
understood. The experiment offers, to our knowledge, the most rigorous
assessment of the effects—or lack thereof—of exposure to modern artifacts on
the underlying dimensions of the representation of shape.
We attribute the decreased accuracy and longer reaction times of the
Himba to the differential testing conditions already mentioned, as well as to their
lack of experience with psychophysical testing (none of our subjects had ever
seen a computer before, much less used one). Many Himba subjects, seemingly
chagrined that they had missed more than they felt they should, told the
experimenter (through the translator), “I’m not used to this.” An additional factor
could be the differences in general ability, known to affect performance on such
tasks (Ree & Carretta, 1994). We also note that there was no question that not
only could the Himba readily appreciate the shape of the images on the screen,
they also appreciated that those shapes could be projections of real-world 3D
objects. One Himba went so far—jokingly—as to accuse the experimenter of
wasting food by placing the macaroni noodles inside the computer, where he
could not eat them!
40
The bottom line is that the Himba’s pattern of responses did not differ from
that of individuals living in what is, arguably, the most artifactual of environments
(Los Angeles). The sensitivity of both the Himba and USC students to underlying
dimensions of generalized cones suggests that such sensitivity does not require
immersion in a regular, manufactured environment but, instead, is likely a
consequence of non-Fourier statistics of shape, determined through genetics or
early infancy, that characterize virtually any visual world. These constraints would
presumably be incorporated into the tuning of later, shape-selective stages of the
ventral pathway.
41
Chapter 3: Spontaneous perception of axis-relative
relationships under different viewing conditions
Introduction
Structural theories of shape recognition (Biederman, 1987; Marr, 1982;
Winston, 1975) hold that objects are represented as collections of parts in
particular relationships. Consistent with these theories, human subjects will most
often create categories of objects that share a single distinctive part or dimension
when given a sorting or categorization task, even if the stimuli vary in many
dimensions (Ahn & Medin, 1992; Medin, Wattenmaker, & Hampson, 1987).
Would human subjects would spontaneously group objects according to common
categorical between-part relationships, as well?
Categorical relationships between parts can be defined in a gravitational/
viewer-centered reference frame (which specifies between-part relationships
such as “above” and “behind”) (Biederman, 1987), or a principal-axis centered
reference frame (which specifies between-part relationships such as “end-of” or
“perpendicular-to”). Relations between a principal axis and the medial axes of
other object parts can define an invariant structure of an object, in the sense that
the relations are consistently defined no matter the perspective from which an
object is seen (Marr, 1982). But are axis structural relationships incorporated into
the representation with which we habitually perceive and evaluate the world?
42
We created a set of objects composed of one of three different groups of
geometric parts (geons), arranged in one of three categorically-different medial
axis structures. (Thus there were 3x3=9 different objects.) Then, in an “inverse
multidimensional scaling” paradigm (Kriegeskorte, under review), naïve subjects
rated the similarity of the objects under two different types of variation in view:
first, when the objects were each rendered at six different sizes, and second,
when the objects were each rendered at six different orientations (rotations in
plane and depth). In contrast to prior sorting studies, in which subjects grouped
objects based on a single part or dimension (Ahn & Medin, 1992; Medin et al.,
1987), non-metric multi-dimensional scaling of subjects’ similarity ratings
revealed that subjects prioritized both the objects’ parts and the objects’ medial
axis structures (as a secondary dimension) in their similarity judgments.
Materials & Methods
Subjects
36 subjects participated in the similarity rating experiments. Ten subjects
(five female, ages 18-22) rated objects that varied in size, and 26 (16 female,
ages 18-39) rated objects that varied in composite orientation. All subjects were
appraised of their rights in accordance with institutional policies, and received
course credit for participation in the study.
43
Figure 7: Experimental stimuli for similarity rating (9/54 images used).
Rows of stimuli share the same medial axis structure; columns share the same
component parts. a. Variation in size b. Variation in view.
Stimuli
Stimuli were 54 white-on-black images, consisting of six views each of nine
different objects (Figure 7). The nine objects were each composed of one of
three groups of three geometrical volumes (geons), arranged in one of three
different structures according to the relationships between the parts’ medial axes.
The parts’ medial axes were conjoined according to categorical distinctions in
medial axis relationships suggested by Biederman (1987): either end-to-end (i.e.,
with the medial axes of each object co-linear) or end-to-side (i.e., with the medial
axes of each object perpendicular). The parts joined end-to-side were either
centered or offset, and the two parts adjoining a larger part were either co-planar
or offset.
44
Thus the stimuli varied in two ways that were relevant to the experiment: in
the identities of the parts that composed them, and in their medial axis structures
(i.e., the structural arrangement of their parts). For the first experiment, we varied
the size of the objects in six roughly equal steps, from ~1.5º to ~4.5º across, a
~300% variation in size. For the second experiment, we varied the overall
orientation of the objects both in plane and in depth (six 22.5˚ increments of
rotation in each). Each image in the orientation-variation experiment subtended
~3º of visual angle. The whole screen subtended ~10º. All stimuli were generated
using Blender (www.blender.org) and presented using the Psychtoolbox
(Brainard, 1997; Kleiner, Brainard, & Pelli, 2007; Pelli, 1997) for Matlab
(Mathworks).
Similarity rating task
Subjects viewed displays of five objects that could be moved around the
screen by clicking and dragging, as in Kriegeskorte et al (in submission—
personal communication). Subjects were instructed to place objects that seemed
similar close together, and objects that seemed dissimilar further apart. Rather
than simply grouping the objects as in other categorization tasks (e.g. Ahn &
Medin, 1992; Medin et al., 1987), subjects were specifically encouraged to
consider the placement of each object with respect to every other object on the
screen. To encourage subjects to consider all possible pair relationships, lines
(color-coded by distance) appeared connecting all of the images on the screen.
45
The distance (in screen pixels) between each pair of objects was recorded as the
dependent measure. This method of gauging similarity has been called “inverse
multi-dimensional scaling,” (Kriegeskorte et al) because the final configuration of
stimuli at the end of each trial resembled the result of a multi-dimensional scaling
algorithm. The protocol was designed to provide an intuitive interface for subjects
to express what they perceived in the objects. Furthermore, similarity ratings
were done in the context of more than one or two other stimuli, which facilitated
comparison of the stimuli along multiple dimensions.
Subjects practiced using the click-and-drag interface with two sets of five
words (text, not images) that had no relationship to the experimental objects or
the shape properties of interest. The word “shape” was never used in the
subjects’ training; “object” was used instead. After training, each subject rated
approximately 92 displays of five objects, encompassing all 54 of the images but
only about half of all (1,431) possible pairs of images. Every other subject saw
the complementary half of the possible pairs.
Statistical analysis
Each subject’s similarity ratings were normalized by the maximum
distance across the screen. For each subject, we ran a mixed regression model
with the distance ratings for each pair as a dependent variable, and with axis
structure group (same or different) and part group (same or different) as
categorical independent variables, view (size or orientation) as a continuous
46
independent variable coded from 0-1 in six linear steps, and interaction terms for
all variables. Non-metric multi-dimensional scaling (Kruskal & Wish, 1978; R. N.
Shepard, 1980) was calculated for the average (across-subject) similarity matrix
(criterion=non-metric stress).
Results
Of the subjects who viewed the images that varied in size, most (7/10)
spontaneously placed images sharing the same axis structure close to each
other, whether the parts or the size were the same or not (i.e., they showed an
independent effect of axis structure, regression t > 2.51, p < 0.02—see Figure
8a). All subjects showed a strong effect of common part identity (for one subject,
t = 2.08, p < 0.04; for the rest, all t > 3.68, all p < 0.001), and six subjects showed
an independent effect of size (t > 3.11, p < 0.005). Additionally, all 10 subjects
showed a super-additive interaction of common parts and common axis
structures (objects with both parts and axis structures in common were judged to
be more similar than objects with only parts or only axis structures in common—
all t > 2.11, p < 0.04).
Among the subjects who viewed the images that varied in in-plane and
depth orientation, fewer than half (10/26) showed an independent effect of axis
structure (t > 2.99, p < 0.01), though the majority (20/26) of the participants
showed an interaction of axis structure and part identity (t > 2.05, p < 0.05). All
47
subjects showed a strong effect of part identity on their similarity judgments (all t
> 4.00,
Figure 8: Statistical similarity rating results: size variation.
Regression and multi-dimensional scaling (MDS) result of for similarity ratings of objects
that varied in size (distance = normalized similarity ratings, criterion = non-metric stress).
a. Regression results. Dot sizes reflect t values obtained for each parameter (common
axis structure, common parts, common size, and interaction terms) from within-subject
regressions. t values range from -1.90 to 14.25 (blue dots represent negative t values).
b. Two-dimensional MDS solution for across-subject similarity matrix.
all p < 0.0001). Only one subject showed an independent effect of orientation (t =
2.70, p = 0.007), and two (different) subjects showed an interaction of axis
structure and orientation (t > 2.33, p < 0.02).
For all similarity rating subjects, debriefings after the experiments matched
well with the regression analysis results: subjects who showed an independent
effect of axis structure would describe, for example, placing objects “with the
48
small part on top of the big part” near to each other. They also used words
suggestive of parts in particular relationships, describing putting the objects with
Figure 9: Statistical similarity rating results: orientation variation.
Regression and multi-dimensional scaling (MDS) results for similarity ratings of objects
that varied in orientation (distance = normalized similarity ratings, criterion = non-metric
stress). a. Regression results. Conventions as in Figure 8. t values range from -2.60 to
19.0 (blue dots represent negative t values). b. Two-dimensional MDS solution for
across-subject similarity matrix. c. Three-dimensional MDS solution for across-subject
similarity matrix. Planes cleanly separate the three axis groups in the third dimension.
49
the “fin” or “spout” on one side near each other, or placing objects with “the two
arms next to each other” close together. Subjects who only showed an effect of
parts did not mention anything about the objects’ structure or the parts’ positions.
Figure 8b shows the 2-dimensional solution to a non-metric multi-
dimensional scaling of the average (across-subject) similarity matrix
(stress=0.26) for the objects that varied in size. The objects are clearly
segregated into nine groups, with both the axis families and part families visibly
separable. Though common parts and common axes are clearly the most
important dimensions, the slightly high stress value indicates that there was more
variation in the data that was not captured in those dimensions. By convention,
stress values below .1 are generally taken to provide a good description of the
data; a 6-dimensional solution met this criterion, but the additional dimensions
were not readily interpretable. Figure 9b and c show the 2- and 3-D MDS
solutions for the objects that varied in orientation. The part families are cleanly
separable in two dimensions (Figure 9b) and the axis families are cleanly
separable in the third (Figure 9c; stress=0.08 for 2-D; stress=0.069 for 3-D).
Discussion
Many object categorization studies have found that people, from a very
early age (Mash, 2006), show a strong tendency to group objects based common
parts (Ahn & Medin, 1992; Medin et al., 1987; Tversky & Hemenway, 1984). We
replicated these findings, and additionally found that many subjects judged
50
objects that shared the same medial axis structure—the same relative
arrangement of their component parts—to be similar, despite variation in the
identities of the component parts and the size or orientation at which they were
seen (Figure 8, Figure 9). The fact that axis structure was apparent in the
subjects’ judgments at all (even as a secondary dimension, as it was when the
view varied—Figure 9) is noteworthy, because several studies (Ahn & Medin,
1992; Medin et al., 1987) have found that subjects group objects based only on a
single part or dimension (despite co-variation in other dimensions that could have
served as the basis for similarity judgments).
Most subjects “noticed” axis structure even when the body orientation
varied, in the sense that they judged objects with the same parts and the same
axis structures to be more similar than objects with only the parts in common
(Figure 8a), indicating that axis structure influences the perception of parts at all
orientations. However, substantially fewer subjects prioritized axis structure per
se in their similarity judgments when the objects varied in their overall orientation
(38%) compared to when the objects varied in size (70%). That is, many subjects
did not spontaneously place objects that shared the same medial axis structure
but had different component parts close together. Relationships between parts
thus seem to be more readily encoded with respect to true vertical—i.e., with
respect to gravity (as in Biederman, 1987; Hummel & Biederman, 1992)—than
with respect to the principal axis of an object.
51
Another possibility is that the common parts provided such a salient
common feature that subjects did not look further for other similarities. Whatever
the case, the common parts seemed to be more obvious to the subjects than the
common axis structure, even in the similarity rating task with no orientation
variation.
Interestingly, though, only two of twenty-six subjects showed an interaction
between axis structure and body orientation, meaning that in general subjects did
not judge two objects sharing the same axis structure to be similar when they
were oriented the same way, but dissimilar when they were mis-oriented, as
might be expected from a multiple-view representation (Bulthoff & Edelman,
1992). If subjects noticed a commonality in objects’ medial axis structures at all,
they almost always noticed that commonality across all views, as if a perceptual
switch had been flipped.
If medial axis structures are only encoded at particular views/orientations,
there seems to be a ready mechanism to associate those different views into a
coherent percept—potentially some form of feature-based attention that could
index each (and every) separate view of the same axis structure.
Size vs. orientation
In explaining the similarity rating paradigm to the subjects, every effort was
made to avoid the use of the word “shape” or any word or phrase pertaining to
shape properties. However, subjects were instructed to put the objects that they
52
found similar close together—which carried the (unspoken) implication that they
should base their judgments on properties intrinsic to the objects (whatever they
might judge those to be). Since variation in both size and orientation can be
triggered by accidents of perspective, one might have expected subjects to
disregard both, but that was not the case: while few subjects grouped objects
based on common orientation, many subjects (6/10) judged objects of the same
size to be similar. The greater influence of size on perceived similarity is
particularly interesting in light of recent findings that real-world objects tend to be
associated with a canonical size with respect to the framing space surrounding
them (Konkle & Oliva, 2010).
Conclusions
Medial axis structure influenced the perception of parts at all sizes and
orientations, in that most subjects judged objects sharing the same parts and the
same axis structure to be more similar than objects with the same parts alone.
However, axis structure was not as readily extracted as a feature unto itself.
Access to medial axis structural information—independent of parts—may only be
made cognitively explicit with focused attention.
53
Chapter 4: Efficient mental rotation of medial axis
structures
Introduction
Several studies have convincingly shown that object recognition and
mental rotation are generally independent processes, showing different patterns
of costs at increasing rotations (Hayward, Zhou, Gauthier, & Harris, 2006;
Lawson, 1999) and involving different cortical regions (Gauthier et al., 2002;
Vanrie, Beatse, Wagemans, Sunaert, & Van Hecke, 2002; Wilson & Farah,
2006). The only tasks that genuinely seem to require mental rotation—i.e., to
show regular, linear costs at increasing angular disparities—are tasks involving
very subtle discriminations between two objects, such as distinguishing mirror-
reflected objects (R.N. Shepard & Cooper, 1982; R.N. Shepard & Metzler, 1971;
S. Shepard & Metzler, 1988), identifying the left or right side of an object rotated
away from its canonical vertical (Jolicoeur, 1988; Wilson & Farah, 2006), and
distinguishing objects from near-identical “mutant” shapes, identical but for a
single vertex slightly displaced (Folk & Luce, 1987) or small protrusion twisted
(Yuille & Steiger, 1982). In fact the recognition of reflections of side views of
symmetrical objects, which are equivalent to a 180° rotations, are decidedly non
monotonic in that they show a complete absence of any costs (Biederman &
Cooper, 1991a).
54
These tasks could all loosely be categorized as spatial relation tasks—but
the relations tested are all relations that people are not adept at distinguishing.
Though we can clearly perceive which direction an object is facing, whether an
object faces left or right rarely has any influence on the time taken to name it.
Numerous studies have shown that mirror images prime each other, indicating
highly similar if not identical visual representations (for purposes of identification,
at least) for mirror-reflected objects (Biederman & Cooper, 1991a; Lawson &
Humphreys, 1996; Stankiewicz, Hummel, & Cooper, 1998). Small metric changes
in position are likewise generally not perceptually salient.
A distinction can be drawn, however, between precise metric (or
coordinate) and coarse categorical spatial relations (Hummel & Stankiewicz,
1996; Kosslyn, 1987). Categorical relation changes assume particular
importance in structural description theories of shape representation (Biederman,
1987; Marr, 1982) as well as in linguistics (Jackendoff, 1992; Pinker, 2007).
Structural description theories hold that objects are represented as collections of
simple parts arranged in a limited set of coarsely-defined (categorical) relations,
such as “perpendicular-to” or “below”.
Would tasks involving distinctions between categorically different relations
still show monotonic increases in reaction time costs over increased rotation
angles? Or would they show more rotation-invariant performance, as has been
55
shown if the parts of objects differ in categorical (or non-accidental) ways
(Biederman & Gerhardstein, 1993)?
In this study, we tested whether subjects could distinguish categorical
changes in the relations between objects’ parts at a variety of orientation
disparities. Since gravitationally-centered categorical relations (e.g. “above” or
“beside”) will vary with rotation in the image plane, we defined a stimulus set
based on categorical distinctions in medial-axis relative relations (e.g. “co-linear
with” or “centered on”). In order to assure that it was indeed the spatial relations
between the stimuli per se that were being compared (and not some more local
image feature), we created stimuli composed of different geons in the same
medial-axis-defined structures. This allowed us to perform a second test, as well,
with the same stimuli: in separate testing sessions, we tested whether subjects
could distinguish categorical changes in the objects’ parts.
The test of whether medial axis structure, per se, can be rotated efficiently
speaks to another question in the mental rotation literature—namely, what
rotates? Different studies have suggested that entire objects are rotated (Cooper
& Podgorny, 1976; R.N. Shepard & Cooper, 1982), that a frame of reference is
rotated (Hinton, 1981), or that some partial form of an object is rotated separately
(Folk & Luce, 1987; Just & Carpenter, 1975; Yuille & Steiger, 1982). Into the
category of “partial features” fall vertices, volumetric parts, “supramodal features”
(i.e., not-completely-visual abstractions) (Jordan, Heinze, Lutz, Kanowski, &
56
Jancke, 2001), and also medial axes. Just and Carpenter (1985) explicitly
suggested:
…the representation that is mentally rotated could be a subset of the
representation of the entire object, such as…a skeletal representation
consisting of vectors that correspond to the major axes of each segment of
the figure. (Just & Carpenter, p. 143).
Materials & Methods
Subjects
16 subjects (8 female, ages 19-30) participated in the experiment. The
experimental protocol was approved by the USC IRB, and all subjects were
informed of their rights according to institutional policies. Subjects received
course credit or financial compensation for participation.
Stimuli
Our stimulus set consisted of nine three-part objects (Figure 10a), each
composed of one of three groups of three simple geometrical volumes (geons).
The central part was largest in each object, giving each object a clear principal
axis, and the three parts were joined together in one of three structures. For each
structure, the relationships between the parts’ medial axes differed according to
categorical distinctions in medial axis relations suggested by Biederman (1987).
The parts either joined end-to-end, with the medial axes of each object co-linear
(as in the cone and the tapered brick in the top-left object in Figure 10a), or end-
to-side, with the medial axes of each object perpendicular (as in the cone and the
57
tapered brick in the top-middle object in Figure 10a). The medial axes of parts
joined perpendicularly were either centered (as in the cone or the macaroni in the
top-right object in Figure 10a) or offset as in the cone or the macaroni in the top-
middle object in Figure 10a). Two parts adjoining a larger part were either co-
planar, as in the second column of Figure 10a or perpendicular, as in the third
column of Figure 10a.
Each of these relationships between parts—perpendicular/co-linear,
centered/offset, and co-planar/perpendicular—are defined with respect to the
intrinsic axes of the parts, so the relationships will be the same regardless of the
view from which the object is seen. Different combinations of between-part
Figure 10: Mental rotation stimuli.
a. Objects used as stimuli. Rows of objects have the same component parts, and
columns have the same medial axis structure (the same relationships between their
parts). b. Sample trials, as images appeared in the experiment. For the “same axis
structure” task, the subject would have pressed the “same” button for i. and ii. For the
“same component parts” task, the subject would have pressed the “same” button for i.
and iii.
58
relations created three categorically distinct “axis structures” (columns in Figure
10a). Each of the nine objects was rendered from six different views, rotated in
the image plane and in depth in six approximately equal (~22.5º) steps from -45º
to +67.5º from the canonical view (with the largest part vertical) shown in Figure
10a (some views were adjusted slightly to avoid occlusions of parts or other
accidents of perspective). This created six possible angular disparities between
pairs of images, from 0º to 112.5º.
In the actual experiment, the stimuli were rendered in grayscale with a
low-frequency surface texture, in front of a similar low-frequency noise
background (as in Figure 10b). The root mean square contrast of all stimuli were
equated. Stimuli were generated using Blender (www.blender.org) and presented
using the Psychtoolbox (Brainard, 1997; Kleiner et al., 2007; Pelli, 1997) for
Matlab (Mathworks). Subjects viewed the stimuli from a distance of ~60 cm, and
each individual object subtended ~3º of visual angle and was presented ~2º
away from fixation (pairs were ~4º apart).
Task & trial timing
On each trial subjects viewed two simultaneously presented objects for
100, 150, or 200 ms, with either a 2,500 or 3,000 ms interstimulus interval (trial
timing was varied based on performance in training runs to approximately equate
overall performance across subjects). Note that this is a considerably shorter
duration than used in many experiments investigating mental rotation. Typically in
59
those experiments objects were left on the screen until subjects responded (e.g.
Hochberg & Gellman, 1977; R.N. Shepard & Metzler, 1971; S. Shepard &
Metzler, 1988; Yuille & Steiger, 1982). The brief trial duration was intended to
constrain the strategies that subjects might use to perform the tasks (Just &
Carpenter, 1985).
Subjects performed two different go/no-go tasks in separate experimental
blocks. In one block of trials, they indicated as quickly and accurately as possible
(via button press on “same” trials) whether the two objects on the screen shared
the same medial axis structure, regardless of component parts or overall
orientation. In another block, they indicated whether the two objects shared the
same component parts regardless of medial axis structure (i.e., part
arrangement) or orientation. Half of the subjects performed the axis task first, and
half performed the part task first. Example trials are shown in Figure 10b, with
correct responses for each task noted in the caption. Before beginning each
section of the experiment, subjects were explicitly instructed (with examples)
which features of the stimulus to attend. They then performed a two-part training
sequence identical to the main experiment, except that for the first part, the
images stayed on the screen until the subject responded. (For the second
training session, the images were presented for 200-300 ms, as in the
experiment.) Subjects had to answer 9 out of 10 and 8 out of 10 consecutive
trials correctly to complete each of the training blocks (and failure to do so
60
resulted in an increase of trial time from 200 to 300 ms). Subjects required an
average (mean + std. dev.) of 56 ± 12 trials to train for the axis task, and 52 ± 9
trials to train for the part task. In the main experiment, each subject completed
864 trials per task. Throughout the experiment, feedback was given on every trial
(“x” for incorrect, “o” for correct).
Statistical analysis
Reaction times (RTs) were analyzed for correct trials only; all responses
greater than 3 standard deviations from each subject’s grand mean were
discarded as outliers (1.2% of trials on average). d' was computed for each
subject instead of raw error rate, since it is a criterion-free measure of perceptual
sensitivity.
Results
Reaction time and sensitivity (d’) for both tasks—judging same/different
axis structure and same/different component parts—are shown as a function of
angular disparity in Figure 11, and regression statistics are shown in Table 1.
Overall, subjects were able to perform both tasks quite well—overall accuracy
was 84.7% for the axis structure judgments, and 88.7% for the part judgments (d’
of 2.51 and 2.89, respectively), and mean RTs were 767 ms (axis) and 690 ms
(parts). Worth noting is that the minimum RT for the axis structure task (659
msec when the objects were perfectly aligned) was not significantly different from
61
Figure 11: Mental rotation results (RT and sensitivity).
a. Reaction times and b. sensitivity (d prime) for the “same axis structure” task, shown
as a function of angular disparity between the two objects presented, with regression
lines. c. RTs and d. sensitivity (d‘) for the “same component parts” task with regression
lines. The dotted green line in c. is the regression fit for 45º-112.5º disparity. Note that
for the objects with different axis structures in the part task (orange lines), angular
disparity has meaning only for the central part of each object—the other parts will be
misoriented at all disparities due to the differences in axis structure.
the minimum RT for the component part task (629 msec when the objects were
perfectly aligned) (t(15) = 1.26, p = 0.23).
Effect of angular disparity
The angular disparity between the stimuli affected the two tasks to
different degrees. There was a marked effect of angular disparity on RTs and
error rates—both increased with increasing disparity—when subjects were
62
Table 1: Mental rotation regression statistics
Same axis task Intercept
Effect of angular
disparity
Effect of same/
diff. parts
Slope/part
interaction
Reaction time (r
2
=0.96)
β value 668.75 1.34 39.20 0.11
t(8) value 60.13 8.23 2.49 0.49
p value < 0.0001 < 0.0001 0.037 0.634
Sensitivity (d’) (r
2
= 0.95)
β value 3.07 -0.01 0.06 -0.00
t(8) value 39.05 -8.80 0.53 -0.34
p value < 0.0001 < 0.0001 0.6076 0.7414
Same part task Intercept
Effect of angular
disparity
Effect of same/
diff. structure
Slope/part
interaction
Reaction time (r
2
= 0.88)
β value 643.19 0.51 65.31 -0.50
t(8) value 84.69 4.57 6.08 -3.20
p value < 0.0001 0.0018 0.0003 0.0127
Sensitivity (d’) (r
2
= 0.82)
β value 2.96 0.00 -0.06 -0.00
t(8) value 85.73 0.01 -1.30 -2.08
p value < 0.0001 0.9911 0.2298 0.0715
Values in bold highlight significant effects (p < 0.05).
performing the axis structure judgment (t(8) = 8.23, p < 0.0001). When the
subjects were judging whether the objects had the same component parts, the
deleterious effect of angular disparity was smaller but still significant: t(8) = 4.57,
p = 0.002. However, in the component part task, the effect of rotation was
primarily due to fast reaction times to the near-identical images (0º and 22.5º
disparities for images with the same axis structure), with little to no cost of
increasing angular disparity past 45º (Figure 11c., green circles).
1
(A regression
1
Note that for the objects with different axis structures in the same-part task (orange
lines in Figure 11c,d), the angular disparity between the stimuli only has meaning for the
central part of each object. The other parts will be misoriented at all disparities due to the
differences in axis structure. Many subjects reported using primarily the
63
line fit using only disparities greater than 22.5º—with a considerably flatter
slope—is shown as the dotted green line in Figure 11c.)
Thus judging whether objects shared the same component parts was
substantially more rotation-invariant than judging whether objects shared the
same medial axis structure. Furthermore, the pattern of reaction times and errors
suggest a speed/accuracy tradeoff in the axis structure task: subjects made more
errors (had lower d primes) on trials with large angular disparities (t(8) = -8.80, p
< 0.0001). The higher error rates could have been elicited by the time pressure
created by the very brief presentation times (< 200 ms); longer presentation
times and more emphasis on correct responses would likely have resulted in a
greater slope for the axis structure judgments.
Effect of same / different parts on axis structure judgments
When judging whether the two images shared the same medial axis
structure, subjects were approximately 40 ms slower if the objects did not share
the same component parts (t(8) = 2.96, p = 0.009), although no differences in
sensitivity were apparent (no independent effect of different parts on d’: t(8) <
1.00, ns.). There was also no interaction of same/different component parts with
the rotation speed (t(8) < 1.00, ns) or sensitivity (t(8) < 1.00, ns). The absence of
an interaction is important, and will be addressed in the discussion. When judging
convexity/concavity of the central part for their same/different part judgments, but still,
this condition is not as cleanly defined as the others.
64
whether two stimuli shared the same component parts, reaction times were faster
if the objects shared the same axis structure (t(8) = 6.08, p = 0.0003). However,
this difference is difficult to interpret, because in objects with different axis
structures the angular disparity was only defined for the central part (the other
parts were always misaligned more than the disparity would suggest, because of
the differences in axis structures—see Figure 10b, iii for an example). Sharing a
common axis structure may facilitate judgment of whether two object’s parts are
the same (over and above the facilitation provided by perfectly aligning the parts),
but our data do not speak to this point with any certainty.
Other factors affecting reaction time and accuracy in both tasks
A separate regression analysis that split the experiment into halves
revealed an effect of learning—for both tasks, RTs were lower in the second half
of the experiment: by 49 msec for the axis task, t(17) = 3.20, p = 0.005, and 41
msec for the part task, t(17) = 4.13, p = 0.001. In the axis task, there was also a
trend toward greater sensitivity in the second half, t(17) = 1.91, p = 0.07, and an
interaction of learning with same/different parts on d’ (subjects were slightly more
sensitive to objects with the same parts in the first half, and slightly more
sensitive to objects with different parts in the second half: t(17) = 2.29, p = 0.04).
Effects not noted here did not reach significance—in particular, there were no
interactions of learning (first/second half of the experiment) with rate of rotation.
65
Figure 12: Reaction times for all axis families separately.
Reaction times as a function of angular disparity with regression lines, for each axis
group individually, with the same (d-f) or different (a-c) component parts. Images shown
are examples of each axis group. a. and d. are axis group 2, with either the same (d.) or
different (a.) parts; b. and e. are axis group 1, c. and f. are axis group 3. Regression
slopes were not significantly different for any axis group.
To determine whether subjects may have been able to rotate some axis
structures faster than others, a third regression analysis was run that modeled
each group of objects sharing the same axis structure separately. The rotation
speeds for all axis structures were approximately equal (see Figure 12), with no
significant interactions of slope and axis structure group, all ps > 0.2).
66
Discussion
Comparison of same-axis and same-part tasks
We found markedly different costs of angular disparity (in RTs and error
rates) for two different types of comparisons between objects. When subjects
judged whether the parts composing two objects were the same or different
(“same-part task”), subjects showed only a minimal cost of increasing angular
disparities. The effect of angular disparity, though statistically significant, seemed
to be due to a benefit for near-identical images rather than a systematic cost of
increasing angular disparity. Furthermore, the estimated mental rotation speed
for the same-part task was 1,965º per second (or greater—there were indications
of a flattening of the slope at higher disparities)—so fast that it seems fair to call
subjects’ performance substantially rotation invariant. This result dovetails with
several other studies showing rotation-tolerant recognition of objects as long as
the same parts can be readily resolved (1989).
For judgments of whether the two objects had the same categorical
relations between their parts’ medial axes (or the same medial axis structure—
“same-axis task” for short), the cost of rotation was also very low relative to other
mental rotation tasks (see Figure 13): the estimated mental rotation speed was
746º per second (though true speed may have been slower due to a
speed/accuracy trade-off). However, in contrast to the same-part task, the costs
in both reaction time and sensitivity at increasing disparities were highly linear
67
(Figure 11 a,b, Figure 12), and there was no change in the rotation slope over the
course of the experiment.
One explanation for the difference between the two tasks could be that the
part changes were more salient. However, there are two arguments against this:
first, by a low-level image similarity metric (Gabor scaling—see Figure 14b, and
Chapter 5, Methods:Stimuli—p. 76—for explanation), objects sharing the same
axis structure and objects sharing the same component parts were approximately
equally self-similar (actually, objects sharing the same axes were slightly more
self-similar). Second, the subjects were equally fast and accurate at both tasks
when the objects were presented at 0º disparity, indicating that they found both
manipulations equally salient when the stimuli were perfectly aligned. Thus in the
absence of locally-distinct parts, it seems that even reasonably salient,
categorical distinctions between medial axis relations cannot be recognized
without a monotonic cost of angular disparity.
What rotates?
How do our results relate to the mental rotation literature, given that some
mental rotation operation seemed to be necessary to determine whether medial
axis structures were the same or categorically different? In particular, a critical
68
question was how variation in the parts composing each object affected reaction
times at increasing disparities.
2
Shimon Ullman (1971) suggested that mental rotation relies on an
alignment step, during which “alignment keys”—features used to establish the
relative orientations of the objects to be rotated into correspondence—are
extracted. In our experiment, two alignment strategies were possible: one that
relied on the medial axis structure of the stimuli, and one that relied on edges,
vertices, or other local features. If subjects had relied on extraction of local
features, they should have had a substantially more difficult time finding
corresponding features between the axis structures with different component
parts (as in Figure 10b, ii)—since there were practically no corresponding local
features. The difficulty would presumably be even greater at large disparities,
resulting in a steeper slope (slower mental rotation speed) for the different-part
objects, i.e. an interaction of same/different parts and slope. No such interaction
was observed (Figure 11a,b).
Subjects did respond slightly (but significantly) slower when axis structures
were not composed of the same parts, but this could be interpreted either as a
cost of having different parts, or a benefit of having the same parts. The process
of image alignment may be facilitated by shared local features (for example, a
distinctive pointy part at one end), even though those features are not strictly
2
Note that this and all subsequent discussion will center on the same-axis task.
69
necessary for alignment. Alternately, parts may be encoded so automatically that
detection of different parts triggered a reflexive double-check to assure that a
“same” response was appropriate. Although this is admittedly post-hoc, it could
explain why there was a cost of different parts in reaction time but not accuracy.
In any case, the parallel slopes for the same- and different-part conditions in the
same axis task strongly suggest that it was possible to efficiently rotate and
compare two objects on the basis of their medial axis structures alone.
Figure 13: Other studies investigating speeds of mental rotation.
All intercepts have been equated for easy comparison. a. Shepard & Metzler (1977):
60º/s. Block figures have 4 segments. b. Hochberg & Gellman (1988) (no landmarks):
62º/s. c. Shepard & Metzler (1977): 129º/s. Block figures have 3 segments. d. Hochberg
& Gellman (1977) (with landmarks): 318º/s. e. Current study (same axis structure task):
746º/s. f. Current study (same component part task): 1,965º/s (for all points) or 8,171º/s
(for 45º-112º disparity).
70
Medial axis structures can be rotated very efficiently
Many factors can influence the speed of mental rotation; in particular,
many studies have convincingly shown that human subjects can rotate sub-parts
of an object rather than holistically rotating the entire object, and that doing so
often results in faster reaction times and less cost of angular disparity (see Folk &
Luce, 1987; Hochberg & Gellman, 1977; Just & Carpenter, 1976; Pylyshyn, 1979;
Yuille & Steiger, 1982).
Hochberg and Gellman (1977) showed that objects that contained what
they called “landmarks” resulted in increased mental rotation speeds. They
defined landmarks as features readily perceptible in a single glance (i.e.,
resolvable at non-foveal locations) that give strong cues as to the orientation of
an object, as well as which parts correspond to each other in each pair of
objects, and thus which direction a given object should be rotated (see Figure
13d. vs. Figure 13b. for examples). In our same-axis task, the same-part stimuli
did have “landmarks” in the Hochberg and Gellman sense, in that identical cones
(or other parts) sticking off of the central part would be readily perceptible in a
single glance and would provide obvious cues as to how the stimuli should be
rotated into correspondence to compare the axis structures. However, the
different-part stimuli contained no such cues
3
, and the estimated rotation speed
3
One caveat to this is that the principal axis defined by the larger central part in
all of the objects may have provided a type of “landmark”; however, Shepard &
71
for different-part objects did not differ from same-part objects (Figure 11). Thus
the efficient rotation we observed could not be due to the presence of landmarks.
Yuille and Steiger (Yuille & Steiger, 1982) showed that drawing attention to
the critical region at which two stimuli differ will speed reaction times in mental
rotation tasks. Because the relationships between each pair of parts of our stimuli
varied in categorical ways, it was possible to distinguish each pair of our stimuli
based on only two of the three parts of each object. For example, axis family 1
(far left column of Figure 10) could be distinguished from axis family 2 and 3 by
the collinear end-to-end junction between two of its parts (as the junction
between the cone and the brick in the top-left image in Figure 10a). Also, axis
family 2 (middle column of Figure 10a) could be distinguished based on the
parallel but not co-linear smaller parts. However, the three different axis families
in the experiment—and thus, the two different distractors for each axis family—
made it difficult to know which region would be the informative region on any
given trial. Also, the distinctive aspects of each family were only at the level of
medial axis relationships, so there is no way that any single part could have been
used to distinguish one axis family from another. Thus the efficient reaction times
Metzler’s 3-part stimuli (Metzler, 1973) also had a fairly obvious middle part, and
resulted in considerably slower reaction times and steeper slopes.
72
in our study are difficult to attribute to partial-object rotation, particularly as other
authors have described the phenomenon
4
.
Our study is most closely analogous to Shepard and Metzler’s 1973 work
(Metzler, 1973), in that both used three-part objects with identical parts (or, in one
of our conditions, all different parts) arranged in different ways. However, our
estimated speed of mental rotation is considerably faster—faster even than the
fastest rotation speeds in Hochberg & Gellman and Steiger & Yuille’s studies.
The principal difference between our study and Shepard & Metzler’s (1973) study
is the nature of the distinction being made between the objects in each trial.
The distinctiveness of the junctions between an object’s medial axes
strongly influences the speed at which it its structure can be appreciated at
oblique orientations. Indeed, the ability to extract medial axes from an object and
the qualitative manner in which the axes join together seems to be a highly
relevant measure of complexity in mental rotation tasks.
Conclusions
Subjects were able to judge whether objects sharing the same component
parts with a minimal cost of the angular disparity between them, even when the
medial axes differed. Objects sharing the same medial axis structure, even those
4
Yuille & Steiger defined partial object rotation as rotation of “figure segments”
that “presumably [consisted of] one arm” of their block objects (Yuille & Steiger,
p. 208).
73
with different parts, were also identified very quickly, with estimated mental
rotation speeds exceeding those from most (any) other mental rotation studies in
the literature. However, the costs in reaction time were highly significant and
strikingly linear at increasing disparities, indicating that even categorical
distinctions in medial axis relationships (of objects with the same or different
parts) are substantially more readily perceived if two objects are aligned. Medial
axis structures provide a likely basis for efficient mental rotation, even if the
objects to be aligned do not share any local features.
74
Chapter 5: Imaging of the representation of axis
structures
Introduction
Objects are represented as an arrangement of parts. Support for a parts-
based representation derives from studies of behavior (Biederman & Cooper,
1991b; Biederman & Gerhardstein, 1993; Hayward, 1998; Tversky & Hemenway,
1984), single unit electrophysiology (Pasupathy & Connor, 2002; Tsunoda,
Yamane, Nishizaki, & Tanifuji, 2001; Yamane, Tsunoda, Matsumoto, Phillips, &
Tanifuji, 2006), and neuroimaging (Hayworth & Biederman, 2006). But how does
the visual system encode the arrangement—the relative positions—of the parts?
Re-arranging parts can lead to a completely different interpretation of an object
(Biederman, 1987), just as changing the relative positions of phonemes in a word
can change the meaning of the word (as in “cat” and “tack” or “rough” and “fur”).
Still, as critical as between-part relationships are to our understanding of the
world around us, comparatively few studies have investigated how they might be
encoded.
One way to define relationships between object parts is in terms of the
relative positions of the parts’ medial axes—the central lines running through
each part, as bones through fingers. More than 40 years ago, Harold Blum
(Blum, 1967; Blum & Nagel, 1978) observed that specifying an object’s medial
75
axes provides a compact and intuitive way to parse the object into parts and
thereby describe its structure. Many influential theories of object representation
have used the concept of principal or medial axes to define the origin of an
object-centered coordinate system (Marr, 1982; Marr & Nishihara, 1978), to
divide an object into parts (Hoffman & Singh, 1997), or to define categorical
relationships between parts (Biederman, 1987). Recently, numerous variants of
Blum’s Medial Axis Transform have been developed to reliably compute “shape
skeletons” for two- and three-dimensional shapes (Cornea, Silver, & Min, 2007;
Dey & Sun, 2006; Feldman & Singh, 2006), some of which have been suggested
as a means to index online libraries of 3D graphical models (see
http://www.cs.princeton.edu/gfx/proj/shape/).
Only a few neurocomputational studies have followed up on the broad and
intuitive appeal of medial axes as shape descriptors. Lee and colleagues (1998)
found that V1 cells show heightened responses to oriented bars located along
the medial axis of a texture-defined figure, and Kimia (2003) has noted that the
lateral connections in V1 are well-situated to compute convex parts’ medial axes
via a computation like Blum’s “grassfire” algorithm.
Early computation of individual parts’ medial axes could lead to encoding
of junctions between medial axes at later stages, analogous to the way that
computation of local orientations in V1 is followed by encoding of junctions of
edges (corners and curves) in V4 (Pasupathy & Connor, 1999). In this study, we
76
will test whether categorically different medial axis structures elicit reliably
different BOLD fMRI patterns in regions of interest throughout visual cortex, using
a set of novel objects that vary in their overall orientation, the shape of their
component parts, and their medial axis structures.
Materials & Methods
Subjects
Eight right-handed subjects (ages 21-29, two female), with normal or
corrected-to-normal vision participated in the experiment. All were screened for
safety and gave written informed consent before participating. They were
financially compensated for their time, and all subject protocols were approved by
the USC Institutional Review Board guidelines (and adhered to the Code of
Helsinki).
Stimuli
Stimuli were 54 white-on-black images, consisting of six views each of nine
different objects (Figure 14). The nine objects were each composed of one of
three groups of three geometrical volumes (geons), arranged in one of three
different structures according to the relationships between the parts’ medial axes.
The parts’ medial axes were conjoined according to categorical distinctions in
medial axis relationships suggested in Biederman (1987), either end-to-end (i.e.,
with the medial axes of each part co-linear) or end-to-side (i.e., with the medial
77
Figure 14: Stimuli for fMRI classification
a. 9 representative images (of the 54 images used). Each row shares the same medial
axis structure (“axis groups”); each column shares the same component parts (“part
groups”). View groups run roughly diagonally. Stimuli actually appeared in contrast-
equated off-white on a dark gray background; they are presented here at high contrast
for clarity. b. Average Gabor-jet distance between all pairs of stimuli within/between each
group.
78
axes of each part perpendicular). The parts joined end-to-side were either
centered or offset, and the two parts adjoining a larger part were either co-planar
or offset.
To dissociate axis structure from low-level features such as local orientation
and low-frequency outline, the overall orientation of the objects in plane and in
depth was varied in six 22.5˚ increments. To assure that the variation in
orientation did indeed change the low level features of the images, stimuli were
analyzed using a simple computational model of V1 (Lades et al., 1993). The
model computed a “jet” of Gabor coefficients at each of 100 points arranged in
expanding radial circles on each image. Each jet was composed of 40 Gabor
filters: eight equally spaced orientations (22.5º differences in angle) at five spatial
scales, each centered on the same point in the image. The overall result for each
image was a 4000-element vector (40 jets x 100 locations) that captured the local
orientation information in the same way that V1 theoretically does.
The low-level feature difference between each pair of images in our stimulus
set was computed as one minus the Pearson correlation between the Gabor-jet
vectors for each image. The average distances between images that either
shared or did not share the same axis structure or overall orientation are shown
in Figure 14b. The images that shared the same global orientation were more
self-similar as a group by the Gabor-jet measure than were the images that
shared the same axis structure. The Gabor-jet metric has been extensively used
79
for scaling the physical differences between stimuli (Biederman & Kalocsai, 1997;
Fiser, Biederman, & Cooper, 1996; X. Xu, Yue, Lescroart, Biederman, & Kim,
2009), and produces essentially the same results as more complex models of
shape processing (Serre, Oliva, & Poggio, 2007).
Thus the stimuli were designed such that the medial-axis relationships
between the objects’ parts were the only commonality among all the members of
each “axis group.” Each image subtended ~5.2º of visual angle. All stimuli were
generated using Blender (www.blender.org) and presented using the
Psychtoolbox (Brainard, 1997; Kleiner et al., 2007; Pelli, 1997) for Matlab
(Mathworks).
Task: attend to component parts
During the MRI scans subjects attended to the identities of the geons
composing the shapes, and indicated via button press which of three “part
groups” (columns in Figure 14a) each shape belonged to. The shapes in the first
group all had a straight-sided tapered brick as the central piece, with a cone and
a curved cylinder attached to it. The shapes in the second family all had a large
convex cylinder, a smaller straight-sided brick, and a smaller curved triangular
prism, and the shapes in the third family all had a large concave brick, a smaller
convex cylinder, and a smaller curved, tapered brick. Since each axis family and
body orientation contained an equal number of members of each part family, the
task was orthogonal to the experimental manipulations of interest.
80
In separate testing sessions, each subject also performed an analogous
task identifying each axis structure group (columns in Figure 14a) by button press
in the same manner.
fMRI data collection and preprocessing
MRI scanning was performed at USC’s Dana and David Dornsife Cognitive
Neuroscience Imaging Center on a Siemens Trio 3T scanner using a 12-channel
head coil. T1-weighted structural scans were performed on each subject using an
MPRAGE sequence (TR=1950 ms, TE=2.26 ms, 160 sagittal slices, 256 x 256
matrix size, 1 x 1 x 1 mm voxels). Functional images were acquired using an
echo planar imaging (EPI) pulse sequence (TR=2000ms, TE=30 ms, flip
angle=65°, in plane resolution 2 x 2 mm, 2.0 or 2.5 mm thick slices, 31 roughly
axial slices). Slices covered as much of the brain as possible, though often the
temporal poles and the crown of the head near the central sulcus were not
scanned (depending on head size).
Subjects were scanned in 7 or 8 scanning runs of 55 trials each. Each trial
consisted of a single stimulus presentation for 200 ms, followed by a 7,800 ms
fixation. Stimuli were presented in pseudo random order (counter-balanced for
axis groups).
fMRI data were collected using on-line motion correction (PACE, (Thesen,
Heid, Mueller, & Schad, 2000). Additionally, data were temporally interpolated to
align each slice with the first slice acquired, motion-corrected (trilinear-sinc
81
interpolation), and temporally smoothed to remove low-frequency drift (kernel = 3
cycles / run). All pre-processing was carried out using Brain Voyager QX version
2.08 (Brain Innovation, Mastricht, Netherlands) (Goebel, Esposito, & Formisano,
2006). Data were not smoothed or normalized; regions of interest were
transformed to the functional data’s space and all pattern analysis was done in
native functional space. The raw activation values for time points 4 and 6
seconds after stimulus onset on each trial were averaged to create a single
activity value per trial. All trial values were converted to z scores (by run) prior to
classification analysis to minimize baseline differences between runs.
Because each trial consisted of only a single presentation of an image
(rather than a block of different images of the same class), it was possible to re-
label trials and attempt to classify different groups within the same data set. Thus
we were able to compare how well a given region distinguished objects with
different axis structures, and compare that to how well the same region
distinguished different orientations of the composite objects, using the same
data.
fMRI classification analyses
We used a linear support vector machine (SVM) to assess whether the
three axis groups elicited reliably different patterns of activation in each visual
area. Linear SVMs have been widely used in fMRI multi-voxel pattern
classification studies (e.g. Eger, Ashburner, Haynes, Dolan, & Rees, 2008; Ester,
82
Serences, & Awh, 2009; Kamitani & Tong, 2005; Ostwald, Lam, Li, & Kourtzi,
2008), and have been shown to be more sensitive at detecting patterns than
other multivariate measures (Cox & Savoy, 2003). We used an SVM classifier
implemented via the Python Multi-Variate Pattern Analysis package (Hanke et al.,
2009, www.pymvpa.org) using the LibSVM library. The SVM classifier was
trained on seven of the eight fMRI runs and tested on the eighth run. Each of the
runs were withheld as the test set once (n-fold cross-validation), for a total of 440
test trials in subjects with 8 scans.
Regions of interest
Regions of interest (Figure 15a) were defined using independent localizer
scans. Rotating contrast-reversing wedges were used to define V1-V4 and V3A
(as in Engel, Glover, & Wandell, 1997; Sereno, 1998). Lateral occipital cortex
(LO) was defined as the region more active to objects than scrambled versions of
the same objects (t contrast with the False Detection Rate [FDR] set at p < 0.05),
spanning the region from the dorsal part of V3 (dorsally) to V4 (ventrally) (Grill-
Spector et al., 1999). We also defined a ventral visual region encompassing the
fusiform face area (FFA), the parahippocampal place area (PPA), and shape-
selective regions in the posterior fusiform gyrus (pFs) by a contrast of
faces+scenes+ objects>scrambled objects. (These regions were initially
analyzed separately, but no differences were found, so they were
83
Figure 15: Regions of interest and activation
a. Regions of interest for a representative subject, displayed on a posterior view of an
inflated brain. ROIs were defined using independent localizers and anatomical criteria.
Dotted lines represent the horizontal meridian, solid lines represent the vertical meridian,
* represents the foveal confluence, and the thick dashed line marks the intra-parietal
sulcus. The ventral region contained face- and place-selective voxels as well as object-
selective voxels. b. Activation maps of response to all stimuli (t values for contrast of all
conditions – fixation).
84
grouped together for simplicity). A region in the intra-parietal sulcus was defined
by mixed anatomical and functional criteria: we took the region extending dorsally
up the medial bank of the intra-parietal sulcus (IPS) from V3A to a region that
showed increasing activation to increasing working memory load (as inY. Xu &
Chun, 2006). Since the regions of interest varied substantially in size and mean
activation level, both of which have been shown to influence classification
performance (Cox & Savoy, 2003; Smith, Kosillo, & Williams, 2010), we imposed
two further restrictions on each region. First, for each ROI we sorted the voxels
according to their overall responsiveness (t statistic) to all axis groups across the
experimental runs used to train the classifier, and chose only voxels that showed
a significant (t(1789) > 2, p < 0.05 uncorrected) response to a contrast of all
stimulus conditions vs. fixation. Second, we set a cap on the number of voxels at
300 to keep the number of voxels (in addition to the activity levels) approximately
constant across all regions of interest.
Results
Behavioral results
Subjects had no trouble distinguishing which “part group” composed each
object. Mean accuracy was 98.1% correct (essentially at ceiling) and mean
reaction time was 751 ms, with no reliable differences across experimental runs
in reaction times or error rates (repeated measures ANOVA, both F(7,7) < 1.2, p
85
> 0.30). Nor were there any reliable differences between part groups,
orientations, or axis structures in either reaction times or error rates. It should be
noted that though subjects were making judgments about objects’ parts, there
was a trend toward differences in reaction times for objects belonging to different
axis families, most likely because of greater self-occlusion between parts in the
third axis family in several of the views (Figure 14a, third row), which made part
judgments slightly more difficult. (For RT differences in judging part families,
F(7,2) = 3.27 , p = 0.07; all other F < 1.75, p > 0.13).
In the complementary task (conducted in separate sessions), the same
subjects also very accurately reported which axis group each object belonged to:
mean accuracy was 98.6%, and mean reaction time was 794 msec, again with
no indication of improvement across runs (after training) in either RT or error rate
(both F(7,7) < 1.1, p > 0.40). Subjects understood the task very quickly, and
immediately performed very well. Subjects identified the first medial axis family
(Figure 14a, column 1) more quickly than the other two (mean RT of F(7,2) =
6.03, p = 0.013, post-hoc test (Tukey’s HSD) for axis family 1 vs. both 2 and 3 p
< 0.05), potentially due to its elongation relative to the other structures.
Interestingly (and unlike the part-group task), subjects were also slower at
judging the axis structures of the stimuli rotated the farthest from vertical (for the
most extreme orientations mean RT=835 msec; for vertical, 785 msec; F(7,5) =
10.16, p < 0.0001; post-hoc test (Tukey’s HSD) comparing vertical with extreme
86
orientations, p < 0.05). All three axis groups—even the first group (Figure 14a,
first row), which would seem to be more distinctive than the others—showed
significant costs of recognition (greater reaction times) at the orientations farthest
from vertical.
Univariate fMRI results
We saw activation throughout visual cortex (Figure 15b) in response to all of
our conditions, with the most (and most significant) activation in the lateral
occipital cortex and surrounding regions. (See Figure 19 for BOLD response
curves for each region.)
fMRI classification results
All regions from V1 to LO were able to distinguish the three different axis
structures (i.e., the three different arrangements of the objects’ parts) significantly
better than chance (all t(7) > 2.43, p < 0.05). (Figure 16a; See for a summary of
statistical tests by ROI). In V1 and V2, the classifier performed slightly better at
distinguishing different orientations of the objects (though this difference was not
significant). By the level of V3, however, significantly more accurate classification
was obtained for distinctions between medial axis structures than for distinctions
between body orientations (t(7) = 2.87, p = 0.02). In the ventral and parietal
regions of interest, the same trend was observed, though overall classification
performance did not exceed chance (both t(7) < 1.90, p > 0.10).
87
Figure 16: Support vector machine classifier results by region of interest.
a. Result when classifier was trained and tested on different scans (8 scans). b. Result
when classifier was trained and tested on different part families (test of generalization to
new stimuli sharing the same axis structure). Asterisks indicate significant differences
between axis structure and body orientation classification, and white diamonds at the
bars’ peaks indicate significantly better-than-chance classification (t(7) > 2.43, p < 0.05) .
The dotted line around the bar for classification by part identity represents that the
classifier task matched with the subjects’ behavioral task.
88
Table 2: Statistical results by ROI
V1 V2 V3 V4 LO Ventral V3A IPS
0.009
t(7)=3.59
0.004
t(7)=4.14
0.005
t(7)=4.03
0.046
t(7)=2.43
0.001
t(7)=5.05
0.116
t(7)=1.80
0.016
t(7)=3.17
0.329
t(7)=1.05
p(axis>
chance)
(t test) 0.003
t(7)=4.49
0.015
t(7)=3.20
0.003
t(7)=4.47
0.211
t(7)=1.38
0.017
t(7)=3.11
0.582
t(7)=0.58
0.104
t(7)=1.86
0.099
t(7)=1.90
0.021 0.014 0.015 0.173 0.039 0.26 0.095 0.309
p(axis>
chance)
(shuffled
conditions) 0.026 0.026 0.021 0.285 0.083 0.436 0.26 0.26
<0.001 <0.001 <0.001 0.033 <0.001 0.023 <0.001 0.180
p(axis>
chance)
(shuffled
group
labels)
<0.001 <0.001 <0.001 0.183 <0.001 0.243 0.077 0.050
0.030
t(7)=2.71
0.011
t(7)=3.44
0.013
t(7)=3.31
0.004
t(7)=4.17
0.019
t(7)=3.05
0.318
t(7)=1.07
0.061
t(7)=2.23
0.323
t(7)=1.06
p(axis
pattern clf.>
axis ROI
mean clf.)
(t test)
0.096
t(7)=1.92
0.042
t(7)=2.49
0.026
t(7)=2.81
0.370
t(7)=0.96
0.417
t(7)=0.86
0.158
t(7)=-1.58
0.174
t(7)=1.51
0.679
t(7)=-0.43
0.080
t(7)=-2.04
0.479
t(7)=-0.75
0.024
t(7)=2.87
0.024
t(7)=2.87
0.011
t(7)=3.41
0.049
t(7)=2.38
0.349
t(7)=1.00
0.394
t(7)=0.91
p(axis>
body
orientation)
(t test)
0.426
t(7)=-0.85
0.530
t(7)=-0.66
0.039
t(7)=2.54
0.200
t(7)=1.42
0.040
t(7)=2.52
0.150
t(7)=1.62
0.415
t(7)=0.87
0.022
t(7)=2.93
Train/test by
run
Train/test by
part family
A similar pattern of results was observed if we used exactly the same
number of voxels in each region of interest (from 50 to 400 voxels, Figure 17), as
well as if we used all voxels within each region of interest.
Even though the subjects were making explicit judgments on each trial as to
which part group each image belonged to (and thus presumably attending to the
features that distinguished the different part groups), no region of interest in our
study was able to classify the part groups more accurately than the axis structure
89
Figure 17: Classification results with voxel count equated for each ROI.
groups. However, classification by parts was significantly more accurate than
chance in V1 (t(7) = 2.85, p = 0.02), and LO (t(7) = 6.03, p < 0.0001). There was
only a trend toward classification by parts in LO being better than classification
by orientation (t(7) = 2.15, p = 0.067). The interpretation of the higher accuracy
for part group classification in LO is further complicated by the congruence with
the subjects’ task. Nonetheless, the higher classification accuracy in LO is
noteworthy, particularly given the lack of sensitivity shown by V2-V4 to the parts
(vs. orientation).
Since the classifier was tested on novel instances or trials of each of the
stimuli, and not completely novel stimuli, it is possible that the voxels in each
region of interest (and thus the classification algorithm) could have picked up on
some idiosyncratic feature of each axis structure group. For a more rigorous test
90
of whether these regions represented axis structures per se, we trained the SVM
classifier on trials of two of the three part groups, and tested it on the third (each
part group was left out in turn in a 3-fold cross-validation). Note that we have
specifically chosen parts that varied in dimensions (e.g. curvature/pointedness,
convexity/concavity) that have been shown to modulate neural activity in both
human lateral occipital cortex (H. P. Op de Beeck, Torfs, & Wagemans, 2008)
and macaque inferotemporal cortex (Kayaert et al., 2005; Kayaert et al., 2003)
and V4 (Pasupathy & Connor, 1999), thus making it less likely that objects with
different component parts will elicit similar patterns of activation. Still, even when
tested on stimuli composed of different parts, the classifier based voxels in V3
and LO still distinguished different axis structures above chance and better than
different body orientations (Figure 16b; see
Table 2 for statistics). Classification performance was slightly lower overall than
when trained and tested by runs, but the classifier was also trained on fewer
trials (2/3 of the data set vs. 7/8 for training and testing by runs).
It is worth noting that there is a slight risk of circularity in this analysis, in that
the training and testing data were drawn from interleaved trials in the same
scanning sessions and so may not be 100% statistically independent. However,
we feel that this does not compromise the results. First, the trials were widely
spaced (8 seconds apart) and counter-balanced such that each axis group
appeared before every other an equal number of times, making it highly unlikely
91
that trials for one axis group were systematically biased by interaction with other
axis groups. Second, we still see poor classification results in some regions (in
V3A, V4, Ventral, and IPS regions), indicating that whatever dependence there
may be between the training and testing sets, that dependence is not sufficient to
explain the above-chance classification. Furthermore, our most critical measure
is a comparison between two classification schemes (classification by common
axis structure and by common body orientation), both of which should have
benefited equally from any statistical dependence between the training and
testing sets—and yet the difference between classification by axis structure and
classification by body orientation difference persists.
We performed a similar test to see whether classification performance
would generalize over different views of the objects: we trained the classifier on
five of the views of each object, and tested it on the sixth. Each orientation was
left out as the testing set once in successive cross-validation steps. Overall,
classification accuracy for axis structure groups was above chance for V1 to V3,
V3A and LO (t(7) > 3.38, p < 0.05) (Figure 18). For a more rigorous test of
whether axis structure groups elicited consistent patterns over different views, we
separated out the different cross-validation splits of the data, and re-combined
them in two ways. First, we took the average accuracy for cross-validation splits
for which the extreme orientations (~ -45º and ~ +67.5º) were left out as the
testing set—that is, the data sets for which the classifier had to extrapolate to a
92
Figure 18: SVM classification results for training/testing by orientation.
Classification results when the classifier was trained and tested on stimuli with different
body orientations. Bars represent average classification accuracy across all splits of the
data, and asterisks indicate classification accuracy significantly greater than chance (p <
0.05). Filled markers (triangles and circles) indicate significant difference (p < 0.05)
between interpolation and extrapolation splits. The dotted line around the bar for
classification by part identity represents that the classifier task matched with the
subjects’ behavioral task.
novel orientation. Second, we took the average for splits in which one of the
intermediate orientations (~ -22.5º to ~ +45º) was left out as the testing set—that
is, data sets for which the classifier could interpolate to a novel orientation. For
V3, V4, and LO (all regions showing an increased sensitivity to axis structure vs.
body orientation), classification accuracy was significantly better in trials forwhich
the classifier could interpolate (t(7) > 2.7, p < 0.05; Figure 18). The only reversal
of this trend was in the parietal lobe, for classification by part families (which
matched with the subjects’ task), though this trend did not reach significance (t(7)
= 1.42, p = 0.20).
93
Because the overall classification accuracy was relatively low, we used two
additional non-parametric measures to determine whether overall classification
performance was significantly greater than chance (these bootstrapped
estimates give more conservative probability estimates than t tests). First, the
classification analysis was repeated 100 times per subject per ROI with trial
labels randomly shuffled within each run. Second, the classification analysis was
repeated 300 times per subject per ROI with the group assignments for each
image randomized, assigning images to the same classification groups that did
not necessarily share the same axis structures, parts, or orientations. Randomly
assigning group labels provided a test of how often any three arbitrary groups of
18 objects would produce higher-than-chance classification performance. The
results of all statistical tests are summarized in Table 1. The trial randomization
bootstrap analysis resulted in a distribution that closely approximated a binomial
distribution for p = 0.33 and n = 440 (the number of trials in our experiment), with
classification accuracies better than ~38% occurring less than 5% of the time.
Very few of the randomly-assigned groups of the stimuli could reliably be
classified across subjects.
To verify that any above-chance classification performance observed was
due to differences in patterns of activity rather than simple mean activity
differences among the conditions, we de-convolved the hemodynamic response
for each axis structure group using a fixed impulse response (FIR) general linear
94
Figure 19: Hemodynamic responses in all ROIs to each axis group
Hemodynamic responses to each of the three axis structures (lines nearly exactly
overlap). Circles mark the post-stimulus time points that were averaged and used for
classification on each trial. The bar graph to the right of each plot shows the mean
percent signal change for each axis group for those two time points. The mean
activations for the axis groups did not reliably differ across subjects in any of the regions
of interest.
model. Results are shown in Figure 19. None of the axis groups produced
reliably different mean activations, though there was a trend apparent in LO for
response to the third axis group to be slightly greater (repeated measures
ANOVA; for LO, F(7,2) = 3.11, p=.0776; all other F < 1.8, p > 0.2). The trend in
LO could be due to the slightly greater difficulty the subjects had in identifying the
parts of objects in the third axis condition.
Since a support vector machine is sensitive to even small differences in
mean activation, above-chance classification in the range that we observed (~36-
39%) could potentially be achieved even using a one-dimensional measure like
95
the mean activity if a simple threshold would suffice to distinguish one group from
the others for a sufficient number of trials. Thus the classification analysis was
repeated using only the mean activity for each ROI instead of the full pattern of
voxel activity in each ROI (as in (Meyer et al., 2010). All regions from V1 to LO
showed greater classification accuracy when the voxel patterns were used
compared to when the mean was used (see circles in Figure 16; see
Table 2 for statistical values), indicating that the information about axis structure
was present in the spatial profile of activation rather than simply the average
activation of each region.
Discussion
Multi-voxel pattern analysis revealed a prioritization of structural
information—i.e., more accurate classification of groups with the same medial
axis structure than groups with the same body orientation—arising as early as V3
(Figure 16a). This difference could not be reduced to low-level (retinotopic)
feature similarity, since a computational model of V1 found the objects sharing
the same orientation to be more self-similar than objects sharing the same axis
structure (Figure 14b). Furthermore, V1 and V2 showed an opposite trend
(orientation > axis structure—Figure 16). The same pattern of classification
accuracy (axis structure > orientation) was maintained in V3 when the classifier
was tested on trials of novel stimuli not used in the training set (Figure 16b),
96
indicating that voxels in V3 are sensitive to arrangements of medial axes and not
to other idiosyncratic features of the stimulus set.
We also found that the structural information present in V3, V4, and LO was
still somewhat orientation-dependent, in that the SVM algorithm could not
accurately classify axis structures at orientations outside the range of orientations
for which it had been trained. Thus there may be separate representations for the
same axis structure at different retinotopic orientations. On one hand, this could
be viewed as a failure of the visual system to achieve full view invariance, if the
goal of the visual system is to encode the relationships between object parts in a
completely orientation-invariant, object-centered manner (as in Marr, 1982; Marr
& Nishihara, 1978; Pylyshyn, 1979). However, it is often important to know which
parts of an object are above or below other parts with respect to gravity
(Biederman, 1987; Hummel & Biederman, 1992). In normal vision, a retinotopic
coordinate frame most often corresponds to a gravitationally-centered coordinate
frame. Thus retinotopically-specific representations of axis structures would
preserve information about which parts were “on top” of other parts (in a
gravitational sense). Furthermore, many studies have shown costs in object
identification when objects are rotated in-plane (i.e., when vertical relationships
are changed) (Hayward et al., 2006; Jolicoeur, 1985; Tarr & Pinker, 1989). In
keeping with these studies (and with our fMRI classification results), subjects in
97
our study were slower at judging objects’ axis structures when they were rotated
further from vertical (though they were quite good at the task overall).
The computation of medial axis structure by the level of V3 presents another
interesting possibility, since V3 is arguably the last visual stage before the ventral
“what” pathway and the dorsal “where” pathway diverge (Felleman & Van Essen,
1991; Ungerleider & Mishkin, 1982). The dorsal stream has been implicated in
spatial reasoning (e.g. mental rotation tasks), whereas the ventral stream has
been implicated in the recognition of mis-oriented objects (Gauthier et al., 2002;
Vanrie et al., 2002; Wilson & Farah, 2006) Since V3 projects to both areas,
medial axis information computed by V3 could feed into both processes.
Relation to other work
Compared to V1 and V2, not much is known about V3. Most cells in
macaque V3 show orientation tuning (when stimulated with simple gratings), and
some cells show multi-peaked orientation tuning curves (Felleman & Van Essen,
1987), which could be involved in computing junctions between medial axes at
different orientations. Many V3 cells also show end-stopping and binocular
disparity tuning (Felleman & Van Essen, 1987; Gegenfurtner, Kiper, & Levitt,
1997), which could also be useful for computing medial axis structures
(potentially in three dimensions). V3 receives direct inputs from V1 with major
inputs from layer 4B, which is associated with the magnocellular pathway (and
processing of low spatial frequency information) (Felleman, Burkhalter, & Van
98
Essen, 1997). The stimuli used in these experiments were too simplified for any
effect of axis structure to be evident, but the results are all compatible with a role
for V3 in encoding medial axis structure.
LO has been implicated in the processing of between-part relations by the
work of Behrmann and colleagues (2006), who studied a patient with a lesion to
the ventral part of LO. The patient, SM, had difficulty distinguishing objects that
only differed in the relationships between their parts, despite a preserved ability
to distinguish objects that differed in the shapes of their parts. As with all lesion
studies, it is unclear how much damage was done to neighboring regions and
white matter pathways (but see Behrmann, in press, for a more thorough
discussion of the lesion location). However, the lesion did seem to be anterior to
V3, suggesting that whatever structural computations might be occurring in V3
may not be “read out” until the signal has reached LO.
Why the high accuracy for axis structure classification in V1?
It could be seen as surprising that the classification by body orientation did
not reliably exceed the classification by axis structure in V1, but we feel that it is
not. First, many of the contours in the objects were not oriented parallel to the
orientation of the composite object. Distinguishing two composite object
orientations should be much more difficult than distinguishing bars or gratings at
two equivalent orientations, but easier than distinguishing two different axis
structures (judging by a simple model of orientation processing in V1—Figure
99
14b). Second, the orientation of the objects varied continuously (over 135º), so
distinguishing between views sometimes meant distinguishing between
neighboring views, which would presumably be more difficult. Despite these
difficulties, classification by body orientation was slightly (though not significantly)
more accurate than classification by axis structure in V1 and V2, a qualitatively
different result than what we observed in V3-LO.
Why the lower accuracy for classification of parts vs. axis structures?
Previously, it has been shown that voxels in LO can distinguish objects with
“pointy” protrusions from objects with smoothly curved or blocky protrusions (H.
P. Op de Beeck, Baker, DiCarlo, & Kanwisher, 2006; H. P. Op de Beeck et al.,
2008). However, all of the stimuli in our set contained some blocky parts, some
curved parts, and some pointy parts. The fact that no single dimension could be
used to distinguish one part group from another likely made the discrimination of
part groups substantially more difficult.
Why the low classification accuracy overall?
Due to the careful manipulation of stimulus features, the images we used
were far more similar overall than stimuli that have been used in many other
multi-voxel experiments (e.g. Eger et al., 2008; Haxby et al., 2001; Kriegeskorte
et al., 2008), which differed in familiarity, behavioral utility, and evolutionary
significance, as well as many aspects of texture and form; classification accuracy
100
for our images might reasonably be expected to be lower. Additionally, the signal
for single trials is much fainter than the signal for blocks of sequentially-presented
objects. However, the design of our study depended critically on using similar
stimuli presented in single trials (so we could re-label trials to reflect different
aspects of the stimuli). Thus we sacrificed a measure of fMRI signal for
theoretical clarity.
A final possibility is that a given cell or voxel may respond to a relationship
between one pair of parts (i.e., a particular way for two medial axes to join). Thus
the multiple parts of each object (and multiple pairs of medial axis relationships)
may have added noise to what would have otherwise been a clearer signal, in
much the same way that the multiple component parts, each with variation in
multiple feature dimensions (concavity / convexity, “pointiness” / “smoothness”)
may have reduced classification accuracy of the part groups.
Conclusions
Our results suggest that information about the relative positions of objects’
parts is encoded as medial axis structures at particular retinotopic (or
gravitational) orientations in V3 and successive stages of the visual cortex. We
do not mean to argue that axis structure is the only feature encoded by V3 or any
of the other regions, or that the entire world looks like stick figures; far from it.
However, spatial abilities known to be mediated by the parietal lobe (such as
mental rotation) may rely computation of medial axis structures (Just &
101
Carpenter, 1976), and many of the categories of objects that have been shown to
be represented in anterior ventral visual regions—especially tools, body parts,
and animals—have substantial differences in their medial axis structures, as well.
Thus a representation of medial axis structure in V3 could provide a link between
local feature tuning in V1 and higher-order processing in both the dorsal and
ventral visual pathways.
102
Chapter 6: General Conclusions
Good performance on axis structure tasks
Across both behavioral paradigms (Ch. 4 and 5, the mental rotation study
and the axis family identification task that was done in parallel with the fMRI
experiment), subjects easily learned and accurately performed the tasks,
indicating a ready facility to process axis structural information. The mental
rotation speed estimated for making axis judgments was (to the best of my
knowledge) as fast or faster than any other published rate of mental rotation (that
is, among studies that do not show complete view invariance). Furthermore, the
efficiency of mental rotation was not diminished for objects that were composed
of completely different parts, indicating that subjects were able to efficiently
extract, rotate and compare medial axes from two different objects.
Persistent effects of orientation / sensitivity to viewpoint
However, there was a clear sensitivity to viewpoint manifest in all of the
behavioral measures (similarity, mental rotation, and identification) associated
with extracting axis structure. These effects were persistent, too, and did not
seem to decrease with practice (as the costs for naming mis-oriented objects do).
A corresponding pattern of view sensitivity was found in the fMRI results:
though axis structures were classified better than simple retinotopic orientations,
103
the classification algorithm failed when it was tested on view outside the views on
which it had been trained.
Thus both the fMRI and behavioral results seem to favor a multiple-views
account (Bulthoff & Edelman, 1992; Tarr & Bulthoff, 1995) of the encoding of
medial axis structure, in which separate representations are stored for each view
of a medial axis junction. However, a multiple-views account would predict that
some learning of the new views should take place after repeated exposures,
leading to decreased effects of outlying views—which we did not observe.
Intermediate-level representation of medial axis structure
The fMRI study points to an intermediate representation of medial axis
structure, beginning at the level of V3. An interesting aspect of V3 is its position
in the hierarchy of visual areas: it receives direct input from V1 (Lyon & Kaas,
2002), and feeds forward to both dorsal and ventral regions of visual cortex
(Felleman & Van Essen). Thus it is ideally situated to provide inputs to dorsal
regions mediating mental rotation and ventral regions mediating viewpoint-
invariant object recognition (Gauthier et al., 2002; Jordan et al., 2001; Wilson &
Farah, 2006).
104
References
Ahn, W., & Medin, D. L. (1992). A two-stage model of category construction.
Cognitive Science, 16, 81-121.
Arguin, M., & Saumier, D. (2000). Conjunction and linear non-separability effects
in visual shape encoding. Vision Research, 40(22), 3099-3115.
Bach, M., Schmitt, C., Quenzer, T., Meigen, T., & Fahle, M. (2000). Summation of
texture segregation across orientation and spatial frequency:
electrophysiological and psychophysical findings. Vision Research,
40(26), 3559-3566.
Behrmann, M., Peterson, M. A., Moscovitch, M., & Suzuki, S. (2006).
Independent representation of parts and the relations between them:
evidence from integrative agnosia. J Exp Psychol Hum Percept Perform,
32(5), 1169-1184.
Biederman, I. (1987). Recognition-by-components: a theory of human image
understanding. Psychological Review, 94(2), 115-147.
Biederman, I., & Bar, M. (1999). One-shot viewpoint invariance in matching novel
objects. Vision Research, 39(17), 2885-2899.
Biederman, I., & Cooper, E. E. (1991a). Evidence for complete translational and
reflectional invariance in visual object priming. Perception, 20(5), 585-593.
Biederman, I., & Cooper, E. E. (1991b). Priming contour-deleted images:
evidence for intermediate representations in visual object recognition.
Cognitive Psychology, 23(3), 393-419.
105
Biederman, I., & Gerhardstein, P. C. (1993). Recognizing depth-rotated objects:
evidence and conditions for three-dimensional viewpoint invariance. J Exp
Psychol Hum Percept Perform, 19(6), 1162-1182.
Biederman, I., & Kalocsai, P. (1997). Neurocomputational bases of object and
face recognition. Philosophical Transactions of the Royal Society of
London B (Biological Sciences), 352(1358), 1203-1219.
Binford, T. O. (1971). Visual Perception by Computer. Paper presented at the
Proceedings of the IEEE Conference on Systems and Control, Miami.
Blum, H. (1967). A Transformation for Extracting New Descriptors of Shape. In
W. Wathen-Dunn (Ed.), Models for the Perception of Speech and Visual
Form (pp. 362-380): MIT Press.
Blum, H., & Nagel, R. N. (1978). Shape description using weighted symmetric
axis features. The Proceedings of the IEEE Computer Society
Conference, 10(3), 167-180.
Brainard, D. H. (1997). The Psychophysics Toolbox. Spatial Vision, 10(4), 433-
436.
Bulthoff, H. H., & Edelman, S. (1992). Psychophysical support for a two-
dimensional view interpolation theory of object recognition. Proceedings of
the National Academy of Sciences (USA), 89(1), 60-64.
Cooper, L. A., & Podgorny, P. (1976). Mental transformations and isual
comparison processes: Effects of complexity and similarity. Journal of
Experimental Psychology: Human Perception and Performance, 2, 503-
514.
Cornea, N. D., Silver, D., & Min, P. (2007). Curve-skeleton properties,
applications, and algorithms. IEEE Transactions on Visualization and
Computer Graphics, 13(3), 530-548.
106
Cox, D. D., & Savoy, R. L. (2003). Functional magnetic resonance imaging (fMRI)
"brain reading": detecting and classifying distributed patterns of fMRI
activity in human visual cortex. Neuroimage, 19(2 Pt 1), 261-270.
Dey, T. K., & Sun, J. (2006). Defining and computing curve-skeletons with medial
geodesic function. Paper presented at the Proceedings of the fourth
Eurographics symposium on Geometry processing.
Eger, E., Ashburner, J., Haynes, J. D., Dolan, R. J., & Rees, G. (2008). fMRI
activity patterns in human LOC carry information about object exemplars
within category. Journal of Cognitive Neuroscience, 20(2), 356-370.
Engel, S. A., Glover, G. H., & Wandell, B. A. (1997). Retinotopic organization in
human visual cortex and the spatial precision of functional MRI. Cerebral
Cortex, 7(2), 181-192.
Ester, E. F., Serences, J. T., & Awh, E. (2009). Spatially global representations in
human primary visual cortex during working memory maintenance. Journal
of Neuroscience, 29(48), 15258-15265.
Farah, M. (2004). Visual Agnosia (Second ed.). Cambridge, MA: MIT Press.
Feldman, J., & Singh, M. (2006). Bayesian estimation of the shape skeleton.
Proceedings of the National Academy of Sciences, 103(47), 18014-18019.
Felleman, D. J., Burkhalter, A., & Van Essen, D. C. (1997). Cortical connections
of areas V3 and VP of macaque monkey extrastriate visual cortex. The
Journal of Comparative Neurology, 379(1), 21-47.
Felleman, D. J., & Van Essen, D. C. (1987). Receptive field properties of neurons
in area V3 of macaque monkey extrastriate cortex. Journal of
Neurophysiology, 57(4), 889-920.
Felleman, D. J., & Van Essen, D. C. (1991). Distributed hierarchical processing in
the primate cerebral cortex. Cerebral Cortex, 1(1), 1-47.
107
Fiser, J., Biederman, I., & Cooper, E. E. (1996). To what extent can matching
algorithms based on direct outputs of spatial filters account for human
object recognition? Spatial Vision, 10(3), 237-271.
Folk, M. D., & Luce, R. D. (1987). Effects of stimulus complexity on mental
rotation rate of polygons. J Exp Psychol Hum Percept Perform, 13(3), 395-
404.
Gallant, J. L., Shoup, R. E., & Mazer, J. A. (2000). A human extrastriate area
functionally homologous to macaque V4. Neuron, 27(2), 227-235.
Garner, W. R. (1974). The processing of information and structure. Potomac,
MD: Lawrence Erlbaum.
Gauthier, I., Hayward, W. G., Tarr, M. J., Anderson, A. W., Skudlarski, P., &
Gore, J. C. (2002). BOLD activity during mental rotation and viewpoint-
dependent object recognition. Neuron, 34(1), 161-171.
Gegenfurtner, K. R., Kiper, D. C., & Levitt, J. B. (1997). Functional properties of
neurons in macaque area V3. Journal of Neurophysiology, 77(4), 1906-
1923.
Goebel, R., Esposito, F., & Formisano, E. (2006). Analysis of functional image
analysis contest (FIAC) data with Brainvoyager QX: From single-subject to
cortically aligned group general linear model analysis and self-organizing
group independent component analysis. Human Brain Mapping, 27, 392-
401.
Goldstone, R. (1994). Influences of categorization on perceptual discrimination.
Journal of Experimental Psychology: General, 123(2), 178-200.
Grill-Spector, K., Kushnir, T., Edelman, S., Avidan, G., Itzchak, Y., & Malach, R.
(1999). Differential processing of objects under various viewing conditions
in the human lateral occipital complex. Neuron, 24(1), 187-203.
108
Hanke, M., Halchenko, Y. O., Sederberg, P. B., Olivetti, E., Frund, I., Rieger, J.
W., et al. (2009). PyMVPA: A Unifying Approach to the Analysis of
Neuroscientific Data. Front Neuroinformatics, 3, 3.
Haxby, J. V., Gobbini, M. I., Furey, M. L., Ishai, A., Schouten, J. L., & Pietrini, P.
(2001). Distributed and overlapping representations of faces and objects in
ventral temporal cortex. Science, 293(5539), 2425-2430.
Hayward, W. G. (1998). Effects of outline shape in object recognition. Journal of
Experimental Psychology: Human Perception and Performance, 24(2),
427-440.
Hayward, W. G., Zhou, G., Gauthier, I., & Harris, I. M. (2006). Dissociating
viewpoint costs in mental rotation and object recognition. Psychonomic
bulletin & review, 13(5), 820-825.
Hayworth, K. J. (2009). Explicit encoding of spatial relations in the human visual
system: evidence from functional neuroimaging. Unpublished doctoral
dissertation.
Hayworth, K. J., & Biederman, I. (2006). Neural evidence for intermediate
representations in object recognition. Vision Research, 46(23), 4024-4031.
Hinton, G. F. (1981). A parallel computation that assigns canonical object-based
frames of reference. Paper presented at the Proceedings of the 7th
international joint conference on Artificial intelligence - Volume 2.
Hochberg, J., & Gellman, L. H. (1977). The effect of landmark features on mental
rotation times. Memory & Cognition, 5, 23-26.
Hoffman, D. D., & Singh, M. (1997). Salience of visual parts. Cognition, 63(1), 29-
78.
Hummel, J. E., & Biederman, I. (1992). Dynamic binding in a neural network for
shape recognition. Psychological Review, 99(3), 480-517.
109
Hummel, J. E., & Holyoak, K. J. (2003). A symbolic-connectionist theory of
relational inference and generalization. Psychological Review, 110(2),
220-264.
Hummel, J. E., & Stankiewicz, B. J. (1996). Categorical relations in shape
perception. Spatial Vision, 10(3), 201-236.
Itti, L., & Koch, C. (2000). A saliency-based search mechanism for overt and
covert shifts of visual attention. Vision Research, 40(10-12), 1489-1506.
Jackendoff, R. (1992). Languages of the Mind: Bradford / MIT Press.
Jolicoeur, P. (1985). The time to name disoriented natural objects. Memory &
Cognition, 13(4), 289-303.
Jolicoeur, P. (1988). Mental rotation and the identification of disoriented objects.
Can J Psychol, 42(4), 461-478.
Jordan, K., Heinze, H. J., Lutz, K., Kanowski, M., & Jancke, L. (2001). Cortical
activations during the mental rotation of different visual objects.
Neuroimage, 13(1), 143-152.
Just, M. A., & Carpenter, P. A. (1975). The semantics of locative information in
pictures and mental images. Br J Psychol, 66(4), 427-441.
Just, M. A., & Carpenter, P. A. (1976). Eye fixations and cognitive processes.
Cognitive Psychology, 8(4), 441-480.
Just, M. A., & Carpenter, P. A. (1985). Cognitive coordinate systems: accounts of
mental rotation and individual differences in spatial ability. Psychological
Review, 92(2), 137-172.
Kamitani, Y., & Tong, F. (2005). Decoding the visual and subjective contents of
the human brain. Nature Neuroscience, 8(5), 679-685.
110
Kayaert, G., Biederman, I., Op de Beeck, H. P., & Vogels, R. (2005). Tuning for
shape dimensions in macaque inferior temporal cortex. European Journal
of Neuroscience, 22(1), 212-224.
Kayaert, G., Biederman, I., & Vogels, R. (2003). Shape tuning in macaque
inferior temporal cortex. Journal of Neuroscience, 23(7), 3016-3027.
Kimia, B. B. (2003). On the role of medial geometry in human vision. Journal of
Physiology, Paris, 97(2-3), 155-190.
Kleiner, M., Brainard, D. H., & Pelli, D. G. (2007). What's new in Psychtoolbox.
Perception, 36(ECVP Abstract Supplement).
Konkle, T., & Oliva, A. (2010). Canonical visual size for real-world objects. J Exp
Psychol Hum Percept Perform.
Kosslyn, S. M. (1987). Seeing and imagining in the cerebral hemispheres: a
computational approach. Psychological Review, 94(2), 148-175.
Kovacs, I., & Julesz, B. (1994). Perceptual sensitivity maps within globally
defined visual shapes. Nature, 370(6491), 644-646.
Kriegeskorte, N., Mur, M., Ruff, D. A., Kiani, R., Bodurka, J., Esteky, H., et al.
(2008). Matching categorical object representations in inferior temporal
cortex of man and monkey. Neuron, 60(6), 1126-1141.
Kruskal, J. B., & Wish, M. (1978). Multidimensional Scaling. Beverly Hills, CA.
Lades, J. C. V., Buhmann, J., Lange, J., Malsburg, C., Wurtz, R., & Konen, W.
(1993). Distortion invariant object recognition in the dynamic link
architecture. IEEE Transactions on Computers: Institution of Electrical and
Electronics Engineers, 42, 300-311.
111
Laeng, B., Chabris, C. F., & Kosslyn, S. M. (2003). Assymmetries in encoding
spatial relations. In K. Hugdahl & R. J. Davidson (Eds.), The asymmetrical
brain. Cambridge, MA
London, England: MIT Press.
Lawson, R. (1999). Achieving visual object constancy across plane rotation and
depth rotation. Acta Psychol (Amst), 102(2-3), 221-245.
Lawson, R., & Humphreys, G. W. (1996). View specificity in object processing:
evidence from picture matching. J Exp Psychol Hum Percept Perform,
22(2), 395-416.
Lee, T. S., Mumford, D., Romero, R., & Lamme, V. A. (1998). The role of the
primary visual cortex in higher level vision. Vision Research, 38(15-16),
2429-2454.
Lescroart, M. D., Biederman, I., Yue, X., & Davidoff, J. (2010). A cross-cultural
study of the representation of shape: Sensitivity to generalized cone
dimensions Visual Cognition, 18(1), 50-66.
Lyon, D. C., & Kaas, J. H. (2002). Evidence for a modified V3 with dorsal and
ventral halves in macaque monkeys. Neuron, 33(3), 453-461.
Marr, D. (1982). Vision: A Computational Investigation into the Human
Representation and Processing of Visual Information. San Francisco:
W.H. Freeman.
Marr, D., & Nishihara, H. K. (1978). Representation and recognition of the spatial
organization of three-dimensional shapes. Proceedings of the Royal
Society of London: B Biological Sciences, 200(1140), 269-294.
Mash, C. (2006). Multidimensional shape similarity in the development of visual
object classification. J Exp Child Psychol, 95(2), 128-152.
112
Medin, D. L., Wattenmaker, W. D., & Hampson, S. E. (1987). Family
resemblance, conceptual cohesiveness, and category construction. Cogn
Psychol, 19(2), 242-279.
Metzler, J. (1973). Cognitive analogues of the rotation of three-dimensional
objects. Unpublished Dissertation. Stanford University.
Meyer, K., Kaplan, J. T., Essex, R., Webber, C., Damasio, H., & Damasio, A.
(2010). Predicting visual stimuli on the basis of activity in auditory cortices.
Nature Neuroscience, 13(6), 667-668.
Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field
properties by learning a sparse code for natural images. Nature,
381(6583), 607-609.
Op de Beeck, H., Wagemans, J., & Vogels, R. (2003). The effect of category
learning on the representation of shape: dimensions can be biased but not
differentiated. Journal of Experimental Psychology: General, 132(4), 491-
511.
Op de Beeck, H. P., Baker, C. I., DiCarlo, J. J., & Kanwisher, N. G. (2006).
Discrimination training alters object representations in human extrastriate
cortex. Journal of Neuroscience, 26(50), 13025-13036.
Op de Beeck, H. P., Torfs, K., & Wagemans, J. (2008). Perceived shape
similarity among unfamiliar objects and the organization of the human
object vision pathway. Journal of Neuroscience, 28(40), 10111-10123.
Ostwald, D., Lam, J. M., Li, S., & Kourtzi, Z. (2008). Neural coding of global form
in the human visual cortex. Journal of Neurophysiology, 99(5), 2456-2469.
Palmer, S. E. (1975). Visual perception and world knowledge: Notes on a model
of sensory-cognitive interaction. In D. A. Norman & D. E. Rumelhart
(Eds.), Explorations in cognition. Hillsdale, NJ: Erlbaum.
113
Pasupathy, A., & Connor, C. E. (1999). Responses to contour features in
macaque area V4. Journal of Neurophysiology, 82(5), 2490-2502.
Pasupathy, A., & Connor, C. E. (2002). Population coding of shape in area V4.
Nature Neuroscience, 5(12), 1332-1338.
Pelli, D. G. (1997). The VideoToolbox software for visual psychophysics:
transforming numbers into movies. Spatial Vision, 10(4), 437-442.
Pinker, S. (2007). The Stuff of Thought: Language as a Window into Human
Nature. New York: Penguin Books.
Pinker, S., & Bloom, P. (1990). Natural language and natural selection. Brain and
Behavioral Sciences, 13, 707-784.
Pizer, S. M., Burbeck, C. A., Coggins, J. M., Fritsch, D. S., & Morse, B. S. (1994).
Object shape before boundary shape: Scale-space medial axes. Journal
of Mathematical Imaging and Vision, 4(3), 303-313.
Pizer, S. M., Siddiqi, K., Székely, G., Damon, J. N., & Zucker, S. W. (2003).
Multiscale medial loci and their properties. International Journal of
Computer Vision, 55(2/3), 155-179.
Pylyshyn, Z. W. (1979). The rate of "mental rotation" of images: a test of a holistic
analogue hypothesis. Memory & Cognition, 7(1), 19-28.
Ree, M. J., & Carretta, T. R. (1994). The Correlation of General Cognitive Ability
and Psychomotor Tracking Tests. International Journal of Selection and
Assessment, 2, 209-216.
Schyns, P. G., & Murphy, G. L. (1994). The ontogeny of part representation in
object concepts. In D. L. Medin (Ed.), The psychology of learning and
motivation: Advances in research and theory (Vol. 31, pp. 305--349). San
Diego, CA: Academic Press, Inc.
114
Schyns, P. G., & Rodet, L. (1997). Categorization creates functional features.
Journal of experimental psychology. Learning, memory, and cognition,
23(3), 681-696.
Sereno, M. I. (1998). Brain mapping in animals and humans. Current Opinion in
Neurobiology, 8(2), 188-194.
Serre, T., Oliva, A., & Poggio, T. (2007). A feedforward architecture accounts for
rapid categorization. Proceedings of the National Academy of Sciences
(USA), 104(15), 6424-6429.
Shepard, R. N. (1964). Attention and the metric structure of the stimulus space.
Journal of Mathematical Psychology, 1, 54-87.
Shepard, R. N. (1980). Multidimensional scaling, tree-fitting, and clustering.
Science, 210(4468), 390-398.
Shepard, R. N., & Cooper, L. A. (1982). Mental images and their transformations.
Cambridge, MA: MIT Press / Bradford Books.
Shepard, R. N., & Metzler, J. (1971). Mental rotation of three-dimensional
objects. Science, 171(972), 701-703.
Shepard, S., & Metzler, D. (1988). Mental rotation: effects of dimensionality of
objects and type of task. J Exp Psychol Hum Percept Perform, 14(1), 3-
11.
Singer, W. (1999). Neuronal synchrony: a versatile code for the definition of
relations? Neuron, 24(1), 49-65, 111-125.
Smith, A. T., Kosillo, P., & Williams, A. L. (2010). The confounding effect of
response amplitude on MVPA performance measures. Neuroimage.
115
Stankiewicz, B. J. (2002). Empirical evidence for independent dimensions in the
visual representation of three-dimensional shape. Journal of Experimental
Psychology: Human Perception & Performance, 28(4), 913-932.
Stankiewicz, B. J., Hummel, J. E., & Cooper, E. E. (1998). The role of attention in
priming for left-right reflections of object images: evidence for a dual
representation of object shape. J Exp Psychol Hum Percept Perform,
24(3), 732-744.
Switkes, E., Mayer, M. J., & Sloan, J. A. (1978). Spatial frequency analysis of the
visual environment: anisotropy and the carpentered environment
hypothesis. Vision Research, 18(10), 1393-1399.
Tadmor, Y., & Tolhurst, D. J. (1994). Discrimination of changes in the second-
order statistics of natural and synthetic images. Vision Research, 34(4),
541-554.
Tarr, M. J., & Bulthoff, H. H. (1995). Is human object recognition better described
by geon structural descriptions or by multiple views? Comment on
Biederman and Gerhardstein (1993). J Exp Psychol Hum Percept
Perform, 21(6), 1494-1505.
Tarr, M. J., & Pinker, S. (1989). Mental rotation and orientation-dependence in
shape recognition. Cogn Psychol, 21(2), 233-282.
Thesen, S., Heid, O., Mueller, E., & Schad, L. R. (2000). Prospective acquisition
correction for head motion with image-based tracking for real-time fMRI.
Magnetic Resonance in Medicine, 44, 457-465.
Tsunoda, K., Yamane, Y., Nishizaki, M., & Tanifuji, M. (2001). Complex objects
are represented in macaque inferotemporal cortex by the combination of
feature columns. Nature Neuroscience, 4(8), 832-838.
Tversky, B., & Hemenway, K. (1984). Objects, parts, and categories. Journal of
Experimental Psychology: General, 113(2), 169-197.
116
Ullman, S. (1989). Aligning pictorial descriptions: an approach to object
recognition. Cognition, 32(3), 193-254.
Ungerleider, L. G., & Mishkin, M. (1982). Two cortical visual systems. In Ingle DJ,
Goodale MA & M. RJ (Eds.), Analysis of Visual Behavior (pp. 549-586).
Cambridge, MA: MIT Press.
Vanrie, J., Beatse, E., Wagemans, J., Sunaert, S., & Van Hecke, P. (2002).
Mental rotation versus invariant features in object perception from different
viewpoints: an fMRI study. Neuropsychologia, 40(7), 917-930.
Wilson, K. D., & Farah, M. J. (2006). Distinct patterns of viewpoint-dependent
BOLD activity during common-object recognition and mental rotation.
Perception, 35(10), 1351-1366.
Winston, P. A. (1975). Learning structural descriptions from examples. In P. H.
Winston (Ed.), The Psychology of Computer Vision (pp. 157-209). New
York, NY: McGraw Hill.
Wolpert, I. (1924). Die Simultanagnosie: Storung Gesamtauffassung. Zeitschrift
fur die gesante Neurologie und Psychiatrie, 93, 397-415.
Xu, X., Yue, X., Lescroart, M. D., Biederman, I., & Kim, J. G. (2009). Adaptation
in the fusiform face area (FFA): image or person? Vision Research,
49(23), 2800-2807.
Xu, Y., & Chun, M. M. (2006). Dissociable neural mechanisms supporting visual
short-term memory for objects. Nature, 440(7080), 91-95.
Yamane, Y., Tsunoda, K., Matsumoto, M., Phillips, A. N., & Tanifuji, M. (2006).
Representation of the spatial relationship among object parts by neurons
in macaque inferotemporal cortex. Journal of Neurophysiology, 96(6),
3147-3156.
Yuille, J. C., & Steiger, J. H. (1982). Nonholistic processing in mental rotation:
some suggestive evidence. Percept Psychophys, 31(3), 201-209.
117
Appendix: Full stimulus set
(top)
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
The neural representation of faces
PDF
Explicit encoding of spatial relations in the human visual system: evidence from functional neuroimaging
PDF
Functional magnetic resonance imaging characterization of peripheral form vision
PDF
The development of object recognition in the newborn brain
PDF
Transfer learning for intelligent systems in the wild
PDF
Object representation and magnetic moments in thin alkali films
PDF
Crowding in peripheral vision
PDF
Toward counteralgorithms: the contestation of interpretability in machine learning
Asset Metadata
Creator
Lescroart, Mark Daniel
(author)
Core Title
The representation of medial axes in the perception of shape
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Neuroscience
Publication Date
02/14/2011
Defense Date
01/05/2011
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
fMRI,medial axis,mental rotation,multi-voxel pattern classification,OAI-PMH Harvest,object recognition,vision
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Biederman, Irving (
committee chair
), Itti, Laurent (
committee member
), McArdle, John J. (
committee member
), Mel, Bartlett W. (
committee member
), Tjan, Bosco S. (
committee member
)
Creator Email
mark.lescroart@usc.edu,marklescroart@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3655
Unique identifier
UC1300034
Identifier
etd-Lescroart-4285 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-430629 (legacy record id),usctheses-m3655 (legacy record id)
Legacy Identifier
etd-Lescroart-4285.pdf
Dmrecord
430629
Document Type
Dissertation
Rights
Lescroart, Mark Daniel
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
fMRI
medial axis
mental rotation
multi-voxel pattern classification
object recognition