Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Integrating top-down and bottom-up visual attention
(USC Thesis Other)
Integrating top-down and bottom-up visual attention
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INTEGRATING TOP-DOWN AND BOTTOM-UP VISUAL ATTENTION
by
Vidhya Navalpakkam
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2006
Copyright 2006 Vidhya Navalpakkam
Dedication
To Appa, Amma, Debo and all favorable sources that inspired me
ii
Acknowledgments
I thank Laurent for his continuous support, feedback and encouragement at all stages of my
thesis. His enthusiasm towards new ideas and his openness and tolerance to uncoventional
ideas (including bizarre and ill-defined ones) dared me to think creatively and differently. His
playful approach to research inspired me to explore several new ideas, without fear of losing
time or of failure. This freedom to explore helped me expand my horizon and was critical in
developing the breadth in the first part of my thesis. I thank him for all the liberty to work
on problems of my choice. I subsequently became interested in a specific subproblem, that
of gain modulation during visual search. Laurent was very supportive of my new initiative.
The innumerable meetings that we had, and his timeless feedback and criticism helped me
develop a formal theory of optimal gain modulation during visual search. With ceasless
enthusiasm, he always motivated me towards achieving higher standards than I had initially
dreamed of. For instance, he encouraged me to organize a symposium at the annual Vision
Sciences Society meeting. This was a remarkable opportunity for me to interact with experts
in the field, and receive feedback on my research. His support and encouragement even
extended to my non-academic ventures – he supported me in starting a student organization
on campus to offer free meditation classes. He even agreed to become the faculty mentor
for the group! Without further ado, I think it would be fair to say that Laurent’s enthusiasm,
iii
positive support and encouragement has kindled my passion to learn and pursue a career in
academia.
I thank Michael for the two most inspiring talks that I heard in my first year at USC,
that were instrumental in my decision to abandon my former interest in computer networks
to pursue research in biological computing. In less than an hour, he gave a brilliant and
succint tour of the history of Computer Science that traced the notions of computing from
finite automata to more recent ones on neural and organic computing. This was followed by a
talk on language and its evolution. These talks inspired me to think big and drew me towards
computational neuroscience. Since then, I have had several discussions with him where he
always dared me to think big, innovate and critically evaluate how my research fits in the big
scheme of things. Michael’s insight and encouragement proved very useful in developing the
breadth in the first part of my thesis.
I thank Biederman for the lively exchanges that we shared on object recognition and
functional brain imaging and other issues beyond visual attention. I thank Malsburg for
sharing his keen insight on neural computing, and for providing me useful feedback.
I thank all my colleagues at iLab and USC, especially my room mates Ran, Jianwei,
David and Nathan for the useful feedback and brainstorming sessions that we shared.
I cannot thank my parents enough for striving hard to provide me an excellent exposure to
science and arts in my young and formative years. My father urged me to excel professionally,
forever setting higher standards for me to reach. My mother and brother complemented his
efforts by emphasising on fine arts and spiritual education. Together, my family helped me
develop a balanced approach to life and supported me in my career.
iv
I cannot be more thankful to my undergraduate years at IIT, where apart from acquiring
technical skills, I found my soul mate, Debo. His enthusiasm towards learning, and his
adventurous spirit towards exploring new research directions inspired me to pursue graduate
studies and to taste the joy of learning. His stubborn and unrelenting attitude towards his
beliefs – often the exact opposite of mine – generated heated debates and encouraged me to
think critically and revisit my own beliefs. I would like to thank Debo for being my best
friend and worst critic. His never quenching thirst for new research ideas continues to inspire
me to pursue a career in research and to innovate.
Finally, I would like to thank my recently acquired friend, philosopher and guide, Chariji,
who has inspired me to embark on a completely new adventure in my life. His simplicity,
openness, love and tireless service to help humanity continues to inspire me to aspire beyond
my professional goals. I am currently meditating under his guidance and am learning to
integrate a scientific mind with a spiritual heart.
This list is grossly incomplete. I would like to thank everyone that inspired me.
v
Table of Contents
Dedication ii
Acknowledgments iii
List of Figures viii
List of Tables xix
Abstract xx
1 General Introduction 1
2 A Large-scale Model of Attention and Scene Understanding 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 An overview of our architecture . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Estimating the task-relevance of scene entities . . . . . . . . . . . . . . . . . 26
2.4.1 Symbolic long-term memory (LTM) . . . . . . . . . . . . . . . . . . 26
2.4.2 Symbolic Working memory (WM) . . . . . . . . . . . . . . . . . . . 30
2.5 Top-down biasing for object detection . . . . . . . . . . . . . . . . . . . . . 33
2.5.1 Learning the object representation . . . . . . . . . . . . . . . . . . . 34
2.5.2 Object detection using the learned visual representation . . . . . . . . 38
2.6 Using attention for Object recognition . . . . . . . . . . . . . . . . . . . . . 41
2.6.1 Case 1:|Match(f,x)| = 1: Unique match at levelx . . . . . . . . 43
2.6.2 Case 2:|Match(f,x)| > 1: Ambiguity at levelx . . . . . . . . . . 44
2.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.7.1 Visual search for a known target in an unknown scene . . . . . . . . 45
2.7.2 Consistency with available pyschophysical data . . . . . . . . . . . . 48
2.7.3 One-shot learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.7.4 Multiple target detection . . . . . . . . . . . . . . . . . . . . . . . . 56
2.7.5 Object recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.7.6 Estimating task-relevance of scene locations . . . . . . . . . . . . . 58
2.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.8.1 Target Representation . . . . . . . . . . . . . . . . . . . . . . . . . 62
vi
2.8.2 Target detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.8.3 Target Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.8.4 Memorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.8.5 Scene Representation . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.9 Unsolved challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3 Investigating the Granularity of Top-Down Attention during Visual Search 74
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.3 Design and analysis of experiments . . . . . . . . . . . . . . . . . . . . . . 79
3.3.1 Experiment 1: Intensity . . . . . . . . . . . . . . . . . . . . . . . . 80
3.3.2 Experiment 2: Size . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.3.3 Experiment 3: Color saturation . . . . . . . . . . . . . . . . . . . . . 88
3.3.4 Control experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4 Optimal Integration of Top-down and Bottom-up Attention during Visual Search 97
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.3.1 A Theory of Optimal Feature Gain Modulation . . . . . . . . . . . . 102
4.3.2 Special cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.4.1 Simulating Visual Search Conditions . . . . . . . . . . . . . . . . . 110
4.4.2 Psychophysics experiments . . . . . . . . . . . . . . . . . . . . . . 114
4.5 Alternative objective functions . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5 Applications in computer vision 127
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.2 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.3.1 Artificial search arrays . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.3.2 Natural scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6 General Discussion 144
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.2 Outstanding issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Reference List 148
vii
List of Figures
2.1 Overview of current understanding of how task influences visual attention: Given
a task such as “find humans in the scene”, prior knowledge of the target’s features
is known to influence low-level feature extraction by priming the desired features.
These low-level features are used to compute the gist and layout of the scene as well
as the bottom-up salience of scene locations. Finally, the gist, layout and bottom-up
salience map are somehow combined with the task and prior knowledge to guide
attention to likely target locations. The present study attempts to cast this fairly
vague overview model into a more precise computational framework that can be
tested against real visual inputs. . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Hypothesis: We hypothesize the existence of different kinds of salience maps that
encode different nature of information about the scene. In particular, we hypothesize
that the posterior parietal cortex may encode a visual salience map, the pre-frontal
cortex may encode a top-down task-relevance map, and the superior colliculus may
store an attention guidance map that guides the focus of attention. . . . . . . . . . 18
2.3 Phase 1 (top left): Eyes closed, Phase 2 (top right): Computing, Phase 3 (bottom
left): Attending, Phase 4 (bottom right): Updating. Please refer to section 2.2 for
details about each phase. All four panels represent the same model; however, to
enable easy comparison of the different phases, we have highlighted the components
that are active in each phase and faded those that are inactive. Dashed lines indicate
parts that have not been implemented yet. Following Rensink’s (2000) terminology,
volatile processing stages refer to those which are under constant flux and regenerate
as the input changes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
viii
2.4 Sample ontology, as used to represent long-term knowledge in our model: The re-
lations include is a, includes, part of, contains, similar, related. While the first five
relations appear as edges within a given ontology, the related relation appears as
edges that connect the three different ontologies. The relations contains and part of
are complementary to each other as in Ship contains Mast, Mast is part of Ship. Sim-
ilarly, is a and includes are complementary. Hand-picked co-occurrence measures
are shown on each edge and the conjunctions, disjunctions are shown using the truth
tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Estimating task-relevance: To estimate the relevance of an entity, we check the ex-
istence of a path from the entity to the task graph and check for property conflicts.
To find “what is the man catching”, we are looking for a hand related object that is
small and holdable, hence a big object like car is considered irrelevant; whereas a
small object like pen is considered relevant. . . . . . . . . . . . . . . . . . . . . 31
2.6 Learning a general representation of an object: The model uses a binary target mask
(target is 1 and background is 0) to serve as a location cue. The model learns the
views by extracting the center-surround feature vectors at different spatial scales
from a few locations within the target. Next, it combines the views to form instances.
The instances are in turn combined to form a general representation of the object. . . 37
2.7 Top-down biasing model for object detection: To detect a specific target object in
any scene, we use the learned target representation to bias the linear combination of
different feature maps to form the salience map. In the salience map thus formed, all
scene locations whose features are similar to the target become more salient and are
more likely to draw attention. . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.8 Architecture for object recognition: Our model recognizes the object at the attended
scene location by extracting a center-surround feature vector from that location and
finding the best match by comparing it against representations stored in the object
hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
ix
2.9 Results: This figure shows our model’s results for top-down biasing for a sample
from our database of objects. The first column is the target object that we biased
the model for; the second column shows the distractor object when in a search ar-
ray setup, or “natural” means that a natural cluttered scene was the background or
distractor; the third column shows the 95% confidence interval for improvement in
target salience normalized by maximum salience in the display (biased over naive
models); the fourth column shows the 95% confidence interval for improvement in
number of attentional shifts before detection of the target (naive over biased mod-
els); the fifth column shows the hypothesis supported by the salience data. The null
hypothesisH
0
(mean improvement in normalized target salience = 2.0) or alterna-
tive hypothesis H
2
(mean improvement in normalized target salience > 2.0) was
supported by a majority of the target objects. In some cases where the distractors
were very similar to the target, the alternative hypothesis H
1
(mean improvement
in normalized target salience < 2.0) was supported. The final column shows some
remarks on the effect of biasing on detection time. Note that in the case of pop-out,
improvement in normalized target salience is approximately 1.0 because the target
is already the most salient item in the display (hence, target salience normalized by
maximum salience equals 1.0), and biasing maintains the target as the most salient
item. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.10 Results for top-down biasing: The example on the left shows the attentional trajec-
tory during free examination of this scene by the naive, bottom-up salience model
(yellow circles represent highly salient locations, green circles represent less salient
locations, red arrows show the scanpath). Even after 20 fixations, the model did not
attend to the coke can, simply because its salience was very low compared to that of
other conspicuous objects in the scene. Displayed on the right is the attentional tra-
jectory after top-down biasing for the coke can object class (built from instances and
views of the coke can from other photographs containing the can in various settings).
Our model detected the target as early as the third fixation. . . . . . . . . . . . . . 49
x
2.11 Difference between our biased model and Rao et al.’s model: Consider searching
for a red-vertical item among red-horizontal and blue-vertical items. Rao’s model
computes salience of each scene location as the Euclidean distance between the tar-
get and that location in feature space, by progressively considering the information
at coarse-to-fine scales. The corresponding salience maps obtained for the first three
fixations are shown here. As early as the third fixation, the salience map including
the finest scale clearly shows the target to be the single most salient location in the
scene. Thus, Rao’s model predicts that conjunction searches are efficient (see section
2.6.2 for details on our re-implementation of that model). On the other hand, in our
model, biasing promotes the red and vertical features. In the resulting color feature
map, the target as well as red-horizontal distractors become active. Similarly, in the
orientation feature map, the target as well as blue-vertical distractors become active.
Due to spatial interactions within each feature map, the target and the distractors
cancel each other. In the resulting salience map, the salience of the target and the
distractors are comparable, hence, leading to an inefficient search. . . . . . . . . . 51
2.12 Comparison between the performance of different models: This figure shows a com-
parison between the performance of a random model, our unbiased model, our bi-
ased model, and a top-down model as proposed by Raoetal. The performance of the
models is compared on search arrays creating pop-out in color (first column), pop-
out in orientation (second column), and serial, conjunction searches (third column).
The x axis shows the number of items in the display and the y axis shows the reaction
time (RT) measured as the number of fixations engaged by the model before target
detection. The random model assumes uniform probability of attending to each item
in the display, hence, on an average, it attends to half the total number of items in the
display before finding the target. In single feature searches, our unbiased (unknown
target) and biased (known target) models, along with Rao’s model (known target)
correctly predict efficient search as shown in columns 1 and 2. However, in con-
junction searches as shown in column 3, Rao’s model continues to predict efficient
search (slope = 0, reaction time does not change with increasing number of items
in the display), while our unbiased and biased models show an approximately linear
increase in reaction time with increasing number of items in the display, which is
typical of inefficient searches. . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.13 One-shot learning: the model learned a specific instance of the handicap sign from
the image shown in the center and used the learned instance to detect new handicap
signs in different poses, sizes and backgrounds as shown in the other images. . . . . 54
xi
2.14 Sequential detection of multiple targets: The model initialized the working memory
with the targets to be found and their relevance (handicap sign, relevance = 1; fire
hydrant, relevance = 0.5). It biased for the most relevant target (in this case, the
handicap sign), made a false detection, recognized the fixation (fire hydrant), up-
dated the state in its working memory (recorded that it found the fire hydrant), and
proceeded to detect the remaining target by repeating the above steps. . . . . . . . 56
2.15 Statistics for the hierarchical recognition of arbitrary fixations, for a sample of ob-
jects from our database: As an initial implementation, we considered a simple object
hierarchy with just 3 main levels (level 1: all objects, level 2: instances and level 3:
views) and at level 0 was a dummy root that was a general class combining all the
objects. The first column is the target object; the second column shows the per-
centage of false positives (number of distractors that were falsely recognized as the
target, over the total number of distractors); the third column shows the distractor
that accounted for the false positives; the fourth column shows the percentage of
false negatives (number of targets that were not recognized as the target, over the
total number of targets); the fifth, sixth and seventh column show the top 3 contribu-
tors to false negatives. Despite the simplicity of the model (it attempts to recognize
fixations by looking at just one location in the object), it seems to be able to classify
the target in the appropriate category of objects - as shown in this figure, the contrib-
utors for false negatives and false positives share features with the target, i.e., they
are similar to the target. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.16 Learning the TRM: The model learned the TRM for a driving task by attending,
estimating the relevance of attended scene locations and updating the TRM. The
development of the TRM across 28 fixations is shown here (brighter shades of grey
indicate locations more relevant than baseline). Note that the TRM does not change
significantly after a while and is learned to a reasonable precision within the first
5-10 fixations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.17 Learning the TRM: On the same scenes as used for the driving task, we learned the
scene locations that belonged to the sky category. The TRM learned after the first 28
fixations is displayed here. Those locations belonging to the car category are clearly
suppressed or marked irrelevant (dark) compared to baseline (white). It may appear
misleading that the road is marked as relevant. Since the road was non-salient, it
did not attract any attention and hence was not marked as irrelevant and remained at
baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
xii
3.1 Testing the hypotheses: Consider searching for a MID intensity target (marked by
a yellow circle for illustration purposes) among LOW, MID and HIGH intensity
distractors. Let the display be processed by neurons that are tuned to LOW, MID
and HIGH intensity intervals. The feature maps corresponding to the LOW, MID
and HIGH intensity intervals are added to form a saliency map that subsequently
guides attention. If top-down guidance were coarse, the gains on LOW, MID and
HIGH intensity intervals would be equal, resulting in equal salience of all items,
thereby yielding equal number of fixations on all intervals. In contrast, if top-down
guidance were fine-grained, the gain on the relevant MID intensity interval would
be higher than LOW and HIGH, resulting in higher salience of items in the MID
interval, thereby yielding higher number of fixations on the MID interval than LOW
or HIGH. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.2 Reaction time: This figure shows RT for all valid trials in the LOW, MID and HIGH
search conditions within intensity, size and color saturation dimensions. In all fea-
ture dimensions, search was slower in the MID condition than the LOW or HIGH
conditions, as demonstrated by the linear separability theory. Note that there is no
speed-accuracy tradeoff here as the the RT was only computed over valid (correct)
trials. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.3 Results in the intensity dimension: a) The first column shows results during search
for a LOW intensity target. The sample eye trace illustrates that subjects tend to fix-
ate on the relevant LOW intensity distractors. Statistical analysis of all trials reveal
a significantly higher number of fixations on the relevant LOW intensity items (indi-
cated by a yellow star) than MID or HIGH (paired t test, p-value< 0.05). Statistical
analysis of fixations as a function of time (in units of block number) reveal that the
strength of biasing does not change with time (see table 2). b) Similar results are
observed for the MID and HIGH conditions. As shown in the second column, when
subjects search for a MID intensity target, they selectively fixate on the MID inten-
sity distractors compared to LOW or HIGH. These results demonstrate that top-down
signals can guide attention to the relevant interval within the intensity dimension. . . 86
3.4 Results in the size dimension: a) The first column shows results during search for a
LOW size target. The sample eye trace illustrates that subjects tend to fixate on the
relevant LOW size distractors. Statistical analysis of all trials reveal a significantly
higher number of fixations on the relevant LOW size items (indicated by a yellow
star) than MID or HIGH (paired t test, p-value< 0.05). Analysis of fixations as a
function of time (measured in units of blocks from 1 to a maximum of 10) reveals
that the strength of biasing does not change with time (see table 2). b) Similar results
are observed for the MID and HIGH conditions. As shown in the second column,
when subjects search for a MID size target, they selectively fixate on the MID size
distractors compared to LOW or HIGH. These results demonstrate that top-down
signals can guide attention to the relevant interval within the size dimension. . . . . 89
xiii
3.5 Results in the saturation dimension: a) The first column shows results during search
for a target with LOW saturation. The sample eye trace illustrates that subjects tend
to fixate on the relevant distractors of LOW saturation. Statistical analysis of all
trials reveal a significantly higher number of fixations on relevant items of LOW
saturation (indicated by a yellow star) than MID or HIGH (paired t test, p-value<
0.05). Analysis of fixations as a function of time (measured in units of blocks from
1 to a maximum of 10) reveals that the strength of biasing does not change with
time (see table 2). b) Similar results are observed for the MID and HIGH conditions.
As shown in the second column, when subjects search for a target with MID satu-
ration, they selectively fixate on distractors with MID saturation compared to LOW
or HIGH. These results demonstrate that top-down signals can guide attention to the
relevant interval within the intensity dimension. . . . . . . . . . . . . . . . . . . 91
3.6 Control experiments and results: a) Design of the control experiment: Search ar-
ray was presented for a brief duration (120ms only) to minimize the role of serial
scanning processes. Search array consisted of a 3x3 grid of equal number of items
belonging to LOW, MID and HIGH intensity intervals. In each search condition
(LOW/MID/HIGH), the target was fixed. Subjects were instructed to search for the
known target and report the number at its location. The reports were analyzed to
determine the % reports on items of each intensity interval. Results of a paired t-test
showed a significantly higher number of reports on items of the relevant interval. For
instance, when subjects searched for a MID intensity target, there were more reports
on items of the MID intensity interval than LOW or HIGH. These results confirm
the role of parallel gain-based guidance in our search experiments. . . . . . . . . . 96
4.1 Overview of our model: The incoming visual sceneA is analyzed in several feature
dimensions (e.g., color and orientation) by populations of neurons with bell-shaped
tuning curves. Within each dimension, bottom-up saliency maps (s
1
(A)...s
n
(A))
are computed for different feature values and combined in a weighted linear manner
to form the overall saliency map (S(A)) for that dimension. Given this model, how
do we choose the optimal set of top-down gains (g
1
...g
n
) such that the target tiger
becomes most salient among distracting clutter? Our theory shows that the intuitive
choice of looking for the tiger’s yellow feature would actually be suboptimal, be-
cause this would activate the distracting grassland more than the tiger. Instead, the
optimal strategy would be to look for orange, which is mildly present in the tiger,
but hardly present in the grasslands, and hence best differentiates between the target
and the distracting background. . . . . . . . . . . . . . . . . . . . . . . . . . . 99
xiv
4.2 Three phases of visual search: Phase 1) Combined bottom-up and top-down process-
ing of the visual input: The top-down gains (Phase 3) derived from the observer’s
beliefs (Phase 2) are combined with bottom-up salience computations to yield the
overall salience of the target and distractors. This determines search performance,
measured bySNR. Phase 2) Acquiring a belief: The distributions of target and
distractor features may be learned through estimation from past trials, preview of
picture cues, verbal instructions, or other means. Phase 3) Generating the optimal
top-down gains: The learned belief in target and distractor features is translated into
a belief in salience of the target and distractors, thus yieldingSNR
b
. The top-down
gains are chosen so as to maximizeSNR
b
. . . . . . . . . . . . . . . . . . . . . 104
4.3 Optimal gains as a function of d
′
and Δ
i
, computed according to eqn 4.37: When
d
′
is high (e.g., d
′
≥ 3), the maximum gain occurs at Δ
i
= 0, i.e., when the
target-distractor discriminability is high, a neuron that is tuned to the target feature
is promoted maximally. However, when d
′
is low (e.g., d
′
= 0.5), the maximum
gain occurs at Δ
i
> 0, i.e., when the target-distractor discriminability is low, a
neuron that is tuned to a non-target feature is promoted more than a neuron tuned to
the target feature. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.4 Simulation results for a variety of search conditions (shown in different rows): The
first column shows the true distribution of the target (T) features (p(Θ|T), solid red)
and distractor (D) features (p(Θ|D), dashed blue), and the second column shows
the observer’s belief (p(Θ
b
|T),p(Θ
b
|D)). The third column shows the optimal dis-
tribution of neural response gains superimposed overp(Θ|T),p(Θ|D). The fourth
column showsSNR followed by the implications of our results, alongwith exper-
imental evidence. For example, row ’a’ illustrates how lack of prior knowledge
prevents any top-down guidance of search. Let the true distributions p(Θ|T) and
p(Θ|D) peak at different values, e.g., red target among green distractors. When T
and D are unknown, the beliefsp(Θ
b
|T),p(Θ
b
|D) are a uniform distribution with all
features being equally likely. Hence, the optimal gains are set to baseline (g
i
= 1,
i ∈ {1...n}). Remarks and supporting experimental evidence for the remaining
search conditions (rows a-h) are shown in the fifth column in this figure. Our theory
is able to formally predict several effects in visual search behavior which have been
previously studied empirically [181, 165, 37, 119, 37, 103, 155, 177, 38, 4, 53]. . . . 112
4.5 Comparison ofSNR
i
: When the target feature (shown in solid red) is similar to the
distractor feature (shown in dotted blue), neuron 2 that is tuned to an exaggerated
feature provides higherSNR
i
than neuron 1 that is tuned to the exact target feature. 114
xv
4.6 a) Experimental design: We test the theory’s prediction of top-down bias during
search for a low-discriminability target among distractors (figure 4.4i). The top-
down bias is set when subjects performT
1
trials. After a random number ofT
1
trials,
the top-down bias is measured in aT
2
trial. AT
1
trial consists of a fixation followed
by a search array containing one target (55
◦
) among several distractors (50
◦
). Sub-
jects are instructed to report the target as soon as possible. Subject’s response is
validated on a per trial basis through a novel No Cheat scheme that is described in
the Methods section. AT
2
trial consists of a fixation, followed by a brief display of
five items representing five features, and by five fineprint random numbers. Subjects
are asked to report the number at the target location. b) Experimental results: We
ran 4 subjects (3 na¨ ıve), aged 22-30, normal or corrected vision, with IRB approval.
TheT
2
trials were analyzed to find the number of reports on30
◦
,50
◦
,55
◦
,60
◦
,80
◦
features. The number of reports on the relevant feature (60
◦
, marked by a golden
star) is significantly higher (paired t-test, p < 0.05) than the number of reports on
the target feature (55
◦
). c) Controls: In a control experiment, we maintained the
same target feature, but reversed the distractor feature. In the T1 trials, subjects
now searched for the55
◦
oriented target among60
◦
oriented distractors. Everything
else, including the T2 trials, instructions and analysis remained the same. Statistical
analysis of number of reports showed a reversal in trend compared to b), with signif-
icantly higher number of reports on the currently relevant feature (50
◦
, marked by a
golden star) than the target feature (55
◦
). . . . . . . . . . . . . . . . . . . . . . . 116
4.7 a) Experimental design: We test the theory’s prediction of top-down bias in the
color dimension. The experimental design is similar to figure 4.6. The target has
medium green hue (CIE x=0.24,y=0.42), while the distractor is either more green
(x=0.25,y=0.45, figure 4.7b), or less green (x=0.23,y=0.38, figure 4.7c), and the
irrelevant controls are yellow (x=0.42,y=0.50) and blue (x=0.21,y=0.27).b) Experi-
mental results: We ran 3 subjects (na¨ ıve), aged 22-30, normal or corrected vision,
with IRB approval. The T
2
trials were analyzed to find the number of reports on
the yellow, more green, medium green, less green and blue features. When subjects
searched for a medium green target among less green distractors, as predicted by the
theory, there were significantly more reports (paired t-test, p-value < 0.05) on the
more green feature than the target feature. c) Controls: In a control experiment, we
maintained the same target feature, but reversed the distractor feature. Now, sub-
jects searched for a medium green target among more green distractors. Statistical
analysis of number of reports showed a reversal in trend compared to b), with sig-
nificantly higher number of reports on the less green feature than the target feature.
These results support optimal feature biasing as suggested by our theory. . . . . . . 119
xvi
4.8 Comparison of different objective functions: These simulations compare the search
performance when gains are modulated to maximizeSNR (ratio of expected target
salience relative to expected distractor salience) vs. D
′
(discriminability between
target and distractor salience). The first two columns illustrate different search con-
ditions (each denoted by a particular distribution of target featureP(Θ|T) shown in
solid red, and distractor featureP(Θ|D) shown in dotted blue). According to pre-
vious psychophysics studies, the search condition illustrated in the first column is
known to more difficult than its counterpart in the second column. While maximiz-
ingSNR successfully accounts for this difference (as shown in the third column,
ratio ofSNR values in easier vs. difficult conditions> 1), maximizingD
′
fails in
some cases (as shown in the fourth column, in conditions e, f, g, ratio of D
′
< 1).
This validates our choice ofSNR as the relevant objective function. . . . . . . . . 121
5.1 Overview of our model: Let the incoming visual scene A contain target and dis-
tractors sampled from probability density functions P(Θ|T) and P(Θ|D). Our
model assumes that the visual input is analyzed in different feature dimensions
by a population of neurons with broad and overlapping tuning curves. Bottom-up
saliency maps s
ij
(A) are extracted for the i
th
feature within the j
th
dimension,
i ∈ {1...n}, j ∈ {1...N}. Prior knowledge of the target and distractors is used to
compute the top-down gainsg
ij
andg
j
. The bottom-up mapss
ij
(A) are then multi-
plicatively weighted by the top-down gainsg
ij
and are summed to yieldS
j
(A), the
saliency map for the j
th
dimension. The resulting saliency maps S
j
(A) are again
weighted by top-down gainsg
j
and summed across different feature dimensions to
form the overall saliency map S(A). The goal here is to choose optimal top-down
weights that maximize the target’s salience relative to the background, thereby max-
imizing the speed of detecting the target. . . . . . . . . . . . . . . . . . . . . . . 131
5.2 Simulation results: This figure shows the results of testing on 750 artificial search
arrays and natural scenes. Each row shows a different search task with different
targets and distractors. The first column shows a sample test scene. The second
column shows theSNR (in decibels) predicted by four different models described
in section 3. The third column shows the distribution of optimal top-down gains
derived from knowledge of the target and distractors, as computed by model T1D1.
The dotted blue lines are the default gains (1) used by model T0D0. The first plot
shows the gains on the intensity (I), color (C) and orientation (O) dimensions. The
subsequent plots show the gains within these dimensions (in the order of intensity,
color and orientation). The final column shows some remarks. As described in sec-
tion 3.1, these results are consistent with bottom-up and top-down effects reported in
psychophysics experiments. Across all search tasks, model T1D1 performed atleast
as good as or better than T1D0, T0D1, which performed better than T0D0. These re-
sults suggest that knowledge of both the target and the distracting background plays
an important role in improving search speed. . . . . . . . . . . . . . . . . . . . 138
xvii
5.3 Example training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.4 Comparison of different models: Comparison of saliency maps of the naive bottom-
up model T0D0 (second row) vs. T1D1 (third row) are shown during search for a
phone on a desk (first column), a coke can in a cluttered scene (second column), and
a pen in a distracting background (third column). Although the target is not bottom-
up salient, prior knowledge of the target and the distracting background (acquired
through training) helps in improving theSNR, thereby rendering the target more
salient and suppressing noisy activity due to the distractors. . . . . . . . . . . . . 141
xviii
List of Tables
2.1 Statistics of target salience as computed by the biased model over that computed by
the naive unbiased model: The first column states the target representation that was
used for biasing (see section 2.6.3 for details). The second column shows the mean
improvement in target salience; the third column shows the standard deviation; the
fourth column shows the 95% confidence interval; the fifth and sixth columns show
the minimum and maximum improvements obtained. . . . . . . . . . . . . . . . 55
2.2 Statistics of target detection time as taken by the naive unbiased model over that
taken by the biased model: The first column states the target representation that
was used for biasing (see section 2.6.3 for details). The second column shows the
mean improvement in target detection time; the third column shows the standard
deviation; the fourth column shows the 95% confidence interval; the fifth and sixth
columns show the minimum and maximum improvements obtained. . . . . . . . . 55
3.1 Strength of biasing: For each dimension tested (intensity, size, color saturation),
we find the strength of biasing (computed as percentage of fixations on the relevant
feature interval) in the LOW, MID and HIGH search conditions. A t-test reveals that
in each search condition the strength of biasing is significantly higher (p<< 0.01)
than the baseline 33.33% predicted by chance. The p-values and 95% confidence
interval in strength of biasing are reported. . . . . . . . . . . . . . . . . . . . . . 87
3.2 Strength of biasing as a function of time: For a given dimension (e.g., intensity,
size or color saturation), and a given search condition (e.g., search for a target of
LOW, MID or HIGH feature interval), we determine whether the strength of biasing
changes with time by performing a 1-way ANOV A test. Results across all conditions
show that the strength of biasing does not change significantly (p>= 0.05) with time
(measured in units of block number ranging from 1 up to a maximum of 10). . . . . 88
xix
Abstract
Visual attention – the brain’s mechanism for selecting important visual information – is in-
fluenced by a combination of bottom-up (sudden, unexpected visual events that are spatio-
temporally different from the surroundings) and top-down (goal-relevant) factors. Although
both are crucial for real-world applications like robot navigation or visual surveillance, most
existing models are either purely bottom-up or top-down. In this thesis, we present a new
model that integrates top-down and bottom-up attention. We begin with a wide perspective
of how a task specification (e.g., “who is doing what to whom”) influences attention dur-
ing scene understanding. We propose and partially implement a general-purpose architecture
illustrating how different bottom-up and top-down components of visual processing such as
the gist, saliency map, object detection and recognition modules, working memory, long term
memory, task-relevance map may interact and interface with each other to guide attention to
salient and relevant scene locations. Next, we investigate the specifics of how bottom-up
and top-down influences may integrate while searching for a target in a distracting back-
ground. We probe the granularity of information integration within feature dimensions such
as color, size, luminance. Results of our eye tracking experiments assert that bottom-up re-
sponses encoding feature dimensions can be modulated by not just one, but several top-down
gain control signals, thus revealing high granularity of integration. Finally, we investigate
xx
the computational principles underlying the integration. We derive a formal theory of opti-
mal integration of bottom-up salience with top-down knowledge about target and distractor
features, such that the target’s salience relative to the distractors is maximized, thereby accel-
erating search speed. Our theory makes a surprising prediction that traditional approaches of
boosting neurons favoring the target features are sub-optimal. Instead, we show that in some
cases, the optimal approach is to boost neurons favoring a non-target feature. We provide
experimental evidence supporting this prediction. Results of testing on artificial and natural
images show that the theory successfully accounts for several effects in human visual search
behavior (including pop-out, target-distractor discriminability, distractor heterogeneity, lin-
ear separability, feature priming, target uncertainty). In summary, this thesis provides insight
into how bottom-up and top-down attention may be integrated in the primate brain.
xxi
Chapter 1
General Introduction
1
Attention as a general selection process: The primate visual system faces the herculean
task of quickly and reliably selecting important subset of information from around10
10
bits/s
of information entering the retina [163]. Visual attention is the brain’s selection mechanism
by which the salient and relevant visual inputs are selected for further processing such as
representation, analysis, control and action. It is known to benefit behavioral performance
in several ways, including improved processing of stimuli at the attended location [124],
enhanced spatial resolution [184, 49], improved feature discrimination [85], increased de-
tectability of stimuli by lowering contrast thresholds [48, 21], improved change detection,
recognition and awareness. It is little wonder that attention has been so extensively studied.
Attentional mechanisms: While earlier studies suggested that attention acts as a “spot-
light” [156], enhancing the neural representation of the attended stimulus [99, 95], later stud-
ies suggest that attention biases competition between multiple stimuli in favor of attended
stimuli, resulting in enhanced processing of attended stimuli and inhibition of unattended
stimuli [98, 131]. Attention is also known to increase baseline firing rates of neurons fa-
voring attended stimuli [24, 77], cause multiplicative gain modulation in neural response
[159, 174], increase contrast sensitivity [132] and dynamically change receptive field prop-
erties [147, 46, 28, 182] (for review, see [78, 130, 94]). Thus, attention plays a critical role
in visual processing by facilitating enhanced processing of salient and relevant stimuli and
filtering out unwanted information.
2
Bottom-up attention: Attention can be guided in an involuntary, automatic manner towards
unexpected and salient visual stimuli that stand out from the background due to spatiotem-
poral differences. For instance, attentional guidance to a red fruit among green leaves; a
rapidly moving vehicle among several slower ones; an unexpected event such as an abrupt
onset of an obstacle on your path, or a sudden bomb explosion are all examples of involun-
tary, stimulus or image-driven attention. This is also referred to as exogeneous or bottom-up
attention. Bottom-up attention operates through feedforward neural processes that rapidly
select salient information in an automatic manner. Several feedforward models of bottom-up
attention have been proposed, deploying different notions of saliency [81, 69, 118, 3, 64] that
guide attention according to a central saliency map [80, 65, 87] or in a distributed manner
[162]. For a detailed review on models of bottom-up attention, please see [67].
Top-down attention: While attention can be driven in a fast, involuntary, stimulus-driven,
bottom-up manner, it can also be driven by volition in a slower manner. For instance, given a
behavioral goal (e.g., find your friend in a crowd), attention can be guided to the goal-relevant
stimuli (e.g., faces), while filtering out irrelevant stimuli. Other examples of voluntary atten-
tion include directing attention to a specific location such as this line of the text while reading
(spatial attention) [124], to a specific feature such as golden yellow while searching for trea-
sure (feature-based attention) [159], to a specific object such as the ball in a soccer game
(object-based attention) [75]. Such voluntary attention is referred to as top-down attention,
and is mediated through feedback, reentrant connections in the cortex. Several models of
top-down attention have been proposed to achieve different goals like object detection [126],
3
object recognition [138], scene understanding [51].
Need to integrate Bottom-up and Top-down attention: Most existing attention models are
either bottom-up driven or top-down driven. There have been few attempts to integrate both
top-down and bottom-up attention [113]. Such integration is crucial for robot navigation,
visual surveillance and any realistic visual search. For instance, in visual surveillance, it is
important to detect goal-relevant targets like suspects, and to simultaneously notice unex-
pected visual events like gun shots or sudden explosions. Similarly, robot navigation requires
top-down detection of landmarks and road signs, as well as bottom-up detection of unex-
pected obstacles and accidents. In this thesis, we present a new model that combines both
top-down and bottom-up influences to guide attention during visual search for a target ob-
ject in distracting clutter, and for scene understanding. The following paragraphs provide an
overview of the different chapters in this thesis.
Chapter 1 – A bird’s eye view of attention and scene understanding: In chapter 1, we pro-
pose a computational model for the task-specific guidance of visual attention in real-world
scenes. Our model emphazises four aspects that are important in biological vision: determin-
ing task-relevance of an entity, biasing attention for the low-level visual features of desired
targets, recognizing these targets using the same low-level features, and incrementally build-
ing a visual map of task-relevance at every scene location. Given a task definition in the
form of keywords, the model first determines and stores the task-relevant entities in work-
ing memory, using prior knowledge stored in long-term memory. It attempts to detect the
4
most relevant entity by biasing its visual attention system with the entity’s learned low-level
features. It attends to the most salient location in the scene, and attempts to recognize the
attended object through hierarchical matching against object representations stored in long-
term memory. It updates its working memory with task-relevance of the recognized entity
and updates a topographic task-relevance map with the location and relevance of the recog-
nized entity. The model is tested on three types of tasks: single-target detection in 343 natural
and synthetic images, where biasing for the target accelerates target detection over two-fold
on average; sequential multiple-target detection in 28 natural images, where biasing, recog-
nition, working memory and long term memory contribute to rapidly finding all targets; and
learning a map of likely locations of cars from a video clip filmed while driving on a highway.
The model’s performance on search for single features and feature conjunctions is consistent
with existing pyschophysical data. These results of our biologically-motivated architecture
suggest that the model may provide a reasonable approximation to many brain processes in-
volved in complex task-driven visual behaviors.
Zooming into the specifics – Attention during visual search: Following a wide perspective
on how top-down and bottom-up influences may integrate to guide attention during scene
understanding, we zoom into a specific subproblem, namely visual search for a target in a
distracting background. In the rest of this thesis, we address several issues such as the gran-
ularity of top-down information, and the computational principles underlying its integration
with bottom-up salience. The following paragraphs provide a summary of these issues.
5
Chapter 2 – Investigating the granularity of top-down attentional selection during visual
search: While much is known about the sources and modulatory effects of top-down atten-
tional signals, the information capacity of these signals is less known. In chaptar 2, we in-
vestigate the granularity of top-down attentional signals. Previous theories in psychophysics
have provided conflicting evidence on whether top-down guidance is coarse-grained (i.e., one
gain control term per feature dimension) or fine-grained (i.e., multiple gain control terms per
dimension). We resolve the conflict by designing new experiments that disentangle top-down
from bottom-up contributions, thereby avoiding confounds existing in previous studies. The
results of our eye tracking experiments indicate that top-down signals are fine-grained and
can specify multiple gain control terms per dimension. As a consequence, top-down signals
can specify not only the relevant feature dimension, but also the relevant interval within a
dimension.
Chapter 3 – Optimal integration of Bottom-up and Top-down during visual search: In
chapter 3, we investigate the computational principles underlying the integration of top-down
search goals with bottom-up salience. Specifically, how does a visual search goal modulate
activity of neurons encoding different visual features (e.g., color, direction of motion)? Pre-
vious research suggests that goal-driven attention enhances the gain of neurons representing
the target’s visual features. Here, we present mathematical and behavioral evidence that this
strategy is suboptimal and humans do not deploy it. We formally derive the optimal feature
gain modulation theory, which combines information from both the target and distracting
clutter to maximize the relative salience of the target. We qualitatively validate the theory
6
against existing electrophysiological and psychophysical literature. A surprising prediction
is that it is sometimes optimal to enhance non-target features. We provide experimental ev-
idence towards this through psychophysics experiments on human subjects, thus suggesting
that humans deploy the optimal gain modulation strategy.
Chapter 4 – Applications in computer vision: Integration of goal-driven, top-down atten-
tion and image-driven, bottom-up attention is crucial for visual search. Yet, previous research
has mostly focused on models that are purely top-down or bottom-up. In chapter 4, we imple-
ment an enriched version of the theory presented in chapter 3, that combines both bottom-up
and top-down attention. The bottom-up component computes the visual salience of scene
locations in different feature maps extracted at multiple spatial scales. The top-down com-
ponent uses accumulated statistical knowledge of the visual features of the desired search
target and background clutter, to optimally tune the bottom-up maps such that target detec-
tion speed is maximized. Testing on 750 artificial and natural scenes shows that the model’s
predictions are consistent with a large body of available literature on human psychophysics of
visual search. These results suggest that our model may provide good approximation of how
humans combine bottom-up and top-down cues such as to optimize target detection speed.
Chapter 5 – Summary: Chapter 5 provides a summary of this thesis. It highlights the gen-
eral as well as specific contributions of this thesis, and identifies important open questions for
future research. In summary, this thesis presents a wide perspective on integrating bottom-up
and top-down attention, that ranges from a systems level engineering design of a large-scale
7
model of attention and scene understanding to a model of optimal gain modulation at the sin-
gle unit level during visual search that simultaneously accounts for behavioral performance
as well.
8
Chapter 2
A Large-scale Model of Attention and Scene Understanding
2.1 Introduction
Visual attention is guided by a complex interplay of several factors including bottom-up
salience, top-down relevance or knowledge of the task at hand, the target and distractor ob-
jects, relevant spatial locations, gist, layout of a scene and others. Figure 1 shows an overview
of the current understanding on the role of the different factors in guiding visual attention.
A detailed review of relevant literature can be found in Navalpakkam and Itti [107] and in
section 1.2. In this chapter, we address this challenge and some critical issues regarding the
coordination and interfacing of the diverse components. We propose and partially implement
a large-scale visual attentional system that attempts to integrate the above components. An
overview of our approach is given below:
• Given a task definition in the form of keywords, the model first determines and stores
the task-relevant entities in working memory, using prior knowledge stored in long-
term memory.
9
Input visual
scene
Low-level
feature
detectors
Bottom-up salience:
conspicuous scene loc
Attentional
selection
Gist:
outdoors
beach
scene
Layout:
1 = grass
2 = beach
3 = sea
4 = sky 1
Figure 2.1: Overview of current understanding of how task influences visual attention: Given a task
such as “find humans in the scene”, prior knowledge of the target’s features is known to influence low-
level feature extraction by priming the desired features. These low-level features are used to compute
the gist and layout of the scene as well as the bottom-up salience of scene locations. Finally, the
gist, layout and bottom-up salience map are somehow combined with the task and prior knowledge to
guide attention to likely target locations. The present study attempts to cast this fairly vague overview
model into a more precise computational framework that can be tested against real visual inputs.
10
• It attempts to detect the most relevant entity by biasing its visual attention system with
the entity’s learned low-level features.
• It attends to the most salient location in the scene, and attempts to recognize the at-
tended object through hierarchical matching against object representations stored in
long-term memory.
• It updates its working memory with the task-relevance of the recognized entity and
updates a topographic task-relevance map with the location and relevance of the rec-
ognized entity.
Thus, we have focused on four outstanding issues: determining task-relevance of scene enti-
ties, top-down biasing, recognizing and memorizing. We discuss these in greater detail in the
remaining sections of this chapter.
2.2 Related work
Theories on scene representation: There is an interesting diversity in the range of hypothet-
ical internal scene representations, including the world as an outside memory hypothesis that
claims no photographic memory for visual information [114], the coherence theory according
to which only one spatio-temporal structure or coherent object can be represented at a time
[127], a limited memory of three or four objects in visual short-term memory [62, 63], and fi-
nally, memory for many more previously attended objects in visual short-term and long-term
memory [55, 56, 54]. Together with studies in change detection [74, 129, 127, 128, 170],
this suggests that internal scene representations do not contain complete knowledge of the
11
scene. To summarize, instead of attempting to segment, identify, represent and maintain de-
tailed memory of all objects in a scene, there is mounting evidence that our brain may adopt
a need-based approach [160], where only desired objects are quickly detected in the scene,
identified and represented. As explained in chapter 2, we adopt a similar need-based approach
to determine the task-relevant or desired objects, bias the attentional system to quickly detect
them in the scene, identify and memorize them. In the rest of this section, we review relevant
literature pertaining to these factors that guide visual attention.
Models of bottom-up visual attention: To begin with, visual attention has been often com-
pared to a virtual spotlight through which our brain sees the world [173]. Attention has
been classified into several types based on whether or not it involves eye movements (overt
vs. covert attention), and whether its deployment over a scene is primarily guided by scene
features or volition (bottom-up vs. top-down attention) (for review, see [67]). One of the first
biologically plausible architecture for controlling bottom-up attention was proposed by Koch
and Ullman [80]. In their model, several feature maps (such as color, orientation, intensity)
are computed in parallel across the visual field [156], and combined into a single salience
map. Then, a selection process sequentially deploys attention to locations in decreasing or-
der of their salience. We enhance this architecture by modeling the influence of task on
attention.
12
Task modulates early visual processing: The influence of task on visual attention can be
observed even in the early stages of visual processing, where task modulates neural activ-
ity by enhancing the responses of neurons tuned to the location and features of a stimulus
[98, 99, 101, 100, 183, 46, 158, 19]. For example, area MT+ is more active during a speed
discrimination task whereas area V1 shows increased activation during a contrast discrimi-
nation task [58]. In addition, psychophysics experiments have shown that knowledge of the
target contributes to an amplification of its salience, e.g., white vertical lines become more
salient if we are looking for them [13]. A recent study even shows that better knowledge of
the target leads to faster search, e.g., seeing an exact picture of the target is better than seeing
a picture of the same semantic type or category as the target [79]. These studies demonstrate
the effects of biasing for features of the target. Other experiments (e.g., [156]) have shown
that searching for feature conjunctions (e.g., color×orientation conjunction search: find a
red-vertical item among red-horizontal and green-vertical items) are slower than “pop-out”
(e.g., find a green item among red items). These observations impose constraints on the pos-
sible biasing mechanisms and eliminate the possibility of generating new composite features
on the fly (as a combination of simple features).
Guided Search model for predicting visual search behavior: Of the several well known
top-down feature biasing models, Guided Search [177] is very popular and accounts for vi-
sual search behavior. It has the same basic architecture as proposed by Koch and Ullman
[80], but in addition, it achieves feature-based biasing by weighing feature maps in a top-
down manner. For example, with the task of detecting a red bar, the red-sensitive feature map
13
gains more weight, hence making the red bar more salient. However, it is not clear how the
weights are chosen in that model. In our model, we learn a vector of feature weights (one
weight per feature) from images containing the target (see section 2.4.1). Further, we use the
same feature vectors for attentional biasing, short-term memory representation, and object
recognition. Thus, our model differs from Guided Search in that we learn internal target rep-
resentations from images, and use these learned representations for top-down biasing. Our
choice for target representation is influenced by the following three factors.
Choice of features for object representation: First, experiments have revealed several pre-
attentive features, including orientation [72, 180, 34, 150], size [156], closure [41, 161], color
(hue) [103, 104, 5, 90, 40], intensity [6, 157, 86], flicker [71], direction of motion [106, 35].
In our current implementation, we use orientation, color and intensity. Second, while within-
feature conjunctions are considered inefficient, color×color and size×size conjunctions are
efficient in a part-whole setup (e.g, find a red house with yellow windows among red houses
with blue windows and blue houses with yellow windows) [12]. Low-level visual neurons
with center-surround receptive fields and color opponence can help support such observa-
tions. If we represent the target in terms of center-surround features, information about the
part can be obtained from the center, and information about the whole can be obtained from
the surround. Besides, using center-surround features can make the system more robust to
changes in absolute feature values that are typically associated with changing viewing condi-
tions. This motivates us to represent the target by a vector of center-surround feature weights.
Third, maintaining a pyramid of feature maps at different spatial scales is known to provide a
14
compact image code [20]. Hence, we are motivated to maintain feature responses at multiple
spatial scales.
Computation of saliency maps from feature maps: In summary, our current implementa-
tion uses seven center-surround feature types: on/off image intensity contrast, red/green and
blue/yellow double opponent channels, and four local orientation contrast (for implementa-
tion details, please see previous papers [66]). We compute the feature maps at six different
pairs of center and surround spatial scales [66], yielding 42 feature maps in all. Non-linear
interactions and spatial competition occur in each of these feature maps (see section 2.4 in
[69]) before the maps are linearly combined into a salience map. This is a very important
(though often overlooked) aspect of our previously proposed bottom-up attention model, also
used here in the new model. The operational definition of salience implemented in this model
is such that a feature map which is active at many locations is not considered a strong driver
of attention (since one would not know to which of the active locations attention should be di-
rected), while a feature map active at only one location is a strong driver. This is implemented
in the bottom-up model [70, 66] as non-classical surround inhibition within each feature map,
whereby neighboring active locations cancel each other out, while a unique active location
would not be affected (or even is amplified in our model). Finally, in order to find the focus
of attention, we deploy a Winner-Take-All (WTA) spatial competition in the salience map
that selects the most salient location in the salience map [70].
15
Object recognition: Having selected the focus of attention, it is important to recognize the
entity at that scene location. Many recognition models have been proposed that can be classi-
fied based on factors including the choice of basic primitives (e.g., Gabor jets [176], geomet-
ric primitives like geons [10], image patches or blobs [172], and view-tuned units [133]), the
process of matching (e.g., self organizing dynamic link matching [84], probabilistic match-
ing [172]), and other factors (for reviews, see [134, 1]). In this chapter, we explore how the
pre-attentive features used to guide attention may be re-used for object representation and
recognition. Since we represent the target as a single feature vector, we do not handle com-
plex or composite objects in the current model.
Memorization: Recognition is followed by the problem of memorization of visual informa-
tion. A popular theory, the object file theory of trans-saccadic memory [60, 61, 62], posits
that when attention is directed to an object, the visual features and location information are
bound into an object file [73] that is maintained in visual short term memory across saccades.
Psychophysics experiments have further shown that up to three or four object files may be
retained in memory [146, 120, 60, 89, 63]. Studies investigating the neural substrates of
working memory in primates and humans suggest that the frontal and extrastriate cortices
may both be functionally and anatomically separated into a ”what” memory for storing the
visual features of the stimuli, and a ”where” memory for storing spatial information [175, 29].
Based on the above, in our model, we memorize the visual representation of the currently at-
tended object by storing its visual features in the visual working memory. In addition, we
store symbolic knowledge such as the logical properties of the currently attended object and
16
its relationship with other objects, in the symbolic working memory with help from the sym-
bolic long-term memory. To memorize the location of objects, we extend the earlier hypothe-
sis of a salience map [80] to propose a two-dimensional topographic task-relevance map that
encodes the task-relevance of scene entities. Our motivation for maintaining various maps
stems from biological evidence. Single-unit recordings in the visual system of the macaque
indicate the existence of a number of distinct maps of the visual environment that appear to
encode the salience and/or the behavioral significance of targets. Such maps have been found
in the superior colliculus, the inferior and lateral subdivisions of the pulvinar, the frontal-eye
fields and areas within the intraparietal sulcus [27, 45, 148, 83]. Since these neurons are
found in different parts of the brain that specialize in different functions, we hypothesize that
they may encode different types of salience: the posterior parietal cortex may encode a vi-
sual salience map, while the pre-frontal cortex may encode a top-down task-relevance map,
and the final eye movements may be generated by integrating information across the visual
salience map and task-relevance map to form an attention guidance map possibly stored in
the superior colliculus (figure 2.2).
Non-attentional pathways like the gist and layout: Our analysis so far has focused on the
attentional pathway. As shown in figure 2.1, non-attentional pathways also play an important
role; in particular, rapid identification of the gist (semantic category) of a scene is very useful
in determining scene context, and is known to guide eye movements [116, 11, 30, 26, 50,
127, 152]. It is computed rapidly within the first 150ms of scene onset [149], and the neural
correlate of this computation is still unknown. Recently, Oliva and Torralba [112] proposed
17
Visual cortex
Inferotemporal
cortex
Posterior
parietal cortex
Prefrontal
cortex
Superior
colliculus
Motor and
other systems
Memory and
cognition
Eye movements
Visual information
Top-down
task-relevance map
Bottom-up
saliency map
Attention guidance map
Figure 2.2: Hypothesis: We hypothesize the existence of different kinds of salience maps that encode
different nature of information about the scene. In particular, we hypothesize that the posterior parietal
cortex may encode a visual salience map, the pre-frontal cortex may encode a top-down task-relevance
map, and the superior colliculus may store an attention guidance map that guides the focus of attention.
holistic representation of the scene based on spatial envelope properties (such as openness,
naturalness etc.) that bypasses the analysis of component objects and represents the scene as
18
a single identity. This approach formalizes the gist as a vector of contextual features [152].
By processing several annotated scenes, these authors learned the relationship between the
scene context and categories of objects that can occur, including object properties such as
locations, size or scale, and used it to focus attention on likely target locations [151, 152].
This provides a good starting point for modeling the role of gist in guiding attention. Since
the gist is computed rapidly, it can serve as an initial guide to attention. But subsequently,
our proposed TRM that is continuously updated may serve as a better guide. For instance, in
dynamic scenes such as traffic scenes where the environment is continuously changing and
the targets such as cars and pedestrians are moving around, the gist may remain unchanged
and hence, it may not be so useful, except as an initial guide.
Knowledge-based approaches for modeling eye movements: The use of gist in guiding
attention to likely target locations motivates knowledge-based approaches to modeling eye
movements, in contrast to image-based approaches. One such famous approach is the scan-
path theory which proposes that attention is mostly guided in a top-down manner based on
an internal model of the scene [110]. Computer vision models have employed a similar ap-
proach to recognize objects. For example, Rybak et al. [138] recognize objects by explicitly
replaying a sequence of eye movements and matching the expected features at each fixation
with the image features. In the present study, we focus on bottom-up guidance of attention
and its top-down biasing, but we do not model such knowledge-based directed eye move-
ments.
19
A model using iconic scene representations to guide eye movements: An interesting
model for predicting eye movements during a search and copying task has been proposed by
Rao et al. [126]. These authors use iconic scene representations to predict eye movements
during visual search. They compute salience at a given location as the squared Euclidean
distance between a feature vector containing responses of a bank of filters at that location,
and the memorized vector of target responses. They validate their model against human data
obtained in a search task and copying task and demonstrate some interesting center of gravity
effects. This model is very interesting in that it suggests a highly efficient mechanism by
which salience could be biased for the detection of a known target. However, this approach
suffers from two shortcomings addressed by our model. First, since salience is computed as
the distance between observed and target features, this model does not provide a mechanism
by which attention could be directed in a purely bottom-up manner, when no specific target is
being looked for. Hence, this model cannot reproduce simple pop-out, where a single vertical
bar is immediately found by human observers within an array of horizontal bars, even in cases
where observers had no prior knowledge of what to look for. Second, when target features are
known, we will see in our Results section (2.6.2) that such template-based approach would
predict that conjunction searches [156] should be as efficient as pop-out searches, which dif-
fers from empirical observations in humans. The biasing mechanism proposed in our model
is less efficient but in better agreement with human data (see Results).
Real-time scene analysis: To summarize, we have motivated the components of our model
which we believe are crucial for scene understanding. Ours is certainly not the first attempt
20
to address this problem. For example, one of the finest examples of real-time scene analysis
systems is The Visual Translator (VITRA) [51], a computer vision system that generates real-
time verbal commentaries while watching a televised soccer game. Their low-level visual
system recognizes and tracks all visible objects from an overhead (bird’s eye) camera view,
and creates a geometric representation of the perceived scene (the 22 players, the field and
the goal locations). This intermediate representation is then analyzed by series of Bayesian
belief networks which evaluate spatial relations, recognize interesting motion events, and
incrementally recognize plans and intentions. The model includes an abstract, non-visual
notion of salience which characterizes each recognized event on the basis of recency, fre-
quency, complexity, importance for the game, and other factors. The system finally generates
a verbal commentary, which typically starts as soon as the beginning of an event has been
recognized but may be interjected if highly salient events occur before the current sentence
has been completed. While this system delivers very impressive results in the specific appli-
cation domain considered, due to its computational complexity it is restricted to one highly
structured environment and one specific task, and cannot be extended to a general scene un-
derstanding model. Indeed, unlike humans who selectively perceive the relevant objects in
the scene, VITRA attends to and continuously monitors all objects and attempts to simul-
taneously recognize all known actions. Our approach differs from VITRA not only in that
there is nothing in our model that commits it to a specific environment or task. In addition,
we only memorize those objects and events that we expect to be relevant to the task at hand,
thus saving enormously on computation complexity.
21
2.3 An overview of our architecture
Our architecture may be understood in four phases, a summary of which is provided below
(figure 2.3).
Phase 1, Eyes Closed: In the first phase known as the “eyes closed” phase, the symbolic
working memory (WM) is initialized by the user with a task definition in the form of key-
words and their relevance (any number greater than baseline 1.0). Given the relevant key-
words in symbolic WM, volitional effects such as “look at the center of the scene” could
be achieved by allowing the symbolic WM to bias the TRM so that the center of the scene
becomes relevant and everything else is irrelevant (but our current implementation has not ex-
plored this yet). For more complex tasks such as “who is doing what to whom,” the symbolic
WM requires prior knowledge and hence, seeks the aid of the symbolic long-term memory
(LTM). For example, to find what the man in the scene is eating, prior knowledge about eating
being a mouth and hand-related action, and being related to food items helps us guide atten-
tion towards mouth or hand and determine the food item. Using such prior knowledge, the
symbolic WM parses the task and determines the task-relevant targets and how they are re-
lated to each other. Our implementation explores this mechanism using a simple hand-coded
symbolic knowledge base to describe long-term knowledge about objects, actors and actions
(Section 2.3.1). Next, it determines the current most task-relevant target as the desired target
(Section 2.3.2). To detect the desired target in the scene, the visual WM retrieves the learned
visual representation of the target from the visual LTM and biases the low-level visual system
22
Input
visual scene
Phase 1: Eyes closed sing sing
x
Attention
guidance
map
Attentional
selection
x
Attention
guidance
map
Attentional
selection
Input
visual scene
Select next
attention target
Phase 2: Computing
Attentional
selection
Input
visual scene
ecognize Select and recognize
ended object attended object
x
Attention
guidance
map
Phase 3: Attending
Input
visual scene
x
Attention
guidance
map
Attentional
selection
Phase 4: Updating
Figure 2.3: Phase 1 (top left): Eyes closed, Phase 2 (top right): Computing, Phase 3 (bottom left):
Attending, Phase 4 (bottom right): Updating. Please refer to section 2.2 for details about each phase.
All four panels represent the same model; however, to enable easy comparison of the different phases,
we have highlighted the components that are active in each phase and faded those that are inactive.
Dashed lines indicate parts that have not been implemented yet. Following Rensink’s (2000) termi-
nology, volatile processing stages refer to those which are under constant flux and regenerate as the
input changes.
23
with the target’s features (Section 2.4).
Phase 2, Computing: In the second phase known as the “computing” phase, the eyes are
open and the visual system receives the input scene. The low-level visual system that is bi-
ased by the target’s features computes the biased salience map (Section 2.4). Apart from
such feature-based attention, spatial attention may be used to focus on likely target locations,
e.g., gist and layout may be used to bias the TRM to focus on relevant locations (but this is
not implemented yet). Since we are interested in attending to locations that are salient and
relevant, the biased salience and task-relevance maps are combined by taking a pointwise
product to form the attention-guidance map (AGM). To select the focus of attention, we de-
ploy a Winner-take-all competition that chooses the most active location in the AGM [70]. It
is important to note that there is no intelligence in this selection and all the intelligence of the
model lies in the WM.
Phase 3, Attending: In the third phase known as the “attending” phase, the low-level fea-
tures or prototype objects are bound into a mid-level representation (in our implementation,
this step simply extracts a vector of visual features at the attended location). The object
recognition module determines the identity of the entity at the currently attended location
(Section 2.5), and the symbolic WM estimates the task-relevance of the recognized entity
(Section 2.3).
24
Phase 4, Updating: In the final phase known as the “updating” phase, the WM updates
its state (e.g., records that it has found the man’s hand). It updates the TRM by recording
the relevance of the currently attended location (Section 2.3). The estimated relevance may
influence attention in several ways. For instance, it may affect the duration of fixation (not
implemented). If the relevance of the entity is less than the baseline 1.0, it is marked as ir-
relevant in the TRM, and hence will be ignored by preventing future fixations on it (e.g., a
chair is irrelevant when we are trying to find what the man is eating. Hence, if we see a chair,
we ignore it). If it is somewhat relevant (e.g., man’s eyes), it may be used to guide attention
to a more relevant target by means of directed attention shifts (e.g., look down to find the
man’s mouth or hand; not implemented). Also if it is relevant (e.g., man’s hand), a detailed
representation of the scene entity may be created for further scrutiny (e.g., a spatio-temporal
structure for tracking the hand; not implemented). The WM also inhibits the current focus
of attention from continuously demanding attention (inhibition of return in SM). Then, the
symbolic WM determines the next most task-relevant target, and the visual WM retrieves the
target’s learned visual representation from visual LTM, and uses it to bias the low-level visual
system.
Termination: This completes one iteration. The computing, attending and updating phases
repeat until the task is complete. Upon completion, the TRM shows all task-relevant loca-
tions and the symbolic WM contains all task-relevant targets.
25
Implemented modules: As mentioned earlier, our focus in this chapter is on determining
task-relevance, biasing, recognizing, and memorizing. Accordingly, we have designed sym-
bolic LTM and WM modules for estimating task-relevance (sections 2.3.1, 2.3.2) and also
for computing and learning task-relevant locations in a TRM (sections 2.3.2 and 2.6.6); vi-
sual WM and LTM modules for learning object representations (section 2.4.1), reusing the
learned target representations to compute the biased saliency map for object detection (see
section 2.4.2), and matching against learned representations for object recognition (see sec-
tion 2.5). Implementation of other components (such as gist, layout, object trackers) and their
interactions is still under progress and we do not include their details in this document.
2.4 Estimating the task-relevance of scene entities
In this section, we propose a computational framework for estimating the task-relevance of
scene locations. This is essentially a top-down process requiring prior knowledge about the
world and some semantic processing. Hence, we recruit symbolic LTM and WM modules.
Our current architecture is based on research in artificial intelligence and knowledge repre-
sentation [16] and is not biological.
2.4.1 Symbolic long-term memory (LTM)
The Symbolic LTM acts as a knowledge base. Our implementation of the LTM is given be-
low:
26
LTM as an ontology: LTM contains entities and their relationships. For consistency with
the vocabulary used in knowledge representation research, we refer to it as ontology from
now on. We currently address tasks such as “who is doing what to whom” and accept task
specifications in the form of object, subject and action keywords. Hence, we maintain object,
subject and action ontologies. Each ontology is represented as a graph with entities as vertices
and their relationships as edges. Our entities include real-world concepts as well as abstract
ones. In our current implementation, we consider simple relationships such as is a, includes,
part of, contains, similar, and related. The following examples motivate the need to store
more information in the edges.
Granularity of a relationship: Consider the case when we want to find a hand. Suppose we
find a finger (hand contains finger) and a man (hand is part of man), how should we determine
which of them is more relevant? Clearly, the finger is more relevant than the man because if
the finger is found, it implies that the hand has been found. However, if the man is found,
we still require a few eye movements before finding the hand within the man. To incorporate
this, we create a partial order on the set of relationships by ranking them according to the
priority or granularity of a relationship g(r(u,v)), wherer(u,v) is the relationship between
entity u and v). In general,
g(“contains
′′
) > g(“partof
′′
)
g(“isa
′′
) > g(“includes
′′
)
g(“related
′′
) > g(“similar
′′
)
27
1.0 0.8
0.6
0.7
0.9
1.0
0.0
Includes
Is a
Contains
Part of
Similar
HRO
HRO
WRO WRO
TRO
TRO
WRO WRO
Abstract entity
Real-world entity
Co-occurrence =
Ship is a Transport RO AND Water RO
Bath RO is a Water RO AND
Human RO
0.9
0.9 0.9
Figure 2.4: Sample ontology, as used to represent long-term knowledge in our model: The relations
include is a, includes, part of, contains, similar, related. While the first five relations appear as edges
within a given ontology, the related relation appears as edges that connect the three different ontolo-
gies. The relations contains and part of are complementary to each other as in Ship contains Mast,
Mast is part of Ship. Similarly, is a and includes are complementary. Hand-picked co-occurrence
measures are shown on each edge and the conjunctions, disjunctions are shown using the truth tables.
28
Co-occurance of scene entities: Let’s consider another case where we still want to find the
hand, but we find a pen and a leaf instead, and wish to estimate their relevance. This situation
is unlike the previous one since both entities are hand-related objects and hence, share the
same relationship with the hand. Yet, we consider the pen to be more relevant than the leaf
because in our daily lives, the hand holds a pen more often than it holds a leaf (unless we
are considering gardeners!). Thus, the probability of joint occurrence of entities seems to be
an important factor in determining relevance. Hence, we store co-occurrence of the entities
c(u,v).
Complex relationships and properties: Apart from storing information in the edges, we
also store information in the nodes. Each node maintains a list of properties in addition to
the list of all its neighbors. To represent conjunctions and disjunctions or other complicated
relationships, we maintain truth tables that describe the probabilities of various combinations
of parent entities. An example is shown in figure 2.4.
Learnability of ontology: Currently, our ontology is not learnable. For the purposes of test-
ing the model, we have hand-coded the ontology with hand-picked values of co-occurrence
and granularity.
29
2.4.2 Symbolic Working memory (WM)
In this section, we propose how the symbolic WM may estimate the task-relevance of scene
entities with the help of the knowledge base stored in the LTM.
Symbolic WM consults LTM to determine task-relevance of entities: The symbolic WM
creates and maintains task graphs for objects, subjects and actions that contain task-relevant
entities and their relationships. After the entity at the current fixation (fixated entity) is rec-
ognized, symbolic WM estimates its task-relevance as follows. First, it checks whether the
fixated entity is already present in the task graph, in which case, a simple lookup gives the
relevance of the fixated entity. If it fails to find the fixated entity in its task graph, then it seeks
the help of symbolic LTM in the following: the symbolic WM requests the symbolic LTM to
check whether there exists a path in the ontology from the fixated entity to any of the entities
in the task graph. If so, the nature of the path reveals how the fixation is related to the current
task graph. If no such path exists, the fixated entity is declared to be irrelevant to the task. In
the case of the object task graph, an extra check is performed to ensure that the properties of
the fixated entity are consistent with the object task graph (see figure 2.5 for examples). If the
tests succeed and the fixated entity is determined to be relevant, the symbolic LTM returns
the discovered paths (from the fixated entity to the entities in the task graph) to the symbolic
WM.
Estimation of relevance of an entity: The symbolic WM computes the relevance of the
fixated entity as a function of the relevance of its neighboring entities (in the task graph)
30
Include
PROPERTY
CONFLICT
OBJECT
SUBJECT ACTION
Entity to be added
T A S K G R A P H
check prop
Abstract entity
Real-world entity
Include
Is a
Contains
Part of
Similar
NO PROPERTY
CONFLICT
OBJECT
SUBJECT ACTION
Entity to be added
T A S K G R A P H
check prop
Include
Figure 2.5: Estimating task-relevance: To estimate the relevance of an entity, we check the existence
of a path from the entity to the task graph and check for property conflicts. To find “what is the man
catching”, we are looking for a hand related object that is small and holdable, hence a big object like
car is considered irrelevant; whereas a small object like pen is considered relevant.
and the nature of its connecting relations. Let’s consider the influence of entity u (whose
relevance is known) on entityv (whose relevance is to be computed). This depends on
31
• R
u
, the relevance of entityu
• g(r(u,v)), the granularity of the relationship r(u,v)
• the conditional probability P(v isrelevant|uisrelevant). For the purposes of vi-
sual scene analysis,v is considered to be relevant if it helps us findu. Hence the condi-
tional probability can be estimated from previous experience asP(uwillbefound|visfound)
or P(u occurs| v occurs). This is the same as c(u,v) / P(v), where c(u,v) is the
co-occurrence ofu andv.
To model the decaying influence with increasing path length
1
between the entities, we
introduce DF, a decay factor that lies between 0 and 1. Thus we arrive at the following
expression for computing relevance of entityv (R
v
):
R
v
= max
u: (u,v) is an edge
(R
u
×g(r(u,v))×c(u,v)/P(v)×DF) (2.1)
Creation of the initial task graph: The relevance of a new entity depends on the task-
relevant entities already present in the task graph. Hence, creation of the initial task graph is
important. In our implementation, the initial task graph consists of task keywords and their
relevance as input by the user. For instance, given a task specification such as “what is the
man catching”, the user inputs “man” as the subject keyword and “catch” as the action key-
word, along with their relevances (any number greater than baseline 1.0). After adding these
keywords to the task graph, we further expand the task graph through the is a relations. Our
new task graph contains “man is a human”, “catch is a hand-related action”. As a general
1
Path length between two nodes A and B of a graph is calculated as the number of edges in the path between
A and B.
32
rule, upon addition of a new entity into the task graph, we expand it to the related entities
(entities connected through the related relation). In this example, we expand the initial task
graph to “hand-related action is related to hand and hand-related object”. Thus even before
the first fixation, we know that we are looking for a hand-related object, i.e., we have a prior
expectation of which entities are relevant. Such expansion of the task into task-relevant tar-
gets allows the model to compute the relevance of fixated entities in the manner explained
above. For example, if the fixation is recognized as an object belonging to the car category,
then it is determined to be irrelevant as it is not a hand-related object (figure 2.5).
Summary: To summarize, our proposed architecture expands a given task into task-relevant
entities and determines the task-relevance of scene entities. Once the task-relevant entities or
targets are known, the next step is to efficiently detect them in the scene.
2.5 Top-down biasing for object detection
From the elementary information available at the pre-attentive stage in the form of low-level
feature maps tuned to color, intensity and orientation, our model learns representations of
objects in diverse backgrounds. The representation starts with simple vectors of low-level
feature values computed at different locations on the object, called views. We then recursively
combine these views to form instances, in turn combined into simple objects, composite
objects, and so on, taking into account feature values and their variance. Given any new scene,
our model uses the learned representation of the target object to perform top-down biasing
on the attentional system, in order to increase the salience of this object by enhancing its
33
characteristic features. The details of how our model learns and detects targets are explained
in the following subsections.
2.5.1 Learning the object representation
During the learning phase, the model operates in a free-viewing mode. That is, in the absence
of any task, there are no top-down effects, the TRM is uniform (baseline 1.0 everywhere), and
the AGM is the same as the salience map. Thus, in the absence of task, our model deploys
attention according to the bottom-up salience model [66].
Location cue for training purposes: To guide the model to the location of the target, we
use a binary target mask that serves as a location cue by highlighting the targets in the input
image. It should be noted that we do not use the target mask to segment the target from its
background. In fact, we attempt to learn not only the object properties, but also local neigh-
borhood properties. This is useful since in several cases, the object and its background may
co-occur and hence, the background information may aid in the detection of the object.
Learning views of objects: When the model attends to the target, a few locations are cho-
sen around that salient location (currently, the model chooses 9 locations from a 3x3 grid
of fixed size centered at the salient location). For each chosen location, the visual WM
learns the center-surround features at multiple spatial scales and stores them in the visual
LTM. The coarser scales include information about the background while the finer scales
contain information about the target. Specifically, a 42-component feature vector extracted
34
at a given location represents a view (red/green, blue/yellow, intensity and four orientations
at six center-surround scales). Thus, we obtain a collection of views contained in the current
instance of the target.
Combining multiple views into an instance: The visual WM combines the different views
obtained above to form a more stable, general representation of an instance of the object that
is robust to noise. It repeats this process by retrieving the stored instances from the visual
LTM and combining them to form a general representation of the object and so on. The
following rules are used for combination of several object classes (equally likely, mutually
exclusive) to form a general representation of the super-object class. LetX
i
be the event that
thei
th
object class occurs, wherei ∈1,2,...n. LetY be the event that the super-object class
occurs. We defineY as follows:
Y = ∪
i
X
i
(2.2)
In other words, an observation is said to belong to the super-object class if and only if it
belongs to any of the object classes (e.g., an observation belongs to an object category if and
only if it belongs to any of the object instances). LetO be the random variable denoting an
observation andO = o be the event that the valueo is observed. P(O = o|X
i
) refers to the
class conditional density, i.e., the probability of observing O = o given that the i
th
object
class has occurred. Let P(O = o|X
i
) follow a normal distribution N(μ
i
,Σ
i
) where μ
i
=
μ
i1
μ
i2
..μ
i42
T
, i.e., a vector of the mean feature values, and Σ
i
is the covariance
35
matrix. Due to our assumption that the different features are independent, the covariance
matrix reduces to a diagonal matrix, whose diagonal entries equal the variance in feature
values, represented asσ
2
i
=
σ
2
i1
σ
2
i2
..σ
2
i42
T
. Our aim is to find the distribution of
O|Y . As shown in the appendix, we obtain the following:
P(O =o|Y) =
X
i
P(O =o|X
i
)w
i
(2.3)
wherew
i
= P(X
i
)/
X
j
P(X
j
) (2.4)
= 1/n (sinceX
i
are equally likely) (2.5)
μ = E[O|Y] (2.6)
=
X
i
w
i
μ
i
(2.7)
σ
2
= E[(O|Y)
2
]−(E[O|Y])
2
(2.8)
=
X
i
w
i
(σ
2
i
+μ
2
i
)−μ
2
(2.9)
In general, O|Y has a multi-modal distribution. But as a first approximation and to
achieve recursion in our implementation, we consider only up to the second moment and
approximate this multi-modal distribution by a normal distribution N(μ,σ).
Summary: By processing several images containing different poses and sizes of an object,
the visual WM, along with the help of visual LTM, learns the representation of the views,
instances and combines them to form a representation of the object (figure 2.6).
36
Object representation
Instances
Views
Original image
Binary target mask
μ σ
μ σ μ σ μ σ μ σ μ σ μ σ
μ σ
μ σ
Figure 2.6: Learning a general representation of an object: The model uses a binary target mask
(target is 1 and background is 0) to serve as a location cue. The model learns the views by extracting
the center-surround feature vectors at different spatial scales from a few locations within the target.
Next, it combines the views to form instances. The instances are in turn combined to form a general
representation of the object.
37
2.5.2 Object detection using the learned visual representation
To detect a specific target object in any scene, the visual WM uses the learned representation
stored in the visual LTM to determine the relevant features and accordingly bias the com-
bination of different feature maps to form the salience map. The details of the biasing are
provided in this section.
Determining relevance of a feature: A featuref is considered to be relevant and reliable if
its mean feature value is high and its feature variance is low. Hence, we determine the weight
by which this feature will contribute to the salience map (feature weight) asR(f).
R(f)= relevance of featuref =
μ(f)
1+σ(f)
whereμ(f)= mean response to featuref, σ
2
(f)= variance in response to featuref
Recursive determination of relevance: We compute several classes of features in several
visual processing channels (section 2) and create a channel hierarchy H as follows. H(0)
(leaves): the set of all features at different spatial scales; H(1): the set of subchannels formed
by combining features of different spatial scales and the same feature type; H(2): the set
of channels formed by combining subchannels of same modality;... H(n): the salience map
(where n is the height of H). In order to promote the target in all the feature channels in
38
Center-surround differences and normalization
Feature maps
orientations colors intensity
Conspicuity maps
Saliency map
Attention
Input image
Learned target
representation
Linear combination Linear combination Linear combination
Weighting
coefficient
Inhibition of
return
μ σ
Figure 2.7: Top-down biasing model for object detection: To detect a specific target object in any
scene, we use the learned target representation to bias the linear combination of different feature maps
to form the salience map. In the salience map thus formed, all scene locations whose features are
similar to the target become more salient and are more likely to draw attention.
39
the channel hierarchy, each parent channel promotes itself proportionally to the maximum
feature weight of its children channels.
∀p∈∪
n
k=0
H(k), R(p)∝ max
c∈children(p)
(R(c))
Example: For instance, if the target has a strong horizontal edge at some scale, then the
weight of the 0
◦
subchannel increases and so does the weight of the orientation channel.
Hence, those channels that are irrelevant for this target are weighted down and contribute
little to the salience map (e.g., for detecting a horizontal object, color is irrelevant and hence
the color channel’s weights are decreased).
Combining the weighted feature maps to form a saliency map: At each level of the chan-
nel hierarchy, weighted maps of the children channels (Map
c
) are summed into a unique map
at the parent channel (Map
p
), resulting in the salience map at the root of the hierarchy.
∀p∈∪
n
k=0
H(k), Map
p
(x,y)=f
X
c∈children(p)
R(c)∗Map
c
(x,y)
where f refers to the spatial competition. For details regarding its implementation, please
see section 2.4 in [69]; as mentioned earlier, its role is to prune those feature maps where
many locations are strongly active (and hence none may be considered a stronger attractor of
attention than any other), while promoting maps where a single or a few locations are active
(and tend to pop-out). This aspect of the saliency model [70, 69] is also further discussed in
40
section 2.6.2 and figures 2.11 and 2.12.
Guiding attention according to the biased saliency map: In the salience map thus formed
by biasing the combination of all feature maps, all scene locations whose local features are
similar to the target’s relevant features become more salient and likely to draw attention
(figure 2.7). The false positives at this stage can be removed at the recognition stage.
2.6 Using attention for Object recognition
Our current implementation for object recognition is aimed at re-using pre-attentive features
used to guide attention. Hence, we adopt the simplest approach and treat the object as a fea-
ture vector, with no explicit representation of spatial tructure. While this imposes limitations
on the complexity of objects that our model can recognize, it is fast and may serve to prune
the search space, thus acting as a filter that may feed into more complex, slower recognition
systems.
Matching the currently attended object to the objects stored in LTM: To recognize an ob-
ject, our model attends to any location in the object, and extracts the center-surround feature
vector from that location. We try to recognize the entity at the current fixation by matching
the extracted feature vector (f) with those already learned (O = {o
1
, o
2
..o
n
}) and stored
in the visual LTM. We use a maximum likelihood estimation technique to find the match
between f and O, i.e., find the object o
i
that maximizes P(f|o
i
). Let Match(f, k) denote
the set of nodes that provide a good match among all nodes from the root (level 0) to some
41
Object Heirarchy
Find the best match
Representation of the fixation Representation of the fixation
ms
ms
Figure 2.8: Architecture for object recognition: Our model recognizes the object at the attended
scene location by extracting a center-surround feature vector from that location and finding the best
match by comparing it against representations stored in the object hierarchy.
desired levelk of specificity.
Progressive matching from coarse to fine object representations: We compute it progres-
sively in increasing levels of specificity by first findingMatch(f,0), then findingMatch(f,
1), and so on up to Match(f, k), i.e., by first comparing against general object represen-
tations and then comparing against more specific representations such as a particular object
or instance or view. At each level, we narrow our search space and improve the speed of
recognition by pruning those subtrees rooted at nodes that do not provide a good match, and
selectively expanding those nodes that provide a good match.
42
Using likelihood estimates to find a good match: We find a good match among a set of
nodes by comparing the likelihood estimates of the nodes to find a unique maximum which
is twice higher than the second maximum. If we find a unique maximum, the corresponding
node provides a good match. Else in the presence of ambiguity, all nodes whose likelihood
estimates are greater than or equal to the mean likelihood estimate are considered to provide
a good match. GivenMatch(f,x), we findMatch(f,x+1) as follows:
2.6.1 Case 1:|Match(f,x)| = 1: Unique match at levelx
Depending on whether level x corresponds to the finest representation or not, we have the
following two conditions:
Level x corresponds to the finest object representation: If level x is the deepest level in
the object hierarchy, then we have successfully found the most specific representation that
matches the fixated entity. We outputMatch(f,x) and terminate our search.
Levelx corresponds to a coarse object representation: Else, given that the general object
representation at levelx provides a good match, we proceed deeper into the object hierarchy
to find a better match among more specific representations. We accomplish this by expanding
the matching node at levelx into its children nodes at levelx+1. If the parent node provides
a better match than the children nodes (e.g., a gray stimulus may match the gray parent better
than its white or black children), we prune the subtree rooted at the parent node andMatch(f,
43
x+1) = Match(f,x). Else,Match(f,x+1) equals the set of children nodes that provide
a good match.
2.6.2 Case 2:|Match(f,x)| > 1: Ambiguity at levelx
As in the previous case, we have the following two conditions for matching depending on
whether levelx corresponds to the finest representation or not.
Levelx corresponds to the finest object representation: If levelx is the deepest level in the
object hierarchy, then we declare ambiguity in recognition and output the node that provides
the best match amongMatch(f,x).
Levelx corresponds to a coarse object representation: Else, we resolve the ambiguity at
this level by seeking better matches at the next levelx+1. We expand each matching node at
levelx into its children nodes at levelx+1, taking care to prune the subtree if the parent node
matches better than its children. Among the nodes thus obtained, Match(f, x+1) equals
the set of nodes that provide a good match.
Summary: Although simple and limited, this object recognition scheme has proven suffi-
ciently robust to allow us to test the model with some natural scenes, as described in the
following section.
44
2.7 Results
We tested our model by examining the influence of task on the model’s fixations. In a sim-
ple search task where a known target had to be detected and recognized, we evaluated our
model’s efficiency against a naive bottom-up model by comparing the speed of detection and
the salience of the target, i.e., which model made the target more attractive. In a multiple tar-
get detection task, we determined our model’s efficiency in detecting all the targets as against
the naive model. Apart from search tasks in static scenes, we also tested our model’s ability
to learn target locations in dynamic scenes where the environment is constantly changing and
the targets are moving. Finally, we verified whether our model could expand a given task into
its task-relevant targets and use this knowledge to determine the relevance of various scene
locations. In the rest of this section, we test the different components of our model that are
involved in the above tasks.
2.7.1 Visual search for a known target in an unknown scene
As a first test of the model, we consider a search task for a known target and wish to detect it
as fast as possible. This test aims at evaluating our model’s efficiency against a naive bottom-
up model by comparing the speed of detection and the salience of the target.
Training: We allowed our model to learn the visual features of the target from training
images (12 training images per target object on average, 24 target objects) and their corre-
sponding target masks (the target mask highlighted the target and served as a location cue for
45
training only).
Statistical testing: To detect the target in a new scene, the visual WM biased the bottom-up
attentional system to enhance the salience of scene locations that were similar to the target.
Attention was guided to locations whose biased salience was high. We tested the model on
343 new scenes and measured the improvement in performance of our top-down biased model
over the naive, bottom-up model [66]. There was a significant improvement in detection and
in the salience of the target in many but not all cases, verified as follows by statistical test-
ing for a significance level of 0.05 (figure 2.9). The null hypothesis H
0
(mean improvement
of 2.00 in target salience normalized by maximum salience in the image) was compared to
alternate hypotheses H
1
(mean improvement in normalized target salience < 2.00) and H
2
(mean improvement in normalized target salience> 2.00).
Hard visual search occurs when distractors are similar to the target: In some scenes, the
distractors were similar to the target, making the search tasks difficult (e.g., detect a circle
among ellipses). In such cases, biasing for the target led to an increase in salience of the target
as well as the distractors that shared the target’s features. Due to the spatial competition that
followed, the salience of the target was modulated and there was no significant improvement
in detection time or salience of the target, hence supporting the alternative hypothesisH
1
.
Consistency with search assymmetry experiments: A particularly interesting case oc-
curred when we tried to detect a circle among circles with vertical bars. The bottom-up
46
Target Distractor 95%conf
salience
95%conf
shifts
Hyp Detection time
[0.78, 2.20] [0.86, 1.06] H
0
improves
[1.21, 3.84] [0.70, 1.09] H
0
not affected
[0.13, 0.32] [0.08, 0.16] H
1
increases
[1.11, 1.40] [1.00, 1.00] H
1
Pop-out
[1.06, 2.30] [1.91, 3.88] H
0
improves
[0.26, 1.00] [0.18, 0.36] H
1
increases
[0.90, 1.01] [1.00, 1.00] H
1
Pop-out
[17.73, 964.89] [0.73, 1.09] H
2
not affected
[1.00, 1.11] [1.00, 1.00] H
1
Pop-out
[0.00, 3319.3] [11.61, 19.18] H
0
improves
[2.09, 8.72] [5.69, 13.30] H
2
improves
[1.02, 1.19] [1.00, 1.00] H
1
improves
[0.97, 1.18] [1.03, 1.77] H
1
improves
[3032.2, 7060.6] [20.00, 20.00] H
2
improves
natural [2.48, 23.79] [0.53, 1.15] H
2
not affected
natural [0.00, 15.13] [0.49, 1.53] H
0
not affected
natural [1.00, 4.39] [1.09, 2.66] H
0
improves
natural [1.88, 2.77] [1.79, 2.39] H
0
improves
Figure 2.9: Results: This figure shows our model’s results for top-down biasing for a sample from
our database of objects. The first column is the target object that we biased the model for; the second
column shows the distractor object when in a search array setup, or “natural” means that a natural
cluttered scene was the background or distractor; the third column shows the 95% confidence inter-
val for improvement in target salience normalized by maximum salience in the display (biased over
naive models); the fourth column shows the 95% confidence interval for improvement in number of
attentional shifts before detection of the target (naive over biased models); the fifth column shows the
hypothesis supported by the salience data. The null hypothesisH
0
(mean improvement in normalized
target salience = 2.0) or alternative hypothesisH
2
(mean improvement in normalized target salience>
2.0) was supported by a majority of the target objects. In some cases where the distractors were very
similar to the target, the alternative hypothesisH
1
(mean improvement in normalized target salience
< 2.0) was supported. The final column shows some remarks on the effect of biasing on detection
time. Note that in the case of pop-out, improvement in normalized target salience is approximately
1.0 because the target is already the most salient item in the display (hence, target salience normalized
by maximum salience equals 1.0), and biasing maintains the target as the most salient item.
47
salience of the circle was very low and biasing improved its salience by a large factor. But
biasing also boosted the salience of all the circles with vertical bars and due to the spatial
competition, the biased salience of the target became low and hence it did not pop-out (just
like this search is always difficult for humans, whether or not they know the target [157]).
But in the opposite case where we tried to detect a circle with a vertical bar among circles,
biasing did not affect the performance since the target was already the most bottom-up salient
item and popped out.
Significant improvement in search performance: In most scenes, despite interference from
the distractors, biasing improved target salience and detection time (data supported H
0
or
H
2
). For example, biasing accelerated the detection of a square among rectangles 15.56-fold
on average. An example of a comparison between the number of fixations taken by the biased
vs. unbiased models is shown in figure 2.10.
2.7.2 Consistency with available pyschophysical data
The first set of results suggested that the spatially global [139] (one weight per feature map)
biasing mechanism implemented here and similar in spirit to Guided Search [177] may or
may not improve search performance, depending on the presence of shared features between
target and distractors. To further explore the validity of such a mechanism, we compared our
biased model’s predictions with existing psychophysical data and other models such as a ran-
dom model, the bottom-up or unbiased model [66], and the top-down search model proposed
48
Figure 2.10: Results for top-down biasing: The example on the left shows the attentional trajectory
during free examination of this scene by the naive, bottom-up salience model (yellow circles represent
highly salient locations, green circles represent less salient locations, red arrows show the scanpath).
Even after 20 fixations, the model did not attend to the coke can, simply because its salience was
very low compared to that of other conspicuous objects in the scene. Displayed on the right is the
attentional trajectory after top-down biasing for the coke can object class (built from instances and
views of the coke can from other photographs containing the can in various settings). Our model
detected the target as early as the third fixation.
by Raoetal. [126].
Highlights of Rao et al.’s model: As mentioned in section 1.3.1, Rao et al.’s model assumes
a much stronger biasing mechanism, whereby salience at every location reflects similarity
between the local low-level features and the target features provided top-down (with similar-
ity computed as the Euclidean distance between feature vectors).
Predictions of our model during conjunction search: To develop an intuitive understand-
ing of the comparison between both models, consider a conjunction search array with red
and blue vertical and horizontal elements (and a single red-vertical target) like in figure 2.11.
In our model, biasing for the features of the target means giving a high weight to red color
(the red/green feature maps) and to vertical orientation (the vertical feature maps). Because
49
each of these feature maps contains many active locations (the target, but also half of the
distractors), the spatial competition in each feature map [69] is expected to drive those maps
to zero, no matter how strongly biased they may be (remember that the spatial competition
tends to promote maps which contain a unique active location and to demote those which
contain many active locations). In the end, biasing is rather ineffective because it increased
the weights of feature maps that were basically noise and not attractors of attention. It is not
totally ineffective, though, because the target is amplified twice (once in the red/green maps
and once in the vertical maps) and hence exhibits slightly increased salience, though still very
low.
Predictions of Rao et al.’s model during conjunction search: In contrast, a template match-
ing algorithm like that of Rao et al. would predict that biasing for the target should render
it salient, since the target will exhibit a feature distance near zero (perfect match between
local features and top-down biasing features, corresponding to highest salience), while dis-
tractors will exhibit non-zero distances (mismatch in at least one feature value). Whether
the difference between target and distractor salience values is sufficient to yield pop-out can
be controlled in Rao et al.’s model by a softmax parameter, λ, which determines how dom-
inantly the location of maximum salience attracts attention compared to locations of lesser
salience. To decide on a fair value for λ, we chose the one which barely allowed our re-
implementation of Rao et al.’s model to find the target in constant time on simple pop-out
search arrays (red-vertical bar among red-horizontal distractors, and red-vertical bar among
blue-vertical distractors). To further allow a fair comparison, our re-implementation of Rao
50
Figure 2.11: Difference between our biased model and Rao et al.’s model: Consider searching for
a red-vertical item among red-horizontal and blue-vertical items. Rao’s model computes salience of
each scene location as the Euclidean distance between the target and that location in feature space, by
progressively considering the information at coarse-to-fine scales. The corresponding salience maps
obtained for the first three fixations are shown here. As early as the third fixation, the salience map
including the finest scale clearly shows the target to be the single most salient location in the scene.
Thus, Rao’s model predicts that conjunction searches are efficient (see section 2.6.2 for details on our
re-implementation of that model). On the other hand, in our model, biasing promotes the red and
vertical features. In the resulting color feature map, the target as well as red-horizontal distractors
become active. Similarly, in the orientation feature map, the target as well as blue-vertical distractors
become active. Due to spatial interactions within each feature map, the target and the distractors cancel
each other. In the resulting salience map, the salience of the target and the distractors are comparable,
hence, leading to an inefficient search.
51
et al.’s model used the same set of features and center-surround scales as our model.
Comparison of our biased model and Rao et al.’s model: We tested all models on 100
color-feature searches (where the target differed from the distractors only in color), 100
orientation-feature searches (where the target differed from the distractors only in orienta-
tion), and 100 conjunction searches (where the target differed from the distractors in either
color or orientation). In each category, we plotted the reaction time (time taken by the models
to detect the target) against increasing number of items in the display (density of display was
maintained a constant while the display size was varied). As shown in figure 2.12, while the
random model and Rao et al.’s model showed no difference in performance across search
categories, our biased and unbiased models correctly predicted pop-out in single-feature
searches and confirmed the linear increase in reaction time with increasing set size, as it
typical in conjunction searches. That is, as soon as Rao et al.’s model was able to reliably
detect pop-out targets (by tuning λ), it had become sensitive enough so as to also reliably
detect conjunction targets.
Summary: This result casts doubt on the fact that a template-matching computation like that
proposed in Rao et al.’s model may occur in the primate brain. Our biased model, as expected
from our intuitive analysis (target salience must increase as it was amplified in two feature
maps but distractors only in one), performed slightly better in the conjunction searches than
the unbiased model.
52
0
5
10
15
20
25
30
35
10 20 30 40 50 60
Random Model
Unbiased Model
Biased Model
Rao et. al
Rao et. al
Unbiased
Models
Biased and
Random Model
Number of items in the display
Reaction time
0
5
10
15
20
25
30
35
10 20 30 40 50 60
Random Model
Unbiased Model
Biased Model
Rao et. al
Number of items in the display
Reaction time
0
5
10
15
20
25
30
35
10 20 30 40 50 60
Random Model
Unbiased Model
Biased Model
Rao et. al
Random Model
Biased Model
Unbiased Model
Rao et. al
Number of items in the display
Reaction time
Color pop-out Orientation pop-out Conjunction search
Figure 2.12: Comparison between the performance of different models: This figure shows a com-
parison between the performance of a random model, our unbiased model, our biased model, and a
top-down model as proposed by Rao et al. The performance of the models is compared on search
arrays creating pop-out in color (first column), pop-out in orientation (second column), and serial,
conjunction searches (third column). The x axis shows the number of items in the display and the y
axis shows the reaction time (RT) measured as the number of fixations engaged by the model before
target detection. The random model assumes uniform probability of attending to each item in the dis-
play, hence, on an average, it attends to half the total number of items in the display before finding the
target. In single feature searches, our unbiased (unknown target) and biased (known target) models,
along with Rao’s model (known target) correctly predict efficient search as shown in columns 1 and
2. However, in conjunction searches as shown in column 3, Rao’s model continues to predict efficient
search (slope = 0, reaction time does not change with increasing number of items in the display),
while our unbiased and biased models show an approximately linear increase in reaction time with
increasing number of items in the display, which is typical of inefficient searches.
2.7.3 One-shot learning
Next, we determined our model’s ability to perform one-shot learning and compared the
detection performance when the target object’s category or coarse representation was known,
53
vs. when the exact instance was known, vs. when a different instance was known.
Training image
Test image
Figure 2.13: One-shot learning: the model learned a specific instance of the handicap sign from the
image shown in the center and used the learned instance to detect new handicap signs in different
poses, sizes and backgrounds as shown in the other images.
Detection performance of the one-shot learning mechanism: An example is shown in fig-
ure 2.13 where the model learned a specific instance of a handicap sign from one image and
used the learned instance to detect new handicap signs in novel poses, sizes and backgrounds.
We tested this one-shot-learning mechanism on 28 test images and as shown by the statistics
in tables 2.1 and 2.2, the model accelerated detection over two-fold on average.
Detection performance when the target object’s category or coarse representation is
known: When we allowed the model to learn all instances and combine them to form a gen-
eral target representation, it allowed for greater variance in the possible target shapes and
54
Operating Mode μ σ 95% Conf. Min Max
Learned-instance 2.72 1.73 [0.91, 4.54] 0.91 5.01
General-object 2.67 1.79 [0.79, 4.54] 0.87 5.39
Exact-instance 3.47 2.45 [0.90, 6.03] 0.95 7.50
Table 2.1: Statistics of target salience as computed by the biased model over that computed by the
naive unbiased model: The first column states the target representation that was used for biasing (see
section 2.6.3 for details). The second column shows the mean improvement in target salience; the
third column shows the standard deviation; the fourth column shows the 95% confidence interval; the
fifth and sixth columns show the minimum and maximum improvements obtained.
Operating Mode μ σ 95% Conf. Min Max
Learned-instance 2.24 1.27 [0.91, 3.58] 1.00 4.35
General-object 2.22 1.24 [0.92, 3.52] 1.00 4.26
Exact-instance 2.25 1.27 [0.92, 3.58] 1.00 4.35
Table 2.2: Statistics of target detection time as taken by the naive unbiased model over that taken
by the biased model: The first column states the target representation that was used for biasing (see
section 2.6.3 for details). The second column shows the mean improvement in target detection time;
the third column shows the standard deviation; the fourth column shows the 95% confidence interval;
the fifth and sixth columns show the minimum and maximum improvements obtained.
sizes. While, on the one hand, increased variance in feature values allows detection and cat-
egorization of modified targets under the same general object category, on the other hand, it
decreases detection speed due to the uncertainty in the exact target features. Hence, biasing
for the general object representation led to a small drop in efficiency as compared to biasing
for the learned instance.
Detection performance when the exact instance of the target object is known: Finally,
when we allowed the model to detect the same instance that it had learned, it was most
efficient. These results support studies in psychophysics suggesting that better or more exact
knowledge of the target leads to better searches [79].
55
2.7.4 Multiple target detection
For multiple target detection, the visual WM used the target representations previously learned
and stored in the visual LTM (as stated earlier, for learning, we used 12 training images per
target object). The model biased for the multiple task-relevant targets sequentially in decreas-
ing order of their relevance.
Handling errors in detection: As mentioned earlier in this section (exemplified with the
conjunction search arrays of figure 2.12), biasing is likely, but not guaranteed to make the
target most salient. Hence, a less relevant target may be detected while biasing for the most
relevant target. Our model handles such errors by recognizing the fixated entity and updating
the state of the task graph in the symbolic WM to indicate that it has found the less relevant
target, and it proceeds to detect the most relevant target by repeating the above steps.
Figure 2.14: Sequential detection of multiple targets: The model initialized the working memory
with the targets to be found and their relevance (handicap sign, relevance = 1; fire hydrant, relevance
= 0.5). It biased for the most relevant target (in this case, the handicap sign), made a false detection,
recognized the fixation (fire hydrant), updated the state in its working memory (recorded that it found
the fire hydrant), and proceeded to detect the remaining target by repeating the above steps.
56
Testing: We tested multi-target detection and recognition on 28 new scenes containing fire
hydrants and handicap signs. Since the influence of the gist on TRM is not implemented in
our model yet, we placed the targets at random locations to eliminate the role of the gist in
aiding the detection of the targets. Results showed that, on average, our model was 6.20 times
faster than the naive unbiased model (95% confidence interval = [1.47, 10.94], min = 0.07,
max = 28.86; figure 2.14).
Summary: In these experiments, we thus tested the top-down biasing and recognition com-
ponents involving visual WM and LTM modules, and the symbolic WM and LTM modules
for creating and maintaining the task graph.
2.7.5 Object recognition
To further test the recognition module, we allowed the model to recognize the entity at the
attended location by matching the visual features extracted at the fixation against those stored
in the object hierarchy in visual LTM. Despite the simplicity of the model (it attempts to
recognize fixations by looking at just one location in the object), it seems to be able to classify
the target in the appropriate category of objects - as shown in figure 2.15, the contributors for
false negatives and false positives share features with the target, i.e., they are similar to the
target.
57
Target % False Top % False Contributor1 Contributor2 Contributor3
positives Contributor negatives
64.53 20.00 (20.00)
30.55 60.00 (50.00) (10.00)
0.00 100.00 (70.00) (30.00)
0.00 10.00 (10.00)
0.00 0.00
18.18 10.00 (10.00)
0.00 10.00 (10.00)
3.97 100.00 (30.00) (20.00) (10.00)
0.00 70.00 (40.00) (30.00)
6.71 70.00 (20.00) (10.00) (10.00)
8.86 30.00 (30.00)
0.00 80.00 (80.00)
0.00 0.00
7.77 90.00 (70.00) (20.00)
Figure 2.15: Statistics for the hierarchical recognition of arbitrary fixations, for a sample of objects
from our database: As an initial implementation, we considered a simple object hierarchy with just 3
main levels (level 1: all objects, level 2: instances and level 3: views) and at level 0 was a dummy root
that was a general class combining all the objects. The first column is the target object; the second
column shows the percentage of false positives (number of distractors that were falsely recognized as
the target, over the total number of distractors); the third column shows the distractor that accounted
for the false positives; the fourth column shows the percentage of false negatives (number of targets
that were not recognized as the target, over the total number of targets); the fifth, sixth and seventh
column show the top 3 contributors to false negatives. Despite the simplicity of the model (it attempts
to recognize fixations by looking at just one location in the object), it seems to be able to classify
the target in the appropriate category of objects - as shown in this figure, the contributors for false
negatives and false positives share features with the target, i.e., they are similar to the target.
2.7.6 Estimating task-relevance of scene locations
Next, we attempted to determine and learn the task-relevant locations in the scene. The vi-
sual WM, with the help of visual LTM, biased the attentional system for the task-relevant
58
target. Initially, the model had no prior knowledge of the scene, hence the TRM was uniform
(baseline 1.0 everywhere), and the model attended to scene locations based on their visual
salience. For each incoming visual scene, the TRM was updated as follows: at each fixation,
the recognition module, with the help of visual LTM, recognized the entity at the attended
scene location. The symbolic WM, with the aid of symbolic LTM, determined the task-
relevance of the recognized entity. It marked the corresponding location in the TRM with
the estimated relevance. To learn the contents of the TRM across all the incoming scenes,
we computed the average TRM in an online and incremental manner (for this purpose, we
maintained the sum of TRMs and the number of TRMs or scenes seen so far). As shown
below, we designed a task in a dynamic environment to test the learning and working of the
TRM. The other modules that were also involved in the test include the top-down biasing and
recognition modules, the working memory and the long-term memory modules.
Determining the relevant scene locations for driving: For a driving task, we allowed the
model to bias for cars and attend to the salient scene locations and recognize them as belong-
ing to the car or the sky category. Initially, the TRM was unbiased due to the lack of any
knowledge of the scene. As the model attended and recognized locations as belonging to the
car category, the relevance of these locations was updated in the TRM. The development of
the TRM over a number of fixations is shown in figure 2.16.
Testing the change in relevance of scene locations as a function of the task: On the same
scenes as used for the driving task, we attempted to learn the scene locations that belonged
59
1 2 3 4
5
6
7
8 9
12 16 20 24 28
Figure 2.16: Learning the TRM: The model learned the TRM for a driving task by attending, es-
timating the relevance of attended scene locations and updating the TRM. The development of the
TRM across 28 fixations is shown here (brighter shades of grey indicate locations more relevant than
baseline). Note that the TRM does not change significantly after a while and is learned to a reasonable
precision within the first 5-10 fixations.
to the sky category. We repeated exactly the same steps as above and obtained the TRM as
shown in figure 2.17. Thus, we explored how different locations in the same scene become
relevant as the task changes.
2.8 Discussion
In this chapter, we have designed and partially implemented an overall biologically plausible
architecture to model how different factors such as bottom-up cues, knowledge of the task
and target influence the guidance of attention. We have tested our model on a variety of tasks
including search tasks in static scenes and a driving task in dynamic scenes. The results show
60
Figure 2.17: Learning the TRM: On the same scenes as used for the driving task, we learned the scene
locations that belonged to the sky category. The TRM learned after the first 28 fixations is displayed
here. Those locations belonging to the car category are clearly suppressed or marked irrelevant (dark)
compared to baseline (white). It may appear misleading that the road is marked as relevant. Since
the road was non-salient, it did not attract any attention and hence was not marked as irrelevant and
remained at baseline.
that our model can determine the task-relevant targets from a given task definition; detect
the targets amidst clutter and diverse backgrounds; reproduce basic human visual search
behavior; recognize many targets and classify them into their corresponding categories with
few errors; learn the task-relevant locations online in an incremental manner, and use the
learned target features as well as likely target locations to bias the attentional system to guide
attention towards the target. In the rest of this section, we discuss our main contributions in
this chapter, namely target representation, target detection, recognition, and memorization,
followed by a brief discussion on scene representation. Finally, we present the limitations of
our model, along with future directions.
61
2.8.1 Target Representation
Our model represents the target by center-surround features at different spatial scales.
Comparison of traditional approaches vs. ours: Traditional approaches attempt to seg-
ment the target from the background in order to avoid the confusion between the target and
the background. In contrast, our target representation includes background information at the
coarse spatial scales and contains information about the target at the finer spatial scales.
Circumventing the problem of segmenting the figure from background: In simple cases
where the object appears in similar sizes but in different backgrounds, our model achieves
the equivalent of segmentation by determining the scales that reliably represent the target. If
the background is inconsistent or changing, its variability is reflected in the high variance in
response to the features at coarse spatial scales. Consequently, those features are considered
unreliable and are not promoted during biasing for the target. In other cases where the back-
ground is consistent, the co-occurrence of the target and its background is captured by the
low variance in response to the features at the coarse scales. Thus, our target representation
provides a convenient way to include contextual information.
2.8.2 Target detection
In our model, feature maps are computed in parallel, non-linear interactions occur in all
of them [69], and they are weighted in a top-down manner before being summed into the
salience map. The target is made salient by adjusting the weights of the low-level feature
62
maps so as to promote the target’s relevant features and suppress its irrelevant features. Thus,
our model provides a computational implementation of a Guided Search mechanism [177],
and it learns the appropriate feature weights directly from training images containing the tar-
gets.
Prediction of our model: Consequently, our model predicts that all scene locations whose
features are a superset of the target’s features or share it also become salient, e.g., a red ellipse
also becomes salient if we are searching for a red circle. This prediction of the model could
be verified with psychophysics experiments.
Efficient target detection depends on both top-down biasing and bottom-up salience: In
addition to top-down factors that influence target salience, bottom-up factors such as spatial
non-linear interactions modulate the target salience based on the salience of the neighboring
distractors [111, 97, 37]. The winner of the spatial competition depends on the positions and
relative salience of the target and distractors. Hence, biasing is likely, but not guaranteed to
make the target most salient.
Consistency with psychophysical data: In important cases like conjunction searches (fig-
ures 2.11 and 2.12), we have shown how biasing is fairly ineffective in our model, in agree-
ment with human data. This reinforces the plausibility of the biasing approach proposed in
our model, especially compared to template-matching models [126] which seem difficult to
reconcile with empirical data (as they do not yield pop-out in feature search cases when the
63
target features are unknown, as mentioned in section 1.3.1, but yield pop-out in both feature
and conjunction searches alike when target features are known, as shown in section 2.6.2).
2.8.3 Target Recognition
The object recognition model proposed here is simple and shares its resources intimately with
the attentional system by re-using in target representation the pre-attentive features computed
for guiding attention.
Progressive matching from coarse to fine object representations: Hierarchical matching
from general representations like object categories to specific representations like the object,
instance or view allows us to terminate the search at the appropriate level of representation,
depending on our task requirements (e.g., distinguishing between a white and red object may
not require processing down to the level of instances such as white car, or white horse). Fur-
ther, by pruning the subtrees (in the object hierarchy) that do not match, we can accelerate
the search for the best match.
Reuse of attention-guiding pre-attentive features for object recognition: In this chapter,
the goal in designing the recognition model was to explore how the pre-attentive features
used to guide attention may be re-used for object representation and recognition. Since we
represent the target as a feature vector, we do not explicitly handle complex or composite
objects in the current model. Yet, our results indicate that the model could recognize some
complex objects such as geometrical shapes including rectangles, cubes and striped bars to a
64
reasonable extent (see figure 2.15).
Limitations of our recognition model: Currently, our model attempts to recognize an ob-
ject by matching any one location’s visual features against all learned representations, hence,
there are false recognitions and limitations on the complexity of objects that can be rec-
ognized. Though it cannot recognize complex objects, this could possibly be achieved by
decomposing the complex object into a spatial configuration of simpler objects (parts)[176],
that could each be recognized using our proposed schema. A higher-level mechanism can
then check for the spatial relations between the parts to recognize the whole.
2.8.4 Memorization
On the one hand, we have symbolic knowledge that deals with high-level concepts and ob-
jects. On the other hand, there are low-level neural maps of the scene that encode salience
or other image attributes at each pixel or image location. To bridge the gap between these
extreme representations, we have proposed a two-dimensional topographic map called task-
relevance map (TRM) that encodes task-relevance of the scene entities.
Using the TRM to memorize the target: To memorize the target, an area in the TRM cor-
responding to the locations and approximate size and shape of the target [168] is highlighted
with the target’s relevance, and visual features are stored in visual working memory along
65
with links to symbolic knowledge.
TRM can be used to prime spatial locations: The TRM has several potential uses as ex-
plained below. It helps to prime a particular scene location by increasing its relevance in the
TRM, thus supporting spatial top-down attentional modulation.
TRM can interact with the gist to aid object detection: The TRM can also help in object
detection by using the gist to guide attention to locations where the object is likely to occur.
Non-attentional scene representations such as gist and layout have already been shown to
play an important role in object detection [116, 11, 30, 26, 50, 127, 152]. An extension of
our model allows incremental learning of the relationship between gist and the constituent
scene objects as follows. The TRM may be used to learn object properties such as locations
where an object is likely to occur and its approximate size. The relation between gist and
object properties may be learned by maintaining a loop between the gist and the TRM (via
working memory). During the feedforward loop, the quick and imprecise gist may be used
to retrieve the appropriate, previously learned TRM and use it as an initial guide to drive
the focus of attention. Subsequently, by the slow and precise processes of attending and
updating, the TRM can be refined and learned online in an incremental manner within the
first few fixations, and be used to drive further fixations. Finally, the feedback loop may
use the TRM to reinforce, confirm or even update the gist. It may also be used to store the
currently learned TRM.
66
2.8.5 Scene Representation
Knowledge of gist, visual features and location of the object may be important for scene un-
derstanding and representation, but they are not sufficient.
Using relations to bind scene entities: Consider the following example of a scene with a
man, a laptop and a cake. In order to understand the scene, we need to know how the entities
are bound or related to each other. If the man and the laptop are bound by the ’work’ action,
then we can conclude that the man is working. Else if the man and the cake are bound by the
’eat’ action, we can conclude that the man is eating the cake. To represent such relationships
in our model, the symbolic working memory (WM) maintains relations among entities by
seeking the help of the symbolic long-term memory. However, we do not make any claims
on the biological feasibility of our current implementation. It is not clear to us as to how
these relations may be represented in our brain and how the entities may be bound together
into composite structures.
Using links to bind symbolic attributes, visual features and location of the attended ob-
ject: Our model presents the following hypothesis on how a scene may be represented. To
bind the symbolic attributes of the attended object with its visual features and its location,
our model suggests the creation and maintenance of a link between the object in the symbolic
WM, its visual features in the visual WM and the corresponding location in the TRM. This
constitutes our explicit representation of an object file [73]. These links can be very useful
in recall, e.g., an object at a particular location may be recalled by activating that location in
67
the TRM, that in turn activates the link and the associated object. Similarly, where we saw
a given object may be recalled by activating the object in the working memory that in turn
activates the link to the corresponding location.
Extension of our model to explore the role of links in scene representation: The follow-
ing discussion, though not directly tied to the reported model, is an interesting detour that
explores the role of the links (that bind visual and symbolic properties of the stimuli) in scene
representation. We propose to use the above links for scene representation by extending
Rensink’s triadic architecture [127] as follows.
Extension of Rensink’s hypothesis for scene representation: Rensink proposes a coher-
ence field where a spatio-temporal structure is created at the focus of attention and is lost
when the focus of attention shifts. He suggests that the low level visual stages such as proto-
objects are volatile and are bound only at the focus of attention. We extend that hypothesis
and suggest that while the low-level visual stages may be volatile, high-level visual stages
such as the WM (and, further, LTM) may not be volatile and may store the recently attended
relevant objects, their locations and their visual features, even though they may not be the
current focus of attention [56, 55, 54].
Consistency of our extended hypothesis with psychophysical data: This is consistent with
studies showing that visual representation at high-level visual stages may be impoverished
and less precise than their low level counterparts [59, 121], but they can be maintained for
68
longer durations under backward pattern masking [121] and across saccades [61]. Hence, in
our representation, the links between the TRM and objects in short term memory or working
memory don’t die when the focus of attention shifts.
The role of relevance of the object in the creation, maintenance and destruction of its
links: But several studies have shown that there exist strict limitations (∼4) on how many
object files may coexist at any given time [146, 120, 60, 89, 63]. This implies that there must
be some competition among the links so that the strong links may survive and the weaker ones
may die (see [140] for an activation level based competition). We suggest that the strength
of the link depends on the relevance of the associated object, perhaps directly proportional.
Hence, links are not established for irrelevant objects and, consequently, their visual features
or locations are forgotten. Older links suffer interference from newer links and gradually
weaken and die. A new link also suffers interference from existing links and may die if its
relevance is not high. Thus, links to irrelevant objects/locations or those seen in the remote
past may die or disappear whereas links to the relevant objects/locations seen recently may
be strong and consequently, we remember the associated details.
2.9 Unsolved challenges
Need to include directed attentional shifts: More sophisticated top-down attentional con-
trol requires directed attentional shifts (e.g., look upwards if searching for a face, but found a
foot) that use prior knowledge of spatial relationships between scene entities. Including these
into our model would require memorization of spatial relationships between scene entities.
69
Top-down models such as by Rybak et al [138] provide an excellent starting point for this
extension.
Need to create a learnable knowledge base: For executing tasks in real world scenes, the
relevant scene locations and objects must be estimated, memorized and learned for subse-
quent use. Thus, the knowledge base or ontology has to be extensive and learnable, and must
be updated by the contents of the TRM and working memory. In our current model, the
knowledge base is hand-coded with a few objects and actions related to humans. Extension
on this front requires research in knowledge representation and learnable ontologies.
Need to detect targets in cluttered, distracting environments: Real world scenes are clut-
tered and often, the target object is hidden among several distractors. Our current model
uses a suboptimal strategy to bias early visual areas with the knowledge of the target’s visual
features such that the target’s salience may be enhanced. It does not use knowledge of the
distractors to enhance the search efficiency. For deployment of our model in the real world,
we need a reliable and quick detection mechanism that detects targets in natural, cluttered
backgrounds with heterogeneous distractors.
Need to improve the recognition model: Our current implementation uses a naive recogni-
tion model that uses preattentive features to detect fairly simple objects. For deployment in
real world scenes, robust recognition models with the ability to recognize objects under vari-
ations in 3D pose, occlusion and cluttered backgrounds are required. This may be achieved
70
by using one of the many recognition models available in the literature, e.g., models based
on the choice of basic primitives (e.g., Gabor jets [176], geometric primitives like geons [10],
image patches or blobs [172], and view-tuned units [133]), models based on the process of
matching (e.g., self organizing dynamic link matching [84], probabilistic matching [172]),
and others (for reviews, see [134, 1]).
2.10 Conclusion
In this chapter, we have proposed and partially implemented a computational model for the
task-specific guidance of attention in real-world scenes.
Highlights of our approach: Thus, given a task specification, our model determines the
task-relevant entities, biases for the current most task-relevant entity, recognizes the fixated
entity, memorizes the task-relevance of the fixated entity, updates its working memory and
repeats the process until the task is complete.
Our contributions: Our main contributions in this chapter are: First, providing a biologically
plausible architecture for object detection, by top-down biasing the bottom-up attentional sys-
tem for the object’s pre-attentive features so as to make the object more salient; second, object
recognition by re-using the pre-attentive features for object representation and matching hier-
archically against stored representations; and, third, memorization of relevant scene locations
in visual working memory by learning their locations and approximate sizes in a topographic
two-dimensional task-relevance map. We have also proposed a non-biological computational
71
scheme to estimate the task-relevance of scene entities using an ontology containing entities
and their relationships.
Summary: The promising results of our model suggest that the model may provide a rea-
sonable approximation to many of the brain processes involved in complex task-driven visual
behaviors.
APPENDIX
Here, we show the derivation of the class conditional density, P(O = o|Y) of super-classY that is
formed by combining several equally likely and mutually exclusive object classesX
i
(refer to section
2.4.1).
P(O =o|Y) = P(O =o|∪
i
X
i
) (using eqn. 2)
= P(O =o,∪
i
X
i
)/P(∪
i
X
i
) (using Bayes rule)
= P(∪
i
X
i
|O =o)P(O =o)/P(∪
i
X
i
) (using Bayes rule)
=
X
i
P(X
i
|O =o)P(O =o)/
X
i
P(X
i
) (sinceX
i
are mutually exclusive)
=
X
i
P(X
i
,O =o)/
X
i
P(X
i
) (using Bayes rule)
=
X
i
P(O =o|X
i
)P(X
i
)/
X
i
P(X
i
) (using Bayes rule)
=
X
i
P(O =o|X
i
)w
i
(2.10)
wherew
i
= P(X
i
)/
X
j
P(X
j
)
= 1/n (sinceX
i
are equally likely)
72
The mean ofO|Y is derived as follows:
E[O|Y] =
Z
o
oP(O =o|Y)do (2.11)
=
Z
o
o(
X
i
P(O =o|X
i
)w
i
)do (using eqn. 10 )
=
X
i
w
i
(
Z
o
oP(O =o|X
i
)do)
=
X
i
w
i
E[O|X
i
] (substituting Y byX
i
in eqn. 11)
μ =
X
i
w
i
μ
i
By definition of variance,
σ
2
i
= E[(O|X
i
−E[O|X
i
])
2
]
= E[(O|X
i
)
2
]−(E[O|X
i
])
2
σ
2
i
= E[(O|X
i
)
2
]−μ
2
i
(2.12)
σ
2
= E[(O|Y)
2
]−μ
2
(similarly) (2.13)
E[(O|Y)
2
] =
Z
o
o
2
P(O =o|Y)do (by definition of expectation)
=
Z
o
o
2
(
X
i
P(O =o|X
i
)w
i
)do (using eqn. 10)
=
X
i
w
i
(
Z
o
o
2
P(O =o|X
i
)do)
=
X
i
w
i
E[(O|X
i
)
2
] (by definition of expectation)
=
X
i
w
i
(σ
2
i
+μ
2
i
) (using eqn. 12) (2.14)
σ
2
=
X
i
w
i
(σ
2
i
+μ
2
i
)−μ
2
(using eqns. 13, 14) (2.15)
73
Chapter 3
Investigating the Granularity of Top-Down Attention during
Visual Search
3.1 Introduction
Importance of studying granularity of top-down signals: The natural world contains prey and
predators that are camouflaged, and hence, visually non salient. For instance, a lion camouflaged in
the dry savannah is hard to detect because its golden fur has similar tint as the yellowish grasslands.
In such situations where bottom-up guidance is minimal, the prey’s survival depends on whether top-
down can guide attention by selecting the fine-grained target feature (in this case, selecting the relevant
shade of yellow among different shades). Hence, the granularity of top-down signals plays a critical
role in determining visual search performance. Despite its importance, the granularity or information
capacity of top-down signals has been less studied than their sources or modulatory effects on early
sensory areas [98, 25, 100, 158, 85, 139, 23]. There exists some evidence from single unit record-
ings for differential gains on neurons tuned to different features [159]. Although such differential
gain modulation has been demonstrated for dimensions like direction of motion and orientation, it is
not yet known whether the same is true for other dimensions like intensity, color saturation and size.
Moreover, the electrophysiological studies do not perform an important test of granularity, which is
74
whether the intermediate feature in a dimension can be selectively enhanced while suppressing flank-
ing distractor features. Few psychophysics studies have tried to address these issues of granularity,
but their results provide conflicting evidence. Some studies suggest that top-down signals are coarse-
grained (fig 1a, e.g., one gain control term for the intensity dimension, thereby selecting all values
or intervals of intensity) [102, 43], while others suggest that top-down signals are fine-grained (fig
1b, e.g., multiple gain control terms within the intensity dimension allowing selection of a particular
interval of intensity) [123] (see below for a detailed literature review). Investigating the granularity of
top-down signals is therefore crucial for further progress in understanding top-down attention modu-
lation. In the next section, we present an overview of relevant literature.
3.2 Related work
Guided Search theory: One of the most influential theories of visual search is the Guided search the-
ory [177]. It successfully accounts for several observed phenomena in human visual search behavior,
such as pop-out vs. conjunction [156], target-distractor discriminability [119, 37, 103, 155], distrac-
tor heterogeneity [37], feature priming [91, 178]. It suggests a two-stage model of visual processing.
In the preattentive stage, feature maps are computed in parallel in several feature dimensions (e.g.,
red, blue, green and yellow feature maps in color hue dimension; steep, shallow, left, right maps in
the orientation dimension). In the second stage, top-down multiplicative gains are applied on these
bottom-up maps and the weighted feature maps are combined additively to form an activation map
which eventually guides visual attention in a sequential manner. Thus, during search for a red item,
the theory suggests that the weight on the red feature may be increased, resulting in increased activity
of all red items in the scene. Although the theory includes top-down guidance through a multiplicative
gain control mechanism, it does not directly address the issue of granularity of top-down guidance.
75
For some dimensions like orientation, it explicitly states that there may be multiple gain control terms
for steep, shallow, left and right features. However, for other dimensions like intensity, size and color
saturation, it does not comment on the granularity.
Linear separability effect: A popular effect observed in visual search behavior suggests that search
is easier when the target can be separated from the distractors by a line in feature space. For instance,
in the intensity dimension, search for the brightest item is easier than search for a medium bright
item among brighter and darker items. This effect has been reported in several dimensions such as
color [38], chromaticity [4], luminance [4, 53], and size [154, 53] (Wolfe & Bose, 1991, unpublished).
Inefficient search for a MID type target seems to suggest that top-down cannot select the MID inter-
val within a feature dimension. Hence, these results seem to support the hypothesis that top-down
guidance is coarse-grained. However, the above experiments varied both the target and the distrac-
tor stimuli across the search conditions, thereby varying both bottom-up and top-down guidance, and
making it difficult to tease apart the top-down contribution. For instance, in the HIGH search condi-
tion, their subjects searched for a single HIGH intensity target among many MID and LOW intensity
distractors. Whereas in the MID condition, subjects searched for a single MID intensity target among
many LOW and HIGH intensity distractors. Thus, both the target and distractor stimuli varied across
search conditions, leading to changes in both bottom-up and top-down effects. Indeed, bottom-up
guidance alone suffices to account for the previous results. Search for the MID intensity target is
slower as the target is not bottom-up salient (due to its similarity to both LOW and HIGH intensity
distractors). Whereas search for the HIGH intensity target is faster as it is more bottom-up salient (due
to the large difference from the LOW intensity distractors). Due to the entangling of top-down and
bottom-up effects, the above experiments cannot reveal the role of top-down guidance. We overcome
this confound by maintaining the background stimulus a constant, while varying only the target across
the search conditions. Thus, bottom-up factors remain nearly a constant, while the top-down factor
76
varies, allowing us to infer its role unambiguously.
Subset search: Classical conjunction searches (e.g., search for a red vertical among green vertical
and red horizontal bars) were known to be hard [156]. But later experiments revealed a distractor-
ratio effect [39, 76, 2, 142, 185], i.e., subjects tend to search in a dimension defined by the smaller
subset of distractors (i.e., if there are fewer red horizontal bars than green vertical bars, then subjects
focus on the color dimension and search through the red items). Thus conjunction search can become
efficient if subjects search through the smaller subset. Although these experiments demonstrate that
subjects can selectively attend to feature within a dimension (e.g., red within color dimension), they
do not indicate whether subjects can select an intermediate feature in a dimension.
Dimension weighting: Several studies in the past investigated top-down guidance to feature di-
mensions. Their results show that prior knowledge of the target dimension can facilitate search
[154, 102, 43, 82]. A prominent theory is the dimension weighting account [102, 43]. It suggests
that during search for a target, feature dimensions are weighted so that the known target dimension
is promoted. The experimental paradigm was as follows: In a within-dimension condition, the target
dimension was known and remained constant across trials, but its value varied within that dimension.
Whereas in the cross-dimension condition, the target dimension varied across trials. Treisman ob-
served shorter reaction times in the within-dimension condition than the cross-dimension condition.
Muller and colleagues observed such within-dimension facilitation even between successive trials.
This led to a dimension weighting account suggesting that the known target dimension receives a
higher weight compared to other unknown dimensions, thereby increasing the target’s activity in the
master map, resulting in faster search for the target. Although these studies suggest weights on fea-
ture dimensions (e.g., intensity), they do not suggest weights within a dimension. Here, we wish to
investigate the granularity of weights. For instance, is there one weight per dimension, or one weight
77
per feature interval within a dimension? In other words, can the coarse feature dimension weighting
account be extended to a finer feature interval weighting account?
Visual guidance in complex scenes: A recent study [123] investigates visual guidance to low-level
features in complex natural senes. The experiment consists of the following paradigm – subjects first
preview a target patch (74x74 pixels) extracted from the image, and subsequently search for the target
in the image. Analysis of eye movement data reveals that subjects saccade to image regions that have
similar intensity, contrast, spatial frequency and orientation as the target. For instance, if the target has
MID intensity, there are more saccades to MID intensity regions of the image, than to LOW or HIGH
intensity regions. This difference in saccadic selectivity was assumed to reflect top-down guidance.
The author proceeded to compare the strength of guidance to different feature dimensions, showing
decreasing order of guidance for intensity, contrast, spatial frequency and orientation. The clever
experiment design allowed the author to break up each dimension into smaller intervals and infer the
spread of guidance through the distribution of saccades as a function of distance from target interval.
However, the above experiment suffers from the same confound as experiments on linear separability.
Across the LOW, MID and HIGH search conditions, the author varied not only the target, but also
the background image. Hence, the measured guidance reflects a combination of both top-down and
bottom-up effects, making it difficult to tease apart the contribution of top-down guidance. Indeed,
bottom-up effects were not controlled in that experiment. It was not verified whether the regions
similar to the target were bottom-up salient or not. Although the proportion of LOW, MID and HIGH
intensity regions was equal when pooled over all images, the proportion was not controlled within
a given search condition. It may have been possible that during search for the MID type target,
there were fewer MID type regions, thereby increasing their bottom-up salience, and yielding higher
saccadic selectivity. Indeed, the author confirms this by reporting feature-ratio effects, i.e., a feature
that is present in smaller proportion or ratio in the image attracts higher number of saccades. Such
78
bottom-up effects need to be controlled to allow unambiguous inference on the role of top-down
guidance. We achieve this by varying the target stimulus, while keeping the background a constant.
This alteration in the experimental design allows us to investigate top-down guidance without any
bottom-up confounds. More details are given in the next section.
3.3 Design and analysis of experiments
Contending hypothesis: In this paper, we investigate the granularity of top-down signals by compar-
ing two competing hypotheses: a) top-down guidance is coarse-grained vs. b) It is fine-grained. As
mentioned earlier, the coarse-grained hypothesis is supported by several existing visual search the-
ories. For instance, the dimension weighting account [102, 43] of visual search behavior suggests
coarse-grained top-down guidance of only one gain control term per feature dimension (i.e., the gains
on all intervals within that dimension are equal, see ’a’ in figure 3.1). The competing hypothesis is
that top-down guidance is fine-grained and contains several gain control terms per feature dimension
(see ’b’ in figure 3.1) [123].
Testing the hypotheses: To test these hypotheses, we designed visual search experiments (see figure
3.1) where subjects searched for a target belonging to a fine-grained feature interval among distractors
belonging to many intervals within a feature dimension (e.g., within the intensity dimension, search
for a medium intensity target among distractors of low, medium and high intensities). Assuming that
attention is guided by a saliency map formed by summing feature maps, the coarse and fine-grained
hypotheses generate contradictory predictions on search behavior: According to the coarse-grained
hypothesis, the gains on all intervals are equal. Hence, all feature maps contribute equally to the
saliency map, resulting in equal salience of items of all intervals, yielding equal number of fixations
on each iterval. In contrast, the fine-grained hypothesis predicts higher gain on the relevant interval,
79
leading to an increased contribution of the relevant feature map, resulting in higher salience of items
of the relevant interval, thereby yielding higher number of fixations. Thus, the fine-grained hypothesis
predicts that items of the relevant fine-grained interval should be preferentially fixated. To test these
hypotheses, we design the following experiments.
3.3.1 Experiment 1: Intensity
This section describes the design and analysis of eye movements to determine whether top-down guid-
ance can selectively enhance the relevant interval within the intensity dimension.
Design of the stimuli: The details of our experiments are as follows: The intensity dimension is
divided into 3 fine-grained feature intervals: LOW, MID and HIGH. The target and distractor stimuli
belong to one of LOW, MID or HIGH intervals, and the distractors are L shaped, while the target is
rotated by 180 degrees. This rotation enables recognition of the target, but disables preattentive guid-
ance (control experiments reveal search efficiency of 23 ms/item, indicating that the rotated L target
is preattentively indistinguishable from the upright L distractors).
Search conditions: There are three search conditions: LOW, MID and HIGH based on whether the
target interval belongs to LOW/MID/HIGH intervals within that dimension. To avoid confounds due
to stimulus-driven bottom-up factors, we maintain the same background stimulus (equal numbers of
LOW, MID and HIGH intensity distractors) across all 3 search conditions, and vary only the target.
For example, in the MID condition, subjects search for a MID intensity target among equal numbers
of LOW, MID and HIGH intensity distractors, while in the HIGH condition, they search for a HIGH
intensity target among equal numbers of LOW, MID and HIGH intensity distractors. Examples of our
displays are shown in figure 2a, 2b, 2c.
80
LOW MID HIGH
a. COARSE TOP-DOWN GUIDANCE
intensity
LOW
MID
HIGH
b. FINE TOP-DOWN GUIDANCE
intensity
# fixations # fixations
LOW
MID
HIGH
+
+
Figure 3.1: Testing the hypotheses: Consider searching for a MID intensity target (marked by a
yellow circle for illustration purposes) among LOW, MID and HIGH intensity distractors. Let the
display be processed by neurons that are tuned to LOW, MID and HIGH intensity intervals. The
feature maps corresponding to the LOW, MID and HIGH intensity intervals are added to form a
saliency map that subsequently guides attention. If top-down guidance were coarse, the gains on
LOW, MID and HIGH intensity intervals would be equal, resulting in equal salience of all items,
thereby yielding equal number of fixations on all intervals. In contrast, if top-down guidance were
fine-grained, the gain on the relevant MID intensity interval would be higher than LOW and HIGH,
resulting in higher salience of items in the MID interval, thereby yielding higher number of fixations
on the MID interval than LOW or HIGH.
81
Additional details of stimuli: Each item is 64x64 pixels in size (1.2
◦
). To avoid spatial biasing, the
target and distractors can randomly appear anywhere in the invisible 5x5 grid that filled the search
array. Further, jitter is introduced by rotating each item randomly upto 5
◦
, and random colored noise
is added to the display. Stimuli are presented on a 22” computer monitor (LaCie Corp; 1280x1024,
60.27Hz double-scan, mean screen luminance 30cd/m
2
, room 4cd/m
2
). The search array (of size
1024x1024 pixels) is embedded on a black background and displayed at the center of the monitor
screen (1280x1024). The display is viewed at a distance of 80cm and the viewing angle is 28
◦
x21
◦
.
The stimuli parameters are as follows: In the intensity dimension, LOW: 4.1 cd/m
2
, MID: 21 cd/m
2
,
HIGH: 112 cd/m
2
. These values of LOW, MID and HIGH intervals are chosen according to the We-
ber’s law. Examples of our displays are shown in figure 2. To avoid any confounds in inference due
to differences in other features, our stimuli are always designed to be identical in all irrelevant feature
dimensions and differ only in the intervals within the relevant feature dimension. Thus, in the intensity
experiments, all stimuli have the same size, color, orientation and differ only in the luminance values.
Experimental organization: Subjects perform one search condition a day, for three consecutive days.
Each search condition lasts up to an hour and is comprised of a maximum of 10 blocks, containing 20
trials each. Subject run as many blocks as they can (in the range of 8-10) within an hour. Subjects are
allowed to take a break in between blocks.
“No Cheat” scheme for response validation: Each trial begins with a central fixation for 250ms
followed by stimulus onset. Subjects search for the target as fast as possible and hit a key upon find-
ing it. Due to boredom or weariness or other factors, subjects may falsely report that they found the
target. To avoid such false positives, we introduce a novel “No Cheat” scheme: Upon the key press
indicating that the target was found in the display, we flash a grid of two digit random numbers (of
size 0.6
◦
each) for 120ms and ask the subject to report the random number that flashed at the target’s
82
location. Subjects could correctly report the number only if they were fixating at the target location.
Online feedback (’Correct’ or ’Wrong’) is provided to the subject based on whether the reported num-
ber matches with the flashed number. Only ’Correct’ trials (i.e., where the subject correctly reported
the number at the location of the target) are considered for analysis of eye movement patterns. Our
choice of the “No Cheat” paradigm instead of traditional target absent trials was motivated by the fol-
lowing reasons: Although target absent yield more fixations per trial, they are more time consuming.
Besides, by validating the subject’s response on a per trial basis, the “No Cheat” paradigm provides
a better guarantee that subjects are actively biasing for the target on each and every trial. This also
minimizes data wastage by rejecting only the ’Wrong’ trials (instead of rejecting the entire block in
which it occurs).
Details of eye tracking: A 9-point eye tracker calibration is performed at the beginning of each block.
Each calibration point consists of fixating a central cross, then a blinking dot at a random point on a
3x3 matrix. The experiment is self-paced and the subjects can stretch before any 9-point calibration.
Subjects fixate on a central cross and press a key to start, at which point the trial beings. The eye
tracker records from the beginning of the display of the search array to the point when the key is
pressed. Each search array image is entirely pre-loaded into memory. Eye position is tracked using a
240Hz infrared-video-based eye-tracker (ISCAN, Inc). All analysis is performed offline.
Data cleaning: To verify whether top-down guidance can select the relevant fine-grained interval, we
analyzed the eye movement data of three subjects with normal or corrected vision, who participated
for course credit or volunteered. Blocks with bad eye tracker calibration were not considered for sub-
sequent analysis (0-4% data). Similarly, trials with too many blinks were discarded (0-7% data). As
mentioned previously, ’Wrong’ trials (incorrect report of the random number flashed at the target’s
location) were also discarded (0-3% data). It was very rare that subjects indeed found the target but
83
did not report the number correctly (from personal communication with subjects, this error varied
between 0-2% for different subjects). Subjects had to fixate on a central fixation at the beginning of
each trial (to avoid any subject biases towards specific spatial regions). Those trials in which subjects
began by fixating more than 3 degrees away from the center were also discarded (0-8% data). All
subsequent analysis was performed only on the remaining valid trials.
Reaction time: We computed reaction time (RT) as the time taken to find the target (time from stim-
ulus onset until key press). We compared RTs across the LOW, MID and HIGH search conditions.
As reported in earlier studies [38, 4], the RT was significantly higher (p< 0.05) in the MID condition
than in the LOW or HIGH conditions (see figure 3a). This replicates the results of previous studies.
410
620
830
1040
1250
1460
Intensity Size Saturation
LOW MID HIGH LOW MID HIGH LOW MID HIGH
RT (ms)
Figure 3.2: Reaction time: This figure shows RT for all valid trials in the LOW, MID and HIGH
search conditions within intensity, size and color saturation dimensions. In all feature dimensions,
search was slower in the MID condition than the LOW or HIGH conditions, as demonstrated by the
linear separability theory. Note that there is no speed-accuracy tradeoff here as the the RT was only
computed over valid (correct) trials.
Saccadic selectivity: We measured the saccadic selectivity towards LOW, MID and HIGH intensity
intervals in the following manner: We parsed the eye movement patterns in the valid trials into fix-
ations and saccades, and assigned each fixation to the nearest item in the search array (sample eye
traces are shown in the first row of figure 4). Saccadic selectivity for an interval was computed as the
total number of fixations that were assigned to items belonging to that interval. For a given search
84
condition (e.g., search for LOW intensity target), we compared saccadic selectivity across different
intervals by pooling the trials across all blocks and subjects, and performing a paired t-test. Statistical
analysis revealed a significantly higher saccadic selectivity (p < 0.05) towards the relevant interval
than irrelevant intervals. For instance, in the MID condition, search for a MID intensity target lead to
more fixations on the MID intensity items than LOW or HIGH intensity items. This was consistent
for all search conditions (see second row of figure 4).
Strength of biasing: For a given search condition, we determined the strength of biasing as a function
of time, by computing the percentage of fixations on the relevant interval for each block. The third
row in figure 4 shows a plot of the strength of biasing as a function of time. Given that there are
equal number of items in all three intervals of intensity, there should be 33.3% chance of fixating each
interval. Yet, a t-test reveals that the percentage of fixations on the relevant interval is significantly
higher (p << 0.01) than that predicted by chance (see table 1). This reveals a clear effect of top-
down guidance through selective enhancement of the relevant interval. Also, the strength of biasing
seems higher in the LOW and HIGH search conditions (95% confidence interval of [50.58,57.9],
[53.30,61.79] respectively) than the MID condition (95% confidence interval of[37.26,44.78]. This
suggests why the RT is lower in the LOW and HIGH conditions compared to the MID condition
(figure 3). Does the strength of biasing vary with time? To answer this question, we performed a
1-way ANOV A (table 2). Results show that there is no main effect of time (in units of block number)
on the strength of biasing.
3.3.2 Experiment 2: Size
To verify the generality of the top-down biasing effect observed in the intensity dimension, we re-
peated similar experiments and analysis on the size dimension.
85
Subject 1 Subject 2 Subject 3 Subject 1 Subject 2 Subject 3 Subject 1 Subject 2 Subject 3
a. LOW intensity b. MID intensity c. HIGH intensity
Time (units of block number)
# fixations
33
100
0
% fixations on relevant cue
0
3
LOW MID HIGH
Figure 3.3: Results in the intensity dimension: a) The first column shows results during search for a
LOW intensity target. The sample eye trace illustrates that subjects tend to fixate on the relevant LOW
intensity distractors. Statistical analysis of all trials reveal a significantly higher number of fixations
on the relevant LOW intensity items (indicated by a yellow star) than MID or HIGH (paired t test,
p-value< 0.05). Statistical analysis of fixations as a function of time (in units of block number) reveal
that the strength of biasing does not change with time (see table 2). b) Similar results are observed
for the MID and HIGH conditions. As shown in the second column, when subjects search for a MID
intensity target, they selectively fixate on the MID intensity distractors compared to LOW or HIGH.
These results demonstrate that top-down signals can guide attention to the relevant interval within the
intensity dimension.
Experimental design: In the size experiments, all stimuli have the same luminance, color saturation,
orientation and differ only in the size values. The values LOW: 0.6
◦
, MID: 1.2
◦
, HIGH: 2.4
◦
are
86
LOW MID HIGH
Intensity p< 10
−26
p< 10
−5
p< 10
−25
ci = [50.58, 57.9] ci = [37.26, 44.78] ci = [53.30, 61.79]
Size p< 10
−20
p< 10
−6
p< 10
−12
ci = [54.32, 63.84] ci = [36.22, 46.78] ci = [49.94, 62.23]
Saturation p< 10
−15
p< 10
−36
p< 10
−17
ci = [44.18, 51.36] ci = [51.96, 58.39] ci = [50.10, 59.34]
Table 3.1: Strength of biasing: For each dimension tested (intensity, size, color saturation), we find
the strength of biasing (computed as percentage of fixations on the relevant feature interval) in the
LOW, MID and HIGH search conditions. A t-test reveals that in each search condition the strength
of biasing is significantly higher (p << 0.01) than the baseline 33.33% predicted by chance. The
p-values and 95% confidence interval in strength of biasing are reported.
chosen according to Weber’s law. Other experimental details are similar to those in intensity.
Saccadic selectivity: As seen in figure 5, in all search conditions, there was significantly higher sac-
cadic selectivity (paired t-test, p< 0.05) towards the relevant size interval. For instance, in the MID
condition, during search for a MID sized target, there were more fixations on the MID sized items
compared to LOW or HIGH sized items.
Strength of biasing: The percentage of fixations on the relevant size interval was significantly higher
(t-test, p << 0.01) than the baseline 33.3% predicted by chance. In the LOW condition, the 95%
confidence interval was as high as[54.32,63.84]; while in the HIGH condition, it was[49.94,62.23];
and in the MID condition, it was [36.22,46.78]. Thus, in all conditions, the strength of biasing was
significantly higher than by chance, thereby indicating strong effects of top-down guidance. As with
87
LOW MID HIGH
Intensity F(8, 485) = 1.29 F(9, 491) = 1.16 F(9, 442) = 1.69
p = 0.2448 p = 0.322 p = 0.0912
Size F(7, 442) = 1.45 F(8, 477) = 0.5 F(8, 447) = 1.35
p = 0.191 p = 0.8521 p = 0.2249
Saturation F(9, 422) = 1.25 F(9, 461) = 0.6 F(8, 438) = 1.56
p = 0.2651 p =0.8004 p = 0.1374
Table 3.2: Strength of biasing as a function of time: For a given dimension (e.g., intensity, size
or color saturation), and a given search condition (e.g., search for a target of LOW, MID or HIGH
feature interval), we determine whether the strength of biasing changes with time by performing a
1-way ANOV A test. Results across all conditions show that the strength of biasing does not change
significantly (p >= 0.05) with time (measured in units of block number ranging from 1 up to a
maximum of 10).
intensity, the results of a 1-way ANOV A show that the strength of biasing did not change with time
(see table 2).
3.3.3 Experiment 3: Color saturation
Here, we repeat similar experiments and analysis on the color saturation dimension.
Experimental design: In these experiments, we desire all stimuli to have the same luminance, ori-
entation, size and differ only in the color saturation values. This is more tricky since the perceived
luminance value of different color saturations is observer dependent. Hence, we run heterochromatic
photometry experiments [122] in which the observer adjusts the luminance values of two chromatic
lights presented in fast alternation (15-20 Hz), until it appears flicker-free. The stimuli thus generated
88
Subject 1 Subject 2 Subject 3 Subject 1 Subject 2 Subject 3 Subject 1 Subject 2 Subject 3
a. LOW size b. MID size c. HIGH size
Time (units of block number)
# fixations
33
100
0
% fixations on relevant cue
0
3
LOW MID HIGH
Figure 3.4: Results in the size dimension: a) The first column shows results during search for a
LOW size target. The sample eye trace illustrates that subjects tend to fixate on the relevant LOW
size distractors. Statistical analysis of all trials reveal a significantly higher number of fixations on
the relevant LOW size items (indicated by a yellow star) than MID or HIGH (paired t test, p-value<
0.05). Analysis of fixations as a function of time (measured in units of blocks from 1 to a maximum of
10) reveals that the strength of biasing does not change with time (see table 2). b) Similar results are
observed for the MID and HIGH conditions. As shown in the second column, when subjects search
for a MID size target, they selectively fixate on the MID size distractors compared to LOW or HIGH.
These results demonstrate that top-down signals can guide attention to the relevant interval within the
size dimension.
have same luminance, size and differ only in the color saturation (LOW: CIE x = 0.331, y = 0.363;
MID: CIE x = 0.453, y = 0.363; HIGH: x = 0.621, y = 0.363). Other experimental details are similar
to those in intensity.
89
Saccadic selectivity: As seen in figure 6, in all search conditions, there was significantly higher sac-
cadic selectivity (paired t-test, p< 0.05) towards the relevant saturation interval. For instance, in the
MID condition, during search for a MID saturated target, there were more fixations on the MID satu-
rated items compared to LOW or HIGH saturated items.
Strength of biasing: The percentage of fixations on the relevant saturation interval was significantly
higher (t-test, p << 0.01) than the baseline 33.3% predicted by chance. In the LOW condition, the
95% confidence interval was [44.18,51.36]; while in the MID condition it was [51.96,58.39], and
in the HIGH condition it was [50.10,59.34]. Thus, in all conditions, the strength of biasing was
significantly higher than by chance, thereby indicating strong effects of top-down guidance. As with
intensity and size, the results of a 1-way ANOV A show that the strength of biasing did not change
with time (see table 2).
3.3.4 Control experiments
Could the observed results be due to covert attention / recognition only? One concern is whether
the observed saccadic selectivity for the relevant feature interval in experiments 1-3 is due to serial
scanning using covert attention and recognition, rather than due to parallel processes that provide top-
down guidance to the relevant feature interval. While some previous studies suggest that search is
serial [156], some others suggest that it is parallel [33], and yet others suggest that it is a mixture of
both [179, 7]. To address this issue in the context of our search experiments, we conducted additional
control experiments in the intensity dimension. We hypothesized that if the observed saccadic selec-
tivity is due to covert serial scanning processes only, then decreasing the presentation time to 120 ms
should eliminate the contribution of serial processes and eye movements [115, 164], hence selectivity
should disappear. On the other hand, if selectivity is due to a parallel, gain-based mechanism, then
even under brief presentation conditions, selectivity for the relevant feature interval should be high.
90
Subject 1 Subject 2 Subject 3 Subject 1 Subject 2 Subject 3 Subject 1 Subject 2 Subject 3
a. LOW saturation b. MID saturation c. HIGH saturation
Time (units of block number)
# fixations
33
100
0 % fixations on relevant cue
0
3
LOW MID HIGH
Figure 3.5: Results in the saturation dimension: a) The first column shows results during search for a
target with LOW saturation. The sample eye trace illustrates that subjects tend to fixate on the relevant
distractors of LOW saturation. Statistical analysis of all trials reveal a significantly higher number of
fixations on relevant items of LOW saturation (indicated by a yellow star) than MID or HIGH (paired
t test, p-value< 0.05). Analysis of fixations as a function of time (measured in units of blocks from
1 to a maximum of 10) reveals that the strength of biasing does not change with time (see table 2).
b) Similar results are observed for the MID and HIGH conditions. As shown in the second column,
when subjects search for a target with MID saturation, they selectively fixate on distractors with MID
saturation compared to LOW or HIGH. These results demonstrate that top-down signals can guide
attention to the relevant interval within the intensity dimension.
We tested this hypothesis through the following control experiments.
91
Design of control experiments: Similar to experiments 1-3, we ran three search conditions: LOW,
MID and HIGH, where subjects searched for a target belonging to the LOW, MID or HIGH inten-
sity interval respectively. Figure 6a shows a sample trial from the MID condition. Each trial began
with a central fixation (for 250 ms), followed by a brief presentation of the search array (for 120 ms).
The search array consisted of a 3x3 grid of items including one target (rotated L shape) and eight
distractors (L shape) belonging to different feature intervals. Pilot experiments in this brief display
paradigm revealed that the task was too difficult with a set size of 5x5 items (used in experiments 1-3)
and subjects became frustrated (search accuracy< 5%), hence we decreased the set size to 3x3 items.
Other parameters such as the size of the target and distractors, and inter-stimulus distance were the
same as in experiments 1-3. The search array was followed by a brief presentation of a grid of random
two digit numbers (as part of the “No Cheat” scheme described in section 2). Similar to experiments
1-3, subjects were instructed to find the target as fast as possible and report the number at its location.
Subjects received feedback on accuracy of target detection. This completed one trial.
Results: Figure 6b shows the results obtained from three naive subjects (who performed 30 blocks of
10 trials each). The task was not easy, as reflected by the low accuracy of target detection (computed
as % reports on the target) that varied between 35-45% across different search conditions and sub-
jects. Although search accuracy was low, all subjects showed significantly higher number of reports
on items belonging to the relevant interval than irrelevant intervals (as determined by a paired t-test,
p value < 0.05). These results confirm that the underlying search mechanism in our experiments is
parallel (gain-based), rather than serial only.
92
3.4 Discussion
Reaction time: As reported in earlier studies [38, 4, 53], in all dimensions tested (intensity, size and
color saturation), the RT was significantly slower (p< 0.05) in the MID condition than in the LOW or
HIGH conditions (see figure 3). For instance, search for a medium sized target was slower than search
for a small or big target. This replicates the results of previous studies.
Granularity of top-down attention: In all dimensions tested (intensity, size and color saturation),
our results indicate that subjects could selectively fixate on items belonging to the relevant fine-grained
interval defined by the target. These results negate the coarse-grained hypothesis (figure 1a), which
predicts equal number of fixations on all intervals. Instead, they confirm the fine-grained hypothesis
(figure 1b) that indeed, top-down signals can select the relevant fine-grained interval within a dimen-
sion.
Did the target shape provide any guidance? Although there are 25 items in the display (8 distractors
each of LOW, MID and HIGH intervals, plus one target), the average number of fixations is fairly low,
between 3-6. This raises a concern of whether there was any special guidance due to the target shape.
But pilot experiments confirmed that the target was preattentively indistinguishable from the distrac-
tors (RT slope 23ms/item, indicating a hard search). This rules out any guidance due to the shape of
the target. Also, the low number of fixations (3-6 on average) is not an indicator of special guidance to
the target for the following reasons: Since most fixations occur only on items of the relevant interval
(see table 1), any model that scans randomly among the 8 items within the relevant feature interval
would also predict 4 fixations on average, which is in agreement with our observations. Thus, the
observed results reveal a clear effect of top-down selectivity for the relevant feature interval, rather
93
than special guidance to the target shape.
Reconcilitation with previous data: There seems to be an apparent contradiction between the fine-
grained hypothesis supported by our results, and the “linear separability theory” supported by previous
results [38, 4, 53]. The latter reports that search for a MID type target is slower than search for LOW
or HIGH type targets (figure 3), suggesting that top-down signals cannot select the MID interval. On
the other hand, the pattern of fixations observed in our results clearly indicate that top-down can select
the MID interval. If indeed top-down signals can select the fine-grained MID interval, why is search
slower? This apparent conflict can be resolved by considering the following model of visual process-
ing: The incoming visual scene is analyzed in each feature dimension by a population of neurons with
broad and overlapping tuning curves. The activity of each such neuron is assumed to be modulated
by a top-down gain control (similar to figure 1b). According to this model, a MID type target can be
found by selectively promoting the neuron that responds maximally to the MID interval (henceforth
referred to as MID neuron). This results in increased salience of all items sharing the MID interval,
thereby attracting more fixations as shown in our results. However, since the MID neuron is broadly
tuned, it not only responds to the MID type target and distractors, but in addition, weakly responds to
the LOW and HIGH type distractors. The responses to LOW and HIGH type distractors intefere with
search for a MID target, leading to a slow search. A direct consequence of this model is that saccadic
selectivity for the MID interval increases as the spacing between LOW, MID and HIGH increases
(i.e., if the LOW and HIGH intervals are widely separated, the MID neuron will respond only to the
MID interval, thereby increasing the salience of MID interval items relative to LOW or HIGH). This
predicts faster RTs – a prediction that is consistent with existing behavioral reports [4].
Time-scale of biasing: Our results show that the strength of biasing does not change as a function of
time within a session. This suggests that the top-down bias that is set up initially during the training
94
period (first 20 trials in the session) does not change in the rest of the session (lasting up to an hour).
However, this does not rule out short-term priming [91] where the strength of biasing may improve
within a few trials. Nor does it rule out long-term priming or perceptual priming [8] effects where the
strength of biasing may improve over a period of days.
Implications for visual search behavior and performance: While previous studies based on RT
measures report that search in the MID condition is slower than LOW or HIGH conditions, they
do not reveal the underlying cause or granularity of top-down signals. Here, we used eye tracking
methods to infer top-down guidance by analyzing whether subjects fixate on the relevant fine-grained
interval or not. Based on the results of our study, we conclude that top-down signals carry fine-grained
information that can specify the relevant feature interval, rather than coarse-grained information that
can only specify the relevant feature dimension. Some implications are given below. Theories such
as dimension weighting accounts [102, 43] which suggest a single gain control term per dimension,
predict equal gain on LOW, MID and HIGH intervals within the dimension and hence cannot account
for the greater number of fixations on the MID interval in our MID condition. Clearly, such theories
need to be updated from a coarse grained, one gain factor per feature dimension to a fine grained, one
gain factor per feature interval. The conditions for efficient search should be revised: search should
be easy not only when the target and distractors differ in some dimension, but also when they differ
in some interval within a dimension. This model also accounts for some observed effects in search
asymmetry. For instance, faster search for a saturated red than desaturated red [157] can be explained
by the model as saturated red activates a HIGH interval, while desaturated red activates the MID in-
terval and the background activates the LOW interval.
95
Figure 3.6: Control experiments and results: a) Design of the control experiment: Search array
was presented for a brief duration (120ms only) to minimize the role of serial scanning processes.
Search array consisted of a 3x3 grid of equal number of items belonging to LOW, MID and HIGH
intensity intervals. In each search condition (LOW/MID/HIGH), the target was fixed. Subjects were
instructed to search for the known target and report the number at its location. The reports were
analyzed to determine the % reports on items of each intensity interval. Results of a paired t-test
showed a significantly higher number of reports on items of the relevant interval. For instance, when
subjects searched for a MID intensity target, there were more reports on items of the MID intensity
interval than LOW or HIGH. These results confirm the role of parallel gain-based guidance in our
search experiments. 96
Chapter 4
Optimal Integration of Top-down and Bottom-up Attention
during Visual Search
4.1 Introduction
Attentional guidance is a combination of bottom-up and top-down factors: It is well known that
attention is guided to both stimulus-driven (bottom-up salient [68]) and goal-driven (top-down rele-
vant [57]) locations and features [98, 100, 159]. Yet, the mechanisms by which top-down relevance
of features are determined and combined with bottom-up salience are relatively unknown. Below, we
address one such outstanding question in the context of visual search.
Attentional modulation of feature gains: Imagine that you are on a safari. The guide cautions you
to beware of tigers hiding in the grasslands. Which visual features will you enhance or suppress in
order to quickly detect a tiger? Enhancing the typical yellow color of a tiger’s skin might seem like
a good strategy. Indeed, previous research [159, 100, 25, 93, 181, 165] in top-down attention suggest
that attention enhances the neural representation of the target defining features. For instance, the
feature similarity gain model [159] suggests that gains increase as a function of similarity between the
neuron’s preferred feature and the target feature. While this may be true in simple scenes where there
97
is no background clutter or the target and distractor features are very different, it may not apply to
more complex scenes where the distractor features are similar to the target. Here, we investigate the
optimal gain modulation strategy and ask whether humans deploy it. Understanding human feature
selection strategies is not only crucial for further progress in understanding top-down attention, but
may help in designing better robots and machines for active vision.
4.2 Related Work
Biased competition: In this section, we present a brief overview of visual search literature. It suggests
that multiple stimuli compete in a mutually suppressive manner to gain access to the limited resources
(such as representation, analysis, control) and attention biases this competition towards the salient and
behaviorally relevant locations or features. Although the details of the amount of top-down feature
bias are not formally specified, the general idea is that visual inputs that match the target description
(or “attentional template” [37]) are favored in the visual cortex [18]. In other words, the top-down
competitive bias on a stimulus depends its similarity to the “attentional template”, thereby yielding a
stronger competitive bias on the target than similar distractors than dissimilar distractors [33]. This
theory has received much support from neurophysiology of spatial [88, 131, 77] and object-based at-
tention [25]. Several neurodynamic implementations of the biased competition hypothesis have also
been proposed [31, 47].
Feature similarity gain model: The biased competition model suggests that attentional modulation
of the competition is stronger when the multiple competing stimuli are near each other (e.g., within the
neuron’s receptive field), and negligible when they are far (e.g., opposite hemifields). However, recent
studies have shown strong feature-based attentional modulation effects in a spatially global manner
across the entire visual field (including the opposite hemifield) [159, 139]. These observations led to a
98
TOP-DOWN bias
/ gain control
color
orientation
. . .
TASK: search for a tiger
in the grasslands
Visual input A
Analysis in different feature dimensions
Bottom-up
saliency maps
within each
feature dimension
s (A). . . s (A)
n 1
Prior statistical
knowledge of
the target and
distractor
objects
. . .
g
1
g
n
?
Overall saliency
map for each
feature dimension
S(A)
Stimulus feature θ
P(θ|T) P(θ|D)
Internal
neural
noise η
Spatial configuration C
.
Figure 4.1: Overview of our model: The incoming visual scene A is analyzed in several feature
dimensions (e.g., color and orientation) by populations of neurons with bell-shaped tuning curves.
Within each dimension, bottom-up saliency maps (s
1
(A)...s
n
(A)) are computed for different feature
values and combined in a weighted linear manner to form the overall saliency map (S(A)) for that
dimension. Given this model, how do we choose the optimal set of top-down gains (g
1
...g
n
) such that
the target tiger becomes most salient among distracting clutter? Our theory shows that the intuitive
choice of looking for the tiger’s yellow feature would actually be suboptimal, because this would
activate the distracting grassland more than the tiger. Instead, the optimal strategy would be to look
for orange, which is mildly present in the tiger, but hardly present in the grasslands, and hence best
differentiates between the target and the distracting background.
simple and elegant “Feature similarity gain” model, where attention causes a multiplicative change in
the response gain of a neuron that depends on the similarity between its preferred feature (or location)
and the attended feature (or location). This theory has recently received more experimental support
99
[93, 7].
Feature gating: Cave [22] proposed a neural network implementation of the Guided Search model
[177] that combines both bottom-up and top-down influences. It consists of a hierarchy of spatial
feature maps and the flow of information is selectively gated from lower to higher levels of the visual
hierarchy. The top-down bias is applied by opening (or closing) gates at each level depending on the
similarity (or dissimilarity) between the target features and the features at that location. Thus, the
top-down component of this model enhances locations whose features are similar to the target.
Selective tuning model: Tsotsos [162] suggests that attention to a stimulus (location or feature)
causes selective tuning by trigerring a cascade of top-down winner-take-all selection processes along
the visual hierarchy. The attended stimulus (or most salient or task-relevant stimulus) is selected at
the top and at the subsequent WTA selection at the lower stages, the neural input that contributes most
to the attended stimulus is selected, and irrelevant signals that interfere are eliminated. Thus, atten-
tion causes selective tuning to the attended stimulus. The model includes a task-specific executive
controller that selects the task-relevant feature at the top. While the details of the task-specific feature
bias are not specified, they suggest that the working memory may store a target template and the WTA
selection may activate stimuli that resemble the target.
Other models: Several other models have been proposed. Hamker [47] suggests that prefrontal areas
might store a target template. Feedback connections from prefrontal to IT (and from IT to V4) may
enhance the activity of neurons whose visual input matches the target template. As a result of the
reentry signals, locations whose features are similar to the target are enhanced, while others are sup-
pressed. Rao and colleagues [126] proposed a saliency model to explain eye movements during visual
search. In their model, salience was computed as the euclidean distance between a target template
100
(memorized vector of responses to the target stimulus) and responses at each location.
Summary of previous models: Several models of top-down attention have been proposed earlier,
and all of them include a top-down biasing or feature selection process that enhances features that
are similar to the target. In the rest of this paper, we investigate whether this target-similarity based
feature selection strategy is optimal. We formally derive the optimal top-down feature biasing strategy
and contrast it to the above target-similarity based approaches.
4.3 Model
Relevant objective function during visual search: We formally derive a theory of how prior sta-
tistical knowledge of the target and distractor features optimally influences feature gains. From a
theoretical standpoint, gains must be modulated in order to maximize search speed, which is a func-
tion of at least two critical variables: S
T
(A), the mean perceived salience of target instances in the
displayA (formed as a result of combined top-down and bottom-up influences), andS
D
(A), the mean
perceived salience of distractor instances. The relative values ofS
T
(A) andS
D
(A) determine visual
search efficiency [65, 178]. Hence, the relevant goal for optimizing top-down gains is to maximize the
signal-to-noise ratio (SNR), i.e., to maximize the ratio between signal strength (target salience) and
noise strength (distractor salience). Such optimization renders the target more salient than distractors
in the display, thereby attracting attention [80], and decreasing the search time [178].
Alternative objective functions: In section 4.5, we compare the results obtained by setting gains
according to different objective functions such as maximizing discriminability between salience of
the target and distractor vs . maximizingSNR.
101
4.3.1 A Theory of Optimal Feature Gain Modulation
Factors affecting salience: S
T
(A) and S
D
(A) are random variables that depend on the top-down
gains, as well as the following bottom-up factors: a) values of target and distractor featuresΘ|T and
Θ|D in the display (sampled from probability density functions p(Θ|T) and p(Θ|D), and possibly
corrupted by external noise), b) spatial configurationC of target and distractor items in the display,
and, c) internal noise in neural response,η. Thus,SNR =E
Θ|T,C,η
[S
T
(A)]/E
Θ|D,C,η
[S
D
(A)].
Framework: We formulate the optimal theory within the framework of a “consensus model” based
on current evidence in neurobiology and psychophysics [156, 80, 177, 159, 139] (figure 1). The
visual input is analyzed in each feature dimension (e.g., direction of motion) and spatial location by
a population ofn neurons with overlapping tuning curves tuned to different feature values [32]. The
i
th
neuron (i∈{1...n}) is tuned to feature valueμ
i
, and its output is used to compute the bottom-up
salience [69] s
i
(x,y,A) at location (x,y) in search array A. The overall perceived salience, S for
a feature dimension is then computed as a function of the saliences s
i
for feature values within that
dimension. While many functions are possible, one of the simplest functions consistent with existing
data is a linear combination ofs
i
[69], weighted in a top-down manner by multiplicative gainsg
i
[52]:
S(x,y,A) =
n
X
i=1
g
i
s
i
(x,y,A) (4.1)
102
Global salience for a dimension as a function of local salience: Thus, the saliency map for a
dimension is computed as a weighted sum of saliency maps from all feature values, and is used to
guide attention. The salience of the target (S
T
) can be computed as follows:
E[S
T
(A)] = E
Θ|T,C,η
"
n
X
i=1
g
i
s
iT
(A)
#
(4.2)
=
n
X
i=1
g
i
E
Θ|T
[E
C
[E
η
[s
iT
(A)]]] (4.3)
(sinceη,C, andΘ are independent random variables) (4.4)
E[S
D
(A)] =
n
X
i=1
g
i
E
Θ|D
[E
C
[E
η
[s
iD
(A)]]] (similarly) (4.5)
SNR =
P
n
i=1
g
i
E
Θ|T
[E
C
[E
η
[s
iT
(A)]]]
P
n
i=1
g
i
E
Θ|D
[E
C
[E
η
[s
iD
(A)]]]
(4.6)
MaximizingSNR wrt gains: To maximizeSNR, we differentiate it wrtg
i
.
∂
∂g
i
SNR =
E
Θ|T
[EC[Eη[siT(A)]]]
E
Θ|D
[EC[Eη[siD(A)]]]
−
P
n
j=1
gjE
Θ|T
[EC[Eη[sjT(A)]]]
P
n
j=1
gjE
Θ|D
[EC[Eη[sjD(A)]]]
P
n
j=1
gjE
Θ|D
[EC[Eη[sjD(A)]]]
E
Θ|D
[EC[Eη[siD(A)]]]
(4.7)
=
SNRi
SNR
−1
α
i
(4.8)
whereα
i
is a normalization term andSNR
i
= E
Θ|T
[E
C
[E
η
[s
iT
(A)]]]/E
Θ|D
[E
C
[E
η
[s
iD
(A)]]]. It
is easy to show that
gi
gi0
(where g
i0
= 1 is the default baseline gain) increases as
SNRi
SNR
increases.
With an added constraint that the gains must sum to a constant,
P
n
i=1
g
i
=n, the simplest solution is:
g
i
=
SNR
i
1
n
P
n
j=1
SNR
j
(4.9)
Thus, the top-down gain on a visual feature depends on its signal-to-noise ratio (SNR
i
).
103
p(θ|D)
p(θ|T)
θ
True distribution of
features for the
current trial
SNR
i
SNR
ESTIMATION
p(θ|D)
p(θ|T)
θ
^
^ LEARNING
&
UPDATE
p(θ|D)
p(θ|T)
θ
b
b
Estimated
distribution of
features for the
current trial
Belief in
features
Picture / word precue / instruction
INTERNAL
MODEL
SNR SNR
i
gains g
i
VISUAL
INPUT
Belief in
salience
b b
Competitive
interactions
yielding bottom-
up salience s
i
i
b
s
TOP-DOWN BIAS
Feature
extraction
Phase 2: Acquiring a belief
Phase 3: Computing the optimal
top-down bias
Phase 1: Combined top-down
and bottom-up processing
Behavioral output
Figure 4.2: Three phases of visual search: Phase 1) Combined bottom-up and top-down processing
of the visual input: The top-down gains (Phase 3) derived from the observer’s beliefs (Phase 2) are
combined with bottom-up salience computations to yield the overall salience of the target and dis-
tractors. This determines search performance, measured bySNR. Phase 2) Acquiring a belief: The
distributions of target and distractor features may be learned through estimation from past trials, pre-
view of picture cues, verbal instructions, or other means. Phase 3) Generating the optimal top-down
gains: The learned belief in target and distractor features is translated into a belief in salience of the
target and distractors, thus yieldingSNR
b
. The top-down gains are chosen so as to maximizeSNR
b
.
Top-down belief vs. true underlying feature distributions: The above theory assumes an ideal ob-
server who knows the true distribution of target and distractor features (p(Θ|T),p(Θ|D)). Instead, a
real observer may possess incomplete knowledge or a belief in the likely target and distractor features
(p(Θ
b
|T),p(Θ
b
|D)). This belief may be learned from a preview of picture cues [181, 165], verbal
instructions (e.g., search for a “red” item) [181], or from observations of past trials [91] (see figure
104
2). In such cases, we assume that the observer can use an internal model to translate his/her belief
in features into a belief in salience of the target and distractorsS
b
T
,S
b
D
. In this extended framework,
it is easy to show that the other derivations remain identical, i.e., gains can be chosen so as to maxi-
mizeSNR
b
(SNR derived from top-down belief). The overall framework that integrates bottom-up
salience with top-down beliefs is shown in figure 2.
4.3.2 Special cases
Deriving analytical expressions for specific visual search conditions: In this section, we attempt
to derive analytical expressions for gains in some common visual search conditions. To simplify the
expressions, we assume that the feature dimension is encoded by neurons with gaussian tuning curves
(f
i
) whose preferred features (μ
i
) vary continuously along the dimension. In the following equation,
σ is the tuning width anda is the amplitude of firing rate, andb is the background firing rate.
f
i
(θ) =
a
σ
√
2π
exp
−
(θ−μ
i
)
2
2σ
2
+b (4.10)
We further approximate salience (s
i
) by the raw neural response (r
i
), which is a poisson random
variable with mean responsef
i
.
E
Θ|T
[E
C
[E
η
[s
iT
(A)]]] = E
Θ|T
[E
C
[E
η
[r
iT
(A)]]] (4.11)
= E
Θ|T
[E
C
[f
iT
(A)]] (4.12)
= E
Θ|T
[f
iT
(A)] (4.13)
E
Θ|D
[E
C
[E
η
[s
iD
(A)]]] = E
Θ|D
[f
iD
(A)] (similarly) (4.14)
SNR
i
=
E
Θ|T
[f
iT
(A)]
E
Θ|D
[f
iD
(A)]
(4.15)
g
i
=
SNR
i
1
n
P
j
SNR
j
(4.16)
105
Single known target, unknown distractor: We derive the optimal gains when the target is known
and consists of a single feature (P(Θ|T) is a Dirac Delta function), while the distractor is unknown
and may assume any feature with equal probability (P(Θ|D) is a uniform distribution).
P(Θ|T) = δ(θ
t
) (4.17)
P(Θ|D) =
1
π
(4.18)
E
Θ|T
[f
iT
(A)] =
Z
Θ|T
f
i
(θ)p(θ)dθ (4.19)
=
a
σ
√
2π
exp
−
(θ
t
−μ
i
)
2
2σ
2
+b (4.20)
E
Θ|D
[f
iD
(A)] = a+b (4.21)
SNR
i
=
a
σ
√
2π
exp
−
(θ
t
−μ
i
)
2
2σ
2
+b
/(a+b) (4.22)
LetC
1
=
a+b
n
X
j
SNR
j
(4.23)
g
i
= C
1
a
σ
√
2π
exp
−
(θ
t
−μ
i
)
2
2σ
2
+b
(4.24)
where C
1
is a normalization constant. Eqn. 4.24 shows that the gain on a neuron depends on the
similarity between its preferred feature and the target feature. Thus, the expression for optimal gains
reduces the “feature similarity gain model” [159].
106
Single known distractor, unknown target: In the opposite case where the distractor feature is known
and the target is unknown, we have the following expression for gains.
P(Θ|T) =
1
π
(4.25)
P(Θ|D) = δ(θ
d
) (4.26)
SNR
i
= (a+b)/
a
σ
√
2π
exp
−
(θ
d
−μ
i
)
2
2σ
2
+b
(4.27)
LetC
2
=
1
(a+b)n
X
j
SNR
j
(4.28)
g
i
=
C
2
a
σ
√
2π
exp
n
−
(θ
d
−μi)
2
2σ
2
o
+b
(4.29)
whereC
2
is a normalization constant. Thus, the gain of a neuron decreases as similarity between its
preferred feature and the distractor feature increases.
Known target and distractor: How do target enhancement and distractor suppression combine when
both the target and distractor features are known? Below, we consider the simplest case where both
the target and distractor consist of a single feature.
P(Θ|T) = δ(θ
t
) (δ() is the Dirac Delta function) (4.30)
P(Θ|D) = δ(θ
d
) (4.31)
SNR
i
=
a
σ
√
2π
exp
−
(θ
t
−μ
i
)
2
2σ
2
+b
/
a
σ
√
2π
exp
−
(θ
d
−μ
i
)
2
2σ
2
+b
(4.32)
LetΔ
i
=
θ
t
−μ
i
σ
(4.33)
Letd
′
=
θ
d
−θ
t
σ
(4.34)
LetC
3
=
bσ
√
2π
a
(4.35)
LetC
4
=
1
n
X
j
SNR
j
(4.36)
g
i
= C
4
exp
−Δ
2
i
/2
+C
3
/
exp
−(Δ
i
+d
′
)
2
/2
+C
3
(4.37)
107
−5
0 5
0
1
2
3
0
0.5
1
1.5
2
2.5
3
3.5
d
′
Δ
i
g
i
d
′
= 3, Δ
i
= 0
d
′
= 0.5, Δ
i
> 0
Figure 4.3: Optimal gains as a function of d
′
and Δ
i
, computed according to eqn 4.37: When
d
′
is high (e.g., d
′
≥ 3), the maximum gain occurs at Δ
i
= 0, i.e., when the target-distractor
discriminability is high, a neuron that is tuned to the target feature is promoted maximally. However,
whend
′
is low (e.g.,d
′
= 0.5), the maximum gain occurs atΔ
i
> 0, i.e., when the target-distractor
discriminability is low, a neuron that is tuned to a non-target feature is promoted more than a neuron
tuned to the target feature.
Thus, we obtain an expression for optimal gains as a function ofd
′
(discriminability between the
target and distractor features) andΔ
i
(distance between target feature and neuron’s preferred feature
108
in units of tuning width). For a given neuron, asd
′
increases,SNR
i
increases andg
i
increases. When
d
′
is very high, we have:
d
′
≫ Δ
i
⇒ Δ
i
+d
′
≃d
′
(4.38)
⇒g
i
≃ C
4
exp
−Δ
2
i
/2
+C
3
/
exp
n
−d
′
2
/2
o
+C
3
(4.39)
∝ exp
−Δ
2
i
/2
+C
3
(4.40)
Thus, whend
′
is very high, the gain of a neuron decreases asΔ
i
(distance between target feature and
neuron’s preferred feature) increases. In other words, the gains vary according to the feature similarity
gain model. The neuron that is best tuned to the target (Δ
i
= 0) contributes maximumSNR
i
, and
consequently has maximum gain. However, whend
′
is low, as shown in figure 4.3, maximumSNR
i
occurs at Δ
i
> 0. Thus, a neuron that is sub-optimally tuned to the target has higher gain than a
neuron that is best tuned to the target feature.
Summary – Feature similarity gain model is not valid when the target is similar to the distractor:
When the distractor is unknown or when the distractor is very different from the target (d
′
is high),
then gains follow the feature similarity gain model. However, when the distractor is similar to the
target (d
′
is low), gains do not follow the feature similarity gain model. Instead, a neuron whose
preferred feature is shifted away from the target and distractor feature has higher gain than a neuron
that is most similar to the target.
109
4.4 Results
Testing: In this section, we report the theory’s predictions on various search conditions through nu-
merical simulations on networks of neurons encoding features of the target and distractors. Subse-
quently, we test novel predictions of the theory through psychophysics experiments on human partic-
ipants.
4.4.1 Simulating Visual Search Conditions
Overview of simulations: To test the optimal feature gain modulation strategy, we perform de-
tailed numerical simulations. For different search conditions and displays, we compute the bottom-up
salience of the target and distractors s
iT
,s
iD
as a function of the true distribution of the target and
distractor features p(Θ|T),p(Θ|D) using the saliency computations proposed by Itti & Koch [69].
Next, we apply the optimal top-down gainsg
i
derived from the observer’s belief p(Θ
b
|T),p(Θ
b
|D)
on the bottom-up saliency maps (s
i
). Then we compute the overall salienceS
T
,S
D
, and the overall
signal-to-noise ratioSNR (figure 2). The resultingSNR may be high and search may be efficient
due to high bottom-up salience of the target relative to the distractors (e.g., a red target pops out among
green distractors [156] ass
iT
>>s
iD
in the saliency map tuned to the red feature), or due to efficient
top-down guidance to the target (e.g., a red target among randomly colored distractors becomes easy
to find once subjects know that the target is red [36] sinceg
i
>>1 on the red feature), or both.
Simulation details: Additional details of the simulations described in section 4.4.1 are given be-
low. We simulate a simple model of early visual cortex as follows: Let f
i
represent the bell-shaped
tuning curve of the i
th
neuron (with preferred feature value μ
i
) in a population of n neurons with
broad, overlapping tuning curves. Let the tuning width σ and amplitude a be the same for all neu-
rons. Let r
i
(θ) be the neural response to stimulus feature θ. r
i
(θ) may be considered a Poisson
110
random variable with mean f
i
(θ) [144]. For simulation purposes, we compute bottom-up salience
s
i
using the “classic” approach of weighting the local neural response r
i
with the square of the
difference between the maximum MAX
i
and mean responses MEAN
i
in that map (for details,
see section 2.3 in [69]). Thus, bottom-up salience is low if a feature map has several active loca-
tions (i.e., (MAX
i
−MEAN
i
)
2
≈ 0), and is high if a feature map has few active locations (i.e.,
(MAX
i
−MEAN
i
)
2
> 0). We chose the following values for our simulation parameters: n = 100
(number of neurons in the population),σ = 5 (width of gaussian tuning curves), gap = 0.6σ (inter
neural spacing in units of σ), a = 100Hz (amplitude of tuning curve), μ
i
∈ {0...300} (preferred
feature of thei
th
neuron),N =3 (i.e., 1 target andN
2
−1=8 distractors in the display).
Results: Figure 4.4 shows the results of our simulations for different search conditions. Figure 4.4a,
b, c together show that for a given target and distractor stimulus, better prior knowledge of their fea-
tures (or decreased uncertainty) allows the relevant features to be primed, thus leading to higherSNR
and faster search. These results are in qualitative agreement with existing psychophysics literature on
the role of uncertainty in target features [181, 165], and the role of feature priming [143, 91, 178].
Figure 4.4d shows that knowledge of the target (only) improvesSNR by enhancing target features.
Evidence for such target-based enhancement has been observed in single unit recordings in MT and
is consistent with the feature similarity gain model [159]. In addition, psychophysics studies provide
evidence that knowledge of the target accelerates search performance [165]. Figure 4.4e predicts that
knowledge of the distractor also improves search by suppressing the distractor features. Partial ex-
perimental evidence comes from studies that show decreased responses to the distractor feature (in
MT [93], in FEF [9]), and psychophysics studies that show a benefit in search performance due to
knowledge of distractors [91, 17].
111
P( θ | D) and P( θ | T)
b b
P( θ | D) and P( θ | T)
b
Optimal gains True distribution Observer's belief
g
i
Parameter
Probability
Neuron's prefe f f rred 0 0 Parameter
Probability
0
Response gain
a)
NO PRIMING / TOP-DOWN GUIDANCE: In the absence of any
knowledge of T and D features, the oberver believes that all features
are equally likely, hence the optimal gains are set to baseline
activation (unity). There is no top-down benefit / priming. SNR is low
and search is slow compared to when the observer knows the
features [8, 9,11, 19, 21].
b) ) b) b)
MINIMAL PRIMING / TOP-DOWN GUIDANCE: The observer learns a
partial distribution of T and D features from the initial trials or word
precues. Due to the uncertainty, minimal biasing occurs. SNR
increases by 2.2dB compared to the previous condition (a) and search
becomes faster [8, 9, 11,19, 21].
c)
MAXIMAL PRIMING: Repetition of T and D features in a blocked
condition, or an exact picture cue allows the observer to completely
learn their true distribution. This allows maximal priming. SNR
increases by 5.5dB and search becomes most efficient [8,9,11,19,21]
compared to the no knowledge condition (a).
d)
TARGET ENHANCEMENT: Knowledge of T through repetition of T
features or precue allows target enhancement [22]. Search is faster as
SNR increases by 5.4dB compared to the baseline condition (a) [9,11].
A similar distribution of gains has also been observed in feature-
based attention studies, e.g., feature similarity gain model [5].
e)
DISTRACTOR SUPPRESSION: Knowledge of D through repetition of
D features allows distractor suppression [22,23]. Search is faster as
SNR increases by 4.4dB compared to the no knowledge condition (a)
[19]. Neurons are suppressed according to the similarity between
their preferred feature and D's feature.
f)
EFFECT OF DISTRACTOR HETEROGENEITY: Increasing distractor
heterogeneity decreases search speed as SNR decreases by 9.7dB
compared to homogeneous distractors condition (c) [24]. In this
condition where T is linearly separable from D, knowledge of T and D
yields a benefit of 3.3dB.
g)
EFFECT OF LINEAR SEPARABILITY: If T is not linearly separable from
D, SNR further decreases by 5.9dB compared to the above linearly
separable condition [25-27]. In this condition, knowledge of T and D
yields a smaller benefit of 0.5dB compared to a benefit of 3.3dB in the
above condition (f ) [25].
h)
EFFECT OF TARGET-DISTRACTOR DISCRIMINABILITY: During
search for a less discriminable T among D, SNR decreases by 18.4dB
compared to the high discriminable condition (c). Search becomes
very slow [14, 24, 28-30]. Further, a neuron that is sub-optimally tuned to
the target's feature is boosted maximally compared to a neuron that is
optimally tuned (see our experiments)
Remarks and experimental evidence
Baseline
SNR
7.4dB
17.5dB
19.7dB
23.3dB
22.9dB
21.9dB
13.3dB
4.6dB
eline Bas Baseline Bas
Figure 4.4: Simulation results for a variety of search conditions (shown in different rows): The
first column shows the true distribution of the target (T) features (p(Θ|T), solid red) and dis-
tractor (D) features (p(Θ|D), dashed blue), and the second column shows the observer’s belief
(p(Θ
b
|T),p(Θ
b
|D)). The third column shows the optimal distribution of neural response gains super-
imposed overp(Θ|T),p(Θ|D). The fourth column showsSNR followed by the implications of our
results, alongwith experimental evidence. For example, row ’a’ illustrates how lack of prior knowl-
edge prevents any top-down guidance of search. Let the true distributionsp(Θ|T) andp(Θ|D) peak
at different values, e.g., red target among green distractors. When T and D are unknown, the beliefs
p(Θ
b
|T),p(Θ
b
|D) are a uniform distribution with all features being equally likely. Hence, the optimal
gains are set to baseline (g
i
= 1, i ∈ {1...n}). Remarks and supporting experimental evidence for
the remaining search conditions (rows a-h) are shown in the fifth column in this figure. Our theory is
able to formally predict several effects in visual search behavior which have been previously studied
empirically [181, 165, 37, 119, 37, 103, 155, 177, 38, 4, 53].
112
Distractor heterogeneity: Figures 4.4c and 4.4f together demonstrate the effect of distractor het-
erogeneity [37], i.e., search efficiency decreases as the number of types of distractors increases (e.g.,
searching for a red target among blue, green, yellow and white distractors is harder than searching
for a red target among green distrators). Consistent with this effect, our simulations show thatSNR
decreases from 23.0dB (figure 4.4c, homogeneous distractors) to 13.3dB (figure 4.4f, heterogeneous
distractors), resulting in slower search due to increased distractor heterogeneity.
Linear separability: A comparison of figures 4.4f and 4.4g reveals the linear separability effect, i.e.,
search for a target flanked by distractor features (as shown in figure 4.4g) is harder than search for
a target that is linearly separable from distractors in feature space (as shown in figure 4.4f). This
effect has been demonstrated in features such as size, chromaticity and luminance [53, 38, 4]. For
example, search for a medium sized target among small and big distractors is known to be harder than
search for a big target among small and medium sized distractors [53]. Our simulation results are
consistent with this effect and show a decline inSNR from 13.3dB (figure 4.4f, linearly separable
target), to 7.4dB (figure 4.4g, target that is not linearly separable). Furthermore, in agreement with
psychophysics [53], our simulations reveal a greater top-down benefit of knowing the target and dis-
tractors in the linearly separable condition (3.3dB in figure 4.4f) than otherwise (0.5dB in figure 4.4g).
Target-distractor discriminability: One of the classic effects in visual search behavior is that search
efficiency decreases as target-distractor discriminability decreases [119, 37, 103, 155, 177]. Figures
4.4c and 4.4h demonstrate this effect. WhileSNR is high (23.0dB) when the target and distractor
features are very different (e.g., 55
◦
oriented target among 25
◦
oriented distractors, as shown in figure
4.4c),SNR drops to as low as 4.6dB when the target and distractor features are similar (e.g., 55
◦
oriented target among 50
◦
oriented distractors, as shown in figure 4.4h).
113
4.4.2 Psychophysics experiments
New prediction: Notably, our theory makes a new prediction that during search for a less discrim-
inable target among distractors, an exaggerated target feature is promoted more than the exact target
feature (see figure 4.4h). Though seemingly counter-intuitive, this occurs since a neuron that is tuned
to an exaggerated target feature provides higherSNR
i
(as it responds much more to the target than
the distractor), whereas a neuron that is tuned to the exact target feature provides lowerSNR
i
(as it
responds similarly to the target and distractor). This is shown in figure 4.5. To validate this claim, we
conduct new psychophysics experiments that are designed in two phases: 1) to set up the top-down
bias and 2) to measure the bias.
{
{
1
SNR
1
SNR
2
Target Distractor
2
orientation . . . . . .
Figure 4.5: Comparison ofSNR
i
: When the target feature (shown in solid red) is similar to the
distractor feature (shown in dotted blue), neuron 2 that is tuned to an exaggerated feature provides
higherSNR
i
than neuron 1 that is tuned to the exact target feature.
Setting up the top-down bias: To set up the top-down gains, we ask subjects to perform the primary
taskT
1
which is a hard visual search for the target (55
◦
tilted line) among several distractors (50
◦
tilted
lines). A typicalT
1
trial is shown in figure 4.6a: it starts with a fixation, followed by the search array.
Upon finding the target among distractors, subjects press any key. To ensure that subjects would bias
for the target among distractors in each and every trial, we introduce a No Cheat scheme (see legend
of figure 4.6a). Subjects are trained on T
1
trials until their performance stabilises with at least 80%
114
. . .
Correct
41
25
34
54
32
12
43 24
31
14
17
54 43
23
45
21
Report the
number
21
53
45
32
Report the
number
Time (ms)
500 ms until keypress 120 ms until report
500 ms 300ms 120 ms
T1 trial Random number of T1 trials T2 trial Random number of T1 trials
a) Experimental design
Target
14
Target Distractor Others
# of reports 5
0
30 50 55 60 80
0 0 0 0 0
T = 55, D = 50
0
T = 55, D = 50 T = 55, D = 50
0
b)
# of reports 5
0
30 50 55 60 80
0 0 0 0 0
c)
T = 55, D = 60
0
T = 55, D = 60 T = 55, D = 60
0
accuracy. Thus, the top-down bias is set up by performingT
1
trials.
Measuring the top-down bias: To measure the top-down gains generated by the above task, we ran-
domly insert T
2
trials in between T
1
trials (figure 4.6a). Our theory predicts that during search for
115
Figure 4.6: a) Experimental design: We test the theory’s prediction of top-down bias during search
for a low-discriminability target among distractors (figure 4.4i). The top-down bias is set when sub-
jects performT
1
trials. After a random number ofT
1
trials, the top-down bias is measured in aT
2
trial.
AT
1
trial consists of a fixation followed by a search array containing one target (55
◦
) among several
distractors (50
◦
). Subjects are instructed to report the target as soon as possible. Subject’s response is
validated on a per trial basis through a novel No Cheat scheme that is described in the Methods section.
A T
2
trial consists of a fixation, followed by a brief display of five items representing five features,
and by five fineprint random numbers. Subjects are asked to report the number at the target location.
b) Experimental results: We ran 4 subjects (3 na¨ ıve), aged 22-30, normal or corrected vision, with
IRB approval. TheT
2
trials were analyzed to find the number of reports on 30
◦
, 50
◦
, 55
◦
, 60
◦
, 80
◦
features. The number of reports on the relevant feature (60
◦
, marked by a golden star) is significantly
higher (paired t-test, p < 0.05) than the number of reports on the target feature (55
◦
). c) Controls:
In a control experiment, we maintained the same target feature, but reversed the distractor feature. In
the T1 trials, subjects now searched for the 55
◦
oriented target among60
◦
oriented distractors. Ev-
erything else, including the T2 trials, instructions and analysis remained the same. Statistical analysis
of number of reports showed a reversal in trend compared to b), with significantly higher number of
reports on the currently relevant feature (50
◦
, marked by a golden star) than the target feature (55
◦
).
the target (55
◦
) among distractors (50
◦
), the most relevant feature will be around 60
◦
and not 55
◦
.
To test this, we ask subjects to “find the target” in a brief display (300ms) of five items representing
five different features - steepest (80
◦
), relevant as predicted by our theory (R, 60
◦
), target (T, 55
◦
),
distractor (D, 50
◦
) and shallowest (30
◦
). The display is brief and its occurrence is unpredictable in
order to minimize any alteration in the top-down gains set up by the T
1
trials. If the top-down gain
on a feature is higher than other features, then it should appear more salient, draw attention and hence
be reported. Thus, although subjects search for the target, our theory predicts a higher number of re-
ports on the relevant feature R than on the target feature T (since R has a higher top-down bias than T).
Experimental details: Additional details of the psychophysics experiments described in section 4.4.2
are given below. Subjects were na¨ ıve to the purpose of the experiment (except one) and were USC
students (2 females, 2 males, mixed ethnicities, ages 22-30, normal corrected or uncorrected vision).
Informed written consent was obtained from all the subjects and they either volunteered or partici-
pated for course credit. All experiments received IRB approval. Stimuli were presented on a 22”
116
computer monitor (LaCie Corp; 640x480, 60.27Hz double-scan, mean screen luminance 30cd/m
2
,
room 4cd/m
2
). Subjects were seated at a viewing distance of 80cm (52.5
◦
x 40.5
◦
usable field of
view) and rested on a chin-rest. Stimuli were presented on a Linux computer under SCHED FIFO
scheduling which ensured microsecond-accurate timing.
Intermix of T1 and T2 trials: In the experiment shown in figure 4.6, the top-down bias is set when
subjects performT
1
trials. After a random number ofT
1
trials, the top-down bias is measured in aT
2
trial. AT
1
trial consists of a fixation for 500 ms followed by a search array containing one target (55
◦
)
among 25 distractors (50
◦
). Subjects are instructed to find the target as soon as possible and press any
key. The time until keypress varied anywhere between 500-7000 ms. To verify that subjects indeed
find the target on every trial, we introduce a novel No Cheat scheme: Following the key press when the
subject finds the target, we flash a grid of fineprint random numbers briefly (120ms) and ask subjects
to report the number at the target’s location. The briefness of the display ensures that subjects find the
target and fixate it in order to report the number correctly. Online feedback on accuracy of report is
provided. Unlike conventional use of target absent trials, which cannot isolate individual trials with
invalid responses, our No Cheat scheme allows validation of the subject’s response on a trial-by-trial
basis. Subjects receive training on this experiment until they achieve at least 80% accuracy. During
testing, a block is rejected if the accuracy falls below 80%. A T
2
trial consists of a fixation for 500
ms, followed by a brief display of five items representing five features (300 ms), and by five fineprint
random numbers. The task is the same as in the T1 trials. Subjects are asked to report the number at
the target location. Each subject performed 10 blocks of 50 trials each, with 160T
2
trials randomly
inserted in between 340T
1
trials. For each of the four subjects, the reports on the 160T
2
trials were
analyzed using a paired t test (p<0.05) to compare the number of reports on30
◦
,50
◦
,55
◦
,60
◦
,80
◦
features.
117
Results: Experimental results across all subjects indicate significantly (p-value< 0.05) higher number
of reports on R than on T (figure 4.6b). As predicted by our theory, subjects could not help but be
attracted towards R although the task was to search for T. In additional controls, when the distractor
feature was reversed (60
◦
) while the target remained the same (55
◦
), the same subjects showed a
reversal in the trend of biasing (described in figure 4.6c). Similar results were obtained in the color
dimension as well (see figure 4.7). Our results provide experimental evidence that humans may deploy
optimal top-down feature gain modulation strategies.
4.5 Alternative objective functions
MaximizingSNR: We have shown that a simple function such as the ratio of expected salience of
the target over the distractors is sufficient to account for most visual search data. For a fixed ratio of
means, when the target and distractor feature distributions are narrow, as shown in figures 4.4b and
4.4c,SNR increases compared to when the feature distributions are wide. Thus, variance in target
and distractor features is implicitly encoded inSNR. Below, we compare ourSNR measure against
D
′
.
Maximizing discriminability: We explore another objective function,D
′
, discriminability between
the salience of the target and distractor, defined as follows:
D
′
=
E[S
T
(A)]−E[S
D
(A)]
p
0.5(V[S
T
(A)]+V[S
D
(A)])
(4.41)
118
. . .
Correct
41
25
34
54
32
12
43 24
31
14
17
54 43
23
45
21
Report the
number
21
53
45
32
Report the
number
Time (ms)
500 ms until keypress 66 ms until report
500 ms 66ms 66 ms
T1 trial Random number of T1 trials T2 trial Random number of T1 trials
a) Experimental design
Target
14
Target Distractor Others
# of reports 5
0
T = , D =
b)
# of reports 5
0
c)
T = , D =
Figure 4.7: a) Experimental design: We test the theory’s prediction of top-down bias in the color
dimension. The experimental design is similar to figure 4.6. The target has medium green hue
(CIE x=0.24,y=0.42), while the distractor is either more green (x=0.25,y=0.45, figure 4.7b), or less
green (x=0.23,y=0.38, figure 4.7c), and the irrelevant controls are yellow (x=0.42,y=0.50) and blue
(x=0.21,y=0.27).b) Experimental results: We ran 3 subjects (na¨ ıve), aged 22-30, normal or corrected
vision, with IRB approval. The T
2
trials were analyzed to find the number of reports on the yellow,
more green, medium green, less green and blue features. When subjects searched for a medium green
target among less green distractors, as predicted by the theory, there were significantly more reports
(paired t-test, p-value < 0.05) on the more green feature than the target feature. c) Controls: In a
control experiment, we maintained the same target feature, but reversed the distractor feature. Now,
subjects searched for a medium green target among more green distractors. Statistical analysis of
number of reports showed a reversal in trend compared to b), with significantly higher number of
reports on the less green feature than the target feature. These results support optimal feature biasing
as suggested by our theory.
119
whereV[.] refers to the variance. Using the additive hypothesis in eqn 4.1 (i.e., assuming that salience
adds across the different saliency maps), we get the following:
D
′
=
E[
P
i
g
i
s
iT
(A)]−E[
P
i
g
i
s
iD
(A)]
p
0.5(V[
P
i
g
i
s
iT
(A)]+V[
P
i
g
i
s
iD
(A)])
(4.42)
=
P
i
g
i
(E[s
iT
(A)]−E[s
iD
(A)])
p
0.5(
P
i
g
2
i
(V[s
iT
(A)]+V[s
iD
(A)]))
(4.43)
(assumings
iT
,s
jT
;s
iD
,s
jD
are independent r.v.) (4.44)
DifferentiatingD
′
wrtg
i
yields the following:
∂D
′
∂g
i
gi=1
=
ti
T
−1
α
i
(4.45)
wheret
i
=
(E[s
iT
(A)]−E[s
iD
(A)])
(V[s
iT
(A)]+V[s
iD
(A)])
(4.46)
whereT =
P
j
E[s
jT
(A)]−E[s
jD
(A)]
P
j
V[s
jT
(A)]+V[s
jD
(A)]
(4.47)
whereα
i
=
q
1
2
P
j
V[s
jT
(A)]+V[s
jD
(A)]
T ×(V[s
iT
(A)]+V[s
iD
(A)])
(4.48)
From eqn 4.45, it is easy to show that
gi
gi0
(where g
i0
= 1 is the default baseline gain) increases as
ti
T
increases. Assuming the monotonic relationship to be linear, and with an added constraint that the
gains must sum to a constant,
P
n
i=1
g
i
=n, the simplest solution is:
g
i
=
t
i
1
n
P
n
j=1
t
j
(4.49)
Comparison of different objective functions: To compareSNR and D
′
, we ran simulations (as
described in section 4.4.1) and compared the predictions on search performance for different target and
distractor feature distributions. For computing the top-down gains in these simulations, we assumed
that salience s
i
could be approximated by the raw neural response r
i
. While computing D
′
, we
120
x
x
x
Difficult vs. Easier search
SNR ratio D' ratio
a)
b)
c)
d)
e)
f)
g)
θ
P(θ|D)
P(θ|Τ)
1.25 14.01
1.05 3.56
1.18 4.18
4.87 3.65
1.58 0.57
3.07 0.93
1.86 0.65
Figure 4.8: Comparison of different objective functions: These simulations compare the search per-
formance when gains are modulated to maximizeSNR (ratio of expected target salience relative to
expected distractor salience) vs. D
′
(discriminability between target and distractor salience). The first
two columns illustrate different search conditions (each denoted by a particular distribution of target
featureP(Θ|T) shown in solid red, and distractor featureP(Θ|D) shown in dotted blue). According
to previous psychophysics studies, the search condition illustrated in the first column is known to more
difficult than its counterpart in the second column. While maximizingSNR successfully accounts for
this difference (as shown in the third column, ratio ofSNR values in easier vs. difficult conditions
>1), maximizingD
′
fails in some cases (as shown in the fourth column, in conditions e, f, g, ratio of
D
′
<1). This validates our choice ofSNR as the relevant objective function.
121
further assumed that the neural firing rate followed a poisson distribution, hence varianceV[.] equals
the expectation E[.]. The top-down gains were combined with bottom-up salience (as computed in
section 2.8 in [69]) to compute the overall salience. As shown in figure 4.8, our SNR measure
effectively captures psyhophysical behavior in several search conditions, whileD
′
fails in some cases.
This suggests thatSNR is the relevant objective function to be optimized for improving visual search
behavior.
4.6 Discussion
Summary: Several theories of visual search have been proposed in the past – while some attempt to
explain the behavior of the organism (e.g., Feature Integration theory, Guided Search theory), others
attempt to account for the single unit responses (e.g., Feature Similarity Gain model, Feature Match-
ing hypothesis). Here, by modulating the gains such that behavioral performance (quantified in terms
ofSNR) is optimized, we provide a simultaneous account of the search behavior of the organism, as
well as neural gains at the single unit level. Specifically, we suggest that gains are modulated so as to
optimize the salience of the target relative to the distractors (which we refer to as the signal-to-noise
ratio, SNR). Such optimization of SNR increases both search accuracy and speed. The theory
makes a number of testable predictions at the single unit and behavioral level, and bears implications
for electrophysiology, brain imaging and psychophysics of visual search.
Boosting features similar to the target is sub-optimal in some cases: While several models of atten-
tion have been proposed in the past, most of them include a top-down component that biases features
according to their similarity to the target [33, 31, 47, 159, 15, 22, 162, 126]. For instance, one of the
prominent models, “The feature similarity gain model”, suggests that the gain on a neuron encoding a
visual feature depends on the similarity between the neuron’s preferred feature and the target feature.
122
We show that this is a special case of our general theory, which occurs whenever the target feature
differs substantially from the distractor feature. Thus, previous experiments with different target and
distractor features (e.g., experiments by Bichot et. al (2005) in the color dimension in FEF, Treue
& Trujillo (1999) in direction of motion in MT) that provide evidence for the feature-similarity gain
model also provide evidence for our theory. In addition, we show examples of search conditions when
the former strategy of enhancing target features is sub-optimal. For instance, when the target and dis-
tractor features are similar (e.g., 5
◦
difference in orientation), neurons tuned to the target respond to
the distractor as well (providing lowerSNR
i
), hence enhancing such neurons increases the response
to the distractor, which is undesirable for performance. On the other hand, a neuron that is tuned to
an exaggerated target feature responds much more to the target relative to the distractor, and provides
higherSNR
i
than a neuron that is tuned to the exact target feature. Hence, the optimal strategy is
to boost a neuron tuned to the exaggerated target feature, and not the exact target feature. This effect
has also been reported in discrimination tasks where a neuron tuned to an exaggerated stimulus fea-
ture contains higher fisher information than a neuron that is tuned to the exact stimulus feature [85].
To the best of our knowledge, this is the first study to demonstrate a similar effect during visual search.
Differences between our model and previous models: Here, we summarise the differences between
our model and previous models. 1) Most previous models ignore the role of the distractor in deter-
mining gain modulation. They enhance features that are similar to the target. On the other hand,
we predict that the distractor plays a critical role, and determines whether the target feature will be
enhanced or not. 2) In several earlier models (e.g., FeatureGate, Feature Similarity Gain, Hamker’s
model, Rao’s model), the top-down bias only works when the target features are known. They cannot
predict the top-down bias when distractor features are known, but the target is unknown (e.g., when
the distractor feature does not change across search trials, but the target feature changes). Our model
123
predicts the top-down bias for all combinations of knowledge of target and distractor features, includ-
ing when the target is unknown, but the distractor is known (or trivially, when both the target and
distractor are unknown, in which case the gains remain at default values). 3) While most previous
models are either purely top-down [126] or bottom-up driven [87], a key distinguising aspect of our
model is that it integrates both bottom-up salience and top-down feature bias.
Proposed theory accounts for existing psychophysics data: By applying optimal top-down gains on
bottom-up salience responses, our theory integrates both goal-driven, top-down and stimulus-driven,
bottom-up factors to guide visual attention. It successfully accounts for a large body of available vi-
sual search literature. For instance, it accounts for several reported knowledge-based effects such as
the role of uncertainty in target features [181, 165], role of feature priming [143, 91, 178], target en-
hancement and distractor suppression [9, 17], and top-down effects on linear separability [53]. It also
demonstrates other well known bottom-up effects such as the role of target-distractor discriminability
[119, 37, 103, 155, 177], distractor heterogeneity [37] and linear separability [38, 4]. Thus, the theory,
despite being simple, yields good predictive power. It is general and applicable to top-down selection
of relevant information in biological as well as artificial systems, in visual and other modalities in-
cluding auditory, somatosensory, and cognitive.
Role of decision processes: Could the observed behavioral response of subjects (in figure 4.6 and
4.7) reflect higher decision processes rather than attentional biasing? Indeed, subjects’ response in
psychophysics studies such as ours is the outcome of several visuo-motor transformations from the
early and intermediate visual areas to higher decision areas. However, it is unlikely that our results
reflect decision-making processes for the following reasons: The presentation time of our probe trials
is brief (66ms in the experiments on color, 300ms for orientation) and prevents scanning of all 5 items
before reporting the target. The briefness of probe trials minimizes the contribution of covert serial
124
recognition or decision processes, so that the subjects’ response may reflect fast attentional biasing
processes rather than slow recognition or decision processes. Further validation of attentional biasing
and the theory’s predictions on gain modulation calls for more studies in electrophysiology.
Neurobiological implications 1 – gain modulation: So far, gain modulation has been studied sys-
tematically only for two configurations: when the target feature is known [93], or when the target and
distractor features are different [159]. A Feature Similarity Gain model was proposed to account for
the observations. Here, we show that the Feature Similarity Gain model can be explained as a special
case of our general theory (see section 4.3.2). Our theory agrees with the predictions of the feature
similarity gain model under the condition that the target and distractor features are very different. In
addition, we predict that the distribution of gains will be skewed away from the target and distractor
feature, when they are very similar. Indeed, natural scenes are full of clutter, and it is common for
targets of interest (e.g., prey, predators, suspects etc.) to be camouflaged or embedded in distracting
backgrounds. We predict that in such cases, the distractor feature (and not just the target feature)
will play a critical role in gain modulation. We have empirically verified this on natural scenes [109],
where the optimal gain modulation strategy based on the target and distractor features performs better
than one which considers target features only. This prediction remains to be tested neurally.
Neurobiological implications 2 – neural substrates: We have proposed a theory of neural function
which suggests that the “end result” of feature-based attention, possibly mediated through complex
neural interactions and feature processing, is to modulate neural response gains according to their
signal-to-noise ratio. The details of the neural mechanisms in the intermediate steps are not yet ad-
dressed by the theory. The functional role of attention suggested by the theory is general and applica-
ble to any population of neurons that encode a continuous feature dimension in a distributed manner,
e.g., neurons in MT that are tuned to direction of motion, V4 neurons that are tuned to orientation.
125
For simple feature dimensions such as orientation that we have currently tested in our psychophysics
experiments, we suggest that the attentional modulation may occur as early as in a V1 hypercolumn
[99, 125, 135, 95, 171, 145, 44, 92].
Extension to multiple feature dimensions: The current report primarily focuses on gain modulation
within a single feature dimension. This provides a theoretical foundation for further research on inte-
grating multiple feature dimensions. As shown elsewhere [109], this theory can be easily extended to
multiple dimensions if they are combined linearly as suggested by the Guided Search theory [177].
Optimal eye movement vs. optimal feature selection strategies: By focusing on visual features as
opposed to locations in space, our study on optimal feature gain modulation complements recent stud-
ies on optimal eye position strategies [105]. While the latter suggests that humans can select relevant
locations optimally, here, we show that humans select visual features optimally as well. Together,
these studies suggest that human visual search behavior is optimal.
126
Chapter 5
Applications in computer vision
5.1 Introduction
State of the art in object detection: Traditional models of object detection use a sliding window
across the image and apply a binary classifier at each window to detect the presence or absence of
the desired target object [117, 96]. While this approach has been successfully applied to detecting
rigid objects such as faces and cars [136, 141, 166] and even pedestrians [117, 167], it is slow and
computationally expensive as each classifier (corresponding to every object) is run independently at
every window within the image.
Role of attention in accelerating detection speed: Recent models of object detection overcome the
speed bottleneck of the sliding window approach by using a generic attention operator to quickly se-
lect a few interest points in the image [42, 14]. This area has received much interest recently, with
several systems using attention as a front-end to accelerate detection speed [69], to reduce complexity
of automated multi-target detection and tracking [169], and to enable automated learning and recog-
nition of objects in cluttered scenes [137]. However, most such models are either purely goal-driven
127
(top-down) [126] or image-driven (bottom-up) [66, 162].
Need to integrate top-down and bottom-up attentional influences: There have been few attempts
to integrate both top-down and bottom-up attention [113]. Such integration is crucial for robot nav-
igation, visual surveillance and any realistic visual search. For instance, in visual surveillance, it is
important to detect goal-relevant targets like suspects, and to simultaneously notice unexpected visual
events like gun shots or sudden explosions. Similarly, robot navigation requires top-down detection of
landmarks and road signs, as well as bottom-up detection of unexpected obstacles and accidents. In
this paper, we present a new model that combines both top-down and bottom-up influences to guide
attention during visual search for a target object in distracting clutter.
Need to consider knowledge of the target and distracting background: One of the central chal-
lenges in integrating bottom-up and top-down attention is to find the optimal top-down influence on
bottom-up processes such that detection speed is maximized. This is an unsolved challenge as yet,
since most models of top-down attention are sub-optimal heuristics and driven by knowledge of the
desired target only [177, 126, 96, 69], while ignoring the contribution due to knowing the distracting
background. Few top-down models consider the distractors [153], by using global features represent-
ing the scene context. But they do not consider the local features of the background, that are known
to facilitate search [91, 17].
Current open challenges: Further progress in building fast, next generation target detection systems
requires a thorough investigation of how statistical knowledge of the local features of the target and
distracting background yields optimal top-down attentional signals that combine with bottom-up at-
tention to maximize detection speed.
128
Highlight of our approach: We propose a new model that combines both bottom-up as well as top-
down attentional influences. Our proposed model first computes the naive, bottom-up salience of
every scene location for different local visual features (e.g., different colors, orientations and intensi-
ties) at multiple spatial scales. Next, the top-down component uses learnt statistical knowledge of the
local features of the target and distracting clutter, to optimize the relative weights of the bottom-up
maps such that the overall salience of the target is maximized relative to the surrounding clutter. Such
optimization renders the target more salient than the distractors, thereby maximizing target detection
speed [178].
Related work: Previously, Navalpakkam and Itti derived a theory of top-down guidance for simpli-
fied stimuli defined within one feature dimension only [108]. Here, we present a new model (theory
and implementation) that combines bottom-up and top-down attention and considers complex targets
and distracting objects that are defined as a conjunction of different features across multiple feature
dimensions. Our model is applicable to natural scenes as well as artificial search arrays. Unlike the
former study that assumes an ideal observer with complete prior knowledge of the target and dis-
tractors, our model allows realistic observers with different beliefs (ranging from no knowledge to
complete knowledge), thereby allowing significantly higher prediction power that captures the perfor-
mance of a novice to an expert.
Our contribution: In section 2, we formally derive the optimal theory of top-down and bottom-
up attention. In section 3, we describe the model’s implementation and its results on 750 synthetic
search arrays and natural scenes. With little computational cost in the form of multiplicative top-down
gains on bottom-up saliency maps, we show that our model can predict many reported bottom-up
[156, 119, 37, 103, 155, 177, 38, 4] and top-down effects [37, 181, 165, 53] on human visual search
behavior. Systematic evaluation of different models with varying degrees of knowledge reveals that
129
knowledge of the local features of the distracting background, in addition to the target, yields better
search performance.
5.2 Theory
Relevant objective function to be optimized: Consider searching for a fruit in the trees. While a
ripe red fruit readily captures our attention due to its high visual salience, an unripe green fruit does
not capture our attention due to its low salience relative to the distracting leaves, and is hard to detect.
Thus, the detection speed depends on the ratio between the strength of signal detecting the target (i.e.,
target salience), over that detecting the distracting background (i.e., distractor salience) [178]. Here,
we will refer to this ratio as the search’s signal-to-noise ratioSNR. The relevant goal for maximizing
object detection speed is to maximizeSNR.
Formalizing visual search: As shown in figure 1, let the perceived salience of the target,S
T
(A) be
a function of the input search array A, which is a function of the visual features of the target Θ|T
(sampled from probability density functionsP(Θ|T)). A is also a function of the relative locations or
spatial configuration of the target and distractors (C). Since C andΘ|T are random variables, so is
S
T
(A). S
T
(A) is also influenced by noise in neural response,η. Similarly, the salience of the distrac-
tors,S
D
(A), depends on the distractor featuresΘ|D, configurationC and internal noiseη. Thus, we
defineSNR as the ratio of expected salience of the target over distractors, with the expectation taken
over random variablesΘ|T,Θ|D,C,η.SNR =E
Θ|T,C,η
[S
T
(A)]/E
Θ|D,C,η
[S
D
(A)].
130
Figure 5.1: Overview of our model: Let the incoming visual sceneA contain target and distractors
sampled from probability density functionsP(Θ|T) andP(Θ|D). Our model assumes that the visual
input is analyzed in different feature dimensions by a population of neurons with broad and overlap-
ping tuning curves. Bottom-up saliency mapss
ij
(A) are extracted for the i
th
feature within the j
th
dimension,i∈{1...n}, j∈{1...N}. Prior knowledge of the target and distractors is used to compute
the top-down gainsg
ij
andg
j
. The bottom-up mapss
ij
(A) are then multiplicatively weighted by the
top-down gains g
ij
and are summed to yield S
j
(A), the saliency map for the j
th
dimension. The
resulting saliency mapsS
j
(A) are again weighted by top-down gainsg
j
and summed across different
feature dimensions to form the overall saliency map S(A). The goal here is to choose optimal top-
down weights that maximize the target’s salience relative to the background, thereby maximizing the
speed of detecting the target.
131
Computing salience within a dimension: The overall perceived salience (combined top-down and
bottom-up salience),S
j
, for a feature dimensionj is computed as a linear combination of the bottom-
up saliencess
ij
for features (values) within that dimension (figure 1). To simulate human like behav-
ior, we assume that the feature responses are modulated in a top-down manner by multiplicative gain
modulation [?, 93].
S
j
(x,y,A) =
n
X
i=1
g
ij
s
ij
(x,y,A) (5.1)
Combining salience across dimensions: To combine information acrossN feature dimensions, we
integrate linearly across all dimensions to obtain the overall perceived salienceS (as suggested by the
Guided Search theory, [177]).
S(x,y,A) =
N
X
j=1
g
j
S
j
(x,y,A) (5.2)
Salience of the target and distractors: The expected salience of the target (S
T
) can be computed
in terms of its salience s
ijT
,i ∈ {1...n}, j ∈ {1...N} in each of the n saliency maps within the N
feature dimensions. Further, assuming thatη,C, andΘ are independent random variables, we obtain:
E[S
T
(A)] = E
Θ|T,C,η
N
X
j=1
g
j
S
jT
(A)
= E
Θ|T,C,η
N
X
j=1
g
j
n
X
i=1
g
ij
s
ijT
(A)
=
N
X
j=1
g
j
n
X
i=1
g
ij
E
Θ|T
[E
C
[E
η
[s
ijT
(A)]]]
132
Similarly for distractors. Thus, we have,
SNR =
P
N
j=1
g
j
P
n
i=1
g
ij
E
Θ|T
[E
C
[E
η
[s
ijT
(A)]]]
P
N
j=1
g
j
P
n
i=1
g
ij
E
Θ|D
[E
C
[E
η
[s
ijD
(A)]]]
(5.3)
Maximizing SNR to obtain the optimal gains: To maximizeSNR, we differentiate it wrtg
ij
and
g
j
and obtain the following:
∂
∂g
ij
SNR =
SNRij
SNR
−1
α
ij
(5.4)
∂
∂g
j
SNR =
SNRj
SNR
−1
α
j
(5.5)
whereα
ij
,α
j
are positive normalization terms and
SNR
ij
=
E
Θ|T
[E
C
[E
η
[s
ijT
(A)]]]
E
Θ|D
[E
C
[E
η
[s
ijD
(A)]]]
(5.6)
SNR
j
=
E
Θ|T
[E
C
[E
η
[S
jT
(A)]]]
E
Θ|D
[E
C
[E
η
[S
jD
(A)]]]
(5.7)
The sign of the derivative
∂
∂gij
SNR determines whether g
ij
should increase, decrease or remain at
the baseline (g
ij
=1), in order to maximizeSNR. Eqn. 4 yields:
SNR
ij
SNR
< 1⇒
∂
∂g
ij
SNR
gij=1
<0⇒g
ij
<1
= 1⇒
∂
∂g
ij
SNR
gij=1
=0⇒g
ij
=1
> 1⇒
∂
∂g
ij
SNR
gij=1
>0⇒g
ij
>1
133
Thusg
ij
increases as
SNRij
SNR
increases. We simplify this monotonic relationship by assuming pro-
portionality. With an added constraint that the gains cannot increase indiscriminately, but must sum
to a constant,
P
n
i=1
g
ij
=n, we get:
g
ij
=
SNR
ij
1
n
P
n
k=1
SNR
kj
(5.8)
g
j
=
SNR
j
1
N
P
N
k=1
SNR
k
(similarly) (5.9)
Interpretation of the result: Thus, the top-down weight on the i
th
visual feature in the j
th
feature
dimension depends on its signal-to-noise ratioSNR
ij
, over the mean in that dimension. Similarly,
the top-down gain on thej
th
feature dimension depends on its signal-to-noise ratioSNR
j
, over the
mean across all dimensions. In other words, a feature is relevant and receives a high weight if it ren-
ders the target more salient than the distractors, and is irrelevant otherwise.
Ideal observer vs. real observer: The current analysis considers an ideal observer who knows the true
underlying distribution of the target and distractor features (θ|T,θ|D). But in reality, the observer may
possess incomplete knowledge or a different belief (θ
b
|T,θ
b
|D). This belief may be learnt through
observation of several displays (bottom-up priming [91]), or through explicit verbal instruction such
as “find the red object” [181, 165]. Our model captures such top-down influences in the following
manner: A forward internal model translates the observer’s belief in feature space (θ
b
|T,θ
b
|D) into
a belief in salience of the target and distractors (S
b
T
,S
b
D
), which is then used to derive the belief
in signal-to-noise ratio (SNR
b
). Top-down gains are chosen according to eqns. 8 and 9, thereby
optimizing SNR
b
. These gains g
ij
are then applied to the bottom-up saliency maps (s
ij
) within
each feature dimension to compute the biased saliency maps S
j
, which are multiplied by the gains
g
j
to obtain the overall saliency map S. Thus, the bottom-up saliency maps are combined with the
134
optimal top-down gains to yield a saliency map where the target’s salience is maximized relative to
the distractors. This saliency map is now used to guide attention to likely target locations.
5.3 Results
In this section, we present a systematic evaluation of the model’s predictions for different observer
beliefs, and search tasks on artificial search arrays and natural scenes.
Computing salience: For computing bottom-up saliency maps, we use the Itti and Koch saliency
model [66]. We use the following set of biologically inspired, low-level visual features: 6 hues within
the color dimension, 4 intensities within the luminance dimension, and 4 orientations (0
◦
,45
◦
,90
◦
,135
◦
)
within the orientation dimension. The input visual scene is analyzed in all feature dimensions in par-
allel and for each of the above features, feature maps (topographic maps of feature responses at all
scene locations) are extracted in 6 different spatial scales (downsized by a factor of 1, 2, 4, 8, 16,
and 32). After local center-surround feature contrast operations, and global nonlinear interactions
in space, these maps are weighted by the top-down gains (whose baseline is unity) and are linearly
combined into a conspicuity map for that feature dimension. The conspicuity maps are also weighted
by top-down gains (default weight is 1) and are combined linearly to obtain the overall saliency map.
The active locations in this map indicate likely target locations.
InterpretingSNR: TheSNR in the overall saliency map may be high and search may be efficient
due to high bottom-up salience of the target relative to the distractors (e.g., a green target pops out
among red distractors [156] as s
ijT
>> s
ijD
in at least one saliency map i in dimensionj), or due
to efficient top-down guidance to the target (e.g., a green target among randomly colored distractors
becomes easy to find once subjects know that the target is green [36] since g
ij
>> 1 on the green
135
feature), or both.
Comparison of four models: We test the predictions of the above theory by implementing four dif-
ferent models: T0D0, T1D0, T0D1 and T1D1, where T and D refer to the target and distractors.
T0D0, the naive, bottom-up model [66] does not know T or D (hence, uses default top-down weights
of 1). T1D0 combines bottom-up salience with knowledge of T only. Hence, it computes top-down
weights based only on target saliences
ijT
, while ignoring D by considerings
ijD
to be some constant.
T0D1 combines bottom-up salience with knowledge of D only. T1D1 combines bottom-up salience
and top-down knowledge of both T and D. It chooses weights according to eqns. 8 and 9.
Training and test data: We compare the performance of the above models on synthetic search array
stimuli used in psychophysics tasks (to study human behavior in a controlled and simplified environ-
ment), as well as in real-world natural scenes with complex stimuli. For each search condition with
the synthetic stimuli, the model learns the belief in salience (S
b
T
,S
b
D
) from 50 training images, com-
putes the mean salience of the target and distractors (E
Θ|T,C,η
[S
b
T
(A)],E
Θ|D,C,η
[S
b
D
(A)]) and uses it
to compute gains (g
ij
,g
j
), that are subsequently applied on 100 new, previously unseen test images.
In each of these images, the target and distractors can occur randomly at any cell within the 9x9 grid,
and their location within the cells is further jittered by upto 10 pixels (thereby changingC). Noise in
stimulus features is also added, in the form of jitter in orientation (upto5
◦
), and jitter in color values
(upto 20 in R,G and B), thereby changingΘ|T,Θ|D. Internal neural noiseη is added by the saliency
model. Results are reported in figure 2a-i.
136
T0D0: 1.3 +- 0.1
T1D0: 7.7 +- 0.2
T0D1: 7.7 +- 0.2
T1D1: 7.7 +- 0.2
T0D0: -0.4 +- 0.1
T1D0: 7.7 +- 0.2
T0D1: 7.7 +- 0.2
T1D1: 7.7 +- 0.2
T0D0: -4.6 +- 0.5
T1D0: -3.0 +- 0.5
T0D1: -5.2 +- 0.4
T1D1: -2.6 +- 0.5
T0D0: -5.8 +- 0.4
T1D0: -5.0 +- 0.5
T0D1: -5.5 +- 0.5
T1D1: -4.9 +- 0.5
T0D0: 7.1 +- 0.3
T1D0: 7.6 +- 0.2
T0D1: 7.7 +- 0.2
T1D1: 7.7 +- 0.2
T0D0: -2.5 +- 0.4
T1D0: 5.6 +- 0.6
T0D1: 3.8 +- 0.6
T1D1: 6.0 +- 0.6
T0D0: -1.2 +- 0.4
T1D0: 1.9 +- 0.4
T0D1: 4.5 +- 0.6
T1D1: 5.9 +- 0.6
T0D0: -1.0 +- 0.3
T1D0: 3.5 +- 0.6
T0D1: 1.2 +- 0.3
T1D1: 4.8 +- 0.6
T0D0: 0.1 +- 0.2
T1D0: 0.5 +- 0.2
T0D1: 1.0 +- 0.3
T1D1: 3.0 +- 1.0
a)
b)
c)
d)
e)
f)
g)
h)
i)
Search scene Mean +- Std. err in SNR Optimal gains
The target pops
out among
distractors
Search becomes
very easy when
the target is
known
Search improves
with knowledge,
but remains
hard
Conjunction
search remains
hard
Search for the
brightest item is
fast
Search for a
medium bright
item improves
with knowledge,
but does not
pop out
Knowledge of
the distracting
background
improves search
speed
Better
knowledge
leads to faster
search
The blue feature
in the blue-
green pen is
suppressed as it
activates the
distractors
0
1
6
I C O
I C O
I C O
I C O
I C O
I C O
I C O
Remarks
I C O
I C O
(dB)
137
Figure 5.2: Simulation results: This figure shows the results of testing on 750 artificial search arrays
and natural scenes. Each row shows a different search task with different targets and distractors. The
first column shows a sample test scene. The second column shows theSNR (in decibels) predicted
by four different models described in section 3. The third column shows the distribution of optimal
top-down gains derived from knowledge of the target and distractors, as computed by model T1D1.
The dotted blue lines are the default gains (1) used by model T0D0. The first plot shows the gains
on the intensity (I), color (C) and orientation (O) dimensions. The subsequent plots show the gains
within these dimensions (in the order of intensity, color and orientation). The final column shows
some remarks. As described in section 3.1, these results are consistent with bottom-up and top-down
effects reported in psychophysics experiments. Across all search tasks, model T1D1 performed atleast
as good as or better than T1D0, T0D1, which performed better than T0D0. These results suggest that
knowledge of both the target and the distracting background plays an important role in improving
search speed.
5.3.1 Artificial search arrays
Pop-out search: Figure 2a shows an example of a pop-out search with a green target among red
distractors. The naive bottom-up saliency model T0D0 predicts thatSNR will be reasonably high
(1.3±0.1dB), indicating that search will be fast [156]. Consistent with psychophysics experiments
on “priming of pop-out” [91], knowledge of the target and distractors allows the relevant features to
be primed and hence increasesSNR to 7.7±0.2 dB, leading to a faster search. The distribution of
optimal gains shows that the gain on the color dimension increases while suppressing intensity and
orientation; and within color, the target’s green feature is maximally boosted while the distracting red
feature and other irrelevant features are suppressed.
Distractor heterogeneity: Figure 2b shows an example of search for a green target among hetero-
geneous distractors of different colors. As observed in human visual search behavior [37], the naive
model T0D0 predicts a hard search (SNR is -0.4±0.1 dB). But psychophysics experiments [37] also
show that this hard search becomes efficient if the target is known, consistent with the prediction of
models T1D0 and T1D1. Note that in both figures 2a and 2b, the target and distractor features are
138
well separated in feature space, hence, the optimal gains reduces to increased gain on target features
and suppression of others.
Poor target-distractor discriminability: Figure 2c shows an example of search for an orange target
that is less discriminable from the red distractors. The naive model T0D0 correctly predicts a very
hard search [119, 37, 103, 155, 177] (SNR is -4.6±0.5 dB). Simply knowing the target feature is
not so helpful, since boosting the target’s red feature also activates all the distractors that share that
feature. Instead, model T1D1 that knows both the target and distractors performs better as it promotes
the yellow feature that selectively activates the target, while suppressing the red feature that activates
the distractors.
Conjunction search: Figure 2d shows conjunction search for a green-horizontal target among green-
vertical and red-horizontal distractors. The naive model T0D0 correctly predicts a very hard search
[156] (SNR is -5.8±0.4 dB). Extra knowledge allows model T1D1 to slightly improve search by pro-
moting the target’s horizontal feature, while suppressing the distractor’s red feature. But consistent
with psychophysics experiments, search remains hard.
Linearly separable target: Figure 2e shows search for a bright target among medium-bright and dark
distractors. Search is easy (SNR is 7.1±0.3 dB), confirming earlier reports of easy search for a target
that can be separated from distractors by a line in feature space [38, 4]. T1D1 suggests higher gain on
intensity dimension, and within intensity, higher gain on the high intensity values than others.
Non-linearly separable target: Figure 2f shows search for a medium-bright target among dark an
bright distractors. The naive model T0D0 predicts hard search (SNR is -2.5±0.4 dB) confirming the
139
Figure 5.3: Example training data
“linear-separability effect” [38, 4] that search for a medium-type target that cannot be linearly sepa-
rated from distractors is harder than when the target is linearly separable (as shown in 2e). Consistent
with previous experiments, there is a top-down effect of knowledge leading to faster search [53]. In
this case, model T1D1 suggests increased gain in the medium intensity value and suppression of high
and low intensity values (corresponding to boosting the target and suppressing the distractors).
5.3.2 Natural scenes
Training and testing: To test the model’s performance on natural scenes, we train it on 10 images
containing different views of the target, appearing at different locations in the scene. Some examples
are shown in figure 3. The learned top-down gains are subsequently applied on 50 new test scenes
where the target can appear in slightly different backgrounds, different locations, views and sizes.
Search for targets in natural scenes: The results for finding a cell phone on a cluttered desk are
shown in figure 2g. While the naive model T0D0 struggles to find the non-salient phone (SNR
is -1.2±0.4 dB), knowledge of the phone and the distracting background (through training) speeds
search significantly (model T1D1 yields anSNR of 5.9±0.6 dB). Inspection of the gains reveal that
140
Figure 5.4: Comparison of different models: Comparison of saliency maps of the naive bottom-up
model T0D0 (second row) vs. T1D1 (third row) are shown during search for a phone on a desk (first
column), a coke can in a cluttered scene (second column), and a pen in a distracting background
(third column). Although the target is not bottom-up salient, prior knowledge of the target and the
distracting background (acquired through training) helps in improving theSNR, thereby rendering
the target more salient and suppressing noisy activity due to the distractors.
color is the useful dimension and within color, the target’s blue feature discriminates it best from the
background. Similar results are shown during search for a pen (figure 2i) and coke can (figure 2h) in
distracting backgrounds. Figure 4 shows sample saliency maps for further comparison between the
naive model (T0D0) and the combined top-down and bottom-up model (T1D1).
5.4 Discussion
Summary of results: By integrating top-down, knowledge-driven and bottom-up, image-driven ap-
proaches, we account for a large body of visual search literature. All models successfully account
for fast search in pop-out tasks (e.g., green target pops out among red distractors) [156], slow search
in conjunction tasks (e.g., green vertical target among green horizontal and red vertical distractors)
[156], slow search when the target is more similar to the distractors (e.g., orange target among red
distractors) [119, 37, 103, 155, 177], and faster search for an extreme feature valued target than a
medium valued target (e.g., faster search for a bright target among dark and medium distractors, while
141
slower search for a medium target among dark and bright distractors) [38, 4]. In addition, knowledge
based models T1D0 and T1D1 also account for fast search for a known target among heterogeneous
distractors, while the naive model T0D0 indicates a slow search (e.g., search for a green target among
red, yellow and blue is slow if we dont know that the target is green, and is otherwise fast)[37].
Better knowledge leads to faster search: Systematic comparison of different models reveals that
model T1D1 performs significantly better than models T1D0, T0D1, which perform better than T0D0.
Thus, we provide a computational correlate for the behavioral effect that better knowledge leads to
faster search [181, 165]. The gradual progression of models from T0D0 (no knowledge), to T1D1
(complete knowledge of both target and distractors) allows us to capture the behavior of novices to
experts, such as due to priming [91].
Role of knowledge of the distracting background: Contrary to previous target based approaches
which assume that knowledge of the target suffices [177, 126], we suggest that knowledge of the dis-
tractors is also crucial. Hence, while the former approaches suggest that the target features always be
promoted, as shown in figure 2c, our model predicts that the target features may even be suppressed
if the distractor activates the same features. Similar examples are also shown in figure 2h, where the
blue feature in the blue-green pen is suppressed as it activates the background. Thus, distractors and
not just the target, play an important role in priming features so as to maximize target detection speed.
Integration of bottom-up and top-down attentional influences: With little computational cost in-
curred through multiplicative top-down weights on bottom-up saliency maps, our model combines
both stimulus-driven and goal-driven attention, to optimize speed of guidance to likely target loca-
tions, while simultaneously being sensitive to unexpected stimulus changes. As mentioned earlier,
this is an important ability for robot navigation, visual surveillance and other active vision tasks that
142
operate in unconstrained environments where unexpected visual events such as accidents may occur.
Future extensions: We currently consider simple, low-level visual features such as intensities, hues
and orientations at multiple spatial scales. But the theory derived in section 2 is general and can be
applied to any feature dimension, such as complex shape features.
143
Chapter 6
General Discussion
Need to integrate bottom-up and top-down visual attention: The primate brain faces huge capac-
ity limits – while the retina is impinged with enormous incoming visual information ( 10
10
bits/s),
much fewer resources are available for information processing and analysis. Visual attention refers
to the brain’s selection mechanism by which a small, but useful subset of information is selected for
further processing, representation and action. This selection is influenced by two different factors:
Bottom-up and Top-down. While bottom-up factors select visually salient information that is sudden
/ unexpected or differs spatio-temporally from its surroundings; top-down factors select information
that is behaviorally relevant to the goal of the organism. Although most existing attention models are
either bottom-up driven or top-down driven, their integration is crucial for any realistic application
like robot navigation, visual surveillance, or scene understanding. For instance, in visual surveillance,
it is important to detect goal-relevant targets like suspects, and to simultaneously notice unexpected
visual events like gun shots or sudden explosions. In this thesis, we present a new model that combines
both top-down and bottom-up influences to guide attention during visual search for a target object in
distracting clutter, and for scene understanding.
144
6.1 Summary
Integrating bottom-up and top-down attention during scene understanding: In chapter 1, we
present a wide perspective of how a task specification influences attention during scene understand-
ing. We propose and partially implement a generic, biologically plausible model that accepts an image
and task specification as input (e.g., Who is Osama shaking hands with?), computes the task-relevance
of scene entities (e.g., Osama’s face, his hand, and region around hand are task-relevant), biases the
low-level visual system to detect the task-relevant entities, recognizes them, updates their relevance
in a task-relevance map, combines it with a bottom-up salience map to guide attention to salient and
relevant scene regions. This large-scale model illustrates how different bottom-up and top-down com-
ponents of visual processing such as the gist, saliency map, object detection and recognition modules,
working memory, long term memory, task-relevance map may interact and interface with each other
to guide attention to salient and relevant scene locations.
Granularity of information integration is high: Next, we investigate the specifics of how bottom-up
and top-down influences may integrate at the single neuron level during a specific task such as search-
ing for a target in distracting background clutter. In chapter 2, we probe the granularity of information
integration within feature dimensions such as color, size, luminance. Results of our eye tracking ex-
periments assert that bottom-up responses encoding feature dimensions can be modulated by not just
one, but several top-down gain control signals, thus revealing high granularity of integration.
Optimal integration of bottom-up and top-down attention in visual search: In chapter 3, we in-
vestigate the computational principles underlying the integration, i.e., how the top-down gain control
terms may be computed in order to improve search performance. We derive a formal theory of optimal
integration of bottom-up salience with top-down information about target and distractor features, such
145
that the target’s salience relative to the distractors is maximized, thereby accelerating search speed.
Our theory makes a surprising prediction that traditional approaches of boosting neurons favoring
the target features are sub-optimal. Instead, we show that in some cases, the optimal approach is to
boost neurons favoring a non-target feature. We provide experimental evidence that support this pre-
diction. We further show that the theory successfully accounts for several reported effects in visual
search behavior (including pop-out, target-distractor discriminability, distractor heterogeneity, linear
separability, feature priming, target uncertainty). We extend the theory to compute optimal gains
within multiple feature dimensions. We provide a software implementation of the theory and test its
performance on several artificial and natural images. Results show that across all tested images, the
performance of our model is better than a naive bottom-up model or other models that compute gains
based on the target or distractors only.
General ssummary: In summary, this thesis presents a wide perspective on integrating bottom-up
and top-down attention, that ranges from a systems level engineering design of a large-scale model
of attention and scene understanding to a model of optimal gain modulation at the single unit level
during visual search that simultaneously accounts for psychophysical performance as well.
6.2 Outstanding issues
Task-relevance and scene representation: In this thesis, we have addressed a few computational is-
sues related to integration of bottom-up and top-down attention. Several other questions remain open
and are beyond the scope of this thesis. We have outlined a few unresolved issues in this section. A
detailed discussion is available at the end of every chapter. In chapter 1, we presented a large-scale
model of attention during scene understanding. We introduced the notion of task-relevance of a scene
entity or location and suggested how it may be computed, and combined with bottom-up salience.
146
However, neural correlates of task-relevance, and biologically plausible methods of their computation
are relatively unknown. Scene representation, its semantic interpretation and their relation to language
representation and analysis is another interesting issue for future research. Recent advances in ma-
chine learning techniques for natural language processing and semantic analysis can be useful in this
regard.
Granularity of top-down information: In chapter 2, we probed the granularity of top-down atten-
tional signals. We showed that multiple top-down gain control signals ( at least three) modulate the
bottom-up responses encoding a feature dimension (such as color, orientation). Future studies are
required to determine the exact number of such independent top-down signals – specifically, can ev-
ery neuron be modulated independently or are the gains on nearby neurons correlated? Apart from
granularity of top-down information, the issue of where the information is computed also remains to
be investigated – are the gains computed in higher areas and then transmitted to the bottom, or are
they computed at the bottom?
Attentional modulation of different neural properties: In chapter 3, we formally derived a the-
ory of how top-down gains on bottom-up responses may be computed such that the target’s salience
relative to the distractors is maximized, thereby accelerating search speed. We showed through simu-
lations that our theory can account for several existing behavioral effects in visual search. But further
experiments in electrophysiology are required to validate the predictions of distribution of gains for
different values of target and distractor features. The neural mechanisms underlying the gain and
SNR computations are also unknown. While we explored only gain modulation, it might very well
be that attention modulates other neural properties like tuning width and preferred features, apart from
the response gain. An important open question is how attention may modulate different neural proper-
ties during visual search. Are they modulated optimally as well? What are their relative contributions
147
to search performance? For instance, does gain modulation contribute more than modulating tuning
width?
“When, where, what and how” of integration: Several other issues regarding the “when, where,
what and how” of integration remain open. A few are outlined below. When does the integration
occur, or what is its timecourse? Some studies suggest that early neural activity represents bottom-
up influences resulting from feedforward processing, while subsequent activity represents top-down
influences resulting from feedback, reentrant processes. This suggests that bottom-up salience dom-
inates initially, followed by slow volitional processes. However, the exact timecourse of integration
is yet to be investigated for complex tasks. Where does the integration occur? Does it occur at the
level of a single neuron (i.e., each neuron is separately modulated), or at the level of a population of
neurons? Does it occur as early as LGN, or as late as FEF or SC? While earlier studies demonstrated
attentional modulation in higher areas like V4 and FEF, recent studies have shown that top-down in-
fluences occur as early as LGN and V1. Recent evidence suggests that integration occurs everywhere
in the hierarchy, but the strength of top-down influences increase as we go up the hierarchy. How does
the integration occur? How does top-down attention modulate bottom-up responses at the single unit
level – does it change the neuron’s preferred feature, or its tuning width, or its response gain, or its
contrast sensitivity? Does the computational strategy depend on the task performed? What computa-
tional principles underly the integration at the level of a population of neurons encoding salience and
relevance maps – are these maps combined linearly, or multiplicatively, or in some other non-linear
fashion? Thus, several important questions remain unresolved, promising an exciting future for this
field.
148
Reference List
[1] F. Arman and J.K Aggarwal. Model-based object recognition in dense-range images - a review.
ACM Computing Surveys (CSUR), 25(1):5–43, 1993.
[2] W. J. Bacon and H. E. Egeth. Goal-directed guidance of attention: evidence from conjunctive
visual search. J Exp Psychol Hum Percept Perform, 23(4):948–961, Aug 1997.
[3] R. J. Baddeley and B. W. Tatler. High frequency edges (but not contrast) predict where we
fixate: A bayesian system identification analysis. Vision Res, 46(18):2824–2833, Sep 2006.
[4] B. Bauer, P. Jolicoeur, and W. B. Cowan. Visual search for colour targets that are or are not
linearly separable from distractors. Vision Res, 36(10):1439–1465, May 1996.
[5] B. Bauer, P. Jolicoeur, and W. B. Cowan. Visual search for colour targets that are or are not
linearly-separable from distractors. Vision Research 36, 10:1439–1465, 1996.
[6] J. Beck, K. Prazdny, and A. Rosenfeld. A Theory of Textural Segmentation. Academic Press,
New York, New York., 1983.
[7] N. P. Bichot, A. F. Rossi, and R. Desimone. Parallel and serial neural mechanisms for visual
search in macaque area v4. Science, 308(5721):529–534, Apr 2005.
[8] N. P. Bichot and J. D. Schall. Saccade target selection in macaque during feature and conjunc-
tion visual search. Vis Neurosci, 16(1):81–89, Jan 1999.
[9] N. P. Bichot and J. D. Schall. Priming in macaque frontal cortex during popout visual search:
feature-based facilitation and location-based inhibition of return. J Neurosci, 22(11):4675–
4685, Jun 2002.
[10] I. Biederman. Recognition-by-components: A theory of human image understanding. Psycho-
logical Review, 94:115–147, 1987.
[11] I. Biederman, R. J. Mezzanotte, and J. C. Rabinowitz. Scene perception: detecting and judging
objects undergrouping relational violations. Cognitive Psychology, 14:143–177, 1982.
[12] A B Bilsky and J M Wolfe. Part-whole information is useful in visual search for size x size but
not orientation x orientation conjunctions. Percept Psychophys, 57(6):749–60, Aug 1994.
[13] E. Blaser, G. Sperling, and Z. L. Lu. Measuring the amplification of attention. Proc. Natl.
Acad. Sci. USA, 96(20):11681–11686, 1999.
[14] Guillaume Bouchard and Bill Triggs. Hierarchical part-based visual object categorization. In
CVPR ’05: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision
149
and Pattern Recognition (CVPR’05) - Volume 1, pages 710–715, Washington, DC, USA, 2005.
IEEE Computer Society.
[15] G. M. Boynton. Attention and visual perception. Curr Opin Neurobiol, 15(4):465–469, Aug
2005.
[16] R. J. Brachman and H. J. Levesque. Morgan Kaufmann Publishers, 1985.
[17] J. J. Braithwaite and G. W. Humphreys. Inhibition and anticipation in visual search: evidence
from effects of color foreknowledge on preview search. Percept Psychophys, 65(2):213–237,
Feb 2003.
[18] C. Bundesen. A theory of visual attention. Psychol Rev, 97(4):523–547, Oct 1990.
[19] G.T. Buracas, T.D. Albright, and T.J. Sejnowski. Varieties of attention: A model of visual
search. Institute of Neural Computation Proc. 3rd Joint Symposium on Neural Computation,
6:11–25, 1996.
[20] P J Burt and E H Adelson. The laplacian pyramid as a compact image code. IEEE Trans on
Communications, 31:532–540, 1983.
[21] M. Carrasco, C. Penpeci-Talgar, and M. Eckstein. Spatial covert attention increases contrast
sensitivity across the csf: support for signal enhancement. Vision Res, 40(10-12):1203–1215,
2000.
[22] K. R. Cave. The featuregate model of visual selection. Psychol Res, 62(2-3):182–194, 1999.
[23] D. Chawla, G. Rees, and K. J. Friston. The physiological basis of attentional modulation in
extrastriate visual areas. Nat Neurosci, 2(7):671–676, Jul 1999.
[24] L. Chelazzi, J. Duncan, E. K. Miller, and R. Desimone. Responses of neurons in inferior
temporal cortex during memory-guided visual search. J Neurophysiol, 80(6):2918–2940, Dec
1998.
[25] L. Chelazzi, E. K. Miller, J. Duncan, and R. Desimone. A neural basis for visual search in
inferior temporal cortex. Nature, 363(6427):345–347, May 1993.
[26] M. M. Chun and Y . Jiang. Contextual cueing: Implicit learning and memory of visual context
guides spatial attention. Cognitive Psychology, 36:28–71, 1998.
[27] C. L. Colby and M. E. Goldberg. Space and attention in parietal cortex. Annu Rev Neurosci,
22:319–49, 1999.
[28] C. E. Connor, D. C. Preddie, J. L. Gallant, and D. C. Van Essen. Spatial attention effects in
macaque area v4. J Neurosci, 17(9):3201–3214, May 1997.
[29] S. M. Courtney, L. G. Ungerleider, K. Keil, and J. V . Haxby. Object and spatial visual working
memory activate separate neural systems in human cortex. Cereb Cortex, 6(1):39–49, Jan 1996.
[30] P. De Graef, D. Christiaens, and G. d’Ydewalle. Perceptual effects of scene context on object
identification. Psychological Research, 52:317–329, 1990.
[31] G. Deco and E. T. Rolls. Oxford Press, 2002.
150
[32] S. Deneve, P. E. Latham, and A. Pouget. Reading population codes: a neural implementation
of ideal observers. Nat Neurosci, 2(8):740–745, Aug 1999.
[33] R. Desimone and J. Duncan. Neural mechanisms of selective visual attention. Annu Rev
Neurosci, 18:193–222, 1995.
[34] R L DeValois, D G Albrecht, and L G Thorell. Spatial-frequency selectivity of cells in macaque
visual cortex. Vision Research, 22:545–559, 1982.
[35] P. Driver, J. McLeod and Z. Dienes. Motion coherence and conjunction search: Implications
for guided search theory. Perception and Psychophysics 51, 1:79–85, 1992.
[36] J. Duncan. Boundary conditions on parallel processing in human vision. Perception,
18(4):457–469, 1989.
[37] J Duncan and G W Humphreys. Visual search and stimulus similarity. Psychological Rev,
96:433–458, 1989.
[38] M. D’Zmura. Color in visual search. Vision Research 31, 6:951–966, 1991.
[39] H. E. Egeth, R. A. Virzi, and H. Garbart. Searching for conjunctively defined targets. J Exp
Psychol Hum Percept Perform, 10(1):32–39, Feb 1984.
[40] S Engel, X Zhang, and B Wandell. Colour tuning in human visual cortex measured with
functional magnetic resonance imaging [see comments]. Nature, 388(6637):68–71, Jul 1997.
[41] J. T. Enns. Seeing textons in context. Perception and Psychophysics, 39:143–147, 1986.
[42] R. Fergus, P. Perona, and A. Zisserman. A sparse object category model for efficient learning
and exhaustive recognition. In CVPR ’05: Proceedings of the 2005 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition (CVPR’05) - Volume 1, pages 380–
387, Washington, DC, USA, 2005. IEEE Computer Society.
[43] A. Found and H. J. Muller. Searching for unknown feature targets on more than one dimension:
investigating a ’dimension-weighting’ account. Percept Psychophys, 58(1):88–101, Jan 1996.
[44] S. P. Gandhi, D. J. Heeger, and G. M. Boynton. Spatial attention affects brain activity in human
primary visual cortex. Proc Natl Acad Sci U S A, 96(6):3314–3319, Mar 1999.
[45] J. P. Gottlieb, M. Kusunoki, and M. E. Goldberg. The representation of visual salience in
monkey parietal cortex. Nature, 391:481–484, 1998.
[46] P.E. Haenny and P.H. Schiller. State dependent activity in monkey visual cortex. single cell
activity in v1 and v4 on visual tasks. Exp Brain Res, 69:245–259, 1988.
[47] F. H. Hamker. A dynamic model of how feature cues guide spatial attention. Vision Res,
44(5):501–521, Mar 2004.
[48] H. L. Hawkins, S. A. Hillyard, S. J. Luck, M. Mouloua, C. J. Downing, and D. P. Wood-
ward. Visual attention modulates signal detectability. J Exp Psychol Hum Percept Perform,
16(4):802–811, Nov 1990.
151
[49] S. He, P. Cavanagh, and J. Intriligator. Attentional resolution and the locus of visual awareness.
Nature, 383(6598):334–337, Sep 1996.
[50] J. M. Henderson and A. Hollingworth. High level scene perception. Annual Review of Psy-
chology, 50:243–271, 1999.
[51] Gerd Herzog and Peter Wazinski. VIsual TRAnslator: Linking perceptions and natural lan-
guage descriptions. Artificial Intelligence Review, 8(2-3):175–187, 1994.
[52] S. A. Hillyard, E. K. V ogel, and S. J. Luck. Sensory gain control (amplification) as a mechanism
of selective attention: electrophysiological and neuroimaging evidence. Philos Trans R Soc
Lond B Biol Sci, 353(1373):1257–1270, Aug 1998.
[53] J. Hodsoll and G. W. Humphreys. Driving attention with the top down: the relative contribution
of target templates to the linear separability effect in the size dimension. Percept Psychophys,
63(5):918–926, Jul 2001.
[54] A. Hollingworth. Constructing visual representations of natural scenes: The roles of short- and
long-term visual memory. submitted.
[55] A. Hollingworth and J. M. Henderson. Accurate visual memory for previously attended objects
in natural scenes. Journal of Experimental Psychology: Human Perception and Performance,
28:113–136, 2002.
[56] A. Hollingworth, C. C. Williams, and J. M. Henderson. To see and remember: Visually spe-
cific information is retained in memory from previously attended objects in natural scenes.
Psychonomic Bulletin and Review, 8:761–768, 2001.
[57] J. B. Hopfinger, M. H. Buonocore, and G. R. Mangun. The neural mechanisms of top-down
attentional control. Nat Neurosci, 3(3):284–291, Mar 2000.
[58] Alexander C. Huk and David J. Heeger. Task-related modulation of visual cortex. J Neuro-
physiol, 83:3525–3536, 2000.
[59] D. E. Irwin. Information integration across saccadic eye movements. Cognitive Psychology,
23:420–456, 1991.
[60] D. E. Irwin. Memory for position and identity across eye movements. Journal of Experimental
Psychology: Learning, Memory, and Cognition, 18:307–317, 1992.
[61] D. E. Irwin. Visual memory within and across fixations. In K. Rayner (Ed.),Eye movements and
visual cognition: Scene perception and reading. New York: Springer-Verlag., 1992.
[62] D. E. Irwin and R. Andrews. Integration and accumulation of information across saccadic
eye movements. Attention and performance XVI: Information integration in perception and
communication, Cambridge, MA: MIT Press:125–155, 1996.
[63] D. E. Irwin and G. J. Zelinsky. Eye movements and scene perception: Memory for things
observed. Perception and Psychophysics, 64:882–895, 2002.
[64] L. Itti and P. Baldi. Bayesian surprise attracts human attention. In Advances in Neural In-
formation Processing Systems, Vol. 19 (NIPS*2005), pages 1–8, Cambridge, MA, 2006. MIT
Press.
152
[65] L. Itti and C. Koch. A saliency-based search mechanism for overt and covert shifts of visual
attention. Vision Res, 40(10-12):1489–1506, 2000.
[66] L. Itti and C. Koch. A saliency-based search mechanism for overt and covert shifts of visual
attention. Vision Research, 40(10-12):1489–1506, May 2000.
[67] L. Itti and C. Koch. Computational modeling of visual attention. Nature Reviews Neuroscience,
2(3):194–203, Mar 2001.
[68] L. Itti and C. Koch. Computational modelling of visual attention. Nat Rev Neurosci, 2(3):194–
203, Mar 2001.
[69] L. Itti and C. Koch. Feature combination strategies for saliency-based visual attention systems.
Journal of Electronic Imaging, 10(1):161–169, Jan 2001.
[70] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene
analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11):1254–1259,
Nov 1998.
[71] B. Julesz. Foundations of Cyclopean Perception. University of Chicago Press, Chicago, Illi-
nois., 1971.
[72] B. Julesz and J.R. Bergen. Textons, the fundamental elements in preattentive vision and per-
ception of textures. The Bell System Technical Journal 62, 6:1619–1645, 1983.
[73] D Kahneman and A Treisman. Changing views of attention and automaticity. In D Parasura-
man, R Davies, and J Beatty, editors, Varieties of attention, pages 29–61. Academic, New York,
NY , 1984.
[74] N. Kanwisher. Repetition blindness: Type recognition without token individuation. Cognition,
27:117–143, 1987.
[75] N. Kanwisher and E. Wojciulik. Visual attention: insights from brain imaging. Nat Rev Neu-
rosci, 1(2):91–100, Nov 2000.
[76] N.A Kaptein, J. Theeuwes, and A.H.C van der Heijden. Search for a conjunctively defined
target can be selectively limited to a color-defined subset of elements. J Exp Psychol Hum
Percept Perform, 21(5):1053–1069, October 1995.
[77] S. Kastner, M. A. Pinsk, P. De Weerd, R. Desimone, and L. G. Ungerleider. Increased activity
in human visual cortex during directed attention in the absence of visual stimulation. Neuron,
22(4):751–761, Apr 1999.
[78] S. Kastner and L. G. Ungerleider. Mechanisms of visual attention in the human cortex. Annu
Rev Neurosci, 23:315–341, 2000.
[79] N. Kenner and J. M. Wolfe. An exact picture of your target guides visual search better than any
other representation [abstract]. Journal of Vision, 3(9):230a, 2003.
[80] C Koch and S Ullman. Shifts in selective visual attention: towards the underlying neural
circuitry. Hum Neurobiol, 4(4):219–27, 1985.
153
[81] G. Krieger, I. Rentschler, G. Hauske, K. Schill, and C. Zetzsche. Object and scene analysis by
saccadic eye-movements: an investigation with higher-order statistics. Spat Vis, 13(2-3):201–
214, 2000.
[82] T. Kumada. Feature-based control of attention: evidence for two forms of dimension weight-
ing. Percept Psychophys, 63(4):698–708, May 2001.
[83] A. A. Kustov and D. L. Robinson. Shared neural control of attentional shifts and eye move-
ments. Nature, 384:74–77, 1996.
[84] Martin Lades, Jan C. V orbr¨ uggen, Joachim Buhmann, J. Lange, Christoph von der Malsburg,
Rolf P. W¨ urtz, and Wolfgang Konen. Distortion invariant object recognition in the dynamic
link architecture. IEEE Transactions on Computers, 42:300–311, 1993.
[85] D. K. Lee, L. Itti, C. Koch, and J. Braun. Attention activates winner-take-all competition
among visual filters. Nat Neurosci, 2(4):375–381, Apr 1999.
[86] A G Leventhal. The Neural Basis of Visual Function (Vision and Visual Dysfunction Vol. 4).
CRC Press, Boca Raton, FL, 1991.
[87] Z. Li. A saliency map in primary visual cortex. Trends Cogn Sci, 6(1):9–16, Jan 2002.
[88] S. J. Luck, L. Chelazzi, S. A. Hillyard, and R. Desimone. Neural mechanisms of spatial selec-
tive attention in areas v1, v2, and v4 of macaque visual cortex. J Neurophysiol, 77(1):24–42,
Jan 1997.
[89] S. J. Luck and E. K. V ogel. The capacity of visual working memory for features and conjunc-
tions. Nature, 390:279–281, 1997.
[90] A Luschow and H C Nothdurft. Pop-out of orientation but no pop-out of motion at isolumi-
nance. Vision Research, 33(1):91–104, 1993.
[91] V . Maljkovic and K. Nakayama. Priming of pop-out: I. role of features. Mem Cognit,
22(6):657–672, Nov 1994.
[92] A. Martinez, L. Anllo-Vento, M. I. Sereno, L. R. Frank, R. B. Buxton, D. J. Dubowitz, E. C.
Wong, H. Hinrichs, H. J. Heinze, and S. A. Hillyard. Involvement of striate and extrastriate
visual cortical areas in spatial attention. Nat Neurosci, 2(4):364–369, Apr 1999.
[93] J. C. Martinez-Trujillo and S. Treue. Feature-based attention increases the selectivity of popu-
lation responses in primate visual cortex. Curr Biol, 14(9):744–751, May 2004.
[94] J. H. Maunsell and S. Treue. Feature-based attention in visual cortex. Trends Neurosci,
29(6):317–322, Jun 2006.
[95] C. J. McAdams and J. H. Maunsell. Effects of attention on orientation-tuning functions of
single neurons in macaque cortical area v4. J Neurosci, 19(1):431–441, Jan 1999.
[96] Baback Moghaddam and Alex Pentland. Probabilistic visual learning for object representation.
IEEE Trans. Pattern Anal. Mach. Intell., 19(7):696–710, 1997.
[97] G. Moraglia. Display organization and the detection of horizontal line segments. Perception
and Psychophysics, 45:265–272, 1989.
154
[98] J Moran and R Desimone. Selective attention gates visual processing in the extrastriate cortex.
Science, 229(4715):782–4, Aug 1985.
[99] B C Motter. Focal attention produces spatially selective processing in visual cortical areas v1,
v2, and v4 in the presence of competing stimuli. J Neurophysiol, 70(3):909–19, Sep 1993.
[100] B C Motter. Neural correlates of attentive selection for color or luminance in extrastriate area
v4. J Neurosci, 14(4):2178–89, Apr 1994.
[101] B C Motter. Neural correlates of feature selective memory and pop-out in extrastriate area v4.
J Neurosci, 14(4):2190–9, Apr 1994.
[102] H. J. Muller, D. Heller, and J. Ziegler. Visual search for singleton feature targets within and
across feature dimensions. Percept Psychophys, 57(1):1–17, Jan 1995.
[103] A. L. Nagy and R. R. Sanchez. Critical color differences determined with a visual search task.
Journal of the Optical Society of America A 7, 7:1209–1217, 1990.
[104] A. L. Nagy and R. R. Sanchez. Chromaticity and luminance as coding dimensions in visual
search. Human Factors 34, 5:601–614, 1992.
[105] J. Najemnik and W. S. Geisler. Optimal eye movement strategies in visual search. Nature,
434(7031):387–391, Mar 2005.
[106] K. Nakayama and G. H. Silverman. Serial and parallel processing of visual feature conjunc-
tions. Nature, 320:264–265, 1986.
[107] V . Navalpakkam and L. Itti. Modeling the influence of task on attention. Vision Research,
45(2):205–231, 2005.
[108] V . Navalpakkam and L. Itti. Optimal cue selection strategy. In Advances in Neural Information
Processing Systems, Vol. 19 (NIPS*2005), pages 1–8, Cambridge, MA, 2005. MIT Press.
[109] V . Navalpakkam and L. Itti. An integrated model of top-down and bottom-up attention for op-
timal object detection. In Proc. IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 1–7, New York, NY , Jun 2006.
[110] D. Norton and L. Stark. Scanpaths in saccadic eyemovements during pattern perception. Sci-
ence, pages 308–311, 1971.
[111] H. C. Nothdurft. Feature analysis and the role of similarity in preattentive vision. Percept
Psychophys, 52(4):355–375, Oct 1992.
[112] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the
spatial envelope. International Journal of Computer Vision, 42(3):145–175, 2001.
[113] A. Oliva, A. Torralba, M. S. Castelhano, , and J. M. Henderson. Top-down control of visual
attention in object detection. IEEE Proceedings of the International Conference in Image Pro-
cessing, pages September 14–17, 2003.
[114] J K O’Regan. Solving the “real” mysteries of visual perception: The world as an outside
memory. Can J Psych, 46:461–488, 1992.
155
[115] J. Palmer. Set-size effects in visual search: the effect of attention is independent of the stimulus
for simple tasks. Vision Res, 34(13):1703–1721, Jul 1994.
[116] S. E. Palmer. The effect of contextual scenes on the identification of objects. Memory and
Cognition, 3:519–526, 1975.
[117] Constantine Papageorgiou and Tomaso Poggio. A trainable system for object detection. Int. J.
Comput. Vision, 38(1):15–33, 2000.
[118] D. Parkhurst, K. Law, and E. Niebur. Modeling the role of salience in the allocation of overt
visual attention. Vision Res, 42(1):107–123, Jan 2002.
[119] H. Pashler. Target-distractor discriminability in visual search. Percept Psychophys, 41(4):385–
392, Apr 1987.
[120] H. Pashler. Familiarity and the detection of change in visual displays. Perception and Psy-
chophysics, 44:369–378, 1988.
[121] W.A. Phillips. On the distinction between sensory storage and short-term visual memory. Per-
ception and Psychophysics, 16:283–290, 1974.
[122] J. Pokorny, V . C. Smith, and M. Lutze. Heterochromatic modulation photometry. J Opt Soc
Am A, 6(10):1618–1623, Oct 1989.
[123] M. Pomplun. Saccadic selectivity in complex visual search displays. Vision Res, Jan 2006.
[124] M. I. Posner, C. R. Snyder, and B. J. Davidson. Attention and the detection of signals. J Exp
Psychol, 109(2):160–174, Jun 1980.
[125] W. A. Press and D. C. van Essen. Attentional modulation of neuronal responses in macaque
area v1. Soc. Neurosci. Abstr., 23(1026), 1997.
[126] R P Rao, G Zelinsky, M Hayhoe, and D H Ballard. Eye movements in iconic visual search.
Vision Research, 42(11):1447–1463, Nov 2002.
[127] R. A. Rensink. The dynamic representation of scenes. Visual Cognition, 7:17–42, 2000.
[128] R. A. Rensink. Change detection. Annual Review of Psychology, 53:245–277, 2002.
[129] R.A. Rensink, O’Regan J.K., and J.J. Clark. To see or not to see: the need for attention to
perceive changes in scenes. Psychological Science, 8:368–373, 1997.
[130] J. H. Reynolds and L. Chelazzi. Attentional modulation of visual processing. Annu Rev Neu-
rosci, 27:611–647, 2004.
[131] J. H. Reynolds, L. Chelazzi, and R. Desimone. Competitive mechanisms subserve attention in
macaque areas v2 and v4. J Neurosci, 19(5):1736–1753, Mar 1999.
[132] J. H. Reynolds, T. Pasternak, and R. Desimone. Attention increases sensitivity of v4 neurons.
Neuron, 26(3):703–714, Jun 2000.
[133] M Riesenhuber and T Poggio. Hierarchical models of object recognition in cortex. Nature
Neuroscience, 2(11):1019–1025, Nov 1999.
156
[134] M Riesenhuber and T Poggio. Models of object recognition. Nature Neuroscience, pages
1199–1204, 2000.
[135] P. R. Roelfsema, V . A. Lamme, and H. Spekreijse. Object-based attention in the primary visual
cortex of the macaque monkey. Nature, 395(6700):376–381, Sep 1998.
[136] Henry A. Rowley, Shumeet Baluja, and Takeo Kanade. Neural network-based face detection.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1):23–38, 1998.
[137] Ueli Rutishauser, Dirk Walther, Christof Koch, and Pietro Perona. Is bottom-up attention useful
for object recognition?, 2004.
[138] I. A. Rybak, V . I. Gusakova, A.V . Golovan, L. N. Podladchikova, and N. A. Shevtsova. A
model of attention-guided visual perception and recognition. Vision Research, 38:2387–2400,
1998.
[139] M. Saenz, G. T. Buracas, and G. M. Boynton. Global effects of feature-based attention in
human visual cortex. Nat Neurosci, 5(7):631–632, Jul 2002.
[140] W.X. Schneider. Visual-spatial working memory, attention, and scene representation: A neuro-
cognitive theory. Psychological Research, 62:220–236, 1999.
[141] H. Schneiderman and T. Kanade. A statistical method for 3d object detection applied to faces
and cars. In In Proc. CVPR, pages 746–751, 2000.
[142] J. Shen, E. M. Reingold, and M. Pomplun. Distractor ratio influences patterns of eye move-
ments during visual search. Perception, 29(2):241–250, 2000.
[143] R. M. Shiffrin and W. Schneider. Controlled and automatic human information processing: Ii.
perceptual learning, automatic attending, and a general theory. Psychological Review, 84:127–
190, 1977.
[144] W. R. Softky and C. Koch. The highly irregular firing of cortical cells is inconsistent with
temporal integration of random epsps. J Neurosci, 13(1):334–350, Jan 1993.
[145] D. C. Somers, A. M. Dale, A. E. Seiffert, and R. B. Tootell. Functional mri reveals spatially
specific attentional modulation in human primary visual cortex. Proc Natl Acad Sci U S A,
96(4):1663–1668, Feb 1999.
[146] G Sperling. The information available in visual presentations. Psychological Monographs,
74:1–29, 1960.
[147] H. Spitzer, R. Desimone, and J. Moran. Increased attention enhances both behavioral and
neuronal performance. Science, 240(4850):338–340, Apr 1988.
[148] K G Thompson and J D Schall. Antecedents and correlates of visual detection and awareness
in macaque prefrontal cortex. Vision Res, 40(10-12):1523–1538, 2000.
[149] S J Thorpe, D Fize, and C Marlot. Speed of processing in the human visual system. Nature,
381:520–522, 1996.
[150] R B Tootell, M S Silverman, S L Hamilton, R L De Valois, and E Switkes. Functional anatomy
of macaque striate cortex. iii. color. Journal of Neuroscience, 8(5):1569–93, May 1988.
157
[151] A. Torralba. Contextual modulation of target saliency. In T. G. Dietterich, S. Becker, and
Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, Cambridge,
MA, 2002. MIT Press.
[152] A. Torralba. Contextual priming for object detection. International Journal of Computer Vi-
sion, 53(2):153–167, 2003.
[153] Antonio Torralba and Pawan Sinha. Statistical context priming for object detection. volume 1,
pages 763–770, 2001.
[154] A. Treisman. Features and objects: the fourteenth bartlett memorial lecture. Q J Exp Psychol
A, 40(2):201–237, May 1988.
[155] A. Treisman. Search, similarity, and integration of features between and within dimensions. J
Exp Psychol Hum Percept Perform, 17(3):652–676, Aug 1991.
[156] A. Treisman and G. Gelade. A feature integration theory of attention. Cognitive Psychology,
12:97–136, 1980.
[157] A. Treisman and S. Gormican. Feature analysis in early vision: Evidence from search asym-
metries. Psychological Review 95, 1:15–48, 1988.
[158] S Treue and J H Maunsell. Attentional modulation of visual motion processing in cortical areas
mt and mst. Nature, 382(6591):539–41, Aug 1996.
[159] S. Treue and J. C. Martinez Trujillo. Feature-based attention influences motion processing gain
in macaque visual cortex. Nature, 399(6736):575–579, Jun 1999.
[160] J. Triesch, D. H. Ballard, M. M. Hayhoe, and B. T. Sullivan. What you see is what you need. J
Vis, 3(1):86–94, 2003.
[161] A. Triesman and J. Souther. Illusory words: The roles of attention and top-down constraints in
conjoining letters to form words. Journal of Experimental Psychology: Human Perception and
Performance, 14:107–141, 1986.
[162] J K Tsotsos, S M Culhane, W Y K Wai, Y H Lai, N Davis, and F Nuflo. Modeling visual-
attention via selective tuning. Artificial Intelligence, 78(1-2):507–45, 1995.
[163] D. C. van Essen and C. H. Anderson. Information processing strategies and pathways in the
primate visual system. Academic Press, FL, 1995.
[164] P. Verghese and L. S. Stone. Combining speed information across space. Vision Res,
35(20):2811–2823, Oct 1995.
[165] Timothy J Vickery, Li-Wei King, and Yuhong Jiang. Setting up the target template in visual
search. J Vis, 5(1):81–92, Feb 2005.
[166] Paul Viola and Michael J. Jones. Robust real-time face detection. Int. J. Comput. Vision,
57(2):137–154, 2004.
[167] Paul Viola, Michael J. Jones, and Daniel Snow. Detecting pedestrians using patterns of motion
and appearance. Int. J. Comput. Vision, 63(2):153–161, 2005.
158
[168] D Walther, L Itti, M Reisenhuber, T Poggio, and C Koch. Attentional selection for object
recognition - a gentle way. Proc. 2nd Workshop on Biologically Motivated Computer Vision
BMCV2002, pages 472–479, Nov 2002.
[169] Dirk Walther, Duane R. Edgington, and Christof Koch. Detection and tracking of objects in
underwater video. In CVPR (1), pages 544–549, 2004.
[170] K. Watanabe. Differential effect of distractor timing on localizing versus identifying visual
changes. Cognition, 88(2):243–257, Jun 2003.
[171] T. Watanabe, Y . Sasaki, S. Miyauchi, B. Putz, N. Fujimaki, M. Nielsen, R. Takino, and
S. Miyakawa. Attention-regulated activity in human primary visual cortex. J Neurophysiol,
79(4):2218–2221, Apr 1998.
[172] M. Weber, M. Welling, and P. Perona. Unsupervised learning of models for recognition. In
Proc. 6th Europ. Conf. Comp. Vis., ECCV2000, Dublin, Ireland, June 2000.
[173] E Weichselgartner and G Sperling. Dynamics of automatic and controlled visual attention.
Science, 238(4828):778–780, Nov 1987.
[174] T. Williford and J. H. Maunsell. Effects of spatial attention on contrast response functions in
macaque area v4. J Neurophysiol, 96(1):40–54, Jul 2006.
[175] F.A. Wilson, S.P. O Scalaidhe, and P.S. Goldman-Rakic. Dissociation of object and spatial
processing domains in primate prefrontal cortex. Science, 260:1955–1958, 1993.
[176] Laurenz Wiskott, Jean-Marc Fellous, Norbert Kr¨ uger, and Christoph von der Malsburg. Face
recognition by elastic bunch graph matching. In Gerald Sommer, Kostas Daniilidis, and Josef
Pauli, editors, Proc. 7th Intern. Conf. on Computer Analysis of Images and Patterns, CAIP’97,
Kiel, number 1296, pages 456–463, Heidelberg, 1997. Springer-Verlag.
[177] J. M. Wolfe. Guided search 2.0: a revised model of visual search. Psyonomic Bulletin and
Review, 1(2):202–238, 1994.
[178] J. M. Wolfe, S. J. Butcher, and M. Hyle. Changing your mind: On the contributions of top-
down and bottom-up guidance in visual search for feature singletons. J Exp Psychol Hum
Percept Perform, 29(2):483–502, 2003.
[179] J. M. Wolfe, K. R. Cave, and S. L. Franzel. Guided search: an alternative to the feature
integration model for visual search. J. Exper. Psychol., 15:419–433, 1989.
[180] J. M. Wolfe, S. R. Friedman-Hill, M. I. Stewart, and K. M. O’Connell. The role of categoriza-
tion in visual search for orientation. Journal of Experimental Psychology: Human Perception
and Performance 18, 1:34–49, 1992.
[181] Jeremy M Wolfe, Todd S Horowitz, Naomi Kenner, Megan Hyle, and Nina Vasan. How fast
can you change your mind? The speed of top-down guidance in visual search. Vision Res,
44(12):1411–1426, Jun 2004.
[182] T. Womelsdorf, K. Anton-Erxleben, F. Pieper, and S. Treue. Dynamic shifts of visual receptive
fields in cortical area mt by spatial attention. Nat Neurosci, 9(9):1156–1160, Sep 2006.
159
[183] R.H. Wurtz, M.E. Goldberg, , and D.L. Robinson. Behavioral modulation of visual responses
in the monkey: Stimulus selection for attention and movement. Progress in Psychobiology and
Physiological Psychology, 9:43–83, 1980.
[184] Y . Yeshurun and M. Carrasco. Attention improves or impairs visual performance by enhancing
spatial resolution. Nature, 396(6706):72–75, Nov 1998.
[185] E. Zohary and S. Hochstein. How serial is serial processing in vision? Perception, 18(2):191–
200, 1989.
160
Abstract (if available)
Abstract
Visual attention -- the brain's mechanism for selecting important visual information -- is influenced by a combination of bottom-up (sudden, unexpected visual events that are spatio-temporally different from the surroundings) and top-down (goal-relevant) factors. Although both are crucial for real-world applications like robot navigation or visual surveillance, most existing models are either purely bottom-up or top-down. In this thesis, we present a new model that integrates top-down and bottom-up attention. We begin with a wide perspective ofhow a task specification (e.g., "who is doing what to whom'') influences attention during scene understanding. We propose and partially implement a general-purpose architecture illustrating how different bottom-up and top-down components of visual processing such as the gist, saliency map, object detection and recognition modules, working memory, long term memory, task-relevance map may interact and interface with each other to guide attention to salient and relevant scene locations. Next, we investigate the specifics of how bottom-up and top-down influences may integrate while searching for a target in a distracting background. We probe the granularity of information integration within feature dimensions such as color, size, luminance. Results of our eye tracking experiments assert that bottom-up responses encoding feature dimensions can be modulated by not just one, but several top-down gain control signals, thusrevealing high granularity of integration. Finally, we investigate the computational principles underlying the integration. We derive a formal theory of optimal integration of bottom-up salience with top-down knowledge about target and distractor features, such that the target's salience relative to the distractors is maximized, thereby accelerating search speed.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Computational modeling and utilization of attention, surprise and attention gating
PDF
Computational modeling and utilization of attention, surprise and attention gating [slides]
PDF
Understanding the relationship between goals and attention
PDF
Perceptual and computational mechanisms of feature-based attention
PDF
Emotion, attention and cognitive aging: the effects of emotional arousal on subsequent visual processing
PDF
Spatiotemporal processing of saliency signals in the primate: a behavioral and neurophysiological investigation
PDF
Eye-trace signatures of clinical populations under natural viewing
PDF
Towards more occlusion-robust deep visual object tracking
PDF
Autonomous mobile robot navigation in urban environment
PDF
Building and validating computational models of emotional expressivity in a natural social task
PDF
Synaptic integration in dendrites: theories and applications
PDF
Contextual modulation of sensory processing via the pulvinar nucleus
PDF
Depth inference and visual saliency detection from 2D images
PDF
Speeding up path planning on state lattices and grid graphs by exploiting freespace structure
PDF
Modeling social and cognitive aspects of user behavior in social media
PDF
Modeling dyadic synchrony with heterogeneous data: validation in infant-mother and infant-robot interactions
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
A logic partitioning framework and implementation optimizations for 3-dimensional integrated circuits
PDF
Heart, brain, and breath: studies on the neuromodulation of interoceptive systems
PDF
Modeling and analysis of nanostructure growth process kinetics and variations for scalable nanomanufacturing
Asset Metadata
Creator
Navalpakkam, Vidhya
(author)
Core Title
Integrating top-down and bottom-up visual attention
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
10/31/2006
Defense Date
10/03/2006
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
bottom-up,gain modulation,OAI-PMH Harvest,scene understanding,top-down,visual attention,visual search
Language
English
Advisor
Itti, Laurent (
committee chair
), Arbib, Michael A. (
committee member
), Biederman, Irving (
committee member
), Koch, Christof (
committee member
)
Creator Email
navalpak@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m118
Unique identifier
UC1314032
Identifier
etd-Navalpakkam-20061031 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-31138 (legacy record id),usctheses-m118 (legacy record id)
Legacy Identifier
etd-Navalpakkam-20061031.pdf
Dmrecord
31138
Document Type
Dissertation
Rights
Navalpakkam, Vidhya
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
bottom-up
gain modulation
scene understanding
top-down
visual attention
visual search