Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Understanding the relationship between goals and attention
(USC Thesis Other)
Understanding the relationship between goals and attention
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
U N D E R S TA N D I N G T H E R E L AT I O N S H I P B E T W E E N G O A L S
A N D AT T E N T I O N
james tanner
Doctor of Philosophy (COMPUTER SCIENCE)
Viterbi School of Engineering, Computer Science Department
FACULTY OF THE USC GRADUATE SCHOOL
University of Southern California
May 2018
James Tanner: Understanding the Relationship Between Goals and Atten-
tion, © May 2018
supervisors:
Laurent Itti
A B S T R A C T
It is well-known that tasks have a large influence on human gaze,
and there exist many saliency models that incorporate some form of
top-down features. However, these are all learned features, and little
research has gone into quantifying the effects of task on eye move-
ment behavior in a way that can predict those effects a priori. First, we
demonstrate a new learning rule that suggests how top-down connec-
tions might operate in the brain, from a functional perspective. Then,
we propose a quantitative theory for measuring the relevance of in-
formation with respect to tasks. Finally, we perform an experiment to
further validate this theory and utilize it to improve a saliency model.
iii
P U B L I C AT I O N S
Some ideas and figures have appeared previously. The following is a
complete list of publications to which I have contributed during the
course of this degree:
[1] Ali Borji and James Tanner. “Reconciling saliency and object
center-bias hypotheses in explaining free-viewing fixations.”
In: IEEE transactions on neural networks and learning systems27.6
(2016), pp. 1214–1226.
[2] W Shane Grant, James Tanner, and Laurent Itti. “Biologically
plausible learning in neural networks with modulatory feed-
back.” In: Neural Networks 88 (2017), pp. 32–48.
[3] James Tanner and Laurent Itti. “Goal relevance as a quantita-
tive model of human task relevance.” In: Psychological review
124.2 (2017), p. 168.
iv
C O N T E N T S
i introduction 1
1 introduction 2
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Top-Down Unsupervised Learning and Border
Ownership . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Visual Attention . . . . . . . . . . . . . . . . . . 6
1.1.3 Goal Relevance and Top-Down Attention . . . . 9
ii experiments 20
2 border ownership 21
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Modulatory Connections . . . . . . . . . . . . . . . . . . 23
2.2.1 Hebbian Learning and Modulatory Connections 25
2.3 Introducing Conflict Learning . . . . . . . . . . . . . . . 29
2.3.1 Conflict Learning and Modulatory Connections 31
2.4 Network Modeling Results . . . . . . . . . . . . . . . . 34
2.4.1 Border Ownership . . . . . . . . . . . . . . . . . 35
2.4.2 Orientation Selectivity . . . . . . . . . . . . . . . 42
2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.5.1 Analyzing the Rule . . . . . . . . . . . . . . . . . 47
2.5.2 Implications for Plasticity . . . . . . . . . . . . . 52
2.5.3 Learning Border Ownership . . . . . . . . . . . 54
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3 goal relevance 56
3.1 Defining Goal Relevance - Theory . . . . . . . . . . . . 56
3.2 Defining Goal Relevance - Implementation . . . . . . . 59
v
contents vi
3.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4 saliency models 79
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2 Experimental Methods . . . . . . . . . . . . . . . . . . . 80
4.2.1 Computation of Goal Relevance . . . . . . . . . 84
4.3 Analytical Methods . . . . . . . . . . . . . . . . . . . . . 88
4.3.1 Solution Space Experiment . . . . . . . . . . . . 89
4.3.2 Saliency Model Experiment . . . . . . . . . . . . 91
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.4.1 Solution Space Experiment . . . . . . . . . . . . 94
4.4.2 Saliency Model Experiment . . . . . . . . . . . . 97
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5 conclusion 106
iii appendix 107
a border ownership appendix 108
a.1 Stability of Modulatory Connections . . . . . . . . . . . 108
a.1.1 Conflict Learning Transitions . . . . . . . . . . . 108
a.1.2 Generalized Hebbian Algorithm . . . . . . . . . 111
a.1.3 BCM . . . . . . . . . . . . . . . . . . . . . . . . . 113
a.2 Learning Rule Details . . . . . . . . . . . . . . . . . . . . 115
a.2.1 Activation . . . . . . . . . . . . . . . . . . . . . . 115
a.2.2 Learning . . . . . . . . . . . . . . . . . . . . . . . 116
a.3 Experimental Methods . . . . . . . . . . . . . . . . . . . 118
a.3.1 Simple Network . . . . . . . . . . . . . . . . . . . 118
a.3.2 Border Ownership Network Architecture . . . . 119
a.3.3 Orientation Selective Network Architecture . . 121
a.3.4 Parameter Listing and Source Code . . . . . . . 123
b saliency models appendix 125
contents vii
b.1 Effect of
fitness
on NSS Scores . . . . . . . . . . . . . . . 125
b.2 AUC Results . . . . . . . . . . . . . . . . . . . . . . . . . 125
bibliography 127
L I S T O F F I G U R E S
Figure 1 An orientation map from a ferret V1. . . . . . . 5
Figure 2 An optical illusion demonstrating the concept
of border ownership. Depending on which side
owns the edges, the image can be interpreted
either as a vase or as two human faces. . . . . 6
Figure 3 A visualization of the processing in a saliency
model by Itti and Koch[60] and its output. . . 7
Figure 4 An example image from Yarbus[133], showing
that task has a strong effect on eye behavior.
Left: Image presented to a participant. Middle:
Eye movements of the participant when asked
to determine the age of the people in the im-
age. Right: Eye movements of the participant
when asked to determine the time period de-
picted. . . . . . . . . . . . . . . . . . . . . . . . . 9
Figure 5 Eye movements performed by a participant while
making a peanut butter and jelly sandwich.
Adapted from [53]. . . . . . . . . . . . . . . . . 13
Figure 6 Virtual environment navigated by the partic-
ipants while performing tasks. Left: Task di-
rections were, "pickup purple litter and avoid
blue obstacles". Middle: Same as left, but object
are more densely cluttered. Right: Additional
salient objects are added to the scene. Adapted
from [106]. . . . . . . . . . . . . . . . . . . . . . 14
viii
List of Figures ix
Figure 7 Left: Driving apparatus. Right: Virtual envi-
ronment displayed to the subject. The white
crosshairs indicate the subject’s gaze location.
Adapted from [118]. . . . . . . . . . . . . . . . . 17
Figure 8 Simple Network With Modulatory Connections 27
Figure 9 States of the Simple Network . . . . . . . . . . 33
Figure 10 Border Ownership Model Architecture . . . . . 36
Figure 11 Learned Feedback Receptive Fields for Border
Ownership . . . . . . . . . . . . . . . . . . . . . 38
Figure 12 Learned Feedforward and Lateral Receptive Fields
for Border Ownership . . . . . . . . . . . . . . 39
Figure 13 Border Ownership Polarity Assignments . . . 40
Figure 14 Conflict Learning Components . . . . . . . . . 41
Figure 15 V1 Network Architecture . . . . . . . . . . . . . 43
Figure 16 Orientation Selectivity Results . . . . . . . . . . 44
List of Figures x
Figure 17 A and C: A simple 2D environment without
(A) and with (C) an obstacle being evaluated,
along with the sampled RRT paths. Although
the paths are randomly sampled, the distribu-
tions appear less random than expected due to
the nature of RRT’s sampling method and the
so-called Voronoi bias inherent to this method [71].
B and D: Normalized grid counts for each grid
cell, computed as the number of RRT paths
that traverse a given grid cell, followed by nor-
malization to a probability distributionP(x,y)
(color scale at right shows probability density
values). We can compute goal relevance as the
difference betweenP(x,y) in panel (B) andP(x,yjD)
in panel (D) using Equation 15. In this paper,
the relevance values range between0 and1.4 Rels.
For this example, the relevance of the added
obstacle to the task of traveling from start to
goal is 0.87 Rels (relevance units). . . . . . . . . 60
List of Figures xi
Figure 18 Environments and probability distribution heat
maps required to compare the relevance of two
objects in a more complex environment. A: En-
vironment without either of the objects in ques-
tion. B: Environment with the first object in
question. C: Environment with the second ob-
ject in question. Applying Equation15 to com-
pare distributions A with B and A with C pro-
vides the relevance values of the first and sec-
ond object, respectively. In this case, the red
objects in B and C had relevance values of0.19
and 0.70 Rels respectively, which agreed with
the participant responses (4 and 34 votes re-
spectively). . . . . . . . . . . . . . . . . . . . . . 62
Figure 19 Left pair: Image pair with rectangular objects.
Right pair: Image pair with convex polygon
objects. The first and third images contain the
more relevant new obstacles according to our
theory. We thus expected that a majority of hu-
man responses would be “Left” for both image
pairs. The second panel is an example of an
environment that would be excluded from the
experiment, because the relevant obstacle falls
outside of the circle formed by the start and goal. 64
Figure 20 Agreement histogram over image pairs. . . . . 69
Figure 21 Model accuracies for image pairs with differ-
ent levels of human agreement, and correspond-
ing regression lines. The number of image pairs
in each category is shown on the X axis with
the agreement values. . . . . . . . . . . . . . . . 70
List of Figures xii
Figure 22 Scatter plot of goal relevance differences and
human agreements for each image pair, and
the regression line. . . . . . . . . . . . . . . . . 71
Figure 23 Experimental setup, where participants played
Mario World while having their eye movements
recorded. . . . . . . . . . . . . . . . . . . . . . . 80
Figure 24 Example of a goal relevance test object. The en-
emy Goomba is too high to threaten normal
Mario. However, high-jumping Mario must be
careful to jump over the pipe while not collid-
ing with the enemy. . . . . . . . . . . . . . . . . 83
Figure 25 More goal relevance test object examples. A:
Same as Figure 24. The flying Goomba is too
high, so it will only affect high-jumping Mario.
B: Normal Mario must go through the mid-
dle opening in the wall. However, small Mario
can also choose to walk under the bottom, and
high-jumping Mario can leap over the wall. C:
Small Mario can avoid some enemies by walk-
ing under the walkway, but the other Marios
cannot fit. D: High-jumping Mario can jump
on top of the overhanging walkway to avoid an
enemy. E: The shell inside the blocks at the top
is highly salient from a bottom-up perspective
because it bounces back and forth quickly and
makes noise. However, no version of Mario can
interact with it. F: The three Spiky enemies at
the bottom are also uninteractable. . . . . . . . 84
List of Figures xiii
Figure 26 Visualization of goal relevance computation. In
this case, we are computing the goal relevance
of the lower set of blocks. A & B: First, a pos-
terior state is created in which the object in
question is removed. C & D: Then, the solu-
tion set for each state is found using the guided
breadth-first search described above. Paths are
colored from green to red based on their fit-
ness compared to the optimal path found by
A*, shown in yellow. Mario is too large to fit
through the gap at the bottom, so he must jump
through the middle. E & F: The solution sets
are converted into probability distributions via
grid discretization. These two images have been
edited with increased gamma and blurring to
improve visualization. Lastly, the goal relevance
value is determined by applying Equation 19
to the two distributions. . . . . . . . . . . . . . 87
Figure 27 Objects on which goal relevance is computed.
A: Each red dot represents an object to which
we assign a goal relevance value. Goal rele-
vance is computed on enemies, terrain units at
corners, blocks, and on Mario himself. B: The
results of the goal relevance computations, vi-
sualized as a raw saliency mask. This is the
same mask that will be used in our saliency
model. In this case, the two most relevant ob-
jects are Mario and the Bullet Bill enemy. Note
that some objects do not appear in the mask
because of low or 0 values for goal relevance. . 89
List of Figures xiv
Figure 28 Attention Ratio (AR) and Goal relevance ra-
tio (GRR) values for specific subsets of objects.
These are the same object subsets shown in
Figure 25A-F. For easier comparison, axes are
scaled such that the AR and GRR values for
regular Mario are centered. In A, the GRR is
0, so we scale based on the high-jumping val-
ues instead. Columns are highlighted in green
if both the AR and GRR values differ in the
same direction as this baseline (both above or
both below), and red if they do not. This is ex-
plained further in the text. A-E: Triangles in-
dicate that the AR and GRR values for a ver-
sion are both significantly different from all of
the other 3 columns, respectively. F: Instead of
comparing between game versions, this graph
compares AR and GRR values between inter-
active and non-interactive enemies, averaged
across versions. The larger GRR value for in-
teractive enemies is matched by a larger ARR
value. . . . . . . . . . . . . . . . . . . . . . . . . 95
Figure 29 Scatterplots of AR values vs. GRR values for
each class of object on which goal relevance
is evaluated. The best fitting linear model is
drawn for each plot, along with the statistics
results for that line. . . . . . . . . . . . . . . . . 98
List of Figures xv
Figure 30 Example images along with the model predic-
tion masks. We show the mask produced by
each individual model (including blurring) along
with the final combined model. In each image,
the cyan dot marks the participant’s gaze lo-
cation. Note that here we show the final masks
which include blurring, particularly the BU and
GR masks. Also, mask images have been edited
with increased gamma to improve visibility. In
most of these examples, goal relevance does
very well. For each row, starting from the top:
1) Gaze is directed to a ledge as Mario falls.
2) Gaze is directed to a cannon which fires
missiles. The left-most cannon is not consid-
ered relevant because Mario is in the ascent of
his jump, so most solutions land beyond it. 3)
Gaze is directed towards an enemy. 4) Gaze is
directed towards a platform onto which Mario
will soon land. 6) The last example shows a
situation where the TD model performs much
better. Gaze is being directed towards the back-
ground above Mario. Goal relevance is inca-
pable of predicting this location because there
are no objects there, but the learned model places
a low probability there. . . . . . . . . . . . . . . 99
List of Figures xvi
Figure 31 NSS scores for all combinations of the3 model
components: bottom-up, top-down learned, and
goal relevance. In all cases, a combination that
includes goal relevance performs significantly
better than the corresponding combination with-
out goal relevance. . . . . . . . . . . . . . . . . 100
Figure 32 States of the Simple Network . . . . . . . . . . 112
Figure 33 NSS scores for the GR and full combined mod-
els for different values of
fitness
. The differ-
ence in the scores is very small, so using a low
value does not affect our results. . . . . . . . . 126
L I S T O F TA B L E S
Table 1 Descriptions of all features tested for SVM-B.
Shapes 1 and 2 are the left and right shapes,
respectively. . . . . . . . . . . . . . . . . . . . . 65
Table 2 Contributions of each feature used in SVM-B.
The accuracy of an SVM trained on each fea-
ture individually is shown, along with a pro-
gression from highest to lowest accuracy. These
results are from the primary experiment. Fea-
tures 5, 9, 11, and 12 represented the best ac-
curacy of all1470 combinations of features. Be-
cause features5 &9 are included without their
pairs (features 6 & 10), we also show alterna-
tive combinations for comparison. . . . . . . . 67
Table 3 Prediction accuracy of models & inter-subject
agreement . . . . . . . . . . . . . . . . . . . . . 69
Table 4 Comparison of model predictions between pri-
mary and control experiments. Accuracy for
the primary experiment is recomputed where
each individual participant response is predicted,
rather than the majority image, since one can-
not determine a majority image in the control
experiment that used unique stimuli on every
trial. Standard deviations across individual par-
ticipants are shown for each value. . . . . . . . 72
Table 5 Statistical results using the Wilcoxon rank sum
test [45], corresponding to the data in Figure 28. 96
xvii
List of Tables xviii
Table 6 Parameter Listing . . . . . . . . . . . . . . . . . 122
Table 7 AUC scores for all combinations of the3 model
components: bottom-up, top-down learned, and
goal relevance. In all cases, a combination that
includes goal relevance performs significantly
better than the corresponding combination with-
out goal relevance. . . . . . . . . . . . . . . . . 126
A C R O N Y M S
BCM Bienenstock-Cooper-Munro
BO Border Ownership
FOV Field Of View
GHA Generalized Hebbian Algorithm
KL Kullback-Leibler
LTD Long-Term Depression
LTP Long-Term Potentiation
MDP Markov Decision Processes
RRT Rapidly-exploring Random Trees
SLT Short and Long-Term
SOM Self-Organizing Map
STDP Spike-Timing-Dependent Plasticity
SVM Support Vector Machine
xix
Part I
I N T R O D U C T I O N
1
I N T R O D U C T I O N
Human visual attention has been studied for decades, and great strides
have been made. We have identified many brain regions involved in
the computations of attention [47] and have developed computational
models that can predict where you look [94]. Much work has gone
into discovering which pieces of information, or features, the brain
uses during this computation, and these features can be organized
into two broad categories: so-called bottom-up features (e.g. color,
orientation, motion) and top-down features (how context and a per-
son’s goals affect their attention). Current models include features
from both of these categories, but the underlying principles behind
the top-down features remain poorly understood. For example, how
might top-down connections actually work in the brain? What makes
information relevant to a goal? And then, how does the relevance of a
piece of information affect its influence on attentional behavior? Most
models that include top-down features avoid these issues by apply-
ing machine learning techniques [100,106] which find patterns in the
effects of top-down information on attention without understanding
the inner mechanisms that give rise to those patterns. The aim of this
thesis is to suggest answers to these three questions.
The first of these questions is answered by the experiment in Chap-
ter 2. This chapter describes the development of an unsupervised
neural learning model based on a new learning rule for handling top-
down, or feedback, connections. The new rule enables the model to
learn neurons sensitive to border ownership (BO), a concept known to
be computed in the brain [136], for the first time. The second question,
2
1.1 background 3
regarding what makes information relevant to goals, is addressed by
the experiment in Chapter 3. Here a quantitative definition of goal
relevance is presented, to measure the relevance of information with
respect to goals, and a theoretical model is built from this definition
which is then shown to agree with human responses to a simple task.
Finally, in Chapter 4, we build a saliency model using goal relevance
as a top-down feature to determine its impact on eye movement be-
havior.
1.1 background
1.1.1 Top-Down Unsupervised Learning and Border Ownership
The brain has a remarkable ability to learn to process complicated in-
put through self-organization, and since the studies of Hubel and
Wiesel [129] it has been known that the development of early vi-
sual processes is dependent on experience. Learning has also been
shown to be prevalent in and instrumental to the function of many
other areas [42, 84] and is likely key to understanding the brain.
Hebbian-based models have come a long way in explaining poten-
tial mechanisms of learning [27, 55, 128], especially in feedforward
networks [117], but an increasing amount of literature suggests that
explaining plasticity requires novel approaches [79, 134]. Addition-
ally, many neural models give little attention to the learning of feed-
back connections, despite the fact that they account for the majority
of connections in the brain [85]. Given the prevalence of feedback,
we cannot hope to truly understand the brain while dismissing more
than half of its connections.
Early methods for unsupervised learning, such as Kohonen’s Self-
Organizing Map (SOM), did not consider top-down (feedback) con-
1.1 background 4
nections, and were instead concerned with explaining the spatial or-
ganization of various brain functions [69]. The SOM however has
some issues with biological plausibility, especially in regards to its re-
liance on global processes and connectivity [92]. The LISSOM model [113]
eliminated global connectivity requirements by incorporating local
Hebbian-driven learning, which allowed previously hard-coded lat-
eral connections to be learned. Sirosh and Miikkulainen’s work was
further extended in GCAL [7], which introduced adaptive firing thresh-
olds and gain control to robustly model the development of V1 [117].
The introduction of Hebbian learning to models of self-organization
allowed more phenomena to be explained, though they have largely
remained focused on feedforward networks.
Other attempts to model neural learning have introduced addi-
tional mechanisms on top of Hebbian learning, often revolving around
explicit synaptic weakening. The Bienenstock-Cooper-Munro (BCM)
rule [11] uses a floating threshold based on activation to modulate the
magnitude and sign of a Hebbian update. Spike-timing-dependent
plasticity (STDP , [114]) uses the relative timing of neuron spikes to
similarly affect its Hebbian-like update. The Leabra framework [96],
which combines Hebbian learning with error-driven learning, takes
ideas from both of these concepts, using activation over multiple time
scales to control a BCM-like threshold.
With few exceptions, the traditional focus of many of these self-
organizing rules has been on feedforward processing. A common
test is their ability to learn orientation selective units from natural
images [9, 89, 117, 120, 127]. It is well-known that orientation maps
develop to maturity based on visual experience in the primary vi-
sual cortex of many mammals [37, 78, 101] (shown in Figure 1) and
thus it is reassuring if a learning rule replicates this behavior. There
are, however, a number of processes in the brain where top-down
1.1 background 5
Figure 1: An orientation map from a ferret V1.
feedback may be beneficial or even necessary for function or devel-
opment, such as attention [4, 132], object segmentation [105], object
recognition [5], or border ownership [30]. Border ownership is of par-
ticular interest here as perhaps one of the simplest mechanisms that
has been investigated with feedback models. It is thought to be nec-
essary for figure-ground segmentation, and involves labeling which
side of an edge points towards the inside of that edge’s owning ob-
ject [68]. A common illusion demonstrating this is shown in Figure 2.
Since the discovery of border ownership selective cells in visual ar-
eas V1, V2, and V4 [136], many computational models have attempted
to explain its mechanisms, using lateral [107,135], feedforward [119],
or feedback [30] connections to explain the source of surrounding
retinotopic border information. Models using feedback have further
been shown to integrate well with models of visual attention [91,
102]. However, none of these models has shown or replicated how
the complicated connections necessary for computing border owner-
ship might be learned. In Chapter2, we show that a model of border
ownership utilizing recurrent modulatory feedback processing, sim-
ilar to the structure of Craft et al. [30], can in fact be learned in a
1.1 background 6
Figure2: An optical illusion demonstrating the concept of border ownership.
Depending on which side owns the edges, the image can be interpreted
either as a vase or as two human faces.
biologically plausible fashion using conflict learning, a feat not possi-
ble with traditional Hebbian methods.
1.1.2 Visual Attention
As the central field of interest in this work is visual attention, it is
worth reviewing the fundamentals. In studying visual attention, the
primary goal is to understand the mechanisms that determine where
humans look in a scene. This location is referred to as the gaze or
fixation location. Often, researchers will focus specifically on study-
ing saccadic eye movements, which are rapid eye movements that
transfer the gaze location from one point to another. A single such
movement is referred to as a saccade.
As theories are crafted to explain why and towards which loca-
tions saccades occur, it is necessary to have methods for evaluating
them. The standard practice for achieving this is to develop a com-
putational model for predicting eye movements, known as a saliency
model. A typical experiment involves recording eye movements while
1.1 background 7
presenting static images to participants. Later, the saliency model is
provided with the same images and must produce a saliency map for
each image. The exact formulation can vary, but saliency maps are
typically probabilistic maps that mark regions of the image towards
which higher probabilities of saccades are predicted.
Figure 3: A visualization of the processing in a saliency model by Itti and
Koch[60] and its output.
Once these saliency maps are computed, some metric must be
applied to determine the similarity between the saliency maps and
the recordings of human eye movements. There is no inherently cor-
rect method to do this, and many metrics have been proposed. Re-
searchers will frequently report their results using more than one met-
ric because there can often be a large difference in results depending
on the metric selected. [19] recently provided a comparative analysis
of many of these metrics in an attempt to make their properties more
transparent and to make recommendations for which metrics are best
in which situations. Many of these are also described in greater de-
tail in [12]. Here, we briefly describe several of the most prominent
metrics.
• Normalized Scanpath Saliency (NSS): NSS is defined as the
saliency map value at the location of the gaze position, after
1.1 background 8
the map has been normalized to have a mean of 0 and a unit
standard deviation. NSS = 0 indicates that the model performs
at chance, while NSS = 1 indicates that the predicted saliency
values are one standard deviation above average.
• Area Under the Curve (AUC): Computes the area under the Re-
ceiver Operating Characteristic (ROC) curve. The computation
involves selecting a threshold and creating a binary classifier
on the saliency map, with the human fixations as the ground
truth. The curve is drawn by varying the threshold and plotting
the true positives vs. the false positives, and the area under that
curve represents the model score, with 1 being a perfect score.
• Correlation Coefficient (CC): Measures the linear correlation co-
efficient between two variables. The first variable is the saliency
map, and the second is another map constructed by having an
empty map with1’s at the fixation locations, often with a Gaus-
sian blur.
• Kullback-Leibler Divergence (KL): Measures the distance be-
tween two probability distributions. These distributions are cre-
ated from the saliency map values at fixation locations and at
random locations, placed into separate histograms.
• Information Gain (IG)[70]: Quantifies how much better a poste-
rior predicts data than a prior. In this case, the prior is a baseline
saliency map representing any biases present, such as a center
bias, and the posterior is the saliency map computed from a
specific image. The score is computed by taking the difference
between this information gain with the information gain of a
gold standard map.
When data is collected from human participants who are freely
viewing the presented scenes, we can refer to the saliency models that
1.1 background 9
Figure 4: An example image from Yarbus[133], showing that task has a
strong effect on eye behavior. Left: Image presented to a participant. Middle:
Eye movements of the participant when asked to determine the age of the
people in the image. Right: Eye movements of the participant when asked
to determine the time period depicted.
try to predict this data as bottom-up models. This comes from the fact
that the models make their predictions purely using the visual data
in the images, as in the feedforward (i.e. bottom-up) pathways in the
visual cortex. The alternative category is top-down models, which
additionally take into account the context of the presented scenes and
the goals of the viewer. It has long been established that a viewer’s
task has a large effect on eye behavior [133], as shown in Figure 4.
However, the earliest models focused on bottom-up predictions, and
bottom-up models are still much more prominent in the literature
than top-down models.
1.1.3 Goal Relevance and Top-Down Attention
Relevance is a concept that humans use all the time. It is a measure of
the importance of observations with respect to a context. This allows
humans to prioritize and filter sensory input by directing their atten-
tion to the more relevant sensory events. While driving, for example,
people will often ignore things going on inside the car, in the sky,
or in the oncoming lane. This might be accomplished with heuristics
learned from experience, which humans do well [46].
Relevance has been extensively studied in linguistics [87], data min-
ing, and information retrieval [26, 32, 104]. In these domains, it is
1.1 background 10
often formulated as a statistical similarity metric between two con-
cepts [29]. This works very well for explaining why "Obama" is rele-
vant to "politics", or "E. coli" is relevant to "disease". However, these
similarity metrics do not naturally extend to the driving example, be-
cause driving is a task. During driving, the relevance of objects is
not determined by their similarity to another object, but by how they
might affect the driver’s goal of safely reaching a destination. This
type of relevance, which will be referred to as goal relevance, has
not received much attention in cognitive sciences, computer vision,
robotics, or perception science, even though it is useful for percep-
tion and cognition in humans [24].
Many studies, further detailed below, have shown that task influ-
ences human attention [53], providing a wealth of empirical data: Hu-
mans are very good at rapidly focusing onto the most relevant sen-
sory information to achieve a goal. However, attempts to model this
quantitatively typically have been limited to establishing associations
between computed features of the sensory data and the tasks being
investigated: For example, when driving and entering a left turn, hu-
mans tend to look into that turn, and from this tendency one can
infer that this perhaps is the most relevant location to look at [100,
106]. These associations often do not render explicit which attributes
or features of the sensory data were used to compute the relevance of
particular locations to look at, or how such computation is carried out.
A better understanding of the mechanisms underlying goal-driven be-
havior might be possible if one first provides a quantitative measure
of goal relevance, which can then be used to establish possible causal
relationships between relevant stimulus items or features and subse-
quent behavior. Thus, one motivation in the present work is to enable
the development of top-down attention models that are based on the-
1.1 background 11
oretical knowledge, in which attention during a task can be viewed
as a search for relevant data.
Intuitively, information received is relevant to an individual if it
connects with background information to yield conclusions that mat-
ter to the individual [115, 130]. Thus, relevance must be considered
together with context (background information) and goals (what mat-
ters). Data that yields stronger conclusions is more relevant, and,
when all else is equal, Wilson and Sperber suggest using process-
ing complexity as an additional tie-breaker. For example [130], of the
three sentences (1) We are serving meat; (2) We are serving chicken;
and (3) Either we are serving chicken or7
2
-3 is not 46; sentence (2)
is the most relevant as it yields stronger cognitive conclusions than
(1) and is simpler to process than the otherwise logically equivalent
(3).
In the context of data mining, one of the earliest formal definitions
of relevance relies on logical entailment and inference: a sentence is
relevant to a query if that sentence is a necessary assumption (axiom)
to prove the query [51]. Note an interesting relation between nov-
elty and relevance through redundancy [23]: Relevance of a search
result might decrease if that result is unsurprising or redundant with
other already returned results. This led Carbonell and Goldstein to
define the notion of maximal marginal relevance to penalize highly
redundant search results.
Yet, there is still debate over the precise definition of relevance in
the information retrieval and web search communities. Hjørland dis-
cusses five different views of relevance and argues for what he calls
the "subject knowledge view", which suggests that relevance is not
an inherent (or objective) attribute, but rather is dependent on the
knowledge or beliefs of the subject who evaluates the relevance [57].
All five of the presented views focus on relevance as applied to web
1.1 background 12
search, and Hjørland does not focus on the case where an agent is
pursuing a goal. In a previous paper, however, he provides an inter-
esting definition of relevance:
Something (A) is relevant to a task (T) if it increases the
likelihood of accomplishing the goal (G) which is implied
by T [57].
This is fairly intuitive, but is still vague for use in obtaining a quan-
titative result. It requires a likelihood function specific to the goal G
which must capture most of the problem. There exists a wide range
of types of goals, from navigation to purchasing groceries to solving
math problems, so the computation of relevance will certainly be spe-
cific to each goal. However, a definition that more clearly defines the
computation would be desirable.
In the meantime, for practical purposes, modern information re-
trieval systems use hundreds of features to rank the relevance of
search results. The simplest features are based on frequency of oc-
currence of the search terms in the documents, while more complex
features also exploit position within a document, page layout and
structure, hyperlink graph structure, and user feedback in the form
of tags from social media sites [75, 99]. Although the most successful
web search corporations do not disclose their exact mix of features,
most modern ones employ statistical and machine learning methods
to tune the features and their combinations [121]. While this approach
is heuristic and still largely ad-hoc despite more recent learning ele-
ments, it provides for a practical definition of relevance in the specific
domain of web search. Unfortunately, even the best search engine is
often useless in attempting to gauge goal relevance. For example, a
web search for “driving” may return pages that present the definition
of driving, i.e., operating a vehicle. This is not always helpful in de-
1.1 background 13
termining which entities in the driver’s view are the most relevant to
the driving task.
Figure5: Eye movements performed by a participant while making a peanut
butter and jelly sandwich. Adapted from [53].
In attention research, it has been well established that tasks and
goals influence human attention, ever since the studies of Buswell [18,
28, 133]. Fecteau et al. use neurophysiological evidence to suggest
that the classical saliency map idea — whereby attention is automat-
ically drawn to sensory stimuli that stand out from clutter [61, 67] —
must be supplemented by a relevance map to produce the overall at-
tention pattern, which they suggest is guided by a priority map that
combines salience and relevance [36]. Hayhoe and Ballard provide
a review of recent advances in goal-directed attention research [54],
in which they note that attention has been examined (among other
things) in a number of visuo-motor tasks including walking, driving,
sports, and making sandwiches [53, 73, 74, 116] (Figure 5). In these
cases, the task is usually considered a top-down influence on atten-
tion. Rothkopf et al. designed a virtual experiment where participants
were given the task to either "pickup litter" or "avoid obstacles". This
study showed a clear effect of task on attention, and it was demon-
strated that local and global visual features around the participants’
focal point of attention were sufficient to predict the task being per-
1.1 background 14
formed [106] (Figure6). Another study [83] showed that top-level fea-
tures about the task, including scene context, affected the saccade pat-
terns. Navalpakkam and Itti [94] developed a computational model
for predicting saccades, in which a task definition provides top-down
biasing over which visual features would more strongly contribute to
a saliency map that guides attention. Given a task, this model first
searches its long-term memory for relevant entities (where relevance
is a conceptual distance over a concept ontology), and then biases at-
tention towards their visual features. Items selected by attention are
analyzed by an object recognition process, and successfully identified
objects in a scene are added to the model’s working memory, allow-
ing it to reason about both relevant entities and found entities when
planning the next attention biasing and next eye movement. While
the definition of relevance used in this model is a simple classical
similarity metric, this model provides a computational framework in
which relevance directly interacts with attention deployment.
Figure 6: Virtual environment navigated by the participants while perform-
ing tasks. Left: Task directions were, "pickup purple litter and avoid blue ob-
stacles". Middle: Same as left, but object are more densely cluttered. Right:
Additional salient objects are added to the scene. Adapted from [106].
More recent approaches to top-down saliency models have aimed
to supply category-specific saliency models, i.e. provide one saliency
map for an image focusing on a cat, and another map for the same
image focusing on a bottle instead. This approach allows the appro-
priate map to be chosen based on the object of interest. In [66], object
classes are represented using superpixels, which are learned in a Con-
ditional Random Field (CRF) framework. Another variant[131] uses
sparse coding from images patches, which are also learned in a CRF.
1.1 background 15
These methods provide excellent performance, but they are machine
learning models and thus still leave us without any conceptual under-
standing of top-down attention. Also, these methods are essentially
solving an object detection task, and therefore will have difficulty on
tasks where the contextual information is more than a simple object
of interest.
Several studies have used learning from examples to implicitly cap-
ture task influences on eye movements [100]. Borji et al. recorded hu-
man gaze while playing video games to learn a Dynamic Bayesian
Network for predicting future gaze [13]. This model was able to out-
perform purely bottom-up saliency models in predicting a player’s
next eye movement. The model essentially captures the current state
of the game through analysis of past and present video frames, user
actions, and eye movements, allowing it to learn associations between
game state and relevant locations/objects, where relevance is here de-
fined as likelihood of being fixated next. A shortcoming of this ap-
proach is that it is a black box, in that the learned parameters of the
Dynamic Bayesian Network are not easily interpreted. Another is that
the learning is task-specific, so the model must be re-trained for each
new task. As exemplified by this model, we still generally and explic-
itly do not understand algorithmically how a given task specifically
affects attention, despite the success of these studies. As a result, even
the current state of the art in gaze prediction during tasks is unable
to reliably mimic human gaze behavior [13]. Thus, we do not have
models that, given a task, can generally predict a priori which objects
will attract more attention.
Ideas from information theory may contribute to defining goal rele-
vance, especially since information theory has been the basis for many
attention models. For example, Renninger et al. developed a model
for predicting eye movements while discriminating among images of
1.1 background 16
object silhouettes, which looks first at the parts of the object contours
with the most information [103]. They define this as the area with
the least entropy over a local orientation histogram. Their implemen-
tation can be viewed as a special case of goal relevance, because it
is tailored to object contours, and, in this case, equating relevance to
information explains human behavior well. In related research, it has
been suggested, in the special case of visual search, that human ob-
servers make saccades towards regions that maximize posterior infor-
mation about target location, rather than directly towards the most
probable target location [93]. Thus, in this framework (search for a
low-contrast target in noisy backgrounds), locations with high poste-
rior are deemed more relevant. These results show that information
theory has been used successfully to investigate special cases of goal
relevance, although, thus far, it has not given rise to a general defini-
tion of goal relevance.
A recent study of human attention under task by Sullivan et al. in-
vestigated the effects of uncertainty in task-related attention [118]. In
these experiments, participants’ eye movements were recorded while
performing a simulated driving task in which they were instructed to
maintain a distance behind a lead car and maintain a specific speed
(Figure7). In different trials, the same participants were told to place
a higher priority on either maintaining the distance or on maintain-
ing the speed. In addition, half of the trials introduced a noise effect,
which would randomly adjust the car’s speed by small amounts. It
was found that adding noise significantly increased the number of
saccades made to the speedometer, but only when the task priority
was to maintain speed. This not only suggests that task and uncer-
tainty affect attention, but also that there is an interaction between
these two factors. A definition of goal relevance may thus need to
handle or embrace uncertainty.
1.1 background 17
Figure 7: Left: Driving apparatus. Right: Virtual environment displayed
to the subject. The white crosshairs indicate the subject’s gaze location.
Adapted from [118].
One such definition of relevance that embraces uncertainty is that
of Statistical Relevance, as discussed in the work of Salmon [108]. This
theory defines a piece of information to be relevant to an event if the
information contributes to the explanation of the event, as measured
in terms of entropy from information theory. In other words, infor-
mation is relevant if the event is more probable or less probable in
the presence of the information than in its absence. This formulation
is very appealing for our problem for several reasons; 1) it provides
a well-defined mathematical framework as its foundation, 2) it han-
dles uncertainty, which is present in many goals, and 3) goals are
somewhat analogous to events in that a goal implies a set of potential
solutions, or potential events. This suggests that a similar formulation
might be used to determine goal relevance.
Gottlieb and Balan [48] present an experiment on attention and
decision-making in monkeys that cannot be explained by any previ-
ous models. The authors suggest that attention is a decision process
based on the utility of information. They also point out neurobio-
logical evidence that the same parietal neurons carrying information
about saccadic motor decisions also carry information unrelated to
any other aspects of those decisions. It is proposed that this addi-
tional information reflects the information utility, which then affects
the saccades. The question of how such a utility metric is computed
1.1 background 18
in general is left open and should be addressed by a quantitative goal
relevance theory.
Beyond information theory, two recent theoretical frameworks might
come to mind as related to goal relevance. The first is that of Markov
Decision Processes (MDPs), which are concerned with finding opti-
mal sequential decision making policies in stochastic environments.
These processes represent a problem as a set of states, along with
available actions that probabilistically transition between those states,
and immediate rewards associated with those transitions. The goal is
to find a state-to-action mapping (a policy) that maximizes the future
total long-term expected discounted reward (where possible rewards
are discounted by how far in the future they are) given the stochastic
nature of the environment (and hence, the fact that actions may fail to
yield the desired outcomes). Although MDPs assume, like early top-
down attention models [100], a correspondence between state and op-
timal action (the optimal policy), fully modeling this correspondence
is often intractable in large state spaces. A quantitative notion of rel-
evance could in such cases possibly help with defining policies with
respect to relevant features of classes of states, as opposed to individ-
ual states. The second related concept is that of Value of Information.
It is defined as the price a decision maker would be willing to pay
for an additional piece of information, in view of making a decision.
It is often applied to problems where decision making and monetary
resources are involved, such as business [44] and investing [20]. One
might speculate that more valuable information in this framework is
more relevant to the goals of the decision maker. This, however, re-
quires that the cost of information be on a comparable scale to that
of the outcome, in which case one typically decides to pay for the ad-
ditional information if its cost is lower than the expected increase in
gain that follows the decision made with vs. without the additional
1.1 background 19
information. While this is applicable in scenarios where money is the
common value metric (monetary cost of information, monetary gain
after decision), for many goal-directed tasks, such as our driving ex-
ample, establishing a parity of values is not obvious (e.g., what is the
cost of making one additional eye movement with respect to the goal
of arriving safely at destination?). Nevertheless, the manner in which
the value of information is typically used to change prior probabili-
ties of a decision making agent provides some additional inspiration
towards our goal relevance definition efforts.
Part II
E X P E R I M E N T S
2
B O R D E R O W N E R S H I P
2.1 introduction
The brain has a remarkable ability to learn to process complicated
input through self-organization, and since the studies of Hubel and
Wiesel ([129]) it has been known that the development of early visual
processes is dependent on experience. In the decades since, mod-
els of visual development have focused on feedforward pathways,
with little attention given to the learning of modulatory connections.
Modulatory connections, which adjust existing neuron activations in-
stead of directly driving them, dominate feedback pathways, which
themselves constitute a majority of the connections in the brain [85].
Hebbian-based models have come a long way in explaining poten-
tial mechanisms of learning [27, 55, 128], especially in feedforward
models of V1 [117], but an increasing amount of literature suggests
that more comprehensively explaining plasticity requires novel ap-
proaches [79,134]. We will argue that the principles of Hebbian learn-
ing, known colloquially as “fire together, wire together,” cannot be
used alone to learn correctly or maintain stability in the context of
modulatory connections.
The primary contributions of this work are twofold: the develop-
ment of a new learning rule that handles modulatory connections,
and showing that a stimulus driven feedback model of border own-
ership can be learned in a biologically plausible way as a result of
the new learning rule. The new learning rule, which we call conflict
21
2.1 introduction 22
learning, is composed of three conceptually simple, physiologically
plausible mechanisms: adjusting plasticity based on the activation
of strongly learned connections, using inhibition as an error signal
to explicitly unlearn connections, and exploiting several timescales.
With border ownership as our prototypical example, we show that a
Hebbian learning rule fails to properly learn modulatory connections,
while the components of our proposed rule enable it to learn the
required connections. Border ownership, which involves the assign-
ment of edges to owning objects, is perhaps one of the earliest and
simplest visual processes dependent upon modulatory feedback [68],
appearing in V1, V2, and V4 [136]. Although many models of its
function exist (e.g., lateral models: [107,135], feedforward: [119], and
feedback: [30]) those incorporating feedback are especially promis-
ing, integrating well with models of attention [91, 102] and concepts
of grouping [88]. However, until now, all of these models have used
fixed, hand-crafted weights, with no demonstration of how the con-
nection patterns for border ownership might be learned.
With our new learning rule, we demonstrate that inhibitory modu-
lation of plasticity, in conjunction with competition, are likely crucial
mechanisms for learning modulatory connections. Additionally, we
show that the rule can be used as a drop-in replacement for a Hebbian
learning rule even in networks lacking any modulatory connections,
such as an orientation selective model of primary visual cortex. Con-
flict learning is compared against a recent Hebbian learning based
rule (GCAL; [117]), which is a good baseline rule for comparison be-
cause its weight updates are governed purely by Hebbian logic and
it operates at a level of abstraction that captures important physiolog-
ical behaviors while still being usable in large scale neural network
models (e.g., orientation selectivity) and being adaptable for use in
new network architectures (e.g., border ownership). We demonstrate
2.2 modulatory connections 23
that conflict learning, like a Hebbian rule such as GCAL, can be used
to learn a biologically consistent model of orientation selectivity. Our
results further suggest that networks learned with conflict learning
have improved noise and stability responses.
Conflict learning works in a fundamentally different way to previ-
ous learning rules by leveraging inhibition as an error signal to dy-
namically adjust plasticity. Though many existing techniques built
upon Hebbian learning, such as those derived from STDP [spike
timing-dependent plasticity, 114] or BCM learning [11], have some
method to explicitly control synaptic weakening (e.g., based on sig-
nal timing for STDP or comparisons to long term activation averages
for BCM), inhibition only indirectly affects learning by lowering acti-
vation. Our successful application of the rule to learning models of
orientation selectivity as well as border ownership serves as a predic-
tion that modulatory connections in the brain require inhibition and
competition to play a bigger role in the dynamics of neural plasticity
and activation.
2.2 modulatory connections
Modulatory connections are the primary motivation for the develop-
ment of conflict learning. They are found extensively in feedback pro-
jections related to visual processing, for example from visual cortex
to the thalamus [31, 62, 63], from higher visual areas to primary vi-
sual cortex [21,58], as well as from posterior parietal cortex to V5/MT
[40]. Top-down modulatory influences also play a role in phenomena
such as attention [4, 10, 132], object segmentation [105], and object
recognition [5]. Attention is a modulatory effect and has the greatest
impact on already active representations [17]. Modulatory feedback,
used in much the same way as in our border ownership experiment,
2.2 modulatory connections 24
has been used to construct a model of attention that replicates nu-
merous observed attentional effects on both firing rates and receptive
field structure [90].
Modulatory connections can alter the existing activation of a neu-
ron, but cannot cause activity in isolation; they must work in conjunc-
tion with driving inputs [16]. We can observe this distinction math-
ematically by first looking at the activation function for an artificial
neuron, which is typically modeled by some function of its weighted
inputs:
x
j
=f(
X
i2input
x
i
w
ij
) (1)
wherew
ij
is the weight between neuronsi andj andx
i
is the activa-
tion of neuroni.
However, as modulatory connections are defined as those that do
not directly drive the activation of a neuron, their effect must be dis-
tinguished from driving connections, which, in similar fashion to [16],
we formalize as:
x
j
=f(D
j
+g(D
j
,M
j
)) (2)
where D
j
=
P
i2driving
x
i
w
ij
and M
j
=
P
i2modulatory
x
i
w
ij
. g is a
monotonically increasing function with respect toD
j
andD
j
=0 im-
plies thatg(D
j
,M
j
) =0. Typically,g is a simple product betweenD
j
andM
j
[e.g.,6,15,105], hypothesized to be implemented biologically
by backpropogation-activated coupling [76].
When feedforward inputs are taken to be driving and feedback to
be modulatory, it can be said that feedback is gated by feedforward,
an effect noted by [76]. [105] discuss the idea of gating in detail and
use it to support a model of figure-ground segregation. This gating al-
lows networks to integrate feedback without struggling to balance it
2.2 modulatory connections 25
against feedforward input or incurring spurious top-down-driven ac-
tivation. The physiological mechanics of modulation have been best
studied in relation to the thalamus, with a recent review by [123]
showing that modulatory input is extensive and heterogeneous in
regards to origin, neurotransmitter, and function. [16] discuss the ev-
idence for the potential physiological implementation of modulatory
feedback while developing a network-level circuit model for feedfor-
ward and feedback interaction.
2.2.1 Hebbian Learning and Modulatory Connections
Traditional Hebbian based learning rules adapt weights based on
some function of the coincidental firing of pre and postsynaptic neu-
rons:
w
ij
=f
w
ij
,x
i
g
x
j
(3)
Hebbian learning in its most basic formulation has no mechanism
to bound weight growth, making it trivially unstable. For our pur-
poses we use a formulation of Hebbian learning that includes a nor-
malization component for stability, adapted from [117]:
w
ij
=
w
ij
+x
i
x
j
P
k
(w
kj
+x
k
x
j
)
-w
ij
(4)
where is the learning rate. This weight update, and its normaliza-
tion, are applied independently to driving and modulatory connec-
tions (i.e. allw
ij
are the same connection type).
To better understand why such a Hebbian rule is not suitable for
learning modulatory connections, let us look at the dynamics of a
minimal network with two competitive neurons, illustrated in Fig-
ure 8. In this context, competitive means that the neurons are con-
2.2 modulatory connections 26
nected such that more active neurons inhibit the activation of those
less active through lateral connections. The desired state of this net-
work is to have each competing neuron develop a strong connection
to a unique source of modulatory input. It should be noted that this
end state is considered desired due to its computational usefulness as
a source of top-down information rather than a direct extrapolation
from biology.
We can imagine this network as, for example, a simple attention
network concerned with detecting apples or oranges in its input. The
modulatory connections act as attentional biases towards either ap-
ples (M
1
) or oranges (M
2
). Though one fruit may be desired over the
other (e.g., searching for a specific fruit;M
1
active vs.M
2
), the net-
work has no control over what is present in its input. Features related
more to apples (N
1
) or to oranges (N
2
) may be active regardless of the
bias signal, even occurring simultaneously. This presents a problem
to learning if a pure correlation based rule, like Hebbian learning,
is to be used, as the top-down bias is equally correlated with each
bottom-up driving input. Learning a unique source of modulatory in-
put is desirable because it allows the attentional biases to affect only
the features with which they are semantically associated. With this in
mind, let us analyze how this Hebbian learning rule behaves in this
network.
The activity of a neuron in the network can be expressed using
(Equation 2) with a product forg() along with adding divisive inhi-
bition [22] for competition [following16,117] as well as a noise term:
x
j
=
D
j
+D
j
M
j
+
1+ Inhib
j
(5)
We are interested in the dynamics of the network once it has reached
the desired state. Let us assume that the neurons have each already
2.2 modulatory connections 27
Figure8: A simple network with modulatory connections. NeuronsN
1
and
N
2
receive identical driving input and compete over input from two modu-
latory neurons,M
1
andM
2
. The colored connections show the desired state
of the network, where each competing neuron has learned a unique source
of modulatory input. The dashed connection represents lateral inhibition.
learned associations to a unique modulatory input, such thatw
M
1
N
1
=
w
M
2
N
2
=w
max
andw
M
2
N
1
=w
M
1
N
2
=w
min
. Because the weights
are normalized (see (Equation4)), this configuration implies thatw
min
+
w
max
=1.
Without loss of generality, assume thatM
1
is highly active while
M
2
is inactive, resulting in M
1
sending strong feedback to N
1
. Be-
cause of that feedback,N
1
will become more active thanN
2
regard-
less of whether or not it was more active prior.N
2
is then inhibited
byN
1
, but because it receives the same driving input, it remains at a
lower but non-zero activation. Formally:
x
M
1
>0 and x
N
2
>0
substituting this into (Equation 4) gives:
w
M
1
N
2
=
w
min
+x
M
1
x
N
2
w
min
+w
max
+x
M
1
x
N
2
+0
-w
min
=
w
min
+x
M
1
x
N
2
1+x
M
1
x
N
2
-w
min
(6)
2.2 modulatory connections 28
letting =x
M
1
x
N
2
,
w
M
1
N
2
=
w
min
+
1+
-w
min
=
w
min
+-w
min
(1+)
1+
=
(1-w
min
)
1+
(7)
Because > 0 and1 > w
min
> 0,w
M
1
N
2
> 0. Thusw
M
1
N
2
is
increasing and the system is not in a steady-state. This implies that
even if this Hebbian learning rule managed to reach the desired state,
it would not be in equilibrium and would be disrupted by any input.
Compared to this simple example, the modulatory inputs in more
general networks will be populations of correlated neurons, and the
competing neurons may not all receive identical driving input. Dis-
tinct populations are assumed to be weakly correlated with each other
(otherwise they would be the same population). The core challenge
of learning modulatory connections, however, can be captured by this
example using two neurons driven by identical input competing over
two independent modulatory inputs.
Implementations of Hebbian learning that bound weight growth
through means other than weight re-normalization, such as the Gen-
eralized Hebbian Algorithm [109], which is closely related to Oja’s
rule [97], or the BCM rule, which uses an adaptive threshold based
on expected average activation to adjust the sign of the weight update,
can also be shown to be either unstable or not guaranteed to reach
the desired state of this network. We will revisit and analyze these
two variations of Hebbian learning in Section A.1 after introducing
conflict learning in the next section.
2.3 introducing conflict learning 29
2.3 introducing conflict learning
Conflict learning was developed to address the demonstrated insta-
bility of Hebbian learning rules in the context of modulatory connec-
tions, and can be intuitively described as a rule that assigns a unique
population of correlated modulatory inputs to each neuron compet-
ing over those inputs. It is a general learning rule composed of three
conceptually simple, physiologically plausible mechanisms: adjusting
plasticity based on the activation of strongly learned connections, us-
ing inhibition as an error signal to explicitly unlearn connections, and
exploiting several timescales. These concepts are formalized by the
following equations:
1. Spreading: Neurons are restricted to increasing weight on only
those connections that overlap with their existing preferred stim-
ulus – thus causing a smooth spreading through feature space.
This is accomplished using a coefficient applied to the weight
update, equal to the maximum activation amongst a neuron’s
strongly learned connections:
i
= max
jj(w
ij
(t)>
1
2
max
j
w
ij
(t))
x
j
(8)
where strongly learned connections are those whose weight ex-
ceed half the strength of the largest weight amongst that indi-
vidual neuron’s connections.
2. Unlearning: Conflict learning treats inhibition as an error signal
indicating that the inhibited neuron has mistakenly strength-
ened any currently active connections. A neuron competing with
its neighbors via inhibition exerts pressure on those neurons
to unlearn the connections driving its activation, while receiv-
ing reciprocal pressure to unlearn the connections that drive its
2.3 introducing conflict learning 30
neighbors. The amount of inhibition a neuron receives is used to
interpolate between a positive and negative associative weight
update:
ij
= (1- Inhib)x
i
x
j
i
- Inhibx
i
x
j
(9)
where (set to1 in all experiments) can be used to control the
rate of learning versus unlearning. The interpolation between
learning and unlearning is irrespective of activation strength
and depends only upon the amount of inhibition received.
3. Short and Long-Term (SLT): Connection weights are adjusted
on a short-term and long-term timescale, striking a balance be-
tween initial exploratory learning and long-term exploitation of
a learned pattern. The short-term weightw
ij
adjusts rapidly to
the current stimulus, but decays towards and fluctuates around
the more stable, slowly adapting long-term weight w
ltm
ij
. The
only visible weight for a neuron is its short-term weight; long-
term weights are internal and only observed via their effect on
short-term weights. The entire neuron weight update process
has four steps:
a) Compute short-term weight updates
ij
b) Move long-term weights towards short-term weights:
w
ltm
ij
(t+1) = (1-s
ltm
)(w
ij
(t)+
ij
)+s
ltm
w
ltm
ij
(t) (10)
c) Move short-term weights towards long-term weights:
w
ij
(t+1) = (1-s
stm
)(w
ij
(t)+
ij
)+s
stm
w
ltm
ij
(t+1) (11)
d) Normalize short- and long-term weights independently
2.3 introducing conflict learning 31
where s
ltm
and s
stm
are smoothing factors, described in Sec-
tion A.2.2.2.
Conflict learning uses the same neuron activation principles as
GCAL [117], described in Section A.2.1. It should be noted that the
above equations, although conceptually grounded, are not directly
fit to experimental data. The intent of this formulation is to demon-
strate that these concepts, when used together, provide a stable and
plausible way to learn in networks with modulatory connections that
could exist in some fashion in actual neurons. Although weight re-
normalization (item3d) is not strictly biologically plausible (see [122]
for more viable alternatives), it ensures that weights are bounded in
a computationally amenable fashion, and furthermore is used in the
weight update equation for GCAL.
An in depth discussion of each component of conflict learning is
provided after the experiments, in Section 2.5, using the results to
address the components’ contributions towards learning and their
biological plausibility.
2.3.1 Conflict Learning and Modulatory Connections
We can now revisit the simple network of Figure 8 and see how
conflict learning resolves the observed stability problems of the Heb-
bian learning rules. Recall the earlier argument (Section2.2.1), which
showed that the analyzed Hebbian learning rules are not stable in the
desired state of the network. Specifically, we noted thatw
M
1
N
2
had a
non-zero update. This is not the case when the conflict learning rule
is used instead.
Assuming the same weight configuration as for the Hebbian rule,
if M
1
and N
1
are active, x
N
1
> x
N
2
, and thus Inhib
N
1
= 0 and
Inhib
N
2
=1. Additionally, becauseN
1
has an active strongly learned
2.3 introducing conflict learning 32
connection,
N
1
=1 whileN
2
has no strongly learned active connec-
tions, so
N
2
= 0. For simplicity we use 1 and 0 for the values of
and Inhib, though the sign of the update remains the same so long as
(1- Inhib
N
2
)
N
2
> Inhib
N
2
holds. Substituting all of this into the
short-term weight update (Equation 9) gives:
M
1
N
2
= (1-(1))x
M
1
x
N
2
(0)-(1)x
M
1
x
N
2
= -x
M
1
x
N
2
<0
(12)
Since w
M
1
N
2
already has a value of w
min
, the effective negative
weight update applied will be 0, much like the effective positive
weight update for w
M
1
N
2
will be 0 because it is already at w
max
.
AlthoughN
2
is still partially active, it is being inhibited byN
1
, so it
performs explicit unlearning towardsM
1
instead of positive learning
like in the Hebbian case. This same procedure can be applied to the
other three feedback connections in this example, and in each case
the weight update will be 0 or restricted to 0 by the weight value
range. Since all of the connection weights maintain their values, the
system is at equilibrium and can maintain this steady state. Knowing
that conflict learning is stable in the desired state, we can consider its
behavior in the other possible states of the network and how the sys-
tem transitions from an initial unlearned state to the desired stable
state.
The network has five functionally distinct states of interest as seen
in Figure 9. 1) The initial state, where no connections have become
strongly learned (0SL).2) A strongly learned connection between one
competitive neuron and one modulatory neuron (1SL). Two strongly
learned connections, either3) one competitive neuron with a strongly
learned connection to both modulatory inputs (2SL-Split),4) one mod-
ulatory neuron with strongly learned connections between both com-
2.3 introducing conflict learning 33
Figure 9: State diagram for the simple network of Figure 8. This diagram
shows the progression of the network from an initial unlearned state (0SL)
to the desired state of each competing neuron learning a unique modula-
tory input (2SL-Desired). Outgoing transition probabilities as well as the
percentage of time spent in each state are shown for both (a) Hebbian learn-
ing and (b) conflict learning, based on simulation. Conflict learning enters
and remains in the 2SL-Desired state, having no outgoing transitions from
2SL-Desired. By contrast, Hebbian learning oscillates between 2SL-Desired,
1SL, and 2SL-Shared. The components of conflict learning essential for spe-
cific transitions are labeled. The spreading component prevents the network
from transitioning from the 1SL to the 2SL-Split state. Although the simple
network under conflict learning cannot make the transition from 1SL to the
2SL-Shared state (dashed arrow), this transition is possible in general, and
made unstable by the unlearning and SLT components.
petitive neurons (2SL-Shared), or 5) unique strongly learned connec-
tions between modulatory and competitive neurons (2SL-Desired).
We performed30 repeated simulations of this simple network to il-
lustrate the trajectory taken by both the considered Hebbian learning
rule and conflict learning through the state space (see Section A.3.1
for experimental procedures). Figure9 shows the outgoing transition
probabilities as well as the percentage of time spent in each state
for both learning rules. This demonstrates that the Hebbian learn-
ing rule, which cannot prevent both competitive neurons from per-
forming learning, immediately transitions into the 2SL-Shared state
before entering an oscillation between 2SL-Shared, 2SL-Desired, and
1SL. The Hebbian rule cannot enter the 2SL-Split state because this
state requires one neuron to perform learning while the other does
nothing. Conflict learning, as shown in (Equation 12), is capable of
performing positive learning on a competitive neuron in isolation,
due to its spreading and unlearning components. The spreading com-
ponent is chiefly responsible for preventing the system from entering
2.4 network modeling results 34
the2SL-Split state. The unlearning and SLT components are similarly
responsible for transitioning the network out of the 2SL-Shared state,
were it ever to be in that state.
A case by case analysis of the transitions made or avoided by con-
flict learning can be found in Section A.1.1. Using the nomencla-
ture for the states introduced here, additional analysis for simulat-
ing two additional variations of Hebbian learning, the Generalized
Hebbian and BCM learning rules, is provided in Section A.1.2 and
Section A.1.3, respectively.
2.4 network modeling results
In contrast to the simple network with two competitive neurons, we
now focus on large scale (several thousand neurons) neural networks.
We test conflict learning by learning a model of border ownership as
well as a model of orientation selectivity. The border ownership net-
work relies on modulatory feedback for proper operation, whereas
the orientation selective network demonstrates that conflict learning
is a general learning rule also applicable in contexts lacking modula-
tory connections.
Conflict learning is compared against an implementation of GCAL
([7, 117]; threshold adjustment is implemented differently, see Sec-
tion A.2.2.1 for full implementation details), a learning rule that uses
purely Hebbian logic to adjust its weights, increasing them when pre
and postsynaptic neurons are simultaneously active. Throughout the
rest of this work, we will often refer to GCAL as the “Hebbian learn-
ing rule” to emphasize the associative nature of its weight update.
GCAL is able to achieve biologically plausible results in applications
such as learning V1-like orientation selective maps by way of adjust-
ing neuron activation through contrast normalization and adaptive
2.4 network modeling results 35
thresholds [117]. For all experiments, both rules use identical activa-
tion functions, activation thresholds, and connection patterns, only
differing in how their weights are adjusted. This section focuses on
reporting the results of the experiments; full technical details on the
experimental procedures is provided in Section A.3. Intuition and fur-
ther analysis of how each component of conflict learning gives rise to
the results shown is provided after the results in Section 2.5.
2.4.1 Border Ownership
The primary benefit of conflict learning is its ability to learn in net-
works with modulatory feedback, a feature that allows it to be used
to learn a model of border ownership. As border ownership (BO) is
a less familiar and more complicated process than orientation selec-
tivity, it is worth briefly revisiting its putative architecture (illustrated
in Figure 10, also see the experimental methods in Section A.3.2) to
fully appreciate the results. The model we develop is a derivative of
the feedback model of [30], which as mentioned in the introduction,
is one of multiple models capturing the observed behavior of actual
border ownership neurons.
BO neurons are identified not just by an orientation, but also by a
polarity, which indicates to which side of their orientation the figure
(or background) lies [136]. The key challenge is to develop receptive
fields such that each BO neuron responds to a single orientation with
a single polarity, with full coverage over all orientations and polarities.
In our model, this relies on learning feedforward and modulatory
feedback connections between columns of BO neurons and a layer
of so-called grouping neurons which pool over multiple BO columns,
integrating non-local information. Learning these connections is espe-
cially challenging because the multiple BO neurons that exist for each
2.4 network modeling results 36
Figure 10: Border ownership model architecture. (a) Diagram of full archi-
tecture. A V1-like layer consisting of Gabor filters processes the input at
four orientations (0, 45, 90, and 135°). Each orientation neuron provides
input to two border ownership cells, which are connected laterally to six
others (for the three remaining orientations) at the same retinotopic loca-
tion within a column in the Border Ownership layer. The grouping layer
pools BO column activation, receiving input from all BO cells within all
columns in a local receptive field. The grouping layer additionally sends
feedback to those same cells. (b) Diagram of a single BO column. Columns
contains eight competing neurons, two for each orientation, and internally
have lateral inhibitory connections between each neuron. They also receive
feedback from a local receptive field in the grouping layer. (c and d) The
effects of an example stimulus (dotted square, actual experiment uses solid
input) on BO columns (cylinders) and grouping cells (circles labeled G). (c)
Feedforward connections from the perspective of a BO column. The column
sends feedforward input to all grouping cells in its receptive field, but only
the grouping cell receiving input from multiple columns is highly active (in-
dicated by increased size). (d) Feedback connections from the perspective
of a grouping cell. Feedback is sent to all BO columns within its receptive
field, but only those along the boundary of the object will be highly active.
(e) Detailed relationship of competition between two BO neurons with the
same orientation. Each BO neuron eventually learns to project to and receive
feedback from a grouping cell on only one side of its orientation.
2.4 network modeling results 37
orientation, destined to develop a specific polarity, must learn consis-
tent and opposite connection patterns. The network accomplishes this
task purely through experience, with no a priori spatial information
– not only are feedforward and feedback weights initially uniform,
but BO neurons within a column must also learn to specialize their
inhibitory lateral connections, a necessary requirement for competi-
tion. While many other models of border ownership require explicit
features for junction (e.g., L, T) detection, our learned model requires
only edge information.
Note that not all components of this model have been directly ob-
served in the brain. Although BO neurons and their responses to var-
ious stimuli have been recorded [136], grouping neurons have yet to
be explicitly discovered [30]. Grouping neurons can thus be seen as
a computational generalization of a more complicated grouping pro-
cess, for which there is mounting evidence [e.g., 88, 125]. This model
is nonetheless a good approximation of the current understanding of
border ownership circuits. Additionally, the structure of the border
ownership network fits within a standard model of computation in
visual cortex: it consists of competition followed by grouping, with in-
creasing receptive field size. This is reminiscent of alternating simple
and complex cells [129], which have formed the basis of many models
of visual cortex [e.g., 43, 110]. The connection from edge responsive
neurons (input in the model) to border ownership neurons is a sim-
plification for the model; we imagine a more realistic circuit would
have edge or contour responsive neurons directly compete with each
other over border ownership polarity.
2.4.1.1 Results
The learned feedback receptive fields for a representative BO col-
umn taken from fully trained networks are shown in Figure 11, and
2.4 network modeling results 38
Figure 11: Learned Feedback Receptive Fields for BO Neurons. Receptive
fields are shown for all eight neurons of a single representative BO column
for both Hebbian and conflict learning rules. Each row of the figure rep-
resents a different orientation. Each BO neuron is marked by a blue pixel,
and green pixels show feedback connections from grouping neurons, with
brightness corresponding to weight strength. Polarity represents the average
degree to which every pair of neurons in a network learns feedback from
grouping neurons on opposite sides, a necessary requirement for consistent
border ownership assignment. The conflict learning network successfully
learn pairs of competing polarity BO neurons without any a priori informa-
tion regarding BO or grouping cells spatial position.
the feedforward and lateral receptive fields are shown in Figure 12
(the full details of training and other experimental procedures can be
found in Section A.3.2). Under conflict learning, each neuron within
a BO column learns to associate with grouping feedback occurring
on only one side of its orientation, with all orientations and polarities
represented. Additionally, the two BO neurons associated with each
orientation learn to become competitive with each other and learn
opposite sides of feedback. This occurs because the opposite sides of
grouping feedback come from distinct populations of grouping neu-
rons, and conflict learning, as was shown in Section 2.3.1, strives to
associate one competing neuron to each population of modulatory
input. The Hebbian learning based rule, however, is unable to de-
velop this partitioning of modulatory feedback amongst competing
2.4 network modeling results 39
Figure 12: Learned feedforward and lateral receptive fields for BO and
grouping neurons. (a) Feedforward receptive fields for a grouping neuron,
shown for both learning rules. Successful learning entails a ring-like pat-
tern of strong connectivity. (b) As in Fig. 5, the results are organized by the
orientation of the BO neuron. For each orientation, the learned outgoing
feedforward projections are displayed first followed by a radial graph of
the corresponding learned lateral inhibition strength for the same neurons.
Lateral connections project to other neurons within the BO column, colored
by the preferred polarity of the inhibited neuron. For example, a red po-
larity corresponds to inhibition towards a horizontal selective BO neuron
with a preference for objects in the lower half of its receptive field. Under
conflict learning, BO neurons learn to primarily inhibit the other neuron
sharing their orientation, as well as applying a small amount of inhibition
to immediately adjacent orientations with overlapping polarities. This pat-
tern of inhibition not only ensures the creation of competing pairs of BO
neurons, but also a winner-take-all like behavior amongst all orientations in
a column.
neurons. The two BO neurons for each orientation learn the same
receptive fields as each other, causing them to be unable to reliably
associate with objects occurring on a particular side of their orienta-
tion. When a stimulus is presented to these neurons, the winner will
be chosen randomly instead of being chosen based on any border
ownership information.
Along with the sampled receptive fields, the average polarity score
for BO neurons of each orientation is shown. This score represents the
degree to which a competing pair of BO neurons learn feedback on
2.4 network modeling results 40
Figure13: Border Ownership Assignments by a Network Trained with Con-
flict Learning. Black lines represent the stimulus and colored arrows repre-
sent BO assignments at those locations. Each BO neuron is assigned a di-
rection vector based on its learned polarity. Assignments are made by sum-
ming these direction vectors, weighted by activation. All results are taken
from a fully learned network naive to these example inputs. The network
has complete position and orientation invariance. (a) The progression of BO
assignment over time. Feedback begins to arrive in iteration 3. (b-e) Settled
(iteration 9) assignments for various stimuli. (c and d) These shapes have
locally ambiguous border ownership assignments that are resolved through
modulatory feedback from the grouping neurons. (e) The network is not
fully scale invariant because the BO to grouping neuron connections exist
only at a single radius, resulting in the corners being weakly activated.
opposite sides (see Section A.3.2). These averaged scores, computed
from all pairs of BO neurons, demonstrate that the pictured examples
are representative of the whole network.
Figure13 shows the results of running the trained conflict learning
network on common stimuli from the border ownership literature.
As the network was trained on single presentations of squares (see
Section A.3.2), every shape presented here is one to which the net-
work has never been exposed. The network in its current implementa-
tion has limited scale invariance, demonstrated by the weak response
at the vertices of the triangle input (Figure 13E). The responses to
the tiled squares (Figure 13A), the C pattern (Figure 13D), and the
rounded squares (Figure13C) are especially interesting because local
2.4 network modeling results 41
information may favor a globally incorrect polarity assignment. The
network, in all cases, is able to use feedback to correct ambiguous
feedforward input in order to reach the correct assignment of border
ownership. To our knowledge, this is the first time a border owner-
ship network has been learned, enabled by the new conflict learning
rule.
Figure 14: Contribution of rule components. (a-h) Representative receptive
fields from a vertical BO neuron pair taken from various configurations of
conflict learning as well as Hebbian learning. Histograms depict the polarity
scores of all vertical BO neurons for a given configuration, with the median
denoted by a red line. Configurations are: (a) Hebbian, (b) spreading com-
ponent only, (c) unlearning component only, (d) SLT component only, (e)
conflict learning without spreading, (f) conflict learning without unlearn-
ing, (g) conflict learning without SLT, (h) full conflict learning. (i) Median
scores for (a-h) with error bars indicating 95th percentile cutoffs. Conflict
learning (h) is significantly higher with respect to all other configurations
(a-g).
Finally, we investigate the contribution of each component of con-
flict learning as it applies to learning the modulatory feedback con-
nections in the border ownership network. Figure 14 shows recep-
2.4 network modeling results 42
tive fields taken from a vertically oriented BO pair for all variations
of rules tested. The receptive fields were chosen to be exemplars of
common failures (if they existed) for the various configurations. His-
tograms of polarity scores over all vertical BO neurons show typi-
cal network-level results. In Figure 14I we compare the median score
across all configurations, showing that conflict learning receives bene-
fit from the amalgamation of all of its components. The results demon-
strate that there is a non-linear relationship between the introduction
of a rule component and its effect on the polarity score. However, we
can still extract some general conclusions with respect to the polarity
score: while unlearning on its own is very influential (C), the unlearn-
ing and spreading components complement each other and together
(G) account for most of the improvement over Hebbian (A). The SLT
component, by slowly transitioning the network to reflect long-term
statistics, appears to have the effect of eliminating outliers and reduc-
ing the variance of the distributions (e.g., histograms B vs. F, C vs.
E, and G vs. H). Additional discussion on the contribution of each
component follows in the discussion (Section 2.5).
2.4.2 Orientation Selectivity
We next apply conflict learning to a problem that can be seen as a
baseline for self-organizing networks of the brain – orientation selec-
tivity. The network, seen in Figure 15, consists of an input layer, a
center-surround layer, and an output layer, like that used to demon-
strate the properties of GCAL [117]. The connections between the in-
put layer and the center-surround layer are fixed; all learning in this
network takes place between the center-surround neurons and the
output neurons. The network has no modulatory connections, such
that the activation equation for neurons reduces to (Equation 1). The
2.4 network modeling results 43
Figure 15: V1-like feedforward network. Center-surround layers perform
a difference-of-Gaussian like computation either preferring the center (On-
Off) or the surround (Off-On). The orientation selective layer receives input
from both On-Off and Off-On neurons and form lateral connections within
some radius. The connections between center-surround layers and the ori-
entation selective layer are learned. Figure depicts actual model responses
from a learned network.
desired goal of learning in this network is to develop output neurons
which are orientation selective over all possible input orientations.
Detailed information on the network architecture, training, and ex-
perimental procedures are provided in Section A.3.3.
2.4.2.1 Results
The primary goal of this experiment is to demonstrate that the conflict
learning rule, even when applied to networks lacking modulatory
feedback and compared against a learning rule tailored for such an
environment, produces similar biologically consistent output.
Figure 16A shows the output neurons, for both learning rules, col-
orized by orientation selectivity after training on oriented bar input.
The learned maps show an arrangement that mimics physiological
maps of orientation selectivity in mammalian cortex (e.g., pinwheels,
which are singularities where orientation preference increases clock-
wise or counterclockwise; see [25]). To quantify this subjective simi-
larity, the pinwheel density metric of [117] is computed for the maps.
2.4 network modeling results 44
Figure16: Orientation selectivity results. (a) Orientation maps for both both
learning rules, colored according to the preferred orientation of each neuron.
Pinwheel locations are determined algorithmically and denoted by white cir-
cles. Both learning rules result in a biologically plausible pinwheel density
within3% of. (b) Average selectivity for both learning rules while train-
ing with input data corrupted with increasing amounts of Gaussian noise.
Selectivity is based on how well a neuron’s receptive field can be modeled
by any Gabor function. (c) Stability for both learning rules as a function of
learning iteration for a range of input noise values. Stability is based on the
correlation between the current and final (iteration 20,000) maps.
A pinwheel density of pinwheels per unit hypercolumn area (see
Section A.3.3) has been found to be consistent across a number of
mammalian species [e.g., tree shrew, galago, cat, and ferret, see: 64,
65], and may be a fundamental constant of map organization [117].
Both learning rules result in pinwheel densities within3% of.
In testing conflict learning, we also observed noteworthy behavior
when varying amounts of noise were injected into the input of the
system. Figure16 also shows the results of simulating the orientation
selective network for both learning rules under varying amounts of
Gaussian noise applied to the input neurons (by adjusting their acti-
vation noise term; see Section A.3.3 for details). Figure 16B shows
an increased resistance to the effects of noise in the conflict learn-
ing results. Hebbian learning more quickly succumbs to a significant
drop in the quality of learned receptive fields compared to conflict
2.5 discussion 45
learning, which only begins to be affected by noise at very high stan-
dard deviations. The scoring metric for selectivity is based on how
well a receptive field can be represented by any possible Gabor func-
tion for all neurons in the network [98]. Real neurons are subject to
many more sources of noise and variability than is present in our
modeling, and handling that noise is a fundamental requirement for
the nervous system [34]. We discuss reasons why conflict learning is
less affected by noise in the discussion section.
Using the same stability metric as [117], we compare how similar
learned receptive fields are at any given time to the final state of the
network (Figure 16C). Conflict learning reaches a higher plateau of
stability at earlier iterations compared to Hebbian learning. As sta-
bility may be important for the development of downstream brain
regions [117], earlier stability could decrease the delay between a reli-
able orientation selective representation and further visual processing.
Additional experiments comparing stability across a greater number
of iterations did not show any appreciable difference in the time it
took to reach stability or the final values. When looking at stability
over increasing levels of noise, we again see a resistance to noise in
conflict learning that only gives way at high standard deviations.
2.5 discussion
Typically a learning rule is devised with a specific activation function
in mind, so it may not seem surprising that the Hebbian learning rule
we compare against was unable to learn a model of border ownership
dependent on modulatory connections. However, the orientation se-
lective network, in which there is no source of modulatory input,
served as a comparison of the two learning rules in a setting where
the activation function was as expected by the Hebbian rule, yet was
2.5 discussion 46
still compatible with conflict learning, which was designed around
the presence of modulatory input. In Section2.2.1, Section A.1.2, and
Section A.1.3, we demonstrated that unlike conflict learning, for even
a minimal network with modulatory connections, neither the normal-
ized Hebbian rule, the Generalized Hebbian Algorithm, nor BCM are
capable of stably learning modulatory weights.
We suggest that this is because all of these variants of Hebbian
learning are based on a core principle of associative learning, which
alone seems incompatible with modulatory input. Our computational
experiments suggest that a synapse does not have enough informa-
tion as to how a weight should be adjusted using only incoming acti-
vation compared with the output activation of the cell. Even learning
rules like BCM, which control plasticity via an adaptive threshold
based on expected activation, do not solve the problem, because they
do not draw on any additional sources of information. We hypothe-
size that additional control signals are required to support modula-
tory connections, where the incoming activation may be coincident
with the firing of the cell, but not relevant. Conflict learning uses two
additional sources of information for these signals: the activation of
strongly learned synapses within the cell, and inhibitory input driven
by competing neurons. Strongly learned connections identify relevant
firing, while inhibition partitions firing by indicating that a neuron is
losing a local competition amongst connected neurons.
We demonstrated through computational models that using inhi-
bition as a control signal results in a partitioning of correlated firing
in modulatory input amongst competing neurons. Our results (e.g.,
Figure 14) suggest that lowering activation through inhibition is in-
sufficient to prevent unwanted learning from taking place - inhibition
must actively drive the partitioning of modulatory input through un-
learning.
2.5 discussion 47
Additionally, we also demonstrated that restricting learning based
on the activation of strongly learned connections results in a success-
ful clustering of correlated firing amongst modulatory input to an
individual neuron. This behavior is complementary to the partition-
ing performed by the inhibitory control signal, resulting in neurons
which compete over correlated firing of incoming connections, regard-
less of whether they are sourced from driving or modulatory input.
These two components of conflict learning, together with short-
and long-term learning, will be discussed and related to experiment
in detail in the following section of the discussion.
Although it may be the case that a different learning rule could gov-
ern driving versus modulatory connections, we think there is some
elegance in a single set of principles being compatible with both types
of excitatory connections. Conflict learning does not directly address
the plasticity of inhibitory connections, which likely do operate with
a different set of mechanisms. In fact, conflict learning cannot be used
for learning inhibitory connections because of its reliance on inhibi-
tion as a control signal (see Section A.2.2.2).
2.5.1 Analyzing the Rule
The results demonstrate that for certain patterns of connections and
firing, traditional Hebbian learning mechanisms are ill-suited for adapt-
ing synaptic weights. This was seen directly in learning a model of
border ownership, where only conflict learning was able to properly
learn the required modulatory feedback connections to perform the
computation correctly. Additionally, conflict learning operates in a bi-
ologically consistent manner even in situations lacking these types
of connections, with the pinwheel density of the learned orientation
selective network matching biology as well as other learning rules.
2.5 discussion 48
The orientation selective network experiments also show interesting
properties with regards to increased stability and robustness to noise.
All of these results are a product of the three complementary compo-
nents that make up conflict learning, introduced in Section2.3, which
we now discuss in detail.
2.5.1.1 First Component - Spreading
The first component of conflict learning states that neurons cannot
strengthen their connection weights unless an already strongly learned
connection is currently active. In the border ownership experiments,
the spreading component helps prevent neurons of a border owner-
ship pair from associating with grouping neurons on both sides of
their oriented edge. While populations of grouping neurons on both
sides are individually co-active with a BO neuron, there is little to no
correlation between the firing of the distant populations themselves.
A Hebbian neuron cannot detect this distinction, whereas a conflict
learning neuron can. This is illustrated most clearly in the learned re-
ceptive fields of the border ownership experiment, seen in Figure 11,
as well as by the simple network of Section 2.3.1.
Spreading is similar to the concept of associative LTP (Long-Term
Potentiation), where the strong firing of a learned synapse supports
the strengthening of a weaker one [81,112]. There has been discussion
on the spatial requirements [33] as well as temporal constraints [77]
of synapses involved in associative LTP , suggesting that it is both
a spatially and a temporally local process. Since we do not model
the physics of our synapses, we use only a temporal constraint. This
means that once a neuron has begun to associate with certain connec-
tions, any further connections it strengthens must co-occur with the
existing ones, which forces connection weights to smoothly spread
outward through feature space from an initially learned pattern. In
2.5 discussion 49
situations where initial conditions allow competing neurons to learn
the same set of connections (analogous to the 2SL-Shared state de-
scribed in Section 2.3.1), the spreading component, if used without
the unlearning component, would make it impossible for the neu-
rons to disentangle their learned features. In Figure 14B, the two BO
neurons are correctly learning on only one side of the boundary, but
have no mechanism to prevent them from learning and spreading to
the same features. This effect is exacerbated when combined solely
with the long-term statistics used by the SLT component (Figure14F),
which compounds the mistaken initialization over time.
Our method of labeling connections within a single neuron as strongly
learned (Equation 8) is a simple abstraction intended to capture the
behavior, but not the exact biological implementation, of such a mech-
anism. It has been demonstrated that the soma can backpropagate
signals to its dendrites for the purpose of manipulating thresholds
[e.g., 76] and that individual dendrites display a wide array of ac-
tive properties such that one synapse can affect the behavior of many
others [82]. Such mechanisms could also be responsible for the manip-
ulation of a learning threshold affecting synaptic plasticity. Therefore
the spreading component, in a real neuron, would likely be imple-
mented through a variety of adaptable thresholds as opposed to the
simple activation strength based product that we employ.
In the context of modulatory connections, the spreading compo-
nent is essential to enabling a neuron to identify a population hidden
within the many correlated activations of its inputs. In networks with-
out modulatory feedback, the spreading component gives increased
resistance to the effects of noise (Figure16) by lessening the impact of
spurious activation as it is unlikely to consistently coincide with the
strongly learned connections.
2.5 discussion 50
2.5.1.2 Second Component - Unlearning
In conflict learning, inhibition, in addition to reducing the activation
of a neuron, causes the neuron to directly unlearn its active connec-
tions. This is in contrast to a typical Hebbian learning rule which
still allows positive learning to take place, dependent on the activa-
tion. It is also distinct from examples of explicit synaptic weakening
in BCM-like rules or STDP , which use activation or timing to control
the unlearning. In conflict learning, a neuron can be strongly active
but still undergo unlearning if its inhibitory input is high enough.
In the border ownership experiments, inhibition primarily occurs be-
tween pairs of border ownership cells competing over feedback from
grouping cells on either side of their local oriented boundary. Un-
learning helps correct mistaken assignments within a BO pair, ulti-
mately resulting in a near even split along the polarity boundary (Fig-
ure 11). Mistaken activation close to the boundary will be frequently
contested and thus unlearned by both cells in the pair.
There is significant evidence of complex interactions between in-
hibition and excitation in the brain. [126] found that inhibition con-
trolled the sign of excitatory plasticity in rat visual cortex, which is re-
markably similar to our unlearning component, via crosstalk between
inhibitory and excitatory signaling. [39] found that the presence or
lack of inhibition could reverse the classic STDP window, causing ei-
ther LTP or LTD (long-term depression) to occur. Additionally, in a
recent review on inhibitory plasticity, [124] emphasize the increasing
evidence that excitation and inhibition are deeply intertwined, with
inhibition potentially providing a mechanism that allows selective
learning to occur.
Unlearning through inhibition allows one neuron to force another
to unlearn common connections between the two, causing the inhib-
ited neuron to return to an initial unlearned state, at which point it is
2.5 discussion 51
possible to learn a different population of input. This was first seen in
analysis of the simple network (Section A.1.1), where the unlearning
component is the primary mechanism by which a 2SL-Shared state
is made unstable. A consequence of this component is that when a
neuron competes for features it actively weakens competitors, lead-
ing to a greater separation in feature space (weight values) between
the neurons (Figure 14C). When combined with the spreading com-
ponent in competitive groups of neurons (such as the mutually in-
hibitory groups of neurons in a column), neurons learn in a smooth
yet competitive fashion (Figure 14G and Figure 14H). The neurons
identify populations in the feature space and slowly expand their re-
ceptive fields until they have no more correlated connections to learn
or they are faced with competition from another neuron. In the orien-
tation selectivity experiment, unlearning enforces a greater difference
in connection weight strength between the features learned by each
neuron, meaning responses are more stable and higher levels of noise
can be introduced without confusing the input pattern.
2.5.1.3 Third Component - Short and Long Term
In conflict learning, all neurons have an externally visible short-term
weight as well as an internal long-term weight. The two weights con-
stantly pull on each other until they settle to the same value, with the
rates at which they move towards each other controlling how quickly
a neuron adapts its weights and how steadfast it becomes in its de-
cisions. This short- and long-term learning, or SLT, allows neurons
to quickly associate with populations in their input while remain-
ing sensitive to long-term trends. In the border ownership network,
this ability to be initially flexible but stable in the long run leads to
more neurons learning significantly better separation along BO neu-
ron boundaries (Figure 14I). We found SLT to be especially benefi-
2.5 discussion 52
cial for feedforward connections, where capturing long-term statistics
is useful (e.g., BO feedforward connections). SLT, used alone, works
essentially like Hebbian learning (Figure 14D), but when combined
with the other two portions of the rule, leads to a significant improve-
ment and consistency of learned receptive fields (Figure 14H). This
increased consistency can also be seen in Figure 14E compared to
Figure 14C, which differ only by the inclusion of SLT.
The physiological underpinnings of multi-timescale learning are
notably discussed by [137], who review the dynamics of short-term
learning, [2], who review synaptic redistribution and the interplay
between short- and long-term potentiation, and [50], throughout his
extensive development of adaptive resonance theory.
2.5.2 Implications for Plasticity
Our results, accompanied by physiological evidence for the mecha-
nisms we have described, suggest that similar mechanisms are likely
used in the brain for the learning of modulatory connections. By act-
ing as an error signal to instigate unlearning, inhibition can dynami-
cally alter plasticity and encourage diversification amongst compet-
ing neurons, and by requiring strongly learned connections to be
active for learning, spreading allows for the detection of correlated
clusters of activation within non-driving inputs. Our model of pri-
mary visual cortex shows that such mechanisms do not interfere with
learning in more traditional contexts lacking modulation. We there-
fore predict that neurons likely have the key mechanisms of conflict
learning: the ability to adjust their plasticity based on a concept of
synaptic strength, and the usage of inhibition as a control signal for
unlearning. These concepts could be tested in actual neurons with
a series of simple experiments on single neurons. For all of these
2.5 discussion 53
proposed experiments, we assume that a single neuron has learned
a preferred stimulus such that it has increased synaptic strength to-
wards the inputs associated with that stimulus. Inhibition is assumed
to originate through interaction with other neurons [e.g., inhibitory
interneurons 86]. Conflict learning predicts that inhibition has addi-
tional effects on plasticity if its presence lowers the activation without
completely surpressing its firing.
If inhibition serves as a signal that unlearning should occur, the
strength of the synapses associated with the learned input should
decrease when inhibition is applied simultaneously to the driving
input. As noted in Section 2.5.1.2, there is existing evidence that this
is indeed a potential role for inhibition. A classical Hebbian theory,
such as any of the rules discussed in Section 2.2.1, or STDP , would
predict no decrease in synaptic strength in such a situation.
To establish the existence of a behavior similar to the spreading
component, a new, independent source of input could be applied
while artificially activating the neuron. Conflict learning predicts that
the lack of activation of the already learned input will prevent or sig-
nificantly impair the learning of the novel input. Existing Hebbian
rules predict that synaptic strength towards the new input should
increase unimpeded.
Finally, the interaction of these two components could be tested
in a combination of the two experiments. While driving the neuron
via its preferred stimulus and supplying a sufficient source of inhibi-
tion, additionally apply a new, independent source of input. In this
situation, conflict learning predicts that the neuron will not increase
synaptic strength towards the novel input, even though it is presented
simultaneously to the learned input. This prediction arises from the
proposed role of inhibition, which in this situation would cause all ac-
tive inputs to the neuron to have their synaptic strength decreased. A
2.5 discussion 54
classical Hebbian rule here would predict that the synaptic strength
to the novel input would increase.
Furthermore, if inhibition is indeed a necessary component for
learning modulatory connections, it follows that modulatory connec-
tions (and thus a majority of feedback) must develop to maturity
alongside inhibition. The balance between excitation and inhibition
is a drawn-out process controlled by experience [41], and a potential
additional reason for this delayed maturation could be explained by
a dependence between inhibition and feedback.
2.5.3 Learning Border Ownership
As mentioned in Section 2.4.1, the border ownership network archi-
tecture we present is not fully drawn from physiological observations.
Our results do not definitively rule out that a Hebbian based rule,
under some alternative network configuration, could reproduce the
behavior of border ownership. However, given the prevailing theory
that the computation of border ownership is dependent upon feed-
back [68], along with the argument presented in Section 2.2 demon-
strating Hebbian learning’s incompatibility with modulatory connec-
tions, it seems unlikely that a Hebbian learning rule could learn
a feedback-based border ownership network. Additionally, through
our experiences developing conflict learning, we believe that any net-
work configuration compatible with purely Hebbian learning would
be overwhelmingly complex and likely not support stimulus driven
learning.
As briefly discussed in Section2.4.1, the network architecture used
here, although only applied to border ownership, is not specifically
tied to computing this one feature. The network has no a priori in-
formation about borders or specific relationships between neurons.
2.6 conclusion 55
Border ownership is instead an emergent property of the network
given competition over orientation responses coupled with higher
level grouping. A deepened hierarchy composed of the same type of
competition and grouping may potentially lead to the computation
of higher level features more akin to proto-objects [for a discussion of
proto-objects, see 56], and is a target for future work.
2.6 conclusion
In developing conflict learning, we have shown how existing mech-
anisms already found in the brain can interact together to provide
substantial benefits in learning and allow the learning of modula-
tory connections. We have demonstrated the effectiveness of conflict
learning by showing, for the first time, how a model of border own-
ership might be learned through experience. This new rule could ad-
ditionally be beneficial for modeling many brain functions, including
figure-ground segmentation, top-down attention, and object recogni-
tion, which may all benefit from top-down modulation. As we un-
cover more details of the mechanisms governing neural plasticity,
models capable of incorporating this new information, such as con-
flict learning, become increasingly necessary.
3
G O A L R E L E VA N C E
This experiment aims to understand what makes information rele-
vant to goals, which will be referred to as goal relevance. During
driving, for example, the relevance of objects is determined by how
they might affect the driver’s goal of safely reaching a destination.
The challenge is in finding a general theory that can be applied to
any task.
3.1 defining goal relevance - theory
Here, we define an objective mathematical formula to compute goal
relevance, which we compare to the human subjective notion of rel-
evance in our experiments below. What makes a data observation
relevant to a goal? Let us examine our intuition using the driving
example. Seeing a car brake in front of you or the movement of ob-
jects on the road are certainly relevant to the goal of safely reaching a
destination. Hearing that a bridge is out seems extremely relevant, if
that bridge was on one of the possible routes. However, learning that
“The stock market is up 0.1%” isn’t relevant at all. How can we sup-
port these intuitions while being abstract enough to apply to other
types of goals?
To answer this question, we first identify what is common to all
goals. First, since we are considering the relevance of data observa-
tions while pursuing a goal, there must be an agent that performs the
pursuing. Also, the act of pursuing a goal implies movement through
56
3.1 defining goal relevance - theory 57
some real or imaginary state space, from a current (start) state to a
desired (goal) state. Note that like in standard search in Artificial In-
telligence, the goal may be implicitly defined (any state that satisfies
a number of requirements that establish a goal test would qualify
as goal state). We can define S as the set of all paths through this
space which begin at the start and terminate at the goal. The job of
the agent is to follow one of these paths (or a collection of paths
that overlap near the current state of the agent), while monitoring
the environment for possible new data observationsD which change
its beliefs that a particular path will be successful, and which thus
may warrant a re-evaluation of how the goal might be achieved. We
thus further define a probability distributionP(S) overS, which repre-
sents the current (or “prior”) distribution of the agent’s beliefs that a
given path will achieve the goal. Looking back at the intuitions above,
we can note that more relevant pieces of data may be the ones that
cause larger changes inP(S), e.g., because a path that was previously
highly likely has become obstructed and less or not at all likely to
achieve the goal, or because some previously obstructed paths have
been cleared. Indeed, if a data observation does not changeP(S), it
does not affect the agent’s beliefs about how to reach the goal(s) and
therefore it should not be considered relevant to the goal(s). If a data
observation instead significantly affects the agent’s beliefs over how
to achieve the goal then it should be considered very relevant accord-
ing to our proposed definition. This inspires our key idea, which is
to define goal relevance by the amount of change inP(S) caused by
the new data observation.
We thus define the goal relevance of a data observation D with
respect to an agent’s probability distribution P(S) over the set S of
possible ways it could achieve its goals as a distance measure,d(,),
3.1 defining goal relevance - theory 58
between the prior distribution of beliefsP(S) and the posterior distri-
butionP(SjD) after observation of dataD:
R(D,S) =d(P(SjD),P(S)) (13)
While many measures are available ford(,), here, inspired by the
related manner in which the concept of surprise was mathematically
defined by Itti & Baldi [59] in a rigorous, quantitative manner, we
will use for numerical applications the Kullback-Leibler divergence
KL(,), such that:
R(D,S) =KL(P(SjD),P(S)) =
Z
S
P(SjD) log
P(SjD)
P(S)
dS (14)
(Note, however, that the Kullback-Leibler divergence is not strictly a
distance measure, as it is not symmetric; one could use the symmet-
ric version KL
sym
(P,Q) =
1
2
(KL(P,Q)+KL(Q,P)) interchangeably).
To cement the idea that we are quantifying relevance, we define a
unit of Relevance (a "Rel") as the amount of relevance corresponding
to a two-fold difference between P(sjD) and P(s) for a given s2 S
as log
P(sjD)
P(s)
(with log taken in base 2), similar to the definition of
the “wow” in Itti and Baldi’s surprise [59] or of the bit in Shannon’s
theory of information [111]. Just like Shannon’s entropy uses a base2
logarithm to define the bit as unit of information [111], it is here a nat-
ural choice to define the Rel. The logarithm in base2 presents advan-
tages in terms of making the mathematical measures of information
(or relevance) scale linearly with intuitive notions of these quantities
(e.g., a 6-bit computer memory array can store the same amount of
information as two 3-bit memory arrays; but, if one were to omit the
logarithm, a 6-bit memory array can store any of 64 possible values
3.2 defining goal relevance - implementation 59
while each3-bit memory array can store only one of8 possible values,
and 64 is not twice 8. Likewise, the number of Rels associated with
observing two independent, non-interacting events is the sum of the
Rels associated with each event separately). The number of Rels asso-
ciated with data observationD over the entire setS is then acquired
via the integration in Equation14, with the logarithm taken in base2.
Our definition thus interprets goal relevance as the degree to which
the data observationD yields surprising changes (in Itti and Baldi’s
terms) in the observer’s beliefs over how it could achieve its goal(s).
Therefore, in a deterministic universe where the agent also has per-
fect information, nothing is relevant because the agent can optimally
plan how to accomplish its goal just based on the initialS and while
ignoring all sensory input. Relevance is important when there is un-
certainty or there are environmental changes that cannot be predicted
by the agent. Note how our definition provides an objective and de-
terministic computation of relevance over a subjective space of paths
from start to goal. In the experiments below, we compare relevance
measured with our mathematical formulation to human evaluations
of the relevance of different visual stimuli.
3.2 defining goal relevance - implementation
Ideally, the proposed definition of goal relevance would be compared
against the prevailing one. However, to the authors’ knowledge, there
are no other quantitative models that explain the relevance of obser-
vations to goals. Instead, since our definition attempts to explain hu-
man cognition, the above model is used to predict human responses
to a relevance task. Given a 2-D environment with obstacles, a start-
ing location, and a goal location, we asked participants to indicate
3.2 defining goal relevance - implementation 60
which of two highlighted obstacles was the most relevant to the task
of traveling from start to goal.
Figure 17: A and C: A simple 2D environment without (A) and with (C) an
obstacle being evaluated, along with the sampled RRT paths. Although the
paths are randomly sampled, the distributions appear less random than
expected due to the nature of RRT’s sampling method and the so-called
Voronoi bias inherent to this method [71]. B and D: Normalized grid counts
for each grid cell, computed as the number of RRT paths that traverse a
given grid cell, followed by normalization to a probability distribution
P(x,y) (color scale at right shows probability density values). We can
compute goal relevance as the difference betweenP(x,y) in panel (B) and
P(x,yjD) in panel (D) using Equation 15. In this paper, the relevance values
range between 0 and 1.4 Rels. For this example, the relevance of the added
obstacle to the task of traveling from start to goal is 0.87 Rels (relevance
units).
To use the equation for relevance, a model space must be chosen
that can represent the possible paths. The2D environment space was
used for this purpose. Because the space of all possible trajectories
from start to finish in a 2D environment is intractably large, here the
space was discretized into a grid, and we simply considered the be-
liefP(x,y)) that a given grid cell at location (x,y) was going to be tra-
versed on the way to the goal. That is, instead of directly considering
the distribution of agent beliefs over all possible paths, we projected
those paths onto a grid and considered the simpler distribution of
beliefs over which locations were more likely to be traversed. This
simplified the problem of computing how data observations affected
the observer’s beliefs considerably, as detailed below, while retaining
unaltered our mathematical definition of relevance, which applies to
anyP(S).
3.2 defining goal relevance - implementation 61
In order to apply Equation 14, we use the discretized form:
R(D,S) =KL(P(SjD),P(S)) =
X
S
P(SjD) log
P(SjD)
P(S)
(15)
To construct prior and posterior belief distributions related to adding
one obstacle to the environment, 1000 paths from start to finish were
randomly sampled in the environment without the obstacle in ques-
tion (prior), and another 1000 in the environment with the extra
obstacle added (posterior). Sampling was done by using Rapidly-
exploring Random Trees (RRTs) [71], which are commonly used in
path-planning systems. RRTs can explore a space by iteratively adding
random branches to form a tree from a start location. When any
branch reaches the goal, the path through the tree from the start to the
end is recorded as the selected path. For each path, we incremented a
counter at every grid cell location (x,y) traversed by that path. Grid
cells were given a minimum value of 1, so that the computation in
Equation15 was never singular (division by zero). We finally normal-
ized the set of all counter values over all grid locations to create the
distributionsP(x,y) andP(x,yjD). This process is visualized in Fig-
ure 17, which shows a very simple environment and the resulting
distributions. Once the prior and posterior distributions are known,
the relevance equation can be applied directly to compute the rele-
vance score associated with adding one obstacle to the environment.
To compare relevance between objects O
1
...O
N
, the above is per-
formed for each objectO
i
, computing first the prior beliefsP(x,y), fol-
lowed by the posterior beliefs after addingO
i
,P(x,yjO
i
), and finally
the relevance ofO
i
,R(O
i
,X,Y). The object with the largest relevance
score is the most relevant to the goal. Note that when comparing
multiple obstacles, the environment used to compute the prior distri-
bution contains none of these obstacles, and soP(x,y) is the same for
3.3 methods 62
Figure 18: Environments and probability distribution heat maps required
to compare the relevance of two objects in a more complex environment. A:
Environment without either of the objects in question. B: Environment
with the first object in question. C: Environment with the second object in
question. Applying Equation 15 to compare distributions A with B and A
with C provides the relevance values of the first and second object,
respectively. In this case, the red objects in B and C had relevance values of
0.19 and 0.70 Rels respectively, which agreed with the participant
responses (4 and 34 votes respectively).
each. The posteriorP(x,yjO
i
) is computed using only the obstacleO
i
in question and the remaining obstacles that are not being evaluated.
This comparison is visualized in Figure 18.
This implementation was used to compute relevance scores for the
added obstacles in the image pairs used for the human experiment,
which we used as a model prediction of the human responses. The
details of these image pairs are covered in the next section.
3.3 methods
To investigate the human intuition behind goal relevance and to test
our mathematical definition, we recruited human participants and
asked them about the relative task relevance of objects. Participants
were presented with image pairs, such as the ones shown in Figure19.
The images represent two dimensional environments, with randomly
placed objects and starting/ending locations (yellow/green dots, re-
spectively). In all cases, each image in the pair was identical to the
3.3 methods 63
other, except for one additional object in each image which was col-
ored red. Participants were told to imagine that they were walking
across the “room” from the start location to the end location. They
were then instructed to consider, “Which of the two red objects is
more relevant to your goal? In other words, which would you pay
more attention to, or be more concerned with, as you cross the room?”
The participants responded to each image pair by indicating whether
they thought the additional object in the image on the left or the right
was more relevant. Response time was recorded along with this deci-
sion.
Environments were randomly generated using the following proce-
dure:
1. Generate a shape
• For rectangular shapes, use a square with side length 1.
• For convex polygon shapes, start at (0, 0) and repeatedly
travel a random distance and turn clockwise by a random
angle until returning to the start location. Afterwards, ran-
domly rotate the shape.
2. Randomly scale and translate the shape and determine if it has
any collisions with other shapes. If not, place it in the environ-
ment (i.e. add it to the shape list). Otherwise, repeat this step
until a valid placement is found or a maximum number of at-
tempts is reached.
3. Repeat 1-2 a total of 10 times. In practice, there are generally
fewer than 10 shapes because the last few shapes exhaust their
attempts due to a crowded environment.
4. Randomly reorder the shape list so that there is no positioning
bias based on shape order. The last 2 shapes are selected as the
relevant shapes.
3.3 methods 64
5. Randomly select a start point, and repeat until the point is not
inside a shape.
6. Randomly select a end point, and repeat until the point is not
inside a shape.
Figure 19: Left pair: Image pair with rectangular objects. Right pair: Image
pair with convex polygon objects. The first and third images contain the
more relevant new obstacles according to our theory. We thus expected
that a majority of human responses would be “Left” for both image pairs.
The second panel is an example of an environment that would be excluded
from the experiment, because the relevant obstacle falls outside of the
circle formed by the start and goal.
Each participant was presented with 200 image pairs. The first 100
contained only rectangular objects, and the second 100 contained ob-
jects that were all convex polygons. This allowed us to investigate if
shape had any effect on goal relevance. All participants viewed the
same 200 pairs, which were randomly generated in advance of the
study. For each participant, presentation on the left or right for the
two images in each pair was randomly shuffled, as was the presenta-
tion order of the image pairs. However, the rectangular-object image
pairs always came prior to the convex-polygonal pairs. To increase
the number of interesting image pairs (i.e., pairs in which both objects
could be considered relevant), image pairs were only used if both of
the additional objects intersected or lied within the circle whose diam-
eter was the line segment between the start and end points (thus, the
second panel of Figure 19 was actually not used in the experiment).
In total, 38 undergraduate students participated in our study. Par-
ticipants had normal or corrected-to-normal vision, and were com-
3.3 methods 65
Feature Description
1 Distance from start to centroid of shape 1
2 Distance from start to centroid of shape 2
3 Distance from goal to centroid of shape 1
4 Distance from goal to centroid of shape 2
5 Area of shape 1
6 Area of shape 2
7 Closest distance between optimal path and edge of shape 1
8 Closest distance between optimal path and edge of shape 2
9 Closest distance between optimal path and centroid of shape 1
10 Closest distance between optimal path and centroid of shape 2
11 Distance from midpoint of start and goal to edge of shape 1
12 Distance from midpoint of start and goal to edge of shape 2
13 Length of optimal path without shape 2
14 Length of optimal path without shape 1
Table1: Descriptions of all features tested for SVM-B. Shapes1 and2 are the
left and right shapes, respectively.
pensated with course credits. The experimental methods were ap-
proved by our university’s Institutional Review Board (IRB).
In addition to this primary experiment, we conducted a secondary
experiment to ensure that our results were not a product of any quirk
in the particular200 image pairs used in the primary experiment. We
used the same methodology and recruited a separate set of52 partici-
pants, generating a unique set of200 image pairs for each participant.
The human responses to the image pairs were compared against
the predictions of several models. First, they were compared to the
goal relevance model described in this paper. Also, two intuitive heuris-
tics were used. The first of these heuristics (H1) compared the areas
of the two objects, and simply chose the larger as more relevant; it
should predict human responses very well if participants preferen-
tially picked the largest obstacle as being more relevant. The second
heuristic (H2) used the distances from the midpoint between start
and goal to the edge of the two objects; this model thus coarsely ac-
counted for how close the added obstacles were to the straight line
3.3 methods 66
from start to goal, and should perform very well if participants sim-
ply judged the obstacle closest to that straight line as more relevant.
This was inspired by the common practice of using the straight-line
distance to a goal location on a map (ignoring actual roads) as a guid-
ing heuristic in route planning algorithms such as the A* algorithm
(i.e., the straight line distance is a weak guide towards the goal during
the search for the optimal route to the goal which correctly follows
the available roads [52]).
The last comparison was made against a Support Vector Machine
(SVM), which is a standard tool in machine learning used to recognize
patterns in data and classify new data (Cortes,1995). During training,
SVMs find a decision boundary that is the best separator between two
classes of data points. Here, our SVM attempts to perform the same
task of our human participants, learning a decision boundary that lets
it choose left or right for an image pair. Compared to other classifiers,
SVMs find the decision boundary that is maximally distant from any
of the training exemplars (maximum margin). The trained decision
boundary can then be used to classify new, unseen data points. The
SVM was trained using different features to provide benchmarks for
our model. A full list of the features we tested is given in Table1. We
selected a list of 14 potential features, given in Table 1, exhaustively
trained a separate SVM for every combination of up to 4 features
from the list, and selected the best performing of the
14
4
+
14
3
+
14
2
+
14
1
= 1470 SVM-based models tested, referred to as SVM-B.
SVM-B used features 5, 9, 11, and 12.
In all cases, the SVMs were trained on half of the image pairs, se-
lected randomly, and tested on the other half. To avoid any bias in
this random split, we trained and tested each SVM using 100 differ-
ent random splits, and report the mean and variance. Note that our
3.3 methods 67
Features Accuracy
5 56.665.78
9 68.983.26
11 70.253.36
12 71.963.31
11, 12 77.333.17
9, 11, 12 79.643.38
5, 9, 11, 12 81.973.46
Alternatives
6, 9, 11, 12 79.363.22
5, 10, 11, 12 81.523.69
6, 10, 11, 12 79.013.02
5, 6, 11, 12 77.253.46
9, 10, 11, 12 80.573.33
Table 2: Contributions of each feature used in SVM-B. The accuracy of an
SVM trained on each feature individually is shown, along with a progres-
sion from highest to lowest accuracy. These results are from the primary
experiment. Features 5, 9, 11, and 12 represented the best accuracy of all
1470 combinations of features. Because features 5 & 9 are included with-
out their pairs (features 6 & 10), we also show alternative combinations for
comparison.
proposed definition of relevance uses no training at all (just Equa-
tion 15).
To gain some intuition for why these particular features were se-
lected, Table 2 shows the accuracy of them individually as well as
other combinations and alternatives. We can see that features 11 and
12 provide the most information, and feature 5 provides the least by
a significant margin. By comparing features 11 and 12 with 5, 9, 11,
and 12, we can see that adding features 5 and 9 does not contribute
much, and most likely comes down to coincidental correlations. For
example, why did the classifier pick feature5 instead of its symmetric
feature6? The answer is that using feature5 (in conjunction with fea-
tures 11, 12 and 9) on average correctly classified 2.6 more of the 100
test samples than did using feature 6 in conjunction with 11, 12 and
9. When using finite training and test samples as we do here, small
3.4 results 68
differences between features that should in principle be symmetric
may happen just due to the limited sample size.
3.4 results
The data from all38 participants was analyzed to produce our results.
For each image pair, the number of responses for the left and the
right object were tallied. We calculated inter-subject agreement for
an image pair as the fraction of participants who agreed with the
majority decision, i.e.,
Agreement =
max(N
L
,N
R
)
N
(16)
whereN
L
andN
R
are the number of participants who chose the left
and right objects, and N = 38 is the total number of participants.
This value is 1.0 when all participants agree, and 0.5 when there is
a 19-19 split. A histogram of the agreement values is shown in Fig-
ure 20. The average inter-subject agreement across all images for our
experiment was 77.05%. For each image pair, we also computed the
majority image as the image selected by the majority of participants.
In the primary experiment, model accuracies were computed by us-
ing the models to predict which image in each pair was the majority
image.
To determine the significance of the data, we looked at the num-
ber of participants who voted “Left” on each of the 200 image pairs.
An F-test revealed a statistically significant variance of these num-
bers, F(199,199) =15.34, p =1.1510
-62
. Linear regression found a
significant regression equation predicting response time from agree-
ment, F(1,198) = 49.9, p = 2.7310
-11
, with an R
2
of 0.201. There
3.4 results 69
Figure 20: Agreement histogram over image pairs.
Model Accuracy (%)
Inter-subject Agreement 77.05
H1 55.50
H2 78.00
SVM-B 81.973.46
Goal Relevance 83.50
Table 3: Prediction accuracy of models & inter-subject agreement
was no effect of rectangular vs. convex polygonal objects (F(99,99) =
1.13, p =0.537).
As mentioned in the previous section, we evaluated the accuracy
of several models in predicting the data. The results are shown in
Table 3. SVM-B was able to achieve a mean accuracy of 81.97%, with
a standard deviation of 3.46%. By contrast, the goal relevance model
had a significantly higher prediction rate, at83.5% (t(99) = -4.42, p =
2.5410
-5
). Note that these accuracies represent each model’s ability
to predict which image was selected by the majority of participants
in each pair, rather than each individual human response.
To further examine the predictions, we divided the image pairs into
groups based on inter-subject agreement, and investigated whether
the goal relevance model would achieve lower predictive accuracy for
3.4 results 70
those pairs for which humans agreed less. This is shown in Figure21.
The significance of each regression line was computed. The human
regression line had values of (F(1,3) = 441, p = 2.3610
-4
) with
an R
2
of 0.993. The goal relevance model had values of (F(1,3) =
12.3, p =0.039) with an R
2
of 0.804. In contrast, the SVM-B line was
not significant at the 5% confidence level, with values of (F(1,3) =
7.25, p =0.074) with an R
2
of 0.707.
Figure 21: Model accuracies for image pairs with different levels of human
agreement, and corresponding regression lines. The number of image pairs
in each category is shown on the X axis with the agreement values.
We also investigated whether the magnitude of difference com-
puted for each object by the goal relevance model was correlated with
the level of human agreement for each image pair. A scatter plot of
these values is shown in Figure 22, along with a regression line. This
line was also statistically significant, (F(1,198) =29.8, p =1.4110
-7
)
with an R
2
of 0.131.
The results of the control experiment are shown in Table 4. SVM-B
was trained by randomly selecting half of the 200 image pairs from
each participant to train a single model, which was then used to pre-
dict the remaining image pairs of each participant separately. This
procedure was repeated 100 times with different random samples. It
3.4 results 71
Figure 22: Scatter plot of goal relevance differences and human agreements
for each image pair, and the regression line.
is important to remember that, because the control experiment uses
unique stimuli in every trial, there is no notion of majority image
in this experiment and we instead compare model predictions with
individual participant responses on each image pair. At first glance,
it would seem that both models perform much worse in this control
experiment. This is due to the fact that accuracy in the primary ex-
periment (Table 3) is computed by using the models to predict the
majority image. The majority images are based on many responses
for the same image pairs, and so are much less noisy than individ-
ual responses. To demonstrate this, we recomputed the prediction
accuracies for the primary experiment when predicting individual
responses instead of the majority image and averaging them across
participants, and the results are very similar. We found no significant
difference between the accuracies of SVM-B and goal relevance mod-
els.
3.5 discussion 72
Model Control Primary
H1 55.415.59 54.345.17
H2 70.5012.96 67.1111.21
SVM-B 71.7313.49 69.7113.21
Goal Relevance 72.2613.78 71.8414.09
Table 4: Comparison of model predictions between primary and control ex-
periments. Accuracy for the primary experiment is recomputed where each
individual participant response is predicted, rather than the majority image,
since one cannot determine a majority image in the control experiment that
used unique stimuli on every trial. Standard deviations across individual
participants are shown for each value.
3.5 discussion
Our results show a clear advantage in using the proposed concept
of goal relevance to predict human responses in this task. The model
more closely matched the human reports than did any of the heuris-
tics or the best of the SVM variants constructed. Here, it is important
to note that although SVM-B performed equal to goal relevance in
the control experiment, this is in spite of several advantages given
to the SVM by our procedures. As mentioned in Methods, SVM-B
was selected as the best of 1470 different SVMs trained on this task,
and we did not apply any statistical corrections to accommodate this.
Also, unlike SVM-B, which was the strongest of the alternative mod-
els, goal relevance does not make use of any training at all. Further-
more, the simplicity of the task means that simple linear machine
learning models like SVMs have an easier time capturing it appropri-
ately. These factors make it all the more significant that goal relevance
performed equal or better in all cases. Goal relevance found the same
image pairs easy or challenging as did the humans, indicated by the
slopes of the regression lines in Figure21 and the significant slope of
the line in Figure 22.
In reviewing the data, we noticed that a small number of our partic-
ipants (3 in the primary experiment,4 in the control) responded dras-
3.5 discussion 73
tically differently, seemingly providing the opposite response from
the remaining participants. This could be due to a misunderstand-
ing of the experiment instructions, or these participants may have an
alternative internal definition of goal relevance. This clearly demon-
strates that the concept of goal relevance is a subjective one. However,
the relatively high accuracy of both our model and SVM-B show that
despite the subjective nature, goal relevance is largely a shared con-
cept which we can attempt to model.
When looking at the results of the H1 and H2 models, it is clear that
H2 was a much stronger predictor of human responses. This suggests
that the size of the objects was not very important, but the distance
between the objects and the midpoint between the start and goal was
quite important. The features selected by the best SVM further sup-
port this, because both features using this computation were chosen
(features 11 and 12).
It is interesting that many of the model accuracies were higher than
the average inter-subject agreement. This may be because these mod-
els, despite being an attempt to approximate the human notion of
task-based relevance, project a subjective measure of relevance for
many different individuals onto an objective measure of universal
relevance. Humans may be less consistent or thorough in their com-
putation and possibly utilize heuristics or other methods for deter-
mining the available solutions to problems, and for detecting changes
in that space. Also, participants may have tried to avoid spending a
large amount of time on any image pair, eventually selecting a ran-
dom answer which would decrease performance. Comparing the re-
sults of the primary and control experiments in Table4 clearly shows
that there is a noticeable amount of noise in participant responses,
whether it be from the subjectivity of the task or noise within an
individual’s response pattern. Also, much of that noise comes from
3.5 discussion 74
the minority of participants who had different responses. However,
our results suggest that the key idea behind goal relevance is sound;
observations are relevant to a goal when they change the set of en-
visioned solutions. This is demonstrated by the success of the goal
relevance model in predicting the participant responses, despite not
utilizing any training data.
The proposed definition of goal relevance is fundamentally related
to Itti and Baldi’s concept of surprise [59], in that they are both based
on a difference between prior and posterior belief distributions. The
idea behind surprise is that an observation is surprising if it changes
your beliefs about the world. Similarly, an observation is relevant to
a goal if it changes your belief about the likely solutions for reach-
ing that goal. Essentially, this means that goal relevance is equivalent
to surprise in a solution space. As surprise has found much success
in predicting bottom-up human visual attention, hopefully goal rele-
vance can be used in a complementary manner to predict top-down
attention. This paper has demonstrated how goal relevance might be
applied to a navigation task, but there are many other types of tasks.
Just like in surprise, the key to extending goal relevance to other types
of goals is in finding the correct belief space, i.e., solution space, and
in establishingP(S).
There are a few potential caveats with this study. First, the theory
does not define how goal relevance can be computed when the prior
or posterior solution set is empty. In this case the KL divergence is
not defined, so a natural interpretation using Equation14 is not avail-
able. None of our image pairs contained this case, so it was not dealt
with in our data. Yet, this is not an unusual problem when using KL,
and it can be circumvented, for example, by adding a fixed, uniform
low probability density over the solution space (i.e., in our numerical
examples, add a fixed non-zero value toP(x,y) at every (x,y) before
3.5 discussion 75
normalization). Also, Nelson et al. compared multiple measures of
value of information, including KL distance, to human information
search, and found that probability gain was the most similar [95]. As
we mentioned earlier when we introduced Equation 13, our choice
to use KL divergence is not intrinsic to the problem of goal rele-
vance and any measure of information distance can be used. It would
be very interesting to investigate if other metrics could improve the
model.
Another potential problem is that in constructing the SVM mod-
els, we could have missed a very informative feature. As mentioned
above, we did try hard to find strong features, and the ones reported
are the best we found, with the caveat that we wanted with the SVM
framework to test only features that would be very different than
what our goal relevance theory used: reasoning about probabilistic
distributions of possible paths. Indeed, we expect that an SVM that
used distributions of paths as a feature should perform just as well
as our goal relevance measure, and would essentially be equivalent
to it rather than constituting a valid alternative. It will be interesting
in the future to see whether even better features can be found. An
issue with the implementation is that the RRT path sampling method
contains the well-known Voronoi bias [80], which tends to bias the
selection of random branches towards open space. We attempted to
minimize this by swapping the start and goal locations for half of
the samples, which eliminated the non-symmetric aspect of the bias.
However, it could still have an impact on the results. Lastly, our study
asked participants to self-report on objects that they found more rel-
evant. It is possible that these reports do not purely reflect the actual
relevances of the objects to each participant. However, the authors be-
lieve that the reports are at least partially indicative of relevance, and
designed the experiment to be simple to reduce further confounding
3.5 discussion 76
factors, since the aim was to exhibit the efficacy of a new relevance
metric.
Another key aspect of this study to consider is the choice of the nav-
igational task to test a general theory. Simply put, the presented task
was chosen for the simplicity of its solution space. We designed the
experimental scenario to be simple so that we could expect the sub-
jective beliefs of the observers to be quite similar, and hence the ob-
servers’ relevance computation would yield similar outcomes across
individuals, as confirmed by the high inter-observer agreement. Yet,
the theory described here is general and not limited to navigation
tasks. The central requirement of the proposed approach is that one
needs to be able to explore or sample the space of possible solutions
that achieve the stated goal(s), such that the influence of new data ob-
servations on that sample can be quantified. In complex tasks, such as
cooking a meal or playing a sport, the solution space is typically more
abstract and may have much higher dimensionality, making it more
difficult to sample directly. Thus, techniques for sampling in high
dimensional spaces may be required, such as Probabilistic Roadmap
Planners [3] or Random Decision Trees [35]. Consider for example the
computation of relevance of different game pieces on a chessboard
taken at some instant during a game, which could be useful to know
when planning the next move, or to try to predict the next eye move-
ment or action of a player. In principle, one could use standard tree
search algorithms, such as the minimax algorithm, to compute pos-
sible sequences of valid chess moves that lead to winning the game.
By repeating this computation with the piece of interest present vs.
absent, one could then compare the two distributions of sequences
of moves and compute relevance as we have done above. In practice,
however, fully expanding the search tree for games such as chess is
computationally prohibitive, and hence heuristics (such as only con-
3.5 discussion 77
sidering a finite horizon of only a few moves into the future) and
probabilistic sampling methods would likely be required, although
the computation of relevance as described here would remain identi-
cal. Many real-world tasks can be formulated in terms of sequences of
elementary actions taken in an environment that may present uncer-
tainty and possibly adversity, and thus directly generalize the chess
example. That is, once those elementary actions are defined, well-
studied search algorithms can be used to explore the large space of
possible action sequences that lead to a goal. Additional examples
are symbolic reasoning systems that attempt to infer logical conclu-
sions from a body of abstract knowledge (e.g., theorem provers, ex-
pert medical diagnosis systems), where the moves are more abstract
and defined as the possible syntactically and logically correct ways in
which two sentences can be combined to infer a new conclusion.
For the more complex tasks, we can speculate that the solution
space itself differs from person to person. For example, if we consider
the task of trying to cook a steak, perhaps cinnamon is irrelevant to
most people attempting this task, but relevant to others because they
are familiar with a recipe that includes that ingredient. The knowl-
edge of the additional recipe changes that person’s solution space for
this task, which shapes their computation of relevance.
Finally, we posit that complex agents, including humans, constantly
compute relevance in multiple hypothesis spaces simultaneously, pos-
sibly combining the resulting relevance values, in some manner which
we have not explored here and which we do not understand, to make
decisions and plans. For example, consider that I am planning to take
an airplane to present research results at a distant conference. Observ-
ing that an airport is shut down for bad weather would yield measur-
able relevance in the space of airline routes I might consider for my
trip, in a manner very similar to what our experiments have studied.
3.5 discussion 78
Yet, observing that the conference is canceled would likely yield no
change in airline routes and schedules, and thus would carry zero
relevance in that space, although one might intuitively consider this
a highly relevant event. The cancellation would indeed be relevant,
to an extent that can be quantified using our proposed measure, but
in different spaces, such as the space of my scientific goals for the
year, of my travel budget for the year, or of my strategies for how to
best disseminate research results. We expect that future experiments
could be designed to better understand how complex agents might
compute relevance in different spaces using our proposed measure as
a core, and then integrate the different relevance values to give rise
to behavior.
We have described a theory of object relevance in the context of
tasks, which has been termed goal relevance. Also, we have shown
that goal relevance is successful in predicting human responses when
asked to select task-relevant objects. To our knowledge, this is the first
model to capture this concept in a quantitative manner. The results
of this experiment support the idea of information utility discussed
in [48]. Further research will hopefully show that this theory general-
izes to other types of tasks.
4
S A L I E N C Y M O D E L S
Having proposed and validated our theoretical framework for goal
relevance, the next step is to determine its effects on eye movement
behavior in humans. This chapter presents an experiment designed
to accomplish this.
4.1 background
As the scope of this experiment largely overlaps with that of Chap-
ter 3, the relevant literature is the same, and can be found in Sec-
tion1.1.3. As described there, several studies have used learning from
examples to implicitly capture task influences on eye movements [13,
100]. The problem with these models is that they learn to predict eye
movement behavior without having any understanding of the process
from which that behavior emerges. While their predictive accuracy is
higher than models without learned top-down features, the top-down
features represent a very shallow understanding and there is much
room for improvement. The learned models also present a require-
ment to be retrained on each new task, and cannot predict a priori
which objects will attract more attention. Developing an understand-
ing of the relationship between goal relevance and eye movements
can hopefully address these issues.
79
4.2 experimental methods 80
4.2 experimental methods
Human participants were recruited, and their eye movements (sac-
cades) recorded while performing a task. Similar to the experiment
presented in [100], the task is a video game. However, in this exper-
iment all participants played a single game, Mario World, and were
be told that their goal was to reach the end of each level as quickly as
possible while taking as little damage as possible (specifically, they
were told to maximize their score, which will be explained later).
Participants were first briefed and then given a short questionnaire
asking about their level of experience with Mario World and with
video games in general. Then, after calibration with the eye tracking
equipment (Figure 23), they had the opportunity to practice playing
the game until they were ready. Finally, they played through eight
levels in a randomized order. Their eye movements were recorded
during this period in addition to during the practice period. In total,
84 participants were recruited (42 males and 42 females).
Figure 23: Experimental setup, where participants played Mario World
while having their eye movements recorded.
4.2 experimental methods 81
Mario World was chosen because of a competition held several
years ago, called MarioAI[1], which released an open-source Java im-
plementation of the game. This allows us to modify Mario World
for our own purposes, to create customized levels to best test our
hypotheses and also adjust some of the game mechanics. For those
familiar with Mario World, the mechanics we have changed are as
follows:
1. Mario has unlimited life. When he takes damage, the partici-
pant receives feedback but Mario is not killed.
2. There is no time limit for completing a level.
3. None of the levels contain any pits into which Mario could fall
and die.
4. All coins and power-ups have been removed from the game.
5. Mario is unable to run, and can only walk.
6. When a Koopa enemy is destroyed, no shell is left behind.
7. At the end of each level, an itemized score is displayed based
on the time taken and how many times Mario took damage.
Some of these changes were to ensure consistency across partici-
pants. Changes 1-3 guaranteed that all participants reached the end
of the customized levels, and therefore saw all of the content in them.
Changes4-6 removed some of the more complex aspects of the game,
so that there would be less of a difference in performance between
participants who were new to the game and those who had prior
experience. In particular, change #5 prevented very experienced par-
ticipants from completing the levels in a drastically shorter time than
other participants, so there would be less variation in the overall ses-
sion duration.
4.2 experimental methods 82
Change #7 exists to make explicit what the participants’ goals were
as they played. This allows us to confidently design a predictive
model using the same goals as the human participants. The score
displayed to participants was based on two factors: the amount of
time taken to complete the level, and the number of times that Mario
took damage during the level. The total score was calculated using
the following equation:
Score =100,000-(t100)-(d3,000) (17)
wheret is the amount of time taken, in seconds, andd is the num-
ber of times Mario received damage. This equation is designed such
that there is no scenario where it is worth taking damage to reach
the end of the level more quickly. The optimal strategy is to prioritize
avoiding damage, and otherwise proceed as quickly as possible. This
was explained to the participants, and the score display after each
level showed the calculations to continually reiterate this point.
Unbeknownst to the participants, they were each randomly assigned
one of four different versions of our game before they arrived, such
that each group contained 21 participants. The different versions are
as follows:
• Normal: Our modified Mario World game as described above.
• Small: Mario is half as tall as in the normal version. This allows
him to fit through some gaps not accessible to normal Mario.
• High Jump: Mario can jump about twice as high as normal
Mario.
• Invulnerable: Mario does not take damage when coming into
contact with enemies, and instead simply passes through them.
4.2 experimental methods 83
Participants were made aware of the special properties of their ver-
sion of the game, but were not told about the other versions. The
assigned version was used for the entirety of the session, both during
practice and data collection.
Figure 24: Example of a goal relevance test object. The enemy Goomba is
too high to threaten normal Mario. However, high-jumping Mario must be
careful to jump over the pipe while not colliding with the enemy.
Each version of the game is designed to have a significant impact
on the space of possible actions that Mario can take, and the lev-
els are also designed to focus on these differences. For example, in
many locations throughout the levels, there are enemies high above
the ground such that they cannot interact with normal Mario (Fig-
ure 24). In this case, the theory of goal relevance labels this enemy
as irrelevant to the participant’s goal. However, high-jumping Mario
must be careful not to jump while passing underneath this enemy
to avoid taking damage. For participants assigned to the High Jump
version, this enemy was considered relevant. Examples such as this
are littered throughout the levels, where goal relevance will assign
different values of relevance to certain objects depending on which
version of Mario World is being played (Figure 25). In this way, we
can investigate a causal effect of changing the solution space on eye
movement behavior.
4.2 experimental methods 84
Figure 25: More goal relevance test object examples. A: Same as Figure 24.
The flying Goomba is too high, so it will only affect high-jumping Mario.
B: Normal Mario must go through the middle opening in the wall. How-
ever, small Mario can also choose to walk under the bottom, and high-
jumping Mario can leap over the wall. C: Small Mario can avoid some en-
emies by walking under the walkway, but the other Marios cannot fit. D:
High-jumping Mario can jump on top of the overhanging walkway to avoid
an enemy. E: The shell inside the blocks at the top is highly salient from a
bottom-up perspective because it bounces back and forth quickly and makes
noise. However, no version of Mario can interact with it. F: The three Spiky
enemies at the bottom are also uninteractable.
4.2.1 Computation of Goal Relevance
The data analysis for this experiment requires the computation of
goal relevance for game objects in each recorded video frame. As de-
scribed in Chapter 3, goal relevance is defined for a data observation
D with respect to an agent’s probability distributionP(S) over the set
S of possible ways it could achieve its goals as a distance measure,
d(,), between the prior distribution of beliefsP(S) and the posterior
distributionP(SjD) after observation of dataD:
R(D,S) =d(P(SjD),P(S)) (18)
4.2 experimental methods 85
And, as in Chapter 3, we will use the Kullback-Leibler divergence
ford(,), along with a discretized form of the function, giving:
R(D,S) =KL(P(SjD),P(S)) =
X
S
P(SjD) log
P(SjD)
P(S)
(19)
In Mario World, the goal is to reach the end of the level as quickly
as possible and while taking damage the least number of times. How-
ever, because of the side-scrolling nature of Mario World, the end of
each level is not visible on screen until Mario has almost reached it.
The participant is also completely unaware of any enemies and ob-
stacles until Mario has progressed far enough for them to enter the
visible scene. Therefore, we instead define our goal as reaching the
right side of the visible scene as quickly as possible and while taking
damage the least number of times. This better matches the goal to-
wards which participants will most likely plan, because they do not
have any information with which to plan further.
A solution for reaching this goal is a list of discrete actions to be
taken in each time step. As there are only three different buttons
used to control Mario (move left, move right, and jump), there are six
distinct actions:
1. Do nothing
2. Move left
3. Move right
4. Jump
5. Jump left
6. Jump right
To enumerate the possible solutions, we perform a breadth-first
search, simulating the game state for each action at each time step,
4.2 experimental methods 86
and recording any path through the breadth-first graph that ends
with a node representing Mario reaching the right side of the screen.
Of course, there are be an infinite number of such solutions because
of the option to do nothing, and aside from that this graph would
expand exponentially and become intractable very quickly. To avoid
both of these issues, we first compute the optimal solution using the
A* search algorithm [52], using the following heuristic for evaluating
the fitnessh of each node in the graph:
h(x,y,t,d) =100,000-((t+e(x))100)-(d3,000) (20)
where x and y are the coordinates of Mario in the level, t is the
amount of time currently elapsed, in seconds, andd is the number of
times Mario has received damage.e(x) is a function for estimating the
amount of time remaining before Mario reaches the level end, and is
simply based on Mario’s current position and maximum speed. This
equation is the same as Equation 17, except that the time component
is broken down into current time and estimated remaining time. The
resulting optimal solution found by A* is used to limit the breadth-
first search by evaluating each node using the same fitness function.
Any node whose fitness is lower than the optimal solution’s fitness
by more than a threshold
fitness
is pruned, reducing the graph to a
manageable size. Even with this restriction, the size of the search tree
becomes extremely large, and the search must be performed several
times on every frame. Because of this, it became necessary to set this
threshold to0 to satisfy computational constraints. To ensure that this
did not negatively affect our results, we compare the results of our
saliency models using a higher threshold, but using data from a small
subset of our participants, in Section B.1.
4.2 experimental methods 87
Figure 26: Visualization of goal relevance computation. In this case, we are
computing the goal relevance of the lower set of blocks. A & B: First, a
posterior state is created in which the object in question is removed. C & D:
Then, the solution set for each state is found using the guided breadth-first
search described above. Paths are colored from green to red based on their
fitness compared to the optimal path found by A*, shown in yellow. Mario is
too large to fit through the gap at the bottom, so he must jump through the
middle. E & F: The solution sets are converted into probability distributions
via grid discretization. These two images have been edited with increased
gamma and blurring to improve visualization. Lastly, the goal relevance
value is determined by applying Equation 19 to the two distributions.
After obtaining a list of possible solutions, a model space must
be chosen that can represent them as a distribution. Similar to the
procedure in Section 3.2, we discretize the visible screen into a 2D
grid and project the solutions onto it, where each grid cell is assigned
a value equal to the number of solutions passing through that cell.
The resulting distribution can be compared (using Equation 19) to
4.3 analytical methods 88
another distribution that does not include the object in question to
produce the goal relevance of the object. This full process is shown in
Figure 26.
The last decision to make with regards to computing goal relevance
is what should be considered an object for which we will evaluate
goal relevance. It is clear that we should compute goal relevance for
each enemy, but the method for evaluating the relevance of blocks
and terrain is not obvious. For blocks, we decided to consider each
connected component (groups of adjacent block tiles) as a single ob-
ject. The same strategy cannot be used for terrain because all of the
terrain in an entire level would be considered as a single connected
component, and we want to capture the different sections of the ter-
rain. Instead, we evaluate individual grid units of terrain at corners:
locations where a unit of terrain is bordered by non-terrain on the top
and either the left or the right. Overall, this lets our model consider
many possible places of interest.
We additionally compute the goal relevance of Mario himself. How-
ever, a game state without Mario included would have no solutions,
which would lead to a null distribution over which KL divergence
is undefined. This same problem occurs when there are no paths to
a goal. In Chapter 3 we suggest that this situation be handled by
adding a fixed, uniform low probability density over the solution
space. Therefore, the goal relevance of Mario can be computed by
comparing a flat distribution with the prior. Figure 27 visualizes the
objects on which goal relevance is computed, along with the result.
4.3 analytical methods
With the data collected, two separate analyses were performed,1) de-
termining if changes in goal relevance values caused by changes in
4.3 analytical methods 89
Figure27: Objects on which goal relevance is computed. A: Each red dot rep-
resents an object to which we assign a goal relevance value. Goal relevance
is computed on enemies, terrain units at corners, blocks, and on Mario him-
self. B: The results of the goal relevance computations, visualized as a raw
saliency mask. This is the same mask that will be used in our saliency model.
In this case, the two most relevant objects are Mario and the Bullet Bill en-
emy. Note that some objects do not appear in the mask because of low or 0
values for goal relevance.
the solution space are correlated with the amount of time participants
spend looking at those objects, and2) using the goal relevance model
as an additional feature in a saliency model to see if the predictive
accuracy can be improved. These analyses will be referred to as the
Solution Space Experiment and the Saliency Model Experiment, re-
spectively.
4.3.1 Solution Space Experiment
As noted earlier and shown in Figure25, throughout the game levels
are many objects designed to have significantly different values for
goal relevance depending on which version of Mario World the par-
ticipant has been assigned, which will be referred to as test objects.
This first experiment uses these to see if changing the solution space
causes a predictable change in human gaze behavior.
To show that changing the solution space causes differences in eye
movements, we need to measure the goal relevance and the amount
of time spent looking at many objects and see if changes in these aver-
ages across game versions are correlated between them. For example,
4.3 analytical methods 90
we expect that enemies hovering high in the air (as in Figure 24) will
have much higher goal relevance values for participants playing the
High Jump version, and we hypothesize that this will lead to those
same participants spending significantly more time looking at those
enemies than participants in the other groups.
In order to measure this, for each participant and each test object
we compute two values, referred to as the Attention Ratio (AR) and
the Goal Relevance Ratio (GRR). The Attention Ratio represents how
much time that participant spent attending to that object, and the
Goal Relevance Ratio represents the average goal relevance of the
object.
AR =
Number of frames attending to the object
Number of frames object is visible
(21)
GRR =
Sum of goal relevance of the object for each frame it is visible
Number of frames object is visible
(22)
The equations are designed this way to account for the fact that par-
ticipants progress through the game at different speeds. Even though
each participant sees every object, the objects are not visible for the
same amount of time for each participant. Participants who are slower
would have more frames in which to view the objects, so we must av-
erage based on the number of frames in which the objects are visible.
Also note that the goal relevance values are different in each frame,
depending on the game state, so they must also be averaged. The sub-
ject is considered to be attending to the object if their gaze location is
within a threshold distance
test
of the object’s center. These values
are then averaged across participants, separately for participants in
4.3 analytical methods 91
each of the four versions of the game. This will provide us with four
AR averages and four GRR averages for each object. Finally, we an-
alyze these values within specific subsets of objects. Specifically, we
investigate the following object subsets:
1. Flying enemies (Figure 25A)
2. Custom wall lowers: The lower portion of the specially designed
wall sections (Figure 25B)
3. Custom wall uppers: The upper portion of the specially de-
signed wall sections (Figure 25B)
4. Tunnels: Long horizontal blocks under which only small Mario
can fit (Figure 25C)
5. Raised ledges: Blocks onto which only high jumping Mario can
jump (Figure 25D)
6. Uninteractable enemies: Enemies hidden behind blocks such
that no version can interact with them (Figure 25E & F)
7. Interactable enemies: All enemies not considered uninteractable
4.3.2 Saliency Model Experiment
The next experiment aims to show that goal relevance can be used to
improve predictive accuracy in saliency models. Specifically, our aim
is not to show state-of-the-art results but to demonstrate that goal rele-
vance presents a new theory of top-down attention that can compete
with machine-learned top-down features and contribute to a com-
bined model. Our saliency model is a combination of three separate
models: Itti and Koch’s classic bottom-up model [61], the learned top-
down model presented in [100] (their full model also uses [61], but
for clarity we include the two parts separately), and our top-down
4.3 analytical methods 92
goal relevance model. The bottom-up model compares distributions
of low-level features at multiple scales and identifies conspicuous lo-
cations based on image statistics. The learned top-down model com-
putes a reduced feature vector for each image representing the "gist",
and learns a mapping between this vector and likely gaze locations.
To construct the goal relevance saliency mask for each frame of
the recorded video, the goal relevance values are first computed as
described in Section4.2.1. Then, for each object a bounding box in the
image is computed and its goal relevance is distributed evenly across
the values in the corresponding box of the mask. Distributing the goal
relevance in this way accounts for the fact that some objects are larger
on the screen, and ensures that the total presence of an object in the
mask is proportional to its goal relevance value. An example mask
is shown in Figure 27. This goal relevance mask will be provided to
the combined model, in addition to the other masks produced by the
bottom-up and learned top-down models. Each mask is individually
blurred to maximize performance.
To combine these three masks into a single model, we take a simple
linear combination of the masks and their second-order terms. That is,
given masks BU, TD, and GR, the combined mask can be computed
with the following equation:
C =x
0
BU+x
1
TD+x
2
GR
+x
3
BUTD+x
4
BUGR+x
5
TDGR
+x
6
BUTDGR
(23)
where x
0
- x
6
are coefficients selected using the simplex search
method [72], an iterative algorithm for minimizing non-linear func-
tions. To avoid overfitting, we randomly set aside one participant
from each group (2 males,2 females) and use the results from these4
4.3 analytical methods 93
participants in the simplex search. The remaining analysis proceeds
with these participants excluded, leaving20 (10 males,10 females) in
each group.
A number of metrics have been proposed for evaluating saliency
models, from Normalized Scanpath Saliency (NSS) and several vari-
ants of area-under-the-curve (AUC) to the Correlation Coeffiecient
(CC), Kullback-Lieblier divergence (KL), and most recently Informa-
tion Gain (IG) [70] metrics. For two reasons, we have decided to pri-
marily report our results using the NSS score (we also computed our
results using the AUC metric, which can be found in Section B.2).
First, typical saliency models operate on datasets of static images,
where each participant views each image for a period of time. This
provides a distribution over the image for each participant of at-
tended locations and allows for a correspondence between partici-
pants. In this experiment we used dynamic video (controlled by real-
time user input), which provides us with only a single point for each
image and prevents any correspondence between frames. This makes
the creation of a human competitor model as a benchmark impossi-
ble. Also, several of the saliency metrics are designed to compare be-
tween two distributions and thus are not as well-suited for comparing
a model prediction to a single point. The CC, KL, and IG scores fall
into this category. Secondly, there has been much discussion in the
saliency community about the growing number of metrics and the
inconsistencies between them. In their review, Bylinkskii et al. [19]
recommend KL or IG when evaluating probabilistic saliency models
and the NSS and CC metrics when using saliency models for cap-
turing viewing behavior. Because our experiment falls into the latter
category, we chose NSS.
4.4 results 94
4.4 results
4.4.1 Solution Space Experiment
The results of the object subset analyses are shown in Figure 28. The
AR and GRR values did not follow a normal distribution, but instead
followed a heavy-tailed distribution. This happened because there
were many objects that did not receive any attention at all from par-
ticipants, and also many objects that did not affect any paths to the
goal. Because the AR and GRR values do not follow a normal distri-
bution and because, in the case of Figure28F, the number of samples
is not the same, analyses were performed using the Wilcoxon rank
sum test [45]. All of the statistical results for this figure are shown in
Table 5.
We are specifically interested in verifying that any clear trends in
GRR values are matched by the corresponding AR values. Figure28A
shows the values for all flying enemies across the4 game versions. As
high-jumping Mario is the only version capable of interacting with
these enemies, the GRR values are trivially 0 for all other versions.
However, note that the high-jumping condition also has the highest
attention ratio, so in this instance there is a correlation between the
GRR and the AR. Although the high-jump GRR is significantly higher
than the other GRR values, the high-jump AR is not significant at the
5% confidence level compared to the regular Mario AR despite the
visual, because of large standard deviations. Perhaps more data is
needed, or there is something not captured by our model particular
to regular Mario. Since we are requiring the simultaneous success of6
significance tests, there is an increased probability for a failure. There
is still a visible trend that flying enemies had larger GRR values and
were attended to more frequently in the high-jumping version.
4.4 results 95
Figure 28: Attention Ratio (AR) and Goal relevance ratio (GRR) values for
specific subsets of objects. These are the same object subsets shown in Fig-
ure25A-F. For easier comparison, axes are scaled such that the AR and GRR
values for regular Mario are centered. In A, the GRR is 0, so we scale based
on the high-jumping values instead. Columns are highlighted in green if
both the AR and GRR values differ in the same direction as this baseline
(both above or both below), and red if they do not. This is explained further
in the text. A-E: Triangles indicate that the AR and GRR values for a version
are both significantly different from all of the other 3 columns, respectively.
F: Instead of comparing between game versions, this graph compares AR
and GRR values between interactive and non-interactive enemies, averaged
across versions. The larger GRR value for interactive enemies is matched by
a larger ARR value.
The custom wall upper sections and the ledges both had significant
results, with the highest GRR values for the high-jumping version
matched by the highest AR values. The higher goal relevance comes
from these objects once again being higher in the air, and mostly
reachable only by high-jumping Mario.
4.4 results 96
In Figure 28B we have a similar story, although a bit less intuitive.
For the lower section of the custom walls, the GRR value is signifi-
cantly lower for small Mario because small Mario can fit under the
wall, so removing the wall has less of an impact on the solutions. We
can see that this is matched with the significantly lowest AR value.
The GRR is also lower for high-jumping Mario because he can jump
over the entire wall and ignore the bottom section, but the AR in that
case is only a bit lower and it is not significant.
Figure28D shows the results of what we refer to as tunnels, which
are long sequences of blocks that leave just enough space underneath
for small Mario to pass through. The initial prediction for these ob-
jects was that small Mario would have the lowest GRR because he is
affected the least, just as with the lower wall sections. The small Mario
cases indeed had a lower GRR, but it turns out that high-jumping
Mario has the lowest GRR. This is due to high-jumping Mario being
able to spend more time in the air, where blocks near the ground
have less of an effect. The AR for small Mario is the lowest as in our
prediction, but the results for this chart are not significant.
Table5: Statistical results using the Wilcoxon rank sum test [45], correspond-
ing to the data in Figure 28.
GRR AR
Flying - High Jump Z =15.9949,p =1.3910
-57
Z =15.9949,p =1.3910
-57
Z =15.9949,p =1.3910
-57
Z =1.7275,p =0.0841
Z =2.7625,p =0.0057
Z =3.9758,p =7.0210
-5
Wall Lowers - Small Z = -17.9961,p = 2.10
10
-72
Z = -4.3609,p =1.3010
-5
Z = -20.9571,p = 1.62
10
-97
Z = -4.3885,p =1.1410
-5
Z = -3.7750,p =1.6010
-4
Z = -4.9984,p =5.7810
-7
Wall Uppers - High Jump Z =15.9967,p =1.3510
-57
Z =22.1635,p<110
-100
Z =14.5943,p =3.0510
-48
Z =4.1647,p =3.1210
-5
Z =6.2896,p =3.1810
-10
Z =6.7039,p =2.0310
-11
Ledges - High Jump Z =10.9232,p =8.9310
-28
Z =11.0262,p =2.8610
-28
Z =11.2031,p =3.9410
-29
Z =2.2740,p =0.023
Z =3.6567,p =2.5510
-4
Z =3.8564,p =1.1510
-4
Interactable vs. Uninteractable Z =22.6520,p<110
-100
Z =14.1159,p =3.0310
-45
Finally, Figure 28F is different in that it compares two object sub-
sets, rather than a single subset across the different versions. However,
4.4 results 97
the results are clear; non-interactive enemies, which trivially have a
goal relevance of 0, are also looked at significantly less frequently
than interactive enemies which have a significantly higher goal rele-
vance.
Also shown in Figure28 is the comparisons for the subset versions
against a baseline version. For most figures, the regular version is
used as the baseline, except for Figure 28A where the high-jumping
version is used instead because the regular GRR value is 0. In cases
where the AR and GRR values of a version are both above the base-
line values or both below the baseline values, a green background
is used, and otherwise a red background is used. Of the 16 compar-
isons (3 in each of A-E,1 in F), only2 of these show a red background.
This demonstrates a general trend of differences in GRR values to be
matched by AR values. In the case of the two instances where dis-
agreement occurred, it is possible that there are factors specific to
these cases that are not captured in our model, or perhaps more data
is necessary.
Additionally, we visualize the AR and GRR values for all objects
as a scatter plot, shown in Figure 29. To see trends more easily, we
separately plot the data for each class of object (enemies, terrain, and
blocks). Mario is not included in these plots because he is only a
single point. The best fitting linear models are also drawn, and we
can see that in all cases there is a positive correlation between AR
and GRR values. All three regression lines were significant at the 5%
confidence level.
4.4.2 Saliency Model Experiment
As described earlier, we also used goal relevance to create a saliency
mask, which was included in a combined model along with masks
4.4 results 98
Figure 29: Scatterplots of AR values vs. GRR values for each class of object
on which goal relevance is evaluated. The best fitting linear model is drawn
for each plot, along with the statistics results for that line.
from bottom-up and learned top-down models. Figure 30 visualizes
several example frames and the masks produced by each of the3 com-
ponents individually, along with the combined model mask. Overall,
goal relevance generally provides more intuitive masks, focusing on
enemies and objects with which Mario will soon interact. However,
human participants also spend significant time looking at other parts
of the scene, such as less relevant objects or the background, and
these cases are better captured by the learned top-down model.
The metric results of the models are shown in Figure 31. Here we
show the model results only for the NSS metric, but results for the
AUC metric can also be found in Section B.2. To compare the contri-
bution of the three models, we show results for each model individ-
ually as well as the three combinations of two models together. For
example, a combined model including only BU and TD is one where
x
2
and x
4
- x
6
are set to 0, and x
0
, x
1
, and x
4
are found using the
simplex search method as above. These combinations allow us to see
4.4 results 99
Figure30: Example images along with the model prediction masks. We show
the mask produced by each individual model (including blurring) along
with the final combined model. In each image, the cyan dot marks the par-
ticipant’s gaze location. Note that here we show the final masks which in-
clude blurring, particularly the BU and GR masks. Also, mask images have
been edited with increased gamma to improve visibility. In most of these ex-
amples, goal relevance does very well. For each row, starting from the top:
1) Gaze is directed to a ledge as Mario falls. 2) Gaze is directed to a cannon
which fires missiles. The left-most cannon is not considered relevant because
Mario is in the ascent of his jump, so most solutions land beyond it.3) Gaze
is directed towards an enemy. 4) Gaze is directed towards a platform onto
which Mario will soon land. 6) The last example shows a situation where
the TD model performs much better. Gaze is being directed towards the
background above Mario. Goal relevance is incapable of predicting this lo-
cation because there are no objects there, but the learned model places a low
probability there.
the effects of adding or removing individual components from the
combined model.
Although goal relevance is not intended to be used as a stand-alone
saliency model, we can still gain insights from comparing the individ-
ual models. Firstly, it is clear that TD performed the best of the three
models, based on its individual score and on the scores of combi-
nations in which it is included. Goal relevance performs well, and
4.4 results 100
Figure 31: NSS scores for all combinations of the 3 model components:
bottom-up, top-down learned, and goal relevance. In all cases, a combi-
nation that includes goal relevance performs significantly better than the
corresponding combination without goal relevance.
every combination in which it is included has a higher score than the
corresponding combination without it. For example, the TdGr combi-
nation does much better than TD on its own.
The BU model struggled with this task, with the individual BU
model barely scoring above chance. This may be caused by the biases
present in this particular task. For example, there was an extremely
strong bias for participants to attend to the right side of the screen,
because their goal is to move Mario to the right and therefore the
right side contains the information needed for planning a path. The
TD model can handle this bias because it learns where the partici-
pants tended to look, and the GR model discovers this on its own
because the paths with higher fitness all go towards the right. The
BU model has no way to account for this, so it gives just as much
weight to surprising things on the left side as on the right side, signif-
icantly decreasing its score. However, we can see that the information
4.4 results 101
provided by the BU model is still useful because combining BU with
the other models yields improvements. This happens because of the
second-order terms, where the BU mask is multiplied by other masks.
In these terms, the right-side bias is already contained in the addi-
tional masks and BU simply filters for the most surprising regions
remaining.
The higher scores from the GR model suggest that this task had
a strong effect on eye behavior. Specifically, it indicates that partic-
ipants reliably gazed towards objects that directly affected how the
task could be solved. The goal relevance calculations were able to
capture this, including the right-side bias, purely based on a compu-
tational definition of the task itself. It is interesting that goal relevance
was a reliable predictor of eye behavior even in the presence of very
strong distractors. For example, some of the non-interactive enemies
(such as the shells bouncing back and forth) are extremely distracting
due to the speed at which they move and the sounds they make. In
terms of bottom-up saliency, they are often the most salient objects
on the screen, and it is quite reasonable to expect participants to look
at them frequently. Goal relevance, by contrast, gives them a value
of 0 because they do not affect any solutions. Although the partici-
pants did occasionally look at these enemies, Figure 28F, in addition
to the NSS scores, show that the difference in goal relevance had a
very strong effect on the time spent looking at them.
The TD model performed the best because it is able to directly learn
where participants tend to look. In this case, it learned the right-side
bias and also that participants frequently looked towards a particular
spot slightly below the center of the screen, which can be seen in the
TD masks in Figure 30. The game screen scrolls to the side as Mario
progresses through each level, such that Mario is always centered hor-
izontally, and this spot is based on the average height of Mario. The
4.5 discussion 102
model also learns other regions of the screen where each participant
tends to look, although on average it does not give them as much
weight as this below-center location.
Finally, introducing an additional component caused a significant
improvement in the NSS score in all cases. This is a pleasing result
which suggests that none of the models overlaps another in terms of
the information it provides. In particular, goal relevance performed
well and significantly contributed to all of the models in which it was
included. There was a significant effect of adding the GR mask to
the BU mask to form the BuGr model (F(279648) = -313.1510, p <
110
-100
). Similarly, adding GR to the TD mask was also significant
(F(279648) = -91.2175, p<110
-100
) as well as adding GR to the
BuTd model (F(279648) = -89.9835, p<110
-100
).
4.5 discussion
In Chapter 3, goal relevance was designed to measure the degree to
which information pertains to a task, but that does not necessarily
imply that it would be correlated with eye behavior. However, all of
the results together suggest that goal relevance is indeed correlated
with eye movements. Almost all object subsets tested in Figure28 pro-
duced intuitive results correlating goal relevance and visual attention.
Figure 28 also shows that changing the task itself causes a change in
goal relevance that is matched by changes in eye behavior. This is a
promising result that might indicate goal relevance can generalize to
other tasks. Finally, the best fit lines in Figure29 directly showed that
goal relevance is correlated with eye movements. This remained true
regardless of the type of object being measured.
Furthermore, Figure 31 demonstrates the effectiveness of goal rele-
vance as a predictor of eye behavior in a saliency model. The fact that
4.5 discussion 103
goal relevance improved the model for each combination in which
it was included suggests that the predictive information provided
by goal relevance is distinct from both the bottom-up and top-down
learned features. This supports our belief that goal relevance provides
a new potential stream of information that can be incorporated into
any saliency model involving a specific task. The best models can
be formed by combining multiple distinct sources of information, in-
cluding learned features and conceptually bound features like those
of the BU and GR models.
Even though the learned top-down model had the highest individ-
ual performance based on NSS scores, the fact that the goal relevance
learns distinct information means that it will generally improve the
results of any model in which it is included. Also, goal relevance
performs well despite the significant advantages of machine learning
techniques and comes without their disadvantages. Goal relevance re-
quires no training data at all, whereas most learned top-down models
(including this one) learn from eye behavior of the same participant.
The TD model is not capable of making a priori predictions about new
participants or new tasks, but the goal relevance model handles this
no differently than examples it has seen before. Also, even though we
refer to the learned model as a top-down model, its learning process
does not discriminate information. While learning top-down informa-
tion, it also captures bottom-up and any other category of informa-
tion that is present in the data, so it is really an all-around learned
model.
Despite the correlations between goal relevance and eye movements
in this study, there are some caveats to this approach. One such issue
is that goal relevance still requires specification by the researcher to
define the task solution space and to designate which pieces of infor-
mation should be evaluated for relevance. Every task is different, and
4.5 discussion 104
there is no way to escape the fact that tailoring computation to a task
requires some additional information. Goal relevance can be thought
of as a conceptual framework that provides a simple way to think
about tasks and information. This is similar to Itti & Baldi’s concept
of surprise [59], which also defined a novel way of thinking about
bottom-up attention but left open the question of which features to
use. Of course, this similarity is not an accident, as the two concepts
share much of the same formulation.
The choices of how to model the solution space and which objects
should be evaluated for goal relevance undoubtedly had a large im-
pact on the goal relevance values. There is not always an obvious
answer to what should constitute a single piece of information for
the purposes of this calculation. When deciding how to handle the
blocks, for example, we could instead have evaluated all of the pieces
(individual tiles) of each block individually. Also, we could have eval-
uated the relevance of empty spaces by comparing with game states
containing blocks inserted at the same locations. This decision, along
with the representation of the solution space, can be thought of as
encapsulating the subjective nature of the task. None of these choices
is any more inherently correct than another, and ultimately we chose
to exclude empty spaces and to group blocks into connected compo-
nents because it seemed more intuitive to process the game environ-
ment in that way.
Another important choice was the manner in which goal relevance
values were converted into a saliency mask to be used in the model.
As goal relevance is still a very new concept the most straightforward
method was chosen, simply using object bounding boxes filled with
intensities proportional to the goal relevance values. This implies a
very basic logic scheme is being used, namely that gaze is directed
towards the most relevant pieces of information. This includes no in-
4.5 discussion 105
hibition of return, and does not account for the many sub-problems
that compose complex tasks. For example, even though the overall
goal was to reach the end of a level, a participant might briefly be-
come focused specifically on jumping onto a single ledge. This could
lead them to focus on the positioning of the ledge and on Mario,
ignoring other enemies and obstacles that are still relevant but are
beyond the ledge. Future work might improve upon goal relevance
by developing a hierarchy of goals and subgoals to account for cases
such as this one.
Our results do not imply that the details of how relevance is com-
puted match the implementation in the human brain. Indeed, one
might be doubtful that human participants simulated thousands of
game states at each moment in order to decide where to direct their
gaze. This is certainly another direction for future research, but we
believe it has been established that goal relevance is related to the
same result as the human computations, even if the brain arrives via
a different manner. It is possible that humans develop heuristics [46],
in which case we would argue that goal relevance is the target that
the heuristics attempt to approximate.
Overall, this study demonstrates that goal relevance provides a
complete theory for the top-down behavioral effects of tasks on hu-
man gaze, from deciding which information is relevant to the task
to how that information affects gaze behavior. In addition, it can
be added into saliency models, even those which already include
learned top-down features, to improve their performance. This is an
important step towards improving our understanding and models of
top-down gaze in humans.
5
C O N C L U S I O N
In Chapter2, we described a new biologically-plausible learning rule
for explaining how top-down modulatory connections might work in
the brain, and used it to learn, for the first time, a network responsive
to border-ownership in its input. In Chapter 3, goal relevance was
proposed for quantifying the degree to which information is related
to an agent’s task, and we showed that it strongly correlated with
human responses in a 2D navigation experiment. Finally, in Chap-
ter 4 goal relevance was shown to be correlated with human atten-
tion during video gameplay, and additionally was able to improve
the accuracy of a saliency model by being introduced as an addi-
tional feature. Together, these contributions represent a step forwards
in our understanding of how tasks affect human attention, by improv-
ing our understanding of both top-down processing in the brain and
top-down attention modeling. In particular, goal relevance provides
a strong conceptual foundation for explaining top-down information
relevance which until now has been elusive. It can be used in models
of eye behavior while providing the intuition and a priori prediction
capabilities that artificial learning methods lack. We hope that this
work can be improved upon to continue uncovering the details of
gaze behavior.
106
Part III
A P P E N D I X
A
B O R D E R O W N E R S H I P A P P E N D I X
a.1 stability of modulatory connections
This section details the possible transitions that occur in the network
of Figure 8 using various learning rules. The nomenclature used for
the various states of the network is the same as introduced in Sec-
tion 2.3.1.
a.1.1 Conflict Learning Transitions
Traditionally this type of stability analysis is performed by analyzing
the properties of the Jacobian. The discontinuous nature of the spread-
ing component (Equation 8) of conflict learning, which is caused by
the categorization of neurons as strongly learned or not, precludes
writing a single equation for the individual components of the Jaco-
bian. Given the analyzed network, this would mean creating a distinct
Jacobian for each state and categorization of neurons, which would
only serve to complicate the presented analysis. We instead continue
in the same fashion as in Section 2.2.1 and Section 2.3.1.
transitions out of 0sl When the network is in its initial
unlearned state 0SL, there is no association between modulatory in-
put and competitive neurons, so regardless of which neuron wins or
which modulatory input is active, the update occurs in the same fash-
108
A.1 stability of modulatory connections 109
ion. Without loss of generality, letN
1
andM
1
be the active neurons.
The updates fromM
1
are then:
M
1
N
1
=x
M
1
x
N
1
>0 (24)
M
1
N
2
= -x
M
1
x
N
2
<0 (25)
which transitionsN
1
into being strongly learned towardsM
1
. Con-
nection weights fromM
2
are unchanged becauseM
2
is inactive.
transitions out of 1sl Once in a state where one of the com-
peting neurons has a strongly learned connection, there are four pos-
sible scenarios of activation. We will again assume, without loss of
generality, that N
1
has strongly learned connections from M
1
, and
thatN
2
has no strongly learned connections.
• N
1
and M
1
active:
M
1
N
1
is the only positive update, so the
weight changes proceed as they did under the same conditions
in the initial state, keeping the network in the 1SL state.
• N
1
andM
2
active: BecauseN
1
is strongly learned towardsM
1
,
N
1
will be 0 asM
1
is inactive. Thus none ofN
1
’s weights can
change and the network remains in the same state, the spread-
ing component preventing the network from transitioning into
the 2SL-Split state. N
2
receives inhibition from N
1
, causing it
to unlearn towards the active modulatory neuron M
2
, which
results in no effective change asw
M
2
N
2
is alreadyw
min
.
• N
2
andM
1
active: In this simple example, the existing feedback
from the strongly learned connection betweenN
1
andM
1
over-
rides the driving input to N
2
, so N
1
becomes active and N
2
inactive, which we have already seen results in no change to the
state.
A.1 stability of modulatory connections 110
• N
2
andM
2
active:N
2
has no strongly learned connections, thus
N
2
=1 andN
2
can learn towards the active modulatory input
M
2
.N
2
becomes strongly learned towardsM
2
and the network
enters the 2SL-Desired state.
In more complex networks, it is possible to transition from the1SL
state to the 2SL-Shared state. In these networks, in place of a sin-
gle neuron, modulatory input comes from correlated populations of
neurons. Depending on the activation of the population, a particular
competitive neuron may only be able to learn a subset of connections
to a population while one of its competitors learns a different subset.
Alternatively, there may be overlap between populations of modula-
tory inputs, meaning that some of the neurons that are learned be-
long to both populations, resulting in a sharing of strongly learned
connections.
transitions out of 2sl-shared When a network is in this
state, more than one competitive neuron has a strongly learned con-
nection to the same modulatory population. In this case, the unlearn-
ing component in conjunction with the SLT component work to make
this an unstable state and move the network back to 1SL: the less ac-
tive competitive neuron will actively unlearn its connection to the ac-
tive population, while the more active one strengthens its connection.
Over time this will result in one of the neurons losing its strongly
learned status to that population, allowing it to return to an initial
unlearned state. The SLT component allows initial changes to hap-
pen quickly and creates momentum via long-term statistics once one
neuron begins to consistently win versus the other.
Consider the behavior of the simple network of Figure 8 if placed
into the 2SL-Shared state: because N
1
and N
2
both have strongly
A.1 stability of modulatory connections 111
learned connections toM
1
, the following applies identically to either
N
1
orN
2
:
• IfM
1
becomes active, eitherN
1
orN
2
will be more active, de-
pending on noise. The winner will update its weights further
towards M
1
while the loser will unlearn its weights towards
M
1
. IfN
1
were the winner, this differential in weight value will
causeN
1
to win versusN
2
in future cases ofM
1
being active,
maintaining these weight updates until the system returns to
the 1SL state:
M
1
N
1
= (1-(0))x
M
1
x
N
1
(1)-(0)x
M
1
x
N
1
=x
M
1
x
N
1
>0
(26)
M
1
N
2
= (1-(1))x
M
1
x
N
2
(1)-(1)x
M
1
x
N
2
= -x
M
1
x
N
2
<0
(27)
• If M
2
becomes active, neither N
1
nor N
2
will perform posi-
tive learning because they are strongly learned towardsM
1
and
N
1
=
N
2
=0.
a.1.2 Generalized Hebbian Algorithm
The Generalized Hebbian Algorithm (GHA) can be shown to be un-
stable for the network of Figure 8 using the same procedure as was
used for the normalization based Hebbian learning rule (Equation4).
GHA adjusts weights as follows:
w
ij
=
x
j
x
i
-x
j
j
X
k=1
w
ik
x
k
!
(28)
A.1 stability of modulatory connections 112
Figure32: State diagram for the simple network of Figure8 for the General-
ized Hebbian Algorithm (Sanger’s Rule) and BCM. This diagram shows the
progression of the network from an initial unlearned state (0SL) to the de-
sired state of each competing neuron learning a unique modulatory input
(2SL-Desired), much like Figure 9 did for a normalized Hebbian learning
rule and conflict learning. Outgoing transition probabilities as well as the
percentage of time spent in each state are shown for both (a) the General-
ized Hebbian Algorithm (GHA) and (b) BCM, based on simulation. States
which were not reached by a learning rule have been omitted for clarity. The
3SL state corresponds to exactly three strongly learned connections between
modulatory (M
1
and M
2
in Figure 8) and competitive neurons (N
1
and
N
2
), regardless of which set of three connections is strongly learned. The
3SL state was not reachable by either rule shown in Figure 9, either due to
weight normalization or various components of conflict learning. Although
GHA can spend time in the 2SL-Desired state, it is not stable in that con-
figuration and oscillates between four different states. BCM is stable in the
2SL-Desired state, but is also stable in the 2SL-Split state. Using BCM, the
transition out of 1SL is essentially random, and the system cannot reliably
end up in the desired state.
A.1 stability of modulatory connections 113
Here we assume most of the same network assumptions as Sec-
tion 2.2.1. This means the network is already in the desired state and
w
M
1
N
1
= w
M
2
N
2
= w
max
andw
M
2
N
1
= w
M
1
N
2
= w
min
. However,
we assumeM
2
andN
2
will be highly active instead ofM
1
andN
1
.
Now, replacing (Equation 4) with (Equation 28) yields:
w
M
2
N
1
=(x
N
1
x
M
2
-x
N
1
(w
min
x
N
1
)) (29)
As we are only interested in the sign ofw
M
2
N
1
, and because>0
andx
N
1
>0, we have:
sgn(w
M
2
N
1
) =sgn((x
N
1
x
M
2
-x
N
1
(w
min
x
N
1
)))
=sgn(x
M
2
-w
min
x
N
1
)
(30)
BecauseM
2
is highly active whileN
1
is being inhibited,x
M
2
>x
N
1
.
Considering this along with the fact that1>w
min
>0, it must be true
thatw
M
2
N
1
>0, indicating that the system is not in a steady-state.
Results for simulating this learning rule for the network of Figure8
can be seen in Figure 32A. The simulation confirms that 2SL-Desired
is not a stable state for this learning rule and shows that the network
enters an oscillation between multiple states.
a.1.3 BCM
We can also demonstrate that the BCM rule, another variant of Heb-
bian learning, is not guaranteed to converge to the desired state of
the network of Figure 8. BCM uses a Hebbian update modulated by
a dynamic threshold to control explicit synaptic weakening:
w
ij
=x
i
x
j
x
j
-
j
(31)
where
j
is the expected value (long-term average) ofx
2
j
.
A.1 stability of modulatory connections 114
The value of
j
directly controls whether this rule is stable in the
desired state of the simple network. Let us assume that the network
is in the desired state 2SL-Desired, and M
1
is the current active
modulatory neuron, implying x
N
1
= x
active
, x
N
2
= x
inhibited
, and
x
active
>x
inhibited
. The sign of each weight update is then dependent
solely the (x
j
-
j
) term of (Equation31). For the system to remain in
the stable state,w
M
1
N
1
>0 andw
M
1
N
2
60 must hold, as these
updates maintain the same assignment of strongly learned connec-
tions. The system must simultaneously satisfy the case whenM
2
is
the active modulatory neuron, which sets up a similar set of require-
ments:w
M
1
N
1
6 0 andw
M
1
N
2
> 0. Arranging all requirements
and substitutingx
active
andx
inhibited
where appropriate, we get:
x
N
1
=x
active
>
N
1
x
N
2
=x
inhibited
6
N
2
x
N
1
=x
inhibited
6
N
1
x
N
2
=x
active
>
N
2
(32)
which is satisfied if and only ifx
active
>>x
inhibited
.
However, the BCM rule has another stable state which it can reach,
2SL-Split, which is the state where both modulatory neurons are asso-
ciated with a single competitive neuron. Once in the 2SL-Split state,
the competitive neuron with two strongly learned connections will
always activate more strongly than and inhibit the other because it is
receiving additional feedback input regardless of which modulatory
neuron is active.
Let us investigate the dynamics of the network in the 1SL state,
before it reaches either2SL-Split or2SL-Desired. Without loss of gen-
erality, assume N
1
has strongly learned connections from M
1
, and
thatN
2
has no strongly learned connections. Consider what happens
A.2 learning rule details 115
when the threshold,, falls within the required bounds for stability
in 2SL-Desired, such that the winning neuron with activationx
active
will do positive learning, and the inhibited neuron with activation
x
inhibited
will do negative learning. The interesting case is what hap-
pens whenM
2
is the active modulatory neuron, which has no exist-
ing strongly learned connections (i.e.w
M
2
N
1
=w
M
2
N
2
=w
min
).N
1
andN
2
thus receive identical input, so the winner is decided by noise.
Due to the value of the threshold, the winner will increase its weight
towardsM
2
, and the loser will decrease its weight towardsM
2
. If the
winner happens to beN
1
, the system will transition into 2SL-Split. If
N
2
wins, the system will transition to 2SL-Desired.
This result can be seen in the simulation results presented in Fig. Fig-
ure 32B, where the network under the BCM rule has two terminal
states: 2SL-Desired and 2SL-Split. To achieve this, we specifically ini-
tialized the adaptive threshold to a value between the activation val-
ues of activation and inhibition for the network.
a.2 learning rule details
a.2.1 Activation
For all experiments, all model neurons use the same activation func-
tion regardless of learning rule. A neuronj has a continuous firing
ratex based on integrating weighted inputs:
x
j
=f(
FF+ Lat+(FB FF
2
)+
1+ Inhib
,
j
) (33)
where FF, Lat, and FB represent the sum of weighted inputs of all
excitatory (weightw >0) feedforward, lateral, and feedback inputs,
respectively. Each sum is calculated as:
P
i2type
w
ij
x
i
, where w
ij
is
the weight between neuronsi andj. Note that feedback is gated by
A.2 learning rule details 116
feedforward input; it cannot activate a neuron in the absence of feed-
forward driving input.
Inhib is calculated by taking the weighted sum of all inhibitory
inputs from more strongly active neurons.
is a noise term sampled from a normal distribution:N(0,
2
noise
).
f(x,) sets the output to zero if it is less than a threshold value.
Thresholds are updated whenever a neuron is active and not inhib-
ited:
=
8
>
>
>
<
>
>
>
:
s FF+((1-s)) FF>
ff
and Inhib<
Inhib
0 else
(34)
wheres is a smoothing parameter,
ff
a threshold for considering a
neuron active, and
Inhib
a threshold for considering a neuron inhib-
ited.
Thresholds are further bound between a minimum (
min
) and max-
imum (
max
) value. The minimum is set such that the noise term is
unlikely to spuriously activate the neuron.
a.2.2 Learning
In the experiments, each model neuron, under either learning rule,
learns each type of connection (i.e. feedback, feedforward, and lateral)
independently.
a.2.2.1 Hebbian Learning
Our experiments use a slightly modified version of GCAL [7] where
the threshold works as described in Section A.2.1, instead of a global
target activation based threshold such as that described in [117]. This
change to the threshold resulted in better performance and easier sys-
A.2 learning rule details 117
tem tuning for both of our experiments. The rule is otherwise the
same, using purely Hebbian logic, i.e. (Equation 4), to determine
weight updates, and the activation function described above (Sec-
tion A.2.1).
a.2.2.2 Conflict Learning
Conflict learning neurons adapt their weights as described in Sec-
tion Section2.3. Neurons additionally have an accumulator of lifetime
short-term weight updates which is used for computing the smooth-
ing factors
ltm
for the long-term weight update:
acc
ij
(t+1) = acc
ij
(t)+
ij
(35)
The smoothing factor for the long-term update,s
ltm
, is computed
by comparing this neuron’s proportion of long-term weight against
its proportion of lifetime accumulator value (normalized w
ltm
ij
(t) vs
acc
ij
(t +1)). When the w
ltm
ij
(t) update would move the long-term
weight proportion towards that of the accumulator,s
ltm
is decreased,
proportional to the remaining distance between them. In cases where
thew
ltm
ij
update would move the proportion away from the accumu-
lator,s
ltm
is increased.
The smoothing factor for the short-term update, s
stm
, is constant,
with smaller values preferring the short-term weight.
Short-term and long-term weights are divisively normalized inde-
pendently, as in the Hebbian update. Weights initially start lower than
their allowed totals and are not normalized until they have grown to
exceed it.
The full conflict learning rule is not used for learning inhibitory
connections as they serve as control signals for the rule itself. Instead,
A.3 experimental methods 118
these connections have a single weight based upon a normalized life-
time accumulation of weight updates:
acc
lat-
ij
(t+1) = acc
lat-
ij
(t)+x
i
x
j
w
lat-
ij
(1- Inhib) (36)
w
lat-
ij
(t+1) =
acc
lat-
ij
(t+1)
P
k
acc
lat-
kj
(t+1)
(37)
a.3 experimental methods
a.3.1 Simple Network
The simple network of Figure8 is used to demonstrate the instability
of variants of Hebbian learning when modulatory connections are
present. The results for simulating the network for both weight re-
normalization Hebbian learning and conflict learning are presented
in Figure 9. Additional simulations using the Generalized Hebbian
Algorithm and the BCM rule are presented in Figure 32.
All connection weights are fixed to 1 with the exception of incom-
ing modulatory input to the competitive neurons, which adjust their
weights using the learning rule being tested. Competition is imple-
mented through lateral inhibition between neuronsN
1
andN
2
. All
tested learning rules use the same network parameters.
The results are averaged over 30 simulations. Each simulation con-
tains 100 presentations of input, which each consist of a uniformly
random modulatory input being active (activation set to 1) while the
driving input to both competitive neurons is simultaneously active
(set to 1). The non-active modulatory input is set to 0. The network
is exposed to each presentation for 100 iterations, followed by 10 it-
A.3 experimental methods 119
erations of all zero-valued input before the next presentation. State
transitions are computed based on the state the network is in before
and after the presentation of an input.
A connection is considered strongly learned if it meets the condi-
tions of the spreading component of conflict learning (Equation 8).
a.3.2 Border Ownership Network Architecture
To analyze our learning rule in feedback contexts, we focus on a
model of border ownership similar to that developed by [30]. The
network is organized into four layers of cells arranged retinotopi-
cally: the input, orientation selective, BO, and grouping layer (Fig-
ure 10). For the main BO experiments, we used a 40x50 grids of cells.
As training time scales with network size, the smallest network that
would still allow interesting stimuli to be presented was used.
The network is given grayscale input. The orientation layer receives
input from the input layer, and uses fixed log Gabor filters [38], pa-
rameterized by
gabor
= andf =
p
, to compute four orientation
maps (0, 45, 90, and 135
) representing a simplified V1-like layer.
There are four orientation selective neurons per grid space, giving
a total of 8,000 neurons.
The output of each of the four angles in the orientation layer below
provide input to two BO neurons at the same retinotopic location.
These BO neurons are grouped into a column at each location with
eight neurons, for a total of 16,000 BO neurons. The neurons within
a column have inhibitory lateral connections, initially with equally
distributed weight. BO neurons in a column have no concept of their
physical position relative to any other neuron, nor their border own-
ership polarity (i.e. left/right, up/down, etc). Initially, without any
learning for the lateral connections, neurons within a column are un-
A.3 experimental methods 120
aware of the neuron with which they will most directly compete to
form a BO pairing.
Each BO neuron provides feedforward input to all grouping cells
within a radius r of its retinotopic position, and receives feedback
from the same set of grouping cells. This radiusr determines the scale
of objects that can be handled by the network. Both the feedforward
and feedback connections between these two layers are learned. The
grouping layer is much more sparsely populated than both the input
and border ownership layers, with roughly1,000 neurons placed ran-
domly using a Poisson-disc algorithm [14]. Finally, there are lateral
connections between grouping neurons in a center-surround fashion,
extending to0.6r for excitation,3r for inhibition.
Training involves the repeated presentation of a moving square
with length 10, chosen to be slightly smaller than the grouping neu-
ron receptive field diameter2r (see Figure 10C and D) . Squares are
given a random initial position, orientation, and scaled up or down
by up to 10% in size. Once placed, squares move in a random linear
path across the FOV until no longer visible. Each positioning of a
square is presented for 10 time steps to allow the network to settle.
The network is given a blank input for10 time steps after the square is
no longer visible. Training is terminated after 40,000 squares are pre-
sented, a sufficient amount to show a plateau in the polarity scoring
metric, described next.
For evaluation, we compute a polarity vector for each BO neuron,
which represents the strength and preferred polarity direction of a
neuron. The polarity vector is calculated as the sum of retinotopic
vectors, each from the BO neuron to one of the grouping neurons
providing it feedback (scaled by weight strength), multiplied by 1
or -1 depending on which side of the BO neuron’s orientation they
fall. The median absolute difference between the magnitude of po-
A.3 experimental methods 121
larity vectors for BO neurons of opposite polarity is then aggregated
across all neurons of each orientation preference, to give the overall
polarity score shown in Figure 11. Significance is established with a
Wilcoxon signed rank test. The polarity vectors are also used in Fig-
ure 13, where the polarity vectors from neurons in the same column
are weighted by activation and summed together to provide the re-
sulting response vectors.
Finally, to compare the Hebbian learning rule versus all possible
variants of conflict learning, a smaller 30x30 network is trained for
each configuration (Figure 14), using the same methodology as the
larger network. For each configuration, we use the vertical orientation
as an exemplar, computing a histogram of polarity scores across all
vertical BO neurons. The medians for each score are then compared
and tested for significance with a Wilcoxon signed rank test.
a.3.3 Orientation Selective Network Architecture
The orientation selective network (Figure 15) has three layers: an in-
put layer, a center-surround layer, and an orientation selective layer,
like that used to demonstrate the properties of GCAL [117]. The
center-surround layer consists of both on-off and off-on preferential
cells. In order to avoid anti-aliasing issues and a bias towards diag-
onals caused by square pixels, the resolution of the input is scaled
by some amount,s, for the center-surround convolution. On-off cells
have a difference-of-Gaussians receptive field, with a sigma of0.33s
for the larger Gaussian and 0.4s for the smaller one. The receptive
field for an off-on cell is the negation of an on-off cell.
Orientation selective neurons receive feedforward input from a disc
of center-surround neurons (on-off and off-on) within some radiusr,
initially with equally distributed weight. Learning these connections
A.3 experimental methods 122
Parameter Value Description
noise
0.01 Standard deviation of noise distribution.
ff
4
noise
Threshold of driving input for neuron to
be considered active.
Inhib
0.2 Threshold of inhibition for neuron to be
considered inhibited.
s 0.1 Smoothing factor for threshold update.
min
4
noise
Minimum threshold value.
max
0.5 or 0.7 Maximum threshold value. Larger value
used for orientation selective network.
0.01 or 0.001 Learning rate. Lower value used by Heb-
bian learning.
1.0 Balances positive versus negative learn-
ing for conflict learning.
Table 6: Parameter Listing
creates the orientation selective behavior of the neurons. Orientation
neurons further have lateral connections in a center-surround fashion
to promote grouping and competition. Excitation extends to 0.27r,
inhibition to 0.73r. Center-surround and orientation selective neu-
rons are placed randomly using the same Poisson-disc algorithm as
used for grouping neurons in the border ownership experiment. The
center-surround and orientation selective layers have approximately
1,600 (800 + 800) and 3,200 neurons, respectively, depending on the
randomness of the Poisson-disc algorithm.
Training involves the repeated presentation of an oriented line seg-
ment spanning the width of the input layer. Lines are given a ran-
dom initial position, orientation, and are translated across the field
of view (FOV) in a random direction until no pixel of the line can be
seen. Each position is held for 10 time steps, which is sufficient for
the network to settle. The network is given a blank input for 10 time
steps after the line is no longer visible in the FOV . Training is termi-
A.3 experimental methods 123
nated after 20,000 lines are presented, a sufficient amount of time to
maximize the selectivity score for a non-noisy network.
Orientations are assigned to neurons by finding the best fitting
Gabor function and taking its orientation and coefficient of determi-
nation (r
2
) values, using the MATLAB library knkutils by Kendrick
Kay. The orientation is used for the hue in generating the color maps,
whereas the coefficient of determination is used for determining se-
lectivity. Pinwheel density is computed on orientation maps using
code adapted from Topographica [8] using the methods described in
[117].
For the noise and stability measurements, noise is introduced by
adjusting the standard deviation of the neuron activation noise term,
, to
noise
in the input layer (see results Figure 16 for noise values
used). The noise score is the average of ther
2
coefficients across all
orientation selective neurons. For stability, the scoring metric is iden-
tical to the metric used by [117]. We perform a paired-sample t-test to
test for significance.
a.3.4 Parameter Listing and Source Code
Key parameters for the learning and activation functions for both
Hebbian learning and conflict learning are displayed in Table 6. All
parameters were tuned for each experiment to maximize performance
with both rules in mind.
The experiments were performed using a custom framework writ-
ten in C++ explicitly for conflict learning, with some analysis of re-
sults performed using MATLAB or Python scripts. All learning rules
tested were implemented in this same framework. Source code is
available on the website for conflict learning [49].
A.3 experimental methods 124
funding
This work was supported by the National Science Foundation (grant
numbers CCF-1317433 and CNS-1545089), the Army Research Office
(W911NF-12-1-0433), and the Office of Naval Research (N00014-13-1-
0563). The authors affirm that the views expressed herein are solely
their own, and do not represent the views of the United States gov-
ernment or any agency thereof.
B
S A L I E N C Y M O D E L S A P P E N D I X
b.1 effect of
fitness
on nss scores
In Section 4.2.1, we described our method for dealing with an infi-
nite solution space, which was to prune the search tree by removing
any node with a fitness value more than
fitness
below the fitness
of the best path found using an A* search. However, computational
restraints required that we set this threshold to 0, meaning that we
pruned branches as soon as they became worse than the best solution.
To confirm that this did not affect our results, we reran the analysis
using higher values but only on the 4 validation participants used
for the simplex search [72]. This is shown in Figure 33. Although the
NSS scores did slightly worse with the higher values, the difference
is small enough that the significance of the results is not affected.
b.2 auc results
In Table 7 we show the model scores for the same models shown in
Figure 31, except that we evaluate the models using the AUC met-
ric instead. Just as before, there was a significant effect of adding
the GR mask to the BU mask to form the BuGr model (F(279648) =
-269.6313, p<110
-100
). Adding GR to the TD mask was also sig-
nificant (F(279648) = -6.8460, p = 7.6010
-12
) as well as adding
GR to the BuTd model (F(279648) = -6.7976, p = 1.0710
-11
). Al-
125
B.2 auc results 126
Figure 33: NSS scores for the GR and full combined models for different
values of
fitness
. The difference in the scores is very small, so using a low
value does not affect our results.
though the differences between the model scores is not as large, the
significance of the results is unaffected.
Table7: AUC scores for all combinations of the3 model components: bottom-
up, top-down learned, and goal relevance. In all cases, a combination that
includes goal relevance performs significantly better than the corresponding
combination without goal relevance.
Components AUC Score
BU 0.5282
TD 0.9088
GR 0.7973
BuTd 0.9088
BuGr 0.7982
TdGr 0.9137
All 0.9137
B I B L I O G R A P H Y
[1] 2012 Mario AI Championship. http://www.marioai.org/. Ac-
cessed: 2017-03-22.
[2] Larry F Abbott and Sacha B Nelson. “Synaptic plasticity: tam-
ing the beast.” In: Nature Neuroscience 3 (2000), pp. 1178–1183.
[3] Nancy M Amato and Yan Wu. “A randomized roadmap method
for path and manipulation planning.” In: Robotics and Automa-
tion, 1996. Proceedings., 1996 IEEE International Conference on.
Vol. 1. IEEE. 1996, pp. 113–120.
[4] Farhan Baluch and Laurent Itti. “Mechanisms of top-down at-
tention.” In: Trends in Neurosciences 34.4 (2011), pp. 210–224.
[5] Moshe Bar, Karim S Kassam, Avniel Singh Ghuman, Jasmine
Boshyan, Annette M Schmid, Anders M Dale, Matti S Hämäläi-
nen, Ksenija Marinkovic, Daniel L Schacter, Bruce R Rosen, et
al. “Top-down facilitation of visual recognition.” In: Proceed-
ings of the National Academy of Sciences of the United States of
America 103.2 (2006), pp. 449–454.
[6] Pierre Bayerl and Heiko Neumann. “Disambiguating visual
motion through contextual feedback modulation.” In: Neural
Computation 16.10 (2004), pp. 2041–2066.
[7] James A Bednar. “Building a mechanistic model of the devel-
opment and function of the primary visual cortex.” In: Journal
of Physiology-Paris 106.5 (2012), pp. 194–211.
[8] James A Bednar. “Topographica: building and analyzing map-
level simulations from Python, C/C++, MATLAB, NEST, or
NEURON components.” In: Python in Neuroscience (2015), p.104.
127
Bibliography 128
[9] James A Bednar and Risto Miikkulainen. “Joint maps for orien-
tation, eye, and direction preference in a self-organizing model
of V1.” In: Neurocomputing 69.10 (2006), pp. 1272–1276.
[10] Frederik Beuth and Fred H Hamker. “A mechanistic cortical
microcircuit of attention for amplification, normalization and
suppression.” In: Vision Research 116 (2015), pp. 241–257.
[11] Elie L Bienenstock, Leon N Cooper, and Paul W Munro. “The-
ory for the development of neuron selectivity: orientation speci-
ficity and binocular interaction in visual cortex.” In: The Jour-
nal of Neuroscience 2.1 (1982), pp. 32–48.
[12] Ali Borji and Laurent Itti. “State-of-the-art in visual attention
modeling.” In: IEEE transactions on pattern analysis and machine
intelligence 35.1 (2013), pp. 185–207.
[13] Ali Borji, Dicky N Sihite, and Laurent Itti. “What/where to
look next? Modeling top-down visual attention in complex in-
teractive environments.” In: IEEE Transactions on Systems, Man,
and Cybernetics: Systems 44.5 (2014), pp. 523–538.
[14] Robert Bridson. “Fast Poisson disk sampling in arbitrary di-
mensions.” In: SIGGRAPH Sketches. 2007, p. 22.
[15] Tobias Brosch and Heiko Neumann. “Computing with a canon-
ical neural circuits model with pool normalization and modu-
lating feedback.” In: Neural Computation (2014).
[16] Tobias Brosch and Heiko Neumann. “Interaction of feedfor-
ward and feedback streams in visual cortex in a firing-rate
model of columnar computations.” In: Neural Networks54 (2014),
pp. 11–16.
[17] Timothy J Buschman and Sabine Kastner. “From Behavior to
Neural Dynamics: An Integrated Theory of Attention.” In: Neu-
ron 88.1 (2015), pp. 127–144.
Bibliography 129
[18] Guy Thomas Buswell. “How people look at pictures: a study
of the psychology and perception in art.” In: (1935).
[19] Zoya Bylinskii, Tilke Judd, Aude Oliva, Antonio Torralba, and
Frédo Durand. “What do different evaluation metrics tell us
about saliency models?” In: arXiv preprint arXiv:1604.03605 (2016).
[20] Antonio Cabrales, Olivier Gossner, and Roberto Serrano. “En-
tropy and the value of information for investors.” In: The Amer-
ican economic review 103.1 (2013), pp. 360–377.
[21] Edward M Callaway. “Feedforward, feedback and inhibitory
connections in primate visual cortex.” In: Neural Networks17.5
(2004), pp. 625–632.
[22] Matteo Carandini and David J Heeger. “Normalization as a
canonical neural computation.” In: Nature Reviews Neuroscience
13.1 (2012), pp. 51–62.
[23] Jaime Carbonell and Jade Goldstein. “The use of MMR, diversity-
based reranking for reordering documents and producing sum-
maries.” In: Proceedings of the 21st annual international ACM
SIGIR conference on Research and development in information re-
trieval. ACM. 1998, pp. 335–336.
[24] Monica S Castelhano, Michael L Mack, and John M Hender-
son. “Viewing task influences eye movement control during
active scene perception.” In: Journal of Vision 9.3 (2009), pp. 6–
6.
[25] Barbara Chapman, Michael P Stryker, and Tobias Bonhoef-
fer. “Development of orientation preference maps in ferret pri-
mary visual cortex.” In: The Journal of Neuroscience16.20 (1996),
pp. 6443–6453.
Bibliography 130
[26] Alexander L Churchill, Emmanouel G Liodakis, and H Ye
Simon. “Twitter relevance filtering via joint bayes classifiers
from user clustering.” In: Journal of University of Stanford (2010).
[27] Claudia Clopath, Lars Büsing, Eleni Vasilaki, and Wulfram
Gerstner. “Connectivity reflects coding: a model of voltage-
based STDP with homeostasis.” In: Nature Neuroscience 13.3
(2010), pp. 344–352.
[28] Maurizio Corbetta and Gordon L Shulman. “Control of goal-
directed and stimulus-driven attention in the brain.” In: Nature
reviews neuroscience 3.3 (2002), pp. 201–215.
[29] Ingemar J Cox, Matthew L Miller, Thomas P Minka, and Peter
N Yianilos. “An optimized interaction strategy for bayesian
relevance feedback.” In: Computer Vision and Pattern Recogni-
tion, 1998. Proceedings. 1998 IEEE Computer Society Conference
on. IEEE. 1998, pp. 553–558.
[30] Edward Craft, Hartmut Schütze, Ernst Niebur, and Rüdiger
Von Der Heydt. “A neural model of figure–ground organiza-
tion.” In: Journal of Neurophysiology 97.6 (2007), pp. 4310–4326.
[31] Javier Cudeiro and Adam M Sillito. “Looking back: corticotha-
lamic feedback and early visual processing.” In: Trends in Neu-
rosciences 29.6 (2006), pp. 298–306.
[32] Caroline M Eastman and Bernard J Jansen. “Coverage, rele-
vance, and ranking: The impact of query operators on Web
search engine results.” In: ACM Transactions on Information Sys-
tems (TOIS) 21.4 (2003), pp. 383–411.
[33] Florian Engert and Tobias Bonhoeffer. “Synapse specificity of
long-term potentiation breaks down at short distances.” In: Na-
ture 388.6639 (1997), pp. 279–284.
Bibliography 131
[34] A Aldo Faisal, Luc PJ Selen, and Daniel M Wolpert. “Noise in
the nervous system.” In: Nature Reviews Neuroscience9.4 (2008),
pp. 292–303.
[35] Wei Fan, Haixun Wang, Philip S Yu, and Sheng Ma. “Is ran-
dom model better? On its accuracy and efficiency.” In: Data
Mining, 2003. ICDM 2003. Third IEEE International Conference
on. IEEE. 2003, pp. 51–58.
[36] Jillian H Fecteau and Douglas P Munoz. “Salience, relevance,
and firing: a priority map for target selection.” In: Trends in
cognitive sciences 10.8 (2006), pp. 382–390.
[37] D Ferster. “Origin of orientation-selective EPSPs in simple cells
of cat visual cortex.” In: The Journal of Neuroscience 7.6 (1987),
pp. 1780–1791.
[38] David J Field. “Relations between the statistics of natural im-
ages and the response properties of cortical cells.” In: JOSA A
4.12 (1987), pp. 2379–2394.
[39] Elodie Fino, Vincent Paille, Yihui Cui, Teresa Morera-Herreras,
Jean-Michel Deniau, and Laurent Venance. “Distinct coinci-
dence detectors govern the corticostriatal spike timing-dependent
plasticity.” In: The Journal of Physiology588.16 (2010), pp.3045–
3062.
[40] KJ Friston and C Büchel. “Attentional modulation of effective
connectivity from V2 to V5/MT in humans.” In: Proceedings of
the National Academy of Sciences 97.13 (2000), pp. 7591–7596.
[41] Robert C Froemke. “Plasticity of cortical excitatory-inhibitory
balance.” In: Annual Review of Neuroscience 38 (2015), p. 195.
[42] Robert C Froemke, Ioana Carcea, Alison J Barker, Kexin Yuan,
Bryan A Seybold, Ana Raquel O Martins, Natalya Zaika, Han-
nah Bernstein, Megan Wachs, Philip A Levis, et al. “Long-
Bibliography 132
term modification of cortical synapses improves sensory per-
ception.” In: Nature neuroscience 16.1 (2013), pp. 79–88.
[43] Kunihiko Fukushima. “Neocognitron: A self-organizing neu-
ral network model for a mechanism of pattern recognition
unaffected by shift in position.” In: Biological Cybernetics 36.4
(1980), pp. 193–202.
[44] Srinagesh Gavirneni, Roman Kapuscinski, and Sridhar Tayur.
“Value of information in capacitated supply chains.” In: Man-
agement science 45.1 (1999), pp. 16–24.
[45] Jean Dickinson Gibbons and Subhabrata Chakraborti. “Non-
parametric statistical inference.” In: International encyclopedia
of statistical science. Springer, 2011, pp. 977–979.
[46] Gerd Gigerenzer. “Why heuristics work.” In: Perspectives on
psychological science 3.1 (2008), pp. 20–29.
[47] Joshua I Gold and Michael N Shadlen. “The neural basis of
decision making.” In: Annu. Rev. Neurosci. 30 (2007), pp. 535–
574.
[48] Jacqueline Gottlieb and Puiu Balan. “Attention as a decision in
information space.” In: Trends in cognitive sciences 14.6 (2010),
pp. 240–248.
[49] W Shane Grant, James Tanner, and Laurent Itti. Conflict Learn-
ing Source Code. ilab.usc.edu/conflictlearning/. Accessed:
2016-07-03. 2016.
[50] Stephen Grossberg. “Adaptive Resonance Theory: How a brain
learns to consciously attend, learn, and recognize a changing
world.” In: Neural Networks 37 (2013), pp. 1–47.
[51] Harry Halpin and Victor Lavrenko. “Relevance feedback be-
tween hypertext and semantic search.” In: Proc. Conference WWW2009
(April 20-24, 2009, Madrid, Spain). 2009.
Bibliography 133
[52] Peter E Hart, Nils J Nilsson, and Bertram Raphael. “A for-
mal basis for the heuristic determination of minimum cost
paths.” In: IEEE transactions on Systems Science and Cybernetics
4.2 (1968), pp. 100–107.
[53] Mary Hayhoe and Dana Ballard. “Eye movements in natural
behavior.” In: Trends in cognitive sciences 9.4 (2005), pp. 188–
194.
[54] Mary Hayhoe and Dana Ballard. “Modeling task control of eye
movements.” In: Current Biology 24.13 (2014), R622–R628.
[55] Donald Olding Hebb. The organization of behavior: A neuropsy-
chological approach. John Wiley & Sons, 1949.
[56] Rüdiger von der Heydt. “Figure–ground organization and the
emergence of proto-objects in the visual cortex.” In: Frontiers
in Psychology 6 (2015).
[57] Birger Hjørland. “The foundation of the concept of relevance.”
In: Journal of the american society for information science and tech-
nology 61.2 (2010), pp. 217–237.
[58] Jean-Michel Hupe, Andrew C James, Pascal Girard, Stephen G
Lomber, Bertram R Payne, and Jean Bullier. “Feedback connec-
tions act on the early part of the responses in monkey visual
cortex.” In: Journal of Neurophysiology 85.1 (2001), pp. 134–145.
[59] Laurent Itti and Pierre Baldi. “Bayesian surprise attracts hu-
man attention.” In: Advances in neural information processing sys-
tems 18 (2006), p. 547.
[60] Laurent Itti and Christof Koch. “A saliency-based search mech-
anism for overt and covert shifts of visual attention.” In: Vision
research 40.10 (2000), pp. 1489–1506.
Bibliography 134
[61] Laurent Itti, Christof Koch, and Ernst Niebur. “A model of
saliency-based visual attention for rapid scene analysis.” In:
IEEE Transactions on pattern analysis and machine intelligence20.11
(1998), pp. 1254–1259.
[62] Helen E Jones, Ian M Andolina, Bashir Ahmed, Stewart D
Shipp, Jake TC Clements, Kenneth L Grieve, Javier Cudeiro,
Thomas E Salt, and Adam M Sillito. “Differential feedback
modulation of center and surround mechanisms in parvocellu-
lar cells in the visual thalamus.” In: The Journal of Neuroscience
32.45 (2012), pp. 15946–15951.
[63] Helen E Jones, Ian M Andolina, Stewart D Shipp, Daniel L
Adams, Javier Cudeiro, Thomas E Salt, and Adam M Sillito.
“Figure-ground modulation in awake primate thalamus.” In:
Proceedings of the National Academy of Sciences 112.22 (2015),
pp. 7085–7090.
[64] Matthias Kaschube, Michael Schnabel, Siegrid Löwel, David
M Coppola, Leonard E White, and Fred Wolf. “Universality in
the evolution of orientation columns in the visual cortex.” In:
Science 330.6007 (2010), pp. 1113–1116.
[65] Wolfgang Keil, Matthias Kaschube, Michael Schnabel, Zoltan
F Kisvarday, Siegrid Löwel, David M Coppola, Leonard E White,
and Fred Wolf. “Response to Comment on “Universality in the
Evolution of Orientation Columns in the Visual Cortex “.” In:
Science 336.6080 (2012), pp. 413–413.
[66] Aysun Kocak, Kemal Cizmeciler, Aykut Erdem, and Erkut Er-
dem. “Top down saliency estimation via superpixel-based dis-
criminative dictionaries.” In: BMVC. 2014.
Bibliography 135
[67] Christof Koch and Shimon Ullman. “Shifts in selective visual
attention: towards the underlying neural circuitry.” In: Matters
of intelligence. Springer, 1987, pp. 115–141.
[68] Naoki Kogo and Raymond van Ee. “Neural mechanisms of
figure-ground organization: Border-ownership, competition and
perceptual switching.” In: Handbook of Perceptual Organization
(2014).
[69] Teuvo Kohonen. “The self-organizing map.” In: Proceedings of
the IEEE 78.9 (1990), pp. 1464–1480.
[70] Matthias Kümmerer, Thomas SA Wallis, and Matthias Bethge.
“Information-theoretic model comparison unifies saliency met-
rics.” In: Proceedings of the National Academy of Sciences 112.52
(2015), pp. 16054–16059.
[71] Steven M LaValle. “Rapidly-exploring random trees: A new
tool for path planning.” In: (1998).
[72] Jeffrey C Lagarias, James A Reeds, Margaret H Wright, and
Paul E Wright. “Convergence properties of the Nelder–Mead
simplex method in low dimensions.” In: SIAM Journal on opti-
mization 9.1 (1998), pp. 112–147.
[73] Michael F Land and Mary Hayhoe. “In what ways do eye
movements contribute to everyday activities?” In: Vision re-
search 41.25 (2001), pp. 3559–3565.
[74] Michael F Land, David N Lee, et al. “Where we look when we
steer.” In: Nature 369.6483 (1994), pp. 742–744.
[75] Amy N Langville and Carl D Meyer. Google’s PageRank and be-
yond: The science of search engine rankings. Princeton University
Press, 2011.
Bibliography 136
[76] Matthew Larkum. “A cellular mechanism for cortical asso-
ciations: an organizing principle for the cerebral cortex.” In:
Trends in Neurosciences 36.3 (2013), pp. 141–151.
[77] WB Levy and O Steward. “Temporal contiguity requirements
for long-term associative potentiation/depression in the hip-
pocampus.” In: Neuroscience 8.4 (1983), pp. 791–797.
[78] Ye Li, David Fitzpatrick, and Leonard E White. “The devel-
opment of direction selectivity in ferret visual cortex requires
early visual experience.” In: Nature Neuroscience9.5 (2006), pp.676–
681.
[79] Sukbin Lim, Jillian L McKee, Luke Woloszyn, Yali Amit, David
J Freedman, David L Sheinberg, and Nicolas Brunel. “Infer-
ring learning rules from distributions of firing rates in cortical
neurons.” In: Nature Neuroscience (2015).
[80] Stephen R Lindemann and Steven M LaValle. “Incrementally
reducing dispersion by increasing Voronoi bias in RRTs.” In:
Robotics and Automation, 2004. Proceedings. ICRA’04. 2004 IEEE
International Conference on. Vol. 4. IEEE. 2004, pp. 3251–3257.
[81] David J Linden and John A Connor. “Long-term synaptic de-
pression.” In: Annual Review of Neuroscience18.1 (1995), pp.319–
357.
[82] Guy Major, Matthew E Larkum, and Jackie Schiller. “Active
properties of neocortical pyramidal neuron dendrites.” In: An-
nual Review of Neuroscience 36 (2013), pp. 1–24.
[83] George L Malcolm and John M Henderson. “Combining top-
down processes to guide eye movements during real-world
scene search.” In: Journal of Vision 10.2 (2010), pp. 4–4.
[84] Robert C Malenka and Mark F Bear. “LTP and LTD: an embar-
rassment of riches.” In: Neuron 44.1 (2004), pp. 5–21.
Bibliography 137
[85] Nikola T Markov, Julien Vezoli, Pascal Chameau, Arnaud Falchier,
René Quilodran, Cyril Huissoud, Camille Lamy, Pierre Misery,
Pascale Giroud, Shimon Ullman, et al. “Anatomy of hierarchy:
feedforward and feedback pathways in macaque visual cor-
tex.” In: Journal of Comparative Neurology 522.1 (2014), pp. 225–
259.
[86] Henry Markram, Maria Toledo-Rodriguez, Yun Wang, Anirudh
Gupta, Gilad Silberberg, and Caizhi Wu. “Interneurons of the
neocortical inhibitory system.” In: Nature Reviews Neuroscience
5.10 (2004), pp. 793–807.
[87] Melvin Earl Maron and John L Kuhns. “On relevance, prob-
abilistic indexing and information retrieval.” In: Journal of the
ACM (JACM) 7.3 (1960), pp. 216–244.
[88] Anne B Martin and Rüdiger von der Heydt. “Spike synchrony
reveals emergence of proto-objects in visual cortex.” In: The
Journal of Neuroscience 35.17 (2015), pp. 6860–6870.
[89] Timothée Masquelier. “Relative spike time coding and STDP-
based orientation selectivity in the early visual system in nat-
ural continuous and saccadic vision: a computational model.”
In: Journal of Computational Neuroscience 32.3 (2012), pp. 425–
441.
[90] Thomas Miconi and Rufin VanRullen. “A Feedback Model of
Attention Explains the Diverse Effects of Attention on Neural
Firing Rates and Receptive Field Structure.” In: PLoS Computa-
tional Biology 12.2 (2016), e1004770.
[91] Stefan Mihalas, Yi Dong, Rüdiger von der Heydt, and Ernst
Niebur. “Mechanisms of perceptual organization provide auto-
zoom and auto-localization for attention to objects.” In: Pro-
Bibliography 138
ceedings of the National Academy of Sciences108.18 (2011), pp.7583–
7588.
[92] Risto Miikkulainen, James A Bednar, Yoonsuck Choe, and Joseph
Sirosh. Computational maps in the visual cortex. Springer Science
& Business Media, 2006.
[93] Jiri Najemnik and Wilson S Geisler. “Optimal eye movement
strategies in visual search.” In: Nature434.7031 (2005), pp.387–
391.
[94] Vidhya Navalpakkam and Laurent Itti. “Modeling the influ-
ence of task on attention.” In: Vision research45.2 (2005), pp.205–
231.
[95] Jonathan D Nelson, Craig RM McKenzie, Garrison W Cottrell,
and Terrence J Sejnowski. “Experience matters information ac-
quisition optimizes probability gain.” In: Psychological science
21.7 (2010), pp. 960–969.
[96] Randall C O’Reilly, Y Munakata, MJ Frank, TE Hazy, et al.
Computational cognitive neuroscience. PediaPress, 2012.
[97] Erkki Oja. “Simplified neuron model as a principal compo-
nent analyzer.” In: Journal of Mathematical Biology 15.3 (1982),
pp. 267–273.
[98] Bruno A Olshausen and David J Field. “Sparse coding with an
overcomplete basis set: A strategy employed by V1?” In: Vision
Research 37.23 (1997), pp. 3311–3325.
[99] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Wino-
grad. The PageRank citation ranking: Bringing order to the web.
Tech. rep. Stanford InfoLab, 1999.
[100] Robert J Peters and Laurent Itti. “Beyond bottom-up: Incorpo-
rating task-dependent influences into a computational model
Bibliography 139
of spatial attention.” In: Computer Vision and Pattern Recogni-
tion, 2007. CVPR’07. IEEE Conference on. IEEE. 2007, pp. 1–8.
[101] John D Pettigrew and Masakazu Konishi. “Neurons selective
for orientation and binocular disparity in the visual Wulst of
the barn owl (Tyto alba).” In: Science 193.4254 (1976), pp. 675–
678.
[102] Fangtu T Qiu, Tadashi Sugihara, and Rüdiger von der Heydt.
“Figure-ground mechanisms provide structure for selective at-
tention.” In: Nature Neuroscience 10.11 (2007), pp. 1492–1499.
[103] Laura Walker Renninger, James M Coughlan, Preeti Verghese,
and Jitendra Malik. “An information maximization model of
eye movements.” In: NIPS. 2004, pp. 1121–1128.
[104] Stephen E Robertson and K Sparck Jones. “Relevance weight-
ing of search terms.” In: Journal of the American Society for Infor-
mation science 27.3 (1976), pp. 129–146.
[105] Pieter R Roelfsema, Victor AF Lamme, Henk Spekreijse, and
Holger Bosch. “Figure—ground segregation in a recurrent net-
work architecture.” In: Journal of Cognitive Neuroscience 14.4
(2002), pp. 525–537.
[106] Constantin A Rothkopf, Dana H Ballard, and Mary M Hayhoe.
“Task and context determine where you look.” In: Journal of
vision 7.14 (2007), pp. 16–16.
[107] Ko Sakai and Haruka Nishimura. “Surrounding suppression
and facilitation in the determination of border ownership.” In:
Journal of Cognitive Neuroscience 18.4 (2006), pp. 562–579.
[108] Wesley C Salmon. Statistical explanation and statistical relevance.
Vol. 69. University of Pittsburgh Pre, 1971.
Bibliography 140
[109] Terence D Sanger. “Optimal unsupervised learning in a single-
layer linear feedforward neural network.” In: Neural Networks
2.6 (1989), pp. 459–473.
[110] Thomas Serre, Lior Wolf, and Tomaso Poggio. “Object recog-
nition with features inspired by visual cortex.” In: Computer
Vision and Pattern Recognition,2005. CVPR2005. IEEE Computer
Society Conference on. Vol. 2. IEEE. 2005, pp. 994–1000.
[111] Claude Elwood Shannon. “A mathematical theory of commu-
nication.” In: The Bell System Technical Journal27 (1948), pp.379–
423, 623–656.
[112] Harel Z Shouval, SH Samuel, and Gayle M Wittenberg. “Spike
timing dependent plasticity: a consequence of more funda-
mental learning rules.” In: Spike-timing Dependent Plasticity (2010),
p. 60.
[113] Joseph Sirosh and Risto Miikkulainen. “Cooperative self-organization
of afferent and lateral connections in cortical maps.” In: Biolog-
ical Cybernetics 71.1 (1994), pp. 65–78.
[114] Sen Song, Kenneth D Miller, and Larry F Abbott. “Competi-
tive Hebbian learning through spike-timing-dependent synap-
tic plasticity.” In: Nature Neuroscience 3.9 (2000), pp. 919–926.
[115] Dan Sperber and Deirdre Wilson. Relevance: Communication
and cognition. Vol. 142. Citeseer, 1986.
[116] Nathan Sprague and Dana H Ballard. “Eye Movements for
Reward Maximization.” In: NIPS. Vol. 16. 2003, p. 2.
[117] Jean-Luc R Stevens, Judith S Law, Ján Antolík, and James A
Bednar. “Mechanisms for stable, robust, and adaptive devel-
opment of orientation maps in the primary visual cortex.” In:
The Journal of Neuroscience 33.40 (2013), pp. 15747–15766.
Bibliography 141
[118] Brian T Sullivan, Leif Johnson, Constantin A Rothkopf, Dana
Ballard, and Mary Hayhoe. “The role of uncertainty and re-
ward on eye movements in a virtual driving task.” In: Journal
of vision 12.13 (2012), pp. 19–19.
[119] Hans Supèr, August Romeo, and Matthias Keil. “Feed-forward
segmentation of figure-ground and assignment of border-ownership.”
In: PLoS One 5.5 (2010), e10705.
[120] Nicholas V Swindale and H Bauer. “Application of Kohonen’s
self–organizing feature map algorithm to cortical maps of ori-
entation and direction preference.” In: Proceedings of the Royal
Society of London B: Biological Sciences 265.1398 (1998), pp. 827–
838.
[121] Michael Taylor, Hugo Zaragoza, Nick Craswell, Stephen Robert-
son, and Chris Burges. “Optimisation methods for ranking
functions with multiple parameters.” In: Proceedings of the 15th
ACM international conference on Information and knowledge man-
agement. ACM. 2006, pp. 585–593.
[122] Gina G Turrigiano and Sacha B Nelson. “Homeostatic plas-
ticity in the developing nervous system.” In: Nature Reviews
Neuroscience 5.2 (2004), pp. 97–107.
[123] Carmen Varela. “Thalamic neuromodulation and its implica-
tions for executive networks.” In: Frontiers in Neural Circuits 8
(2014), p. 69.
[124] Tim P Vogels, Robert C Froemke, Nicolas Doyon, Matthieu
Gilson, Julie S Haas, Robert Liu, Arianna Maffei, Paul Miller,
CJ Wierenga, Melanie A Woodin, et al. “Inhibitory synaptic
plasticity: spike timing-dependence and putative network func-
tion.” In: Frontiers in Neural Circuits 7.EPFL-REVIEW-189448
(2013).
Bibliography 142
[125] Nobuhiko Wagatsuma, Rüdiger von der Heydt, and Ernst Niebur.
“Spike Synchrony Generated by Modulatory Common Input
through NMDA-type Synapses.” In: Journal of Neurophysiology
(2016), jn–01142.
[126] Lang Wang and Arianna Maffei. “Inhibitory plasticity dictates
the sign of plasticity at excitatory synapses.” In: The Journal of
Neuroscience 34.4 (2014), pp. 1083–1093.
[127] Oliver G Wenisch, Joachim Noll, and J Leo van Hemmen. “Spon-
taneously emerging direction selectivity maps in visual cortex
through STDP.” In: Biological Cybernetics 93.4 (2005), pp. 239–
247.
[128] John Widloski and Ila R Fiete. “A model of grid cell develop-
ment through spatial exploration and spike time-dependent
plasticity.” In: Neuron 83.2 (2014), pp. 481–495.
[129] Torsten N Wiesel, David H Hubel, et al. “Single-cell responses
in striate cortex of kittens deprived of vision in one eye.” In:
Journal of Neurophysiology 26.6 (1963), pp. 1003–1017.
[130] Deirdre Wilson and Dan Sperber. “Truthfulness and relevance.”
In: Mind 111.443 (2002), pp. 583–632.
[131] Jimei Yang and Ming-Hsuan Yang. “Top-down visual saliency
via joint crf and dictionary learning.” In: IEEE transactions on
pattern analysis and machine intelligence39.3 (2017), pp.576–588.
[132] Steven Yantis. “The neural basis of selective attention cortical
sources and targets of attentional modulation.” In: Current Di-
rections in Psychological Science 17.2 (2008), pp. 86–90.
[133] Alfred L Yarbus. Eye movements during perception of complex ob-
jects. Springer, 1967.
Bibliography 143
[134] Friedemann Zenke, Everton J Agnes, and Wulfram Gerstner.
“Diverse synaptic plasticity mechanisms orchestrated to form
and retrieve memories in spiking neural networks.” In: Nature
Communications 6 (2015).
[135] Li Zhaoping. “Border ownership from intracortical interactions
in visual area V2.” In: Neuron 47.1 (2005), pp. 143–153.
[136] Hong Zhou, Howard S Friedman, and Rüdiger Von Der Heydt.
“Coding of border ownership in monkey visual cortex.” In: The
Journal of Neuroscience 20.17 (2000), pp. 6594–6611.
[137] Robert S Zucker and Wade G Regehr. “Short-term synaptic
plasticity.” In: Annual Review of Physiology64.1 (2002), pp.355–
405.
Abstract (if available)
Abstract
It is well-known that tasks have a large influence on human gaze, and there exist many saliency models that incorporate some form of top-down features. However, these are all learned features, and little research has gone into quantifying the effects of task on eye movement behavior in a way that can predict those effects a priori. First, we demonstrate a new learning rule that suggests how top-down connections might operate in the brain, from a functional perspective. Then, we propose a quantitative theory for measuring the relevance of information with respect to tasks. Finally, we perform an experiment to further validate this theory and utilize it to improve a saliency model.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Integrating top-down and bottom-up visual attention
PDF
Computational modeling and utilization of attention, surprise and attention gating
PDF
Computational modeling and utilization of attention, surprise and attention gating [slides]
PDF
Spatiotemporal processing of saliency signals in the primate: a behavioral and neurophysiological investigation
PDF
Eye-trace signatures of clinical populations under natural viewing
PDF
Temporal dynamics of attention: attention gating in rapid serial visual presentation
PDF
Modeling the learner's attention and learning goals using Bayesian network
PDF
Perceptual and computational mechanisms of feature-based attention
PDF
Autonomous mobile robot navigation in urban environment
PDF
Learning invariant features in modulatory neural networks through conflict and ambiguity
PDF
Leveraging cross-task transfer in sequential decision problems
PDF
Deep representations for shapes, structures and motion
PDF
Understanding goal-oriented reinforcement learning
PDF
Characterizing and improving robot learning: a control-theoretic perspective
PDF
Landmark detection for faces in the wild
PDF
Data scarcity in robotics: leveraging structural priors and representation learning
PDF
Towards more occlusion-robust deep visual object tracking
PDF
Transfer learning for intelligent systems in the wild
PDF
Computational model of stroke therapy and long term recovery
PDF
Design and evaluation of adaptive redirected walking systems
Asset Metadata
Creator
Tanner, James Everett
(author)
Core Title
Understanding the relationship between goals and attention
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
02/07/2018
Defense Date
12/07/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
attention,eye-tracking,goals,OAI-PMH Harvest,relevance
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Itti, Laurent (
committee chair
), Ayanian, Nora (
committee member
), Biederman, Irving (
committee member
)
Creator Email
jimmyt857@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-469440
Unique identifier
UC11268000
Identifier
etd-TannerJame-6004.pdf (filename),usctheses-c40-469440 (legacy record id)
Legacy Identifier
etd-TannerJame-6004.pdf
Dmrecord
469440
Document Type
Dissertation
Rights
Tanner, James Everett
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
attention
eye-tracking
goals
relevance