Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Biologically inspired mobile robot vision localization
(USC Thesis Other)
Biologically inspired mobile robot vision localization
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
BIOLOGICALLY INSPIRED MOBILE ROBOT VISION LOCALIZATION
by
Christian Siagian
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
August 2009
Copyright 2009 Christian Siagian
Acknowledgements
First of all, I would like to thank Dr. Laurent Itti for the opportunity to contribute
in his lab and for his continued support of my research. Every time I need help, I can
always knock on his door.
Iamalso veryappreciative ofmycolleagues atiLab throughouttheyears. With such
diverse talents, Ilearn somuch fromall our daily interactions. Iwould like to thankRob
Peters whose debugging skills single-handedly speed up my work by at least 3 months. I
also would like to thank Kai Chang without whom the Beobot 2.0 project would not be
possible. The same goes with Lior Elazary and Randolph Voorhies. I will never forget
the time when Lior opened up his home and helped us machine the two hubs for our
robot. I want to especially thank Randolph for all his help with Altium, CS445 class,
and the summer robotic programs. Finally I would like to thank Manu Viswanathan for
his help in reading many of my manuscripts as well as our gist research collaboration.
Inaddition, Ialsowould like tothankDr. IrvingBiederman,Dr. Ramakant Nevatia,
Dr. Gaurav Sukhatme, and Dr. Bosco Tjan for their time in serving on my qualifying
exam and thesis defense committee.
I dedicate my dissertation to my mom and dad, Joseph, Vini, Noel and my friends,
for their encouragements and loving support. For my momand dad, wholove meuncon-
ditionally; your prayers kept me going. My brother Joseph, for whom I have enormous
ii
respect. My sister Lavinia, who is so strong and yet so caring. My creative brother
Noel. Keep working hard, don’t give up. Thank you to all my friends who made my
years at USC so wonderful. I am especially grateful for pastor David Hartono and her
wife Grace; they are always ready to lend a helping hand.
Last but not least, for my Lord and Savior Jesus Christ, the reason that this little
heart is filled with unceasing love, strength, and courage to fight each and every day.
Thank you for saving my soul and allowing me to be part of your plan.
iii
Table of Contents
Acknowledgements ii
List of Tables v
List of Figures vi
Abstract vii
Chapter 1 Introduction 1
1.1 Vision Localization 3
1.1.1 Taxonomy of Visual Features 7
1.1.2 Global Feature Vision Localization Systems 9
1.1.2.1 Segmentation-Based Vision Localization Systems 10
1.1.3 Local Feature Vision Localization Systems 11
1.1.4 Multiple Feature Vision Localization Systems 16
1.2 Biological Approach to Vision 17
1.2.1 Anatomical Decomposition of the Human Vision 18
1.2.2 Functional Decomposition of the Human Vision 20
1.3 Topological Maps 24
1.4 Document Organization 25
Chapter 2 Design And Implementation 26
2.1 System Overview 28
2.1.1 Feature extraction: Gist and Salient Regions 32
2.1.2 Segment and Salient Region Recognition 32
2.1.3 Localization 34
2.1.4 Parallel Implementation 34
2.2 Visual Cortex Feature Extraction, Gist, and Saliency Model 35
2.2.1 Visual Cortex Feature Extraction 38
2.2.2 Gist Model 41
2.2.2.1 Gist Feature Extraction 41
2.2.2.2 Color Constancy 44
2.2.2.3 PCA/ICA Dimension Reduction 46
2.2.2.4 Segment Estimation 48
2.3 Salient Regions as Localization Cues 48
2.3.1 Salient Region Selection and Segmentation 50
iv
2.3.2 Salient Region Recognition 52
2.4 Storing and Recalling Environment Information 55
2.4.1 Landmark Database Construction 57
2.4.1.1 Building a Database Within an Episode 60
2.4.1.2 Building a Database Across Episodes 65
2.4.2 Landmark Database Search Prioritization 69
2.5 Salient Region Tracking 72
2.6 Monte-Carlo Localization 75
2.6.1 Motion Model 77
2.6.2 Segment-Estimation Observation Model 78
2.6.3 Salient-Region-Recognition Observation Model 79
Chapter 3 Testing And Results 82
3.1 Testing Setup 83
3.1.1 Site 1: Ahmanson Center for Biological Research (ACB) 86
3.1.2 Site 2: Associates and Founders Park (AnF) 89
3.1.3 Site 3: Frederick D. Fagg park (FDF) 90
3.2 Gist Model Testing Results 94
3.2.1 Experiment 1: Ahmanson Center for Biological Research (ACB) 95
3.2.2 Experiment 2: Associates and Founders Park (AnF) 97
3.2.3 Experiment 3: Frederick D. Fagg park (FDF) 98
3.2.4 Experiment 4: Combined sites 100
3.3 Localization Testing Results 104
3.3.1 Experiment 1: Ahmanson Center for Biological Research (ACB) 105
3.3.2 Experiment 2: Associates and Founders Park (AnF) 106
3.3.3 Experiment 3: Frederick D. Fagg park (FDF) 107
3.3.4 Experiment 4: Sub-module Analysis 109
3.4 Landmark Database Prioritization Testing Results 110
3.4.1 Search Prioritization 112
3.4.2 Landmark Database Search Early Exit Strategy 117
3.5 Salient Region Tracking Testing 122
Chapter 4 Discussions And Conclusions 126
4.1 Implementation and Usage of Gist and Saliency Model 127
4.2 Hierarchical Landmark Database and Multi-level
Localization 128
4.2.1 Hierarchical Landmark Database 129
4.2.2 Multi-level Localization 131
4.3 Environment Invariance 131
Chapter 5 Future Works 133
5.1 Porting to a Robot 133
5.2 Gist Model 136
References 138
v
List of Tables
3.1 Ahmanson Center for Biology Segment Classification Experimental Results 96
3.2 Ahmanson Center for Biology Segment Classification Confusion Matrix . 97
3.3 Associate and Founders Park Segment Classification Experimental Results 99
3.4 Associate and Founders Park Segment Classification Confusion Matrix . . 100
3.5 Frederick D. Fagg Park Segment Classification Experimental Results . . . 101
3.6 Frederick D. Fagg Park Segment Classification Confusion Matrix . . . . . 102
3.7 Combined Sites Segment Classification Experimental Results . . . . . . . 103
3.8 Combined Sites Segment Classification Site-Level Confusion Matrix . . . . 104
3.9 Ahmanson Center for Biology Experimental Results . . . . . . . . . . . . 106
3.10 Associate and Founders Park Experimental Results . . . . . . . . . . . . . 107
3.11 Frederick D. Fagg Park Experimental Results . . . . . . . . . . . . . . . . 108
3.12 Ahmanson Center for Biology Model Comparison Experimental Results . 110
3.13 Associate and Founders Park Model Comparison Experimental Results . . 110
3.14 Frederick D. Fagg Park Model Comparison Experimental Results . . . . . 110
3.15 Ahmanson Center for Biology Experiment 1 Results . . . . . . . . . . . . 113
3.16 Associate and Founders Park Experiment 1 Results . . . . . . . . . . . . . 114
3.17 Frederick D. Fagg Park Experiment 1 Results . . . . . . . . . . . . . . . . 115
vi
3.18 Ahmanson Center for Biology Early-Exit Experiment Results . . . . . . . 118
3.19 Associate and Founders Park Early-Exit Experiment Results . . . . . . . 119
3.20 Frederick D. Fagg Park Early-Exit Experiment Results . . . . . . . . . . . 120
3.21 Ahmanson Center for Biology Tracking Experiment Results . . . . . . . . 123
3.22 Associate and Founders Park Tracking Experiment Results . . . . . . . . 123
3.23 Frederick D. Fagg Park Tracking Experiment Results . . . . . . . . . . . . 124
vii
List of Figures
1.1 General diagram of a vision localization system. . . . . . . . . . . . . . . . 4
1.2 Anatomical Decomposition of Human Vision. . . . . . . . . . . . . . . . . 19
1.3 A model of the Functional Decomposition of Image Understanding in the
Human Brain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.1 Diagram for the presented mobile-robot vision system. . . . . . . . . . . . 30
2.2 Side by side comparison of the Gist and Saliency model. . . . . . . . . . . 36
2.3 Gist Decomposition of Vertical Orientation Sub-channel. . . . . . . . . . . 42
2.4 Example of two lighting conditions of the same scene. . . . . . . . . . . . 45
2.5 Example of gist operation applied to an Image. . . . . . . . . . . . . . . . 47
2.6 Salient region is extraction. . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.7 rocess of obtaining multiple salient regions from a frame . . . . . . . . . . 51
2.8 Matching process of two salient regions using SIFT keypoints and salient
feature vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.9 The landmark database building procedure. . . . . . . . . . . . . . . . . . 59
2.10 Landmark Database Building for a Single Run. . . . . . . . . . . . . . . . 60
2.11 An example of obtaining stored salient regions from a series of 8 frames. . 64
2.12 Multi-run Landmark Database Integration. . . . . . . . . . . . . . . . . . 66
2.13 Hierarchical Landmark Database. . . . . . . . . . . . . . . . . . . . . . . . 68
viii
2.14 Diagram of salient region tracking. . . . . . . . . . . . . . . . . . . . . . . 73
2.15 A snapshot of the system test-run. . . . . . . . . . . . . . . . . . . . . . . 81
3.1 Map of the three experiment sites at the USC campus . . . . . . . . . . . 84
3.2 ExamplesofimagesineachsegmentoftheAhmansonCenterforBiological
Research (ACB). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.3 Map of the path segments of the ACB site. . . . . . . . . . . . . . . . . . 88
3.4 Lighting conditions used for testing at the ACB site. . . . . . . . . . . . . 89
3.5 Examples of images in each segment of Associate and Founders park (AnF). 90
3.6 Map of the path segments of the AnF site. . . . . . . . . . . . . . . . . . . 91
3.7 Lighting conditions used for testing at the AnF site. . . . . . . . . . . . . 91
3.8 Examples of images, one from each segment of Frederick D. Fagg park
(FDF). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.9 Map of the path segments of Frederick D. Fagg (FDF) park site. . . . . . 93
3.10 Lighting Conditions use for Testing at the FDF site. . . . . . . . . . . . . 93
ix
Abstract
Theproblemof localization is central to endowing mobile machines with intelligence.
Vision is a promising research path because of its versatility and robustness in most
unconstrained environments, both indoors and outdoors. Today, with many available
studies in human vision, there is a unique opportunity to develop systems that take
inspiration from neuroscience. In this work we examine several important issues on how
the human brain deal with vision in general, and localization in particular.
For one, the human visual system extracts a plethora of features from different do-
mains (for example: colors, orientations, intensity). Each of them brings a different
perspective in scene understandingand allows humans to localize in many types of envi-
ronment. Furthermore,thehumanbrainalsointroducesmultiplesceneabstractionsthat
complement each other. Here, we focus on the gist model, which rapidly summarizes a
scene (general semantic classifications, spatial layout, etc.), and saliency model, which
guides visual attention to specific conspicuous regions within the field of view.
One hallmark biological characteristic that we rely upon is the utilization of coarse-
to-fine paradigm. There are two parts in the system wherethis is clearly evident. One is
in the multi-level localization module, where the system tries to interchangeably localize
both to a general vicinity, and to a more accurate coordinate location. The second is
in the process of recalling stored environment information through a form of guided
x
(hierarchical) search using various contextual knowledge, which we believe is a key to its
scalability.
In order to fairly assess our contributions, we test the system in three large scale
outdoor environments - a building complex (126x180ft. area, 13966 testing images), a
vegetation-filled park (270x360ft. area, 26397 testing images), and an open-field area
(450x585ft. area, 34711 testing images) - each with its own challenges. We not only test
its accuracy in terms of coordinate location, we also pay close attention to its efficiency
in frame rate.
In the end, we describethe futuredirections of our research, such as how to go about
inserting the localization module into a fully autonomous mobile robot system.
xi
Chapter 1
Introduction
Building the next generation mobile robots hinges on solving tasks such as localization,
mapping, and navigation. These tasks critically depend on developing capabilities that
robustly answer the central question: Where are we? The early attempts of localiza-
tion uses dead reckoning [7, 12, 14], by memorizing every robot movement in order to
backtrack to particular locations. However, dead reckoning is not always reliable due to
issues such as odometry drifting or kidnapped robot instances (when the robot is moved
to a new location without telling it). Because of these difficulties, subsequent efforts
incorporate sensors to perceive the characteristics of the current location and compare
them with stored information about previously known locations in the environment to
arrive at the most likely match.
A significant number of mobile robotic approaches address this problem by utilizing
sonar,laser,orotherrangesensors[23,43,90,47]. Theyareparticularlyeffectiveindoors
due to the many structural regularities, including flat walls and narrow corridors. In
the outdoors, however, these sensors become less robust because the structure of the
enviroment can vary tremendously. It then becomes hard to predict the sensor input
1
given all the protrusions and surface irregularities [45]. For example, a slight change in
pose can result in large jumps in range reading because of tree trunks, moving branches,
and leaves.
Global Positioning System (GPS), which provides sparser estimate (in the range of 1
- 10Hz for faster models)of the robot’s absolute location, is also used in various systems.
Coupled with other sensors or by itself [16, 3, 102], GPS has also been used extensively
in the robotics field. However, GPS may not be applicable in environments where there
is no satellite visibility, such as underwater, in caves, indoors, or on Mars.
These difficulties with traditional robot sensors have prompted research towards the
primary sensory modality of humans: vision. In most of the places mentioned above,
vision is a viable alternative as long as there is reasonable level of visibility (although,
even then, night vision cameras can also be used [86, 107]). In addition, the advantage
of using vision is that we can extend it to other tasks, not just localization.
There are, however, major hurdles to overcome such as lighting (especially in the
outdoors), view-point change, and occlusion, all of which are carry-over from object
recognition to vision localization. In addition, there is also a challenge specific to local-
ization: we have to identify which parts of the image portray entities that are native to
the environment, and which are not. This is critical because of the existence of dynamic
distractions such as people walking, which can mislead the system if they are used as
cues.
In the following section 1.1, we review the available literatures in building vision-
based localization systems. Here we focus on selecting the right visual features that
are distinct, and easy to extract and recognize in a cluttered scene. We then look at
2
the available research in human vision (section 1.2), which are used as inspirations for
the design and implementation of our presented vision-based localization system. In
addition, we also have a section (1.3) that describes the current research in biological
localization pertaining to topological maps, which are analogous to how humans deal
with spatial information. We then conclude with the organization of our work at the
end of the chapter, in section 1.4.
1.1 Vision Localization
In the past few decades [18], vision localization has been an active research branch with
diverse existing approaches. Generally speaking, we find that there are four modules
(illustrated in figure 1.1) in which these systems needs to consider. They are: image
acquisition and pre-processing, matching and pose estimation, landmark database, and
localization.
In the first module, “image acquisition and preprocessing”, the system obtains the
necessary discriminating visual cues from a raw image. Before we discuss in depths
about these cues or features, it bears mentioning that the camera view-type also affects
what features we can use. The two that are used more often are the regular-ground
view (putting the camera on top of a robot and looking straight ahead) and the omni-
directional view (the camera is pointed upward toward a precisely set up prism that
looks out to the environment).
3
5
Processing Back-End
Vision Front-End
Pose Estimation
Landmark Database
Matching
Map
Environment Information
User Command
Motion
Localization Mapping Navigation
Figure 1.1: General diagram of a vision localization system. A system starts by extracting visual features from an input image
in the “image acquisition and preprocessing” module. These features are then compared with reference features in the “matching
and pose estimation” module on an individual comparison basis, with the “landmark database” being responsible for managing
theinformationabouttheenvironmentandtheoverall efficiencyofthematchingprocess. Ifamatchisfound,wecanthenproceed
to localize the robot in the “localization” module. Some systems performs affine matching to produce an image-coordinate level
pose estimation, while others simply report a match and use the corresponding location tag (where the stored database image
was taken). The back-end localization module then incorporates the match into its hypothesis to come up with the best possible
robot localization estimation.
4
Many systems use regular view [74, 94] because of its straight-forward perspective.
It is easier to imagine what a system is trying to do because the images are not dis-
torted. Without distortion, the cornersand lines that appear in the image become easily
discernable and recognizable. However, there are certain advantages of using an omni-
directional view. Unlike regular-view systems, the omni-directional systems [99, 10, 105]
have the luxury of processing visual stimuli from all 360-degrees. This way, a robot
may approach a location from whichever angle and still register the same image, given a
rotational correction. However, this is not entirely true as some parts of the image that
undergo distortion in a form of enlargements actually occlude their neighboring image
regions. In addition, because regular views are pointing forward, we can see as far as
the horizon. Omni-directional views, on the other hand, are limited in look-ahead range
by the dimension and angle of the used prism.
The second module, “matching and pose estimation”, is the process of comparing
the extracted features with stored information, obtained during training. In order to
implement a successful vision localization system, just like other problems in computer
vision, everything hinges on achieving reliable matching, one that leads to satisfactory
pose estimation. Here, pose estimation refers to properly aligning current pose of the
robot’s camera view with respect to the matched database (reference) entry. For exam-
ple, robust techniques such as wide-baseline matching [108, 109] perform alignment to
the pixel coordinate accuracy. In order to do this, a system would need features that
contain pixel-level spatial information in its matching correspondences. To speed up
the process, however, many systems use a classifier as a matching module, but it only
provides positive/negative identification.
5
We separate the matching module from the third, “landmark database”, because
we are putting an emphasis on the importance of the database structure to make the
matching process as a whole more efficient, not just on an individual comparison basis.
Forreal-timesystemssuchasrobots,itisnotenoughtojusttohaveaccuraterecognition;
we also have to consider how much work is needed to find a match. This is especially
importantwhenexploringlarge environmentswhereweareboundtohavelarge amounts
of reference data.
In the last, “localization” module, we now relate what we see to where we are in
the environment. When identifying one’s own location, some systems go as far as actual
metric location [89], while others only to a coarse general vicinity such as a place or a
room number [99]. This localization resolution directly leads to the question of what
kind of maps a system uses. Furthermore, some systems also have to create their own
maps while navigating a never-before-seen environment. This problem is also known
as Simultaneous Localization and Mapping (SLAM) [30, 55, 67], which is an active
branch of robotics research with many challenges of its own. In section 1.3, we are going
to discuss the mapping related sub-topic as part of a discussion about human spatial
representation. Asforour currentwork, wearegoing to focuson thelocalization process
itself, given that map format and content are previously provided.
Inrecentyears,probabilisticlocalizationhasmaturedandstandardtechniques(which
largely utilize other sensorssuch are range sensors[23, 90, 47] and GPS [3, 102]) are now
well understood. However, the use of vision sensors (cameras) still has not been as ex-
tensively developed. Inthisworkwefocusourresearch onthesecondandthirdmodules,
6
“matching and pose estimation” and “landmark database,”, because we believe match-
ing is the most critical part of any vision-based system. That said, we acknowledge that
otherparts(back-end localization, forone)alsocontributesignificantlytotherobustness
of a system.
From the vision perspective, in order to implement a reliable recognition system, one
has to start with the selection of robust features. Thus, we begin with the taxonomy of
visualfeatures(section 1.1.1) beforedescribingthedifferenttypesoflocalization systems
that use them.
1.1.1 Taxonomy of Visual Features
In describing the taxonomy of visual features, we focus on the following aspects: charac-
teristics, scope, and locality of a feature. Characteristics are basic information about an
image that is being encoded by the feature value. For example: color, intensity, edge or
blob contrast value. Scope, on the other hand , describes how large a neighborhood the
value is encoding. Thiscan bejust a single pixel, or in the case of using a histogram, the
whole image. In addition, given the scope ofa feature, thereare different ways to encode
image characteristics within, such as averaging them [99] or building a local histogram
[46]. Finally, locality is the resolution of the spatial location of the feature. For affine
matching, coordinatelocation isessential, whileformanyrecognition techniques[84,42],
the knowledge of feature existence within an image is enough. An example of locality
in between these two extremes is a pre-defined grid with the coarser grid locations as
coordinates.
7
In the past, many research groups [83, 56, 104] have used the terms local and global
features to broadly describe the two opposing spectra of visual features types. Local
features are computed over a limited area of an image, while global features may pool
information (using operations such as averaging) over the entire image into, e.g., his-
tograms. However, just because the features are in a form of a histogram, it does not
mean that they are global features as we can have a histogram of local features. For
example, [68] uses k-means clustering on a set of local features (called textons) to find
a set number of groupings (denoted by centroids) in the feature space. Each centroid is
assigned an entry in the final histogram of feature counts. When an incoming testing
image arrives, each of its extracted local feature is compared with all centroids to find
its respective closest centroid. That corresponding centroid then can increase its entry
count to indicate how many features in the image is closest to it (than any other cen-
troids). This histogram profile becomes the signature of the image. When comparing
imagesforsimilarity, thesystemassumesthatthebetterthehistogramoverlap, themore
similar the images are. Here, although the final comparisons are between histogram en-
try counts, the process of producing the counts themselves come from comparisons of
local features as the clustering method incorporates other features in the training data.
And so, the main criteria of deciding whether a feature is local or global comes from the
scope of the individual values just prior to the comparison with other reference data.
Given this definition, we are going to use the terms global and local features in our
broad groupings below. We present the global features in section 1.1.2 and the local
features in section 1.1.3. Just as a historical note, the majority of global features are
introduced earlier than the local features. In addition, we also are going to discuss
8
features which fall somewhere in the middle of the global-to-local paradigm, we call
them segmentation-based region features (section 1.1.2.1). These features also predate
the local features. And finally, we include section 1.1.4 for systems that utilize multiple
feature types and discuss the benefit of doing so.
1.1.2 Global Feature Vision Localization Systems
Global-feature methods generally consider an input image as a whole and extract a
low-dimensional signature that compactly summarizes the image’s statistics and/or se-
mantics, usually done in a form of a histogram. These large scope statistics should
produce more robust values because random pixel-level noise, which may catastrophi-
cally influence local processing, tends to average out globally. Although these holistic
approaches may sacrifice spatial information (the location of the features), they do not
need to perform segmentation to isolate precise region boundaries or matching on a
coordinate level.
These context-based representations usually utilize descriptors that come from a
variety of domains such as color [99, 10], textures (2D Fourier Transform [61], steerable
wavelet pyramids [94]), or a combination of color and textures [81, 65]. However, these
approaches are limited, for the most part, to classifying places (as opposed to exact
geographical locations) because the end result correspondences are not fine enough for
accurate pose estimation. Nevertheless, with the lower localization resolution, global
featuresgain sizable advantage inspeedasclassifiers(back-propagation neuralnetworks,
SVM) usually outputs their results almost instantaneously.
9
In order to deal with this lack of spatial information, some systems [62, 94, 81] use
a predefined, regularly-spaced grid and computing global statistics within each grid tile.
Unfortunately, in the current landscape of global feature research, grids are as far as
most systems would go in terms of spatial attributes of an image. This is because the
next step would have to involve content-based delineation which would probably require
a form of segmentation. In recent years, segmentation has fallen out of favor because of
the difficulty in obtaining a robust and efficient performance, especially in unconstraint
environments. Ascribing concise attributes to compare the different regions themselves
are not an easy task. Furthermore, many systems have shown to be able to recognize
objects/landmarks in the absence of accurate outline of the target.
In the following section we are going to describe the research that has been done in
segmentation as it pertains to robot localization.
1.1.2.1 Segmentation-Based Vision Localization Systems
Segmentation-based approaches [1, 92, 48] limit their scope of feature locality to match-
ing image regions and their configurational relationships to form a signature of a loca-
tion. At this level of representation, aside for being quite slow, the major hurdle lies in
achieving reliable segmentation and in robustly characterizing individual regions. Na¨ ıve
template matching involving rigid relationships is often not flexible enough in the face
of over/under-segmentation. This is especially true when we introduce clutter. What
mostsystemswould then doisto adoptanumberofassumptionsabouttheenvironment
to simplify the problem. By establishing these rules, naturally, these systems become
environment specific.
10
For example, a system by [70] is designed to work indoors, because it assumes that
the environment have clear line structures and large homogeneous color surfaces to ease
the segmentation process. Also, an approach by [57] is specifically for localization in
the moon or other planets. It looks for isolated rock regions to be considered as land-
marks. They are easily segmented because the grounds in those places have a smoother
appearance. Similarly [39] looks for segmented red-bricked buildings on campus and
compare their shapes to localize. A slightly different approach by [50] finds a different
type of uniform region. Instead of looking for specific foreground regions, it tries to
compare the shapes of the background sky region at different locations. By pointing the
camera slightly up, it captures the silhouettes that are created by the backdrop of the
surrounding on-campus buildings.
1.1.3 Local Feature Vision Localization Systems
In recent years the research in image features for classification and recognition have
turned towards local features. By computing visual values over a specific area of an
image (as opposed to the whole image), local features encode scene characteristics that
are more focusedin scope. Thisacuity is whatgives themtheir discriminative ability. In
addition, because these newer features use descriptors with so many associated values,
they are quite invariant in scale, in-plane rotation, and, to a lesser degree, viewpoint and
lighting invariance. Also, because the features are dispersed throughout the image, they
can overcome partial occlusion.
11
The general extraction procedure of this group of techniques includes two phases:
detection and description. That is, a system first uses a detector to isolate local compo-
nents of an image that are considered easily identifiable andrepeatable beforedescribing
the neighborhood around those interest points.
In terms of what these systems are looking for, there are two types of detector. One
looks for isolated blobs, while the other looks for edge-like (corners) shape. SIFT [46],
whichisthemostpopularoflocalfeatures,approximatestheLaplacianoftheGaussianof
an image using Difference of Gaussian and looks for optimas (minimas and maximas) to
come up with blob-based keypoints. SURF [8] is another detector that looks for optimas
in the determinants of the image Hessian for a blob-like region. Another blob-based
detector is MSER [49]. On the other hand, Harris-Affine [52] (using the Harris measure)
and theC1detector inHMAX [75] lookfor edge-like interest points. Inaddition tothese
two detector types, there are a few that do not look for specific shapes. For example,
[38] uses entropy measure to find interest points. Note that, for all of these systems,
the optimum locations are searched in the spatial as well as the frequency domain. By
painstakingly selecting these interest points, we inc rease the stability of the features.
The second partof the processis assigning descriptors that encode the neighborhood
around an interest point. To be invariant to planar rotation, these features are recorded
with respect to the primary direction of the point (in both edge as well as blob-based
cases). To be invariant to local deformation, both SIFT and SURF use localized dis-
tributions of gradient directions and magnitude. GLOH [53] is another one that uses
a similar strategy. It is important to point out that, although description is separate
from detection, the right combination of the two produces a much better result. This
12
is because the invariant characteristics around an edge-based interest point are different
than that of a blob.
Because of robust detection and rich (albeit high dimensional) characteristics, these
features perform very well in object recognition, even in large-sized databases. The
drawback, however, is that they are quite slow. As such, subsequent works have taken
steps to alleviate this problem. The first is a dimension reduction on the features. For
example, PCA-SIFT [40] drops the number of dimension from the original SIFT of 128
downto10. Anotherwayisbynotperformingaffinematchingwherecoordinatelocations
of the features are taken into account. Yet an even faster way by [42] is by not having
a detector step in the extraction process. This system extracts SIFT feature descriptors
from an image at a regular spatial grid interval and at pre-determined angle. By tiling
the features, it does not need to perform feature rotation or translation alignment.
However, the most frequently used speed-up technique applied to local features are
the bag-of-features (BOF). This approach [84, 59, 68] takes local features and compares
each of them to a set of prototypical features to create a frequency histogram. In this
scheme, spatial matching is considered positive regardless of location correspondence.
The process of selecting the prototypical features themselves (using k-means clustering,
for example) are usually done with a particular set of data in mind to optimize the
training process. However, some argue [75, 84] that the same results can be obtained in
the absence of a specific task (free-viewing) in a lifelong-learning paradigm to produce a
so-called universal bag-of-features.
By performing recognition only based on the existence of particular features in any
part of the image, BOF, too, does not perform affine matching. The more important
13
point about this technique, however, is that it transforms sets of keypoints that differ
in total numbers in each image to an identical number of dimension across all frames.
This way, we can use kernel-based classifiers and not compare each of the n individual
keypoint sets from a stored database (O(n) operations). Instead classifiers usually run
in theO(k) with k being the numberof categories/places, and, thus, outputtheir results
almost instantaneously. An extension by [42] accounts for a coarse spatial information
by requiring the features to be in certain quadrants of a regularly pre-determined grid
of the image (to retain a coarse spatial information).
Another variantofthebag-of-features isatechniqueby[28]whichtransformsasetof
keypoints to a pyramid representation that takes into account the structure of the data
byusingvalue-driven bins(much like bucket-sort) asitshistogram. Thetechniqueworks
well using a lower dimensional PCA-SIFT features. However, it is less effective for high
dimensional features [29]. A later version [29] solves this problem by a using vocabulary
driven pyramid constructed by hierarchical k-mean clustering, which actually brings the
technique closer to the original bag-of-features.
However,despitesuccessesinobjectrecognition,transfertovisionlocalizationisnota
straightforward process. For one, the approach would not work as well in environments
where the visual stimuli is lacking in textures, such as the desert or an empty beach
where there are only yellow sand, blue ocean, and the sky. On the other extreme, scenes
that are cluttered with many distracting textures are also difficult to deal with. This is
because one disadvantage of using local features is the number of features that needs to
be extracted and stored for each image. In addition, a lot of the collected local features
may not be at all useful as they are, in general, less stable than global features. For
14
example, in a park full of trees, a majority of the local features are outlines of leaves
that are commonplace.
Consequently, instead of just blindly using all features found in an image [74, 27],
some systems try to find ways to select just a subset, the most useful ones, to speed
up the matching process as well as avoiding features that only add noise. Usually these
features are selected because they describe an object or distinctive sub-regions in the
image [25, 63, 58, 26].
In terms of features utilized, by far, the most frequently used is SIFT [83, 65, 17,
74, 27, 19]. However, there are also systems that use GLOH [66], and SURF [101, 56].
Interestingly, both of the SURF systems utilize omni-directional images.
Another aspect that differentiate localization systems is whether they perform pose
estimation. Some systems [74, 108] estimate pose to the pixel level, through wide-
baseline matching technique, for example. However, a larger percentage [103, 73, 17, 4]
utilize the bag-of-features technique to speed up their performance. Many believe that
having to performa straightforward affine matching between sets of local features would
probably prevent a system from being able to run in real-time, especially on large scale
environments.
Another useful point of comparison is related to the testing environments, whether
thesystemworksindoorsoroutdoors. Inthepast,amajorityofthemaretested indoors.
Nowadays, there are a significant increase in percentage for outdoor testing. In addition
a lot of the data are also available online, with some are even taken from a period that
spans multiple seasons, such as from summer to winter [101, 65].
15
In the current state of vision localization systems, there are plenty of approaches
utilizing all kinds of features with varying degrees of success. Unfortunately, at this
point, we are stil unable to theorize and predict what and why certain features work
with certain environment. This is because a majority of the systems are tested only in
one environment. We believe that the best way to move forward would be to rigorously
test systems in multiple environments, each with different visual challenges. This is why
we selected three large scale outdoor sites for evaluating our system.
1.1.4 Multiple Feature Vision Localization Systems
Recently, moresystemshave shownthat usinga combination of local and global features
increase the computation speed while maintaining localization accuracy. In order to
utilize thetwocontrasting featureswithinthesameframework, wehavetoaccommodate
two aspects: representation and time-scale. Local features have a much more detailed
information representation and are slower to match, while the global features are coarser
but more compact, which leads to faster matching.
One way to combine the two feature types is through staged filtering steps with
global features matching being applied before a full-blown local feature recognition is
performed. For real-time systems such as robot localization, the local feature database
search process often can only afford the first positive match, not the best match. Once
found, the search is stopped. The global feature filtering step is designed to increase
the likelihood of this event occuring early, either by ordering the database entries to be
compared (from the most likely to be matched to the least) [83] or by discarding the
least likely entries altogether [21, 103]. In practice, these systems never have to look
16
at all the entries because (through experiments) they have a good idea of when to stop
given the unlikelihood that a match will be found thereafter. Note that, in these setups,
a system still has to rely primarily on local features to stay accurate.
Another waytobringtogether thelocal andglobal featuretypesistoallow each type
to run independently of the other. For example, a system by [104] picks which features
(local or global) to use for each time step through a measure of confidence level. The
system uses global features when it is fairly certain of its location and switches to local
features when it is not. In our presented work, we do not exclusively select one module
and ignore the other; we let both asynchronously influence the decision-making process
using a distributed computing methodology. That is, because the global features are
computed instantaneously, they can be used to roughly update the location belief on
every frame. Meanwhile, the local feature matching results would then refine the belief
whenever they are available. This way the system does not have to wait for the local
feature matching process to run to completion at the end of each frame.
1.2 Biological Approach to Vision
Despite recent advances in computer vision and robotics, humans still perform orders of
magnitude better than the best available vision systems in localization and navigation.
As such, it is inspiring to examine the low-level mechanisms as well as the system-level
computational architecture according to which human vision is organized.
Because of the complexity of the human brain we find that it is beneficial to have
an insight of how it deciphers the world and develop its general scene understanding
17
beforeapplyingthemspecificallytothelocalization task. Also,becauseoftheincomplete
knowledgeaboutthebrainthatisavailable atthispoint,itislikelythatourexplanations
have gaps in describing specific details about certain parts of the brain. In these cases,
we try to make educated guesses to fill in the blanks.
We analyze the human vision from two different perspectives: anatomical and func-
tional. In the anatomical decomposition, we follow the path of the visual stimuli from
the retina all the way to just before the spinal chord where action is encoded. In the
functional decomposition, we investigate the brain as an image processor, how does it
go from individual pixel values to a high level scene understanding.
1.2.1 Anatomical Decomposition of the Human Vision
Figure 1.2 illustrates the flow of visual processing anatomically.
After early preprocessing that takes place at both the retina and LGN (following
figure 1.2), the visual stimuli arrive at the Visual Cortex (cortical visual areas V1, V2,
V4, and MT), which, by far, is the most extensively studied region of the brain, for
low-level feature extractions. In this area, the brain produces various raw features that
are static in nature (orientation, color, disparity) as well as dynamic (motion). From
there, these features are then sent to two different areas through two visual processing
streams called the ventral and dorsal pathways [100].
The Ventral or “what” pathway includes areas such as the hippocampus and para-
hippocampus which are known to be involved in recognition and spatial memory recol-
lection [13]. The important point to note here is that the operations in this area require
a formof memory recall to the extensive bankof information that covers all facets ofour
18
LIP
TE
Visual
Cortex
LGN
TEO
Pre-frontal Cortex
Hippocampus
Pre-motor Cortex
to Spinal Chord
from Retina
Ventral Pathway
Dorsal Pathway
Center Surround
Tracking
High Level
Cognition
Recognition
Feature
Extraction
Motor
Control
Figure 1.2: Anatomical Decomposition of Human Vision. After visual stimuli are en-
coded in the retina, the brain then proceeded to relay them to the Visual Cortex via the
LGN for center-surround contrast enhancements. From there the processing is split into
ventral and dorsal pathways. The former is usually attributed to recognition capabili-
ties, while the latter are more about tracking the movement of the target object. The
two pathways then merge back in the pre-frontal cortex, which preceeds the pre-motor
cortex, which generates motion commands that go through the spinal chord.
live experiences as well as the knowledge we acquired. For mobile robotics, this trans-
lates to tasks such as localization, map making, and path finding. One characteristics
abouthumansability to rememberthatwewould like toprojectto ourrobotsishow one
can remember so much and yet can recall critical information in such a timely manner.
Thedorsalor“where”pathway,ontheotherhand,dealslessdirectlywiththeidentity
of the stimuli. This area is responsible for operations that involve in the tracking and
19
handlingofthetarget stimuliwhileitisstillbeingrecognized oralreadyidentified bythe
ventral pathway. Thecasesthatapplytomobilerobotswouldbenavigational taskssuch
as landmark tracking, obstacle avoidance, or staying in the middle of a lane. Because
of the need for memory recall, the modules in the Ventral pathways may perform at a
slower pace than that of the Dorsal pathway, which demand reactions in real-time and
a more action-based form of recognition.
In the end, both pathways end up at the pre-frontal cortex where conscious high-
level decisions involving currenttasks, motivations, emotions of the individualare made.
Some of these decisions can simply be mental notes, while others are made known in a
form of motor commands. For the latter to take effect, they have to be encoded by the
primary motor cortex before sending them to the spinal chord.
1.2.2 Functional Decomposition of the Human Vision
We now look at the scene understanding capability of the human visual system by
analyzing the different vision modules that make up the system, as illustrated in figures
1.3.
Early on, even in the initial viewing of a scene, the human visual processing system
already makes decisions to focus attention and processing resources onto those small
regions within the field of view that look more interesting. The mechanism by which
very rapid holistic image analysis gives rise to a small set of candidate salient locations
in a scene has recently been the subject of comprehensive research efforts and is fairly
well understood [95, 106, 37, 36]. These highlighted points of interest would be useful
[24] in selecting landmarks that are the most reliable in a particular environment (a
20
challenging problem in itself). Moreover, by focusing on specific sub-regions and not
the whole image, the matching process becomes more flexible and less computationally
expensive.
Parallel with attention guidance and mechanisms for saliency computation, humans
demonstrate exquisite ability at instantly capturing the “gist” of a scene; for example,
following presentation of a photograph for just a fraction of a second, an observer may
reportthatitisanindoorkitchenscenewithnumerouscolorfulobjectsonthecountertop
[64, 9, 97, 60]. Such reportat a firstglance onto an image is remarkable considering that
itsummarizesthequintessentialcharacteristicsofanimage, aprocesspreviouslythought
to require much more complex analysis. With very brief exposures (100ms or below),
humansareable todistinguishafew generalsemantic attributes(e.g., indoors,outdoors,
office, kitchen) and a coarse evaluation of distributions of visual features (e.g., highly
colorful, grayscale, several large masses, many small objects) [72, 69]. Furthermore,
humans are able to answer specific questions such as whether an animal was present
or not in the scene can be performed reliably down to exposure times of 28ms [87,
54]. Anatomically, gist may be computed in brain areas which have been shown to
preferentially respond to “places,” that is, visual scene types with a restricted spatial
layout [20]. Spectral contents and color diagnosticity have been shown to influence gist
perception [60, 61], leading to the development of the existing computational models
that emphasize spectral analysis [93, 2].
From the point of view of desired results, gist and saliency appear to be opposites:
finding salient parts requires identifying image regions which stand out by being signifi-
cantly different from their neighbors, while computing gist involves accumulating image
21
Input image
Linear Filtering at 8 Spatial Scales
Orientation Channel Color Channel Intensity Channel
Saliency Model Gist Model
Attention Gist of Image
Local Object Recognition
Layout
Cognition
Figure 1.3: A model of the Functional Decomposition of Image Understanding in the
Human Brain. The brain starts by extracting various features in the Visual Cortex. It
thenperformsfoveal processingusingthesaliency modelandperipheralprocessingusing
the gist model. These two complementary models allow the system to investigate the
scene from two opposing directions.
22
statistics, finding what is the general consensus, over the entire scene. Yet, despite
these differences, both of these modules rely upon the same raw features of the early
visual cortex. Furthermore, the idea that gist and saliency are computed in parallel is
demonstrated in a study in which human subjects are able to simultaneously discrimi-
nate rapidly presented natural scenes in the peripheral view while being involved in a
visual discrimination task in the foveal view [44], engaging the subject’s attention. Part
of our contribution is to model the connection between these two crucial components of
biological vision.
To this end, we explicitly explore whether it is possible to devise a working system
where these low-level feature extraction mechanisms (section 1.2.1) are shared and serve
bothsaliencyandgist, asopposedtocomputingtwoentirelyseparatesetsofrawfeatures
for each respective machine vision modules. Our system (presented in section 2.2.2)
faithfully and efficiently implements the low-level Visual Cortex features which are then
further processed in the gist and saliency models using contrasting biologically-plausible
operations to produce a critical capability such as localization.
From an engineering perspective it is an effective strategy to analyze a scene from
multiple abstractions, a high-level, image-global layout (corresponding to gist) and a
more detailed and focused analysis of saliency. It is also important to note that while
the saliency model primarily utilizes local features [36], the gist features are almost
exclusively global or holistic [61, 94, 81]. In addition, the two models, when run in
parallel, can help each other and provide a more complete description of the scene in
question.
23
In what follows, we use the term gist in a more specific sense than its broad psy-
chological definition (what observers can gather from a scene over a single glance): we
formalizegistasarelativelylow-dimensional(comparedtoarawimagepixelarray)scene
representation which is acquired over very short time frames, and we thus represent gist
as a vector in some feature space. Scene classification based on gist then becomes possi-
ble if and when the gist vector corresponding to a given image can be reliably classified
as belonging to a given scene category.
1.3 Topological Maps
Inadditiontobiological vision, wealsoutilizetopological maps,whichdrawfromvarious
human experiments. A topological map [88, 41], which refers to a graph annotation of
an environment, assigns nodes to particular places and edges to paths if direct passage
between pairs of places (end-nodes) exist. One of the distinct ways humans manage
spatial knowledge is by relying more on topological information than metric. That is,
although humans cannot estimate precise distances or directions [98], they can draw a
detailed and hierarchical topological (or cognitive) map to describe their environments
[51]. Nevertheless, approximate metric information is still deducible and is quite useful.
Thus, in our system, as well as a number of others [88, 11], we use an augmented
topological map where we assign a cartesian coordinate to each node and corresponding
an Eucledian distance to each edge cost. The amount of added metric information is
not a heavy burden (in terms of updating and querying) for the system, because the
basic graph organization is very concise. This is in sharp contrast to a more traditional
24
metric grid map in robotics localization literature [23, 89], where every area in the map
is specified for occupancy, as opposed to being assumed untraversable if not specified as
places or paths.
1.4 Document Organization
In the following chapters, we describe the presented system that starts with a system
overview (chapter 2). This chapter also includes the main contributions of the system
that isinspiredbyhuman vision studiesandthe currenthurdlesencountered in thefield.
We then extensively test the system (chapter 3) in three different outdoor environ-
ments - building complex (126x180ft. area, 13966 testing images), vegetation-filled park
(270x360ft. area, 26397 testing images), and open-field area (450x585ft. area, 34711
testing images) - each with its own visual challenges. We not only test for the system’s
accuracy in terms of coordinate location, but also investigate its efficiency in terms of
how fast it produces answers. We find that the system is able to perform well at 10
frames/second. In the “Discussions and conclusions” chapter 4 we summarize the work
presentedandputitinperspective. Intheconcludingchapter5, wearegoingtodescribe
the future directions of our research.
25
Chapter 2
Design And Implementation
The overall objective of this work is to localize a robot in all environments, under all
conditions without the need for an overly elaborate calibration process. We call this
capability environment invariance. The calibration requirement is not considered sat-
isfied anytime the designer has to pre-select specific landmarks, or add or substract
objects/fiducials to the current environment for the purpose of localizing in it. This
requirement is important because a system that violates it would not be extendable to a
truly SLAM (Simultaneous Localization And Mapping) system, which needs to operate
in a never-before-seen environment.
However, before we delve into the details of the presented system, we would like to
start with the author’s over-arching research philosophy. The overall goals of our works
are both to endow machines with intelligence (an engineering thrust) and to study how
human sees (a scientific endeavor). That is, although we are committed to studying
human vision, we also would like to build systems that are useful in the real world now.
This dual intention has its advantages For one, scientific ideas can be used to build
neuromorphic robots. On the other hand, engineering ideas can help bring inspirations
26
to explain scientific phenomena. Granted, this relationship is not always perfect. For
one, the human brain and a computer are fundamentally different computing devices.
Another is that our knowledge of the brain is still incomplete, which may cause us to
apply incorrect concepts to the robots. However despite these problems, we feel that,
given someadjustments, thereare anumberofneuroscience research ideas thatare quite
applicable to solve current engineering problems. Conversely, there are many machine
vision techniques that currently are better at performing specific visual tasks than their
counterpart biological models. They may not be biologically plausible at face value,
but sometimes these alternatives are so compelling that something analogous may be
happening in the human brain.
The author believes that the main contribution of this work is to faithfully emulate
the functionality of the human brain as it applies to the available machines. However,
to advance our other objective as engineers, the application should also be robust and
efficient Assuch,whenthereisavoidofparticularpartsofthesystemthatdonothave a
satisfactory biologically inspired option, we substitute them with the best available sub-
systems in their place. For example, our back-end localization module uses a standard
probabilistic approach [23, 89, 55], which may not be biologically plausible. This is
why we simply claim that the system is biologically inspired. In addition, we also use
the SIFT model [46] as our landmark recognition module because it is the current gold
standard. And so, in its current state, we have a complete vision localization system,
albeit not entirely biologically plausible. However, we believe this is an important first
step in what we hope to eventually bea fully biological system that will help explain the
inner-workings of a human brain.
27
This chapter starts with the system overview (section 2.1) which gives a high-level
snapshot of the involved modules and the order of operation from image acquisition to
motor commands for robot movements. In addition, we also are going to discuss the
motivations that drive the decisions we made. We hope that our attempts to answer
some of the current issues in vision localization can be of use in building and improving
other systems.
2.1 System Overview
Figure 2.1 illustrates the complete biologically-inspired mobile robotics vision system,
which consists of a recognition/localization module (top right-hand box in orange) and
a navigation system (in cyan). The localization/navigation branching is inspired by the
previously mentioned ventral and dorsal pathways [100], where the former being the lo-
calization module and the latter performs various navigational tasks such as landmark
tracking and following, obstacle detection and avoidance, and lane detection and fol-
lowing. Also note that the split occur at approximately the same point in the visual
processing stream. Both modules utilize the gist features [81] and salient regions [83]
that are computed in parallel using a shared raw Visual Cortex features. We imple-
mented them as such, not only because it is what we observe in the human brain, but
also it is the right engineering decision.
In this thesis we are going to focus on the localization system with the exception of
the salient region tracking system which originally comes from the navigation system.
The tracking mechanism is needed to allow the system to keep searching through the
28
landmark database while the robot is moving. The localization system can be divided to
threestages: featureextraction, segment andsalient region recognition, andlocalization.
The feature extraction stage takes an image from a camera and outputs the gist features
and a set of salient regions. The segment and salient region recognition stage then tries
to match them with memorized information about the environment. These matches are
then used as inputs to the localization stage which decides the most likely location of
the robot.
Before describing the system in depths, there are a few concepts that first need to
be defined, namely segment and salient region. These two terms are part of the stored
visual information associated with the map of the environment. The map, which is
currently provided to the system, is an augmented-topological map with all directed
edges. It is a graph-based map with each node having a Cartesian coordinate and each
edgehavingitscostsettothedistancebetweentheedge’scorrespondingend-nodes. This
way the systembenefitsfromthecompact representation of agraph whilepreservingthe
important metric information of the environment. In the map, a robot state (position
and viewing direction) is represented by a point which can lie on a node or an edge.
Given the information above we now introduce the concept of a segment. A segment
is an ordered list of edges with each edge connected to the next one in the list to form a
continuous path. This grouping is motivated by the fact that views/layout in one path-
segment are coarsely similar. The selected three-edge segment (highlighted in green)
in the map in figure 2.1 is an example. Geographically speaking, a segment is usually
a portion of a hallway, path, or road interrupted by a crossing or a physical barrier
at both ends for a natural delineation. The term segment is roughly equivalent to the
29
9
8
6
5
3 4
1
2
Figure 2.1: Diagram for the presented mobile-robot vision system. The system extracts features from the color, orientation, and
intensity domainsto compute thegist features and salient regions. In the nextstage, thisinformation is usedfor both localization
(module with orange background) and navigation (cyan). For localization, the system estimates its current segment location
using gist and tries to match the salient regions with the ones previously seen in order to refine its location hypothesis. In the
navigation module, the system performs tasks such as landmark tracking, obstacle avoidance, and path following to properly
steer the robot.
30
generic term “place” for the place recognition systems that are mentioned in section 1.1
or the concept of place that are attributed to the area PPA in section 1.2. And so,
with this added information, the robot location can now be denoted as both Cartesian
coordinates (x,y) (because the map includes a rectangular boundaryand an origin) or a
pair of segment number and a fraction of length traveled (between 0.0 to 1.0) along the
segment (snum,ltrav).
The term salient region refers to a conspicuous area of input image that are easily
detected at a part of the environment which makes it a good localization cue candidate.
An ideal salient region would be the ones that are persistently observed from different
points of view as the robot moves about the environment and at different times of the
day. A salient region does not have to depict an object (a lot of the times it is a small
part of an object or a set of objects), it just has to be a point of interest situated in the
real world that as time goes on proven to be consistently detectable. To this end, a set
of salient regions that portrays the same point of interest is grouped together and the
set is called a landmark. Thus, a salient region can be considered as an evidence of a
landmark and “to match a salient region with a landmark,” means to match a region
with the landmark’s saved regions.
Inthefollowing sub-sectionswearegoing todescribeeach ofthethreestages, feature
extraction, segmentandsalient regionrecognition, andthelocalization stage, initsorder
of operation. We then briefly explain the parallel implementation of the algorithm.
31
2.1.1 Feature extraction: Gist and Salient Regions
The starting point for the proposed system is the existing saliency model of Itti et
al. [37, 32], which is freely available on the World-Wide-Web [34]. The saliency and
gist model utilize the shared low-level features, which emulate the ones found in the
primate Visual Cortex, consist of center-surround color, intensity, and orientation that
are computed in separate channels and are run in parallel. Because the Visual Cortex
feature extraction, gist, and saliency model are very connected, we are going to present
them at the same time in is section 2.2.
At this junction of the system, the output of the gist model is ready to use for
recognition at the next stage. On the other hand, the output of the saliency model still
needs to be further processed to produce salient regions. We describe this procedure in
section 2.3.
2.1.2 Segment and Salient Region Recognition
This stage attempts to match the visual stimuli (salient regions and gist features) with
the stored environment information and the results is then used to localize at the next
stage.
When building a robotic system that has to traverse large-scale environments, it is
imperative to discuss how to best store the information obtained during training. We
device a landmark database building procedurethat specifically minimize the amount of
informationstoredbutstillhaveenoughtocoverallpertinentaspectsoftheenvironment.
We also find that (as observed in the brain) a highly connected and descriptive
database allows for utilization of a range of relations (spatial, visual, and temporal) for
32
the system to perform smart indexing for quick information recalls. In a hybrid system
such as ours, appeasement of multiple features requires the understanding of the indi-
vidual representation and time-scale, to make them fit together and work optimally. We
achieve this is through a compartmentalized landmark database and an on-line search
prioritization technique where the system uses priors such as gist-based segment estima-
tion to first order the entries from the most likely segment location to the least, before
it proceeds with the slow matching of salient regions’ SIFT features.
The training procedureinvolves a guided traversal of the robot through all the paths
in the map and is done several times to have ample lighting coverage as well as to allow
identification oflandmarksthatareconsistent over multiplenumberofruns. Thesystem
acquire the information with two separate steps: build a landmark database and train
the segment classifier using the gist features.
The segment estimation training and run-time classification is described as part of
the gist model in section 2.2.2. The salient region run-time matching is described as
part of section 2.3, while the landmark database construction is described in section
2.4. As part of the run-time matching, we are also going to explain in section 2.4.2, the
prioritization procedure that allows the system to increase the speed of the landmark
database search while not sacrificing accuracy.
Despite the effort in speeding up information recall, the system still may take some
time in returning a match. This is where the dorsal pathway operations, which is run in
parallel as the ventral localization system, can alleviate the problem. Here, the dorsal
pathway tracks the salient region that is being matched with the landmark database in
the ventral pathway. So, in spite of the possible slow recognition process, as long as the
33
salient region ofinterestisinitsfieldofview, therobotcan stillmove initsenvironment.
We are going to explain this procedure in section 2.5.
In both gist and salient regions, we formulate the output of the recognition process
for equations used in the probabilistic localization stage.
2.1.3 Localization
The system uses a straightforward Monte-Carlo Localization (MCL) [23, 89, 55] (de-
scribed in section 2.6) using a motion model to take the robot movements into account,
as well as two types of input observations: segment estimation and salient region recog-
nition. We formulate a multi-level localization system for the two to coincide naturally
within the same framework. With gist (a global feature) the system can locate its gen-
eral whereabout to the segment level, while salient regions can pinpoint its coordinate
location by finding distinctive cues situated within the segment.
2.1.4 Parallel Implementation
It bears mentioning that our localization system is meant to be run on a robot with
multiple processors. To maximize the hardware, the algorithm has to be parallelize-
able. This is done by implementing the Visual Cortex raw feature extraction (quite
computation-heavy) in parallel across the different channels. In addition, the landmark
search process, which is the slowest part of the system, uses a priority queue of jobs that
are accessible by any (large) number of threads.
34
2.2 Visual Cortex Feature Extraction, Gist, and Saliency
Model
One redeeming biological traits in how the human brain performs scene understanding
is its tenacity. It findsso many ways to get the most out of an image and tries to exploit
all deciphered patterns, which is especially important when we think about the different
types of environment that are out in the real world (being environment invariant). We
can sense in our mind that, when we try to figure something out, our brains seem to
run several scenarios or think in multiple abstractions in parallel. It starts by extracting
visual features from various domains, from the coarsest to the finest resolution in the
visual cortex. Moreover, the brain then tries to approach a visual problem from all sides
and in parallel.
The concurrent useof saliency and gist is an implementation of peripheraland foveal
vision, wherethesceneisanalyzed fromtwo oppositepointsofview: a high-level, image-
globallayout(gist)andadetailedpixel-wiseanalysisintheattendedlocations(saliency).
Thisisimportantbecauseforlocalization, unlikeobjectrecognition, foregroundactivities
(objects, actors, etc.) are oftentimes distractions, while background is a main source of
information.
Inaddition, ourcomputations toobtain thegist featurescontrast that ofthesaliency
model’s to provide complementary and non-redundant result. As a quick point compar-
ison to point out the similarities and differences between the two, please observe figure
2.2.
35
Cross Scale Center-Surround Differences
PCA/ICA Dimension Reduction
Place Classifier
Most Likely Location
Input image
Linear Filtering at 8 Spatial Scales
Orientation Channel Color Channel Intensity Channel
Gist Feature Extraction
Gist Feature Vector
Figure 2.2: Side by side comparison of the Gist and Saliency model. Both uses shared
raw filter outputs and center-surround (feature) maps. Gist uses them to create a gist
vector, while saliency, through winner-take-all mechanisms creates a saliency map.
36
From the figure, we see that both models use the same center-surround raw features
(also known as feature maps), except for the orientation channel, where we take the
Gabor filter outputs, instead of the center-surround of Gabors. We do not perform
the center-surround operation because we believe the Gabor Filter are already perform
differencing between two levels. We are going to formulate these maps in section 2.2.1)
below.
After the raw features are extracted, the models would then diverge in their philoso-
phyofcomputing. Inthesaliencymodel,thefeaturemapsareusedtodetectconspicuous
regions in each channel using operations that are competitive in nature. This is done
through the winner-take-all (WTA) mechanisms, which emphasize locations which sub-
stantially differ from their neighbors [37, 32].
The feature maps are first linearly combined within each channel (where a max-
norm operation [37, 32] to subduenoisy channels is then subsequently applied) to create
conspicuity maps. The maps are then linearly combined to yield a saliency map. On
the other hand, the gist model performs a more cooperative grid averaging operations
(detailed in sub-section 2.2.2.1). Our basic approach is to exploit statistical data of
color and texture measurements in predetermined region subdivisions. These features
are independent of shape as they are simply denoting lines and blobs.
While the computations and application of the saliency map has been extensively
reported in previous works of [37, 35, 36, 33], here, we are going to describes the gist
model in detail in section 2.2.2.
One may see that such an approach would be highly inefficient as the resources are
spread too thin. However, if a system shares intermediate information between each
37
module so as to avoid unnecessary duplicate computations, as is the case in the hu-
man brain and our implementation of it, the benefits far outweigh the cost. And so,
to pick up where the saliency model left off, the gist model tries to bring something
new to image observation. Furthermore, the gist computations ought to be theoretically
non-redundantwith respect to saliency’s so that the results of both modules would com-
plement each other. Because saliency’s computations boil down to competition among
neurons or pixel locations, we thought that gist’s would be more of accumulation oper-
ations that represent cooperation among the neurons. In addition, there is more spatial
emphasis in saliency, while there is less so in gist. Also, in terms of results, instead of
looking for irregularities (most salient locations) in image, gist looks for regularities.
2.2.1 Visual Cortex Feature Extraction
Inoursystem,aninputimageisfilteredatanumberoflow-levelvisual“featurechannels”
at multiple spatial scales, for features in the color, intensity, and orientation domain. In
addition, the flicker and motion channel (also found in Visual Cortex) have also been
showntoimprovethesaliencymodel. Currentlywedonotincludethesechannelsbecause
we think that they are more dominantly determined by the robot’s egomotion and hence
unreliable in forming a gist signature of a given location.
The color and orientation channels have several sub-channels (2 color opponency
types and 4 orientations, respectively), while the intensity channel only has one (dark-
bright opponency). Each sub-channel has a nine-scale pyramidal representation of fil-
ter outputs, a ratio of 1:1 (level 0) to 1:256 (level 8) in both horizontal and verti-
cal dimensions, with a 5-by-5 Gaussian smoothing applied in between scales. Within
38
each sub-channel i, the model performs center-surround operations (commonly found
in biological-vision which compares image values in center-location to its neighboring
surround-locations) between filter output maps, O
i
(s), at different scales s in the pyra-
mid. This yields feature maps FM
i
(c,s), given “center” (finer) scale c and “surround”
(coarser) scale s. Our implementation uses c = 2,3,4 and s = c + d, with d = 3,4.
Across-scale difference (operator ⊖) between two maps is obtained by interpolation to
the center (finer) scale and pointwise absolute difference (eqn. 2.1).
FM
i
(c,s) =|O
i
(c) ⊖ O
i
(s)| =|O
i
(c)−Interp
s−c
(O
i
(s))| (2.1)
Hence,wecomputesixfeaturemapsforeachtypeoffeatureatscales2-5,2-6,3-6,3-7,
4-7, and 4-8, so that the system can gather information in regions at several scales, with
addedlighting invariance provided bythecenter-surroundcomparison (furtherdiscussed
below).
For the orientation channel, we employ Gabor filters to the greyscale input image
(eqn. 2.2) at four different angles (θ
i
=0,45,90,135
◦
):
Ori
i
(c) =Gabor(θ
i
,c) (2.2)
The center-surround or feature map formulation then becomes:
Ori
i
(c,s) =|Ori
i
(c)⊖ Ori
i
(s)| (2.3)
(2.4)
39
The color and intensity channels, on the other hand, combine to compose three pairs
of color opponents derived from Ewald Hering’s Color Opponency theories [96], which
identify four primary colors red, green, blue, yellow (denoted as R,G,B, and Y in eqns.
2.5, 2.6, 2.7, and 2.8, respectively) and two hueless dark and bright colors 2.9, computed
from the raw camera r, g, b outputs [37].
R =r−(g+b)/2 (2.5)
G =g−(r+b)/2 (2.6)
B =b−(r+g)/2 (2.7)
Y =r+g−2(|r−g|+b) (2.8)
I =(r+g+b)/3 (2.9)
The color opponency center-surround pairs are the two color channels’ red-green
and blue-yellow (eqn. 2.10 and 2.11), along with the intensity channel’s dark-bright
opponency (eqn. 2.12):
RG(c,s) =|(R(c)−G(c)) ⊖ (R(s)−G(s))| (2.10)
BY(c,s) =|(B(c)−Y(c)) ⊖ (B(s)−Y(s))| (2.11)
I(c,s) =|I(c) ⊖ I(s)| (2.12)
40
2.2.2 Gist Model
In this section we describe the gist model used by the system to encapsulate the general
layout of a scene. Following the gist model diagram (right column of figure 2.2), the
model utilizes the same feature maps used by the saliency model to perform gist feature
extraction (section 2.2.2.1). Afterwards, we then utilize them to classify the segments
in the environment. We first use a PCA/ICA dimension reduction on the gist features
(section 2.2.2.3), before using them as inputs for a neural network trained using a back-
propagation algorithm (section 2.2.2.4).
Also, during the design process, we perform some experimentation to investigate the
color invariance of the gist feature, which is reported in section 2.2.2.2.
2.2.2.1 Gist Feature Extraction
Aftertherawfeaturesarecomputed, each sub-channelextractsagistvector fromitscor-
respondingfeature map. We apply averaging operations (the simplest neurally-plausible
computation) in a fixed four-by-four grid of sub-regions over the map. Observe figure
2.3 for visualization of the process.
Thus, as proposed in introduction, gist accumulates information over space in image
sub-regions, while saliency relies on competition across space.
Noting that the maps used for gist extraction differ from channel to channel, we are
going to formalize gist using 2 equations 2.13 and 2.14. The former gist formulation
(G
k,l
i
(c,s)) is for the color and orientation channels, while the latter (G
k,l
i
(c)) is for
41
Figure 2.3: Gist Decomposition of Vertical Orientation Sub-channel. The original image
(top left) is put through a vertically-oriented Gabor filter to produce a feature map (top
right). That map is then divided into 4-by-4 grid sub-regions. We then take the mean
of each grid to produce 16 values for the gist feature vector (bottom).
orientation channel. However, for all maps, we are going to extract sixteen raw gist
features per map, taking sums over each given sub-region (specified by indices k and l
in the horizontal and vertical direction, respectively) of the map, then dividing each of
them by the number of pixels in the corresponding sub-region.
In the color and orientation channels we usetheir feature maps (equations 2.10, 2.11,
and 2.12). There are twelve in the color channel (6 center-surround combinations for
each both red-green and yellow-blue opponencies) and six in the intensity channel (6
center-surround combinations for dark-bright opponency).
Thus, the gist extraction formulation for these channels, using the generic center-
surround equation 2.1, is:
42
For color, intensity channels:
G
k,l
i
(c,s) =
1
16WH
(k+1)W
4
−1
X
u=
kW
4
(l+1)H
4
−1
X
v=
lH
4
[FM
i
(c,s)](u,v) (2.13)
Where W and H are the width and height of the entire image.
For the orientation channel, we do not use the feature maps, but operate on the raw
Gabor outputs instead. In addition, we only use the first four spatial scales c =0,1,2,3
ofthefilteroutputpyramid(foreach ofthefourθ
i
sub-channels)forasubtotalofsixteen
maps. Thus, the gist formulation for a Gabor map Ori
θ
i
(c) (following equation 2.2):
For orientation channel:
G
k,l
i
(c) =
1
16WH
(k+1)W
4
−1
X
u=
kW
4
(l+1)H
4
−1
X
v=
lH
4
[Gabor(θ
i
,c)](u,v) (2.14)
also with W and H as the width and height of the entire image.
These sixteen Gabor maps along with the eighteen color and intensity feature maps
add up to a total of thirty-four maps altogether. And since we have 16 regions/features
per map, the total number of raw gist feature dimension is 544.
Because the present gist model is not specific to any domain, other channels such
as stereo could be used as well. Furthermore, although additional statistics such as
variance would certainly provide useful additional descriptors, their computational cost
is much higher than that of first-order statistics and their biological plausibility remains
43
debated [15]. Thus, here we explore whether first-order statistics would be sufficient to
yield reliable classification, if one relies on using the variety of visual feature domains to
compensate for more complex statistics but just from one domain.
2.2.2.2 Color Constancy
The advantage of coarse statistical-based gist is stability in averaging out local and ran-
dom noise. What is more concerning is the effect of global bias such as lighting as it
changes the appearance of the entire image. Color constancy algorithms such as gray
world and white patch assume that lighting is constant throughout a scene [5, 6]. Unfor-
tunately, outdoor ambient light is not quite as straightforward. Not only does it change
over time, both in luminance and chrominance, but also vary within a single scene as
it is not a point light-source. Different sun positions and atmospheric conditions illumi-
nate different parts of a scene in varying degrees as illustrated by images taken one hour
apart, juxtaposed in the first row of figure 2.4. We can see that the foreground of image
1 receives more light while the background does not. Conversely, the opposite occurs in
image 2. It is important to note that the goal of the step is not to recognize/normalize
color with high accuracy, butto producestable gist featuresover color and intensity. We
also considered another normalization technique called ComprehensiveColor Normaliza-
tion (CCN) [22], which is an iterative (and slower converging) process that alternates
global and local operations.
One indisputable fact is that when textures are lost because of lighting saturation
(both too bright or too dark for the camera sensor), no normalization, however sophisti-
cated, can bring it back. To this end, because of the nature of our gist computation, the
44
Channel Image 1 Image 2 Sub- pSNR
channel (db)
Raw
r 9.24
g 9.60
b 10.08
Red - Green
2 & 5 32.57
2 & 6 32.13
3 & 6 34.28
3 & 7 33.95
4 & 7 36.32
4 & 8 35.82
Blue - Yellow
2 & 5 32.44
2 & 6 30.83
3 & 6 32.42
3 & 7 30.95
4 & 7 31.95
4 & 8 31.95
Dark - Bright
2 & 5 15.03
2 & 6 12.72
3 & 6 13.79
3 & 7 12.21
4 & 7 13.29
4 & 8 13.33
Figure 2.4: Example of two lighting conditions of the same scene. pSNR (peak signal-
to-noise ratio) values measure how similar the maps are between images 1 and 2. Higher
pSNR values for a given map indicate better robustness of that feature to variations in
lighting conditions. Ourcenter-surroundchannels exhibitbetter invariance thanthe raw
r, g, b channels although, obviously, are not completely invariant.
45
best way is to recognize gists of scenes with different lighting separately. We thus opted
not to addany preprocessing, butinstead to train our gist classifier (describedbelow) on
several lighting conditions. Thegist features themselves already help minimize theeffect
of illumination change because of their differential nature (Gabor or center-surround).
Peak Signal-to-Noise ratio (pSNR) tests for the two images with differing lighting con-
ditions in figure 2.4 show better invariance for our differential features than for the raw
r, g, b features, especially for the two opponent color channels. This shows that the
low-level feature processing produces contrast information that is more robust to light-
ing change. Note that using differential features comes at a price: baseline information
(e.g., absolute color distributions) is omitted from our gist encoding even though it has
been shown to be useful [99].
In general, thecenter-surround operations can beseen as only looking for boundaries
surrounding the regions (not the regions themselves), where the contrasts are. On the
other hand, calculating color distribution histograms using the gist extraction approxi-
mate the size of those regions. By performing the operations at multiple pyramid scales,
the system can pick up different region sizes, and indirectly infer the absolute distri-
bution information with the added lighting invariance. As an example, the intensity
channel output for the illustration image of figure 2.5 shows different-sized regions being
emphasized according to their respective center-surround parameters.
2.2.2.3 PCA/ICA Dimension Reduction
We reduce the dimensions of the raw gist features (544 dimensions) using Principal
Component Analysis (PCA) and then Independent Component Analysis (ICA) with
46
PCA/ICA Dimension Reduction
Place Classifier
Most Likely Location
Gist Feature
Vectors
Orientation Channel Color Channel Intensity Channel
Feature Maps
Gist Features
Input Image
Figure 2.5: Example of gist operation applied to an Image. The model utilizes the same
feature maps used by the saliency model to perform gist feature extraction. We reduce
the gist features using PCA/ICA dimension reduction, before using them as inputs for
a neural network trained using a back-propagation algorithm to classify the segments in
the environment.
47
FastICA [31] to a more practical number of 80 while still preserving up to 97% of the
variance for a set in the upwards of 30,000 campus scenes.
2.2.2.4 Segment Estimation
For scene classification, we use a three-layer neural network (with intermediate layers of
200 and 100 nodes), trained with the back-propagation algorithm. Each segment in an
environment has an associated output node and the output potential is the likelihood
thatthescenebelongstothatsegment,storedinavectorz
′
t
tobeusedbythelocalization
algorithm as an observation where
z
′
t
={ sval
t,j
}, j =1 ... N
segment
(2.15)
with sval
t,j
being the segment likelihood for time t and segment j which is one of
the N
segment
segments in the environment.
One of the main reasons why the classifier succeeds is because the decision to group
edges to segments as it would have been difficult to train an edge-classifier using coarse
features like gist. In addition, it is easy to add more samples and the training process
takes a short amount of time. The complete process is illustrated in figure 2.5.
2.3 Salient Regions as Localization Cues
The basic idea of using salient regions as localization cues is that we hope these parts
of the image would be readily and consistently pop out whenever we visited a particular
area without needing considerable amount of prior knowledge (what objects work best
48
forthecurrentenvironment)orhavingtoperformsophisticatedcomputations(fullimage
segmentation, for example) just to start the process.
The saliency model [37, 32] provides the system with a saliency map, which we can
use to find the coordinates of its peak values called the salient points. Once we record
the salient points, we need to be able to identify these points on subsequent viewings.
From empirical resultswe findthat usingjustcenter-surroundfeature values at the pixel
location is not enough because of reasons such as scalability and lighting change. We
have to include information from the surrounding neighborhood as well. To do this we
extract out the region that attracts our model’s attention and document the necessary
attributes (feature values and spatial information) that can be used to identify that
region. Here we describe the salient region extraction process in section 2.3.1, and then
the recognition process (how to compare two salient regions) in section 2.3.2.
Inadditiontobeingabletoreliablyidentifyindividualsalientregions,wealsoneedto
select the best, most repeatable and persistent ones for our localization task. That is we
should only use regions that are consistently salient as we move about an environment.
We do not want to use regions that are salient just because they are results of accidental
perspectives, andthus, hardtorecall. Wewantregionsthatarerepeatablewhenwevisit
the area at later times. This characteristics rely on not just visual pop-out effects in an
image but also semantic issues of using regions that depict entities that are native to the
environment,andnotmovingobjectssuchaspeople. Inordertodealwiththeseconcerns,
we add a temporal aspect in the selection of these regions. That is, we have to see how
a salient region behaves overtime, within and across training sessions. We examine this
issue in the next section 2.4 that explains the landmark database construction.
49
2.3.1 Salient Region Selection and Segmentation
The process of obtaining salient regions from a saliency map is illustrated in figure 2.6.
Figure 2.6: A salient region is extracted from the center-surround map that gives rise to
it. We use a shape estimator algorithm to create a region-of-interest (ROI) window and
use inhibition-of-return (IOR) in the saliency map to find other regions.
Thesystemstartsatthepixellocationofthesaliencymap’shighestvalue. Toextract
a region that includes thepoint, we usea shapeestimator algorithm [71] (region growing
with adaptive thresholding) to segment the feature map that gives rise to it. To find the
appropriate feature map, we compare the values of the conspicuity maps (there’s one for
each of the 7 sub-channels) at the salient location and select the sub-channel with the
highest value. This sub-channel is called the winning sub-channel. Within the winning
sub-channel, we compare values at the same location for all the feature maps. The map
with the highest value is called the winning center-surround map.
Thesystemthencreatesaboundingboxaroundthesegmentedregion. Initially,wefit
a box in a straight-forward manner: find smallest-sized rectangle that fits all connected
pixels. The system then adjusts the size to between 35% and 50% in both the image
width and height, if it is not yet within the range. Thisis because small regions are hard
to recognize and overly large ones take too long to match. In addition, the system also
50
Figure 2.7: Process of obtaining multiple salient regions from a frame, where the IOR
mask (last row) dictates the shift in attention of the system to different parts of the
image.
creates an inhibition-of-return (IOR) mask to suppress that part of the saliency map to
move to subsequent regions. This is done by blurring the region with a Gaussian filter
to produce a tapering effect at the mask’s border. Also, if a new region overlaps any
previous regions by more than 66%, it is discarded but is still suppressed.
We continue until one of following three exit conditions occur: unsegmented image
area is below 50%, number of regions processed is 5, and the saliency map value of the
next point is lower than 5% of the first(most salient). We limit the regions to 5 because,
51
from experiments, subsequent regions have a much lower likelihood of being repeatable
in testing. Figure 2.7 shows extraction of 5 regions.
There are reasons why multiple regions per image is better. First, additional per-
ception (there are many salient entities within the field of view) contributes to a more
accurate localization, given the possibility of occlusion in an image. Second, the first
region may be coincidental or a distraction. In figure 2.7, the most salient point is a ray
of sunshine hitting a building. Although from the saliency perspective, it is correct, it
is not a good location cue. The second region is better because it depicts details of a
building.
2.3.2 Salient Region Recognition
In order to recall the stored salient regions we have to find a robust way to recognize
them. We use two sets of signatures: SIFT keypoints [46] and salient feature vector.
We employ a straight-forward SIFT recognition system [46] (using all the suggested
parameters and thresholds) but consider only regions that have more than 5 keypoints
to ensure that the match is not a coincidence.
A salient feature vector[79] is a set of values taken from a 5-by-5 window centered
at the salient point location (yellow disk in figure 2.8) of a region sreg. These nor-
malized values (between 0.0 to 1.0) come from the sub-channels’ feature maps [37, 32]
for all channels (color, intensity, and orientation). In total, there are 1050 features (7
sub-channels times 6 feature maps times 5x5 locations). Because the feature maps are
produced in the previous feature extraction for saliency and gist (section 2.2.1), even
52
though they are computed over the entire image for each visual domain, from the salient
feature vector perspective, they come at almost no computational cost.
Tocomparesalientfeaturevectorsfromtwosalientregionssreg
1
andsreg
2
,wefactor
in both feature similarity sfSim (equation 2.16) and salient point location proximity
sfProx(equation 2.17). The former is based on the Euclidian-distance in feature space:
sfSim(sreg
1
,sreg
2
) = 1−
q
P
N
sf
i=1
(sreg
1,i
−sreg
2,i
)
2
N
sf
(2.16)
N
sf
is the total number of salient features. For a match to be confirmed, the feature
similarity has to be above .75 out of the maximal 1.0. The location proximity sfProx,
ontheother hand,istheEuclidian distanceinpixelunits(denoted bythefunctiondist),
normalized by the image diagonal length:
sfProx(sreg
1
,sreg
2
) = 1−
dist(sreg
1
,sreg
2
)
l
Diagonal
(2.17)
Thepositive match score thresholdfor thedistance is95% (within 5%of inputimage
diagonal). Note that the proximity distance is measured after aligning sreg
1
and sreg
2
together, which is after a positive SIFT match is ascertained (observe the fused image
in figure 2.8). The SIFT recognition module estimates a planar (translational and rota-
tional) transformation matrix [46] that characterizes the alignment. In short, individual
reference-testkeypointpairsarefirstcomparedbasedonthedescriptor’ssimilarity. Each
matched pair then “votes” for possible 2D affine transforms (there is no explicit notion
of an object location in 3D space) that relate the two images. An outlier elimination
is performed using the most likely transform given all matches. Using the remaining
53
pairs, we compute a final affine transform. With this matrix, the system can check the
alignment disparity between the two regions’ salient point location.
Figure 2.8: Matching process of two salient regions using SIFT keypoints (drawn as red
disks) and salient feature vector, which is a set of feature map values taken at the salient
point (drawn as the yellow disk). The lines indicate the correspondences that are found.
The fused image is added to show that we also estimate the pose change between the
pair.
Oncetheincomingsalientregionsarecomparedwiththelandmarkdatabase, thesuc-
cessful matches (ones which pass both salient feature vector and SIFT match thresholds
described above) are denoted as observation z
′′
t
, where
z
′′
t
={ omatch
t,k
},k =1 ... M
t
(2.18)
with omatch
t,k
being the k-th matched database salient region at time t. M
t
denotes
the total number of positive matches at time t. Note that the recognition module may
not produce an observation for every time t, M
t
= 0, as it is possible that it finds no
matches or is still searching for a match.
54
It is important to note that selecting visually distinct parts of an image has an ef-
ficiency consequence because it relieves the system from matching whole scenes. By
extracting features only within a small window, the number of SIFT keypoints is drasti-
cally reduced which is substantial given that matching is the slowest part of the system.
Also,inclutteredenvironmentswheretherearelotsofdistractingtexturessuchasapark
full of trees (observe figure 2.7 above), targeted windows help avoid the high number of
ephemeral and repetitive features, which contaminate the matching process. Naturally
we have to consider the possible downside of increase in false positives rate as a decrease
in the number of SIFT features indirectly leads to a lower matching threshold. However,
this is where having multiple context cues can minimize the these errors. Cues such as
gist-based segment estimation can be a prior filter before performing SIFT matching. If
there is a coincidental match, a number of independent factors would also have to be in
agreement.
2.4 Storing and Recalling Environment Information
The scalability of a vision localization system is especially important when the robot
needs to explore large scale environment. This issue is directly linked to the efficiency
in which the system matches input stimuli with its stored information, in our case, the
landmark database. In this section, we are going to focus on optimizing the size of the
database during training. Theoretically, we want a database that is as small as possible
but keeping all pertinent information. It is, however, important that we are not being
55
overly aggressive in paring down the database size as it still has to completely encapsu-
late all aspects of the environment that system needs in order to maintain localization
accuracy.
The decision of what to store is tightly coupled with the capability of the system’s
recognition module. If the module is able to robustly match an object in any condition,
then one photograph would suffice for each key place or landmark object. However,
this invariance is difficult to achieve, for example, with respect to lighting (particularly
outdoors) and view-point in the case of SIFT keypoints that we use to represent salient
region. Given these shortcomings, in constructing a database, multiple entries of a
landmarkshoulddirectlyincrease therobustnessoftherecognition stepinthoseaspects.
Inthefollowingsection2.4.1wearegoingtodescribetheprocessofbuildingthelandmark
database, which we hope provides a solution for these two problems.
The notion that the smaller the database or the number of features extracted per
image (as in the case of using a salient region), the faster the matching process is,
implicitly assumes the need for a best match as it requires comparisons with the whole
database. For real-time systems such as robot localization, however, a positive match is
all that we need. And once it is found, the search process can be stopped. To increase
the likelihood of this event to occur early, our system utilizes a prioritization step to
compare database entries in the order of the most likely to the least. Thus, in practice,
usingseveralearly-exitconditions, itneverhastolookatalltheentriesbecause(through
experiments) it has a good idea when to stop given the unlikelihood that a match will
be found thereafter. We will discuss this prioritization technique in section 2.4.2.
56
2.4.1 Landmark Database Construction
As explained earlier, a landmark is a collection of salient regions depicting the same real
worldpointofinterest. Ourintentionhereistokeeptheregioncounttoaminimumwhile
still storing all possible appearances. Because an individual salient region is represented
by SIFT keypoints (as well as the salient feature vector), these regions are less invariant
to out-of-plane viewpoint change and lighting. And thus, multiple entries of a landmark
should directly increase the robustness of the SIFT recognition step to these changes.
In alleviating SIFT’s shortcomings in viewpoint change, when building a landmark,
we keep its snapshots from all the viewing angles that are sufficiently different than
the previous ones in the list. This strategy actually produces low salient region counts
because a lot of the regions isolate parts of the environment that are physically far
away from the robot (signs, buildings), which means that angle changes induced by its
movement do not affect their viewing as much. Note that we are not doing any three-
dimensional modeling to stitch together these views as it would be very difficult given
that our testing environments are cluttered and unconstrained.
As for lighting changes, as with other textures, SIFT keypoints are usually very
different in wider disparity cases. Our solution is to survey and select training sessions
to include all distinct conditions but with enough overlap for the same landmarks from
each session to be considered similar, and thus, can be connected by the algorithm.
The training process consists of guided traversals of the robot through all the paths
in the environment. The runs are performed several times on various lighting conditions
(tworunseach). Thisisalsoallowsforidentificationoflandmarksthatareconsistentover
57
a number of runs. We survey and select training times to include all distinct conditions
but with enough overlap for landmarks from each session to be combined. If there is an
improvement that weneed for thesystem, it would bevisual cuesthat work across wider
range oflightings. In addition, thesetraversals also provide all of theneeded view-points
from the intended paths. That is, angles under normal circumstances, and not from odd
positions that highly unlikely to occur if the robot control behaves properly. During the
runs, currently, the robot only records images; the actual database construction is done
off-line. We are exploring the possibility of doing this online as part of a system that is
capable of performing Simultaneous Localization and Mapping (SLAM).
The landmark database building procedure is illustrated in figure 2.9 and it is as
follows: create a landmark database for each training episode (section 2.4.1.1), then
combine them together to create one complete landmark database (section 2.4.1.2). In
these sections, we describe procedures that need a few decision thresholds, which may
be viewed as making the approach weaker. However, they are quite intuitive and we try
to characterize what overall impact each of them has.
The combining process is done in pairs between a training run database just com-
pleted and a resulting database from all the previous runs. In the individual-episode
database construction level, we try to minimize the amount of unnecessary region rep-
etitions on a frame to frame basis, while at the across-episodes level, the process of
integration is more of an indexing consolidation. That is, when two landmarks from
different sessions depicts the same point of interest, we do not delete any salient re-
gions, just combine the list of regions together and make a note of the sessions that the
landmark is detected at.
58
Figure 2.9: The landmark database building procedure is done in two steps: create a current-episode landmark database for each
training run, then iteratively (one run at a time) integrate them together to create one complete landmark database.
59
2.4.1.1 Building a Database Within an Episode
Givenaseriesofframesfromasingletrainingsession/episodeofrobottraversalwecreate
a database that stores all the persistent landmarks in that session. Training is currently
done with a person controlling its movement, running straight through all the paths in
themap. Thedriveralsonoteswhichsegmenttherobotiscurrentlyattoenablesegment
labeling on the frames because the landmarks in the database are compartmentalized in
segments.
The following figure 2.10 illustrates the current run database building process.
Figure 2.10: Landmark Database Building for a Single Run. The system first finds and
matches salient regions on a frame-to-frame basis. At the end it then looks for the most
consistent landmark (the ones with high numbers of salient region or the ones with that
span a large number of frames) to be permanently stored in the landmark database.
60
From the first frame we obtain a set of salient regions to create initial landmarks.
When the next set of regions arrives (from the subsequent frame), the system tries to
concurrently match them with the ones in existing landmarks. We first create a two-
dimensional match score matrix between all combinations of the incoming regions and
current landmarks.
As mentioned in section 2.3.2, we have two salient region matching criteria: SIFT
and saliency feature vector matching. If either one turns up negative, the score entry in
the match matrix is set to zero. However, if both thresholds are passed, the score that
we enter is only involves the salient feature vector matching, which is the product of
the salient feature similarity sfSim (equation 2.16) and salient point location proximity
sfProx (equation 2.17):
sScore(sreg
1
,sreg
2
) = sfSim(sreg
1
,sreg
2
)∗sfProx(sreg
1
,sreg
2
) (2.19)
We only take the saliency score into account (and not the SIFT score) because we
want to cluster regions based on justthe salient landmarkthey depict, noton the overall
region similarity. Adding the SIFT score would allow overlapping regions that portray
different landmarks to be clustered.
Onceall thecomparisonsare performed,we start theinsertion processbycalculating
the best/2nd best score ratio for each region-landmark pair. Theregion with the highest
ratio at the current iteration is inserted to the corresponding landmark. We keep doing
61
this until there are no more matches, in which we can create new landmarks for the
remaining regions.
Afteralltheframesareprocessed,wepruneoutlandmarksusingtwocriteria: number
of salient regions and range of frame numbers, both of which indicate persistence in the
environment. The first one is for landmarks that are very salient but are viewed briefly
(20 frames or less): their total counts have to be larger than 7. The second one is
for landmarks that are less salient but are detected for a long period of time (frame
number range larger than 20): at least 5 regions. These set of thresholds control how
many landmarks we want to keep. We find that these values allow enough of the smaller
but useful landmarks (fewer number of regions registered) to be kept by the database.
Adding even smaller ones would just increase the size of the database without getting
much in return.
In the region assignment step, for regions that are positively matched with multiple
landmarks, the situation becomes complicated. First, we have to find the best landmark
to insert to; adding the same salient region to multiple landmarks would unnecessarily
increase the size of the database. Here, we select the one with the highest number
of regions to create a momentum towards the larger landmarks. Multiple landmark
matches occur because matches that were supposed to happen in the previous frames
did not go through, and we are left with more than one landmark depicting the same
point of interest. The reason we use salient region count and not the score is because it
is not unusual to have two landmark matches in which one has a large number of salient
regions buta lower score because theother, smaller, landmarkhasthe region that comes
from the previous frame, and thus have a higher score. Usually this is because the
62
smaller landmark is supposed to be part of the larger one, and the new region makes
the connection even more evident. With this policy, we keep landmarks from splitting
to smaller ones.
Furthermore, for this reason, the system also consolidates all the regions involved in
the multiple-landmark match to the largest among them. It does not combine the land-
marks together because the other regions in those landmarks may legitimately describe
other points of interest. For example, there is a possibility that two landmarks somehow
move closer together and overlap each other to create an ambiguity. What is achieved
by moving these regions to the largest landmark is that the involved landmarks become
less similar, which is a reasonable compromise.
Before explaining the transfer of salient regions, we would like to describe the actual
inner working of a landmark. In training, when a landmark is being built, it actually
consists of two lists: a main list and a temporary list. The main list has regions that
are going to be saved at the end, while the temporary list is discarded. For the purpose
of pruning at the end of training, however, the total number of salient regions is still
taken as the sum of the two lists. When an incoming salient region is compared to a
landmark, it is first compared with the regions in the main list (actually, just the last 10
in reverse order) and, if needed, the temp list as well (also the last 10 in reverse order).
If there is a match in the main list (and the landmark is also selected through the ratio
test), the new region goes to the temporary list. If we find no match in the main list
but one is found in the temporary list (and also passes the ratio test) the corresponding
matched region inthetemporarylistisputto themainlist andthenewoneisputto the
temporary list. What is essentially accomplished is that one region is a representative
63
for as many regions as possible until its appearance becomes so different compared to an
incoming region that we are forced to store an additional one.
Figure 2.11: An example of how a series of 8 frames affects the number of salient regions
that are stored in a landmark during training/building. The number in the top left
corner of each grid is the frame number. The label next to it is the matching condition
or command that comes from the system.
Figure 2.11 shows how the algorithm works in a series of 8 frames. In the first frame,
an initial salient region is used to create a landmark and is automatically placed to the
main list. In frames 1 and 2, the new regions are sufficiently similar to region 0, so they
are placed in the temp list (this is what happens the majority of the time). Frame 3 is
an example where the landmark is momentarily not salient enough in the frame, or out
64
of the field of view, and thus is not detected. In frame 4, the new region is again goes to
the temp list because it is also similar to that of frame 0, so it goes to temp. In frame 5,
however, we have an interesting case. The incoming region is not similar to region 0 but
is close enough to the one from frame 4 (obviously because they are from back-to-back
frames). And thus, the algorithm moves region 4 to the main list and inserts region 5 to
the temp list. After another uneventful frame 6, the end signal is received and the last
entry in the temp list is moved to the main list to produce a complete list that is to be
saved. In this example, we save three frames out of the possible seven. We find that the
save-to-discard ratio obtained is, on average, about1:5; it keeps only 20% of the training
salient regions.
In the case of transfer-of-evidence for multiple landmark matches, we have to be
careful in how to re-link the lists in landmarks that lose a salient region. There are two
different cases, the region to be moved is either in a main or a temp list. If the latter is
the case, it can simply be moved because there is a similar region in the main list. If,
on the other hand, the region is in the main list we have to replace it with the one most
similar in the temp list: the one with the closest but higher (later) frame number.
2.4.1.2 Building a Database Across Episodes
The procedure of combining databases from individual training episodes to a complete
database is done iteratively; we add one episode at a time. The matches are done at
landmark-to-landmark level for all incoming-and-stored landmark combinations. That
is, when deciding if two landmarks depict the same real-world point-of-interest, the
system counts how many regions from one landmark matches the ones from the other.
65
Figure 2.12: Multi-run Landmark Database Integration. The system consolidates land-
marks from different training runs by checking for overlap of matched salient regions.
If a pair of landmarks have a percentage of matched regions that goes past a preset
threshold, they are combined into one.
In addition, it also looks at the percentage of regions matched in the landmark to be
added. For two landmarks to be combined, they have to pass any of the following
thresholds:
• 2 to 5 match count and >= 50% of matches
• 6 to 10 match count and >= 25% of matches
• above 10 match count
66
That is, for the first threshold, if a landmark-pair has 2 to 5 matches, and the
percentage of matches with respect to the incoming landmark is above 50%, we are
going to combine the two landmarks. The second threshold is similar (except for the
numbers), while the last one does not need to check for percentage since there is already
a high (10) number of matches. We find that these values to be fairly safe, that the
combined landmarks are almost always the same points of interest found at different
sessions.
Intheactualprocessofcombiningtwolandmarks,wedonotdeleteanysalientregions
(even if there are identical ones taken from different sessions). We simply append one
list to the back of the other to keep the episodic progression (for temporal priming of
where the landmark is expected to be at later frames) in tact. Also, if a landmark
from the currently processed database matches with more than one landmark in the
accumulated database, the stored landmarks are first combined before appending the
incominglandmark. Tokeepthesalientregionsproperlystored,theyaresortedbasedon
thesessionnamesfirst(alpha-numericorder),andframenumberssecond. Theknowledge
that certain landmarks are found on multiple sessions would be helpful in gauging its
recall reliability, which we have not fully exploit. For one, we can add a filtering step in
the end that pruneslandmarks across sessions, only keeping the ones that occur in more
than one episode.
The final result is a hierarchical database structure is illustrated in figure 2.13.
67
...
...
... ... ...
...
... ... ...
...
... ... ...
Figure2.13: HierarchicalLandmarkDatabase. Thelandmarksarecompartmentalizedby
segmentoforigin. Thelandmarksthemselves arecollections ofsalient regions(portrayed
by a stack of rectangles) depicting their respective landmarks. Within a landmark, the
multiple columns of salient regions and corresponding feature vectors indicate that the
landmark is detected in multiple episodes/lighting conditions.
68
2.4.2 Landmark Database Search Prioritization
One of the many advantages of a hierarchical database such as ours is that it allows the
system to avoid having to search through a long single-dimensional list. Through the
use of context information such as segment estimation, the system exploits the database
structureto funnelthesearch quicklyandavoid partsofthedatabase thatare irrelevant.
We implement a search prioritization module that quickly puts the landmarks to be
compared in an order from the most likely to be positively matched to the least likely
during run-time. In real-time systems such as robots, it is a given that the database
search ends after the first match is found as the system does not have time to find all
positive matches and compare which one is the best. This procedure speeds up the
database search because, once a match is found, the search is called off and the result is
relayed to the back-end probabilistic localization module.
In addition, this module indirectly influences the salient region recognition step as
havingahigherprioritymeansabetterchanceofbeingselected thanthelaterones. This
is especially critical because the number of SIFT keypoints in salient region windows (as
low as single digits) is lower than if we were to use the whole scene. Although having
fewer keypoints to compare speedsup the process, it may lead to a possibility of turning
up false positive matches. So a form of ordering based on coarse feature matching goes
a long way to minimizing this problem.
Weformulateapriorityvalueforcomparisonbetweenlandmarklmk
i,j
(fromsegment
i of indexj) and incoming salient region sReg
k
according to equation 2.20, which weighs
69
the following factors: segment estimation using gist features, salient feature similarity,
and current location belief.
priority(lmk
i,j
,sReg
k
)=
W
gist
∗ sval
i
+
W
sal
∗ salDiff(lmk
i,j
,sReg
k
)+
W
loc
∗ dist(lmk
i,j
,loc(S
t
)) (2.20)
The weights used are W
gist
= .5, W
sal
= .2, and W
loc
= .3 (found through experi-
mentation).
Because the landmarks are arranged by segment of origin, the segment estimation
values sval
i
can be used to prioritize search order by segment likelihood. For salient
feature similarity (second term), we compute salDiff(lmk
i,j
,sReg
k
), which is the Eu-
cledian distance between the salient features vector of the incoming region sReg
k
and
the landmark lmk
i,j
average feature vector, the pre-computed average salient feature
values of the regions in the landmark. Because this priority computation is done at the
landmark level (landmarks, on average, have a little less than 20 regions), the procedure
is still fast. The third term, dist(lmk
i,j
,loc(S
t
)), orders the landmarks by proximity to
the current belief location S
t
(our probabilistic formulation convention explained below
in section 2.6) for the state of the robot at time t, which adds a temporal aspect to the
priority value. For the location of each landmark, we use the centroid of salient region
70
locations of each respective landmark. This way the number of distance calculations is
on the landmark count level.
The system then creates a job item for all incoming salient-region-and-landmark
combinations for a multi-threaded search. These job items are put to a priority queue
accessible by all the processors in the robot, so that each can perform the slow region
recognition process in parallel.
Wenoticedthatmostregionsthatarefoundarediscoveredearlyinthesearch(within
the first 2% of maximum number of comparisons). Equipped with this knowledge we
employ a number of exit conditions that calls off the searches if:
• 3 regions are matched.
• 2 regions matched and 10% of queue has been processed since the last match.
• 1 region is matched and 20% of queue has been processed since the last match.
• no regions are matched and 33% of queue has been processed.
These heuristics are based on the idea that as the system matches more regions, it
becomes increasingly less important for it to find new ones. During testing (section 3),
we aggressively lower the above thresholds to 1, 2, and 3%, respectively, to drop the
search duration to about 1 second. In the end, it typically takes about 1 to 2 % of the
maximum number of comparisons needed to complete a search.
71
2.5 Salient Region Tracking
Even with the high efficiency of the landmark search process, the system needs in the
order of seconds to go through a database of a large outdoor environment. This can
cause the robot to be unacceptably slow if each frame is tied up to the completion of
the process. However, we employ a tracking system that allows the robot to move while
the search is still in progress. That is, even if the robot moves during the procedure, the
subsequent tracked locations of the regions that are currently being matched are also
updated, allowing the current search to stay relevant. And thus, the process can run
for multiple frames until matching results are ready to be applied as localization module
inputs. In the meantime, the gist-based segment estimation result, which is available
instantaneously on every frame is used to maintain the location belief. Thus, both local
and global features are able to update the location belief at their own individual time-
scales.
The tracking is done by incorporating a Dorsal pathway module called the salient
region tracking. Originally, the intention was to have this module invoke a go-to-a-
landmark behavior once the system identifies a landmark in the path to a goal location.
By adding the tracking mechanisms for search, we produce a behavior where the robot
is still moving towards its goal, while simultaneously keeping an eye out for the next
landmarks to follow.
Figure 2.14 illustrates the salient region tracking procedure. The tracker utilizes a
template matching algorithm (OpenCV implementation with squared-error distance) on
the conspicuity maps [37, 32] from the 7 Visual Cortex sub-channels. For each region,
72
we perform template matching on each map (7 associated templates) before summing
the resulting distance map. The templates are initialized using 7x7 windows from each
conspicuity map, centered around the salient point of the region. The conspicuity maps
themselves are 40x30 in size, down-sampled from the original 160x120 input image size,
which is acceptable given that we are mostly tracking large salient objects. We then
weight the summation based on the distance from the previous time step for temporal
filtering. The minimum coordinate is the predicted new location.
Figure 2.14: Diagram of salient region tracking. The system uses conspicuity maps from
7 sub-channels to perform template matching tracking. Before selecting the predicted
location, the system weights the result with proximity to the point of the previous time
step. In addition, we also adapt the templates overtime for tracking robustness.
In addition, in each frame, we update the templates for robustness (to lighting
changes, among other things) using the following adaptive equation:
T
t,i
= .9∗T
t−1,i
+.1∗N
t−1,i
(2.21)
Here, T
t−1,i
is a region’s template for sub-channel i at time t−1, while N
t,i
is a new
region template around the predicted location. Before the actual update, we added a
73
threshold to check if the resulting template would change drastically; a sign that the
tracking may be failing. If this is the case, we do not update, hoping that the error is
just coincidental. However, if this occurs three times in a row, we report that tracking
has failed. This means that the search for that particular region is called off, and the
system resets itself to recognize a new region. A reset can also occur if the search takes
too long - here we use 15 frames.
On the other hand, when a positive match does occur in the Ventral pathway, we
have to consider the latency between the time when the search starts and the time of
completion. Inthiscase,thesystemtriestomatchthetrackedsalientregionasitappears
in the current frame with the just identified landmark. The match for tracked region
(in the current frame) is usually only a few index numbers away from original region
of the positively matched landmark and this extra search process usually takes just a
few comparisons. This way, there is no need to project forward a matching result for a
region that comes from a previous frame.
Thisprocess has proven to bequite robustbecause, aside fromfollowing regions that
are already salient, the variety of domains means that for a failure to occur, noise has
to bias many sub-channels. In addition, the noise is also minimized since our saliency
algorithm uses a biologically inspired spatial competition for salience (winner-take-all)
[37, 32], where noisy sub-channels with too many peaks are subdued. Also, by re-
using the conspicuity maps used in the saliency model, the process is fast. But, more
importantly,trackingrequiresnorecallofstoredinformation,whichallowstheprocedure
to be scalable to the size of the environment. In the current implementation, the system
is usually able to track a salient region for up to 100 frames and the process only takes
74
about 5ms per frame for each salient region. On each frame, we track up to 10 regions:
5 for the ones already recognized in the previous frame (which drive the robot to a goal
location), and 5 more that are currently being recognized.
2.6 Monte-Carlo Localization
When a landmark is recognized, we can use its associated location to deduce where
we are. There is a possibility that two identical landmarks may occur at two different
locations. In this case we can rely on accumulated temporal context because of the use
of our probabilistic localization back-end module.
We estimate robotposition byimplementingMonte-Carlo Localization (MCL)which
utilizes Sampling Importance Resampling (SIR) [23, 89, 55]. We formulate the location
belief state S
t
as a set of weighted particles: S
t
={x
t,i
, w
t,i
} i=1 ... N at time t and N
being the number of particles. Each particle (possible robot location) x
t,i
is composed
of a segment number snum and percentage of length traveled ltrav along the segment
edges, x
t,i
= {snum
t,i
, ltrav
t,i
}. Each particle has a weight w
t,i
, which is proportional
to the likelihood of observing incoming data modeled by the segment and salient region
observation models (explained in sections 2.6.2 and 2.6.3 below, respectively).
Notethatthesegmentobservationisappliedbeforesalientregionobservationbecause
segment estimation can be calculated almost instantaneously while the salient region
matching is much slower. Also, because of the tracking step, while segment estimation
is available at each time step the region recognition results may not be as it might not
be found yet or returns no match.
75
From experiments, N = 100 appears to suffice for the more compact topological
localization domainwhereahallway isrepresentedbyanedgeandnotatwo dimensional
space. We tried N as high as 1000 with unnoticeable performance or computation speed
change. With N = 50 the performance starts to degrade, namely in kidnapped robot
instances.
We estimate the location belief Bel(S
t
) by recursively updating posterior p(S
t
|z
t
,u
t
)
— z
t
being an evidence and u
t
the motion measurement using [91]:
Bel(S
t
)=p(S
t
|z
t
,u
t
) (2.22)
=αp(z
t
|S
t
)
Z
S
t−1
p(S
t
|S
t−1
,u
t
)Bel(S
t−1
)dS
t−1
We first compute p(S
t
|S
t−1
,u
t
) (called the prediction/proposal phase) to take robot
movement into account by applying the motion model to the particles. Afterwards,
p(z
t
|S
t
) is computed in the update phase to incorporate the visual information by ap-
plying the observation models — segment estimation z
′
t
(eqn. 2.15) and matched salient
regions z
′′
t
(eqn. 2.18) — to each particle for weighted resampling steps.
The following algorithm shows the order in which the system computes its belief
estimation Bel(S
t
) at each time step t:
1. apply motion model to S
t−1
to create S
′
t
2. apply segment observation model to S
′
t
to create S
′′
t
76
3. if (M
t
>0)
(a) apply salient region observation model to S
′′
t
to yield S
t
(b) else S
t
=S
′′
t
Here, we specify two intermediate states: S
′
t
and S
′′
t
. S
′
t
is the belief state after the
motion model is applied to the particles. S
′′
t
is the state after the segment observation
(first step of update phase p(z
t
|S
t
)) is subsequently applied to S
′
t
. Segment observation
application is done by weighted resampling using likelihood function p(z
′
t
|x
′
t,i
) (equation
2.23 below) as weights. This function denotes the likelihood that a segment estimation
z
′
t
is observed at location x
′
t,i
. Afterwards, the salient region observation model (second
step in the update phase p(z
t
|S
t
)) is applied to the belief state S
′′
t
to produce S
t
. This
is done with weighted resampling using the likelihood function p(z
′′
t
|x
′′
t,i
) (equation 2.24
below) as weights, representing the likelihood that salient region match z
′′
t
is found at
x
′′
t,i
.
2.6.1 Motion Model
The system employs a straightforward motion model to each particle x
′
t−1,i
in S
t−1
by
moving it with the distance traveled (odometry reading u
t
) plus noise to account for
uncertainties such as wheel slippage. We model this by drawing a particle x
′
t,i
from a
Gaussian probability density p(x
′
t,i
|u
t
,x
t−1,i
), where the mean is the robot location in
the absence of noise and standard deviation of .1ft (about 1/6th of a typical single step).
The latter controls the level of noise in the robot movement measurement. From our
experiments, we find that this number does not affect the end result as much because
77
the neighborhood of particles around a converged location (observe the belief map in
figure 2.15) is large enough that motion error in any direction is well covered.
In the procedure,the distribution spawnsa new location by onlychanging the length
traveled ltrav portion of a particle x
′
t,i
. It is then checked for validity with respect to the
mapasltrav hasarangeof0.0to1.0. Ifthevalueisbelow0.0, thentherobothasmoved
back to a previous segment in the path, while if it is above 1.0, the robot has moved
to a subsequent segment. We take care of these situations by changing the segment
snumand normalizing the excess distance (fromthe end of original segment) to produce
a corresponding ltrav. If the original segment ends in an intersection with multiple
continuing segments, we simply select one randomly. If no other segment extends the
path, we just resample.
2.6.2 Segment-Estimation Observation Model
This model estimates the likelihood that the gist feature-based segment estimation cor-
rectly predicts the assumed robot location. So, we weigh each location particle x
′
t,i
in S
′
t
with w
′
t,i
= p(z
′
t
|x
′
t,i
) for resampling (with added 10 percent random particles to avoid
the well known population degeneration problem in Monte Carlo methods) to create
belief S
′′
t
. We take into account the segment-estimation vector z
′
t
by using:
p(z
′
t
|x
′
t,i
)=
sval
t,snum
′
t,i
P
Nsegment
j=1
sval
t,j
∗sval
t,snum
′
t,i
(2.23)
Here, the likelihood that a particle x
′
t,i
observes z
′
t
is proportional to the percentage
of estimation value of the robot’s segment location sval
t,snum
′
t,i
over the total estimation
78
value (first term) times the robot segment location value (second term). The rationale
for the first term is to measure the segment’s dominance with respect to all values in the
vector; the more dominant the more sure we are that the segment estimation is correctly
predicting the particle’s segment location. The second term preserves the ratio of the
robot segment location value with respect to maximum value of 1.0 so that we can make
a distinction of confidence level of the segment estimation prediction. Note that the
likelihood function only makes use of the segment snum
′
t,i
information fromparticle x
′
t,i
,
while ltrav
′
t,i
is left unused as the precise location of the robot within the segment does
not have any effect on segment estimation.
2.6.3 Salient-Region-Recognition Observation Model
Inthismodelwewanttomeasurethelikelihoodofsimultaneouslyobservingthematched
salient regions given that the robot is at a given location. We weigh each particle x
′′
t,i
in
S
′′
t
with w
′′
t,i
= p(z
′′
t
|x
′′
t,i
) for resampling (with added 20% random noise, also to combat
population degeneracy) to create belief S
t+1
by taking into account the salient region
matches z
′′
t
using:
p(z
′′
t
|x
′′
t,i
)=
Mt
Y
k=1
p(omatch
t,k
|x
′′
t,i
) (2.24)
Given that each salient-region match observation is independent, we simply multiply
each of them to calculate the total likelihood. The probability of an individual match
p(omatch
t,k
|x
′′
t,i
) is modeled by a Gaussian with the standard deviation σ set to 5% of
the environment map’s diagonal. The likelihood value is the probability of drawing a
79
length longer than the distance between the particle and thelocation wherethematched
database salient region is acquired. σ is set proportional to the map diagonal to reflect
how the larger the environment, the higher the level of uncertainty. The added noise
is twice that of segment observation because the salient region observation probability
density is much narrower and we find that 20% keeps the particle population diverse
enough to allow for dispersion and correct re-convergence in a kidnapped robot event.
Also, although the SIFT and salient feature vector matching scores (explained in section
2.3.2 above) are available for weights, we do not use them in the likelihood function
directly. These matching scores were thresholded to identify the positive salient region
matches we are now considering in this section. We do not reason with match quality
because the thresholds alone eliminate most false positives.
An snapshot illustration of how the system works together can be viewed at figure
2.15.
80
Figure 2.15: A snapshot of the system test-run. Top-left (main) image contains the salient region windows. Green window means
a database match, while red is not found. A salient region match is displayed next to the main image. Below the main image is
the segment estimation vector derived from gist (there are 9 possible segments in the environment). The middle image projects
the robot state onto the map: cyan disks are the particles, the yellow disks are the location of the matched database salient
region, the blue disk (the center of the blue circle, here partially covered by a yellow disk) is the most likely location. The radius
of the blue circle is equivalent to five feet. The right-most histogram is the number of particles at each of the 9 possible segments.
The robot believes that it is towards the end of the first segment, which is correct within a few feet.
81
Chapter 3
Testing And Results
In order to meaningfully assess the localization system performance, it is important
to setup testing in the most challenging venues to stress it to the limit. In our case,
these venues are visually contrasting large scale outdoor environments. We use the
following criterion to select the sites: variety in appearance challenges, variability in
area covered/path lengths, various lighting conditions.
In our testing setup (section 3.1), we isolate issues concerning just the localization,
andnotnavigation, inthatthesystemisnotaskedtoactuallyproducemotorcommands,
only to accurately and efficiently figure out where it is.
We are going to thoroughly test the system starting with the segment classification
(section 3.2). We then test the localization system (section 3.3) without the use of
tracking as a baseline of how accurate the system can be if it is given as much time
as needed. We include the prioritization module because it actually helps prune out
incorrect salient region matches.
In section 3.4, we then take a step back and examine the information recall efficiency
by gauging how much does the landmark search-priority technique help increase the
82
speed of the system and whether it also maintains its accuracy. Lastly, we are going to
test the system in a real-time settings (section 3.5), pushing to 10 frames per second,
which requires the landmark tracking module.
3.1 Testing Setup
We test the system at three sites on the campus of University of Southern California
(map shown in figure 3.1): the Ahmanson Center for Biological Research (ACB), the
Associate and Founders park (AnF), and the Frederick. D. Fagg park. Each site has
nine segments. Both the training and testing data are available online [76].
We are currently building an affordable mobile robot platform called Beobot2.0 [77],
which will have 8 dual-core 2.2Ghz machines. At this point, however, we test the system
ona16-core 2.6GHz computingcluster, operatingon160x120 images. Notethatbecause
we are only focusing on the localization skills, and not navigation, where autonomous
control is involved, the platform in which the algorithm is run has no bearing on the
outcome. In addition, we are still going to feed the input to the system as fast as
10 frames per second. In spite of this, we agree that making the system to work on
a mobile robot in the outdoor environment would much more impressive and we are
currently working toward having a fully functioning localization and navigation system.
As for now, the visual data is gathered usingan 8mm handheldcamcorder carried by
a person. The captured video clips are slightly less stable because of a few rough spots
on the road, although the camera itself have a mechanism to smooth out image jitters.
83
2
3
1
Figure 3.1: Map of the three experiment sites at the USC campus: the Ahmanson
Center for Biological Research (ACB), the Associate and Founders park (AnF), and the
Frederick. D. Fagg park.
In addition, there is no camera calibration or lens distortion correction being performed,
which may help in salient region matching.
We divide the video clips into predetermined segments for classification using gist.
Thepathisdividedalonganaturalgeographicaldelineation, imageswithineachsegment
look similar to a human observer. Moreover, when separating the data at the junction,
we take special care in creating a clean break between the two involved segments. That
is, we stop short of a crossing for the current segment and wait a few moments before
starting the next one. This ensures that the system will be trained with data where
84
ground-truth labeling (assigning a segment number to an image) is unambiguous. In
addition, all the frames in the clips are included as there is no by-hand selection being
done.
For the current testing setup, we are also slightly selective in filming as it is done
during off-peak hours where fewer people are out walking. It should be pointed out that
the gist features can absorb foreground objects as part of a scene as long as they do not
dominate it. In addition, because we obtain multiple salient points for each frame, even
if there are many distractions, the desired landmarks, so long as they are salient, should
still be detected.
Atthispointthedataisstillsomewhatview-specific,aseachlocationisonlytraversed
fromonedirectionofpath. Foralargerview-invariantscenerecognition, weneedtotrain
the system on multiple views [94]. We have tried to sweep the camera left to right (and
vice versa) to create a wider point of view, although to retain performance the sweep
has to be done at a much slower pace than regular walking speed [78].
Because the data is recorded at approximately constant speed and we record clips
for individual segments separately, we use interpolation to come up with the ground-
truth locations for both training and testing. We calculate the walking velocity using
the distance of a particular path (available in the map) divided by the amount of time
it takes for the person to traverse it (identical to the clip duration). We can place the
location of the start and end of the clip because they are pre-specified. For the frame
locations in between, we assume a uniform capture interval to advance the person’s
location properly. In all experiments, a denoted error signifies a measured difference (in
feet) between the robot belief and this generated ground-truth location.
85
The main issue in collecting training samples is selecting filming times that includes
all lighting conditions. Because lighting space is hard to gauge, we perform trial-and-
error to come up with the appropriate times of day (up to 6 per day): from the brightest
(noon time) to the darkest (early evening). We also include different climates such as
overcast, clear, and also notable changes in appearance due to increases in temperature
(hazy mid-afternoon). Although we are bound to exclude some lighting conditions, the
results show that the collected samples cover a large portion of the space, where in each
site we have between 9 and 11 training runs. In addition for each of the three sites
we have 4 testing runs that have different lighting conditions. Note that 10 of 12 of
the testing clips are taken at a different date than the training clips. As for the two
other two testing clips taken on the same day as training, the testing data were recorded
in the early evening (dark lighting) while training data were taken near noon (bright
lighting). In all, there are 26,368 training and 13,966 testing frames for the ACB cite,
66,291 training and 26,387 testing frames for the AnF site, and 82,747 training and
34,711 testing frames for the FDF site.
3.1.1 Site 1: Ahmanson Center for Biological Research (ACB)
Thefirstsiteisthe126x180ft. AhmansonCenter forBiological Research (ACB) building
complex. This experiment site is chosen to investigate what the system can achieve in
a rigid and less spacious man-made environment. Figure 3.2 shows a scene from each
of the segment of ACB, ordered 1 to 9, from first to third row, left to right. This is
also the same arrangement of display for the other two sites. For this site, most of the
86
surroundings are flat walls with little texture and solid lines that delineate the different
parts of the buildings.
Observe figure 3.3 for the map of the segments. Each segment is a straight line and
part of a hallway. Some hallways are divided into two segments so that each segment is
approximately of the same length. Note also that the map shows that some segments
are not part of a single continuous path, but a series of available walkways within an
environment, traversed from a single direction.
Figure 3.2: Examples of images in each segment of Ahmanson Center for Biological
Research, ACB (segments 1 through 9 from left to right and top to bottom).
87
1
2
3 4
5
6
7
8
9
ACB
Eas t Wing
SHS
ACB
Wes t
Wing
ACB
OCW
CEM
LJS
Figure 3.3: Map of the path segments of the ACB site.
88
Figure 3.4: Lighting conditions used for testing at the ACB site. Left to right: late
afternoon (trial 1), early evening (trial 2), mid-afternoon (trial 3), and noon (trial 4).
Figure 3.4 represents the four lighting conditions used in testing: late afternoon,
early evening (note that the lights are already turned on), mid-afternoon, and noon
(trial number 1, 2, 3, 4, respectively). We chose two dark and two bright conditions to
ensure a wide range of testing conditions.
3.1.2 Site 2: Associates and Founders Park (AnF)
The second site is a 270x360ft. area comprised of two adjoining parks: Associate and
Founders park (AnF) in which large parts of the scenes are practically unrecognizable
as they are overrun by leaves, which makes it difficult to isolate objects. Compared to
the ACB site, conceivably, this is a more difficult task: localization through longer paths
(about twice the lengths of the segments in the first site) in a vegetation-dominated site.
Figure 3.5 displays a sample image from each of the segments while figure 3.6 maps
them out.
Figure3.7 showsfourlighting conditionstested: overcast (trial 1), early eveningwith
lights already turned on (trial 2), mid-afternoon (trial 3), and noon (trial 4). As we can
see in the images, there are fewer rigid structures and what few object that exist in
the environment (lamp posts and benches), they tend to look small with respect to the
image size. Also, objects can either be taken away (eg. the bench in the top right image
89
Figure 3.5: Examples of images in each segment of Associate and Founders park, AnF
(segments 1 through 9 from left to right and top to bottom).
in figure 3.7) or added such as service vehicles parked or a huge storage box placed in
the park for a day. In addition, whole scene matching using local features would be
very inefficient as well as noisy because the leaves produces high number of random
texture-like pattern that significantly contaminate the process.
3.1.3 Site 3: Frederick D. Fagg park (FDF)
The third and final site is a 450x585ft. open area in front of the Leavey and Doheny
libraries called the Frederick D. Fagg park (figure 3.8). Students use the area to study
90
1
2 3
4
5
6
7
8
9
ADM
NCT THH
Asso cates
Par k
Founders
Par k
Figure 3.6: Map of the path segments of the AnF site.
Figure 3.7: Lighting conditions used for testing at theAnF site. Clockwise fromtop left:
overcast (trial 1), early evening (trial 2), mid-afternoon (trial 3), and noon (trial 4)
91
outdoors and to catch some sun. A large portion of the scenes is the sky, mostly tex-
tureless space with random light clouds. The main motivation for testing at this site is
to assess the system response on a sparser scenes (figure 3.8) and in an even larger en-
vironment (the segments are about 50% longer than the ones in the second experiment,
three times that of experiment 1). Figure 3.9 shows the map of the segments.
Figure 3.8: Examples of images, one from each segment of Frederick D. Fagg park, FDF
(segments 1 through 9 from left to right and top to bottom).
Figure 3.10 represents the four lighting conditions tested: early evening when the
street lights not yet turnedon (trial 1), evening whenthestreet lights arealready turned
on (trial 2), noon (trial 3), and middle of afternoon (trial 4).
92
1
5
2
3
4
6
7
8
9
VKC
SOS
WPH
LVL
CAS
Fre derick D .
Fag g Jr. Par k
Figure 3.9: Map of the path segments of Frederick D. Fagg (FDF) park site.
Figure 3.10: Lighting Conditions use for Testing at the FDF site. Clockwise from top
left: early evening (trial 1), evening (trial 2), noon (trial 3), and middle of afternoon
(trial 4).
93
3.2 Gist Model Testing Results
Here, our focus is on segment classification using the gist signature computed by our
model. We extensively test the model in the three environments: ACB, AnF, and FDF
(sections 3.2.1, 3.2.2, and 3.2.3, respectively).
We use the same neural network classifier architecture in all three sites, each with
nine output layer nodes (same as the number of segments). The network’s intermediate
layers have 200 and 100 nodes, respectively, while we have 80 input nodes (for the 80
features of the PCA/ICA dimension-reduced gist vector). That is a total of: 80*200 +
200*100 + 100*9 = 36,900 connections.
We use absolute encoding for the training data. That is, if the correct answer for an
image is segment 1, the corresponding node is assigned 1.0, while the others are all 0.0.
Our cut-off for convergence is 1% training error.
Note that, at the time, all training is done on a 1.667GHz Athlon AMD machine,
which explains why the training time is quite slow. However, if we focus only on the
relative training duration between the test-sites, we can still compare the effects of
scalability in the neural-network training.
In addition separate testing on individual sites, we also combine all three environ-
ments to show that the gist model’s performance does not degrade when we increase the
number of segments (described in section 3.2.4).
94
3.2.1 Experiment 1: Ahmanson Center for Biological Research (ACB)
We train the system several times before choosing a training convergence that gives
the highest classification result to avoid a local minima on the training data. After
only about twenty epochs, the network converges to less than 1% error. A fast rate of
training convergence in the firstfew epochs appearsto bea telling sign of how successful
classification will be during testing.
Table 3.1 shows results of the experiment in this site. Here, the term “False+” for
segment x denotes the number of incorrect segment x guesses given that the correct an-
swer isanother segment, dividedbythetotal numberframesinthesegment. Conversely,
“False-” is the number of incorrect guesses given that the correct answer is segment x,
divided the total number frames in the segment. The table shows that the system is
able to classify the segments consistently during the testing phase with a total error of
12.04% or an overall 87.96% correctness.
We also report the confusion matrix in table 3.2 and find that the errors are, in
general, notuniformlydistributed. Spikesofclassification errorsbetweensegments1and
2wouldsuggestapossibilityofsignificantoverlappingscenes(segment2isacontinuation
of segment 1; see figure 3.3). On the other hand, there are also errors that are not as
easily explainable. For example, 163 false positives for segment 2 when the ground truth
is segment 7 (and 141 false positives in the other direction). From figure 3.2 we can
see that there is little resemblance between the two appearance-wise. However, if we
consider just the coarse layout, the structure of both segments are similar, with a white
region on the left side and a dark red one on the right side.
95
Trial 1 Trial 2 Trial 3 Trial 4 Total
Segment False+ False- False+ False- False+ False- False+ False- False+ Percent. False- Percent.
1 14/390 11/387 17/380 47/410 32/393 27/388 39/445 5/411 102/1608 6.34% 90/1596 5.64%
2 20/346 114/440 133/468 101/436 85/492 54/461 18/325 131/438 256/1631 15.70% 400/1775 22.54%
3 1/463 3/465 0/456 29/485 82/502 43/463 33/475 31/473 116/1896 6.12% 106/1886 5.62%
4 7/348 18/359 24/338 7/321 5/226 84/305 7/148 108/249 43/1060 4.06% 217/1234 17.59%
5 46/348 5/307 52/389 0/337 64/290 95/321 125/403 41/319 287/1430 20.07% 141/1284 10.98%
6 24/567 13/556 39/478 56/495 23/533 24/534 69/564 7/502 155/2142 7.24% 100/2087 4.79%
7 43/410 71/438 55/371 129/445 136/439 95/398 108/486 22/400 342/1706 20.05% 317/1681 18.86%
8 101/391 0/290 18/265 0/247 67/320 21/274 37/303 22/288 223/1279 17.44% 43/1099 3.91%
9 65/320 86/341 46/404 15/373 17/262 68/313 29/227 98/296 157/1213 12.94% 267/1323 20.18%
Total 321/3583 384/3549 511/3457 465/3376 1681/13965
Percent. 8.96% 10.82% 14.78% 13.77% 12.04%
Table 3.1: Ahmanson Center for Biology Segment Classification Experimental Results
96
Table 3.2: Ahmanson Center for Biology Segment Classification Confusion Matrix
Segment number guessed by algorithm
True segment number
Segment 1 2 3 4 5 6 7 8 9
1 1506 39 0 1 0 12 25 0 13
2 77 1375 0 0 0 54 141 11 117
3 1 6 1780 19 40 11 4 24 1
4 3 1 66 1017 66 49 14 18 0
5 0 10 4 0 1143 4 114 4 5
6 0 9 7 3 61 1987 10 10 0
7 18 163 0 13 1 19 1364 82 21
8 0 14 1 3 15 3 7 1056 0
9 3 14 38 4 104 3 27 74 1056
3.2.2 Experiment 2: Associates and Founders Park (AnF)
TheresultfortheAnFsiteareshownattable3.3. Theconfusionmatrix(table3.4)isalso
reported. Aswiththefirstsite,weperformmulti-layerneuralnetworkclassificationusing
theback-propagationalgorithmwiththesamenetworkarchitectureandparameters. The
numberofepochsfortrainingconvergenceislessthan40,abouttwicethatofExperiment
1.
A quick glance at table 3.3 reveals that the performance, with a total error of 15.79%
(84.21% success rate), is higher than in Experiment 1. However, if we look at the
challenges presented by the scenes, it is quite an accomplishment to have a drop-off of
less than 4% in performance. Furthermore, no calibration is done in moving from the
first environment to the second. In addition, increases in length of segments do not
affect the results drastically, either. The results from the third experiment which has
even longer segments will confirm this assessment. It appears that the longer the length
does not mean the more variability is there to absorb because the majority of the scenes
within a segment do not change all that much. The confusion matrix (table 3.4) shows
97
that the errors are marginally more uniform than in Experiment 1 (few zero entries).
This is probably because the environment is less structured and prone to more random
classification errors, which make the errors more spread out.
3.2.3 Experiment 3: Frederick D. Fagg park (FDF)
Table3.5showstheresultsfortheFDFexperiment, listingtotal errorof11.38% (88.62%
classification) withtraininginwhichthenumberofepochsgoesupbyabouttenfoldsand
the amount of time for convergence roughly doubles that of the AnF site experiment, to
about 50 minutes.
The result from trail 2 (7.95% error) is the best among all runs for all experiments.
We think that this is because the lighting very closely resembles the one in the training
data. That run was conducted at noon, in which the lighting does tend to stay the
same for long periods of time. As a performance reference, when we test the system
with a set of data taken right after a training set, the error rates are about 9% to 11%.
Furthermore, when training images with approximately the same lighting condition as
the testing data are excluded during training, the error for that testing run usually at
least triples (to about thirty to forty percent), which suggests that lighting coverage in
the training phase is a critical factor.
The confusion matrix for Experiment 3 (table 3.6) is also reported. Overall, the
results are better than Experiment 1 and 2 even though the segments are longer on
average. It can be argued that the system performance degrades gracefully with the
subjectively-assessed visual difficulty of the environment, experiment 2 (AnF) being the
most challenging one.
98
Table 3.3: Associate and Founders Park Segment Classification Experimental Results
Trial 1 Trial 2 Trial 3 Trial 4 Total
Segment False+ False- False+ False- False+ False- False+ False- False+ Percent. False- Percent.
1 71/559 210/698 177/539 440/802 140/786 245/891 49/733 62/746 437/2617 16.70% 957/3137 30.51%
2 38/544 64/570 107/429 6/328 271/558 187/474 122/584 12/474 538/2115 25.44% 269/1846 14.57%
3 57/851 71/865 54/814 217/977 206/1096 78/968 38/996 5/963 355/3757 9.45% 371/3773 9.83%
4 61/518 31/488 72/611 58/597 221/730 179/688 131/652 111/632 485/2511 19.32% 379/2405 15.76%
5 82/669 30/617 142/867 45/770 121/785 110/774 54/744 87/777 399/3065 13.02% 272/2938 9.26%
6 300/1254 47/1001 265/1210 177/1122 273/1084 192/1003 148/1079 167/1098 986/4627 21.31% 583/4224 13.80%
7 42/297 167/422 177/643 104/570 54/553 62/561 76/416 59/399 349/1909 18.28% 392/1952 20.08%
8 54/577 75/598 73/696 69/692 59/771 85/797 60/770 58/768 246/2814 8.74% 287/2855 10.05%
9 106/737 116/747 53/858 4/809 146/655 353/862 69/732 186/849 374/2982 12.54% 659/3267 20.17%
Total 811/6006 1120/6667 1491/7018 747/6706 4169/26397
Percent. 13.50% 16.80% 21.25% 11.14% 15.79%
99
Table 3.4: Associate and Founders Park Segment Classification Confusion Matrix
Segment number guessed by algorithm
True segment number
Segment 1 2 3 4 5 6 7 8 9
1 2180 32 21 28 212 374 96 52 142
2 20 1577 43 7 1 186 7 4 1
3 118 26 3402 73 51 4 27 33 39
4 3 131 89 2026 4 66 45 5 36
5 68 2 9 13 2666 49 8 105 18
6 22 86 59 142 40 3641 161 6 67
7 38 62 1 14 15 201 1560 4 57
8 78 24 73 52 14 29 3 2568 14
9 90 175 60 156 62 77 2 37 2608
3.2.4 Experiment 4: Combined sites
In addition, to gauge the system’s scalability, we combine scenes from all three sites and
train a classifier to differentiate between twenty-seven segments. The only difference in
the neural-network architecture is that the output layer now consists of twenty-seven
nodes. The number of the input and hidden nodes remains the same. The number
of connections is increased by 1,800 (18 new output nodes by 100 second hidden-layer
nodes),from36,900 to38,700 connections(4.88%). Weusethesameprocedureaswellas
trainingandtestingdata(175,406 and75,073 frames,respectively). Thetrainingprocess
takes much longer than that of the other experiments. It is about 260 epochs with the
last 200 epochs or so converging very slowly from 3% down to 1%. When training, we
print the confusion matrix periodically to analyze the process of convergence, and we
find that the network converges from inter-site classification before going further and
eliminates the intra-site errors. We organize the results into segment-level (Table 3.7)
and site-level (Table 3.8) statistics.
100
Table 3.5: Frederick D. Fagg Park Segment Classification Experimental Results
Trial 1 Trial 2 Trial 3 Trial 4 Total
Segment False+ False- False+ False- False+ False- False+ False- False+ Percent. False- Percent.
1 22/657 246/881 40/699 11/670 44/684 207/847 28/735 246/953 134/2775 4.83% 710/3351 21.19%
2 246/1022 12/788 53/749 44/740 56/727 126/797 105/758 225/878 460/3256 14.13% 407/3203 12.71%
3 11/691 178/858 0/689 7/696 341/1218 45/922 282/1147 5/870 634/3475 16.93% 235/3346 7.02%
4 5/799 43/837 3/663 80/740 53/883 7/837 35/757 99/821 96/3102 3.09% 229/3235 7.08%
5 18/440 409/831 11/390 369/748 2/696 0/694 16/870 0/854 47/2396 1.96% 778/3127 24.88%
6 343/1976 47/1680 12/1550 27/1565 182/1772 122/1712 243/1770 145/1672 780/7068 11.04% 341/6629 5.14%
7 0/806 231/1037 25/944 4/923 0/675 182/857 30/886 38/894 55/3311 1.66% 455/3711 12.26%
8 483/1607 48/1172 436/1568 79/1211 319/1581 93/1355 149/1244 175/1270 1387/6000 23.12% 395/5008 7.89%
9 86/825 0/739 65/866 24/825 42/579 257/794 164/788 119/743 357/3058 11.67% 400/3101 12.90%
Total 1214/8823 645/8118 1039/8815 1052/8955 3950/34711
Percent. 13.76% 7.95% 11.787% 11.75% 11.38%
101
Table 3.6: Frederick D. Fagg Park Segment Classification Confusion Matrix
Segment number guessed by algorithm
True segment number
Segment 1 2 3 4 5 6 7 8 9
1 2641 11 165 0 0 364 0 170 0
2 55 2796 106 0 7 97 0 70 72
3 0 71 3111 0 0 153 0 0 11
4 0 136 1 3006 0 0 0 92 0
5 9 23 0 0 2349 63 13 660 10
6 22 30 254 5 6 6288 0 3 21
7 0 154 0 0 24 62 3256 168 47
8 45 30 40 35 6 3 40 4613 196
9 3 5 68 56 4 38 2 224 2701
For segment-level classification, the total error rate is 13.55%. We expected the
results to be somewhat worse than all previous three experiments when each site is
classified individually. However, such is not the case as it is better than the AnF results
(15.79%) while being marginally worse than the other two (12.04% for ACB and 11.38%
for FDF experiment). Notice also that the individual errors between single-site and
combined experiment in each site changes as well. The results for AnF segments in the
combined setup improve by 2.47% to 13.32% error while rate for segments in ACB and
FDF degrades by 4.24% and 1.25%, respectively. From the site-level confusion matrix
(table 3.8), we see that the system can reliably pin a given test image to the correct site
with only 4.54% error (95.46% classification). Thisis encouraging because we can utilize
multiple classifiers that can provide various levels of place classifications. For instance,
when the system is unsure about its segment location, it can at least rely on being at
the right site.
Another concerns in combining segments from different sites is that the number of
samples for each of them becomes unbalanced as there are some segments that take less
102
Table 3.7: Combined Sites Segment Classification Experimental Results
ACB AnF FDF
Segment False+ % err. False- % err. False+ % err. False- % err. False+ % err. False- % err.
1 292/1657 16.20% 231/1596 14.47% 565/3120 18.11% 582/3137 18.55% 231/2306 10.02% 1276/3351 38.08%
2 277/1710 16.20% 342/1775 19.27% 636/2159 29.46% 323/1846 17.50% 455/3175 14.33% 483/3203 15.08%
3 275/2031 13.54% 130/1886 6.89% 555/4198 13.22% 130/3773 3.45% 893/3881 23.01% 358/3346 10.70%
4 61/1211 5.04% 84/1234 6.81% 233/2401 9.70% 237/2405 24.05% 56/3102 1.81% 189/3235 5.84%
5 129/1208 10.68% 205/1284 15.97% 583/3251 17.93% 270/2938 9.19% 107/3115 3.43% 119/3127 3.81%
6 162/2040 7.94% 209/2087 10.01% 926/4462 20.75% 688/4224 16.29% 784/6426 12.20% 987/6629 14.89%
7 308/1438 21.42% 551/1681 32.78% 298/1680 17.74% 570/1952 29.20% 309/3704 8.34% 316/3711 8.52%
8 83/961 8.64% 221/1099 20.11% 730/3278 22.72% 307/2855 10.75% 300/4833 6.21% 475/5008 9.48%
9 116/1139 10.18% 300/1323 22.68% 257/3115 8.25% 409/3267 12.52% 551/3472 15.87% 180/3101 5.80%
Total 1703/13395 12.714% 2273/13965 16.276% 4783/27664 17.290% 3516/26397 13.320% 3686/34014 10.837% 4383/34711 12.627%
Total 10172/75073= 13.55%
103
Table 3.8: Combined Sites Segment Classification Site-Level Confusion Matrix
Site ACB AnF FDF False-/Total Pct. err
ACB 12882 563 520 1083/13965 7.76%
AnF 350 25668 379 729/26397 2.76%
FDF 163 1433 33115 1596/34711 4.60%
False+ 513 1996 899 3408
Total 13395 27664 34014 75073
Pct. err 3.83% 7.22% 2.64% 4.54%
than 15 seconds to walk through while others can take up to a minute and a half. It is
possiblethatthelowernumberofsamplesforACBmayyieldanetworkconvergencethat
gives heavier weights on correctly classifying the longer segments from AnF and FDF.
From the site-level statistics (table 3.8), we can see that the trend somewhat holds,
although not to an alarming extent.
3.3 Localization Testing Results
We now test the localization system on the same three sites (sections 3.3.1, 3.3.2, and
3.3.3 below. At this time we allow the landmark search process to run to completion
for every frame (that is, we do not utilize salient region tracking) to come up with as
a baseline of how accurate the system can be. In this we series of experiments, we use
the optimum prioritization weights of W
gist
= .5, W
sal
= .2, and W
loc
= .3. For the
early-exit parameters, we use the suggested and very conservative thresholds of 10%,
20%, and 33% for the two, one, and zero regions matched, respectively.
In addition, in section 3.3.4, we also compare the presented system, which employs
both local features (SIFT keypoints within salient regions and salient feature vector at
the salient point) as well as global (gist) features with two systems that use only local
104
features (SIFT) or only global features (gist). The back-end Monte-Carlo localization
modules in all three instances are kept identical. For the SIFT-only system, we take out
the salient feature vector from the region signature to end up with only SIFT features.
In[82]wehave comparedourgistsystemwithother placerecognition systemsandfound
that the results are comparable. Thus, the gist-only localization comparison may also
be indicative of how well place recognition systems can perform in a metric localization
task.
3.3.1 Experiment 1: Ahmanson Center for Biological Research (ACB)
Table 3.9 shows the result for the ACB experiment, with an overall error of 3.24ft In
general, theerrorisuniformlydistributedacrosssegments, althoughspikesinsegments 2
and 5 are clearly visible. The error rate for segment 2, which comes from trials 1, 2, and
4, occurred because the identified salient regions (mainly the textured white building
and its entrance door in figure 3.4) are at the end of the hallway and they do not change
sizes as much even after a 3m robot displacement. It is also the case for the error spike
in segment 5 for trial 4, as the system latches to a water tower (second image of the
second row of figure 3.2).
The errors in segment 5 from trials 3 and 4 (bright lighting) partially originate from
the camera’s exposure control that tries to properly normalize the range of frames with
wide intensity contrast (the scenes are comprised of very bright sky and dark buildings)
and it ends up darkening the building for a few seconds — something to consider when
selecting a camera to film outdoor scenes. During this time, the segment estimator
produces incorrect values and the SIFT module is unable to recognize any regions in
105
Table 3.9: Ahmanson Center for Biology Experimental Results
Trial 1 Trial 2 Trial 3 Trial 4 Total
Segment number error number error number error number error number error
frames (ft) frames (ft) frames (ft) frames (ft) frames (ft)
1 387 3.14 410 3.35 388 2.39 411 2.46 1596 2.84
2 440 6.14 436 9.43 461 2.29 438 5.44 1775 5.78
3 465 3.47 485 2.26 463 2.90 474 4.43 1887 3.26
4 359 3.26 321 3.15 305 3.28 249 3.22 1234 3.23
5 307 3.83 337 2.02 321 5.79 319 6.44 1284 4.49
6 556 1.97 495 3.79 534 2.45 502 1.85 2087 2.50
7 438 1.56 445 1.97 398 2.78 400 2.68 1681 2.23
8 290 1.94 247 3.75 274 2.54 288 2.90 1099 2.75
9 341 2.18 373 1.64 313 1.96 296 1.94 1323 1.92
Total 3583 3.06 3549 3.54 3457 2.87 3377 3.48 13966 3.24
the image, which throws off the robot belief completely. It seems that for the system
to fail, all parts (saliency, SIFT, and gist matching) have to fail. In addition, in order
to significantly perturb the system, because of the use of Monte Carlo localization, that
match would have to persist for some time.
3.3.2 Experiment 2: Associates and Founders Park (AnF)
The results (table 3.10) reveal an overall error of 8.62ft but with noticeably higher
performance disparity between segments. The errors are also different across trials for
which segment produces high displacements. On average (last column of the table)
though, all segments have roughly equal errors. However, the average error difference
(last row) between the two dim lighting trials (3 and 4) and the bright lighting trials
(1 and 2) is quite significant. It seems that low lighting, or more importantly the lack
of unpredictable and ephemeral sunlight (observe the grass in the bottom two images
of figure 3.7), allows for uniform lighting and better correlation between training and
testing runs. In the end, although the results are worse than experiment 1, it is quite an
106
Table 3.10: Associate and Founders Park Experimental Results
Trial 1 Trial 2 Trial 3 Trial 4 Total
Segment number error number error number error number error number error
frames (ft) frames (ft) frames (ft) frames (ft) frames (ft)
1 698 3.95 802 6.17 891 13.74 746 5.79 3137 7.74
2 570 7.89 328 6.23 474 18.90 474 6.23 1846 9.99
3 865 5.28 977 10.90 968 6.59 963 15.26 3773 9.62
4 488 10.49 597 5.66 688 5.14 632 9.36 2405 7.46
5 617 10.97 770 4.37 774 5.59 777 11.02 2938 7.84
6 1001 5.09 1122 5.91 1003 10.76 1098 11.08 4224 8.21
7 422 3.57 570 13.15 561 8.03 399 9.17 1952 8.79
8 598 8.28 692 10.20 797 7.26 768 5.53 2855 7.72
9 747 7.02 809 5.45 862 11.61 849 16.53 3267 10.31
Total 6006 6.75 6667 7.50 7018 9.48 6706 10.52 26397 8.62
accomplishment given the challenges presented by the scenes and no by-hand calibration
is done in moving from the first environment to the second.
3.3.3 Experiment 3: Frederick D. Fagg park (FDF)
Table 3.11 shows the results, listing an overall error of 11.34ft, worse than the other
two sites. It seems that an increase in environment size has some effects on the results.
However,wefindthatthemoredirectcauseisthatasubstantialnumberofthelandmarks
recognized are far away from the robot, particularly in the FDF site, as they tend to
be buildings or signs. These are actually good landmarks because they are persistent
and easily detected. However, they also make accurate localization (within feet) visually
difficult because, as the robot moves toward them, their appearance hardly changes.
Furthermorethisproblemisexacerbated bytheSIFTrecognition moduleperforming
scale-invariant matching (with the scale ratio included as part of the result). Because of
this, we decided to limit the matching-scale threshold to between 2/3 and 3/2. We do
this so that when we get a positive match we would not have to do pose estimation as
107
Table 3.11: Frederick D. Fagg Park Experimental Results
Trial 1 Trial 2 Trial 3 Trial 4 Total
Segment number error number error number error number error number error
frames (ft) frames (ft) frames (ft) frames (ft) frames (ft)
1 881 4.72 670 6.51 847 9.46 953 4.63 3351 6.25
2 788 21.55 740 16.14 797 7.56 878 13.08 3203 14.50
3 858 11.32 696 13.52 922 4.89 870 7.03 3346 8.89
4 837 14.89 740 14.03 837 6.46 821 15.05 3235 12.55
5 831 11.21 748 12.40 694 15.40 854 9.96 3127 12.08
6 1680 18.09 1565 12.60 1712 10.64 1672 12.42 6629 13.44
7 1037 11.29 923 9.75 857 10.96 894 10.99 3711 10.76
8 1172 16.19 1211 10.57 1355 7.20 1270 11.01 5008 11.08
9 739 9.93 825 8.95 794 12.04 743 12.31 3101 10.78
8823 13.72 8118 11.61 8815 9.25 8955 10.79 34711 11.34
may only have as little as 5 correspondences. However we find that even a scale ratio of
0.8 (the region found is smaller than the one matched in the database) can translate to
a geographical difference of 15ft in the FDF site. Thus, although these buildings/signs
salient regions are stable localization cues, they are not good for fine-grained location
pin-pointing. We would need closer (<10ft away) regions for convergence to the correct
location.
We also would like to add, in these cases, stereoscopic system would not have per-
formed better because these landmarks are too far away to perceive accurate depth.
On the bright side, however, the position disparities mostly occurred along the path
where the ground truth is. Compared to the robot’s belief, it is either a bit behind or
ahead but not completely off. In addition one more encouraging point is that the system
seems to be able to cope with a variety of lighting conditions. The results are better
than the preliminary results [79] because of better lighting coverage in training despite
thefact thattraining and testing are doneon separate days. In thissite, for example, we
108
have dark (trial 1 and 2) and bright (trials 3 and 4) conditions, even with long shadows
cast on the field (trial 4 scene in figure 3.10).
3.3.4 Experiment 4: Sub-module Analysis
Tables 3.12, 3.13, and 3.14 show a comparison of systems that use only local features
(SIFT), only global features (gist features), and our presented bio-system, which uses
both global and local features in the ABC, AnF, and FDF site, respectively. The gist-
only system cannot localize to the metric level because it can only pin-point location
to the segment level and some segments have lengths that are more than 100 feet. The
SIFT-onlysystem, on the other hand, isclose to thepresented system. However, there is
aclear improvement between thetwo. IntheACBsite, theimprovement is42.53%, from
5.63ft in SIFT-only to 3.24ft in our system, (one-sided t-test t(27930) = −27.3134, p <
0.01), while the AnF site is 18.65%, from 10.60ft to 8.62ft (one-sided t-test t(52792) =
−15.5403, p<0.01), andtheFDF site is23.74% from14.86m to 11.34ft (one-sided t-test
t(69420) =−32.3395, p<0.01). On several occasions, the SIFT-only system completely
misplacedtherobot. Inoursystem,ontheotherhand,wheneverthesalientregion(SIFT
and salient feature vector) matching is incorrect, the gist observation model is available
tocorrectmistakes. Incontrast, theSIFT-onlysystemcanonlymakeadecisionfromone
recognition module. Additionally, inkidnappedrobotsituations (we inserted 4instances
per run for ACB and AnF, and 5 for FDF, about once every several thousand frames),
the presented systemis faster to correctly relocalize becauseit receives twice the amount
of observations (both global and local) as the SIFT only system.
109
Table 3.12: Ahmanson Center for Biology Model Comparison Experimental Results
System
Trial 1 Trial 2 Trial 3 Trial 4
err. (ft) err. (ft) err. (ft) err. (ft)
gist 25.62 24.17 29.92 19.97
SIFT 5.23 5.54 6.29 5.48
bio-system 3.06 3.54 2.87 3.48
Table 3.13: Associate and Founders Park Model Comparison Experimental Results
System
Trial 1 Trial 2 Trial 3 Trial 4
err. (ft) err. (ft) err. (ft) err. (ft)
gist 42.94 59.84 66.00 46.87
SIFT 8.87 9.81 11.35 12.15
bio-system 6.75 7.50 9.48 10.52
In addition, the search time for the SIFT-only model is also much longer than our
system. This is because we use the gist features (segment estimation) not only as an
observation model, but also as a context information for order of comparison between
input and stored salient regions. By the same token, we also use the salient feature
vector as an initial comparison (if the salient feature vectors between reference and test
region differ significantly, there is no need for SIFT matching).
3.4 Landmark Database Prioritization Testing Results
Now that we have shown that our system is able to satisfactorily localize in a number
of challenging outdoor environments with various lighting conditions, we are going to
Table 3.14: Frederick D. Fagg Park Model Comparison Experimental Results
System
Trial 1 Trial 2 Trial 3 Trial 4
err. (ft) err. (ft) err. (ft) err. (ft)
gist 78.61 85.69 80.25 89.41
SIFT 15.02 16.28 12.75 15.51
bio-system 13.72 11.61 9.25 10.79
110
take a step back and turn our attention toward the other critical factor in applying it to
mobile robots: speed.
In real-time systems such as robots, the balance between speed and accuracy have
been a tricky proposition. This is exactly what the prioritization technique is trying
to do by sorting the database entries and using the early exit strategy. In this set of
experiments, we are going to systematically tune the parameters to see how fast the
system can be without the use of salient region tracking. In addition, we also are going
to focus on whether the system can maintain its accuracy as the thresholds are lowered.
As mentioned earlier, we currently test our system on a 16-core 2.6GHz machine,
operating on 160x120 images. We time the individual sub-modules of the system and
find that the gist and saliency computation times (also implemented in parallel where
each sub-channel has its own thread) are about 20ms. The salient region acquisition
(windowing) takes 10ms, while the segment estimation takes less than 1ms. In addition,
the salient region tracking (up to 10 regions per frame) takes 50ms in total and the
back-end Monte Carlo localization itself takes less than 1ms because it only uses 100
particles.
As expected, the slowest part of the system, by far, is the landmark database search
process. This is because, unlike the other processes, there is a need to recall stored
memoryformatching an incomingsalient region with theonesinthe database. Through
the use of a hierarchical landmark database, however, we have a chance to speed up the
search. Here, we report two experiments that examine the components of our speed-up
strategy: landmark prioritization (section 3.4.1) and the early exit strategy 3.4.2. The
latter experiment is actually a compound of both as search prioritization performed well
111
above expectation as it actually improves accuracy on two (AnF and FDF) of the three
sites.
Also, fromhereon out, ourspeed-related testings areonly going to bedoneusingthe
first of the four testing runs partially because each testing runs takes far too long (could
take a few days) even though there is so much data out of one testing run for each site
already. Another reason is because we believe we can make a well educated deduction
on just one run each as the aspects that we are trying to look at are so obvious that
addition would not change thefindings. Because ofthisdecision, for thisand subsequent
sections, we are going to discuss the results not on each site individually, but the three
of them altogether.
3.4.1 Search Prioritization
We test the system usingdifferent prioritization weights in equation 2.20 to demonstrate
the impact of this parameter. As a baseline for this experiment, we assign random
priorities to each landmark in the first run. We then run the system using individual
priority cues (segment estimation, salient feature vector proximity, and current location
belief) exclusively by zeroing out the weight of each butone(the desired)of the terms in
theequation. Lastly,wereporttheresultsusingoptimalweightsofW
gist
=.5,W
sal
=.2,
andW
loc
=.3,whichwefoundthroughexperimentations. Forthisexperiment,wereport
the efficiency of the system in the number of comparisons with the salient regions in the
landmark database and the processing time in seconds.
Tables 3.15, 3.16, and 3.17 report the effect of different prioritization policies for
ACB, AnF, and FDF, respectively.
112
Table 3.15: Ahmanson Center for Biology Experiment 1 Results
Number of Segments: 9 Number of Landmarks in Database: 1501
Number of Training Sessions: 9 Number of Salient Regions/landmark: 19.79
Number of testing frames: 3583 Number of Salient Regions: 29710
Search Order Policy
found not found total error input rate
% search # of sreg./fr. % search # of sreg./fr. % search # of sreg./fr. (ft.) (ms)
random priority 27.64% 2.77±1.14 100.00% 2.13±1.14 59.06% 4.89±0.40 3.46±4.84 4739.60
segment priority 6.17% 2.77±1.14 100.00% 2.13±1.14 46.91% 4.89±0.40 3.57±3.61 4393.80
saliency priority 6.24% 2.77±1.14 100.00% 2.13±1.14 46.95% 4.89±0.40 3.62±4.98 4442.37
location priority 3.38% 2.77±1.14 100.00% 2.13±1.14 45.34% 4.89±0.40 3.67±3.58 4183.09
combination 1.03% 2.77±1.14 100.00% 2.13±1.13 44.00% 4.89±0.40 3.63±3.83 4229.42
NOTE (applies to all 3 tables): The first part of the table reports the environment parameters and training results.
The second part shows the performance of each run (with different prioritization parameters), which consists of the percentage
of the compared salient regions in the database and the average number of regions/frame for input regions that are
found, not found, and the total.
113
Table 3.16: Associate and Founders Park Experiment 1 Results
Number of Segments: 9 Number of Landmarks in Database: 4664
Number of Training Sessions: 10 Number of Salient Regions/landmark: 17.69
Number of testing frames: 6006 Number of Salient Regions: 82502
Search Order Policy
found not found total error input rate
% search # of sreg./fr. % search # of sreg./fr. % search # of sreg./fr. (ft.) (ms)
random priority 24.87% 3.52±1.14 100.00% 1.47±1.14 46.96% 4.98±0.14 7.48±10.33 11986.18
segment priority 5.77% 3.52±1.14 100.00% 1.47±1.14 33.47% 4.98±0.14 7.60±10.00 10303.86
saliency priority 5.24% 3.52±1.14 100.00% 1.47±1.14 33.09% 4.98±0.14 7.25±9.92 10779.05
location priority 2.36% 3.52±1.14 100.00% 1.47±1.14 31.06% 4.98±0.14 7.35±9.45 9632.37
combination 0.86% 3.52±1.14 100.00% 1.47±1.14 30.00% 4.98±0.14 7.07±9.29 7198.14
114
Table 3.17: Frederick D. Fagg Park Experiment 1 Results
Number of Segments: 9 Number of Landmarks in Database: 4808
Number of Training Sessions: 11 Number of Salient Regions/landmark: 18.86
Number of testing frames: 8823 Number of Salient Regions: 90660
Search Order Policy
found not found total error input rate
% search # of sreg./fr. % search # of sreg./fr. % search # of sreg./fr. (ft.) (ms)
random priority 29.33% 3.02±1.24 100.00% 1.75±1.25 55.30% 4.77±0.72 15.21±19.37 15226.00
segment priority 8.85% 3.05±1.24 100.00% 1.73±1.25 41.85% 4.78±0.71 14.56±17.44 13439.31
saliency priority 8.18% 3.05±1.24 100.00% 1.73±1.25 41.41% 4.78±0.71 14.00±15.89 12162.98
location priority 3.50% 3.05±1.24 100.00% 1.73±1.25 38.45% 4.78±0.71 14.44±15.23 12184.63
combination 1.28% 3.05±1.24 100.00% 1.73±1.25 37.02% 4.78±0.71 13.95±16.14 8305.34
115
Thefirstpartofeachtablestatesthetestinginformationandthesizeofthedatabase.
Note the large numbers of the stored salient regions: between 29,710 and 90,660. The
secondpartshowstheperformanceofeach run(withdifferentprioritization parameters),
which consists of the percentage of the searched salient regions in the database and the
number of input regions per frame for the ones that are found, not found, and the
total. The number of input salient regions are capped at 5 per frame [80] and the totals
differ slightly due to the small amount of noise added in the saliency model. The last
two columns show the error (in feet) and processing time per frame (in ms), i.e. the
reciprocal of the frame rate. Note that these times are quite high, sometimes it can take
up to a minute to process 1 frame.
Within each environment, the errorsare approximately the same, given the standard
deviations although in the AnF and FDF site, the error actually decreases. We attribute
this improvement to the fact that prioritization indirectly influences the salient region
recognition step given that the database search ends after the first match is found.
The tables also show that, on each site, the system improves its matching efficiency
themostusingtheoptimumprioritycomparedtojustusingrandomorindividualpriority
terms (segment, saliency, or location), because it has both the instantaneous appearance
(segment and saliency) and temporal (current predicted location) factor. Thisdifference
inpercentageofcomparisonsiseven morepronouncedifonlyweonlyfocusonthesalient
region matches that are eventually found (1.03%, 0.86%, and 1.28% of the database,
respectively). However, these numbers are drowned out by the number of comparisons
spent on regions that are eventually not found (noted by 100% percentage of database
searched) and, thus, the running times are still overall very long. At some point, given
116
how long the search has gone on, the system should notice that it is better to quit the
search since a match is unlikely to be found. And so, in the next experiment, we are
going to test this intuition of early-exit strategy, described at the end of section 2.4.2.
3.4.2 Landmark Database Search Early Exit Strategy
We start with the suggested initial parameters (10%, 20%, and 33% respectively) and
lower them until we see a dip in localization accuracy indicating a performance limit.
Tables 3.18, 3.19, and 3.20 (for ACB, AnF, and FDF, respectively) report the results in
the same format as the set of tables in the previous experiment.
Justbycomparingthefirsttwolinesofthetables, wecanseeatremendousdifference
in total percentage of database compared between a system without an early exit policy
(denotedby100%onallthreecolumns)andoneusingthemostconservativepolicy,speed
up of: 5.99 (44.00%/7.35%), 8.57 (30.00%/3.50%), and 6.05 (37.02%/6.12%) times for
each of the respective site. Most of the reductions occur because the system does not
need to search the entire database before deciding that a match may not exist, down
from 100% to 14.16%, 7.41%, and 12.31% of the database, for each respective site. It is
encouraging to see that a large percentage of the regions are found early on compared
to the eventual total found (2.50/2.77, 2.84/3.52 and 2.57/3.05 per image, respectively).
Notethattheinputratesmaynotcorrelatelinearlywiththetotalsearchpercentages.
For example, in table 3.20, a drop from 37.02% to 6.12% translates to a reduction of
only 2387.08ms. We believe this is partly due to different salient regions having different
numbers of SIFT keypoints so that one region takes longer to compare than another.
Additionally, our recorded time is the real wall clock time and not CPU access time.
117
Table 3.18: Ahmanson Center for Biology Early-Exit Experiment Results
Early Exit Parameter found not found total error input rate
2 regs 1 reg 0 regs % search # of sreg./fr. % search # of sreg./fr. % search # of sreg./fr. (ft.) (ms)
100% 100% 100% 1.03% 2.77±1.14 100.00% 2.13±1.13 44.00% 4.89±0.40 3.63±3.83 4229.42
10% 20% 33% 0.85% 2.50±0.85 14.16% 2.39±0.88 7.35% 4.89±0.40 3.46±3.08 958.97
5% 10% 20% 1.00% 2.53±0.89 8.15% 2.36±0.91 4.45% 4.90±0.40 3.29±2.72 420.75
3.3% 6.7% 10% 0.99% 2.53±0.90 5.95% 2.37±0.92 3.38% 4.89±0.40 3.28±4.76 371.94
1.67% 3.3% 5% 0.93% 2.50±0.91 4.10% 2.39±0.93 2.48% 4.89±0.40 3.59±5.89 327.35
1% 2% 3% 0.87% 2.46±0.93 3.14% 2.44±0.95 2.00% 4.89±0.40 3.57±5.26 249.11
118
Table 3.19: Associate and Founders Park Early-Exit Experiment Results
Early-Exit Parameter found not found total error input rate
2 regs 1 reg 0 regs % search # of sreg./fr. % search # of sreg./fr. % search # of sreg./fr. (ft.) (ms)
100% 100% 100% 0.86% 3.52±1.14 100.00% 1.47±1.14 30.00% 4.98±0.14 7.07±9.29 7198.14
10% 20% 33% 0.54% 2.84±0.68 7.41% 2.14±0.68 3.50% 4.98±0.14 6.55±5.20 3278.77
5% 10% 20% 0.58% 2.84±0.68 4.30% 2.14±0.68 2.18% 4.98±0.14 6.58±5.53 2361.05
3.3% 6.7% 10% 0.56% 2.83±0.68 3.15% 2.15±0.68 1.68% 4.98±0.14 6.16±5.34 1903.20
1.67% 3.3% 5% 0.52% 2.81±0.71 2.24% 2.17±0.71 1.27% 4.98±0.14 6.66±8.35 1467.99
1% 2% 3% 0.46% 2.75±0.78 1.79% 2.24±0.79 1.06% 4.98±0.14 7.63±12.54 1236.20
119
Table 3.20: Frederick D. Fagg Park Early-Exit Experiment Results
Early-Exit Parameter found not found total error input rate
2 regs 1 reg 0 regs % search # of sreg./fr. % search # of sreg./fr. % search # of sreg./fr. (ft.) (ms)
100% 100% 100% 1.28% 3.05±1.24 100.00% 1.73±1.25 37.02% 4.78±0.71 13.95±16.14 8305.34
10% 20% 33% 0.82% 2.57±0.81 12.31% 2.21±0.92 6.12% 4.78±0.71 12.96±11.40 5918.46
5% 10% 20% 0.82% 2.57±0.81 6.95% 2.21±0.93 3.65% 4.78±0.71 13.12±10.57 3990.99
3.3% 6.7% 10% 0.81% 2.57±0.82 4.81% 2.21±0.93 2.66% 4.78±0.71 12.65±10.14 3007.23
1.67% 3.3% 5% 0.76% 2.54±0.83 3.27% 2.24±0.94 1.94% 4.78±0.71 13.15±13.44 2083.04
1% 2% 3% 0.68% 2.47±0.88 2.55% 2.31±1.00 1.58% 4.78±0.71 15.61±27.16 1630.91
120
This is an issue because the load on the mini cluster we used for testing may vary (as
it is a shared machine). In some respects, this would also be expected on a real robot,
where we may need to allocate some amount of CPU time to other assigned tasks. For
these experiments, we prefer to compare performances based on just the percentage of
total numberofcomparisons. Wewerenotasconcerned aboutthetimesfornowbecause
we expected the algorithm to still be relatively slow. For the following experiments in
section 3.5, we make sure that our system was the only one running on the computer
during testing.
In addition, there is a slight drop in error between the system without the early exit
strategy with all of the other ones up until the most aggressive one (last), although not
statistically significant given the standarddeviation. We positthat the longer thesearch
lasts, the more likely a false positive would occur. As it turned out, a “not found” may
serve the system better. In addition, by quitting early, we can move on to subsequent
frames, hoping that the new regions are easier to identify.
Asforthetrendinsearchpercentagereductionbyloweringtheearlyexitparameters,
both AnF and FDF seem to suggest that the cutoff is on the 1.67%-3.3%-5% line as the
one below exhibits an increase in error. And if we look at the total percentage search
drop between the two (2.48% to 2.00%, 1.27% to 1.06%, and 1.94 to 1.58% on each
respective site), there appears to be a diminishing return for further reduction. Thus,
wechoose the1.67%-3.3%-5% early exit parameter for thefullsystemwith salient region
tracking.
121
3.5 Salient Region Tracking Testing
In spite of a parallel search implementation using 16 dispatched threads that compare
input regions with different parts of the landmark database, the process still takes sec-
onds. And so, using the salient region tracking module we do not have to bound the
system to the completion of the search on each frame. And thus, in this section, we
can now progressively test the system from 1 fps to 10 fps. The actual mechanisms of
speedingup the system comes from changing the inputrefresh rate. Thesystem, using a
back-end particle-based Monte-Carlo localization [80] has a location belief that is always
available for query. Whenever motion or observation (segment estimation or database
match) models would like to update the distribution, each will do so at its own pace.
As for us, we will query the result (can be done instantaneously) just before we start
processing the next frame. And so the faster the refresh rate the faster the query results
are recorded.
Below, we test the system with an input rate of 1000ms, 500ms, 200ms, and 100ms,
except for ACB, which only uses the last two rates. In addition, for baseline results,
we also test the system with a rate of infinity (the system can search the database to
completion ateach frame),taken fromtheoptimumresultoftheexperiment. Thereason
we discard the other input rates for ACB is because they bring no advantage as the area
covered bythe site (or more directly, thenumberof salient regions stored) ismanageable
for the system given that the average search time of the infinity rate is only 246.61ms.
Our goal for the system is 100 ms per frame or 10 frames per second (fps) — 1/3 the
standard NTSC 30fps. This because, at this speed and given that the system processing
122
Table 3.21: Ahmanson Center for Biology Tracking Experiment Results
Input Robot Num. Average Average Error
Rate Speed Search Search Region (ft.)
(ms.) (mi./hr) Time (ms.) Identified
infinity 0.124 3583 327.35 2.50/4.89 3.59
200 0.153 2180 268.76 2.17/4.90 4.79
100 0.305 1257 246.40 2.17/4.89 4.95
Table 3.22: Associate and Founders Park Tracking Experiment Results
Input Robot Num. Average Average Error
Rate Speed Search Search Region (ft.)
(ms.) (mi./hr) Time (ms.) Identified
infinity 0.030 6006 1467.99 2.81/4.98 6.66
1000 0.035 4618 754.74 2.21/4.99 8.79
500 0.071 3517 689.64 2.21/4.99 9.97
200 0.176 1770 589.79 2.16/4.98 11.60
100 0.353 971 577.95 2.13/4.98 11.67
latency(thetimeittakes torespondtoanacquiredimage) is2timestheframeduration,
the robot would appear to be acceptably fluid in the eyes of a human observer.
Additionally, using the input rate, we can calculate how fast the robot moves. We
first calculate the cameraman’s walking speed using the frames’ ground-truth locations
and the frame capture speed (30 fps): 5.5 to 7.5 m/s (depending on the environment),
slightly faster than the walking speed of a normal human. Because the frame rate is 1/3
the normal rate, the robot effectively moves at 1/3 the walking speed of the cameraman.
The result for each site is displayed in tables 3.21, 3.22, and 3.23, respectively.
The first column shows the input rate and the second column lists the corresponding
robot speed. Thethird column shows how many searches occurred in the trial, while the
forth and fifth columns display the average search time (in ms) and the average salient
regions identified over average regions detected per frame, respectively. The number of
123
Table 3.23: Frederick D. Fagg Park Tracking Experiment Results
Input Robot Num. Average Average Error
Rate Speed Search Search Region (ft.)
(ms.) (mi./hr) Time (ms.) Identified
infinity 0.027 8823 2083.04 2.54/4.78 13.15
1000 0.043 5828 959.21 2.13/4.74 15.73
500 0.087 4232 847.26 2.17/4.78 16.24
200 0.217 2046 776.19 2.07/4.60 23.01
100 0.435 1220 682.35 1.94/4.65 23.80
searches at the infinity speed is also the total number of test frames. The last column
shows the error (in feet), which is used as a performance/accuracy measure.
In terms of accuracy differences between input rates, notice that none of the error of
the fastest rates increases past twice the highest rate (infinity). When a system is forced
to deal with a faster input rate, the segment estimation results play a more significant
role in shaping the location belief, as it provides coarse segment level perception on each
frame. However, as time goes on, although the segment location stays consistent, the
coordinate location drifts away bit by bit.
In addition, in our testing setup, whenever there are breaks between segments, we
added a kidnapped-robot instance test. These instances contribute to the error quite
significantly. On these cases, the segment estimation quickly (about 10 frames for the
100ms. rate) puts the robot at the correct segment but not at the correct location until
a salient region is recognized (takes about 30 - 50 frames). Although these instances
contribute high amounts of error, the robot is always able to relocalize itself.
One encouraging observation of our system is that it displays a level of resourceful-
nessas itsperformancedegradesgracefully with increased framerate. However, itisalso
obvious that once the frame rate is increased to 10fps, the segment estimation almost
124
exclusively updatesthe location belief. Thisbecomesaproblemasitisacoarse classifier
(which also interjects errors). A key to improving the system is to provide more percep-
tion to localization. The tracking module, for one, can also be used as a localization aid.
At this point, the landmark tracking module only reports to the system when tracking
is lost (to prompt the system to stop the search). We should also use it when tracking
is going well by combining it with the odometry data to improve the motion model.
125
Chapter 4
Discussions And Conclusions
Localization isone oftheprimaryproblemsin bringingmoreroboticstechnologies outof
the laboratories and into the population. In this work, we have implemented a complete
biologically-inspired mobile robot localization system with close to real-time capability.
This is achieved because of the introduction of new ideas (summarized in following
sections) in vision localization which have been shown to be beneficial in our testing.
As a performance benchmark, to the best of our knowledge, we have not seen other
systems that are tested in multiple outdoor environments (building complex, vegetation
dominated park, and open field area) and are localizing to the coordinate level (not just
a general vicinity or a place). At 2005 ICCV Vision contest [85], the competing teams
have to localize from a database of GPS-coordinate-tagged street-level photographs of a
stretch (1 city block) of urban street. The winner [109] returns 9/22 answers within 4
metersoftheactuallocation. Oursystemhasa6ft. (1.8m)errorina126x180ft. building
complex. The difference is that our system can store as many pictures as it wants, while
they have a limited number of reference of images. Most purely vision-based systems
126
report just the recognition rates (whether the current view is correctly matched with
stored images), not the locations.
4.1 Implementation andUsage ofGistandSaliencyModel
As the study of human visual cortex suggests, the complementary use gist [81] and
saliency [37, 32] features are implemented in parallel using shared raw feature channels
(color, intensity, orientation) in an overall biologically-plausible framework.
Through the saliency model, the system automatically selects persistently-salient
parts of an input image, which allow the system to crop out small windows called the
salient regions. Because the system does not perform whole-scene matching (only re-
gions), the processis more efficient in that it onlyhas to compare a small subset of SIFT
keypoints for faster matching time.
Thesystemalsousesthegistfeatures, whichcome withsaliencyatalmostnocompu-
tation cost, to approximate the image layout and provide segment estimation. The gist
model testing results (section 3.2) show that the gist features by itself can be useful in
outdoorlocalization forautonomousmobilerobotsbyclassifyingsegmentsinanenviron-
ment by contrasting images of scenes in a global manner, automatically taking obvious
idiosyncrasies into account. In addition, these high-level context information help rule
out possible coincidental salient region matches and allow the system to maintain accu-
racy. Itisimportanttonotethatthegistmoduleresultsareachieved withoutthehelpof
temporal filtering (one-shot recognition with each image considered individually), which
we expect to further improve the performance [94].
127
The gist features, however, would have problems differentiating scenes when most of
the background overlaps as is the case for scenes within a segment. Gist, by definition,
is not a mechanism to produce a detailed picture and an accurate localization, just the
coarse context. This is where saliency and localized region recognition complement gist.
One issue that we encounter with this arrangement is the need to synchronize the two
so that they can coincide naturally to provide a complete scene description.
4.2 Hierarchical Landmark Database and Multi-level
Localization
.
We appease the representation and time-scale differences of gist features (global fea-
ture type) and salient regions (local feature type) by performing multi-cue landmark
recognition through a hierarchical database (section 4.2.1) and then use those results
in a multi-level localization framework (section 4.2.2). That is, with gist we can locate
our general whereabouts to the segment level while with saliency we can pin-point our
exact location by finding distinctive and salient region situated within the segment to
further refine the resolution to the metric location, all within a back-end Monte Carlo
Localization framework (section 2.6).
This style of search, which is hierarchical in nature, has been shown [108, 103] to
speed up the matching process as well as allowing additional context information to
pruneoutpossibleincorrectmatches. Byhavingthesystemprocessesanimageatvarious
abstractions (specific region and image layout), we are not putting undue pressure on
any individual module and allow each to contribute only in ways that it is best at. It
128
would hard to expect local features to think about the matching consequences globally
or asking global features to find coordinate level correspondences. On the contrary, we
may even be able to slightly lower certain module’s thresholds to reduce false-negative
rate while keeping the false-positive rate low as well.
4.2.1 Hierarchical Landmark Database
Usingthelandmarkdatabase,weoptimizethespeed-accuracytradeoffsoftherecognition
process by minimizing the number of entries in the database (done in the landmark
database construction procedure, during training) as well as the number of run-time
comparisons needed to determine whether there is a match (during run-time). The
latterisdonebyprioritizingthedatabasesearchusingsegmentestimation,salientfeature
vector matching, and the current location state (e.g. matching landmarks that are near
the vicinity of the belief location) before performing the slow salient region recognition.
In the landmark database construction, our goal is to lower the number of salient
regions stored in the database while still keeping all the necessary information. We
play to the strength of the individual entries and only add new instances when they
reduce the weaknesses. In this case, the strength of the SIFT features that makes up the
salient regions, is scale and in-plane rotation invariance, while the weaknesses are out-
of-plane view-point and lighting changes. During the individual database construction,
viewpoint invariance is the main reason why we add a salient region to a landmark.
Lighting invariance, on the other hand, is achieved by training the system on multiple
lighting conditions. However, because there are enough lighting overlaps, we are able
129
to consolidate landmarks that depict the same point of interest and but are created in
different training sessions.
The on-line landmark database search prioritization is the culmination of the benefit
of using a multi-expert, multi-level approach. By prioritizing the order of salient region
matching(frommostthemostlikelyto-be-matched entrytotheleast), weareabletocut
the average percentage of database entries being compared down to single digits. The
saved computation time can be used to perform more robust and sophisticated recogni-
tion. For example, ifthereisan improvement needed,itwould betoaddvisualcuesthat
work across wider range of lightings. Also, in addition to the presented prioritization
factors, it would be easy to add temporal shortcuts such as always compare the previous
ten matched landmarks first. In the same spirit, we can also add the recently matched
training session priority (taken from the session origin of the recalled reference salient
regions), which, in effect, provide lighting condition priming.
There is one problem that we observe in testing with using the salient regions. In
an uncontrolled environment, we are bound to have outside distractions that are not
native to the location, such as people walking and are in the field-of-view long enough.
In these instances, the stored regions depicting people (they are quite salient) become
unnecessary space that are only going to bog the system down. However, such a person
is consistently present at a particular location, he/she can be considered as a reliable
landmark. What we can do to improve this is by adding priority values to landmarks
that are found in multiple training sessions or even eliminate the ones that are only
found in one session during the construction process.
130
4.2.2 Multi-level Localization
Multi-level localization, on the other hand, is done by using both segment estimation
and salient region recognition as separate observations in the Monte-Carlo Localization
module. Because bothfeatures have direct access to updatethe location belief, there are
no dependencies between the two. The gist features, which are holistic and computed
quickly, are used to provide a coarse butmore recently updatedlocalization. Conversely,
a more accurate, but markedly slower, salient region recognition can then infer metric
localization whenever it is available. Many global feature-based methods [99, 10, 94, 81,
65] that are limited to recognizing places (as opposed to geographical points) indicate
that their results can be used as a filter for a more accurate attempt at localization with
the use of finer yet more volatile local features. Our system is the implementation of
such extension.
4.3 Environment Invariance
Because our system attacks the problem with multiple recognition systems that work
at different abstractions, we also achieve one of the main goals of having the system
to be environment invariant. By successfully localizing in three visually contrasting and
challenging large-scale outdoorenvironments, thesystem,ineffect, validates itsaccuracy
and scalability.
Inaddition,onecriterionofbeingenvironmentinvariantisthelimitationsfordetailed
calibration, where the robot has to rely on the ad-hoc knowledge of the designer for
reliable landmarks. Through the use of the saliency model, the system automatically
131
selects persistently salient regions as localization cues. In addition, the training process
is simply running a robot through an environment and the system then trains a single
neural network (for segment estimation) and builds a landmark database.
132
Chapter 5
Future Works
Given the current localization system implementation can run in real-time, one of our
focus now to port it on a robot (section 5.1) in an unconstrained, outdoor environment.
In addition, we also would like to do additional biological vision work in the topic of gist
(5.2).
5.1 Porting to a Robot
We are planning to port our complete mobile robot vision system (observe figure 2.1
in chapter 2) to our robot Beobot 2.0 [77] (which has a computing cluster platform) to
test for both the localization as well as visual navigation. We first have to integrate the
localization module in the ventral pathway with the dorsal pathway module, which is
responsibleforautonomousnavigation. Wehaveimplemented thesalientregion tracking
sub-module to aid the landmark database search process. We are currently developing a
lane direction detection (and following) and obstacle detection (and avoidance) modules
using gist and saliency.
133
For lane following, we are planning to use the gist features much like the Fourier
profilenavigationby[2]. Inaddition,wecanalsousethesalientpointcoordinatelocation
of the regions stored in the landmark database to keep the robot on its path. During
training, when these regions are obtained, the robot is controlled by a person, who
naturally would keep the robot in the middle of the road. By servoing the robot with
respect to those coordinate locations, we can steer the robot to stay on the road. As for
obstacledetection, weareplanningtoincorporatethemotion features, whicharealready
available in the next version saliency model. Eventually, we would like to implement a
fully autonomous mobile vision system based on these ideas.
The main advantage of implementing an algorithm in a embodied system is that we
can work on research that utilize an action-perception coupling. For example, we also
plan to integrate a behavior to slow down the robot whenever a localization ambiguity
is encountered. A more advanced reaction would be to have the robot perform an
exploratory, disambiguating movements whenever the system is confused. One such
place in the case the robot is too close to a blank wall and needs to back up.
A problem related to localization that displays the value of having such ability is
the goal-seeking task. Here, the robot needs to construct and follow a path from its
current location to a commanded goal location such as the campuslibrary. We are going
to do this with a landmark hopping strategy. First the robot needs to localizes itself,
either in-place or while moving, identifying at least one landmark. We then use the
landmark to construct the appropriate path and guide the robot to the right direction.
After the robot makes its first move, it then tries to recognize subsequent landmarks.
When the next landmark that is going to advance the robot’s even further is recognized,
134
it switches to that one. To make this hand-off as smooth as possible, we use the salient
region tracking module, where it tracks both the currently compared and the already
identified landmarkthat the robotis tryingto go to. Theprocessrepeats untilthe robot
arrives at the destination. Here, biasing the saliency module to look specifically into the
direction of the next predicted landmark in the path would be helpful.
Another localization related project that we plan to implement is the SLAM (Simul-
taneous Localization and Mapping) version of the system. This capability reduces the
need for detailed mapping in which a robot has to rely on the knowledge of the designer
to survey the environment. Because we use an augmented topological map, the flavor
of the algorithm should be similar to [67]. First of all, however, we would need the gist
features to classify places/segments without having to train it off-line. This requires the
ability to cluster gist feature vectors into segments, much like the one implemented in
[101]. One of the difficulty here is that the robot has to notice by itself that it is moving
from one segment to another. In addition, this could potentially make the classification
moredifficultasthedecision boundariesarefuzzier, especially inthetransitionsbetween
segments.
Another vision-related issue to pay attention to is that the current testing data is
only uni-directional: all images are taken from the same perspective, the middle of the
road. In autonomous control using lane following, a bit of swerving may occur. We
need to consider training the system on a less stable data set. However, recording from
every perspectives in the environment may put the recognition system, both in segment
classification aswellassalient regionrecognition, pastitslimits. Aworkablecompromise
135
would be to have the camera pan left to right (up to 45
◦
) while the robot is on the road.
We are currently taking data that exhibited this swerving characteristics.
5.2 Gist Model
Thegist features, despitetheir simpleapproach, areable to achieve apromisinglocaliza-
tion performance. The technique highlights the rapid nature of gist while still accurate
in performing its tasks. This, in large part, is because the basic computational mech-
anisms of extracting the gist features are only averaging of visual maps from different
domains. Theoretically, scalability is a concern because when we average over large ar-
eas, critical distinguishing details may be lost. However, although more sophisticated
gist computations can be incorporated, we maintain the instantaneous nature of the
information.
Furthermore, wealsoavoid complications thatoccurwhentryingtofitmorecomplex
models to unconstrained and noisy data. For example, an alternative is using a graph
of regions as a layout representation, instead of the current grid-based decomposition.
It can represent a scene as segmented region feature vectors for each node, and coarse
spatial relationships for the edges. The node information can provide explicit shape
recognition, which is lacking in our current implementation. However, such approach
can break down when a segmentation error occurs. We believe, capturing this spatial
expressivenesswithouttheoffragile natureofsegmentation isthekey tomovingforward
in the representation.
136
In terms of robustness, the gist features are able to handle translational, angular,
scale, and illumination changes. Because they are computed from large image sub-
regions, it takes a sizable translational shift to affect the values. As for angular stability,
the natural perturbation of the data produced by a camera carried while walking seems
to aid the demonstrated invariance. For larger angular discrepancy, like, in the case of
off-roadenvironments,anengineeringapproachlikeaddingsensorssuchasagyroscopeto
correcttheangleofviewisadvisable. Thegistfeaturesarealsoinvarianttoscalebecause
the majority of the scenes (background) are stationary and the system is trained at all
viewing distances. The features also achieve solid illumination invariance when trained
with different lighting conditions. Lastly, the combined-sites experiment (section 3.2.4)
shows that the number of differentiable scenes can be quite high; twenty seven segments
can make up a detailed map of a large area.
A profound effect of using gist is the utilization of background information more
so than foreground. However, one drawback of the current gist implementation is that
it cannot carry out partial background matching for scenes in which large parts are
occluded by dynamic foreground objects. As mentioned earlier the videos are filmed
during off-peak hours when fewer people (or vehicles) are on the road. Nevertheless,
theycan still create problemswhenmovingtoo close tothecamera. Inoursystem, these
images can be taken out using the motion cues from the motion channel of the saliency
algorithm as a preprocessing filter, detecting significant occlusion by thresholding the
sum of the motion channel feature maps [33]. Furthermore, a wide-angle lens (with
software distortion correction) may help to see more of the background scenes and, in
comparison, decrease the size of the moving foreground objects.
137
Another way to increase the theoretical strength of the gist features is to go to a
finer grid to incorporate more spatial information. For the current extraction process,
to go to next level in the pyramid (an eight-by-eight grid) is to increase the number of
features from 16 to 64 in each sub-channel map. However more spatial resolution also
means more data (quadrupled the amount) to process and it is not obvious where the
point of diminishing return is. Thus we are faced with a decision to strike a balance in
resolution and generalization. We would like to push the complexity of expressiveness
of the features while still keeping robustness and compactness in mind. At the start, we
stated that our goal is to emulate human-level gist understandingthat can be applied to
a larger set of problems. As such our further direction would be to stay faithful to the
available scientific data on human vision.
In the future, however, we plan to test our gist-based place recognition system with
data that challenges the model using crowded scenes, and more extreme translational
or point-of-view change. In addition, we also plan to improve our model in expectation
that the current one would not be able to cope with the introduced variability.
138
References
[1] Y. Abe, M. Shikano, T. Fukuda, F. Arai, and Y. Tanaka. Vision based navigation
system for autonomous mobile robot with global matching. IEEE International
Conference on Robotics and Automation, 20:1299–1304, May 1999.
[2] C. Ackerman and L. Itti. Robot steering with spectral image information. IEEE
Transactions on Robotics, 21(2):247–251, Apr 2005.
[3] M. Agrawal and K.G. Konolige. Real-time localization in outdoor environments
using stereo vision and inexpensive gps. In ICPR06, volume 3, pages 1063–1068,
2006.
[4] A.Angeli,D.Filliat, S.Doncieux,andJ.-A.Meyer. Afastandincrementalmethod
for loop-closure detection using bags of visual words. Accpeted for publication in
IEEE Transactions On Robotics, Special Issue on Visual SLAM, -:–, 2008.
[5] Kobus Barnard, Vlad Cardei, and Brian Funt. A comparison of computational
color constancy algorithms; part one: Methodology and experiments with synthe-
sized data. IEEE Transactions in Image Processing, 11(9):972 – 984, 2002.
[6] Kobus Barnard, Lindsay Martin, , Adam Coath, and Brian Funt. A compari-
son of color constancy algorithms. part two. experiments with image data. IEEE
Transactions in Image Processing, 11(9):985 – 996, 2002.
[7] B. Barshan and H. Durrant-Whyte. An inertial navigation system for a mobile
robot. IEEE Trans. Robotics and Automation, 11(3):328–342, June 1995.
[8] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust
features. In In ECCV, pages 404–417, 2006.
[9] I.Biederman. Dobackgrounddepthgradientsfacilitateobjectidentification? Per-
ception, 10:573 – 578, 1982.
[10] P. Blaer and P. Allen. Topological mobile robot localization using fast vision
techniques. In Proc. IEEE ICRA, 2002.
[11] J. L. Blanco, J. Gonzalez, and J. A. Fernndez-Madrigal. Consistent observation
groupingforgeneratingmetric-topologicalmapsthatimprovesrobotlocalization*.
In ICRA, Barcelona, Spain, 2006.
139
[12] J. Borenstein and L. Feng. Umbmark - a method formeasuring, comparing, and
correcting dead-reckoning errors in mobile robots. Technical Report UM-MEAM-
94-22, University of Michigan, December 1994.
[13] N.J. Broadbent, L.R. Squire, and R.E. Clark. Spatial memory, recognition mem-
ory, and the hippocampus. In Proc National Academy of Science USA, volume 11,
pages 14515 – 14520, 2004.
[14] K. Seng Chong and L. Kleeman. Accurate odometry and error modelling for a
mobile robot. In ICRA, volume 4, pages 2783–2788, 20-25 Apr 1997.
[15] C. Chubb and G. Sperling. Drift-balanced random stimuli: a general basis for
studying non-fourier motion perception. JOSA, 5(11):1986 – 2007, 1988.
[16] S. Cooper and H. Durrant-Whyte. A kalman filter model for gps navigation of
land vehicles. In IEEE/RSJ/GI Int. Conf. Intell. Robots Syst., pages 157 – 163,
Munich, Germany, September 1994.
[17] Mark Cummins and Paul Newman. Fab-map: Probabilistic localization and map-
ping in the space of appearance. Int. J. Rob. Res., 27(6):647–665, 2008.
[18] Guilherme N. DeSouza and Avinash C. Kak. Vision for mobile robot navigation:
A survey. IEEE Trans. Pattern Anal. Mach. Intell., 24(2):237–267, 2002.
[19] Pantelis Elinas and James J. Little. σMCL: Monte-Carlo localization for mobile
robots with stereo vision. In Proceedings of Robotics: Science and Systems, Cam-
bridge, USA, June 2005.
[20] R. Epstein, D. Stanley, A. Harris, and N. Kanwisher. The parahippocampal place
area: Perception, encoding, or memory retrieval? Neuron, 23:115 – 125, 2000.
[21] F.Escolano,B.Bonev,P.Suau,W.Aguilar,Y.Frauel,J.M.Saez,andM.Cazorla.
Contextual visual localization: Cascaded submapclassification, optimized saliency
detection, and fast view matching. In Proc. IEEE International Conference on
Intelligent Robots and Systems (IROS), pages 1715 – 1722, 10 2007.
[22] G.D. Finlayson, B. Schiele, and J.L. Crowley. Comprehensive colour image nor-
malization. In 5th European Conference on Computer Vision, pages 475 – 490,
May 1998.
[23] Dieter Fox, Wolfram Burgard, Frank Dellaert, and Sebastian Thrun. Monte carlo
localization: Efficient position estimation for mobile robots. In Proc. of Sixteenth
National Conference on Artificial Intelligence (AAAI’99)., July 1999.
[24] S. Frintrop, P. Jensfelt, and H. Christensen. Attention landmark selection for
visual slam. In IROS, Beijing, October 2006.
[25] S.Frintrop, P. Jensfelt, andH.Christensen. Pay attention whenselecting features.
In ICPR, Hong Kong, August 2006.
140
[26] S. Frintrop, P. Jensfelt, and H. Christensen. Attentional robot localization
and mapping. In ”ICVS Workshop on Computational Attention & Applications
(WCAA)”, Bielefeld, Germany, March 2007.
[27] L.Goncalves,E.DiBernardo,D.Benson,M.Svedman,J.Ostrowski,N.Karlssona,
and P. Pirjanian. A visual front-end for simultaneous localization and mapping.
In ICRA, pages 44–49, April 18 - 22 2005.
[28] Kristen Grauman and Trevor Darrell. The pyramid match kernel: Discriminative
classification with sets of image features. In ICCV, pages 1458–1465, 2005.
[29] Kristen Grauman and Trevor Darrell. Approximate correspondences in high di-
mensions. In Advances in Neural Information Processing Systems (NIPS), vol-
ume 19, 2007.
[30] D. Haehnel, W. Burgard, D. Fox, and S. Thrun. An efficient fastslam algorithm
for generating maps of large-scale cyclic environments from raw laser range mea-
surements. In IROS, 2003.
[31] A. Hyvrinen. Fast and robust fixed-point algorithms for independent component
analysis. IEEE Transactions on Neural Networks, 10(3):626–634, 1999.
[32] L. Itti. Models of Bottom-Up and Top-Down Visual Attention. bu;td;mod;psy;cv,
Pasadena, California, Jan 2000.
[33] L. Itti. Automatic foveation for video compression using a neurobiological model
of visual attention. IEEE Transactions on Image Processing, 13(10):1304–1318,
Oct 2004.
[34] L. Itti. ilab neuromorphic vision c++ toolkit (invt), 2009.
[35] L. Itti and C. Koch. A saliency-based search mechanism for overt and covert shifts
of visual attention. Vision Research, 40(10-12):1489–1506, May 2000.
[36] L. Itti and C.Koch. Computational modelling of visual attention. Nature Reviews
Neuroscience, 2(3):194–203, Mar 2001.
[37] L.Itti,C.Koch,andE.Niebur. Amodelofsaliency-basedvisualattentionforrapid
scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence,
20(11):1254–1259, Nov 1998.
[38] Timor Kadir, Andrew Zisserman, and Michael Brady. An affine invariant salient
region detector. In ECCV (1), pages 228–241, 2004.
[39] H. Katsura, J. Miura, M. Hild, and Y. Shirai. A view-based outdoor navigation
using object recognition robust to changes of weather and seasons. In IROS, pages
2974–2979, Las Vegas, NV, Oct 27 - 31 2003.
[40] Yan Ke and Rahul Sukthankar. Pca-sift: A more distinctive representation for
local image descriptors. In CVPR (2), pages 506–513, 2004.
141
[41] B. Kuipers. An intellectual history of the spatial semantic hierarchy. In Margaret
Jefferies and Albert (Wai-Kiang) Yeap, editors, Robot and Cognitive Approaches
to Spatial Mapping, volume 99, pages 21–71. Springer Verlag, 2008.
[42] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of features:
Spatial pyramid matching for recognizing natural scene categories. In CVPR ’06:
Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition, pages 2169–2178, Washington, DC, USA, 2006. IEEE
Computer Society.
[43] J. J. Leonard and H. F. Durrant-Whyte. Mobile robot localization by tracking
geometric beacons. IEEE Trans. Robotics and Automation, 7(3):376–382, June
1991.
[44] F.F.Li,R.VanRullen,C.Koch,andP.Perona. Rapidnaturalscenecategorization
inthenearabsenceofattention. InProc. Natl. Acad. Sci.,pages8378–8383, 2002.
[45] K. Lingemann, H. Surmann, A. Nuchter, and J. Hertzberg. Indoor and outdoor
localization for fast mobile robots. In IROS, 2004.
[46] D.G.Lowe. Distinctiveimagefeaturesfromscale-invariantkeypoints. Intl. Journal
of Computer Vision, 60(2):91–110, 2004.
[47] F.LuandE.Milios. Robotposeestimation inunknownenvironmentsbymatching
2d range scans. Journal of Intelligent and Robotic Systems, 18:249 – 275, 1997.
[48] Shoichi Maeyama, Akihisa Ohya, and Shin’ichi Yuta. Long distance outdoor nav-
igation of an autonomous mobile robot by playback of perceived route map. In
ISER, pages 185–194, 1997.
[49] JiriMatas, OndrejChum,MartinUrban,andTom´ asPajdla. Robustwidebaseline
stereo from maximally stable extremal regions. In BMVC, 2002.
[50] Y.Matsumoto, M.Inaba, andH.Inoue. View-basedapproach torobotnavigation.
In IEEE-IROS, pages 1702–1708, 2000.
[51] T. P. McNamara. Memory’s view of space. InG. H.Bower, editor, The psychology
of learning and motivation: Advances in research and theory, volume 27, pages
147–186. Academic Press, 1991.
[52] Krystian Mikolajczyk and Cordelia Schmid. An affine invariant interest point
detector. In ECCV (1), pages 128–142, 2002.
[53] Krystian Mikolajczyk and Cordelia Schmid. A performance evaluation of local
descriptors. IEEE Trans. Pattern Anal. Mach. Intell., 27(10):1615–1630, 2005.
[54] M.J. Mace MJ, S.J.Thorpe, and M. Fabre-Thorpe. Rapid categorization of achro-
matic natural scenes: how robust at very low contrasts? Eur J Neurosci.,
21(7):2007 – 2018, April 2005.
142
[55] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit. Fastslam: A factored
solution to the simultaneous localization and mapping problem. In AAAI, 2002.
[56] A.C. Murillo, J.J. Guerrero, and C. Sagues. Surf features for efficient robot lo-
calization with omnidirectional images. In Robotics and Automation, 2007 IEEE
International Conference on, pages 3901–3907, April 2007.
[57] R. Murrieta-Cid, C. Parra, and M. Devy. Visual navigation in natural environ-
ments: From range and color data to a landmark-based model. Autonomous
Robots, 13(2):143–168, 2002.
[58] P. Newman and K. Ho. Slam-loop closing with visually salient features. In In-
ternational Conference on Robotics and Automation (ICRA), Barcelona, Spain,
2005.
[59] D. Nist´ er and H. Stew´ enius. Scalable recognition with a vocabulary tree. In IEEE
Conference on ComputerVisionandPattern Recognition (CVPR),volume2,pages
2161–2168, June 2006.
[60] A. Oliva and P.G. Schyns. Coarse blobs or fine edges? evidence that information
diagnosticity changes the perception of complex visual stimuli. Cognitive Psychol-
ogy, 34:72 – 107, 1997.
[61] A. Oliva and P.G. Schyns. Colored diagnostic blobs mediate scene recognition.
Cognitive Psychology, 41:176 – 210, 2000.
[62] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representa-
tion of the spatial envelope. International Journal of Computer Vision, 42(3):145–
175, 2001.
[63] N. Ouerhani, A. Bur, and H. Hgli. Visual attention-based robot self-localization.
InProc. of European Conference on Mobile Robotics (ECMR),pages8–13, Ancona,
Italy, September 2005.
[64] M. C. Potter. Meaning in visual search. Science, 187(4180):965–966, 1975.
[65] A. Pronobis, B. Caputo, P. Jensfelt, and H.I. Christensen. A discriminative ap-
proach to robust visual place recognition. In IROS, 2006.
[66] A. Ramisa, A. Tapus, R. Lopez de Mantaras, and R. Toledo. Mobile robot local-
ization using panoramic vision and combination of local feature region detectors.
In ICRA, pages 538–543, Pasadena, CA, May 2008.
[67] A. Ranganathan and F. Dellaert. A rao-blackwellized particle filter for topological
mapping. In ICRA, pages 810– 817, 2006.
[68] L.W.Renniger andJ.Malik. Whenissceneidentification justtexturerecognition?
Vision Research, 44:2301–2311, 2004.
[69] R. A. Rensink. The dynamic representation of scenes. Visual Cognition, 7:17 –
42, 2000.
143
[70] M. Rous, H. Lupschen, and K.-F. Kraiss. Vision-based indoor scene analysis for
natural landmark detection. IEEE International Conference on Robotics and Au-
tomation, pages 4642– 4647, April 2005.
[71] U.Rutishauser,D.Walther,C.Koch,andP.Perona. Isbottom-upattention useful
for object recognition? In CVPR (2), pages 37–44, 2004.
[72] T. Sanocki and W. Epstein. Priming spatial layout of scenes. Psychol. Sci., 8:374
– 378, 1997.
[73] Grant Schindler, Matthew Brown, and Richard Szeliski. City-scale location recog-
nition. In Computer Vision and Pattern Recognition, IEEE Computer Society
Conference on, volume 0, pages 1–7, Los Alamitos, CA, USA, 2007. IEEE Com-
puter Society.
[74] S. Se, D. G. Lowe, and J. J. Little. Vision-based global localization and mapping
for mobile robots. IEEE Transactions on Robotics, 21(3):364–375, 2005.
[75] T. Serre, L. Wolf, S.M. Bileschi, M. Riesenhuber, and T. Poggio. Robust object
recognitionwithcortex-likemechanisms. IEEETrans. Pattern Anal. Mach. Intell.,
29(3):411–426, 2007.
[76] C. Siagian. Ieee-t pami data home page, 2007. [Online; accessed 22-September-
2008].
[77] C. Siagian, C. K. Chang, R. Voorhies, and L. Itti. Beobot 2.0, 2009.
[78] C. Siagian and L. Itti. Gist: A mobile robotics application of context-based vi-
sion in outdoor environment. In Proc. IEEE-CVPR Workshop on Attention and
Performance in Computer Vision (WAPCV’05), San Diego, California, pages1–7,
Jun 2005.
[79] C.SiagianandL.Itti. Biologically-inspired roboticsvisionmonte-carlolocalization
intheoutdoorenvironment. InProc. IEEEInternational Conference on Intelligent
Robots and Systems (IROS), Oct 2007.
[80] C.SiagianandL.Itti. Biologically-inspired roboticsvisionmonte-carlolocalization
in the outdoor environment. In Proc. IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS), Oct 2007.
[81] C.SiagianandL.Itti. Rapidbiologically-inspiredsceneclassificationusingfeatures
sharedwithvisualattention. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 29(2):300–312, Feb 2007.
[82] C. Siagian and L. Itti. Comparison of gist models in rapid scene categorization
tasks. In Proc. Vision Science Society Annual Meeting (VSS08), May 2008.
[83] C. Siagian and L. Itti. Storing and recalling information for vision localization. In
IEEE International Conference on Robotics and Automation (ICRA), Pasadena,
California, May 2008.
144
[84] Josef Sivic and Andrew Zisserman. Video google: A text retrieval approach to
object matching in videos. In ICCV ’03: Proceedings of the Ninth IEEE Interna-
tional Conference on Computer Vision, page 1470, Washington, DC, USA, 2003.
IEEE Computer Society.
[85] Richard Szeliski. Iccv2005 computer vision contest where am i?
http://research.microsoft.com/iccv2005/Contest/, November 2005.
[86] R. Taktak, M. Dufaut, and R. Husson. Vehicle detection at night using image
processing and pattern recognition. ICIP-II, pages 296–300, 1994.
[87] S.Thorpe,D.Fize,andC.Marlot. Speedofprocessinginthehumanvisualsystem.
Nature, 381:520 – 522, 1995.
[88] S. Thrun. Learning metric-topological maps for indoor mobile robot navigation.
Artificial Intelligence, 99(1):21–71, 1998.
[89] S. Thrun, M. Bennewitz, W. Burgard, A.B. Cremers, F. Dellaert, D. Fox,
D. H¨ ahnel, C. Rosenberg, N. Roy, J. Schulte, and D. Schulz. MINERVA: A second
generation mobile tour-guide robot. In Proc. of the IEEE ICRA, 1999.
[90] S. Thrun, D. Fox, and W. Burgard. A probabilistic approach to concurrent map-
ping and localization for mobile robots. Machine Learning, 31:29–53, 1998.
[91] S. Thrun, D. Fox, W. Burgard, and F. Dellaert. Robust monte-carlo localization
for mobile robots. Artificial Intelligence, 128(1-2):99–141, 2000.
[92] Sebastian Thrun. Finding landmarksfor mobile robot navigation. In ICRA, pages
958–963, 1998.
[93] A. Torralba. Modelingglobal scenefactors in attention. Journal of Optical Society
of America, 20(7):1407 – 1418, 2003.
[94] A. Torralba, K. P. Murphy, W. T. Freeman, and M. A. Rubin. Context-based
vision system for place and object recognition. In ICCV, pages 1023 – 1029, Nice,
France, October 2003.
[95] A M Treisman and G Gelade. A feature-integration theory of attention. Cognit
Psychol, 12(1):97–136, 1980.
[96] R.StevenTurner. Inthe eye’smind: visionand the Helmholtz-Heringcontroversy.
Princeton University Press, 1994.
[97] B. Tversky and K. Hemenway. Categories of the environmental scenes. Cognitive
Psychology, 15:121 – 149, 1983.
[98] Barbara Tversky. Navigating by mind and by body. In Spatial Cognition, pages
1–10, 2003.
[99] I. Ulrich and I. Nourbakhsh. Appearance-based place rcognition for topological
localization. In IEEE-ICRA, pages 1023 – 1029, April 2000.
145
[100] L G Ungerleider and M Mishkin. Two cortical visual systems. In D G Ingle,
M A A Goodale, and R J W Mansfield, editors, Analysis of visual behavior, pages
549–586. MIT Press, Cambridge, MA, 1982.
[101] C. Valgren and A. J. Lilienthal. Incremental spectral clustering and seasons:
Appearance-based localization in outdoor environments. In ICRA, Pasadena, CA,
2008.
[102] G. R. A. Vargas, K. Nagatani, and K. Yoshida. Adaptive kalman filtering for
gps-based mobile robot localization. In IEEE International Workshop on Safety,
Security and Rescue Robotics, Rome, Italy, September 2007.
[103] J. Wang, H. Zha, and R. Cipolla. Coarse-to-fine vision-based localization by
indexing scale-invariant features. IEEE Trans. Systems, Man and Cybernetics,
36(2):413–422, April 2006.
[104] C. Weiss, H. Tamimi, A. Masselli, and AndreasZell. A hybridapproach for vision-
based outdoor robot localization using global and local image features. In Proc.
IEEE International Conference on Intelligent Robots and Systems (IROS), pages
1047 – 1052, 10 2007.
[105] N. Winters, J. Gaspar, G. Lacey, and J. Santos-Victor. Omni-directional vision
for robot navigation. In In IEEE Workshop on Omnidirectional Vision, pages 21
– 28, June 2000.
[106] J M Wolfe. Visual search in continuous, naturalistic stimuli. Vision Res,
34(9):1187–95, 1994.
[107] F. Xu and K. Fujimura. Pedestrian detection and tracking with night vision. In
Intelligent Vehicle Symposium, volume 1, pages 21–30, Pasadena, CA, June 2002.
[108] W. Zhang and J. Kosecka. Localization based on building recognition. In IEEE
Workshop on Applications for Visually Impaired, pages 21 – 28, June 2005.
[109] W. Zhang and J. Kosecka. Image based localization in urban environments. In
International Symposium on 3D Data Processing, Visualization and Transmission,
3DPVT, Chapel Hill, North Carolina, 2006.
146
Abstract (if available)
Abstract
The problem of localization is central to endowing mobile machines with intelligence. Vision is a promising research path because of its versatility and robustness in most unconstrained environments, both indoors and outdoors. Today, with many available studies in human vision, there is a unique opportunity to develop systems that take inspiration from neuroscience. In this work we examine several important issues on how the human brain deal with vision in general, and localization in particular.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Autonomous mobile robot navigation in urban environment
PDF
Mobile robot obstacle avoidance using a computational model of the locust brain
PDF
Attention, movie cuts, and natural vision: a functional perspective
PDF
Interaction between Artificial Intelligence Systems and Primate Brains
PDF
Mobility-based topology control of robot networks
PDF
Biologically inspired approaches to computer vision
PDF
Crowding and form vision deficits in peripheral vision
PDF
A robotic system for benthic sampling along a transect
PDF
Machine learning of motor skills for robotics
PDF
Computational modeling and utilization of attention, surprise and attention gating
PDF
Computational modeling and utilization of attention, surprise and attention gating [slides]
PDF
Coordinating social communication in human-robot task collaborations
PDF
Efficient SLAM for scanning LiDAR sensors using combined plane and point features
PDF
Robot vision for the visually impaired
PDF
Remote exploration with robotic networks: queue-aware autonomy and collaborative localization
PDF
Spatiotemporal processing of saliency signals in the primate: a behavioral and neurophysiological investigation
PDF
Learning from planners to enable new robot capabilities
PDF
Novel soft and micro transducers for biologically-inspired robots
PDF
Macroscopic approaches to control: multi-robot systems and beyond
PDF
Nonverbal communication for non-humanoid robots
Asset Metadata
Creator
Siagian, Christian
(author)
Core Title
Biologically inspired mobile robot vision localization
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science (Robotics
Publication Date
08/06/2009
Defense Date
05/04/2009
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
biologically-inspired vision,GIST,OAI-PMH Harvest,robot localization,saliency,vision localization
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Itti, Laurent (
committee chair
), Biederman, Irving (
committee member
), Nevatia, Ramakant (
committee member
), Sukhatme, Gaurav S. (
committee member
), Tjan, Bosco S. (
committee member
)
Creator Email
siagian@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m2511
Unique identifier
UC1317304
Identifier
etd-Siagian-3154 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-184573 (legacy record id),usctheses-m2511 (legacy record id)
Legacy Identifier
etd-Siagian-3154.pdf
Dmrecord
184573
Document Type
Dissertation
Rights
Siagian, Christian
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
biologically-inspired vision
GIST
robot localization
saliency
vision localization