Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Outdoor visual navigation aid for the blind in dynamic environments
(USC Thesis Other)
Outdoor visual navigation aid for the blind in dynamic environments
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
OUTDOOR VISUAL NAVIGATION AID FOR THE BLIND IN DYNAMIC
ENVIRONMENTS
by
Tung-Sing Leung
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
University of Southern California
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
August 2015
Copyright 2015 Tung-Sing Leung
Table of Contents
List of Tables iv
List of Figures v
Abstract x
1 Introduction 1
1.1 Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Overview of our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Organization of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Related Work 12
2.1 Mobility aids for VI individuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.1 Canes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.2 Guide dogs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.3 GPS and DGPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.4 Dead reckoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.5 Simultaneous Localization and Mapping . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.6 Visual Teach-and-repeat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.7 Map-aided localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Robot navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 GPS and dead reckoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2 Simultaneous Localization and Mapping . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.3 Map-aided localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Visual Odometry in dynamic environments 26
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 Static environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2 Dynamic environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.3 Generic stereo-based VO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.4 Libviso2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Overview of our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
ii
3.4 Ground plane model estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.1 Ground plane detection for short baseline stereo . . . . . . . . . . . . . . . . . . . 38
3.4.2 Ground plane normal estimation from visual data . . . . . . . . . . . . . . . . . . 41
3.5 Egomotion estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5.1 Classical egomotion estimation directly from all points . . . . . . . . . . . . . . . . 42
3.5.2 Egomotion estimation from ground plane . . . . . . . . . . . . . . . . . . . . . . . 44
3.6 Validation using simulated data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.7 Benchmarking results with KITTI dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.8 Validation with our dataset using images only . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.8.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.8.2 Evaluation criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.8.3 Results without IMU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.9 Visual odometry with IMU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.10 Validation with our dataset with IMU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.10.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.10.2 Results with IMU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.11 Motion estimation from smartphone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4 Map Aided Localization 73
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.1.2 Overview of our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2 Street graph generation from OSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.2.1 Data available in OSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.2.2 Map-to-street graph conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3 Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.3.1 Bayes Filter with street map prior . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.3.2 State model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.3.3 Measurement model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.3.4 State transition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.3.5 Street transition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.3.6 Mixture of Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4 Running time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.5 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.5.1 Localization results with stereo camera . . . . . . . . . . . . . . . . . . . . . . . . 96
4.5.2 Localization results with smartphone . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.5.3 Comparison between stereo vuzix and mobile phone . . . . . . . . . . . . . . . . . 102
4.5.4 Very large-scale localization results . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.6 Recovery from temporary loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.6.1 Re-localization test results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5 Conclusion and future work 120
Reference List 122
iii
List of Tables
3.1 Four dierent simulated scenes sorted according to the amount of dynamic points . . . . . 47
3.2 Average translational error (E
ave
) for dierent algorithms in dierent environments with-
out using IMU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.3 Average translational error (E
ave
) for dierent algorithms in dierent environments . . . . 68
4.1 Comparison of computation time averaged over 1,000 frames at resolution 320 x 240 pixels.
Average number of feature points tracked by our algorithm and Libviso2 are 370 and 360,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.2 Localization errors comparison using stereo camera without GPS. Sequence not localized
is marked with \*". . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.3 Localization errors comparison using stereo camera with GPS. Sequence not localized is
marked with \*". . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.4 Localization errors comparison using mobile phone without GPS. Sequence not localized
is marked with \*". . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.5 Localization comparison using mobile phone with GPS. . . . . . . . . . . . . . . . . . . . 102
4.6 Localization comparison between Stereo Vuzix and mobile phone using the proposed method105
4.7 Quantitative localization results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.8 Relocalization comparison using stereo camera in dierent areas without GPS. Experiment
not localized are labeled with \*" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.9 Relocalization comparison using mobile phone without GPS . . . . . . . . . . . . . . . . . 118
iv
List of Figures
1.1 (a) Wearable stereo camera (b) Smartphone in pocket with rear camera exposed (c)
Crowded scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Block diagram for our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Vuzix Wrap 920 AR with integrated IMU [19] . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Images captured at the spots marked in red are selected as anchored frames. Three
examples of common features visible in both Google satellite/aerial imagery and live
image (a) distinctive ground texture (b) trees (c) markings and curbs at road junctions . 10
1.5 Dense ground truth locations are obtained by interpolation . . . . . . . . . . . . . . . . . 11
3.1 Pose estimation using stereo camera only . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Stereo-based VO output that can be translated to world coordinates if the initial camera
pose is known . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Generic stereo-based VO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 (a)Circular matching (b) Libviso2 method . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 (a) Select points from entire scene vs. (b) Select points on ground . . . . . . . . . . . . . 34
3.6 (a) Detected ground region in yellow and estimated ground plane normal in red vector
(b) Optical
ow on ground surface, inliers in green and outlier in red . . . . . . . . . . . . 35
3.7 Standard stereo-based VO vs. ours method . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.8 Y-coordinates of ground plane in image are linearly proportional to the corresponding
disparity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.9 Output road mask highlighted in yellow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
v
3.10 Average translation error over dierent scenes . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.11 Average translation error over ve runs in planar scenes with dierent observation noise
levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.12 KITTI dataset was captured by car-mounted stereo camera . . . . . . . . . . . . . . . . . 50
3.13 Our algorithm can extract and track ground features in dierent road conditions . . . . . 51
3.14 Ground detection and optical
ow extracted from the ground . . . . . . . . . . . . . . . . 51
3.15 Optical
ow on the ground surface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.16 Our algorithm is among the top ve stereo-based methods because the top two algorithms
are laser-based methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.17 KITTI dataset results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.18 The primary sensor of our system: Vuzix Wrap 920 AR with integrated IMU [19] . . . . . 55
3.19 (a) Evaluating VO with four anchored frames (b) Translational error between all possible
subpaths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.20 Cafeteria area test result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.21 Football match dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.22 Shopping area dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.23 Position estimation using stereo camera and IMU . . . . . . . . . . . . . . . . . . . . . . . 65
3.24 Coordinates system for integrated IMU [19] . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.25 Red line and rectangle represent ground plane normal vector measured by IMU . . . . . 66
3.26 Average Translational Error comparison with dierent datasets . . . . . . . . . . . . . . . 67
3.27 Camera motion estimation from known ground plane . . . . . . . . . . . . . . . . . . . . . 70
3.28 Lower 50% of a live image is marked as ground. Inliers and outliers are highlighted in
green and red, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.1 Map aided position estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
vi
4.2 For long distance navigation, the VO may drift from ground truth if there is no correction
from absolute positioning device such as GPS . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3 Google StreetView vs pedestrian's View . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4 The zebra crossings printed on the crosswalks are not updated in Google Map (the middle
diagram) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.5 (a) Google StreetView captured in 2009. (b) Live view recorded in 2014 . . . . . . . . . . 81
4.6 Road networks (highlighted in yellow) from OSM superimposed on an aerial photo from
Google Earth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.7 (a) Street segment parameterization. The red circle and yellow line denote the location
and heading direction of the user, respectively (b) Street map from OSM. Footpath is
highlighted in red dotted lines (c) Corresponding connectivity graph for Figure 4.7(b).
Road segment IDs are printed in black . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.8 Output for proposed position estimation method with campus dataset (a) Legend (b-h)
Outputs for feature tracking, VO, and localization superimposed in dierent time steps . 87
4.9 A graphical model representation of our proposed localization system . . . . . . . . . . . . 88
4.10 Street map for (a) Campus (b) Downtown datasets. Both maps encompass an area of
approximately 1km
2
. Only the yellow road segments are converted to a connectivity
graph for localization, and thus each le size is smaller than 13 MB. Note that the road
network pattern in Downtown Los Angeles is primarily a grid system. Therefore, VO
estimation must provide accurate estimation in order for the particle lter to converge to
the correct spot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.11 Sample frames of the USC Campus dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.12 Sample frames of the Downtown dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.13 Ground truth path (yellow curve) surrounded by high-rise buildings that cause large GPS
error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.14 Red dotted curve denotes output trajectory for our localization system using smartphone
camera after localized at spot marked by green circle . . . . . . . . . . . . . . . . . . . . . 103
4.15 (a) Rectied stereo image pair captured by Vuzix as it moves outdoor. Vertical distance
between each pair of red grid lines is 12 pixels (b) Corresponding dense disparity map.
In normal operational conditions, the same point in a real scene should appear with the
same y-coordinates in each rectied image. This condition is fullled in this case; hence,
the disparity image is smooth and accurate . . . . . . . . . . . . . . . . . . . . . . . . . . 103
vii
4.16 First to last rows show four consecutive stereo image pairs captured while Vuzix camera
moves under a shaded area. Corresponding disparity map is in last column. Disparity
maps for second and third stereo image pairs are corrupted, only rst and last rows are
acceptable. This is because left and right images were captured at slightly dierent times
under low light conditions. Notice that
oor textures do not appear on same y-coordinates
in left and right images in second and third rows. . . . . . . . . . . . . . . . . . . . . . . . 106
4.17 Image quality comparison between stereo Vuzix and smartphone under shade . . . . . . . 107
4.18 Test scenario: User species four way points between the starting location and nal
destination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.19 Path output by path planner, total distance is 6.7km and approximate walking time is
1.5hrs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.20 Map for navigating from campus to downtown and vice versa. Only blue road segments
are encoded into map. The le size is approximately 35 MB, which includes both road
connectivity information and semantic labels. . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.21 Sample frames from dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.22 Final trajectory output from our proposed method shown in red dotted line . . . . . . . . 112
4.23 Enlarged output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.24 Surrounded by buildings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.25 Steps taken for recovery of the proposed system. Black line shows most recent ground
truth trajectory, and blue spot with green ring marks localized user's location estimated
by proposed system (a) User is localized when all particles converge to single spot (b)
Large VO measurement error occurs (c) \Lost" is detected when the particles diverge into
multiple modes (not shown) (d) Our method re-initializes particle lter by resetting local
area of last known location with uniform distribution (e) Most particles are eliminated
shortly, and two modes remain (f) User is re-localized as particles converge into single
spot again . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.26 Positional error (blue line in top graph) increased when there is perturbation in linear
and angular displacements. Red dotted lines in top graph depict perturbation times.
Green depicts time when sequence is re-localized. Blue lines in bottom two graphs are
original VO measurements, red lines depict the perturbed measurements. Two red spikes
in middle plot are caused by perturbations in linear displacement. Corresponding angular
displacements are marked by two ellipses in bottom graph. . . . . . . . . . . . . . . . . . . 116
4.27 Position error for re-localization test with campus dataset captured by smartphone (a)[13]
and (b) ours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
viii
5.1 Client-server implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
ix
Abstract
This thesis proposes a visual navigation aid for the blind. Our goal is to develop a wearable system to help
the Visually Impaired (VI) navigate in highly dynamic outdoor environments. The proposed solution
uses both visual sensing and an existing map available online. Our work focuses on position estimation,
which consists of two parts: Visual Odometry (VO) and localization. We propose dierent VO methods
to compute user motion even in cluttered environments using either a consumer-grade wearable stereo
camera or a smartphone. For the case of the stereo camera, instead of computing egomotion from 3D
point correspondences in consecutive frames, we propose nding the ground plane, then decomposing
the six Degrees of Freedom (6-DoF) egomotion into a motion of the ground plane, and a 3-DoF planar
motion on the ground plane. The ground plane is estimated at each frame by analysis of the disparity
array. To decelerate the accumulation of VO error, the Inertial Measurement Unit (IMU) readings are
used to approximate the ground plane orientation when walking along
at terrain. VO is extended to
a monocular system so that the proposed framework is applicable to smartphone platforms with small
pocketsize form factors. In addition, mobile devices are more accessible to VI users.
The VO output is the latest camera pose given in a local coordinate system dened with respect to
the rst camera frame. To localize the user in global coordinates, the proposed system combines the
VO output with the semantic information available in an existing map downloaded from a free map
service. The Monte Carlo Localization (MCL) framework is used to match the local motion trajectory
of the user with the shape of the street network found in the map. The framework allows the system to
x
estimate both the global position and orientation of the user continuously. Our system is validated on
real scenarios of hours of walking in both open terrain and urban environment in Downtown Los Angeles.
Experiment results show that our method corrects the cumulative drifting error of position estimation
computed from either a stereo camera or smartphone. It also localizes users without Global Positioning
System (GPS) even in areas with equiangular intersections and equilength streets.
To accelerate the localization process, we model GPS noise with Gaussian distribution and incorpo-
rate GPS measurements as prior knowledge to reduce the search area. Experiment results show that
GPS readings improve localization speed even when the signal is noisy in an urban area. In the case of
temporary loss, which often occurs in dynamic environments, our system improves re-localization speed
by initializing the localization locally with the last known user's location.
We also focus on choosing appropriate sensors for the position estimation. We explore a variety of
possible modalities and dierent combination between wearable stereo camera, IMU, GPS, and integrated
sensors in mobile phones. The lightweight wearable stereo camera not only measures depth, but also
provides a natural appearance to the VI user. The disadvantage is that it is composed of two low quality
image sensors with an extremely short baseline. On the other hand, the smartphone camera captures
images with higher resolution, but lacks depth measurement. By analyzing the experimental results,
we found that, surprisingly, the monocular device provides lower position error in urban environments
compared with the stereo camera, which is only more accurate in open areas.
xi
Chapter 1
Introduction
The term \visual impairment" describes any type of vision loss that cannot be cured by standard
glasses or medical treatment. According to the statistics from the World Health Organization (WHO)
[82], there are approximately 285 million people currently suering from visual impairment worldwide.
Visual impairment leads to loss of independence in performing several routines and life-enabling tasks.
Navigating in unfamiliar environments continues to be a major challenge for the Visually Impaired (VI)
population because such patients lose both orientation and mobility skills [12]. Orientation is a skill for
nding the relative pose with respect to destination. Mobility is the capability of avoiding obstacles,
maintaining consistent headings, and estimating distances and angles. Currently, VI individuals rely on
Global Positioning System (GPS)-based devices for orientation [119, 87, 25] and use touch, white canes,
or guiding dogs to overcome the mobility problem. However, GPS is highly inaccurate in urban areas
because of signal blockages and re
ection by buildings and trees. In addition, white canes and guiding
dogs cannot solve the orientation problem.
Recently, there has been great progress in robotic navigation research, such as self-driving vehicles
[35]. Such autonomous vehicles solve both orientation and mobility problems on the road using multiple
sensors that include Light Detection and Ranging (LIDAR), GPS, and inertial sensors. However, it
1
(a) (b) (c)
Figure 1.1: (a) Wearable stereo camera (b) Smartphone in pocket with rear camera exposed (c) Crowded
scenes
is dicult to directly transfer those technologies to humans for several reasons. First, humans cannot
transport the heavy sensors and power sources on which ground vehicles rely for autonomous navigation.
Furthermore, egomotion estimation from a walking user (rst-person vision) is radically dierent from
that of a moving vehicle. The motion of the head-mounted camera on a walking person is signicantly
more complex than that of a vehicle-mounted camera: the former is a combination of 6-DoF head
and body motion, whereas the latter is dominated only by forward velocity and yaw angle. Robotic
applications usually employ wheel odometry as prior information for camera motion estimation, but
such measurement is not available for human application. Currently, the majority of robotic navigation
research focuses on relatively static environments. These methods may not be suciently robust to
manage dynamic environments such as crowded scenes in urban areas.
Therefore our goal is to develop a wearable visual navigation aid for VI patients in order to overcome
the orientation problem and assist them in navigating from point A to B in dynamic outdoor environ-
ments without being overloaded by heavy hardware. To achieve this goal, we designed a vision-based
system to estimate the world coordinate of the moving user at regular intervals while he/she moves
2
toward a given destination. Instead of building custom hardware, we validate our method using existing
consumer-grade sensors and popular mobile devices accessible to VI users.
1.1 Issues
Visually impaired patients do not have condence to go to unfamiliar places because they may lose their
way easily. The widely used GPS-based localization method is not suitable for them because GPS signal
is not available all the time, the accuracy may drop due to multipath and signal blockage especially
in urban areas surrounded by buildings and trees. In addition the GPS update rate is one second.
Users with normal eye-sight can still navigate by keeping track of visible landmarks. However, visually
impaired users may lose their direction easily in unfamiliar places between GPS updates.
Another issue in the design of wearable navigation aid is the choice of sensing modality. Many
dierent methods have been proposed to help the visually impaired patients navigate independently.
Dierent approaches rely on dierent sensory modalities or combination of multiple sensors. This is
because dierent sensors have their own advantages and disadvantages, it is nearly impossible to have
single sensor which suits all environments. However, the cutting edge sensors may not t our application
because the users prefer small and light sensors as well as processing unit with low power consumption.
In order to provide immediate guidance instructions to the user, the vision-based navigation system
should run in real time and provide comparable or better accuracy then GPS. All these issues motivate
us to develop our own approach to solve the navigation problem for VI patients.
3
1.2 Overview of our approach
The block diagram in Figure 1.2 summarizes our approach. A user can initiate the proposed navigation
aid by specifying a destination and current location. The starting area can be a large region, such as
Downtown Los Angeles or San Francisco Chinatown. The \Motion Estimation" module computes and
records the user's current trajectory using a wearable camera. At the same time, the \Planning" module
downloads and prepares a street map that encompasses the starting area and destination specied by
the user. The system can bypass the map generation step if the map is already available in the device's
memory. When the map is ready, the \Localization" module starts to locate the user on the map using
his/her trajectory. Once localized, the \Planning" block computes a best path to link the user's current
location to the destination. The path can be subdivided into a series of waypoints. The \Guidance"
module computes the best direction to the next waypoint and informs the user through a tactile user
interface similar to [89], which is an array of micro-vibration motors embedded in a vest to provide
guidance cues, as shown in Figure 1.1(a). The user's pose changes continuously as he/she follows the
path to the destination. The \Planning" module requires the latest pose of the user computed by the
localization module and the detected obstacles for further planning.
Map
Motion
Estimation
Measurements
GPS
(if available)
Choice of input sensors
Inertial
sensors
(Our focus)
Guidance Planning
User
Interface
IMU
Our focus
Localization
(Our focus)
Obstacle
detection
Stereo camera
GPS
Smartphone
(Our focus)
Figure 1.2: Block diagram for our approach
4
Because there are many ongoing research works for the modules shown in grey in Figure 1.2, we
focus on pose estimation (motion estimation & localization) using a wearable camera and sensors in
this thesis. The minimum sensor input is a camera used by the motion estimation module to compute
the user's pose through a Visual Odometry (VO) algorithm. Motion estimation uses dead reckoning
to compute the latest camera pose given in a local coordinate system dened with respect to the rst
camera frame. Hence, the estimation error propagates and accumulates over time and causes the nal
trajectory to drift from the truth path. To reduce such drift, we integrate the Inertial Measurement
Unit (IMU) readings to decelerate the accumulation of the motion estimation error. We also incorporate
a map in the localization process to convert the user's pose from the local coordinates output by the
motion estimation module to the world coordinates, even when GPS is unavailable. The localization
estimates both the global position and orientation of the user continuously by comparing the estimated
user's trajectory with the road segments extracted from the map. In addition, the map is used as a
global reference to correct drift in motion estimation.
However, there are two challenges for this localization approach. First, the user has to continue
walking before the system has localized him/her. Second, if the road network is a regular grid pattern
similar to that of Downtown Los Angeles as shown in Figure 4.10(b), the localization system might
require a signicant amount of time to converge to the true location. To overcome the rst issue, we
incorporate GPS measurements (if the signal is available) to accelerate the localization process and
shorten the walking distance. Note that GPS alone is insucient for the pose estimation of VI users as
mentioned above. However, we nd that GPS measurements improve localization speed even when the
signal is noisy in urban areas. For the second issue, we implement an accurate and robust VO algorithm
in the motion estimation module. Experiment results show that our VO helps the localization module
to nd the user even in areas with equiangular intersections and equilength streets.
5
Figure 1.3: Vuzix Wrap 920 AR with integrated IMU [19]
We also focus on choosing sensors appropriate for the system, as shown in the left side of Figure
1.2. To achieve accurate pose estimations for navigation, we explore a variety of possible modalities
and dierent combinations of stereo cameras, IMU gravitational sensors, GPS, and integrated sensors in
mobile phones. Note that our approach does not rely on gyroscopes because they might not be available
on all devices. We use a consumer-grade o-the-shelf wearable stereo camera, the Vuzix Wrap 920 as
shown in Figure 1.3, to capture stereo images. Vuzix is chosen in order to negotiate between portability
and accuracy in depth measurements. The stereo rig is not only lightweight and comfortable to wear,
but also provides a natural appearance to the VI user. The disadvantage is that it is composed of two
low quality image sensors with an extremely short baseline of 6 cm (half of that of Bumblebee2). On the
other hand, smartphone cameras capture images with higher resolution, but lack depth measurement.
A comparison in terms of pose estimation accuracy between these two cameras is discussed in Section
4.5.3.
1.3 Experimental setup
To validate the proposed localization system, we collected large amounts of data with a stereo camera
and smartphone in both University Park Campus of University of Southern California (USC Campus)
and Downtown Los Angeles areas. Our datasets include all sensors measurements and manually labeled
ground truth. The measurements consist of stereo and smartphone camera video sequences, IMU and
6
GPS data. Both ground truth and measurements are synchronized with the global timestamp. All
datasets are available for download at http://www-scf.usc.edu/
~
tungsinl/NavForBlind.
Although GPS measurements are available outdoor, it cannot be used as ground truth because the
GPS signal can be blocked by buildings and trees. Therefore we extracted the ground truth trajectory
using Google Earth software [33]. This is performed manually by comparing the ground texture found
in the recorded images with the Google satellite/aerial imagery. We select anchored frames, which are
the camera images with distinctive ground texture patterns visible, along the path we walked. When
there is insucient ground texture, we may use trees and road junctions that are commonly visible in
Google Earth and live view as shown in Figure 1.4. We approximate the spots where the anchored frames
were captured from Google Earth and use the corresponding UTM coordinates (WGS84) as the ground
truth camera location for those frames. The height of all ground truths is set to zero. We assume the
camera always look straight during data collection and compute the ground truth heading angles from
the consecutive ground truth locations.
The ground truth location of each frame is approximated by interpolating the locations of the an-
chored frames as illustrated in Figure 1.5. This is achieved by assuming constant forward velocity and
constant turning speed in between each two consecutive anchored frames. The interpolated ground truth
is used to evaluate the VO accuracy and the details are provided in Section 3.8.
To access the accuracy of localization in Section 4.5, all ground truths are projected onto the map
data. This is because the estimated user location is constrained by the street map represented with line
and arc segments. The representation does not account for road/sidewalk widths or intersection sizes.
To speed up inference, we subsample the data to a rate of one frame per second. The average localization
error is the average root mean square (RMS) error between the localization output and the projected
ground truth.
7
After the ground truth is available, we can conduct experiment with the proposed system. It is
possible to implement both motion estimation and localization modules in a single program. However
the proposed system is more
exible if the two modules are run separately in client and server processes.
The client process, which functions as an entry point of the entire system, consists of all the time critical
processes including image acquisition and VO. The localization module is run in the server process as
the module is more computational intensive.
For all experiments described in this thesis, both client and server processes are run in the same
desktop computer. To evaluate the localization accuracy quantitatively using dierent congurations
without repeating the VO process, we rst process the entire dataset with dierent VO algorithms and
save the VO output in log les. Each log le contains N lines where N is the number of frames of the
sequence. Each line is the camera pose given in a local coordinate system dened with respect to the rst
camera frame. The localization module is then run with dierent settings to process the pre-computed
VO log les sequentially. The details are provided in Sections 4.3.3 and 4.5.
To make the entire system running online, the VO log e can be easily substituted with a callback
function implemented in localization module to process the live camera poses computed by the VO
algorithms. The localization results (in global coordinates) are fed back to the client once available.
Therefore, no image data is required in the transmission between the two processes and the bandwidth
requirement for the data exchange is extremely low. In practice, the two processes could be run in
separate mobile devices carried along by the user. For instance, the client process could be implemented
in smartphone while the server is run in a powerful laptop being placed in a backpack. It is also possible
to run the server process remotely in a more powerful machine while keeping the client running in the
mobile device as described in Section 5. This conguration greatly reduces the weight of the required
hardware carried by the user.
8
Currently our motion estimation module runs in real-time as it is implemented in C++ with OpenCV
library [1]. Our localization module is based on the Python library [13] which takes advantage of multi-
core architecture and should run much faster using the latest available processor with more cores. The
localization process is more time consuming during initialization stage but it runs in real-time once
the user is localized. Nevertheless, our system generates user's location continuously in frame rate by
propagating the last localization output with the motion estimation results.
1.4 Organization of this thesis
This thesis is organized as follows: we start with a review of related works in Chapter 2. The issues and
our method for VO are discussed in Chapter 3. The integration of inertial sensors for motion estimation
is presented in Section 3.9. Section 4 presents the localization module. The results of localization and
recovery from temporary loss are discussed in Section 4.5 and 4.6 respectively. Finally, the conclusion
of the study and future work are summarized in Chapter 5.
9
Walking direction
(a) Google Earth and live scene
Walking direction
(b) Google Earth and live scene
(c) Google Earth and live scene
Figure 1.4: Images captured at the spots marked in red are selected as anchored frames. Three examples
of common features visible in both Google satellite/aerial imagery and live image (a) distinctive ground
texture (b) trees (c) markings and curbs at road junctions
10
(a) The locations of sparse anchored frames are marked in red circles
(b) The ground truth locations (highlighted in green crosses) of every 30
frames are approximated by interpolating the locations of the anchored
frames in Figure 1.5(a).
Figure 1.5: Dense ground truth locations are obtained by interpolation
11
Chapter 2
Related Work
Many works have been proposed to solve the navigation problem of autonomous systems/robots and
human locomotion. A summary of recent works on human navigation is provided in Section 2.1. We
also highlight some popular approaches for robotics in Section 2.2. Specic works related to VO and
localization are provided in Section 3.2 and 4.1.1, receptively.
2.1 Mobility aids for VI individuals
2.1.1 Canes
For many decades, a white cane has been the most popular mobility tool for VI individuals because of
its low cost and highly portable characteristics. The simple foldable stick provides haptic feedback to
assist users in obstacle avoidance.
Some smart canes have been developed to replace the standard white cane used by VI people.
Benjamin et al. [9] introduced the C-5 Laser Cane, also known as the Nurion Laser Cane, in 1973.
The basic concept is based on optical triangulation with three laser diodes and three photo-diodes as
receivers. Because of the limited number of photoreceptors, it can only detect close range obstacles up
12
to 1.5 m to 3.5 m ahead. In addition, it can detect head-height and drop-o obstacles in front of the
user. However, it cannot provide accurate depth measurements.
Yuan et al. [121] proposed an environment-exploration device that uses a camera and laser pointer
to detect corners, curbs, drop-os, and stairways by tracking surfaces. Although this virtual white cane
provides high accuracy in tracking surfaces, it has diculties classifying re
ections from specular surfaces,
and hence such device is restricted to indoor use only and cannot be used for large-scale navigation in
outdoor environments under sunlight.
Another smart cane was introduced by Ye et al. [117] for both way nding and object detection.
The white cane was integrated with a 3D camera capable of perceiving the surroundings and providing
navigational information to its user. 3D data are used to compute the relative motion of the cane and
construct a 3D map for object detection and obstacle avoidance.
However, none of these tools provides any orientation information for navigation.
2.1.2 Guide dogs
Guiding dogs are another alternative for guiding VI people to dierent places and help avoid users from
becoming too close to potential hazards. However, such intelligent animals require costly training and
are only available to a small percentage of VI individuals.
To overcome this, the Robotic Guide Dog (RGD) [57] was introduced to lead the way for VI travelers.
RGD uses the navigational information stored in a set of radio-frequency identication (RFID) tags
deployed in the environment to move from one point to another. A map that contains the location of
RFID tags and the points of interest is used by RGD for navigation. The RFID reader on RGD detects
a tag and retrieves the stored information, based on which RGD determines its location and the action
13
of moving toward the next RFID tag until reaching the destination. However, RGD lacks portability
and the approach requires re-engineering the environment (i.e., installation of RFID tags).
2.1.3 GPS and DGPS
Because of these limitations, many Electronic Travel Aids (ETA) [18] have been introduced as alternative
navigation tools for the VI population. One example of a proposed GPS-based ETA is the latest blind
map [119]. There are also a number of GPS-aided navigation systems especially designed for the VI
population. Petrie et al. [87] was the rst group that applied a GPS receiver for the navigation systems
designed for VI individuals. Feiner et al. [25] proposed combining GPS with virtual reality applications
in order to explore urban environments. Recently, the majority of cell phones are equipped with GPS
receivers. Hence, many smartphone and PDA-based solutions have been designed to help VI and elderly
individuals to navigate.
These solutions can provide accurate position information in open outdoor environments, provided
that there is good GPS reception. GPS accuracy and availability depend on many external factors, such
as the density of tall buildings or trees in surrounding areas, as well as climatic conditions. The position
error measured by standard civilian GPS receivers can range from 10m to 50m [12]. As a result, ETA
systems that rely only on GPS are not suciently accurate to orient VI users when navigating in urban
areas.
In addition, the GPS update rate is 1 sec. Users with normal eyesight can continue navigating by
observing visible landmarks. However, VI users might lose their direction easily in unfamiliar places
during GPS updates. Furthermore, the outdoor environment could be highly dynamic when the area
is crowded. Therefore, VI individuals can face many diculties when navigating independently within
such uncertain areas.
14
In order to improve GPS accuracy, some research teams have suggested using Dierential GPS
(DGPS) that reduces the nominal error to sub-meter accuracy [88]. Drishti [91] provided position
information to VI users for both indoor and outdoor environments. In outdoor environments, Drishti's
method uses dierential DGPS to allow users to walk along the central line in sidewalks. The user can use
simple voice commands to switch the system from an outdoor to an indoor environment. After the switch,
the system uses an ultrasound positioning system to provide precise indoor location measurements. Their
experiment results show that localization error is less than 22cm within an indoor environment.
However, DGPS is not common in many places because the solution requires a network of ground-
based reference stations that are too expensive to build and maintain. In addition, the DGPS solution
requires expensive bulky receivers not suitable for daily navigation usage.
2.1.4 Dead reckoning
Because good GPS reception is not always available, many approaches have been proposed to reduce
GPS dependency using additional dead-reckoning sensors, such as inertial sensors. The combination of
GPS and dead-reckoning sensor readouts provides continuous estimates during GPS outages in harsh
environments, such as tunnels, underground passages, dense urban areas, and more. Positioning data
from these two sources are usually integrated by the Extended Kalman or particle lters that perform
well when errors can be modeled by white noise with the property of being uncorrelated with itself.
For instance, Mayerhofer et al. [71] proposed fusing the GPS with inertial sensors in order to provide
continuous position update to VI users through dead-reckoning algorithms when the uncertainty of GPS
measurements exceeds predened thresholds, or even after satellite loss. This technique is useful in
situations where there is GPS signal loss for short duration. This is because the cumulated error of the
15
dead reckoning process grows over time and the drifted position output might provide wrong guidance
to the user if there is no correction from absolution positioning device.
Here is a list of inertial sensors for human navigation that are popular because they are lightweight
and have low power consumption:
1. A pedometer is used in the systems for [24], [46] to estimate the distance walked by a pedestrian
by measuring the step length. This is done by analyzing acceleration in the gravity axis. As a
person walks, the body undulates according to the strides. The technique is accurate from 0.5% to
10%, depending on the gait style. Better accuracy can be achieved by mounting the accelerometer
to the footwear [32].
2. Heading direction can simply be read from a magnetic compass. A compass is sensitive to the
local distortions of a magnetic eld caused by vehicles, power lines, etc. [24]. This can be solved
by fusing with gyroscope measurements.
3. A gyroscope coupled with the Kalman lter can reduce erroneous readouts [32].
4. A head-mounted camera system was proposed in [118]. The method estimates user motion by
fusing multiple measurements from both vision and inertial sensors. However, the vision-based
solution requires a large amount of data on the environment, and it is not robust to lighting or
abrupt motion changes.
2.1.5 Simultaneous Localization and Mapping
Another approach for performing localization in GPS-restricted areas is to estimate orientation with
respect to known landmarks. For this, some research groups have focused their work on building en-
vironmental representations \from scratch" through Simultaneous Localization and Mapping (SLAM)
16
approaches. SLAM allows building a map of the surrounding space and providing localization at the
same time, without employing global navigation methods. A review of SLAM techniques, modications,
results, etc., is given in [74].
Visual SLAM is a vision-based technique that can be applied to building a 3D map of the environment.
For instance, Saez et al. [96] proposed a stereo-based navigation system towards a six-DoF SLAM for
VI users. Their work consists of three components: egomotion, mapping, and observation integration.
Egomotion is estimated by integrating 3D and 2D (appearance) information in real time. Mapping
is conducted through a randomized global entropy minimization algorithm. When new observation is
available, view integration fuses some of those observations in order to reduce the overall variable and
make global rectication run in real time. The navigation system was further enhanced in [98] to predict
the next movement of the VI user and maintain a local 3D map of the environment. This 3D map
information is used for over-head obstacle detection when the user navigates through a given area.
Pradeep et al. [89] proposed a head-mounted stereo camera system for VI users. By employing
techniques such as stereo-based VO and feature-based metric-topological SLAM, they can construct a
3D map around the vicinity of the user while he/she is moving. This 3D map can be used to generate a
traversability map and detect obstacles.
Lee et al. [62] described an RGBD-based navigation system that can work without any maps. To
achieve this, the system actually constructs a map while guiding VI users to a given destination. The
navigation software, designed by the University of Southern California (USC), is composed of a helmet
with 3D sensor (also known as RGB-D camera) and a tactile vest with four micro-vibration motors located
on each side of the shoulder and waist. The RGB-D sensor allows the navigation system to continuously
construct a 3D model of the indoor environment. The model is converted to a traversability map, which
is a 2D
oor plan with labels for obstacles and safe walking areas. To avoid an obstacle, a left/right
17
shoulder vibration signals the user to turn left/right, whereas the waist motor implies a side step in the
corresponding direction.
For all these SLAM solutions, the algorithms have to maintain and update a map of increasing size
as the user moves about, and use the map for pose estimation. This can result in an intractable memory
requirement and computational cost.
2.1.6 Visual Teach-and-repeat
Visual teach-and-repeat navigation allows long-range self-localization without solving the SLAM problem
or requiring an accurate global reconstruction [27, 95, 102]. Treuillet et al. [110] proposed a vision-based
algorithm that performs self-localization for VI users by employing a visual teach-and-repeat technique
similar to Royer et al. [94]. The method consists of two stages. In the learning stage, a video sequence
is recorded by the onboard vision system and processed o-line to automatically build a 3D map of
the environment using Bundle Adjustment (BA). During the navigation assistance stage, the camera is
localized by matching the stored 3D landmarks with the projections extracted in the current image. The
system output is the position and orientation of the user in the 3D map. The data association between
the observed 2D feature points and landmarks in the 3D map is done by selecting the key frame nearest to
the camera origin in terms of Euclidean distance. System tests were conducted with real VI individuals
in both outdoor and indoor environments. Loop closure is not employed using the experiment. The
walking trip was approximately 150 m and 70 m long for outdoor and indoor environments, respectively.
The results of their experiments show that their proposed system can maintain the walking user within
1 m width along the predened path.
Liu et al. [69] presented an indoor vision-based localization system based mainly on topological and
appearance information. The system consists of two stages. First, the system records a reference sequence
18
that consists of 2D key frame positions and orientation data provided by IMU. GIST and SURF features
are extracted and referenced to their corresponding key frames. During the online localization stage,
similar key frames are selected from the reference sequence by comparing the GIST feature extracted
from the live scene. SURF features are also extracted from the live image and matched against the
landmark features stored in the map. The camera pose is then estimated using the Hidden Markov
Model (HMM).
These methods work only when the users travel along the same path where the learning video sequence
is recorded. This is because the camera viewpoints during the navigation assistance stage must be very
similar to the stored key frames captured during the learning stage.
2.1.7 Map-aided localization
On the other hand, precise maps of outdoor or indoor environments may already be available, e.g.,
building or city plans that are accurate within single centimeters, accuracy that is scarcely achievable
by any SLAM or teach-and-repeat techniques. Free mapping services such as Google Map [34] and
OpenStreetMap (OSM) [4] provide geo-tagged maps for almost any location worldwide.
For instance, Oh et al. [79] proposed a particle lter system to localize a walking user within a
campus map. Instead of initializing particles uniformly over an entire area, they exploited the semantic
information available in maps to bias the motion model toward the areas of higher probability. The
system assigns a higher probability to the edges of a sidewalk based on the observation that VI people
tend to walk more along the edges of the sidewalk than its center. Given no other information aside from
the probability map, a motion model, and the initial position and orientation of the person, their system
can estimate the short-term trajectory of this person by considering the most probable behavior. The
particle weight is updated by the GPS measurements, which are assumed to be Gaussian distributed.
19
Assuming that the map of an indoor environment is available, Apostolopoulos et al. [5] implemented
an interactive system to allow users inform the navigation system when they reach certain prominent
indoor structures, such as doors, staircases, hallways, and more. Upon receiving user feedback, the system
can compute the location of the user from the pre-programmed coordinates of the indoor structures.
The aforementioned are some recent navigation aids especially designed for the Vi population. We
highlight the advances in robotics navigation in the following section.
2.2 Robot navigation
Because of the delay in transmission and reduced number of human operators, many applications require
robots to navigate autonomously instead via remote control. For instance, NASA's Mars Explorer is
capable of navigating autonomously on the planet using a stereo camera. Google is developing a self-
driving vehicle that can navigate in urban areas using LIDAR and small radar sensors without the
need of a human driver. There are many ongoing studies on autonomous robotics navigation. These
robots/vehicles are usually equipped with dierent sensors depending on the environments where they
operate.
2.2.1 GPS and dead reckoning
In robotics, GPS receivers are the primary sensors for large-scale outdoor navigation. Because good GPS
reception is not always available, many approaches have been proposed for robotics systems in order to
reduce GPS dependency.
For instance, Scaramuzza et al. [100] proposed a VO algorithm to estimate the trajectory of a moving
vehicle using a single car-mounted camera. Such VO algorithm attempts to reconstruct platform motion
by estimating and accumulating coordinate transformations between successive camera images. Typical
20
approaches to VO utilize feature extraction algorithms to generate robust image descriptors, and then
match them between frames to reconstruct the camera motion [77], [78].
Some robotics studies have proposed computing egomotion by fusing the measurements from the
onboard inertial sensors and camera, or wheel odometry and camera [45]. The current position of the
system is calculated by dead reckoning that adds the previous motion estimations. However, the dead
reckoning process is prone to drift over time and the accumulated positioning error continues increasing
without bounding. A simple solution is to reset the accumulated error using the absolute position
measurements from GPS when there are reliable satellites signals.
2.2.2 Simultaneous Localization and Mapping
Filter and key frame-based Many vision-based navigation approaches have been proposed for
robotics platforms. One popular approach in robotics committees is vision-based SLAM. In Visual
SLAM, the objectives are to construct a map of the environment incrementally and estimating the pose
of the camera within that map simultaneously [97, 89]. Two main categories of Visual SLAM methods
include ltering and key frame. Filtering-based methods, such as the Extended Kalman Filter (EKF),
combine the information of all images with some pre-dened probability distribution models [45, 21].
The key frame method optimizes the camera pose and landmark location together using nonlinear opti-
mization on selected key frames [105].
The 1-Point RANSAC method [45] maintains a temporary local map that consists of recently observed
visual features, and uses a modied EKF to track the features and estimate the 6-DoF pose of a mobile
robot. In order to obtain an accurate state estimate, the method employs a RANSAC procedure to
generate hypothesized states and nd the Most Supported State Hypothesis (MSSH). The features that
support MSSH are then used to update the state estimate. However, the method cannot be directly
21
applied to VI users because wheel odometry data is not available in our case. In addition, similar to
any other EKF-based estimator, the 1-PRES method linearizes the nonlinear SLAM problem, and thus
accrues state estimation error [17].
Bundle Adjustment (BA) Parallel Tracking and Mapping (PTAM) [54] is one of the well known
monocular SLAM approaches based on BA widely used in structures from motion application in computer
vision research. It is primary designed for augmented virtual reality application within small areas.
The algorithm performs real-time camera pose estimation by tracking many feature points between
consecutive frames. The camera poses and landmarks are optimized using BA. Instead of using all
frames, the map optimization builds a 3D map of the environment by performing 3D reconstruction
based on a subset of frames (or key frames).
Recently, Blosch et al. [10] proposed a vision-based navigation framework for Micro Aerial Vehicles
(MAVs). Their system utilizes PTAM to accurately estimate the pose of a
ying vehicle. However,
the main limitation of PTAM is that it does not scale well with larger environments. In addition, the
algorithm needs to be initialized by translating the camera side way in order to generate a virtual stereo
pair.
This initialization is performed in order to estimate an approximate depth of the initial 3D points.
Then, new 3D points are triangulated according to previous reconstructed key frames. This initialization
procedure plays an extremely important role in the nal quality of the 3D map, and results can dier
substantially from real ones if this stereo initialization is not suciently accurate, as was shown in [115],
for example.
However, most of the existing visual SLAM systems assume almost static environments where the
majority of the feature correspondence must come from static structures [56, 72]. These methods are
22
highly sensitive to outliers, such as those caused by vehicles or pedestrians. They may fail in crowded
scenes where many objects move in the same way that can be explained by motion models. They may
also fail even when there is only one object that moves precisely in front of the camera and covers almost
the entire FOV.
SLAM for dynamic environments Therefore, many works have been proposed to avoid including
feature points on any moving objects as landmark during map construction. The method by Yuan et
al. [120] rst detects feature points from a live image. The RANSAC algorithm [26] is then used to
iteratively estimate camera motion parameters and identify and reject outlier feature points from the
optimization. The algorithm is designed for single cameras, but the performance is evaluated using only
a small scale of experiments.
The stereo-based approach [51] relies on a binary classier [31] to classify image patches into static
or movable objects based on their appearance. An Iterated Sigma Point Kalman Filter [103] (ISPKF)
is used to estimate the camera motion parameters using only the static image patches. Although the
binary classier plays a crucial role in the algorithm, no detail about the training method was given in
the paper. Some experiments have been conducted by [51] using a camera mounted on a vehicle. The
mean square localization error for their longest trial is approximately 30m
2
, which could be a large
navigation error for walking users.
Unlike the previous methods that discard detected outliers, the work in [59], designed for automobile
navigation, not only classies regions of stereo images into moving and stationary parts, but it also
tracks the moving objects using EKF in order to predict their trajectories in the near future. Stationary
parts are integrated in the occupancy grid (or traversability map). In order to separate moving vehicles
from stationary roads, their classier exploits the special structure of the disparity map found in the
23
common road scene. However, these special disparity structures cannot be located easily in pedestrian
paths. The provided simulation results showed the eectiveness of their algorithm in detecting moving
objects, but there no quantitative experiment was conducted for real scenes.
There is also the SLAM framework [92] that attempts to incorporate moving objects as dynamic
landmarks. The preliminary results from their experiments showed that their approach is capable of
estimating camera motion and moving objects. However, those tests were constructed within a small
oce.
2.2.3 Map-aided localization
As mentioned before, precise maps of outdoor or indoor environments may already be available. For
example, in [39], a cadastral map with the footprint of buildings was used for robot localization in urban
environments. In [37], an environmental map of predened routes was used for path planning of an
autonomous forklift truck. In the DARPA Grand Challenge competitions, a detailed predened map
of the path [108] and route network [113] was given to all teams. These data included lane and trac
sign information, thus forming the background for autonomous navigation. In these approaches, the
environmental representation is created typically by experts for a specic scenario at a certain location.
For localization in indoor environments where GPS is totally unavailable, laser sensor is a popular
sensor when a map of the environment is available. Stachniss et al. [104] proposed a method for
learning accurate 2D maps for a humanoid robot equipped with a laser scanner. Then, these maps
can be used eciently in applications that require global localization, such as the museum robot guide
Robotinho presented in [23]. Recently, Hornung et al. [41] proposed a six-DoF humanoid robot laser-
based localization approach for complex environments. In the mentioned work, only localization is
24
addressed, and therefore, a volumetric 3D map of the environment is created in advance. A more
detailed review on map-aided localization is given in Section 4.1.1.
2.3 Summary
We showed that the existing navigation aids for the VI population are not ecient. We also showed
that robot vision is promising. However, we cannot transfer these techniques directly to human systems
because of instrumentation and robustness issues. In the following chapters, we present our design for
a navigation aid that performs egomotion estimation and localization for facilitating human navigation
for VI users within both static and dynamic outdoor environments.
25
Chapter 3
Visual Odometry in dynamic environments
3.1 Introduction
VO algorithms attempt to reconstruct platform motion by estimating and accumulating coordinate
transformations between successive camera images. Typical approaches to VO utilize feature extraction
algorithms to generate robust image descriptors, and then match them between frames to reconstruct
camera motion [77], [78].
Stereo
camera
Position
Estimation
Guidance Planning
User
Interface
Figure 3.1: Pose estimation using stereo camera only
In this section, we describe our proposed VO designed to manage dynamic environments, as shown
in Figure 3.1. This section is structured as follows: Section 3.1 introduces VO in general. Section 3.2
presents related works in VO. Section 3.3 presents an overview of our approach. Section 3.4 presents our
ground plane estimation algorithm. Section 3.5 presents the egomotion algorithm. Section 3.6 presents
the validation methods and results with simulated data. Section 3.7 presents the validation results
26
with public dataset. Section 3.8 presents the validation methods results with our dataset. Section 3.9
presents our VO method with IMU. Section 3.10 presents the validation results with our dataset with
IMU. Section 3.11 presents the extension to monocular camera. A summary is given in Section 3.12.
In the discussion from Sections 3.4 to 3.7, we assume stereo camera is the only input to the position
estimation system as shown in the block diagram in Figure 3.1. Unlike monocular system, stereo provides
higher accuracy without scale drift. In addition, it oers substantially lower size, weight, power, and
cost comparing with other consumer depth sensors, such as Kinect, ToF camera, and laser.
Assuming that the stereo camera is the only input for position estimation, we can compute VO in
order to estimate the velocity and orientation change of a user. Position can be estimated by integrating
these values with time. VO is computed with respect to the origin of the rst camera frame xed at
(0,0,0) in the 3D space. The initial angle is set at 90 deg from the X-axis, as shown in the left diagram
in Figure 3.2. If the initial world coordinates (latitude and longitude) and heading are given, we can
transform the estimated trajectory into the world coordinate system shown in Figure 3.2.
3.2 Related work
Research in VO using visual sensors has been performed for the past decades. There is a large volume of
published studies that describe dierent VO designs using a wide range of sensors. Most assume relative
static environments. In recent years, there has been an increasing amount of literature on camera motion
estimation in dynamic scenes. Here, we brie
y review some of these approaches.
3.2.1 Static environments
Stereo-based method is popular for VO because of its low power consumption and the capability of
providing scene depth measurements, and hence, it provides no scale drift in egomotion outputs. The
27
Start at (0,0, 90 )
If the initial pose
(lat, lon, heading) is known
Starting
Visual odometry output
using car mounted
stereo cam with baseline 54cm
Figure 3.2: Stereo-based VO output that can be translated to world coordinates if the initial camera
pose is known
relative camera motion between two consecutive frames is estimated by tracking 3D feature points from
one frame to the next.
Libviso1 [52] and Libviso2 [30] are two extensions of the generic stereo-based VO designed for the
automobile platform. Libviso1 rst computes feature correspondences between two consecutive stereo
image pairs, and estimates egomotion using ISPKF. Dynamic environments are managed using an outlier
rejection scheme based on RANSAC. In order to improve accuracy, Libviso2 performs a \circular" feature
match among four images: the left and right images of two consecutive stereo pairs. Camera motion is
then computed by minimizing the sum of reprojection errors. The Kalman lter is used to rene the
velocity estimates and predict the camera motion when the minimization process does not converge.
28
Although generic stereo-based VO requires a minimum of three pairs of feature correspondence to
solve the 6-DoF camera motion, Kaess et al. [47] divides the features correspondence into two groups
(far and close features) based on the disparity map. Camera rotation and translation are estimated
with two-point and one-point RANSAC, respectively. Scaramuzza [100] further reduced the number of
correspondences in motion estimation to one point using the Non-Holonomic constraint of the vehicle.
Instead of using a stereo camera, Tardif et al. [107] presented a VO algorithm using a car-mounted
omnidirectional camera that decoupled the camera motion into rotation and translation estimation.
PTAM [54] is a monocular algorithm that estimates camera pose from localization. PTAM rst initializes
the 3D map from two key frames captured in dierent views. It then expands the map with subsequent
frames while estimating the camera pose. BA is run simultaneously to reduce drift error. While PTAM
produces good egomotion in small-scale indoor environments, we found that the system cannot detect
good key frames robustly and fails to expand the map continuously in outdoor environments.
3.2.2 Dynamic environments
All the aforementioned methods assume that the environment is almost static and rely on RANSAC [26]
to manage outliers induced by object movements. They work provided that outliers are a minority of
observations. In addition to detect-and-reject, the method in [7] uses EKF to estimate the position and
velocity of moving objects.
Some methods employ the prior knowledge of the environment. The method in [8] recovered structure-
from-motion from two views of a piecewise planar scene. [80] computes relative camera motion with weak
Manhattan world assumption and IMU measurements.
To manage dynamic scenes, some VO works have been proposed recently to utilize the ground planar
surface. For instance, the wheeled robot in [106] computes egomotion only from the points sampled
29
on the ground region. Panahandeh et al. [84] proposed estimating egomotion by fusing monocular
and IMU. To avoid tracking moving objects, only ground features are detected and tracked between
consecutive frames. A measurement model is derived based on the ground feature correspondence. Such
model imposes visual constraints on the inertial navigation system to estimate camera motion in 6-DoF
with SPKF. Experiments have been conducted in small-scale indoor scenarios. Kitt et al. [53] used the
ground plane to resolve VO scale by assuming prior knowledge of xed camera height and depression
angle. The
ying robot in [60] uses the geometry of the ground plane to accelerate BA. There are other
systems [15, 16, 67, 49] that exploit the ground plane constraint in VO estimation, but they are all
designed for automobile platforms.
The following is a detailed discussion on two popular stereo-based VO algorithms that inspired our
proposed algorithm.
3.2.3 Generic stereo-based VO
Figure 3.3 illustrates the steps to compute the camera rotation R and translation t between two frames
by tracking 3D points. From the stereo pair acquired at previous time step k 1, the VO algorithm
measures the image coordinates of each landmark in the left image and uses them to compute the 3D
landmark coordinates with triangulation. When new stereo pair acquired at time k, the corresponding
image coordinates of each landmark are measured in the new left image. In Figure 3.3, the 2D points
u
1
; u
2
; u
3
measured in stereo pair k corresponds to the 3D coordinates X
1
; X
2
; X
3
of the landmark
measured in stereo pair k 1.
Let M denotes the projection that takes 3D point X and maps it to the pixel u on the current left
image plane. Given a calibrated stereo camera with focal length f, principal point (p
x
;p
y
), squared
30
pixels and zero skew, the 3D coordinates of the landmark in the previous stereo pair X
i
= [X
i
;Y
i
;Z
i
]
T
and their corresponding 2D coordinates u
i
= [u
i
;v
i
]
T
in the current stereo pair are related by
X
1
X
2
X
3
u
1
u
2
u
3
Rt
Stereo pair
k-1
Stereo pair k
Figure 3.3: Generic stereo-based VO
0
B
B
B
B
B
B
@
u
i
v
i
1
1
C
C
C
C
C
C
A
=M(X
i
;!; t) =
0
B
B
B
B
B
B
@
f 0 p
x
0 f p
y
0 0 1
1
C
C
C
C
C
C
A
R t
0
B
B
B
B
B
B
B
B
B
B
@
X
i
Y
i
Z
i
1
1
C
C
C
C
C
C
C
C
C
C
A
(3.1)
Rotation matrix R is approximated by [!
x
;!
y
;!
z
]
T
instead of 9 parameters using
R =
2
6
6
6
6
6
6
4
0 !
z
!
y
!
z
0 !
x
!
y
!
x
0
3
7
7
7
7
7
7
5
(3.2)
The relative camera rotation R and translation t are estimated by minimizing the sum of reprojection
errors E:
E = arg min
f!;tg
N
X
i=1
ku
i
M(X
i
;!; t)k
2
(3.3)
31
stereo pair at k-1
stereo pair at k
Left
Left
Right
Right
(a)
X
1
X
2
X
3
u
1
u
2
u
3
Rt
Stereo pair
k-1
Stereo pair k
(b)
Figure 3.4: (a)Circular matching (b) Libviso2 method
3.2.4 Libviso2
Libviso2 [30] improves the generic algorithm by using \circular matching" as illustrated in Figure 3.4(a).
In other words, Libviso2 matches the features in a \circle": starting from all feature candidates in the
current left image, Libviso2 nds the best match in the previous left image within an NN search
window; next, in the previous right image, the current right image and last in the current left image
again. A \circular match" is accepted only if the last feature coincides with the rst feature.
The tracking results of \circular matching" provide an additional 2D measurement of the landmark
in the right image. Libviso2 includes this additional measurement in optimization to improve accuracy
using (3.4).
E = arg min
f!;tg
N
X
i=1
u
i
(l)
M
(l)
(X
i
;!; t)
2
+
u
i
(r)
M
(r)
(X
i
;!; t)
2
(3.4)
Here M
(l)
denotes the projection that takes a 3D point X and maps it to a pixel u
(l)
on the left
image plane given in (3.1). Similarly, M
(r)
denotes the projection onto the right image plane given in
(3.5) where B denotes the baseline. Using Gauss-Newton optimization, Libviso2 iteratively minimizes
32
(3.4) with respect to the transformation parameters R; t. Here u
(l)
i
, u
(r)
i
denote the feature locations in
the current left and right images respectively.
0
B
B
B
B
B
B
@
u
i
v
i
1
1
C
C
C
C
C
C
A
(r)
=M
(r)
(X
i
;!; t) =
0
B
B
B
B
B
B
@
f 0 p
x
0 f p
y
0 0 1
1
C
C
C
C
C
C
A
0
B
B
B
B
B
B
@
1 0 0 B
0 1 0 0
0 0 1 0
1
C
C
C
C
C
C
A
2
6
6
4
R t
0
13
1
3
7
7
5
0
B
B
B
B
B
B
B
B
B
B
@
X
i
Y
i
Z
i
1
1
C
C
C
C
C
C
C
C
C
C
A
(3.5)
The main limitation of generic stereo VO and Libviso2 is that 3D points on moving objects (dynamic
points) may be included in the optimization (3.3) or (3.4). Libviso2 assumes that the scene is almost
static, uses the RANSAC [26] algorithm to reject them as outliers, and only uses the 3D points on static
structures (static points) for nal optimization. However, RANSAC may fail in dynamic environments
where the moving 3D points are the majority within the scene as shown in Figure 3.5(a). The optical
ow detected on moving people may grouped as inliers if the pedestrians move with similar speed and
direction.
3.3 Overview of our approach
In order to avoid moving 3D points from being included in motion estimation, we propose estimating the
camera motion only from the observed ground plane surface [64]. We tested the generic and Libviso2 VO
algorithms, but both failed in crowded scenes because these approaches can include moving 3D points in
the camera motion estimation. This is illustrated in Figure 3.5, and we provide the quantitative results
in Section 3.8. When the user navigates in dynamic environments, the ground surface is a prominent
33
static structure most of the time, even when the static background is occluded by the crowd. Therefore,
the 3D point cloud on the ground plane is more reliable and stable for camera motion estimation. We
propose to decouple the 6-DoF camera motion into two components:
1. the motion of the ground plane
2. the motion on the ground surface
First, our method extracts the ground plane, and then it computes the motion. The details of both steps
are provided in Sections 3.4 and 3.5.
Stereo pair* at t-1
Stereo pair at t
R,T
Stereo pair at t-1
Stereo pair at t
R,T
(a)
Stereo pair* at t-1
Stereo pair at t
R,T
Stereo pair at t-1
Stereo pair at t
R,T
(b)
Figure 3.5: (a) Select points from entire scene vs. (b) Select points on ground
The block diagrams in Figure 3.7 show the dierences between the conventional direct stereo odom-
etry and our method. The standard method (Figure 3.7(a)) rst computes feature correspondences
between the left and right frame. Next, it computes the 3D coordinates of the feature point by means of
standard triangulation equations. A 2D motion eld is sampled from the most prominent static structure
found in the scene. The camera motion is computed by minimizing the sum of reprojection errors of the
3D points found in entire scene. The outliers are detected by RANSAC during optimization. However,
34
Stereo pair* at t-1
Stereo pair at t
R,T
Stereo pair at t-1
Stereo pair at t
R,T
(a)
Stereo pair* at t-1
Stereo pair at t
R,T
Stereo pair at t-1
Stereo pair at t
R,T
(b)
Figure 3.6: (a) Detected ground region in yellow and estimated ground plane normal in red vector (b)
Optical
ow on ground surface, inliers in green and outlier in red
Right
Image
Left
Image
Interest
Point
Detection
Interest
Point
Detection
Sparse
Matching
Left
Image
Interest
Point
Detection
Sparse
Matching
Estimate 6 DoF camera motion [ ,t] by
minimizing reprojection error
Scene at t-1
Scene at t
Sparse 3D points
2D point
correspondences
Sparse
2D points
Sparse 2D points
Sparse 2D points
(a) Block diagram of standard stereo visual odom-
etry
Ground
Detection
Plane
Estimation
Ground
plane filter
Right
Image
Left
Image
Dense
Matching
Interest
Point
Detection
Left
Image
Interest
Point
Detection
Sparse
Matching
Estimate 6 DoF camera motion [ ,t] from i) motion of
the plane and ii) motion on the plane
Scene at t-1 Scene at t
2D point correspondences
on ground plane
Pose of
ground (n, D)
Road mask
Dense 3D
points on ground
Sparse
2D
points
Sparse 2D points
2D point
correspondences
Disparity Image
(b) Block diagram of our approach
Figure 3.7: Standard stereo-based VO vs. ours method
35
inconsistent motion vectors (or outliers) are dicult to detect and reject from dynamic environments
rich in moving objects.
Instead, we utilize the global motion eld property of the ground planar surface and decompose
the 6-DoF egomotion into 1) motion of the ground plane and 2) motion on the plane. Figure 3.7(b)
illustrates the key components of this approach. The pose of the ground plane must be inferred in order
to determine a unique camera pose from the optical
ow of the coplanar 3D points. To accomplish this,
the ground regions are detected in every frame from the disparity image shown in Figure 3.6(a). We
evaluate two approaches to estimate the normal of the ground plane: one from the stereo measurements
and the other from the IMU measurements. We then perform robust estimation of the egomotion as a
composition of these two motions, as shown in Figure 3.6(b).
There are many dierences between our method and those mentioned in Section 2.1. First, by
comparing the block diagram in Figure 3.7, our method samples only 2D motion (or optical
ow) on the
ground surface. We estimate the ground plane structure and use it to simplify the egomotion formulation.
Second, we use wearable stereo cameras with a baseline as short as 6 cm, whereas the rest use longer
baselines that range from 12 cm to 70 cm. In addition, we captured our dataset by walking through
dierent environments similar to [7], so there is no wheel odometry available and our system must be
robust to any motion-blur caused by body motion, whereas data acquired by wheeled robots is relatively
stable. This also explains not being able to directly transfer the VO algorithms of wheeled robotics
platforms to human application. We argue that egomotion estimation from a walking user (rst person
vision) is radically dierent from that of a moving vehicle, especially in urban areas where the majority of
the roads are relatively
at. The motion of the head mounted camera on a walking person is signicantly
more complex than that of a car-mounted camera: the former is a combination of 6-DoF head and body
motions, whereas the latter is dominated only by the forward velocity, yaw, and pitch angles. In addition,
36
VI applications require a higher rate of VO update. Hence, the dead-reckoning error accumulated over
time may grow faster than the moving speed of the camera.
Although our work is similar to [106], which suggests estimating the VO of the wheeled robot by
tracking feature points on the ground, our proposed algorithm is designed to work with a smaller stereo
camera worn at the eye level of an adult. The range between our stereo camera and the ground plane
is much further. Hence, the textures of the ground surface become less prominent. Furthermore, the
ground feature points are likely to be occluded by the body of pedestrians in crowded environments.
3.4 Ground plane model estimation
Our VO method decomposes the 6-DoF camera motion into: (1) the motion of the ground plane and (2)
the motion on the ground surface. In order words, it rst extracts the ground plane and then compute
the motion on the plane. In this section, we explain the ground plane estimation in details. We represent
the ground plane, , with the standard 3D plane equation:
:AX +BY +CZ +D = 0 (3.6)
where n = [A;B;C]
T
is the normal unity vector of the plane and D denotes the distance from the
camera origin to the plane. We assume that the Z-axis of the world coordinate system is aligned with
the camera optical axis, and the X and Y -axes are aligned with the image axes x and y. The focal
length is f and the principal point (p
x
;p
y
) are estimated by camera calibration [11]. The parameters
A to D are estimated in two steps: ground detection and normal estimation. In ground detection, a
road mask is rst computed automatically by analysis of the disparity image. Next, the plane distance
from the origin is computed from the 3D point cloud that intersects the road mask. We investigated two
37
approaches to estimate the plane normal. The rst method uses visual data and ts a plane to the 3D
point cloud within the road mask. The second method estimates the normal from IMU measurements.
The rst method updates the plane normal with the actual terrain on which the VI user is walking, but
we found that the second method provides more stable result in the busy urban areas in which we are
interested.
3.4.1 Ground plane detection for short baseline stereo
We rst identity the image regions that correspond to the ground surface in the disparity map and
determine whether the ground region is suciently large for egomotion estimation. If the ground region
is too small or does not exist, our VO discards the frame and estimates the camera motion by propagating
the last estimated camera motion with Kalman lter by assuming that the motion model is Gaussian
Linear. We nd that this assumption is valid in practice.
When a new pair of stereo images is captured by the camera, our algorithm computes a dense
disparity map after the standard rectication process. The pixels that correspond to the closest points
have a bigger disparity value than those farthest. The V-disparity space can be obtained easily from the
disparity map by expressing the histogram over the disparity values for every image row (y-coordinate).
If we assume that the camera is approximately straight and facing a horizontal
at ground plane, the
y-coordinates of the ground plane in the image is linearly proportional to the corresponding disparity
value. This is illustrated in Figure 3.8.
Formally, the disparity value of any 3D point seen by the stereo camera with focal length f and
baseline B is related to depth Z of the 3D point by :
d =
fB
Z
(3.7)
38
(a) Rectied left image
x=200
X−coor
Y−coor
50 100 150 200 250 300
50
100
150
200
0
5
10
15
Disparity
values
(b) Disparity image
0 50 100 150 200 250
−2
0
2
4
6
8
10
12
14
Disparity
Y−coor
Plot of disparity vs y−coordinates at x=200
(c) Vertical prole of dis-
parity at x=200 marked by
the dotted line in Figure
3.8(b)
Figure 3.8: Y-coordinates of ground plane in image are linearly proportional to the corresponding dis-
parity
Therefore, disparity d is inversely proportional to the depth of the point. Given a calibrated head-
mounted stereo camera with depression angle to the horizontal ground plane and zero roll angle, the
work in [43] shows that disparity d of the ground plane surface is linearly related to the y-coordinate on
the image plane by
dh
b
= (yp
y
)cos +fsin (3.8)
where f is focal length, b is base line, p
y
is the y-coordinate of the principle point, and h is the positive
height of the camera above the ground plane. For each stereo image pair, we compute a dense disparity
image and convert it into a V-disparity image [43] as input to the ground detection algorithm. Image
regions that correspond to road surface can then be located by tting a straight line in the V-disparity
domain using a method such as the Hough transform [42].
39
Ground surface detection by V-disparity algorithm is often a prerequisite in object detection for
automobiles with stereo cameras [59, 58, 28]. Because drivers usually maintain a safety distance from
the vehicles in front, there is usually a large road surface captured within the Field-Of-View (FOV) of
the stereo rig mounted at a xed depression angle on the vehicle. Hence, standard V-disparity works
very well in detecting road regions.
Because of the limited range in short baseline stereo cameras, the diagonal line that corresponds to
the horizontal plane may be almost vertical in the narrow disparity map, and thus the Hough transform
may detect both as a single line. However, we also observe that the vertical line in the V-Disparity
image is more prominent in the short baseline stereo camera unless there is an obstacle blocking the
lens, or when the camera is pointing directly to the ground. This is because short baseline stereo has an
extremely limited range, and large portions of the disparity map are lled with minimum disparity values
when the camera is looking at open space, hence inducing a prominent vertical line in the V-Disparity
image.
We made two modications to the V-Disparity method in [61] to extract the diagonal line. After
thresholding the V-Disparity image, we rst apply the thin and clean morphological operations to clear
the V-Disparity domain.
Second, instead of detecting the diagonal line, we remove those regions from V-Disparity that do not
correspond to the horizontal plane. This is done by adding the binary intensity of each column in the
V-Disparity domain; the column with the highest sum corresponds to the vertical line. All columns of
the V-Disparity image on the left of the vertical line are set to zero because nothing should appear at
a distance further than that disparity value. We also remove the upper portion of the V-Disparity map
spanned by the vertical line. The diagonal line can then be extracted quickly by Hough transform in
the remaining region. If no vertical line is detected, we proceed with diagonal line extraction with the
40
Hough transform over the entire V-Disparity image. Finally, we check the largest connected region in the
disparity map that satises the extracted line. If the largest region covers at least 10% of the image, it
is used as the road mask to estimate the ground plane model in the next step. Otherwise, our algorithm
has two choices for managing this: (1) estimate the camera motion directly from all the dominant 3D
points in the scene with RANSAC and attempt to detect the ground plane again in the next frame;
(2) assume that the camera is moving with constant velocity and predict the camera motion using the
Kalman lter described in Section 3.5.1. We use the Kalman lter technique in all our experiments in
order to show the contribution of the ground plane constraint in the nal results. An example of a road
mask is highlighted in yellow in Figure 3.9.
Figure 3.9: Output road mask highlighted in yellow
3.4.2 Ground plane normal estimation from visual data
Each disparity image point (u;v;d) that corresponds to the road mask maps to a 3D point on ground
plane in the left camera frame. We rst compute the 3D coordinates of all the points on the ground by
means of standard stereo geometry equations. The next task is to estimate [A;B;C;D] in (3.6) by tting
a linear model into that point cloud. Because we have the dense disparity data of the road region, plane
tting is accomplished by weighted least square tting [109] because of its speed and low computational
41
complexity. Each 3D point is weighted by the sigmoid function (3.9) of the disparity with the tuning
parameter =20. Therefore more weight, w, is assigned to points nearer to the camera.
w(d) =
1
1 +e
(d)
(3.9)
After recovering the motion of the ground plane, we compute the motion on the ground surface in
the next section.
3.5 Egomotion estimation
After aligning the ground plane between two consecutive frames, we estimate the 3-DoF camera motion
on the plane. We brie
y explain the conventional egomotion estimation rst, followed by a detailed
discussion of our method.
3.5.1 Classical egomotion estimation directly from all points
If the camera moves in space with translational vector t t t = [t
x
;t
y
;t
z
]
T
and angular velocity ! ! ! =
[!
x
;!
y
;!
z
]
T
, standard egomotion algorithm decomposes the 6-DoF camera motion into
F
c
k
c
k1
=F
c
k
I
k
F
I
k
I
k1
F
I
k1
c
k1
(3.10)
where F
I
k1
c
k1
is the camera to image plane transformation at the previous frame k 1, F
I
k
I
k1
is the
transformation from a previous frame to the current frame measured directly by optical
ow, and F
c
k
I
k
is the image plane-to-camera transformation at the current frame.
42
By assuming a pin-hole camera model, the transformations (F
I
k1
c
k1
, F
c
k
I
k
) relate the 3D points P P P =
[X;Y;Z]
T
to the image coordinatesp p p = [u;v; 1]
T
byp p p =f
P P P
Z
, wheref is the focal length. Note that these
two matrices are not global transformations in the presence of non-rigid structures because Z depends
on disparity. Therefore, in the case of a dynamic scene, the standard VO might converge to the wrong
solution of F
c
k
c
k1
when minimizing the sum of reprojection errors, E, over all 3D points in (3.11)
1
.
E = arg min
f! ! !;t t tg
N
X
i=1
kx
i
(X X X
i
;! ! !;t t t)k
2
(3.11)
Herex
i
denotes the feature locations in the current left images. (X X X
i
;! ! !;t t t) computes the pixel coordinates
of the 3D pointX X X
i
= [X;Y;Z; 1]
T
i
in the left image plane using (3.12):
0
B
B
B
B
B
B
@
u
v
1
1
C
C
C
C
C
C
A
=
0
B
B
B
B
B
B
@
f 0 p
x
0 f p
y
0 0 1
1
C
C
C
C
C
C
A
R R R t t t
0
B
B
B
B
B
B
B
B
B
B
@
X
Y
Z
1
1
C
C
C
C
C
C
C
C
C
C
A
(3.12)
with rotation matrixR R R2SO(3) is formed by! ! ! = [!
x
;!
y
;!
z
]
T
and translation vectort t t = [t
x
;t
y
;t
z
]
T
.
To manage a dynamic environment, the VO [106] solves the cost function (3.11) by sampling the
optical
ow only on the ground instead of the entire scene. In other words, the 6-DoF camera motion is
decomposed into
F
c
k
c
k1
=F
c
k
g
k
F
g
k
g
k1
F
g
k1
c
k1
(3.13)
1
Libviso2 improves the accuracy by minimizing the cost function (3.11) over four views obtained from two consecutive
stereo image pairs.
43
where (F
c
k
g
k
;F
g
k1
c
k1
) are the transformations between the camera and image regions that correspond to
the ground at time k and k 1, respectively, and F
g
k
g
k1
is measured by the optical
ow found in the
ground regions. (F
c
k
g
k
;F
g
k1
c
k1
) are global transformations because the ground is a rigid structure. However
computing 3D motion directly from the motion estimates produced by the coplanar points may lead to
two solutions. This is because the same planar motion eld can be induced by two dierent planes
undergoing two dierent 3D motions [112]. We can determine a unique solution using the ground plane
structure (computed in either Section 3.4.2 or 3.9) as discussed below.
3.5.2 Egomotion estimation from ground plane
We assume that the camera is moving in space with translational vectort t t and angular velocity! ! ! while
observing a planar surface, in (3.6). Let P P P be a point on , the equation of is n n n
T
P P P = D, where
n n n = [n
x
;n
y
;n
z
]
T
is the unit vector normal to , and D is the distance between and the center
of projection. We can further decompose (F
c
k
g
k
;F
g
k1
c
k1
) into (F
c
k
p
k
F
p
k
g
k
;F
g
k1
p
k1
F
p
k1
c
k1
). Hence the 6 DoF
egomotion becomes
F
c
k
c
k1
=F
c
k
p
k
F
p
k
g
k
F
g
k
g
k1
F
g
k1
p
k1
F
p
k1
c
k1
(3.14)
whereF
c
k
p
k
andF
p
k1
c
k1
are the transformation between the camera and ground plane, which is a function
of (n n n;D).
Given the parameters of the ground plane, (F
c
k
g
k
;F
g
k1
c
k1
) become global transformations because the
corresponding depth Z of each ground pixel (u;v) is a function of the plane parameters (n n n;D):
Z =
Df
n
x
u +n
y
v +n
z
f
(3.15)
44
By taking the ground plane geometry (n n n;D), a set of pixelsG G G within the road mask estimated in previous
section, a set of tracked ground feature points x
j
= (u;v; 1)
T
j
for j2G G G in current left image, and their
corresponding feature pointsx
0
j
= (u
0
;v
0
; 1)
T
j
and disparityd
0
j
in previous left image, we estimate vectors
! ! ! andt t t by minimizing the sum of reprojection errors, E, over all points in the road mask:
E = arg min
f! ! !;t t tg
X
j2G G G
kx
j
(X X X
j
;! ! !;t t t)k
2
(3.16)
where
X X X
j
=
2
6
6
6
6
6
6
4
X
Y
Z
3
7
7
7
7
7
7
5
=
2
6
6
6
6
6
6
4
(u
0
p
x
)b=d
0
(v
0
p
y
)b=d
0
Df=(n
x
u
0
+n
y
v
0
+n
z
f)
3
7
7
7
7
7
7
5
(3.17)
and b denotes the baseline. Hence the original 6-DoF visual odometry is re-formulated as quadratic
curve tting problem.
We use the Kanade Lucas Tomasi (KLT) feature tracker [122] to select and track the feature points
between two consecutive frames in the left camera. To improve speed and robustness, the KLT tracker
only processes the image regions within the road mask. Although we also tested other feature matching
algorithms such as the feature matching algorithm in Libviso2 [30] and FAST detector [93], we found
that KLT is more accurate in tracking the ground features for our application. In addition VI people
do not tend to walk at high speed, and thus the dierence between consecutive frames can be managed
easily by the pyramidal implementation of the KLT tracker.
We derive the Jacobian matrix from (3.16) and iteratively minimize it using Gauss-Newton optimiza-
tion with respect to the transformation parameters! ! ! andt t t. RANSAC [26] is used during optimization
to improve the robustness of the motion estimation against outlier motion vectors. The parameters (! ! !,
45
t t t) are rst estimated forN trials based onm randomly selected motion vector on the ground plane. We
choose the set parametersf! ! !
i
;t t t
i
g
N
i=1
from the trials with the largest number of inliers and use all such
inliers to compute the nal (! ! !,t t t). N is adjusted during the trials with the standard RANSAC equation
(3.18)
N =
log(1p)
log(1u
m
)
(3.18)
where m=3, p=0.99 and u denotes the highest ratio of inliers to the total number of motion vectors up
to current trial.
We also include the same Kalman Filter as in [30] to smooth the egomotion output, and predict the
camera motion when insucient ground region features are detected within the FOV.
3.6 Validation using simulated data
In this section, we analyze the eectiveness of our method in managing dierent degrees of dynamic
scenes using simulated data similar to [81]. We simulate dierent dynamic scenes using dierent ratios
of moving 3D points and static 3D points. By adjusting this ratio, we can benchmark the performance
of our VO algorithm in simulated environments with a dierent number of dynamic features that cannot
be controlled in the real world.
The ground plane normal is estimated by tting a plane to the 3D point cloud using the RANSAC
method (Section 3.4.2) in all simulations. We compare our method to Libviso2 [30], which computes
six-DoF stereo camera motion by minimizing the reprojection error of the sparse feature correspondence
in the entire scene.
46
Scene/ Dynamic Non-ground Ground
Conguration Points [%] static points [%] 3D points [%]
Scene I (static scene) 0 45 55
Scene II 13 26 61
Scene III 18 53 29
Scene IV 23 12 65
Table 3.1: Four dierent simulated scenes sorted according to the amount of dynamic points
For purposes of the simulation, a total of 1x10
5
3D points are generated randomly along the trajectory
of a moving stereo camera. We congure four dierent dynamic scenes by mixing the distribution of
three dierent types of 3D points: i) static point on the ground plane ii) static point above the ground
plane and iii) dynamic points. In order to simulate 3D points of moving objects in a real scene, such as
humans or vehicles, the dynamic points are congured to move together in some random directions.
Table 3.1 lists the statistics of dierent point types seen by the camera when it traverses in the four
simulated scenes. The simulated camera with the same intrinsic parameters of Vuzix (Figure 1.3) is
congured to move forward at constant speed and altitude, but random heading angles. The depression
and roll angle are xed to 14
and 0
, respectively. The 2D observations (u;v) of these features on
the image plane are corrupted by zero mean Gaussian distribution with
u
and
v
. These noisy 2D
feature positions are then fed into the odometry code to generate motion estimates. We set
u
=
v
= 0.05 pixel
2
for all simulations. Because of random variation, ve runs over a given trajectory are
averaged to produce the error estimates. Similar to [81], motion blur is not included in the simulation.
We also assume that the data association is known because the outliers in feature tracking are managed
by RANSAC, and thus are not included in the simulation.
47
We use the benchmarking scheme in [29] to evaluate the performance of the algorithms. Figure 3.10
shows the average translation errors over all possible subsequences of length (5, 10, 50, 100, 150, ...,
400 m); the error curves for Libviso2 and our algorithms are highlighted in blue and red, respectively.
The simulation suggests that Libviso2 generates larger errors with an increasing amount of dynamic
structures in the scene. On the other hand, the performance of our algorithm is not aected by the
dynamic points found in the environment. This is because Libviso2 uses RANSAC to identify the inlier
points from the entire 3D point cloud found in the scene. However, when there is a substantial amount
of dynamic points moving together, RANSAC may misclassify outliers as inliers and vice versa. In real
scenarios, these situations occur commonly when using a small FOV camera in crowded environments.
Our method is not aected by such problems because it computes camera motion based only on ground
features.
0 50 100 150 200 250 300 350 400
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Translation Error [%]
Path Length [m]
Libviso2 (Scene I)
Libviso2 (Scene II)
Libviso2 (Scene III)
Libviso2 (Scene IV)
Ours (Scene I)
Ours (Scene II)
Ours (Scene III)
Ours (Scene IV)
Figure 3.10: Average translation error over dierent scenes
In the second simulation, we only fed ground features to both odometry codes to compare their
accuracy. We generated 1x10
5
3D points only on the ground plane. All points are static throughout
48
the simulations. The 2D observations (u;v) of these features are rst rounded o to integer coordinates,
and then further corrupted by zero mean Gaussian noise with dierent along both u and v axes. We
use the same benchmarking scheme as before. Figure 3.11 shows that our approach is more robust at
dierent noise levels.
The test results with simulated data show that our approach is more robust to sensor noise and
dynamic scenes. In next section, we validate our method using a popular benchmark dataset.
0 50 100 150 200 250 300 350 400
0
2
4
6
8
10
12
Translation Error [%]
Path Length [m]
Libviso2(σ
=0.02 pixel
2
)
Libviso2(σ
=0.05 pixel
2
)
Libviso2(σ
=0.08 pixel
2
)
Ours(σ
=0.02 pixel
2
)
Ours(σ
=0.05 pixel
2
)
Ours(σ
=0.08 pixel
2
)
Figure 3.11: Average translation error over ve runs in planar scenes with dierent observation noise
levels
3.7 Benchmarking results with KITTI dataset
In addition to validating using simulated data, we tested our VO using a public dataset [29]. The
objective is to validate whether our method is suciently robust to manage a wide range of camera
motions under dierent environments and road conditions. The only input of all tests in this benchmark
is stereo camera images captured by cameras with very dierent parameters and baselines compared
49
without a wearable stereo camera. The dataset was captured by a car-mounted stereo rig as shown
in Figure 3.12 while the vehicle was driven in dierent urban areas. This is an extremely challenging
dataset for our VO because our algorithm is not designed for driving scenarios. The purpose of this
experiment is to validate whether our proposed method is suciently general to work with other stereo
systems that have dierent intrinsic parameters and baselines within extremely dierent environments.
The testing dataset consists of 11 video sequences captured by a pair of high quality cameras separated
by 54 cm; ground truth is not available. The video sequences include busy trac scenes and areas where
the ground surfaces are not clearly visible because of re
ection, over-exposed, or covered by shadow.
Some of the sample images are shown in Figure 3.13.
Figure 3.12: KITTI dataset was captured by car-mounted stereo camera
We ran our VO algorithm over all 11 testing video sequences. Figure 3.14 shows a sample of detected
ground regions highlighted in yellow. The optical
ows are measured on the detected ground regions
only and superimposed into the rectied image as shown in Figure 3.15. The green and red vectors are
the inliers and outliers classied by our VO algorithm, respectively.
We submitted the output trajectories estimated by our algorithm to the KITTI Vision Benchmark
Suite [50]. We call our new algorithm Stereo Odometry for Visually Impaired (SOVI). Our algorithm
ranked number ve among all 19 stereo-based VO methods listed in the benchmark suite shown in Figure
50
Our algorithm also works with KIT dataset
captured over different road surfaces
Dynamic traffic
Reflective road surface
Slope Road features covered by shadow
Missing road surface
Over-exposed
We predicted the camera motion using Kalman filter when there is no road surface or not
enough ground feature detected.
Figure 3.13: Our algorithm can extract and track ground features in dierent road conditions
Figure 3.14: Ground detection and optical
ow extracted from the ground
3.16. Note that the algorithms that ranked #1 and #2 in the list are based on Laser point measurements,
and thus they should be excluded from the ranking of stereo-based VO. The Libviso2 algorithm is four
positions behind us.
51
Figure 3.15: Optical
ow on the ground surface
The top stereo-based VO is MFI, which requires training using the ten training sequences with ground
truth provided by KIT. On the other hand, we do not perform training on our algorithm. The second
place is a BA method called VoBa, but our approach does not rely on BA. The third place is a SLAM-
based algorithm, whereas we only optimize the camera motion without estimating or keeping track of
landmark positions because such processes consume signicant computational resources. There is no
additional information available for the fourth algorithm eVO, but we rank behind this algorithm by a
small 0.04% in translation accuracy. The full evaluation report from KIT on our algorithm is available
in Figure 3.17.
The benchmarking results show that our approach is suciently robust to manage a wide range of
environments and dierent camera settings.
3.8 Validation with our dataset using images only
In addition to the public dataset, we compare the proposed method against other open source VOs using
our own dataset in this section. In this validation, the only input is the stereo images captured by a
consumer o-the-shelf wearable stereo camera, as shown in Figure 3.18. Note that the integrated IMU
52
Figure 3.16: Our algorithm is among the top ve stereo-based methods because the top two algorithms
are laser-based methods
53
0
0.5
1
1.5
2
2.5
100 200 300 400 500 600 700 800
Translation Error [%]
Path Length [m]
Translation Error
(a)
0
0.002
0.004
0.006
0.008
0.01
0.012
100 200 300 400 500 600 700 800
Rotation Error [deg/m]
Path Length [m]
Rotation Error
(b)
0
0.5
1
1.5
2
2.5
10 20 30 40 50 60 70 80 90
Translation Error [%]
Speed [km/h]
Translation Error
(c)
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
10 20 30 40 50 60 70 80 90
Rotation Error [deg/m]
Speed [km/h]
Rotation Error
(d)
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
10 20 30 40 50 60 70 80 90
Rotation Error [deg/m]
Speed [km/h]
Rotation Error
(e)
-200
-100
0
100
200
300
400
-300 -200 -100 0 100 200 300 400
z [m]
x [m]
Ground Truth
Visual Odometry
Sequence Start
(f)
Figure 3.17: KITTI dataset results
54
is not discussed in this section, but it is discussed in Section 3.10. Stereo is chosen in order to strike a
balance between portability and accuracy in depth measurements. The stereo rig is not only lightweight
and comfortable to wear, but it also gives a natural appearance to the VI user. The limitation is that it
is composed of two low quality image sensors with an extremely short baseline of 6 cm (half of that of
the Bumblebee2).
Figure 3.18: The primary sensor of our system: Vuzix Wrap 920 AR with integrated IMU [19]
3.8.1 Methods
We demonstrate our approach using image sequences captured by our stereo camera eyewear, while
walking through dierent outdoor environments. We run our VO algorithm without IMU and compare
the output trajectories with Libviso2 and Libfovis[44]. Libfovis is another open source stereo-based VO
developed for Unmanned Aerial Vehicles (UAVs).
When IMU is not used, the ground plane normal is measured by tting the plane to the 3D points
found on the ground (Section 3.4.2). Note that none of the algorithms uses any form of BA or loop-
closure technique. Therefore, similar to all dead-reckoning systems, our system output degrades with
elapsed time and distance traveled in all experiments. Accuracy can be easily improved by incorporating
other optimization frameworks such as [111] or [90].
55
3.8.2 Evaluation criteria
Our datasets consist of stereo image pairs captured at 30 fps and IMU data logged at 100 Hz. We also
attempted to record GPS using handheld receivers (Garmin nuvi 205) and smartphones (Google Nexus
S). However, after plotting the GPS measurements in Google Earth, we found that both devices received
inaccurate measurements because the GPS signal can be blocked by buildings and trees. Hence, GPS
measurements cannot be used as ground truth.
Therefore, we extracted the ground truth trajectory using Google Earth software. This is performed
manually by comparing the ground texture found in the recorded images with the Google satellite map.
We select anchored frames, which are the left camera images with distinctive ground texture patterns
visible, at regular intervals along the path we walked. We approximate the spots where the frames were
captured from Google Earth and use the corresponding UTM coordinates (WGS84) as the ground truth
camera location for those frames. The height of all ground truths is set to zero. The locations of anchored
frames are highlighted by light blue lines in Figures 3.20(e), 3.21(e) and 3.22(e). Unfortunately, rotational
error cannot be evaluated because accurate ground truth roll-pitch-yaw angles are not available.
Because BA is not used by any of the algorithms in evaluation, we cannot directly compare the
algorithms using the error of the end points of the estimated trajectories. Therefore, we use criteria
similar to those described in [29] to evaluate the accuracy of the estimated trajectory. We compute the
translational error between all possible pairs of anchored frames (Figure 3.19) and take their average.
Formally, given anchored frames 1:::M, the translational error T
err
(i;j) between two anchored frames i
and j is dened as:
T
err
(i;j) =k(p
j
p
i
) (^ p
j
^ p
i
)k
2
(3.19)
56
where ^ p;p2R
3
are the estimated camera location and corresponding ground truth location obtained from
Google Earth, respectively. We normalized each translational error with the length of the corresponding
subpath. The average translational error for the entire trajectory with M anchor frames is dened as:
E
ave
=
2
M(M 1)
M1
X
i=1
M
X
j=i+1
T
err
(i;j)
kp
j
p
i
k
2
(3.20)
start
p
2
p
3
p
4
p
1
p
2
p
3
p
4
(a)
p
1
p
2
p
3
p
4
p
1
T
err
(1,2)
T
err
(1,3)
T
err
(1,4)
p
2
T
err
(2,3)
T
err
(2,4)
p
3
T
err
(3,4)
p
4
(b)
Figure 3.19: (a) Evaluating VO with four anchored frames (b) Translational error between all possible
subpaths
We assume that the location of the rst frame is known. The rst heading is obtained from a compass,
but subsequent camera orientations are computed by the VO algorithms.
3.8.3 Results without IMU
The quantitative results are summarized in Table 3.2. Our approach provides better results with the
stereo camera alone when compared with other state-of-the-art methods. The details of the three datasets
are given below.
57
Cafeteria Area The rst dataset was captured around a crowded cafeteria area in USC Campus as
indicated by the ground truth path shown in light blue in Figure 3.20(e). We postpone the discussion
of the yellow curve to Section 3.10.2. The route is approximately 218 m. The environment was crowded
and there were people walking about in front of our stereo camera, with the exception of the last 50 m of
the route, but the images were overexposed, as shown in Figure 3.20(d). Some snapshots along the route
are shown in the thumbnails in Figure 3.20. The yellow curve in Figure 3.20(e) shows the trajectory
generated by our algorithm when IMU is used to approximate the ground plane normal. This is discussed
in Section 3.10. The green curve is the output of our algorithm without IMU, and the ground plane was
estimated using RANSAC 3D plane tting. The output trajectory is relatively straight between corners
until drifting after the second turn (approximately 20 m after label C in Figure 3.20(e)) before entering
a shaded corridor.
58
(a) (b) (c) (d)
(e) Ground truth and estimated trajectories
60 80 100 120 140 160 180 200 220 240
20
40
60
80
100
Path Length (m)
Translation Error (%)
Libviso2
Ours (without IMU)
Libfovis
(f) Average Translational Error
Figure 3.20: Cafeteria area test result
59
Football Match The second dataset is the most challenging one because it was collected immediately
after a football game. The route is approximately 351 m. We used the same setup as in the rst
on-campus experiment. As shown in the thumbnails in Figure 3.21, the environment was crowded with
spectators leaving the stadium. Our camera moved in the direction opposite to the crowd in the rst half
of the route, and then followed the crowd in the second half. When IMU is not used, the accuracy of the
green trajectory is degraded, but it does not drift as badly as with the other two algorithms. Although
the green curve bends away from the ground truth after the rst U-turn because of the accumulated
rotational error, the trajectory remains relatively straight until the second U-turn. Therefore, our
proposed algorithm maintains the lowest translational error throughout the route, as shown in 3.21(f),
although it gives a large end-point error, as shown in the satellite map.
60
(a) (b) (c) (d)
(e) Ground truth and estimated trajectories
50 100 150 200 250 300 350
0
50
100
150
Path Length (m)
Translation Error (%)
Libviso2
Ours (without IMU)
Libfovis
(f) Average Translational Error
Figure 3.21: Football match dataset
61
Algorithm/ Libfovis Libviso2 Ours
Dataset (without IMU)
Cafeteria 66.31% 52.27% 51.64%
Football 72.03% 68.69% 29.60%
Shopping 64.71% 71.92% 30.42%
Average E
ave
67.68% 64.29% 37.22%
Table 3.2: Average translational error (E
ave
) for dierent algorithms in dierent environments without
using IMU
Shopping Area Figure 3.22 shows the dataset we collected in a shopping area. The route is approx-
imately 268 m. By comparing with the two datasets above, this dataset is less crowded, but the road is
surrounded by buildings as shown in the thumbnails and satellite image in Figure 3.22(e). Because the
sunlight was blocked by some buildings, parts of the image sequence are overexposed (Figure 3.22(c)),
and some are underexposed (Figure 3.22(d)). The shape of the trajectories computed by our proposed
algorithm without IMU is the second closest to the ground truth, as shown in the green curves in Figure
3.22(e). The end-point errors of both trajectories are caused by accumulated rotation error that can be
easily rectied by incorporating the compass measurements. Figure 3.22(f) shows that our algorithm
provides the lowest average translation error with and without the use of IMU in the ground plane
estimation.
62
(a) (b) (c) (d)
(e) Ground truth and estimated trajectories
50 100 150 200 250 300
0
50
100
150
Path Length (m)
Translation Error (%)
Libviso2
Ours (without IMU)
Libfovis
(f) Average Translational Error
Figure 3.22: Shopping area dataset
63
3.9 Visual odometry with IMU
The test results in last section show that our VO is more robust than both state-of-the-art methods
Libviso2 and Libfovis. However, our method is still aected by moving objects when the stereo camera
is the only input. To overcome this, we found that adding the angle measurements of IMU (Figure 3.23)
improves VO robustness.
IMU is a low-cost and lightweight electronic device that measures the velocity, orientation, and
gravitational forces of the object on which the sensors are mounted, using a combination of accelerometers
and gyroscopes, sometimes also magnetometers. The integrated IMU of our eyewear is designed to track
the general head movement of the user. Note that our system is not especially designed for the Vuzix
eyewear with integrated IMU. It should work with any other stereo camera with rigidly mounted IMU.
There is a trade o between the size of the sensor and its accuracy. We found that the small integrated
IMU can detect large changes in head movement, but it ultimately gives a large heading drift error, which
is a common problem in many low cost MEMS IMUs. This drifting problem is also stated in the user
manual of stereo cameras [19]. Therefore, our proposed algorithm does not rely on the yaw angle.
Another problem of this IMU is that there is no access to the timestamp value of each IMU measure-
ment. There is no function call in the Software development Kit (SDK) provided by the manufacturer
to read the timestamp. We attempted to integrate gyroscope measurements using the recorded system
time from the computer during data acquisition, but the accuracy is poor. This is because there is some
unknown time delay in reading the IMU measurements.
We observed that VI people prefer to navigate on
at terrains in daily life. The orientation of the
ground plane can be approximated easily by the roll and pitch measurements from the IMU without
64
Stereo
camera
Inertial
sensors
Position
Estimation
IMU
Guidance Planning
User
Interface
Figure 3.23: Position estimation using stereo camera and IMU
Figure 3.24: Coordinates system for integrated IMU [19]
spending additional computation power for plane tting. Formally, the plane normaln n n = [A;B;C]
T
can
be approximated by projecting the up vector of the IMU,n n n
up
= [0;1; 0]
T
, using
n n n = R
imu
n n n
up
(3.21)
65
Figure 3.25: Red line and rectangle represent ground plane normal vector measured by IMU
where R
imu
2SO(3) is the rotation matrix form by pitch and roll angle measurements:
R
imu
=
2
6
6
6
6
6
6
4
1 0 0
0 cos() sin()
0 sin() cos()
3
7
7
7
7
7
7
5
2
6
6
6
6
6
6
4
cos() sin() 0
sin() cos() 0
0 0 1
3
7
7
7
7
7
7
5
(3.22)
Figure 3.25 below shows some ground plane output by IMU.
3.10 Validation with our dataset with IMU
3.10.1 Methods
We demonstrate our approach with IMU using the same image sequences as in Section 3.8. We run
our VO algorithm with IMU and compare the output trajectories with Libviso2, Libfovis[44] and the
66
proposed VO without IMU. When the IMU is not used, the ground plane normal is measured by tting
the plane to the 3D points found on the ground (Section 3.4.2). When the IMU is used, the ground plane
normal is approximated from the roll and pitch measurements using (3.21). Neither BA nor loop-closure
technique is used.
60 80 100 120 140 160 180 200 220 240
0
20
40
60
80
100
Path Length (m)
Translation Error (%)
Libviso2
Ours (with IMU)
Ours (without IMU)
Libfovis
(a) Cafeteria area test result
50 100 150 200 250 300 350
0
50
100
150
Path Length (m)
Translation Error (%)
Libviso2
Ours (with IMU)
Ours (without IMU)
Libfovis
(b) Football match dataset
50 100 150 200 250 300
0
50
100
150
Path Length (m)
Translation Error (%)
Libviso2
Ours (with IMU)
Ours (without IMU)
Libfovis
(c) Shopping area dataset
Figure 3.26: Average Translational Error comparison with dierent datasets
67
Algorithm/ Libfovis Libviso2 Ours Ours
Dataset (with IMU) (without IMU)
Cafeteria 66.31% 52.27% 10.38% 51.64%
Football 72.03% 68.69% 15.47% 29.60%
Shopping 64.71% 71.92% 18.91% 30.42%
Average E
ave
67.68% 64.29% 14.92% 37.22%
Table 3.3: Average translational error (E
ave
) for dierent algorithms in dierent environments
3.10.2 Results with IMU
The quantitative results are summarized in Table 3.3. Our proposed algorithm gives the lowest error in
all three datasets when IMU is used to estimate the ground plane. In addition, our approach provides
better results with the stereo camera alone compared with other state-of-the-art methods. The details
of individual dataset are given below.
Cafeteria Area The yellow curve in Figure 3.20(e) shows the trajectory generated by our algorithm
when IMU is used to approximate the ground plane normal. This trajectory is closest to the ground
truth compared with Libviso2 and Libfovis. Figure 3.26(a) shows the average translational error of the
four algorithms for dierent subpath lengths. Our proposed algorithm with IMU (red curve) clearly
provides the lowest translational error with dierent path length. However, when IMU is not used, our
algorithm performs only slightly better than Libviso2 on average, as shown in the rst row of Table 3.3.
Football Match The yellow curve in 3.21(e) clearly shows that our VO is the most insensitive to the
dynamic change in the environment when IMU is used. When IMU is not used, the accuracy of the
68
green trajectory is degraded but does not drift as badly as the other two algorithms. Therefore, our
proposed algorithm maintains the lowest translational error throughout the route as shown in Figure
3.26(b), although it gives a large end-point error, as shown in the satellite map.
Shopping Area The shape of the two trajectories computed by our proposed algorithm with and
without IMU is the closest to the ground truth as shown in the yellow and green curves in Figure
3.22(e). The end-point errors of both trajectories are caused by accumulated rotation error that can be
easily rectied by incorporating the compass measurements. Figure 3.26(c) shows that our algorithm
provides the lowest average translation error with and without the use of IMU in ground plane estimation.
69
3.11 Motion estimation from smartphone
The stereo VO method can be extended to a single camera as in [65]. This monocular extension allow
our system to run with smartphones equipped with IMU and GPS within a very portable form factor.
In addition, smartphone are more accessible and aordable to users. Moreover, both the hardware and
software of smartphones are upgraded more frequently compared with stereo-based sensors.
Our approach is illustrated in Figure 3.27. By assuming that the user walks on a
at terrain, the
ground plane normal [n
x
;n
y
;n
z
]
T
can be approximated by the roll and pitch angle measurements of the
IMU as described in Section 3.9. Assume that the rotations and translations of the calibrated camera
are [!
x
;!
y
;!
z
]
T
and [t
x
;t
y
;t
z
]
T
respectively, and that the observed optical
ow (v
x
;v
y
) on the ground
plane with image coordinates (x;y) is related to the camera motion with equations (3.23) and (3.24):
pitch
Ground plane normal
[ , , ]
T
x y z
n n n
Camera
height
Smartphone
in pocket
Figure 3.27: Camera motion estimation from known ground plane
v
x
=
T
z
xT
x
f
Z
!
y
f +!
z
y +
!
x
xy
f
!
y
x
2
f
; (3.23)
v
y
=
T
z
yT
y
f
Z
+!
x
f!
z
x +
!
y
xy
f
!
x
y
2
f
; (3.24)
where Z =
Df
nxu+nyv+nzf
, f is the focal length and D is the slant distance from the camera center to
the plane which can be estimated from the height of the camera and the pitch angle. We found that
the height of the camera does not change signicantly while the user is walking, even when there is
70
head motion. Hence given the optical
ow, roll and pitch measurements from IMU, we can estimate the
camera motion [!
x
;!
y
;!
z
]
T
, [t
x
;t
y
;t
z
]
T
by least square tting with (3.23) and (3.24).
Figure 3.28: Lower 50% of a live image is marked as ground. Inliers and outliers are highlighted in green
and red, respectively.
We cannot detect the ground region with the VDisparity algorithm for single camera, and thus the
lower 50% of the live image is marked as the ground region, as shown in Figure 3.28. Optical
ow is
measured over the ground region. We use the RANSAC algorithm to detect inliers which are the optical
ow vectors that satisfy the planar constraint in (3.23) and (3.24). The inliers and outliers are marked
as green and red respectively in Figure 3.28
3.12 Summary
We presented a new robust VO method that works with head-mounted wearable stereo cameras. In
order to manage environments rich in moving objects, our method computes egomotion only from the
optical
ow observed on the ground plane. We described our robust ground detection method for short
baseline stereo. We also presented two dierent approaches to estimate ground plane normal. Experiment
71
results showed that our system outperforms existing VOs that rely on motion eld from the entire scene.
Furthermore, by comparing the results with and without IMU, estimating the ground plane normal with
IMU clearly showed an improvement in egomotion estimation. We also extended the proposed method
to a single camera. Similar to other dead-reckoning navigation systems, the trajectory estimated by
VO drifts during long navigation. This is because the accumulated position error continues growing
unbound. This error can be corrected either using measurements of absolute positioning devices (e.g.,
GPS) or incorporating map information as described in the next chapter.
72
Chapter 4
Map Aided Localization
4.1 Introduction
In this section, we describe our map aided localization by including a map as an input to the system as
shown in Figure 4.1.
Position
Estimation
Measurements
Stereo
camera
Inertial
sensors
Map
Our focus
or
Smartphone
camera
Smartphone
Guidance Planning
User
Interface
Obstacle
detection
Figure 4.1: Map aided position estimation
The drifting error of the motion estimation can be corrected using some global reference, such as
GPS measurements, but the readings could be noisy and imprecise in urban areas because of blockage.
We postpone the discussion on GPS to Section 4.3.3.
To reduce GPS dependence, many vision-based approaches have been proposed to solve the problem
using place recognition techniques. The methods rely on a large database that contains depth and image
73
-118.2852 -118.285 -118.2848-118.2846-118.2844-118.2842 -118.284 -118.2838-118.2836-118.2834-118.2832
34.02
34.0202
34.0204
34.0206
34.0208
34.021
34.0212
34.0214
34.0216
Ground Truth ––– –––– – – ––
Estimated Trajectory ––– – –
start
lat
Lon
Figure 4.2: For long distance navigation, the VO may drift from ground truth if there is no correction
from absolute positioning device such as GPS
features extracted from the environments where robots or users are to navigate. Localization is then
formulated as an image retrieval problem. However, the cost to create and maintain such a huge database
is not feasible and could limit the places where VI individuals can navigate.
In contrast, our goal is to solve the localization problem in places no one has visited before. Our
approach is inspired by the method in [14] to utilize free maps available online. Currently, many com-
panies oer free map services in dierent formats, such as satellite maps, road maps, and street views.
These maps and images are geo-referenced and cover almost everywhere worldwide. However, storing
the entire image database from these mapping services in our portable devices is not feasible. Although
we could design a planning system to download the local portion of the map at the moment it performs
planning, the downloading time required for such image data could be too time-consuming for real-time
planning.
74
We propose utilizing the existing free map service provided by OSM [4] for the task of vision-based
localization. Unlike Google Map [34], OSM contains semantic details such as road names, the world
coordinates of the starting and ending point of each road, the type of road, such as footpath or highway,
etc. All such information and the map can be downloaded easily in Extensible Markup Language (XML)
format and stored in mobile devices with scant storage space requirements because image data is not
stored.
Our system is based on probabilistic map localization that utilizes the estimated VO described in
Chapter 3 along with the OSM data as the only inputs when GPS is not available. The system localizes
the user within the street graph by matching the local motion trajectory with the shape of the road
segment. A street graph is a directed graph (Figure 4.7(c)) that captures the connectivity of dierent
roads. The inputs of the localization system include (1) a map that encompasses the starting area and
destination and (2) the user trajectory in the last 3 sec. The local motion trajectory is estimated by the
VO algorithm either in Section 3.9 or 3.11. GPS is optional, but it provides an initial guess and causes
our probabilistic localization to converge quickly. We rst present the steps to generate the street graph
from OSM, and follow with the probabilistic model of how a VI user can traverse the graph.
4.1.1 Related Work
GPS measurements are usually employed to correct the accumulated drifting error of VOs in outdoor
environments. It works extremely well in open areas where there is minimum blockage of GPS signals.
The autonomous ground vehicle [68] examines the precision of the received GPS measurements while
navigating with stereo-based VO. The estimated trajectory is corrected once the uncertainty of the GPS
readings drops below a certain threshold.
75
To reduce GPS dependency, some methods [6, 99, 66, 36] manage large-scale navigation using a place
recognition technique. However, generating a database of geo-referenced images to cover every place
required by VI users is not feasible. In contrast, the maps used by our localization system consume a
small amount of storage space. For instance, the map of the entire USC campus requires less than 8
MB.
Recently, large-scale appearance-based geo-localization methods have attracted much attention. FAB-
Map [20] is an appearance-only SLAM system based on bag-of-features representation. It rst constructs
a visual dictionary by extracting visual cues from a pool of images. Localization is performed by ap-
pearance matching using probabilistic framework. It also detects unseen places and augments its map
with new observations. Similar methods [123, 101, 55] have been proposed. These approaches rely on
a reference dataset that consists of geo-tagged images. The geo-location of a query image is estimated
by nding the best match image from the reference database. Schindler et al. [101] developed a method
for city scale localization by constructing vocabulary trees to represent a dataset of street side images.
Vaca-Castano et al. [114] proposed a method for estimating the geo-spatial trajectory of a moving
camera in a city-scale urban environment by matching live scenes with a large database of GPS-tagged
Street View images.
To leverage the existing public geo-tagged image database, Majdik et al. [70] proposed an appearance-
based localization method for UAVs using an existing image database from Google. They manually
controlled a UAV to
y sideway and estimated the global position by matching the live camera view of
the UAV with Google StreetView. In addition, a crosswalk localization system [76] was been recently
developed for VI pedestrians by matching the image sequence captured by a smartphone with Google
Map. The work assumes that the GPS has sucient accuracy to determine the road junction where the
user is standing. The user is required to record an image sequence using a smartphone by turning 360
76
deg. The image sequence is then ?stitched? into a 360-deg cylindrical panorama and projected into a
top-down aerial view. The aerial view is compared with Google Map using GPS as the initial guess, and
the algorithm estimates the precise location of the user.
Another approach for performing localization in GPS-denied areas is to leverage the map. For
this, some research groups have focused their work on building the environmental representation \from
scratch" through SLAM approaches by optimizing the estimated poses and landmarks together [45, 21,
75]. However, these methods cannot maintain a large map incrementally for large areas. The hybrid
camera system [86] extends EKF SLAM to manage large-scale navigation by including measurements
from stereo cameras. Recently, the map size limitation has been overcome partially using incremental
sparse matrix factorization [48], and the computational speed has been improved using relative rep-
resentations [72]. Furthermore, better maps can be constructed using new methods with loop-closure
detection technique [20, 116]. However, localization using SLAM methods only works with those maps
generated previously with similar sensors. In contrast, our proposed method relies on existing maps that
cover most of the places worldwide, and it only requires a few minutes to download and convert to the
compatible mapping format for our system.
Visual teach-and-repeat navigation allows long-range self-localization without solving the SLAM
problem or requiring an accurate global reconstruction [27, 95, 102]. A database is generated by manually
controlling a robot to move about in the areas of interest while recording all prominent visual features
in the environment. During the autonomous operation, such robot can perform self-localization by
comparing the visual features in the live scene with the stored features. However, this approach is not
suitable for the VI application because it restricts users to travel along some pre-dened paths. It also
requires someone to travel and pre-record the database.
77
Typically, the SLAM approaches regard mainly obstacle conguration in the environment. The visual
teach-and-repeat method represents the environment using a set of image features without any semantic
labeling information. Important information, such as traversable footpaths or regions of interest, is not
considered by these approaches.
On the other hand, precise maps of outdoor or indoor environments may already be available, e.g.,
building or city plans that are accurate within single centimeters, where such accuracy is barely achievable
by any SLAM techniques. Free mapping services such as Google Map and OSM provide geo-tagged maps
for almost everywhere worldwide.
Parra et al. [85] proposed a map-aided localization system by matching the motion trajectory of
moving vehicles with the road network downloaded from OSM whenever there is low quality GPS signals
or unavailable of GPS measurements. The accuracy of this method depends on the accuracy of the
estimated motion trajectory from stereo cameras.
To improve the robustness of localization, many navigation projects incorporate particle lters into
the system. For instance, Oh et al. [79] proposed a particle lter system to localize the walking user
within a map. Instead of initializing particles uniformly over an entire area, they exploited the semantic
information available in maps to bias the motion model toward the areas of higher probability. The
system assigns a higher probability to the edges of a sidewalk based on the observation that VI people
tend to walk more on the edges of a sidewalk than on its center. Given no other information, with the
exception of the probability map, motion model, and initial position and orientation of a person, their
system can estimate the short-term trajectory of this person by considering the most probable behavior.
Particle weight is updated by the GPS measurements assumed to be Gaussian-distributed.
78
To reduce GPS dependency, dierent measurements have been proposed to update particle weight.
For instance, Miller et al. [73] proposed updating a particle lter with both GPS and camera measure-
ments. To localize a self-driving car within a given map, their computer vision system detects the lane
markings and stop lines on road surfaces and use such detections to update particle weights.
Brilhault et al. [12] proposed a particle lter-base localization system to assist VI patients with
navigating in urban areas. They ran an object detection algorithm to detect prominent objects along
the street on which the user walks, such as road signs, mailing boxes, and more. The user's world
coordinates are computed from the geo-tagged coordinates of the detected objects and the distance
between these objects and the user as measured by the wearable stereo camera. The system in [38]
localizes moving vehicles by matching the estimated VO and road network extracted from OSM. Instead
of using road networks, the system in [38] update the weight of the particle lter from the matching
results between the estimated planar surface from the laser point cloud and the building information
from the given map. Similarly, Leung et al. [63] proposed a particle lter localization system that utilizes
aerial imagery. A reference map is constructed by extracting the boundary features from aerial imagery.
To perform localization, the vanishing point information is extracted from live images that infer the
orientation of building boundaries. This boundary information is then compared with the reference map
to determine the importance factor of particles in the particle lter.
Brubaker et al. [14] proposed a self-localization algorithm for moving vehicles. The input to the
localization system is a road map of surrounding and estimated VOs from a car-mounted stereo camera.
The road map is rst converted to a directed graph. The probabilistic state of the vehicle is then updated
given the noisy VO. The pose of the vehicle is represented by a Mixture of Gaussian within the most
probable road segment. Low probability modes are removed in order to reduce the complexity of the
mixture models.
79
4.1.2 Overview of our approach
Although there are many existing image databases and mapping service available online, such huge
amount of geo-tagged resources cannot be used directly for localization. For instance, we need to extract
semantic information from those online maps, but we can only download most of the map in bitmap
format, such as Google Map. We also found that appearance-based matching methods that use existing
image databases, such as Google StreetView and Google Map, are not reliable for the following reasons.
First, there is barely any common overlap between the street view captured by a vehicle on the road and
the pedestrian view from the sidewalk, as shown in Figure 4.3(a). Second, the online database might be
outdated. For instance, the zebra crossings printed on the crosswalks in Downtown Los Angeles are not
updated in the aerial photo of Google Map, as shown in Figure 4.4. In addition, a building facade in
the live scene could be occluded or not visible. This is illustrated in Figure 4.5.
(a) Google StreetView (b) Pedestrian's view
Figure 4.3: Google StreetView vs pedestrian's View
To overcome these issues, we propose constructing a street graph from the map downloaded from
OSM. Our localization system is based on a particle lter similar to [13], which localizes the user within
the street graph by matching the local motion trajectory with the shape of the road segment. A street
graph is a directed graph (Figure 4.7(c)) that captures the connectivity of dierent roads. The inputs
80
Figure 4.4: The zebra crossings printed on the crosswalks are not updated in Google Map (the middle
diagram)
Figure 4.5: (a) Google StreetView captured in 2009. (b) Live view recorded in 2014
for the localization system include (1) a map that encompasses the starting area and destination and
(2) the user trajectory in the last 3 sec. The local motion trajectory is estimated by the VO algorithm
either in Section 3.9 or 3.11. GPS is optional, but it provides an initial guess and causes our probabilistic
localization to converge quickly. We rst present the steps for generating the street graph from OSM,
and follow with the probabilistic model of how VI users can traverse the graph.
81
4.2 Street graph generation from OSM
4.2.1 Data available in OSM
Unlike the free map services provided by big companies including Google, Apple, Microsoft, and Yahoo,
OSM is a collaborative project for creating a free editable map of the world. The OSM data include
semantic labels, road networks, etc. contributed by volunteers. According to statistics in 2014, more
than 1,600,000 registered users have contributed or used the mapping project [83]. The OSM map editor
allows users worldwide to update the map by either uploading tracks from portable GPS devices, or
marking roads and label structures on top of aerial photography. Therefore, the map data provided
by OSM might be more up-to-date. The OSM map includes the world coordinates of each street,
junction, footpath, point of interest, and building. In addition to numerical information, the OSM map
includes semantic labels for the structures, such as road name, road type (e.g., highway or footpath),
and more. The OSM map can be downloaded under an open content license to promote free use and
data re-distribution.
Because all map data are contributed by users, there is no guarantee in the accuracy of all geospatial
metadata. We found that the map downloaded from OSM aligns well with that of Google Earth. Figure
4.6 shows the road networks (highlighted in yellow) of OSM superimposed on an aerial photo from Google
Earth.
4.2.2 Map-to-street graph conversion
A map of the area of interest can be downloaded from OSM by specifying the top-left and bottom-right
corner of a bounding box in terms of longitude and latitude. The downloaded map is in XML format,
which contains the name and coordinates of all buildings, streets, and pedestrian crossings.
82
Figure 4.6: Road networks (highlighted in yellow) from OSM superimposed on an aerial photo from
Google Earth
We use the software in [14] to convert the OSM to a street graph that is a directed graph to represent
the road connectivity found in the map. The software rst extract all crossings and drivable roads
that connect them, and represents the roads as piece-wise linear segments. Each bi-directional street
is divided into two one-way streets and smoothing intersections using circular arcs. This line and arc
representation is converted into a connectivity graph, the street segments are represented by a node,
and the connectivity information from the street is captured by the edges. All streets that are either
dead-ends or are cropped out of the bounding box are assigned a \Sink" node.
Figure 4.7(a) illustrates the parameters of these line segments highlighted in blue. Each street
segment has three basic parameters: length ` , and start and end position p
0
and p
1
. Each arc street
segment is modeled by additional parameters: c is the circle center,r is the radius, is an initial heading
of the street segment, =
1
0
`
is a curvature parameter, and
0
and
1
are the start and end angles
of the arc, respectively.
Because a user could walk in any direction along a street, each street is divided into two one-way
streets, and circular arcs are inserted to smooth intersections. A connectivity graph is constructed with
83
origin P
0
P
1
r
0
1
(a)
(b) (c)
Figure 4.7: (a) Street segment parameterization. The red circle and yellow line denote the location and
heading direction of the user, respectively (b) Street map from OSM. Footpath is highlighted in red
dotted lines (c) Corresponding connectivity graph for Figure 4.7(b). Road segment IDs are printed in
black
nodes to represent street segments and edges to represent street connectivity. Figure 4.7(b) shows a
84
small map and the corresponding connectivity graph in Figure 4.7(c) for that area. All streets that are
either outside the bounding box or are dead-ends are labeled as \Sink" nodes.
The global position (X;Y ) and orientation of the walking user are re-dened in terms of map
parameters (u;d;) , whereu is the unique road ID of the street segment where the user currently is; d is
the distance from the origin of the street segment; is the angle oset from the local street heading. This
is illustrated in Figure 4.7(a): the red lled circles and yellow lines represent the camera and heading
direction, respectively. Formally, (X;Y; ) and (u;d;) are related by the parameters of the street in
(4.1):
2
6
6
4
X
Y
3
7
7
5
=
8
>
>
>
>
>
>
<
>
>
>
>
>
>
:
`d
`
p
0
+
d
`
p
1
u is straight
c +r
2
6
6
4
cos(
`d
`
0
+
d
`
1
)
sin(
`d
`
0
+
d
`
1
)
3
7
7
5
u is an arc
= + +d:
(4.1)
This conversion is used in Section 4.3.1.
4.3 Particle Filter
Our localization method is an extension of [14], which is based on the Monte Carlo Localization (MCL)
framework. Figures 4.8(b) to 4.8(h) show the output of the proposed position estimation system at
dierent time steps from initialization to when the user is localized without using GPS. The probability
distribution of the user's location over the map is maintained by a particle lter updated with the
most recent VO measurements (the red trajectories in Figure 4.8) within a small sliding window. Each
particle represents an explicit hypothesis for where the user might be and move about in the street
network randomly. The legend is shown in Figure 4.8(a). In order not to clutter the entire gure with
85
particles, only the mode is shown (as a blue spot) in the localization results. Each mode represents a
group of particles with prominent probability on that road segment. Because the user can be anywhere
immediately after initialization, the particles are spread over the entire map in Figure 4.8(b). Because
the particle is forced to be on the street segment, the motion trajectory of each particle is constrained
by the shape of the street. When a new VO measurement is available, the particle lter adjusts the
important weight of each particle by comparing its motion trajectory with the user's trajectory. Particles
with small weight lower than the threshold are removed because they are unlikely to be on the same
street segment as the user. New particles are inserted into locations with higher probability. Therefore,
the modes (blue spots) disappear from the \Localization Result" windows in Figures 4.8(c) to 4.8(g).
When all particles converge in a single spot in Figure 4.8(h), the user is declared \localized" and is
marked with a green circle.
86
P:\ICRA_Movie\monoIMU_C
am_oneLoop\combine_gndF
low_ModeNose_920x480
Optical flow
from
ground surface
Visual
Odometry
Localization result
Ground Truth
Estimated user’s heading
Mode
Estimated user’s position
(after localized)
(a) (b)
(c) (d)
(e) (f)
(g) (h)
Figure 4.8: Output for proposed position estimation method with campus dataset (a) Legend (b-h)
Outputs for feature tracking, VO, and localization superimposed in dierent time steps
87
4.3.1 Bayes Filter with street map prior
We express the localization as a Bayesian Filter problem. Let X
t
= [X;Y; ] be the position and
orientation of the camera at timet. Figure 4.9 shows the graphical model of the particle lter where t is
time, and X and Z represent the state and measurement respectively. We seek to compute the posterior
distribution p(X
t
jZ
1:t
) over the state X
t
given all measurements Z
1:t
up to time t. This can be solved
by applying the Bayesian Filter:
p(X
t
jZ
1:t
) =p(Z
t
jX
t
)
Z
p(X
t
jX
t1
)p(X
t1
jZ
1:t1
)dX
t1
; (4.2)
where is the normalizing factor to ensure that p(X
t
jZ
1:t
) adds to one over all X
t
. The Bayesian lter
computes the posterior p(X
t
jZ
1:t
) recursively from the measurement likelihood p(Z
t
jX
t
), the previous
posterior, p(X
t1
jZ
1:t1
), and the state transition described by p(X
t
jX
t1
), which is the conditional
density of the current pose X
t
given the previous pose X
t1
.
x
1
x
2
x
t
x
T
Z
1
Z
2
Z
t
Z
T
… …
Figure 4.9: A graphical model representation of our proposed localization system
The 6-DoF head-mounted camera motion of a walking user is complex and cannot be modeled with
a linear model. The problem can be simplied by incorporating the street network into the Bayesian
lter. To achieve this, we make two assumptions about the motion of the VI user:
1. User travels on the road most of the time.
88
2. User cannot jump from one place to another with no connection.
Then we express the state, and measurement models, and the state and street transitions in terms of
map parameters, as in [13].
4.3.2 State model
The positions and orientations of the user are represented by a state vector in terms of street parameters:
x
t
= (u
t
; s
t
), where s
t
= (d
t
;d
t1
;
t
;
t1
) andd
t1
;
t1
are the distance and angle at the previous time
dened relative to the current street with road ID u
t
, respectively. Let y
t
be the measurement, the
posterior and state transition distributions are factorized to:
p (x
t
jy
1:t
) =p (s
t
ju
t
; y
1:t
)p(u
t
jy
1:t
); (4.3)
p (x
t
jx
t1
) =p(s
t
ju
t
; x
t1
)p (u
t
jx
t1
): (4.4)
The Bayesian lter is transformed to the map domain by substituting (4.3) and (4.4) into (4.2):
p(x
t
jy
1:t
) =p(y
t
jx
t
)
X
u
t1
Z
p(u
t
jx
t1
)p(s
t
ju
t
; x
t1
)p(u
t1
jy
1:t1
)p(s
t1
ju
t1
; y
1:t1
)ds
t1
(4.5)
p(x
t
jy
1:t
) =p(y
t
jx
t
)
X
u
t1
Z
p(u
t1
jy
1:t1
)p(u
t
ju
t1
; s
t1
)p(s
t
ju
t
;u
t1
; s
t1
)p(s
t1
ju
t1
; y
1:t1
)ds
t1
:
(4.6)
The term y
t
is the measurement described in Section 4.3.3. The state transitionp(s
t
ju
t
; x
t1
) and street
transition p (u
t
jx
t1
) are modeled through Gaussian distribution [13].
89
4.3.3 Measurement model
Unlike [13], we incorporate GPS into the measurement model if the signal is available. The term
p(y
t
jx
t
) in (4.6) is the combined measurement likelihood from VO and GPS measurements. Let v
t
be
the VO observation described in Chapter 3. To minimize MCL complexity, only the linear and angular
displacement from time t 1 to time t are used. Formally, let z be the axis that points forward and is
aligned with the camera optical axis, and thex andy axes are the image axes. Let R and T be the 33
relative rotation matrix and translation vector, respectively, that transform 3D points from the current
to the previous frame: R(r) = R
x
(r
x
)R
y
(r
y
)R
z
(r
z
) and T =
t
x
t
y
t
z
T
. Then the VO input v
t
to the localization module is reduced to :
v
t
=
t
z
r
y
T
: (4.7)
Let g
t
be the GPS measurements converted to a map coordinate system at time t. We model both
p (v
t
jx
t
) and p (g
t
jx
t
) with Gaussian distribution:
p (v
t
jx
t
) =N (v
t
jM
ut
s
t
;
v
) (4.8)
p (g
t
jx
t
) =N (g
t
jx
t
;
g
); (4.9)
where M
u
= [m
d
; m
]
T
; m
d
= (1;1; 0; 0)
T
; m
= (
u
;
u
; 1;1)
T
. Here,
u
is the curvature pa-
rameter of the street segment u. Covariance matrix
v
represents the uncertainty of VO learned from
data. The covariance matrix
g
represents the uncertainty of the GPS measurement given by the GPS
receiver. The settings of theses uncertainties are given in Section 4.5. We assume that the combined
90
measurement likelihood isp(y
t
jx
t
) =p (v
t
jx
t
)p (g
t
jx
t
). Note thatp (g
t
jx
t
) is xed to one when the GPS
measurement is not used.
4.3.4 State transition
We model the state transition model p (s
t
ju
t
; x
t1
) as Gauss-Linear:
p (s
t
ju
t
; x
t1
) =N
s
t
jA
ut;u
t1
s
t1
+ b
u;u
t1
;
x
ut
; (4.10)
where
x
ut
is the covariance matrix for a given u
t
learned from data. Note that s
t
represents a local
position of the camera with respect to the current street, u
t
, In order to transform s
t
from one street to
another, A
ut;u
t1
and b
u;u
t1
are expressed as:
A
ut;u
t1
=
2
6
6
6
6
6
6
6
6
6
6
4
2 1 0 0
1 0 0 0
0 0
ut
0
0
u
t1
ut
1 0
3
7
7
7
7
7
7
7
7
7
7
5
(4.11)
b
ut;u
t1
=
8
>
>
<
>
>
:
(`
u
t1
;`
u
t1
; 0;
ut;u
t1
)
T
u
t
6=u
t1
(0; 0; 0; 0)
T
u
t
=u
t1
(4.12)
where
ut;u
t1
=
ut
u
t1
+
ut
`
ut1
which is the angle between the end of the current streetu
t
and
the beginning of u
t1
at the previous time.
91
4.3.5 Street transition
The street transition p(u
t
jx
t1
) denes the probability of transitioning onto the street u
t
given the
previous state x
t1
. We use the Gaussian transition dynamics to dene the probability of changing
street segments:
p(u
t
jx
t1
) =
ut;u
t1
Z
`u
t1
+`u
`u
t1
N (xja
d
T
s
t1
; a
d
T
x
u
t1
a
d
)dx (4.13)
where a
d
= (2;1; 0; 0),
ut;u
t1
=
8
>
>
>
>
>
>
<
>
>
>
>
>
>
:
1 u
t
=u
t1
1
jN(u
t1
)j
u
t
2 N(u
t1
)
0 otherwise
(4.14)
where N(u) is the set of streets to which u connects. Equation (4.14) states that at the road junction,
the user has the same probability of entering any new street connected to u
t1
.
4.3.6 Mixture of Gaussians
From (4.3), we can indicate that the posterior is factorized into a continuous and discrete distributions
p (s
t
ju
t
; y
1:t
) and p(u
t
jy
1:t
) respectively. The term p (s
t
ju
t
; y
1:t
) represents the distribution over the
position and orientation along a given road. We model the distribution using Mixture of Gaussians, i.e.
p(s
t
ju
t
; y
1:t
) =
Nu
t
X
i=1
ut
(i)
N(s
t
j
ut
(i)
;
ut
(i)
); (4.15)
where N
ut
is the number of components for the mixture associated with u
t
. In addition,
i
ut
;
i
ut
;
i
ut
are the weight, mean and covariance of the Gaussian mixture for road u
t
respectively. The other term
p(u
t
jy
1:t
) represents the distribution over streets, and it is computed by adding up the weight of the
92
mixture model. After each update, our system uses the method of [13] to eliminate particles with low
weight while maintaining KL divergence [40] below threshold xed at 0.01. We use Maximum-a-
Posteriori (MAP) estimation to extract the user location from the posterior distribution in (4.6). The
output is the most probable state of the user in the street graph.
4.4 Running time
The running time for our system is recorded in a desktop with Quad i7 CPU core running at 3.4 GHz.
Our localization system shares the same complexity as the original localization system proposed for
automobiles in [13]. Python implementation achieves real-time performance with 16 cores. In practice,
the localization process could be run on the server side remotely, whereas the motion estimation can run
in a user's mobile device. Details are provided in Chapter 5.
The additional time required by our system is for the VO algorithm. Our stereo VO (Section 3.3) runs
at 39.5 ms per frame without IMU, which is approximately 17 ms slower than for Libviso2. The running
time can be greatly reduced by computing the dense disparity image with GPU [22]. The smartphone
VO (Section 3.11) is much faster and runs at 20 ms per frame.
4.5 Validation
We modied the open source MCL [13] with our motion estimation and measurement model described
in Section 3 and 4.3.3, respectively. In order to have a fair comparison, we xed the values of and in all
experiments. The diagonal elements of the two diagonal matrices are [8.21e-07, 1.81e-04] and [2.46e-06,
0, 1.14e-02, 0], respectively.
93
Algorithms/Modules Stereo VO (without IMU) Smartphone VO Libviso2
Dense disparity 17.08 ms - -
Ground detection 1.93 ms - -
Plane Fitting 0.75 ms - -
KLT tracking 3.46 ms 3.46 ms -
Gauss-Newton optimization 16.35 ms 16.35 ms -
Total time per frame 39.57 ms 19.81 ms 22.00 ms
Table 4.1: Comparison of computation time averaged over 1,000 frames at resolution 320 x 240 pixels.
Average number of feature points tracked by our algorithm and Libviso2 are 370 and 360, respectively.
To validate the proposed localization system, we collected a large amount of data with a stereo
camera and smartphone in both USC Campus and Downtown Los Angeles areas. Our datasets consist
of stereo image sequences captured by the stereo Vuzix camera at 30 fps while walking about the areas;
IMU data was logged at 100 Hz. We also collected a smartphone dataset by walking with a Samsung
Galaxy S5 smartphone placed in a shirt pocket. Video sequences were recorded with the smartphone's
rear camera at 30 fps. The IMU and GPS data were logged at 50 Hz and 1 Hz, respectively, using
the same smartphone simultaneously. The data are inputted into our system and the original reference
system [13] separately. Both methods run on a desktop computer. All the experiments described below
were conducted oine by processing the pre-recorded data in order to benchmark with the original
reference system [13]. We will be able to conduct live experiments in real time soon using the server-
client conguration described in Chapter 5.
The maps used for localization experiments are shown in Figures 4.10(a) and 4.10(b). During system
initialization, we assign uniform probability to all street segments and place one particle every 10 m
94
(a) (b)
Figure 4.10: Street map for (a) Campus (b) Downtown datasets. Both maps encompass an area of ap-
proximately 1km
2
. Only the yellow road segments are converted to a connectivity graph for localization,
and thus each le size is smaller than 13 MB. Note that the road network pattern in Downtown Los
Angeles is primarily a grid system. Therefore, VO estimation must provide accurate estimation in order
for the particle lter to converge to the correct spot.
along each segment. The accuracy of the position and orientation are evaluated only when the system
has beenlocalized. This occurs when there is a single mode in the posterior for at least 5 sec continuously,
and the distance to the ground truth from the mode is less than 20 m. Once the localization criteria are
met, all subsequent frames are considered localized until the system becomes \lost." This occurs when
the localization criteria are not met continuously for more than 5 sec. Errors of the estimated position
and heading of the MAP state for localized frames are computed using the ground truth extracted from
Google Earth. This is performed manually by comparing the ground texture found in the recorded images
with the Google satellite map. The corresponding UTM coordinates (WGS84) from Google Earth are
used as the ground truth camera location. The height of all ground truth is set to zero. We assume that
the camera always looks straight during data collection, and compute the ground truth heading angles
from the consecutive ground truth locations.
95
Our system localizes the user even without GPS. When GPS is available, although the signal might
be noisy, the measurements provide an initial guess to reduce the search area and accelerate the local-
ization process. The average GPS errors for the USC Campus and Downtown Los Angeles datasets are
approximately 10 m and 35 m, respectively. The GPS accuracy in Downtown Los Angeles is poorer
because the dataset was recorded in an area surrounded by high-rise buildings, as shown in Figure 4.13.
The covariance matrix
g
is constructed from the accuracy of each GPS reading obtained by calling
the Android SDK function location:getAccuracy(). This returns a radius of 68% condence in meters.
Constructing
g
with raw GPS accuracy might overshrink the search area and cause the particle lter
to be unable to recover from prediction errors or noisy observation updates. Therefore, we scale up the
accuracy value by a factor of three before forming
g
.
4.5.1 Localization results with stereo camera
We rst demonstrate our approach using the image sequences captured by Vuzix (Figure 1.3), an o-
the-shelf wearable stereo camera with IMU, while walking through dierent outdoor environments. The
stereo rig is composed of two low quality cameras with an extremely short baseline of 6 cm.
We captured a total of 50-min video sequences and IMU readings around USC Campus and Downtown
Los Angeles areas. The walking distances are 1.7 km and 2.0 km, respectively. The average walking
speed is 1.0 m/sec. Figures 4.11 and 4.12 show some sample images from the datasets.
In the rst localization experiment, we discard all GPS measurements. The results are listed in Table
4.2. We repeat the experiments with GPS measurements, and results are listed in Table 4.3. The terms
\Localized Error" and \Lost Error" measure the errors over all frames labeled as localized and lost,
respectively. \Lost Error" is \NA" when the method does not become \lost" in the video sequence once
localized.
96
Figure 4.11: Sample frames of the USC Campus dataset
Figure 4.12: Sample frames of the Downtown dataset
As indicated in Table 4.2, when the GPS is not used, we can estimate the position to 5.73 m and
12.32 m from the campus and Downtown Los Angeles datasets, respectively. The errors are less than
the average GPS error in both areas, which are approximately 10 m and 35 m, respectively. The system
requires longer to localize the Downtown Los Angeles sequence because the map used for the experiment
is dominated by equiangular intersections and equilength streets, as shown in Figure 4.10(b). Although
[13] provides a reasonable result for the campus sequence, it provides a large position error for the
Downtown Los Angeles dataset, and does not meet the criteria for localization of the entire sequence.
97
Figure 4.13: Ground truth path (yellow curve) surrounded by high-rise buildings that cause large GPS
error.
USC Campus Downtown
[13] Ours [13] Ours
Localized Position Error (m)
Average 16.7 5.73 * 12.32
Standard Dev. 18.42 5.36 * 11.53
Localized Heading Error (deg)
Average 21.68 7.36 * 7.26
Standard Dev. 20.25 6.53 * 10.21
Lost Position Error (m)
Average NA NA * NA
Standard Dev. NA NA * NA
Lost Heading Error (m)
Average NA NA * NA
Standard Dev. NA NA * NA
Time to localize (sec) Average 144 180 * 873
Table 4.2: Localization errors comparison using stereo camera without GPS. Sequence not localized is
marked with \*".
.
98
This is because the VO algorithm in [13] is not robust to crowded environment. Hence, it provides noisy
motion estimation and causes the localized system to become lost. On the other hand, the proposed VO
method that exploits the ground plane constraint is much more stable.
USC Campus Downtown
[13] Ours [13] Ours
Localized Position Error (m)
Average 8.03 5.49 57.7 12.17
Standard Dev. 7.82 4.45 36.32 9.21
Localized Heading Error (deg)
Average 16.35 7.18 26.37 6.97
Standard Dev. 22.02 6.42 42.13 11.28
Lost Position Error (m)
Average NA NA NA NA
Standard Dev. NA NA NA NA
Lost Heading Error (m)
Average NA NA NA NA
Standard Dev. NA NA NA NA
Time to localize (sec) Average 15 27 630 705
Table 4.3: Localization errors comparison using stereo camera with GPS. Sequence not localized is
marked with \*".
.
After including GPS measurements (Table 4.3), the time required to meet the localization criteria
(or time-to-localize) is shortened by a large amount in the USC Campus dataset, but only shows small
improvement for the Downtown Los Angeles data. This is because the measured GPS covariance
g
is
larger in Downtown Los Angeles than the USC Campus datasets.
99
4.5.2 Localization results with smartphone
This section presents the localization results of the monocular camera using the image sequences captured
with a Samsung Galaxy S5 smartphone while walking through similar trajectories as in Section 4.5.1.
The dataset was collected by placing the mobile phone in the walker's shirt pocket, as shown in Figure
1.1(b). We placed tissue paper to ll the pocket in order to reduce camera shake blurring while walking.
No other mounting hardware is used. The dataset is more challenging than the hand-held or body-
mounted settings because the mobile phone camera not only moves vertically, but also horizontally
because of swinging arms while walking. In addition, we could not see the cellphone screen during the
video capture process. The walking distances are 2.1 km and 2.8 km, respectively. The average walking
speed is 1.0 m/sec. In addition to the camera, our system only utilizes roll-and-pitch angle readings
from the accelerometer.
We input the datasets into our system as well as the original method [13], and use the same set of
parameters as in Section 4.5.1. Although smartphones do not measure depth, the localization results
using monocular VO are still accurate to 8.31 m and 9.45 m for the USC Campus and Downtown Los
Angeles datasets, respectively (Table 4.4), without relying on GPS, once the system is localized. The
original system [13] fails to localize the Downtown Los Angeles dataset when the GPS is not used because
of noisy VO measurements. Although the system can localize the USC Campus dataset, it gives a large
position error after becoming \lost" shortly afterward because of incorrect VO updates. On the other
hand, our system gives the lowest error in both cases and does not become \lost" after localizing the
user. The Downtown Los Angeles sequence requires longer to be localized because the map used for the
experiment is dominated by equiangular intersections and equilength streets, as shown in Figure 4.10(b).
100
After including GPS measurements (Table 4.5), the original system [13] localizes the Downtown Los
Angeles dataset as well. Although it becomes \lost" ve times, it can recover after incorporating GPS
measurements. On the other hand, our system gives the lowest error in both cases and does not become
\lost" after localizing the user.
The corresponding qualitative results are shown in Figure 4.14(a) and 4.14(b). The red curve is the
localization output with update from the VO measurements. It is very close to the ground truth (in
blue) after the system is localized in both datasets. The black curve is the GPS measurements only
accurate for the USC Campus dataset, but with larger position error for the Downtown Los Angeles
area.
USC Campus Downtown
[13] Ours [13] Ours
Localized Position Error (m)
Average 9.23 8.31 * 9.45
Standard Dev. 11.22 8.66 * 8.47
Localized Heading Error (deg)
Average 12.21 4.75 * 8.36
Standard Dev. 21.94 6.53 * 11.44
Lost Position Error (m)
Average 303.75 NA * NA
Standard Dev. 84.35 NA * NA
Lost Heading Error (deg)
Average 120.41 NA * NA
Standard Dev. 36.98 NA * NA
Time to localize (sec) Average 579.00 81.00 * 372.00
Table 4.4: Localization errors comparison using mobile phone without GPS. Sequence not localized is
marked with \*".
.
101
USC Campus Downtown
[13] Ours [13] Ours
Localized Position Error (m)
Average 8.93 6.59 28.39 9.23
Standard Dev. 7.90 4.36 35.42 8.39
Localized Heading Error (deg)
Average 14.20 4.84 36.80 9.24
Standard Dev. 23.04 6.67 44.64 12.48
Lost Position Error (m)
Average NA NA 75.65 NA
Standard Dev. NA NA 46.29 NA
Lost Heading Error (m)
Average NA NA 69.52 NA
Standard Dev. NA NA 56.96 NA
Time to localize (sec) Average 9.00 9.00 51.00 57.00
Table 4.5: Localization comparison using mobile phone with GPS.
.
4.5.3 Comparison between stereo vuzix and mobile phone
In order to compare the performance of our system between stereo and monocular devices, we consolidate
the localization results of our method from Table 4.2, 4.3, 4.4, and 4.5 in Table 4.6. The results show
that by replacing the stereo camera with the monocular smartphone camera, the localized position
error is reduced in the Downtown Los Angeles area, although depth measurements are not available in
smartphones. In addition, the time to localize is shortened by more than 50% in all cases. The reasons
are explained below.
102
GPS
Ground Truth
PF
Localized
End
(a) Localization results on USC Campus dataset
without using GPS, total walking distance is 2.1
km
GPS
Ground Truth
PF
Localized
End
(b) Localization results on Downtown dataset with-
out using GPS, total walking distance is 2.8 km
Figure 4.14: Red dotted curve denotes output trajectory for our localization system using smartphone
camera after localized at spot marked by green circle
(a) (b)
Figure 4.15: (a) Rectied stereo image pair captured by Vuzix as it moves outdoor. Vertical distance
between each pair of red grid lines is 12 pixels (b) Corresponding dense disparity map. In normal
operational conditions, the same point in a real scene should appear with the same y-coordinates in each
rectied image. This condition is fullled in this case; hence, the disparity image is smooth and accurate
The stereo wearable glasses provide depth information. Although the range is limited by the short
baseline (6 cm), our VO method relies only on the depth measurement of the ground plane immediately
103
in front of the user. The stereo pair provides reliable depth in open areas under consistent lighting
conditions, as shown in Figure 4.15.
However, the depth measurements are noisy under shaded areas, as shown in Figure 4.16. This is
because the frame rate for both the left and right camera drops automatically under low light conditions,
but each camera is set at slightly dierent frame rates. Hence, the stereo camera generates corrupted
disparity images because the left and right cameras are no longer synchronized. In addition, the auto
exposure adjustment in Vuzix stereo camera is also not robust when walking through shaded areas
usually found in the real urban environments.
On the other hand, the auto gain control of the smartphone is very responsive and manage to change
camera exposure to adapt dierent lighting conditions in outdoor. However this smart gain control
function does not provide signicant improvement in the open terrain found in the USC Campus dataset
in which no shaded area along the ground truth path. This explains why Vuzix stereo camera gives
smaller position error in the USC Campus dataset because the depth measurements are reliable under
consistent lighting condition. Figure 4.17(a) and 4.17(b) show the images taken by the two dierent
sensors at the approximate same location. The Vuzix camera fails to boost up the camera exposure when
walking under the shade. On the other hand , the smartphone camera adjusted the gain automatically
such that the scene outside the shade is overexposed but the details under the shade is clearly displayed.
The time-to-localize value depends on the shape of the road segments and the accuracy of visual
odometry estimation before the user being localized. Although the visual odometry estimated by stereo
camera drifts slower than that from monocular smartphone over the entire trajectory, the smartphone
gives more locally accurate visual odometry within a short sliding window due to the higher image
quality and better auto-gain control. Hence smartphone gives faster time-to-localize values even though
the average localized position error is poorer than stereo camera.
104
USC Campus Downtown
without GPS with GPS without GPS with GPS
Stereo Phone Stereo Phone Stereo Phone Stereo Phone
Localized Position
Error (m)
Average 5.73 8.31 4.92 6.59 12.32 9.45 9.94 9.23
Standard Dev. 5.77 8.66 5.38 4.36 14.7 8.47 8.78 8.39
Localized Heading
Error (deg)
Average 7.36 4.75 7.13 4.84 7.26 8.36 7.36 9.24
Standard Dev. 6.95 6.53 6.87 6.67 6.01 11.44 8.17 12.48
Time to
localize (sec)
Average 180.00 81.00 27.00 9.00 873.00 372.00 303.00 57.00
Table 4.6: Localization comparison between Stereo Vuzix and mobile phone using the proposed method
105
Figure 4.16: First to last rows show four consecutive stereo image pairs captured while Vuzix camera
moves under a shaded area. Corresponding disparity map is in last column. Disparity maps for second
and third stereo image pairs are corrupted, only rst and last rows are acceptable. This is because left
and right images were captured at slightly dierent times under low light conditions. Notice that
oor
textures do not appear on same y-coordinates in left and right images in second and third rows.
106
(a) Stereo Vuzix under shade (b) Smartphone camera
under shade
Figure 4.17: Image quality comparison between stereo Vuzix and smartphone under shade
107
4.5.4 Very large-scale localization results
We validate the proposed method on a real scenario of hours of walking about in both open terrain and
urban environment. We assume that a user wants to walk from the USC Campus to the City National
Bank in Los Angeles Financial District through four places, as shown in Figures 4.18(b) to 4.18(e). Our
framework is initialized by an approximate location of the user, either from noisy GPS measurement
or the names of nearby roads/landmarks. In this case, the user might specify \USC Campus" as the
starting location and \LA City National Bank" as the nal destination. In addition, the user might
insert multiple waypoints before reaching the destination, as shown in Figures 4.18(b) to 4.18(e).
The proposed system rst downloads a map that encompasses both the USC Campus and Los
Angeles Financial District from OSM if the map does not exist in the device's memory. The road
network is extracted as shown by the blue lines in Figure 4.20, and the corresponding connectivity graph
is constructed. This map construction process is fast and might require from few seconds to a minute
depending on the size of the map area.
Even before the map is ready, the user might start walking along the road. The user's trajectory is
estimated by the VO and used by the particle lter for localization once the map is ready. The walking
distance required for localization depends on the shape of the road. For instance, the particle lter
converges faster if the user makes some turns.
The ground truth path is shown in Figure 4.19. The total walking distance is 6.7 km. We captured
video and IMU measurements for 1.5 hours with a Samsung Note 3 placed in the walker's shirt pocket,
as shown in Figure 1.1(b). The qualitative result is shown in Figure 4.22, whereas the quantitative
results are listed in Table 4.7. Four areas enclosed within yellow boxes are enlarged in Figures 4.23(a) to
4.23(d). Our system can localize the user after a walking distance of 93 m as marked by the green circle
108
in Figure 4.23(a). The GPS (black) error in this area is small, with the exception of the area adjacent
to the Shrine Auditorium marked by the green bubble on the map and live image at the bottom right
corner. On the other hand, the estimated position (in red) is extremely close to the ground truth, which
is highlighted in blue.
Figure 4.23(c) shows that the GPS measurement is extremely noisy around the area under the
yover
marked by the green bubble. However, the estimated position of our system is not aected. The GPS
error is even worse in the Financial District, as shown in Figure 4.23(d), because that area is surrounded
by high rise buildings, as shown in Google Earth (Figure 4.24).
(a) Starting Point (b) Waypoint #01 (c) Waypoint #02 (d) Waypoint #03 (e) Waypoint #04 (f) Final Des-
tination
Figure 4.18: Test scenario: User species four way points between the starting location and nal desti-
nation
Figure 4.19: Path output by path planner, total distance is 6.7km and approximate walking time is
1.5hrs
109
−118.29 −118.28 −118.27 −118.26 −118.25
34.02
34.025
34.03
34.035
34.04
34.045
34.05
Longitude (deg)
Latitude (deg)
Figure 4.20: Map for navigating from campus to downtown and vice versa. Only blue road segments
are encoded into map. The le size is approximately 35 MB, which includes both road connectivity
information and semantic labels.
4.6 Recovery from temporary loss
The prediction-and-update approach in the particle lter can fail when the dierence between the ob-
served VO and predicted camera motion is too large. The observation can validate the incorrect predic-
tion and cause the particles to diverge into multiple modes. In practice, a localized system may become
\lost" when the user changes velocity unexpectedly, or there is a large error in the VO measurement. An
outdated map or a user leaving the map area can also cause the system to become \lost." The lter can
generate a large positional error when \lost," and it may require many update cycles to recover. Hence, it
110
Figure 4.21: Sample frames from dataset
is important to develop a recovery strategy to shorten the time required for recovery (or re-localization)
when failure occurs, especially during GPS outages.
To shorten the recovery time and reduce localization error during a \lost" state, our system restarts
the particle lter when \lost" is detected. There are many possible ways for detecting \lost." However,
we are more interested in the recovery performance, and thus we assume that the time being \lost"
is known by comparing the localization output with the ground truth. Restart is performed by re-
initializing all particles to uniform distribution within 100 m of the last trusted location of the user.
This erases the history of the particle lter and restarts a new estimation. This is illustrated in Figures
4.25(a) to 4.25(f). Experiment results show that our system provides a much lower localization error in
the \lost" state compared with the state-of-the-art systems [13].
We evaluate the re-localization performance using the following parameters: i) the statistics of the
pose error before and after becoming lost and ii) the time required for the system to re-localize after
111
Localization results (Bir d’ s-eye view)
1
2
3
4
Figure 4.22: Final trajectory output from our proposed method shown in red dotted line
becoming lost. This is the time required for the posterior to meet the criteria of localization by having
a single mode remain within 20 m from the ground truth for at least 5 sec. We articially introduce a
large error in VO at regular intervals after the sequence is localized the rst time. To perturb VO, we
scale up the observed linear displacement by a factor of three for 1 sec, and oset the observed angular
displacement by 30 deg for 10 sec. Figure 4.26 illustrates the impact on the positional error by the two
perturbations. All frames after the perturbation are marked as lost until the criteria for localization are
met again. All subsequent frames are then relabeled as localized until the next perturbation.
112
User is
localized
Starting
point
(a) Starting Point and destination 01 (b) Destination 02
(c) Destination 03 (d) Destination 04
Figure 4.23: Enlarged output
To obtain more representable statistics, we duplicate both datasets multiple times and concatenate
each into a single video sequence. Because the rst and last frames of the datasets were captured by
approximate same camera poses, VO has no diculties computing the transformation between loops.
This ensures that the cumulative drifting error for VO is propagated from one loop to another. Each
dataset is duplicated 12 times, and VO error is added at every 2,400th sec to create 11 relocalization
tests. Because we are only interested in comparing the particle lter when managing recovery, we
113
Figure 4.24: Surrounded by buildings
Without GPS With GPS
Localized Position Error (m)
Average 9.86 8.00
Standard Dev. 11.34 7.35
Localized Heading Error (deg)
Average 8.33 7.64
Standard Dev. 11.72 9.46
Time to localize (sec) Average 93 81
Table 4.7: Quantitative localization results
substitute the VO algorithm of the original system [13] with ours as presented in Chapter 3, and make
the re-localization test results independent from the motion estimation update.
4.6.1 Re-localization test results
Tables 4.8 and 4.9 summarizes the relocalization test results using stereo camera and smartphone, re-
spectively, without GPS.
114
(a) (b) (c)
(d) (e) (f)
Figure 4.25: Steps taken for recovery of the proposed system. Black line shows most recent ground
truth trajectory, and blue spot with green ring marks localized user's location estimated by proposed
system (a) User is localized when all particles converge to single spot (b) Large VO measurement error
occurs (c) \Lost" is detected when the particles diverge into multiple modes (not shown) (d) Our method
re-initializes particle lter by resetting local area of last known location with uniform distribution (e)
Most particles are eliminated shortly, and two modes remain (f) User is re-localized as particles converge
into single spot again
The terms Localized Error and Lost Error measure the error over all frames labeled as localized
and lost, respectively. \Time to re-localize" is the time between the start of the perturbation and the
system becoming localized again. In Tables 13 and 14, both methods give a similar position error in the
\localized state" because they run the same VO algorithm. However, when the particle lter becomes
115
1.1 1.12 1.14 1.16 1.18 1.2 1.22 1.24 1.26 1.28
x 10
4
0
100
200
300
Positional Error
Positional
Error (m)
Time (sec)
1.1 1.12 1.14 1.16 1.18 1.2 1.22 1.24 1.26 1.28
x 10
4
0
50
Perturbation in linear displacement
Linear
Displace−
ment(m)
Time (sec)
1.1 1.12 1.14 1.16 1.18 1.2 1.22 1.24 1.26 1.28
x 10
4
−0.5
0
0.5
Perturbation in angular displacement
Angular
Displace−
ment(rad)
Time (sec)
Figure 4.26: Positional error (blue line in top graph) increased when there is perturbation in linear and
angular displacements. Red dotted lines in top graph depict perturbation times. Green depicts time
when sequence is re-localized. Blue lines in bottom two graphs are original VO measurements, red lines
depict the perturbed measurements. Two red spikes in middle plot are caused by perturbations in linear
displacement. Corresponding angular displacements are marked by two ellipses in bottom graph.
\lost," the system in [13] provides poor position estimates. Our system can reduce the pose estimation
error in the \lost" state signicantly, and becomes re-localized much faster. This improvement is achieved
by re-initializing all particles within the local area around the last trusted location when the particle
lter became lost.
Figure 4.27(a) and 4.27(b) show the plot of positional error vs. time in blue. The red dotted lines
depict the times when the large error is articially introduced into VO. The green depicts the time when
the sequence is re-localized. If no action is taken as in [13], the positional errors are large between each
116
USC Campus Downtown
[118] Ours [118] Ours
Localized Position Error (m)
Average 23.54 5.01 * 9.62
Standard Dev. 38.87 5.42 * 9.74
Localized Heading Error (deg)
Average 27.93 7.32 * 6.62
Standard Dev. 34.8 7.32 * 6.9
Lost Position Error (m)
Average 160.43 22.95 * 156.61
Standard Dev. 137.18 22 * 117.37
Lost Heading Error (m)
Average 77.1 47.05 * 57.79
Standard Dev. 78.32 42.14 * 39.2
Time to Re-localize (sec)
Average 672 115 * 643.91
Standard Dev. 784.43 34.62 * 335.79
Table 4.8: Relocalization comparison using stereo camera in dierent areas without GPS. Experiment
not localized are labeled with \*"
pair of red and green dotted lines because the localization system remains in the \lost" state. Our
system reduces the error quickly after re-initializing all particles.
117
USC Campus Downtown
[13] Ours [13] Ours
Localized Position Error (m)
Average 10.55 10.389 9.60 9.00
Standard Dev. 9.34 9.05 9.33 8.74
Localized Heading Error (deg)
Average 5.08 5.09 7.19 7.12
Standard Dev. 6.89 6.89 9.40 9.50
Lost Position Error (m)
Average 149.48 57.41 116.25 28.69
Standard Dev. 124.69 75.34 95.50 38.66
Lost Heading Error (m)
Average 114.49 33.55 58.23 15.89
Standard Dev. 60.61 53.76 42.30 32.00
Time to Re-localize (sec)
Average 813.82 205.09 972.81 649.36
Standard Dev. 569.70 80.56 416.60 83.39
Table 4.9: Relocalization comparison using mobile phone without GPS
118
0.5 1 1.5 2 2.5 3 3.5
x 10
4
0
100
200
300
400
500
600
Positional Error
Error (meters)
Time (sec)
(a)
0.5 1 1.5 2 2.5 3 3.5
x 10
4
0
100
200
300
400
500
600
700
Positional Error
Error (meters)
Time (sec)
(b)
Figure 4.27: Position error for re-localization test with campus dataset captured by smartphone (a)[13]
and (b) ours
119
Chapter 5
Conclusion and future work
We presented a pose tracking and localization framework designed to aid VI users navigate in urban en-
vironments. The proposed method performs global position estimation using VO along with an existing
map available from the Internet. We explored a gamut of approaches in VO estimation. We also con-
ducted extensive tests and obtained good results with a wearable stereo camera (Vuzix). Paradoxically,
we showed that by using a single camera with better image quality, we achieved an order of magnitude
improvement in localization speed. We validated the proposed system on real scenarios of hours walking
in both an open terrain and urban environment. Our method is suited not just to visual navigation
for VI individuals, but also for many other applications, such as localization for augmented reality and
vision-based navigation for autonomous vehicles.
In order to make the entire system fully functional, there are a few possible extensions to this frame-
work. The rst is exploiting cloud computing to allow the system to achieve real time performance.
This is performed by running VO and localization separately in client and server machines, respec-
tively, as illustrated in Figure 5.1. Our VO algorithm is implemented in C++ and OpenCV [1] for
Microsoft Windows. It can be ported over to the smartphone platform easily by either OpenCV iOS [2]
120
Motion
Estimation
Guidance Planning
User
Interface
Map
Localization
(timestamp, t
z
, -r
y
)
(timestamp, latitude,
longitude, heading)
6Dof relative
camera motion
IMU
Stereo camera
GPS
Smartphone
camera
Client side
Server side
Figure 5.1: Client-server implementation
or OpenCV4Android SDK [3]. The smartphone plays the client role while the localization process re-
quires more computational power and should be run in a server machine with multiple cores. The server
machine could be a computer located remotely or a powerful laptop being carried along with the user.
The server localizes the user based on the motion trajectory computed by the VO algorithm. Therefore,
no image data is required in the transmission. The client simply needs to transmit two values (the latest
forward speed and heading) in real time to the server. After the user is localized, his/her latest pose
can be transmitted to the client through three values: latitude, longitude, and heading. Therefore, the
bandwidth requirement for the data exchange between server and client is extremely low. For the case
of local server, such as laptop being carried with the user, the client and server can communicate using
simple User Datagram Protocol (UDP). Otherwise, Transmission Control Protocol/Internet Protocol
(TCP/IP) is more suitable for remote server implementation.
121
The second extension is a user interface so that VI users can specify the nal destination or insert
new waypoints at any time. The third possible extension could be a cue system that informs users about
assisted facilities close to their current position estimated by our system. Assisted facilities such as zebra
crossings, tactile pavement, and their corresponding world coordinates are all available from OSM as
semantic labels.
Thanks to these encouraging results, we will perform experiments with real patients from the Braille
Institute Sight Center in Los Angeles. Software and datasets of our methods are available @ http:
//www-scf.usc.edu/
~
tungsinl/NavForBlind.
122
Reference List
[1] Opencv. http://opencv.org/.
[2] Opencv ios. http://docs.opencv.org/doc/tutorials/introduction/ios_install/ios_
install.html/.
[3] Opencv4android. http://opencv.org/platforms/android.html.
[4] Openstreetmap. https://www.openstreetmap.org/.
[5] I. Apostolopoulos, N. Fallah, E. Folmer, and K. E. Bekris. Feasibility of interactive localization
and navigation of people with visual impairments. In 11th IEEE Intelligent Autonomous Systems
(IAS-10) Conference, Ottawa, Canada, August 2010.
[6] Georges Baatz, Kevin K oser, Pollefeys. Leveraging 3d city models for rotation invariant place-of-
interest recognition. International Journal of Computer Vision, 96(3):315{334, 2012.
[7] Hernan Badino and Takeo Kanade. A head-wearable short-baseline stereo system for the simulta-
neous estimation of structure and motion. In IAPR Conference on Machine Vision Applications
(MVA), number CMU-RI-TR-, June 2011.
[8] A. Bartoli, P. Sturm, and R. Haraud. Projective structure and motion from two views of a piecewise
planar scene. In Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International
Conference on, volume 1, pages 593{598 vol.1, 2001.
[9] Ali N. A. Benjamin, J. M. and A. F. Schepis. A laser cane for the blind. In Proceedings of the San
Diego Biomedical Symposium, page 12:5357, 1973.
[10] Michael Blsch, Stephan Weiss, Davide Scaramuzza, and Roland Siegwart. Vision based mav navi-
gation in unknown and unstructured environments. In ICRA, pages 21{28. IEEE, 2010.
[11] J. Y. Bouguet. Camera calibration toolbox for Matlab, 2008.
[12] Adrien Brilhault, Slim Kammoun, Olivier Gutierrez, Philippe Truillet, and Christophe Jourais.
Fusion of articial vision and gps to improve blind pedestrian positioning. In NTMS, pages 1{5.
IEEE.
[13] Marcus A. Brubaker, Andreas Geiger, and Raquel Urtasun. Lost! leveraging the crowd for prob-
abilistic visual self-localization. In CVPR, pages 3057{3064. IEEE, 2013.
[14] Marcus A. Brubaker, Andreas Geiger, and Raquel Urtasun. Lost! leveraging the crowd for prob-
abilistic visual self-localization. In CVPR, pages 3057{3064. IEEE, 2013.
123
[15] Vincenzo Caglioti and Simone Gasparini. Uncalibrated visual odometry for ground plane motion
without auto-calibration. In VISAPP (Workshop on on Robot Vision), pages 107{116, 2007.
[16] Vincenzo Caglioti and Pierluigi Taddei. Planar motion estimation using an uncalibrated general
camera.
[17] Javier Civera, Andrew J. Davison, and J. M. M. Montiel. Inverse depth parametrization for
monocular slam. IEEE Transactions on Robotics, 24(5):932{945, 2008.
[18] Albert M. Cook and Susan Hussey. Assistive Technologies: Principles and Practice (2nd Edition).
Mosby, 2 edition, December 2001.
[19] Vuzix Corporation. Wrap 920ar augmented reality eyewear user guide, 2011.
[20] Mark Joseph Cummins and Paul M. Newman. Fab-map: Probabilistic localization and mapping
in the space of appearance. I. J. Robotic Res., 27(6):647{665, 2008.
[21] A.J. Davison. Real-time simultaneous localisation and mapping with a single camera. In Proc.
International Conference on Computer Vision, Nice, October 2003.
[22] Ines Ernst and Heiko Hirschmller. Mutual information based semi-global stereo matching on the
gpu. In George Bebis, Richard D. Boyle, Bahram Parvin, Darko Koracin, Paolo Remagnino,
Fatih Murat Porikli, Jrg Peters, James T. Klosowski, Laura L. Arns, Yu Ka Chun, Theresa-Marie
Rhyne, and Laura Monroe, editors, ISVC (1), volume 5358 of Lecture Notes in Computer Science,
pages 228{239. Springer, 2008.
[23] F. Faber, M. Bennewitz, C. Eppner, A. Gorog, C. Gonsior, D. Joho, M. Schreiber, and S. Behnke.
The humanoid museum tour guide robotinho. In Robot and Human Interactive Communication,
2009. RO-MAN 2009. The 18th IEEE International Symposium on, pages 891{896, Sept 2009.
[24] Lei Fang, Panos J. Antsaklis, Luis Antonio Montestruque, M. Brett McMickell, Michael D. Lem-
mon, Yashan Sun, Hui Fang, Ioannis Koutroulis, Martin Haenggi, Min Xie 0005, and Xiaojuan
Xie. Design of a wireless assisted pedestrian dead reckoning system - the navmote experience.
IEEE T. Instrumentation and Measurement, 54(6):2342{2358, 2005.
[25] Steven Feiner, Blair Macintyre, and Tobias Hllerer. A touring machine: Prototyping 3d mobile
augmented reality systems for exploring the urban environment. pages 74{81, 1997.
[26] Martin A. Fischler and Robert C. Bolles. Random sample consensus: a paradigm for model tting
with applications to image analysis and automated cartography. Commun. ACM, 24(6):381{395,
June 1981.
[27] Paul Timothy Furgale and Timothy D. Barfoot. Stereo mapping and localization for long-range
path following on rough terrain. In ICRA, pages 4410{4416. IEEE, 2010.
[28] Y. Gao, X. Ai, Y. Wang, J. Rarity, and N. Dahnoun. U-v-disparity based obstacle detection with
3d camera and steerable lter. In Intelligent Vehicles Symposium (IV), 2011 IEEE, pages 957{962,
2011.
124
[29] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision
benchmark suite. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference
on, pages 3354{3361, 2012.
[30] Andreas Geiger, Julius Ziegler, and Christoph Stiller. Stereoscan: Dense 3d reconstruction in
real-time. In IEEE Intelligent Vehicles Symposium, Baden-Baden, Germany, June 2011.
[31] Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees. Machine Learn-
ing, 63(1):3{42, 2006.
[32] S. Godha, G. Lachapelle, and M. E Cannon. Integrated GPS/INS system for pedestrian navigation
in a signal degraded environment. pages 2151{2164.
[33] Google. Google earth. https://www.google.com/earth/.
[34] Google. Google map. https://maps.google.com/.
[35] Erico Guizzo. How google self-driving car works, 2011. http://spectrum.ieee.org/automaton/
robotics/artificial-intelligence/how-google-self-driving-car-works.
[36] James Hays and Alexei A. Efros. Im2gps: estimating geographic information from a single image.
In CVPR. IEEE Computer Society, 2008.
[37] D.; Wagner B. Hentschel, M.; Lecking. Deterministic path planning and navigation for an au-
tonomous fork lift truck. In 6th IFAC Symposium on Intelligent Autonomous Vehicles (IAV), Sept
2007.
[38] M. Hentschel and B. Wagner. Autonomous robot navigation based on openstreetmap geodata. In
Intelligent Transportation Systems (ITSC), 2010 13th International IEEE Conference on, pages
1645{1650, Sept 2010.
[39] M. Hentschel, O. Wulf, and B. Wagner. A gps and laser-based localization for urban and non-
urban outdoor environments. In Intelligent Robots and Systems, 2008. IROS 2008. IEEE/RSJ
International Conference on, pages 149{154, Sept 2008.
[40] J.R. Hershey and P.A. Olsen. Approximating the kullback leibler divergence between gaussian mix-
ture models. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International
Conference on, volume 4, pages IV{317{IV{320, April 2007.
[41] A. Hornung, K.M. Wurm, and M. Bennewitz. Humanoid robot localization in complex indoor en-
vironments. In Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference
on, pages 1690{1695, Oct 2010.
[42] P. V. C. Hough. Machine Analysis of Bubble Chamber Pictures. In International Conference on
High Energy Accelerators and Instrumentation, CERN, 1959.
[43] Z. Hu and K. Uchimura. U-v-disparity: an ecient algorithm for stereovision based scene analysis.
In Intelligent Vehicles Symposium, 2005. Proceedings. IEEE, pages 48 { 54, june 2005.
125
[44] Albert S. Huang, Abraham Bachrach, Peter Henry, Michael Krainin, Daniel Maturana, Dieter Fox,
and Nicholas Roy. Visual odometry and mapping for autonomous
ight using an rgb-d camera. In
Int. Symposium on Robotics Research (ISRR), Flagsta, Arizona, USA, Aug. 2011.
[45] Andrew J. Davison Javier Civera, O. Garcia and J. M. M. Montiel. Journal of Field Robotics,
27(5):609{631, October 2010.
[46] Rommanee Jirawimut, Piotr Ptasinski, Vanja Garaj, Franjo Cecelja, and Wamadeva Balachandran.
A method for dead reckoning parameter correction in pedestrian navigation system. IEEE T.
Instrumentation and Measurement, 52(1):209{215, 2003.
[47] M. Kaess, Kai Ni, and F. Dellaert. Flow separation for fast and robust stereo odometry. In
Robotics and Automation, 2009. ICRA '09. IEEE International Conference on, pages 3539{3544,
May.
[48] Michael Kaess, Hordur Johannsson, Richard Roberts 0001, Viorela Ila, John J. Leonard, and Frank
Dellaert. isam2: Incremental smoothing and mapping using the bayes tree. I. J. Robotic Res.,
31(2):216{235, 2012.
[49] Qifa Ke and Takeo Kanade. Transforming camera geometry to a virtual downward-looking camera:
Robust ego-motion estimation and ground-layer detection. In IEEE International Conference on
Computer Vision and Pattern Recognition (CVPR 2003), June 2003.
[50] KIT. The kitti vision benchmark suite, 2013. http://www.cvlibs.net/datasets/kitti/eval_
odometry.php.
[51] B. Kitt, F. Moosmann, and C. Stiller. Moving on to dynamic environments: Visual odometry using
feature classication. In Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International
Conference on, pages 5551 {5556, oct. 2010.
[52] Bernd Kitt, Andreas Geiger, and Henning Lategahn. Visual odometry based on stereo image
sequences with ransac-based outlier rejection scheme. In IEEE Intelligent Vehicles Symposium,
San Diego, USA, June 2010.
[53] Bernd Manfred Kitt, Joern Rehder, Andrew D Chambers, Miriam Schonbein, Henning Lategahn,
and Sanjiv Singh. Monocular visual odometry using a planar road model to solve scale ambiguity.
In Proc. European Conference on Mobile Robots, September 2011.
[54] Georg Klein and David Murray. Parallel tracking and mapping for small AR workspaces. In Proc.
Sixth IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR'07),
Nara, Japan, November 2007.
[55] Jan Knopp, Josef Sivic, and Toms Pajdla. Avoiding confusing features in place recognition. In
Kostas Daniilidis, Petros Maragos, and Nikos Paragios, editors, ECCV (1), volume 6311 of Lecture
Notes in Computer Science, pages 748{761. Springer, 2010.
[56] K. Konolige and M. Agrawal. Frameslam: From bundle adjustment to real-time visual mapping.
Trans. Rob., 24(5):1066{1077, October 2008.
126
[57] Vladimir A. Kulyukin, Chaitanya Gharpure, John Nicholson, and Grayson Osborne. Robot-
assisted waynding for the visually impaired in structured indoor environments. Auton. Robots,
21(1):29{41, 2006.
[58] Raphal Labayrade and Didier Aubert. Robust and fast stereovision based road obstacles detection
for driving safety assistance. In MVA, pages 624{627, 2002.
[59] H. Lategahn, T. Graf, C. Hasberg, B. Kitt, and J. Eertz. Mapping in dynamic environments
using stereo vision. In Intelligent Vehicles Symposium (IV), 2011 IEEE, pages 150 {156, june
2011.
[60] Gim Hee Lee, Friedrich Fraundorfer, and Marc Pollefeys. Mav visual slam with plane constraint.
In ICRA, pages 3139{3144. IEEE, 2011.
[61] Young Hoon Lee, Tung-Sing Leung, and Gerard Medioni. Real-time staircase detection from a
wearable stereo system. In Pattern Recognition (ICPR), 2012 21st International Conference on,
pages 3770 {3773, nov. 2012.
[62] Young Hoon Lee and Grard Medioni. A rgb-d camera based navigation for the visually impaired.
InRSS 2011 RGB-D: Advanced Reasoning with Depth Camera Workshop, Los Angeles, USA, June
2011.
[63] Keith Yu Kit Leung, C. M. Clark, and Jan Paul Huissoon. Localization in urban environments by
matching ground level video images with an aerial image. In ICRA, pages 551{556. IEEE, 2008.
[64] Tung-Sing Leung and Gerard Medioni. Visual navigation aid for the blind in dynamic environ-
ments. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Work-
shops, June 2014.
[65] Tung-Sing Leung and Gerard Medioni. Geolocalization using mobile phone and street grid map in
dynamic environment. In IEEE International Conference on Multimedia and Expo, ICME, June
2015. To appear.
[66] Yunpeng Li, Noah Snavely, Dan Huttenlocher, and Pascal Fua. Worldwide pose estimation using
3d point clouds. In Andrew W. Fitzgibbon, Svetlana Lazebnik, Pietro Perona, Yoichi Sato, and
Cordelia Schmid, editors, ECCV (1), volume 7572 of Lecture Notes in Computer Science, pages
15{29. Springer, 2012.
[67] Bojian Liang and N. Pears. Visual navigation using planar homographies. InRoboticsandAutoma-
tion, 2002. Proceedings. ICRA '02. IEEE International Conference on, volume 1, pages 205{210
vol.1, 2002.
[68] Yassine Ruichek Lijun Wei, Cindy Cappelle and Frdrick Zann. Gps and stereovision-based vi-
sual odometry: Application to urban scene mapping and intelligent. In Vehicle Localization,?
International Journal of Vehicular Technology, 2011.
[69] J.J. Liu, C. Phillips, and K. Daniilidis. Video-based localization without 3d mapping for the
visually impaired. InComputer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE
Computer Society Conference on, pages 23{30, June 2010.
127
[70] Andras Majdik, Yves Albers-Schoenberg, and Davide Scaramuzza. Mav urban localization from
google street view data. In IROS, pages 3979{3986. IEEE, 2013.
[71] Bernhard Mayerhofer, Bettina Pressl, and Manfred Wieser. Odilia - a mobility concept for the
visually impaired. In Klaus Miesenberger, Joachim Klaus, Wolfgang L. Zagler, and Arthur I.
Karshmer, editors, ICCHP, volume 5105 of Lecture Notes in Computer Science, pages 1109{1116.
Springer, 2008.
[72] Christopher Mei, Gabe Sibley, Mark Cummins, Paul M. Newman, and Ian D. Reid. Rslam: A
system for large-scale mapping in constant-time using stereo. International Journal of Computer
Vision, 94(2):198{214, 2011.
[73] Isaac Miller, Mark E. Campbell, and Dan Huttenlocher. Map-aided localization in sparse global
positioning system environments using vision and particle ltering. J. Field Robotics, 28(5):619{
643, 2011.
[74] Michael Montemerlo and Sebastian Thrun. FastSLAM: A Scalable Method for the Simultaneous
Localization and Mapping Problem in Robotics (Springer Tracts in Advanced Robotics). Springer-
Verlag New York, Inc., Secaucus, NJ, USA, 2007.
[75] Michael Montemerlo, Sebastian Thrun, Daphne Koller, and Ben Wegbreit. Fastslam 2.0: An
improved particle ltering algorithm for simultaneous localization and mapping that provably con-
verges. In Georg Gottlob and Toby Walsh, editors, IJCAI, pages 1151{1156. Morgan Kaufmann,
2003.
[76] V.N. Murali and J.M. Coughlan. Smartphone-based crosswalk detection and localization for vi-
sually impaired pedestrians. In Multimedia and Expo Workshops (ICMEW), 2013 IEEE Interna-
tional Conference on, pages 1{7, July 2013.
[77] Kai Ni and Frank Dellaert. Stereo tracking and three-point/one-point algorithms - a robust ap-
proach in visual odometry. In ICIP, pages 2777{2780. IEEE, 2006.
[78] David Nistr, Oleg Naroditsky, and James R. Bergen. Visual odometry for ground vehicle applica-
tions. J. Field Robotics, 23(1):3{20, 2006.
[79] Sang Min Oh, Sarah Tariq, Bruce N. Walker, and Frank Dellaert. Map-based priors for localization.
In IROS, pages 2179{2184. IEEE, 2004.
[80] Marc Pollefeys Olivier Saurer, Friedrich Fraundorfer. Homography based visual odometry with
known vertical direction and weak manhattan world assumption. In Vicomor Workshop at IROS
2012, 2012.
[81] C.F. Olson and A.O. Robinson. Camera-aided human navigation: Advances and challenges. In
Emerging Signal Processing Applications (ESPA), 2012 IEEE International Conference on, pages
75 {78, jan. 2012.
[82] World Health Organization. Visual impairment and blindness fact sheet, 2012.
[83] OSM. Openstreetmap database statistics, 2010. http://wiki.openstreetmap.org/wiki/Stats.
128
[84] G. Panahandeh, D. Zachariah, and M. Jansson. Exploiting ground plane constraints for visual-
inertial navigation. In Position Location and Navigation Symposium (PLANS), 2012 IEEE/ION,
pages 527{534, April.
[85] I. Parra, M.A. Sotelo, D.F. Llorca, C. Fernandez, A. Llamazares, N. Hernandez, and I. Garcia.
Visual odometry and map fusion for gps navigation assistance. In Industrial Electronics (ISIE),
2011 IEEE International Symposium on, pages 832{837, June 2011.
[86] Lina M. Paz, P. Pinies, J.D. Tardos, and J. Neira. Large-scale 6-dof slam with stereo-in-hand.
Robotics, IEEE Transactions on, 24(5):946{957, Oct 2008.
[87] Helen Petrie, Valerie Johnson, Thomas Strothotte, Andreas Raab, Ste Fritz, and Rainer Michel.
Mobic: Designing a travel aid for blind and elderly people. Journal of Navigation, 49:45{52, 1
1996.
[88] Helen Petrie, Valerie Johnson, Thomas Strothotte, Andreas Raab, Ste Fritz, and Rainer Michel.
Mobic: Designing a travel aid for blind and elderly people. volume 49, pages 45{52, 1 1996.
[89] V. Pradeep, G. Medioni, and J. Weiland. Robot vision for the visually impaired. In Computer
Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference
on, pages 15{22, June.
[90] Vivek Pradeep, Grard G. Medioni, and James Weiland. Visual loop closing using multi-resolution
sift grids in metric-topological slam. In 2009 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pages
1438{1445. IEEE, 2009.
[91] L. Ran, S. Helal, and S. Moore. Drishti: an integrated indoor/outdoor blind navigation system
and service. In Pervasive Computing and Communications, 2004. PerCom 2004. Proceedings of
the Second IEEE Annual Conference on, pages 23{30, March 2004.
[92] J.G. Rogers, A.J.B. Trevor, C. Nieto-Granda, and H.I. Christensen. Slam with expectation maxi-
mization for moveable object tracking. In Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ
International Conference on, pages 2077 {2082, oct. 2010.
[93] Edward Rosten and Tom Drummond. Machine learning for high-speed corner detection. In Euro-
pean Conference on Computer Vision, volume 1, pages 430{443, May 2006.
[94] Eric Royer, Maxime Lhuillier, Michel Dhome, and Thierry Chateau. Localization in urban envi-
ronments: Monocular vision compared to a dierential gps sensor. In CVPR (2), pages 114{121.
IEEE Computer Society, 2005.
[95] Eric Royer, MAXIME LHUILLIER, MICHEL DHOME, and JEAN-MARC LAVEST. Monocular
vision for mobile robot localization and autonomous navigation. JOURNAL OF COMPUTER
VISION, 74(3):237{260, 2007.
[96] Juan Manuel Saez, Francisco Escolano, and Antonio Penalver. First steps towards stereo-based
6dof slam for the visually impaired. 2012 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition Workshops, 0:23, 2005.
129
[97] Juan Manuel Saez, Francisco Escolano, and Antonio Penalver. First steps towards stereo-based
6dof slam for the visually impaired. 2012 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition Workshops, 0:23, 2005.
[98] Juan Manuel Saez Martinez and Francisco Escolano Ruiz. Stereo-based Aerial Obstacle Detec-
tion for the Visually Impaired. In Workshop on Computer Vision Applications for the Visually
Impaired, Marseille, France, 2008. James Coughlan and Roberto Manduchi.
[99] Torsten Sattler, Bastian Leibe, and Leif Kobbelt. Fast image-based localization using direct 2d-to-
3d matching. In Dimitris N. Metaxas, Long Quan, Alberto Sanfeliu, and Luc J. Van Gool, editors,
ICCV, pages 667{674. IEEE, 2011.
[100] Davide Scaramuzza. 1-point-ransac structure from motion for vehicle-mounted cameras by exploit-
ing non-holonomic constraints. International Journal of Computer Vision, 95(1):74{85, 2011.
[101] G. Schindler, M. Brown, and R. Szeliski. City-scale location recognition. In Computer Vision and
Pattern Recognition, 2007. CVPR '07. IEEE Conference on, pages 1{7, June 2007.
[102] Sinisa Segvic, Anthony Remazeilles, Albert Diosi, and Franois Chaumette. A mapping and local-
ization framework for scalable appearance-based navigation. Computer Vision and Image Under-
standing, 113(2):172{187, 2009.
[103] G. Sibley, G. Sukhatme, and L. Matthies. The iterated sigma point kalman lter with applications
to long range stereo. In Proceedings of Robotics: Science and Systems, Philadelphia, USA, August
2006.
[104] C. Stachniss, M. Bennewitz, G. Grisetti, S. Behnke, and W. Burgard. How to learn accurate
grid maps with a humanoid. In Robotics and Automation, 2008. ICRA 2008. IEEE International
Conference on, pages 3194{3199, May 2008.
[105] H. Strasdat, A.J. Davison, J.M.M. Montiel, and K. Konolige. Double window optimisation for
constant time visual slam. In Computer Vision (ICCV), 2011 IEEE International Conference on,
pages 2352 {2359, nov. 2011.
[106] Y. Tamura, M. Suzuki, A. Ishii, and Y. Kuroda. Visual odometry with eective feature sampling for
untextured outdoor environment. In Intelligent Robots and Systems, 2009. IROS 2009. IEEE/RSJ
International Conference on, pages 3492{3497, Oct.
[107] J.-P. Tardif, Y. Pavlidis, and K. Daniilidis. Monocular visual odometry in urban environments
using an omnidirectional camera. In Intelligent Robots and Systems, 2008. IROS 2008. IEEE/RSJ
International Conference on, pages 2531{2538, Sept 2008.
[108] S. Thrun, M. Montemerlo, H. Dahlkamp, D. Stavens, A. Aron, J. Diebel, P. Fong, J. Gale,
M. Halpenny, G. Homann, K. Lau, C. Oakley, M. Palatucci, V. Pratt, P. Stang, S. Strohband,
C. Dupont, L.-E. Jendrossek, C. Koelen, C. Markey, C. Rummel, J. van Niekerk, E. Jensen,
P. Alessandrini, G. Bradski, B. Davies, S. Ettinger, A. Kaehler, A. Nean, and P. Mahoney. Win-
ning the darpa grand challenge. Journal of Field Robotics, 2006. accepted for publication.
[109] Geometric Tools. Least squares tting of data, 2010.
130
[110] Sylvie Treuillet, Eric Royer, Thierry Chateau, Michel Dhome, and Jean-Marc Lavest. Body
mounted vision system for visually impaired outdoor and indoor waynding assistance. In Mar-
ion A. Hersh and James Ohene-Djan, editors, Proceedings of the Conference and Workshop on
Assistive Technologies for People with Vision and Hearing Impairments: Assistive Technology for
All Ages (CVHI-2007), Granada, Spain, 28th - 31th August, 2007, volume 415 of CEUR Workshop
Proceedings. CEUR-WS.org, 2007.
[111] Bill Triggs, Philip F. McLauchlan, Richard I. Hartley, and Andrew W. Fitzgibbon. Bundle adjust-
ment - a modern synthesis. In Proceedings of the International Workshop on Vision Algorithms:
Theory and Practice, ICCV '99, pages 298{372, 2000.
[112] Emanuele Trucco and Alessandro Verri. Introductory Techniques for 3-D Computer Vision. Pren-
tice Hall PTR, Upper Saddle River, NJ, USA, 1998.
[113] Christopher Urmson, Joshua Anhalt, Hong Bae, J. Andrew (Drew) Bagnell, Christopher R. Baker,
Robert E Bittner, Thomas Brown, M. N. Clark, Michael Darms, Daniel Demitrish, John M Dolan,
David Duggins, David Ferguson , Tugrul Galatali, Christopher M Geyer, Michele Gittleman, Sam
Harbaugh, Martial Hebert, Thomas Howard, Sascha Kolski, Maxim Likhachev, Bakhtiar Litkouhi,
Alonzo Kelly, Matthew McNaughton, Nick Miller, Jim Nickolaou, Kevin Peterson, Brian Pilnick,
Ragunathan Rajkumar, Paul Rybski, Varsha Sadekar, Bryan Salesky, Young-Woo Seo, Sanjiv
Singh, Jarrod M Snider, Joshua C Struble, Anthony (Tony) Stentz, Michael Taylor , William
(Red) L. Whittaker, Ziv Wolkowicki, Wende Zhang, and Jason Ziglar. Autonomous driving in
urban environments: Boss and the urban challenge. Journal of Field Robotics Special Issue on the
2007 DARPA Urban Challenge, Part I, 25(8):425{466, June 2008.
[114] G. Vaca-Castano, A.R. Zamir, and M. Shah. City scale geo-spatial trajectory estimation of a
moving camera. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference
on, pages 1186{1193, June 2012.
[115] A. Wendel, A. Irschara, and H. Bischof. Natural landmark-based monocular localization for mavs.
In Robotics and Automation (ICRA), 2011 IEEE International Conference on, pages 5792{5799,
May 2011.
[116] Brian Patrick Williams, Mark Cummins, Jos Neira, Paul M. Newman, Ian D. Reid, and Juan D.
Tards. A comparison of loop closing techniques in monocular slam. Robotics and Autonomous
Systems, 57(12):1188{1197, 2009.
[117] Cang Ye. Navigating a portable robotic device by a 3d imaging sensor. In Sensors, 2010 IEEE,
pages 1005{1010, Nov 2010.
[118] Y. Yokokohji, Y. Sugawara, and T. Yoshikawa. Accurate image overlay on video see-through hmds
using vision and accelerometers. In Virtual Reality, 2000. Proceedings. IEEE, pages 247{254, 2000.
[119] John Yong. `google maps' for the blind, 2013.
[120] Chunrong Yuan, I. Schwab, F. Recktenwald, and H.A. Mallot. Detection of moving objects by sta-
tistical motion analysis. In Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International
Conference on, pages 1722 {1727, oct. 2010.
131
[121] D. Yuan and Roberto Manduchi. Dynamic environment exploration using a virtual white cane. In
CVPR (1), pages 243{249. IEEE Computer Society, 2005.
[122] Jean yves Bouguet. Pyramidal implementation of the lucas kanade feature tracker. Intel Corpo-
ration, Microprocessor Research Labs, 2000.
[123] Amir Roshan Zamir and Mubarak Shah. Accurate image localization based on google maps street
view. In Kostas Daniilidis, Petros Maragos, and Nikos Paragios, editors, ECCV (4), volume 6314
of Lecture Notes in Computer Science, pages 255{268. Springer, 2010.
132
Abstract (if available)
Abstract
This thesis proposes a visual navigation aid for the blind. Our goal is to develop a wearable system to help the Visually Impaired (VI) navigate in highly dynamic outdoor environments. The proposed solution uses both visual sensing and an existing map available online. Our work focuses on position estimation, which consists of two parts: Visual Odometry (VO) and localization. We propose different VO methods to compute user motion even in cluttered environments using either a consumer-grade wearable stereo camera or a smartphone. For the case of the stereo camera, instead of computing egomotion from 3D point correspondences in consecutive frames, we propose finding the ground plane, then decomposing the six Degrees of Freedom (6-DoF) egomotion into a motion of the ground plane, and a 3-DoF planar motion on the ground plane. The ground plane is estimated at each frame by analysis of the disparity array. To decelerate the accumulation of VO error, the Inertial Measurement Unit (IMU) readings are used to approximate the ground plane orientation when walking along flat terrain. VO is extended to a monocular system so that the proposed framework is applicable to smartphone platforms with small pocketsize form factors. In addition, mobile devices are more accessible to VI users. ❧ The VO output is the latest camera pose given in a local coordinate system defined with respect to the first camera frame. To localize the user in global coordinates, the proposed system combines the VO output with the semantic information available in an existing map downloaded from a free map service. The Monte Carlo Localization (MCL) framework is used to match the local motion trajectory of the user with the shape of the street network found in the map. The framework allows the system to estimate both the global position and orientation of the user continuously. Our system is validated on real scenarios of hours of walking in both open terrain and urban environment in Downtown Los Angeles. Experiment results show that our method corrects the cumulative drifting error of position estimation computed from either a stereo camera or smartphone. It also localizes users without Global Positioning System (GPS) even in areas with equiangular intersections and equilength streets. ❧ To accelerate the localization process, we model GPS noise with Gaussian distribution and incorporate GPS measurements as prior knowledge to reduce the search area. Experiment results show that GPS readings improve localization speed even when the signal is noisy in an urban area. In the case of temporary loss, which often occurs in dynamic environments, our system improves re-localization speed by initializing the localization locally with the last known user’s location. ❧ We also focus on choosing appropriate sensors for the position estimation. We explore a variety of possible modalities and different combination between wearable stereo camera, IMU, GPS, and integrated sensors in mobile phones. The lightweight wearable stereo camera not only measures depth, but also provides a natural appearance to the VI user. The disadvantage is that it is composed of two low quality image sensors with an extremely short baseline. On the other hand, the smartphone camera captures images with higher resolution, but lacks depth measurement. By analyzing the experimental results, we found that, surprisingly, the monocular device provides lower position error in urban environments compared with the stereo camera, which is only more accurate in open areas.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
RGBD camera based wearable indoor navigation system for the visually impaired
PDF
Automatic image and video enhancement with application to visually impaired people
PDF
Robot vision for the visually impaired
PDF
Autonomous mobile robot navigation in urban environment
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
Computer vision aided object localization for the visually impaired
PDF
Exploitation of wide area motion imagery
PDF
Depth inference and visual saliency detection from 2D images
PDF
Transfer learning for intelligent systems in the wild
PDF
Body pose estimation and gesture recognition for human-computer interaction system
PDF
Prosthetic vision in blind human patients: Predicting the percepts of epiretinal stimulation
PDF
Multiple vehicle segmentation and tracking in complex environments
PDF
Cloud-enabled mobile sensing systems
PDF
Crowd-sourced collaborative sensing in highly mobile environments
PDF
Dispersed computing in dynamic environments
PDF
Efficient SLAM for scanning LiDAR sensors using combined plane and point features
PDF
Visual representation learning with structural prior
PDF
Integrating top-down and bottom-up visual attention
PDF
Robust real-time vision modules for a personal service robot in a home visual sensor network
PDF
Biologically inspired mobile robot vision localization
Asset Metadata
Creator
Leung, Tung-Sing
(author)
Core Title
Outdoor visual navigation aid for the blind in dynamic environments
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
07/06/2015
Defense Date
08/01/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer vision,localization,OAI-PMH Harvest,visual odometry,visually impaired
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Medioni, Gérard G. (
committee chair
), Itti, Laurent (
committee member
), Weiland, James D. (
committee member
)
Creator Email
leung.tungsing@gmail.com,tungsinl@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-586493
Unique identifier
UC11299896
Identifier
etd-LeungTungS-3548.pdf (filename),usctheses-c3-586493 (legacy record id)
Legacy Identifier
etd-LeungTungS-3548.pdf
Dmrecord
586493
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Leung, Tung-Sing
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
computer vision
localization
visual odometry
visually impaired