Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Robot vision for the visually impaired
(USC Thesis Other)
Robot vision for the visually impaired
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ROBOT VISION FOR THE VISUALLY IMPAIRED by Vivek Pradeep A Dissertation Presented to the FACULTY OF USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (BIOMEDICAL ENGINEERING) May 2011 Copyright 2011 Vivek Pradeep Dedication To my mother, Mrs. Vineeta Pradeep, for all her love, support and encouragement. To my father, Dr. Guru Dayal Pradeep, for always inspiring and pushing me to go for that extra mile. To Mr. Rabindran Arulanandam for instilling in me the love for physics and mathematics. To Dr. Isaac Asimov for the indispensable Three Laws of Robotics. ii Acknowledgements Due apologies to Franz Kafka for appropriating the following lines from an English translation of his short story A Report to an Academy (Ein Bericht f¨ ur eine Akademie, published Nov., 1917): ‘Nearly five years stand between me and my apehood, a period that may be short in terms of the calendar but is an infinitely long one to gallop through as I have done, accompanied for certain stretches by excellent people, advice, applause and band music...’. I have been fortunate to have had the support and friendship of a great many people over the entire duration of my research. My advisor, Dr. James Weiland, has always been a source of encouragement and inspiration, and his faith in my abilities enabled me to persevere even through prolonged periods of unproductivity. Countless roadblocks were overcome during my weekly meetings with him, and without his ideas and suggestions, this dissertation would be far from complete. Needless to say, his mentorship has been invaluable and I owe my deepest gratitude to him. I would also like to thank my co-advisor, Dr. Gerard Medioni, for taking me on as his student and teaching me everything I know about computer vision. His enthusiasm for this project and my own personal development as a vision researcher greatly motivated me and his insights were instrumental in developing some of the key ideas proposed in this thesis. I am also grateful to all my committee members for constantly providing me with new per- spectives. Dr. Mark Humayun always challenged me to think about the clinical implications and practical feasibility of my work. My conversations with Dr. Gisele Ragusa, Dr. Norberto Grzywacz and Dr. Manbir Singh helped shape the protocol for human studies. In this regard, I would also like to acknowledge Anita Wright and Judy Hill for generously providing me with facilities and resources at the Braille Institute, Los Angeles to conduct the human studies. A big thank you to Alice Kim, Aminat Adebiyi, Irvin Rabanales, Kelsey Ford, Kesar Varma and iii Shauna Higgins for helping me out during the experiments. I would be remiss without expressing my appreciation for Carla Pulido, Doris Lee, Eric Bonilla, Mischalgrace Diasanta, Rosie Soltero and Verlinda Hughling for resolving the myriad problems of a graduate student. My thanks also goestoDr. JongwooLimformentoringmeduringandaftermyinternshipattheHondaResearch Institute, Mountain View and to Ron Buck at Tyzx Inc., for his cooperation in building a custom stereo camera for the project. For the many strong cups of coffee, lunch time conversations and ritualistic procrastinations, I would like to thank my colleagues at the Bioelectronics Research Lab - Dr. Aditi Ray, Dr. Adrien Rowley, Alice Cho, Andrew Weitz, Artin Petrossians, Dr. Ashish Ahuja, Dr. Biju Thomas, Dr. Brooke Basinger, Christian Gutierrez, Devyani Nanduri, Dr. Mathew Behrend, Navya Davuluri, Dr. Neha Parikh, Dr. Nicholas Sachs, Dr. Rahul Kaliki, Samantha Cunnigham and Tim Nayar. My colleagues at the Insitute for Robotics and Intelligent Systems - Anustup Choudhury, Dr. Chang Yuan, Cheng-Hao Kuo, Elnaz Nouri, Eunyoung Kim, Furqan Khan, Jan Prokaj, Dr. Jongmoo Choi, Dr. Matheen Siddiqui, Dr. Pradeep Natarajan, Pramod Sharma, Prithviraj Banerjee, Dr. Qiang Yu, Dr. Sung Chung, Thang Dinh, Vivek Singh, Dr. Yann Dumortier, Dr. Yuan Li and Dr. YuPing Lin - helped make the task of writing code enjoyable. Manasee and Jeet, thanks a lot for making me feel at home in Los Angeles, with the fabulous dinner spreads! I will always cherish the many memories made over the years with my good friends Ashutosh, Hetal, Pranav. Priyanka and Sabby. I look forward to many more. iv Table of Contents Dedication ii Acknowledgements iii List of Tables ix List of Figures x List of Abbreviations xvii Abstract xix Chapter 1: Introduction 1 1.1 Independent Locomotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Human Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.1.1.1 Role of the Human Visual System in Locomotion. . . . . . . . . . 6 1.1.1.2 Egomotion Estimation . . . . . . . . . . . . . . . . . . . . . . . . 8 1.1.1.3 Spatial Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.1.2 Robot Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.1.2.1 Sensors for Autonomous Navigation . . . . . . . . . . . . . . . . . 11 1.1.2.2 Vision based Sensors. . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.2 Visual Impairment and Impact on Mobility . . . . . . . . . . . . . . . . . . . . . . 22 1.2.1 Prevalance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.2.2 Impact on Mobility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.3 Mobility Aids for the Visually Impaired . . . . . . . . . . . . . . . . . . . . . . . . 24 1.3.1 Sonar Based Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.3.2 Laser and IR Based Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.3.3 GPS Based Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 1.3.4 Distributed Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.3.5 Vision Based Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.3.6 Sensory Substitution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 1.4 Scope and Organization of this Dissertation . . . . . . . . . . . . . . . . . . . . . . 31 1.4.1 Specific Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 1.4.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 1.4.2.1 Visual Odometry and SLAM . . . . . . . . . . . . . . . . . . . . . 35 1.4.2.2 Obstacle Detection and Traversability Analysis . . . . . . . . . . . 35 1.4.2.3 Motion Prediction and Path Planning . . . . . . . . . . . . . . . . 36 1.4.3 Tactile Cuing and Mobility Experiments . . . . . . . . . . . . . . . . . . . . 36 v Chapter 2: Visual Odometry 38 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.1.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.1.2 Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.1.2.1 The Five Point Algorithm . . . . . . . . . . . . . . . . . . . . . . . 40 2.1.2.2 The Absolute Orientation Algorithm . . . . . . . . . . . . . . . . . 41 2.1.2.3 The Three Point Algorithm . . . . . . . . . . . . . . . . . . . . . . 42 2.1.2.4 Line SfM Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.1.2.5 Hybrid Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.1.3 Feature Extraction and Tracking . . . . . . . . . . . . . . . . . . . . . . . . 46 2.1.3.1 Point Feature Detection and Tracking . . . . . . . . . . . . . . . . 46 2.1.3.2 Line Feature Detection and Tracking . . . . . . . . . . . . . . . . 53 2.1.3.3 Noise Characterization . . . . . . . . . . . . . . . . . . . . . . . . 54 2.1.4 Non-linear Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.1.4.1 Random Sample Consensus (RANSAC) . . . . . . . . . . . . . . . 56 2.1.4.2 Sparse Bundle Adjustment (SBA) . . . . . . . . . . . . . . . . . . 57 2.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 2.2.1 Visual Odometry Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 2.2.1.1 Selection of Feature Trackers . . . . . . . . . . . . . . . . . . . . . 60 2.2.2 Egomotion Using Assorted Features . . . . . . . . . . . . . . . . . . . . . . 62 2.2.2.1 Review of the trifocal relations . . . . . . . . . . . . . . . . . . . . 63 2.2.2.2 Calibrated tensors and linear solution . . . . . . . . . . . . . . . . 64 2.2.2.3 Subspace solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 2.2.3 Quaternion-based direct solution . . . . . . . . . . . . . . . . . . . . . . . . 67 2.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 2.3.1 Simulation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 2.3.1.1 Three-point algorithm with and without SBA . . . . . . . . . . . 70 2.3.1.2 Using point and line features . . . . . . . . . . . . . . . . . . . . . 71 2.3.2 Real World Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Chapter 3: Simultaneous Localization and Mapping (SLAM) 81 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.1.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.1.2 Markov Chain and the Role of Uncertainty . . . . . . . . . . . . . . . . . . 85 3.1.3 The Extended Kalman Filter (EKF) . . . . . . . . . . . . . . . . . . . . . . 87 3.1.4 FastSLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.1.4.1 Sampling Importance Resampling (SIR) . . . . . . . . . . . . . . . 90 3.1.4.2 Rao-Blackwellised Particle Filter (RBPF) . . . . . . . . . . . . . . 93 3.1.4.3 Efficient Implementation . . . . . . . . . . . . . . . . . . . . . . . 94 3.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.2.1 Motion and Observation Functions . . . . . . . . . . . . . . . . . . . . . . . 94 3.2.1.1 Noise Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.2.1.2 Proposal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 97 3.2.1.3 Landmark Update . . . . . . . . . . . . . . . . . . . . . . . . . . 99 3.2.1.4 Importance Resampling . . . . . . . . . . . . . . . . . . . . . . . . 99 3.2.2 Active Matching for Robust Data Association . . . . . . . . . . . . . . . . . 100 3.2.2.1 Landmark Initialization . . . . . . . . . . . . . . . . . . . . . . . . 101 3.2.2.2 Projecting Search Windows . . . . . . . . . . . . . . . . . . . . . . 101 3.2.2.3 Template Matching . . . . . . . . . . . . . . . . . . . . . . . . . . 102 3.2.3 Metric-Topological SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 vi 3.2.3.1 Local Submap Level . . . . . . . . . . . . . . . . . . . . . . . . . . 105 3.2.3.2 Global, topological Representation . . . . . . . . . . . . . . . . . . 105 3.2.3.3 Keyframes Extraction and Sparse Bundle Adjustment . . . . . . . 106 3.2.4 Visual Loop Closing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 3.2.4.1 Bayesian Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 107 3.2.4.2 Multi-resolution SIFT Grids . . . . . . . . . . . . . . . . . . . . . 110 3.2.4.3 Learning Discriminative Words . . . . . . . . . . . . . . . . . . . . 112 3.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 3.3.1 Active Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 3.3.2 SLAM Maps and Loop Closing . . . . . . . . . . . . . . . . . . . . . . . . . 115 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Chapter 4: Obstacle Avoidance and Path Planning 119 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.1.1 Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.1.1.1 Appearance and Homography Based . . . . . . . . . . . . . . . . . 120 4.1.1.2 Plane Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.1.1.3 Tensor Voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.1.2 Unstructured Obstacle Detection . . . . . . . . . . . . . . . . . . . . . . . . 126 4.2 Approach and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.2.1 Voxelized Elevation Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.2.2 Multi-Surface Elevation Map . . . . . . . . . . . . . . . . . . . . . . . . . . 128 4.2.3 Post-processing Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 4.2.4 Probabilistic Approach for Temporal Consistency . . . . . . . . . . . . . . . 132 4.2.5 Scrolling Region of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 4.3 Path Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 4.3.1 Waypoint Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 4.3.2 D* Lite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Chapter 5: Cuing Modalities and Human Testing 138 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 5.2 Tactile Vest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.3 Cuing Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 5.4 Mobility Experiments with Sighted Subjects . . . . . . . . . . . . . . . . . . . . . . 142 5.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 5.4.2 Evaluation of tactile cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 5.4.2.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 5.4.2.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 145 5.4.3 Mobility Experiments: Autonomous steering . . . . . . . . . . . . . . . . . 148 5.4.3.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 5.4.3.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 149 5.5 Mobility Experiments with Visually Impaired Subjects . . . . . . . . . . . . . . . . 150 5.5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 5.5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 5.5.2.1 Subject Recruitment . . . . . . . . . . . . . . . . . . . . . . . . . . 151 5.5.2.2 Percentage Preferred Walking Speed (PPWS). . . . . . . . . . . . 152 5.5.2.3 Experiment Protocol . . . . . . . . . . . . . . . . . . . . . . . . . 155 5.5.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.5.3.1 Number of Collisions . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.5.3.2 Time to Completion . . . . . . . . . . . . . . . . . . . . . . . . . . 161 vii 5.5.3.3 Navigation Trajectories . . . . . . . . . . . . . . . . . . . . . . . . 164 5.5.3.4 Discussion on System Performance . . . . . . . . . . . . . . . . . . 165 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 Chapter 6: Conclusion and Future Work 169 6.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 6.2 Directions for Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 6.2.1 Addressing Failure Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 6.2.2 Incorporating New Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Bibliography 175 viii List of Tables 4.1 Misclassification fractions for the scenes before and after tensor voting. Lower misclassification fraction indicates better performance. Adapted with permission from (Pradeep et al., 2008). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.1 Summary of data from subject recruitment. O&M stands for orientation and mo- bility training, RP for retinitis pigmentosa and AMD for age-related macular de- generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 5.2 Self-assessment scores for independent mobility. . . . . . . . . . . . . . . . . . . . . 154 ix List of Figures 1.1 (a) A drawing of a section through the human eye; (b) Graph showing rod and cone densities along the horizontal median. Adapted with permission from (Kolb et al.). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Visual input to the brain goes from eye to LGN and then to primary visual cortex, or area V1, which is located in the posterior of the occipital lobe. Adapted with permission from (Kolb et al.). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 (a) The geomery of a pinhole camera; (b) sideways view. Adapted with permission from (Hartley and Zisserman, 2000). . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.4 Epipolargeometry. Cand C ! arethetwocameracenters. Adaptedwithpermission from (Hartley and Zisserman, 2000) . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.5 (a)Leftcameraand(b)rightcameraviewsofasceneinastereoconfiguration. The offset between two corresponding points (labeled by red dots) is the disparity and inversely proportional to depth. (c) shows the disparity map for the entire image, where lighter shades imply larger disparities (and shallower depth with respect to the camera). Adapted with permission from (Hirschmuller and Scharstein, 2007). . 21 1.6 Specific aims addressed in this dissertation: (1) Design and implementation of algorithms to build an obstacle map and extract camera orientation, (2) design and implementation of a human interface to convey navigational assistance to the humanuserand(3)evaluatesystemandanalyzefeasibilitythroughmobilitystudies with human users. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 1.7 Two prototypes of the system: (a) Version 1 with the Bumblebee2 stereo camera manufactured by Point Grey Research; (b) Version 2 with the DeepSea G3 Embed- dedVisionSystem(EVS)soldbyTyzxInc. Thelatterversionallowsanuntethered interface to the camera through a wireless network. The tactile interface in both versions is the same: an array of microvibration motors provides guidance cues. Cuing strategy, however, is different and described in Chapter 5. Adapted with permission from (Pradeep et al., 2010a),©2010 IEEE. . . . . . . . . . . . . . . . . 32 1.8 Block diagram of data flow from stereo camera to the tactile interface. Adapted with permission from (Pradeep et al., 2010b),©2010 IEEE. . . . . . . . . . . . . . 33 x 1.9 Three snapshots of the system’s processing over a navigation sequence. The left column of each sub-image shows the input image acquired by the stereo camera at various time instants (green dots with red lines represent the optic flow for computingcameramotion)andtheadjoiningfiguresontherightare3Drenderings (top-down perspective) of the mapping and path-planning processes. In (a) and (c), far enough way-points are available, and a shortest path is computed from current position. The system will be in Guidance mode, delivering cues to keep the user on this path. However, as shown in (b), the user is free to not follow these cues, and instead turns around to face an obstacle. The most accessible way-point is very close to the current position, and the system switches to Proximity mode. The user could turn around back to (a) or explore a novel area (c), and the system will switch back to Guidance mode as soon as possible. Adapted with permission from (Pradeep at al., 2010b),©2010 IEEE. . . . . . . . . . . . . . . . . . . . . . . 34 2.1 Geometry of a stereo rig.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.2 Matching three image figures in (a) and using the corresponding 3D points in (b) to estimate camera motion. Note that the 3D point cloud generated for the two images is not dense over all pixels. This is because of occlusion or low texture. The Bumblebee2 stereo camera was used to generate this data. . . . . . . . . . . . . . 42 2.3 Geometry of the three-point algorithm.. . . . . . . . . . . . . . . . . . . . . . . . . 43 2.4 (a) A 3D line projects to image lines. This 3D line is the intersection of the planes formed by the backprojection of the image lines. (b) Three image lines are required to provide enough constraints for estimating camera matrix. Adapted with permission from (Hartley and Zisserman, 2000). . . . . . . . . . . . . . . . . . 44 2.5 The implemented visual odometry pipeline. . . . . . . . . . . . . . . . . . . . . . . 59 2.6 Tracked features between two frames for various detectors and trackers, with two different intervals between frames. . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 2.7 (a) Stereo geometry for two views and the point-line-line configuration (b) point- point-point and (c) line-line-line configuration in terms of point-line-line configura- tion. Adapted with permission from (Pradeep et al., 2010c),©2010 IEEE . . . . . 63 2.8 Evaluation of improvement with sparse bundle adjustment. . . . . . . . . . . . . . 71 2.9 Rotation and translation errors comparing performance of linear (red) subspace (green) and quaternion (blue) based formulations of the trifocal minimal solver using arbitrary feature combinations. Adapted with permission from (Pradeep et al.),© 2010 IEEE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 2.10 Rotation and translation errors comparing performance of the 3-point algorithm (red) against trifocal-quaternion solver using 3 (green), 4 (blue) and 5 (black) points. Adapted with permission from (Pradeep et al.),© 2010 IEEE. . . . . . . . 73 2.11 Rotation and translation errors comparing performance of the lines Sfm algorithm (red) against trifocal-quaternion solver using 3 (green), 4 (blue) and 5 (back) lines. Adapted with permission from (Pradeep et al.),© 2010 IEEE. . . . . . . . . . . . 73 xi 2.12 Number of RANSAC trials before an acceptable solution was found for trifocal quaternion-3(red),-4(green)and-5(blue). Adaptedwithpermissionfrom(Pradeep et al.),© 2010 IEEE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 2.13 Camera trajectory generated by the visual odometry pipeline. Red: without SBA, green: after SBA. In (a), the trajectory is overlaid on the floorplan and in (b), a sideways profile is shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 2.14 Mean absolute errors in the estimation of camera yaw magnitude for three levels of angular velocity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 2.15 (Top) Lines and points tracked in the system. (Below) Estimated visual odometry superimposed on the floor-plan of an indoor office environment (Red: 3-point al- gorithm, green: proposed algorithm using assorted combinations of point and line features). The 3-point algorithm estimates a path that goes ‘through the wall’ towards the end and incorrectly estimates the camera position beyond the bounds of the room. Adapted with permission from (Pradeep et al.),© 2010 IEEE. . . . 77 2.16 Sequence for two loops around a parking structure (red: 3 point algorithm, green: trifocal quaternion-5). Adapted with permission from (Pradeep et al.), © 2010 IEEE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 2.17 Performance in low-textured, corridor sequence. Adapted with permission from (Pradeep et al.),© 2010 IEEE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.1 Formulation of the SLAM problem for a stereo rig. . . . . . . . . . . . . . . . . . . 83 3.2 SLAM as a Dynamic Bayes Network . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.3 Evolution of particles through the various stages of the SIR filter. . . . . . . . . . . 91 3.4 Schematic representations of the measurement and landmark uncertainties. . . . . 95 3.5 A three image sequence showing active matching (left) and SLAM filtering (right). The red ellipses on the the grayscale images depict the mixture of Gaussian pro- jections for candidate landmarks. These lead to search windows; the green crosses are locations where template matches are found (i.e., observations). On the right, the 3D landmark ellipsoids are shown with the stereo camera pose at the bottom. Notice how the ellipsoids have shrunk between frames 1, 2 and 3 for the same land- marks. This is a result of improved EKF estimates due to reobservations. Also, new landmarks have been added in frame 3. . . . . . . . . . . . . . . . . . . . . . 103 3.6 Overview of the metric-topological SLAM algorithm. KF: keyframe. . . . . . . . . 104 3.7 Feature discriminative power. Consider an observation consisting only of feature- words 1, 2 and 3. These words occur with equal probabilities in submap i and j. Hence, both regions will be considered equally probable candidates for loop-closure (priors being the same). However, feature word 4 occurs consistently and exclu- sively in submap j. Since this highly discriminative and stable feature associated with j has not been observed, the confidence over submap j is reduced. Adapted with permission from (Pradeep et al., 2009). © IEEE. . . . . . . . . . . . . . . . 108 xii 3.8 Schematic overview of the loop closure framework. Consider a set of 8 two- dimensionalfeatures(showindifferentcolors)andthreegridsofvaryingresolutions. Associated probabilities for word occurrences in three submaps i, j and k are in- dicated with the corresponding word found in the grid. The radius of the circle along with arrow indicates the weight Υ F p j assigned to the observation. Adapted with permission from (Pradeep et al., 2009),© IEEE. . . . . . . . . . . . . . . . . 111 3.9 Tracking features using active matching. . . . . . . . . . . . . . . . . . . . . . . . . 114 3.10 Indoor experiment result. The non-textured 3D map is projected onto the XY plane. Lower-rightshowsthefloorplanofthetworoomswiththecameratrajectory (red) exhibiting a loop. The highlighted section on the map corresponds to the images (top-right) captured during the beginning and end of loop execution. Loop closure was detected with a probability of 0.9984. Adapted with permission from (Pradeep et al., 2009). © IEEE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 3.11 Top view of the reconstructed 3D map with camera trajectory (red trace) for the outdoor test environment. Selected regions have been magnified for better visu- alization. The flag indicates where loop closure was found. The total trajectory length was 232 meters. Adapted with permission from (Pradeep et al., 2009). © IEEE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 3.12 Similarity probabilities between images in different submaps. Top row corresponds to the detected loop in Figure 3.11. Adapted with permission from (Pradeep et al., 2009),© IEEE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 3.13 Overview of the entire SLAM system. Adapted with permission from (Pradeep at al., 2009),© IEEE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.1 Appearance based obstacle detection results. Green pixels correspond to hypothe- sized ground plane pixels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.2 Block diagram of the piecewise planar modeling algorithm. The stereo camera generates a dense 3D point-cloud, to which normals are fit using RANSAC. These normals are globally inconsistent due to local processing per voxel. Tensor voting is employed to refine these normals and clustering is finally performed. Adapted wth permission from (Pradeep et al., 2008). . . . . . . . . . . . . . . . . . . . . . . 123 4.3 Comparison of results obtained before tensor voting and after. (a) Original 3D point-cloud; (b) input point-cloud manually clustered into horizontal (red) and vertical (blue) planes to generate ground truth; (c) segmentation results after RANSAC only; (d) segmentation results after tensor voting. Adapted with per- mission from (Pradeep et al., 2008). . . . . . . . . . . . . . . . . . . . . . . . . . . 124 4.4 Comparison of results obtained before tensor voting and after. (a) Original 3D point-cloud; (b) input point-cloud manually clustered into horizontal (red) and vertical (blue) planes to generate ground truth; (c) segmentation results after RANSAC only; (d) segmentation results after tensor voting. Adapted with per- mission from (Pradeep et al., 2008). . . . . . . . . . . . . . . . . . . . . . . . . . . 125 xiii 4.5 Final planes segmented from the scenes in (top) Figure 4.3 and (bottom) Figure 4.4. Adapted with permission from (Pradeep et al., 2008). . . . . . . . . . . . . . . 125 4.6 Obstacle detection using the compatible points algorithm. Obstacles are labeled in red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.7 Multi-level surface patch models of 3D point clouds. Red regions show vertial surfaces and green regions indicate horizontal surfaces. The cone represents the current camera orientaion. Adapted with permission from (Pradeep et al., 2010a), © IEEE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 4.8 Intermediate results during post-processing of the elevation grid. Images (b), (c) and (d) are top-down perspectives of the obstacle map.. . . . . . . . . . . . . . . . 131 4.9 Comparison of obstacle maps produced before and after morphological operations for the image in Figure 4.8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 4.10 (Top) Top-down view of the traversability maps corresponding to different time instants (below) of system operation. Processing is confined within the 5 meter circle around the current location of the user. Blue regions indicate unmapped areas in the subject’s neighborhood, green regions are traversable and red regions are not. The yellow cone shows the current user position and head orientation. Note that as the system starts, the immediate region around the camera is not mapped as it does not appear in the field of view. This information is propagated to the path planner, that ignores the first ‘road-block’ until it enters a mapped region. Adapted with permission from (Pradeep et al., 2010a),© IEEE. . . . . . . 133 4.11 Schematics of waypoint computation. . . . . . . . . . . . . . . . . . . . . . . . . . . 134 4.12 Autonomous updates of waypoint (pink dot). . . . . . . . . . . . . . . . . . . . . . 135 4.13 The obstacle detection and path planning pipeline. . . . . . . . . . . . . . . . . . . 136 5.1 Main components of the tactile vest. . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.2 System level overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 5.3 Schematic depiction of the cuing triggers for the shoulder motors. . . . . . . . . . . 141 5.4 Tactile vest with the two modes of operation: manual control used for evaluat- ing cuing strategy only and auto control with head-mounted stereo camera for SLAM+obstacle detection. An 8-bit NES remote control was used for the manual interface. Adapted with permission from (Pradeep et al., 2010a),© IEEE. . . . . 143 5.5 Top down schematic of the obstacle course used in mobility studies. The course length is 12.2 m and the maximum, unrestricted width is 3 m. The grid spacing overlayed is 0.15 m× 0.15 m. Grey areas indicate walls or obstacles placed in the path. Subjects entered on the left side and data was recorded until they reached the exit at the right end.Adapted with permission from (Pradeep et al., 2010a),© IEEE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 xiv 5.6 Average time to completion of obstacle course for the four groups. Adapted with permission from (Pradeep et al., 2010a),© IEEE. . . . . . . . . . . . . . . . . . . 146 5.7 Plots of incremental learning over trials for Groups 2,3 and 4 with sigmoid fits. The horizontal asymptote indicates learning is achieved between the seventh and eight trials. Adapted with permission from (Pradeep et al., 2010a),© IEEE. . . . 147 5.8 Normalized average time to completion. . . . . . . . . . . . . . . . . . . . . . . . . 147 5.9 Heatmaps displaying the averaged out trajectories for each group. Brighter red signifiesmoretimespendinthatlocation. Adaptedwithpermissionfrom(Pradeep et al., 2010a),© IEEE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 5.10 Heatmap for autonomous steering experiment. Adapted with permission from (Pradeep et al., 2010a),© IEEE. . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.11 Walking speeds of the subjects over 10 trials. The dotted red-box indicates the data points used to calculate the PWS measure for each subject. . . . . . . . . . . 154 5.12 Study protocol. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 5.13 Front and sideways of the system prototype used in the mobility experiments. . . . 157 5.14 (a) Top down schematic of the obstacle course at the Braille Institute, Los Angeles. (b), (c) and (d) show snapshots of example subjects from each of the three groups in the study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 5.15 Map produced by the system compared with the floorplan. . . . . . . . . . . . . . 158 5.16 Comparison of number of collisions per trial for subjects from the three groups before switching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.17 Collisions before and after switching Group 1 and Group 2. The direction of the arrow indicates which mobility test was performed first for each group. . . . . . . . 160 5.18 Locations of collisions for white cane users of Group 1. The number next to the red blocks indicate the total contacts made summed over all subjects and trials at that face. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 5.19 PPWS measures per subject. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 5.20 Average PPWS measures per group. . . . . . . . . . . . . . . . . . . . . . . . . . . 163 5.21 Heatmaps showing trajectories taken by participants from each group. Group 1: caneonly,Group2: camera-vestsystemandGroup3: canepluscamera-vestsystem users. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 xv 5.22 Two snapshots of the system output during a trial affected by delayed cues. The dashed red box in the top figure shows the region corresponding to which the maps have been diplayed. (Left image) The system has already detected a path to a waypoint to the left (as shown by the white line leading to the pink dot). In spite of this, however, the subject keeps walking towards the wall (right image) as no cues have yet been received. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 5.23 Example failure modes of the system. . . . . . . . . . . . . . . . . . . . . . . . . . 166 6.1 Example of depth map produced by the Prime Sensor TM for an untextured surface. 172 xvi List of Abbreviations AGV Automated Guided Vehicle AMD Age-related Macular Degeneration DOF Degrees Of Freedom DoG Difference of Gaussians EKF Extended Kalman Filter ETA Electronic Travel Aid FAST Features from Accelerated Segment Test GMM Gaussian Mixture Model GPS Global Positioning Sytsem INS Inertial Navigation System IR Infrared IRB Institutional Review Board KLT Kanade Lucas Tomase feature tracker LASER Light Amplification by Stimulated Emission of Radiation LGN Lateral Geniculate Nucleus xvii LIDAR Light Detection And Ranging LM Levenberg Marquardt minimization LMedS Least Median of Squares MEMS Micro Electro-Mechanical Systems PEP Polynomial Eigen-value Problem PPWS Percentage Preferred Walking Speed PWS Preferred Walking Speed RANSAC Random Sample Consensus RBPF Rao-Blackwellised Particle Filter RFID Radio-frequency Identification RP Retinitis Pigmentosa SBA Sparse Bundle Adjustment SfM Structure from Motion SIFT Scale Invariant Feature Transform SIR Sampling Importance Resampling SLAM Simultaneous Localization and Mapping SONAR Sound Navigation And Ranging SURF Speeded Up Robust Features SVD Singular Value Decomposition xviii Abstract Vision is one of the primary sensory modalities for humans that assists in performing several life- sustaining and life-enhancing tasks, including the execution of actions such as obstacle avoidance and path-planning necessary for independent locomotion. Visual impairment has a debilitating impactonsuchindependenceandthevisuallyimpairedareoftenforcedtorestricttheirmovements to familiar locations or employ assistive devices such as the white cane. More recently, various electronictravelaidshavebeenproposedthatincorporateelectronicsensorconfigurationsandthe mechanismofsensorysubstitutiontoproviderelevantinformation-suchasobstaclelocationsand body position - via audio or tactile cues. By providing higher information bandwidth (compared to the white cane) and at a greater range, it is hypothesized that the independent mobility performances of the visually impaired can be improved. The challenge is to extract and deliver information in a manner that keeps cognitive load at a level suitable for a human user to interpret in real-time. This dissertation presents a novel mobility aid for the visually impaired that consists of only a pair of cameras as input sensors and a tactile vest to deliver navigation cues. By adopting a head-mounted camera design, the system creates an implicit interaction scheme where scene- interpretation is done in a context-driven manner, based on head rotations and body movements of the user. Novel computer vision algorithms are designed and implemented to build a rich, 3D map of the environment, estimate current position and motion of the user and detect obstacles in the vicinity. A multi-threaded and factored simultaneous localization and mapping framework is used to tie all the different software modules together for interpreting the scene in real-time and accurately. The system always maintains a safe path for traversal through the current map, and tactile cues are generated to keep the person on this path, and delivered only when deviations are xix detected. Withthisstrategy, theenduseronlyneedstofocusonmakingincrementaladjustments to the direction of travel. This dissertation also presents one of the few computer-vision based mobility aids that have been tested with visually impaired subjects. Standard techniques employed in the assessment of mobility for people with vision loss were used used to quantify performance through an obstacle course. Experimental evidence demonstrates that the number of contacts with objects in the path are reduced with the proposed system. Qualitatively, subjects with the device also follow safer paths compared to white cane users in terms of proximity to obstacles. However, members of the former group take longer durations to complete the course and this is primarily due to certain inadequacies in the system design that had not been initially anticipated. Solutions for overcoming these problems are discussed in depth towards the end of this thesis. The work presented here makes several novel contributions to the computer vision community and provides insight into mobility performance of the visually impaired with and without the assistance of a sophisticated travel aid. The research done here can potentially help improve the quality of life of those with severe visual defects and also translate to the development of autonomous robotic systems. xx Chapter 1 Introduction Independent travel through an unfamiliar environment involves several skills, including the in- corporation of a rich set of visual, auditory and kinesthetic cues. Significant loss of visual input makes navigation a discouraging prospect for the visually impaired, adversely impacting several life-sustaining and -enhancing behaviors. Compared to normal sighted people, those with visual dysfunction have to adopt more conservative and inefficient way-finding strategies with the added expense of greater cognitive load. Research in developing algorithms for implementing real-time navigation capabilities on autonomous robotic systems has highlighted the compexity of tasks such as spatial cognition, path-planning and obstacle avoidance and is an active area (with many problem domains) in the computer vision and robotics communities. While human and robot navigation differ significantly in several aspects, the basic issues are similar (Werner etal., 1997). In spite of the various challenges towards developing such autonomous systems, impressive per- formance has been demonstrated over the past few years. With the dropping price, size, weight and power profiles of electronic components, it is pertinent to ask the following question - can advances made in robot vision be built upon and translated to a wearable device that substantially improves the quality of life of the visually impaired? This thesis tackles the relevant algorithmic and system design issues and answers the question through rigorous mobility experiments. 1 The goal of this chapter is to introduce concepts related to independent locomotion and mo- bility assistive devices for the visually impaired. First, some fundamental issues applicable to both biological and robotic locomotor systems are presented in Section 1.1. Both areas of study are introduced in parallel to elucidate the similarities in desired competencies and approaches between them. Robotics engineers have learnt a lot from how biological systems plan and execute physical movements and this dissertation (in designing a robotic system to guide humans) goes in the reverse direction. It is therefore imperative to ascertain whether the current state-of-art in roboticsiscompatiblewithrequirementsforindependenthumannavigation. Theroleofthevisual system for human locomotion is emphasized in Section 1.1.1, and this will also help appreciate thechallengesfacedbythevisuallyimpairedinperformingnavigationaltasks(whichiscoveredin Section 1.2). Research in autonomous navigation for systems/robots is discussed further in Sec- tion 1.1.2, and juxtaposed with the preceding section on human locomotion, will highlight more similarities across the two domains. A brief description of various sensors used in autonomous systems navigation follows in Section 1.1.2.1, which will enable understanding the evolution of technology for assistive devices described in Section 1.3. An awareness of the shortcomings and particular advantages of the different sensors will also justify the choice of vision-based sensors in the system developed as part of this PhD thesis - an overview of which is presented in Section 1.4. The last section also describes how the rest of this dissertation is organized. 1.1 Independent Locomotion The task of independent navigation requires the locomotor control system to solve various sub- problems. For human navigation, some of these requirements can be summarised as as "updating one’s position and orientation during travel, forming and making use of representations of the environment through which travel takes place, and planning routes subject to various constraints (shortest distance, minimal travel time, maximum safety, etc.)" (Loomis etal., 1993). Drawing 2 parallelsfromtheroboticscommunity, autonomousnavigationisdefinedastheprocessofanswer- ing the following three questions (Levitt and Lawton, 1990) : (a) "Where am I?"; (b) "Where are other places with respect to me?"; (c) "How do I get to other places from here?". This implies maintaining and updating a single global representation of the environment, from which motor actions may be inferenced for driving the robot (Franz and Mallot, 2000). In both situations, therefore, it becomes necessary to have machinery for computing self-motion as well as, a mechanism forrepresentingthesurroundingenvironment. For the latter, it is also essential to have an abstraction scheme to represent only details relevant to the navigation task. For hu- mans, this helps in improving recall and retreival (Patla, 1997), while for robot implementations, this addresses issues of time and space complexity. The process of computing self-motion is termed as ‘dead-reckoning’ or ‘egomotion estimation’ when applied to either human locomotion (Dusenbery, 1992; Poucet, 1993) or robot navigation. It is hypothesized that the distance and heading directions involved in human motion might be obtained by integrating vestibular signals, with the kinesthetic system also providing estimates of the distance traveled linearly. However, this integration process is prone to corruption by noise in thesensorysystem,andwithoutanybootstrappingstrategy,canoperateonlyovershortdistances (McNaughton etal., 1991; Loomis etal., 1993). This is also very similar to the problem faced in dead-reckoning navigation systems for autonomous robotic systems, that employ inertial sensors, accelorometers or wheel encoders for maintaining estimates of self-motion. This phenomenon of ‘drift’ (Neira etal., 1999) has been widely reported in systems using a wide variety of sensors, and is also a significant motivating factor towards the development of algorithms described in this dissertation. It is interesting to note that the rate of error accumulation involved in both biological and artifical systems may be dampened by employing visual input and will be discussed in more detail in the following subsections (see Sections 1.1.1 and 1.1.2). The other issue, pertaining to environment representation in animals and humans, is a topic with many schools of thought. The term ‘cognitive spatial maps’ was first introduced in (Tolman, 3 1948), where the construction of a map-like (cartographic, Euclidean) representation within the nervous system was hypothesised to explain the maze-solving abilities of rats, and by analogy, humans. While intially ignored, this idea has evolved into a highly dynamic, annotated and abstract mechanism of encoding spatial relations and environmental attributes for interacting within an environment and processing relevant data. In (O’Keefe and Nadel, 1978), the place cell system in the hippocampus is itself believed to be the neural correlate of a cognitive map, encoding a three-dimensional, Euclidean model of the world with rigid geometrical properties. A more recent finding (Sargolini etal., 2006) also attibutes the grid cells in the entorhinal cortex to theconstructionofsuchmaps. Whiletheroleofthehippocampusinwayfindinghasbeenaccepted byotherneuroscientists(LieblichandArbib,1982),theactualstructureofthismaprepresentation has alternative theories and connotations. Furthermore, cognitive maps have also been associated with the performance of non-spatial tasks. An in-depth review of the progression of research in this area may be found in (Kitchin, 1994). In this dissertation, specifics only relevant to human travelwillbementionedandinthisregard, cognitivemapsrefertothementalprocessinginvolved in making a geographically structured travel plan (Golledge and Garling, 2004). When navigating large-scale environments, humans have been observed to predominantly em- ploy the strategy of piloting (Bryne, 1982; Schacter and Nadel, 1991; O’Keefe and Nadel, 1978). This involves identifying landmarks, understanding the distances and directions involved, estab- lishing linked connections between places, and projecting the underlying mental knowledge to the surrounding physical space. The spatial cognitive maps described earlier help accomplish these tasks by providing allocentric information (independent of self location) in the form of combined topological and metric information (Poucet, 1993) about the environment. A topological repre- sentation helps in limiting the amount of environment information that needs to be memorized and improves recall. Furthermore, a topological framework provides information about the rel- ative relationships between landmarks and is especially useful when obstacles limit the use of a particular travel path (Patla, 1997). Metric knowledge assists in local navigation and has been 4 claimed to be evidenced by the ability of animals to take short-cuts, though some neuroscientists (Bennett, 1996) are not entirely convinced about this and offer simpler explanations. However, the role of vision is clearly underscored in either explanation and will be delved in Section 1.1.1.3. Before moving on, it is relevant to emphasize another point of similarity between robot naviga- tion and human locomotion - this time, with respect to metric-topological mapping. The modern approach to autonomous systems navigation is based on solving what has been termed as the problem of ‘Simultaneous Localization and Mapping’ (abbreviated SLAM ). Algorithms designed for SLAM exploit onboard sensor input to construct a map of the surroundings and estimate the robot’s position within this map. Modules such as waypoint searching, obstacle avoidance and path-planning are built on top of these estimates - and this approach is also adopted in this work and will be explicitly described over the next few chapters. Recent developments (?, ), includ- ing publications stemming from this dissertation (Pradeep etal., 2009), have also demonstrated the utility of maintaining a hybrid metric-topological map structure in saving computational and memory costs, as well as, improving overall accuracy. 1.1.1 Human Navigation Human navigation is aided by several sensory modalities - kinestheic, vestibular and vision. One of the more critical requirements for safe and effective locomotion is the detection of unexpected ‘perturbations’ for maintaining dynamic stability of the body (i.e. reactive control of balance). However, reactive control strategies are designed to recover from collisions into obstacles and trips, and therefore cannot be employed as a primary means of navigation. The most effective means of ensuring body stability is to proactively avoid such perturbations (Patla, 1997). Vision provides a very poweful machinery for enabling such anticipatory behavior, as information about distant objects can be acquired at the speed of propagation of light waves. Furthermore, humans have to accommodate for surfaces with different geometrical structure (curbs, steps) and material properties (slippery, sandy) while walking. In (Patla etal., 1993), it was demonstrated that foot 5 Figure 1.1: (a) A drawing of a section through the human eye; (b) Graph showing rod and cone densities along the horizontal median. Adapted with permission from (Kolb et al.). placement, body posture and velocity are altered based on visually percieved surface compliance characteristics. Vision is also a more reliable modality for egomotion estimation, as the vestibular system cannot detect constant velocity and requires co-processing with kinesthetic input to dis- ambiguate between body and head motions. That said, visual processing is slower and requires about 50 ms to adjust position, versus the vestibulo-ocular reflex that can do so in 10 ms. 1.1.1.1 Role of the Human Visual System in Locomotion Instead of attempting a comprehensive description of the anatomical and functional features of thehumanvisualsystem, onlythoseaspectsrelevanttolocomotionarehighlightedinthissection. Interested readers may refer to (Kolb etal.) for a more thorough anatomical description. The human visual apparatus is tuned to detect spatio-temporal light patterns over a wide range of intensity and distance variations. The amount of light entering the eye is regulated by the pupil and is focused by the cornea and cyrstalline lens onto the retina. The retinal tissue is a circular disc with a diameter of about 42 mm and a thickness of 0.3 mm (specifically, the human retina is about 72% of a sphere approximately 22mm in diameter). Light-senstive receptors (rods and 6 Figure 1.2: Visual input to the brain goes from eye to LGN and then to primary visual cortex, or area V1, which is located in the posterior of the occipital lobe. Adapted with permission from (Kolb et al.). cones) are heterogenously distributed over the retinal surface and different types of receptors have different sensitivities to light. For instance, the density of cones ranges from 147,000/mm 2 to 238,000/mm 2 (Kolb etal.). There are two well-differentiated regions with respect to receptor density on the retina - the central region (macula) corresponding to about 13 ◦ of visual field and the peripheral region surrounding it. The macula itself has a distinct region at its center called the fovea that covers approximately 1 ◦ of visual field and provides fine and detailed vision for tasks such as reading. The foveal region has a high density of cones that provide high visual acuity in bright light, while the peripheral region predominantly contains rods giving low acuity visionevenindimlight(Dusenbery, 1992). Furthermore, theprocessoflightanddarkadaptation enables seeing over different ranges of light intensity. The visual field is approximately 200 ◦ in the horizontal plane and 110 ◦ in the vertical plane (Patla, 1997). The ability to perform coordinated eye and head motions allow scanning the environment beyond this visual field of view without having to reorient the whole body. 7 The spatio-temporal pattern of light detected by the photoreceptors in the retina is converted into a spatio-temporial pattern of activity (spiking discharge) that is propagated along the op- tic nerve. These afferent inputs from the retina are processed in parallel by the different visual cortical areas for the extraction of attributes such as color, motion and form. Retinal input is first delivered to the primary visual cortex via the lateral geniculate nucleus and subsequently distributed over to the surrounding cortical areas (Zeki, 1983). There are two major pathways of visualprocessingfromtheprimaryvisualcortex-theoccipitoparietal(toposteriorparietalcortex) and the occipitotemporal (to inferotemporal cortex), serving functions of perception (viewpoint and context invariant recognition of objects and events) and action (visual processing from an egocentric perspective) (Milner and Goodale, 1993). Retinal processing also occurs subcortically by the superior colliculus, and has been demonstrated to play a role in orienting and attention to novel stimuli in the visual field (Stein and Meredith, 1993). The speed and flexibility conferred by such parallel and modular processing of visual input makes the visual system indispensable for controlling whole body movements, such as locomotion. In fact, visually mediated locomo- tor adaptive strategies are controlled by the corticospinal tract in the motor cortex, while the tectospinal tract descending from the superior colliculus is involved in orienting behavior. As mentioned before, the particular advantage of vision lies in the acquisition of information for making proactive locomotor adjustments. For instance, it has been demonstrated conclusively that exteroceptive information about obstacles is employed in the feedfoward control of swing limb trajectory (Patla and Vickers, 1996; Patla etal., 1996). The specific aspects of egomotion estimation and mapping, as introduced in Section 1.1 will be discussed next. 1.1.1.2 Egomotion Estimation Egomotion estimation or ‘dead reckoning’ is used to retrace the path back to the start position. Thisisaccomplishedbycomputingthedistanceanddirectiontraveledfromthestart(Dusenbery, 1992). Since the vestibular and kinesthetic systems are easily corrupted by noise in the sensory 8 system, theerrorsinself-motionperceptioncanquicklygooutofbounds. Thevisualsystemhelps in correcting such deviations from the desired path by providing visual fixes of environmental landmarks (McNaughton etal., 1991). Apart from landmark recognition, vision also helps in the estimation of ‘optic flow’, which is the pattern of apparent motion of objects in the scene resulting from relative motion between an observer and the scene. Information about self-motion can be extracted from such optic flow. In the human visual system, this is detected primarily by the large peripheral field of view and several studies have demonstrated its influence on whole body velocity in humans (Konczak, 1994; Warren, 1995; Lackner and DiZio, 1988; Pailhou etal., 1988). The visual system is especially equipped with various reflexive and volitional gaze stabilization mechanisms for reliably estimating optic flow (Barnes, 1993). 1.1.1.3 Spatial Mapping Theroleofvisionintheformationofspatialcognitivemapsisapparentfromtheearlierdiscussion that stressed on identification of landmarks and their geometric properties, as well as their rela- tionships to other objects in the environment. In humans, lesion studies (and recently, functional imaging) have been primarily used to identify neural sites for cognitive mapping. For example, acquisition and retention of spatial information is impacted by lesions in theparietal cortex, while frontal cortex injuries affect egocentric spatial information. The hippocampus has been proposed asasiteforencodingtopologicalmapinformation,andtheparietalcortex(receivingvisualandso- matosensoryinput)hasbeenattributedtometricrepresentationoftheallocentricspace(Poucet). Thisallocentricspatialinformationistransformedintospatiallydirectedlocomotormovementsby the frontal cortex (along with the striatum) (Paillard, 1987). In alternative theories to cognitive mapping (Bennett, 1996), vision continues to play a critical part. One proposed mechanism is the recognition of previously trained landmarks from new viewpoints (mediated by the occipitopari- etal pathway) or the association of familiar landmarks near goal points. Visual input, therefore, 9 is necessary not just for local navigation but also for planning routes to destinations that are not visibile from the start position. 1.1.2 Robot Navigation Autonomous navigation for mobile systems or robots is an active area of research, encompassing a broad spectrum of applications and technologies. As pointed out earlier, the goals are very similar to what biological systems have been able to achieve: self-localization, map-building and interpretation and path-planning. Given the different shapes and sizes of robots (there are about 8.6 million robots around the world - not all of them autonomous or mobile) and the diverse range of intended applications, there are various concepts of what the term navigation implies. ‘Global navigation’ (Dixon and Henlich, 1997) is the ability to move between two points in space (typically, over a large geographical scale) and requires estimation by the robot of self position in an absolute or relative coordinate system (using GPS devices, for instance). ‘Local navigation’ is required to complete tasks within an environment and is accomplished by determining position relative to stationary or moving objects and subsequent interaction with them. Finally, ‘Personal navigation’ is akin to proprioception in animals, and implies monitoring of the various parts that make up the robot and anything in contact. The problem of human mobility addressed in this thesis scales comparably with that of local navigation, and therefore, topics related to global and personal navigation techniques will be largely omitted from further description. A fundamental problem in autonomous robot navigation is obtaining a good estimate of self- position. Themanydifferentapproachestothismayberoughlygroupedintotwocategories,based on the choice of the coordinate system in which position fixing is obtained: relative and absolute (Borenstein etal., 1997). Absolute position fixing implies navigation with respect to a particular, staticpointintheenvironment,whilerelativepositioninginvolvesregisteringpositionwithrespect to self-position at some other instant in time. If a map of the environment is available, coordinate transformations can be carried out to obtain fixes with respect to other points for executing 10 large-scale motions. Thus, successful autonomous navigation requires egomotion estimation , as well as the ability to represent spatial maps. Before going any further, it should be pointed out that several aspects of research applicable to navigation of aircraft, space craft, automated guided vehicles(AGVs)andevenartilleryaddressesarealsoapplicabletorobotnavigation(Battin, 1987; Farrell, 1976; Everett, 1995). However, the following discussion will be focused only on literature relevant directly to mobile robots. This is justified as sensor systems for robots (and which could bereasonablyappliedtohumanwearabledevices)mustberelativelysmall,lightweight,low-power and inexpensive. 1.1.2.1 Sensors for Autonomous Navigation Several kinds of sensor systems exist to complement the many, varied environments and ap- plications for which robotic systems are designed. For autonomous navigation systems, there are primarily two classes of sensors - one for extracting knowledge about self-motion (dead- reckoning/proprioception) and another set for acquisition of input about the surroundings such as relative distances between and from objects (range/proximty sensing). For the values returned by the sensors to be useful, the information has to be further processed by a set of algorithms and heuristics. These algorithms will be introduced and described in later chapters, as relevant to this dissertation. In addition, there are several navigation strategies that rely on deployment of systems (satellities for GPS) or markers (RFID and active beacons ) in the environment. GPS receivers can provide good accuracy outdoors for up to a few meters (and have also been used in commercially available mobility aids discussed in Section 1.3.3, and are proposed as part of future work in Chapter 6), but do not work indoors nor help in avoiding obstacles. Large-scale installation of markers is not viable for general purpose mobility applications and will not be explicitly discussed. However, instances where such an approach has been employed for providing navigational assistance to the visually impaired will be mentioned in Section 1.3.4. 11 A comprehensive survey covering all sensor types and designs is not possible in this thesis (for instance, 17 million variants of rotary encoders are offered by one company (Avolio, 1993) alone!). Only the basic working principles behind the most popular ones will be summarized here. A more detailed exposition may be found in (Everett, 1995). Proprioceptive Sensors Odometry/Wheel encoders : Sensors falling under this class measure wheel rotation and/or provide steering orientation. Incremental and absolute optical encoders are the most pop- ulartypesformobilerobotapplications-especially, asmostofthemarewheeleddesigns. Acoded opaque/transparentpatternismountedonarotatingintermediatediskattachedtothewheelshaft and is used to periodically interrupt a focused beam of light aimed at a matched photodetector. The resolving power of rotation sensors is often limited by the number of holes/patches in the patterneddisk. Theincrementalsensorsmeasurerotationalvelocity, fromwhichrelativepositions can be inferred. The more expensive , absolute measurement sensors can directly provide angular position of the wheel. The advantage of the latter approach is that it is easier to recover from loss of a previous relative position measurement. Odometry units are among the most widely popular components in robot navigation systems due to low cost and high sampling rates. However, used by themselves, these provide good accuracy over short terms only. The equations involved in translating wheel revolutions (Borenstein etal., 1997) into linear displacements relative to the ground are prone to errors due to wheel slippage and other causes. Since incremental motion is to be integrated over time, this system leads inevitably to the unbounded accumulation of errors. Orientation errors can result in lateral position errors that increase proportionally with distance traveled. Accelerometers : These sensors measure rate of accelaration and are also classified as ‘in- ertial navigation systems (INS) ’. The principle of operation involves sensing minute accelerations 12 in each of the three directional axes and position estimates are obtained by integrating returned values twice. Conceptually, the acceleration is measured from the displacement of a damped mass on a spring. Piezoelectric, peizoresistive and capactive components are used for converting mechanical motion into electric signals. Modern accelerometers are often manufactured using micro electro-mechanical systems (MEMS) technology. Similar to wheel encoders, these are also self-contained units and suffer from the same problem of drift - any small constant error increases without bound after integration. In general, accelerometers are poor sensors for mobile robot navigation over extended periods of time and uneven surfaces. One approach to overcome the latter problem proposed incorporation of a tilt sensor to cancel the gravity component projections on each axis of the accelerometer (Barshan and Durrant-Whyte, 1995). Despite this, a position drift rate of 1 to 8 cm/s still persisted, which is too high for most applications. Furthermore, the signal-to-noiseratio isverypooratloweraccelarations(low-speed turns)(Borensteinetal., 1997). Gyroscopes : Gyroscopes also fall under the category of INSs and provide rate of angular motion. This is very useful in providing instant corrections to the aforementioned odometry systems by detecting and compensating for orientation errors. Mechanical gyroscopes operate on the principle of conservation of angular momentum and consist of a spinning wheel mounted in gimbals. The orientation of the wheel remains fixed, irrespective of the motion of the rest of the body and this is used to monitor the motion of the platform on which it is mounted. Gyroscopes based on other principles are also popular, such as MEMS gyroscopic devices, solid state ring lasers and fibre optic gyroscopes. Magnetic Compasses : Magnetic compasses provide measures of absolute heading angle and are often employed to reduce accumulated dead-reckoning errors. These sensors are based on a variety of physical effects related to the Earth’s magnetic field and many different systems are available (fluxgate compass, hall-effect compass, magnetoelastic compass to name a few). While rate integrating gyroscopes can only measure relative angular positions, magnetic compass units 13 provide absolute measurements and do not need to be initially referenced to a known orienta- tion. The Earth’s magnetic field, however, is often distorted near power lines and steel stuctures (Byrne etal., 1992) and therefore, it is non-trivial to use these sensors for indoor applications. Also, the sensor coil has to be protected from serious errors induced by the vertical component of the geomagnetic field in scenarios involving uneven terrain. Proximity Sensors Contact Sensors : These are one of the most basic sensor types for building autonomous navigationsystemsandoftenincorporatedintowheelbasebumpersandrobotarms(Pruski,1993) Different types of switches (mechanical microswitches, mercury switches) are used in designs and the most commonly implemented mode of operation is to trigger on the rising or falling edge of a square wave resulting when the sensor is pressed or released. While cheap and easily available, these sensors are useful when it is acceptable for the robot to collide into obstacles and recover subsequently. For a human wearable application, as proposed in this dissertation, this is clearly not desirable as safe navigation implies the use of proactive versus reactive control strategies. Ultrasonic/Sonar: Ultrasonic sensors are commonly used for obstacle avoidance, naviga- tion and map-building. The operating principle is the measurement of time of flight of a high- pitched sound wave emitted by the transmitter (on the sensor) as it strikes an object ahead and returns to the receiver. For instance, the HiTechnic ultrasound sensor emits a short burst of 40kHz ultrasonic pulse using a piezoelectric transducer. As soon as the pulse is received back, the time-of-flight computation is triggered. Usually, a configuration of differently placed ultra- sonic range sensors is employed to provide a wide-field view of the environment. It is important to take note of the fact that the received pulse is always attenuated with respect to its original strength. In an ideal setting, sound waves propagate as a cone about 30 ◦ wide. Power disipates with distance and at greater distances, obstacles need to be larger to reflect sufficient energy for 14 redetection. Sonars are generally effective for upto 25 feet in indoor environments, while outdoor performance is generally under 10 feet (Dean). Another caveat is that for best performance, the sensor face must be parallel to the surface of the nearest object (Sabatini etal., 1995) and that surface should also be flat, reflective and relatively large. If the surface is angled, then the time of flight measurement records distance to the nearest point that lies within the 30 ◦ cone. Specular surfaces (Flynn, 1988) and multiple reflections lead to erroneous estimates of distance. Active Infrared (IR): IR sensors are cheaper in cost and have faster response times com- pared to ultrasonic sensors. An important design consideration is to be able to disambiguate between the IR emitted by the sensor and other ambient sources. One solution is to employ frequency modulation to label the emitted light and electronically filter away unwanted signals. A difficultywithIRsensorsisthatthereflectancepropertiesofobjectsurfacesintheenvironment have significant impact on the received signal strength. Some researchers have proposed using both, ultrasonic and IR sensors in an integrated system to complement each other (Mohammad, 2009; Flynn, 1988). Another design for IR sensors involves projecting the light as a distinct pat- tern. A camera-based system can then analyze distortions in the projected pattern to infer 3-D information about the objects. A similar approach is being used in the recently released Prime- Sense sensor (PrimeSense, 2006), the reference design of which is being used in Microsoft’s Kinect (Microsoft) for controller-less gaming. This sensor combines an IR transmitter and receiver that generates depth maps blended with color information. The relevance of this particular sensor for the system developed in this dissertation will be discussed further in Chapter 6 LASER/LIDAR: A laser rangefinder uses a laser beam for obtaining distance measure- ments. While the time of flight principle is the most commonly used one, the speed of light renders it inappropriate for high-precision applications such as 3D modeling. Instead, triangu- lation or refraction measurement techniques are employed for achieving sub-millimeter accuracy. Laser rangefinders are used in lidar systems to obtain depth data over large regions with very 15 high accuracy and resolution. A typical lidar sensor consists of a laser rangefinder reflected by a rotating mirror for scaning the scene in one sweep. Results with lidar sensors have certainly been impressive in the recent past. The autonomous vehicle Stanley (Thrun etal., 2006) (winner of the 2005 DARPA Grand Challenge) used five lidar units (produced by the German company Sick AG (Sick)). In June 2010, a robotic Boeing AH-6 autonomously navigated a flight path that included obstacles, using lidar (Spice, 2010). Compared to vision-based devices, these sensors are more expensive, bulkier and have higher power footprints. Time-of-flightCameras: Sensorsunderthiscategorycompriseofsystemsbasedonseveral different technologies. While similar in principle to time-of-flight lidar scanners, the advantage hereisthatthewholesceneiscapturedatthesametime(lidarsneedtosweepthescenetoacquire a slice of information). These devices have a range of up to about 60 meters, with a distance resolution of approximately 1 cm. These systems are also very amenable to real-time applications as frame rates of 100 Hz can be achieved. Some of the problems, though, are due to multiple reflections and background illumination. 1.1.2.2 Vision based Sensors Camera systems are one of the most recent entries into the family of sensors used for robot navigation. These sensors project 3D world information onto a 2D plane, and therefore, do not preserve depth information. Since the system described in this dissertation only makes use of cameras to perform range-finding, topics related to the geometry of image formation and how depth might be recovered are discussed next. The Geometry of Image Formation : The simplest, albeit specialized, camera model is the basic pinhole camera as illustrated in Figure 1.3. More generalized models may be found in (HartleyandZisserman,2000)and(Faugerasetal.,2001). Inthepinholemodel,oneconsidersthe centralprojectionofpointsinspaceontotheimage/focalplaneidentifiedbyZ=f, wheref isthe 16 Figure 1.3: (a) The geomery of a pinhole camera; (b) sideways view. Adapted with permission from (Hartley and Zisserman, 2000). focal-length. Under these conditions, a point in space with coordinatesX = (X,Y,Z) T is mapped to the intersection of the line joining the pointX to the center of projection with the image plane. From Figure 1.3(b), applying similar triangles and ignoring the final image coordinate (which is f), one gets the following mapping: (X,Y,Z) T "−→ (fX/Z,fY/Z) T (1.1) Expression1.1describesamappingfromEuclidean3-spaceR 3 toEuclidean2-spaceR 2 . Using homogeneous coordinates, this expression can be conveniently represented using matrix algebra as: X Y Z 1 "−→ f 0 f 0 10 X Y Z 1 (1.2) or, 17 X Y Z 1 "−→ f 0 f 0 1 [I0] X Y Z 1 (1.3) whereI is a 3×3 identity matrix. It is important to note that the pointX has been expressed in a coordinate system that is attached to the camera. In general, points in space will have a differentEuclideanframe,andthecameracoordinatesystemwillberelatedtothisworldreference frame by a 3×3 rotation matrix R, and 3×1 translation vector t. Thus, taking this coordinate transformation into account, one gets X Y Z 1 "−→K[Rt] X Y Z 1 (1.4) K = f 0 f 0 1 (1.5) where K is referred to as the internal calibration matrix and [Rt] is the 3× 4 external calibration matrix. The expressions given above assume that the origin of the image plane coor- dinate system is at the principal point (the point where the principal ray/optic axis intersects the image plane). Instead, this origin could be at the bottom-left corner of the image, for instance, and might have an offset (p x ,p y ). The pixels can additionally have a skew parameter, s and the image coordinates might not have equal scales in both axial directions (non-square pixels). If 18 the number of pixels per unit distance in image coordinates are m x and m y along the x and y directions, the internal calibration matrix can be expressed as: K = α x sx 0 α y y 0 1 (1.6) Here, α x = fm x , α y = fm y , x 0 = m x p x , and y 0 = m y p y . The internal calibration matrix, therefore, has a total of 5 degrees of freedom and together with the external calibration matrix (6 degrees of freedom - 3 for rotation, 3 for translation), forms the camera calibration matrix P =K[Rt] (1.7) which has 11 degrees of freeom (the same number of DOF as a 3×4 matrix, defined upto an arbitrary scale). Thus, a general projective camera will be represented by a homogeneous 3×4 matrix of rank 3. The rank 3 requirement ensures that the range of the matrix mapping will always be a plane (i.e. a 2D image) and not a line or a point. A general projective camera P maps world points X to homogeneous image coordinates x according to x =PX (1.8) Being a homogeneous system, the above projection is defined only upto a scale factor. The internalcalibrationparameterscanbederivedusingaknowngridpatternandforfuturereference, it will be assumed that K is already known (i.e., the camera is already calibrated). Stereo Cameras : A stereo rig exploits the geometry of two calibrated cameras, P and P ! for performing 3D depth estimation. Consider a 3D point in space X that projects onto the two cameras as 19 Figure 1.4: Epipolar geometry. C and C ! are the two camera centers. Adapted with permission from (Hartley and Zisserman, 2000) x =PX x ! =P ! X (1.9) The pointsx↔x ! are called corresponding points and in the idealized case, satisfy a geomet- rical relationship called the epipolar constraint. This constraint arises from the coplanarity of the two cameras, x,x ! and X (the epipolar plane) as illustrated in Figure 1.4. Mathematically, this relationship may be expressed as x ! Fx=0 (1.10) where F is a 3× 3 matrix of rank 2, called the fundamental matrix. F can be uniquely determined if P and P ! are completely known, which is the case for a calibrated stereo rig. The implication of Equation 1.10 is that the corresponding point in the other image is always located along a line (the epipolar line) specified by the fundamental matrix and the known point. This reducescomputationalcost,asmatchingpointsneednotbesearchedforthroughthewholeimage. In stereo cameras, the geometry is further simplified by a process called rectification. By applying a projective homography to the images, the epipolar lines are made to align with the horizontal scan line of the image. Thus, for a point in the left image, the corresponding point in the right 20 Figure 1.5: (a) Left camera and (b) right camera views of a scene in a stereo configuration. The offset between two corresponding points (labeled by red dots) is the disparity and inversely proportionaltodepth. (c)showsthedisparitymapfortheentireimage,wherelightershadesimply largerdisparities(andshallowerdepthwithrespecttothecamera). Adaptedwithpermissionfrom (Hirschmuller and Scharstein, 2007). image will be located along the same scan-line, with an offset. This offset is called the disparity (see Figure 1.5) and can be used to estimate the 3D point X = (X,Y,Z) T directly using the following system of equations, instead of solving the linear system in Equation 1.9: Z =fB/d X=uZ/f Y=vZ/f (1.11) where u and v are the pixel locations in the image, d is the disparity and B is the baseline of the stereo rig (the horizontal separation between the two camera centers). The most difficult step in stereo triangulation is establishing correspondences across the two cameras and several approacheshavebeenproposedovertheyears(ScharsteinandSzeliski). Incorrectmatchesleadto erroneous disparities, and for farther objects (which have smaller disparities), the impact of such errors is catastrophic. Furthermore, larger sections of distant regions in the scene project to the 21 samesensorareaasdosmallersectionsofcloserobjects. Thesefactors, coupledwithquantization errors and occlusion, result in ‘noisy’ 3D maps with anisotropic noise that increases with depth. However, the low cost, passive characteristics, rich information bandwidth (in the form of color images) and power profiles of stereo cameras makes it worthwhile to investigate algorithms that can cope with the noise and provide acceptable input data for autonomous navigation. Futher characterization of this noise and relevant algorithms as applicable to this thesis will be described in Chapters 2 and 3. 1.2 Visual Impairment and Impact on Mobility 1.2.1 Prevalance According to the 2006 National Health Interview Survery, there were 21.2 million Americans over the age of 18 who reported having vision loss (Pleis and Lethbridge-Cejku, 2006; AFB, 2008). The term vision loss here refers to individuals who reported having trouble seeing, inspite of corrective glasses or contact lenses, and also includes those who reported being unable to see at all. Approximately 9 million out of this total were between the ages of 45 to 64 years, and as the population of the baby boomers continues to age, the number of seniors (65 years of age or older) with vision loss is expected to grow substantially (6.2 million in 2006) . In recent times, combat eye injuries have also increased and around 368 soldiers have been evacuated for eye- related problems during the last two conflicts involving the United States. From March 2003 to December 2004, 15.8% of all medical evacuations were related to battle eye injuries, and 1,100 eye surgeries have resulted from injury to soldiers in Iraq and Afghanistan. Traumatic Brain Injury (TBI) patients often present symptoms of visual field loss, with some reports suggesting as many as 39% of patients with TBI also having permanently impaired vision. In addition, age-related macular degeneration (AMD), retinitis pigmentosa (RP) and other vision-impairing diseases continue to afflict veterans and the general public alike. 22 1.2.2 Impact on Mobility As can be deduced from the discussion in Section 1.1.1, visual impairment leads to loss of in- dependence in performing several routine and life-enabling tasks. With advances in several key technology areas, various assistive devices have been proposed over the years that encompass a broad area of functionality, applicability and usability. For instance, an electronic retinal pros- thesis (Weiland and Humayun, 2006) is under development to treat retinal degenerative diseases (RP and AMD) by electrically activating surviving cells of the retina to create artificial vision. However, indoor and outdoor mobility continues to be a major challenge and even for recipients of the prosthesis, only the central 15-20 degrees field of view would be available due to surgical and engineering limitations. Absence of peripheral vision hampers the ability to detect obstacles and make proactive navigational decisions. Pending the availability of more current information, approximately 109,000 people with vision loss in the US used long canes to get around in 1990 (AFB, 2008; For Health Statistics, 1992; JVIB, 1994) and just over 7000 used dog guides. It is estimated that annually, only about 1,500 individuals graduate from a dog-guide user program (AFB, 2008; Eames and Eames, 1994; JVIB, 1995). While these navigational aids are currently the most popular, they also have limited usability in crowded regions, social situations and defi- nitelydonotresolvetheprimaryissuesofindependentmobilityandnon-reactivenavigation(that is, avoiding an obstacle without actually having to touch it). ‘Tunnelvision’isoftencharacterizedbyunaidedvisualfieldsoflessthan90degreesand‘vision mobility impaired’ is a term used by and for people with fields less than 30 degrees. Common causes of peripheral vision impairment are glaucoma and retinitis pigmentosa (RP). In RP, visual fieldwidthstypicallydecreaseoveraperiodof15-20yearstoaslittleas4degreesandthen,within several more years, total blindness occurs (In National Institute for Rehabilitation Engineering, 2003). Using simulated vision with pixelised images and restricted field of view, (Chanderli etal., 23 2006) reported a requirement of 33×23 degree effective visual field (with 200 pixels) for safe nav- igation in highly predictable environments. For less predictable environments, at least twice this visual information was required. According to a survey carried out in (Golledge etal., 1997), the task of route planning and unforseen obstacles can severely impede the independent travel of sightless individuals and will reduce their willingness to travel, despite having access to the long cane or guide dog. Genensky et al. (Genensky etal., 1979) reported difficulties in detecting small obstacles in the path of travel, uneven walking surfaces and horizontal objects at eye level or lower for a majority of 94 visually impaired subjects interviewed. An error in self-displacement orientation may lead to accidents, possibly involving other persons and this risk chance is further increased when moving and heading through different environments, which normally are unstruc- tured and unpredictable. It has been estimated that more than 30% of the blind population do not ambulate autonomously outdoors (Clark-Carter etal., 1986a). All these factors, therefore, underscore the necessity of a device that aids in detecting obstacles (structured or unstructured) and helps perform wayfinding. 1.3 Mobility Aids for the Visually Impaired Electronic Travel Aids (ETA) (Cook and Hussey, 2002) are sophisticated electronic displacement aids used by the visually impaired. The goal is to enhance user confidence for independent travel rather than to replace conventional aids like the cane and guide dog. ETAs are based on various physical principles, and currently no standardized or complete system is available in the market. Most of the approaches are based on the physical principles of ultrasound, laser and vision. All such devices employ three basic components: an element that captures data on environment variables, a system to process this data and, finally, an interface that renders this information in a useful way to the person. 24 1.3.1 Sonar Based Devices These devices primarily focus on the reflectance of ultrasonic waves from objects in the vicinity to measure proximity to obstacles. An early example of one such system is (Russel, 1965), that encoded range data into audio signals (see 1.3.6). The Mowat Sensor (MOWAT) and the NottinghamObstacleDetector(NOD)(BissittandHeyes,1980)aretwooftheoldest,commercially available devices in this category. While both of these are hand-held devices, the Binaural Sonic Aid (Sonicguide, (Kay, 1964)) is a wearable system in the form of a pair of spectacle frames, with one ultrasonic wide-beam transmitter mounted between the spectacle lenses and one receiver on each side of the transmitter. The robotics community has taken particular interest in applying technologies from autonomous systems research in this area and typically, current approaches can be grouped into two broad categories: passive systems that only detect and warn, such as the Sonic Pathfinder (Dodds etal., 1984) and the Bat K-Sonar (Laurent and Christian, 2007), or active systems that detect and actually steer the subject around the obstacles (for example, the Guide Cane(Borenstein and Ulrich, 1997)). Devices belonging to the former category require significant training before use as the lack of contextual information in purely range data limits interpretation of the evironment by the accompanying processing algorithms. Furthermore, the user often has to perform additional measurements when an obstacle is detected, in order to determine the object dimensions. All this requires conscious effort that also reduces walking speed. The issues with active systems are power consumption, portability, traversability and lack of complete user control. 1.3.2 Laser and IR Based Devices Theoperatingprinciplehereisverysimilartoultrasonicdevices. However,lasertechnologyproves superior spatial and temporal resolution compared to ultrasound. The C-5 Laser Cane (also known as the Nurion Laser Cane) was introduced in 1973 by Benjamin et al. (Benjamin etal., 25 1973). It is based on optical triangulation with three laser diodes and three photo-diodes as receivers. The Laser Cane can detect obstacles at head-height, drop-offs in front of the user, and obstacles up to a range of 1.5 m or 3.5 m ahead of the user. Due to the small number of photoreceptors, this device has only limited range-processing capabilities and cannot be used to compute distances in general. The Pilot Light mini-radar and Guideline are hand-held infra-red obstacle detection devices with approximately 4 foot range. In (Yuan and Manduchi, 2005), an ‘environment-exploration’deviceisproposedthatusesalaserpointerandcameratotracksurfaces for detecting curbs, corners, drop-offs and stairways. This ‘Virtual White Cane’ also recognizes such environment singularities as the subject scans the scene by waving the device around. In spite of the high accuracy obtained with such sensors, difficulties resolving data from reflections off specular surfaces is encountered and such an approach is not a practical solution for outdoor navigation, where sunlight often overwhelms infrared and laser sensors. 1.3.3 GPS Based Devices Since the introduction of the Global Positioning System (GPS) in the late 1980s, there have been several attempts at integrating GPS into a navigation-assistance system for the blind. Prolifera- tion of multimedia-enabled cell phones has furthered the motivation behind such projects and the Loadstone GPS project (Loadstone) running on Symbian-powered Nokia phones is a very popular example. For locations where no map data is available (developing or recently industrializing countries), the Loadstone software allows creation, storage and sharing of custom waypoints. The Victor Trekker (HumanWare), launched by Humanware in 2003, is another GPS-powered PDA application that enables a blind person to determine position, create routes and receive infor- mation on navigating to a destination. Other examples include Wayfinder Access (Wayfinder), BrailleNote GPS, Mobile Geo (MobileGeo) and MoBIC (Mobility of Blind and Elderly people In- teracting with Computers). GPS-based systems provide points-of-interest (POI) and wayfinding information but cannot resolve details at the local level. They do not aid obstacle avoidance and 26 indoor navigation. NAVIG (NAVIG) is an interdisciplinary project that aims to integrate GPS technology with computer vision algorithms for extracting local scene information in addition to global localization. The computer vision techniques developed in this project can also be inte- gratedwithGPSsystemsandwouldenablecompletelyindependentnavigationintrulylarge-scale and unfamiliar environments. 1.3.4 Distributed Systems The mobility aids discussed in the previous two sections are designed to be carried by the user or mounted on the subject’s body. At the other extreme are navigational aids embedded as part of the environment to facilitate access for the visually impaired. These are mostly designed for public places and set up to solve specific problems. Talking Signs (Brabyn etal., 1993) developed at the Smith-Kettlewell Eye Research Institute, Rehabilitation Engineering Research Center in San Francisco, California is an example of one such system that has been actively deployed. The systemconsistsofpermanentlyinstalledtransmittersthatsendlocation-specificaudioinformation to a hand-held receiver that decodes the signal and delivers a voice message through its speaker or headset. The system has been demonstrated to work effectively in both interior and exterior applications and enables a person to orient and localize himself correctly. A similar system is described in (Kulyukin, 2006) that consists of a helper robot guiding a human through an indoor environment using a network of RF-ID tags for mapping and a sonar sensor for local obstacle avoidance. Whilethesesystemsperformadmirably, theyarebydesigntooconstrainedforgeneral purpose navigation and it might not be cost and time effective to have them installed at homes and smaller locations. 1.3.5 Vision Based Systems The availability of cheap cameras and faster processors has encouraged many researchers to inves- tigatecomputervisionandimageprocessingstrategiesoninformation-richimages/videostobuild 27 more intelligent mobility aids. The high information bandwidth provided by cameras makes it possible to implement different algorithms for accomplishing diverse tasks within a single system. The vOICe (Meijer), for instance, converts images into sounds and transmits information about the position, size, height and orientation of objects in the scene from a single frame. More inter- esting systems have come out of multiple-camera configurations that reconstruct 3D information from 2D images. In (Costa etal., 2007), a simple binocular configuration (stereo camera) is used to compute dense scene depth from disparity information on a Digital Signal Processor. This depth information is then conveyed to the user by means of a tactile interface and the user learns to associate patterns of tactile stimuli with objects. This approach leaves the heavy inference work for the human subject and therefore, imposes significant training time and severe cognitive load. Furthermore, a lot of information is conveyed every frame and the subject is forced to pay conscious attention to redundant or unnecessary data. (Molton, 1999) describes another stereo vision system integrated with sonar called ASMONC. The sensors are mounted on the waist and shoulders and an initial calibration step by standing in an obstacle-free, texture rich zone is re- quired. As the user moves, the ground plane is continuously tracked and surface inconsistencies (obstacles or drop-offs) are detected. As the sensors are fixed on the body, the subject has to perform bodily rotations to integrate scene information. Also, this system requires the camera pose to be fixed. Another example is a curb/stairway detection and localization algorithm (Lu and Manduchi, 2005) for use in a navigation assistance system. At the Quality of Life Engi- neering Research Center (QoLT ERC), Carnegie Mellon University, a novel vision-based wearable assistive device is currently being developed that performs several tasks such as scene geometry estimation and object detection to guide visually impaired people to task-relevant regions. Some systems have also been devised for other tasks that some visually impaired people might be able to perform, like driving (McCarthy etal., 2008). In order to collect maximal scene information about the user’s vicinity, a head-mounted vision system would be ideal. As the environment is traversed, the subject can scan around, providing 28 thesystemwithwidefieldinformationforcomputingtheoptimaltraversablepath. However,since the camera now undergoes continuous motion in six degrees of freedom (6 DOF ), the parameters of the ground plane captured in every frame will have to be adjusted for the change in coordinate system. This entails tracking camera motion and registering depth maps from frame to frame. This problem is well-known in computer vision as Structure from motion (SfM) and in robotics as Simultaneous Localization and Mapping (SLAM). Such a system provides online estimates of camera trajectory and generates a large-scale 3D map of the environment. In (Saez etal., 2005), a SLAM algorithm is implemented that provides real-time warning of overhead obstacles. This systemisthemostsimilartotheapproachpresentedinthisdissertationintermsofoveralldesign. However, there are important differences with implications for robustness, speed and additional features.(Koch and Teller, 2008) describes a multicamera approach to constructing a graph-based topologicalmapoftheenvironmentbylearningcorrespondencebetweenusermotionandevolution of visual features. This system provides only coarse guidance to the user. 1.3.6 Sensory Substitution Sensorysubstitution(Arno,1999)isthemechanismbywhichcharacteristicsofonesensorymodal- ity are transformed into stimuli of another sensory modality. This can potentially restore the ability to perceive certain defective sensory modality by using information from a functioning sensory modality. For the visually impaired, audio and tactile modalities have been primarily used to convey information collected and processed by the device. For instance, many assistive devices (such as (Borenstein and Ulrich, 1997; Benjamin etal., 1973; Dodds etal., 1984; Russel, 1965)) encode distance to objects, their location and dimensions by volume, pitch or frequency of audio signals. Some researchers (Meijer) have also proposed simply encoding pixel values by sound and argued that with training, subjects can learn to recognize distinct patterns associated with objects. Human studies have revealed that to a certain degree, it is indeed possible to assign audio signatures to objects. However, comprehending such signals can be a time consuming and 29 arduous process, and visually impaired people prefer having their hearing modality available for other tasks also. The NavBelt (Shoval etal., 1998) is an ultrasonic assistive aid that translates captured images into a series of (stereophonic) audio cues through which the user can determine which directions are blocked by obstacles and which directions are free for travel. Due to the difficulties in understanding audio cues, test subjects could not travel faster than roughly 0.3 m/sec (1 foot/sec) and even this marginal level of performance required tens of hours of training time. In (Costa etal., 2007), range data was directly encoded into tactile stimuli by means of an array of 27 microvibrations motors. This approach also required subjects to recognize and asso- ciate patterns with typical objects in the scene (trees, obstacles, and even moving people). The advantages of using tactile interfaces are that they are less obstrusive and offer more resolution. For instance, in (Jones etal., 2004), a torso-based haptic interface is developed for providing nav- igational cues to a human. Two kinds of designs are introduced - one using small electric motors and another with contractile shape-memory alloy (SMA) fibers. A 3×3 motor configuration was tested and demonstrated high accuracy in perceiving the direction of activation. The Blind Audio Tactile Mapping System (BATS) project at the University of North Carolina, Chapel Hill is in- vestigating how spatial map information can be effectively delivered to visually impaired people. Tactile displays that convey spatial information for improving sense of orientation are described in (Nagel etal., 2005; Erp etal., 2006; Veen etal., 2004), while situation awareness devices have been explored in (Lindemann etal., 2005; Tsukada and Yasumura, 2004). The ActiveBelt consists of eight tactors (Tsukada and Yasumura, 2004) that are equally distributed around the user’s body. Tactors can be activated to point towards directions. In (Erp etal., 2006), it was shown that providing egocentric information about the current orientation through a tactile display can help people in countering spatial disorientation. Furthermore, after 6 weeks of wearing a tactile belt, experiments in (Nagel etal., 2005) show that participants demonstrate signs of having the 30 Figure 1.6: Specific aims addressed in this dissertation: (1) Design and implementation of algo- rithms to build an obstacle map and extract camera orientation, (2) design and implementation of a human interface to convey navigational assistance to the human user and (3) evaluate system and analyze feasibility through mobility studies with human users. tactile feedback integrated into their normal senses. Participants who embraced this technology reported substantial changes in their perception of the environment. 1.4 Scope and Organization of this Dissertation 1.4.1 Specific Aims In this research project, a novel, wearable, low-vision aid is proposed that provides navigational assistancetopatientswithimpairedvisionderivingfromeithercompleteblindnessorpartialblin- dess in the periphery. Several different aspects towards the design of such a device are considered here and identified as three specific aims (see Figure 1.6). The first specific aim is to select a sensor that provides high information bandwidth for performing navigational tasks such as ob- stacle detection and wayfinding and yet, is light-weight, low-power and easily integrated into a wearable solution. The incoming data from the sensor has to be processed to extract meaningful data suitable for providing navigational assistance. The design and implementation of such algo- rithms is a critical part of this dissertation. The second specific aim is to design an interface to provide the user with the computed navigational cues in a manner that does not impose severe 31 Figure 1.7: Two prototypes of the system: (a) Version 1 with the Bumblebee2 stereo camera manufactured by Point Grey Research; (b) Version 2 with the DeepSea G3 Embedded Vision System (EVS) sold by Tyzx Inc. The latter version allows an untethered interface to the camera through a wireless network. The tactile interface in both versions is the same: an array of microvibrationmotorsprovidesguidancecues. Cuingstrategy, however, isdifferentanddescribed in Chapter 5. Adapted with permission from (Pradeep et al., 2010a),©2010 IEEE. cognitive load and is intuitive to use. Here, a wearable array of microvibration motors is pre- sented that provide tactile cues to alert about the presence of obstacles and guide the user along a computed, safe path. The thirdspecific aim is to assess the feasibility and applicability of the whole system by carrying out extensive mobility experiments with human subjects. 1.4.2 System Overview Theprimarycomponentsoftwodifferentprototypesofthesystemaddressingthefirsttwospecific aims are shown in Figure 1.7. In both versions, a stereo camera generates dense 3D maps of the scenebasedontheprincipledescribedinSection1.1.2.2. This3Ddataisusedtoperformobstacle detection and compute the shortest and safest path through the environment. However, as the camera system is mounted on the head, it undergoes significant pitch and yaw motions during 32 Figure 1.8: Block diagram of data flow from stereo camera to the tactile interface. Adapted with permission from (Pradeep et al., 2010b),©2010 IEEE. walking. Furthermore, to guide a human subject through the environment, the system should have knowledge about the location and orientation of the user’s head with respect to the obstacle map. Theuser’smotionshouldbecontinuallytracked(localization)onaglobalframeofreference, and new obstacles be registered into this frame (mapping). Another advantage of performing this simultaneous localization and mapping is that the system has memory about obstacles present in the vicinity, even when the camera is not pointed towards them. Given such a global map, it is also possible to provide way-point guidance through an area that has been traversed before. In Figure 4.2, a block diagram showing the flow of data from the stereo camera to the tactile vest is presented. The suite of algorithms perform real-time egomotion estimation by exploiting image optic flow. The camera motion estimates are used to dynamically build an occupancy map, with traversible and untraversible regions labelled. From the current and previous position estimates, the direction of motion being taken by the user is computed. Based on this direction vector and head orientation, the occupancy map is scanned for the most accessible region and a way-point is established at that coordinate. In version 2, if the computed way-point is in close proximity to the current position, then the system switches to Proximity alert mode, and all 33 Figure 1.9: Three snapshots of the system’s processing over a navigation sequence. The left column of each sub-image shows the input image acquired by the stereo camera at various time instants (green dots with red lines represent the optic flow for computing camera motion) and the adjoining figures on the right are 3D renderings (top-down perspective) of the mapping and path-planning processes. In (a) and (c), far enough way-points are available, and a shortest path is computed from current position. The system will be in Guidance mode, delivering cues to keep the user on this path. However, as shown in (b), the user is free to not follow these cues, and instead turns around to face an obstacle. The most accessible way-point is very close to the current position, and the system switches to Proximity mode. The user could turn around back to (a) or explore a novel area (c), and the system will switch back to Guidance mode as soon as possible. Adapted with permission from (Pradeep at al., 2010b),©2010 IEEE. the vibration motors are turned on - indicating the user to scan around for a free path. If the way-point is a reasonable distance away, a shortest path is computed leading to it and the system switches to Guidance mode. Now, if the subject deviates from this path (which is computed by locating the intersection between direction of motion and a line segment on the shortest path), corresponding vibration motors (left/right shoulder) are activated for cuing a corrective rotation. Asafailsafe, ifanobstacleispresentincloseproximity, thewaistmotors(right/left)areactivated tostepawayfromit. Thesystemupdatesitsestimateofuserdirectioneveryframe,andtherefore, can switch at any time from Guidance mode to Proximity alert mode (or vice-versa) if the user does not follow the guidance cues and steps too close to obstacles. A sequence demonstrating these behaviors is shown in Figure 1.9. 34 1.4.2.1 Visual Odometry and SLAM Camera motion can be estimated by recursively matching features across two stereo views. Let P t−1 L ,P t−1 R represent the left and right camera matrices before the motion to be computed and P t L ,P t R bethepairafter. Imagefeaturescorrespondingtothesame3Dpointsacross(P t−1 L ,P t−1 R ,P t L ) or (P t−1 L ,P t−1 R ,P t R ) can be used to compute the camera motion using the well-known three-point algorithm (Haralick etal., 1991) in a RANSAC (Fischler and Bolles, 1997) setting. This feature tracking is performed using the Kanade-Lucas-Tomasi (KLT) (Lucas and Kanade, 1981) track- ing algoirithm for estimating optic flow. Once the motion is estimated and corresponding inliers found, sparse bundle adjustment (Sunderhauf etal., 2005a) is applied to get a more consistent solution. The visual odometry pipeline and novel contributions to this aspect of the system are described in Chapter 2. In spite of the numerically and statistically robust techniques applied to computecameramotion, thefinalestimateisstillnoisyduetofeaturemismatches, errorsindepth estimation (described in Section 1.1.2.2) and round-off errors. As this is a recursive pipeline (that is, camera motion estimates are propagated from frame to frame to obtain the trajectory), the noise can grow unbounded leading to inconsistent pose estimates very quickly. In order to delay errors from rapidly growing out of bounds, a Rao-Blackwellised Particle Filter (which implements a Particle Filter in conjunction with Extended Kalman Filters) is employed in a novel way that enables scalability by utilizing a hybrid metric-topological approach. Chapter 3 is devoted to discussing SLAM and the proposed approach in further detail. 1.4.2.2 Obstacle Detection and Traversability Analysis The 3D map around the current position is scanned for vertical and horizontal patches. For computational efficiency, this 3D data is quantized into 30cm×30cm grid cells. If the difference between the minimum and maximum heights of points falling into a cell exceeds a certain thresh- old, the corresponding cell gets labeled as a vertical patch. For each of the remaining cells, its 8-neighbors are scanned and if the variance between heights is low, it gets labeled as a horizontal 35 patch. The next step is to convert this obstacle map into a traversability map. As stereo depth computation is noisy, the grid cells will fluctuate every frame if a simple update strategy is em- ployed. Instead, state information for each of the cells is maintained and tracked, incrementing a confidence variable for every consistent update. Only cells that exceed a threshold are used in the next stage. Chapter 4 provides more description on this approach. 1.4.2.3 Motion Prediction and Path Planning Information about the user’s intended direction of travel is maintained as the equation of a line joining the previous position to the current position. The previous position is updated only if there is a translational motion greater than a distance threshold, so as to avoid recomputing the way-point based on head rotations which can occur naturally during walking. The heading angle is also updated if the yaw motion exceeds a certain angular threshold. To compute the way-point, the traversability map is searched for the farthest accessible point within a triangular search region, with the current position as the vertex and a line along the direction of travel bisecting the base. Once an accessible way-point is located, the system makes the determination whether to switch to Proximity or Guidance mode. If the latter, a shortest path is computed using the D-Star Lite algorithm (Koenig and Likhachev, 2005). Values of the various thresholds used and the algorithms in this module are further elaborated in Chapter 4. 1.4.3 Tactile Cuing and Mobility Experiments In Chapter 5, the strategy and control algorithms used to provide tactile guidance for navigation are presented. Results from mobility studies involving sighted, unsighted and visually impaired subjects are also presented in this chapter. These experiments demonstrate the feasibility of the approach presented in this thesis and provide evidence of the usability of the proposed assistive device. At the same time, these results also demonstrate failure modes of the system and shed 36 light on the improvements that need to be made to obtain better performance. A discussion of the weaknesses of the system and proposal for future directions is presented in Chapter 6. 37 Chapter 2 Visual Odometry 2.1 Introduction 2.1.1 Problem Formulation Visualodometryistheprocessofanalyzingasequenceofcameraimagestodeterminetheposition and orientation of the moving camera platform (mounted on a mobile robot or human user) . The inertial navigation sensors (accelerometers and gyroscopes) introduced in Section 1.1.2.1 provide good estimates of self-motion only over short distances due to drift. A small, constant error introduced at one time instant is propagated and grows unbounded over time due to the process of integration. The errors might be due to wheel-slippage, motion over non-smooth surfaces and the highly non-linear motion of legged robots or humans. Visual odometry provides more robust estimates as it requires only image data corresponding to the environment that is being traversed, and does not depend on any motion model of the underlying platform or surface. While there are many exisiting approaches to visual odometry, the majority of them rely on the following processing steps: • Acquisition of input images at different time instants. • Extraction of ‘interest point’ or salient features from image data. 38 Figure 2.1: Geometry of a stereo rig. • Matching of features across image data. • Filtering out incorrect matches and refining feature matches. • Initial estimation of camera motion from matched feature sets. • Camera motion refinement by minimizing a cost function, using the initial estimate to initialize the minimization. Inthisdissertation,asmentionedinthepreviouschapter,astereocameraisthesensorofchoice for acquisition of input image data. A binocular stereo rig (see Figure 2.1) can be represented in its canonical form, with the left and right camera matrices P 1 =[I0] and P 2 =[R 0 t 0 ] respectively. Here, (R 0 ,t 0 ) encodes the rigid geometry of the rig, and is fixed and known a- priori. After undergoing arbitrary rotationR and translationt due to some unknown motion, the corresponding cameras in the same coordinate system can be written as: P 3 =[Rt] (2.1) P 4 =[R 0 RR 0 t+t 0 ] (2.2) These equations can be recursively applied to every pair of stereo views before and after motion, and the goal of the visual odometry algorithm is to estimate (R,t). By concatenating the 39 inter-frame motions, the complete trajectory of the camera may be recovered. Notice that in the equations above, the internal camera calibration matrix K is assumed to be the identity matrix, without loss of generality. In practice, the calibration matrix for a stereo rig is always known a-priori from the manufacturer. As a result, for any homogeneous image point x =PX, one can obtain ˆx =K −1 x, thereby eliminating the calibration matrix. ˆ x is called the normalized image coordinate corresponding to x, and the camera matrix K −1 P is the normalized camera matrix. 2.1.2 Motion Estimation 2.1.2.1 The Five Point Algorithm Consider the P 1 and P 3 (left cameras before and after motion) matrices only, and a matched feature pair (expressed in normalized coordinates) ˆx 1 ←→ ˆx 3 such that ˆx 1 =P 1 X ˆx 3 =P 3 X (2.3) for some 3D point X. In Equation 1.10, the fundamental matrix F constraint was presented as the relationship between corresponding features across two views. SinceF is a 3×3 matrix of rank 2, it has only seven degrees of freedom. In the calibrated case, involving normalized image coordinates and normalized camera matrices, this relationship has to be rewritten as ˆx 3 Eˆ x 1 =0 (2.4) where E =K T 3 FK 1 (2.5) E is called the Essential Matrix and has only 5 degrees of freedom due to additional con- straints. Once the essential matrix is known, R and t can be recovered from the singular value 40 decomposition (SVD) of E. The five point algorithm, as proposed in (Nister, 2004b) is an ef- ficient way to perform this estimation using just five point correspondences across two views. This is accomplished by solving a tenth-degree polynomial. Several flavors of algorithms that involve solving polynomial systems exist (Bujnak etal., 2008; Nister, 2004b; Triggs, 1999) and these vary in mathematical structure, assumptions on available camera calibration and even in the minimal number of correspondences required for a feasible solution. Research in this area has also resulted in the development of numerically efficient and stable methods for solving the corresponding polynomial systems. Popular techniques include Groebner basis (Stewénius etal., 2006), polynomial eigen-value problem (PEP) (Kukelova etal., 2008b) and the hidden-variable (Li and Hartley, 2006) If two views are sufficient, however, then what might be the purpose behind the four view system outlined in Equations 2.1 and 2.2? The first reason is that the five point algorithm generates four possible solutions, and an extra view helps to disambiguate from the incorrect solutions. Secondly,thisapproachdoesnotuncoverthescaleofthecameramotionandistherefore inadequate in applications requiring a Euclidean interpretation of the scene (such as obstacle detection in this dissertation). A common approach to overcome this problem is to initialize for the scale of the scene by using a calibration grid (see (Klein and Murray, 2007)). Finally, since the algorithm uses 5 points, it is slower than the 3-point algorithm (discussed next) for reasons that will become apparent in Section 2.1.4.1. 2.1.2.2 The Absolute Orientation Algorithm Revisiting Section 2.1.1 provides a very simple algorithm for camera motion estimation. The two stereo pairs (P 1 ,P 2 ) and (P 3 ,P 4 ) provide two measurements for the same 3D point: before motion (X) and after motion (X ! ). See Figure 2.2 for an illustration of the approach. Obviously, thecoordinatesystemsofXandX ! arerelatedby (R,t). Ifthreesuchpointsareavailable(X 1 ↔ X ! 1 , X 2 ↔ X ! 2 , X 3 ↔ X ! 3 ), then it is trivial to compute this coordinate transformation. One 41 Figure 2.2: Matching three image figures in (a) and using the corresponding 3D points in (b) to estimate camera motion. Note that the 3D point cloud generated for the two images is not dense over all pixels. This is because of occlusion or low texture. The Bumblebee2 stereo camera was used to generate this data. methodistofindtheleast-squaressolutionasgivenin(Arunetal.,1987). Thisapproach,however, isknowntobeveryunstable(Nisteretal.,2004)duetohighuncertaintyin3Dtriangulationalong the depth direction. For the same point viewed in two different stereo frames, there are significant discrepancies along the depth direction. For small sets of features, this can be catastrophic. The next algorithm overcomes this by relying only on image quantities. Moving a 3D point along the depth does not alter image projection much, and the uncertainty is compensated for. 2.1.2.3 The Three Point Algorithm The geometry exploited in this algorithm is illustrated in Figure 2.3. Consider three 3D points (X 1 ,X 2 ,X 3 ) that have been obtained by triangulation using the pair (P 1 ,P 2 ) and project to image points (x 1 ,x 2 ,x 3 ) in P 3 . This algorithm begins by computing the relative distances (a,b,c) between the 3D points. Note that this relative distance computation is more immune to the depth uncertainties that appear in the absolute orientation algorithm. Given the pair 42 Figure 2.3: Geometry of the three-point algorithm. {(a,b,c),(x 1 ,x 2 ,x 3 )} , the objective is to compute the distances (s1,s2,s3) to the 3D points from the center of camera P 3 and from this, the pose P 3 may be inferred with respect to P 1 . Let x i =(u i ,v i ,f) T be the homogeneous image coordinates as derived in Equation 1.1 for each i = 1,2,3, f being the focal-length of the camera. The unit vectors pointing from the center of perspectivity to the 3D points are j i = 1 - u 2 i +v 2 i +f 2 u i v i f (2.6) Consider the angle between s1 and s3. Using the dot product, it is given by cosα =j 1 !j 3 (2.7) Applying the law of cosines to the triangle (X 1 X 3 P 3 ), one can write s1 2 +s3 2 −2s1s3cosα=0 (2.8) Equation 2.8 can be reformulated for the remaining two triangles (X 2 X 3 P 3 ) and (X 1 X 2 P 3 ), giving three quadratic equations in three unknowns (s1,s2,s3). The final solution is based on 43 Figure 2.4: (a) A 3D line projects to image lines. This 3D line is the intersection of the planes formed by the backprojection of the image lines. (b) Three image lines are required to provide enough constraints for estimating camera matrix. Adapted with permission from (Hartley and Zisserman, 2000). solvingafourth-degreepolynomial. (Haralicketal., 1991)outlinesvariousmethodsofsolvingthis system of equations, each differing in the order in which substitutions are made and strategies adopted for change of variables. These subtleties have an impact on the stability of the final solution, and in this dissertation the method of Fischler and Bolles is adopted. 2.1.2.4 Line SfM Algorithms While point features are the most popularly used feature type for visual odometry, some recent workhasbeendoneusingedgesandlines. Theadvantagesofusingsuchastrategywillbediscussed inSection2.1.3.3onnoisecharacterization. Supposetherearetwomatchededgesorlinefeaturesin twoviewsP 1 andP 3 , givenbythehomogeneouscoordinatesl 1 andl 3 , correspondingtoa3Dline L(formedduetotheedgeofadoor,forinstance). AsillustratedinFigure2.4,thebackprojections of the image lines are planesΠ 1 andΠ 3 that intersect atL. However, it is obvious that any two generalplaneswillalwaysintersectinaline. Hence, ifonlytwoimage-linesareavailable, thereare not enough constraints to fix the geometry (they might be two incorrectly matched image edges, whose backprojections still intersect in a 3D line). For estimating camera matrices, therefore, it is imperative that atleast three views be available. The fundamental matrix is inadequte in this situation and the trifocal (described further in Section 2.2.2) and quadrifocal tensors are invoked 44 tosolvetheproblem. Thetrifocaltensorisa3×3×3cubeoperatorthatexpressesthe(projective) geometric constraints between three views independent of scene structure. It can be computed given at least 13 line or 7 point correspondences. Traditionally, line features have been employed in structure from motion algorithms using the multifocal tensor framework (Bartoli and Sturm, 2003;LiuandHuang,1988). (TorrandZisserman,1997)introducesacubicpolynomialsystemfor extracting the tensor from 6 points only and a method to uncover the underlying camera matrices is presented in (Hartley, 1997). The latter also introduces, for the first time, a closed-form linear solution from a combination of point or line features. For a calibrated setting, these matrices can be decomposed to obtain the camera orientation and perform visual odometry. The four-view extension of this concept, called the quadrifocal tensor, was investigated in (Shashua and Wolf, 2000b). In(Comport etal., 2007), the known orientation between the stereo pair is exploited along with the quadrifocal tensor to enforce constraints between image intensities of adjacent stereo pairs. It is evident from all these works, however, that enforcing non-linear dependencies within the tensor indices requires substantial book-keeping (Hartley, 1998; Heyden, 1995) and is ultimately too cumbersome for estimating a 6-dof motion. Another approach, described in (Chandrakeretal., 2009), constructsasmalllow-degreepolynomialsystemandexplicitlyenforces orthonormality constraints. 2.1.2.5 Hybrid Algorithms Someworkhasalsobeendoneusingcombinationofdifferentfeaturetypes. Thisisrelevanttothis dissertation as a novel approach to solve the problem of visual odometry using such an approach is proposed in Section 2.2.2. A unified representation for points, lines and planes by repesent- ing linear features as Gaussian density functions was presented in (Seitz and Anandan, 1999). However, as mentioned by the authors themselves, this representation falls short of being directly applicableforamotion-and-structurerecoveryframeworkduetounresolvedissuesindefiningjoin 45 operations and performing inverse projections. Some attempt has been made towards integrat- ing point and line features for the perspective pose estimation problem (Ansar and Daniilidis, 2003; Dornaika and Garcia, 1999). A factorization based multi-frame SfM algorithm also utilizing point and line features is presented in (Oliensis and Werman, 2000). The proposed method, how- ever, is restricted to cases where translation is very small and involves iterative recomputations of rotation and translation parameters. 2.1.3 Feature Extraction and Tracking The algorithms described above require a certain minimal number of feature matches (points or edges) to form and solve the necessary equations. While obtaining numerical solutions to the underlying polynomial systems is non-trivial itself, several algorithms have been proposed that solve specific problems in a stable manner (Kukelova etal., 2008a; Kukelova etal., 2008b; Nister, 2004b). At the same time, the number of features observed, noise-level (in feature localization as well as tracking) and their distribution, all have a major impact on the final motion estimate. Due to errors that might exist in feature matching and for robustness, more than the minimal number of features are detected and matched. Then, the methods described in Sections 2.1.4.1 and 2.1.4.2 are employed to obtain the ‘best’ solution to the matched set. In this section, a summary of the major feature detection and tracking algorithms is provided. During the initial stage of development, all these algorithms were implemented. In Section 2.2.1, the choice and justification for the selected approach in the final system are reported. 2.1.3.1 Point Feature Detection and Tracking Due to their abundance in natural scenes, salient corners in image data have been primarily used asinterestpointsinmostvisualodometrysystems. Whileonecanperformdensetrackingofimage features(wherealmosteverypixelintheimageistrackedtothenextimageinthesequence), only sparse feature tracking (where a relatively small set of 100-150 features are detected and tracked) 46 isrelevanttothevisualodometrysystemdescribedinthechapter. Sparsefeaturetrackingismore stable, requireslesscomputation, andisadequateforcameramotionestimation. Furthermore, for the three point algorithm, 3D data is required corresponding to each feature. Stereo triangulation does not provide 3D information for every pixel due to lack of texture or occlusion in parts of the image (see Figure 2.2(b)). Thus, dense feature tracking would be a waste of resources. Optic Flow Based Tracking Optical flow measures the apparent motion of the image bright- ness pattern. It produces a vector field that can be used to track pixels from one frame to the next. This motion field is determined from the spatial and temporal variations of the image brightness. Consider I as the ‘brightness function’ of an image, that is a function of both the spatial coordinates of the image plane, x and y, and of time t. Formally, I = I(x,y,t). The intuitive meaning is that given a sequence of images, I at an image coordinate returns the pixel magnitude for a particular image. The first step in estimating optic flow is to formulate what is known as the ‘brightness constancy equation’. It is reasonable to assume that (under normal conditions) the apparent brightness of moving objects remains constant. In other words, a pixel that ‘moves’ from one location at time t to another at t+δt should have the same pixel value. This can be written as dI(x,y,t) dt =0 (2.9) or, using the chain rule of differentiation, ∂I ∂x dx dt + ∂I ∂y dy dt + ∂I ∂t =0 (2.10) Intheaboveequation, thepartialspatialderivativesofI aresimplythetwocomponentsofthe spatialimagegradient∇I. Themotionfieldv, whichistheobjectofinteresthere, iscomprisedof the temporal derivatives ( dx dt , dy dt ). The partial difference with respect to time can be computed by 47 subtracting the two images att andt+δt . The brightness constancy equations suffers from what isknownasthe‘apertureproblem’: onlythatcomponentofthemotionfieldcanberesolvedwhich is normal to the direction of motion. For instance, if a pixel lies on a line and motion is along the line, then the correct v cannot be obtained. Another way to see this is that Equation 2.10 has two unknowns and cannot be solved with just one equation (from one pixel). Furthermore, as listed in (Trucco and Verri, 1988), the brightness constancy equation is really valid only at points with very high spatial gradients or if the underlying motion is purely translational or such that the illumination direction is parallel to the angular velocity of the rigid motion. One well-known method to get around the aperture problem, and which is followed in this dissertation,istheLucas-KanadeOpticalFlowalgorithm(LucasandKanade,1981). Theassump- tion is that for very small inter-frame motion (such as two adjacent frames for an image stream captured at 30 Hz), the displacement between image contents is very small and approximately constantwithintheneighborhoodofapixel. Thus, tocomputetheopticalflow (v x ,v y )=( dx dt , dy dt ) for a pixel p, one considers a window of n pixels around it {q i } n i=1 . The constancy equation can then be applied to all these pixels I x (q 1 )v x +I y (q 1 )=−I t (q 1 ) I x (q 2 )v x +I y (q 2 )=−I t (q 2 ) ! ! ! I x (q n )v x +I y (q n )=−I t (q n ) (2.11) I x (q i ),I y (q i ) and I t (q i ) are the spatial and partial derivatives of the image computed at each pixel in the window. The above can be rewritten as a linear system of equations and since the system has more equations than unknowns, a weighted least-squares solution is found. The solution is given by 48 v x v y = . i I x (q i ) 2 . i I x (q i )I y (q i ) . i I x (q i )I y (q i ) . i I y (q i ) 2 −1 − . i I x (q i )I t (q i ) − . i I y (q i )I t (q i ) (2.12) A critical decision is the size of the window : a small window size is preferable to avoid smoothing out details in the images, while a large window size helps to handle large motions and adds robustness to noise. Thus, there is tradeoff between local accuracy and robustness. This problem is solved by a pyramidal implementation of the Lucas-Kanade tracker (Bouguet, 2000). Image pyramids consisting of filtered and sampled versions of the original images are formed and the displacement vectors between the feature locations are found by iteratively maximizing a correlation measure over a small window, starting from the coarsest level and ending at the finest (original resolution) level. As mentioned before, it is important to consider points only at which the spatial gradients are high. This is known as ‘interest point’ selection, and two approaches that were implemented are described: Multi-Scale Harris Corners : Note the following matrix that occurs in Equation 2.12: A = . i I 2 x . i I x I y . i I x I y . i I 2 y (2.13) • This matrix is referred to as the structure tensor of the image at the particular pixel about which it is computed. The contributions of the pixels in the window may be weighed with a two-dimensionalGaussian whosemeanisalignedwiththecenterpixel. Thismatrixencodes the distribution of the image gradient within the window. One obtains two eigen-values λ 1 and λ 2 , and for an interest-point, both of them must have large positive values (Harris and Stephens, 1988). In an actual implementation, however, finding eigen-values can be quite slow and instead, only the determinant and trace are evaluated. The optimum window-size 49 is again a trade-off and is circumvented by utilizing a multi-scale approach as described for the Lucas-Kanade tracker. FAST Corners : The acronym FAST stands for Features from Accelerated Segment Test (Rosten and Drummond, 2005). The approach in this method is to consider a Bresenham circle of radius r around the candidate pixel. For r=3, this corresponds to a circle of 16 pixels. If the intensitiesofncontiguouspixels(n = 12forr=3)areallaboveorbelowtheintensityofthecenter pixel by some threshold t, then a feature is detected. This algorithm can be highly optimized by only examining pixels 1, 9, 5 and 13 (for the r=3 case) to quickly reject bad candidates. Along with other optimization, this makes FAST corners as one of the fastest feature detectors available and has been extensively employed in many state of the art visual odometry systems (Klein and Murray, 2007). Descriptor Based Tracking Optic flow tracking is feasible only for image sequences where the inter-frame motion is small. The brightness constancy equation cannot be applied to a pair of images captured from widely separated viewpoints (or when the object has moved significantly). The first reason behind this is that pixels belonging to the same physical object can have very different brightness values when imaged from two highly disparate locations (change of scale, for example). This could be because of affine deformation or illumination. Secondly, the window size will have to be increased substantially, implying the assignment of the same motion field to a very large number of pixels. Thus, for matching images in such conditions, there is a requirement for a technique that is robust to changes in illumination, scale and affine deformations. The following class of detectors not only extract stable interest points from images, but also compute a ‘descriptor’ for each feature. The term descriptor here refers to a unique identifier/signature assignedtoapixelthatmaybeusedwhensearchingforamatchinadifferentimage. Asuccessful match is found by favorably comparing two sets of descriptors. 50 SIFT Features : SIFT is an acronym standing for Scale-Invariant Feature Transform. The algorithm was first published in (Lowe, 1999). The first step here is the detection of highy salient and stable interest points. The original image I is convolved with Gaussian filters at different scales to build an image pyramid. Subsequently, the difference of successive Gussian-blurred images are taken to build a Difference of Gaussians (DoG) pyramid. If G(x,y,kσ) represents a Gaussian filter with a scale that is some integer k times an initial σ, the DoG image D(x,y,σ) is given by D(x,y,σ)=G(x,y,k i σ)⊗I(x,y)−G(x,y,k j σ)⊗I(x,y) (2.14) where ⊗ represents the convolution operation. The value of the standard deviations for the Gaussian filters is so chosen such that there are a fixed number of convolved images per octave (an octave corresponds to doubling σ). From the DoG images, interest points are identified as the local extrema (minima/maxima) across scales. Each pixel in the DoG images is compared to its eight neighbors at the same scale and nine pixels in each of the adjacent scales. This process results in a set of candidate interest points. From this set, more localized features are selected by discarding those pixels with low contrast and poor localization (lying on an edge). The DoG pyramid computation is an appoximation of the Laplacian of Gaussians (LoG) and selecting only extrema in such a scale-space is what confers the scale-invariant attribute to the SIFT features. In the next step, one or more orientations are assigned to each selected feature point. The orientations of the gradients surrounding a pixel are distributed over a 36-bin histogram. Each bin entry is weighted by the corresponding gradient magnitude and a Gaussian-weighted circular window with a σ that is 1.5 times the scale of the keypoint. The highest peak in the histogram is considered as the orientation of the feature. In case there are other peaks within 80% of the maximal peak, additional copies of the same keypoint are stored associated with each additional orientation. In the next step, descriptors will be computed for each keypoint. However, these 51 descriptors will be represented relative to the feature orientation estimated here and this helps achieve rotation invariance for SIFT features. The descriptors that will be computed must be highly distinctive and invariant to illumination and changes in viewpoint. The latter two objectives are only partially met here. The 16× 16 neighborhood around the pixel is grouped into 4× 4 pixel sub-regions. Each such sub-region contributes an 8-bin Gaussian-weighted orientation histogram corresponding to its own neighbor- hood. All these histograms are grouped into one vector, resulting in a 4×4×8 = 128-dimensional descriptor. The vector is normalized to improve robustness to illumination changes. SURFFeatures: SURFstandsforSpeededUpRobustFeaturesandthealgorithmwasfirst published in (Bay etal., 2008). This algorithm also describes a keypoint detector and descriptor. The scale-invariant keypoints are found by computing a Hessian matrix H(x,y,σ) at scale σ for each candidate pixel: H(x,y,σ)= L xx (x,y,σ) L xy (x,y,σ) L xy (x,y,σ) L yy (x,y,σ) (2.15) where L xx (x,y,σ) is the convolution of the Gaussian second order derivative with the image I at (x,y). Thesecond-orderGaussiansareinfactapproximatedbyboxtypeconvolutionfiltersim- plemented using integral images (Viola and Jones, 2001). This results in a much greater speed-up compared to SIFT features. The responses to Haar wavelets are used for estimating orientations and subsequently, the descriptor is formed from the wavelet responses in the keypoint neighbor- hood. SURF features have been reported to not differ significantly in performance compared to SIFT features, but are significantly faster to compute (Bauer etal., 2007). DAISY Features : While it is computationally more efficient to compute SURF features compared to SIFT, it comes at a price. In contrast to the spatial weighting scheme of SIFT, all gradients contribute equally for SURF descriptors. In (Tola etal., 2010), it is pointed out 52 that such a scheme results in damaging artifacts when used for dense computation. The DAISY algorithmmaintainsefficiencyandthespatialweightingschemebyusingconvolutionsoftheimage gradients in specific directions with several Gaussian filters. Orientation maps at different sizes can be computed at low cost as convolutions with a large Gaussian kernel can be replaced with several consecutive convolutions with smaller kernels. The name of this algorithm stems from the floral pattern of pixel neighborhoods used to compute the descriptors. 2.1.3.2 Line Feature Detection and Tracking The first step in extracting line features from images is image edge detection. This can be done by using a high pass filter coupled with some heuristics to identify well-localized features. A very popular algortithm for edge detection is the Canny edge detector (Canny, 1986). In this dissertation, however, the method outlined in (Chandraker etal., 2009) is borrowed. First, Sobel edge images are constructed by applying convolution kernels to detect sharp intensity changes in thexandy directionsoftheimage. Anon-maximalsuppressionisappliedtolocatethepeak/ridge of strong as well as weak edges: I x = max dx σ( S x −(S x+dx +S x−dx )/2 max((S x+dx +S x−dx )/2,θ edge ;s,m) (2.16) where S x is the Sobel edge response at a pixel (x,y). σ(x;s,m)= 1 1+exp(−(x−m)/2) is a sigmoid function, θ edge is a free parameter and dx is the set of four testing directions given by {(1,0) T ,(0,1) T ,(1,1) T ,(1,−1) T )}. The edge map so produced will have broken edgels (edgel is short for ‘edge element’) and a linking strategy needs to employed for linking them into line segments. By doing so, the edges can be represented as equations of lines. This makes line tracking very robust as even though an edgel might change length from one image to another, the line equation can be used to identify correct matches. The popular Hough Transform algorithm (Shapiro and Stockman, 2001) is slow 53 and does not perform well in cluttered environments. Instead, a divide-and-conquer scheme is used. The edge map is divided into 8× 8-pixel cells and edge-blobs are detected per cell. The thickness of the convex hull of the points is used to determine whether an edge blob is likely to be a line segment. The convex hull, mean and covariance of the point coordinates for each line segment are stored. At the next level, each cell contains 4 sub-cells and pairs of line segments are merged if the thickness of the combined convex hull is within a threshold. This process is repeated until the convex hull covers the full image. Lines can subsequently be tracked one frame to another using the Lucas-Kanade optic flow. 2.1.3.3 Noise Characterization Thedevelopmentofnoveltechniquesforextractingandmatchingpointfeaturesandbreakthrough minimal solvers, as outlined above, have led to robust and efficient visual odometry systems (Nister etal., 2004; Pollefeys etal., 2007). It has also been emipirically observed (Christy and Horaud, 1999) that leveraging image lines instead of points can lead to improved performance in detection/matching(duetomultipixelsupport), occlusionhandlinganddealingwithT-junctions. Furthermore, the abundance of edge features in man-made environments (cityscapes and indoor structures)canbeexploitedtoreducetrackingfailuressignificantly,therebyminimizingsituations where odometry systems can get ‘lost’ and also help to reconstruct high-level scene information. On the other hand, it is also well-known that the constraints imposed by line correspondences on camera pose are much weaker than those provided by points (Hartley, 1997) and there is considerable ambiguity when dealing with curved surfaces like cylindrical columns. In (Rosten and Drummond, 2005), an interesting exposition of the complementary noise statistics and failure modes of line-based and point-based feature trackers is presented. Section (2.2.1) presents the visual odometry pipeline for the mobility aid introduced in Section (1.4). This pipeline employs point feature tracking only and factors leading to the choice of appropriate trackers are discussed. In Section (2.2.2), a novel odometry algorithm is proposed that combines the complementary 54 characteristics of the point and line features to build a very robust system. Before going any further, these noise characteristics are discussed in some more detail. While point features are robust to large motions, the appearance can change substantially under large aspect changes. An image is essentially a discretized representation of the projection of a real-world scene onto the camera sensor. Naturally, there are quantization and round-off errors associated with pixelization. The feature detection techniques presented above rely on finding points of extrema in local neighborhoods, but in the simplest case, these neighborhoods are discrete. In general, subpixel refinement is required to find the true points of maxima/minima by employing some form of interpolation. However, for stereo-based visual odometry systems, it is difficult to associate 3D depth data at subpixel resolutions. Hence, no such refinement is performed and the interest points so detected are the closest integer coordinates to the true points. The errors due to pixel quantization, however, are independent and can be modeled using Gaussian distributions. The camera pose that is estimated from a collection of such ‘noisy’ features is also a Gaussian distributions, from direct application of the central limit theorem. Line features are extremely invariant to illumination changes and while a desirable property for good features, this also makes the task of matching very difficult. Due to the stability of the edges, the associated descriptor (a gradient in intensity) is not unique and other techniques need to be employed. One such method that is followed in this dissertation was presented in Section (2.1.3.2). Itisevidentfromsuchastrategythatincorrectmatcheswilloftenbestronglycorrelated and the resulting camera pose will not approach a Gaussian. In (Rosten and Drummond, 2005), this distribution is modeled as a two component GMM (Gaussian Mixture Model). 2.1.4 Non-linear Refinement Given the inherent noisy nature of feature tracking and pose estimation, it is necessary to use techniques that can protect the output camera pose estimates from serious corruption. Two 55 techniques are extensively used in this thesis - one can be thought of as an outlier rejection scheme, while the latter fits a better solution by minimizing a cost function. 2.1.4.1 Random Sample Consensus (RANSAC) The abbreviation RANSAC stands for RANdom Sample Consensus (Fischler and Bolles, 1997). Thismethoditerativelyestimatesparametersofamathematicalmodelfromasetofobserveddata containingoutliers. Thealgorithmwillproduceareasonableresultonlywithacertainprobability (which increases with the number of iterations). Given a set of N input points, n points are randomly selected to compute the output model. Here, n is determined by the minimum number of points required to compute the model (n=3 for the three-point algorithm, n=5 for the five-point algorithm). All the remaining points in the set are then tested against the fitted model, and if a point fits well, it is considered a hypothetical inlier. The definition of a ‘good fit’ can be user defined - it could simply be the number of data points that satisfy the model with an error under some predefined threshold, or more robust methods including least-squares, least median of squares (LMedS), or some M-estimators (Faugeras etal., 2001). For visual odometry using the three point algorithm, the estimated (s1,s2,s3) values are compared to the actual depth data in the current frame to get the error estimates. The process is repeated over several random samples and the algorithm terminates when the percentage of inliers reaches a threshold or the maximum number of trials (computed to meet real-time constraints) is reached. The number of trials k required to compute a model such that at least one randomly selected minimal set is completely free of outliers can be derived by assuming an initial percentage w (based on noise in data) of inliers and a desired probability of success p (often set to 0.99). For a minimal set of size n, the probability that all are inliers isw n . Thus, the probability that at least one ofn is an outlier (and therefore, a bad model is estimated) is 1−w n . The probability that the algorithm never selects a set of n points over k trials which are all inliers is given by (1−w n ) k . Thus, 56 1−p = (1−w n ) k (2.17) and hence, k = log(1−p) log(1−w n ) (2.18) In actual implementation, each time a good model is found, the value of w can be updated to give a more current estimate of k. An improved estimate will increase the value of w, and decrease k. Thus, instead of waiting to reach the maximum number of trials, only k trials need be conducted leading to faster convergence. Note that as n increases (the minimal set size), the number of trials before an acceptable solution is obtained also increases. RANSAC can also be applied during feature tracking, where the fundamental matrix between two views is robustly estimated using Equation (1.10). Matches that are found as outliers can be rejected - and this would eliminate mismatched features or those belonging to moving objects from the odometry pipeline. The RANSAC algorithm terminates providing a best fit solution and a set of inliers. However, the solution is based only on the minimal set that was used to generate it. The next step is to recompute the solution using the entire inlier set, starting with the best fit solution as an initialization to an iterative scheme. An overview of the particular method best suited for visual odometry follows next. 2.1.4.2 Sparse Bundle Adjustment (SBA) Consider a set of n 3D points X i seen in m views P j , and let x ij be the projection of the i-th point on image j. Bundle adjustment is defined as the process of refining a set of initial camera and structure (3D points) estimates for finding the set of parameters that most accurately predict the locations of the n points in the set of m images. Formally, this can stated as 57 min Pj,Xi n / i=1 m / j=1 d(P j X i ,x ij ) 2 (2.19) whered(x,y)denotestheEuclideandistancebetweentheimagepointsrepresentedbyxandy. Equation(2.19)seekstominimizethegeometricerrorandthetermbundleadjustmentarisesfrom the fact that this process can be seen as optimally adjusting bundles of light rays originating from each 3D feature and converging on each camera’s optical center. If κ and λ are the dimensions in theparametersofeachP j andX i ,thisbecomesaminimizationproblemovermκ+nλparameters. In inter-frame visual odometry with a stereo rig κ=6, as the three parameters for rotation and three for translation are the only unknowns as far as the camera matrices are concerned. One camera can be kept fixed as the reference coordinate system and need not be optimized. However, the value of n can be anything from 10 to 100 depending on the kind of feature tracker employed, structure in the scene and the number of inliers found at the end of a RANSAC run. Hence, a straight-forward implementation of a numerical descent algorithm to solve the bundle adjustment problem would not meet real-time constraints. As pointed out in (Lourakis and Argyros, 2004), the lack of interaction among parameters for different 3D points and cameras enables forming sparse block matrices for implementing a very efficient Levenberg-Marquardt (LM) minimization scheme. If f is a function that maps a parameter vectorp∈R m to a measurement vector ˆ x =f(p),ˆ x∈R n , LM minimizes the squared distance ( T (,( =x− ˆ x by linearly approximating f in the neighborhood of p. For a small*δ p *, a Taylor series expansion gives the following approximation (Golub and Loan, 1996): f(p+δ p )≈f(p)+Jδ p (2.20) where J is the Jacobian matrix ∂f(p) ∂p . The value of δ p is iteratively computed as the solution of the following normal equations: 58 Figure 2.5: The implemented visual odometry pipeline. J T Jδ p =J T ( (2.21) For bundle adjustment, the measurement vector f(p) is given by the set {x ij } and the pa- rameter space p is comprised of the 3D points X i and camera matrices P j . It can be seen that ∂ˆ xij ∂X k =0,∀i,= k and ∂ˆ xij ∂P k =0,∀j ,= k. This results in a very sparse structure for J that can be very efficiently solved and the process is detailed in (Lourakis and Argyros, 2004). Sparse bundle adjustment can be used in real-time to refine inter-frame camera motion estimates using all the inlier data points and intial solution from RANSAC. 2.2 Approach 2.2.1 Visual Odometry Pipeline The visual odometry pipeline implemented in this dissertation is illustrated in Figure 2.5. The stereo camera provides a 3D point-cloud and a color image each frame. Feature tracking begins 59 with the extraction of FAST corners (Rosten and Drummond, 2005) from the very first frame. Only those FAST corners are selected which have 3D data. Thus, each feature µ consists of the following attributes: µ=[u,v,X,Y,Z,d] (2.22) where (u,v) is the pixel coordinate, (X,Y,Z) is the corresponding 3D information and d is the stereo disparity. The set of detected corners {µ i } 0 are stored along with the image and the next frame at time t is acquired. The Kanade-Lucas tracker is used to estimate optic flow given by the current and previous images, and{µ i } 0 . The new feature set{µ i } 1 is populated from the resultingopticflow,onceagainretainingonlythosefeatureswithassociated3Ddata. Theepipolar constraint is enforced using RANSAC, and all the incorrect matches (as well as pixels belonging to moving objects) are filtered out. This step also produces an estimate of the fundamental matrix F. The pixel coordinates of both the feature sets are rectified to satisfy equation 1.10 based on the algorithm given in (Hartley and Zisserman, 2000). The resulting matched pair [{ˆ µ i } 0 ,{ˆ µ i } 1 ] is passed to the visual odomety module that employs the three point algorithm in a RANSAC framework, using the LMedS error function. Finally, the obtained solution and inlier set are provided as input to a sparse bundle adjustment routine to obtain the final camera motion estimate. 2.2.1.1 Selection of Feature Trackers The choice of FAST corners with KLT tracking was made based on requirements for stability and real-time operation. In Figure 2.6, the performance of four different tracking methodologies are shown. KLT tracking gives very dense matches between two adjacent frames, while descriptor based methods outperform when viewpoints are highly disparate. Furthermore, FAST corner tracking operates at around 5.32± 0.07milliseconds, while SIFT feature tracking takes close to 60 Figure 2.6: Tracked features between two frames for various detectors and trackers, with two different intervals between frames. 323.05±35.53 milliseconds. DAISY feature tracking is faster, but only at around 101.12±4.12 milliseconds. As described earlier, the frames are very quickly sampled during odometry esti- mation, and hence, inter-frame motion is (under normal circumstances) well within the tolerable range of optic flow computation with FAST corners. Intheexampleabove,thethresholdfortheFASTfeaturedetectorwassetatt = 40. Naturally, this threshold will vary based on the amount of texture in the environment. While a certain minimum number of features must be available for accurate camera motion estimation, detecting too many features adversely impacts the computation speed. Hence, a dynamic threshold update mechanism is employed to detect only a limited number of features. The threshold is incremented ordecrementedbasedonthenumberoffeaturesavailableeachtimetrackreplenishmentisrequired (when the number of tracked features has fallen to less than 35% of the original frame, features need to be redetected to repopulate the tracks). 61 2.2.2 Egomotion Using Assorted Features 1 Based on the discussion in Section (2.1.3.3), it is desirable to have a visual odometry algorithm that can incorporate any combination of point and line features as available in the image data and yield a camera motion estimate using the combination set that generates the most accurate solution. For a real-time and robust implementation, it is preferable to have a unified framework that,independentoffeaturetype,computesthe6DOFmotionfromminimalsetsovertheavailable data. This section introduces a novel minimal solver that accomplishes this and so is ideal for use in such a hypothesize-and-test setting (Fischler and Bolles, 1997). Notation. Unless otherwise stated, a 3D point in space is represented by a homogeneous 4-vectorX = [X 1 X 2 X 3 X 4 ] # ∈R 4 and its projection on the image plane of camera i by a homo- geneous 3-vectorx i = 0 x i 1 x i 2 x i 3 1 # ∈R 3 . Similarly, any line and its projection on the image plane are denoted by the parameters L =[L 1 L 2 L 3 L 4 ] # and l i = 0 l i 1 l i 2 l i 3 1 # respectively. A projective camera is given by a 3×4 matrixK[Rt], withK being the 3×3 internal calibration matrix and the 3×3 rotationR and 3×1 translation vectort representing the exterior camera orientation. In the remainder of this paper, without loss of generality, we assume normalized camera coordinates and therefore,K is set to be the identity matrixI . The trifocal tensorT={T i } i=1,2,3 , with each 3×3 submatrix denoted byT i is the geometric object of interest in this work. The [...] × notation denotes the skew-symmetric matrix for forming a vector cross-product and the colon operator : is usedtoreferencerowsorcolumnsof matrices(forinstance,A(:,i)isthei th column,A(:,[i : j,k]) references columns i to j and column k). 1 This work was done as part of a paid summer internship at the Honda Research Institute, Mountain View 62 Figure 2.7: (a) Stereo geometry for two views and the point-line-line configuration (b) point- point-point and (c) line-line-line configuration in terms of point-line-line configuration. Adapted with permission from (Pradeep et al., 2010c),©2010 IEEE . 2.2.2.1 Review of the trifocal relations Consider a canonical three camera configuration given by P 1 =[I0] , P 2 =[Aa] and P 3 = [Bb]. As described in (Hartley and Zisserman, 2000), the trifocal tensor for these three views, T={T i } is given by T i =A i b # −aB i# (2.23) A i =A(:,i),B i =B(:,i). Assume a line correspondencel 1 ⇔l 2 ⇔l 3 and a point correspon- dence x 1 ⇔x 2 ⇔x 3 across P 1 ,P 2 ,P 3 respectively. The relationships between these features and the tensor can be expressed by the following equations: l 1 i =l 2# T i l 3 (2.24) 0 x 2 1 × 2 / i x 1 i T i 3 0 x 3 1 × =0 3×3 , (2.25) 63 In the general case, a single point triplet generates four linearly independent constraints and a line-line-line correspondence provides two linearly independent constaints over the tensor param- eters. Geometrically, both these configurations are best understood in terms of a hypothetical point-line-line correspondence x 1 ⇔l 2 ⇔l 3 (see figure 2.7(a)). The constraint expressed by this arrangement is simply the incidence relationship between x 1 and the line transferred from l 3 via the homography induced byl 2 . For a line triplet therefore, each of any two points (figure 2.7(c)) on the line in the first image establishes a point-line-line configuration giving two equations, while for a case of three matching points figure2.7(b)), there are four possible permutations obtained by choosing any two linearly independent lines spanned by the points in the P 2 and P 3 . 2.2.2.2 Calibrated tensors and linear solution Reconsider the stereo geometry in Figure (2.1). The framework of the proposed algorithm is composed of two trifocal tensors; T L = 4 T L i 5 arising out of image correspondences between cameras P 1 ,P 2 ,P 3 and T R = 4 T R i 5 from correspondences between P 1 ,P 2 ,P 4 . These tensors, using Equations 2.1, 2.2 and 2.23, are given as T L i = R i 0 t # −t 0 R i# (2.26) T R i = R i 0 (R 0 t+t 0 ) # −t 0 (R 0 R) i# (2.27) Since the stereo configuration is fixed and known, it is only required to estimate the twelve parameters of the underlying motion to fix T L and T R . From correspondence sets of the form 4 l 1 ⇔l 2 ⇔l 3 ;l 1 ⇔l 2 ⇔l 4 5 or 4 x 1 ⇔x 2 ⇔x 3 ;x 1 ⇔x 2 ⇔x 4 5 in Equations 2.24 and 2.25, one can write a concatenated linear system in terms of the twelve unknowns. Furthermore, an image line can be parameterized by taking any two arbitary points lying on it. Thus, Ay =0 (2.28) 64 y=[r 11 r 21 r 31 r 12 r 22 r 32 r 13 r 23 r 33 t 1 t 2 t 3 1] # (2.29) where r ij is the (i,j) th element of the rotation matrix R and translation t = 6 t 1 t 2 t 3 7 # . Note that this linear system will be inhomogeneous due to the form of Equation 2.27. The next step is to geometrically derive the minimum number of feature correspondences (over the four views) required to solve Equation 2.28. Consider x 1 ⇔ x 2 ⇔ x 3 first. With P 1 and P 2 fixed, a corresponding 3D point X is defined from the first two views. Thus, x 3 =P 3 X providesonlytwolinearlyindependentequationsfortheunknown (R,t). Theoretically, therefore, 4 x 1 ⇔x 2 ⇔x 3 ;x 1 ⇔x 2 ⇔x 4 5 must generate 4 linearly independent equations. However, P 3 and P 4 form a stereo pair and the following holds for any stereo pair i and j: x i 3 x j 2 −x i 2 x j 3 (2.30) Thus, on concatenating point correspondence constraints fromT L andT R , only 3 linearly inde- pendent equations are obtained. Arguing similarly but noting that Equation 2.30 is not invoked for general points on matching lines, it can be shown that a line quartet provides 4 linearly in- dependent equations. These dependencies can also be seen by performing row operations on A matrix. Hence, given n point- and m line-correpondence sets, matrixA has 3n+4m independent rows. A linear solution, therefore, can be obtained for {4 points} or {3 lines} or {3 points+1 line} (overconstrained) or {2 points+2 lines} (overconstrained). In the presence of noise, it is recommended to use more than the minimum number of equations per correspondence. However, with noisy features, this approach is also not recommended becauseR will not be obtained as an orthonormal rotation matrix. One could refine the solution further by minimizing the Frobenius norm with respect to an orthonormal matrix, but better techniques are presented in Sections 2.2.2.3 and 2.2.3. 65 2.2.2.3 Subspace solution This approach is similar to the works described in (Nister, 2004b; Kukelova etal., 2008b; Li and Hartley, 2006). Using nonlinear techniques, it is possible solve for y in Equation 2.28 from a combination of only {3 points} or {2 points + 1 line} or {2 lines + 1 point}, each of which provides 9, 10 or 11 linearly independent equations. A {2 line} solution can also be obtained (8 equations), but is unstable in the presence of noise. Therefore, this condition is ignored and one the 3 correspondences scenario is considered (the {3 lines} case has a linear solution). With this formulation, the least number of constraints one might expect is 9 equations from {3 points} (for 12 unknowns), and so, the solution can be given by a (12-9 =) 3 dimensional subspace. However, as the system of equations is non-homogeneous, y =y p +αy 1 +βy 2 +γy 3 (2.31) y p is the so-called ‘particular solution’ of a non-homogeneous linear system that can be com- puted from the psuedo-inverse of A and y 1 , y 2 and y 3 are the eigenvectors corresponding to the smallest eigenvalues from a singular value decomposition on A. To solve for α, β and γ and simultaneouslyensureorthonormalityfortherotationpartthesixpolynomialconstraintsexpress- ing unit norm for each row of R and orthogonality between any two rows can be applied. The resulting polynomial system of equations may then be solved by a variety of techniques such as Groebner basis (Kukelova etal., 2008a) or polynomial eigenvalue problem (PEP) (Li and Hartley, 2006). While better than the linear approach, the subspace of A is still very unstable in the presence of noise. This instability can be traced back to the underlying noisy tensor constraints that do not faithfully encode a camera representing rigid motion. This problem of robustly extracting camera matrices from the trifocal tensor has been reported elsewhere in (Hartley, 1997) and a work-around is described that requires estimating epipoles and applying additional constraints on 66 the camera matrices. The method described in Section 2.2.3yields a much more robust solution without requiring any additional information or imposing further constraints. In Section 2.3 , further experimential justification is provided that explains why the following method should be preferred over the subspace solution. 2.2.3 Quaternion-based direct solution A rotation matrix R can be parameterized in terms of the unit quaternion q= a +bi+cj+dk: R = 2 6 6 6 6 6 4 a 2 +b 2 −c 2 −d 2 2(bc−ad) 2(ac+bd) 2(ad+bc) a 2 −b 2 +c 2 −d 2 2(cd−ab) 2(bd−ac) 2(ab+cd) a 2 −b 2 −c 2 +d 2 3 7 7 7 7 7 5 RewritingEquation2.28usingthisparameterization,thefollowingsystemofequationsisobtained: A % y % =0 (2.32) y % = 0 a 2 b 2 c 2 d 2 abacadbcbdcdt 1 t 2 t 3 1 1 # The quaternion parameters are solved for first. To this end, using the Equations in2.32, (t 1 ,t 2 ,t 3 ) are expressed in terms of the other parameters by performing the following steps: SolveA % (:,[11 : 13])C t =A % (:,[1 : 10,14]) (2.33) A t =A % (:,[1 : 10,14])−A % (:,[11 : 13])C t (2.34) A t y t =0 (2.35) y t = 0 a 2 b 2 c 2 d 2 abacadbcbdcd 1 1 # The translation part is given by 6 t 1 t 2 t 3 7 # =−C t y t and now, only the polynomial system 2.35 has to be solved. At this point, the quaternion constraint a 2 + b 2 + c 2 + d 2 =1 is also introduced, which ensures that in the presence of noise, a consistent rotation is obtained. To solve 67 for y t , the method of elimination is adopted by writing the monomials 8 b 2 ,c 2 ,d 2 ,bc,bd,cd 9 in terms of 8 a 2 ,ab,ac,ad,1 9 : b 2 c 2 d 2 bc bd cd =C q a 2 ab ac ad 1 (2.36) where, A t (:,[2 : 4,8 : 10])C q =A t (:,[1,5:7,11]) (2.37) TheremustbesufficientcorrespondencessothatEquation2.37isnotunderconstrained. A t (:,[2 : 4,8 : 10]) submatrixmusthaveatleastrank6,implyingatleast6linearlyindependentrows. Sinceoneinde- pendent constraint is already provided by quaternion unit norm condition, 5 or more independent rows must come from the point or line correspondences in A % t . However, note that the manip- ulation in 2.34 introduces dependencies in A t and therefore, minimal configuration sets are {3 points} or {2 lines} or {2 points + 1 line}. The {2 lines} configuration tends to be less stable in the presence of noise, and therefore, only the {3 lines} and {2 lines + 1 point} sets are employed. From a systems point of view, this also lets one use a uniform set size of 3 features for any com- bination in a RANSAC setting. The minimal 3 feature criterion ensures that it will be always possible to solve Equations 2.33 and 2.37. In the presence of noise, it is recommended to use more 68 than just the linearly independent rows provided by each correspondence. Now, the equations in 2.36 can be rewritten as b 2 c 2 d 2 bc bd cd = [1] [1] [1] [2] [1] [1] [1] [2] [1] [1] [1] [2] [1] [1] [1] [2] [1] [1] [1] [2] [1] [1] [1] [2] b c d 1 (2.38) [i] represents an i th degree polynomial in a. Note that not all terms in the LHS of Equation 2.38 are independent and they should infact, satisfy the following conditions: (bc) 2 =(b 2 )(c 2 )(bc)(bd)=(b 2 )(cd) (bd) 2 =(b 2 )(d 2 )(bc)(cd)=(c 2 )(bd) (cd) 2 =(c 2 )(d 2 )(bd)(cd)=(d 2 )(bc) Applying these to the RHS in (2.38), one obtains the following final system of constraints: [3] [3] [3] [4] [3] [3] [3] [4] [3] [3] [3] [4] [3] [3] [3] [4] [3] [3] [3] [4] [3] [3] [3] [4] b c d 1 =0 (2.39) 69 Since by hypothesis, there exists a solution to 6 bcd 1 7 # , any 4×4 submatrix in 2.39 must have a determinant equal to 0. This gives a 13-degree polynomial in a. (k 1 a 10 +k 2 a 8 +k 3 a 6 +k 4 a 4 +k 5 a 2 +k 6 )a 3 =0 (2.40) Let α =a 2 : k 1 α 5 +k 2 α 4 +k 3 α 3 +k 4 α 2 +k 5 α+k 6 =0 (2.41) This 5-degree polynomial can be easily solved and the real-values a’s retained. Thus, the rotation matrix R can be composed after (b,c,d) are obtained from the null-vector of the 4×4 submatrix and translation 6 t 1 t 2 t 3 7 # =−C t y t . 2.3 Results 2.3.1 Simulation Experiments Performanceofthedifferentalgorithmsisquantifiedacrossvariousnoiselevelsbycreatingrandom 3D features and camera motion. Camera geometry is created by fixing the baseline to 7.5 cm and generating a random stereo pair. The second stereo pair is displaced with a random motion. 3D scene geometry and image correspondences are also created by randomly generating 3D points and lines in space, and projecting them onto the four cameras. Zero mean, Gaussian noise with varying standard deviations is added to the coordinates of image features in each view. All the experiments are based on 1000 trials. 2.3.1.1 Three-point algorithm with and without SBA Inthistest, experimentswereperformedtoevaluateperformanceofthethree-pointvisualodome- trypipeline(Figure2.5)inthepresenceandabsenceoftwo-framesparsebundleadjustment. Fifty 70 Figure 2.8: Evaluation of improvement with sparse bundle adjustment. random points were generated and the camera motion was computed before and after application of SBA. The results are shown in Figure 2.8. The errors with SBA were significantly lower (null hypothesis was rejected at the 5% significance level) across all non-zero noise levels. 2.3.1.2 Using point and line features Synthetic data experiments are performed to compare the results obtained using the different versions of the trifocal formulations - trifocal linear (Section 2.2.2.2), trifocal subspace (Section 2.2.2.3) and trifocal quaternion - n (Section 2.2.3). The n suffix indicates the number of feature correspondences used in the solver (with n=3 being the minimal case). Comparisons are per- formed against the popularly used three point algorithm (Haralick etal., 1991) and the line-only based motion estimation setup described in Chandraker etal. (2009), which are referred to in the experiments as ‘LinesSfM’. A mechanism to incorporate more than just the minimal set in a closed-form for generating a single hypothesis is one of the advantages of the proposed method, when compared to these algorithms. This is demonstrated in the various experiments below by setting n=4 or 5. Similar to (Chandraker etal., 2009; Nister, 2004b), only the lower quartile of 71 (a) Translation Error (b) Rotation Error Figure2.9: Rotationandtranslationerrorscomparingperformanceoflinear(red)subspace(green) and quaternion (blue) based formulations of the trifocal minimal solver using arbitrary feature combinations. Adapted with permission from (Pradeep et al.),© 2010 IEEE. error distributions for all algorithms is reported, as the targeted use is in a RANSAC framework where finding a fraction of good hypotheses is more important than consistency. Figure 2.9 shows the rotation and translation errors for the various flavors of the trifocal formulation,givendifferentminimalsets. Randompermutationsofpointsandlinesweregenerated for each trial. As alluded to earlier, the linear approach has the worst performance, while the quaternion-based polynomial system works the best. Performance comparisons using only points againstthestandardthree-pointalgorithm(Figure2.10)andusingonlylines(Figure2.11)against LinesSfmarealsopresented. Sincethelatteralgorithmworksbestonlyforsmallmotions, random motion was generated such that it does not break this method. For the points-only case, the quaternion-3methodshowsslightlyinferiorerrorbehaviorcomparedtothethreepointalgorithm, but the quaternion-4 and quaternion-5 methods outperform. Using only lines, all versions of the trifocal method do significantly better. The impact on the number of RANSAC trials with increasing n is presented in Figure 2.12. For this test, a random set of 30 points and 15 lines was generated and RANSAC trials by selecting random combinations of features were carried out until a threshold reprojection error (in all four images) was reached. ANOVA tests between the three groups at each noise level found no significant difference, but these results might differ for 72 (a) Translation Error (b) Rotation Error Figure 2.10: Rotation and translation errors comparing performance of the 3-point algorithm (red) against trifocal-quaternion solver using 3 (green), 4 (blue) and 5 (black) points. Adapted with permission from (Pradeep et al.),© 2010 IEEE. (a) Translation Error (b) Rotation Error Figure 2.11: Rotation and translation errors comparing performance of the lines Sfm algorithm (red) against trifocal-quaternion solver using 3 (green), 4 (blue) and 5 (back) lines. Adapted with permission from (Pradeep et al.),© 2010 IEEE. 73 Figure 2.12: Number of RANSAC trials before an acceptable solution was found for trifocal quaternion-3 (red), -4(green) and -5(blue). Adapted with permission from (Pradeep et al.), © 2010 IEEE. real data. Note that for the sake of fairness, simulations only use the particular class of features the benchmark methods have been designed to work with in Figures 2.10 and 2.11. Incorporating mixed feature combinations adds an additional dimension to theodometry pipeline withpotential for greater accuracy. It makes more sense to demonstrate this behavior with real data, as line and point features have different error statistics with significant impact on performance, and are difficult to replicate in simulated experiments. This is done in the next section. 2.3.2 Real World Experiments In Figure 2.13, the results of the camera trajectory generated by the visual odometry pipeline usingthethreepointalgorithmisshown. Thecamerapathtracedwhensparsebundleadjustment is being used is more consistent with the floor-plan of the indoor environment where data was collected. Note how the red trace (visual odometry without SBA) goes through the walls more often. In Figure 2.13(b), a sideways profile is shown. This highlights the drift in visual odometry 74 Figure 2.13: Camera trajectory generated by the visual odometry pipeline. Red: without SBA, green: after SBA. In (a), the trajectory is overlaid on the floorplan and in (b), a sideways profile is shown. (greater when SBA is not used) as evidenced by the departure from the plane of motion (the camera was moved parallel to the ground plane). For further quantitative validation of the odometry pipeline, a commercial Intertial Measure- ment Unit (IMU) was mounted on the stereo camera to provide ground truth data. The IMU, sold and manufactured by Intersense under the trade name InertiaCube3, provides yaw, pitch and roll data at 180 Hz using an array of 9 internal MEMS sensors including gyros, magnetometers and accelorometers. The yaw angles (rotation about the axis perpendicular to the ground) were compared as these are expected to be encountered the most in a head-mounted configuration. Performance studies under three different conditions were carried out: slow camera rotation (less than 5 degrees/second), medium camera motion (5-20 degrees/second) and fast camera motion (20-30 degrees/second). Faster camera motions were not explored as these lead to significant motion blur and small to no image overlap between two consecute frames resulting in tracking failure. In such cases (which can be expected in abnormal conditions like head jerks), the tracking system resets and reinitializes the odometry. The results are presented in Figure 2.14. Since the development of a mixed-feature solver (Section 2.2.2) was motivated by practical difficulties in sparsely textured or badly illuminated indoor environments, real experiments were 75 Figure 2.14: Mean absolute errors in the estimation of camera yaw magnitude for three levels of angular velocity performed in environments that exhibit such pathological behavior to evaluate our algorithms against state of the art. InFigure2.16,thesequenceiscollectedaroundaparkingstructure. Thisisalarge,openspace and several frames have either very little texture or only a few point features that are clustered on one side. The 3-point algorithm performs noticably worse, exhibiting a severe break (see near top of figure) and veering off from the loop. About 2036 frames were captured for this sequence. In Figure 2.17(a), results from a corridor sequence are shown, where lack of sufficient texture in some frames (Figure 2.17(b)), leads to very few point feature tracks. The actual motion simply consists of a walk along with corridor, entrance into an elevator and a return. As is obvious from the estimated trajectory, the use of points and lines in the trifocal quaternion algorithm leads to much better performance. The 3-point algorithm drifts significantly into the wall. The bottom plot in Figure 2.17(a) shows a sideways view of the results. Since the stereo camera was fixed and transported on a platform, the motion was more or less confined along a plane parallel to the ground. Ideally, the estimated camera path should lie on the thin horizontal line from this perspective. The 3-point algorithm begins to drift from this very quickly, while the effect is minimized for the trifocal quaternion-5 algorithm. In Figure 2.17(c), a plot of the number of features available to both algorithms and RANSAC inliers found per frame is shown. This clearly shows the advantage of exploiting both feature types in such situations. 76 Figure 2.15: (Top) Lines and points tracked in the system. (Below) Estimated visual odometry superimposed on the floor-plan of an indoor office environment (Red: 3-point algorithm, green: proposedalgorithmusingassortedcombinationsofpointandlinefeatures). The3-pointalgorithm estimatesapaththatgoes‘throughthewall’towardstheendandincorrectlyestimatesthecamera position beyond the bounds of the room. Adapted with permission from (Pradeep et al.),© 2010 IEEE. 77 20m Thursday, November 19, 2009 Figure 2.16: Sequence for two loops around a parking structure (red: 3 point algorithm, green: trifocal quaternion-5). Adapted with permission from (Pradeep et al.),© 2010 IEEE. 2.4 Summary In this chapter, the visual odometry pipeline developed for the mobility assistive device proposed in this dissertation was presented. The choice of appropriate feature tracking (FAST corner de- tection with KLT tracking) methodology was explained and the importance of applying sparse bundle adjustment was demonstrated through simulation and real-world experiments. Further- more, a novel minimal solver that can leverage any combination of point and line features tracked across two stereo views was derived. This solver outperforms the state-of-art minimal solvers and evidence of this was provided through experiments with extensive synthetic and actual data. The visual odometry module is the first step in building an obstacle map around the vicinity of the user and localizing his/her position and orientation within this map. As can be seen from the experimental results, however, the camera motion estimates are not perfect and suffer from drift 78 (a) Estimated camera paths (red : 3-point algorithm, green: trifocal quaternion-5). (Top) Top-down view and (Bottom) sideways view. (b) Low textured corridor images with very few point features. (c) Plot of number of features per frame and RANSAC inliers Figure 2.17: Performance in low-textured, corridor sequence. Adapted with permission from (Pradeep et al.),© 2010 IEEE. 79 in spite of non-linear refinement and pragmatic feature tracking. The next chapter will describe techniques to cope with this drift and obtain more robust estimates of egomotion. 80 Chapter 3 Simultaneous Localization and Mapping (SLAM) 3.1 Introduction The results presented in the previous chapter demonstrate that visual odometry alone cannot be used to maintain robust estimates of camera pose over long image sequences. For the mobility assistive device in this dissertation, recovered camera motion is used to stitch 3D data (acquired each frame using stereo triangulation) to build obstacle maps in the vicinity of the visually im- paired user. Odometry drift implies that map consistency will eventually be compromised and the system might get ‘lost’ over long distances. The implication of this is that appropriate cues to guide the user to a safe region cannot be provided as the system cannot compute the posi- tion and orientation of the person in the map nor the location of obstacles with respect to the head-mounted camera platform. Furthermore, there is no mechanism to correct for a sudden error that might occur due to a rapid motion or lack of texture in the environment. Obviously, errors in map-building and self-localization are strongly correlated. In this chapter, techniques are presented to keep these errors under bound over time, simultaneously build a map and ex- tract egomotion information given noisy input data. This problem is referred to as simultaneous localization and mapping (SLAM). Novel contributions are presented in the form of algorithms 81 and a SLAM framework that can accompish the desired goals in a computationally tractable and stable way. 3.1.1 Problem Formulation If the path of the camera platform were known with certainty, then mapping would be a relatively much easier problem to solve: the positions of the various objects could be estimated using independent filters (Moravec, 1988; Thrun, 2002). Similarly, if a map were available apriori, localization would be straightforward (Dellaert etal., 1999). In SLAM, however, both of these are unknown and must be considered concurrently. This is a ‘chicken-or-egg’ relationship, as pose estimates are corrupted by motion noise, and this in turn will impact the perceived locations of objects in the world. The correlation between errors in path estimates with errors in mapping was first identified by (Smith etal., 1986) and later, in (Leonard and Durrant-Whyte, 1991). TherearetwosourcesofinputinformationavailableforsolvingaSLAMproblemasthesensor platform traverses an environment. First is the set of controls that denote (noisy) control signals or measurements of the motion from timet−1 to timet (represented byu t ). The other is a set of observations (given by z t at time t) that are noisy measurements of features in the environment. Each control or observation, along with an appropriate noise model, is a probabilistic constraint: a control input probabilistically constrains two successive poses, while observations constrain the relativepositionsofthesensorandobjectsinthemap. Alotofdifferentsensorsystemshave been reported in literature for solving the SLAM problem, with inertial navigation or wheel encoder systems (described in Section 1.1.2.2) providing the control signals and laser or sonar scanners for collecting observations. In this dissertation, the focus is on SLAM using a stereo camera and therefore, control is derived from visual odometry while 3D landmarks corresponding to interest points detected in an image sequence (described in the previous chapter) comprise observation sets. 82 Figure 3.1: Formulation of the SLAM problem for a stereo rig. In Figure 3.1, a toy example is presented that describes the formulation of SLAM for a stereo rig. Consider the problem of mapping the positions θ 1 , θ 2 and θ 3 of three landmarks (correspond- ingtosomeeasilydetectablefeaturepointsontheobjectsshown)andrecoveringthecameraposes s t−2 ,s t−1 ands t over time instancest−2,t−1 andt. At each instant, the stereo camera acquires an image along with 3D data that provide measurements about these landmarks. The images are denotedbyz t−2 ,z t−1 andz t , withtherespectivefeaturepointsdetectedbyanyoneoftheinterest point extraction schemes outlined in Chapter 2. From Equation 2.22, each measurement consists of a pixel location and 3D coordinates in Euclidean space. Note that features corresponding to not all landmarks are visible each frame - this could be because of occlusion, tracking failure or simply because the object is not in the field of view of the camera. Also, the features are associated with independent Gaussian ellipses in the images representing pixel localization uncer- tainties. The justification for this has already been provided in Section 2.1.3.3. Visual odometry provides estimates of camera motion between t−2 and t−1 (= u t−1 ) and between t−1 and t (= u t ). The uncertainties associated with feature tracking affect odometry and unchecked, will 83 grow unbounded. Hence, at any given time, the estimates of the true camera pose and map will be corrupted by noise. The goal of SLAM algorithms is to estimate the true map and true pose, given the currently available set of observations and controls. The approach adopted in this dissertation is to estimate the posterior probability distribution over all possible maps θ (θ =[θ 1 ,θ 2 ,θ 3 ] in the toy example) and poses s t conditioned on the full set of controls u t (=[u t−1 ,u t ]) and observations z t (=[z t−2 ,z t−1 ,z t ]) at time t. This joint posterior distribution over maps and poses can be written as: p(s t ,θ|z t ,u t ) (3.1) Solving Equation 3.1 is very different from obtaining only the most likely estimate. In noisy environments, maintaining a distribution of all possible solutions leads to more robust algorithms and the measure of uncertainty can be used to bootstrap the information conveyed by different components of the solution (Montemerlo and Thrun, 2007). For instance, one section of the map may be very well known due to multiple visitations, while others might be very uncertain. This is advantageous in the application for a mobility device, as the system can assess its own confidence before providing guidance cues to the user. For simplicity, x t ={s t ,θ} is called the state of the system and therefore, p(s t θ|z t ,u t )=p(x t |z t ,u t ). By applying Bayes rule, one gets (Thrun, 2002): p(x t |z t ,u t )=ηp(z t |x t ) ˆ p(x t |u t ,x t−1 )p(x t−1 |z t−1 ,u t−1 )dx t−1 (3.2) where η is a normalizing constant. The constraints imposed by controls and observations start out as relatively weak. In (Dissanayake etal., 2000), it has been shown that as map features are repeatedly observed, these connections become stronger and in the limit of an infinite number of observations and controls, the positions of all map features will become fully correlated. Equation 3.2 allows a recursive estimation of the posterior probability of maps and poses if the two distri- butions p(z t |x t ) and p(x t |u t ,x t−1 ) are given. These can be obtained by modeling the processes 84 Figure 3.2: SLAM as a Dynamic Bayes Network of observation and camera motion. In general, though, an analytical computation cannot be ob- tained in closed form. However, by assuming a particular form of the posterior distribution, the motion and measurement models, approximate solutions may be obtained. Statistical estimation techniques, such as the Kalman filter and the particle filter are techniques that can be applied to accomplish this. 3.1.2 Markov Chain and the Role of Uncertainty In Figure 3.2, the SLAM problem is described as a probabilistic Markov chain. The motion and observation models have been made explicit by the arrows converging at each node: the current pose s t is a function only of the previous pose s t−1 and control u t . This is given by p(s t |s t−1 ,u t ) and also captures how noise in the the motion measurement injects uncertainty into the pose estimate. Similarly, the measurement model (which governs acquisition of sensor observations) is given by p(z t |s t ,θ) - i.e., the current observation z t is only a function of the current pose estimate s t and the map θ. Again, errors in the current pose and map hypotheses will impact the observation errors. Using the motion and observation models, the SLAM posterior at any time can be recursively computed from the posterior at the previous time step. 85 An important aspect to consider in the above Markov chain representation is how uncertainty in pose estimates influences the correlation between the different landmarks. At time t− 1, landmarks θ 1 and θ 2 are observed and the resulting observation z t−1 determines the distribution of the state s t−1 (and the map θ) from Equation 3.2. Subsequently, this estimate is propagated to s t by the recursive update filter and the map θ is again updated at time t. Note that at this instance, only θ 1 and θ 3 were observed. However, information about θ 2 from the previous time step influences the uncertainty in s t , that in turn impacts the probability distribution for the whole map, including θ 3 . Hence, the landmarks are correlated due to the uncertainty in the pose estimate. Two important deductions can be made here about the nature of the SLAM problem: • There is an inherent sparsity in the SLAM formulation - observations and controls constrain only a small subset of the state variables. Landmarks θ 2 and θ 3 are strongly correlated as shown above. However, two very distant landmarks will not have such a close relationship. Information about a very far away landmark will provide only minimal information as it propagates through the many steps of recursion. This sparsity has often been exploited to compute the SLAM posterior in an efficient manner (Bailey, 2002; Guivant and Nebot, 2001; Bosse etal., 2002). • Thelandmarksareonlycorrelatedduetotheuncertaintyinposeestimates. Ifthecomplete, true path were known, then the landmark positions could be estimated independently. In other words, knowledge of the robot’s true path implies conditional independence between the landmark positions. This can be exploited in Equation 3.2 for reducing the problem to lower dimensional estimation problems, as will be done in Section 3.1.4. This approach is followed in this dissertation. The complete path is given by the collection of all poses [...,s t−2 ,s t−1 ,s t ,...] and is highlighted in gray in Figure 3.2. The true path is said to ‘d- separate’ the nodes representing the landmarks (Russell and Norvig, 1995; Montemerlo and Thrun, 2007). 86 3.1.3 The Extended Kalman Filter (EKF) The extended Kalman Filter technique is one of the most popular approaches to the SLAM problem. The particular form of the posterior is approximated as a high-dimensional Gaussian overallfeaturesandcamerapose. ThecovariancematrixofthismultivariateGaussianencodesthe correlations between the various state variables - off-diagonal elements being non-zero. Consider the Gaussian parameterized by a mean µ t (representing the most likely state of the system) and a covariance matrixΣ t . Then, as derived in (Montemerlo and Thrun, 2007) p(s t ,θ|u t ,z t )=N(x t ;µ t ,Σ t ) (3.3) µ t ={µ st ,µ θ1,t ,...,µ θ N,t } (3.4) Σ t = Σ st,t Σ stθ1,t ··· Σ stθ N ,t Σ θ1st,t Σ θ1,t Σ θ1θ2,t . . . . . . Σ θ N st,t Σ θ N θ2,t Σ θ N ,t (3.5) The Kalman Filter algorithm is the optimal estimator for a linear system with Gaussian noise (Bar-Shalom and Fortmann, 1988). The EKF extends the basic algorithm to non-linear systems by replacing the non-linear motion and measurement models with ‘linearized’ models around the most likely state of the system. This is a good approximation only if the actual models are approximately linear and and the discrete time step of the filter is small. In this dissertation, the camera pose state has six degrees of freedom (3 for rotation and 3 for translation) and each landmark is 3-dimensional. Thus, the covariance matrix becomes: 87 Σ t = Σ st,t|6×6 Σ stθ1,t|6×3 ··· Σ stθ N ,t|6×3 Σ θ1st,t|3×6 Σ θ1,t|3×3 Σ θ1θ2,t . . . . . . Σ θ N st,t|3×6 Σ θ N θ2,t|3×3 Σ θ N ,t|3×3 (3.6) Hence, the mean vector µ t is of dimension 6+3N, whereN is the number of landmarks in the map, and the covariance matrix is of size6+3N by6+3N. The system is initialized by setting the camera pose to zero (the world reference) and the covariance matrix also to zeros. The EKF consistsoftwosteps: predictionandupdate, whosegenericequationsaregivenby(Blanco, 2010): Prediction: µ t|t−1 =f(µ t−1 ,u t ) (3.7) Σ t|t−1 = ∂f ∂µ Σ t−1 ∂f ∂µ T +Q t (3.8) Update: γ t =z t −h(µ t|t−1 ) (3.9) Π t = ∂h ∂µ Σ t|t−1 ∂h ∂µ T +R t (3.10) K t =Σ t|t−1 ∂h ∂µ T Π −1 t (3.11) µ t =µ t|t−1 +K t γ t (3.12) 88 Σ t = (1−K t ∂h ∂µ )Σ t|t−1 (3.13) Here, f(·) and h(·) represent the motion and measurement models of the system. A simple exampleisrange-bearingSLAM,wheref(·)issimplytheposecompoundingoperationthatmath- ematically updates the new pose of the camera given motion estimates from visual odometry and h(·) is the image projection equation as in Equation 1.8. The observations are pixel coordinates of the corresponding landmarks. Exact forms of these functions and the Jacobians ( ∂f ∂µ and ∂h ∂µ ) are application dependent and examples are given in (Blanco, 2010). Q t is the noise in motion estimation (i.e., noise in visual odometry) andR t is the noise in measurement (i.e., noise in stereo triangulation). Thederivationoftheseterms,asrelevanttothisdissertation,isdeferredtoSection 3.2.1. γ t (in Equation 3.9) is called the ‘innovation’ of the system and measures the departure of the actual observation z t from the predicted observation h(µ t|t−1 ).K t is the Kalman Gain and updates the mean of system in Equation 3.12. If the innovation is small, then the updated state of the system is close to the prediction in Equation 3.7, else the necessary correction is applied. While the EKF has been quite popular, there are substantial disadvantages to this approach in terms of scalability and robustness: • Each time a control and an observation are incorporated, the covariance matrixΣ t update requires a number of computations that is quadratic with the number of landmarks N. Hence, the space and time complexity of the algorithm are both O(N 2 ). Thus, the EKF approach does not scale favorably over large maps (and especially in vision-based SLAM algorithms, where the number of feature based landmarks can increase quite rapidly). • The EKF error is very brittle to matching errors. In other words, if the incorrect landmarks are updated for some mismatched image features, the filter will diverge very quickly. This is a well-known failure mode of the EKF (Dissanayake etal., 2000) and as presented in the chapter on visual odometry, feature tracking can be quite error prone. 89 • The camera motion model is linearized and represented as a Gaussian. However, in actual conditions, this is not true - especially for a camera rig mounted on the head of a human user, where motion can be quite jerky. Giventheshortcomingsmentionedabove,analternativeapproachisconsideredinthisdissertation that is more robust and scales well with the number of map landmarks. This is discussed next. 3.1.4 FastSLAM The notion of d-separation in the SLAM problem was introduced in Section 3.1.2. The condi- tional independence between landmarks, given the true camera path allows factorzing the SLAM posterior into a product of two simpler terms: p(s t ,θ|z t ,u t )=p(s t |z t ,u t ) N : n=1 p(θ n |s t ,z t ,u t ) (3.14) Note that the form of the posterior above differs subtlely from Equation 3.2 in that the state nowincludesthecompletepaths t =[...,s t−2 ,s t−1 ,s t ,...]ratherthanjustthecurrentposes t. This factorization first appeared in (Murphy, 1999) and leads to separation of the SLAM posterior into a product of a camera trajectory posterior p(s t |z t ,u t ) and N independent landmark posteriors conditioned on the camera path. The factorization is exact and not an approximation, and a complete derivation may be found in (Montemerlo and Thrun, 2007). 3.1.4.1 Sampling Importance Resampling (SIR) The Extended Kalman Filter was presented as one approach to evaluating p(x t |z t ,u t ) by repre- senting the posterior density as a multivariate Gaussian. Particle filters, on the other hand, can track multi-modal beliefs and include non-linear motion and measurement models. This makes them very suited to the development of robust algorithms. In general, arbitrary distributions are 90 Figure 3.3: Evolution of particles through the various stages of the SIR filter. represented using a finite set of sample states, or ‘particles’. A high density of particles corre- sponds to a high probability region in the distribution, while low probability regions contain few or no particles. The true distribution can be reconstructed in the limit of an infinite number of particles (Doucet, 1998). In actual practice, weights are assigned to each particle proportional to the probability of the region it corresponds to in the distribution. Thus, the posterior is repre- sented by a set of m weighted particles {x i t ,w i t } i=1,...,m . The recursive Bayes Filter in Equation 3.2 can be implemented using the following steps of the sampling importance resampling (SIR) algorithm (see Figure 3.3): 1. Sampling: Drawm random samples fromp(x t−1 |z t−1 ,u t−1 ) and add to the set of particles {x i t−1 } i=1,...,m . 2. Importance: Predict (sample) each particle using a transition prior (also known as the proposal distribution) p(x t |x t−1 ,u t ) giving a set{x i t|t−1 } i=1,...,m . At this stage, the samples 91 are distributed according to p(x t |z t−1 ,u t ) and not the desired posterior (which is referred to as the target distribution; in this case being p(x t |z t ,u t )). 3. Resampling: Computetheweightw i t foreachparticlesuchthatthedistributionrepresents the target density. Thus, w i t = targetdistribution proposaldistribution = p(x i t |z t ,u t ) p(x i t |z t−1 ,u t ) (3.15) These weights end up being proportional to the observation density p(z t |x i t ) and the intu- ition is that those particles that best explain the feature observations should receive higher weights. Finally, the weights are normalized such that they add up to 1 and the particles are resampled in proportion to their weights. Thus, more instances of particles explaining the current observations propagate to the next time step. The importance step is the particle filter analogue of the prediction step in the EKF, while resampling corresponds to the update step. Resampling is computationally expensive and instead todoingiteveryframe,asimpleheuristicmaybeappliedtodecidewhentoresample: anestimate of the number of effective particles is maintained: ˆ N eff = 1 . m n=1 (w i t ) 2 (3.16) Resampling is performed only if ˆ N eff falls under a threshold. This prevents particles from eventuallycollapsingtoaverysmallneighborhoodaroundapeakoftheposterior. Ifthishappens, then the filter will not be able to maintain robustness in the presence of large noise. 92 3.1.4.2 Rao-Blackwellised Particle Filter (RBPF) In order to track the posterior accurately, the number of particles required scales exponentially with the dimensionality of the state space (Montemerlo and Thrun, 2007). Hence, an implemen- tation for the full state of the SLAM posterior might require millions of dimensions as N (the number of landmarks) becomes large. The factorization presented in Equation 3.14 , however, can be exploited to maintain N+1 filters, each of which is low-dimensional. The path poste- rior (6-dimensional state space) can be represented using the SIR filter, while the landmarks are maintained using N independent EKFs. This representation is known as the Rao-Blackwellised Particle Filter (RBPF) and a very popular implementation strategy is the FastSLAM algorithm outlined in (Montemerlo and Thrun, 2007). Specifically, FastSLAMestimatesp(s t |z t ,u t )usingaparticlefilterandtheremainingN condi- tional landmark posteriorsp(θ n |s t ,z t ,u t ) using fixed-size, low-dimensional EKFs. Each particle is a hypothesis of the camera trajectory and possesses its own set of EKFs (i.e., each particle has its own map). Thus, there are N·M EKFs, M being the total number of particles. Mathematically, each particle is of the form: P m t = ; s t,m ,θ m 1,t ,Σ m θ1,t ,...,θ m N,t ,Σ m θ N,t < (3.17) The SIR steps for the RBPF are modified as follows: 1. Sampling: Draw M particles to generate the set{P m t−1 }. 2. Importance: Givenvisualodometryu t ,drawnewcameraposesbasedonproposalp(s t |s t−1 ,u t ) and add to the path estimate s t|t−1,m for each particle. 3. EKF update: Update the landmark EKFs of the observed feature in each particle using the prediction and update steps similar to those outlined in Section 3.1.3. 93 4. Resampling: Calculate an importance weight for each particle, and draw a new particle set using importance resampling. 3.1.4.3 Efficient Implementation A naive implementation of the FastSLAM algorithm above has a computational compexity given by O(M ·N). The linear complexity in M is because each particle has to be processed at every update. The linear compexity in N arises due to the importance resampling step - a single particle may be duplicated several times during recomputation of the final particle set. This copying also includes replicating the entire map for each particle. However, at any given time, only a few landmarks are observed and therefore, several landmark filters remain unchanged. In (Montemerlo etal., 2002), an efficient book keeping strategy is proposed that maintains a binary treestructureoverthelandmarkfilters. Subtreesbelongingtounchangedlandmarksofduplicated particles are simply shared through pointers and wholesale copying need not be performed. This reduces the complexity to O(M·logN). 3.2 Approach 3.2.1 Motion and Observation Functions 3.2.1.1 Noise Models The first step in implementing FastSLAM for a stereo vision system is choosing appropriate noise models Q t and R t for the motion and observation functions respectively. The camera pose s t is comprised of six parameters s t =[ T x T y T z φχψ] T , where (T x ,T y ,T z ) are the translation parameters of camera motion in Euclidean 3-space and (φ,χ,ψ) are the yaw, pitch and roll angles. Visual odometry provides an estimate of camera pose change between two framesu t =(R,t)=[ .T x .T y .T z .φ .χ .ψ ] T .Eachlandmarkθ i isrepresentedby Euclidean coordinates (X i ,Y i ,Z i ). A measurement is defined as the corresponding homogeneous 94 Figure 3.4: Schematic representations of the measurement and landmark uncertainties. pixel coordinates (u i ,v i ,1) in the image I t and as presented in Equation 2.22, a feature is stored as µ i =[u i ,v i ,X i ,Y i ,Z i ,d i ] , where d i is the stereo disparity. The 2D pixel and 3D landmark coordinates are related by Equation 1.11, repeated here for convenience: Z i =fB/d i X i =u i Z/f Y i =v i Z/f (3.18) where B is the baseline of the stereo rig andf is the focal-length. In this dissertation, a zero- mean Gaussian distribution with fixed pixel (σ x ,σ y ) and disparity (σ d ) deviations are used to intialize the noise parameters, as described in Section 2.1.3.3. When a feature µ i is first observed, the pixel coordinates are associated with a Gaussian uncertainty with mean given by (u i ,v i ) and covariance by R i t = σ 2 u 0 0 σ 2 v (3.19) 95 This model has been employed in other works (Sim etal., 2007; Matthies and Shafer, 1987) and the values used in this dissertation are σ x = σ y =0.1 and σ d =0.4. Errors along the two image axes are considered uncorrelated. This measurement noise in image coordinates has to be propagated to an uncertainty in the 3D location of θ i (given by the 3×3 covariance matrixΣ θi ), based on noise characteristics of the stereo rig. A first-order error propagation can be used to obtain the following covariance matrix for the X i ,Y i and Z i coordinates: Σ θi =J σ 2 u σ 2 v σ 2 d J T (3.20) where J is the Jacobian matrix of the functions in Equation 3.18. Expanding the terms, the following expression is obtained: Σ θi =( B d i ) 2 σ 2 v + σ 2 d v 2 i d 2 i σ 2 d uivi d 2 i σ 2 d vif d 2 i σ 2 d uivi d 2 i σ 2 u + σ 2 d u 2 i d 2 i σ 2 d uif d 2 i σ 2 d vif d 2 i σ 2 d uif d 2 i σ 2 d f 2 d 2 i (3.21) A graphical depiction of the uncertainties corresponding to image measurements and 3D land- marks is shown in Figure 3.4. The other uncertainty that needs to be modeled corresponds to the noise in the motion estimate, Q t . This motion estimate is provided by visual odometry, and as was already discussed in Chapter 2, is a result of incorporating noisy feature observations. Since errors in feature localization are modeled by independent Gaussian distributions, noise in the motion model may also be modeled using a Gaussian by direct application of the central limit theorem. The method of (Moreno etal., 2009) is employed, where Q t =( N / H iT (Σ i ) −1 H i i=1 ) −1 (3.22) 96 Σ i =Σ θi,t +Σ θi,t−1 (3.23) H i = ∂f X ∂&Tx ∂f X ∂&Ty ∂f X ∂&Tz ∂f X ∂&φ ∂f X ∂&χ ∂f X ∂&ψ ∂f Y ∂&Tx ∂f Y ∂&Ty ∂f Y ∂&Tz ∂f Y ∂&φ ∂f Y ∂&χ ∂f Y ∂&ψ ∂f Z ∂&Tx ∂f Z ∂&Ty ∂f Z ∂&Tz ∂f Z ∂&φ ∂f Z ∂&χ ∂f Z ∂&ψ (3.24) Landmark pairs{θ i,t−1 ,θ i,t } are theN 3D coordinates of the two feature sets from the current and previous time steps involved in visual odometry, while f X , f Y and f Z are functions that determine the X,Y and Z coordinates of the 3D points at time t using the landmarks at time t−1 and estimated motion (R,t). These can be derived from θ i,t =f(θ i,t−1 ,R,T),i=1,...,N (3.25) 3.2.1.2 Proposal Distribution The FastSLAM2.0 algorithm (Montemerlo etal., 2003) was modified to support a visual SLAM system in this dissertation. A particle set {S m t−1 } M m=1 is initially available from the previous time step. In the FastSLAM1.0 algorithm (Montemerlo etal., 2002), these particles are simply sampled (using the motion model) to generate the set {S m t|t−1 } M m=1 . However, this strategy is not recommended when noise in the motion estimate is much greater than the observation errors (Montemerlo and Thrun, 2007). This is especially true for a stereo vision sensor employing visual odometry, asQ t is significantly impacted by errors in depth triangulation of the feature sets and incorrect matching. In contrast, the measurement noise is low as depth uncertainties do not affect image projections so much. Hence, the FastSLAM1.0 strategy would result in a lot of particles being sampled from low probability areas of the pose posterior. In FastSLAM2.0, the particles are not directly sampled but rather ‘shifted’ to a more likely areas before being sampled. This 97 shifting is achieved by using the feature measurements to bootsrap the sampling process. The algorithm proceeds as follows: 1. Project s t−1,m forward by applying the visual odometry N(u t ,Q t ) estimate. The result- ing pose will have a Gaussian uncertainty, given by N(ˆ s m t|t−1 , ˆ Σ m ˆ S t|t−1 ). This Gaussian is considered as a prior distribution of an EKF restimating the camera pose. 2. For each feature observation µ i =(u i ,v i ,1), apply the following EKF update equations to incorporate the feature: ˆ µ i =h(ˆ s m t|t−1 ,θ i,t−1 ) (3.26) Z i =R t + ∂h ∂θ i Σ θi ∂h ∂θ i T (3.27) ˆ Σ m ˆ S t|t−1 = ∂h ∂ˆ s m t|t−1 T Z −1 i ∂h ∂ˆ s t|t−1 +( ˆ Σ m ˆ S t|t−1 ) −1 (3.28) ˆ s m t|t−1 =ˆ s m t|t−1 +Σ m ˆ S t|t−1 ∂h ∂ˆ s m t|t−1 T Z −1 i (µ i − ˆ µ i ) (3.29) Equation 3.29 is the Kalman update equation for the mean of the particle’s pose. The function h(·) is simply the camera projection expression from which the Jacobians ∂h ∂θi and ∂h ∂ˆ s m t|t−1 are also computed. The above steps are repeated for each feature observation, and the preceding particle-mean and particle-covariance are updated during each iteration of the loop. Furthermore, a weight is computed for each particle, based on the innovation contributed by each feature in the above steps: L i =R t + ∂h ∂θ i Σ θi ∂h ∂θ i T + ∂h ∂ˆ s m t|t−1 ˆ Σ m ˆ S t|t−1 ∂h ∂ˆ s m t|t−1 T (3.30) w m = N : i=1 N((µ i − ˆ µ i ),L i ) (3.31) 98 At the end of this step, the pose prior is represented as a mixture of weighted Gaussians, {w m ,ˆ s m t|t−1 , ˆ Σ m ˆ S t|t−1 } M m=1 . Note that these weights are not the SIR filter weights, which will be computed in Section 3.2.1.4. 3. Finally, pose samples are randomly drawn from the mixture of Gaussians distribution above giving the set{s m t|t−1 }. 3.2.1.3 Landmark Update The map associated with each particle is updated using independent EKFs for each observed landmark. For each observed feature i, the following steps are executed: 1. Compute the Kalman Gain: K i =Σ θi ∂h ∂θ i T ( ∂h ∂s m t|t−1 Σ θi ∂h ∂s m t|t−1 T +R t ) −1 (3.32) 2. Compute the expected observetion: ˆ µ i =h(s m t|t−1 ,θ i ) (3.33) 3. Apply Kalman updates to the mean and covariance of the landmark distribution: θ i = θ i +K i (µ i − ˆ µ i ) (3.34) Σ θi =(I−K i ∂h ∂θ i )Σ θi (3.35) 3.2.1.4 Importance Resampling The next step is computing importance weights. This is a fairly simple process, and can be done by applying Equation 3.31 for all particles. The only difference this time is that the landmark 99 mean and covariance values are borrowed from Equations 3.34 and 3.35 that reflect the updated map posterior. An important caveat is the representation and normalization of these particle weights. If the weights are maintained as log-likelihoods (logw m ), then a simple addition strategy can be employed instead of multiplication in computing Equation 3.31. However, as pointed out in (Sim etal., 2007), the process of normalization can be quite tricky. The magnitude of the log-likelihood can be such that raising it to the exponential power (to get w m ) may result in machine-precision zero. To deal with this, the maximum log-likelihood is found amongst all the particles and subtracted, setting it to zero. This maximum is also subtracted from all other weights. Subsequently, raising to the exponential causes no problems and normalization can proceed as usual. 3.2.2 Active Matching for Robust Data Association The presentation above assumed that perfect data association was available. In other words, each featuremeasurementµ i isassumedtobecorrectlymatchedagainstrightlandmarkθ i . Inpractice, this problem has to be solved. The SLAM algorithm begins by populating the particle maps with a set of landmarks. Each time new features are extracted from the image, a determination has to be made about which of these come from landmarks already present in the map(s). This is knownasthedataassociationproblem. In(Simetal., 2007), SIFTdescriptorsareassociatedwith landmarksand storedin adatabase. SIFT descriptorsfor newfeaturesare thencompared against those in the database to detect landmarks that are being reobserved. However, a descriptor based technique for frame-to-frame SLAM is not feasible due to real-time constraints. In this dissertation, the technique of active matching (Molton etal., 2004) is adopted, which has been very popular for EKF-based visual SLAM systems. This idea is adapted to fit into the FastSLAM framework and is described next. This problem of data association has to be solved before steps 2 and 3 described in Section 3.2.1.2 can be implemented and hence, prior to resolution, the RBPF particles are pose priors with Gaussian distributionsN(ˆ s m t|t−1 , ˆ Σ m ˆ S t|t−1 ). 100 3.2.2.1 Landmark Initialization The first time a landmark is added to the RBPF, an image patch T 0 corresponding to a 13×13 windowaroundthefeatureisalsoextractedandsaved. Thistemplatewillbeusedtoperformdata- association. Furthermore, it is assumed that this image patch is representative of a planar scene geometry in the vicinity of the feature. A normal [ n x n y n z ] T is estimated for this planar patch to represent its orientation. This normal can be computed either using the homography between the stereo views or the underlying 3D data. 3.2.2.2 Projecting Search Windows For each new frame, the first data association step is to extract the subset of landmarks that are likely to be in the field of view of the camera. For computational efficiency, only the most likely particle and its map are considered. Given a set of landmarks {θ i }, the camera projection equation 1.8 can be used to predict image coordinates in the current field of view (using ˆ s m t|t−1 to compose the camera matrix). If these coordinates lie within the dimensions of the image, the landmark is considered a candidate and rejected otherwise. Consider a set C ={θ i } K i=1 of K such candidates. For each such candidate i, the following steps are taken: 1. For each particle j, project the landmark into the image along with its uncertainty ellipse to get a GaussianN(µ j ,Λ j ) µ j =h(ˆ s j t|t−1 ,θ i ) (3.36) Λ j =R+ ∂h ∂θ i Σ θi ∂h ∂θ i T + ∂h ∂ˆ s j t|t−1 ˆ Σ j ˆ S t|t−1 ∂h ∂ˆ s j t|t−1 T (3.37) Thus, each landmark has M Gaussians representing projection uncertainties in the image. 2. Combine the distributions uniformly to form a mixture of Gaussians representation for the landmark’s image projection. Compute the mean and covariance of this mixture (µ i ,Λ i ). 101 3. Compute the singular value decomposition of the 2×2 matrixΛ i to get two eigenvalues λ 1 and λ 2 . This defines a search window for the landmark θ i , centered at µ i with width and height (λ 1 ,λ 2 ). 3.2.2.3 Template Matching Attheendofstep3foralllandmarks, therewillbeK searchwindowsforthecandidates. Inorder toperformdataassociation, asimpletemplatematching(measuringnormalizedcross-correlation) is performed between T 0 (already saved for each landmark at initialization) and 13×13 windows that are translated across the search window. A measurement corresponds to the pixel location thatgivesthehighestmatchingscore,subjecttotheconstraintthatthecross-correlationisgreater than 0.8. One problem is that due to scale changes or camera rotation, the template T 0 may not corresponddirectlytotheneighborhoodaroundthepixelofinterest. Thenormal[ n x n y n z ] T estimated initially is used to compute an affine deformation for the template to compensate for this distortion. The active matching approach outlined here has two major advantages: • Search for data association is constrained to lie within K windows only. This saves a lot of computation power, as the complete image need to be scanned for potential matches. • Since the search window represents the most likely area within which the landmark may be measured, this method is robust to incorrect matching. A sequence of the above processing steps is shown in Figure 3.5. 3.2.3 Metric-Topological SLAM The FastSLAM2.0 algorithm with active matching has a complexity that is logarithmic in the number of landmarks. However, this is still not sufficient to ensure scalability over large areas wherethelandmarksmaynumberinthousands. Theculpritisthestepthatrequiresprojectingall 102 Figure3.5: Athreeimagesequenceshowingactivematching(left)andSLAMfiltering(right). The red ellipses on the the grayscale images depict the mixture of Gaussian projections for candidate landmarks. Theseleadtosearchwindows; thegreencrossesarelocationswheretemplatematches are found (i.e., observations). On the right, the 3D landmark ellipsoids are shown with the stereo camera pose at the bottom. Notice how the ellipsoids have shrunk between frames 1, 2 and 3 for the same landmarks. This is a result of improved EKF estimates due to reobservations. Also, new landmarks have been added in frame 3. 103 Figure 3.6: Overview of the metric-topological SLAM algorithm. KF: keyframe. landmarks into image space to search for candidates that might be in the field of view. Obviously, it does not make sense to test landmarks that are several meters away from the current, estimated position of the camera! Secondly, SLAM algorithms are only effective in preventing significant drift if the same landmarks are reobserved over the entire course of environment exploration. This does not hold when new areas are constantly being traversed (as one might expect in an application for a human wearable device). In this section, a novel metric-topological approach to the FastSLAM algorithm is proposed (Pradeep etal., 2009). Two levels of environment representation are maintained: local, metric and global, topological. Different variations of such a setup can be found in (Estrada etal., 2005b; Clemente etal., 2007; Blanco etal., 2008). This scheme has two main advantages: by computing metric maps over small areas, the map size can be kept bounded and secondly, loop closure (when a much older area is revisited) can be efficiently handled by simply reusing a previously computed submap. An overview of the algorithm is displayed in Figure 3.13. 104 3.2.3.1 Local Submap Level A local submap is generated by simply running the FastSLAM2.0 algorithm as explained in earlier sections. FAST features are tracked using optical flow to provide camera motion estimates through visual odometry. These motion estimates are filtered using the RBPF (withM particles) and active matching. The first frame acquired is set as the origin of the coordinate frame L for this submap, and the camera trajectory and maps are built with reference to it. This is the main thread of the entire system and is responsible for extracting information corresponding to the immediate vicinity of the user. In addition to maintaining a metric map, this level also stores a setofrepresentativeimagesandfeaturescalledkeyframes(KF).Thesekeyframeswillbedescribed in Section 3.2.3.3 and are useful for performing loop closure. 3.2.3.2 Global, topological Representation A local submap i with coordinate frame L i , built using the above methodology in the time interval t− τ to t, encapsulates M samples (particles) of the camera pose trajectory and per- particle maps. Suppose, at this instance, a new submap j having coordinate frame L j is to be initialized. A transformation j i Λ between the two coordinate systems is computed from the most likely pose (the sample with the largest weight) at this time in submap i. In fact, this transformation is smoothed using sparse bundle adjustment, as discussed in the next section. An initial map for this new region is constructed by simply copying the landmarks in the associated particle-map corresponding to only the current observations and transforming them into the new coordinate frame. It can be easily observed that adjacent submaps are conditionally independent due to these shared landmarks and coordinate transformation. A formal proof for this can be found in (Estrada etal., 2005b). In general, this structure can be represented by an annotated graph G =/{ i M} i∈Ωt ,{ b a Λ} a,b∈Ωt 0 (3.38) 105 i M are the metric submaps ,Ω t is the set of computed submaps until time t and b a Λ are the coordinate transformations between adjacent maps. This is very similar to (Blanco etal., 2008), but unlike them, multiple-hypotheses on the space of topological maps are not defined. 3.2.3.3 Keyframes Extraction and Sparse Bundle Adjustment A keyframe is defined as a collection of feature observations extracted from an image frame, each accompanied by a descriptor (SIFT or DAISY). Keyframes are extracted and stored each time feature tracking require replenishment (see Section 2.2.1). In order that descriptor computation not interfere with the real time operation of the system, this occurs over a parallel thread. In the chapter on visual odometry, the two-frame sparse bundle adjustment was mentioned as atechniquetoimprovethemotionestimatebetweenadjacentframes. However, thisdoesnottake advantage of the fact that shared features across several frames can provide a smoother motion estimate. Ignoring such information can lead to incremental errors that adversely impact the global transformaions b a Λ between submaps. The keyframe features, once initialized, are tracked along with the FAST corners using optic flow. These tracks (only the keyframes, and not all frames over a submap, are considered) are fed into a sparse bundle adjustment routine to improve b a Λ . These optimizations occur only when a submap is not currently ‘active’ and over a parallel thread to maintain the frame rate. 3.2.4 Visual Loop Closing In this section, a novel visual loop closure detection scheme is proposed that leverages the previ- ously extracted keyframes to build a generative model of appearance over submaps. In contrast to earlier approaches, this method eliminates ambiguities due to repetitive features in different regions while keeping the time complexity linear in the number of submaps (a similarity matrix approach in (Newman etal., 2006) was cubic in the number of features). In this regard, the work of (Cummins and Newman, 2007) is the most similar to the presented idea, where a dictionary of 106 featurewordsisbuiltofflineinadifferentenvironmentandaChow-Liutreeisusedtoapproximate joint densities over co-occuring words for loop-closure. The approach proposed here, however, re- quires no such offline training and the words considered come from the same environment that is currently being traversed. Beliefs are defined in the form of a probability distribution over the space of abstract, topolog- ical maps. This distribution determines the active node in the annotated graph, i.e. it gives the probability of being in the same submap, an older one (loop closure) or whether a new node must be initialized. Hence, this framework also drives creation of edges between the graph nodes. The appearance-basedmethodextractsdiscriminative‘words’foreachsubmapandmakesloop-closure decisions based on current observations. 3.2.4.1 Bayesian Framework Given a set of feature observations z t until time t, the goal is to estimate the density over each submap i M inΩ t . This can be formulated as the following recursive Bayes estimation problem p( i M|z t )= p(z t | i M)p( i M|z t−1 ) p(z t |z t−1 ) (3.39) The denominator is simply a normalizing term and p( i M|z t−1 ) is the prior belief over location that can be manually initialized. Hence, the algorithm focuses on estimating the posterior over the current observations z t for each submap. Loop-closure or new submap initialization checks are performed only during keyframe extraction. Therefore, the observation z t includes a set of M SIFT features{F 1 j } available from the image. For now, assume the availability of a precomputed list of N ‘discriminative’ features {F 0 j } which are not currently being detected. These are also added to the observation set. The intuition behind this is that the absence of stable features which predominantly occur in some submaps also provides information about the locations not being currently visited–it helps in reinforcing the belief that those particular submaps are not 107 Figure3.7: Featurediscriminativepower. Consideranobservationconsistingonlyoffeature-words 1, 2 and 3. These words occur with equal probabilities in submap i and j. Hence, both regions will be considered equally probable candidates for loop-closure (priors being the same). However, feature word 4 occurs consistently and exclusively in submap j. Since this highly discriminative and stable feature associated with j has not been observed, the confidence over submap j is reduced. Adapted with permission from (Pradeep et al., 2009). © IEEE. responsible for the observations. This is necessary for distinguishing between similar looking regions with repetitive features. While the common features will assign equal beliefs over the regions, the absence of characteristic features will penalize submaps that are associated with them. This is illustrated in Figure 3.7. Define an indicator functionI(F p j ) which is 1 when the corresponding feature is present in the image and 0 otherwise. Thus, the observation set z t consists of M +N features given by z t ={F p j :I(F p j )=p;p∈ [0,1]} (3.40) Expanding the posterior, p(z t | i M)=p(F 1 1 ,F 1 2 ,...F 1 M ,F 0 1 ,F 0 2 ,...,F 0 N | i M) (3.41) Assume that each feature in the image is actually generated with some probabilityΥ F p j from a feature-wordW F p j . Themethodtogeneratesuchadictionaryoffeaturewordsonlineisdescribed in the next section. This description implies the following likelihoods p(F p j |W F p j = 0) = 0if I(F p j ) = 1 (3.42) 108 p(F p j |W F p j = 1) =Υ F p j if I(F p j )=1 (3.43) p(F p j |W F p j = 0) = 1if I(F p j )=0 (3.44) p(F p j |W F p j = 1) = δ F p j if I(F p j ) = 0 (3.45) Equations 3.42 and 3.44 state that in the absence of the generating word, the corresponding feature cannot be detected (no false positives). Equation 3.43 defines the likelihood of observing the feature if the word is present while Equation 3.45 specifies the miss probability of the feature extractor. Equation 3.41 can be now rewritten as follows: p(z t | i M)= / ω p({F 1 j },{F 0 j }|{W F 1 j },{W F 0 j })p({W F 1 j },{W F 0 j }| i M) (3.46) The curly braces have been applied for compactness and encapsulate the two sets of features and their generating words. The summation is to be evaluated over the entire space ω of possible word combinations, with each word taking two possible values –present or absent. Obviously, the size of this space grows exponentially with the number of words and can quickly become intractable to evaluate. It is reasonable to state that each word is conditionally independent of all other words, given the submap under consideration. This simplifies Equation 3.46 to p(z t | i M)= / ω p({F 1 j },{F 0 j }|{W F 1 j },{W F 0 j }) M : j=1 p(W F 1 j | i M) N : j=1 p(W F 0 j | i M) (3.47) Making a further assumption that feature detections are conditionally independent of all other words given the generating word and applying Equations 3.42 through 3.45, 109 p(z t | i M)≈ M : j=1 Υ F 1 j p(W F 1 j =1| i M)[ N : j=1 p(W F 0 j =0| i M)+ N : j=1 δ F 0 j p(W F 0 j =1| i M)] (3.48) Note that the above factorization is only an approximation as not all the product terms are considered while expanding the summation. Intuitively, the first term in Equation 3.48 favors those submaps in which the words corresponding to the observed features, W F 1 j co-occur. As a second level of support, the remaining terms take into account those discriminatory words whose featureswerenotobserved. Submapswithhighprobabilitiesfortheabsenceofthesewordsexplain better why the corresponding features were not detected and get awarded. The δ F 0 j term takes care of unstable features and if equal to 1 (i.e, the miss probability is high), offers no advantage in this regard. 3.2.4.2 Multi-resolution SIFT Grids In the standard bag-of-words approach, clusters of features are assigned a specific word after an offline training step. For a SLAM algorithm that is supposed to run in varying environments, however, such training on an unrelated image set might not build a relevant vocabulary– for instance, a dictionary built using an outdoor image sequence will not necessarily encode sufficient semantics to distinguish between features collected in an indoor setting. The following approach eliminates such ‘subjectiveness’ by using an online approach dependent only on the environment being traversed. Quantized, 128-dimensional ‘SIFT grids’ are built for each submap. These grids store the number of hits (h) and misses (m) in each bin for each submap. The hit counter is incremented each time a feature is successfully tracked, while a miss counter receives an increment when it is not detected in spite of the corresponding landmark being in the camera’s field of view. These grids are global and shared by all the submaps. A SIFT grid is updated when a keyframe has 110 Figure3.8: Schematicoverviewoftheloopclosureframework. Considerasetof8two-dimensional features (show in different colors) and three grids of varying resolutions. Associated probabilities forword occurrences inthreesubmapsi,j andk areindicatedwith thecorrespondingwordfound in the grid. The radius of the circle along with arrow indicates the weight Υ F p j assigned to the observation. Adapted with permission from (Pradeep et al., 2009),© IEEE. been extracted, using the h and m counters for the features to carry out the update. A SIFT grid is characterized by its resolution, R that determines a fixed quantization along each dimension. Since SIFT descriptors are essentially 128 numbers, each in the range [0−255], settingR = 32, for instance, would give 8 bins along each dimension. In other words, the total number of bins can be ( 256 R ) 128 . However, not all bins are expected to be present and a hash-table is used to implement the grids. The key is simply generated by quantizing the descriptor according to R and updating the relevant grid element. These grid bins are the words W F p j for each feature. Note that depending upon the quantization level, different features can have the same word but different words cannot include the same feature. A schematic depiction of this representation is provided in Figure 3.8. The value ofR is difficult to arrive at. For loop-closing, it is necessary to be able to establish correspondences between features and words. IfR is set to a small value, minor variations along any one dimension of the descriptor will result in querying for a completely different word than what was intended- the query word might not have even been instantiated in the grid. On the 111 other hand, a large value will collapse all the features into just a few bins and hamper the task of discrimination. ThesolutiontothisproblemisthecreationofseveralSIFTgridswithvaluesofRrangingfrom small to big and assigning detector probabilities Υ F p j based on the respective resolutions. Each time a keyframe is extracted, all the grids are updated by hashing the descriptors accordingly. Sinceeachgridisahash-table,thisoperationcanbedoneveryefficiently. Ineachgrid,allpossible features that can be generated by a word are considered to be equally likely in the space bounded by the resolution of the bin, i.e. Υ F p j ∝ 1 R (3.49) Whencheckingforloop-closurewithafeatureF p j ,asearchismadebylookingfortheexistence of the corresponding hash-entry starting with the finest resolution grid to the coarsest. Search is stopped once an entry is found and Υ i F p j assigned according to Equation 3.49. The hit and miss counters stored for each submap in this bin are extracted to obtain the probabilities for the word given a submap (normalized so that the the results are not biased on the duration of each submap). In a loop-closure scenario, one would expect several high-resolution words to be found (created by their first occurrence) and some low resolution words (corresponding to new features encountered this time). The high resolution words would get higher weights and the probability over each submap can be computed from the counters. In the case of no-loop closure (i.e., a new region), all the new feature observations would correspond to low resolution grids and therefore, the probability over any previous submap will be low. In the implementation, five grids are employed withR values 16, 32, 64, 128 and 256. 3.2.4.3 Learning Discriminative Words As outlined in Section 3.2.4.1, one must also consider words that are predominant in only some submaps. These are discriminative words and if the corresponding features are not detected, they 112 provide strong cues about which submaps may be rejected from consideration. Only the finest resolutiongridisconsideredtoformsuchwords. ImaginethatthereareNtotalsubmapsavailable at time t in Ω t. Given the submap occurrence probabilities, a random event X may be defined as the word being present in one of the theseN possible locations, i.e. p(X = i M) . If the word could occur with equal likelihood in all submaps, the associated entropy or uncertainty would be maximum, i.e, H(X) max =−log( 1 N ) (3.50) However, the actual entropy is, H(X)=− N / i=1 p(X = i M)logp(X = i M) (3.51) The entropy is zero if the word occurs with a probability of 1 in any submap (i.e., there is no uncertainty regarding which submap the word occurs in). Such a word is guaranteed to be highly discriminative. However, the maximum possible entropy grows with the number of submaps and what is really relevant is the decrease in entropy. Hence, a normalized version of this quantity is computed, called ‘distinctiveness’, for each word D(W F p j )= H(X) max −H(X) H(X) max (3.52) Thus, when checking for loop closure, all the words with D(W F p j ) above a certain threshold (0.95 intheimplementation)areselected,outofthosewhosecorrespondingfeatureshavebeenobserved are eliminated. The left overs are used to compile the set {F 0 j }. The stability δ F 0 j is computed from the miss counter value. 113 3.3 Results 3.3.1 Active Matching Figure 3.9: Tracking features using active matching. In Figure 3.9, the results of feature tracking (for FastSLAM) using active matching are illus- trated through an image sequence. Landmarks are intialized into the RBPF from frame 1, with associated image templates stored into the database. Subsequently, these features can be reliably tracked despite the occlusion from frames 2 to 10. The unoccluded features are tracked quite robustly as good template matching scores are obtained within the search windows. From frames 11 to 16, the camera is rotated away and brought back to near its original position. The old 114 Figure 3.10: Indoor experiment result. The non-textured 3D map is projected onto the XY plane. Lower-right shows the floor plan of the two rooms with the camera trajectory (red) exhibiting a loop. The highlighted section on the map corresponds to the images (top-right) captured during the beginning and end of loop execution. Loop closure was detected with a probability of 0.9984. Adapted with permission from (Pradeep et al., 2009). © IEEE. landmarks are redetected, providing very strong localization priors for the camera pose. Active matching, therefore, provides very robust data association. 3.3.2 SLAM Maps and Loop Closing The system was tested for full 6 DOF motion using a hand-held stereo camera (Bumblebee2) in indoor and outdoor environments. The results reported below were obtained after offline processing on a 3.39GHz, Pentium dual core PC equipped with 3 GB RAM. No knowledge about the environment (for instance, the ground plane) was assumed. Figure 3.10 displays an indoor experiment result, with a camera trajectory of approximately 30 meters (2206 frames). Figure 3.11 shows the result for a much longer outdoor sequence with a path length of about 232 meters (8000 frames). About 30 to 50 features occurred between adjacent frames. Both these experiments also demonstrate loop closure using the SIFT grids- based approach. The outdoor dataset is characterized by several similar looking regions and in Figure 3.12, the probabilities estimated by the algorithm for region similarity are indicated. This 115 Figure 3.11: Top view of the reconstructed 3D map with camera trajectory (red trace) for the outdoor test environment. Selected regions have been magnified for better visualization. The flag indicates where loop closure was found. The total trajectory length was 232 meters. Adapted with permission from (Pradeep et al., 2009). © IEEE. demonstrates the proposed method’s ability to discriminate between submaps with repetitive features. 3.4 Summary A FastSLAM-based visual metric-topological SLAM algorithm was presented in this chapter. Incorporation of an active matching scheme provides very robust data association that is able to handleocclusionsandcameramotion. Metricmapsoversmallareasatthelocallevelandaglobal, graph-based abstract topological framework help build consistent maps over large distances - as shown in the results. Keyframes extraction using SIFT features enables to perform sparse bundle adjustment and therefore, compute smooth transformations between submaps. Furthermore, the 116 Figure 3.12: Similarity probabilities between images in different submaps. Top row corresponds to the detected loop in Figure 3.11. Adapted with permission from (Pradeep et al., 2009), © IEEE. visualloopclosurealgorithmintroducedhereisabletodistinguishbetweenvisuallysimilarregions with linear complexity in the number of submaps. A block diagram of the entire SLAM system, including the visual odometry block, is provided in Figure 3.13. The next step is to use the resulting map and camera pose estimates from the SLAM module to detect the presence of obstacles in the vicinity of the human user. The algorithms developed to accompish this are discussed in the next chapter. 117 Figure 3.13: Overview of the entire SLAM system. Adapted with permission from (Pradeep at al., 2009),© IEEE. 118 Chapter 4 Obstacle Avoidance and Path Planning 4.1 Introduction The ultimate goal of the system is to detect obstacles in the path and warn users about potential collisions. The SLAM algorithm, as seen from the results in the previous chapter, provides rich 3D maps and camera pose estimates. A scene interpretation module can use this data to make inferences about the structure of the surrounding environment. A contiguous area safe for navigation can be identified and a path plotted through it. In this chapter, the various modules developed to accomplish the above mentioned goals are described. Obstacle detection and path planning are two very old research topics in the field of autonomous robotics. There are several approaches in existence for solving these problems, and various algorithms have been designed after pragmatic consideration of the sensor type, noise in input data and intended application of the system. Thus, due consideration is given in Section 4.1.1 to the peculiarities of depth estimation using a stereo rig and how this makes the task of object detection in this dissertation particularly challenging. Results of applying initially developedalgorithmsbasedonpopularideas(i.e.,colorbasedorenforcingstructureconstraintson 3D data to detect out-of-plane voxels) are presented to motivate towards methods specially suited to unstructured obstacle detection. While a comprehensive review of all available techniques is 119 Figure4.1: Appearancebasedobstacledetectionresults. Greenpixelscorrespondtohypothesized ground plane pixels. not possible here, some ideas pertaining to unstructured obstacle detection are introduced in Section 4.1.2. Since the various steps of processing are sequential, the algorithms developed in this dissertation are accompanied with corresponding results in Section 4.2. This organization is different from previous chapters, but makes it easier to understand the next stages in data processing. Path planning is discussed in Section 4.3 and an overview of the entire pipeline is finally summarized in Section 5.6. 4.1.1 Issues 4.1.1.1 Appearance and Homography Based The simplest strategy for obstacle detection (used in many toy robots or specialized applications) is appearance based. A color image is acquired from the current position that is assumed to be free of any obstacles. Subsequently, a sub-region from close to the bottom-edge of this image is extracted and color values of the pixels anaylzed. It is assumed that this region corresponds to a section on the ground plane. All other pixels are matched against this color model and those that donotgivegoodmatchingscoresarelabeledasobstacles. Therearemanyvariationstothistheme, 120 and in Figure 4.1, results are shown for one approach that uses the hue-saturation colorspace. Varying degrees of success were obtained and certainly, this method is not very effective in highly cluttered environments. An alternative to appearance based obstacle detection is to exploit the underlying geometry of image formation. The 3D ground plane induces a homography mapping between two images and can be used to identify non-ground plane pixels. A homography matrix can be computed given four or more points between the images. This approach, however, also requires that the ground plane be visible in the image sequence. A more difficult requirement is that features lying on the ground plane be reliably matched against each other. These constraints make this approach infeasible for the proposed application in this dissertation. 4.1.1.2 Plane Fitting Obstacle detection on point cloud data obtained from stereo processing is challenging due to the highnoisecontent,irregularstructureoftheoperatingenvironment,unknownshapeoftheground surface, changing camera pose and real-time requirements. Depth information obtained by stereo is inherently noisy due to seveal factors. As objects farther way have smaller disparities, error tends to increase with distance from camera. Incorrect stereo matching, occlusions and shadows lead to patches where there might not be any 3D information. The typical approach of imposing a horizontal plane in the cloud (ground plane detection (Molton, 1999)) and computing distances to individual points is not very reliable. Such an approach requires fitting planes to the 3D data, followedbysegmentation. Thefirststeptoplanefittingistheestimationofalocalnormalateach 3Ddatapoint. Singularvaluedecompositioncanbeusedcalculateaplaneforeachvoxelinthe3D point cloud. Since there might be ouliers in the data, a RANSAC based algorithm can be devised for robustness. However, since this based on local processing, the obtained normals do not end up globally consistent. The size of the voxel has a significant impact on the estimated plane. High noise content in the voxel calls for a larger size (for smoothening out outliers) whereas, potential 121 curvature in the data requires keeping voixel size small (overfitting). (Mitra and Nguyen, 2003) describeanadaptivetechniquetodeterminevoxelsizebasedoncurvatureandnoisecontent. Since structures belonging to steps and curbs tend to be small (given that the camera is head-mounted and detection is required from a distance), curvature begins to play an early role such that local processing is still a factor. Another problem at such step ups is that parallel planes at differing depths need to be identified. 4.1.1.3 Tensor Voting In an initial attempt, tensor voting (Mordohai and Medioni, 2006a) was employed to improve the normal estimates at each point. Specifically, the 3D data was encoded as a set of points associated with 3×3 tensors. The tensor structure is obtained from initial normals evaluated by the RANSAC-SVD technique. Suppose the plane at a point is given by P = ai +bj +ck +d, where i,j and k are the orthogonal unit vectors. Then, a, b, c are the eigen-values along the eigen-vectors (1,0,0), (0,1,0) and (0,0,1) respectively. The end result of this processing step is a set of refined normals which are more consistent with the global 3D structure of the scene. Noise is taken care of by the low number of votes cast for any consistent normal by outliers. Once accurate normals have been obtained, planes are clustered trivially to obtain the domi- nant planar stuctures in the data. Assuming the ground plane to be the XZ plane, a histogram of the angles θ between the normals at the data points and the ground plane is computed. A simple Hough voting over the space of these angles clusters points belonging to parallel planes together. For instance, all horizontal planes will vote for θ=0 and vertical planes will vote for θ = 90. The ground plane search can be constrained over only a small range of angles, knowing the camera orientation from the SLAM system. A block diagram of the primary processing steps is shown in Figure 4.2. 122 Figure 4.2: Block diagram of the piecewise planar modeling algorithm. The stereo camera gener- ates a dense 3D point-cloud, to which normals are fit using RANSAC. These normals are globally inconsistent due to local processing per voxel. Tensor voting is employed to refine these normals and clustering is finally performed. Adapted wth permission from (Pradeep et al., 2008). Scene RANSAC only Tensor voting Single step (Figure 4.3) 0.253 0.073 Two steps (Figure 4.4) 0.351 0.067 Table 4.1: Misclassification fractions for the scenes before and after tensor voting. Lower misclas- sification fraction indicates better performance. Adapted with permission from (Pradeep et al., 2008). Figure 4.3 presents a comparison of results obtained after tensor voting to that using a RANSAC only based scheme. 3D points on a curb have been manually clustered into hori- zontal and vertical planes to create the ground truth. The effect of local processing and noise can be easily visualized in the RANSAC result, where globally inconsistent normals lead to lots of misclassifications. Figure 4.4 displays results for another scene with two steps, while Figure 4.5 shows the complete segmentation results, where parallel planes have also been identified. To quantitatively assess the performance of tensor voting, the fraction of misclassified pixels (i.e., if a horizontal plane pixel was classified as a vertical plane pixel) relative to the ground truth data was computed. These values are provided in Table 4.1. While good plane classification was achieved with tensor voting, the frame rate was extremely slow - about 10 seconds per frame. Furthermore, this method works only where the scene can 123 Figure 4.3: Comparison of results obtained before tensor voting and after. (a) Original 3D point- cloud; (b) input point-cloud manually clustered into horizontal (red) and vertical (blue) planes to generate ground truth; (c) segmentation results after RANSAC only; (d) segmentation results after tensor voting. Adapted with permission from (Pradeep et al., 2008). be modeled as a piecewise planar structure. This is not true for many outdoor situations, where unstructuredobstaclesarepresent(suchastrees). Also, overhangingobstaclescannotbedetected usingthismethod. Asaresultoftheseobservations,amoregenericobstacledetectionmodulewas finally developed and integrated into the system. The approach shown here could be integrated as part of a future system enhancement to detect the presence of stairs or curbs. While real-time implementation might not be trivial (a parallel processing architecture could be used to speed up execution time), this could be a user-driven module that provides information about the number of step ups or step downs as data is gathered over a few frames. 124 Figure 4.4: Comparison of results obtained before tensor voting and after. (a) Original 3D point- cloud; (b) input point-cloud manually clustered into horizontal (red) and vertical (blue) planes to generate ground truth; (c) segmentation results after RANSAC only; (d) segmentation results after tensor voting. Adapted with permission from (Pradeep et al., 2008). Figure 4.5: Final planes segmented from the scenes in (top) Figure 4.3 and (bottom) Figure 4.4. Adapted with permission from (Pradeep et al., 2008). 125 4.1.2 Unstructured Obstacle Detection The planar assumption cannot be applied for generic obstacle detection, and more so when the camera is in a constant state of motion (Batavia and Singh, 2002). In (Talukder etal., 2002), a novel approach is proposed that classifies obstacles based on the height and slope difference between neighboring points. Two surface points p 1 (p 1,x ,p 1,y ,p 1,z ) and p 2 (p 2,x ,p 2,y ,p 2,z ) are called compatible with each other if the following two conditions are met: 1. H min <|p 2,y −p 1,y |<H max 2. |p2,y−p1,y| (p2−p1( > sinθ max The first condition links those points that are within a height threshold of each other, while the second condition groups points such that the line joining them forms an angle with the horizontalplanegreaterthanathreshold. H max ,H min ,θ max areconstantparametersthatneedto bespecifiedforthealgorithm. Geometrically,theseconstraintslookforpointsthatlieinafrustum of a 3D conical search region. Any two points that are compatible or are connected by a string of compatible points belong to the same obstacle. The algorithm is made efficient by projecting the search regions to image space (where the frustums become triangular sections). Computational workload was further reduced in this dissertation by modifying this algorithm into a graph-based approach. Afullyconnectedgraphwasinitializedanditsneighborssearchedforcompatiblepoints. This search was structured such that points that had already been labeled were not rescanned. Edges were created to identify compatible pairs and subsequently, connected subgraphs were extractedtodetectobstacles. TheresultsareshownfortwoexampleimagesinFigure4.6. Further experimentation revealed one major short-coming of this approach: H max ,H min ,θ max needed to be modified for different environments. Thus, this approach could not be generalized. Ultimately, the method of (Triebel etal., 2006) (proposed for lidar scans) was borrowed and adapted to work with stereo point clouds. This is described next. 126 Figure 4.6: Obstacle detection using the compatible points algorithm. Obstacles are labeled in red. 4.2 Approach and Results 4.2.1 Voxelized Elevation Map The first step in the obstacle detection pipeline is reducing the size of the 3D point cloud. On an average, the cloud (each frame) consists of 30,000-40,000 points and this imposes severe loads on both, computational and memory resources. The solution to this problem is voxelization: the cloud is quantized into a two-dimensional grid based on preset resolutions along the x and z axes. Each bin in this grid will store a vector of height values (given by the y coordinate) of the points falling into the bin. Each bin, B i (X,Z) will therefore be of the form: B i (X,Z) = [y 1 ,...,y N ] (4.1) 127 where N is the number of points. This will be different for each bin, depending on the density of scene structure at the corresponding location. The current implementation uses uniform values of 20 cm along the two directions, but the resolution can be non-isotropic as well. Given a 3D coordinate (x,y,z) and the quantization levels, a unique number can be computed that maps the 3D point to an integer ‘key’. A hash-table is maintained and for each incoming 3D point, the key is computed. Subsequently, the height value is added to the corresponding bin. As not all the possiblebinsin3Dspacewillbeoccupied,thisdatastructureleadstoaverysparserepresentation of the point cloud. Furthermore, data can be retrieved in constant time. At this stage, outliers are also filtered by eliminating bins with N less than 10. 4.2.2 Multi-Surface Elevation Map The technique of multi-surface maps (Triebel etal., 2006) is adopted for detecting obstacles from the elevation grid. A brief review of the algorithm, modified for application in this dissertation, follows: • Height intervals H k are computed for each bin from the stored values. Two consecutive heights belong to the same interval if the values are closer than a given gap size, G. In other words, y i ,y j ∈H k if|y i −y j |<G (4.2) The gap size can be equal to or greater than the height of the person to enable him/her to pass through. Note that this parameter is not environment dependent and can be kept fixed for the user. Each bin can have more than one height interval. For instance, if there is an open door, points belonging to the floor will fall in one interval and those belonging to the upper section of the wall will fall in another. At the end of this step, each bin will have K intervals B i (X,Z) = [H 1 ,...,H K ] (4.3) 128 The value of H i is the median value of heights in the interval. • Each interval is classified as a horizontal or vertical patch, based on the height of the interval. If the height exceeds a certain threshold thickness T, then it is considered as vertical, otherwise it is horizontal. Again, the value of T depends on the the step size of the user and is not environment dependent. ∀H j ∈B i (X,Z) H j =Vertical, if H j >T H j =Horizontalif H j "T (4.4) • Any bin containing a vertical patch is immediately labeled as untraversable (obstacle). • The remaining bins are arranged in a graph, which is subsequently scanned from left to right, top to bottom. This graph is organized such that corresponding to vertical bins, there are empty nodes. Next, the 3× 3 neighborhoods for each node are checked and for any non-empty node with a gap interval comparable to the candidate node, an edge is added. • A second pass is made over the graph to check for the number of edges at each node. Bins correponding to nodes having at least 5 edges are labeled as traversable. Rest of the bins are denoted as untraversable. Theparametersrequiredintheabovealgorithmareindependentofscenestructure,andonlylocal neighborhood information is required for obstacle detection. Thus, the difficulties associated with ground plane estimation and noise are circumvented. Examples of results obtained at the end of this stage are shown in Figure 4.7. 4.2.3 Post-processing Steps Because of quantization effects and holes in the depth map, the multi-elevation maps will produce an obstacle map with gaps between objects. Sometimes, these gaps can be wide enough that the 129 Figure 4.7: Multi-level surface patch models of 3D point clouds. Red regions show vertial surfaces andgreenregionsindicatehorizontalsurfaces. Theconerepresentsthecurrentcameraorientaion. Adapted with permission from (Pradeep et al., 2010a),© IEEE. path planning module (discussed in Section 4.3) might get tricked into planning a path through untraversable regions. The obstacle map is therefore refined by filling out the breaks between obstacles using morphological operations of dilation followed by erosion. Intermediate results at different stages of this post-processing step are shown in Figure 4.8. The obstacle maps produced without and with these operations are also shown in Figure 4.9. Note that in the absence of dilation/erosion, the path planner detects a safe path straight ahead, which does not happen to be the case. This is rectified by filling the gaps. 130 Figure 4.8: Intermediate results during post-processing of the elevation grid. Images (b), (c) and (d) are top-down perspectives of the obstacle map. Figure 4.9: Comparison of obstacle maps produced before and after morphological operations for the image in Figure 4.8. 131 4.2.4 Probabilistic Approach for Temporal Consistency In practice, a single frame of 3D data cannot be completely relied upon for detecting obstacles in the map. Due to noise and moving objects, there could be false positives or missed objects. A temporal window is required to ensure the stability of obstacle detection. Fortunately, since a SLAM algorithm is being run as the backbone of the entire system, each obstacle map can be registered on top of the one from the previous time step to build a consistent occupancy map. For each overlapping voxel from the preceding time step, following updates are applied: 1. If the voxel is currently detected as an obstacle, increment a counter O; else increment a counter S. 2. Do not make any decisions about the voxel until O +S reaches a value of 5. This implies that the particular voxel shoud be observed in at least 5 frames. 3. Evalute the ratio R = O O+S . IfO> 0.5, label the voxel as an obstacle; else as a traversable region. Increment another counter I each time the voxel switches between an traversable and untraversable label. 4. Keep updating the value of O,S, R and I by repeating steps 2 and 3. 5. If I reaches a value of 15, label it permanently as an obstacle. Skip these steps next time the voxel is reobserved. Step 5 stems from a conservative approach to system design. It is safer to label a ‘jittery’ voxel as an obstacle from the user point of view. Thus, if a slow moving object walks past the field of view, it might leave a trace of obstacle bins along its path. While not implemented as part of this dissertation, tracking such a moving object could help eliminate such artifacts. 132 Figure 4.10: (Top) Top-down view of the traversability maps corresponding to different time instants (below) of system operation. Processing is confined within the 5 meter circle around the current location of the user. Blue regions indicate unmapped areas in the subject’s neighborhood, green regions are traversable and red regions are not. The yellow cone shows the current user position and head orientation. Note that as the system starts, the immediate region around the camera is not mapped as it does not appear in the field of view. This information is propagated to the path planner, that ignores the first ‘road-block’ until it enters a mapped region. Adapted with permission from (Pradeep et al., 2010a),© IEEE. 4.2.5 Scrolling Region of Interest It is computationally not feasible to maintain real time operation and build an obstacle map for the entire map returned by the SLAM algorithm - especially, as more new areas are explored. This also serves no purpose for the application, as it is required to only guide the user around objects in the immediate vicinity. This is accompished by limiting obstacle detection to a circle of radius 5 meters around the subject. Mobility studies in visually impaired subjects have found that successful navigation requires information only upto 3 meters ahead (Bath, 1979). As the SLAM algorithm tracks the motion of camera in the map, this circle can be ‘scrolled’ through the environment. All the steps outlined above for obstacle detection, as well for path planning, are confined within this area (see Figure 4.10). 133 Figure 4.11: Schematics of waypoint computation. 4.3 Path Planning The system’s primary mode of operation is providing guidance to the user to stay on a safe path (explainedfurtherinChapter5). Forthis, twocomponentsneedtobecomputed: awaypointthat the subject can safely be lead to and the shortest path along which this may be accomplished. The waypoint could also be provided by the user, but this would require constant interaction between the human and the system. Furthermore, a visually impaired subject will generally not have any information about an unfamiliar environment to direct the system. However, if the entire strategy is automated, then the system can get too intrusive and try to over-ride the user’s intentions. In this dissertation, a mechanism to solve this problem is proposed that leverages the concept of implicit interaction. The system autonomously computes a waypoint but also tries to predict the human user’s motion and update its waypoint estimate based on actions taken in the environment. 134 Figure 4.12: Autonomous updates of waypoint (pink dot). 4.3.1 Waypoint Identification Information about the user’s intended direction of travel is maintained as the equation of a line joining the previous position to the current position. This is saved as a ‘motion vector’ along with the obstacle map. The previous position is updated only if there is a translational motion greaterthan 1meter, soastoavoidrecomputingtheway-pointbasedonheadrotationswhichcan occur naturally during walking. The heading angle is also updated if the yaw motion exceeds 15 degrees. To compute the waypoint, the traversability map is searched for the farthest accessible point within a triangular search region, with the current position as the vertex and motion vector bisecting the base. A schematic of this formulation is shown in Figure 4.11. In Figure 4.12, three shapshots of navigation through an obstacle course are shown. The waypoint is automatically updated during when the user orientation has changed significantly from its previous position. Note that in the middle image, the detected waypoint is very close to the user’s position. As will be explained in the next chapter, the system will only provide cues to a new waypoint if the traversable area leading up to it is larger than the previous one. In this case, cues will still be give based on the waypoint in the first image. However, the third image presents a situation where the user has completely turned around and this time, a long enough free path is available that is directly in the ‘line of sight’. 135 4.3.2 D* Lite TheshortestpathtothewaypointiscomputedusingtheD*Litealgorithm(KoenigandLikhachev, 2005). An obstacle region is assigned a cost of infinity, while traversable nodes are given unit cost values. The goal is to reach the waypoint, avoiding all obstacles, while keeping the total cost low. This is an incremental heuristic search algorithm that is able to cope with online modifications to the obstacle map in real time. The path is updated only for sections of the map that have un- dergone some modification. The D* Lite algorithm produces the same paths as the more popular D*, but is easier to implement and is at least as fast as the latter. A complete description of this algorithm is beyond the scope of this thesis, but details may be found in (Koenig and Likhachev, 2005). 4.4 Summary Figure 4.13: The obstacle detection and path planning pipeline. Figure 4.13 shows the various modules described in this chapter as part of a single data processing pipeline, interfacing with the SLAM system. Incoming stereo data first passes through preliminary obstacle detection blocks that build a multi-level elevation map to detect objects in 136 the scene. This approach overcomes the limitations of standard algorithms that require a model of the ground plane. The elevation map is registered with the already existing obstacle map corresponding to the user’s vicinity and temporal filtering is applied to eliminate false positives and obtain confidence values for the new voxels. Subsequently, a waypoint is computed if the user has moved significantly or changed orientation. Any change in waypoint is transmitted to the path planning module, that employs the D* Lite algorithm to compute the safest and shortest path to it. This enables an implicit interaction between the user and the system, where actions taken by the human are automatically transmitted into system adjustments to anticipate user needs. The next chapter will discribe how this information is efficiently conveyed and present results of field tests of the system with normal sighted and visually impaired subjects. 137 Chapter 5 Cuing Modalities and Human Testing 5.1 Introduction A critical component of electronic travel aids is the human interface, that converts data acquired by sensors and processed by algorithms into a format suitable for interpretation by the user. A wearablemobilityassistivedevicedesignedforobstacledetection, asproposedinthisdissertation, needs to convey cues to the human user for taking evasive action. In previous studies (see Section 1.3.6), themechanismofsensorysubstitution(Arno, 1999)hasbeenusedforprovidingcuestothe visuallyimpairedviatheaudioandtactilemodalities. Forinstance, audiosignaturesareassigned to objects and subjects learn to correlate them after some period of training (Shoval etal., 1998). Audio cuing in this way, however, imposes a severe cognitive load on the user. Tactile cuing is less obtrusive and has higher resolution (Jones etal., 2004). In (Erp etal., 2006), it was shown that providing egocentric information about the current orientation through a tactile display can help people in countering spatial disorientation. Given these observations, the tactile modality has been employed in this dissertation to com- plement the computer vision system described over the previous three chapters. A tactile vest withmicrovibrationmotorsandwirelesscontrollogichasbeendesigned,fabricatedandintegrated into the system and will be described in Section 5.2. The control logic for translating locations 138 of detected obstacles and safe paths into vibration cues will be outlined in Section 5.3. To test the performance and feasibility of this approach, extensive tests were performed on blindfolded human subjects and the results are reported in Section 5.4. Mobility experiments (Section 5.5) werealsoperformedwithvisuallyimpairedsubjectstoobtainfeedbackfromthetargetpopulation for this system. Finally, the major findings in this chapter are summarized in Section 5.6. 5.2 Tactile Vest Figure 5.1: Main components of the tactile vest. The primary components of the tactile interface are shown in Figure 5.1. Four microvibration motors are positioned on the vest - one on each shoulder and two on each side of the waist. The vibration motors provide a buzzing sensation similar to that found in cell phones. Each motor can be driven independently or synchronously with the remaining three, based on digital control signalsprovidedbyanonboardmicro-controller(ArduinoDuemilanoveboardwithanATmega328 processor). Thismicro-controllerhas14digitaloutput/inputpins, 6analogpins, a16MHzcrystal 139 Figure 5.2: System level overview. oscillator and can operate with an input volate range of 6-20V (recommended ratings are 7-12V). It is easy to program the board through a standard USB interface. The computer vision algorithms run on a PC and for conducting mobility experiments, a tethered connection between the vest and the computer is not practical. Thus, a wireless link is set up between the tactile system and the processing computer using XBee modules (XB24-ACI- 001). At the computer terminal, 8-bit data packets are streamed to the transmitter via the USB. When these signals are received through the radio-frequency link, the microcontroller (on the vest) sets the corresponding digital pins high. While this allowed the human interface to be fairly self-contained, the Bumblebee2 stereo camera still needed a tethered connection to the computer. For mobility experiments with visually impaired subjects, the Tyzx stereo camera was used for its wireless interface (see Figure 5.2). With this, subjects could walk through an obstacle course without any constraints as the camera streamed images to the processing PC and wireless cues were sent back. The four motors have a very simple interpretation scheme for the human user: left/right shoulder vibration activation requires the subject to turn left/right, while the waist motors imply 140 Figure 5.3: Schematic depiction of the cuing triggers for the shoulder motors. a side-step in the corresponding direction. If the subject gets too close to an obstacle and a potential collision is detected, all the motors are activated at once. In the absence of any cues, the user simply keeps walking straight ahead. When a cue is received, walking motion is suspended and the turn/side-step is executed. The subject keeps turning or side-stepping as long as the vibration is on. Once the motor(s) are switched off, normal walking is resumed. In case of a collision alert, the user can explore the region until the system detects a new safe path and starts providing guidance. The control strategy for activating the appropriate motors is discussed next. 5.3 Cuing Strategy The system has two modes of operation: guidance mode and proximity alert mode. When the path-planner is able to compute a waypoint at least 1 meter away from the user and plot a path to it, guidance mode is invoked. The SLAM algorithm continually tracks the camera motion to compute a motion vector (as described in the previous chapter) for predicting the user’s direction of travel. The angle between this vector and path segment between the current position and the way-point is used to determine deviation from the safe path of travel. If this deviation exceeds a certain threshold, and based on the whether the angle is positive or negative, the left or right shoulder motor is activated. This strategy is schematically illustrated in Figure 5.3. However, if the farthest waypoint is under 1 meter away in the direction of the motion vector, the system 141 switches to proximity mode. If this distance is more than half a meter, then search is made for traversible regions to the right or the left. If successful, the corresponding waist motor is switched on. In case no safe region is found or the distance is less than 50 cm, all motors are activated at once. This is to alert the subject that he/she is dangerously close to a collision. At this point, the subject can perform rotations in any direction to map new areas of the environment and the system will return to guidance mode as soon as a reasonable waypoint can be found. Note that as this mobility assistive device will only complement a white cane, the latter can be used in such a situation to move to safer areas. 5.4 Mobility Experiments with Sighted Subjects 5.4.1 Motivation This section presents a series of pilot experiments conducted with blindfolded subjects. One test investigated whether it is possible to guide users through an obstacle course (in the absence of visual input) by means of tactile cues alone and whether such cues offer any advantage(s) over standard mobility aids. As only the tactile vest was being evaluated here, cues were provided by manual control (see Figure 5.4) and the computer vision algorithms played no role. The second experiment was a trial for the whole system , with a head-mounted stereo camera (version 1 with the Bumblebee2) providing input to the SLAM and obstacle detection modules, which au- tonomously drove the vibration motors installed on the vest to guide subjects. These experiments were conducted with the following hypotheses in mind: • Hypothesis #1: Navigation performance in the same environment improves over time until it reaches a plateau - with or without mobility aids. • Hypothesis #2: The white cane only transmits local and restricted environment informa- tion to the user, leading to subjects getting stuck in trap situations. 142 Figure5.4: Tactilevestwiththetwomodesofoperation: manualcontrolusedforevaluatingcuing strategy only and auto control with head-mounted stereo camera for SLAM+obstacle detection. An 8-bit NES remote control was used for the manual interface. Adapted with permission from (Pradeep et al., 2010a),© IEEE. • Hypothesis #3: Tactile cuing based on global information about a much larger neighbor- hood around the user facilitates better performance. • Hypothesis #4: The SLAM system, together with the obstacle detection module, can autonomously generate tactile cues in real-time based on global information for optimal performance. 5.4.2 Evaluation of tactile cues 5.4.2.1 Methods The goal of this experiment was to understand the navigational challenges associated with lack of visual input and whether any improvements can be obtained with a tactile cuing system that warns about the presence of obstacles in the vicinity of the subject. An obstacle course, as shown in Figure 5.5 was created for this purpose. Sixteen normal sighted subjects, recruited from 143 Figure5.5: Topdownschematicoftheobstaclecourseusedinmobilitystudies. Thecourselength is 12.2 m and the maximum, unrestricted width is 3 m. The grid spacing overlayed is 0.15 m× 0.15 m. Grey areas indicate walls or obstacles placed in the path. Subjects entered on the left side and data was recorded until they reached the exit at the right end.Adapted with permission from (Pradeep et al., 2010a),© IEEE. amongst the students and staff at the University of Southern California, were divided into four equal groups: Group 1 was assigned as the control with no vision restrictions imposed, while members of the remaining three were blindfolded. All subjects were naive to the nature of the experiment as well as the obstacle course. Furthermore, all groups had the same goal: to navigate from one end to the other by avoiding collisions as much as possible. The blindfolded groups were allowed three different types of navigational assistance: Group 2 members could freely move about both their arms to feel for objects/walls in the path, those in Group 3 were given a 1 meter long walking stick to probe the environment and the tactile suit was worn by subjects in Group 4. The tactile suit, at this stage, was manually controlled by wireless remote and cues were provided depending on the distribution of obstacles in the vicinity. This vicinity was established with a radius of approximately 3 meters (as recommended in (Bath, 1979)), although a stereo vision system with a baseline (horizontal separation between the cameras on the stereo rig) of 10-12 cm provides reasonable depth estimates upto 5 meters. Overhead cameras recorded video data and tracked the trajectories followed by the subjects. This approach is similar to the work described in (Brown etal., 1986), where an ultrasonic ranging method was used to track each subject’s progress through obstacle courses. Ten trials were conducted per subject and no prior training was done (those in Group 4 were simply instructed on how to interpret the tactile cues). 144 5.4.2.2 Results and Discussion The average time spent by the various groups per trial is shown in Figure 5.6. As expected, control Group 1 subjects take the least amount of time and there is hardly any variation across trials as they have access to full visual input. Group 2, 3 and 4 subjects demonstrate learning and their time-performance improves over trials. This can be seen from the sigmoid curves (Figure 5.7) fitted to the means for each group, demonstrating incremental learning intitially, followed by a flattening of the curve (as indicated by the asymptote to the sigmoid). Furthermore, the first three trials show the highest means and variability due to unfamiliarity with the task and this number agrees with mobility studies conducted under simulated prosthetic vision conditions (Dagnelie etal., 2007). This also agrees with the observation of (Haymes etal., 1995), where most learning during mobility experiments for partially sighted subjects is observed during the first three trials, with flattening of the curve by the fifth trial. There is also a wide variation in subject performance, both within and across groups. Using the learning curves, this was accounted for by normalizing each subject’s completion time values by the highest time taken in the first three trials. The resulting normalized curves are shown in Figure 5.8. A one-way ANOVA (for the completion time curves) between the three test groups using this data from trials 7,8,9 and 10 returnedap-valueof 1.8×10 −12 (F-number=45.02), highlightingthatdifferentnavigationalaids do have a significant impact on time performance. Subjects with the freedom to move their hands did the best as they were able to easily follow the wall by learning the texture corresponding to the locations of obstacles. However, this performance came at the cost of several collisions during the trials. Groups 3 and 4 had almost no collisions and the p-value between them was obtained as 0.0002, which is also statistically significant. Subjects with the tactile suit did better and were able to reduce their completion time to almost 50% of their starting time, compared to 65% for those with the cane. 145 Figure 5.6: Average time to completion of obstacle course for the four groups. Adapted with permission from (Pradeep et al., 2010a),© IEEE. Togainfurtherinsight,datacorrespondingtothetimesspentalongthetrajectorywasanalysed from the tracking data. As the field of view of the overhead camera was limited, tracking data for only only the first half of the course was obtained. This data was converted into ‘heatmaps’ that showthetypicaltrajectoryforeachgroup(Figure5.9). Fromthisdata, itisobviousthatsubjects with the guide cane spent a lot of time trying to get out of the trap situation created by the 2 meter long wall in the middle. This happened probably because the limited detection area of the cane required subjects to often get very close to structures before being able to take any evasive maneauvers. As they could only obtain very localized information, they were forced to choose randomlytogoleftorrightandtheirnextstepwouldnotalwayshelpthembackontherightpath. In contrast, since subjects with the tactile vest received information several feet before reaching the wall and cues were such that they accounted for the whole surrounding, these situations were avoided. The heatmaps corresponding to this group exhibit the most similarity with the path taken by the control group members. This observation matches closely with wayfinding experimentsusingtactilecuesdescribedin(Pielotetal.,2009),whereavibrotactilebeltwasfound to improve mobility performance by leading to shorter routes and less frequent disorientation. 146 Figure 5.7: Plots of incremental learning over trials for Groups 2,3 and 4 with sigmoid fits. The horizontal asymptote indicates learning is achieved between the seventh and eight trials. Adapted with permission from (Pradeep et al., 2010a),© IEEE. Figure 5.8: Normalized average time to completion. 147 Figure 5.9: Heatmaps displaying the averaged out trajectories for each group. Brighter red signifies more time spend in that location. Adapted with permission from (Pradeep et al., 2010a), © IEEE. 5.4.3 Mobility Experiments: Autonomous steering 5.4.3.1 Methods Thegoalofthisexperimentwastoinvestigateifthesystemdescribedinthepreviousthreechapters can autonomously generate tactile cues for close to optimal route planning, as demonstrated by manually generated cues in Section 5.4.2. The Bumblebee2 stereo camera was head-mounted and connected to a laptop running the SLAM and obstacle detection modules. A traversability grid was computed for each data acquisition step and the algorithm generated and transmitted signals to the tactile vest, based on the presence of obstacles in the subject’s vicinity. Note at this stage, thepathplanningmodulewasnotreadyandnavigationalcuesweregeneratedbasedonproximity to obstacles only. Specifically, cues were delivered corresponding to whichever obstacle was in the immediate vicinty and there were hardcoded values constraining the obstacle sizes to expect and dimensions of the obstacle course. Since the camera was tethered, there was a limit to the area that could be traversed by the subject. Two blindfolded subjects, 8 trials each, participated in this experiment. 148 Figure 5.10: Heatmap for autonomous steering experiment. Adapted with permission from (Pradeep et al., 2010a),© IEEE. 5.4.3.2 Results and Discussion As the heatmaps in Figure 5.10 show, performance is quite similar to that obtained under manual guidance. Thisimpliesthatthesystemisabletoprovidereasonableguidancethroughanobstacle course. Thesestudies, however, werenotreliableindicatorsofexpectedperformancewithvisually impairedsubjects. Theprimaryreasonisthatsubjectshadbeenblindfoldedforonlyashortwhile, anddidnothaveanyorientationandmobilitytrainingthatthevisuallyimpairedgenerallypossess. Secondly, the obstacle course was short and the SLAM system had hard-coded values about what obstacle sizes to expect. Both of these cannot be expected in real-life situations. Another very important aspect of this study was the absence of any path planning module. Due to the lack of waypoint computation, the system only had one mode of operation: the proximity mode. Subjects only executed side stepping evasive maneuvers, which after discussion with orientation and mobility instructors, was not considered a viable strategy for an actual implementation. However, the take home points from these studies were: • Tactile cuing has sufficient resolution and meets cognitive load constraints to be used as a reliable means of conveying navigational assistance. • The SLAM and obstacle detection algorithms work as expected and the head-mounted cameradesignisafeasibleapproachtosolvingtheproblemofobstaclemappingwithrespect to the user’s position. 149 For the unanswered questions, another set of mobility studies were carried out with visually impaired subjects. Path planning was fully integrated into the system and both modes - guidance and proximity - were activated. These studies are described next. 5.5 Mobility Experiments with Visually Impaired Subjects 5.5.1 Motivation In(Turanoetal., 2002), 127visuallyimpairedsubjectsansweredquestionnairesdesignedtoassess self-perception of abilities for independent navigation in a variety of environments. 35 different scenarios were rated based on how difficult or easy subjects thought performing the particular tasks would be. The scenarios were a list of daily activities like ‘moving about in the home’, ‘walking down stairs’, ‘walking through door ways’, ‘avoiding bumping into knee-high obstacles’, ‘walkingatnight’, ‘walkingincrowds’, ‘steppingontoandoffcurbs’, ‘movingaboutinstores’, etc. The perception of ability relates more to performance than just the residual vision functionality. Manypatientswhowerenotlegallyblind, andthushadafairamountofresidualvision, expressed fears of falling and a huge percentage had already had a fall. For an electronic travel aid to be effective therefore, it is necessary to validate whether the target population can adapt to the device and navigate an obstacle course with some degree of confidence. A majority of those with visual impairments use white canes to avoid obstacles. Generally, rehabilitating vision loss patients receive extensive orientation and mobility training to use the cane effectively. While the proposed system is not intended as a replacement for the cane, it might be instructive to evaluate how quickly subjects take to the device and compare performance between the two. This will provide insight into the shortcomings of the system, as well as highlight benefits, and help make furture improvements. Three variables were studied during the course of these experiments. The first wascollisions with obstacles in the path. It is well-known that contrast sensitivity, visual acuity, and visual 150 field defects reduce mobility effectiveness and safety (Bowers etal., 2008; Leat and Love-Kitchin, 2008). Impaired mobility is also a known risk factor for falls (Freeman etal., 2007; Kuang etal., 2008) and is correlated with mortality (Kulmala etal., 2008). Thus, if the number of collisions could be reduced with the mobility assistive device, over the white cane, it could have significant impact on the quality of life of the visually impaired. Secondly, independent mobility needs to be efficient in addition to being safe. Efficiency is described here with respect to time to completion of the navigation task and the general shape of the trajectory adopted through the environment. Hence, the experiments were designed to test the following hypotheses: • Hypothesis#1: The mobility assistive device reduces the number of collisions with obsta- cles as it is especially designed for such a purpose, in contrast to the white cane that might not detect objects of different sizes. • Hypothesis #2: The white cane only transmits local and restricted environment informa- tion to the user, leading to subjects getting stuck in trap situations. The proposed system is able to provide guidance around these objects due to the path planning module. • Hypothesis #3: Visually impaired subjects can learn to navigate with the system, and combineduseofboth-thestereocameraandthewhitecane-mightgivebetterperformance as opposed to using each individually. 5.5.2 Methods 5.5.2.1 Subject Recruitment After obtaining approval from the Institutional Review Board (IRB) at the Unversity of Southern California,10visuallyimpairedsubjectswereenrolledforthisstudyandasignedinformedconsent was obtained from each participant. Subjects were required to be able to speak and read English, 18+ years of age, no history of motion sickness or vertigo, no cognitive or language/hearing impairments and not require any ambulatory aids such as crutches. Only participants with best 151 corrected visual acuity of less than 20/60 or significant peripheral visual field defect (less than 90 degrees) or reduced peak contrast sensitivity with either of the two preceding conditions were selected for the study. Recruitment was initially done through flyers and emails. Orientation and mobility instructors at the Braille Institute of Los Angeles were requested to advertise the study amongst their students. However, due to the small number of interested participants, a seminar was conducted at the Braille Institute, Los Angeles where an oral presentation was given describing the system and the proposed study. Attendees were invited to a demonstration of the tacile vest at the end of the presentation and complementary lunch was purchased for them at the Braille Institute’s canteen. All selected subjects were paid $20 in cash upon conclusion of each experiment session. In addition, detailed information pertaining to subjects’ vision loss and self-assessment of mobility was collected. Some relevant data is presented in Table 5.1. Subject CW dropped out of the study later, and therefore, will not be reported in the results section. 5.5.2.2 Percentage Preferred Walking Speed (PPWS) The most salient feature of Table 5.1 is the high variability across subjects in terms of age, medical condition and experience. Subjects were also asked to report (on a scale of 1-5, 5 being the best) how confident they felt about navigating independently when using their preferred mobility aid. These scores are shown in Table 5.2. While subjective, this sheds light over the differences in confidence levels between subjects. Naturally, these and other variables (height, weight, intelligence, etc.) will also impact mobility performance along with the visual factors. In order to be able to compare between various groups, some kind of data normalization has to be performed to account for such confounding variables. The percentage preferred walking speed (PPWS) concept was developed by Clark-Carter and colleagues (Clark-Carter etal., 1986b) to compare ideal walking speed (pace set by an individual with a sighted guide who ensures safety) with alternative modalities. The assumption is that if visually impaired people had their sight fully restored, they would walk at their optimal speeds 152 Subject Age(yrs)/Sex Condition Left Eye Right Eye Duration Experience with Mobility aids RR 46/male glaucoma; detached retinas no light perception no light perception 3 years white cane for 3 years; O&M for 1 month KR 48/female congenital; both eyes are prosthetic no light perception no light perception 33 years white cane and O&M through school GB 52/female anolphtoma, microphthalmus prosthetic eyes no light perception no light perception 45 years white cane for 40 years; O&M for 3 years AD 70/female RP; can see gross shapes such as outlines no light perception minimal light perception 39 years white cane for 14 years; guide dog for 5 years RM 69/male cataract (right); dry AMD (left); gross shapes minimal light perception minimal light perception 49 years white cane for 41 years; O&M for 18 months MK 33/female RP; can tell if in shade or sunlight only minimal light perception minimal light perception 5 months Undergoing O&M once a week DW 58/male gun injury in left eye no light perception 20/70 acuity 27 years white cane 27 years; O&M for a few weeks NG 51/male scotoma; cortical blindness minimal light perception minimal light perception 5 years white cane for 5 years; GPS, iPhone OB 52/male Charles Bonnett syndrome - - 1 year hite cane for 1 years; O&M weekly CW 54/male diabetic retinopathy; gross shapes minimal light perception minimal light perception 6 years White cane for 2 years; O&M for 6 months Table 5.1: Summary of data from subject recruitment. O&M stands for orientation and mobility training, RP for retinitis pigmentosa and AMD for age-related macular degeneration. 153 Subject Self-assessment of mobility on a 1-5 scale (5 being the best) RR 2 KR 4 GB 5 AD 4 RM 2 MK 2 DW 3 NG 4 OB 3 CW 5 Table 5.2: Self-assessment scores for independent mobility. Figure 5.11: Walking speeds of the subjects over 10 trials. The dotted red-box indicates the data points used to calculate the PWS measure for each subject. 154 (the preferred walking speed, PWS). The actual speeds that is adopted on a route, therefore, can be expressed as a percentage of the corresponding PWS and this measure effectively captures the impact of vision loss, as all other factors remain the same. The PWS can be measured with a sighted guide accompanying the subject or without - the study reported in (Soong etal., 2000) found no significant difference between the two techniques. In the experiments conducted here, subjects were asked to walk an unobstructed mobility course while accompanied with a sighted subject. The walking speeds obtained for the 10 subjects are shown in Figure 5.11. Subjects were asked to walk across 10 times and as the graph shows, a plateau in the speeds was attained by the seventh trial for almost all subjects. This was because some participants had difficulty maintaining a steady orientation and it took a few tries before they could walk in a straight line down the corridor. Data-points from the last four trials were used to compute the mean PWS values for the subjects. 5.5.2.3 Experiment Protocol The experiments were conducted at facilities provided by the Braille Institute of Los Angeles. The study protocol is sketched in Figure 5.12. After the PWS session, subjects were sent back and a waiting period of at least 1 week was kept before resuming the experiments. This is because the space available to do these experiments were limited and both, the PWS and obstacle course sessionswereconductedinthesamecorridor. EventhoughthePWSsessioninvolvednoobstacles, and subjects walked in the opposite direction compared to the obstacle course sessions, care was taken to reduce the impact of any learning (gained during the PWS session) during the mobility tests. Subjectsweredividedintooneofthreegroups. Group1subjectscouldnavigatetheobstacle course with their own white cane, Group 2 subjects used the tactile vest system while Group 3 subjects were given both to use simultaneously. As subject CW dropped out, each group had 3 subjects: RR, KR, GB formed Group 1; AD, RM, MK comprised Group 2 and Group 3 consisted of DW,NGandOB(seeFigure 5.14). Subjectswereinstructedtowalkdownthecourse, avoiding 155 Figure 5.12: Study protocol. 156 Figure 5.13: Front and sideways of the system prototype used in the mobility experiments. Figure 5.14: (a) Top down schematic of the obstacle course at the Braille Institute, Los Angeles. (b), (c) and (d) show snapshots of example subjects from each of the three groups in the study. 157 Figure 5.15: Map produced by the system compared with the floorplan. obstacles encountered in the path. 10 trials were performed each time. A sighted person followed the subject to ensure safety. Once the subject reached the end of the course, the guide would ask the subject to turn around and bring him/her back to the starting point. The obstacle course was kept the same for all subjects and across all trials, with only the start position and orientation varying. Another sighted individual recorded the start and stop times. An example of a map produced by the system during a trial (subject GB, trial 8) is shown in Figure 5.15. Group 2 and Group 3 subjects were simply instructed on the operation of the tactile vest and how to interpret the vibration cues. Prior to start of a session, each motor was individually turned on and the subject was asked to identify which motor was active. Subjects were also asked to demonstrate their speed of turning and in case they were too slow or too fast, appropriate instructions to modulate their speeds were provided. No training over the obstacle course was performed. For Group 3 subjects, they were additionally told to use the white cane as they normally would, and treat the tactile system as a secondary assistive device only. After a waiting period of at least 2 weeks, subjects of Group 1 and Group 2 were switched with each other. Thus, at this point, RR, KR and GB used the tactile system only while AD, RM and MK used the white cane. The purpose of this was to evaluate whether there is any difference 158 Figure5.16: Comparisonofnumberofcollisionspertrialforsubjectsfromthethreegroupsbefore switching. in performance on using the system/cane after already having navigated the obstacle course in the past. This would help understand the applicability of the system in unfamiliar versus familar areas. In case no significant difference is observed, this would help in increasing the number of subjects per group for more robust statistical analyses. 5.5.3 Results and Discussion 5.5.3.1 Number of Collisions A collision was counted each time a subject contacted an obstacle with any part of his/her body. Any intervention on part of the sighted guide to prevent a subject from potential injury was also counted as a collision. The number of contacts per trial for each subject are displayed as a matrix in Figure 5.16. As is evident, the number of collisions are lower when the tactile system is being 159 - Figure 5.17: Collisions before and after switching Group 1 and Group 2. The direction of the arrow indicates which mobility test was performed first for each group. Figure 5.18: Locations of collisions for white cane users of Group 1. The number next to the red blocks indicate the total contacts made summed over all subjects and trials at that face. used and non-existent when both the white cane and system are employed. It is surprising that white cane users have so many collisions, given the fact that most of them have undergone O&M training and use it routinely. However, it was observed that most of these collisions were due to the objects coming in contact with the side of the body (See Figure 5.18). Subjects could not gather any information about the depth of obstacles by contacting only the front surface. As they walkedpasttheface,theyoftenerroneouslyassumedthattheobjectwascompletelybehindthem. This is a direct result of the limited range of the cane sensing approach, as subjects walk very close to the detected obstacle while avoiding it. Admittedly, detecting depth of an obstacle when pointed directly at its front surface would be a problem for the computer vision algorithms too. 160 However, in a relatively uncluttered environment such as the obstacle course, the path-planner sticks to widely-traversable areas (steers well clear of obstacles) and keeps users on a path that has more margin for error (see Section 5.5.3.3). In contrast, cane users must contact objects with the cane to avoid collision, so they are by necessity brought closer to the objects they seek to avoid. Most of the collisions for Group 2 subjects were due to interventions on part of the guide (as subjects overcompensated or walked too fast) and errors in wireless communication. These problems will be discussed in next section on system performance (Section 5.5.3.3). For Group 3 subjects, it is not surprising that no collisions were observed. While the tactile system did keep themonpathsthathadmoredistancefromobstacles,failuresofthesystemandovercompensation problems were probably offset by the cane. This group of subjects learnt to distinguish between the good and bad cues. These differences will become more apparent when the trajectories taken by the different groups are analysed. In Figure 5.17, the number of collisions for members of Group 1 and Group 2 are shown before and after switching their mobility aids. Group 1 subjects perform much better when using the device,whileGroup2subjectshavemorecollisionswhenusingonlythewhitecane. Thissupports the earlier argument that the proposed device leads to safer navigation, and this conclusion is irrespective of the composition of the subject group. 5.5.3.2 Time to Completion The percentage preferred walking speed (PPWS) was computed for each subject per trial. The corresponding plots are shown in Figure 5.19, while consolidated values per group are displayed in Figure 5.20. The white cane users have the best time performance and show steady learning over trials, while Group 2 subjects perform the worst. No obvious learning trend can be observed from the data for members of both, Group 2 and Group 3. The pattern apparent from the plots for Group 2, however, is that of quick learning within the first two or three trials, followed by a drop in performance midway and eventual recovery during the last few trials. Subjects would 161 Figure 5.19: PPWS measures per subject. 162 Figure 5.20: Average PPWS measures per group. initially start out without having a clear understanding of how much to turn when cues were presented. In the absence of vision, it was difficult to impress the appropriate rate of turning required and the only way this could be done was to provide verbal feedback during the initial trials. Once they got accustomed, however, PPWS measures increased. It is probably because of the accompanying increase in confidence with such improvement, that performance dropped after the first few trials. Subjects started moving too fast (especially head motions) for the system to compute motion accurately and provide appropriate cues or they overcompensated (which will be discussed in the next section). These issues resulted in them getting lost or walking longer distances through the obstacle course. Several times, backtracking was observed that further lengthened time taken through the obstacle course. Subsequently, optimum turning speeds had to be reinforced verbally that explain the improvements demonstrated during the last trials. It is interesting to note that Group 3 subjects were comparatively more stable across the trials. This is probably because of having the familar white cane to pace themselves and orient their body positions correctly even when cues were delayed or incorrect. There were also errors in wireless transmission of cues at farther distances from the transmitter (see Section 5.5.3.3). System issues 163 such as these were also largely responsible for Group 2 and Group 3 subjects taking significantly longer times in completing the navigation exercises. 5.5.3.3 Navigation Trajectories Figure 5.21: Heatmaps showing trajectories taken by participants from each group. Group 1: cane only, Group 2: camera-vest system and Group 3: cane plus camera-vest system users. In Figure 5.21, heatmaps are shown that represent the average trajectories followed by the different group members through the course. The brightness level of red corresponds to the time spent at that location across all trials and subjects for the particular group. These plots were generated by manually tracking the subjects through video analysis, as the tracking camera malfunctioned and could not be used for automated heatmap generation. As alluded to in Section 5.5.3.1, whitecaneusershavealimitedrangeofdetectionandwalkdangerouslyclosetoobstacles. This results in collisions as they walk past the side faces of the objects. They also spend a lot of 164 timemakingcontactswithlongerobstaclesandgettingstuckintrapsituations. Thepath-planner of the tactile system eliminates these difficulties and it canbe observed from the first halves of the heatmaps(forGroup2andGroup3)thatthetrajectoriesaretrulysafeandcircumventanytraps. Towards the end, however, the paths are often circuitous (especially for Group 2) and subjects zigzag instead of walking a straight line. One definite reason for this is the delay in receiving the wirelesscuesignals. Duetoline-of-sightissues, subjectswouldreceiveavibrationmuchlaterthan they should have, resulting in them taking evasive actions too late and leading to a vicious cycle of incorrect cues at the wrong times. An example of such a scenario is depicted in Figure 5.22. Anotherreason,whichismoreevidentfromthevideorecordings,isthatsubjectsovercompensated when turning. They had been instructed to perform rotations as long as a vibration motor was active. Sincethevibrationsweresentassteadypulses, therewasnowaytoobtainfeedbackabout how much to turn. For instance, if the vibrations decreased in intensity as the body orientation got closer to the desired path angle, subjects would be able to modulate their speeds. In the absence of such a strategy (which is an after-thought), a right turn would result in a deviation from a straight line path such that eventually a left turn had to be cued to compensate. That left turn would also overshoot the desired target and so on and so forth. Group 3 subjects made use of contacts with the wall and obstacles to avoid having such orientation problems, though open spaces often confused them as well. 5.5.3.4 Discussion on System Performance The delays in wireless transmission and cue control are simple engineering problems that can be rectified with an improvement on the system components and control logic. A more fundamental issue is when the system itself was in error while computing camera motion, building the map or detecting obstacles. In particular, four failure modes of the system were observed: • Thecompleteleftsideofthecorridorwasaglasspanelandbeingaspecularsurface, created errors in feature tracking and stereo triangulation. The visual odometry pipeline handled 165 Figure 5.22: Two snapshots of the system output during a trial affected by delayed cues. The dashed red box in the top figure shows the region corresponding to which the maps have been diplayed. (Leftimage)Thesystemhasalreadydetectedapathtoawaypointtotheleft(asshown by the white line leading to the pink dot). In spite of this, however, the subject keeps walking towards the wall (right image) as no cues have yet been received. Figure 5.23: Example failure modes of the system. 166 this by detecting a bad fit for the epipolar constraint and ignoring the frame. Neverthess, if this side of the obstacle course was focussed on for a longer duration, mapping and motion estimation would fail. In such a situation, the trial was restarted. • Glass also created problems for obstacle detection. Often, the space on the other side of the panel would be considered an open area and the system would try to guide the subject through the panel. A long table was placed in front of the glass panel to create an obstacle, but ocassional failures did occur. • Slippage of the camera would often cause the ground floor to be no longer visible. This lead to ‘holes’ in the map, where no obstacles or traversable regions were detected. The path-planner would therefore, not have any information to provide cues even if free space was right ahead. • Fast head motions or walking would result in motion blur and failure to compute camera motion. This was explained as a probable reason for the drop in performance for subjects who became very confident with the system after the first few trials. Some of these pathological cases are depicted in Figure 5.23. 5.6 Summary A tactile interface for communicating navigational cues computed by the stereo camera based system was intoduced in this chapter. Experiments with blindfolded subjects demonstrated that this was a feasible approach for providing information about the presence of obstacles. A cuing strategy that tries to keep the user on a pre-computed safe path was developed and an impicit human-computer interaction mechanism was built to keep cognitive load to a minimum. Mobility experimentswithvisuallyimpairedsubjectswereconductedattheBrailleInstituteofLosAngeles. While the system was able to reduce the number of collisions and provided safer paths through an 167 obstacle course, the time performance was not found to be comparable with the white cane. This was due to a combination of several factors, including particular failure modes of the system. In the next chapter, these shortcomings will be addressed in detail and future work for eliminating them will be proposed. 168 Chapter 6 Conclusion and Future Work 6.1 Summary of Contributions Thefollowingquestionwasposedatthebeginningofthisdissertation: can advances made in robot vision be built upon and translated to a wearable device that substantially improves the quality of life of the visually impaired? Over chapters 2, 3 and 4, state of art computer vision techniques were explored and novel algorithms proposed, implemented and tested for building a framework to support answering that question. These contributions are summarized here: • Astereo-visionbasedvisualodometrypipelinethatusesadaptivefeaturetrackingforcoping with varying amounts of texture in different environments (Chapter 2). • A new minimal solver for computing camera motion given any arbitrary combination of point and line features. This method outperforms the state-of-art techniques for visual odometry and extracting camera matrices from noisy trifocal tensors (Chapter 2). • A hybrid metric-topological SLAM framework that is scalable and keeps error under bound. Active matching is used for running a metric Rao-Blackwellised particle filter, while the resulting maps are stored as nodes in a graph representation. Sparse bundle adjustment helps keep transformations between nodes locally consistent (Chapter 3). 169 • A novel visual loop closure algorithm that requires no offline training and is able to disam- biguate between visually similar regions (Chapter 3). • A generic obstacle detection module that translates techniques used with lidar scanners to build temporally stable occupancy maps for path-planning (Chapter 4). In addition to computer vision algorithms, new methods were developed and insights gained for building and testing electronic mobility aids for the visually impaired. • Implicit interaction between the user and tactile interface that keeps cognitive load at a level that allows the person to use the information without significant delay. The system autonomously anticipates and predicts user intentions by computing a waypoint based on motion history of the subject. This allows the system to adapt in real-time to changing needs, while not having to continually query the user for instructions (Chapter 5). • Demonstrationofthefeasibilityoftactilecuingtoguidesubjectsthroughanobstaclecourse. Techniques such as trajectory estimation and heatmap visualization help get a clear under- standing of the pros and cons of different mobility aids (Chapter 5). • Mobilityexperimentswithvisuallyimpairedsubjectsusingthewhitecane,thetactilesystem and both simultaneously. Results show that the system proposed in this thesis can reduce thenumberofcollisionsandleadtosubjectsfollowingsaferpathsthroughanobstaclecourse. Whiletimeperformanceisnotoptimalwiththesystem,thisdissertationalsogainedvaluable insight into improvements that are feasible to make (Chapter 5). From the results shown in the preceding chapters, it is possible to conclude that robot vision is an appropriate modality for developing assistive technologies for the visually impaired. Experiments and feedback obtained from study participants clearly demonstrate the need and usefulness for such a device. However, the system developed here falls short of completely accomplishing its goalsofmakingindependentnavigationbothsafe andefficient. Duringthecourseofexperimental 170 evaluation, shortcomingswereobservedthatneedtobeeliminatedforuserstobeabletonavigate more effectively. These are discussed in the next section. 6.2 Directions for Future Work 6.2.1 Addressing Failure Modes Failure modes of the system were presented towards the end of Chapter 5. The following sugges- tions are made here to improve over-all performance: • Errorswhenfacingspecularsurfaces: Reflectionsoffglassyandotherspecularsurfaces are catastrophic for feature tracking. The system is able to handle only a few frames of tracking failure and more so, when there are a few non-specular objects in the scene. It is notuncommon(especiallyinindoorenvironments)forsuchproblematicsurfacestobewidely prevalant. It is therefore recommended to employ a cheap accelerometer and gyroscope unit to provide motion estimates when visualy odometry has failed. By careful integration of such INS units with SLAM, the system performance can be improved substantially. For instance, in (Zhu etal., 2007), a ten-fold improvement in visual odometry is reported using a combination of IMU sensors with stereo cameras. • Errors due to motion blur: Fast head-motions result in very blurry images that makes feature detection and tracking very difficult. Three alternatives are proposed here: one is to employ a camera with a faster capture rate. By sampling images with shorter intervals, the motion blur problem can be avoided. However, images captured from such cameras suffer highdegradationinquality. AnotherapproachistousetheINSunitsagaintoestimatequick motion. Finally, a third alternative is to use edges for coping with image blur. Generally, rapid camera motion is associated with rotations instead of translations and edges parallel 171 Figure 6.1: Example of depth map produced by the Prime Sensor TM for an untextured surface. to the direction of motion are often preserved. Camera rotation may be determined using such features, as has been demonstrated in (Klein and Murray, 2007). • Lackoftexture: AseriousproblemforthevisualodometryandSLAMsystemoccurswhen navigating textureless environments. The lack of intensity gradients renders the problems of feature detection, optic-flow computation and even stereo triangulation ill-posed. One readysolutiontothisproblemistoincorporatethepoint/lineodometryalgorithmproposed inChapter2intotheSLAMsystem. Lineswillhavetoberepresentedinthemaptoachieve this integration. A more attractive solution that was explored towards the end of this thesis work is replacing the stereo camera itself with the Prime Sensor TM (PrimeSense, 2006). There are three main advantages over stereo cameras in favor of this sensor: (1) The Prime Sensor TM has an infra-red (IR) emitting element and an IR-sensor. By projecting a known pattern on a surface, it is able to obtain dense 3D data pertaining to even feature-less surfaces. (2) Since 3D triangulation is performed by active illumination and not matching images between two cameras, the range data is more dense and accurate compared to stereo cameras. (3) This sensor is intended for mass-production and is therefore, expected to be cheaplyavailable. AnexampleofadepthmapproducedbythesensorispresentedinFigure 6.1. • Delayedcuesandovercompensation: Theradio-frequencylinkusedtotransmitsignals from the processing PC to the tactile vest needs to be upgraded. Doing this over a wi-fi 172 network (which is being used for the Tyzx camera anyways) will improve reliability and eliminate line-of-sight problems with the current XBee modules. Furthermore, the intensity of vibration cues needs to be modulated to indicate how much turning is required. Some kind of audio feedback could also be provided to inform the subject to slow down his/her body rotations. In addition to enabling the users in walking down straighter paths, this could also help them learn the system through bootstrapping and eventually obtain optimal performance. 6.2.2 Incorporating New Features • Integration of a GPS receiver: A GPS receiver should be integrated for obtaining true autonomy in large-scale, outdoor environments. This could provide global navigation guidance (such as directions to a grocery store), while the SLAM and obstacle detection moduleshelpinlocalnavigation(thattheGPSdoesnothaveenoughresolutiontoprovide). A careful integration could significantly reduce errors in visual odometry and the SLAM pipelines, and provide redundancy should the vision pipelines fail. • Imagebasedlocalization: WhileaGPSsystemwillhelpingettingtoaparticularlocation outdoors, no such assistance can be made available for indoor environments. However, if the user wants to return to a previously visited location (say, the entrance lobby of a large building), thisshould be feasiblegiven amapand theset ofimagescorrespondingto various pointsinthemap. Asasubjectentersthebuilding, heorshecanstarttaggingimageswhile in the lobby. The SLAM system can then employ the current position estimate, the global map and the set of images to recognize when a return has been made to an earlier location. • New human interface design: An important observation when conducting mobility experiments with the visually impaired was their reluctance to the prospect of wearing a tactile vest. Specifically, this design had no aesthetic appeal for the participants and a more 173 self-contained, compactformwasrequested. Thiscouldbeintheformofbraceletsorabelt. Currently, a tactile cane that has all vibration motors located on the handle is in advanced stages of prototyping. This would eliminate the need for any specialized gear. • Device interaction: To improve the overall experience with the system, a seamless means of communicating with the device should be incorporated. The subject should be able to turn it on or off, start the GPS receiver/image based localization (when integrated) etc. For this, a speech-to-text module could accompany the computer vision algorithms, enabling all interaction through voice commands only. While the recommendations listed above will solve particular problems, miniaturization of the device is desired for a final product. Ideally, the cameras could be installed in glasses and the algorithms run off an embedded system attached to the subject. Several electronic travel aids have been proposed in the past, but weight, cost and form-factor have been primary obstacles for their wide-spread adoption by the visually impaired community. Smart phones these days have incredibleprocessingpoweranditisnotdifficulttoimaginethattheseissuescanbeovercomewith pragmatic engineering effort. Robot vision can indeed improve the quality of life of the visually impaired and along the way, also lead to applications that could benefit the general public. 174 Bibliography AFB, Sep 2008,/URL: http://www.afb.org/Section.asp?SectionID=15&DocumentID=43980. Ansar, A and Daniilidis, K, Linear pose estimation from points or lines, IEEE PAMI 25:5 2003, pp.578–589. Arno, P., Auditory coding of visual patterns for the blind, Perception 28:8 1999, pp.1013–1029. Arun, K. S., Huang, T. S. and Blostein, S. D., Least-squares fitting of two 3-D point sets, IEEE Transactions on Pattern Analysis and Machine Intelligence 9:5 1987, pp.698 – 700. Avolio, G., Principles of Rotary Optical Encoders, Sensors 1993, pp.10 – 18. Bailey, T., Mobile Robot Localization and Mapping in Extensive Outdoor Environments, Ph.D thesis, University of Sydney 2002. Bar-Shalom, Y. and Fortmann, T., Tracking and Data Association Academic Press, Inc., 1988. Barnes, G. R., Visual-vestibular interaction in the control of head and eye movement: the role of visual feedback and predictive mechanisms, Progr Neurobiol 41:435 – 472 1993. Barshan, B. and Durrant-Whyte, H. F., Inertial Navigation Systems for Mobile Robots, IEEE Transactions on Robotics and Automation 11:3 June 1995, pp.328 – 342. Bartoli, A and Sturm, P, Multiple-view structure and motion from line correspondences, International Conference on Computer Vision (ICCV) 2003. Batavia, P. and Singh, S., Obstacle Detection in Smooth High Curvature Terrain, IEEE In- ternational Conference on Robotics and Automation (ICRA) 2002. Bath, J L, The effect of preview constraint on perceptual motor behaviour and stress level in a mobility task, Unpublished Doctoral Dissertation, University of Louisville 1979. Battin, R. H., An Introduction to the Mathematics and Methods of Astrodynamics AIAA Education Series, New York, NY, ISBN 0-930403-25-8, 1987. Bauer, J., Sunderhauf, N. and Protzel, P., Comparing several implementations of two re- centlypublishedfeaturedetectors,ProceedingsoftheInternationalConferenceonIntelligent and Autonomous Systems (IAV) 2007. Bay, H.etal., SURF: Speeded Up Robust Features, Computer Vision and Image Understanding (CVIU) 110:3 2008, pp.346 – 359. 175 Benjamin, J. M., Ali, N. A. and Schepis, A. F., A Laser Cane for the Blind, Proceedings of the San Diego Biomedical Symposium 12:53–57 1973. Bennett, Andrew T. D., Do Animals have Cognitive Maps?, Journal of Experimental Biol- ogy 219 – 224 1996. Bissitt, D. and Heyes, A. D., An Application of Biofeedback in the Rehabilitation of the Blind, Applied Ergonomics 11:1 1980, pp.31–33. Blanco, J. L., Derivation and Implementation of a Full 6DOF EKF-based Solution to Range- Bearing SLAM, Technical report Perception and Mobile Robotics Research Group, Univer- sity of Malaga, Spain, 2010. Blanco, Jose-Luis,Madrigal, Juan-Antonio FernandezandGonzalez, Javier, Toward a Unified Bayesian Approach to Hybrid Metric-Topological SLAM, IEEE Trans. on Robotics 2008. Borenstein, J. etal., Mobile Robot Positioning - Sensors and Techniques, Journal of Robotic Systems, Special Issue on Mobile Robots 14:4 1997, pp.231 – 249. Borenstein,J.andUlrich,I.,TheGuideCane-Acomputerizedtravelaidfortheactiveguidance of blind pedestrians, IEEE International Conference on Robotics and Automation 1997, pp.1283–1288. Bosse, M. etal., An atlas framework for scalable mapping, Technical report Massachusetts Institute of Technology, 2002. Bouguet, J., Pyramidal Implementation of the Lucas-Kanade Feature Tracker: Description of the Algorithm, Technical report Intel Corporation, Microprocessor Research Labs, 2000. Bowers, A. R.,Keeney, K.andPeli, E., Community-based trial of a peripheral prism visual field expansion device for hemianopia, Archives of Ophthalomology 126 2008, pp.657 – 664. Brabyn, J., Crandall, W. and Gerrey, W., Talking signs: a remote signage, solution for the blind, visually impaired and reading disabled, Engineering in Medicine and Biology Society 1993, pp.1309–1310. Brown, B. etal., Contribution of vision variables to mobility in age-related maculopathy pa- tients, American Journal of Optometry and Physiological Optics 63:9 1986, pp.733–739. Bryne,R.W.,GeographicalKnowledgeandOrientationLondon: AcademicPress,1982,pp.239 – 264. Bujnak, M, Kukelova, Z and Pajdla, T, A general solution to the P4P problem for camera with unknown focal length, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) 2008, pp.1–8. Byrne, R H,Klarer, P RandPLetta, J B, Techniques for Autonomous Navigation, Sandia Report SAND92-0457 Sandia National Laboratories, Albuquerque, NM, March 1992. Canny, J., A Computational Approach to Edge Detection, IEEE Transactions on Pattern Anal- ysis and Machine Intelligence 8:6 1986, pp.679 – 698. Chanderli, K. etal., Minimum requirements for useful artificial vision : Mobility and environ- ment predictability, European Conference On Visual Perception 2006. 176 Chandraker, M, Lim, J and Kreigman, D J, Moving in Stereo: Efficient Structure and Motion Using Lines, International Conference on Computer Vision (ICCV) 2009. Christy, S and Horaud, R, Iterative pose computation from line correspondences, CVIU 73:1 1999, pp.137–144. Clark-Carter, D. D., Heyes, A. D. and Howarth, C. I., The effect of non-visual preview uponthewalkingspeedofvisuallyimpairedpeople, Ergonomics29:121986a, pp.1575–1581. Clark-Carter, D. D., Heyes, A. D. and Howarth, C. I., The efficiency and walking speed of visually impaired people, Ergonomics 29 1986b, pp.779 – 789. Clemente, L. A. etal., Mapping large loops with a single hand-held camera, Robotics:Science and Systems 2007. Comport,AI,Malis,EandRives,P,Accuratequadrifocaltrackingforrobust3dvisualodom- etry, IEEE International Conference on Robotics and Automation (ICRA) 2007, pp.40–45. Cook, A. and Hussey, S., Assistive Technologies, Principles and Practice, edited by Mosby, 2nd edition 2002. Costa, G. etal., Mobility and orientation aid for blind persons using artificial vision, J. Phys 2007. Cummins, M. and Newman, P., Probabilistic Appearance Based Navigation and Loop Clos- ing, IEEE International Conference on Robotics and Automation (ICRA) 2007, pp.2042– 2048. Dagnelie,G.etal.,Realandvirtualmobilityperformanceinsimulatedprostheticvision,Neural Engineering 4:92–101 2007. Dean, Tom, Ultrasound Acoustic Sensing. Dellaert, F. etal., Monte carlo localization for mobile robots, Proceedings of the IEEE Inter- national Conference on Robotics and Automation 1999. Dissanayake, M. etal., Lecture Notes in Control and Information Sciences, Experimental Robotics VI Springer-Verlag, 2000. Dixon, Jonathan and Henlich, Oliver, Mobile Robot Navigation June 1997, /URL: http: //www.doc.ic.ac.uk/~nd/surprise_97/journal/vol4/jmd/0. Dodds, A., and Howarth, D. Clark-Carterand C., The sonic path finder:an evaluation, Journal of Visual Impairment Blindness 78:5 1984, pp.206–207. Dornaika, FandGarcia, C, Pose estimation using point and line correspondences, Real-Time Imaging 5:3 1999, pp.215–230. Doucet, Arnaud, On sequential simulation-based methods for bayesian filtering, Technical re- port, Signal Processing Group, Department of Engineering, University of Cambridge 1998. Dusenbery, D. B., Sensory Ecology New York: W.H. Freeman Company, 1992. Eames, E and Eames, T, A Guide to Guide Dog Schools, 2nd edition 1994. Erp, J. B. F. Van, Groen, E. L. and Bos, J. E., A tactile cockpit instrument supports the controlofself-motionduringspartialdisorientation, HumanFactors: TheJournalofHuman Factors and Ergonomics Society 48 2006, pp.219–228. 177 Estrada, Carlos,Neira, JoseandTardos, Juan D., Hierarchial SLAM: Real-Time Accurate Mapping of Large Environments, IEEE Trans. on Robotics 21:4 2005b, pp.588–596. Everett, H. R., Sensors for Mobile Robots: Theory and Application A K Peters Ltd., Wellesley, MA, ISBN 1-56881-048-2, 1995. Farrell,J.L.,IntegratedAircraftNavigationAcademicPress,NewYork,NY,ISBN0-12-249750- 3, 1976. Faugeras, Olivier,Luong, Quang-TuanandPapadopoulou, T., TheGeometryofMultiple Images MIT Press, Cambridge, ISBN: 0262062208, 2001. Fischler, M A and Bolles, R C, Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography, International Journal of Computer Vision (IJCV) 22:2 1997, pp.125–140. Flynn, A. M., Combining sonar and infrared sensors for mobile robot navigation, The Interna- tional Journal of Robotics Research 7:6 1988, pp.5 – 14. Franz,MatthiasO.andMallot,HanspeterA., BiomimeticRobotNavigation, Roboticsand Autonomous Systems 30 2000, pp.133 – 153. Freeman, E. E. etal., Visual field loss increases the risk of falls in older adults: the Salisbury eye evaluation, Investigative Ophthalmology and Visual Science 48 2007, pp.4445 – 4450. Genensky, S M, Berry, S H and Bikson, T H, Visual environment adaptation problems of thepartiallysighted: FinalReport(CPS-100-HEW),CenterforthePartiallySighted, Santa Monica Medical Center 1979. Golledge, R. G.andGarling, T., Cognitive Maps and Urban Travel, Handbook of Transport Geography and Spatial Systems 5 2004, pp.501 – 512. Golledge, R. G., Marston, J. R. and Costanzo, C. M., Attitudes of visually impaired persons towards the use of public transportation, J. Vis. Impairment Blindness 90 1997, pp.446–459. Golub, G. H. and Loan, C. F. Van, Matrix Computations, 3rd edition Johns Hopkins Uni- versity Press, 1996. Guivant, J.andNebot, E., Optimizationofthesimultaneouslocalizationandmapbuildingal- gorithmforrealtimeimplementation, IEEETransactionsonRoboticsandAutomation17:3 2001, pp.242 – 257. Haralick,Retal.,Analysisandsolutionsofthethreepointperspectiveposeestimationproblem, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) 1991,/URL: http://ir.lis.nsysu.edu.tw:8080/ir/handle/987654321/205650. Harris, C. and Stephens, M., A combined corner and edge detector, Proceedings of the 4th Alvey Vision Conference 1988, pp.147 – 151. Hartley, R, Lines and points in three views and the trifocal tensor, IJCV 22:2 March 1997, pp.125–140. Hartley, R, Computation of the trifocal tensor, European Conference on Computer Vision (ECCV) 1998, pp.20–35. 178 Hartley, R. I. and Zisserman, A., Multiple View Geometry in Computer Vision Cambridge University Press, ISBN: 0521623049, 2000. Haymes, S etal., Mobility of pople with retinitis pigmentosa as a function of vision and psy- chological factors, Optometry and Vision Science 73:10 1995, pp.621–637. Health Statistics, National Center for, Supplement on Assistive Technology Devices and Home Accessibility Features, National Health Interview Survey 1992. Heyden, A, Geometry and Algebra of Multiple Projective Transformations, PhD thesis, Lund University 1995. Hirschmuller, H.andScharstein, D., Evaluation of cost functions for stereo matching, IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops June 2007. HumanWare,,/URL: http://www.humanware.com/en-europe/home0. Jones, L. A., Nakamura, M. and Lockyer, B., Development of a tactile vest, International Symposium on Haptic Interfaces for Virtual Environment and Teleoperator Systems 2004, pp.82–89. JVIB, Demographics Update: Use of White ("Long") Canes, Journal of Visual Impairment and Blindness, Part 2 88:1 1994, pp.4–5. JVIB, Demographics Update: Alternate Estimate of the Number of Guide Dog Users, Journal of Visual Impairment and Blindness, Part2 89:2 1995, pp.4–6. Kay,L.,Anultrasonicsensingprobeasamobilityaidfortheblind,Ultrasonics21964,pp.53–59. Kitchin, R. M., Cognitive Maps: What Are They and Why Study Them?, Journal of Environ- mental Psychology 14 1994, pp.1 – 19. Klein, G. and Murray, D., Parallel Tracking and Mapping for Small AR Workplaces, ISMAR November 2007. Koch, O.andTeller, S., A Self-Calibrating, Vision-Based Navigation Assistant., CVAVI 2008. Koenig, S. and Likhachev, M., Fast Replanning for Navigation in Unknown Terrain, Trans- actions on Robotics 21:3 2005, pp.354–363. Kolb, Helga etal., WebVision: The Organization of the Retina and the Visual System,/URL: http://webvision.med.utah.edu0 – visited on 2010. Konczak, J., Effetcs of optic flow on the kinematics of human gait: a comparison of young and older adults, Journal of Motor Behavior 26 1994, pp.225 – 236. Kuang,T.M.etal., Visualimpairmentandfallsintheelderly: TheShihpaiEyestudy, Journal of the Chinese Medical Association 71 2008, pp.467 – 474. Kukelova, Z, Bujnak, M and Pajdla, T, Automatic generator of minimal problem solvers, European Conference on Computer Vision (ECCV) 2008a, pp.302–315. Kukelova, Z, Bujnak, M and Pajdla, T, Polynomial Eigenvalue Solutions to the 5-pt and 6-pt Relative Pose Problems, BMVC 2008b. Kulmala, J. etal., Visual acuity and mortality in older people and factors on the pathway, Ophthalmic Epidemiology 15 2008, pp.128–134. 179 Kulyukin, V., Robot-assisted wayfinding for the visually impaired in structured indoor environ- ments, Autonomous Robots 21:1 2006, pp.29–41. Lackner, J. R.andDiZio, P., Visual stimulation affects the perception of voluntary leg move- ments during walking, Perception 17 1988, pp.71 – 80. Laurent, B. and Christian, T. N. A., A sonar system modeled after spatial hearing and echolocating bats for blind mobility aid, International Journal of Physical Sciences 2:4 2007, pp.104–111. Leat, S. J. and Love-Kitchin, J. E., Visual function, visual attentionm and mobility perfor- mance in low vision, Optometry and Vision Science 85 2008, pp.1049 – 1056. Leonard, J. J. and Durrant-Whyte, H. F., Simultaneous map building and localization for an autonomous mobile robot, IEEE International Conference on Intelligent Robots and Systems (IROS) 3 1991, pp.1442–1447. Levitt, T. S. and Lawton, D. T., Qualitative Navigation for Mobile Robots, Artificial Intelli- gence 44 1990, pp.305 – 360. Li, H and Hartley, R, Five-point motion estimation made easy, ICPR 1 2006. Lieblich, I. and Arbib, M. A., Mulitple representations of space underlying behaviour and associated commentaries, The Behavioural and Brain Sciences 5:4 1982, pp.627 – 660. Lindemann, R W etal., Effectiveness of directional vibrotactile cuing on a building-clearance task, CHI 271–280 2005. Liu, Y andHuang, TS, A linear algorithm for motion estimation using straight line correspon- dences, ICPR 1988, pp.213–219. Loadstone,,/URL: http://www.loadstone-gps.com/about/0. Loomis, Jack M. etal., Nonvisual Navigation by Blind and Sighted: Assessment of Path Integration Ability, Journal of Experimental Psychology: General 122:1 1993, pp.73 – 91. Lourakis, M. and Argyros, A., The Design and Implementation of a Generic Sparse Bundle Adjustment Software Package Based on the Levenberg-Marquardt Algorithm, Technical re- port FORTH-ICS/TR-340 Science and Technology Park, Heraklio, Crete: Computational Vision and Robotics Laboratory, Institue of Computer Science (ICS), Foundation for Re- search and Technology Hellas (FORTH), August 2004. Lowe, D.G., Object Recognition from local scale-invariant features, Proceedings of the Interna- tional Conference on Computer Vision 1999, pp.1150–1157. Lu,X.andManduchi,R.,Detectionandlocalizationofcurbsandstairwaysusingstereo-vision, International Conference on Robotics and Automation 2005, pp.4648–4654. Lucas, B D and Kanade, T, An iterative image registration technique with an application to stereo vision, International Jount Conferences on Aritifical Intelligence (IJCAI) 1981, pp.1151–1156. Matthies, L.andShafer, S., Error moding in stereo navigation, IEEE Journal of Robotics and Automation 3:3 1987, pp.239 – 248. McCarthy, C. etal., Towards a Hazard Perception Assistance System using Visual Motion, CVAVI 2008. 180 McNaughton, B. L.,Chen, L. L.andMarkus, E. J., ’Dead Reckoning’, landmark learning, and the sense of direction: a neurophysiological and computational hypothesis, Journal of Cognitive Neuroscience 3:2 1991, pp.190 – 202. Meijer, P. B. L., Vision technology for the totally blind. Microsoft, Kinect for Xbox 360. Milner, A. D. and Goodale, M. A., Visual pathways to perception and action, Progr Brain Res 95 1993, pp.317 – 337. Mitra, N. J.andNguyen, A., Estimating surface normals in noisy point cloud data, Proceed- ings of the Nineteenth Annual Symposium on Computational Geometry 2003, pp.322–328. MobileGeo,,/URL: http://www.codefactory.es/en/products.asp?id=2500. Mohammad, Tarek, Using Ultrasonic and Infrared Sensors for Distance Measurement, World Academy of Science, Engineering and Technology 51 2009, pp.293 – 298. Molton, N., Robotic sensing for the partially sighted, Robotics and Autonomous Systems 26 1999, pp.185–201. Molton, N., Davison, A. J. and Reid, I., Locally Planar Patch Features for Real-Time Structure from Motion, British Machine Vision Conference 2004. Montemerlo, M. etal., Fast-SLAM: A factored solution to the simultaneous localization and mapping problem, Proceedings of the AAAI National Conference on Artificial Intelligence 2002, pp.593–598. Montemerlo, M. and Thrun, S., FastSLAM: A Scalable Method for the Simultaneous Local- ization and Mapping Problem in Robotics, ISBN 978-3-540-46399-3 Springer, 2007. Montemerlo, M. etal., Fast-SLAM 2.0: An improved particle filtering algorithm for simul- taneous localization and mapping that provably converges, Proceedings of the Eighteenth Inernational Joint Conference on Artifical Intelligence (IJCAI-03) 2003, pp.1151–1156. Moravec, H. P., Sensor fusion in certainty grids for mobile robots, AI Magazine 9:2 1988, pp.61 – 74. Mordohai, P. and Medioni, G., Tensor Voting: A Perceptual Organization Approach to Computer Vision and Machine Learning Morgan and Claypool, 2006a. Moreno, F. A., Blanco, J. L. and Gonzalez, J., Stereo vision-specific models for Particle Filter-based SLAM, Robotics and Autonomous Systems 57:9 2009, pp.955 – 970. MOWAT,,/URL: http://www.as.wvu.edu/~scidis/terms/mowat.html0. Murphy, K., Bayesian map learning in dynamic environments, Neural Information Processing Systems (NIPS) 1999, pp.1015–1021. Nagel, S K etal., Beyond sensory substitution–learning the sixth sense, J. Neural Engineer- ing 2:4 2005. NAVIG,,/URL: http://navig.irit.fr/0. Neira, J. etal., Fusing Range and Intensity Images for Mobile Robot Localization, IEEE Trans- actions on Robotics and Automation 15:1 February 1999, pp.76 – 84. 181 Newman, P., Cole, D. and Ho, K., Outdoor SLAM using Visual Appearance and Laser Ranging, Journal of Field Robotics 23:1 2006. Nister, D, An efficient solution to the five-point relative pose problem, IEEE PAMI 26:6 2004b, pp.756–770. Nister, D, Naroditsky, O and Bergen, J, Visual odometry, IEEE Computer Society Con- ference on Computer Vision and Pattern Recognition (CVPR) 1 2004, /URL: http: //ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=13150940, pp.652–659. O’Keefe, J and Nadel, L., The Hippocampus as a Cognitive Map 1978. Oliensis, J and Werman, M, Structure from motion using points, lines, and intensities, IEEE ComputerSocietyConferenceonComputerVisionandPatternRecognition(CVPR)22000. Pailhou, J. etal., Unintentional modulations of human gait by optical flow, Behavioral Brain Research 38 1988, pp.275 – 281. Paillard, J., Cognitive versus sensorimotor encoding of spatial information, in: Ellen, P. and Thinus-Blanc, C., editors, Cognitive Processes and Spatial Orientation in Animal and Man: NeurophysiologyandDevelopmentalAspectsTheHague: MartinusNijhoffPublishers, NATO ASI Series No. 37, 1987, pp.43 – 77. Patla, A. E., Understanding the Roles of Vision in the control of Human Locomotion, Gait and Posture 5 1997, pp.54 – 69. Patla, A. E. etal., Characteristics of voluntary visual sampling of the environment during locomotion over different terrains, Experimental Brain Research 112:3 1996, pp.513 – 522. Patla,A.E.,Prentice,S.andUngers-Peters,G., AccommodatingDifferentCompliantSur- faces in the Travel Path During Locomotions, Proceedings of the Fourteenth International Society for Biomechanics Conference 11 1993, pp.1010 – 1011. Patla, A. E.andVickers, J., Visual search patterns as we approach and slip over obstacles in the travel path, Soc Neurosci Abstr 22:3 1996, p.1848. Pielot, M., Henze, N. and Boll, S., Supporting map-based wayfinding with tactile cues, MobileHCI 2009. Pleis, J R and Lethbridge-Cejku, M, Summary health statistics of US adults: National Health Interview Survey, National Center for Health Statistics 10:235 2006. Pollefeys, M,Nister, Dandal., Detailed real-time urban 3D reconstruction from video, IJCV 2007. Poucet, B., Spatial Cognitive Maps in Animals: New Hypothesis on their Structure and Neural Mechanisms. Poucet, B., Spatial Cognitive Maps in Animals: New Hypothesis on their Structure and Neural Mechanisms, Psychol Rev 100 2 1993, pp.163 – 182. Pradeep, V. and Lim, J., Egomotion Using Assorted Features, IEEE Computer Society Con- ference on Computer Vision and Pattern Recognition (CVPR) August 2010c, pp.1514 – 1521. Pradeep, V., Medioni, G. and Weiland, J., Piecewise Planar Modeling for Step Detection using Stereo Vision, European Conference on Computer Vision Workshops, CVAVI 2008. 182 Pradeep, V., Medioni, G. and Weiland, J., Visual loop closing using multi-resolution SIFT grids in metric-topological SLAM, IEEE Computer Society Conference on Computer Vision and Pattern Recognition 0 2009, pp.1438–1445. Pradeep, V., Medioni, G. and Weiland, J., Robot Vision for the Visually Impaired, IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops 2010a, pp.15 – 22. Pradeep, V., Medioni, G. and Weiland, J., A Wearable System for the Visually Impaired, IEEE Engineering in Medicine and Biology Conference 2010b. PrimeSense, 2006,/URL: http://www.primesense.com/0. Pruski, A., Surface contact sensor for robot safety, Sensor Review 6:3 1993, pp.143 – 144. RehabilitationEngineering,VisionAidsforImpairedPeripheralVisionorTunnelVision2003, /URL: http://www.abledata.com/abledata_docs/Peripheral_Vision.htm0. Rosten, E and Drummond, T, Fusing points and lines for high performance tracking, Inter- national Conference on Computer Vision (ICCV) 2 October 2005, pp.1508–1515. Russel, L., Travel path sounder, Proceedings of the Rotterdam Mobility Research Conference, New York 1965. Russell, S. and Norvig, P., Artificial Intelligence: A Modern Approach Prentice Hall, 1995. Sabatini, A. etal., A low-cost, composite sensor array combining ultrasonic and infrared prox- imity sensors, International Conference on Intelligent Robots and Systems 3 August 1995, pp.120 – 126. Saez,J.M.etal.,Firststepstowardsstereo-based6DOFSLAMforthevisuallyimpaired,IEEE Conference on Computer Vision and Pattern Recognition 2 2005, p.23. Sargolini, F. etal., Conjunctive representation of position, direction, and velocity in entorhinal cortex, Science 312:5774 May 2006, pp.758 – 762. Schacter, D. L. and Nadel, L., Varieties of spatial memory: a problem for cognitive neuro- science, in: Lister, R. G. and Weingartner, H. J., editors, Perspectives on Cognitive Neuroscience London: Oxford University Press, 1991, pp.164 – 185. Scharstein, Daniel and Szeliski, Richard, Middlebury Stereo Vision Page, /URL: vision. middlebury.edu/stereo/0 – visited on 2010. Seitz, SMandAnandan, P, Implicit representation and scene reconstruction from probability density functions, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) 2 1999. Shapiro, L. and Stockman, G., Computer Vision Prentice Hall, 2001. Shashua, A and Wolf, L, On the structure and properties of the quadrifocal tensor, Lecture notes in computer science 2000b, pp.710–724. Shoval, S.,Borenstein, J.andKoren, Y., Auditory guidance with the navbelt - A computer- ized travel aid for the blind, IEEE Transactions on Systems, Man, and Cybernetics 28 1998, pp.459–467. Sick, About SICK, /URL: http://www.sick.com/group/EN/home/about_sick/Pages/about_ sick.aspx0. 183 Sim, Robert, Elinas, Pantelis and Little, James J., A study of Rao-Blackwellised Particle Filter for Efficient and Accurate Vision-Based SLAM, International Journal of Computer Vision 74:3 January 2007, pp.303–318. Smith,R.C.,Self,M.andßCheeseman,P.,OntheRepresentationandEstimationofSpatial Uncertainty, Proceedings of the Annual Conference on Uncertainty in AI 1986, pp.435–461. Soong, G. P., Love-Kitchin, J. E. and Brown, B., Preferred walking speed for assessment of mobility performance: sighted guide versus non-sighted guide techniques, Clinical and Experimental Optometry 83:2 2000, pp.279 – 282. Spice, Byron, Researchers Help Develop Full-Size Autonomous Helicopter July 2010, /URL: http://www.cmu.edu/news/blog/2010/Summer/unprecedented-robochopper.shtml0 – visited on 6 July 2010. Stein, B. E. and Meredith, A., The Merging of the Senses, in: A Bradford Book Cambridge, MA: MIT Press, 1993. Stewénius, H, Engels, C and Nister, D, Recent developments on direct relative orientation, J. Photogrammetry and Remote Sensing 60 2006, pp.284–294. Sunderhauf,N,Konolige,KandLacroix,S,Visualodometryusingsparsebundleadjustment on an autonomous outdoor vehicle, Levi Jan 2005a, /URL: http://citeseerx.ist.psu. edu/viewdoc/download?doi=10.1.1.92.4881&rep=rep1&type=pdf0. Talukder, A. etal., Fast and Reliable Obstacle Detection and Segmentation for Cross-country Navigation, Proceedings of the 2002 Intelligent Vehicles Conference 2002. Thrun,S.,Roboticmapping: Asurvey,in: Lakemeyer,G.andNebel,B.,editors,Exploring Artificial Intelligence in the New Millenium Morgan Kauffman, 2002. Thrun, Sebastian etal., Stanley: The Robot that Won the DARPA Grand Challenge, Journal of Field Robotics 23:9 2006, pp.661 – 692. Tola, E., Lepetit, V. and Fua, P., DAISY: An Efficient Dense Descriptor Applied to Wide- BaselineStereo, IEEETransactionsonPatternAnalysisandMachineIntelligence32:52010, pp.815 – 830. Tolman, E. C., Cognitive Maps in Rats and Men, Psychological Review 55:4 July 1948, pp.189 – 208. Torr, P H S and Zisserman, A, Robust parameterization and computation of the trifocal tensor, Image and Vision Computing 15 1997, pp.591–605. Triebel, T., Pfaff, P. and Burgard, W., Multi-level surface maps for outdoor terrain map- ping and loop-closing, IEEE International Conference on Intelligent Robots and Systems (IROS) 2276–2282 2006. Triggs, B, Camera pose and calibration from 4 or 5 known 3d points, International Conference on Computer Vision (ICCV) 1999, pp.278–284. Trucco, EmanueleandVerri, Alessandro, Introductory Techniques to 3-D Computer Vision Prentice Hall, 1988. Tsukada and Yasumura, ActiveBelt: Belt-type wearable tactile display for directional naviga- tion, UbiComp 2004. 184 Turano, K. A., Massof, R. W. and Quigley, H. A., A self-assessment instrument designed for measuring independent mobility in RP patients: generalizability to glaucoma patients, Investigative Ophthalmology and Visual Science 43:9 Sep 2002, pp.2874 – 2881. Veen, H. A. H. C. Van, Spape, M. and Erp, J. B. F. Van, Waypoint navigation on land: Different ways of coding distance to the next waypoint, EuroHaptics 2004. Viola, P. and Jones, M., Rapid object detection using a boosted cascade of simple features, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) 2001, pp.511 – 518. Warren, W. H. Jr., Self-Motion: Visual perception and visual content, in: Epstein, W. and Rogers, S., editors, Handbook of Perception and Cognition. Vol. 5. Perception of Space and Motion New York: Academic Press, 1995, pp.263 – 325. Wayfinder,,/URL: http://www.wayfinder.com/?id=912&lang=en-UK0. Weiland, J. D. and Humayun, M. S., Intraocular retinal prosthesis, IEEE Engineering in Medicine and Biology Magazine 25 2006, pp.60–66. Werner,Steffenetal.,SpatialCognition: TheRoleofLandmark,RouteandSurveyKnowledge in Human and Robot Navigation, in: Ph.D. in Computer Science in Springer, 1997, pp.41 – 50. Yuan, D.andManduchi, R., Dynamic Environment Exploration Using a Virtual White Cane, IEEE Conference on Computer Vision and Pattern Recognition 2005. Zeki, S., A Vision of the Brain London: Blackwell Scientific Publication, 1983. Zhu, Z. etal., Ten-fold Improvement in Visual Odometry Using Landmark Matching, Interna- tional Conference on Computer Vision (ICCV) 2007, pp.1 – 8. 185
Abstract (if available)
Abstract
Vision is one of the primary sensory modalities for humans that assists in performing several life-sustaining and life-enhancing tasks, including the execution of actions such as obstacle avoidance and path-planning necessary for independent locomotion. Visual impairment has a debilitating impact on such independence and the visually impaired are often forced to restrict their movements to familiar locations or employ assistive devices such as the white cane. More recently, various electronic travel aids have been proposed that incorporate electronic sensor configurations and the mechanism of sensory substitution to provide relevant information - such as obstacle locations and body position - via audio or tactile cues. By providing higher information bandwidth (compared to the white cane) and at a greater range, it is hypothesized that the independent mobility performances of the visually impaired can be improved. The challenge is to extract and deliver information in a manner that keeps cognitive load at a level suitable for a human user to interpret in real-time.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Computer vision aided object localization for the visually impaired
PDF
RGBD camera based wearable indoor navigation system for the visually impaired
PDF
Automatic image and video enhancement with application to visually impaired people
PDF
Robust real-time vision modules for a personal service robot in a home visual sensor network
PDF
Managing multi-party social dynamics for socially assistive robotics
PDF
Prosthetic vision in blind human patients: Predicting the percepts of epiretinal stimulation
PDF
Characterization of visual cortex function in late-blind individuals with retinitis pigmentosa and Argus II patients
PDF
Outdoor visual navigation aid for the blind in dynamic environments
PDF
Situated proxemics and multimodal communication: space, speech, and gesture in human-robot interaction
PDF
Speeding up trajectory planning for autonomous robots operating in complex environments
PDF
Intraocular and extraocular cameras for retinal prostheses: effects of foveation by means of visual prosthesis simulation
PDF
Autonomous mobile robot navigation in urban environment
PDF
The effect of a solfège-related tactile indicator on pitch accuracy in the retention of a vocal song in visually impaired elementary and secondary students
PDF
Leveraging structure for learning robot control and reactive planning
PDF
User-interface considerations for mobility feedback in a wearable visual aid
PDF
Human visual perception of centers of optic flows
PDF
Active sensing in robotic deployments
PDF
Selectivity for visual speech in posterior temporal cortex
PDF
Data scarcity in robotics: leveraging structural priors and representation learning
PDF
Crowding in peripheral vision
Asset Metadata
Creator
Pradeep, Vivek
(author)
Core Title
Robot vision for the visually impaired
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Biomedical Engineering
Publication Date
01/11/2012
Defense Date
11/23/2010
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
mobility aid,OAI-PMH Harvest,obstacle avoidance,path planning,simultaneous localization and mapping,SLAM,stereo,structure from motion,tracking,visual prosthetics,visually impaired
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Weiland, James D. (
committee chair
), Humayun, Mark S. (
committee member
), Medioni, Gerard G. (
committee member
), Ragusa, Gisele (
committee member
), Tjan, Bosco S. (
committee member
)
Creator Email
vivek.pradeep@gmail.com,vpradeep@microsoft.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3611
Unique identifier
UC183308
Identifier
etd-Pradeep-4239 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-435695 (legacy record id),usctheses-m3611 (legacy record id)
Legacy Identifier
etd-Pradeep-4239.pdf
Dmrecord
435695
Document Type
Dissertation
Rights
Pradeep, Vivek
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
mobility aid
obstacle avoidance
path planning
simultaneous localization and mapping
SLAM
stereo
structure from motion
tracking
visual prosthetics
visually impaired